Metabolomics analysis of Papaya sensory profiles

1 Aims

The aim of this project was to determine the role of flavour related metabolites in papaya fruit flavour, focussing on taste and aroma. As demonstrated by Colantonio et al. (2022), metabolomic analysis of fruit can identify metabolite markers for fruit flavour selection in breeding programs. Subsequently, genetic markers may be developed, which are neccesary for papaya breeding due to the challenges associated with growing thousands of mature trees for phenotyping (Lomax et al. 2024). We aimed to build on the work reported by Zhou et al. (2021) and Zhou et al. (2022).

1.1 Objectives

Examine the general differences in sensory and metabolite profiles between 27 papaya genotypes using principal component analysis.
Apply machine learning models (GBM and BGLR) to analyse how specific metabolites contribute to aroma and flavor characteristics in papaya.
Identify trends in consumer liking scores and consumer demographic information using a clustering algorithm (APcluster)
Apply machine learning models (GBM and BGLR) to analyse the effect of sensory attributes and metabolite concentrations on papaya liking scores
Generate a predictive machine learning model (XGBoost) to estimate fruit liking based on targeted metabolite concentrations

2 Methods

Sensory data was collected by a trained and untrained panel of tasters. The trained panel undertook standard quantitative descriptive analysis (QDA) for 27 papaya genotypes. The untrained panelists consisted of 125 consumers recruit for a “tropical fruit tasting”, who then scored nine disctinctly flavoured papaya genotypes using a labelled affective magnitude scale between 0 to 100 for liking.

Papaya fruit was collected from a single location (Dimbulah, QLD, AUS) on four separate occasions: August 2022 (QDA), February 2023 (QDA), January 2024 (Consumer) and February 2024 (QDA). The fruit was collected and treated aligned with industry management practices (e.g. ripening index 2-3, fungicide/ethylene, refrigerated truck transportation).

Chemical analysis of the fruit employed various methods, including traditional measurements for fruit quality (i.e. TSS, TA, pH, water content) and analytical methods (HS-SPME-GC-MS/MS, LC-MS/MS and IC-Au detector). To preserve the metabolite profile of the fruit after sensory preparation, the fruit was flash frozen and stored at -80. Further, the fruit subject to VOC analysis was prepared at temperatures below -80°C using liquid nitrogen. Samples for LC and IC were freeze dried and subject to solvent extraction: IC, sugars (ethanol); LC, GTR (methanol).

Target VOCs were selected based on the concentration ranges reported in Brewer et al. (2021), focusing on species that exceeded their odour threshold values in water (OAV > 1). The SPME optimisation and GC-MS conditions largely emulated the work done by Pico et al. (2022).

3 Data analysis preprocessing

The initial steps of the data analysis pipeline included:

Installing/loading required packages

# list of packages
required_packages <- c(
    "juba/rmdformats","kableExtra","gt","flextable","DT", "xfun", # Document formatting & tables

    "tidyverse","readxl","janitor", # Data wrangling

    "ggpmisc","ggthemes","ggrepel","geomtextpath","scales", # Plotting stuff

    "FactoMineR","factoextra", # PCA & visualization

    "car","agricolae","tableone", "summarytools", # Stats & summaries

    "BGLR","caret","gbm","xgboost","apcluster","lme4","emmeans" # Modeling & ML
)

# Install all packages
# pak::pak(required_packages)

# load all packages
pacman::p_load(char = basename(required_packages), install=FALSE)

Compiling meta data

# Load metadata from Excel
genotype_table <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "meta_data")

# Create sensory metadata table
sensory_meta <- 
  data.frame(
    "var" = c(
        # Aroma attributes
        "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR",
        "fishy_AR", "citrus_AR", "floral_AR",
        # Texture attributes
        "resistance_TX", "velvety_TX", "juiciness_TX", "dissolving_TX",
        "fibrous_TX",
        # Flavor attributes
        "flavour_intensity_FL", "sweetness_FL", "bitterness_FL",
        "musty_FL", "floral_FL",
        # Aftertaste attributes
        "bitter_AT", "sweet_AT", "metallic_AT", "prickly_AT"),
    
    "group" = c(
        rep("aroma", 6),
        rep("texture", 5),
        rep("flavour", 5),
        rep("aftertaste", 4))) %>% 
  
   mutate(group = snakecase::to_title_case(group),
          label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# Create classification key for VOCs
class_key <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "VOCs") %>%
  select(-(Class:Class2)) %>%
  rename("var" = "variable") %>%
  mutate(
    Class3 = str_replace_all(Class3,
                            c("Ester" = "Lipid-derived"))
  ) %>%
  rename("group" = "Class3") %>%
  mutate(
    label = str_replace_all(var,
                           c("Neryl_Geranyl_acetone" = "(E/Z)-Geranylacetone",
                             "E_2_nonal" = "(E)-2-Nonanal",
                             "P_cymene" = "p-Cymene",
                             "Beta_Cyclocitral" = "β-Cyclocitral",
                             "Beta_ionone" = "β-Ionone",
                             "Gamma_octalactone" = "γ-Octalactone",
                             "GTR_conc" = "GTR",
                             "wet_fraction"="Water fraction",
                             "_" = " ")))

glimpse(genotype_table)
> Rows: 123
> Columns: 6
> $ ID        <chr> "01-050", "01-144", "01-217", "01-487", "02-081", "02-480", …
> $ label     <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ assay     <chr> "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24…
> $ Flesh     <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red…
> $ Genotype  <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ Genotype2 <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1 Malay", …

glimpse(sensory_meta)
> Rows: 20
> Columns: 3
> $ var   <chr> "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR", "fi…
> $ group <chr> "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Texture",…
> $ label <chr> "Aroma intensity", "Sweet fruit", "Musty off note", "Fishy", "Ci…

glimpse(class_key)
> Rows: 34
> Columns: 3
> $ var   <chr> "Hexanal", "P_cymene", "Terpinolene", "Sulcatone", "Beta_Cycloci…
> $ group <chr> "Lipid-derived", "Carotenoid/Terpenes", "Carotenoid/Terpenes", "…
> $ label <chr> "Hexanal", "p-Cymene", "Terpinolene", "Sulcatone", "β-Cyclocitra…

Calculating VOC matrix effects

calculate_matrix_effect <- function(data,col_range) {
  data %>%
    pivot_longer(
      cols = all_of(col_range),
      names_to = "species",
      values_to = "peak_area"
    ) %>%
    group_by(Spiked, type, species) %>%
    summarise(mean_peak_area = mean(peak_area), .groups = "drop") %>%
    mutate(sample = paste(Spiked, type, sep = "_")) %>%
    select(-c(Spiked, type)) %>%
    pivot_wider(
      names_from = "sample",
      values_from = "mean_peak_area"
    ) %>%
    mutate(
      red_matrix_effect = (((T_Red - F_Red)/F_Standard) - 1)*100,
      yellow_matrix_effect = (((T_Yellow - F_Yellow)/F_Standard) - 1)*100
    )
}

# Read and process data
matrix_effect_low <- read_xlsx("data/voc_matrix_effect.xlsx", "Matrix low split") %>% 
  filter(!ID %in% c("R8","Y8")) %>% 
  calculate_matrix_effect(col_range = 4:28)

matrix_effect_high <- read_xlsx("data/voc_matrix_effect.xlsx", "matrix high split") %>%
  calculate_matrix_effect(col_range = 4:7)

# Combine and clean species names
matrix_effect <- rbind(matrix_effect_high, matrix_effect_low) %>%
  mutate(species = str_replace_all(species, "-", "_"))

kableExtra::kbl(matrix_effect[,c(1,7:8)],
      col.names = c("Flesh Color",
                    "Matrix effect red",
                    "Matrix effect yellow"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "matrix effect for VOC quantification") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

Table 1: matrix effect for VOC quantification
Flesh Color	Matrix effect red	Matrix effect yellow
Cis_Linalool oxide	-196.4	-123.0
Linalool	410.2	1583.0
Linalool_oxide	-138.3	-110.3
Trans_Linalool oxide (furanoid)	-85.2	-98.8
2_Octanone	-46.3	-52.0
BITC	-62.8	-71.7
Benzaldehyde	-40.0	-36.8
Beta_Cyclocitral	-58.6	-55.2
Beta_ionone	-85.8	-84.0
Decanal	-58.4	-65.3
E_2_nonal	-67.7	-70.2
Gamma_octalactone	-54.2	-17.9
Geraniol	471.1	377.2
Geranylacetone	-60.8	-72.1
Heptanal	45.2	-36.6
Hexanal	2.5	-60.9
Limonene	-1.1	-52.6
Methyl_dodecanoate	-92.4	-95.0
Methyl_hexanoate	-61.7	-47.4
Methyl_octanoate	-36.7	-49.8
Neryl_Geranyl_acetone	-65.1	-73.2
Nerylacetone	-72.5	-75.1
Nonanal	-30.6	-59.2
Octanal	-40.3	-62.4
P_cymene	-54.9	-64.1
Phenylacetaldehyde	-39.1	-8.0
Sulcatone	-7.9	-20.1
Terpinene	-39.7	-66.8
Terpinolene	-51.1	-56.9

Transforming VOC data to ng/kg

calibration_table <- read_excel("data/Results_22_23.xlsx", "densities") %>% 
  select(Hexanal:Linalool) %>% 
  t() %>% as.data.frame() %>% rownames_to_column("species") %>% 
  set_names(c("species", "densities", "OT", "purity"))

res_sheets <- c("Aug22", "Feb23", "Feb24")

VOCs <- res_sheets %>%
  # Read and combine sheets
  map(~read_excel("data/Results_22_23.xlsx", .x) %>%
        pivot_longer(cols = (Hexanal:BITC),
                     names_to = "species",
                     values_to = "concentration") %>%
        mutate(assay = .x)) %>%
  list_rbind() %>%
  
  # join data
  left_join(calibration_table) %>%
  left_join(genotype_table) %>%
  
  # Calculate concentrations
  mutate(
    conc_ppb = (concentration * densities * purity) * (0.00425/0.0042), # ug/L -> ug/KgFW = ug/L * volume(L)/mass(Kg)
    species = str_replace_all(species, "-", "_")) %>%
  
  # update matrix corrections
  left_join(
    matrix_effect %>% 
      mutate(across(
        ends_with("matrix_effect"),
        ~ case_when(
          species %in% c("Linalool","Linalool_oxide") ~ .[species == "Trans_Linalool oxide (furanoid)"],
          TRUE ~ .
          )
        ))
    ) %>%
  
   # Apply matrix corrections 
  mutate(
    corrected_conc_ppb = case_when(
      Flesh == "Red" ~ conc_ppb / (1 + (red_matrix_effect/100)),
      Flesh == "Yellow" ~ conc_ppb / (1 + (yellow_matrix_effect)/100)
      ),
    OAV = corrected_conc_ppb/OT,
    ) %>%
  relocate(c(assay, Flesh, label), .before = species)

# summary of VOC concentrations

VOC_summary <- VOCs %>%
  group_by(species) %>%
  summarise(
    n = n(),                                    
    mean_conc = mean(corrected_conc_ppb, na.rm = TRUE),
    sd_conc = sd(corrected_conc_ppb, na.rm = TRUE),      
    se_conc = sd_conc/sqrt(n),                 
    median_conc = median(corrected_conc_ppb, na.rm = TRUE),
    min_conc = min(corrected_conc_ppb, na.rm = TRUE),
    max_conc = max(corrected_conc_ppb, na.rm = TRUE),
    n_detected = sum(corrected_conc_ppb>0.01),        
    perc_detected = n_detected/n * 100         
  ) %>%
  arrange(desc(mean_conc)) %>%  
  mutate(across(where(is.numeric), ~ round(.x, digits = 2))) %>% 
  ungroup()

Sugar extraction efficiency

sugar_key <- read_xlsx("data/July24_all.xlsx", sheet = "eth_key")
extraction_efficiency <-
  left_join(sugar_key, read_xlsx("data/July24_all.xlsx",
            sheet = "all_raw"), by = "sample_id") %>%
  filter(batch_id == "pooled") %>%
  mutate(
    Flesh = c(rep("Yellow", 5), rep("Red", 5)),
    Sorbitol_extraction_prop = Sorbitol/20000
  ) %>%
  group_by(Flesh) %>%
  summarise(
    extraction_efficiency_mean = mean(Sorbitol_extraction_prop*100),
    extraction_efficiency_sd = sd(Sorbitol_extraction_prop*100)
  )
kableExtra::kbl(extraction_efficiency,
      col.names = c("Flesh Color",
                    "Mean Recovery (%)",
                    "SD (%)"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "Sorbitol extraction efficiency from freeze-dried tissue samples") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

Table 2: Sorbitol extraction efficiency from freeze-dried tissue samples
Flesh Color	Mean Recovery (%)	SD (%)
Red	84.4	0.6
Yellow	85.3	2.3

Sugar data unit transformation and imputation of missing values

bad_dry <- paste0("JL", 10:53) # compromised samples
UI <- paste0("UI", 1:5) # auxiliary samples

# Data import and initial cleaning
raw_sugar <- left_join(sugar_key,
                      read_xlsx("data/July24_all.xlsx", sheet = "all_raw"),
                      by = "sample_id") %>%
  filter(!sample_id %in% bad_dry,
         !sensory_id %in% UI,
         batch_id != "pooled",
         mass != "75") %>% # auxiliary sample
  select(-c(sample_id, Sorbitol, batch_id)) %>%
  rename(ID = sensory_id) %>%
  full_join(genotype_table, by = "ID") %>%
  left_join(read_xlsx("data/July24_all.xlsx", sheet = "weight"),
            by = "ID") %>%
  mutate(FW_factor = wet_fraction/dry_fraction)

# Missing values
na_rows <- raw_sugar[!complete.cases(raw_sugar),]
kableExtra::kbl(na_rows,
                   caption = "Missing values in sugar dataset") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

Table 3: Missing values in sugar dataset
ID	mass	Glucose	Fructose	Sucrose	label	assay	Flesh	Genotype	Genotype2	Batch_id	dry_fraction	wet_fraction	FW_factor
01-105	100	20923.45	22637.53	NA	PBL10	Aug22	Yellow	PBL10	PBL10	1	0.1371277	0.8628723	6.292473
01-1	100	17030.04	17982.28	22326.106	PBL10	Aug22	Yellow	PBL10	PBL10	NA	NA	NA	NA
02-1	100	21081.63	20147.66	16771.018	PBL11	Aug22	Yellow	PBL11	PBL11	NA	NA	NA	NA
03-1	100	16326.40	18285.30	15781.072	PBL13	Aug22	Yellow	PBL13	PBL13	NA	NA	NA	NA
05-1	100	19006.88	19580.25	19986.072	PBL15	Aug22	Yellow	PBL15	PBL15	NA	NA	NA	NA
07-1	101	18744.57	18958.28	11242.871	1B	Aug22	Yellow	1B	1B	NA	NA	NA	NA
02-654	101	28661.26	28256.75	1213.175	PBL11	Aug22	Yellow	PBL11	PBL11	NA	NA	NA	NA
03-504	101	28795.46	30279.16	1207.327	PBL13	Aug22	Yellow	PBL13	PBL13	NA	NA	NA	NA
05-603	100	28863.26	29848.19	NA	PBL15	Aug22	Yellow	PBL15	PBL15	1	0.1311861	0.8688139	6.622759
02-021	NA	NA	NA	NA	PBL11	Aug22	Yellow	PBL11	PBL11	1	0.1350356	0.8649644	6.405455
12-795	NA	NA	NA	NA	Solo	Aug22	Red	Solo	Solo	1	0.1160407	0.8839593	7.617667


# Get all imputed datasets using default parameters
set.seed(11)

imp <- mice::mice(raw_sugar, m = 25, maxit = 100, print = FALSE)
> Warning: Number of logged events: 7

save(imp, file = "data/sugar_imp.RData")
load("data/sugar_imp.RData")
imputed_sugar <- complete(imp, "long") %>%
  group_by(ID) %>%
  summarise(across(c(mass, FW_factor, Glucose, Fructose, Sucrose), mean))

# new dataset with mean values from 25 random imputations after 100 iterations
final_sugar <- raw_sugar %>%
  select(-c(mass, FW_factor, Glucose, Fructose, Sucrose)) %>%
  left_join(imputed_sugar, by = "ID")

# Unit transformation of raw data to g sugar/100g FW
sugar_vars <- c("Glucose", "Fructose", "Sucrose")

sugar_data <- final_sugar %>%
  mutate(across(all_of(sugar_vars), ~.x * 0.001),  # Accounts for 0.001L extraction volume
         FW = mass * FW_factor,                      # Converts to fresh weight basis
         across(all_of(sugar_vars), ~(.x / FW) * 100),  # Convert to g/100g FW
         total_sugar = rowSums(across(all_of(sugar_vars))),
         across(all_of(sugar_vars), ~.x / total_sugar, .names = "{.col}_prop"),
         wet_fraction = FW_factor/(1+FW_factor),
         dry_fraction = 1-wet_fraction) %>%
  select(assay, Genotype, ID, Flesh, all_of(sugar_vars),
         total_sugar, ends_with("_prop"),label,wet_fraction)

glimpse(sugar_data)
> Rows: 123
> Columns: 13
> $ assay         <chr> "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "F…
> $ Genotype      <chr> "PBL8", "Sunlight 3", "PBL9", "PBL6", "PBL7", "RB1", "PB…
> $ ID            <chr> "01-618", "02-424", "03-849", "04-924", "05-mon", "06-46…
> $ Flesh         <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", …
> $ Glucose       <dbl> 2.388368, 2.819370, 2.166258, 2.116345, 2.759456, 3.0948…
> $ Fructose      <dbl> 2.177387, 2.402008, 2.156498, 1.926594, 2.610099, 2.7535…
> $ Sucrose       <dbl> 7.654067, 7.026162, 6.312745, 4.587775, 6.481786, 3.1513…
> $ total_sugar   <dbl> 12.219822, 12.247540, 10.635501, 8.630714, 11.851341, 8.…
> $ Glucose_prop  <dbl> 0.1954503, 0.2301989, 0.2036818, 0.2452109, 0.2328391, 0…
> $ Fructose_prop <dbl> 0.1781848, 0.1961216, 0.2027641, 0.2232254, 0.2202366, 0…
> $ Sucrose_prop  <dbl> 0.6263649, 0.5736795, 0.5935541, 0.5315638, 0.5469243, 0…
> $ label         <chr> "PBL8", "Sunlight 3", "PBL9", "PBL6", "PBL7", "RB1_2", "…
> $ wet_fraction  <dbl> 0.8515992, 0.8561540, 0.8649311, 0.8783233, 0.8591738, 0…

Glucotropaeolin unit transformation and QC analysis

# Read sample metadata and define sample IDs
GTR_key <- read_xlsx("data/July24_all.xlsx", sheet = "meth_key")
samples <- paste0("ME", seq(10, 136))

# Process GTR (Glucotropaeolin) concentration data
# Initial concentration in mg/L (mg glucotropaeolin per L methanol extract)
GTR_data <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  # Filter relevant samples
  filter(sample_id %in% samples) %>%

  # Handle non-detects: replace 'n.a.' with detection limit (0.0001 mg/L)
  mutate(
    conc = str_replace(conc, "n.a.", "0.0001"),
    conc = as.numeric(conc)
  ) %>%

  # Reassign sample IDs and remove duplicate
  mutate(sample_id = paste0("ME", seq(10, 135))) %>%
  filter(!sample_id %in% c("ME12","ME57")) %>%

  # Join with sample metadata and moisture content data
  left_join(GTR_key) %>%
  rename(ID = sensory_id) %>%
  left_join(imputed_sugar %>% select(ID, FW_factor)) %>%

  # Convert concentration from mg/L to mg/kgFW
  # Step 1: mg/L * (0.001L/mass_mg) = mg/mgDW
  # Step 2: mg/mgDW / FW_factor = mg/mgFW
  # Step 3: mg/mgFW * 100000 = mg/kgFW
  mutate(
    conc = ((conc * (0.001/mass))/FW_factor) * 100000
  ) %>%

  # Clean up column names and select final variables
  rename(GTR_conc = conc) %>%
  select(ID, GTR_conc) %>% 
  filter(!ID %in% UI)

# Calculate Quality Control metrics for Continuing Calibration Verification (CCV) samples
# Reference standard (1 mg/L) analyzed every 10 samples
CCV_samples <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  mutate(conc = as.numeric(conc)) %>%
  filter(sample_id == "STD_1ppm") %>%
  summarise(
    mean_conc = mean(conc),
    sd_conc = sd(conc),
    rsd_percent = (sd_conc/mean_conc) * 100
  )

# Display QC results
kbl(CCV_samples,
                   caption = "Quality Control Metrics for 1 mg/L Reference Standard (n=14)",
                   col.names = c("Mean (mg/L)", "SD (mg/L)", "RSD (%)"),
                   digits = c(4, 4, 2)) %>% 
   kable_styling()

Table 4: Quality Control Metrics for 1 mg/L Reference Standard (n=14)
Mean (mg/L)	SD (mg/L)	RSD (%)
1.1513	0.1589	13.8

Combining data and imputing missing data

# Load proxy sugar/acid data, excluding redundant Genotype column
proxy <- read_xlsx("data/22_23_proxy_sugar.xlsx", "all") 
# Combine multiple datasets
conc_matrix <- full_join(sugar_data, GTR_data, relationship = "many-to-many") %>%  
  full_join(proxy, relationship = "many-to-many") %>% 
  full_join(VOCs, relationship = "many-to-many") %>% 
  # Remove auxiliary samples
  filter(!ID %in% c("01-1","02-1","03-1","04-1","05-1","07-1")) %>% 
  # Remove intermediate calculation columns
  select(-c(concentration:purity,conc_ppb:yellow_matrix_effect,OAV)) %>% 
  # Reshape data for analysis
  pivot_wider(names_from = species, 
              values_from = corrected_conc_ppb)

# missing results
kbl(conc_matrix[!complete.cases(conc_matrix),],
                   caption = "Results with missing values",
                   digits = 1) %>% 
   kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

(#tab:combine data)Results with missing values
assay	Genotype	ID	Flesh	Glucose	Fructose	Sucrose	total_sugar	Glucose_prop	Fructose_prop	Sucrose_prop	label	wet_fraction	GTR_conc	pH	Brix	TA	Genotype2	Hexanal	P_cymene	Terpinolene	Sulcatone	Linalool_oxide	Linalool	Beta_Cyclocitral	Terpinene	Heptanal	Methyl_hexanoate	Limonene	Octanal	Methyl_octanoate	Nonanal	Decanal	Benzaldehyde	E_2_nonal	Neryl_Geranyl_acetone	Gamma_octalactone	Beta_ionone	BITC
Feb24	F1_Malay	02-734	Red	3.8	3.6	4.2	11.6	0.3	0.3	0.4	F1_Malay	0.9	NA	5.5	13.4	0.2	F1 Malay	2.1	7.8	68.1	66.4	16913.1	5811.8	2.9	22.7	10.5	0.1	62.7	1.5	1.6	1.8	3.6	6.9	1.5	10.6	10.6	5.2	1.3
Aug22	Holland_5	09-1	Red	2.7	3.0	2.1	7.9	0.3	0.4	0.3	Holland_5_1	0.9	0.2	NA	NA	NA	Holland 5	16.2	4.5	0.0	32.6	0.0	54.8	0.0	0.0	34.2	0.0	10.5	7.9	0.5	9.4	0.0	5.1	0.0	35.1	4.9	4.5	4.6
Aug22	Solo	12-1	Red	3.2	3.4	2.0	8.7	0.4	0.4	0.2	Solo	0.9	0.1	NA	NA	NA	Solo	19.5	4.7	15.1	72.5	670.6	178.7	0.0	5.9	33.8	3.7	14.0	8.4	0.6	11.4	0.0	6.6	0.0	76.1	2.4	6.7	0.0
Aug22	PBL15	05-575	Yellow	5.5	6.0	0.1	11.7	0.5	0.5	0.0	PBL15	0.8	NA	5.4	12.3	0.2	PBL15	85.3	5.9	16.6	2.8	3593.0	387.3	0.0	0.0	85.2	0.0	22.1	14.3	0.0	19.5	0.0	7.9	0.0	32.8	1.2	4.7	6.2
Aug22	PBL11	02-021	Yellow	3.4	3.5	2.8	9.7	0.3	0.4	0.3	PBL11	0.9	NA	5.1	10.6	0.4	PBL11	43.9	6.4	19.0	4.6	335918.5	8204.7	0.0	10.7	68.3	48.5	34.9	12.2	4.0	14.3	0.0	24.8	0.0	59.9	9.0	7.6	8.1
Aug22	Solo	12-795	Red	2.9	3.0	1.8	7.7	0.4	0.4	0.2	Solo	0.9	NA	5.4	10.4	0.4	Solo	13.0	7.2	32.9	15.0	6139.8	2893.9	0.0	23.2	57.6	2.8	44.0	17.5	0.5	19.6	0.0	11.7	0.0	168.5	41.5	4.8	25.2


# Get all imputed datasets using default parameters
imp_vars <- c("GTR_conc","pH","Brix","TA")

set.seed(11)
init <- mice::mice(conc_matrix,maxit=0)
> Warning: Number of logged events: 6

init$predictorMatrix[,] <- 0 
init$predictorMatrix[imp_vars, c(sugar_vars,"BITC")] <- 1

# imp2 <- mice::mice(conc_matrix, predictorMatrix = init$predictorMatrix, m = 25, maxit = 100, 
#                    print = FALSE)
# 
# imputed_conc <- complete(imp2, "long") %>%
#   group_by(ID) %>%
#   summarise(across(all_of(imp_vars), mean))

# save(imputed_conc, file = "data/imputed_conc.RData")
load("data/imputed_conc.RData")
# new dataset with mean values from 25 random imputations after 100 iterations (probably overkill)
final_conc_matrix <- conc_matrix %>%
  select(-(all_of(imp_vars))) %>%
  left_join(imputed_conc, by = "ID") %>% 
  relocate(all_of(imp_vars), .before = Hexanal)

# Attaching sensory data
full_data_matrix <- final_conc_matrix %>% left_join(read_xlsx("data/all_sensory_data.xlsx"))

# full_data_matrix %>% select(Flesh, sweetness_FL, Sulcatone, Heptanal) %>% print(n=118)
  
data_summary_genotype <- 
  full_data_matrix %>%
  group_by(Genotype) %>%
  summarise(across(where(is.numeric),
                   ~sprintf("%.2f (%.2f)", mean(.x), sd(.x)))) %>%
  pivot_longer(cols = -Genotype, names_to = "variable") %>%
  pivot_wider(names_from = Genotype, values_from = value)

4 Sensory and metabolite data summary

Sensory data evaluations were conducted in a purpose-built sensory laboratory (Food Quality lab at the Centre for Nutrition and Food Sciences; The Queensland Alliance for Agriculture and Food Innovation, University of Queensland), equipped with 14 isolated sensory booths, temperature control (22°C), and LED daylight-equivalent lighting. Six formal sessions were conducted in a single-blind randomised block design, with participants recording their ratings for each genotype. To minimize taster fatigue, sessions were limited to two per day. The data from the first two sessions were considered dummy data and discarded (detailed methods can be found at Sensory methods supplementary)

print(summarytools::dfSummary(full_data_matrix,
                graph.magnif = 0.82,
                varnumbers = FALSE,
                style = "grid",
                valid.col = TRUE,  # Changed to TRUE to match your desired output
                graph.col = TRUE),  # Added to ensure graphs are displayed
      method = 'render',  # Important for proper HTML rendering
      table.classes = 'st-table st-table-bordered st-table-striped')

Data Frame Summary

full_data_matrix

Dimensions: 118 x 60
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

assay [character]

1. Aug22

2. Feb23

3. Feb24

46	(	39.0%	)
24	(	20.3%	)
48	(	40.7%	)

118 (100.0%)

0 (0.0%)

Genotype [character]

1. RB1

2. Holland_5

3. PBL15

4. 1B

5. Eksotika

6. F1_Malay

7. First_Lady

8. PBL1

9. PBL10

10. PBL11

[ 17 others ]

11	(	9.3%	)
8	(	6.8%	)
5	(	4.2%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
66	(	55.9%	)

118 (100.0%)

0 (0.0%)

ID [character]

1. 01-039
2. 01-050
3. 01-105
4. 01-144
5. 01-215
6. 01-217
7. 01-487
8. 01-534
9. 01-618
10. 01-776
[ 108 others ]

1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
1	(	0.8%	)
108	(	91.5%	)

118 (100.0%)

0 (0.0%)

Flesh [character]

1. Red

2. Yellow

90	(	76.3%	)
28	(	23.7%	)

118 (100.0%)

0 (0.0%)

Glucose [numeric]

Mean (sd) : 3 (0.8)

min ≤ med ≤ max:

1.3 ≤ 2.8 ≤ 5.5

IQR (CV) : 1 (0.3)

118 distinct values

118 (100.0%)

0 (0.0%)

Fructose [numeric]

Mean (sd) : 2.9 (0.8)

min ≤ med ≤ max:

1.4 ≤ 2.7 ≤ 6

IQR (CV) : 1.1 (0.3)

118 distinct values

118 (100.0%)

0 (0.0%)

Sucrose [numeric]

Mean (sd) : 3.3 (2.3)

min ≤ med ≤ max:

0.1 ≤ 2.7 ≤ 8.9

IQR (CV) : 3.7 (0.7)

118 distinct values

118 (100.0%)

0 (0.0%)

total_sugar [numeric]

Mean (sd) : 9.2 (2.1)

min ≤ med ≤ max:

4.2 ≤ 9.2 ≤ 13.6

IQR (CV) : 3.2 (0.2)

118 distinct values

118 (100.0%)

0 (0.0%)

Glucose_prop [numeric]

Mean (sd) : 0.3 (0.1)

min ≤ med ≤ max:

0.2 ≤ 0.3 ≤ 0.5

IQR (CV) : 0.1 (0.3)

118 distinct values

118 (100.0%)

0 (0.0%)

Fructose_prop [numeric]

Mean (sd) : 0.3 (0.1)

min ≤ med ≤ max:

0.1 ≤ 0.3 ≤ 0.5

IQR (CV) : 0.2 (0.3)

118 distinct values

118 (100.0%)

0 (0.0%)

Sucrose_prop [numeric]

Mean (sd) : 0.3 (0.2)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 0.7

IQR (CV) : 0.3 (0.6)

118 distinct values

118 (100.0%)

0 (0.0%)

label [character]

1. PBL15

2. 1B

3. Eksotika

4. F1_Malay

5. First_Lady

6. Holland_5_1

7. Holland_5_3

8. PBL1

9. PBL10

10. PBL11

[ 20 others ]

5	(	4.2%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
77	(	65.3%	)

118 (100.0%)

0 (0.0%)

wet_fraction [numeric]

Mean (sd) : 0.9 (0)

min ≤ med ≤ max:

0.8 ≤ 0.9 ≤ 0.9

IQR (CV) : 0 (0)

118 distinct values

118 (100.0%)

0 (0.0%)

Genotype2 [character]

1. RB1

2. Holland 5

3. PBL15

4. 1B

5. Eksotika

6. F1 Malay

7. First Lady

8. PBL1

9. PBL10

10. PBL11

[ 17 others ]

11	(	9.3%	)
8	(	6.8%	)
5	(	4.2%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
4	(	3.4%	)
66	(	55.9%	)

118 (100.0%)

0 (0.0%)

GTR_conc [numeric]

Mean (sd) : 0.8 (1.5)

min ≤ med ≤ max:

0 ≤ 0.3 ≤ 12.1

IQR (CV) : 0.9 (1.9)

118 distinct values

118 (100.0%)

0 (0.0%)

pH [numeric]

Mean (sd) : 5.5 (0.2)

min ≤ med ≤ max:

5 ≤ 5.5 ≤ 6.1

IQR (CV) : 0.2 (0)

14 distinct values

118 (100.0%)

0 (0.0%)

Brix [numeric]

Mean (sd) : 11.3 (1.8)

min ≤ med ≤ max:

6.6 ≤ 11.1 ≤ 14.2

IQR (CV) : 2.8 (0.2)

59 distinct values

118 (100.0%)

0 (0.0%)

TA [numeric]

Mean (sd) : 0.2 (0.1)

min ≤ med ≤ max:

0 ≤ 0.2 ≤ 0.4

IQR (CV) : 0.1 (0.4)

34 distinct values

118 (100.0%)

0 (0.0%)

Hexanal [numeric]

Mean (sd) : 18.4 (21.3)

min ≤ med ≤ max:

0 ≤ 9.8 ≤ 85.3

IQR (CV) : 22.8 (1.2)

118 distinct values

118 (100.0%)

0 (0.0%)

P_cymene [numeric]

Mean (sd) : 5.2 (3.3)

min ≤ med ≤ max:

0 ≤ 5.1 ≤ 20.5

IQR (CV) : 4.2 (0.6)

113 distinct values

118 (100.0%)

0 (0.0%)

Terpinolene [numeric]

Mean (sd) : 20.9 (33.9)

min ≤ med ≤ max:

0 ≤ 14.7 ≤ 213.3

IQR (CV) : 16 (1.6)

96 distinct values

118 (100.0%)

0 (0.0%)

Sulcatone [numeric]

Mean (sd) : 62.9 (84.2)

min ≤ med ≤ max:

0 ≤ 46 ≤ 629.1

IQR (CV) : 66.4 (1.3)

118 distinct values

118 (100.0%)

0 (0.0%)

Linalool_oxide [numeric]

Mean (sd) : 52682.5 (108929.3)

min ≤ med ≤ max:

0 ≤ 7529 ≤ 747850.9

IQR (CV) : 42821.2 (2.1)

117 distinct values

118 (100.0%)

0 (0.0%)

Linalool [numeric]

Mean (sd) : 7313.7 (18543.3)

min ≤ med ≤ max:

0 ≤ 682.3 ≤ 112060.6

IQR (CV) : 4922.4 (2.5)

117 distinct values

118 (100.0%)

0 (0.0%)

Beta_Cyclocitral [numeric]

Mean (sd) : 2.2 (2.9)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 5 (1.3)

50 distinct values

118 (100.0%)

0 (0.0%)

Terpinene [numeric]

Mean (sd) : 10.5 (15.4)

min ≤ med ≤ max:

0 ≤ 5.9 ≤ 105.5

IQR (CV) : 9.5 (1.5)

107 distinct values

118 (100.0%)

0 (0.0%)

Heptanal [numeric]

Mean (sd) : 32 (24.5)

min ≤ med ≤ max:

0 ≤ 23.4 ≤ 112.2

IQR (CV) : 30.4 (0.8)

118 distinct values

118 (100.0%)

0 (0.0%)

Methyl_hexanoate [numeric]

Mean (sd) : 42.8 (199.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 2002.8

IQR (CV) : 8.8 (4.6)

52 distinct values

118 (100.0%)

0 (0.0%)

Limonene [numeric]

Mean (sd) : 26.4 (39.9)

min ≤ med ≤ max:

0 ≤ 12.1 ≤ 250.1

IQR (CV) : 30.3 (1.5)

90 distinct values

118 (100.0%)

0 (0.0%)

Octanal [numeric]

Mean (sd) : 7 (5.9)

min ≤ med ≤ max:

0 ≤ 5.2 ≤ 33.5

IQR (CV) : 8.4 (0.8)

118 distinct values

118 (100.0%)

0 (0.0%)

Methyl_octanoate [numeric]

Mean (sd) : 2.2 (3.8)

min ≤ med ≤ max:

0 ≤ 1.4 ≤ 20.9

IQR (CV) : 1.1 (1.7)

101 distinct values

118 (100.0%)

0 (0.0%)

Nonanal [numeric]

Mean (sd) : 9.5 (7.9)

min ≤ med ≤ max:

0.4 ≤ 8.1 ≤ 36.4

IQR (CV) : 13.7 (0.8)

118 distinct values

118 (100.0%)

0 (0.0%)

Decanal [numeric]

Mean (sd) : 0.4 (1.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 7.8

IQR (CV) : 0 (3)

20 distinct values

118 (100.0%)

0 (0.0%)

Benzaldehyde [numeric]

Mean (sd) : 14.1 (14.7)

min ≤ med ≤ max:

1.8 ≤ 7.8 ≤ 74.8

IQR (CV) : 15.6 (1)

118 distinct values

118 (100.0%)

0 (0.0%)

E_2_nonal [numeric]

Mean (sd) : 0.5 (0.9)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5.1

IQR (CV) : 0.8 (1.7)

45 distinct values

118 (100.0%)

0 (0.0%)

Phenylacetaldehyde [numeric]

Mean (sd) : 0 (0)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0

IQR (CV) : 0 (1)

3 distinct values

118 (100.0%)

0 (0.0%)

Neryl_Geranyl_acetone [numeric]

Mean (sd) : 39.9 (42.8)

min ≤ med ≤ max:

2.6 ≤ 32.9 ≤ 312.7

IQR (CV) : 53.9 (1.1)

118 distinct values

118 (100.0%)

0 (0.0%)

Gamma_octalactone [numeric]

Mean (sd) : 11.4 (13.7)

min ≤ med ≤ max:

1.2 ≤ 6.4 ≤ 63.2

IQR (CV) : 9.1 (1.2)

118 distinct values

118 (100.0%)

0 (0.0%)

Beta_ionone [numeric]

Mean (sd) : 6.8 (4.9)

min ≤ med ≤ max:

3.5 ≤ 5.3 ≤ 50.2

IQR (CV) : 3 (0.7)

118 distinct values

118 (100.0%)

0 (0.0%)

BITC [numeric]

Mean (sd) : 4.5 (7.1)

min ≤ med ≤ max:

0 ≤ 2.6 ≤ 67.3

IQR (CV) : 5.3 (1.6)

93 distinct values

118 (100.0%)

0 (0.0%)

aroma_intensity_AR [numeric]

Mean (sd) : 52.6 (16.7)

min ≤ med ≤ max:

18.8 ≤ 49.4 ≤ 86

IQR (CV) : 23.9 (0.3)

113 distinct values

118 (100.0%)

0 (0.0%)

sweet_fruit_AR [numeric]

Mean (sd) : 35.3 (18.6)

min ≤ med ≤ max:

5.4 ≤ 31.6 ≤ 71.1

IQR (CV) : 33.2 (0.5)

116 distinct values

118 (100.0%)

0 (0.0%)

musty_off_note_AR [numeric]

Mean (sd) : 40.4 (12.3)

min ≤ med ≤ max:

13.1 ≤ 39.3 ≤ 74.8

IQR (CV) : 16.2 (0.3)

112 distinct values

118 (100.0%)

0 (0.0%)

fishy_AR [numeric]

Mean (sd) : 30.2 (16.5)

min ≤ med ≤ max:

0.6 ≤ 29 ≤ 68.9

IQR (CV) : 27 (0.5)

112 distinct values

118 (100.0%)

0 (0.0%)

citrus_AR [numeric]

Mean (sd) : 21.2 (18.9)

min ≤ med ≤ max:

0.9 ≤ 14.2 ≤ 71.4

IQR (CV) : 26.1 (0.9)

108 distinct values

118 (100.0%)

0 (0.0%)

floral_AR [numeric]

Mean (sd) : 21.6 (19.2)

min ≤ med ≤ max:

0.3 ≤ 13.9 ≤ 73.7

IQR (CV) : 22.3 (0.9)

109 distinct values

118 (100.0%)

0 (0.0%)

resistance_TX [numeric]

Mean (sd) : 42.1 (22.7)

min ≤ med ≤ max:

3.4 ≤ 40.9 ≤ 89.7

IQR (CV) : 35.8 (0.5)

115 distinct values

118 (100.0%)

0 (0.0%)

velvety_TX [numeric]

Mean (sd) : 45.4 (13.1)

min ≤ med ≤ max:

16.8 ≤ 44.5 ≤ 75.3

IQR (CV) : 17.5 (0.3)

113 distinct values

118 (100.0%)

0 (0.0%)

juiciness_TX [numeric]

Mean (sd) : 48.4 (19.4)

min ≤ med ≤ max:

6.8 ≤ 50.4 ≤ 85.4

IQR (CV) : 34.6 (0.4)

117 distinct values

118 (100.0%)

0 (0.0%)

dissolving_TX [numeric]

Mean (sd) : 54.9 (21.3)

min ≤ med ≤ max:

6.9 ≤ 56.4 ≤ 92

IQR (CV) : 28.8 (0.4)

117 distinct values

118 (100.0%)

0 (0.0%)

fibrous_TX [numeric]

Mean (sd) : 44.6 (13.5)

min ≤ med ≤ max:

15.9 ≤ 44.3 ≤ 84.6

IQR (CV) : 17.9 (0.3)

111 distinct values

118 (100.0%)

0 (0.0%)

flavour_intensity_FL [numeric]

Mean (sd) : 58.2 (13.7)

min ≤ med ≤ max:

14.1 ≤ 59 ≤ 87.2

IQR (CV) : 19.4 (0.2)

109 distinct values

118 (100.0%)

0 (0.0%)

sweetness_FL [numeric]

Mean (sd) : 54.8 (16.6)

min ≤ med ≤ max:

7 ≤ 57 ≤ 80.8

IQR (CV) : 22.7 (0.3)

114 distinct values

118 (100.0%)

0 (0.0%)

bitterness_FL [numeric]

Mean (sd) : 26.8 (18.7)

min ≤ med ≤ max:

2.4 ≤ 20.8 ≤ 96.2

IQR (CV) : 19.8 (0.7)

109 distinct values

118 (100.0%)

0 (0.0%)

musty_FL [numeric]

Mean (sd) : 42.1 (11.5)

min ≤ med ≤ max:

18.2 ≤ 39.7 ≤ 69.6

IQR (CV) : 16.8 (0.3)

112 distinct values

118 (100.0%)

0 (0.0%)

floral_FL [numeric]

Mean (sd) : 26.6 (16.4)

min ≤ med ≤ max:

1.9 ≤ 21.2 ≤ 66.6

IQR (CV) : 20.4 (0.6)

114 distinct values

118 (100.0%)

0 (0.0%)

bitter_AT [numeric]

Mean (sd) : 23.6 (17)

min ≤ med ≤ max:

5 ≤ 18.1 ≤ 91.1

IQR (CV) : 16.5 (0.7)

108 distinct values

118 (100.0%)

0 (0.0%)

sweet_AT [numeric]

Mean (sd) : 37.8 (15.5)

min ≤ med ≤ max:

1.5 ≤ 39.4 ≤ 66.9

IQR (CV) : 21 (0.4)

116 distinct values

118 (100.0%)

0 (0.0%)

metallic_AT [numeric]

Mean (sd) : 11.8 (6.1)

min ≤ med ≤ max:

0.8 ≤ 10.7 ≤ 34.7

IQR (CV) : 7 (0.5)

102 distinct values

118 (100.0%)

0 (0.0%)

prickly_AT [numeric]

Mean (sd) : 8.6 (4.7)

min ≤ med ≤ max:

1.4 ≤ 7.8 ≤ 23.3

IQR (CV) : 5.3 (0.5)

103 distinct values

118 (100.0%)

0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.2.1)
2025-06-20

5 Sensory and metabolite Principal Component Analysis

Metabolites with odour activity values less than one likely play a very minor role in fruit flavour; due to the small ranges and fluctuations, their influence on flavour may be overestimated in modelling. Moreover, sensory terms including prickly aftertaste and metallic aftertaste poorly differentiated the genotypes throughout descriptive analysis. Therefore these variables were removed to focus on highly relevant flavour variables.

# cut VOC variables
no_aroma <- VOCs %>% 
  select(1:6,OAV) %>% 
  pivot_wider(names_from = species, values_from = OAV) %>% 
  select(Hexanal:BITC) %>% 
  summarise(across(where(is.numeric), ~ max(.x, na.rm = TRUE))) %>%
  select(where(function(x) any(x<1))) %>% 
  colnames()

# cut sensory variables
poor_lexicon_use <- c("prickly_AT", "metallic_AT")

Once these variables were removed the data was visualised using PCA to look at sensory and metabolite profiles of each genotype.

# data set up
remove_vars <- c("Glucose_prop","Fructose_prop","Sucrose_prop","label","Genotype2")
PCA_data <- full_data_matrix %>% 
  column_to_rownames(var = "ID") %>% 
  select(-all_of(c(poor_lexicon_use, no_aroma,remove_vars,"Beta_Cyclocitral")))
# Beta_Cyclocitral appears to have very low range and is exacerbating small differences due to presence absence

# PCA script custom bi-plot
PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 5, graph = F, 
                quali.sup = 1:31) # omit all but sensory columns
 
# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(17, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var))) 

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.6*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
               group_by(label) %>% 
               summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
               left_join(genotype_table %>% distinct(label,Flesh, Genotype, Genotype2, assay))


# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month")

Figure 1: PCA biplot of 27 papaya genotype dipicting the variations in sensory profiles from quantitative descriptive analysis.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 6, graph = F, 
                quali.sup = c(1:3,31:48)) # all chem PCA

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib[,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(22, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key, by = "var") %>% 
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
  as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("ID") %>% 
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
  group_by(label,Flesh) %>% 
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month")

CA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype.

Figure 2: CA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype.

There appears to be a significant batch effect associated with metabolites with highly skewed distributions. After removing decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal, the control genotypes are qualitatively similar in distance to what is observed in the sensory data.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE,
                ncp = 6, graph = F,
                quali.sup = c(1:3,21,23:26,31:48)) # omit decanal, methyl hexanoate, octanal, (E)-2-nonanal, nonanal

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>%
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>%
  top_n(22, contribution)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>%
  as.data.frame() %>%
  select(1:2) %>%
  rownames_to_column(var = "var") %>%
  left_join(class_key) %>%
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs,
                       element = "var",
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE],
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))),
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>%
  as_tibble() %>%
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>%
  as.data.frame() %>%
  rownames_to_column("ID") %>%
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>%
  group_by(label,Flesh) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) +
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) +
  geom_hline(yintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2),
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) +
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(),
                     panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2,
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh",
        colour="Attribute", shape = "Havest Month")

PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype. Excluding decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal.

Figure 3: PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype. Excluding decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal.

6 Modelling the interactions between sensory and metabolite profiles

Variable influences on traits were assessed using two complementary approaches: (1) generalised boosted regression (Friedman 2001) to capture both main effects and interactions (x-axis), optimized using 10-fold cross-validation, and (2) Bayesian linear regression (BayesA; Meuwissen et al. (2001)) to estimate individual additive effects without interactions (y-axis). Only variables that stood out from the rest were labelled due to space limitations.

scaled_matrix <- PCA_data %>% select(-c(1:3,21,23:26)) %>% scale()

# Sensory subsetting
# column naming
traits <- colnames(scaled_matrix) [23:40]
gbm_name <- c(seq(from = 1, to = 35, by = 2))
beta_name <- c(seq(from = 2, to = 36, by = 2))

set.seed(11)

for (i in 1:18) {
  
  # caret GBM
  carX <- as.matrix(cbind(Y=scaled_matrix[,22+i], scaled_matrix[,1:22]))
  
  grid <- expand.grid(.n.trees = c(500, 750, 1000), 
                      .interaction.depth = c(5,7,10), 
                      .shrinkage = c(0.01,0.02,0.05), 
                      .n.minobsinnode = c(3,5))
  
  trnCtrl <- trainControl(method = "cv", number = 10)
  
  gbm <- train(Y ~ ., data = carX, method = "gbm", trControl = trnCtrl, tuneGrid = grid)
  
  # BGLR
  trnY <- scaled_matrix[,22+i]
  
  ETA <- list(list(X = scaled_matrix[,1:22], model = 'BayesA'))
  
  fit_BA <- BGLR::BGLR(y = trnY, 
                 ETA = ETA, 
                 nIter = 30000, 
                 burnIn = 10000,
                 thin = 100, 
                 saveAt = 'data/', 
                 df0 = 5, 
                 S0 = NULL,
                 weights = NULL, 
                 R2 = 0.5)

 # write results table
  if(i == 1){
    wts <- as.data.frame(varImp(object = gbm)$importance) %>% 
           cbind(fit_BA$ETA[[1]]$b)
    names(wts)[i] <- paste0(traits[i],"_gbm")
    names(wts)[i+1] <- paste0(traits[i],"_beta")
  } else {
    wts <- cbind(wts,varImp(object = gbm)$importance,fit_BA$ETA[[1]]$b)
    names(wts)[gbm_name[i]] <- paste0(traits[i],"_gbm")
    names(wts)[beta_name[i]] <- paste0(traits[i],"_beta")
  }
}

save(wts,traits,beta_name, gbm_name,  file = "data/sensory_modelling.RData")

# Plot prep ------------
bglr_data <- wts %>% 
  select(all_of(beta_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "y", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_beta",""))

gbm_data <- wts %>% 
  select(all_of(gbm_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "x", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_gbm",""))

all_plot_data <- left_join(bglr_data,gbm_data[c(1,4:5)],by=c("var" = "var", "trait" = "trait")) %>% 
                mutate(sensory_group = str_sub(trait, - 2, - 1),
                       trait_label = str_sub(trait,0,-4),
                       trait_label = str_replace_all(trait_label,"_"," "),
                       trait_label = str_to_sentence(trait_label),
                       label = str_replace_all(label,c("total sugar"="Total sugar",
                                                       "wet fraction"="Water content")))
# Filter for AR labels -------
ar_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_min(y, n=4),
    
    all_plot_data %>%  filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(y, n=4),
    
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>%
  
  distinct()
  
my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting AR_label 

AR_intensity <- c("Terpinolene","Heptanal","Terpinene","(E/Z)-Geranylacetone","Sucrose",
                  "Hexanal")
AR_citrus <- c("Glucose","Hexanal","Limonene","Sucrose","Heptanal","Sulcatone","β-Ionone","Terpinene")
AR_fishy <- c("Sucrose","Glucose","P-Cymene","pH","β-Ionone","Sulcatone","TA")
AR_floral <- c("(E/Z)-Geranylacetone","β-Ionone", "Hexanal","Sulcatone","(E/Z)-Geranylacetone")
AR_musty <- c("Linalool oxide", "Terpinolene", "GTR","(E/Z)-Geranylacetone")
AR_sweet <- c("BITC","β-Ionone","Linalool oxide","P-Cymene","Hexanal","GTR","pH","Total sugar")

# Filter labels and plot

ar_label1 <- ar_label %>%
  filter(!label %in% AR_intensity & trait=="aroma_intensity_AR"|
         !label %in% AR_citrus & trait == "citrus_AR" |
         !label %in% AR_fishy & trait == "fishy_AR"|
         !label %in% AR_floral & trait == "floral_AR"|
         !label %in% AR_musty & trait == "musty_off_note_AR"|
         !label %in% AR_sweet & trait == "sweet_fruit_AR")
  

#plot AROMA trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AR")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y)), 
       max(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y))) +


  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = ar_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -.1,
                  vjust = -0.4,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 4: Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes.

# Filter for FL labels -------
FL_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_min(y, n=5), 
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(y, n=5),
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting FL_label 

FL_bitterness <- c("γ-Octalactone", "Sucrose","Hexanal","Glucose","Terpinolene","Water fraction","Brix")
FL_intensity <- c("Sulcatone", "β-Ionone","Linalool","TA","BITC")
FL_floral <- c("Brix","Linalool oxide","(E/Z)-Geranylacetone","Terpinene")
FL_musty <- c("Terpinene","Sulcatone")
FL_sweet <- c("Fructose","Linalool oxide","Glucose","GTR", "TA","pH","γ-Octalactone")

# Filter labels and plot

FL_label1 <- FL_label %>%
  filter(!label %in% FL_bitterness & trait=="bitterness_FL"|
           !label %in% FL_intensity & trait == "flavour_intensity_FL" |
           !label %in% FL_floral & trait == "floral_FL"|
           !label %in% FL_musty & trait == "musty_FL"|
           !label %in% FL_sweet & trait == "sweetness_FL")


# Plot flavour trait variables
ggplot(all_plot_data %>% filter(sensory_group == "FL")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "FL") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "FL") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = FL_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -0,
                  vjust = -0.5,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 5: Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes

# First filter for AT labels -------
AT_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_min(y, n=4), 
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(x, n=7)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot aftertaste trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AT")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "AT") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "AT") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = AT_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

(#fig:aftertaste metabolites)Variation in papaya aftertaste characteristics explained by targeted metabolites in 27 papaya genotypes.

#First filter for TX labels -------
  TX_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "TX") %>%
      group_by(trait) %>% slice_min(y, n=2), 
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(x, n=2)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot TEXTURE trait variables
ggplot(all_plot_data %>% filter(sensory_group == "TX")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "TX") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "TX") %>% select(y))) +
  
  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = TX_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )

Figure 6: Variation in papaya texture characteristics explained by targeted metabolites in 27 papaya genotypes.

7 Modelling consumer liking

Trends within the individual consumer liking scores for each genotype were identified using the affinity propagation clustering algorithm (Frey and Dueck 2007). The grouping of individual consumers into three distinct clusters reveals the interaction between their regular papaya consumption habit and how they scored each genotype during tasting. Moreover, there are indications that multiple genotypes maybe be accepted into the Australian market by different clusters of consumers.

7.1 Cluster analysis

consumer_data <- read_xlsx("data/consumer_data.xlsx")

z <- consumer_data %>% 
  select(participant,liking,var) %>%
  pivot_wider(names_from = var, 
              values_from = liking)

# to plot the eps values
set.seed(11)
a = apcluster(negDistMat(r=2), x=z, q=0)

# cluster exemplars
clusters_eg <- a@exemplars %>% unname()
kableExtra::kbl(consumer_data[clusters_eg,],
      col.names = c("Participant", "Genotype", "Flesh Colour","Liking Score","Like Comment","Dislike Comment","Gender","Age","Consumption Habit"),
      caption = "Cluster exemplars generated using apcluster")

> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

# consumer clusters
c1 <- a@clusters[[1]]%>% unname()
c2 <- a@clusters[[2]]%>% unname()
c3 <- a@clusters[[3]]%>% unname()

consumer_clusters <- z %>% mutate(Cluster = c(1:125))
consumer_clusters$Cluster <- as.character(car::recode(consumer_clusters$Cluster,"c1='1';c2='2';c3='3'"))

# attach cluster membership to consumer data
cluster_data <- 
  left_join(consumer_clusters, 
            consumer_data %>% select(participant,Gender,age,habit) %>% distinct()) %>% 
  relocate(Cluster:habit, .before = RB1) %>% 
  select(-participant)

# Summarise cluster demographic data using fisher test and kruskal
cluster_membership <- cluster_data %>% select(1:4)

tab1 <- CreateTableOne(vars = c("Gender", "age", "habit"),
                       strata = "Cluster",
                       data = cluster_membership,
                       factorVars = c("Gender", "habit"))



# summarise cluster liking for each genotype

cluster_summary <- consumer_clusters %>% select(-participant) %>%
  pivot_longer(1:9,names_to = "var",values_to = "liking")

# ANOVA and Post-hoc 
cluster_aov <- aov(liking ~ var * Cluster, data = cluster_summary)
cluster_ph <- SNK.test(cluster_aov, c("var","Cluster"))

# Extract data for plotting
aov_plot <- 
  left_join(cluster_ph$groups %>% rownames_to_column(var="var"),
            cluster_ph$means %>% rownames_to_column(var="var")) %>% 
  separate(var,
           into = c("var", "cluster"),
           sep = ":")
 
# second y-axis values
lam_breaks <- c(0, 13, 23, 33, 45, 50, 55, 67, 77, 87, 100)
lam_labels <- c('Greatest imaginable dislike',
                'Dislike extremely',
                'Dislike very much',
                'Dislike moderately',
                'Dislike slightly',
                'Neither like nor dislike',
                'Like slightly',
                'Like moderately',
                'Like very much',
                'like extremely',
                'Greatest imaginable like')
ggplot(aov_plot) +
aes(x = reorder(var,-liking), 
    y = liking, 
    group = interaction(var, cluster), 
    color = cluster, fill = cluster) + 
  # Add boxplot-style elements
  geom_boxplot(
    aes(ymin = Min,
        lower = Q25,
        middle = Q50,
        upper = Q75,
        ymax = Max),
    stat = "identity",
    alpha = 0.3,  
    width = 0.9,
    position = position_dodge(0.9)) +
  # Add post-hoc letters
  geom_text(aes(label = groups, y = Q50 + 2),
            position = position_dodge(0.9),
            size = 2.5) +
  # Customize colors
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster') +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  # remove letter in legend
  guides(color = "none") +
  # Customize theme
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(fill = NA, color = 'black'),
        legend.position = 'bottom', 
        legend.box = "horizontal",   
        legend.direction = "horizontal",  
        legend.margin = margin(t = 0, b = 10),  
        legend.title = element_text(size = 10),  
        legend.text = element_text(size = 9)) +
  # Add labels
  labs(x = 'Papaya Genotype',
       y = 'Consumer Liking Score (0-100)') +
  # Add secondary y-axis for LAM scale with no expansion
  scale_y_continuous(
    limits = c(0, 100),
    expand = c(0, 0), 
    sec.axis = sec_axis(
      ~.,
      breaks = lam_breaks,
      labels = lam_labels,
      name = 'Labeled Affective Magnitude (LAM) Scale'))

Consumer liking score for nine papaya genotypes recorded by 125 untrained panelists. Each panelist was grouped into a cluster based on the affinity propagation clustering algorithm. There is an interaction between geotype liking scores and cluster membership (ANOVA; p <= 0.001). The differences between genotype liking scores among all clusters are indicated by different letters (SNK: p <= 0.05).

Figure 7: Consumer liking score for nine papaya genotypes recorded by 125 untrained panelists. Each panelist was grouped into a cluster based on the affinity propagation clustering algorithm. There is an interaction between geotype liking scores and cluster membership (ANOVA; p <= 0.001). The differences between genotype liking scores among all clusters are indicated by different letters (SNK: p <= 0.05).

# very convoluted way to make it look good in html
tab_df <- 
  as.data.frame(print(tab1)) %>%
  rownames_to_column(var = "Consumer Cluster") %>% 
  gt()  %>% 
  tab_caption("Breakdown of demographic data and papaya consumption habits within each consumer cluster.") %>% 
  tab_footnote(
    footnote = "Kruskal wallis was applied to age and fishers test was applied to gender and papaya consumption habit.",
    locations = cells_column_labels(columns = "Consumer Cluster")
  )

>                         Stratified by Cluster
>                          1             2             3             p      test
>   n                         61            24            40                    
>   Gender (%)                                                        0.541     
>      Female                 43 (70.5)     15 (62.5)     23 (57.5)             
>      Male                   17 (27.9)      9 (37.5)     16 (40.0)             
>      Non-binary              1 ( 1.6)      0 ( 0.0)      0 ( 0.0)             
>      Unspecified             0 ( 0.0)      0 ( 0.0)      1 ( 2.5)             
>   age (mean (SD))        36.56 (10.90) 35.92 (12.60) 36.17 (10.70)  0.968     
>   habit (%)                                                         0.002     
>      1-3 times per month    23 (37.7)      3 (12.5)     20 (50.0)             
>      2-4 times a week        3 ( 4.9)      1 ( 4.2)      2 ( 5.0)             
>      never                   2 ( 3.3)      8 (33.3)      2 ( 5.0)             
>      once a week             6 ( 9.8)      1 ( 4.2)      2 ( 5.0)             
>      rarely                 27 (44.3)     11 (45.8)     14 (35.0)

# # Create flextable
# ft <- flextable(tab_df) %>%
#   theme_vanilla() %>%
#   autofit()
# Display the table
# ft

  # PCA clusters as variables -----
pca_data <- cluster_data %>%
  pivot_longer(5:13,names_to = "var",values_to = "liking") %>%
  group_by(Cluster,var) %>%
  summarise(mean_liking = mean(liking)) %>%  
  pivot_wider(names_from = "Cluster",values_from = "mean_liking") %>% 
  column_to_rownames(var = "var")

plot_data <- PCA(pca_data, graph = F)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- plot_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% mutate(var=paste("Cluster",var))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(plot_data, element = "var", 
                       result = c("coord", "contrib", "cos2"))
colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(plot_data$ind$coord[, drop = FALSE], stringsAsFactors = TRUE)
colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))
# update the variable coordinates
PCAvar_plot_data <- PCAvar_coords %>% as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.8*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- plot_data$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("Genotype2") %>% 
  left_join(genotype_table %>% select(Genotype2,Flesh) %>% distinct())

# choose colours
my_colours <- c("#E69F00", "#56B4E9", "#009E73")

# Get component variance information for axis labels
percentVar <- plot_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill=Flesh) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(xend=Dim.1, yend=Dim.2, colour = var), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.5,
               arrow = arrow(angle = 30, length = unit(0.2, "cm")),
               lineend = "round",linejoin = "mitre") +
  geom_text_repel(mapping = aes(x=Dim.1, y=Dim.2, label=var),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=-0.5, size=3.2, colour= my_colours) + 
  scale_fill_manual(values = c("#E31A1C","gold2")) +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  geom_point(shape = 21,alpha = 0.9, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12),
                     plot.title = element_text(hjust = 0.5)) +
  geom_text_repel(label = PCAvoctable$Genotype2,
                  size=3, hjust=-0.15, vjust=-0.3,
                  colour = "black", fontface = "italic") +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), 
        fill="Flesh Colour", 
        colour="Consumer Cluster")

Principal component analysis of the liking scores for nine papaya genotypes reported by three distinct consumer clusters (consumer clusters were generated using agglomerative clustering algorithm): Cluster 1, n = 61; Cluster 2, n = 21; Cluster 3, n = 40.

Figure 8: Principal component analysis of the liking scores for nine papaya genotypes reported by three distinct consumer clusters (consumer clusters were generated using agglomerative clustering algorithm): Cluster 1, n = 61; Cluster 2, n = 21; Cluster 3, n = 40.

There is an inherent limitation in the dataset whereby fruit included in the consumer study was not able to be used for chemical analysis. This is due to the logistical issues regarding adequate fruit collection and processing: the fruit that was collected was completely consumed by the panelists. To deal with this limitation best linear unbiased estimates were combined with previously collected sensory and metabolite data to represent the broad differences between genotype liking scores. This caveat likely leads to the underestimation of complex nuances between variables that influence papaya fruit liking. Therefore, the model generated from this data highlights larger variable variation between genotypes.

liking_fit <- lmer(liking ~ var +
                     (1|participant),
                   data = consumer_data)

blue_df <- 
  as.data.frame(
    emmeans(liking_fit, 
            specs = "var")) %>% 
  rename("Genotype2" = "var") 

model_data <- 
  left_join(PCA_data %>% rownames_to_column(var = "ID"), genotype_table) %>% 
  left_join(blue_df) %>%
  filter(!is.na(emmean)) %>% 
  column_to_rownames(var = "ID") %>% 
  select(-c(assay:Genotype,SE:upper.CL,label:Genotype2)) %>% 
  relocate(Flesh, .after = "emmean") %>% 
  select(-c(Decanal,Methyl_hexanoate,Octanal,E_2_nonal, Nonanal)) 

kbl(blue_df, caption = "Estimated marginal means from the mixed effects model including genotype as a fixed effect and participant as a random effect", 
      col.names = c("Genotype","Least-squares mean",
                    colnames(blue_df)[3:6]),
      align = c("l",rep("c",5))) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE)

> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

Table 5: Estimated marginal means from the mixed effects model including genotype as a fixed effect and participant as a random effect
Genotype	Least-squares mean	SE	df	lower.CL	upper.CL
1B	39.664	1.66757	607.539	36.3891	42.9389
PBL14	56.832	1.66757	607.539	53.5571	60.1069
PBL15	50.784	1.66757	607.539	47.5091	54.0589
PBL16	62.720	1.66757	607.539	59.4451	65.9949
PBL21	70.104	1.66757	607.539	66.8291	73.3789
PBL9	70.368	1.66757	607.539	67.0931	73.6429
RB1	71.888	1.66757	607.539	68.6131	75.1629
Solo	64.888	1.66757	607.539	61.6131	68.1629
Sunlight 1	65.992	1.66757	607.539	62.7171	69.2669

Models indicating variable influence on fruit liking can be produced from red-fleshed and yellow-fleshed data separately; however, with limited power due to sample size. Moreover, the liking scores are more accurately compared as a whole because the consumer participants tasted all samples in one sitting, as opposed to only red-fleshed and then only yellow-fleshed samples.

Interestingly, it became clear in consumer tasting that PBL9 wasn’t an inherently bitter genotype; although it was thought to be from earlier sensory panels. Due to the constants of genotype liking, any samples displaying uncharacteristic flavour (likely due to poor fruit maturation) cause significant inaccuracies in modelling efforts. These samples were detected by comparing genotypes with equivalent liking scores. Therefore, samples were removed for further modelling: 03-302 and 03-669.

# find the source of confusion in cosumer liking modelling - filter for RS5/T2-6-5.27.12 and equivalently liked genotypes
sample_review <- full_data_matrix %>% 
  select(-(5:40)) %>% 
  filter(Genotype %in% c("PBL9","RB1")) %>% 
  column_to_rownames(var = "ID")

# PCA
pca_data <- PCA(sample_review, quali.sup = 1:4, graph = F)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- pca_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  slice_max(abs(Dim.1)+abs(Dim.2), n = 10) %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(pca_data, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(pca_data$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.7*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- pca_data$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# choose colours
my_colours <- c("Aug22" = "#E69F00", "Feb23" = "#56B4E9", "Feb24" = "#009E73")


# Get component variance information for axis labels
percentVar <- pca_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill = assay, shape = Genotype2) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5,color = "black") +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = PCAvoctable$ID, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) + scale_fill_manual(values = my_colours,
                                                                guide = guide_legend(override.aes = list(shape = 22, size = 6))) +
   guides(colour=guide_legend(override.aes = list(label="")))+
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Harvest Month", 
        colour="Attribute", shape = "Genotype")

PCA Bi-plot comparing red-fleshed papaya genotypes RB1, Hybrid 6 and RS5. All three genotypes received similar liking scores from 125 untrained panelists.

Figure 9: PCA Bi-plot comparing red-fleshed papaya genotypes RB1, Hybrid 6 and RS5. All three genotypes received similar liking scores from 125 untrained panelists.

# issue with 2 samples during sensory: overly bitter and less sweet than expected
bitter_samples <- c("03-302","03-669")

# update model data
model_data <- model_data[!rownames(model_data) %in% bitter_samples,]

After these two samples were removed from the dataset, modelling the effect of sensory traits and metabolite traits on papaya liking scores better reflects what was observed. Similar to before gbm and bayesA were implemented to determine how the different variable groups influence papaya liking.

# Function to run models for a specific flesh type
run_models <- function(data, flesh_type, met_cols, sens_cols, y_col = 41) {
  # 
  # Filter and prepare data
  pap <- data %>%
    filter(grepl(flesh_type, Flesh, perl = TRUE)) %>%
    select(-Flesh) %>% 
    scale()

  # Define model parameters based on flesh type
  model_params <- if(flesh_type == "Yellow") {
    list(
      n_trees = c(500),
      int_depth = c(5),
      shrinkage = c(0.01),
      min_obs = c(1),
      cv_folds = 5,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  } else {
    list(
      n_trees = c(500, 750, 1000),
      int_depth = c(5, 7, 10),
      shrinkage = c(0.01, 0.02, 0.05),
      min_obs = c(3, 5),
      cv_folds = 10,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  }

  # Function to run a single model set (metabolites or sensory)
  run_model_set <- function(cols, type) {
    # Prepare data
    carX <- as.matrix(cbind(Y = pap[,y_col], pap[,cols]))

    # GBM
    grid <- expand.grid(
      .n.trees = model_params$n_trees,
      .interaction.depth = model_params$int_depth,
      .shrinkage = model_params$shrinkage,
      .n.minobsinnode = model_params$min_obs
    )

    gbm_model <- train(
      Y ~ .,
      data = carX,
      method = "gbm",
      trControl = trainControl(method = "cv", number = model_params$cv_folds),
      tuneGrid = grid
    )

    # BGLR
    bglr_model <- BGLR::BGLR(
      y = pap[,y_col],
      ETA = list(list(X = pap[,cols], model = 'BayesA')),
      nIter = model_params$bglr_iter,
      burnIn = model_params$bglr_burnin,
      thin = 100,
      saveAt = 'data/',
      df0 = 5,
      R2 = 0.5
    )

    # Combine results
    bind_rows(
      varImp(gbm_model)$importance %>%
        rownames_to_column("var") %>%
        mutate(model = paste0(flesh_type, "_", type, "_gbm")),
      tibble(
        var = colnames(pap[,cols]),
        Overall = bglr_model$ETA[[1]]$b,
        model = paste0(flesh_type, "_", type, "_beta")
      )
    ) %>%
      rename(coord = Overall)
  }

  # Run both model sets and combine results
  bind_rows(
    run_model_set(met_cols, "mets"),
    run_model_set(sens_cols, "sens")
  )
}

# run functions
set.seed(11)
# flesh_types <- c("Red", "Yellow", "Red_Yellow")
flesh_types <- c("Red", "Yellow", "Red|Yellow")
met_cols <- 1:22
sens_cols <- 23:40

results <- flesh_types %>%
  map(~run_models(model_data, ., met_cols, sens_cols)) %>%
  list_rbind() %>% 
  select(var, model, coord) %>% 
  mutate(model = sub("|", "_", model, fixed = TRUE))
save(results, file = "data/liking_var_model.RData")

# liking variable plots
create_importance_plot <- function(data,
                                 data_type = c("mets", "sens"),
                                 variety = "Red & Yellow",
                                 label_data,
                                 color_palette,
                                 group_breaks,
                                 facet_by_variety = FALSE,
                                 facet_scales = "free_y",
                                 facet_ncol = NULL,
                                 facet_theme = NULL) {
  # Input validation
  data_type <- match.arg(data_type)

  # Common theme settings
  plot_theme <- theme_bw() +
    theme(
      plot.background = element_blank(),
      panel.background = element_blank(),
      panel.border = element_rect(linetype = "solid", colour = "Black"),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.title = element_text(size = 11),
      axis.title.x = element_text(vjust = -2),
      axis.title.y = element_text(vjust = 2),
      axis.text = element_text(size = 8),
      legend.background = element_blank(),
      strip.background = element_blank(),
      strip.text = element_text(size = 11, face = "bold"),
      panel.spacing.y = unit(0, "cm", data = NULL),
      legend.position = "bottom"
    )

  # Filter data based on type and variety
  wts_filtered <- data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Filter labels
  labels_filtered <- label_data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Create base plot
  p <- ggplot(wts_filtered) +
    aes(x = x, y = y, fill = group, colour = group, label = label) +
    geom_hline(
      yintercept = 0,
      alpha = 0.6,
      linetype = "dashed",
      linewidth = 0.5
    ) +
    plot_theme +
    xlim(0, 100) +
    {if(!facet_by_variety) ylim(min(wts_filtered$y), max(wts_filtered$y))} +
    geom_point(shape = 21, alpha = 0.7, size = 2.5) +
    geom_text_repel(
      data = labels_filtered,
      aes(x = x, y = y, label = label, color = group),
      hjust = -0.1, vjust = 0, size = 3, box.padding = 0.10
    ) +
    scale_fill_manual(breaks = group_breaks, values = color_palette) +
    scale_color_manual(breaks = group_breaks, values = color_palette) +
    guides(colour = "none") +
    labs(
      x = "Variable importance (GBM)",
      y = "β coefficient (BayesA)",
      fill = ""
    )

  # Add faceting if requested
  if(facet_by_variety) {
    p <- p +
      facet_wrap(~variety,
                 scales = facet_scales,
                 ncol = facet_ncol) +
      {if(!is.null(facet_theme)) facet_theme else
        theme(
          strip.background = element_rect(fill = "white"),
          strip.text = element_text(size = 11, face = "bold")
        )
      }
  }

  return(p)
}

# load results
load("data/liking_var_model.RData")
# Prepare the data
wts <- results %>%
  left_join(rbind(class_key, sensory_meta), by = "var") %>%
  mutate(
    variety = case_when(
      str_detect(model, "Red_Yellow") ~ "Red & Yellow",
      str_detect(model, "Red") ~ "Red",
      str_detect(model, "Yellow") ~ "Yellow"
    ),
    data_type = str_extract(model, "mets|sens"),
    model = str_extract(model, "gbm|beta"),
    label = str_replace_all(label,"total sugar","Total sugar")
  ) %>%
  pivot_wider(names_from = model, values_from = coord) %>%
  rename(x = gbm, y = beta)

# Define colors
my_colours <- c(
  "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22"
)

# Prepare labels
mets_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 7),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

sens_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 7),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

# Define group breaks
mets_breaks <- c("Carotenoid/Terpenes", "Lipid-derived", "Phenylalanine-derived",
                 "Sugars", "Acids", "Other")
sens_breaks <- c("Flavour", "Aftertaste", "Aroma", "Texture")

# metabolite variable plot
metabolites_plot<- create_importance_plot(
  data = wts,
  data_type = "mets",
  variety = "Red & Yellow",
  label_data = mets_labels,
    color_palette = my_colours,
  group_breaks = mets_breaks,
  facet_by_variety = FALSE
)
# sensory variable plot
Sensory_plot <- create_importance_plot(
  data = wts,
  data_type = "sens",
  variety = "Red & Yellow",
  label_data = sens_labels,
  color_palette = my_colours,
  group_breaks = sens_breaks,
  facet_by_variety = FALSE
)

metabolites_plot

Metabolite characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Figure 10: Metabolite characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Sensory_plot

Sensory characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Figure 11: Sensory characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Variance decomposition analysis is implemented to understand the variation in liking scores between the genotypes based on class groupings for sensory and metabolite variables. Such as taste or aroma for sensory characteristics and terpenes (volatile organic compound) or sugars (non-volatile organic compound) for metabolite variables. This provides a broader sense of the variables influencing fruit liking scores.

# update model data
model_data <- model_data %>% select(-Flesh)

# Function to create Gaussian kernel matrix------------
create_kernel_matrix <- function(data, vars) {
    W <- data %>%
        select(all_of(vars)) %>%
        as.matrix()

    G <- as.matrix(dist(W, method="euclidean", diag=TRUE, upper=TRUE))
    K <- 1 - G/max(G, na.rm=TRUE)
    return(K)
}

# Main variance decomposition function
variance_decomp <- function(data, var_groups, response = "emmean") {
    # Create list to store kernel matrices
    kernel_matrices <- list()

    # Generate kernel matrix for each group
    for(group_name in names(var_groups)) {
        kernel_matrices[[group_name]] <- create_kernel_matrix(
            data,
            var_groups[[group_name]]
        )
    }

    # Create ETA list for BGLR
    ETA <- lapply(kernel_matrices, function(K) {
        list(K = K, model = 'RKHS')
    })

    # Extract response variable
    
    y <- scale(data)[,response]

    # Fit BGLR model
    fm <- BGLR::BGLR(
        y = y,
        ETA = ETA,
        nIter = 30000,
        burnIn = 10000,
        thin = 100,
        R2 = 0.5,
        saveAt = "data/"
    )

    # Calculate variance components and proportions
    varE <- fm$varE  # Residual variance
    var_components <- sapply(fm$ETA, function(x) x$varU)
    total_var <- sum(var_components) + varE

    # Calculate proportions for each group
    var_proportions <- c(var_components, 
                         residual = varE) / total_var

    # Create results list
    results <- list(
        model = fm,
        variance_components = var_components,
        residual_variance = varE,
        variance_proportions = var_proportions,
        total_variance = total_var
    )

    return(results)
}

# Create list of metabolite variables----------
included_vars <- class_key %>% 
  filter(var %in% colnames(model_data))

phe <- included_vars %>%
    filter(group == "Phenylalanine-derived" & !var == "GTR_conc") %>%
     select(var) %>% pull()
lip <- included_vars %>%
    filter(group == "Lipid-derived") %>%
    select(var) %>% pull()
car <- included_vars %>%
    filter(group == "Carotenoid/Terpenes") %>%
    select(var) %>% pull()
sugar <- included_vars %>%
    filter(group == "Sugars",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()
acid <- c("TA")
GTR <- c("GTR_conc")

# Create groups list
mets_groups <- list(
    phe = phe,
    lip = lip,
    car = car,
    sugar = sugar,
    acid = acid,
    GTR = GTR
)

# Metabolites analysis
# mets_results <- variance_decomp(model_data, mets_groups, response = "emmean")
save(mets_results, file = "data/mets_results.RData")

# Create list of Sensory variables -------
include_vars <- sensory_meta %>% 
  filter(var %in% colnames(model_data))
AR <- include_vars %>%
    filter(group == "Aroma") %>%
    select(var) %>% pull()
TX <- include_vars %>%
    filter(group == "Texture") %>%
    select(var) %>% pull()
FL <- include_vars %>%
    filter(group == "Flavour") %>%
    select(var) %>% pull()
AT <- include_vars %>%
    filter(group == "Aftertaste",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()

# Create groups list
sens_groups <- list(
    Aroma = AR,
    Texture = TX,
    Flavour = FL,
    Aftertaste = AT)

# Sensory analysis
sens_results <- variance_decomp(model_data, sens_groups, response = "emmean")
save(sens_results, file = "data/sens_results.RData")

load("data/sens_results.RData")
# Organize data with row groups
sens_table_data <- 
  data.frame(category = names(sens_results$variance_proportions),
             proportion = sens_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         category = str_replace(category,'residual', 'Residual'))
# write table
sens_table <- sens_table_data %>%
    gt() %>%
    tab_caption("Variance components analysis showing the proportion of variance in liking scores explained by sensory groups") %>% 
    # tab_header(
    #     title = "Variance Components Analysis",
    #     subtitle = "Proportion of Variance in Liking Scores Explained by Sensory Groups"
    # ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Sensory Group",
        proportion = "Variance Explained (%)"
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

load("data/mets_results.RData")
# Organize data with row groups
mets_table_data <- 
  data.frame(category = names(mets_results$variance_proportions),
             proportion = mets_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         group_type = case_when(
           category %in% c("phe", "lip", "car") ~ "Volatiles",
           category %in% c("sugar", "acid", "GTR") ~ "Non-volatiles",
           TRUE ~ "Residual"),
         category = 
           str_replace_all(category,
                           c("phe" = "Phenylalanine-derived",
                             "lip" = "Lipid-derived",
                             "car" = "Carotenoid/Terpenes",
                             "sugar" = "Sugars",
                             "acid" = "Acids",
                             "GTR" = "GTR")))
# write table
mets_table <- mets_table_data %>%
    select(group_type, category, proportion) %>%
    gt(
        groupname_col = "group_type"
    ) %>%
    tab_caption("Variance components analysis showing the proportion of variance in liking scores explained by metabolite groups") %>% 
    # tab_header(
    #     title = "Variance Components Analysis",
    #     subtitle = "Proportion of Variance in Liking Scores Explained by Metabolite Groups"
    # ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Metabolite Group",
        proportion = "Variance Explained (%)"
    ) %>%
    summary_rows(
        groups = c("Non-volatiles", "Residual", "Volatiles"),
        columns = proportion,
        fns = list("Group Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    grand_summary_rows(
        columns = proportion,
        fns = list("Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

Finally a predictive model was generated to predict papaya fruit liking in future selection processes. The variables included in the model were reduced to four to prevent overfitting considering the limited sample size: sulcatone (terpene), linalool oxide (terpene), γ-Octalactone (lipid-derived) and total sugars (sucrose, glucose and fructose; sugars). This model may be implemented for selection purposes if the sampling and instrumentation protocols are repeated. New samples may be indexed within the scores of samples used to build the model. Considering the model isn’t equipped for extrapolation, any values outside the ranges of what was used to build the model may distort the predicted liking score. Unscaled data was input to the model algorithm to make it more accessible for new data imputation.

explantory_vars <- function(data, num_vars = 4, target_var) {
  
  # Validate inputs
  if (!is.data.frame(data)) stop("data must be a data frame")
  if (!is.numeric(num_vars) || num_vars < 1) stop("num_vars must be a positive number")
  if (!target_var %in% colnames(data)) stop("target_var not found in data")
  if (num_vars >= ncol(data)) stop("num_vars must be less than number of predictor variables")
  
  # Get variable names, excluding the target variable
  variable_names <- colnames(data)
  variable_names <- variable_names[variable_names != target_var]

  # Generate all combinations
  combinations <- combn(seq_along(variable_names), num_vars)

  # Convert to named list of variable combinations
  variable_combinations <- setNames(
      lapply(1:ncol(combinations), function(i) {
          variable_names[combinations[,i]]
      }),
      paste("Model", 1:ncol(combinations))
  )

  return(variable_combinations)
}

# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 50) {
  cv <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = nrounds,
    nfold = nfold,
    early_stopping_rounds = early_stopping,
    verbose = FALSE,
    metrics = "rmse"
  )
  
  return(list(
    rmse = min(cv$evaluation_log$test_rmse_mean),
    iteration = cv$best_iteration
  ))
}

# Grid search function
grid_search_xgb <- function(param_grid, dtrain) {
  grid <- expand.grid(param_grid)
  
  results <- apply(grid, 1, function(params) {
    param_list <- list(
      objective = "reg:squarederror",
      eta = params["eta"],
      max_depth = params["max_depth"],
      subsample = params["subsample"],
      colsample_bytree = params["colsample_bytree"]
    )
    
    cv_result <- run_xgb_cv(param_list, dtrain)
    
    return(c(
      params,
      best_rmse = cv_result$rmse,
      best_iteration = cv_result$iteration
    ))
  })
  
  results_df <- as.data.frame(t(results))
  return(results_df)
}

# Function to prepare data for a specific variable combination
prepare_data <- function(data, vars, target_var) {
  boost_data <- data %>%
    select(all_of(c(target_var, vars)))
  
  set.seed(11)
  train_index <- createDataPartition(boost_data[[target_var]], p = 0.8, list = FALSE)
  train_data <- boost_data[train_index, ]
  test_data <- boost_data[-train_index, ]
  
  train_predictors <- as.matrix(train_data[, vars])
  test_predictors <- as.matrix(test_data[, vars])
  
  list(
    dtrain = xgb.DMatrix(data = train_predictors, label = train_data[[target_var]]),
    dtest = xgb.DMatrix(data = test_predictors, label = test_data[[target_var]]),
    test_data = test_data
  )
}

# Modified grid search function to return best parameters and performance metrics
run_model_combination <- function(data, vars, target_var) {
  # Prepare data
  prepared_data <- prepare_data(data, vars, target_var)
  
  # Grid search parameters
  param_grid <- list(
    eta = c(0.01, 0.05, 0.1),
    max_depth = c(2, 3, 4),
    subsample = c(0.8, 0.9, 1.0),
    colsample_bytree = c(0.7, 0.85, 1.0)
  )
  
  # Run grid search
  results <- grid_search_xgb(param_grid, prepared_data$dtrain)
  best_params <- results[which.min(results$best_rmse), ]
  
  # Train final model with best parameters
  set.seed(11)
  xgb_model <- xgboost(
    data = prepared_data$dtrain,
    nrounds = best_params$best_iteration,
    eta = best_params$eta,
    max_depth = best_params$max_depth,
    subsample = best_params$subsample,
    colsample_bytree = best_params$colsample_bytree,
    objective = "reg:squarederror",
    verbose = 0  # Suppress output
  )
  
  # Make predictions
  predictions <- predict(xgb_model, newdata = prepared_data$dtest)
  
  # Calculate performance metrics
  metrics <- postResample(pred = predictions, obs = prepared_data$test_data[[target_var]])
  
  # Add cleanup
  rm(xgb_model, prepared_data)
  gc()  # Force garbage collection
  
  # Return results
  list(
    variables = paste(vars, collapse = ", "),
    RMSE = metrics["RMSE"],
    Rsquared = metrics["Rsquared"],
    MAE = metrics["MAE"],
    best_param = best_params
  )
}

# Model comparison function
model_comparison <- function(data, variable_combinations, target_var) {
  total_models <- length(variable_combinations)
  pb <- txtProgressBar(min = 0, max = total_models, style = 3)
  
  results <- imap(variable_combinations, \(vars, name) {
    setTxtProgressBar(pb, as.numeric(gsub("Model ", "", name)))
    message(sprintf("Processing %s with variables: %s", name, paste(vars, collapse=", ")))
    tryCatch({
      result <- run_model_combination(data, vars, target_var)
      tibble(
        Model_ID = name,
        Variables = result$variables,
        RMSE = result$RMSE,
        R_squared = result$Rsquared,
        MAE = result$MAE,
        params = result$best_param,
        Error = NA_character_
      )
    }, error = function(e) {
      tibble(
        Model_ID = name,
        Variables = paste(vars, collapse=", "),
        RMSE = NA_real_,
        R_squared = NA_real_,
        MAE = NA_real_,
        params = NA_character_,
        Error = as.character(e)
      )
    })
  }) 
  close(pb)
  return(bind_rows(results))
}

# prep data
model_data1 <- model_data %>% select(1:22, "emmean")
target_var <- "emmean"
# Execute models
variable_combinations <- explantory_vars(model_data1, num_vars = 4, target_var)
model_results <- model_comparison(model_data1, variable_combinations, target_var)
# Write results
flattened_results <- model_results %>% unnest(params)
writexl::write_xlsx(flattened_results,"results/model_results.xlsx")

# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 50) {
  cv <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = nrounds,
    nfold = nfold,
    early_stopping_rounds = early_stopping,
    verbose = FALSE,
    metrics = "rmse"
  )

  return(list(
    rmse = min(cv$evaluation_log$test_rmse_mean),
    iteration = cv$best_iteration
  ))
}

grid_search_xgb <- function(param_grid, dtrain) {
  # Create grid of all combinations
  grid <- expand.grid(param_grid)

  # Apply CV to each parameter combination
  results <- apply(grid, 1, function(params) {
    param_list <- list(
      objective = "reg:squarederror",
      eta = params["eta"],
      max_depth = params["max_depth"],
      subsample = params["subsample"],
      colsample_bytree = params["colsample_bytree"]#,
      # min_child_weight = params["min_child_weight"],
      # gamma = params["gamma"] 
    )

    cv_result <- run_xgb_cv(param_list, dtrain)

    return(c(
      params,
      best_rmse = cv_result$rmse,
      best_iteration = cv_result$iteration
    ))
  })

  # Convert results to dataframe
  results_df <- as.data.frame(t(results))

  return(results_df)
}

# Data set up
boost_data <- model_data %>%
    select(emmean, total_sugar, Sulcatone, Linalool, Beta_ionone)

# Use createDataPartition instead of sample
set.seed(11)
train_index <- createDataPartition(boost_data$emmean, p = 0.8, list = FALSE)
train_data <- boost_data[train_index, ]
test_data <- boost_data[-train_index, ]

# Create matrices consistently
train_predictors <- as.matrix(train_data[, -1])
test_predictors <- as.matrix(test_data[, -1])
dtrain <- xgb.DMatrix(data = train_predictors, label = train_data$emmean)
dtest <- xgb.DMatrix(data = test_predictors, label = test_data$emmean)


# Grid search parameters
  # param_grid <- list(
  #   eta = c(0.01, 0.05, 0.1),
  #   max_depth = c(2, 3, 4),
  #   subsample = c(0.8, 0.9, 1.0),
  #   colsample_bytree = c(0.7, 0.85, 1.0)
  # )

param_grid <- list(
    eta = seq(0.06,0.15,0.01),
    max_depth = c(3),
    subsample = c(1),
    colsample_bytree = c(1)#,
    # min_child_weight = c(1, 3),
    # gamma = c(0, 0.1) 
)

# Grid search results
set.seed(11)
results <- grid_search_xgb(param_grid, dtrain)
best_params <- results[which.min(results$best_rmse),]

# Best params
#     eta max_depth subsample colsample_bytree best_rmse best_iteration
# 10 0.15         3         1                1  5.103844             34

# run model with best parameters.
set.seed(11)
xgb_model <- xgboost(data = dtrain,
                     nrounds = best_params$best_iteration,
                     eta = best_params$eta,
                     max_depth = best_params$max_depth,
                     subsample = best_params$subsample,
                     colsample_bytree = best_params$colsample_bytree,
                     objective = "reg:squarederror",
                     verbose = 0)  # Suppress output

# Make predictions on test set
predictions <- predict(xgb_model, newdata = dtest)
# Feature Importance
importance_matrix <- xgb.importance(feature_names = colnames(train_data[,-1]),
                                    model = xgb_model)

save(dtrain, dtest, results, predictions, xgb_model, importance_matrix, 
     file = "data/xgboost_train_results.RData")

load("data/xgboost_train_results.RData")
# Model evaluation
postResample(pred = predictions, obs = test_data[,"emmean"]) %>%
  as.data.frame() %>% rownames_to_column(var="Metric") %>% gt() %>%
    tab_caption("XGBoost performance metrics")

Table 6: XGBoost performance metrics
Metric	.
RMSE	1.4845620
Rsquared	0.9895691
MAE	1.3567530

# Create a scatter plot of predicted vs actual values for test data
ggplot(data.frame(Predicted = predictions, 
                        Actual = test_data[,"emmean"]), 
             aes(x = Actual, y = Predicted)) + 
  stat_poly_line(color = "firebrick4", fill = "red3") + 
  stat_poly_eq() + theme_tufte() + 
  geom_point(color = "gray15", alpha = .5, shape = 20, size = 3.5) +
  labs( x="Observed Liking", y="Predicted Liking", title = "Predicted Liking ~ Observed Liking") +
  theme(plot.title =  element_text(hjust = 0.5), axis.title = element_text(size=14),title = element_text(size = 14))

Figure 12: XGBoost model predicted liking vs observed liking based on concentrations of sulcatone, Beta ionone, linalool and total sugar.

xgb.plot.importance(importance_matrix,
                    main = "Feature Importance Plot")

Figure 13: XGBoost variable importance

importance_matrix %>% mutate(Feature = str_replace_all(Feature,c("_"=" ","total"="Total"))) %>% 
    gt() %>%
    tab_caption("XGBoost importance of features contributing to liking"
    ) %>%
    # tab_header(
    #     title = "XGBoost Feature importance",
    #     subtitle = "Features contributing to liking"
    # ) %>% 
    fmt_percent(
        columns = everything(),
        decimals = 1, 
        scale_values = TRUE
    )  %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

Table 7: XGBoost importance of features contributing to liking
Feature	Gain	Cover	Frequency
Sulcatone	62.7%	57.5%	44.4%
Beta ionone	23.5%	7.6%	9.3%
Linalool	8.3%	21.3%	24.1%
Total sugar	5.5%	13.6%	22.2%

8 Data Summary

The summarised volatile concentrations are reported below

> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")

Table 8: Summary of VOC concentrations (ppb) by chemical species
Species	CAS Number	RT (min)	RI	Chemical Class	n	Mean	SD	SE	Median	Min	Max	N detected	% detected
Hexanal	66-25-1	22.22	809	Aldehyde	118	18.38	21.25	1.96	9.84	0.03	85.25	118	100.00
Heptanal	111-71-7	26.91	912	Aldehyde	118	32.05	24.52	2.26	23.41	0.00	112.15	117	99.15
Methyl_hexanoate	106-70-7	27.69	929	Ester	118	42.84	199.13	18.33	0.00	0.00	2002.82	49	41.53
Benzaldehyde	100-52-7	30.3	988	Aldehyde	118	14.06	14.68	1.35	7.81	1.77	74.75	118	100.00
Sulcatone	110-93-0	30.53	993	Ketone	118	62.88	84.21	7.75	46.03	0.00	629.09	117	99.15
Octanal	124-13-0	31.43	1013	Aldehyde	118	7.05	5.93	0.55	5.22	0.00	33.49	117	99.15
P_cymene	99-87-6	32.89	1047	Terpene	118	5.20	3.25	0.30	5.11	0.00	20.54	112	94.92
Limonene	5989-27-5	33.15	1054	Terpene	118	26.41	39.93	3.68	12.14	0.00	250.14	87	73.73
Phenylacetaldehyde	122-78-1	33.70	1067	Aldehyde	118	0.00	0.00	0.00	0.00	0.00	0.00	0	0.00
Terpinene	99-85-4	34.22	1079	Terpene	118	10.53	15.37	1.41	5.92	0.00	105.47	105	88.98
Linalool_oxide	34995-77-2	34.77/35.39	1092/1108	Terpene	118	52682.49	108929.32	10027.76	7529.00	0.00	747850.87	116	98.31
Linalool	78-70-6	35.56	1112	Terpene	118	7313.67	18543.25	1707.04	682.34	0.00	112060.59	116	98.31
Nonanal	124-19-6	35.65	1114	Aldehyde	118	9.50	7.91	0.73	8.09	0.43	36.41	118	100.00
Methyl_octanoate	111-11-5	36.14	1126	Ester	118	2.21	3.80	0.35	1.38	0.00	20.87	99	83.90
E_2_nonal	18829-56-6	37.94	1172	Aldehyde	118	0.53	0.88	0.08	0.00	0.00	5.13	43	36.44
Decanal	112-31-2	39.59	1216	Aldehyde	118	0.39	1.20	0.11	0.00	0.00	7.83	18	15.25
Beta_Cyclocitral	432-25-7	40.98	1255	Terpene	118	2.24	2.94	0.27	0.00	0.00	10.01	48	40.68
Gamma_octalactone	104-50-7	41.85	1280	Ester	118	11.38	13.69	1.26	6.45	1.17	63.22	118	100.00
BITC	622-78-6	46.21	1414	Sulfur	118	4.53	7.13	0.66	2.62	0.00	67.27	91	77.12
Neryl_Geranyl_acetone	3796-70-1	48.11	1478	Terpene	118	39.89	42.79	3.94	32.89	2.65	312.72	118	100.00
Beta_ionone	79-77-6	49.4	1524	Terpene	118	6.81	4.93	0.45	5.26	3.47	50.19	118	100.00

Sugar values reported in g/100gFW; total soluble solids (brix) reported in °Brix; titratable acids reported in %CA; VOCs are reported in ppb; GTR reported in ppm.

Table 9: Variable means for 27 papaya genotypes (standard deviation)

9 General information

This document was last updated at 2025-06-20 14:07:44 using R Markdown (built with R version 4.2.1 (2022-06-23 ucrt)). The source code for this website can be found on https://github.com/JoshLomax/Papaya_sensory_metabolomics.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.

Bibliography

Brewer S, Plotto A, Bai J, et al. (2021) Evaluation of 21 papaya (Carica papaya L.) accessions in southern Florida for fruit quality, aroma, plant height, and yield components. Scientia Horticulturae 288:110387. doi: 10.1016/j.scienta.2021.110387

Colantonio V, Ferrão LFV, Tieman DM, et al. (2022) Metabolomic selection for enhanced fruit flavor. Proc Natl Acad Sci 119:e2115865119. doi: 10.1073/pnas.2115865119

Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976. doi: 10.1126/science.1136800

Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29:1189–1232. doi: 10.1214/aos/1013203451

Lomax J, Ford R, Bar I (2024) Multi-omic applications for understanding and enhancing tropical fruit flavour. Plant Mol Biol 114:83. doi: 10.1007/s11103-024-01480-7

Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829. doi: 10.1093/genetics/157.4.1819

Pico J, Gerbrandt EM, Castellarin SD (2022) Optimization and validation of a SPME-GC/MS method for the determination of volatile compounds, including enantiomeric analysis, in northern highbush blueberries (Vaccinium corymbosum L.). Food Chem 368:130812. doi: 10.1016/j.foodchem.2021.130812

Zhou Z, Bar I, Ford R, et al. (2022) Biochemical, Sensory, and Molecular Evaluation of Flavour and Consumer Acceptability in Australian Papaya (Carica papaya L.) Varieties. International Journal of Molecular Sciences 23:6313. doi: 10.3390/ijms23116313

Zhou Z, Ford R, Bar I, Kanchana-udomkan C (2021) Papaya (Carica papaya L.) Flavour Profiling. Genes 12:1416. doi: 10.3390/genes12091416