Metabolomics analysis of Papaya sensory profiles

1 Aims

The aim of this project was to determine the role of flavour related metabolites in papaya fruit flavour, focussing on taste and aroma. As demonstrated by Colantonio et al. (2022), metabolomic analysis of fruit can identify metabolite markers for fruit flavour selection in breeding programs. Subsequently, genetic markers may be developed, which are neccesary for papaya breeding due to the challenges associated with growing thousands of mature trees for phenotyping (Lomax et al. 2024). We aimed to build on the work reported by Zhou et al. (2021) and Zhou et al. (2022).

1.1 Objectives

  1. Examine the general differences in sensory and metabolite profiles between 27 papaya genotypes using principal component analysis.
  2. Apply machine learning models (GBM and BGLR) to analyse how specific metabolites contribute to aroma and flavor characteristics in papaya.
  3. Identify trends in consumer liking scores and consumer demographic information using a clustering algorithm (APcluster)
  4. Apply machine learning models (GBM and BGLR) to analyse the effect of sensory attributes and metabolite concentrations on papaya liking scores
  5. Generate a predictive machine learning model (XGBoost) to estimate fruit liking based on targeted metabolite concentrations

2 Methods

Sensory data was collected by a trained and untrained panel of tasters. The trained panel undertook standard quantitative descriptive analysis (QDA) for 27 papaya genotypes. The untrained panelists consisted of 125 consumers recruit for a “tropical fruit tasting”, who then scored nine disctinctly flavoured papaya genotypes using a labelled affective magnitude scale between 0 to 100 for liking.

Papaya fruit was collected from a single location (Dimbulah, QLD, AUS) on four separate occasions: August 2022 (QDA), February 2023 (QDA), January 2024 (Consumer) and February 2024 (QDA). The fruit was collected and treated aligned with industry management practices (e.g. ripening index 2-3, fungicide/ethylene, refrigerated truck transportation).

Chemical analysis of the fruit employed various methods, including traditional measurements for fruit quality (i.e. TSS, TA, pH, water content) and analytical methods (HS-SPME-GC-MS/MS, LC-MS/MS and IC-Au detector). To preserve the metabolite profile of the fruit after sensory preparation, the fruit was flash frozen and stored at -80. Further, the fruit subject to VOC analysis was prepared at temperatures below -80°C using liquid nitrogen. Samples for LC and IC were freeze dried and subject to solvent extraction: IC, sugars (ethanol); LC, GTR (methanol).

Target VOCs were selected based on the concentration ranges reported in Brewer et al. (2021), focusing on species that exceeded their odour threshold values in water (OAV > 1). The SPME optimisation and GC-MS conditions largely emulated the work done by Pico et al. (2022).

3 Data analysis preprocessing

The initial steps of the data analysis pipeline included:

  • Installing/loading required packages
# list of packages
required_packages <- c(
    "juba/rmdformats","kableExtra","gt","flextable","DT", "xfun", # Document formatting & tables

    "tidyverse","readxl","janitor", # Data wrangling

    "ggpmisc","ggthemes","ggrepel","geomtextpath","scales", # Plotting stuff

    "FactoMineR","factoextra", # PCA & visualization

    "car","agricolae","tableone", "summarytools", # Stats & summaries

    "BGLR","caret","gbm","xgboost","apcluster","lme4","emmeans" # Modeling & ML
)

# Install all packages
# pak::pak(required_packages)

# load all packages
pacman::p_load(char = basename(required_packages), install=FALSE)
  • Compiling meta data
# Load metadata from Excel
genotype_table <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "meta_data")

# Create sensory metadata table
sensory_meta <- 
  data.frame(
    "var" = c(
        # Aroma attributes
        "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR",
        "fishy_AR", "citrus_AR", "floral_AR",
        # Texture attributes
        "resistance_TX", "velvety_TX", "juiciness_TX", "dissolving_TX",
        "fibrous_TX",
        # Flavor attributes
        "flavour_intensity_FL", "sweetness_FL", "bitterness_FL",
        "musty_FL", "floral_FL",
        # Aftertaste attributes
        "bitter_AT", "sweet_AT", "metallic_AT", "prickly_AT"),
    
    "group" = c(
        rep("aroma", 6),
        rep("texture", 5),
        rep("flavour", 5),
        rep("aftertaste", 4))) %>% 
  
   mutate(group = snakecase::to_title_case(group),
          label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# Create classification key for VOCs
class_key <- 
  read_excel("data/papaya_sample_meta_data.xlsx",
             sheet = "VOCs") %>%
  select(-(Class:Class2)) %>%
  rename("var" = "variable") %>%
  mutate(
    Class3 = str_replace_all(Class3,
                            c("Ester" = "Lipid-derived"))
  ) %>%
  rename("group" = "Class3") %>%
  mutate(
    label = str_replace_all(var,
                           c("Neryl_Geranyl_acetone" = "(E/Z)-Geranylacetone",
                             "E_2_nonal" = "(E)-2-Nonanal",
                             "P_cymene" = "p-Cymene",
                             "Beta_Cyclocitral" = "β-Cyclocitral",
                             "Beta_ionone" = "β-Ionone",
                             "Gamma_octalactone" = "γ-Octalactone",
                             "GTR_conc" = "GTR",
                             "wet_fraction"="Water fraction",
                             "_" = " ")))

glimpse(genotype_table)
> Rows: 123
> Columns: 6
> $ ID        <chr> "01-050", "01-144", "01-217", "01-487", "02-081", "02-480", …
> $ label     <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ assay     <chr> "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24", "Feb24…
> $ Flesh     <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red…
> $ Genotype  <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1_Malay", …
> $ Genotype2 <chr> "Eksotika", "Eksotika", "Eksotika", "Eksotika", "F1 Malay", …
glimpse(sensory_meta)
> Rows: 20
> Columns: 3
> $ var   <chr> "aroma_intensity_AR", "sweet_fruit_AR", "musty_off_note_AR", "fi…
> $ group <chr> "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Aroma", "Texture",…
> $ label <chr> "Aroma intensity", "Sweet fruit", "Musty off note", "Fishy", "Ci…
glimpse(class_key)
> Rows: 34
> Columns: 3
> $ var   <chr> "Hexanal", "P_cymene", "Terpinolene", "Sulcatone", "Beta_Cycloci…
> $ group <chr> "Lipid-derived", "Carotenoid/Terpenes", "Carotenoid/Terpenes", "
> $ label <chr> "Hexanal", "p-Cymene", "Terpinolene", "Sulcatone", "β-Cyclocitra…
  • Calculating VOC matrix effects
calculate_matrix_effect <- function(data,col_range) {
  data %>%
    pivot_longer(
      cols = all_of(col_range),
      names_to = "species",
      values_to = "peak_area"
    ) %>%
    group_by(Spiked, type, species) %>%
    summarise(mean_peak_area = mean(peak_area), .groups = "drop") %>%
    mutate(sample = paste(Spiked, type, sep = "_")) %>%
    select(-c(Spiked, type)) %>%
    pivot_wider(
      names_from = "sample",
      values_from = "mean_peak_area"
    ) %>%
    mutate(
      red_matrix_effect = (((T_Red - F_Red)/F_Standard) - 1)*100,
      yellow_matrix_effect = (((T_Yellow - F_Yellow)/F_Standard) - 1)*100
    )
}

# Read and process data
matrix_effect_low <- read_xlsx("data/voc_matrix_effect.xlsx", "Matrix low split") %>% 
  filter(!ID %in% c("R8","Y8")) %>% 
  calculate_matrix_effect(col_range = 4:28)

matrix_effect_high <- read_xlsx("data/voc_matrix_effect.xlsx", "matrix high split") %>%
  calculate_matrix_effect(col_range = 4:7)

# Combine and clean species names
matrix_effect <- rbind(matrix_effect_high, matrix_effect_low) %>%
  mutate(species = str_replace_all(species, "-", "_"))

kableExtra::kbl(matrix_effect[,c(1,7:8)],
      col.names = c("Flesh Color",
                    "Matrix effect red",
                    "Matrix effect yellow"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "matrix effect for VOC quantification") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
Table 1: matrix effect for VOC quantification
Flesh Color Matrix effect red Matrix effect yellow
Cis_Linalool oxide -196.4 -123.0
Linalool 410.2 1583.0
Linalool_oxide -138.3 -110.3
Trans_Linalool oxide (furanoid) -85.2 -98.8
2_Octanone -46.3 -52.0
BITC -62.8 -71.7
Benzaldehyde -40.0 -36.8
Beta_Cyclocitral -58.6 -55.2
Beta_ionone -85.8 -84.0
Decanal -58.4 -65.3
E_2_nonal -67.7 -70.2
Gamma_octalactone -54.2 -17.9
Geraniol 471.1 377.2
Geranylacetone -60.8 -72.1
Heptanal 45.2 -36.6
Hexanal 2.5 -60.9
Limonene -1.1 -52.6
Methyl_dodecanoate -92.4 -95.0
Methyl_hexanoate -61.7 -47.4
Methyl_octanoate -36.7 -49.8
Neryl_Geranyl_acetone -65.1 -73.2
Nerylacetone -72.5 -75.1
Nonanal -30.6 -59.2
Octanal -40.3 -62.4
P_cymene -54.9 -64.1
Phenylacetaldehyde -39.1 -8.0
Sulcatone -7.9 -20.1
Terpinene -39.7 -66.8
Terpinolene -51.1 -56.9
  • Transforming VOC data to ng/kg
calibration_table <- read_excel("data/Results_22_23.xlsx", "densities") %>% 
  select(Hexanal:Linalool) %>% 
  t() %>% as.data.frame() %>% rownames_to_column("species") %>% 
  set_names(c("species", "densities", "OT", "purity"))

res_sheets <- c("Aug22", "Feb23", "Feb24")

VOCs <- res_sheets %>%
  # Read and combine sheets
  map(~read_excel("data/Results_22_23.xlsx", .x) %>%
        pivot_longer(cols = (Hexanal:BITC),
                     names_to = "species",
                     values_to = "concentration") %>%
        mutate(assay = .x)) %>%
  list_rbind() %>%
  
  # join data
  left_join(calibration_table) %>%
  left_join(genotype_table) %>%
  
  # Calculate concentrations
  mutate(
    conc_ppb = (concentration * densities * purity) * (0.00425/0.0042), # ug/L -> ug/KgFW = ug/L * volume(L)/mass(Kg)
    species = str_replace_all(species, "-", "_")) %>%
  
  # update matrix corrections
  left_join(
    matrix_effect %>% 
      mutate(across(
        ends_with("matrix_effect"),
        ~ case_when(
          species %in% c("Linalool","Linalool_oxide") ~ .[species == "Trans_Linalool oxide (furanoid)"],
          TRUE ~ .
          )
        ))
    ) %>%
  
   # Apply matrix corrections 
  mutate(
    corrected_conc_ppb = case_when(
      Flesh == "Red" ~ conc_ppb / (1 + (red_matrix_effect/100)),
      Flesh == "Yellow" ~ conc_ppb / (1 + (yellow_matrix_effect)/100)
      ),
    OAV = corrected_conc_ppb/OT,
    ) %>%
  relocate(c(assay, Flesh, label), .before = species)

# summary of VOC concentrations

VOC_summary <- VOCs %>%
  group_by(species) %>%
  summarise(
    n = n(),                                    
    mean_conc = mean(corrected_conc_ppb, na.rm = TRUE),
    sd_conc = sd(corrected_conc_ppb, na.rm = TRUE),      
    se_conc = sd_conc/sqrt(n),                 
    median_conc = median(corrected_conc_ppb, na.rm = TRUE),
    min_conc = min(corrected_conc_ppb, na.rm = TRUE),
    max_conc = max(corrected_conc_ppb, na.rm = TRUE),
    n_detected = sum(corrected_conc_ppb>0.01),        
    perc_detected = n_detected/n * 100         
  ) %>%
  arrange(desc(mean_conc)) %>%  
  mutate(across(where(is.numeric), ~ round(.x, digits = 2))) %>% 
  ungroup() 
  • Sugar extraction efficiency
sugar_key <- read_xlsx("data/July24_all.xlsx", sheet = "eth_key")
extraction_efficiency <-
  left_join(sugar_key, read_xlsx("data/July24_all.xlsx",
            sheet = "all_raw"), by = "sample_id") %>%
  filter(batch_id == "pooled") %>%
  mutate(
    Flesh = c(rep("Yellow", 5), rep("Red", 5)),
    Sorbitol_extraction_prop = Sorbitol/20000
  ) %>%
  group_by(Flesh) %>%
  summarise(
    extraction_efficiency_mean = mean(Sorbitol_extraction_prop*100),
    extraction_efficiency_sd = sd(Sorbitol_extraction_prop*100)
  )
kableExtra::kbl(extraction_efficiency,
      col.names = c("Flesh Color",
                    "Mean Recovery (%)",
                    "SD (%)"),
      digits = 1,
      align = c('l', 'c', 'c'),
      caption = "Sorbitol extraction efficiency from freeze-dried tissue samples") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
Table 2: Sorbitol extraction efficiency from freeze-dried tissue samples
Flesh Color Mean Recovery (%) SD (%)
Red 84.4 0.6
Yellow 85.3 2.3
  • Sugar data unit transformation and imputation of missing values
bad_dry <- paste0("JL", 10:53) # compromised samples
UI <- paste0("UI", 1:5) # auxiliary samples

# Data import and initial cleaning
raw_sugar <- left_join(sugar_key,
                      read_xlsx("data/July24_all.xlsx", sheet = "all_raw"),
                      by = "sample_id") %>%
  filter(!sample_id %in% bad_dry,
         !sensory_id %in% UI,
         batch_id != "pooled",
         mass != "75") %>% # auxiliary sample
  select(-c(sample_id, Sorbitol, batch_id)) %>%
  rename(ID = sensory_id) %>%
  full_join(genotype_table, by = "ID") %>%
  left_join(read_xlsx("data/July24_all.xlsx", sheet = "weight"),
            by = "ID") %>%
  mutate(FW_factor = wet_fraction/dry_fraction)

# Missing values
na_rows <- raw_sugar[!complete.cases(raw_sugar),]
kableExtra::kbl(na_rows,
                   caption = "Missing values in sugar dataset") %>% 
  kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
Table 3: Missing values in sugar dataset
ID mass Glucose Fructose Sucrose label assay Flesh Genotype Genotype2 Batch_id dry_fraction wet_fraction FW_factor
01-105 100 20923.45 22637.53 NA PBL10 Aug22 Yellow PBL10 PBL10 1 0.1371277 0.8628723 6.292473
01-1 100 17030.04 17982.28 22326.106 PBL10 Aug22 Yellow PBL10 PBL10 NA NA NA NA
02-1 100 21081.63 20147.66 16771.018 PBL11 Aug22 Yellow PBL11 PBL11 NA NA NA NA
03-1 100 16326.40 18285.30 15781.072 PBL13 Aug22 Yellow PBL13 PBL13 NA NA NA NA
05-1 100 19006.88 19580.25 19986.072 PBL15 Aug22 Yellow PBL15 PBL15 NA NA NA NA
07-1 101 18744.57 18958.28 11242.871 1B Aug22 Yellow 1B 1B NA NA NA NA
02-654 101 28661.26 28256.75 1213.175 PBL11 Aug22 Yellow PBL11 PBL11 NA NA NA NA
03-504 101 28795.46 30279.16 1207.327 PBL13 Aug22 Yellow PBL13 PBL13 NA NA NA NA
05-603 100 28863.26 29848.19 NA PBL15 Aug22 Yellow PBL15 PBL15 1 0.1311861 0.8688139 6.622759
02-021 NA NA NA NA PBL11 Aug22 Yellow PBL11 PBL11 1 0.1350356 0.8649644 6.405455
12-795 NA NA NA NA Solo Aug22 Red Solo Solo 1 0.1160407 0.8839593 7.617667

# Get all imputed datasets using default parameters
set.seed(11)

imp <- mice::mice(raw_sugar, m = 25, maxit = 100, print = FALSE)
> Warning: Number of logged events: 7
save(imp, file = "data/sugar_imp.RData")
load("data/sugar_imp.RData")
imputed_sugar <- complete(imp, "long") %>%
  group_by(ID) %>%
  summarise(across(c(mass, FW_factor, Glucose, Fructose, Sucrose), mean))

# new dataset with mean values from 25 random imputations after 100 iterations
final_sugar <- raw_sugar %>%
  select(-c(mass, FW_factor, Glucose, Fructose, Sucrose)) %>%
  left_join(imputed_sugar, by = "ID")

# Unit transformation of raw data to g sugar/100g FW
sugar_vars <- c("Glucose", "Fructose", "Sucrose")

sugar_data <- final_sugar %>%
  mutate(across(all_of(sugar_vars), ~.x * 0.001),  # Accounts for 0.001L extraction volume
         FW = mass * FW_factor,                      # Converts to fresh weight basis
         across(all_of(sugar_vars), ~(.x / FW) * 100),  # Convert to g/100g FW
         total_sugar = rowSums(across(all_of(sugar_vars))),
         across(all_of(sugar_vars), ~.x / total_sugar, .names = "{.col}_prop"),
         wet_fraction = FW_factor/(1+FW_factor),
         dry_fraction = 1-wet_fraction) %>%
  select(assay, Genotype, ID, Flesh, all_of(sugar_vars),
         total_sugar, ends_with("_prop"),label,wet_fraction)

glimpse(sugar_data)
> Rows: 123
> Columns: 13
> $ assay         <chr> "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "Feb23", "F…
> $ Genotype      <chr> "PBL8", "Sunlight 3", "PBL9", "PBL6", "PBL7", "RB1", "PB…
> $ ID            <chr> "01-618", "02-424", "03-849", "04-924", "05-mon", "06-46…
> $ Flesh         <chr> "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", …
> $ Glucose       <dbl> 2.388368, 2.819370, 2.166258, 2.116345, 2.759456, 3.0948…
> $ Fructose      <dbl> 2.177387, 2.402008, 2.156498, 1.926594, 2.610099, 2.7535…
> $ Sucrose       <dbl> 7.654067, 7.026162, 6.312745, 4.587775, 6.481786, 3.1513…
> $ total_sugar   <dbl> 12.219822, 12.247540, 10.635501, 8.630714, 11.851341, 8.…
> $ Glucose_prop  <dbl> 0.1954503, 0.2301989, 0.2036818, 0.2452109, 0.2328391, 0…
> $ Fructose_prop <dbl> 0.1781848, 0.1961216, 0.2027641, 0.2232254, 0.2202366, 0…
> $ Sucrose_prop  <dbl> 0.6263649, 0.5736795, 0.5935541, 0.5315638, 0.5469243, 0…
> $ label         <chr> "PBL8", "Sunlight 3", "PBL9", "PBL6", "PBL7", "RB1_2", "
> $ wet_fraction  <dbl> 0.8515992, 0.8561540, 0.8649311, 0.8783233, 0.8591738, 0
  • Glucotropaeolin unit transformation and QC analysis
# Read sample metadata and define sample IDs
GTR_key <- read_xlsx("data/July24_all.xlsx", sheet = "meth_key")
samples <- paste0("ME", seq(10, 136))

# Process GTR (Glucotropaeolin) concentration data
# Initial concentration in mg/L (mg glucotropaeolin per L methanol extract)
GTR_data <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  # Filter relevant samples
  filter(sample_id %in% samples) %>%

  # Handle non-detects: replace 'n.a.' with detection limit (0.0001 mg/L)
  mutate(
    conc = str_replace(conc, "n.a.", "0.0001"),
    conc = as.numeric(conc)
  ) %>%

  # Reassign sample IDs and remove duplicate
  mutate(sample_id = paste0("ME", seq(10, 135))) %>%
  filter(!sample_id %in% c("ME12","ME57")) %>%

  # Join with sample metadata and moisture content data
  left_join(GTR_key) %>%
  rename(ID = sensory_id) %>%
  left_join(imputed_sugar %>% select(ID, FW_factor)) %>%

  # Convert concentration from mg/L to mg/kgFW
  # Step 1: mg/L * (0.001L/mass_mg) = mg/mgDW
  # Step 2: mg/mgDW / FW_factor = mg/mgFW
  # Step 3: mg/mgFW * 100000 = mg/kgFW
  mutate(
    conc = ((conc * (0.001/mass))/FW_factor) * 100000
  ) %>%

  # Clean up column names and select final variables
  rename(GTR_conc = conc) %>%
  select(ID, GTR_conc) %>% 
  filter(!ID %in% UI)

# Calculate Quality Control metrics for Continuing Calibration Verification (CCV) samples
# Reference standard (1 mg/L) analyzed every 10 samples
CCV_samples <- read_xlsx("data/GTR_data.xlsx", sheet = "GTR") %>%
  mutate(conc = as.numeric(conc)) %>%
  filter(sample_id == "STD_1ppm") %>%
  summarise(
    mean_conc = mean(conc),
    sd_conc = sd(conc),
    rsd_percent = (sd_conc/mean_conc) * 100
  )

# Display QC results
kbl(CCV_samples,
                   caption = "Quality Control Metrics for 1 mg/L Reference Standard (n=14)",
                   col.names = c("Mean (mg/L)", "SD (mg/L)", "RSD (%)"),
                   digits = c(4, 4, 2)) %>% 
   kable_styling()
Table 4: Quality Control Metrics for 1 mg/L Reference Standard (n=14)
Mean (mg/L) SD (mg/L) RSD (%)
1.1513 0.1589 13.8
  • Combining data and imputing missing data
# Load proxy sugar/acid data, excluding redundant Genotype column
proxy <- read_xlsx("data/22_23_proxy_sugar.xlsx", "all") 
# Combine multiple datasets
conc_matrix <- full_join(sugar_data, GTR_data, relationship = "many-to-many") %>%  
  full_join(proxy, relationship = "many-to-many") %>% 
  full_join(VOCs, relationship = "many-to-many") %>% 
  # Remove auxiliary samples
  filter(!ID %in% c("01-1","02-1","03-1","04-1","05-1","07-1")) %>% 
  # Remove intermediate calculation columns
  select(-c(concentration:purity,conc_ppb:yellow_matrix_effect,OAV)) %>% 
  # Reshape data for analysis
  pivot_wider(names_from = species, 
              values_from = corrected_conc_ppb)

# missing results
kbl(conc_matrix[!complete.cases(conc_matrix),],
                   caption = "Results with missing values",
                   digits = 1) %>% 
   kable_styling()
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
(#tab:combine data)Results with missing values
assay Genotype ID Flesh Glucose Fructose Sucrose total_sugar Glucose_prop Fructose_prop Sucrose_prop label wet_fraction GTR_conc pH Brix TA Genotype2 Hexanal P_cymene Terpinolene Sulcatone Linalool_oxide Linalool Beta_Cyclocitral Terpinene Heptanal Methyl_hexanoate Limonene Octanal Methyl_octanoate Nonanal Decanal Benzaldehyde E_2_nonal Phenylacetaldehyde Neryl_Geranyl_acetone Gamma_octalactone Beta_ionone BITC
Feb24 F1_Malay 02-734 Red 3.8 3.6 4.2 11.6 0.3 0.3 0.4 F1_Malay 0.9 NA 5.5 13.4 0.2 F1 Malay 2.1 7.8 68.1 66.4 16913.1 5811.8 2.9 22.7 10.5 0.1 62.7 1.5 1.6 1.8 3.6 6.9 1.5 0 10.6 10.6 5.2 1.3
Aug22 Holland_5 09-1 Red 2.7 3.0 2.1 7.9 0.3 0.4 0.3 Holland_5_1 0.9 0.2 NA NA NA Holland 5 16.2 4.5 0.0 32.6 0.0 54.8 0.0 0.0 34.2 0.0 10.5 7.9 0.5 9.4 0.0 5.1 0.0 0 35.1 4.9 4.5 4.6
Aug22 Solo 12-1 Red 3.2 3.4 2.0 8.7 0.4 0.4 0.2 Solo 0.9 0.1 NA NA NA Solo 19.5 4.7 15.1 72.5 670.6 178.7 0.0 5.9 33.8 3.7 14.0 8.4 0.6 11.4 0.0 6.6 0.0 0 76.1 2.4 6.7 0.0
Aug22 PBL15 05-575 Yellow 5.5 6.0 0.1 11.7 0.5 0.5 0.0 PBL15 0.8 NA 5.4 12.3 0.2 PBL15 85.3 5.9 16.6 2.8 3593.0 387.3 0.0 0.0 85.2 0.0 22.1 14.3 0.0 19.5 0.0 7.9 0.0 0 32.8 1.2 4.7 6.2
Aug22 PBL11 02-021 Yellow 3.4 3.5 2.8 9.7 0.3 0.4 0.3 PBL11 0.9 NA 5.1 10.6 0.4 PBL11 43.9 6.4 19.0 4.6 335918.5 8204.7 0.0 10.7 68.3 48.5 34.9 12.2 4.0 14.3 0.0 24.8 0.0 0 59.9 9.0 7.6 8.1
Aug22 Solo 12-795 Red 2.9 3.0 1.8 7.7 0.4 0.4 0.2 Solo 0.9 NA 5.4 10.4 0.4 Solo 13.0 7.2 32.9 15.0 6139.8 2893.9 0.0 23.2 57.6 2.8 44.0 17.5 0.5 19.6 0.0 11.7 0.0 0 168.5 41.5 4.8 25.2

# Get all imputed datasets using default parameters
imp_vars <- c("GTR_conc","pH","Brix","TA")

set.seed(11)
init <- mice::mice(conc_matrix,maxit=0)
> Warning: Number of logged events: 6
init$predictorMatrix[,] <- 0 
init$predictorMatrix[imp_vars, c(sugar_vars,"BITC")] <- 1

# imp2 <- mice::mice(conc_matrix, predictorMatrix = init$predictorMatrix, m = 25, maxit = 100, 
#                    print = FALSE)
# 
# imputed_conc <- complete(imp2, "long") %>%
#   group_by(ID) %>%
#   summarise(across(all_of(imp_vars), mean))

# save(imputed_conc, file = "data/imputed_conc.RData")
load("data/imputed_conc.RData")
# new dataset with mean values from 25 random imputations after 100 iterations (probably overkill)
final_conc_matrix <- conc_matrix %>%
  select(-(all_of(imp_vars))) %>%
  left_join(imputed_conc, by = "ID") %>% 
  relocate(all_of(imp_vars), .before = Hexanal)

# Attaching sensory data
full_data_matrix <- final_conc_matrix %>% left_join(read_xlsx("data/all_sensory_data.xlsx"))

# full_data_matrix %>% select(Flesh, sweetness_FL, Sulcatone, Heptanal) %>% print(n=118)
  
data_summary_genotype <- 
  full_data_matrix %>%
  group_by(Genotype) %>%
  summarise(across(where(is.numeric),
                   ~sprintf("%.2f (%.2f)", mean(.x), sd(.x)))) %>%
  pivot_longer(cols = -Genotype, names_to = "variable") %>%
  pivot_wider(names_from = Genotype, values_from = value)
  

4 Sensory and metabolite data summary

Sensory data evaluations were conducted in a purpose-built sensory laboratory (Food Quality lab at the Centre for Nutrition and Food Sciences; The Queensland Alliance for Agriculture and Food Innovation, University of Queensland), equipped with 14 isolated sensory booths, temperature control (22°C), and LED daylight-equivalent lighting. Six formal sessions were conducted in a single-blind randomised block design, with participants recording their ratings for each genotype. To minimize taster fatigue, sessions were limited to two per day. The data from the first two sessions were considered dummy data and discarded (detailed methods can be found at Sensory methods supplementary)

print(summarytools::dfSummary(full_data_matrix,
                graph.magnif = 0.82,
                varnumbers = FALSE,
                style = "grid",
                valid.col = TRUE,  # Changed to TRUE to match your desired output
                graph.col = TRUE),  # Added to ensure graphs are displayed
      method = 'render',  # Important for proper HTML rendering
      table.classes = 'st-table st-table-bordered st-table-striped')

Data Frame Summary

full_data_matrix

Dimensions: 118 x 60
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
assay [character]
1. Aug22
2. Feb23
3. Feb24
46(39.0%)
24(20.3%)
48(40.7%)
118 (100.0%) 0 (0.0%)
Genotype [character]
1. RB1
2. Holland_5
3. PBL15
4. 1B
5. Eksotika
6. F1_Malay
7. First_Lady
8. PBL1
9. PBL10
10. PBL11
[ 17 others ]
11(9.3%)
8(6.8%)
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
66(55.9%)
118 (100.0%) 0 (0.0%)
ID [character]
1. 01-039
2. 01-050
3. 01-105
4. 01-144
5. 01-215
6. 01-217
7. 01-487
8. 01-534
9. 01-618
10. 01-776
[ 108 others ]
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
1(0.8%)
108(91.5%)
118 (100.0%) 0 (0.0%)
Flesh [character]
1. Red
2. Yellow
90(76.3%)
28(23.7%)
118 (100.0%) 0 (0.0%)
Glucose [numeric]
Mean (sd) : 3 (0.8)
min ≤ med ≤ max:
1.3 ≤ 2.8 ≤ 5.5
IQR (CV) : 1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Fructose [numeric]
Mean (sd) : 2.9 (0.8)
min ≤ med ≤ max:
1.4 ≤ 2.7 ≤ 6
IQR (CV) : 1.1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Sucrose [numeric]
Mean (sd) : 3.3 (2.3)
min ≤ med ≤ max:
0.1 ≤ 2.7 ≤ 8.9
IQR (CV) : 3.7 (0.7)
118 distinct values 118 (100.0%) 0 (0.0%)
total_sugar [numeric]
Mean (sd) : 9.2 (2.1)
min ≤ med ≤ max:
4.2 ≤ 9.2 ≤ 13.6
IQR (CV) : 3.2 (0.2)
118 distinct values 118 (100.0%) 0 (0.0%)
Glucose_prop [numeric]
Mean (sd) : 0.3 (0.1)
min ≤ med ≤ max:
0.2 ≤ 0.3 ≤ 0.5
IQR (CV) : 0.1 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Fructose_prop [numeric]
Mean (sd) : 0.3 (0.1)
min ≤ med ≤ max:
0.1 ≤ 0.3 ≤ 0.5
IQR (CV) : 0.2 (0.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Sucrose_prop [numeric]
Mean (sd) : 0.3 (0.2)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 0.7
IQR (CV) : 0.3 (0.6)
118 distinct values 118 (100.0%) 0 (0.0%)
label [character]
1. PBL15
2. 1B
3. Eksotika
4. F1_Malay
5. First_Lady
6. Holland_5_1
7. Holland_5_3
8. PBL1
9. PBL10
10. PBL11
[ 20 others ]
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
77(65.3%)
118 (100.0%) 0 (0.0%)
wet_fraction [numeric]
Mean (sd) : 0.9 (0)
min ≤ med ≤ max:
0.8 ≤ 0.9 ≤ 0.9
IQR (CV) : 0 (0)
118 distinct values 118 (100.0%) 0 (0.0%)
Genotype2 [character]
1. RB1
2. Holland 5
3. PBL15
4. 1B
5. Eksotika
6. F1 Malay
7. First Lady
8. PBL1
9. PBL10
10. PBL11
[ 17 others ]
11(9.3%)
8(6.8%)
5(4.2%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
4(3.4%)
66(55.9%)
118 (100.0%) 0 (0.0%)
GTR_conc [numeric]
Mean (sd) : 0.8 (1.5)
min ≤ med ≤ max:
0 ≤ 0.3 ≤ 12.1
IQR (CV) : 0.9 (1.9)
118 distinct values 118 (100.0%) 0 (0.0%)
pH [numeric]
Mean (sd) : 5.5 (0.2)
min ≤ med ≤ max:
5 ≤ 5.5 ≤ 6.1
IQR (CV) : 0.2 (0)
14 distinct values 118 (100.0%) 0 (0.0%)
Brix [numeric]
Mean (sd) : 11.3 (1.8)
min ≤ med ≤ max:
6.6 ≤ 11.1 ≤ 14.2
IQR (CV) : 2.8 (0.2)
59 distinct values 118 (100.0%) 0 (0.0%)
TA [numeric]
Mean (sd) : 0.2 (0.1)
min ≤ med ≤ max:
0 ≤ 0.2 ≤ 0.4
IQR (CV) : 0.1 (0.4)
34 distinct values 118 (100.0%) 0 (0.0%)
Hexanal [numeric]
Mean (sd) : 18.4 (21.3)
min ≤ med ≤ max:
0 ≤ 9.8 ≤ 85.3
IQR (CV) : 22.8 (1.2)
118 distinct values 118 (100.0%) 0 (0.0%)
P_cymene [numeric]
Mean (sd) : 5.2 (3.3)
min ≤ med ≤ max:
0 ≤ 5.1 ≤ 20.5
IQR (CV) : 4.2 (0.6)
113 distinct values 118 (100.0%) 0 (0.0%)
Terpinolene [numeric]
Mean (sd) : 20.9 (33.9)
min ≤ med ≤ max:
0 ≤ 14.7 ≤ 213.3
IQR (CV) : 16 (1.6)
96 distinct values 118 (100.0%) 0 (0.0%)
Sulcatone [numeric]
Mean (sd) : 62.9 (84.2)
min ≤ med ≤ max:
0 ≤ 46 ≤ 629.1
IQR (CV) : 66.4 (1.3)
118 distinct values 118 (100.0%) 0 (0.0%)
Linalool_oxide [numeric]
Mean (sd) : 52682.5 (108929.3)
min ≤ med ≤ max:
0 ≤ 7529 ≤ 747850.9
IQR (CV) : 42821.2 (2.1)
117 distinct values 118 (100.0%) 0 (0.0%)
Linalool [numeric]
Mean (sd) : 7313.7 (18543.3)
min ≤ med ≤ max:
0 ≤ 682.3 ≤ 112060.6
IQR (CV) : 4922.4 (2.5)
117 distinct values 118 (100.0%) 0 (0.0%)
Beta_Cyclocitral [numeric]
Mean (sd) : 2.2 (2.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 5 (1.3)
50 distinct values 118 (100.0%) 0 (0.0%)
Terpinene [numeric]
Mean (sd) : 10.5 (15.4)
min ≤ med ≤ max:
0 ≤ 5.9 ≤ 105.5
IQR (CV) : 9.5 (1.5)
107 distinct values 118 (100.0%) 0 (0.0%)
Heptanal [numeric]
Mean (sd) : 32 (24.5)
min ≤ med ≤ max:
0 ≤ 23.4 ≤ 112.2
IQR (CV) : 30.4 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Methyl_hexanoate [numeric]
Mean (sd) : 42.8 (199.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 2002.8
IQR (CV) : 8.8 (4.6)
52 distinct values 118 (100.0%) 0 (0.0%)
Limonene [numeric]
Mean (sd) : 26.4 (39.9)
min ≤ med ≤ max:
0 ≤ 12.1 ≤ 250.1
IQR (CV) : 30.3 (1.5)
90 distinct values 118 (100.0%) 0 (0.0%)
Octanal [numeric]
Mean (sd) : 7 (5.9)
min ≤ med ≤ max:
0 ≤ 5.2 ≤ 33.5
IQR (CV) : 8.4 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Methyl_octanoate [numeric]
Mean (sd) : 2.2 (3.8)
min ≤ med ≤ max:
0 ≤ 1.4 ≤ 20.9
IQR (CV) : 1.1 (1.7)
101 distinct values 118 (100.0%) 0 (0.0%)
Nonanal [numeric]
Mean (sd) : 9.5 (7.9)
min ≤ med ≤ max:
0.4 ≤ 8.1 ≤ 36.4
IQR (CV) : 13.7 (0.8)
118 distinct values 118 (100.0%) 0 (0.0%)
Decanal [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 7.8
IQR (CV) : 0 (3)
20 distinct values 118 (100.0%) 0 (0.0%)
Benzaldehyde [numeric]
Mean (sd) : 14.1 (14.7)
min ≤ med ≤ max:
1.8 ≤ 7.8 ≤ 74.8
IQR (CV) : 15.6 (1)
118 distinct values 118 (100.0%) 0 (0.0%)
E_2_nonal [numeric]
Mean (sd) : 0.5 (0.9)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5.1
IQR (CV) : 0.8 (1.7)
45 distinct values 118 (100.0%) 0 (0.0%)
Phenylacetaldehyde [numeric]
Mean (sd) : 0 (0)
min ≤ med ≤ max:
0 ≤ 0 ≤ 0
IQR (CV) : 0 (1)
3 distinct values 118 (100.0%) 0 (0.0%)
Neryl_Geranyl_acetone [numeric]
Mean (sd) : 39.9 (42.8)
min ≤ med ≤ max:
2.6 ≤ 32.9 ≤ 312.7
IQR (CV) : 53.9 (1.1)
118 distinct values 118 (100.0%) 0 (0.0%)
Gamma_octalactone [numeric]
Mean (sd) : 11.4 (13.7)
min ≤ med ≤ max:
1.2 ≤ 6.4 ≤ 63.2
IQR (CV) : 9.1 (1.2)
118 distinct values 118 (100.0%) 0 (0.0%)
Beta_ionone [numeric]
Mean (sd) : 6.8 (4.9)
min ≤ med ≤ max:
3.5 ≤ 5.3 ≤ 50.2
IQR (CV) : 3 (0.7)
118 distinct values 118 (100.0%) 0 (0.0%)
BITC [numeric]
Mean (sd) : 4.5 (7.1)
min ≤ med ≤ max:
0 ≤ 2.6 ≤ 67.3
IQR (CV) : 5.3 (1.6)
93 distinct values 118 (100.0%) 0 (0.0%)
aroma_intensity_AR [numeric]
Mean (sd) : 52.6 (16.7)
min ≤ med ≤ max:
18.8 ≤ 49.4 ≤ 86
IQR (CV) : 23.9 (0.3)
113 distinct values 118 (100.0%) 0 (0.0%)
sweet_fruit_AR [numeric]
Mean (sd) : 35.3 (18.6)
min ≤ med ≤ max:
5.4 ≤ 31.6 ≤ 71.1
IQR (CV) : 33.2 (0.5)
116 distinct values 118 (100.0%) 0 (0.0%)
musty_off_note_AR [numeric]
Mean (sd) : 40.4 (12.3)
min ≤ med ≤ max:
13.1 ≤ 39.3 ≤ 74.8
IQR (CV) : 16.2 (0.3)
112 distinct values 118 (100.0%) 0 (0.0%)
fishy_AR [numeric]
Mean (sd) : 30.2 (16.5)
min ≤ med ≤ max:
0.6 ≤ 29 ≤ 68.9
IQR (CV) : 27 (0.5)
112 distinct values 118 (100.0%) 0 (0.0%)
citrus_AR [numeric]
Mean (sd) : 21.2 (18.9)
min ≤ med ≤ max:
0.9 ≤ 14.2 ≤ 71.4
IQR (CV) : 26.1 (0.9)
108 distinct values 118 (100.0%) 0 (0.0%)
floral_AR [numeric]
Mean (sd) : 21.6 (19.2)
min ≤ med ≤ max:
0.3 ≤ 13.9 ≤ 73.7
IQR (CV) : 22.3 (0.9)
109 distinct values 118 (100.0%) 0 (0.0%)
resistance_TX [numeric]
Mean (sd) : 42.1 (22.7)
min ≤ med ≤ max:
3.4 ≤ 40.9 ≤ 89.7
IQR (CV) : 35.8 (0.5)
115 distinct values 118 (100.0%) 0 (0.0%)
velvety_TX [numeric]
Mean (sd) : 45.4 (13.1)
min ≤ med ≤ max:
16.8 ≤ 44.5 ≤ 75.3
IQR (CV) : 17.5 (0.3)
113 distinct values 118 (100.0%) 0 (0.0%)
juiciness_TX [numeric]
Mean (sd) : 48.4 (19.4)
min ≤ med ≤ max:
6.8 ≤ 50.4 ≤ 85.4
IQR (CV) : 34.6 (0.4)
117 distinct values 118 (100.0%) 0 (0.0%)
dissolving_TX [numeric]
Mean (sd) : 54.9 (21.3)
min ≤ med ≤ max:
6.9 ≤ 56.4 ≤ 92
IQR (CV) : 28.8 (0.4)
117 distinct values 118 (100.0%) 0 (0.0%)
fibrous_TX [numeric]
Mean (sd) : 44.6 (13.5)
min ≤ med ≤ max:
15.9 ≤ 44.3 ≤ 84.6
IQR (CV) : 17.9 (0.3)
111 distinct values 118 (100.0%) 0 (0.0%)
flavour_intensity_FL [numeric]
Mean (sd) : 58.2 (13.7)
min ≤ med ≤ max:
14.1 ≤ 59 ≤ 87.2
IQR (CV) : 19.4 (0.2)
109 distinct values 118 (100.0%) 0 (0.0%)
sweetness_FL [numeric]
Mean (sd) : 54.8 (16.6)
min ≤ med ≤ max:
7 ≤ 57 ≤ 80.8
IQR (CV) : 22.7 (0.3)
114 distinct values 118 (100.0%) 0 (0.0%)
bitterness_FL [numeric]
Mean (sd) : 26.8 (18.7)
min ≤ med ≤ max:
2.4 ≤ 20.8 ≤ 96.2
IQR (CV) : 19.8 (0.7)
109 distinct values 118 (100.0%) 0 (0.0%)
musty_FL [numeric]
Mean (sd) : 42.1 (11.5)
min ≤ med ≤ max:
18.2 ≤ 39.7 ≤ 69.6
IQR (CV) : 16.8 (0.3)
112 distinct values 118 (100.0%) 0 (0.0%)
floral_FL [numeric]
Mean (sd) : 26.6 (16.4)
min ≤ med ≤ max:
1.9 ≤ 21.2 ≤ 66.6
IQR (CV) : 20.4 (0.6)
114 distinct values 118 (100.0%) 0 (0.0%)
bitter_AT [numeric]
Mean (sd) : 23.6 (17)
min ≤ med ≤ max:
5 ≤ 18.1 ≤ 91.1
IQR (CV) : 16.5 (0.7)
108 distinct values 118 (100.0%) 0 (0.0%)
sweet_AT [numeric]
Mean (sd) : 37.8 (15.5)
min ≤ med ≤ max:
1.5 ≤ 39.4 ≤ 66.9
IQR (CV) : 21 (0.4)
116 distinct values 118 (100.0%) 0 (0.0%)
metallic_AT [numeric]
Mean (sd) : 11.8 (6.1)
min ≤ med ≤ max:
0.8 ≤ 10.7 ≤ 34.7
IQR (CV) : 7 (0.5)
102 distinct values 118 (100.0%) 0 (0.0%)
prickly_AT [numeric]
Mean (sd) : 8.6 (4.7)
min ≤ med ≤ max:
1.4 ≤ 7.8 ≤ 23.3
IQR (CV) : 5.3 (0.5)
103 distinct values 118 (100.0%) 0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.2.1)
2025-06-20

5 Sensory and metabolite Principal Component Analysis

Metabolites with odour activity values less than one likely play a very minor role in fruit flavour; due to the small ranges and fluctuations, their influence on flavour may be overestimated in modelling. Moreover, sensory terms including prickly aftertaste and metallic aftertaste poorly differentiated the genotypes throughout descriptive analysis. Therefore these variables were removed to focus on highly relevant flavour variables.

# cut VOC variables
no_aroma <- VOCs %>% 
  select(1:6,OAV) %>% 
  pivot_wider(names_from = species, values_from = OAV) %>% 
  select(Hexanal:BITC) %>% 
  summarise(across(where(is.numeric), ~ max(.x, na.rm = TRUE))) %>%
  select(where(function(x) any(x<1))) %>% 
  colnames()

# cut sensory variables
poor_lexicon_use <- c("prickly_AT", "metallic_AT")

Once these variables were removed the data was visualised using PCA to look at sensory and metabolite profiles of each genotype.

# data set up
remove_vars <- c("Glucose_prop","Fructose_prop","Sucrose_prop","label","Genotype2")
PCA_data <- full_data_matrix %>% 
  column_to_rownames(var = "ID") %>% 
  select(-all_of(c(poor_lexicon_use, no_aroma,remove_vars,"Beta_Cyclocitral")))
# Beta_Cyclocitral appears to have very low range and is exacerbating small differences due to presence absence

# PCA script custom bi-plot
PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 5, graph = F, 
                quali.sup = 1:31) # omit all but sensory columns
 
# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(17, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var))) 

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.6*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
               group_by(label) %>% 
               summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
               left_join(genotype_table %>% distinct(label,Flesh, Genotype, Genotype2, assay))


# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month") 
PCA biplot of 27 papaya genotype dipicting the variations in sensory profiles from quantitative descriptive analysis.

Figure 1: PCA biplot of 27 papaya genotype dipicting the variations in sensory profiles from quantitative descriptive analysis.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE, 
                ncp = 6, graph = F, 
                quali.sup = c(1:3,31:48)) # all chem PCA

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib[,1:2] %>% as.data.frame() %>% 
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>% 
  top_n(22, contribution)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key, by = "var") %>% 
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
  as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("ID") %>% 
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>% 
  group_by(label,Flesh) %>% 
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) + 
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2, 
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh", 
        colour="Attribute", shape = "Havest Month")
CA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype.

Figure 2: CA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype.

There appears to be a significant batch effect associated with metabolites with highly skewed distributions. After removing decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal, the control genotypes are qualitatively similar in distance to what is observed in the sensory data.

PCA_vocs <- PCA(PCA_data, scale.unit = TRUE,
                ncp = 6, graph = F,
                quali.sup = c(1:3,21,23:26,31:48)) # omit decanal, methyl hexanoate, octanal, (E)-2-nonanal, nonanal

# make a list of top variables based on visual assessment
topvars <- PCA_vocs$var$contrib [,1:2] %>% as.data.frame() %>%
  rownames_to_column(var = "species") %>% mutate (contribution = Dim.1 + Dim.2) %>%
  top_n(22, contribution)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- PCA_vocs$var$coord[topvars$species,] %>%
  as.data.frame() %>%
  select(1:2) %>%
  rownames_to_column(var = "var") %>%
  left_join(class_key) %>%
  mutate(var=str_replace_all(var, c("Neryl_Geranyl_acetone" = "Neryl/Geranyl Acetone", "E_2_nonal"="E-2-Nonanal","P_cymene"="P-Cymene","_"=" ")), group=str_replace_all(group, c("Acids"="Sugar/acid","Sugars"="Sugar/acid")))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(PCA_vocs,
                       element = "var",
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(PCA_vocs$ind$coord[, drop = FALSE],
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))),
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>%
  as_tibble() %>%
  mutate(across(starts_with("Dim"), ~r*0.6*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- PCA_vocs$ind$coord %>%
  as.data.frame() %>%
  rownames_to_column("ID") %>%
  left_join(genotype_table)

# create a table with the average coordinates for each phenotype
Mean_coords <- PCAvoctable %>%
  group_by(label,Flesh) %>%
  summarise(across(where(is.numeric), ~mean(.x, na.rm = TRUE))) %>%
  left_join(genotype_table %>% distinct(label, Flesh, Genotype, Genotype2,assay))

# choose colours
my_colours <- c("#E31A1C","gold2")


# Get component variance information for axis labels
percentVar <- PCA_vocs$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(Mean_coords) +
  aes(x = Dim.1, y = Dim.2, fill = Flesh, shape = assay) +
  geom_hline(yintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6,
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2),
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=0.3, vjust=-0.6, size=3.2, box.padding = 0.1) +
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090","darkorange")) +
  geom_point(alpha = 1, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(),
                     panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = Mean_coords$Genotype2,
                  size=3, hjust=1.1, vjust=1.2, box.padding = 0.1,
                  colour = "black", fontface = "italic") +
  scale_shape_manual(values = c(21,22,24)) +scale_fill_manual(values = my_colours) +
  guides(colour=guide_legend(override.aes = list(label="")))+
  scale_fill_identity(guide = guide_legend(override.aes = list(shape = 21))) +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Flesh",
        colour="Attribute", shape = "Havest Month")
PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype. Excluding decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal.

Figure 3: PCA biplot of 27 papaya genotype dipicting the variations in metabolite profiles from targeted quantitative chemical assays. Including metabolites with concentrations theorectically above an odour threshold of one in one or more genotype. Excluding decanal, methyl hexanoate, octanal, (E)-2-nonanal and nonanal.

6 Modelling the interactions between sensory and metabolite profiles

Variable influences on traits were assessed using two complementary approaches: (1) generalised boosted regression (Friedman 2001) to capture both main effects and interactions (x-axis), optimized using 10-fold cross-validation, and (2) Bayesian linear regression (BayesA; Meuwissen et al. (2001)) to estimate individual additive effects without interactions (y-axis). Only variables that stood out from the rest were labelled due to space limitations.

scaled_matrix <- PCA_data %>% select(-c(1:3,21,23:26)) %>% scale()

# Sensory subsetting
# column naming
traits <- colnames(scaled_matrix) [23:40]
gbm_name <- c(seq(from = 1, to = 35, by = 2))
beta_name <- c(seq(from = 2, to = 36, by = 2))

set.seed(11)

for (i in 1:18) {
  
  # caret GBM
  carX <- as.matrix(cbind(Y=scaled_matrix[,22+i], scaled_matrix[,1:22]))
  
  grid <- expand.grid(.n.trees = c(500, 750, 1000), 
                      .interaction.depth = c(5,7,10), 
                      .shrinkage = c(0.01,0.02,0.05), 
                      .n.minobsinnode = c(3,5))
  
  trnCtrl <- trainControl(method = "cv", number = 10)
  
  gbm <- train(Y ~ ., data = carX, method = "gbm", trControl = trnCtrl, tuneGrid = grid)
  
  # BGLR
  trnY <- scaled_matrix[,22+i]
  
  ETA <- list(list(X = scaled_matrix[,1:22], model = 'BayesA'))
  
  fit_BA <- BGLR::BGLR(y = trnY, 
                 ETA = ETA, 
                 nIter = 30000, 
                 burnIn = 10000,
                 thin = 100, 
                 saveAt = 'data/', 
                 df0 = 5, 
                 S0 = NULL,
                 weights = NULL, 
                 R2 = 0.5)

 # write results table
  if(i == 1){
    wts <- as.data.frame(varImp(object = gbm)$importance) %>% 
           cbind(fit_BA$ETA[[1]]$b)
    names(wts)[i] <- paste0(traits[i],"_gbm")
    names(wts)[i+1] <- paste0(traits[i],"_beta")
  } else {
    wts <- cbind(wts,varImp(object = gbm)$importance,fit_BA$ETA[[1]]$b)
    names(wts)[gbm_name[i]] <- paste0(traits[i],"_gbm")
    names(wts)[beta_name[i]] <- paste0(traits[i],"_beta")
  }
}

save(wts,traits,beta_name, gbm_name,  file = "data/sensory_modelling.RData")
# Plot prep ------------
bglr_data <- wts %>% 
  select(all_of(beta_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "y", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_beta",""))

gbm_data <- wts %>% 
  select(all_of(gbm_name)) %>% 
  rownames_to_column(var = "var") %>% 
  left_join(class_key) %>% 
  pivot_longer(cols = 2:19, values_to = "x", names_to = "trait") %>% 
  mutate(trait=str_replace_all(trait,"_gbm",""))

all_plot_data <- left_join(bglr_data,gbm_data[c(1,4:5)],by=c("var" = "var", "trait" = "trait")) %>% 
                mutate(sensory_group = str_sub(trait, - 2, - 1),
                       trait_label = str_sub(trait,0,-4),
                       trait_label = str_replace_all(trait_label,"_"," "),
                       trait_label = str_to_sentence(trait_label),
                       label = str_replace_all(label,c("total sugar"="Total sugar",
                                                       "wet fraction"="Water content")))
# Filter for AR labels -------
ar_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_min(y, n=4),
    
    all_plot_data %>%  filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(y, n=4),
    
    all_plot_data %>% filter(sensory_group == "AR") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>%
  
  distinct()
  
my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting AR_label 

AR_intensity <- c("Terpinolene","Heptanal","Terpinene","(E/Z)-Geranylacetone","Sucrose",
                  "Hexanal")
AR_citrus <- c("Glucose","Hexanal","Limonene","Sucrose","Heptanal","Sulcatone","β-Ionone","Terpinene")
AR_fishy <- c("Sucrose","Glucose","P-Cymene","pH","β-Ionone","Sulcatone","TA")
AR_floral <- c("(E/Z)-Geranylacetone","β-Ionone", "Hexanal","Sulcatone","(E/Z)-Geranylacetone")
AR_musty <- c("Linalool oxide", "Terpinolene", "GTR","(E/Z)-Geranylacetone")
AR_sweet <- c("BITC","β-Ionone","Linalool oxide","P-Cymene","Hexanal","GTR","pH","Total sugar")

# Filter labels and plot

ar_label1 <- ar_label %>%
  filter(!label %in% AR_intensity & trait=="aroma_intensity_AR"|
         !label %in% AR_citrus & trait == "citrus_AR" |
         !label %in% AR_fishy & trait == "fishy_AR"|
         !label %in% AR_floral & trait == "floral_AR"|
         !label %in% AR_musty & trait == "musty_off_note_AR"|
         !label %in% AR_sweet & trait == "sweet_fruit_AR")
  

#plot AROMA trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AR")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y)), 
       max(all_plot_data[all_plot_data$sensory_group=="AR",] %>% select(y))) +


  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = ar_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -.1,
                  vjust = -0.4,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )
Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes.

Figure 4: Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes.

# Filter for FL labels -------
FL_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_min(y, n=5), 
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(y, n=5),
    all_plot_data %>% filter(sensory_group == "FL") %>% 
      group_by(trait) %>% slice_max(x, n=8)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Need labels to remove
# Determined by initially plotting FL_label 

FL_bitterness <- c("γ-Octalactone", "Sucrose","Hexanal","Glucose","Terpinolene","Water fraction","Brix")
FL_intensity <- c("Sulcatone", "β-Ionone","Linalool","TA","BITC")
FL_floral <- c("Brix","Linalool oxide","(E/Z)-Geranylacetone","Terpinene")
FL_musty <- c("Terpinene","Sulcatone")
FL_sweet <- c("Fructose","Linalool oxide","Glucose","GTR", "TA","pH","γ-Octalactone")

# Filter labels and plot

FL_label1 <- FL_label %>%
  filter(!label %in% FL_bitterness & trait=="bitterness_FL"|
           !label %in% FL_intensity & trait == "flavour_intensity_FL" |
           !label %in% FL_floral & trait == "floral_FL"|
           !label %in% FL_musty & trait == "musty_FL"|
           !label %in% FL_sweet & trait == "sweetness_FL")


# Plot flavour trait variables
ggplot(all_plot_data %>% filter(sensory_group == "FL")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "FL") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "FL") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = FL_label1,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = -0,
                  vjust = -0.5,
                  size = 3,
                  box.padding = 0.1) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )
Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes

Figure 5: Variation in papaya aroma characteristics explained by targeted metabolites in 27 papaya genotypes

# First filter for AT labels -------
AT_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_min(y, n=4), 
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "AT") %>% 
      group_by(trait) %>% slice_max(x, n=7)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot aftertaste trait variables
ggplot(all_plot_data %>% filter(sensory_group == "AT")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "AT") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "AT") %>% select(y))) +

  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = AT_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )
Variation in papaya aftertaste characteristics explained by targeted metabolites in 27 papaya genotypes.

(#fig:aftertaste metabolites)Variation in papaya aftertaste characteristics explained by targeted metabolites in 27 papaya genotypes.

#First filter for TX labels -------
  TX_label <- 
  rbind(
    all_plot_data %>% filter(sensory_group == "TX") %>%
      group_by(trait) %>% slice_min(y, n=2), 
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(y, n=2),
    all_plot_data %>% filter(sensory_group == "TX") %>% 
      group_by(trait) %>% slice_max(x, n=2)) %>% 
  distinct()

my_colours <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
                "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22")

# Plot TEXTURE trait variables
ggplot(all_plot_data %>% filter(sensory_group == "TX")) +
  aes(x = x, y = y,
      fill = group,
      colour = group,
      label = label) +

  # Base plot elements
  geom_hline(yintercept = 0,
             alpha = 0.6,
             linetype = "dashed",
             linewidth = 0.5) +
  geom_point(shape = 21,
             alpha = 0.7,
             size = 2.5) +

  # Faceting and coordinates
  facet_wrap(vars(trait_label)) +
  xlim(0, 100) +
  ylim(min(all_plot_data %>% filter(sensory_group == "TX") %>% select(y)), 
       max(all_plot_data %>% filter(sensory_group == "TX") %>% select(y))) +
  
  # Labels
  labs(x = "Variable importance (GBM)",
       y = expression(beta~"coefficient (BayesA)"),
       fill = "") +

  # Point labels
  geom_text_repel(data = TX_label,
                  aes(x = x, y = y,
                      label = label,
                      color = group),
                  hjust = 0,
                  vjust = 0,
                  size = 3,
                  box.padding = 0.15) +

  # Colours and guides
  scale_fill_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                              "Phenylalanine-derived", "Sugars",
                              "Acids", "Other"),
                   values = my_colours) +
  scale_color_manual(breaks = c("Carotenoid/Terpenes", "Lipid-derived",
                               "Phenylalanine-derived", "Sugars",
                               "Acids", "Other"),
                    values = my_colours) +
  guides(colour = "none") +

  # Theme customisation
  theme_bw() +
  theme(
    plot.background = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(linetype = "solid", colour = "Black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.title = element_text(size = 11),
    axis.title.x = element_text(vjust = -2),
    axis.title.y = element_text(vjust = 2),
    axis.text = element_text(size = 8),
    legend.background = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    panel.spacing.y = unit(0, "cm", data = NULL),
    legend.position = "bottom"
  )
Variation in papaya texture characteristics explained by targeted metabolites in 27 papaya genotypes.

Figure 6: Variation in papaya texture characteristics explained by targeted metabolites in 27 papaya genotypes.

7 Modelling consumer liking

Trends within the individual consumer liking scores for each genotype were identified using the affinity propagation clustering algorithm (Frey and Dueck 2007). The grouping of individual consumers into three distinct clusters reveals the interaction between their regular papaya consumption habit and how they scored each genotype during tasting. Moreover, there are indications that multiple genotypes maybe be accepted into the Australian market by different clusters of consumers.

7.1 Cluster analysis

consumer_data <- read_xlsx("data/consumer_data.xlsx")

z <- consumer_data %>% 
  select(participant,liking,var) %>%
  pivot_wider(names_from = var, 
              values_from = liking)

# to plot the eps values
set.seed(11)
a = apcluster(negDistMat(r=2), x=z, q=0)

# cluster exemplars
clusters_eg <- a@exemplars %>% unname()
kableExtra::kbl(consumer_data[clusters_eg,],
      col.names = c("Participant", "Genotype", "Flesh Colour","Liking Score","Like Comment","Dislike Comment","Gender","Age","Consumption Habit"),
      caption = "Cluster exemplars generated using apcluster")
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
# consumer clusters
c1 <- a@clusters[[1]]%>% unname()
c2 <- a@clusters[[2]]%>% unname()
c3 <- a@clusters[[3]]%>% unname()

consumer_clusters <- z %>% mutate(Cluster = c(1:125))
consumer_clusters$Cluster <- as.character(car::recode(consumer_clusters$Cluster,"c1='1';c2='2';c3='3'"))

# attach cluster membership to consumer data
cluster_data <- 
  left_join(consumer_clusters, 
            consumer_data %>% select(participant,Gender,age,habit) %>% distinct()) %>% 
  relocate(Cluster:habit, .before = RB1) %>% 
  select(-participant)

# Summarise cluster demographic data using fisher test and kruskal
cluster_membership <- cluster_data %>% select(1:4)

tab1 <- CreateTableOne(vars = c("Gender", "age", "habit"),
                       strata = "Cluster",
                       data = cluster_membership,
                       factorVars = c("Gender", "habit"))



# summarise cluster liking for each genotype

cluster_summary <- consumer_clusters %>% select(-participant) %>%
  pivot_longer(1:9,names_to = "var",values_to = "liking")

# ANOVA and Post-hoc 
cluster_aov <- aov(liking ~ var * Cluster, data = cluster_summary)
cluster_ph <- SNK.test(cluster_aov, c("var","Cluster"))

# Extract data for plotting
aov_plot <- 
  left_join(cluster_ph$groups %>% rownames_to_column(var="var"),
            cluster_ph$means %>% rownames_to_column(var="var")) %>% 
  separate(var,
           into = c("var", "cluster"),
           sep = ":")
 
# second y-axis values
lam_breaks <- c(0, 13, 23, 33, 45, 50, 55, 67, 77, 87, 100)
lam_labels <- c('Greatest imaginable dislike',
                'Dislike extremely',
                'Dislike very much',
                'Dislike moderately',
                'Dislike slightly',
                'Neither like nor dislike',
                'Like slightly',
                'Like moderately',
                'Like very much',
                'like extremely',
                'Greatest imaginable like')
ggplot(aov_plot) +
aes(x = reorder(var,-liking), 
    y = liking, 
    group = interaction(var, cluster), 
    color = cluster, fill = cluster) + 
  # Add boxplot-style elements
  geom_boxplot(
    aes(ymin = Min,
        lower = Q25,
        middle = Q50,
        upper = Q75,
        ymax = Max),
    stat = "identity",
    alpha = 0.3,  
    width = 0.9,
    position = position_dodge(0.9)) +
  # Add post-hoc letters
  geom_text(aes(label = groups, y = Q50 + 2),
            position = position_dodge(0.9),
            size = 2.5) +
  # Customize colors
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster') +
scale_fill_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  # remove letter in legend
  guides(color = "none") +
  # Customize theme
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(fill = NA, color = 'black'),
        legend.position = 'bottom', 
        legend.box = "horizontal",   
        legend.direction = "horizontal",  
        legend.margin = margin(t = 0, b = 10),  
        legend.title = element_text(size = 10),  
        legend.text = element_text(size = 9)) +
  # Add labels
  labs(x = 'Papaya Genotype',
       y = 'Consumer Liking Score (0-100)') +
  # Add secondary y-axis for LAM scale with no expansion
  scale_y_continuous(
    limits = c(0, 100),
    expand = c(0, 0), 
    sec.axis = sec_axis(
      ~.,
      breaks = lam_breaks,
      labels = lam_labels,
      name = 'Labeled Affective Magnitude (LAM) Scale'))
Consumer liking score for nine papaya genotypes recorded by 125 untrained panelists. Each panelist was grouped into a cluster based on the affinity propagation clustering algorithm. There is an interaction between geotype liking scores and cluster membership (ANOVA; p <= 0.001). The differences between genotype liking scores among all clusters are indicated by different letters (SNK: p <= 0.05).

Figure 7: Consumer liking score for nine papaya genotypes recorded by 125 untrained panelists. Each panelist was grouped into a cluster based on the affinity propagation clustering algorithm. There is an interaction between geotype liking scores and cluster membership (ANOVA; p <= 0.001). The differences between genotype liking scores among all clusters are indicated by different letters (SNK: p <= 0.05).

# very convoluted way to make it look good in html
tab_df <- 
  as.data.frame(print(tab1)) %>%
  rownames_to_column(var = "Consumer Cluster") %>% 
  gt()  %>% 
  tab_caption("Breakdown of demographic data and papaya consumption habits within each consumer cluster.") %>% 
  tab_footnote(
    footnote = "Kruskal wallis was applied to age and fishers test was applied to gender and papaya consumption habit.",
    locations = cells_column_labels(columns = "Consumer Cluster")
  )
>                         Stratified by Cluster
>                          1             2             3             p      test
>   n                         61            24            40                    
>   Gender (%)                                                        0.541     
>      Female                 43 (70.5)     15 (62.5)     23 (57.5)             
>      Male                   17 (27.9)      9 (37.5)     16 (40.0)             
>      Non-binary              1 ( 1.6)      0 ( 0.0)      0 ( 0.0)             
>      Unspecified             0 ( 0.0)      0 ( 0.0)      1 ( 2.5)             
>   age (mean (SD))        36.56 (10.90) 35.92 (12.60) 36.17 (10.70)  0.968     
>   habit (%)                                                         0.002     
>      1-3 times per month    23 (37.7)      3 (12.5)     20 (50.0)             
>      2-4 times a week        3 ( 4.9)      1 ( 4.2)      2 ( 5.0)             
>      never                   2 ( 3.3)      8 (33.3)      2 ( 5.0)             
>      once a week             6 ( 9.8)      1 ( 4.2)      2 ( 5.0)             
>      rarely                 27 (44.3)     11 (45.8)     14 (35.0)
# # Create flextable
# ft <- flextable(tab_df) %>%
#   theme_vanilla() %>%
#   autofit()
# Display the table
# ft
  # PCA clusters as variables -----
pca_data <- cluster_data %>%
  pivot_longer(5:13,names_to = "var",values_to = "liking") %>%
  group_by(Cluster,var) %>%
  summarise(mean_liking = mean(liking)) %>%  
  pivot_wider(names_from = "Cluster",values_from = "mean_liking") %>% 
  column_to_rownames(var = "var")

plot_data <- PCA(pca_data, graph = F)
# make a dataframe with variable coordinates for arrows
PCAvar_coords <- plot_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% mutate(var=paste("Cluster",var))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(plot_data, element = "var", 
                       result = c("coord", "contrib", "cos2"))
colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(plot_data$ind$coord[, drop = FALSE], stringsAsFactors = TRUE)
colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))
# update the variable coordinates
PCAvar_plot_data <- PCAvar_coords %>% as_tibble() %>% 
  mutate(across(starts_with("Dim"), ~r*0.8*.x))

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- plot_data$ind$coord %>% 
  as.data.frame() %>% 
  rownames_to_column("Genotype2") %>% 
  left_join(genotype_table %>% select(Genotype2,Flesh) %>% distinct())

# choose colours
my_colours <- c("#E69F00", "#56B4E9", "#009E73")

# Get component variance information for axis labels
percentVar <- plot_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill=Flesh) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(xend=Dim.1, yend=Dim.2, colour = var), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.5,
               arrow = arrow(angle = 30, length = unit(0.2, "cm")),
               lineend = "round",linejoin = "mitre") +
  geom_text_repel(mapping = aes(x=Dim.1, y=Dim.2, label=var),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=-0.5, size=3.2, colour= my_colours) + 
  scale_fill_manual(values = c("#E31A1C","gold2")) +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73"),
                   name = 'Consumer\nCluster',
                  labels= parse(text = c(
                    "1~(italic(n)==61)",
                    "2~(italic(n)==24)",
                    "3~(italic(n)==40)"))) +
  geom_point(shape = 21,alpha = 0.9, size=2.5) +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12),
                     plot.title = element_text(hjust = 0.5)) +
  geom_text_repel(label = PCAvoctable$Genotype2,
                  size=3, hjust=-0.15, vjust=-0.3,
                  colour = "black", fontface = "italic") +
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), 
        fill="Flesh Colour", 
        colour="Consumer Cluster") 
Principal component analysis of the liking scores for nine papaya genotypes reported by three distinct consumer clusters (consumer clusters were generated using agglomerative clustering algorithm): Cluster 1, n = 61; Cluster 2, n = 21; Cluster 3, n = 40.

Figure 8: Principal component analysis of the liking scores for nine papaya genotypes reported by three distinct consumer clusters (consumer clusters were generated using agglomerative clustering algorithm): Cluster 1, n = 61; Cluster 2, n = 21; Cluster 3, n = 40.

There is an inherent limitation in the dataset whereby fruit included in the consumer study was not able to be used for chemical analysis. This is due to the logistical issues regarding adequate fruit collection and processing: the fruit that was collected was completely consumed by the panelists. To deal with this limitation best linear unbiased estimates were combined with previously collected sensory and metabolite data to represent the broad differences between genotype liking scores. This caveat likely leads to the underestimation of complex nuances between variables that influence papaya fruit liking. Therefore, the model generated from this data highlights larger variable variation between genotypes.

liking_fit <- lmer(liking ~ var +
                     (1|participant),
                   data = consumer_data)

blue_df <- 
  as.data.frame(
    emmeans(liking_fit, 
            specs = "var")) %>% 
  rename("Genotype2" = "var") 

model_data <- 
  left_join(PCA_data %>% rownames_to_column(var = "ID"), genotype_table) %>% 
  left_join(blue_df) %>%
  filter(!is.na(emmean)) %>% 
  column_to_rownames(var = "ID") %>% 
  select(-c(assay:Genotype,SE:upper.CL,label:Genotype2)) %>% 
  relocate(Flesh, .after = "emmean") %>% 
  select(-c(Decanal,Methyl_hexanoate,Octanal,E_2_nonal, Nonanal)) 

kbl(blue_df, caption = "Estimated marginal means from the mixed effects model including genotype as a fixed effect and participant as a random effect", 
      col.names = c("Genotype","Least-squares mean",
                    colnames(blue_df)[3:6]),
      align = c("l",rep("c",5))) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE)
> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
Table 5: Estimated marginal means from the mixed effects model including genotype as a fixed effect and participant as a random effect
Genotype Least-squares mean SE df lower.CL upper.CL
1B 39.664 1.66757 607.539 36.3891 42.9389
PBL14 56.832 1.66757 607.539 53.5571 60.1069
PBL15 50.784 1.66757 607.539 47.5091 54.0589
PBL16 62.720 1.66757 607.539 59.4451 65.9949
PBL21 70.104 1.66757 607.539 66.8291 73.3789
PBL9 70.368 1.66757 607.539 67.0931 73.6429
RB1 71.888 1.66757 607.539 68.6131 75.1629
Solo 64.888 1.66757 607.539 61.6131 68.1629
Sunlight 1 65.992 1.66757 607.539 62.7171 69.2669

Models indicating variable influence on fruit liking can be produced from red-fleshed and yellow-fleshed data separately; however, with limited power due to sample size. Moreover, the liking scores are more accurately compared as a whole because the consumer participants tasted all samples in one sitting, as opposed to only red-fleshed and then only yellow-fleshed samples.

Interestingly, it became clear in consumer tasting that PBL9 wasn’t an inherently bitter genotype; although it was thought to be from earlier sensory panels. Due to the constants of genotype liking, any samples displaying uncharacteristic flavour (likely due to poor fruit maturation) cause significant inaccuracies in modelling efforts. These samples were detected by comparing genotypes with equivalent liking scores. Therefore, samples were removed for further modelling: 03-302 and 03-669.

# find the source of confusion in cosumer liking modelling - filter for RS5/T2-6-5.27.12 and equivalently liked genotypes
sample_review <- full_data_matrix %>% 
  select(-(5:40)) %>% 
  filter(Genotype %in% c("PBL9","RB1")) %>% 
  column_to_rownames(var = "ID")

# PCA
pca_data <- PCA(sample_review, quali.sup = 1:4, graph = F)

# make a dataframe with variable coordinates for arrows
PCAvar_coords <- pca_data$var$coord %>% 
  as.data.frame() %>% 
  select(1:2) %>% 
  rownames_to_column(var = "var") %>% 
  slice_max(abs(Dim.1)+abs(Dim.2), n = 10) %>% 
  left_join(sensory_meta) %>% 
  mutate(Category = snakecase::to_title_case(group),
         sensory_label = snakecase::to_sentence_case(sub("_[A-Z]+$", "", var)))

# create a scale value to the size of the scatter plot to adjust variable coordinates
var <- facto_summarize(pca_data, 
                       element = "var", 
                       result = c("coord", "contrib", "cos2"))

colnames(var)[2:3] <- c("x", "y")

ind <- data.frame(pca_data$ind$coord[, drop = FALSE], 
                  stringsAsFactors = TRUE)

colnames(ind) <- c("x", "y")

r <- min((max(ind[, "x"]) - min(ind[, "x"])/(max(var[, "x"]) - min(var[, "x"]))), 
         (max(ind[, "y"]) - min(ind[, "y"])/(max(var[,"y"]) - min(var[, "y"]))))

# update the variable coordinates

PCAvar_plot_data <- PCAvar_coords %>% 
                    as_tibble() %>% 
                    mutate(across(starts_with("Dim"), ~r*0.7*.x)) # this number scales arrow length

# get coordinates for individuals to map group means with meta data for aesthetics
PCAvoctable <- pca_data$ind$coord %>% 
               as.data.frame() %>% 
               rownames_to_column("ID") %>% 
               left_join(genotype_table)

# choose colours
my_colours <- c("Aug22" = "#E69F00", "Feb23" = "#56B4E9", "Feb24" = "#009E73")


# Get component variance information for axis labels
percentVar <- pca_data$eig[1:2,2]

#plot PCA coords using ggplot

ggplot(PCAvoctable) + 
  aes(x = Dim.1, y = Dim.2, fill = assay, shape = Genotype2) + 
  geom_hline(yintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_vline(xintercept = 0, alpha = 0.6, 
             linetype="dashed", linewidth=0.5) +
  geom_segment(mapping = aes(colour=group, xend=Dim.1, yend=Dim.2), 
               data = PCAvar_plot_data,
               inherit.aes = F,
               x=0, y=0, linewidth=0.3,
               arrow = arrow(angle = 30, length = unit(0.2, "cm"))) +
  geom_text_repel(mapping = aes(colour=group, x=Dim.1, y=Dim.2, label=label),
                  data = PCAvar_plot_data,
                  inherit.aes = F,
                  hjust=-0.15, vjust=0.20, size=3.2, box.padding = 0.1) + 
  scale_color_manual(values = c("#8B0000", "#000080", "#228B22", "#708090")) +
  geom_point(alpha = 1, size=2.5,color = "black") +
  theme_bw() + theme(plot.background = element_blank(), 
                     panel.grid.major = element_blank(), 
                     panel.grid.minor = element_blank(),
                     axis.title = element_text(size=12)) +
  geom_text_repel(label = PCAvoctable$ID, 
                  size=3, hjust=-0.2, vjust=0.5, box.padding = 0.2,
                  colour = "black", fontface = "italic") +
   scale_shape_manual(values = c(21,22,24)) + scale_fill_manual(values = my_colours,
                                                                guide = guide_legend(override.aes = list(shape = 22, size = 6))) +
   guides(colour=guide_legend(override.aes = list(label="")))+
  labs( x=glue::glue("C1: {round(percentVar[1],2)}% variance"),
        y=glue::glue("C2: {round(percentVar[2],2)}% variance"), fill="Harvest Month", 
        colour="Attribute", shape = "Genotype") 
PCA Bi-plot comparing red-fleshed papaya genotypes RB1, Hybrid 6 and RS5. All three genotypes received similar liking scores from 125 untrained panelists.

Figure 9: PCA Bi-plot comparing red-fleshed papaya genotypes RB1, Hybrid 6 and RS5. All three genotypes received similar liking scores from 125 untrained panelists.

# issue with 2 samples during sensory: overly bitter and less sweet than expected
bitter_samples <- c("03-302","03-669")

# update model data
model_data <- model_data[!rownames(model_data) %in% bitter_samples,]

After these two samples were removed from the dataset, modelling the effect of sensory traits and metabolite traits on papaya liking scores better reflects what was observed. Similar to before gbm and bayesA were implemented to determine how the different variable groups influence papaya liking.

# Function to run models for a specific flesh type
run_models <- function(data, flesh_type, met_cols, sens_cols, y_col = 41) {
  # 
  # Filter and prepare data
  pap <- data %>%
    filter(grepl(flesh_type, Flesh, perl = TRUE)) %>%
    select(-Flesh) %>% 
    scale()

  # Define model parameters based on flesh type
  model_params <- if(flesh_type == "Yellow") {
    list(
      n_trees = c(500),
      int_depth = c(5),
      shrinkage = c(0.01),
      min_obs = c(1),
      cv_folds = 5,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  } else {
    list(
      n_trees = c(500, 750, 1000),
      int_depth = c(5, 7, 10),
      shrinkage = c(0.01, 0.02, 0.05),
      min_obs = c(3, 5),
      cv_folds = 10,
      bglr_iter = 30000,
      bglr_burnin = 10000
    )
  }

  # Function to run a single model set (metabolites or sensory)
  run_model_set <- function(cols, type) {
    # Prepare data
    carX <- as.matrix(cbind(Y = pap[,y_col], pap[,cols]))

    # GBM
    grid <- expand.grid(
      .n.trees = model_params$n_trees,
      .interaction.depth = model_params$int_depth,
      .shrinkage = model_params$shrinkage,
      .n.minobsinnode = model_params$min_obs
    )

    gbm_model <- train(
      Y ~ .,
      data = carX,
      method = "gbm",
      trControl = trainControl(method = "cv", number = model_params$cv_folds),
      tuneGrid = grid
    )

    # BGLR
    bglr_model <- BGLR::BGLR(
      y = pap[,y_col],
      ETA = list(list(X = pap[,cols], model = 'BayesA')),
      nIter = model_params$bglr_iter,
      burnIn = model_params$bglr_burnin,
      thin = 100,
      saveAt = 'data/',
      df0 = 5,
      R2 = 0.5
    )

    # Combine results
    bind_rows(
      varImp(gbm_model)$importance %>%
        rownames_to_column("var") %>%
        mutate(model = paste0(flesh_type, "_", type, "_gbm")),
      tibble(
        var = colnames(pap[,cols]),
        Overall = bglr_model$ETA[[1]]$b,
        model = paste0(flesh_type, "_", type, "_beta")
      )
    ) %>%
      rename(coord = Overall)
  }

  # Run both model sets and combine results
  bind_rows(
    run_model_set(met_cols, "mets"),
    run_model_set(sens_cols, "sens")
  )
}

# run functions
set.seed(11)
# flesh_types <- c("Red", "Yellow", "Red_Yellow")
flesh_types <- c("Red", "Yellow", "Red|Yellow")
met_cols <- 1:22
sens_cols <- 23:40

results <- flesh_types %>%
  map(~run_models(model_data, ., met_cols, sens_cols)) %>%
  list_rbind() %>% 
  select(var, model, coord) %>% 
  mutate(model = sub("|", "_", model, fixed = TRUE))
save(results, file = "data/liking_var_model.RData")
# liking variable plots
create_importance_plot <- function(data,
                                 data_type = c("mets", "sens"),
                                 variety = "Red & Yellow",
                                 label_data,
                                 color_palette,
                                 group_breaks,
                                 facet_by_variety = FALSE,
                                 facet_scales = "free_y",
                                 facet_ncol = NULL,
                                 facet_theme = NULL) {
  # Input validation
  data_type <- match.arg(data_type)

  # Common theme settings
  plot_theme <- theme_bw() +
    theme(
      plot.background = element_blank(),
      panel.background = element_blank(),
      panel.border = element_rect(linetype = "solid", colour = "Black"),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.title = element_text(size = 11),
      axis.title.x = element_text(vjust = -2),
      axis.title.y = element_text(vjust = 2),
      axis.text = element_text(size = 8),
      legend.background = element_blank(),
      strip.background = element_blank(),
      strip.text = element_text(size = 11, face = "bold"),
      panel.spacing.y = unit(0, "cm", data = NULL),
      legend.position = "bottom"
    )

  # Filter data based on type and variety
  wts_filtered <- data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Filter labels
  labels_filtered <- label_data %>%
    filter(data_type == !!data_type) %>%
    {if(!facet_by_variety) filter(., variety == !!variety) else .}

  # Create base plot
  p <- ggplot(wts_filtered) +
    aes(x = x, y = y, fill = group, colour = group, label = label) +
    geom_hline(
      yintercept = 0,
      alpha = 0.6,
      linetype = "dashed",
      linewidth = 0.5
    ) +
    plot_theme +
    xlim(0, 100) +
    {if(!facet_by_variety) ylim(min(wts_filtered$y), max(wts_filtered$y))} +
    geom_point(shape = 21, alpha = 0.7, size = 2.5) +
    geom_text_repel(
      data = labels_filtered,
      aes(x = x, y = y, label = label, color = group),
      hjust = -0.1, vjust = 0, size = 3, box.padding = 0.10
    ) +
    scale_fill_manual(breaks = group_breaks, values = color_palette) +
    scale_color_manual(breaks = group_breaks, values = color_palette) +
    guides(colour = "none") +
    labs(
      x = "Variable importance (GBM)",
      y = "β coefficient (BayesA)",
      fill = ""
    )

  # Add faceting if requested
  if(facet_by_variety) {
    p <- p +
      facet_wrap(~variety,
                 scales = facet_scales,
                 ncol = facet_ncol) +
      {if(!is.null(facet_theme)) facet_theme else
        theme(
          strip.background = element_rect(fill = "white"),
          strip.text = element_text(size = 11, face = "bold")
        )
      }
  }

  return(p)
}

# load results
load("data/liking_var_model.RData")
# Prepare the data
wts <- results %>%
  left_join(rbind(class_key, sensory_meta), by = "var") %>%
  mutate(
    variety = case_when(
      str_detect(model, "Red_Yellow") ~ "Red & Yellow",
      str_detect(model, "Red") ~ "Red",
      str_detect(model, "Yellow") ~ "Yellow"
    ),
    data_type = str_extract(model, "mets|sens"),
    model = str_extract(model, "gbm|beta"),
    label = str_replace_all(label,"total sugar","Total sugar")
  ) %>%
  pivot_wider(names_from = model, values_from = coord) %>%
  rename(x = gbm, y = beta)

# Define colors
my_colours <- c(
  "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22"
)

# Prepare labels
mets_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 7),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

sens_labels <- rbind(
  wts %>% group_by(variety, data_type) %>% slice_min(y, n = 5),
  wts %>% group_by(variety, data_type) %>% slice_max(y, n = 7),
  wts %>% group_by(variety, data_type) %>% slice_max(x, n = 6)
) %>% distinct()

# Define group breaks
mets_breaks <- c("Carotenoid/Terpenes", "Lipid-derived", "Phenylalanine-derived",
                 "Sugars", "Acids", "Other")
sens_breaks <- c("Flavour", "Aftertaste", "Aroma", "Texture")

# metabolite variable plot
metabolites_plot<- create_importance_plot(
  data = wts,
  data_type = "mets",
  variety = "Red & Yellow",
  label_data = mets_labels,
    color_palette = my_colours,
  group_breaks = mets_breaks,
  facet_by_variety = FALSE
)
# sensory variable plot
Sensory_plot <- create_importance_plot(
  data = wts,
  data_type = "sens",
  variety = "Red & Yellow",
  label_data = sens_labels,
  color_palette = my_colours,
  group_breaks = sens_breaks,
  facet_by_variety = FALSE
)
metabolites_plot
Metabolite characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Figure 10: Metabolite characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Sensory_plot
Sensory characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Figure 11: Sensory characteristics that influence the consumer liking scores for nine papaya genotypes. Variable importance is estimated by gradient boosting machine analysis and Bayesian linear modelling (BayesA) estimates the correlation of each variable with papaya liking scores.

Variance decomposition analysis is implemented to understand the variation in liking scores between the genotypes based on class groupings for sensory and metabolite variables. Such as taste or aroma for sensory characteristics and terpenes (volatile organic compound) or sugars (non-volatile organic compound) for metabolite variables. This provides a broader sense of the variables influencing fruit liking scores.

# update model data
model_data <- model_data %>% select(-Flesh)

# Function to create Gaussian kernel matrix------------
create_kernel_matrix <- function(data, vars) {
    W <- data %>%
        select(all_of(vars)) %>%
        as.matrix()

    G <- as.matrix(dist(W, method="euclidean", diag=TRUE, upper=TRUE))
    K <- 1 - G/max(G, na.rm=TRUE)
    return(K)
}

# Main variance decomposition function
variance_decomp <- function(data, var_groups, response = "emmean") {
    # Create list to store kernel matrices
    kernel_matrices <- list()

    # Generate kernel matrix for each group
    for(group_name in names(var_groups)) {
        kernel_matrices[[group_name]] <- create_kernel_matrix(
            data,
            var_groups[[group_name]]
        )
    }

    # Create ETA list for BGLR
    ETA <- lapply(kernel_matrices, function(K) {
        list(K = K, model = 'RKHS')
    })

    # Extract response variable
    
    y <- scale(data)[,response]

    # Fit BGLR model
    fm <- BGLR::BGLR(
        y = y,
        ETA = ETA,
        nIter = 30000,
        burnIn = 10000,
        thin = 100,
        R2 = 0.5,
        saveAt = "data/"
    )

    # Calculate variance components and proportions
    varE <- fm$varE  # Residual variance
    var_components <- sapply(fm$ETA, function(x) x$varU)
    total_var <- sum(var_components) + varE

    # Calculate proportions for each group
    var_proportions <- c(var_components, 
                         residual = varE) / total_var

    # Create results list
    results <- list(
        model = fm,
        variance_components = var_components,
        residual_variance = varE,
        variance_proportions = var_proportions,
        total_variance = total_var
    )

    return(results)
}

# Create list of metabolite variables----------
included_vars <- class_key %>% 
  filter(var %in% colnames(model_data))

phe <- included_vars %>%
    filter(group == "Phenylalanine-derived" & !var == "GTR_conc") %>%
     select(var) %>% pull()
lip <- included_vars %>%
    filter(group == "Lipid-derived") %>%
    select(var) %>% pull()
car <- included_vars %>%
    filter(group == "Carotenoid/Terpenes") %>%
    select(var) %>% pull()
sugar <- included_vars %>%
    filter(group == "Sugars",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()
acid <- c("TA")
GTR <- c("GTR_conc")

# Create groups list
mets_groups <- list(
    phe = phe,
    lip = lip,
    car = car,
    sugar = sugar,
    acid = acid,
    GTR = GTR
)

# Metabolites analysis
# mets_results <- variance_decomp(model_data, mets_groups, response = "emmean")
save(mets_results, file = "data/mets_results.RData")

# Create list of Sensory variables -------
include_vars <- sensory_meta %>% 
  filter(var %in% colnames(model_data))
AR <- include_vars %>%
    filter(group == "Aroma") %>%
    select(var) %>% pull()
TX <- include_vars %>%
    filter(group == "Texture") %>%
    select(var) %>% pull()
FL <- include_vars %>%
    filter(group == "Flavour") %>%
    select(var) %>% pull()
AT <- include_vars %>%
    filter(group == "Aftertaste",
           !str_detect(var,"_prop")) %>%
    select(var) %>% pull()

# Create groups list
sens_groups <- list(
    Aroma = AR,
    Texture = TX,
    Flavour = FL,
    Aftertaste = AT)

# Sensory analysis
sens_results <- variance_decomp(model_data, sens_groups, response = "emmean")
save(sens_results, file = "data/sens_results.RData")
load("data/sens_results.RData")
# Organize data with row groups
sens_table_data <- 
  data.frame(category = names(sens_results$variance_proportions),
             proportion = sens_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         category = str_replace(category,'residual', 'Residual'))
# write table
sens_table <- sens_table_data %>%
    gt() %>%
    tab_caption("Variance components analysis showing the proportion of variance in liking scores explained by sensory groups") %>% 
    # tab_header(
    #     title = "Variance Components Analysis",
    #     subtitle = "Proportion of Variance in Liking Scores Explained by Sensory Groups"
    # ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Sensory Group",
        proportion = "Variance Explained (%)"
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )
load("data/mets_results.RData")
# Organize data with row groups
mets_table_data <- 
  data.frame(category = names(mets_results$variance_proportions),
             proportion = mets_results$variance_proportions) %>%
  mutate(proportion = round(proportion * 100, 1),
         group_type = case_when(
           category %in% c("phe", "lip", "car") ~ "Volatiles",
           category %in% c("sugar", "acid", "GTR") ~ "Non-volatiles",
           TRUE ~ "Residual"),
         category = 
           str_replace_all(category,
                           c("phe" = "Phenylalanine-derived",
                             "lip" = "Lipid-derived",
                             "car" = "Carotenoid/Terpenes",
                             "sugar" = "Sugars",
                             "acid" = "Acids",
                             "GTR" = "GTR")))
# write table
mets_table <- mets_table_data %>%
    select(group_type, category, proportion) %>%
    gt(
        groupname_col = "group_type"
    ) %>%
    tab_caption("Variance components analysis showing the proportion of variance in liking scores explained by metabolite groups") %>% 
    # tab_header(
    #     title = "Variance Components Analysis",
    #     subtitle = "Proportion of Variance in Liking Scores Explained by Metabolite Groups"
    # ) %>%
    fmt_percent(
        columns = proportion,
        decimals = 1,
        scale_values = FALSE
    ) %>%
    cols_label(
        category = "Metabolite Group",
        proportion = "Variance Explained (%)"
    ) %>%
    summary_rows(
        groups = c("Non-volatiles", "Residual", "Volatiles"),
        columns = proportion,
        fns = list("Group Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    grand_summary_rows(
        columns = proportion,
        fns = list("Total" = ~sum(.)),
        fmt = ~ fmt_percent(., decimals = 1, scale_values = FALSE)
    ) %>%
    tab_style(
        style = cell_borders(
            sides = "bottom",
            weight = px(2)
        ),
        locations = cells_row_groups()
    ) %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )

Finally a predictive model was generated to predict papaya fruit liking in future selection processes. The variables included in the model were reduced to four to prevent overfitting considering the limited sample size: sulcatone (terpene), linalool oxide (terpene), γ-Octalactone (lipid-derived) and total sugars (sucrose, glucose and fructose; sugars). This model may be implemented for selection purposes if the sampling and instrumentation protocols are repeated. New samples may be indexed within the scores of samples used to build the model. Considering the model isn’t equipped for extrapolation, any values outside the ranges of what was used to build the model may distort the predicted liking score. Unscaled data was input to the model algorithm to make it more accessible for new data imputation.

explantory_vars <- function(data, num_vars = 4, target_var) {
  
  # Validate inputs
  if (!is.data.frame(data)) stop("data must be a data frame")
  if (!is.numeric(num_vars) || num_vars < 1) stop("num_vars must be a positive number")
  if (!target_var %in% colnames(data)) stop("target_var not found in data")
  if (num_vars >= ncol(data)) stop("num_vars must be less than number of predictor variables")
  
  # Get variable names, excluding the target variable
  variable_names <- colnames(data)
  variable_names <- variable_names[variable_names != target_var]

  # Generate all combinations
  combinations <- combn(seq_along(variable_names), num_vars)

  # Convert to named list of variable combinations
  variable_combinations <- setNames(
      lapply(1:ncol(combinations), function(i) {
          variable_names[combinations[,i]]
      }),
      paste("Model", 1:ncol(combinations))
  )

  return(variable_combinations)
}

# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 50) {
  cv <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = nrounds,
    nfold = nfold,
    early_stopping_rounds = early_stopping,
    verbose = FALSE,
    metrics = "rmse"
  )
  
  return(list(
    rmse = min(cv$evaluation_log$test_rmse_mean),
    iteration = cv$best_iteration
  ))
}

# Grid search function
grid_search_xgb <- function(param_grid, dtrain) {
  grid <- expand.grid(param_grid)
  
  results <- apply(grid, 1, function(params) {
    param_list <- list(
      objective = "reg:squarederror",
      eta = params["eta"],
      max_depth = params["max_depth"],
      subsample = params["subsample"],
      colsample_bytree = params["colsample_bytree"]
    )
    
    cv_result <- run_xgb_cv(param_list, dtrain)
    
    return(c(
      params,
      best_rmse = cv_result$rmse,
      best_iteration = cv_result$iteration
    ))
  })
  
  results_df <- as.data.frame(t(results))
  return(results_df)
}

# Function to prepare data for a specific variable combination
prepare_data <- function(data, vars, target_var) {
  boost_data <- data %>%
    select(all_of(c(target_var, vars)))
  
  set.seed(11)
  train_index <- createDataPartition(boost_data[[target_var]], p = 0.8, list = FALSE)
  train_data <- boost_data[train_index, ]
  test_data <- boost_data[-train_index, ]
  
  train_predictors <- as.matrix(train_data[, vars])
  test_predictors <- as.matrix(test_data[, vars])
  
  list(
    dtrain = xgb.DMatrix(data = train_predictors, label = train_data[[target_var]]),
    dtest = xgb.DMatrix(data = test_predictors, label = test_data[[target_var]]),
    test_data = test_data
  )
}

# Modified grid search function to return best parameters and performance metrics
run_model_combination <- function(data, vars, target_var) {
  # Prepare data
  prepared_data <- prepare_data(data, vars, target_var)
  
  # Grid search parameters
  param_grid <- list(
    eta = c(0.01, 0.05, 0.1),
    max_depth = c(2, 3, 4),
    subsample = c(0.8, 0.9, 1.0),
    colsample_bytree = c(0.7, 0.85, 1.0)
  )
  
  # Run grid search
  results <- grid_search_xgb(param_grid, prepared_data$dtrain)
  best_params <- results[which.min(results$best_rmse), ]
  
  # Train final model with best parameters
  set.seed(11)
  xgb_model <- xgboost(
    data = prepared_data$dtrain,
    nrounds = best_params$best_iteration,
    eta = best_params$eta,
    max_depth = best_params$max_depth,
    subsample = best_params$subsample,
    colsample_bytree = best_params$colsample_bytree,
    objective = "reg:squarederror",
    verbose = 0  # Suppress output
  )
  
  # Make predictions
  predictions <- predict(xgb_model, newdata = prepared_data$dtest)
  
  # Calculate performance metrics
  metrics <- postResample(pred = predictions, obs = prepared_data$test_data[[target_var]])
  
  # Add cleanup
  rm(xgb_model, prepared_data)
  gc()  # Force garbage collection
  
  # Return results
  list(
    variables = paste(vars, collapse = ", "),
    RMSE = metrics["RMSE"],
    Rsquared = metrics["Rsquared"],
    MAE = metrics["MAE"],
    best_param = best_params
  )
}

# Model comparison function
model_comparison <- function(data, variable_combinations, target_var) {
  total_models <- length(variable_combinations)
  pb <- txtProgressBar(min = 0, max = total_models, style = 3)
  
  results <- imap(variable_combinations, \(vars, name) {
    setTxtProgressBar(pb, as.numeric(gsub("Model ", "", name)))
    message(sprintf("Processing %s with variables: %s", name, paste(vars, collapse=", ")))
    tryCatch({
      result <- run_model_combination(data, vars, target_var)
      tibble(
        Model_ID = name,
        Variables = result$variables,
        RMSE = result$RMSE,
        R_squared = result$Rsquared,
        MAE = result$MAE,
        params = result$best_param,
        Error = NA_character_
      )
    }, error = function(e) {
      tibble(
        Model_ID = name,
        Variables = paste(vars, collapse=", "),
        RMSE = NA_real_,
        R_squared = NA_real_,
        MAE = NA_real_,
        params = NA_character_,
        Error = as.character(e)
      )
    })
  }) 
  close(pb)
  return(bind_rows(results))
}

# prep data
model_data1 <- model_data %>% select(1:22, "emmean")
target_var <- "emmean"
# Execute models
variable_combinations <- explantory_vars(model_data1, num_vars = 4, target_var)
model_results <- model_comparison(model_data1, variable_combinations, target_var)
# Write results
flattened_results <- model_results %>% unnest(params)
writexl::write_xlsx(flattened_results,"results/model_results.xlsx")
# Function to run 5-fold cross-validation using hyperparameter grid search
run_xgb_cv <- function(params, dtrain, nrounds = 500, nfold = 5, early_stopping = 50) {
  cv <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = nrounds,
    nfold = nfold,
    early_stopping_rounds = early_stopping,
    verbose = FALSE,
    metrics = "rmse"
  )

  return(list(
    rmse = min(cv$evaluation_log$test_rmse_mean),
    iteration = cv$best_iteration
  ))
}

grid_search_xgb <- function(param_grid, dtrain) {
  # Create grid of all combinations
  grid <- expand.grid(param_grid)

  # Apply CV to each parameter combination
  results <- apply(grid, 1, function(params) {
    param_list <- list(
      objective = "reg:squarederror",
      eta = params["eta"],
      max_depth = params["max_depth"],
      subsample = params["subsample"],
      colsample_bytree = params["colsample_bytree"]#,
      # min_child_weight = params["min_child_weight"],
      # gamma = params["gamma"] 
    )

    cv_result <- run_xgb_cv(param_list, dtrain)

    return(c(
      params,
      best_rmse = cv_result$rmse,
      best_iteration = cv_result$iteration
    ))
  })

  # Convert results to dataframe
  results_df <- as.data.frame(t(results))

  return(results_df)
}

# Data set up
boost_data <- model_data %>%
    select(emmean, total_sugar, Sulcatone, Linalool, Beta_ionone)

# Use createDataPartition instead of sample
set.seed(11)
train_index <- createDataPartition(boost_data$emmean, p = 0.8, list = FALSE)
train_data <- boost_data[train_index, ]
test_data <- boost_data[-train_index, ]

# Create matrices consistently
train_predictors <- as.matrix(train_data[, -1])
test_predictors <- as.matrix(test_data[, -1])
dtrain <- xgb.DMatrix(data = train_predictors, label = train_data$emmean)
dtest <- xgb.DMatrix(data = test_predictors, label = test_data$emmean)


# Grid search parameters
  # param_grid <- list(
  #   eta = c(0.01, 0.05, 0.1),
  #   max_depth = c(2, 3, 4),
  #   subsample = c(0.8, 0.9, 1.0),
  #   colsample_bytree = c(0.7, 0.85, 1.0)
  # )

param_grid <- list(
    eta = seq(0.06,0.15,0.01),
    max_depth = c(3),
    subsample = c(1),
    colsample_bytree = c(1)#,
    # min_child_weight = c(1, 3),
    # gamma = c(0, 0.1) 
)

# Grid search results
set.seed(11)
results <- grid_search_xgb(param_grid, dtrain)
best_params <- results[which.min(results$best_rmse),]

# Best params
#     eta max_depth subsample colsample_bytree best_rmse best_iteration
# 10 0.15         3         1                1  5.103844             34
# run model with best parameters.
set.seed(11)
xgb_model <- xgboost(data = dtrain,
                     nrounds = best_params$best_iteration,
                     eta = best_params$eta,
                     max_depth = best_params$max_depth,
                     subsample = best_params$subsample,
                     colsample_bytree = best_params$colsample_bytree,
                     objective = "reg:squarederror",
                     verbose = 0)  # Suppress output

# Make predictions on test set
predictions <- predict(xgb_model, newdata = dtest)
# Feature Importance
importance_matrix <- xgb.importance(feature_names = colnames(train_data[,-1]),
                                    model = xgb_model)

save(dtrain, dtest, results, predictions, xgb_model, importance_matrix, 
     file = "data/xgboost_train_results.RData")
load("data/xgboost_train_results.RData")
# Model evaluation
postResample(pred = predictions, obs = test_data[,"emmean"]) %>%
  as.data.frame() %>% rownames_to_column(var="Metric") %>% gt() %>%
    tab_caption("XGBoost performance metrics")
Table 6: XGBoost performance metrics
Metric .
RMSE 1.4845620
Rsquared 0.9895691
MAE 1.3567530
# Create a scatter plot of predicted vs actual values for test data
ggplot(data.frame(Predicted = predictions, 
                        Actual = test_data[,"emmean"]), 
             aes(x = Actual, y = Predicted)) + 
  stat_poly_line(color = "firebrick4", fill = "red3") + 
  stat_poly_eq() + theme_tufte() + 
  geom_point(color = "gray15", alpha = .5, shape = 20, size = 3.5) +
  labs( x="Observed Liking", y="Predicted Liking", title = "Predicted Liking ~ Observed Liking") +
  theme(plot.title =  element_text(hjust = 0.5), axis.title = element_text(size=14),title = element_text(size = 14))
XGBoost model predicted liking vs observed liking based on concentrations of sulcatone, Beta ionone, linalool and total sugar.

Figure 12: XGBoost model predicted liking vs observed liking based on concentrations of sulcatone, Beta ionone, linalool and total sugar.

xgb.plot.importance(importance_matrix,
                    main = "Feature Importance Plot")
XGBoost variable importance

Figure 13: XGBoost variable importance

importance_matrix %>% mutate(Feature = str_replace_all(Feature,c("_"=" ","total"="Total"))) %>% 
    gt() %>%
    tab_caption("XGBoost importance of features contributing to liking"
    ) %>%
    # tab_header(
    #     title = "XGBoost Feature importance",
    #     subtitle = "Features contributing to liking"
    # ) %>% 
    fmt_percent(
        columns = everything(),
        decimals = 1, 
        scale_values = TRUE
    )  %>%
    opt_row_striping() %>%
    tab_options(
        row_group.background.color = "gray90",
        table.border.top.width = px(3),
        table.border.bottom.width = px(3)
    )
Table 7: XGBoost importance of features contributing to liking
Feature Gain Cover Frequency
Sulcatone 62.7% 57.5% 44.4%
Beta ionone 23.5% 7.6% 9.3%
Linalool 8.3% 21.3% 24.1%
Total sugar 5.5% 13.6% 22.2%

8 Data Summary

The summarised volatile concentrations are reported below

> Warning: 'xfun::attr()' is deprecated.
> Use 'xfun::attr2()' instead.
> See help("Deprecated")
Table 8: Summary of VOC concentrations (ppb) by chemical species
Species CAS Number RT (min) RI Chemical Class n Mean SD SE Median Min Max N detected % detected
Hexanal 66-25-1 22.22 809 Aldehyde 118 18.38 21.25 1.96 9.84 0.03 85.25 118 100.00
Heptanal 111-71-7 26.91 912 Aldehyde 118 32.05 24.52 2.26 23.41 0.00 112.15 117 99.15
Methyl_hexanoate 106-70-7 27.69 929 Ester 118 42.84 199.13 18.33 0.00 0.00 2002.82 49 41.53
Benzaldehyde 100-52-7 30.3 988 Aldehyde 118 14.06 14.68 1.35 7.81 1.77 74.75 118 100.00
Sulcatone 110-93-0 30.53 993 Ketone 118 62.88 84.21 7.75 46.03 0.00 629.09 117 99.15
Octanal 124-13-0 31.43 1013 Aldehyde 118 7.05 5.93 0.55 5.22 0.00 33.49 117 99.15
P_cymene 99-87-6 32.89 1047 Terpene 118 5.20 3.25 0.30 5.11 0.00 20.54 112 94.92
Limonene 5989-27-5 33.15 1054 Terpene 118 26.41 39.93 3.68 12.14 0.00 250.14 87 73.73
Phenylacetaldehyde 122-78-1 33.70 1067 Aldehyde 118 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00
Terpinene 99-85-4 34.22 1079 Terpene 118 10.53 15.37 1.41 5.92 0.00 105.47 105 88.98
Linalool_oxide 34995-77-2 34.77/35.39 1092/1108 Terpene 118 52682.49 108929.32 10027.76 7529.00 0.00 747850.87 116 98.31
Linalool 78-70-6 35.56 1112 Terpene 118 7313.67 18543.25 1707.04 682.34 0.00 112060.59 116 98.31
Nonanal 124-19-6 35.65 1114 Aldehyde 118 9.50 7.91 0.73 8.09 0.43 36.41 118 100.00
Methyl_octanoate 111-11-5 36.14 1126 Ester 118 2.21 3.80 0.35 1.38 0.00 20.87 99 83.90
E_2_nonal 18829-56-6 37.94 1172 Aldehyde 118 0.53 0.88 0.08 0.00 0.00 5.13 43 36.44
Decanal 112-31-2 39.59 1216 Aldehyde 118 0.39 1.20 0.11 0.00 0.00 7.83 18 15.25
Beta_Cyclocitral 432-25-7 40.98 1255 Terpene 118 2.24 2.94 0.27 0.00 0.00 10.01 48 40.68
Gamma_octalactone 104-50-7 41.85 1280 Ester 118 11.38 13.69 1.26 6.45 1.17 63.22 118 100.00
BITC 622-78-6 46.21 1414 Sulfur 118 4.53 7.13 0.66 2.62 0.00 67.27 91 77.12
Neryl_Geranyl_acetone 3796-70-1 48.11 1478 Terpene 118 39.89 42.79 3.94 32.89 2.65 312.72 118 100.00
Beta_ionone 79-77-6 49.4 1524 Terpene 118 6.81 4.93 0.45 5.26 3.47 50.19 118 100.00

Sugar values reported in g/100gFW; total soluble solids (brix) reported in °Brix; titratable acids reported in %CA; VOCs are reported in ppb; GTR reported in ppm.

Table 9: Variable means for 27 papaya genotypes (standard deviation)

9 General information

This document was last updated at 2025-06-20 14:07:44 using R Markdown (built with R version 4.2.1 (2022-06-23 ucrt)). The source code for this website can be found on https://github.com/JoshLomax/Papaya_sensory_metabolomics.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is especially powerful at authoring documents and reports which include code and can execute code and use the results in the output. For more details on using R Markdown see http://rmarkdown.rstudio.com, R Markdown: The Definitive Guide and Rmarkdown cheatsheet.

Bibliography

Brewer S, Plotto A, Bai J, et al. (2021) Evaluation of 21 papaya (Carica papaya L.) accessions in southern Florida for fruit quality, aroma, plant height, and yield components. Scientia Horticulturae 288:110387. doi: 10.1016/j.scienta.2021.110387
Colantonio V, Ferrão LFV, Tieman DM, et al. (2022) Metabolomic selection for enhanced fruit flavor. Proc Natl Acad Sci 119:e2115865119. doi: 10.1073/pnas.2115865119
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976. doi: 10.1126/science.1136800
Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29:1189–1232. doi: 10.1214/aos/1013203451
Lomax J, Ford R, Bar I (2024) Multi-omic applications for understanding and enhancing tropical fruit flavour. Plant Mol Biol 114:83. doi: 10.1007/s11103-024-01480-7
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829. doi: 10.1093/genetics/157.4.1819
Pico J, Gerbrandt EM, Castellarin SD (2022) Optimization and validation of a SPME-GC/MS method for the determination of volatile compounds, including enantiomeric analysis, in northern highbush blueberries (Vaccinium corymbosum L.). Food Chem 368:130812. doi: 10.1016/j.foodchem.2021.130812
Zhou Z, Bar I, Ford R, et al. (2022) Biochemical, Sensory, and Molecular Evaluation of Flavour and Consumer Acceptability in Australian Papaya (Carica papaya L.) Varieties. International Journal of Molecular Sciences 23:6313. doi: 10.3390/ijms23116313
Zhou Z, Ford R, Bar I, Kanchana-udomkan C (2021) Papaya (Carica papaya L.) Flavour Profiling. Genes 12:1416. doi: 10.3390/genes12091416