suppressPackageStartupMessages({

library(readxl)
library(readr)

library(limma)
library(stringr)
library(rvest)
library(kableExtra)
library(tidyverse)

library(BiocFileCache)

})

1 Introduction

Enrichment Analysis -also called Pathway Analysis- generically describes a variety of analyses, -not always related neither with enrichment or with pathways- that can be done, once a set of biological features has been obtained from an omics study, in order to gain insight on its biological interpretation.

1.1 Components required for enrichment analysis

Although the analyses can differ from one another the input is usually the same. It is usally required:

  1. A list of biological features, usually described by their identifiers in some biological database, such as HMDB or KEGG.
  2. Some type of table or relation between the identifiers and a knowlwdge database, which is queried during the analysis.
  3. [Optional] a background or universe that represents the set of all possible featueres that could have been obtained in the study. These may be the IDs of all features in a given “ome” (genome, metabolome, …) or also the IDs of all the features that were queried by the experiment, eg. genes or metabolites in a given panel,
  4. [Optional] An effect size (Fold change, correlation, t-test value,…) measuring the relevance of a given feature. This document contains the preprocessing and analysis of two metabolomics datasets that will be used for benchmarking several enrichment analysis packages.

In this document we present and prepare three datasets, that will be later used for Enrichment Analyses.

The datasets are:

  • The Cachexia dataset a well known case study used in many MetaboAnalyst tutorials.
  • The MST000291 dataset, resulting from a nutritional study, and used by the fobitools package as an example. The original data required to generate the lists are available in MetabolomicsWorkbench.
  • The KrasG12D mutation dataset, a list of cancer related metabolites used by the Enrichmet package in a tutorial.

2 The cachexia dataset

This is an example introduced by Xia et alt. in MetaboAnalyst and is used in several tutorials on how to use this software.

The dataset is available as a .csv file in the metaboData repository in github (https://github.com/nutrimetabolomics/metaboData/blob/main/Datasets/2024-Cachexia/human_cachexia.csv). For implicity it has been downladed from there and re-stored as an Excel File.

human_cachexia <- read_excel("data/human_cachexia.xlsx")
dim(human_cachexia)
## [1] 63 78
colnames(human_cachexia)[c(1:3, 20:22, 70:72)]
## [1] "Metabolite"       "cachPIF_178"      "cachPIF_087"      "cachPIF_113"     
## [5] "cachPIF_143"      "cachNETCR_007_V1" "ctrlNETCR_009_V1" "ctrlNETCR_009_V2"
## [9] "ctrlNETL_007_V1"

There seem to be two groups defined by the prefix cach / ctrl in the data

2.1 Selection of differentially quantified metabolites

In order to run an Over Representation Anaalysis (ORA) we only need a list of differentially quantified metabolites which we usually may obtain from some comparison test.

This dataset, however, contains only metabolites that have been shown to be significantly different among the groups. so, in principle this previous step may not seem necessary.

Anyway the comparison is performed, not only because it allows to check this, but also because it will provide an effect size measure that can be use for MSEA analysis.

The code below shows the analysis performed below, using the Bioconductor limma package to compare the two groups provided in the dataset.

# 1. Definir el vector de grups a partir dels noms de les columnes
sample_names <- colnames(human_cachexia)[-1]   # excepte 'Metabolite'

group <- ifelse(str_detect(sample_names, "^ctrl"), "Control", "Cachexia")
group <- factor(group)

# 2. Construir la matriu d'expressió (metabolits x mostres)
X <- as.matrix(human_cachexia[ , -1])
rownames(X) <- human_cachexia$Metabolite

# 3. Disseny experimental
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)

fit <- lmFit(X, design)
cont.matrix <- makeContrasts(Cachexia - Control, levels = design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)

topCachexia <- topTable(fit2, number = Inf) %>% mutate(ordByLimma =1:nrow(X))

head(round(topCachexia,3),10)
tail (round(topCachexia,3))

Alternatively a row-wise t-test can be applied

topCachexiaByT<- genefilter::rowttests(x=X, fac=group) %>% mutate(p.adj= p.adjust(p.value, method="BH")) %>% arrange(p.value) %>% mutate(orderByT=1:nrow(X))

Both tests yield similar results as can be seen in the naive comparison showed below.

head(round(topCachexia[,c(3,1,4,5,7)],4))
head(round(topCachexiaByT,4))
merged_table <- merge(
  topCachexia[,c(3,1,4,5,7)],
  topCachexiaByT,
  by="row.names"
) %>% arrange(p.value)

rownames(merged_table) <- merged_table$Row.names
merged_table$Row.names <- NULL

The results of both tests can be saved to disk for later use

# 6. Mostrar els primers resultats

save(topCachexia, topCachexiaByT, file="data/Cachexia_topTable.Rda")
write.table(rownames(topCachexia), file="data/cachexia_ids.txt", row.names = FALSE, quote=FALSE)

2.2 Selected Metabolites and its identifiers

Metabolites in the table are identified by their common names, a type of identifier very useful for scientists working with these metabolites at chemical or nutritional level.

However, most enrichment analysis tools require a more objective ID from public databases such as PubChem, HumanDB or KEGG.

Retrieving metabolites by their name (for example from MetaboliteIDmapping) may be a frustrating task because small changes in the names make them unrecognizable.

In this case, however, the associated identifiers for these metabolites, when available, have been obtained using MetaboAnalyst Compound ID converter.

Probably the fact that the example came from MtaboAnalyst has greatly facilitated the conversion.

The equivalence names-IDs has been saved as a .csv file

cachexia_map <- read.csv("data/Cachexia_map.csv")
head(cachexia_map)

The resulting identifiers may be used as input for enrichment analysis.

2.3 Components for Enrichment Analysis

If we check the required elements described above we have:

  1. The list of selected identifiers? YES: Any of the columns in the cachexia_map table HMDB, KEGG.
  2. A table relating Pathwyas and IDs YES. Although it is not provided with the example -because MetaboAnalystR usually queries the web- a standard EnrichmentSet object relating the IDs and the KEGG or the SMPDB database are available as described in the document DB4Metabolomics.html.
  3. The background? NO. We don’t know which metabolites were analyzed in this study, so, by default we will use the whole Pathway Database.
  4. An effect size? YES. Either the difference groupCachexia-groupControl or the associated F-statistic.

3 The ST000291 dataset

This dataset comes from a study that has been uploaded to Metabolomics Workbench with ID ST0000291: LC-MS Based Approaches to Investigate Metabolomic Differences in the Urine of Young Women after Drinking Cranberry Juice or Apple Juice

WARNING Although the original study followed a 2×2 crossover design, the Metabolomics Workbench dataset (ST000291) does not include subject identifiers, and therefore paired or repeated-measures models cannot be applied. As a result, treatment groups must be analysed as independent samples.

This choice is conservative: ignoring the within-subject pairing increases residual variance and therefore raises the probability of false negatives, not false positives. Consequently, any metabolites that remain significant under this independent-samples analysis are likely to represent strong and biologically robust treatment effects.

Once downloaded the data consists of three separate files

  • The features data
    • 1541 variables, 45 samples
  • The metadata
    • 45 rows, two columns (sample name, group name)

The metabolite names - 1541 row, 3 columns (original name, PubChem ID and KEGG ID)

The downloaded data are stored in a particular format “MWTab” so some functions to extract data from it have been prepared.

3.1 Data loading and preprocessing

Data in Metabolomics Workbench is not preprocessed so some work has to be performed before they can be analyzed.

The preprocessing steps are performed using the POMA Bioconductor package and can be found in the acompanying script Rcode/scripts/MW_ST000291_getTopTables.R

After preprocessing the Bioconductor limma package may be used to perform pairwise comparisons, whose results have been saved to a binary file.

limma_res = list(ApplvsBase = limma_resAB, 
                 CranvsBase = limma_resCB,
                 ApplvsCran = limma_resAC)

3.2 Restictring the analysis to metabolites with HMDB_ID

Pubchem identifiers are widely used but they be ambiguous and this may result in small number of coincidences among the list and the EnrichmentSet IDs, which have been created separately.

To simplify this we have converted PubChem IDs to other systems using again that has been generated MetaboAnalyst ID converter.

The list of equivalences has been stored in file MW_ST000291__map.csv

MW_map <- readr::read_csv("data/MW_ST000291__map.csv", show_col_types = FALSE) %>%
  janitor::clean_names() %>%
  mutate(
    PubChemCID = as.character(query), 
    HMDB = ifelse(hmdb == "" | is.na(hmdb), NA, hmdb)
  ) %>%
  select(PubChemCID, HMDB)

head(MW_map)

This has been used to filter out all metabolites not mapped to HMDB.

The analysis has been reproduced with this restricted dataset and a new topTable for the comparison Appl vs Cranberry has been generated.

The resulting topTables have been saved to file=data/MW_ST000291-limma_res.Rda Focussing on the HMDB based top table:

load("data/MW_ST000291-limma_res.Rda")
head(limma_HMDB_CB)

3.3 Components for Enrichment Analysis

If we check the required elements described above we have:

  1. The list of selected identifiers? YES: The first 19 (adj.pvalue < 0.49) identifiers in the column Metabolite
  2. A table relating Pathwyas and IDs YES. Although it is not provided with the example a standard EnrichmentSet object relating the IDs and the KEGG or the SMPDB database are available as described in the document DB4Metabolomics.html.
  3. The background? YES. We can use as background the set of all identifiers in the column “metabolite”
  4. An effect size? YES. Either the columns log2FC or the associated t-statistic.

4 The KrasG12D mutation dataset

The enrichmet package includes a tutorial that uses these example data, available from Zotero (records: 15498097).

The data used in the tutorial are described only briefly as summary statistics from a differential metabolomics analysis involving KRAS G12D mutation and KRAS G12D mutation with TAp73 deletion. The authors indicate that they selected the top 57 metabolites, represented using KEGG identifiers, and used as background the set of all human KEGG metabolites associated with KEGG pathways. The ranked metabolite list appears to include at least log2 fold change and p-value information, since these variables are used for the MetSEA analysis. Beyond this, the article does not report the original data matrix, sample size, experimental platform, biological model, preprocessing steps, or the full list of metabolites, so the dataset is only minimally documented in the paper.

For ease of reproducibility we have previously downloaded all the objects included in the use-case and saved them into a binary file: data/enrichmet_example.Rda

4.1 Components require for enrichment analysis

The objects provided by the package can be directly mapped to what we have called the “components require for enrichment:

  1. The list of selected identifiers? YES: A list of 57 KEGG identifiers representing the statistically significant metabolites are stored in the column met_id object EM_summaryStats.
  2. A table relating Pathwyas and IDs YES. The table EM_PathwayVsMetabolites provided with the example data relates Pathways with their associated KEGG IDs. For simplicity of uses a standard EnrichmentSet object has been prepared from this object.
  3. The background? NO. The authors have decided to use as background all the metabolites, which means all metabolites contained in the previous table.
  4. An effect size? YES. Columns pval or log2fc from object EM_summarystats can be used as effect size measures.

4.2 Optional Input Files

Enrichmetmay use some optional files to enhance the analysis and its biological interpretation:

  • KEGGLookup
    A reference file linking KEGG compound IDs (https://www.kegg.jp/) to metabolite names or other annotations, which helps standardize identifiers and improve pathway mapping.

  • STITCHInteractions
    A dataset of known or predicted interactions between metabolites or proteins, obtained from the freely available STITCH database (http://stitch.embl.de/). These interactions are useful for constructing networks to support pathway analysis. The example file is supplied as: chemical_chemical.tsv. Users may provide their own interaction data appropriate for their species.

  • mapping_df
    A mapping data frame that links STITCH identifiers with metabolite names, facilitating integration between datasets. The provided example file is available here: mapping_df.xlsx
    Users can substitute this with a mapping file relevant to their species of interest.

4.3 Data downloading and preparation

The code below describes how the data can be downloaded and saved into a binary file.

# InputMetabolites: A vector or list containing the metabolites of interest.
inputMetabolites <- c(
  "C02721", "C03736", "C03139", "C00074", "C02862", "C00092", "C00275",
  "C00386", "C10172", "C00361", "C00242", "C19806", "C16358", "C16357",
  "C16353", "C02140", "C05488", "C01181", "C03017", "C02632", "C00246",
  "C00079", "C08317", "C00588", "C16527", "C00065", "C00249", "C00836",
  "C02990", "C03690", "C04025", "C06424", "C00123", "C00606", "C03067",
  "C00539", "C00633", "C00519", "C00366", "C00148", "C00219", "C08261",
  "C00385", "C06429", "C05637", "C00295", "C06428", "C16513", "C01909",
  "C03299", "C01530", "C01571", "C08278", "C00599", "C15587", "C00144",
  "C03739"
)
# Standard cache location
cache_dir <- tools::R_user_dir("enrichmet", "cache")
bfc <- BiocFileCache(cache_dir, ask = FALSE)

# Helper function to download & cache any file from Zenodo
get_cached_file <- function(url) {
  bfcrpath(bfc, url)
}

# -------------------------
# Load files with caching
# -------------------------

# PathwayVsMetabolites
pathway_url <- "https://zenodo.org/api/records/15498097/files/Human-PathwaysVsMetabolites.csv/content"
pathway_path <- get_cached_file(pathway_url)
PathwayVsMetabolites <- read.csv(pathway_path)
str(PathwayVsMetabolites)
## 'data.frame':    351 obs. of  2 variables:
##  $ Pathway    : chr  "Glycolysis / Gluconeogenesis" "Citrate cycle (TCA cycle)" "Pentose phosphate pathway" "Pentose and glucuronate interconversions" ...
##  $ Metabolites: chr  "C00022,C00024,C00031,C00033,C00036,C00068,C00074,C00084,C00085,C00103,C00111,C00118,C00186,C00197,C00221,C00236"| __truncated__ "C00022,C00024,C00026,C00036,C00042,C00068,C00074,C00091,C00122,C00149,C00158,C00311,C00417,C05125,C05379,C05381"| __truncated__ "C00022,C00031,C00085,C00117,C00118,C00119,C00121,C00197,C00198,C00199,C00204,C00221,C00231,C00257,C00258,C00279"| __truncated__ "C00022,C00026,C00029,C00103,C00111,C00116,C00167,C00181,C00191,C00199,C00204,C00216,C00231,C00259,C00266,C00309"| __truncated__ ...
# SummaryStatistics
example_url <- "https://zenodo.org/api/records/15498097/files/example_data.xlsx/content"
example_path <- get_cached_file(example_url)
example_data <- read_excel(example_path)
show(example_data[1:3,])
## # A tibble: 3 × 4
##   met_id      pval    padj log2fc
##   <chr>      <dbl>   <dbl>  <dbl>
## 1 C02721 0.0000124 0.00315  -2.32
## 2 C03736 0.0000694 0.00739   2.63
## 3 C03139 0.0000869 0.00739   1.75
# KEGGLookup
kegg_url <- "https://zenodo.org/api/records/15498097/files/kegg_lookup.xlsx/content"
kegg_path <- get_cached_file(kegg_url)
kegg_lookup <- read_excel(kegg_path)
head(kegg_lookup)
# Mapping_df
mapping_url <- "https://zenodo.org/api/records/15498097/files/mapping_df.xlsx/content"
mapping_path <- get_cached_file(mapping_url)
mapping_df <- read_excel(mapping_path)
head(mapping_df)
# STITCHInteractions
# stitch_url <- "https://zenodo.org/records/15498097/files/chemical_chemical.tsv"
# stitch_path <- get_cached_file(stitch_url)
# stitch_df <- read_tsv(stitch_path)
# head(stitch_df)

The data from “EnrichMetExample” is not an EnrichmentSet, so a new one is created from the PathwayvsMetabolites data.frame:

First we prepare the long version of PathwayVsMetabolites:

tmp <- PathwayVsMetabolites %>%
  mutate(Metabolites = strsplit(Metabolites, ","))

unnested <- tidyr::unnest(tmp, cols = c(Metabolites))

PathwayVsMetabolites_long <- dplyr::rename(unnested, kegg_id = Metabolites)

head(PathwayVsMetabolites_long)
PathwayVsMetabolites_long <- PathwayVsMetabolites_long %>%
  mutate(PathwayID = paste0("PW_", as.integer(as.factor(Pathway))))
head(PathwayVsMetabolites_long)
PathwayVsMetabolites_long <- PathwayVsMetabolites_long %>%
  left_join(kegg_lookup, by = c("kegg_id" = "kegg_id"))
## Warning in left_join(., kegg_lookup, by = c(kegg_id = "kegg_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 132 of `x` matches multiple rows in `y`.
## ℹ Row 159 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(PathwayVsMetabolites_long)

With this we can directly build the EnrichmentSet object:

require(localEnrichment)
## Cargando paquete requerido: localEnrichment
KEGGset_KEGG <- buildEnrichmentSet(
  data = PathwayVsMetabolites_long,
  id_col = "kegg_id",
  category_col = "Pathway",
  set_id_col = "PathwayID",
  set_name = "KEGG_hsa_pathways",
  source = "KEGG",
  species = "Homo sapiens",
  version = Sys.Date(),
  description = "KEGG human pathways mapped to KEGG compound IDs",
  sep = ";"
)

summary(KEGGset_KEGG)
## EnrichmentSet summary:
##   Mapping name: KEGG_hsa_pathways 
##   Source: KEGG 
##   Feature IDs: kegg_id 
##   Number of sets: 351 
##   Mean set size: 16.37607 features
##   Median set size: 6 features
head(KEGGset_KEGG@data)

All the data is saved as a binary file to be used later

EM_inputMetabolites <- inputMetabolites
EM_summaryStats <- example_data
EM_PathwayVsMetabolites<- PathwayVsMetabolites
EM_mapping_df <- mapping_df
# EM_stitch_df <- stitch_df
EM_kegg_lookup <- kegg_lookup
save(EM_inputMetabolites, KEGGset_KEGG, EM_PathwayVsMetabolites, EM_summaryStats, EM_kegg_lookup, EM_mapping_df, file="data/enrichmet_example.Rda")