Enrichment analysis aims to facilitate the biological interpretation of omics results by relating lists of features (genes, proteins, metabolites) to prior biological knowledge stored in databases (pathways, ontologies, chemical classes, etc.).
In genomics, ORA and GSEA workflows are relatively well standardized,
and code-based (clusterProfiler or graphical-interface
(g-profiler or Reactome) are well established
tools.
In metabolomics, however, the scenario is more-fragmented and available tools take more heterogeneorus approaches and rely on different identifier systems and pathway databases.
The goals of this document are:
We will focus on:
enrichmet (pathway enrichment, GSEA, centrality,
plots)localEnrichment (our own infrastructure for metabolite
sets and ORA/GSEA)If time permits we will progressively incorporate other tools/approaches such as:
ChemRich (https://raw.githubusercontent.com/barupal/ChemRICH/refs/heads/master/chemrich_minimum_analysis.R)Fella (https://bioconductor.org/packages//release/bioc/vignettes/FELLA/inst/doc/quickstart.html)MetaboAnalystR, which has become very hard to make it
work so the examples will be run from its web site and downloaded.Rather than exhaustively exploring each tool, we restrict ourselves to a reproducible workflow:
We have selected a few public metabolite datasets. The selection has been based on “popularity” and availability, but the workflow should be easily adaptable to new datasets.
We have prepared an ad-hoc collection of metabolite
sets, that is, Metabolite sets that have been programatically
prepared, as data.frames and EnrichmentSet (an S4 class
created to store Metabolite Sets Collections), from the
Lheuristic package, to be used for these analyses.
For each dataset we have prepared:
Given these components we will run enrichment analysis on them (ORA and GSEA when possible) with the different tools on the same sets.
As more analyses are avilable we will increase comparisons among results to check which analyses are related and which lead different results because they are indeed different analyses.
We have decided to adopt the term GSEA also for metabolomics to avoid confusions because the authors who introduced the term MSEA (Xia and Wishart 2010) used this term to refer to both ORA and GSEA applied to metabolite sets.
Any pathway analysis requires, at least, two inputs:
A list of features usually associated with a phenotype (e.g. metabolites differentially quantified between two groups).
A source of annotations for these features, that can be generically described as Feature sets or, as in this case Metabolite Sets.
We use three datasets that have been preprocessed elsewhere (see file: PrepareData-4-EnrichmentAnalysis.Rmd) and
The ia dataset. This is a popular example used
by MetaboAnalyst, where a comparison between affected and
control patients is performed. It has the drawback that only
differential metabolites are provided, which makes it only
partially interesting for some types of analyses. We keep it for its
popularity.
The ST000291 dataset, downloaded from
Metabolomics Workbench, and used in a fobitools use case.
This is obtained by a nutritional intervention study where a comparion
among consumption of Apple, Cranberries and Baseline has been performed.
In this case we have the full dataset and the results of comparing the
groups using limma whih leads to around 20 differential
metabolites. Here, we focus only on the Cranberry vs Baseline
comparison, although lists from other comparisons are also available and
may be used when a compare-comparisons approach is considered. This
study seems to have been made using a cross-over design that the authors
have later omitted when providing the samples. Because it is only for
instructional purposes, this flaw will be ignored.
The KrasG12D mutation dataset, a list of cancer related
metabolites used by the Enrichmet package in one of its
tutorial.
The data for this use case has been prepared elsewhere and is
available in the data directory.
For each dataset we have the list of metabolites, the analysis results in the form of a topTable and a mapping performed with MetaboAnalyst between the metabolite ids and several databases as shown below.
# Cachexia: topTable + ID mapping
cachexia_map_path <- "data/Cachexia_map.csv"
cachexia_tt_path <- "data/Cachexia_topTable.Rda"
Cachexia_map <- read.csv(cachexia_map_path, stringsAsFactors = FALSE)
load(cachexia_tt_path) # loads 'topCachexia'
cat("Cachexia_map:\n")
## Cachexia_map:
print_tbl(head(Cachexia_map))
## # A tibble: 6 × 9
## Query Match HMDB PubChem ChEBI KEGG METLIN SMILES Comment
## <chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <int>
## 1 2-Hydroxyisobutyrate 2-Hydrox… HMDB… 4277439 19641 <NA> NA CC(C)… 1
## 2 Creatinine Creatini… HMDB… 588 16737 C007… 8 CN1CC… 1
## 3 Quinolinate Quinolin… HMDB… 1066 16675 C037… 330 OC(=O… 1
## 4 Valine L-Valine HMDB… 6287 16414 C001… 5842 CC(C)… 1
## 5 Dimethylamine Dimethyl… HMDB… 674 17170 C005… 3758 CNC 1
## 6 Tryptophan L-Trypto… HMDB… 6305 16828 C000… 5879 N[C@@… 1
cat("\ntopCachexia:\n")
##
## topCachexia:
print_tbl(head(topCachexia))
## # A tibble: 6 × 7
## logFC AveExpr t P.Value adj.P.Val B ordByLimma
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 44.4 66.4 4.06 0.000119 0.00426 -3.92 1
## 2 25.4 35.7 3.97 0.000164 0.00426 -3.95 2
## 3 20.9 26.3 3.90 0.000203 0.00426 -3.97 3
## 4 17.7 24.4 3.72 0.000378 0.00431 -4.02 4
## 5 245. 358. 3.69 0.000417 0.00431 -4.03 5
## 6 151. 211. 3.67 0.000455 0.00431 -4.04 6
# ST000291: Cranberry vs Baseline topTable + ID mapping
st291_map_path <- "data/MW_ST000291__map.csv"
st291_tt_path <- "data/MW_ST000291-limma_res.Rda"
MS_ST000291_map <- read.csv(st291_map_path, stringsAsFactors = FALSE)
load(st291_tt_path) # loads 'topCranVSBase'
topCranVSBase<- limma_resCB; rm(limma_resCB)
topCranVSBase_HMDB<- limma_HMDB_CB; rm(limma_HMDB_CB)
cat("\nMS_ST000291_map:\n")
##
## MS_ST000291_map:
print_tbl(head(MS_ST000291_map))
## # A tibble: 6 × 9
## Query Match HMDB PubChem ChEBI KEGG METLIN SMILES Comment
## <int> <chr> <chr> <int> <int> <chr> <int> <chr> <int>
## 1 192798 Digitalose <NA> 192798 NA <NA> NA <NA> 1
## 2 79034 12-Hydroxydodecanoic… HMDB… 79034 39567 C083… 6465 OCCCC… 1
## 3 10413 4-Hydroxybutyric acid HMDB… 10413 30830 C009… 5678 OCCCC… 1
## 4 439230 Mevalonic acid HMDB… 439230 17710 C004… 127 C[C@@… 1
## 5 20975673 <NA> <NA> NA NA <NA> NA <NA> 0
## 6 5275508 <NA> <NA> NA NA <NA> NA <NA> 0
cat("\ntopCranVSBase:\n")
##
## topCranVSBase:
print_tbl(head(topCranVSBase))
## # A tibble: 6 × 7
## PubChemCID log2FC AveExpr t pvalue adj_pvalue B
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 54678503 2.92 3.57e-16 6.45 0.0000000623 0.0000847 8.11
## 2 71485 3.27 -2.40e-16 5.85 0.000000484 0.000329 6.19
## 3 94214 2.18 -8.93e-17 5.69 0.000000843 0.000382 5.68
## 4 5378303 2.41 1.10e-16 5.54 0.00000140 0.000403 5.20
## 5 3037 3.18 8.39e-17 5.46 0.00000184 0.000403 4.95
## 6 10178 3.33 3.81e-17 5.45 0.00000189 0.000403 4.92
cat("\ntopCranVSBase (HMDB):\n")
##
## topCranVSBase (HMDB):
print_tbl(head(topCranVSBase_HMDB))
## # A tibble: 6 × 7
## feature log2FC AveExpr t pvalue adj_pvalue B
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HMDB0251200 3.18 8.39e-17 5.48 0.00000172 0.000876 5.01
## 2 HMDB0248806 2.26 3.85e-17 5.34 0.00000281 0.000876 4.54
## 3 HMDB0258615 2.05 2.68e-16 5.00 0.00000881 0.00154 3.47
## 4 HMDB0006115 1.56 -4.28e-16 4.96 0.00000986 0.00154 3.36
## 5 HMDB0000738 1.46 1.01e-15 4.72 0.0000221 0.00233 2.60
## 6 HMDB0000714 1.51 -8.74e-16 4.72 0.0000225 0.00233 2.58
enrichmet package example datasetenrichmet_path <- "data/enrichmet_example.Rda"
load(enrichmet_path) # loads 'topCranVSBase'
cat("\nEM_summaryStats:\n")
##
## EM_summaryStats:
print_tbl(head(EM_summaryStats))
## # A tibble: 6 × 5
## met_id pval padj log2fc met_id_raw
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 C02721 0.0000124 0.00315 -2.32 C02721
## 2 C03736 0.0000694 0.00739 2.63 C03736
## 3 C03139 0.0000869 0.00739 1.75 C03139
## 4 C00074 0.000120 0.00763 -3.22 C00074
## 5 C02862 0.000198 0.00819 -1.70 C02862|
## 6 C00386 0.000225 0.00819 1.68 C00386
# cat("\nEM_mapping_df:\n")
# print_tbl(head(EM_mapping_df))
# cat("\nEM_kegg_lookup:\n")
# print_tbl(head(EM_kegg_lookup))
# cat("\nEM_PathwayVsMetabolites:\n")
# print_tbl(head(EM_PathwayVsMetabolites))
# cat("\nKEGGset_KEGG:\n")
# show(KEGGset_KEGG)
We also use a unified collection of metabolite sets, previously constructed and saved as:
KEGGset_HMDB, KEGGset_HMDB_dfKEGGset_PubChem, KEGGset_PubChem_dfChemicalClassSet, ChemicalClassSet_dfSMPDBset, SMPDBset_dfAll sets collections, but KEGGset_PubChem use
HMDB IDs as feature identifiers.
KEGGset_PubChem uses PubChemIDs.
Metabolite sets have been prepared elsewhere and saved as
EnrichmentSet objects of the class defined in the
localEnrichment package.
# Metabolite sets (EnrichmentSet collection)
# mets_collection_path <- "databases/MetaboliteSets_Collection.Rda"
# load(mets_collection_path)
load("../Metabolites_Sets/results/MetaboliteSets_Collection.Rda")
### Parche. Pendent de netejar-ho al script que ho crea
KEGGset_HMDB_df <- df_HMDB
KEGGset_PubChem_df <- df_PubChem
###
cat("\nEnrichmentSet objects loaded:\n")
##
## EnrichmentSet objects loaded:
summary(KEGGset_HMDB); cat("\n")
## EnrichmentSet summary:
## Mapping name: KEGG_pathways_hsa_HMDB
## Source: KEGG
## Feature IDs: HMDB
## Number of sets: 283
## Mean set size: 22.5689 features
## Median set size: 10 features
summary(KEGGset_PubChem); cat("\n")
## EnrichmentSet summary:
## Mapping name: KEGG_pathways_hsa_PubChem
## Source: KEGG
## Feature IDs: PubChem
## Number of sets: 267
## Mean set size: 18.15356 features
## Median set size: 8 features
summary(ChemicalClassSet); cat("\n")
## EnrichmentSet summary:
## Mapping name: ChemicalClasses
## Source: HMDB
## Feature IDs: HMDB
## Number of sets: 31
## Mean set size: 13.54839 features
## Median set size: 4 features
summary(SMPDBset); cat("\n")
## EnrichmentSet summary:
## Mapping name: SMPDB_pathways
## Source: SMPDB
## Feature IDs: HMDB
## Number of sets: 48654
## Mean set size: 17.30571 features
## Median set size: 18 features
cat("\nKEGGset_HMDB_df:\n")
##
## KEGGset_HMDB_df:
print_tbl(head(KEGGset_HMDB_df))
## # A tibble: 6 × 3
## set_name set_id Metabolites
## <chr> <chr> <chr>
## 1 Glycolysis / Gluconeogenesis hsa00010 HMDB0000243;HMDB0001206;HMD…
## 2 Citrate cycle (TCA cycle) hsa00020 HMDB0000243;HMDB0001206;HMD…
## 3 Pentose phosphate pathway hsa00030 HMDB0000243;HMDB0000124;HMD…
## 4 Pentose and glucuronate interconversions hsa00040 HMDB0000243;HMDB0000208;HMD…
## 5 Fructose and mannose metabolism hsa00051 HMDB0000124;HMDB0001163;HMD…
## 6 Galactose metabolism hsa00052 HMDB0000286;HMDB0000302;HMD…
cat("\nKEGGset_PubChem_df:\n")
##
## KEGGset_PubChem_df:
print_tbl(head(KEGGset_PubChem_df))
## # A tibble: 6 × 3
## set_name set_id Metabolites
## <chr> <chr> <chr>
## 1 Glycolysis / Gluconeogenesis hsa00010 1060;107735;176;175;970;906…
## 2 Citrate cycle (TCA cycle) hsa00020 1060;107735;51;164533;970;1…
## 3 Pentose phosphate pathway hsa00030 1060;107735;439168;5311110;…
## 4 Pentose and glucuronate interconversions hsa00040 1060;107735;51;164533;8629;…
## 5 Fructose and mannose metabolism hsa00051 439709;668;439168;107689;79…
## 6 Galactose metabolism hsa00052 8629;5988;439709;668;753;43…
cat("\nChemicalClassSet_df:\n")
##
## ChemicalClassSet_df:
print_tbl(head(ChemicalClassSet_df))
## # A tibble: 6 × 3
## set_name set_id Metabolites
## <chr> <chr> <chr>
## 1 Acylcarnitines Acylcarnitines HMDB0000062;HMDB0000201;HMDB0…
## 2 Amine oxide Amine oxide HMDB0000925
## 3 Amino Acids Amino Acids HMDB0000161;HMDB0000517;HMDB0…
## 4 Amino Acids Derivatives Amino Acids Derivatives HMDB0000092;HMDB0000766;HMDB0…
## 5 Amino acid - related Amino acid - related HMDB0000510;HMDB0000452;HMDB0…
## 6 Azoles Azoles HMDB0000462
cat("\nSMPDBset_df:\n")
##
## SMPDBset_df:
print_tbl(head(SMPDBset_df))
## # A tibble: 6 × 3
## set_name set_id Metabolites
## <chr> <chr> <chr>
## 1 Citrullinemia Type I SMP0000001 HMDB0000641;HMDB0000161;…
## 2 Carbamoyl Phosphate Synthetase Deficiency SMP0000002 HMDB0000641;HMDB0000161;…
## 3 Argininosuccinic Aciduria SMP0000003 HMDB0000641;HMDB0000161;…
## 4 Glycine and Serine Metabolism SMP0000004 HMDB0002134;HMDB0001377;…
## 5 Pterine Biosynthesis SMP0000005 HMDB0006822;HMDB0002111;…
## 6 Tyrosine Metabolism SMP0000006 HMDB0000158;HMDB0000208;…
The Metabolite sets for the Enrichmet package example
data have been loaded when loading the example data but are shown here
fore consistency.
They consist of a data.frame and an
EnrichmentSet object that has been preprare dfrom this
cat("\nEM_PathwayVsMetabolites:\n")
##
## EM_PathwayVsMetabolites:
print_tbl(head(EM_PathwayVsMetabolites))
## # A tibble: 6 × 2
## Pathway Metabolites
## <chr> <chr>
## 1 Glycolysis / Gluconeogenesis C00022,C00024,C00031,C00033,C00036,C…
## 2 Citrate cycle (TCA cycle) C00022,C00024,C00026,C00036,C00042,C…
## 3 Pentose phosphate pathway C00022,C00031,C00085,C00117,C00118,C…
## 4 Pentose and glucuronate interconversions C00022,C00026,C00029,C00103,C00111,C…
## 5 Fructose and mannose metabolism C00085,C00095,C00096,C00111,C00118,C…
## 6 Galactose metabolism C00029,C00031,C00052,C00085,C00089,C…
cat("\nKEGGset_KEGG:\n")
##
## KEGGset_KEGG:
show(KEGGset_KEGG)
## EnrichmentSet: KEGG_hsa_pathways
## Source: KEGG
## Feature IDs: kegg_id
## Number of sets: 351
## Example set: ABC transporters
print_tbl(head(KEGGset_KEGG@data))
## # A tibble: 6 × 4
## set_id set_name feature_ids feature_list
## <chr> <chr> <chr> <list>
## 1 PW_1 ABC transporters C00009;C00025;C… <chr [138]>
## 2 PW_10 Aldosterone-regulated sodium reabsorption C00238;C00575;C… <chr [8]>
## 3 PW_100 Epstein-Barr virus infection C00076;C01245;C… <chr [3]>
## 4 PW_101 ErbB signaling pathway C00076;C00165;C… <chr [4]>
## 5 PW_102 Estrogen signaling pathway C00076;C00165;C… <chr [8]>
## 6 PW_103 Ether lipid metabolism C00670;C00958;C… <chr [25]>
KEGGset_KEGG_df <- as.MetaboliteSetDataFrame(KEGGset_KEGG, "both")
Notice that Metabolite identifiers for this example data are neither HMDB, nor PubChem but KEGG ids!
Our aim is to derive signatures and
backgrounds that are compatible with the
EnrichmentSet objects, which use HMDB
IDs.
We therefore:
filterEnrichmentSet(Eset, ids) to restrict the
metabolite sets to those that contain at least one of the signature
metabolites.library(tibble)
## Warning: package 'tibble' was built under R version 4.4.3
# Convert topCachexia to tibble with metabolite name
cachexia_tt <- topCachexia %>%
rownames_to_column(var = "Metabolite") %>%
left_join(Cachexia_map, by = c("Metabolite" = "Query"))
cat("Cachexia limma table with mapping:\n")
## Cachexia limma table with mapping:
dim(cachexia_tt)
## [1] 63 16
print_tbl(cachexia_tt)
## # A tibble: 63 × 16
## Metabolite logFC AveExpr t P.Value adj.P.Val B ordByLimma Match HMDB
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 Quinolinate 44.4 66.4 4.06 1.19e-4 0.00426 -3.92 1 Quin… HMDB…
## 2 Valine 25.4 35.7 3.97 1.64e-4 0.00426 -3.95 2 L-Va… HMDB…
## 3 N,N-Dimeth… 20.9 26.3 3.90 2.03e-4 0.00426 -3.97 3 Dime… HMDB…
## 4 Leucine 17.7 24.4 3.72 3.78e-4 0.00431 -4.02 4 Leuc… HMDB…
## 5 Dimethylam… 245. 358. 3.69 4.17e-4 0.00431 -4.03 5 Dime… HMDB…
## 6 Pyroglutam… 151. 211. 3.67 4.55e-4 0.00431 -4.04 6 Pyro… HMDB…
## # ℹ 57 more rows
## # ℹ 6 more variables: PubChem <int>, ChEBI <chr>, KEGG <chr>, METLIN <int>,
## # SMILES <chr>, Comment <int>
tail(cachexia_tt)
## Metabolite logFC AveExpr t P.Value adj.P.Val
## 58 Acetone 4.926383 11.42701 0.88147387 0.3808486 0.4136804
## 59 Tartrate 18.558681 40.00403 0.76840327 0.4446356 0.4747804
## 60 4-Hydroxyphenylacetate 20.023887 112.02104 0.70996703 0.4799035 0.5038987
## 61 Pantothenate -12.678624 44.88377 -0.62491564 0.5339035 0.5514086
## 62 Uracil 5.020284 35.55766 0.60799978 0.5450055 0.5537959
## 63 1-Methylnicotinamide -2.590745 71.57364 -0.08306038 0.9340225 0.9340225
## B ordByLimma Match HMDB PubChem ChEBI
## 58 -4.609633 58 Acetone HMDB0001659 180 15347
## 59 -4.619378 59 Tartaric acid HMDB0000956 444305 15671
## 60 -4.623907 60 p-Hydroxyphenylacetic acid HMDB0000020 127 18101
## 61 -4.629872 61 Pantothenic acid HMDB0000210 6613 46905
## 62 -4.630970 62 Uracil HMDB0000300 1174 17568
## 63 -4.650155 63 1-Methylnicotinamide HMDB0000699 457 16797
## KEGG METLIN SMILES Comment
## 58 C00207 3745 CC(C)=O 1
## 59 C00898 NA O[C@H]([C@@H](O)C(O)=O)C(O)=O 1
## 60 C00642 130 OC(=O)CC1=CC=C(O)C=C1 1
## 61 C00864 NA CC(C)(CO)[C@@H](O)C(=O)NCCC(O)=O 1
## 62 C00106 258 O=C1NC=CC(=O)N1 1
## 63 C02918 5667 C[N+]1=CC=CC(=C1)C(N)=O 1
# p-value column and threshold
p_col_cachexia <- "adj.P.Val"
alpha_cachexia <- 0.05
# Signature: HMDB IDs of significant metabolites
cachexia_sig <- cachexia_tt %>%
filter(!is.na(HMDB),
!!rlang::sym(p_col_cachexia) < alpha_cachexia) %>%
pull(HMDB) %>%
unique()
length(cachexia_sig)
## [1] 36
cachexia_sig
## [1] "HMDB0000232" "HMDB0000883" "HMDB0000092" "HMDB0000687" "HMDB0000087"
## [6] "HMDB0000267" "HMDB0000562" "HMDB0000641" "HMDB0000161" "HMDB0000929"
## [11] "HMDB0000211" "HMDB0000043" "HMDB0000167" "HMDB0000164" "HMDB0000187"
## [16] "HMDB0000072" "HMDB0000174" "HMDB0000042" "HMDB0242161" "HMDB0000168"
## [21] "HMDB0000682" "HMDB0000754" "HMDB0000177" "HMDB0000158" "HMDB0000254"
## [26] "HMDB0000094" "HMDB0000875" "HMDB0304356" "HMDB0000243" "HMDB0000479"
## [31] "HMDB0000714" "HMDB0000958" "HMDB0000149" "HMDB0000123" "HMDB0000448"
## [36] "HMDB0000452"
Note that in this Cachexia dataset we have mostly (> 50%) differential metabolites, although it is hard to say if this is due to the fact that the researches only quantified those metabolites or they only reported those. A realistic background based purely on the data is therefore. Instead, we derive a functional background from the metabolite-set collection.
The background will be constructed artifically by taking all sets in
the KEGGset_HMDB collection and, from there the HMDB ids of
each set.
# Extract background as the union of all HMDB IDs in the filtered sets
cachexia_background <- unique(unlist(KEGGset_HMDB@data$feature_list))
length(cachexia_background)
## [1] 2105
intersect(cachexia_sig, cachexia_background) |> length()
## [1] 25
We now have all we need to do the Enrichment Analysis on the Cachexia dataset.
cachexia_sigstr(cachexia_sig)
## chr [1:36] "HMDB0000232" "HMDB0000883" "HMDB0000092" "HMDB0000687" ...
cachexia_backgroundstr(cachexia_background)
## chr [1:2105] "HMDB0000243" "HMDB0001206" "HMDB0000042" "HMDB0000223" ...
KEGGset_HMDB.show(KEGGset_HMDB)
## EnrichmentSet: KEGG_pathways_hsa_HMDB
## Source: KEGG
## Feature IDs: HMDB
## Number of sets: 283
## Example set: Glycolysis / Gluconeogenesis
SMPDBset and
ChemicalClassSet because these are sets made of HMDB
ids.show(SMPDBset)
## EnrichmentSet: SMPDB_pathways
## Source: SMPDB
## Feature IDs: HMDB
## Number of sets: 48654
## Example set: Citrullinemia Type I
show(ChemicalClassSet)
## EnrichmentSet: ChemicalClasses
## Source: HMDB
## Feature IDs: HMDB
## Number of sets: 31
## Example set: Acylcarnitines
The original ST000291 dataset is annotated with PubChem
identifiers, so we are restricted to work with Metabolite Sets created
with PubChem identifiers.
The code below shows how to prepare a list of selected metabolites and background straight from the available information.
To define the signature we select metabolites with an adjusted p-value of at most 0.33
## ST000291: signature (PubChem) and background
# top table from limma
st291_tt <- topCranVSBase # ja carregat anteriorment
# signature (adj.P.Val < 0.05)
ST291_sig <- st291_tt %>%
filter(adj_pvalue < 0.33) %>%
pull(PubChemCID) %>%
unique()
# background = ALL tested metabolites
ST291_background <- st291_tt %>%
pull(PubChemCID) %>%
unique()
cat("Signature size:", length(ST291_sig), "\n")
## Signature size: 50
cat("Background size:", length(ST291_background), "\n")
## Background size: 1358
Given that many programs can only work with HMDB or KEGG identifiers we have prepared and analyzed a subset of metabolites made only by those that have such ids.
## ST000291: signature (HMDB) and background
# top table from limma
st291_tt_HMDB <- topCranVSBase_HMDB # ja carregat anteriorment
# signature (adj.P.Val < 0.05)
ST291_HMDB_sig <- st291_tt_HMDB %>%
filter(pvalue < 0.05) %>%
pull(feature) %>%
unique()
# background = ALL tested metabolites
ST291_HMDB_background <- st291_tt_HMDB %>%
pull(feature) %>%
unique()
cat("Signature size:", length(ST291_HMDB_sig), "\n")
## Signature size: 29
cat("Background size:", length(ST291_HMDB_background), "\n")
## Background size: 623
The object loaded already contains the signature as
EM_inputMetabolites. The background can be obtained from
the EM_summaryStats object.
EM_sig <- EM_inputMetabolites
EM_background <- EM_summaryStats$met_id
cat("Signature size:", length(EM_sig), "\n")
## Signature size: 57
cat("Background size:", length(EM_background), "\n")
## Background size: 249
Cachexia datalocalEnrichment and KEGGsets_HMDBWe start doing ORA using KEGG metabolite sets with HMDB identifiers
First perform Fisher test for all sets and select the enriched ones.
library(localEnrichment)
# ORA per Cachexia vs KEGG (HMDB)
ORA_cachexia_KEGG_LE <- ora_test(
eset = KEGGset_HMDB,
selected = cachexia_sig,
background = cachexia_background,
test = "fisher",
p_adjust = "BH",
min_set_size = 3,
max_set_size = 500
)
# veure com és el resultat
dplyr::glimpse(ORA_cachexia_KEGG_LE)
## Rows: 91
## Columns: 8
## $ set_id <chr> "hsa05230", "hsa00970", "hsa04974", "hsa04978", "hsa…
## $ set_name <chr> "Central carbon metabolism in cancer", "Aminoacyl-tR…
## $ n_selected_in_set <int> 13, 11, 12, 9, 8, 6, 9, 7, 6, 4, 4, 3, 3, 3, 3, 3, 3…
## $ n_in_set <int> 35, 23, 47, 27, 47, 26, 82, 45, 40, 15, 17, 10, 12, …
## $ p_value <dbl> 1.103121e-15, 8.109981e-15, 2.935896e-12, 1.626280e-…
## $ overlap_features <chr> "HMDB0000883;HMDB0000687;HMDB0000641;HMDB0000161;HMD…
## $ p_adj <dbl> 1.003840e-13, 3.690042e-13, 8.905551e-11, 3.699786e-…
## $ fold_enrichment <dbl> 21.718254, 27.964976, 14.929078, 19.490741, 9.952719…
head(ORA_cachexia_KEGG_LE)
## # A tibble: 6 × 8
## set_id set_name n_selected_in_set n_in_set p_value overlap_features p_adj
## <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 hsa052… Central… 13 35 1.10e-15 HMDB0000883;HMD… 1.00e-13
## 2 hsa009… Aminoac… 11 23 8.11e-15 HMDB0000883;HMD… 3.69e-13
## 3 hsa049… Protein… 12 47 2.94e-12 HMDB0000883;HMD… 8.91e-11
## 4 hsa049… Mineral… 9 27 1.63e-10 HMDB0000883;HMD… 3.70e- 9
## 5 hsa006… Glyoxyl… 8 47 6.32e- 7 HMDB0000641;HMD… 1.15e- 5
## 6 hsa002… Alanine… 6 26 2.92e- 6 HMDB0000641;HMD… 4.43e- 5
## # ℹ 1 more variable: fold_enrichment <dbl>
ORA_cachexia_KEGG_LE_sig <- ORA_cachexia_KEGG_LE %>%
dplyr::filter(p_adj < 0.05) %>%
dplyr::arrange(p_adj)
nrow(ORA_cachexia_KEGG_LE_sig)
## [1] 20
head(ORA_cachexia_KEGG_LE_sig[,1:6], 10)
## # A tibble: 10 × 6
## set_id set_name n_selected_in_set n_in_set p_value overlap_features
## <chr> <chr> <int> <int> <dbl> <chr>
## 1 hsa05230 Central carbon… 13 35 1.10e-15 HMDB0000883;HMD…
## 2 hsa00970 Aminoacyl-tRNA… 11 23 8.11e-15 HMDB0000883;HMD…
## 3 hsa04974 Protein digest… 12 47 2.94e-12 HMDB0000883;HMD…
## 4 hsa04978 Mineral absorp… 9 27 1.63e-10 HMDB0000883;HMD…
## 5 hsa00630 Glyoxylate and… 8 47 6.32e- 7 HMDB0000641;HMD…
## 6 hsa00250 Alanine, aspar… 6 26 2.92e- 6 HMDB0000641;HMD…
## 7 hsa02010 ABC transporte… 9 82 5.31e- 6 HMDB0000883;HMD…
## 8 hsa00470 D-Amino acid m… 7 45 6.62e- 6 HMDB0000641;HMD…
## 9 hsa00260 Glycine, serin… 6 40 4.09e- 5 HMDB0000092;HMD…
## 10 hsa00020 Citrate cycle … 4 15 8.61e- 5 HMDB0000072;HMD…
This can be visualized by different approaches.
First clean the Pathway names
clean_pathway_name <- function(x) {
return(x)
}
ORA_cachexia_KEGG_LE_sig <- ORA_cachexia_KEGG_LE_sig %>%
dplyr::mutate(set_name_clean = clean_pathway_name(set_name))
top20 <- ORA_cachexia_KEGG_LE_sig %>% arrange(p_adj)
# %>% dplyr::slice(1:20)
# IMPORTANT: nivells únics per ordenar el gràfic
top20$set_name_clean <- factor(
top20$set_name_clean,
levels = rev(unique(top20$set_name_clean))
)
localEnrichment::plot_enrichment(
results =top20 ,
type = c("bar", "dot"),
top = 20,
x_measure = "-log10(p_adj)",
color_by = "p_adj",
title = NULL
)
enrichmet and KEGGsets_HMDBenrichment cannot work with EnrichmentSets objects
-which are defined in localEnrichment. Instead it uses a
data.frame with two columns, “Pathways” and “Metabolites” where
metabolites are comma separated IDs.
Such data.frame can be extracted from the EnrichmentSet as:
library(dplyr)
PwVSMet_HMDB <- as.MetaboliteSetDataFrame(
KEGGset_HMDB,
id_type = "name",
id_sep = ","
) %>%
dplyr::transmute(
Pathway = set_name,
Metabolites = Metabolites
)
Now we con proceed with the enrichmet wrapper
function:
ORA_cachexia_KEGG_EM<- perform_enrichment_analysis(
inputMetabolites= cachexia_sig,
PathwayVsMetabolites= PwVSMet_HMDB,
top_n = 100,
p_value_cutoff = .25
)
plot <- create_enrichment_plot(ORA_cachexia_KEGG_EM)
plot
To run GSEA we will use the whole dataset, not only significant metabolites and, additionally, a measure of effect size.
localEnrichment and KEGGsets_HMDBGSEA in localEnrichment has been written to match
MetaboAnalyst function so it is also called qea_test.
The test requires all the metabolites with an effect size measure.
cachexia_scores<- cachexia_tt$t
names(cachexia_scores) <- cachexia_tt$HMDB
The test can now be run as:
QEA_cachexia_KEGG_LE<- qea_test(
eset = KEGGset_HMDB,
scores = cachexia_scores,
nperm = 1000,
p_adjust = "BH",
min_set_size = 3,
max_set_size = 500,
return_profiles = FALSE,
quiet = TRUE
)
The results can be plotted similarly to how we did with ORA.
# GSEA/QEA plot with localEnrichment
top_cachexia_qea <- QEA_cachexia_KEGG_LE$table %>%
dplyr::arrange(p_adj) %>%
dplyr::slice(1:20)
top_cachexia_qea$set_name <- factor(
top_cachexia_qea$set_name,
levels = rev(unique(top_cachexia_qea$set_name))
)
localEnrichment::plot_enrichment(
results = top_cachexia_qea,
type = c("bar", "dot"),
top = 20,
x_measure = "-log10(p_adj)",
color_by = "p_adj",
title = "Cachexia: GSEA/QEA with localEnrichment"
)
enrichmet and KEGGsets_HMDBIn this case we have to provide a more rigid data structure that can be extracted from cachegia_tt.
data4gsea <- cachexia_tt[,c("HMDB", "P.Value", "logFC")]
colnames(data4gsea) <- c("met_id", "pval", "log2fc")
head(PwVSMet_HMDB, n=10)
head(data4gsea, n=12)
GSEA_cachexia_KEGG_EM<- perform_gsea_analysis(
example_data =data4gsea ,
PathwayVsMetabolites= PwVSMet_HMDB
)
head(GSEA_cachexia_KEGG_EM)
enrichmet and KEGGsets_KEGGenrichmet seems to have a strange design flaw: in some
recent versions of the package the perform_gsea_analysis()
function expects KEGG-like identifiers in the met_id
column, so for this step we use the KEGG annotation instead of HMDB.
First, we build the summary statistics table required by
enrichmet:
data4gsea_KEGG <- cachexia_tt %>%
dplyr::select(KEGG, P.Value, logFC) %>%
dplyr::rename(
met_id = KEGG,
pval = P.Value,
log2fc = logFC
) %>%
dplyr::filter(!is.na(met_id), met_id != "") %>%
dplyr::distinct(met_id, .keep_all = TRUE)
head(data4gsea_KEGG, 10)
## met_id pval log2fc
## 1 C03722 0.0001187889 44.42323
## 2 C00183 0.0001640210 25.44989
## 3 C01026 0.0002029160 20.89312
## 4 C00123 0.0003775515 17.70504
## 5 C00543 0.0004169006 244.89730
## 6 C01879 0.0004546884 151.03434
## 7 C00791 0.0004792226 5102.96555
## 8 C00064 0.0009999880 216.98309
## 9 C00041 0.0011130175 190.00706
## 10 C00078 0.0018802610 39.99104
We also need a pathway-to-metabolite table using KEGG identifiers:
PwVSMet_KEGG <- as.MetaboliteSetDataFrame(
KEGGset_KEGG,
id_type = "name",
id_sep = ","
) %>%
dplyr::transmute(
Pathway = set_name,
Metabolites = Metabolites
)
head(PwVSMet_KEGG, 10)
## # A tibble: 10 × 2
## Pathway Metabolites
## <chr> <chr>
## 1 ABC transporters C00009,C00025,C00031,C00032,C00034…
## 2 Aldosterone-regulated sodium reabsorption C00238,C00575,C00698,C00735,C00762…
## 3 Epstein-Barr virus infection C00076,C01245,C05981
## 4 ErbB signaling pathway C00076,C00165,C01245,C05981
## 5 Estrogen signaling pathway C00076,C00165,C00238,C00334,C00533…
## 6 Ether lipid metabolism C00670,C00958,C01233,C01264,C02773…
## 7 Fanconi anemia pathway NoMetabolitesfound
## 8 Fat digestion and absorption C00010,C00040,C00162,C00165,C00187…
## 9 Fatty acid biosynthesis C00024,C00083,C00154,C00173,C00229…
## 10 Fatty acid degradation C00010,C00024,C00071,C00136,C00154…
Now we can run the enrichmet GSEA-like function:
GSEA_cachexia_KEGG_EM <- perform_gsea_analysis(
example_data = data4gsea_KEGG,
PathwayVsMetabolites = PwVSMet_KEGG
)
## | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
head(GSEA_cachexia_KEGG_EM)
## pathway pval padj log2err
## <char> <num> <num> <num>
## 1: Protein_digestion_and_absorption 0.003707435 0.02224461 0.43170770
## 2: Aminoacyl-tRNA_biosynthesis 0.013144417 0.03943325 0.38073040
## 3: Mineral_absorption 0.022942484 0.04588497 0.35248786
## 4: ABC_transporters 0.093945720 0.14091858 0.14551615
## 5: Central_carbon_metabolism_in_cancer 0.181438999 0.21772680 0.09957912
## 6: Glyoxylate_and_dicarboxylate_metabolism 0.470779221 0.47077922 0.05205700
## ES NES size leadingEdge input_count
## <num> <num> <int> <list> <int>
## 1: 0.6575265 1.744456 14 C00183, .... 11
## 2: 0.6341130 1.663106 13 C00183, .... 10
## 3: 0.6221943 1.596308 11 C00183, .... 8
## 4: 0.5135357 1.386298 17 C00183, .... 8
## 5: 0.4651647 1.262874 18 C00183, .... 12
## 6: 0.3908266 1.002709 11 C00064, .... 7
ANd plot the results.
# GSEA plot with enrichmet
plot <- create_gsea_plot(GSEA_cachexia_KEGG_EM, top_n = 20)
plot
In any case the results of both tests are relatively similat even if not significant.
head(QEA_cachexia_KEGG_LE$table[1:10,c(1,2,6,7)],10)
## set_id set_name p_value p_adj
## 1 hsa04974 Protein digestion and absorption 0.006 0.1040
## 2 hsa04978 Mineral absorption 0.007 0.1040
## 3 hsa00970 Aminoacyl-tRNA biosynthesis 0.008 0.1040
## 4 hsa04976 Bile secretion 0.031 0.2860
## 5 hsa00280 Valine, leucine and isoleucine degradation 0.044 0.2860
## 6 hsa04382 Cornified envelope formation 0.044 0.2860
## 7 hsa05230 Central carbon metabolism in cancer 0.082 0.3465
## 8 hsa04922 Glucagon signaling pathway 0.091 0.3465
## 9 hsa02010 ABC transporters 0.101 0.3465
## 10 hsa00770 Pantothenate and CoA biosynthesis 0.115 0.3465
head(GSEA_cachexia_KEGG_EM[,1:3],10)
## pathway pval padj
## <char> <num> <num>
## 1: Protein_digestion_and_absorption 0.003707435 0.02224461
## 2: Aminoacyl-tRNA_biosynthesis 0.013144417 0.03943325
## 3: Mineral_absorption 0.022942484 0.04588497
## 4: ABC_transporters 0.093945720 0.14091858
## 5: Central_carbon_metabolism_in_cancer 0.181438999 0.21772680
## 6: Glyoxylate_and_dicarboxylate_metabolism 0.470779221 0.47077922
The enrichmet package includes a small example dataset
derived from a metabolomics comparison involving KRAS G12D mutation.
Unlike the Cachexia example, these data are already distributed in a
format very close to what enrichment tools require:
EM_inputMetabolites)EM_summaryStats)EM_PathwayVsMetabolites)EnrichmentSet object built from the same KEGG-based
pathway mapping (KEGGset_KEGG)We begin by defining the signature and the corresponding background.
EM_sig <- unique(EM_inputMetabolites)
EM_background <- unique(EM_summaryStats$met_id)
cat("Signature size:", length(EM_sig), "\n")
## Signature size: 57
cat("Background size:", length(EM_background), "\n")
## Background size: 249
intersect(EM_sig, EM_background) |> length()
## [1] 51
The ORA can now be carried out with localEnrichment::ora_test() using the KEGG-based EnrichmentSet.
library(localEnrichment)
library(dplyr)
ORA_EM_KEGG_LE <- ora_test(
eset = KEGGset_KEGG,
selected = EM_sig,
background = EM_background,
test = "fisher",
p_adjust = "BH",
min_set_size = 3,
max_set_size = 500
)
dplyr::glimpse(ORA_EM_KEGG_LE)
## Rows: 47
## Columns: 8
## $ set_id <chr> "PW_45", "PW_106", "PW_107", "PW_27", "PW_269", "PW_…
## $ set_name <chr> "Biosynthesis of unsaturated fatty acids", "Fatty ac…
## $ n_selected_in_set <int> 7, 4, 2, 1, 1, 2, 2, 4, 1, 2, 2, 2, 5, 1, 1, 1, 1, 1…
## $ n_in_set <int> 11, 7, 3, 14, 15, 4, 4, 11, 13, 5, 5, 5, 33, 10, 10,…
## $ p_value <dbl> 0.003633786, 0.050183860, 0.132051780, 0.200320067, …
## $ overlap_features <chr> "C16527;C00249;C00219;C06429;C06428;C16513;C01530", …
## $ p_adj <dbl> 0.1707879, 1.0000000, 1.0000000, 1.0000000, 1.000000…
## $ fold_enrichment <dbl> 2.7799043, 2.4962406, 2.9122807, 0.3120301, 0.291228…
head(ORA_EM_KEGG_LE)
## # A tibble: 6 × 8
## set_id set_name n_selected_in_set n_in_set p_value overlap_features p_adj
## <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 PW_45 Biosynthesis… 7 11 0.00363 C16527;C00249;C… 0.171
## 2 PW_106 Fatty acid b… 4 7 0.0502 C00249;C06424;C… 1
## 3 PW_107 Fatty acid d… 2 3 0.132 C00249;C02990 1
## 4 PW_27 Arginine and… 1 14 0.200 C00148 1
## 5 PW_269 Pyrimidine m… 1 15 0.202 C00295 1
## 6 PW_297 Sphingolipid… 2 4 0.226 C00065;C00836 1
## # ℹ 1 more variable: fold_enrichment <dbl>
We may restrict attention to the pathways that remain significant after multiple testing correction.
ORA_EM_KEGG_LE_sig <- ORA_EM_KEGG_LE %>%
dplyr::filter(p_adj < 0.25) %>%
dplyr::arrange(p_adj)
nrow(ORA_EM_KEGG_LE_sig)
## [1] 1
head(ORA_EM_KEGG_LE_sig[, 1:6], 10)
## # A tibble: 1 × 6
## set_id set_name n_selected_in_set n_in_set p_value overlap_features
## <chr> <chr> <int> <int> <dbl> <chr>
## 1 PW_45 Biosynthesis of un… 7 11 0.00363 C16527;C00249;C…
The enrichment results can be visualized with the standard plotting functions from localEnrichment.
top_kras_ora <- ORA_EM_KEGG_LE %>%
dplyr::arrange(p_adj) %>%
dplyr::slice(1:20)
top_kras_ora$set_name <- factor(
top_kras_ora$set_name,
levels = rev(unique(top_kras_ora$set_name))
)
localEnrichment::plot_enrichment(
results = top_kras_ora,
type = c("bar", "dot"),
top = 20,
x_measure = "-log10(p_adj)",
color_by = "p_adj",
title = "KRASG12D example: ORA with localEnrichment"
)
enrichmet and EM_PathwayVsMetabolitesThe same ORA-like analysis can be repeated with enrichmet, using directly the pathway mapping table provided with the example data.
ORA_EM_KEGG_EM <- perform_enrichment_analysis(
inputMetabolites = EM_sig,
PathwayVsMetabolites = EM_PathwayVsMetabolites,
top_n = 100,
p_value_cutoff = 0.25
)
head(ORA_EM_KEGG_EM)
## Pathway P_value Log_P_value Impact
## 1 Central carbon metabolism in cancer 4.817117e-05 4.317213 0.05461853
## 2 Protein digestion and absorption 1.925909e-04 3.715364 0.06838474
## 3 Biosynthesis of unsaturated fatty acids 3.699903e-04 3.431810 0.70197902
## 4 Caffeine metabolism 6.220368e-04 3.206184 0.09487611
## 5 Mineral absorption 1.832389e-03 2.736982 0.09229047
## Coverage Count Adjusted_P_value Q_value
## 1 0.16216216 6 0.01690808 0.0003367180
## 2 0.12765957 6 0.03379971 0.0006731084
## 3 0.09459459 7 0.04328886 0.0008620812
## 4 0.18181818 4 0.05458373 0.0010870143
## 5 0.13793103 4 0.12863372 0.0025616917
And visualized with the plotting function supplied by enrichmet.
plot <- create_enrichment_plot(ORA_EM_KEGG_EM)
plot
Because both analyses use the same KEGG identifiers and a pathway
mapping derived from the same source, this example is particularly
convenient to compare the behaviour of the two implementations.
To run a quantitative enrichment analysis with localEnrichment, we need a numeric score for all metabolites in the ranked list. Here we will use the log2fc column from EM_summaryStats.
EM_scores <- EM_summaryStats$log2fc
names(EM_scores) <- EM_summaryStats$met_id
summary(EM_scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7235 -0.7930 -0.1904 -0.2540 0.4430 3.1852
head(EM_scores)
## C02721 C03736 C03139 C00074 C02862 C00386
## -2.324807 2.634161 1.748349 -3.219306 -1.697477 1.683674
The GSEA-like quantitative enrichment analysis can then be run with qea_test().
QEA_EM_KEGG_LE <- qea_test(
eset = KEGGset_KEGG,
scores = EM_scores,
nperm = 1000,
p_adjust = "BH",
min_set_size = 3,
max_set_size = 500,
return_profiles = FALSE,
quiet = TRUE
)
head(QEA_EM_KEGG_LE$table[, c("set_id", "set_name", "p_value", "p_adj")], 10)
## set_id set_name p_value p_adj
## 1 PW_263 Protein digestion and absorption 0.000 0.00000000
## 2 PW_269 Pyrimidine metabolism 0.000 0.00000000
## 3 PW_16 Aminoacyl-tRNA biosynthesis 0.001 0.02833333
## 4 PW_111 Ferroptosis 0.003 0.06375000
## 5 PW_195 Mineral absorption 0.007 0.11900000
## 6 PW_86 Diabetic cardiomyopathy 0.024 0.25500000
## 7 PW_45 Biosynthesis of unsaturated fatty acids 0.026 0.25500000
## 8 PW_85 D-Amino acid metabolism 0.026 0.25500000
## 9 PW_81 Cysteine and methionine metabolism 0.027 0.25500000
## 10 PW_53 cAMP signaling pathway 0.031 0.26350000
If desired, we can inspect only the most significant pathways.
QEA_EM_KEGG_LE_sig <- QEA_EM_KEGG_LE$table %>%
dplyr::filter(p_adj < 0.25) %>%
dplyr::arrange(p_adj)
nrow(QEA_EM_KEGG_LE_sig)
## [1] 5
head(QEA_EM_KEGG_LE_sig[, c("set_id", "set_name", "p_value", "p_adj")], 10)
## set_id set_name p_value p_adj
## 1 PW_263 Protein digestion and absorption 0.000 0.00000000
## 2 PW_269 Pyrimidine metabolism 0.000 0.00000000
## 3 PW_16 Aminoacyl-tRNA biosynthesis 0.001 0.02833333
## 4 PW_111 Ferroptosis 0.003 0.06375000
## 5 PW_195 Mineral absorption 0.007 0.11900000
And finmally GSEA results can be plotted:
# GSEA/QEA plot with localEnrichment
top_kras_qea <- QEA_EM_KEGG_LE$table %>%
dplyr::arrange(p_adj) %>%
dplyr::slice(1:20)
top_kras_qea$set_name <- factor(
top_kras_qea$set_name,
levels = rev(unique(top_kras_qea$set_name))
)
localEnrichment::plot_enrichment(
results = top_kras_qea,
type = c("bar", "dot"),
top = 20,
x_measure = "-log10(p_adj)",
color_by = "p_adj",
title = "KRASG12D example: GSEA/QEA with localEnrichment"
)
The enrichmet package already provides a function to perform a GSEA-like analysis from the summary statistics table.
GSEA_EM_KEGG_EM <- perform_gsea_analysis(
example_data = EM_summaryStats,
PathwayVsMetabolites = EM_PathwayVsMetabolites
)
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
head(GSEA_EM_KEGG_EM[, 1:3], 10)
## pathway pval padj
## <char> <num> <num>
## 1: Mineral_absorption 0.0008854647 0.009906069
## 2: Protein_digestion_and_absorption 0.0012382586 0.009906069
## 3: Aminoacyl-tRNA_biosynthesis 0.0034658893 0.018484743
## 4: Central_carbon_metabolism_in_cancer 0.0159814878 0.063925951
## 5: D-Amino_acid_metabolism 0.0228011134 0.072963563
## 6: ABC_transporters 0.1144366197 0.236621911
## 7: Pyrimidine_metabolism 0.1194295900 0.236621911
## 8: Biosynthesis_of_unsaturated_fatty_acids 0.1230769231 0.236621911
## 9: Cysteine_and_methionine_metabolism 0.1330998249 0.236621911
## 10: Arginine_and_proline_metabolism 0.1631944444 0.261111111
And a plot:
create_gsea_plot(GSEA_EM_KEGG_EM, top_n = 15)
plot
This allows a direct comparison with the localEnrichment implementation, since both analyses are now based on the same ranked metabolite list and the same KEGG pathway definitions.
A simple way to compare both implementations is to inspect the top ranked pathways side by side.
top_qea_le <- QEA_EM_KEGG_LE$table %>%
dplyr::arrange(p_adj) %>%
dplyr::select(set_name, p_value, p_adj) %>%
dplyr::slice(1:10) %>%
dplyr::rename(
pathway = set_name,
p_value_LE = p_value,
p_adj_LE = p_adj
)
top_gsea_em <- GSEA_EM_KEGG_EM %>%
as_tibble() %>%
dplyr::arrange(padj) %>%
dplyr::select(pathway, pval, padj) %>%
dplyr::slice(1:10) %>%
dplyr::rename(
p_value_EM = pval,
p_adj_EM = padj
)
top_qea_le
## pathway p_value_LE p_adj_LE
## 1 Protein digestion and absorption 0.000 0.00000000
## 2 Pyrimidine metabolism 0.000 0.00000000
## 3 Aminoacyl-tRNA biosynthesis 0.001 0.02833333
## 4 Ferroptosis 0.003 0.06375000
## 5 Mineral absorption 0.007 0.11900000
## 6 Diabetic cardiomyopathy 0.024 0.25500000
## 7 Biosynthesis of unsaturated fatty acids 0.026 0.25500000
## 8 D-Amino acid metabolism 0.026 0.25500000
## 9 Cysteine and methionine metabolism 0.027 0.25500000
## 10 cAMP signaling pathway 0.031 0.26350000
top_gsea_em
## # A tibble: 10 × 3
## pathway p_value_EM p_adj_EM
## <chr> <dbl> <dbl>
## 1 Mineral_absorption 0.000885 0.00991
## 2 Protein_digestion_and_absorption 0.00124 0.00991
## 3 Aminoacyl-tRNA_biosynthesis 0.00347 0.0185
## 4 Central_carbon_metabolism_in_cancer 0.0160 0.0639
## 5 D-Amino_acid_metabolism 0.0228 0.0730
## 6 ABC_transporters 0.114 0.237
## 7 Pyrimidine_metabolism 0.119 0.237
## 8 Biosynthesis_of_unsaturated_fatty_acids 0.123 0.237
## 9 Cysteine_and_methionine_metabolism 0.133 0.237
## 10 Arginine_and_proline_metabolism 0.163 0.261
In this KRASG12D example the compatibility between both tools is
cleaner than in the Cachexia case, because: - the identifiers are
already KEGG compound identifiers, - the pathway mapping used by
enrichmet is directly available, - the EnrichmentSet object
used by localEnrichment has been built from the same
source.
This makes the example particularly useful as a controlled use case to compare the behaviour of the two packages under nearly identical inputs.
Over-Representation Analysis (ORA) is a classical approach used to identify biological sets (such as pathways, chemical classes, or ontological categories) that are statistically enriched in a predefined list of significant features, typically metabolites or genes.
The key idea is to test whether a particular set contains more significant features than expected by chance, given a specified background (Rivals et al. 2006; Khatri, Sirota, and Butte 2012).
For each set, a contingency table is constructed:
| In Set | Not in Set | Total | |
|---|---|---|---|
| Selected (hits) | k | K - k | K |
| Not selected | m - k | N - m - K + k | N - K |
where: - N = total number of features in the
background - K = number of selected (significant)
features
- m = number of features in the given set
- k = number of selected features that belong to that
set
A statistical test (typically Fisher’s exact test or the hypergeometric test) is then applied to compute the probability of observing at least k selected features in the set under a random sampling model (Rivals et al. 2006).
The resulting p-values are adjusted for multiple testing (e.g., Benjamini–Hochberg FDR correction), and the fold enrichment is often reported as:
\[ \text{Fold Enrichment} = \frac{k/K}{m/N} \]
indicating how much more represented the set is among selected features than expected by chance.
localEnrichmentThe function ora_test() implements this classical ORA
workflow in a flexible and reproducible way using the
EnrichmentSet object as input. It can be applied to any
type of feature-to-set mapping, including pathways, metabolite classes,
or ontological categories.
EnrichMetEnrichMet follows the same classical ORA framework,
specifically applied to KEGG pathways. For each pathway, it constructs a
2×2 contingency table comparing detected metabolites versus a background
set, and applies Fisher’s exact test to evaluate
enrichment (Mekonnen et al. 2026).
Key characteristics:
Conceptually, this implementation is equivalent to standard ORA, with the main specificity being the direct integration of curated KEGG pathway mappings and a streamlined one-step workflow.
Quantitative Enrichment Analysis (QEA), also known as Metabolite Set Enrichment Analysis (GSEA), is the analogue of GSEA for metabolomics and other continuous-valued omics data (Subramanian et al. 2005; Xia, Wishart, and Valencia 2011).
Instead of dichotomizing features into “significant/non-significant,” QEA uses the entire ranked list of features, each with an associated score such as:
For each metabolite set, QEA tests whether its members tend to appear towards the top or bottom of the ranked list more often than expected by chance.
This avoids arbitrary cutoffs and provides sensitivity to subtle but consistent shifts across a pathway or chemical class.
localEnrichmentThere are several variants of GSEA/QEA in the literature:
ORA-like GSEA (MetaboAnalyst’s early implementation): uses Fisher tests on selected metabolites → not what we use here.
SSPA / Mean/median tests (subset-based
statistics):
compares mean scores inside vs. outside a set → not used
here.
Correlation-based global tests (e.g.,
GlobalTest):
model score–membership associations → not used here.
GSEA-style running-sum statistic (Subramanian et
al., PNAS 2005):
uses an enrichment score based on a weighted random walk and permutation
testing: this is the method implemented in
qea_test() (Subramanian et
al. 2005).
localEnrichment implements a GSEA-style
running-score, adapted to metabolomics.
Specifically:
+1/Nh-1/(N - Nh)This is the most statistically robust form of QEA/GSEA and is the same approach used in modern tools such as fgsea (Korotkevich, Sukhov, and Sergushichev 2016), clusterProfiler::GSEA(), and the updated MetaboAnalyst 5.0 “QEA” module.
EnrichMet (MetSEA)EnrichMet implements GSEA (referred to as
MetSEA) using the fgsea algorithm,
which corresponds to a fast implementation of the GSEA running-sum
statistic (Korotkevich, Sukhov, and Sergushichev
2016; Mekonnen et al. 2026).
Main elements:
In practice, this implementation is methodologically equivalent to the GSEA-style approach described above, with two practical differences:
Overall, both localEnrichment and EnrichMet
implement conceptually the same class of QEA/GSEA methods (GSEA-like),
differing mainly in computational backend and workflow integration
rather than in statistical principles.
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.2 (2024-10-31 ucrt)
## os Windows 11 x64 (build 26200)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate Spanish_Spain.utf8
## ctype Spanish_Spain.utf8
## tz Europe/Madrid
## date 2026-03-23
## pandoc 3.6.3 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## quarto NA @ C:\\Users\\Usuario\\AppData\\Local\\Programs\\Quarto\\bin\\quarto.exe
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## BiocGenerics 0.52.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
## BiocManager * 1.30.27 2025-11-14 [1] CRAN (R 4.4.3)
## BiocParallel 1.40.2 2025-09-29 [1] Bioconductor
## bslib 0.9.0 2025-01-30 [1] CRAN (R 4.4.3)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## circlize 0.4.16 2024-02-20 [1] CRAN (R 4.4.2)
## cli 3.6.5 2025-04-23 [1] CRAN (R 4.4.3)
## clue 0.3-67 2026-02-18 [1] CRAN (R 4.4.3)
## cluster 2.1.8.1 2025-03-12 [1] CRAN (R 4.4.3)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.2)
## colorspace 2.1-2 2025-09-22 [1] CRAN (R 4.4.3)
## ComplexHeatmap 2.22.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
## cowplot 1.2.0 2025-07-07 [1] CRAN (R 4.4.3)
## crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.1)
## data.table 1.17.8 2025-07-10 [1] CRAN (R 4.4.3)
## devtools 2.4.6 2025-10-03 [1] CRAN (R 4.4.3)
## dichromat 2.0-0.1 2022-05-02 [1] CRAN (R 4.4.0)
## digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.2)
## doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.4.0)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## enrichmet * 0.99.7 2025-10-19 [1] git2r (https://github.com/biodatalab/enrichmet.git@4c12848)
## evaluate 1.0.5 2025-08-27 [1] CRAN (R 4.4.3)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## fastmatch 1.1-6 2024-12-23 [1] CRAN (R 4.4.2)
## fgsea 1.32.4 2025-03-20 [1] Bioconductor 3.20 (R 4.4.3)
## forcats 1.0.1 2025-09-25 [1] CRAN (R 4.4.3)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
## fs 1.6.7 2026-03-06 [1] CRAN (R 4.4.3)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.4.3)
## GetoptLong 1.1.0 2025-11-28 [1] CRAN (R 4.4.2)
## ggforce 0.5.0 2025-06-18 [1] CRAN (R 4.4.3)
## ggplot2 4.0.2 2026-02-03 [1] CRAN (R 4.4.3)
## ggraph 2.2.2 2025-08-24 [1] CRAN (R 4.4.3)
## ggrepel 0.9.6 2024-09-07 [1] CRAN (R 4.4.1)
## GlobalOptions 0.1.3 2025-11-28 [1] CRAN (R 4.4.2)
## glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
## graphlayouts 1.2.3 2026-02-21 [1] CRAN (R 4.4.3)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.4.0)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.4.2)
## hms 1.1.4 2025-10-17 [1] CRAN (R 4.4.3)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## igraph 2.2.2 2026-02-12 [1] CRAN (R 4.4.3)
## IRanges 2.40.1 2024-12-05 [1] Bioconductor 3.20 (R 4.4.2)
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.4.3)
## knitr 1.50 2025-03-16 [1] CRAN (R 4.4.3)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## lattice 0.22-7 2025-04-02 [2] CRAN (R 4.4.3)
## lifecycle 1.0.5 2026-01-08 [1] CRAN (R 4.4.3)
## localEnrichment * 0.1.0 2025-11-29 [1] local
## magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.4.3)
## MASS 7.3-65 2025-02-28 [2] CRAN (R 4.4.3)
## Matrix 1.7-4 2025-08-28 [2] CRAN (R 4.4.3)
## matrixStats 1.5.0 2025-01-07 [1] CRAN (R 4.4.3)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## pillar 1.11.1 2025-09-17 [1] CRAN (R 4.4.3)
## pkgbuild 1.4.8 2025-05-26 [1] CRAN (R 4.4.3)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.1 2025-09-23 [1] CRAN (R 4.4.3)
## png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0)
## polyclip 1.10-7 2024-07-23 [1] CRAN (R 4.4.1)
## purrr 1.1.0 2025-07-10 [1] CRAN (R 4.4.3)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.3)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
## Rcpp 1.1.0 2025-07-02 [1] CRAN (R 4.4.3)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## rjson 0.2.23 2024-09-16 [1] CRAN (R 4.4.1)
## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.4.3)
## rmarkdown 2.30 2025-09-28 [1] CRAN (R 4.4.3)
## rstudioapi 0.18.0 2026-01-16 [1] CRAN (R 4.4.3)
## S4Vectors 0.44.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
## S7 0.2.0 2024-11-07 [1] CRAN (R 4.4.2)
## sass 0.4.10 2025-04-11 [1] CRAN (R 4.4.3)
## scales 1.4.0 2025-04-24 [1] CRAN (R 4.4.3)
## sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.4.3)
## shape 1.4.6.1 2024-02-23 [1] CRAN (R 4.4.0)
## stringi 1.8.7 2025-03-27 [1] CRAN (R 4.4.3)
## stringr * 1.5.2 2025-09-08 [1] CRAN (R 4.4.3)
## tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.4.3)
## tidygraph 1.3.1 2024-01-30 [1] CRAN (R 4.4.0)
## tidyr * 1.3.2 2025-12-19 [1] CRAN (R 4.4.3)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tweenr 2.0.3 2024-02-26 [1] CRAN (R 4.4.0)
## tzdb 0.5.0 2025-03-15 [1] CRAN (R 4.4.3)
## usethis 3.2.1 2025-09-06 [1] CRAN (R 4.4.3)
## utf8 1.2.6 2025-06-08 [1] CRAN (R 4.4.3)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## viridis 0.6.5 2024-01-29 [1] CRAN (R 4.4.0)
## viridisLite 0.4.3 2026-02-04 [1] CRAN (R 4.4.3)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.2)
## xfun 0.54 2025-10-30 [1] CRAN (R 4.4.2)
## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1)
##
## [1] C:/Users/Usuario/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.2/library
## * ── Packages attached to the search path.
##
## ──────────────────────────────────────────────────────────────────────────────