1 Introduction

Enrichment analysis aims to facilitate the biological interpretation of omics results by relating lists of features (genes, proteins, metabolites) to prior biological knowledge stored in databases (pathways, ontologies, chemical classes, etc.).

In genomics, ORA and GSEA workflows are relatively well standardized, and code-based (clusterProfiler or graphical-interface (g-profiler or Reactome) are well established tools.

In metabolomics, however, the scenario is more-fragmented and available tools take more heterogeneorus approaches and rely on different identifier systems and pathway databases.

The goals of this document are:

  1. Quick-start: show how to run basic over-representation analysis (ORA) and metabolite set enrichment analysis (GSEA-like) with several tools.
  2. Naïve benchmarking: compare the results of these tools on the same datasets.

We will focus on:

  • enrichmet (pathway enrichment, GSEA, centrality, plots)
  • localEnrichment (our own infrastructure for metabolite sets and ORA/GSEA)

If time permits we will progressively incorporate other tools/approaches such as:

Rather than exhaustively exploring each tool, we restrict ourselves to a reproducible workflow:

  1. We have selected a few public metabolite datasets. The selection has been based on “popularity” and availability, but the workflow should be easily adaptable to new datasets.

  2. We have prepared an ad-hoc collection of metabolite sets, that is, Metabolite sets that have been programatically prepared, as data.frames and EnrichmentSet (an S4 class created to store Metabolite Sets Collections), from the Lheuristic package, to be used for these analyses.

  3. For each dataset we have prepared:

    • A list of significant metabolites (signature),
    • A background tailored to that signature created either ad-hoc or from the original
    • An effect size table with summary statistics that can be used for GSEA tests.
  4. Given these components we will run enrichment analysis on them (ORA and GSEA when possible) with the different tools on the same sets.

  5. As more analyses are avilable we will increase comparisons among results to check which analyses are related and which lead different results because they are indeed different analyses.

We have decided to adopt the term GSEA also for metabolomics to avoid confusions because the authors who introduced the term MSEA (Xia and Wishart 2010) used this term to refer to both ORA and GSEA applied to metabolite sets.

2 Data and metabolite sets

Any pathway analysis requires, at least, two inputs:

  • A list of features usually associated with a phenotype (e.g. metabolites differentially quantified between two groups).

  • A source of annotations for these features, that can be generically described as Feature sets or, as in this case Metabolite Sets.

2.1 The datasets

We use three datasets that have been preprocessed elsewhere (see file: PrepareData-4-EnrichmentAnalysis.Rmd) and

  1. The ia dataset. This is a popular example used by MetaboAnalyst, where a comparison between affected and control patients is performed. It has the drawback that only differential metabolites are provided, which makes it only partially interesting for some types of analyses. We keep it for its popularity.

  2. The ST000291 dataset, downloaded from Metabolomics Workbench, and used in a fobitools use case. This is obtained by a nutritional intervention study where a comparion among consumption of Apple, Cranberries and Baseline has been performed. In this case we have the full dataset and the results of comparing the groups using limma whih leads to around 20 differential metabolites. Here, we focus only on the Cranberry vs Baseline comparison, although lists from other comparisons are also available and may be used when a compare-comparisons approach is considered. This study seems to have been made using a cross-over design that the authors have later omitted when providing the samples. Because it is only for instructional purposes, this flaw will be ignored.

  3. The KrasG12D mutation dataset, a list of cancer related metabolites used by the Enrichmet package in one of its tutorial.

2.1.1 Loading datasets

The data for this use case has been prepared elsewhere and is available in the data directory.

For each dataset we have the list of metabolites, the analysis results in the form of a topTable and a mapping performed with MetaboAnalyst between the metabolite ids and several databases as shown below.

2.1.1.1 Cachexia dataset

# Cachexia: topTable + ID mapping
cachexia_map_path  <- "data/Cachexia_map.csv"
cachexia_tt_path   <- "data/Cachexia_topTable.Rda"

Cachexia_map <- read.csv(cachexia_map_path, stringsAsFactors = FALSE)
load(cachexia_tt_path)  # loads 'topCachexia'

cat("Cachexia_map:\n")
## Cachexia_map:
print_tbl(head(Cachexia_map))
## # A tibble: 6 × 9
##   Query                Match     HMDB  PubChem ChEBI KEGG  METLIN SMILES Comment
##   <chr>                <chr>     <chr>   <int> <chr> <chr>  <int> <chr>    <int>
## 1 2-Hydroxyisobutyrate 2-Hydrox… HMDB… 4277439 19641 <NA>      NA CC(C)…       1
## 2 Creatinine           Creatini… HMDB…     588 16737 C007…      8 CN1CC…       1
## 3 Quinolinate          Quinolin… HMDB…    1066 16675 C037…    330 OC(=O…       1
## 4 Valine               L-Valine  HMDB…    6287 16414 C001…   5842 CC(C)…       1
## 5 Dimethylamine        Dimethyl… HMDB…     674 17170 C005…   3758 CNC          1
## 6 Tryptophan           L-Trypto… HMDB…    6305 16828 C000…   5879 N[C@@…       1
cat("\ntopCachexia:\n")
## 
## topCachexia:
print_tbl(head(topCachexia))
## # A tibble: 6 × 7
##   logFC AveExpr     t  P.Value adj.P.Val     B ordByLimma
##   <dbl>   <dbl> <dbl>    <dbl>     <dbl> <dbl>      <int>
## 1  44.4    66.4  4.06 0.000119   0.00426 -3.92          1
## 2  25.4    35.7  3.97 0.000164   0.00426 -3.95          2
## 3  20.9    26.3  3.90 0.000203   0.00426 -3.97          3
## 4  17.7    24.4  3.72 0.000378   0.00431 -4.02          4
## 5 245.    358.   3.69 0.000417   0.00431 -4.03          5
## 6 151.    211.   3.67 0.000455   0.00431 -4.04          6

2.1.1.2 Metabolomics Workbench ST000291 dataset

# ST000291: Cranberry vs Baseline topTable + ID mapping
st291_map_path   <- "data/MW_ST000291__map.csv"
st291_tt_path    <- "data/MW_ST000291-limma_res.Rda"

MS_ST000291_map <- read.csv(st291_map_path, stringsAsFactors = FALSE)
load(st291_tt_path)  # loads 'topCranVSBase'
topCranVSBase<- limma_resCB; rm(limma_resCB)
topCranVSBase_HMDB<- limma_HMDB_CB; rm(limma_HMDB_CB)

cat("\nMS_ST000291_map:\n")
## 
## MS_ST000291_map:
print_tbl(head(MS_ST000291_map))
## # A tibble: 6 × 9
##      Query Match                 HMDB  PubChem ChEBI KEGG  METLIN SMILES Comment
##      <int> <chr>                 <chr>   <int> <int> <chr>  <int> <chr>    <int>
## 1   192798 Digitalose            <NA>   192798    NA <NA>      NA <NA>         1
## 2    79034 12-Hydroxydodecanoic… HMDB…   79034 39567 C083…   6465 OCCCC…       1
## 3    10413 4-Hydroxybutyric acid HMDB…   10413 30830 C009…   5678 OCCCC…       1
## 4   439230 Mevalonic acid        HMDB…  439230 17710 C004…    127 C[C@@…       1
## 5 20975673 <NA>                  <NA>       NA    NA <NA>      NA <NA>         0
## 6  5275508 <NA>                  <NA>       NA    NA <NA>      NA <NA>         0
cat("\ntopCranVSBase:\n")
## 
## topCranVSBase:
print_tbl(head(topCranVSBase))
## # A tibble: 6 × 7
##   PubChemCID log2FC   AveExpr     t       pvalue adj_pvalue     B
##   <chr>       <dbl>     <dbl> <dbl>        <dbl>      <dbl> <dbl>
## 1 54678503     2.92  3.57e-16  6.45 0.0000000623  0.0000847  8.11
## 2 71485        3.27 -2.40e-16  5.85 0.000000484   0.000329   6.19
## 3 94214        2.18 -8.93e-17  5.69 0.000000843   0.000382   5.68
## 4 5378303      2.41  1.10e-16  5.54 0.00000140    0.000403   5.20
## 5 3037         3.18  8.39e-17  5.46 0.00000184    0.000403   4.95
## 6 10178        3.33  3.81e-17  5.45 0.00000189    0.000403   4.92
cat("\ntopCranVSBase (HMDB):\n")
## 
## topCranVSBase (HMDB):
print_tbl(head(topCranVSBase_HMDB))
## # A tibble: 6 × 7
##   feature     log2FC   AveExpr     t     pvalue adj_pvalue     B
##   <chr>        <dbl>     <dbl> <dbl>      <dbl>      <dbl> <dbl>
## 1 HMDB0251200   3.18  8.39e-17  5.48 0.00000172   0.000876  5.01
## 2 HMDB0248806   2.26  3.85e-17  5.34 0.00000281   0.000876  4.54
## 3 HMDB0258615   2.05  2.68e-16  5.00 0.00000881   0.00154   3.47
## 4 HMDB0006115   1.56 -4.28e-16  4.96 0.00000986   0.00154   3.36
## 5 HMDB0000738   1.46  1.01e-15  4.72 0.0000221    0.00233   2.60
## 6 HMDB0000714   1.51 -8.74e-16  4.72 0.0000225    0.00233   2.58

2.1.1.3 enrichmet package example dataset

enrichmet_path <- "data/enrichmet_example.Rda"
load(enrichmet_path)  # loads 'topCranVSBase'
cat("\nEM_summaryStats:\n")
## 
## EM_summaryStats:
print_tbl(head(EM_summaryStats))
## # A tibble: 6 × 5
##   met_id      pval    padj log2fc met_id_raw
##   <chr>      <dbl>   <dbl>  <dbl> <chr>     
## 1 C02721 0.0000124 0.00315  -2.32 C02721    
## 2 C03736 0.0000694 0.00739   2.63 C03736    
## 3 C03139 0.0000869 0.00739   1.75 C03139    
## 4 C00074 0.000120  0.00763  -3.22 C00074    
## 5 C02862 0.000198  0.00819  -1.70 C02862|   
## 6 C00386 0.000225  0.00819   1.68 C00386
# cat("\nEM_mapping_df:\n")
# print_tbl(head(EM_mapping_df))
# cat("\nEM_kegg_lookup:\n")
# print_tbl(head(EM_kegg_lookup))
# cat("\nEM_PathwayVsMetabolites:\n")
# print_tbl(head(EM_PathwayVsMetabolites))
# cat("\nKEGGset_KEGG:\n")
# show(KEGGset_KEGG)

2.2 Metabolite Sets

We also use a unified collection of metabolite sets, previously constructed and saved as:

  • KEGGset_HMDB, KEGGset_HMDB_df
  • KEGGset_PubChem, KEGGset_PubChem_df
  • ChemicalClassSet, ChemicalClassSet_df
  • SMPDBset, SMPDBset_df

All sets collections, but KEGGset_PubChem use HMDB IDs as feature identifiers. KEGGset_PubChem uses PubChemIDs.

2.2.1 Load Metabolite Sets

Metabolite sets have been prepared elsewhere and saved as EnrichmentSet objects of the class defined in the localEnrichment package.

# Metabolite sets (EnrichmentSet collection)
# mets_collection_path <- "databases/MetaboliteSets_Collection.Rda"
# load(mets_collection_path)

load("../Metabolites_Sets/results/MetaboliteSets_Collection.Rda")

### Parche. Pendent de netejar-ho al script  que ho crea
KEGGset_HMDB_df <- df_HMDB
KEGGset_PubChem_df <- df_PubChem

###

cat("\nEnrichmentSet objects loaded:\n")
## 
## EnrichmentSet objects loaded:
summary(KEGGset_HMDB);      cat("\n")
## EnrichmentSet summary:
##   Mapping name: KEGG_pathways_hsa_HMDB 
##   Source: KEGG 
##   Feature IDs: HMDB 
##   Number of sets: 283 
##   Mean set size: 22.5689 features
##   Median set size: 10 features
summary(KEGGset_PubChem);      cat("\n")
## EnrichmentSet summary:
##   Mapping name: KEGG_pathways_hsa_PubChem 
##   Source: KEGG 
##   Feature IDs: PubChem 
##   Number of sets: 267 
##   Mean set size: 18.15356 features
##   Median set size: 8 features
summary(ChemicalClassSet);    cat("\n")
## EnrichmentSet summary:
##   Mapping name: ChemicalClasses 
##   Source: HMDB 
##   Feature IDs: HMDB 
##   Number of sets: 31 
##   Mean set size: 13.54839 features
##   Median set size: 4 features
summary(SMPDBset);            cat("\n")
## EnrichmentSet summary:
##   Mapping name: SMPDB_pathways 
##   Source: SMPDB 
##   Feature IDs: HMDB 
##   Number of sets: 48654 
##   Mean set size: 17.30571 features
##   Median set size: 18 features
cat("\nKEGGset_HMDB_df:\n")
## 
## KEGGset_HMDB_df:
print_tbl(head(KEGGset_HMDB_df))
## # A tibble: 6 × 3
##   set_name                                 set_id   Metabolites                 
##   <chr>                                    <chr>    <chr>                       
## 1 Glycolysis / Gluconeogenesis             hsa00010 HMDB0000243;HMDB0001206;HMD…
## 2 Citrate cycle (TCA cycle)                hsa00020 HMDB0000243;HMDB0001206;HMD…
## 3 Pentose phosphate pathway                hsa00030 HMDB0000243;HMDB0000124;HMD…
## 4 Pentose and glucuronate interconversions hsa00040 HMDB0000243;HMDB0000208;HMD…
## 5 Fructose and mannose metabolism          hsa00051 HMDB0000124;HMDB0001163;HMD…
## 6 Galactose metabolism                     hsa00052 HMDB0000286;HMDB0000302;HMD…
cat("\nKEGGset_PubChem_df:\n")
## 
## KEGGset_PubChem_df:
print_tbl(head(KEGGset_PubChem_df))
## # A tibble: 6 × 3
##   set_name                                 set_id   Metabolites                 
##   <chr>                                    <chr>    <chr>                       
## 1 Glycolysis / Gluconeogenesis             hsa00010 1060;107735;176;175;970;906…
## 2 Citrate cycle (TCA cycle)                hsa00020 1060;107735;51;164533;970;1…
## 3 Pentose phosphate pathway                hsa00030 1060;107735;439168;5311110;…
## 4 Pentose and glucuronate interconversions hsa00040 1060;107735;51;164533;8629;…
## 5 Fructose and mannose metabolism          hsa00051 439709;668;439168;107689;79…
## 6 Galactose metabolism                     hsa00052 8629;5988;439709;668;753;43…
cat("\nChemicalClassSet_df:\n")
## 
## ChemicalClassSet_df:
print_tbl(head(ChemicalClassSet_df))
## # A tibble: 6 × 3
##   set_name                set_id                  Metabolites                   
##   <chr>                   <chr>                   <chr>                         
## 1 Acylcarnitines          Acylcarnitines          HMDB0000062;HMDB0000201;HMDB0…
## 2 Amine oxide             Amine oxide             HMDB0000925                   
## 3 Amino Acids             Amino Acids             HMDB0000161;HMDB0000517;HMDB0…
## 4 Amino Acids Derivatives Amino Acids Derivatives HMDB0000092;HMDB0000766;HMDB0…
## 5 Amino acid - related    Amino acid - related    HMDB0000510;HMDB0000452;HMDB0…
## 6 Azoles                  Azoles                  HMDB0000462
cat("\nSMPDBset_df:\n")
## 
## SMPDBset_df:
print_tbl(head(SMPDBset_df))
## # A tibble: 6 × 3
##   set_name                                  set_id     Metabolites              
##   <chr>                                     <chr>      <chr>                    
## 1 Citrullinemia Type I                      SMP0000001 HMDB0000641;HMDB0000161;…
## 2 Carbamoyl Phosphate Synthetase Deficiency SMP0000002 HMDB0000641;HMDB0000161;…
## 3 Argininosuccinic Aciduria                 SMP0000003 HMDB0000641;HMDB0000161;…
## 4 Glycine and Serine Metabolism             SMP0000004 HMDB0002134;HMDB0001377;…
## 5 Pterine Biosynthesis                      SMP0000005 HMDB0006822;HMDB0002111;…
## 6 Tyrosine Metabolism                       SMP0000006 HMDB0000158;HMDB0000208;…

The Metabolite sets for the Enrichmet package example data have been loaded when loading the example data but are shown here fore consistency.

They consist of a data.frame and an EnrichmentSet object that has been preprare dfrom this

cat("\nEM_PathwayVsMetabolites:\n")
## 
## EM_PathwayVsMetabolites:
print_tbl(head(EM_PathwayVsMetabolites))
## # A tibble: 6 × 2
##   Pathway                                  Metabolites                          
##   <chr>                                    <chr>                                
## 1 Glycolysis / Gluconeogenesis             C00022,C00024,C00031,C00033,C00036,C…
## 2 Citrate cycle (TCA cycle)                C00022,C00024,C00026,C00036,C00042,C…
## 3 Pentose phosphate pathway                C00022,C00031,C00085,C00117,C00118,C…
## 4 Pentose and glucuronate interconversions C00022,C00026,C00029,C00103,C00111,C…
## 5 Fructose and mannose metabolism          C00085,C00095,C00096,C00111,C00118,C…
## 6 Galactose metabolism                     C00029,C00031,C00052,C00085,C00089,C…
cat("\nKEGGset_KEGG:\n")
## 
## KEGGset_KEGG:
show(KEGGset_KEGG)
## EnrichmentSet: KEGG_hsa_pathways 
##   Source: KEGG 
##   Feature IDs: kegg_id 
##   Number of sets: 351 
##   Example set: ABC transporters
print_tbl(head(KEGGset_KEGG@data))
## # A tibble: 6 × 4
##   set_id set_name                                  feature_ids      feature_list
##   <chr>  <chr>                                     <chr>            <list>      
## 1 PW_1   ABC transporters                          C00009;C00025;C… <chr [138]> 
## 2 PW_10  Aldosterone-regulated sodium reabsorption C00238;C00575;C… <chr [8]>   
## 3 PW_100 Epstein-Barr virus infection              C00076;C01245;C… <chr [3]>   
## 4 PW_101 ErbB signaling pathway                    C00076;C00165;C… <chr [4]>   
## 5 PW_102 Estrogen signaling pathway                C00076;C00165;C… <chr [8]>   
## 6 PW_103 Ether lipid metabolism                    C00670;C00958;C… <chr [25]>
KEGGset_KEGG_df <- as.MetaboliteSetDataFrame(KEGGset_KEGG, "both")

Notice that Metabolite identifiers for this example data are neither HMDB, nor PubChem but KEGG ids!

3 Building signatures and backgrounds

Our aim is to derive signatures and backgrounds that are compatible with the EnrichmentSet objects, which use HMDB IDs.

We therefore:

  1. Map each topTable to HMDB IDs.
  2. Construct a vector of significant HMDB IDs (signature).
  3. Use filterEnrichmentSet(Eset, ids) to restrict the metabolite sets to those that contain at least one of the signature metabolites.
  4. Build a background as the union of all metabolites in the filtered sets.

3.1 Cachexia: HMDB mapping and signature

library(tibble)
## Warning: package 'tibble' was built under R version 4.4.3
# Convert topCachexia to tibble with metabolite name
cachexia_tt <- topCachexia %>%
  rownames_to_column(var = "Metabolite") %>%
  left_join(Cachexia_map, by = c("Metabolite" = "Query"))

cat("Cachexia limma table with mapping:\n")
## Cachexia limma table with mapping:
dim(cachexia_tt)
## [1] 63 16
print_tbl(cachexia_tt)
## # A tibble: 63 × 16
##   Metabolite  logFC AveExpr     t P.Value adj.P.Val     B ordByLimma Match HMDB 
##   <chr>       <dbl>   <dbl> <dbl>   <dbl>     <dbl> <dbl>      <int> <chr> <chr>
## 1 Quinolinate  44.4    66.4  4.06 1.19e-4   0.00426 -3.92          1 Quin… HMDB…
## 2 Valine       25.4    35.7  3.97 1.64e-4   0.00426 -3.95          2 L-Va… HMDB…
## 3 N,N-Dimeth…  20.9    26.3  3.90 2.03e-4   0.00426 -3.97          3 Dime… HMDB…
## 4 Leucine      17.7    24.4  3.72 3.78e-4   0.00431 -4.02          4 Leuc… HMDB…
## 5 Dimethylam… 245.    358.   3.69 4.17e-4   0.00431 -4.03          5 Dime… HMDB…
## 6 Pyroglutam… 151.    211.   3.67 4.55e-4   0.00431 -4.04          6 Pyro… HMDB…
## # ℹ 57 more rows
## # ℹ 6 more variables: PubChem <int>, ChEBI <chr>, KEGG <chr>, METLIN <int>,
## #   SMILES <chr>, Comment <int>
tail(cachexia_tt)
##                Metabolite      logFC   AveExpr           t   P.Value adj.P.Val
## 58                Acetone   4.926383  11.42701  0.88147387 0.3808486 0.4136804
## 59               Tartrate  18.558681  40.00403  0.76840327 0.4446356 0.4747804
## 60 4-Hydroxyphenylacetate  20.023887 112.02104  0.70996703 0.4799035 0.5038987
## 61           Pantothenate -12.678624  44.88377 -0.62491564 0.5339035 0.5514086
## 62                 Uracil   5.020284  35.55766  0.60799978 0.5450055 0.5537959
## 63   1-Methylnicotinamide  -2.590745  71.57364 -0.08306038 0.9340225 0.9340225
##            B ordByLimma                      Match        HMDB PubChem ChEBI
## 58 -4.609633         58                    Acetone HMDB0001659     180 15347
## 59 -4.619378         59              Tartaric acid HMDB0000956  444305 15671
## 60 -4.623907         60 p-Hydroxyphenylacetic acid HMDB0000020     127 18101
## 61 -4.629872         61           Pantothenic acid HMDB0000210    6613 46905
## 62 -4.630970         62                     Uracil HMDB0000300    1174 17568
## 63 -4.650155         63       1-Methylnicotinamide HMDB0000699     457 16797
##      KEGG METLIN                           SMILES Comment
## 58 C00207   3745                          CC(C)=O       1
## 59 C00898     NA    O[C@H]([C@@H](O)C(O)=O)C(O)=O       1
## 60 C00642    130            OC(=O)CC1=CC=C(O)C=C1       1
## 61 C00864     NA CC(C)(CO)[C@@H](O)C(=O)NCCC(O)=O       1
## 62 C00106    258                  O=C1NC=CC(=O)N1       1
## 63 C02918   5667          C[N+]1=CC=CC(=C1)C(N)=O       1
# p-value column and threshold
p_col_cachexia <- "adj.P.Val"
alpha_cachexia <- 0.05

# Signature: HMDB IDs of significant metabolites
cachexia_sig <- cachexia_tt %>%
  filter(!is.na(HMDB),
         !!rlang::sym(p_col_cachexia) < alpha_cachexia) %>%
  pull(HMDB) %>%
  unique()

length(cachexia_sig)
## [1] 36
cachexia_sig
##  [1] "HMDB0000232" "HMDB0000883" "HMDB0000092" "HMDB0000687" "HMDB0000087"
##  [6] "HMDB0000267" "HMDB0000562" "HMDB0000641" "HMDB0000161" "HMDB0000929"
## [11] "HMDB0000211" "HMDB0000043" "HMDB0000167" "HMDB0000164" "HMDB0000187"
## [16] "HMDB0000072" "HMDB0000174" "HMDB0000042" "HMDB0242161" "HMDB0000168"
## [21] "HMDB0000682" "HMDB0000754" "HMDB0000177" "HMDB0000158" "HMDB0000254"
## [26] "HMDB0000094" "HMDB0000875" "HMDB0304356" "HMDB0000243" "HMDB0000479"
## [31] "HMDB0000714" "HMDB0000958" "HMDB0000149" "HMDB0000123" "HMDB0000448"
## [36] "HMDB0000452"

Note that in this Cachexia dataset we have mostly (> 50%) differential metabolites, although it is hard to say if this is due to the fact that the researches only quantified those metabolites or they only reported those. A realistic background based purely on the data is therefore. Instead, we derive a functional background from the metabolite-set collection.

The background will be constructed artifically by taking all sets in the KEGGset_HMDB collection and, from there the HMDB ids of each set.

# Extract background as the union of all HMDB IDs in the filtered sets

cachexia_background <- unique(unlist(KEGGset_HMDB@data$feature_list))
length(cachexia_background)
## [1] 2105
intersect(cachexia_sig, cachexia_background) |> length()
## [1] 25

We now have all we need to do the Enrichment Analysis on the Cachexia dataset.

  • The list of significant metabolites: cachexia_sig
str(cachexia_sig)
##  chr [1:36] "HMDB0000232" "HMDB0000883" "HMDB0000092" "HMDB0000687" ...
  • The background associated with this list: cachexia_background
str(cachexia_background)
##  chr [1:2105] "HMDB0000243" "HMDB0001206" "HMDB0000042" "HMDB0000223" ...
  • The collections of metabolite sets to do the enrichment Analysis: KEGGset_HMDB.
show(KEGGset_HMDB)
## EnrichmentSet: KEGG_pathways_hsa_HMDB 
##   Source: KEGG 
##   Feature IDs: HMDB 
##   Number of sets: 283 
##   Example set: Glycolysis / Gluconeogenesis
  • We can also use SMPDBset and ChemicalClassSet because these are sets made of HMDB ids.
show(SMPDBset)
## EnrichmentSet: SMPDB_pathways 
##   Source: SMPDB 
##   Feature IDs: HMDB 
##   Number of sets: 48654 
##   Example set: Citrullinemia Type I
show(ChemicalClassSet)
## EnrichmentSet: ChemicalClasses 
##   Source: HMDB 
##   Feature IDs: HMDB 
##   Number of sets: 31 
##   Example set: Acylcarnitines

3.2 ST000291: PubChem and HMDB mappings and signatures

3.3 ST000291 with PubChem identifiers

The original ST000291 dataset is annotated with PubChem identifiers, so we are restricted to work with Metabolite Sets created with PubChem identifiers.

The code below shows how to prepare a list of selected metabolites and background straight from the available information.

To define the signature we select metabolites with an adjusted p-value of at most 0.33

## ST000291: signature (PubChem) and background

# top table from limma
st291_tt <- topCranVSBase   # ja carregat anteriorment

# signature (adj.P.Val < 0.05)
ST291_sig <- st291_tt %>%
  filter(adj_pvalue  < 0.33) %>%
  pull(PubChemCID) %>%
  unique()

# background = ALL tested metabolites
ST291_background <- st291_tt %>%
  pull(PubChemCID) %>%
  unique()

cat("Signature size:", length(ST291_sig), "\n")
## Signature size: 50
cat("Background size:", length(ST291_background), "\n")
## Background size: 1358

3.4 ST000291 with HMDB identifiers

Given that many programs can only work with HMDB or KEGG identifiers we have prepared and analyzed a subset of metabolites made only by those that have such ids.

## ST000291: signature (HMDB) and background

# top table from limma
st291_tt_HMDB <- topCranVSBase_HMDB   # ja carregat anteriorment

# signature (adj.P.Val < 0.05)
ST291_HMDB_sig <- st291_tt_HMDB %>%
  filter(pvalue  < 0.05) %>%
  pull(feature) %>%
  unique()

# background = ALL tested metabolites
ST291_HMDB_background <- st291_tt_HMDB %>%
  pull(feature) %>%
  unique()

cat("Signature size:", length(ST291_HMDB_sig), "\n")
## Signature size: 29
cat("Background size:", length(ST291_HMDB_background), "\n")
## Background size: 623

3.5 Enrichmet Example: KEGG mappings and signature

The object loaded already contains the signature as EM_inputMetabolites. The background can be obtained from the EM_summaryStats object.

EM_sig <- EM_inputMetabolites
EM_background <- EM_summaryStats$met_id
cat("Signature size:", length(EM_sig), "\n")
## Signature size: 57
cat("Background size:", length(EM_background), "\n")
## Background size: 249

4 Enrichment analysis of the Cachexia data

4.1 Over Representation Analysis (ORA)

4.1.1 ORA with localEnrichment and KEGGsets_HMDB

We start doing ORA using KEGG metabolite sets with HMDB identifiers

First perform Fisher test for all sets and select the enriched ones.

library(localEnrichment)

# ORA per Cachexia vs KEGG (HMDB)

ORA_cachexia_KEGG_LE <- ora_test(
  eset       = KEGGset_HMDB,
  selected   = cachexia_sig,
  background = cachexia_background,
  test       = "fisher",
  p_adjust   = "BH",
  min_set_size = 3,
  max_set_size = 500
)

# veure com és el resultat
dplyr::glimpse(ORA_cachexia_KEGG_LE)
## Rows: 91
## Columns: 8
## $ set_id            <chr> "hsa05230", "hsa00970", "hsa04974", "hsa04978", "hsa…
## $ set_name          <chr> "Central carbon metabolism in cancer", "Aminoacyl-tR…
## $ n_selected_in_set <int> 13, 11, 12, 9, 8, 6, 9, 7, 6, 4, 4, 3, 3, 3, 3, 3, 3…
## $ n_in_set          <int> 35, 23, 47, 27, 47, 26, 82, 45, 40, 15, 17, 10, 12, …
## $ p_value           <dbl> 1.103121e-15, 8.109981e-15, 2.935896e-12, 1.626280e-…
## $ overlap_features  <chr> "HMDB0000883;HMDB0000687;HMDB0000641;HMDB0000161;HMD…
## $ p_adj             <dbl> 1.003840e-13, 3.690042e-13, 8.905551e-11, 3.699786e-…
## $ fold_enrichment   <dbl> 21.718254, 27.964976, 14.929078, 19.490741, 9.952719…
head(ORA_cachexia_KEGG_LE)
## # A tibble: 6 × 8
##   set_id  set_name n_selected_in_set n_in_set  p_value overlap_features    p_adj
##   <chr>   <chr>                <int>    <int>    <dbl> <chr>               <dbl>
## 1 hsa052… Central…                13       35 1.10e-15 HMDB0000883;HMD… 1.00e-13
## 2 hsa009… Aminoac…                11       23 8.11e-15 HMDB0000883;HMD… 3.69e-13
## 3 hsa049… Protein…                12       47 2.94e-12 HMDB0000883;HMD… 8.91e-11
## 4 hsa049… Mineral…                 9       27 1.63e-10 HMDB0000883;HMD… 3.70e- 9
## 5 hsa006… Glyoxyl…                 8       47 6.32e- 7 HMDB0000641;HMD… 1.15e- 5
## 6 hsa002… Alanine…                 6       26 2.92e- 6 HMDB0000641;HMD… 4.43e- 5
## # ℹ 1 more variable: fold_enrichment <dbl>
ORA_cachexia_KEGG_LE_sig <- ORA_cachexia_KEGG_LE %>%
  dplyr::filter(p_adj < 0.05) %>%
  dplyr::arrange(p_adj)

nrow(ORA_cachexia_KEGG_LE_sig)
## [1] 20
head(ORA_cachexia_KEGG_LE_sig[,1:6], 10)
## # A tibble: 10 × 6
##    set_id   set_name        n_selected_in_set n_in_set  p_value overlap_features
##    <chr>    <chr>                       <int>    <int>    <dbl> <chr>           
##  1 hsa05230 Central carbon…                13       35 1.10e-15 HMDB0000883;HMD…
##  2 hsa00970 Aminoacyl-tRNA…                11       23 8.11e-15 HMDB0000883;HMD…
##  3 hsa04974 Protein digest…                12       47 2.94e-12 HMDB0000883;HMD…
##  4 hsa04978 Mineral absorp…                 9       27 1.63e-10 HMDB0000883;HMD…
##  5 hsa00630 Glyoxylate and…                 8       47 6.32e- 7 HMDB0000641;HMD…
##  6 hsa00250 Alanine, aspar…                 6       26 2.92e- 6 HMDB0000641;HMD…
##  7 hsa02010 ABC transporte…                 9       82 5.31e- 6 HMDB0000883;HMD…
##  8 hsa00470 D-Amino acid m…                 7       45 6.62e- 6 HMDB0000641;HMD…
##  9 hsa00260 Glycine, serin…                 6       40 4.09e- 5 HMDB0000092;HMD…
## 10 hsa00020 Citrate cycle …                 4       15 8.61e- 5 HMDB0000072;HMD…

This can be visualized by different approaches.

First clean the Pathway names

clean_pathway_name <- function(x) {
  return(x)
}

ORA_cachexia_KEGG_LE_sig <- ORA_cachexia_KEGG_LE_sig %>%
  dplyr::mutate(set_name_clean = clean_pathway_name(set_name))

top20 <- ORA_cachexia_KEGG_LE_sig %>%  arrange(p_adj) 
  # %>% dplyr::slice(1:20)


# IMPORTANT: nivells únics per ordenar el gràfic
top20$set_name_clean <- factor(
  top20$set_name_clean,
  levels = rev(unique(top20$set_name_clean))
)
localEnrichment::plot_enrichment(
  results =top20 ,
  type = c("bar", "dot"),
  top = 20,
  x_measure = "-log10(p_adj)",
  color_by = "p_adj",
  title = NULL
)

4.1.2 ORA with enrichmet and KEGGsets_HMDB

enrichment cannot work with EnrichmentSets objects -which are defined in localEnrichment. Instead it uses a data.frame with two columns, “Pathways” and “Metabolites” where metabolites are comma separated IDs.

Such data.frame can be extracted from the EnrichmentSet as:

library(dplyr)

PwVSMet_HMDB <- as.MetaboliteSetDataFrame(
  KEGGset_HMDB,
  id_type = "name",
  id_sep  = ","
) %>%
  dplyr::transmute(
    Pathway = set_name,
    Metabolites = Metabolites
  )

Now we con proceed with the enrichmet wrapper function:

ORA_cachexia_KEGG_EM<- perform_enrichment_analysis(
  inputMetabolites= cachexia_sig,
  PathwayVsMetabolites= PwVSMet_HMDB,
  top_n = 100,
  p_value_cutoff = .25
)
plot <- create_enrichment_plot(ORA_cachexia_KEGG_EM)
plot

4.2 Metabolite Set Enrichment Analysis (GSEA)

To run GSEA we will use the whole dataset, not only significant metabolites and, additionally, a measure of effect size.

4.2.1 GSEA with localEnrichment and KEGGsets_HMDB

GSEA in localEnrichment has been written to match MetaboAnalyst function so it is also called qea_test.

The test requires all the metabolites with an effect size measure.

cachexia_scores<- cachexia_tt$t
names(cachexia_scores) <- cachexia_tt$HMDB

The test can now be run as:

QEA_cachexia_KEGG_LE<- qea_test(
  eset       = KEGGset_HMDB,
  scores = cachexia_scores,
  nperm = 1000,
  p_adjust = "BH",
  min_set_size = 3,
  max_set_size = 500,
  return_profiles = FALSE,
  quiet = TRUE
)

The results can be plotted similarly to how we did with ORA.

# GSEA/QEA plot with localEnrichment
top_cachexia_qea <- QEA_cachexia_KEGG_LE$table %>%
  dplyr::arrange(p_adj) %>%
  dplyr::slice(1:20)

top_cachexia_qea$set_name <- factor(
  top_cachexia_qea$set_name,
  levels = rev(unique(top_cachexia_qea$set_name))
)

localEnrichment::plot_enrichment(
  results = top_cachexia_qea,
  type = c("bar", "dot"),
  top = 20,
  x_measure = "-log10(p_adj)",
  color_by = "p_adj",
  title = "Cachexia: GSEA/QEA with localEnrichment"
)

4.2.2 GSEA with enrichmet and KEGGsets_HMDB

In this case we have to provide a more rigid data structure that can be extracted from cachegia_tt.

data4gsea <- cachexia_tt[,c("HMDB", "P.Value", "logFC")]
colnames(data4gsea) <- c("met_id", "pval", "log2fc")
head(PwVSMet_HMDB, n=10)
head(data4gsea, n=12)
GSEA_cachexia_KEGG_EM<- perform_gsea_analysis(
  example_data =data4gsea ,
  PathwayVsMetabolites= PwVSMet_HMDB
)
head(GSEA_cachexia_KEGG_EM)

4.2.3 GSEA with enrichmet and KEGGsets_KEGG

enrichmet seems to have a strange design flaw: in some recent versions of the package the perform_gsea_analysis() function expects KEGG-like identifiers in the met_id column, so for this step we use the KEGG annotation instead of HMDB.

First, we build the summary statistics table required by enrichmet:

data4gsea_KEGG <- cachexia_tt %>%
  dplyr::select(KEGG, P.Value, logFC) %>%
  dplyr::rename(
    met_id = KEGG,
    pval   = P.Value,
    log2fc = logFC
  ) %>%
  dplyr::filter(!is.na(met_id), met_id != "") %>%
  dplyr::distinct(met_id, .keep_all = TRUE)

head(data4gsea_KEGG, 10)
##    met_id         pval     log2fc
## 1  C03722 0.0001187889   44.42323
## 2  C00183 0.0001640210   25.44989
## 3  C01026 0.0002029160   20.89312
## 4  C00123 0.0003775515   17.70504
## 5  C00543 0.0004169006  244.89730
## 6  C01879 0.0004546884  151.03434
## 7  C00791 0.0004792226 5102.96555
## 8  C00064 0.0009999880  216.98309
## 9  C00041 0.0011130175  190.00706
## 10 C00078 0.0018802610   39.99104

We also need a pathway-to-metabolite table using KEGG identifiers:

PwVSMet_KEGG <- as.MetaboliteSetDataFrame(
  KEGGset_KEGG,
  id_type = "name",
  id_sep = ","
) %>%
  dplyr::transmute(
    Pathway = set_name,
    Metabolites = Metabolites
  )

head(PwVSMet_KEGG, 10)
## # A tibble: 10 × 2
##    Pathway                                   Metabolites                        
##    <chr>                                     <chr>                              
##  1 ABC transporters                          C00009,C00025,C00031,C00032,C00034…
##  2 Aldosterone-regulated sodium reabsorption C00238,C00575,C00698,C00735,C00762…
##  3 Epstein-Barr virus infection              C00076,C01245,C05981               
##  4 ErbB signaling pathway                    C00076,C00165,C01245,C05981        
##  5 Estrogen signaling pathway                C00076,C00165,C00238,C00334,C00533…
##  6 Ether lipid metabolism                    C00670,C00958,C01233,C01264,C02773…
##  7 Fanconi anemia pathway                    NoMetabolitesfound                 
##  8 Fat digestion and absorption              C00010,C00040,C00162,C00165,C00187…
##  9 Fatty acid biosynthesis                   C00024,C00083,C00154,C00173,C00229…
## 10 Fatty acid degradation                    C00010,C00024,C00071,C00136,C00154…

Now we can run the enrichmet GSEA-like function:

GSEA_cachexia_KEGG_EM <- perform_gsea_analysis(
  example_data = data4gsea_KEGG,
  PathwayVsMetabolites = PwVSMet_KEGG
)
##   |                                                                              |                                                                      |   0%  |                                                                              |=======================                                               |  33%  |                                                                              |===============================================                       |  67%  |                                                                              |======================================================================| 100%
head(GSEA_cachexia_KEGG_EM)
##                                    pathway        pval       padj    log2err
##                                     <char>       <num>      <num>      <num>
## 1:        Protein_digestion_and_absorption 0.003707435 0.02224461 0.43170770
## 2:             Aminoacyl-tRNA_biosynthesis 0.013144417 0.03943325 0.38073040
## 3:                      Mineral_absorption 0.022942484 0.04588497 0.35248786
## 4:                        ABC_transporters 0.093945720 0.14091858 0.14551615
## 5:     Central_carbon_metabolism_in_cancer 0.181438999 0.21772680 0.09957912
## 6: Glyoxylate_and_dicarboxylate_metabolism 0.470779221 0.47077922 0.05205700
##           ES      NES  size  leadingEdge input_count
##        <num>    <num> <int>       <list>       <int>
## 1: 0.6575265 1.744456    14 C00183, ....          11
## 2: 0.6341130 1.663106    13 C00183, ....          10
## 3: 0.6221943 1.596308    11 C00183, ....           8
## 4: 0.5135357 1.386298    17 C00183, ....           8
## 5: 0.4651647 1.262874    18 C00183, ....          12
## 6: 0.3908266 1.002709    11 C00064, ....           7

ANd plot the results.

# GSEA plot with enrichmet
plot <- create_gsea_plot(GSEA_cachexia_KEGG_EM, top_n = 20)
plot

4.2.4 Comparison of results

In any case the results of both tests are relatively similat even if not significant.

head(QEA_cachexia_KEGG_LE$table[1:10,c(1,2,6,7)],10)
##      set_id                                   set_name p_value  p_adj
## 1  hsa04974           Protein digestion and absorption   0.006 0.1040
## 2  hsa04978                         Mineral absorption   0.007 0.1040
## 3  hsa00970                Aminoacyl-tRNA biosynthesis   0.008 0.1040
## 4  hsa04976                             Bile secretion   0.031 0.2860
## 5  hsa00280 Valine, leucine and isoleucine degradation   0.044 0.2860
## 6  hsa04382               Cornified envelope formation   0.044 0.2860
## 7  hsa05230        Central carbon metabolism in cancer   0.082 0.3465
## 8  hsa04922                 Glucagon signaling pathway   0.091 0.3465
## 9  hsa02010                           ABC transporters   0.101 0.3465
## 10 hsa00770          Pantothenate and CoA biosynthesis   0.115 0.3465
head(GSEA_cachexia_KEGG_EM[,1:3],10)
##                                    pathway        pval       padj
##                                     <char>       <num>      <num>
## 1:        Protein_digestion_and_absorption 0.003707435 0.02224461
## 2:             Aminoacyl-tRNA_biosynthesis 0.013144417 0.03943325
## 3:                      Mineral_absorption 0.022942484 0.04588497
## 4:                        ABC_transporters 0.093945720 0.14091858
## 5:     Central_carbon_metabolism_in_cancer 0.181438999 0.21772680
## 6: Glyoxylate_and_dicarboxylate_metabolism 0.470779221 0.47077922

5 Enrichment analysis of the enrichmet example data (KRASG12D)

The enrichmet package includes a small example dataset derived from a metabolomics comparison involving KRAS G12D mutation. Unlike the Cachexia example, these data are already distributed in a format very close to what enrichment tools require:

  • a list of significant metabolites (EM_inputMetabolites)
  • a summary statistics table (EM_summaryStats)
  • a pathway-to-metabolite mapping table (EM_PathwayVsMetabolites)
  • an EnrichmentSet object built from the same KEGG-based pathway mapping (KEGGset_KEGG)

5.1 ORA with localEnrichment and KEGGset_KEGG

We begin by defining the signature and the corresponding background.

EM_sig <- unique(EM_inputMetabolites)
EM_background <- unique(EM_summaryStats$met_id)

cat("Signature size:", length(EM_sig), "\n")
## Signature size: 57
cat("Background size:", length(EM_background), "\n")
## Background size: 249
intersect(EM_sig, EM_background) |> length()
## [1] 51

The ORA can now be carried out with localEnrichment::ora_test() using the KEGG-based EnrichmentSet.

library(localEnrichment)
library(dplyr)

ORA_EM_KEGG_LE <- ora_test(
  eset = KEGGset_KEGG,
  selected = EM_sig,
  background = EM_background,
  test = "fisher",
  p_adjust = "BH",
  min_set_size = 3,
  max_set_size = 500
)

dplyr::glimpse(ORA_EM_KEGG_LE)
## Rows: 47
## Columns: 8
## $ set_id            <chr> "PW_45", "PW_106", "PW_107", "PW_27", "PW_269", "PW_…
## $ set_name          <chr> "Biosynthesis of unsaturated fatty acids", "Fatty ac…
## $ n_selected_in_set <int> 7, 4, 2, 1, 1, 2, 2, 4, 1, 2, 2, 2, 5, 1, 1, 1, 1, 1…
## $ n_in_set          <int> 11, 7, 3, 14, 15, 4, 4, 11, 13, 5, 5, 5, 33, 10, 10,…
## $ p_value           <dbl> 0.003633786, 0.050183860, 0.132051780, 0.200320067, …
## $ overlap_features  <chr> "C16527;C00249;C00219;C06429;C06428;C16513;C01530", …
## $ p_adj             <dbl> 0.1707879, 1.0000000, 1.0000000, 1.0000000, 1.000000…
## $ fold_enrichment   <dbl> 2.7799043, 2.4962406, 2.9122807, 0.3120301, 0.291228…
head(ORA_EM_KEGG_LE)
## # A tibble: 6 × 8
##   set_id set_name      n_selected_in_set n_in_set p_value overlap_features p_adj
##   <chr>  <chr>                     <int>    <int>   <dbl> <chr>            <dbl>
## 1 PW_45  Biosynthesis…                 7       11 0.00363 C16527;C00249;C… 0.171
## 2 PW_106 Fatty acid b…                 4        7 0.0502  C00249;C06424;C… 1    
## 3 PW_107 Fatty acid d…                 2        3 0.132   C00249;C02990    1    
## 4 PW_27  Arginine and…                 1       14 0.200   C00148           1    
## 5 PW_269 Pyrimidine m…                 1       15 0.202   C00295           1    
## 6 PW_297 Sphingolipid…                 2        4 0.226   C00065;C00836    1    
## # ℹ 1 more variable: fold_enrichment <dbl>

We may restrict attention to the pathways that remain significant after multiple testing correction.

ORA_EM_KEGG_LE_sig <- ORA_EM_KEGG_LE %>%
  dplyr::filter(p_adj < 0.25) %>%
  dplyr::arrange(p_adj)

nrow(ORA_EM_KEGG_LE_sig)
## [1] 1
head(ORA_EM_KEGG_LE_sig[, 1:6], 10)
## # A tibble: 1 × 6
##   set_id set_name            n_selected_in_set n_in_set p_value overlap_features
##   <chr>  <chr>                           <int>    <int>   <dbl> <chr>           
## 1 PW_45  Biosynthesis of un…                 7       11 0.00363 C16527;C00249;C…

The enrichment results can be visualized with the standard plotting functions from localEnrichment.

top_kras_ora <- ORA_EM_KEGG_LE %>%
  dplyr::arrange(p_adj) %>%
  dplyr::slice(1:20)

top_kras_ora$set_name <- factor(
  top_kras_ora$set_name,
  levels = rev(unique(top_kras_ora$set_name))
)

localEnrichment::plot_enrichment(
  results = top_kras_ora,
  type = c("bar", "dot"),
  top = 20,
  x_measure = "-log10(p_adj)",
  color_by = "p_adj",
  title = "KRASG12D example: ORA with localEnrichment"
)

5.2 ORA with enrichmet and EM_PathwayVsMetabolites

The same ORA-like analysis can be repeated with enrichmet, using directly the pathway mapping table provided with the example data.

ORA_EM_KEGG_EM <- perform_enrichment_analysis(
  inputMetabolites = EM_sig,
  PathwayVsMetabolites = EM_PathwayVsMetabolites,
  top_n = 100,
  p_value_cutoff = 0.25
)

head(ORA_EM_KEGG_EM)
##                                   Pathway      P_value Log_P_value     Impact
## 1     Central carbon metabolism in cancer 4.817117e-05    4.317213 0.05461853
## 2        Protein digestion and absorption 1.925909e-04    3.715364 0.06838474
## 3 Biosynthesis of unsaturated fatty acids 3.699903e-04    3.431810 0.70197902
## 4                     Caffeine metabolism 6.220368e-04    3.206184 0.09487611
## 5                      Mineral absorption 1.832389e-03    2.736982 0.09229047
##     Coverage Count Adjusted_P_value      Q_value
## 1 0.16216216     6       0.01690808 0.0003367180
## 2 0.12765957     6       0.03379971 0.0006731084
## 3 0.09459459     7       0.04328886 0.0008620812
## 4 0.18181818     4       0.05458373 0.0010870143
## 5 0.13793103     4       0.12863372 0.0025616917

And visualized with the plotting function supplied by enrichmet.

plot <- create_enrichment_plot(ORA_EM_KEGG_EM)
plot

Because both analyses use the same KEGG identifiers and a pathway mapping derived from the same source, this example is particularly convenient to compare the behaviour of the two implementations.

5.3 GSEA / QEA with localEnrichment and KEGGset_KEGG

To run a quantitative enrichment analysis with localEnrichment, we need a numeric score for all metabolites in the ranked list. Here we will use the log2fc column from EM_summaryStats.

EM_scores <- EM_summaryStats$log2fc
names(EM_scores) <- EM_summaryStats$met_id

summary(EM_scores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.7235 -0.7930 -0.1904 -0.2540  0.4430  3.1852
head(EM_scores)
##    C02721    C03736    C03139    C00074    C02862    C00386 
## -2.324807  2.634161  1.748349 -3.219306 -1.697477  1.683674

The GSEA-like quantitative enrichment analysis can then be run with qea_test().

QEA_EM_KEGG_LE <- qea_test(
  eset = KEGGset_KEGG,
  scores = EM_scores,
  nperm = 1000,
  p_adjust = "BH",
  min_set_size = 3,
  max_set_size = 500,
  return_profiles = FALSE,
  quiet = TRUE
)

head(QEA_EM_KEGG_LE$table[, c("set_id", "set_name", "p_value", "p_adj")], 10)
##    set_id                                set_name p_value      p_adj
## 1  PW_263        Protein digestion and absorption   0.000 0.00000000
## 2  PW_269                   Pyrimidine metabolism   0.000 0.00000000
## 3   PW_16             Aminoacyl-tRNA biosynthesis   0.001 0.02833333
## 4  PW_111                             Ferroptosis   0.003 0.06375000
## 5  PW_195                      Mineral absorption   0.007 0.11900000
## 6   PW_86                 Diabetic cardiomyopathy   0.024 0.25500000
## 7   PW_45 Biosynthesis of unsaturated fatty acids   0.026 0.25500000
## 8   PW_85                 D-Amino acid metabolism   0.026 0.25500000
## 9   PW_81      Cysteine and methionine metabolism   0.027 0.25500000
## 10  PW_53                  cAMP signaling pathway   0.031 0.26350000

If desired, we can inspect only the most significant pathways.

QEA_EM_KEGG_LE_sig <- QEA_EM_KEGG_LE$table %>%
  dplyr::filter(p_adj < 0.25) %>%
  dplyr::arrange(p_adj)

nrow(QEA_EM_KEGG_LE_sig)
## [1] 5
head(QEA_EM_KEGG_LE_sig[, c("set_id", "set_name", "p_value", "p_adj")], 10)
##   set_id                         set_name p_value      p_adj
## 1 PW_263 Protein digestion and absorption   0.000 0.00000000
## 2 PW_269            Pyrimidine metabolism   0.000 0.00000000
## 3  PW_16      Aminoacyl-tRNA biosynthesis   0.001 0.02833333
## 4 PW_111                      Ferroptosis   0.003 0.06375000
## 5 PW_195               Mineral absorption   0.007 0.11900000

And finmally GSEA results can be plotted:

# GSEA/QEA plot with localEnrichment
top_kras_qea <- QEA_EM_KEGG_LE$table %>%
  dplyr::arrange(p_adj) %>%
  dplyr::slice(1:20)

top_kras_qea$set_name <- factor(
  top_kras_qea$set_name,
  levels = rev(unique(top_kras_qea$set_name))
)

localEnrichment::plot_enrichment(
  results = top_kras_qea,
  type = c("bar", "dot"),
  top = 20,
  x_measure = "-log10(p_adj)",
  color_by = "p_adj",
  title = "KRASG12D example: GSEA/QEA with localEnrichment"
)

5.4 GSEA with enrichmet

The enrichmet package already provides a function to perform a GSEA-like analysis from the summary statistics table.

GSEA_EM_KEGG_EM <- perform_gsea_analysis(
  example_data = EM_summaryStats,
  PathwayVsMetabolites = EM_PathwayVsMetabolites
)
##   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
head(GSEA_EM_KEGG_EM[, 1:3], 10)
##                                     pathway         pval        padj
##                                      <char>        <num>       <num>
##  1:                      Mineral_absorption 0.0008854647 0.009906069
##  2:        Protein_digestion_and_absorption 0.0012382586 0.009906069
##  3:             Aminoacyl-tRNA_biosynthesis 0.0034658893 0.018484743
##  4:     Central_carbon_metabolism_in_cancer 0.0159814878 0.063925951
##  5:                 D-Amino_acid_metabolism 0.0228011134 0.072963563
##  6:                        ABC_transporters 0.1144366197 0.236621911
##  7:                   Pyrimidine_metabolism 0.1194295900 0.236621911
##  8: Biosynthesis_of_unsaturated_fatty_acids 0.1230769231 0.236621911
##  9:      Cysteine_and_methionine_metabolism 0.1330998249 0.236621911
## 10:         Arginine_and_proline_metabolism 0.1631944444 0.261111111

And a plot:

create_gsea_plot(GSEA_EM_KEGG_EM, top_n = 15)

plot

This allows a direct comparison with the localEnrichment implementation, since both analyses are now based on the same ranked metabolite list and the same KEGG pathway definitions.

5.5 Quick comparison of the top pathways

A simple way to compare both implementations is to inspect the top ranked pathways side by side.

top_qea_le <- QEA_EM_KEGG_LE$table %>%
  dplyr::arrange(p_adj) %>%
  dplyr::select(set_name, p_value, p_adj) %>%
  dplyr::slice(1:10) %>%
  dplyr::rename(
    pathway = set_name,
    p_value_LE = p_value,
    p_adj_LE = p_adj
  )

top_gsea_em <- GSEA_EM_KEGG_EM %>%
  as_tibble() %>%
  dplyr::arrange(padj) %>%
  dplyr::select(pathway, pval, padj) %>%
  dplyr::slice(1:10) %>%
  dplyr::rename(
    p_value_EM = pval,
    p_adj_EM = padj
  )

top_qea_le
##                                    pathway p_value_LE   p_adj_LE
## 1         Protein digestion and absorption      0.000 0.00000000
## 2                    Pyrimidine metabolism      0.000 0.00000000
## 3              Aminoacyl-tRNA biosynthesis      0.001 0.02833333
## 4                              Ferroptosis      0.003 0.06375000
## 5                       Mineral absorption      0.007 0.11900000
## 6                  Diabetic cardiomyopathy      0.024 0.25500000
## 7  Biosynthesis of unsaturated fatty acids      0.026 0.25500000
## 8                  D-Amino acid metabolism      0.026 0.25500000
## 9       Cysteine and methionine metabolism      0.027 0.25500000
## 10                  cAMP signaling pathway      0.031 0.26350000
top_gsea_em
## # A tibble: 10 × 3
##    pathway                                 p_value_EM p_adj_EM
##    <chr>                                        <dbl>    <dbl>
##  1 Mineral_absorption                        0.000885  0.00991
##  2 Protein_digestion_and_absorption          0.00124   0.00991
##  3 Aminoacyl-tRNA_biosynthesis               0.00347   0.0185 
##  4 Central_carbon_metabolism_in_cancer       0.0160    0.0639 
##  5 D-Amino_acid_metabolism                   0.0228    0.0730 
##  6 ABC_transporters                          0.114     0.237  
##  7 Pyrimidine_metabolism                     0.119     0.237  
##  8 Biosynthesis_of_unsaturated_fatty_acids   0.123     0.237  
##  9 Cysteine_and_methionine_metabolism        0.133     0.237  
## 10 Arginine_and_proline_metabolism           0.163     0.261

In this KRASG12D example the compatibility between both tools is cleaner than in the Cachexia case, because: - the identifiers are already KEGG compound identifiers, - the pathway mapping used by enrichmet is directly available, - the EnrichmentSet object used by localEnrichment has been built from the same source.

This makes the example particularly useful as a controlled use case to compare the behaviour of the two packages under nearly identical inputs.

6 Apendix I Methods and packages explained

6.1 Over-Representation Analysis (ORA)

6.1.1 Concept

Over-Representation Analysis (ORA) is a classical approach used to identify biological sets (such as pathways, chemical classes, or ontological categories) that are statistically enriched in a predefined list of significant features, typically metabolites or genes.

The key idea is to test whether a particular set contains more significant features than expected by chance, given a specified background (Rivals et al. 2006; Khatri, Sirota, and Butte 2012).

For each set, a contingency table is constructed:

In Set Not in Set Total
Selected (hits) k K - k K
Not selected m - k N - m - K + k N - K

where: - N = total number of features in the background - K = number of selected (significant) features
- m = number of features in the given set
- k = number of selected features that belong to that set

A statistical test (typically Fisher’s exact test or the hypergeometric test) is then applied to compute the probability of observing at least k selected features in the set under a random sampling model (Rivals et al. 2006).

The resulting p-values are adjusted for multiple testing (e.g., Benjamini–Hochberg FDR correction), and the fold enrichment is often reported as:

\[ \text{Fold Enrichment} = \frac{k/K}{m/N} \]

indicating how much more represented the set is among selected features than expected by chance.

6.1.2 Implementation in localEnrichment

The function ora_test() implements this classical ORA workflow in a flexible and reproducible way using the EnrichmentSet object as input. It can be applied to any type of feature-to-set mapping, including pathways, metabolite classes, or ontological categories.

6.1.3 Implementation in EnrichMet

EnrichMet follows the same classical ORA framework, specifically applied to KEGG pathways. For each pathway, it constructs a 2×2 contingency table comparing detected metabolites versus a background set, and applies Fisher’s exact test to evaluate enrichment (Mekonnen et al. 2026).

Key characteristics:

  • Uses a user-provided list of significant metabolites together with a predefined KEGG-based background.
  • Computes enrichment independently for each pathway.
  • Adjusts p-values using Benjamini–Hochberg correction and reports q-values.
  • Outputs ranked pathways together with significance metrics, typically visualized using minus \(log10(p)\) plots.

Conceptually, this implementation is equivalent to standard ORA, with the main specificity being the direct integration of curated KEGG pathway mappings and a streamlined one-step workflow.

6.2 Quantitative Enrichment Analysis (QEA / GSEA)

6.2.1 Concept

Quantitative Enrichment Analysis (QEA), also known as Metabolite Set Enrichment Analysis (GSEA), is the analogue of GSEA for metabolomics and other continuous-valued omics data (Subramanian et al. 2005; Xia, Wishart, and Valencia 2011).

Instead of dichotomizing features into “significant/non-significant,” QEA uses the entire ranked list of features, each with an associated score such as:

  • log fold change
  • t-statistic
  • correlation coefficient
  • loading from PCA/MFA/PLS
  • any quantity that defines an ordering of features

For each metabolite set, QEA tests whether its members tend to appear towards the top or bottom of the ranked list more often than expected by chance.

This avoids arbitrary cutoffs and provides sensitivity to subtle but consistent shifts across a pathway or chemical class.

6.2.2 GSEA variant implemented in localEnrichment

There are several variants of GSEA/QEA in the literature:

  1. ORA-like GSEA (MetaboAnalyst’s early implementation): uses Fisher tests on selected metabolites → not what we use here.

  2. SSPA / Mean/median tests (subset-based statistics):
    compares mean scores inside vs. outside a set → not used here.

  3. Correlation-based global tests (e.g., GlobalTest):
    model score–membership associations → not used here.

  4. GSEA-style running-sum statistic (Subramanian et al., PNAS 2005):
    uses an enrichment score based on a weighted random walk and permutation testing: this is the method implemented in qea_test() (Subramanian et al. 2005).

localEnrichment implements a GSEA-style running-score, adapted to metabolomics.

Specifically:

  • The ranked list is used without thresholding.
  • For each set, a running enrichment profile is computed:
    • hits contribute +1/Nh
    • misses contribute -1/(N - Nh)
  • The Enrichment Score (ES) is the maximum (positive enrichment) or minimum (negative enrichment) deviation from zero.
  • Significance is evaluated by permuting feature positions, exactly as in classical GSEA (Subramanian et al. 2005).

This is the most statistically robust form of QEA/GSEA and is the same approach used in modern tools such as fgsea (Korotkevich, Sukhov, and Sergushichev 2016), clusterProfiler::GSEA(), and the updated MetaboAnalyst 5.0 “QEA” module.

6.2.3 Implementation in EnrichMet (MetSEA)

EnrichMet implements GSEA (referred to as MetSEA) using the fgsea algorithm, which corresponds to a fast implementation of the GSEA running-sum statistic (Korotkevich, Sukhov, and Sergushichev 2016; Mekonnen et al. 2026).

Main elements:

  • Metabolites are ranked using continuous statistics (typically log2 fold change and/or p-values).
  • Pathways are formatted as gene/metabolite sets (GMT format) and evaluated using fgsea.
  • A running enrichment score (ES) is computed along the ranked list:
    • increased when a metabolite belongs to the set
    • decreased otherwise
  • The maximum deviation from zero defines the ES.
  • Significance is assessed via permutation, and results are reported as:
    • ES
    • Normalized Enrichment Score (NES)
    • adjusted p-values

In practice, this implementation is methodologically equivalent to the GSEA-style approach described above, with two practical differences:

  • it leverages fgsea for computational efficiency, enabling fast analysis of large datasets
  • it is tightly integrated with KEGG pathway definitions and the ORA workflow within the same pipeline

Overall, both localEnrichment and EnrichMet implement conceptually the same class of QEA/GSEA methods (GSEA-like), differing mainly in computational backend and workflow integration rather than in statistical principles.

8 References

Khatri, Purvesh, Marina Sirota, and Atul J. Butte. 2012. “Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges.” PLoS Computational Biology 8. https://doi.org/10.1371/JOURNAL.PCBI.1002375.
Korotkevich, Gennady, Vladimir Sukhov, and Alexey A. Sergushichev. 2016. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://api.semanticscholar.org/CorpusID:229281802.
Mekonnen, Yonatan Ayalew, Neha Dhake, Vanessa Rubio, Shreya Jaiswal, Isis Narváez-Bandera, Ashley Lui, Augustine Takyi, et al. 2026. “EnrichMet: R Package for Integrated Pathway and Network Analysis for Metabolomics.” bioRxiv. https://doi.org/10.1101/2025.08.28.672951.
Rivals, Isabelle, Léon Personnaz, Lieng Taing, and Marie-Claude Potier. 2006. “Enrichment or Depletion of a GO Category Within a Class of Genes: Which Test?” Bioinformatics 23 (4): 401–7. https://doi.org/10.1093/bioinformatics/btl633.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (October): 15545–50. https://doi.org/10.1073/PNAS.0506580102.
Xia, Jianguo, and David S. Wishart. 2010. “MSEA: A Web-Based Tool to Identify Biologically Meaningful Patterns in Quantitative Metabolomic Data.” Nucleic Acids Research 38 (Web Server issue): W71–77. https://doi.org/10.1093/nar/gkq329.
Xia, Jianguo, David S. Wishart, and Alfonso Valencia. 2011. “MetPA: A Web-Based Metabolomics Tool for Pathway Analysis and Visualization.” Bioinformatics 27: 2342–44. https://doi.org/10.1093/BIOINFORMATICS/BTQ418.

9 Session information

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.2 (2024-10-31 ucrt)
##  os       Windows 11 x64 (build 26200)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  Spanish_Spain.utf8
##  ctype    Spanish_Spain.utf8
##  tz       Europe/Madrid
##  date     2026-03-23
##  pandoc   3.6.3 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##  quarto   NA @ C:\\Users\\Usuario\\AppData\\Local\\Programs\\Quarto\\bin\\quarto.exe
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package         * version date (UTC) lib source
##  BiocGenerics      0.52.0  2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
##  BiocManager     * 1.30.27 2025-11-14 [1] CRAN (R 4.4.3)
##  BiocParallel      1.40.2  2025-09-29 [1] Bioconductor
##  bslib             0.9.0   2025-01-30 [1] CRAN (R 4.4.3)
##  cachem            1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
##  circlize          0.4.16  2024-02-20 [1] CRAN (R 4.4.2)
##  cli               3.6.5   2025-04-23 [1] CRAN (R 4.4.3)
##  clue              0.3-67  2026-02-18 [1] CRAN (R 4.4.3)
##  cluster           2.1.8.1 2025-03-12 [1] CRAN (R 4.4.3)
##  codetools         0.2-20  2024-03-31 [2] CRAN (R 4.4.2)
##  colorspace        2.1-2   2025-09-22 [1] CRAN (R 4.4.3)
##  ComplexHeatmap    2.22.0  2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
##  cowplot           1.2.0   2025-07-07 [1] CRAN (R 4.4.3)
##  crayon            1.5.3   2024-06-20 [1] CRAN (R 4.4.1)
##  data.table        1.17.8  2025-07-10 [1] CRAN (R 4.4.3)
##  devtools          2.4.6   2025-10-03 [1] CRAN (R 4.4.3)
##  dichromat         2.0-0.1 2022-05-02 [1] CRAN (R 4.4.0)
##  digest            0.6.37  2024-08-19 [1] CRAN (R 4.4.2)
##  doParallel        1.0.17  2022-02-07 [1] CRAN (R 4.4.0)
##  dplyr           * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis          0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
##  enrichmet       * 0.99.7  2025-10-19 [1] git2r (https://github.com/biodatalab/enrichmet.git@4c12848)
##  evaluate          1.0.5   2025-08-27 [1] CRAN (R 4.4.3)
##  farver            2.1.2   2024-05-13 [1] CRAN (R 4.4.0)
##  fastmap           1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
##  fastmatch         1.1-6   2024-12-23 [1] CRAN (R 4.4.2)
##  fgsea             1.32.4  2025-03-20 [1] Bioconductor 3.20 (R 4.4.3)
##  forcats           1.0.1   2025-09-25 [1] CRAN (R 4.4.3)
##  foreach           1.5.2   2022-02-02 [1] CRAN (R 4.4.0)
##  fs                1.6.7   2026-03-06 [1] CRAN (R 4.4.3)
##  generics          0.1.4   2025-05-09 [1] CRAN (R 4.4.3)
##  GetoptLong        1.1.0   2025-11-28 [1] CRAN (R 4.4.2)
##  ggforce           0.5.0   2025-06-18 [1] CRAN (R 4.4.3)
##  ggplot2           4.0.2   2026-02-03 [1] CRAN (R 4.4.3)
##  ggraph            2.2.2   2025-08-24 [1] CRAN (R 4.4.3)
##  ggrepel           0.9.6   2024-09-07 [1] CRAN (R 4.4.1)
##  GlobalOptions     0.1.3   2025-11-28 [1] CRAN (R 4.4.2)
##  glue              1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
##  graphlayouts      1.2.3   2026-02-21 [1] CRAN (R 4.4.3)
##  gridExtra         2.3     2017-09-09 [1] CRAN (R 4.4.0)
##  gtable            0.3.6   2024-10-25 [1] CRAN (R 4.4.2)
##  hms               1.1.4   2025-10-17 [1] CRAN (R 4.4.3)
##  htmltools         0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
##  igraph            2.2.2   2026-02-12 [1] CRAN (R 4.4.3)
##  IRanges           2.40.1  2024-12-05 [1] Bioconductor 3.20 (R 4.4.2)
##  iterators         1.0.14  2022-02-05 [1] CRAN (R 4.4.0)
##  jquerylib         0.1.4   2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite          2.0.0   2025-03-27 [1] CRAN (R 4.4.3)
##  knitr             1.50    2025-03-16 [1] CRAN (R 4.4.3)
##  labeling          0.4.3   2023-08-29 [1] CRAN (R 4.4.0)
##  lattice           0.22-7  2025-04-02 [2] CRAN (R 4.4.3)
##  lifecycle         1.0.5   2026-01-08 [1] CRAN (R 4.4.3)
##  localEnrichment * 0.1.0   2025-11-29 [1] local
##  magrittr          2.0.4   2025-09-12 [1] CRAN (R 4.4.3)
##  MASS              7.3-65  2025-02-28 [2] CRAN (R 4.4.3)
##  Matrix            1.7-4   2025-08-28 [2] CRAN (R 4.4.3)
##  matrixStats       1.5.0   2025-01-07 [1] CRAN (R 4.4.3)
##  memoise           2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
##  pillar            1.11.1  2025-09-17 [1] CRAN (R 4.4.3)
##  pkgbuild          1.4.8   2025-05-26 [1] CRAN (R 4.4.3)
##  pkgconfig         2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload           1.4.1   2025-09-23 [1] CRAN (R 4.4.3)
##  png               0.1-8   2022-11-29 [1] CRAN (R 4.4.0)
##  polyclip          1.10-7  2024-07-23 [1] CRAN (R 4.4.1)
##  purrr             1.1.0   2025-07-10 [1] CRAN (R 4.4.3)
##  R6                2.6.1   2025-02-15 [1] CRAN (R 4.4.3)
##  RColorBrewer      1.1-3   2022-04-03 [1] CRAN (R 4.4.0)
##  Rcpp              1.1.0   2025-07-02 [1] CRAN (R 4.4.3)
##  readr           * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
##  remotes           2.5.0   2024-03-17 [1] CRAN (R 4.4.0)
##  rjson             0.2.23  2024-09-16 [1] CRAN (R 4.4.1)
##  rlang             1.1.6   2025-04-11 [1] CRAN (R 4.4.3)
##  rmarkdown         2.30    2025-09-28 [1] CRAN (R 4.4.3)
##  rstudioapi        0.18.0  2026-01-16 [1] CRAN (R 4.4.3)
##  S4Vectors         0.44.0  2024-10-29 [1] Bioconductor 3.20 (R 4.4.1)
##  S7                0.2.0   2024-11-07 [1] CRAN (R 4.4.2)
##  sass              0.4.10  2025-04-11 [1] CRAN (R 4.4.3)
##  scales            1.4.0   2025-04-24 [1] CRAN (R 4.4.3)
##  sessioninfo       1.2.3   2025-02-05 [1] CRAN (R 4.4.3)
##  shape             1.4.6.1 2024-02-23 [1] CRAN (R 4.4.0)
##  stringi           1.8.7   2025-03-27 [1] CRAN (R 4.4.3)
##  stringr         * 1.5.2   2025-09-08 [1] CRAN (R 4.4.3)
##  tibble          * 3.3.0   2025-06-08 [1] CRAN (R 4.4.3)
##  tidygraph         1.3.1   2024-01-30 [1] CRAN (R 4.4.0)
##  tidyr           * 1.3.2   2025-12-19 [1] CRAN (R 4.4.3)
##  tidyselect        1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
##  tweenr            2.0.3   2024-02-26 [1] CRAN (R 4.4.0)
##  tzdb              0.5.0   2025-03-15 [1] CRAN (R 4.4.3)
##  usethis           3.2.1   2025-09-06 [1] CRAN (R 4.4.3)
##  utf8              1.2.6   2025-06-08 [1] CRAN (R 4.4.3)
##  vctrs             0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  viridis           0.6.5   2024-01-29 [1] CRAN (R 4.4.0)
##  viridisLite       0.4.3   2026-02-04 [1] CRAN (R 4.4.3)
##  withr             3.0.2   2024-10-28 [1] CRAN (R 4.4.2)
##  xfun              0.54    2025-10-30 [1] CRAN (R 4.4.2)
##  yaml              2.3.10  2024-07-26 [1] CRAN (R 4.4.1)
## 
##  [1] C:/Users/Usuario/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.2/library
##  * ── Packages attached to the search path.
## 
## ──────────────────────────────────────────────────────────────────────────────