1 Introduction

Enrichment analysis aims to facilitate the biological interpretation of omics results by relating lists of features (genes, proteins, metabolites) to prior biological knowledge stored in databases (pathways, ontologies, chemical classes, etc.).

In genomics, ORA and GSEA workflows are now fairly standardized. In metabolomics, however, tools are more fragmented and rely on different identifier systems and pathway databases.

The goals of this document are:

Quick-start: show how to run basic over-representation analysis (ORA) and metabolite set enrichment analysis (MSEA/GSEA-like) with several tools.
Naïve benchmarking: compare the results of these tools on the same datasets, using MetaboAnalyst Web output as a “practical reference”gold standard”.

We will focus on:

enrichmet (pathway enrichment, GSEA, centrality, plots)
localEnrichment (our own infrastructure for metabolite sets and ORA/MSEA)
ChemRich (which does not prohttps://raw.githubusercontent.com/barupal/ChemRICH/refs/heads/master/chemrich_minimum_analysis.R)
MetaboAnalystR, has become very hard to make it work so the examples will be run from its web site and downloaded.

Rather than exhaustively exploring each tool, we restrict ourselves to a reproducible workflow:

Use an ad-hoc collection of metabolite sets, that is Metabolite sets that have been programatically prepared, as data.frames and EnrichmentSet objects, to be used for these analyses..
For each dataset, derive:
- a list of significant metabolites (signature),
- a background tailored to that signature created either ad-hoc or from the original dataset.
Run ORA (and MSEA when possible) with the different tools on the same sets.

2 Data and metabolite sets

Any pathway analysis requires two inputs:

A list of features usually associated with a phenotype (e.g. metabolites differentially quantified between two groups).
A source of annotations for these features, that can be generically described as Feature sets or, as in this cas Metabolite Sets.

2.1 The datasets

We use three datasets that have been preprocessed elsewhere (see files Benchmarking-PrepareData.Rmd) and testEnrichMet.Rmd.

The Cachexia dataset. This is a popular example used by MetaboAnalyst, where a comparison between affected and control patients is performed. It has the drawback that only differential metabolites are provided, which makes it only partially interesting for some types of analyses. We keep it for its popularity in MetaboAnalyst.
The ST000291 dataset, downloaded from Metabolomics Workbench, and used in a fobitools use case. This is obtained by a nutritional intervention study where a comparion among consumption of Apple, Cranberries and Baseline has been performed. In this case we have the full dataset and the results of comparing the groups using limma whih leads to around 40 differential metabolites. We focus only on the Cranberry vs Baseline comparison. This study seems to have been made using a cross-over design that the authors have later omitted when providing the samples. Because it is only for instructional purposes, this flaw will be ignored.
The Case study of KrasG12D mutation dataset is an undocumented dataset used to illustrate how the R package enrichmet works.

Although it seems that this package provides all that is needed it is being compared to localEnrichment which was developed independently.

2.1.1 Loading datasets

The data for this use case has been prepared elsewhere and is available in the datadirectory.

For each dataset we have the list of metabolites, the analysis results in the form of a topTable and a mapping performed with MetaboAnalyst between the metabolite ids and several databases.

2.1.1.1 Cachexia dataset

# Cachexia: topTable + ID mapping
cachexia_map_path  <- "data/Cachexia_map.csv"
cachexia_tt_path   <- "data/Cachexia_topTable.Rda"

Cachexia_map <- read.csv(cachexia_map_path, stringsAsFactors = FALSE)
load(cachexia_tt_path)  # loads 'topCachexia'

cat("Cachexia_map:\n")

## Cachexia_map:

print_tbl(head(Cachexia_map))

## # A tibble: 6 × 9
##   Query                Match     HMDB  PubChem ChEBI KEGG  METLIN SMILES Comment
##   <chr>                <chr>     <chr>   <int> <chr> <chr>  <int> <chr>    <int>
## 1 2-Hydroxyisobutyrate 2-Hydrox… HMDB… 4277439 19641 <NA>      NA CC(C)…       1
## 2 Creatinine           Creatini… HMDB…     588 16737 C007…      8 CN1CC…       1
## 3 Quinolinate          Quinolin… HMDB…    1066 16675 C037…    330 OC(=O…       1
## 4 Valine               L-Valine  HMDB…    6287 16414 C001…   5842 CC(C)…       1
## 5 Dimethylamine        Dimethyl… HMDB…     674 17170 C005…   3758 CNC          1
## 6 Tryptophan           L-Trypto… HMDB…    6305 16828 C000…   5879 N[C@@…       1

cat("\ntopCachexia:\n")

## 
## topCachexia:

print_tbl(head(topCachexia))

## # A tibble: 6 × 6
##   groupCachexia groupControl AveExpr     F  P.Value adj.P.Val
##           <dbl>        <dbl>   <dbl> <dbl>    <dbl>     <dbl>
## 1          43.2         27.9    37.3 102.  3.11e-22  1.96e-20
## 2       10722.        5619.   8734.   88.7 1.40e-20  4.41e-19
## 3          83.7         39.3    66.4  85.7 3.48e-20  7.31e-19
## 4          45.6         20.1    35.7  72.8 2.25e-18  3.55e-17
## 5         454.         209.    358.   68.1 1.17e-17  1.48e-16
## 6          81.8         41.8    66.2  65.1 3.56e-17  3.74e-16

2.1.1.2 Metabolomics Workbench ST000291 dataset

# ST000291: Cranberry vs Baseline topTable + ID mapping
st291_map_path   <- "data/MW_ST000291__map.csv"
st291_tt_path    <- "data/MW_ST000291-limma_res.Rda"

MS_ST000291_map <- read.csv(st291_map_path, stringsAsFactors = FALSE)
load(st291_tt_path)  # loads 'topCranVSBase'
topCranVSBase<- limma_resCB; rm(limma_resCB)
topCranVSBase_HMDB<- limma_HMDB_CB; rm(limma_HMDB_CB)

cat("\nMS_ST000291_map:\n")

## 
## MS_ST000291_map:

print_tbl(head(MS_ST000291_map))

## # A tibble: 6 × 9
##      Query Match                 HMDB  PubChem ChEBI KEGG  METLIN SMILES Comment
##      <int> <chr>                 <chr>   <int> <int> <chr>  <int> <chr>    <int>
## 1   192798 Digitalose            <NA>   192798    NA <NA>      NA <NA>         1
## 2    79034 12-Hydroxydodecanoic… HMDB…   79034 39567 C083…   6465 OCCCC…       1
## 3    10413 4-Hydroxybutyric acid HMDB…   10413 30830 C009…   5678 OCCCC…       1
## 4   439230 Mevalonic acid        HMDB…  439230 17710 C004…    127 C[C@@…       1
## 5 20975673 <NA>                  <NA>       NA    NA <NA>      NA <NA>         0
## 6  5275508 <NA>                  <NA>       NA    NA <NA>      NA <NA>         0

cat("\ntopCranVSBase:\n")

## 
## topCranVSBase:

print_tbl(head(topCranVSBase))

## # A tibble: 6 × 7
##   PubChemCID log2FC   AveExpr     t       pvalue adj_pvalue     B
##   <chr>       <dbl>     <dbl> <dbl>        <dbl>      <dbl> <dbl>
## 1 54678503     2.92  3.57e-16  6.45 0.0000000623  0.0000847  8.11
## 2 71485        3.27 -2.40e-16  5.85 0.000000484   0.000329   6.19
## 3 94214        2.18 -8.93e-17  5.69 0.000000843   0.000382   5.68
## 4 5378303      2.41  1.10e-16  5.54 0.00000140    0.000403   5.20
## 5 3037         3.18  8.39e-17  5.46 0.00000184    0.000403   4.95
## 6 10178        3.33  3.81e-17  5.45 0.00000189    0.000403   4.92

cat("\ntopCranVSBase (HMDB):\n")

## 
## topCranVSBase (HMDB):

print_tbl(head(topCranVSBase_HMDB))

## # A tibble: 6 × 7
##   feature     log2FC   AveExpr     t     pvalue adj_pvalue     B
##   <chr>        <dbl>     <dbl> <dbl>      <dbl>      <dbl> <dbl>
## 1 HMDB0251200   3.18  8.39e-17  5.48 0.00000172   0.000876  5.01
## 2 HMDB0248806   2.26  3.85e-17  5.34 0.00000281   0.000876  4.54
## 3 HMDB0258615   2.05  2.68e-16  5.00 0.00000881   0.00154   3.47
## 4 HMDB0006115   1.56 -4.28e-16  4.96 0.00000986   0.00154   3.36
## 5 HMDB0000738   1.46  1.01e-15  4.72 0.0000221    0.00233   2.60
## 6 HMDB0000714   1.51 -8.74e-16  4.72 0.0000225    0.00233   2.58

2.1.1.3 `enrichmet` package example dataset

The instructions below load a list of metrabolites, a Summary Statistics table and an EnrichmentSet object, required to perform Enrichment Analysis using ORA and MSEA. Besides this the raw versions of these data to be used with Enrichment are also loaded, so avoiding the need to extract them from the EnrichmentSet object.

enrichmet_path <- "data/enrichmet_example.Rda"
load(enrichmet_path)  # loads 'topCranVSBase'

2.2 Metabolite Sets

We also use a unified collection of metabolite sets, previously constructed and saved as:

KEGGset_HMDB, KEGGset_HMDB_df
KEGGset_PubChem, KEGGset_PubChem_df
ChemicalClassSet, ChemicalClassSet_df
SMPDBset, SMPDBset_df

All sets collections, but KEGGset_PubChem use HMDB IDs as feature identifiers. KEGGset_PubChem uses PubChemIDs.

2.2.1 Load Metabolite Sets

Metabolite sets have been prepared elsewhere and saved as EnrichmentSet objects of the class defined in the localEnrichment package.

# Metabolite sets (EnrichmentSet collection)
# mets_collection_path <- "databases/MetaboliteSets_Collection.Rda"
# load(mets_collection_path)

load("C:/Users/Usuario/OneDrive - Universitat de Barcelona/EstBioinfo/Databases4Metabolomics/results/MetaboliteSets_Collection.Rda")

### Parche. Pendent de netejar-ho al script  que ho crea
KEGGset_HMDB_df <- df_HMDB
KEGGset_PubChem_df <- df_PubChem
rm(df_HMDB, df_PubChem, KEGGset_global, KEGGset_global_df, KEGGset_study, KEGGset_study_df)
###

cat("\nEnrichmentSet objects loaded:\n")

## 
## EnrichmentSet objects loaded:

summary(KEGGset_HMDB);      cat("\n")

## EnrichmentSet summary:
##   Mapping name: KEGG_pathway_HMDB 
##   Source: KEGG 
##   Feature IDs: HMDB 
##   Number of sets: 282 
##   Mean set size: 22.60993 features
##   Median set size: 9.5 features

summary(KEGGset_PubChem);      cat("\n")

## EnrichmentSet summary:
##   Mapping name: KEGG_pathway_PubChem 
##   Source: KEGG 
##   Feature IDs: PubChem 
##   Number of sets: 266 
##   Mean set size: 18.18045 features
##   Median set size: 8 features

summary(ChemicalClassSet);    cat("\n")

## EnrichmentSet summary:
##   Mapping name: ChemicalClasses 
##   Source: HMDB 
##   Feature IDs: HMDB 
##   Number of sets: 30 
##   Mean set size: 9.366667 features
##   Median set size: 3.5 features

summary(SMPDBset);            cat("\n")

## EnrichmentSet summary:
##   Mapping name: SMPDB_pathways 
##   Source: SMPDB 
##   Feature IDs: HMDB 
##   Number of sets: 48654 
##   Mean set size: 17.30571 features
##   Median set size: 18 features

cat("\nKEGGset_HMDB_df:\n")

## 
## KEGGset_HMDB_df:

print_tbl(head(KEGGset_HMDB_df))

## # A tibble: 6 × 3
##   set_name                                                    set_id Metabolites
##   <chr>                                                       <chr>  <chr>      
## 1 Glycolysis / Gluconeogenesis - Homo sapiens (human)         hsa00… HMDB000024…
## 2 Citrate cycle (TCA cycle) - Homo sapiens (human)            hsa00… HMDB000024…
## 3 Pentose phosphate pathway - Homo sapiens (human)            hsa00… HMDB000024…
## 4 Pentose and glucuronate interconversions - Homo sapiens (h… hsa00… HMDB000024…
## 5 Fructose and mannose metabolism - Homo sapiens (human)      hsa00… HMDB000012…
## 6 Galactose metabolism - Homo sapiens (human)                 hsa00… HMDB000028…

cat("\nKEGGset_PubChem_df:\n")

## 
## KEGGset_PubChem_df:

print_tbl(head(KEGGset_PubChem_df))

## # A tibble: 6 × 3
##   set_name                                                    set_id Metabolites
##   <chr>                                                       <chr>  <chr>      
## 1 Glycolysis / Gluconeogenesis - Homo sapiens (human)         hsa00… 1060;10773…
## 2 Citrate cycle (TCA cycle) - Homo sapiens (human)            hsa00… 1060;10773…
## 3 Pentose phosphate pathway - Homo sapiens (human)            hsa00… 1060;10773…
## 4 Pentose and glucuronate interconversions - Homo sapiens (h… hsa00… 1060;10773…
## 5 Fructose and mannose metabolism - Homo sapiens (human)      hsa00… 439709;668…
## 6 Galactose metabolism - Homo sapiens (human)                 hsa00… 8629;5988;…

cat("\nChemicalClassSet_df:\n")

## 
## ChemicalClassSet_df:

print_tbl(head(ChemicalClassSet_df))

## # A tibble: 6 × 3
##   set_name                set_id                  Metabolites                   
##   <chr>                   <chr>                   <chr>                         
## 1 Acylcarnitines          Acylcarnitines          HMDB0000201;HMDB0000824;HMDB0…
## 2 Amine oxide             Amine oxide             HMDB0000925                   
## 3 Amino Acids             Amino Acids             HMDB0000187;HMDB0000687;HMDB0…
## 4 Amino Acids Derivatives Amino Acids Derivatives HMDB0000092;HMDB0000812;HMDB0…
## 5 Amino acid - related    Amino acid - related    HMDB0000452;HMDB0000510;HMDB0…
## 6 Azoles                  Azoles                  HMDB0000462

cat("\nSMPDBset_df:\n")

## 
## SMPDBset_df:

print_tbl(head(SMPDBset_df))

## # A tibble: 6 × 3
##   set_name                                  set_id     Metabolites              
##   <chr>                                     <chr>      <chr>                    
## 1 Citrullinemia Type I                      SMP0000001 HMDB0000641;HMDB0000161;…
## 2 Carbamoyl Phosphate Synthetase Deficiency SMP0000002 HMDB0000641;HMDB0000161;…
## 3 Argininosuccinic Aciduria                 SMP0000003 HMDB0000641;HMDB0000161;…
## 4 Glycine and Serine Metabolism             SMP0000004 HMDB0002134;HMDB0001377;…
## 5 Pterine Biosynthesis                      SMP0000005 HMDB0006822;HMDB0002111;…
## 6 Tyrosine Metabolism                       SMP0000006 HMDB0000158;HMDB0000208;…

The Metabolite set for the Enrichmet package example data has been loaded before.

summary(KEGGset_KEGG);      cat("\n")

## EnrichmentSet summary:
##   Mapping name: KEGG_hsa_pathways 
##   Source: KEGG 
##   Feature IDs: kegg_id 
##   Number of sets: 351 
##   Mean set size: 16.37607 features
##   Median set size: 6 features

print_tbl(head(KEGGset_KEGG@data))

## # A tibble: 6 × 4
##   set_id set_name                                  feature_ids      feature_list
##   <chr>  <chr>                                     <chr>            <list>      
## 1 PW_1   ABC transporters                          C00009;C00025;C… <chr [138]> 
## 2 PW_10  Aldosterone-regulated sodium reabsorption C00238;C00575;C… <chr [8]>   
## 3 PW_100 Epstein-Barr virus infection              C00076;C01245;C… <chr [3]>   
## 4 PW_101 ErbB signaling pathway                    C00076;C00165;C… <chr [4]>   
## 5 PW_102 Estrogen signaling pathway                C00076;C00165;C… <chr [8]>   
## 6 PW_103 Ether lipid metabolism                    C00670;C00958;C… <chr [25]>

Notice that Metabolite identifiers for this example data are neither HMDB, nor PubChem but KEGG ids!

3 Building signatures and backgrounds

Our aim is to derive signatures and backgrounds that are compatible with the EnrichmentSet objects, which use HMDB IDs.

We therefore:

Map each topTable to HMDB IDs.
Construct a vector of significant HMDB IDs (signature).
Use filterEnrichmentSet(Eset, ids) to restrict the metabolite sets to those that contain at least one of the signature metabolites.
Build a background as the union of all metabolites in the filtered sets.

3.1 Cachexia: HMDB mapping and signature

library(tibble)

## Warning: package 'tibble' was built under R version 4.4.3

# Convert topCachexia to tibble with metabolite name
cachexia_tt <- topCachexia %>%
  rownames_to_column(var = "Metabolite") %>%
  left_join(Cachexia_map, by = c("Metabolite" = "Query"))

cat("Cachexia limma table with mapping:\n")

## Cachexia limma table with mapping:

print_tbl(cachexia_tt)

## # A tibble: 63 × 15
##   Metabolite   groupCachexia groupControl AveExpr     F  P.Value adj.P.Val Match
##   <chr>                <dbl>        <dbl>   <dbl> <dbl>    <dbl>     <dbl> <chr>
## 1 2-Hydroxyis…          43.2         27.9    37.3 102.  3.11e-22  1.96e-20 2-Hy…
## 2 Creatinine         10722.        5619.   8734.   88.7 1.40e-20  4.41e-19 Crea…
## 3 Quinolinate           83.7         39.3    66.4  85.7 3.48e-20  7.31e-19 Quin…
## 4 Valine                45.6         20.1    35.7  72.8 2.25e-18  3.55e-17 L-Va…
## 5 Dimethylami…         454.         209.    358.   68.1 1.17e-17  1.48e-16 Dime…
## 6 Tryptophan            81.8         41.8    66.2  65.1 3.56e-17  3.74e-16 L-Tr…
## # ℹ 57 more rows
## # ℹ 7 more variables: HMDB <chr>, PubChem <int>, ChEBI <chr>, KEGG <chr>,
## #   METLIN <int>, SMILES <chr>, Comment <int>

# p-value column and threshold
p_col_cachexia <- "adj.P.Val"
alpha_cachexia <- 0.05

# Signature: HMDB IDs of significant metabolites
cachexia_sig <- cachexia_tt %>%
  filter(!is.na(HMDB),
         !!rlang::sym(p_col_cachexia) < alpha_cachexia) %>%
  pull(HMDB) %>%
  unique()

length(cachexia_sig)

## [1] 61

cachexia_sig

##  [1] "HMDB0242161" "HMDB0000562" "HMDB0000232" "HMDB0000883" "HMDB0000087"
##  [6] "HMDB0000929" "HMDB0000164" "HMDB0000267" "HMDB0000687" "HMDB0000168"
## [11] "HMDB0000092" "HMDB0000479" "HMDB0000167" "HMDB0000043" "HMDB0000161"
## [16] "HMDB0000682" "HMDB0000174" "HMDB0000641" "HMDB0000187" "HMDB0000149"
## [21] "HMDB0000094" "HMDB0000958" "HMDB0000157" "HMDB0000128" "HMDB0000158"
## [26] "HMDB0000172" "HMDB0000177" "HMDB0000300" "HMDB0000123" "HMDB0000754"
## [31] "HMDB0000042" "HMDB0000020" "HMDB0000211" "HMDB0000243" "HMDB0000182"
## [36] "HMDB0000714" "HMDB0304356" "HMDB0000640" "HMDB0000072" "HMDB0000251"
## [41] "HMDB0000254" "HMDB0000925" "HMDB0000875" "HMDB0001522" "HMDB0000062"
## [46] "HMDB0000001" "HMDB0000452" "HMDB0000201" "HMDB0000134" "HMDB0000448"
## [51] "HMDB0000699" "HMDB0000064" "HMDB0000210" "HMDB0001659" "HMDB0000122"
## [56] "HMDB0000258" "HMDB0000208" "HMDB0003911" "HMDB0000098" "HMDB0000190"
## [61] "HMDB0000956"

Note that in this Cachexia dataset we only have differential metabolites, not the full list of measured features. A realistic background based purely on the data is therefore impossible. Instead, we derive a functional background from the metabolite-set collection.

The background will be constructed artifically by taking:

All sets in the KEGGset_HMDB collection.
The HMDB ids of each set.

# Extract background as the union of all HMDB IDs in the filtered sets

cachexia_background <- unique(unlist(KEGGset_HMDB@data$feature_list))
length(cachexia_background)

## [1] 2106

intersect(cachexia_sig, cachexia_background) |> length()

## [1] 45

We now have all we need to do the Enrichment Analysis on the Cachexia dataset.

The list of significant metabolites: cachexia_sig

str(cachexia_sig)

##  chr [1:61] "HMDB0242161" "HMDB0000562" "HMDB0000232" "HMDB0000883" ...

The background associated with this list: cachexia_background

str(cachexia_background)

##  chr [1:2106] "HMDB0000243" "HMDB0001206" "HMDB0000042" "HMDB0000223" ...

The collections of metabolite sets to do the enrichment Analysis: KEGGset_HMDB.

summary(KEGGset_HMDB)

## EnrichmentSet summary:
##   Mapping name: KEGG_pathway_HMDB 
##   Source: KEGG 
##   Feature IDs: HMDB 
##   Number of sets: 282 
##   Mean set size: 22.60993 features
##   Median set size: 9.5 features

We can also use SMPDBset and ChemicalClassSet because these are sets made of HMDB ids.

summary(SMPDBset)

## EnrichmentSet summary:
##   Mapping name: SMPDB_pathways 
##   Source: SMPDB 
##   Feature IDs: HMDB 
##   Number of sets: 48654 
##   Mean set size: 17.30571 features
##   Median set size: 18 features

summary(ChemicalClassSet)

## EnrichmentSet summary:
##   Mapping name: ChemicalClasses 
##   Source: HMDB 
##   Feature IDs: HMDB 
##   Number of sets: 30 
##   Mean set size: 9.366667 features
##   Median set size: 3.5 features

3.2 ST000291: PubChem and HMDB mappings and signature

3.3 ST000291 with PubChem identifiers

The original ST000291 dataset is annotated with PubChem identifiers, so we are restricted to work Metabolite Sets created with PubChem identifiers.

The code below shows how to prepare a list of selected metabolites and background straight from the available information.

To define the signature we select metabolites with an adjusted p-value of at most 0.33

## ST000291: signature (PubChem) and background

# top table from limma
st291_tt <- topCranVSBase   # ja carregat anteriorment

# signature (adj.P.Val < 0.05)
ST291_sig <- st291_tt %>%
  filter(pvalue  < 0.05) %>%
  pull(PubChemCID) %>%
  unique()

# background = ALL tested metabolites
ST291_background <- st291_tt %>%
  pull(PubChemCID) %>%
  unique()

cat("Signature size:", length(ST291_sig), "\n")

## Signature size: 76

cat("Background size:", length(ST291_background), "\n")

## Background size: 1358

3.4 ST000291 with HMDB identifiers

Given that many programs can only work with HMDB or KEGG identifiers we have prepared and analyzed a subset of metabolites made only by those that have such ids.

## ST000291: signature (HMDB) and background

# top table from limma
st291_tt_HMDB <- topCranVSBase_HMDB   # ja carregat anteriorment

# signature (adj.P.Val < 0.05)
ST291_HMDB_sig <- st291_tt_HMDB %>%
  filter(pvalue  < 0.05) %>%
  pull(feature) %>%
  unique()

# background = ALL tested metabolites
ST291_HMDB_background <- st291_tt_HMDB %>%
  pull(feature) %>%
  unique()

cat("Signature size:", length(ST291_HMDB_sig), "\n")

## Signature size: 29

cat("Background size:", length(ST291_HMDB_background), "\n")

## Background size: 623

3.5 Enrichmet Example: KEGG mappings and signature

The object loaded already contains the signature as EM_inputMetaboilites. The background can be obtained from the EM_summaryStats object.

EM_sig <- EM_inputMetabolites
EM_background <- EM_summaryStats$met_id
cat("Signature size:", length(EM_sig), "\n")

## Signature size: 57

cat("Background size:", length(EM_background), "\n")

## Background size: 255

4 Enrichment with `localEnrichment`

4.1 Overrepresentation Analysis (ORA) for the Cachexia dataset

4.1.1 ORA based on KEGGsets_HMDB

We start doing ORA using KEGG metabolite sets with HMDB identifiers

First perform Fisher test for all sets and select the enriched ones.

library(localEnrichment)

# ORA per Cachexia vs KEGG (HMDB)

ORA_cachexia_KEGG <- ora_test(
  eset       = KEGGset_HMDB,
  selected   = cachexia_sig,
  background = cachexia_background,
  test       = "fisher",
  p_adjust   = "BH",
  min_set_size = 3,
  max_set_size = 500
)

# veure com és el resultat
dplyr::glimpse(ORA_cachexia_KEGG)

## Rows: 99
## Columns: 8
## $ set_id            <chr> "hsa05230", "hsa00970", "hsa04974", "hsa04978", "hsa…
## $ set_name          <chr> "Central carbon metabolism in cancer - Homo sapiens …
## $ n_selected_in_set <int> 17, 13, 14, 10, 14, 8, 10, 6, 9, 8, 6, 5, 5, 5, 5, 3…
## $ n_in_set          <int> 35, 23, 47, 27, 82, 26, 47, 15, 45, 40, 22, 17, 18, …
## $ p_value           <dbl> 2.039174e-18, 2.441065e-15, 1.026504e-11, 1.122660e-…
## $ overlap_features  <chr> "HMDB0000883;HMDB0000929;HMDB0000687;HMDB0000168;HMD…
## $ p_adj             <dbl> 2.018783e-16, 1.208327e-13, 3.387462e-10, 2.778584e-…
## $ fold_enrichment   <dbl> 16.769087, 19.513899, 10.283920, 12.786885, 5.894442…

head(ORA_cachexia_KEGG)

## # A tibble: 6 × 8
##   set_id  set_name n_selected_in_set n_in_set  p_value overlap_features    p_adj
##   <chr>   <chr>                <int>    <int>    <dbl> <chr>               <dbl>
## 1 hsa052… Central…                17       35 2.04e-18 HMDB0000883;HMD… 2.02e-16
## 2 hsa009… Aminoac…                13       23 2.44e-15 HMDB0000883;HMD… 1.21e-13
## 3 hsa049… Protein…                14       47 1.03e-11 HMDB0000883;HMD… 3.39e-10
## 4 hsa049… Mineral…                10       27 1.12e- 9 HMDB0000883;HMD… 2.78e- 8
## 5 hsa020… ABC tra…                14       82 3.11e- 8 HMDB0000883;HMD… 6.15e- 7
## 6 hsa002… Alanine…                 8       26 3.23e- 7 HMDB0000168;HMD… 5.32e- 6
## # ℹ 1 more variable: fold_enrichment <dbl>

ORA_cachexia_KEGG_sig <- ORA_cachexia_KEGG %>%
  dplyr::filter(p_adj < 0.05) %>%
  dplyr::arrange(p_adj)

nrow(ORA_cachexia_KEGG_sig)

## [1] 26

head(ORA_cachexia_KEGG_sig, 10)

## # A tibble: 10 × 8
##    set_id set_name n_selected_in_set n_in_set  p_value overlap_features    p_adj
##    <chr>  <chr>                <int>    <int>    <dbl> <chr>               <dbl>
##  1 hsa05… Central…                17       35 2.04e-18 HMDB0000883;HMD… 2.02e-16
##  2 hsa00… Aminoac…                13       23 2.44e-15 HMDB0000883;HMD… 1.21e-13
##  3 hsa04… Protein…                14       47 1.03e-11 HMDB0000883;HMD… 3.39e-10
##  4 hsa04… Mineral…                10       27 1.12e- 9 HMDB0000883;HMD… 2.78e- 8
##  5 hsa02… ABC tra…                14       82 3.11e- 8 HMDB0000883;HMD… 6.15e- 7
##  6 hsa00… Alanine…                 8       26 3.23e- 7 HMDB0000168;HMD… 5.32e- 6
##  7 hsa00… Glyoxyl…                10       47 4.39e- 7 HMDB0000641;HMD… 6.21e- 6
##  8 hsa00… Citrate…                 6       15 1.88e- 6 HMDB0000094;HMD… 2.33e- 5
##  9 hsa00… D-Amino…                 9       45 3.08e- 6 HMDB0000167;HMD… 3.39e- 5
## 10 hsa00… Glycine…                 8       40 1.15e- 5 HMDB0000929;HMD… 1.14e- 4
## # ℹ 1 more variable: fold_enrichment <dbl>

str(ORA_cachexia_KEGG_sig)

## EnrchmnR [26 × 8] (S3: EnrichmentResult/tbl_df/tbl/data.frame)
##  $ set_id           : chr [1:26] "hsa05230" "hsa00970" "hsa04974" "hsa04978" ...
##  $ set_name         : chr [1:26] "Central carbon metabolism in cancer - Homo sapiens (human)" "Aminoacyl-tRNA biosynthesis - Homo sapiens (human)" "Protein digestion and absorption - Homo sapiens (human)" "Mineral absorption - Homo sapiens (human)" ...
##  $ n_selected_in_set: int [1:26] 17 13 14 10 14 8 10 6 9 8 ...
##  $ n_in_set         : int [1:26] 35 23 47 27 82 26 47 15 45 40 ...
##  $ p_value          : num [1:26] 2.04e-18 2.44e-15 1.03e-11 1.12e-09 3.11e-08 ...
##  $ overlap_features : chr [1:26] "HMDB0000883;HMDB0000929;HMDB0000687;HMDB0000168;HMDB0000161;HMDB0000641;HMDB0000187;HMDB0000094;HMDB0000158;HMD"| __truncated__ "HMDB0000883;HMDB0000929;HMDB0000687;HMDB0000168;HMDB0000167;HMDB0000161;HMDB0000641;HMDB0000187;HMDB0000158;HMD"| __truncated__ "HMDB0000883;HMDB0000929;HMDB0000687;HMDB0000168;HMDB0000167;HMDB0000161;HMDB0000641;HMDB0000187;HMDB0000158;HMD"| __truncated__ "HMDB0000883;HMDB0000929;HMDB0000687;HMDB0000168;HMDB0000167;HMDB0000161;HMDB0000641;HMDB0000187;HMDB0000172;HMDB0000123" ...
##  $ p_adj            : num [1:26] 2.02e-16 1.21e-13 3.39e-10 2.78e-08 6.15e-07 ...
##  $ fold_enrichment  : num [1:26] 16.77 19.51 10.28 12.79 5.89 ...
##  - attr(*, "metadata")=List of 6
##   ..$ mapping_name   : chr "KEGG_pathway_HMDB"
##   ..$ feature_id_type: chr "HMDB"
##   ..$ feature_species: chr "Homo sapiens"
##   ..$ set_source     : chr "KEGG"
##   ..$ version        : chr "2025-11-29"
##   ..$ description    : chr "KEGG pathway to HMDB metabolites via metaboliteIDmapping"

This can be visualized by different approaches.

First clean the Pathway names

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

clean_pathway_name <- function(x) {
  gsub(" - Homo sapiens \\(human\\)$", "", x, perl = TRUE) |>
    trimws()
}

ORA_cachexia_KEGG_sig <- ORA_cachexia_KEGG_sig %>%
  dplyr::mutate(set_name_clean = clean_pathway_name(set_name))

top20 <- ORA_cachexia_KEGG_sig %>%  arrange(p_adj) 
  # %>% dplyr::slice(1:20)


# IMPORTANT: nivells únics per ordenar el gràfic
top20$set_name_clean <- factor(
  top20$set_name_clean,
  levels = rev(unique(top20$set_name_clean))
)


ggplot(top20, aes(x = set_name_clean, y = -log10(p_adj))) +
  geom_col(fill = "#3B82F6") +
  coord_flip() +
  labs(
    title = "Cachexia – KEGG Enrichment (Top 20)",
    x = "Pathway",
    y = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

ggplot(top20, aes(x = -log10(p_adj), y = set_name_clean)) +
  geom_segment(aes(x = 0, xend = -log10(p_adj),
                   y = set_name_clean, yend = set_name_clean),
               color = "#999999") +
  geom_point(size = 3, color = "#E11D48") +
  labs(
    title = "Cachexia – KEGG Enrichment (Top 20)",
    x = "-log10(FDR)",
    y = "Pathway"
  ) +
  theme_minimal(base_size = 12)

top30 <- ORA_cachexia_KEGG_sig %>%  arrange(p_adj) 
# %>% dplyr::slice(1:30)

ggplot(top30, aes(
  x = fold_enrichment,
  y = -log10(p_adj),
  size = n_selected_in_set,
  color = -log10(p_adj)
)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "#8B5CF6", high = "#1E40AF") +
  labs(
    title = "Cachexia – KEGG Enrichment (Bubble plot)",
    x = "Fold Enrichment",
    y = "-log10(FDR)",
    size = "# hits"
  ) +
  theme_minimal(base_size = 12)

4.1.2 ORA based on `SMPDBsets`

We can proceed similarly with SMPDB

# ORA Cachexia vs SMPDB

ORA_cachexia_SMPDB <- ora_test(
  eset         = SMPDBset,
  selected     = cachexia_sig,
  background   = cachexia_background,
  test         = "fisher",
  p_adjust     = "BH",
  min_set_size = 3,
  max_set_size = 500
)


# Resultats significatius
ORA_cachexia_SMPDB_sig <- ORA_cachexia_SMPDB %>%
  filter(p_adj < 0.05) %>%
  arrange(p_adj)

nrow(ORA_cachexia_SMPDB_sig)

## [1] 87

head(ORA_cachexia_SMPDB_sig, 10)

## # A tibble: 10 × 8
##    set_id   set_name n_selected_in_set n_in_set p_value overlap_features   p_adj
##    <chr>    <chr>                <int>    <int>   <dbl> <chr>              <dbl>
##  1 SMP0000… Nevirap…                10       41 1.09e-7 HMDB0000687;HMD… 3.42e-5
##  2 SMP0000… Azithro…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  3 SMP0000… Clarith…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  4 SMP0000… Clindam…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  5 SMP0000… Erythro…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  6 SMP0000… Roxithr…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  7 SMP0000… Telithr…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  8 SMP0000… Amikaci…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
##  9 SMP0000… Gentami…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
## 10 SMP0000… Kanamyc…                 7       19 4.63e-7 HMDB0000883;HMD… 3.42e-5
## # ℹ 1 more variable: fold_enrichment <dbl>

top_smpdb <- ORA_cachexia_SMPDB_sig %>% #  dplyr::slice(1:70) %>%
  mutate(set_name = factor(set_name, levels = rev(set_name)))


ggplot(top_smpdb, aes(x = set_name, y = -log10(p_adj))) +
  geom_col(fill = "#10B981") +   # color verd (diferent del KEGG)
  coord_flip() +
  labs(
    title = "Cachexia – SMPDB Enrichment (Top 20)",
    x = "SMPDB pathway",
    y = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

4.1.3 ORA for chemical classes

ORA_cachexia_Chem <- ora_test(
  eset         = ChemicalClassSet,
  selected     = cachexia_sig,
  background   = cachexia_background,
  test         = "fisher",
  p_adjust     = "BH",
  min_set_size = 1,
  max_set_size = 500
)

ORA_cachexia_Chem_sig <- ORA_cachexia_Chem %>%  filter(p_adj < 0.25) %>%
  arrange(p_adj)

nrow(ORA_cachexia_Chem_sig)

## [1] 12

head(ORA_cachexia_Chem_sig, 10)

## # A tibble: 10 × 8
##    set_id set_name n_selected_in_set n_in_set  p_value overlap_features    p_adj
##    <chr>  <chr>                <int>    <int>    <dbl> <chr>               <dbl>
##  1 "Amin… "Amino …                10       15 5.23e-13 HMDB0000687;HMD… 6.28e-12
##  2 "Orga… "Organi…                 8       24 1.59e- 7 HMDB0000232;HMD… 9.54e- 7
##  3 "Acyl… "Acylca…                 2        2 8.26e- 4 HMDB0000062;HMD… 3.30e- 3
##  4 "Amin… "Amino …                 2        4 4.77e- 3 HMDB0000267;HMD… 9.54e- 3
##  5 "Biog… "Biogen…                 2        4 4.77e- 3 HMDB0000562;HMD… 9.54e- 3
##  6 "Orga… "Organi…                 2        4 4.77e- 3 HMDB0000128;HMD… 9.54e- 3
##  7 "Amin… "Amino …                 2        7 1.58e- 2 HMDB0000479;HMD… 2.71e- 2
##  8 "Carb… "Carbox…                 1        1 2.90e- 2 HMDB0000072      4.34e- 2
##  9 "Nucl… "Nucleo…                 1        2 5.71e- 2 HMDB0000157      6.85e- 2
## 10 "Orga… "Organo…                 1        2 5.71e- 2 HMDB0000149      6.85e- 2
## # ℹ 1 more variable: fold_enrichment <dbl>

top_chem <- ORA_cachexia_Chem_sig %>% 
  dplyr::slice(1:10) %>%
  mutate(set_name = factor(set_name, levels = rev(set_name)))

ggplot(top_chem, aes(x = set_name, y = -log10(p_adj))) +
  geom_col(fill = "#F59E0B") +   # taronja per diferenciar
  coord_flip() +
  labs(
    title = "Cachexia – Chemical Class Enrichment (Top 20)",
    x = "Chemical class",
    y = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

4.2 ORA and MSEA for the ST000291 dataset

4.2.1 ORA for the ST000291 dataset

4.2.1.1 ORA Based on PubChem annotated results

## -------------------------------
## 1) ORA (Over-Representation Analysis)
## -------------------------------

ORA_ST291_KEGG <- ora_test(
  eset         = KEGGset_PubChem,
  selected     = ST291_sig,
  background   = ST291_background,
  test         = "fisher",
  p_adjust     = "none",
  min_set_size = 1,
  max_set_size = 500,
  quiet        = TRUE
)

# Resultats ordenats i amb nom net

ORA_ST291_KEGG_sig <- ORA_ST291_KEGG %>%
  filter(p_value < 0.05) %>%
  arrange(p_adj) %>%
  mutate(set_name_clean = clean_pathway_name(set_name))

# Vista ràpida dels top 10
head(ORA_ST291_KEGG_sig, 10)

## # A tibble: 2 × 9
##   set_id   set_name   n_selected_in_set n_in_set p_value overlap_features  p_adj
##   <chr>    <chr>                  <int>    <int>   <dbl> <chr>             <dbl>
## 1 hsa05208 Chemical …                 2        5  0.0277 996;241          0.0277
## 2 hsa00350 Tyrosine …                 3       13  0.0321 5816;996;780     0.0321
## # ℹ 2 more variables: fold_enrichment <dbl>, set_name_clean <chr>

## Barplot ORA – Top 20 pathways
top_ora <- ORA_ST291_KEGG_sig %>%
  dplyr::slice(1:20) %>%
  mutate(set_name_clean = factor(set_name_clean, levels = rev(set_name_clean)))

ggplot(top_ora, aes(x = set_name_clean, y = -log10(p_adj))) +
  geom_col(fill = "#3B82F6") +
  coord_flip() +
  labs(
    title = "ST000291 – KEGG ORA (Top 20 pathways)",
    x     = "KEGG pathway",
    y     = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

## Bubble plot ORA – Top 30
top_ora <- ORA_ST291_KEGG_sig %>%
  dplyr::slice(1:30)

ggplot(top_ora, aes(
  x = fold_enrichment,
  y = -log10(p_adj),
  size = n_selected_in_set,
  color = -log10(p_adj)
)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "#8B5CF6", high = "#1E40AF") +
  labs(
    title = "ST000291 – KEGG ORA (Bubble plot, Top 30)",
    x     = "Fold enrichment",
    y     = "-log10(FDR)",
    size  = "# hits"
  ) +
  theme_minimal(base_size = 12)

4.2.1.2 ORA based on HMDB annotated results

## -------------------------------
## 1) ORA (Over-Representation Analysis)
## -------------------------------

ORA_ST291_HMDB_KEGG <- ora_test(
  eset         = KEGGset_HMDB,
  selected     = ST291_HMDB_sig,
  background   = ST291_HMDB_background,
  test         = "fisher",
  p_adjust     = "none",
  min_set_size = 1,
  max_set_size = 500,
  quiet        = TRUE
)

# Resultats ordenats i amb nom net

ORA_ST291_HMDB_KEGG_sig <- ORA_ST291_HMDB_KEGG %>%
  filter(p_value < 0.05) %>%
  arrange(p_adj) %>%
  mutate(set_name_clean = clean_pathway_name(set_name))

# Vista ràpida dels top 10
head(ORA_ST291_HMDB_KEGG_sig, 10)

## # A tibble: 5 × 9
##   set_id   set_name   n_selected_in_set n_in_set p_value overlap_features  p_adj
##   <chr>    <chr>                  <int>    <int>   <dbl> <chr>             <dbl>
## 1 hsa05208 Chemical …                 2        6  0.0280 HMDB0000228;HMD… 0.0280
## 2 hsa00350 Tyrosine …                 3       15  0.0283 HMDB0000068;HMD… 0.0283
## 3 hsa00982 Drug meta…                 2        7  0.0380 HMDB0060686;HMD… 0.0380
## 4 hsa04081 Hormone s…                 1        1  0.0465 HMDB0000068      0.0465
## 5 hsa04261 Adrenergi…                 1        1  0.0465 HMDB0000068      0.0465
## # ℹ 2 more variables: fold_enrichment <dbl>, set_name_clean <chr>

## Barplot ORA – Top 20 pathways
top_ora <- ORA_ST291_HMDB_KEGG_sig %>%
  dplyr::slice(1:20) %>%
  mutate(set_name_clean = factor(set_name_clean, levels = rev(set_name_clean)))

ggplot(top_ora, aes(x = set_name_clean, y = -log10(p_adj))) +
  geom_col(fill = "#3B82F6") +
  coord_flip() +
  labs(
    title = "ST000291 – KEGG ORA (Top 20 pathways)",
    x     = "KEGG pathway",
    y     = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

## Bubble plot ORA – Top 30
top_ora <- ORA_ST291_HMDB_KEGG_sig %>%
  dplyr::slice(1:30)

ggplot(top_ora, aes(
  x = fold_enrichment,
  y = -log10(p_adj),
  size = n_selected_in_set,
  color = -log10(p_adj)
)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "#8B5CF6", high = "#1E40AF") +
  labs(
    title = "ST000291 – KEGG ORA (Bubble plot, Top 30)",
    x     = "Fold enrichment",
    y     = "-log10(FDR)",
    size  = "# hits"
  ) +
  theme_minimal(base_size = 12)

4.2.2 MSEA for the ST000291 dataset

MSEA, requires more information than ENrichment Analysis and indeed it also returns different results, even if performed on the same collection of sets.

The idea behind MSEA is that, even when we cannot find Enriched sets in a given list iot may be possible to observe some tendency among features belonging to a given sets to show similar behavior, this meaning that they tend to show a positive (or a negative) effect at the same time or even to show the same positive or negative correlation. When this is the case we describe thses sets as differentially enriched (Metabolite) sets.

The test to decide if a certain set is differentially enriched is based on observing if the scores of the metabolites in this set tend to vary at random or, instead they tend to be consistently high (consistently low).

Although the first versions oF MSEA based the decision on the distribution of the p-values, later methods have realized that it can alo be based on some effect size measure such as the t-statistic. the log-fold change or even the correlation.

## -------------------------------
## 2) MSEA / QEA (Quantitative Enrichment Analysis)
## -------------------------------
## ST291_scores ha de ser un vector numèric amb noms = PubChem IDs.
## Exemple d’estructura (NO s’executa):
##   ST291_scores <- setNames(topTable$logFC, topTable$PubChemCID)

# QEA/MSEA

## ============================
## Construir ST291_scores
## ============================

# Comprovem que PubChemCID és caracter i no factor
topCranVSBase <- topCranVSBase %>%
  dplyr::mutate(PubChemCID = as.character(PubChemCID))

# Vector numeric amb noms = PubChem
ST291_scores <- setNames(topCranVSBase$log2FC, topCranVSBase$PubChemCID)
ST291_scores <- ST291_scores[!is.na(ST291_scores)]
str(ST291_scores)

##  Named num [1:1358] 2.92 3.27 2.18 2.41 3.18 ...
##  - attr(*, "names")= chr [1:1358] "54678503" "71485" "94214" "5378303" ...

head(ST291_scores)

## 54678503    71485    94214  5378303     3037    10178 
## 2.916890 3.267987 2.180373 2.414795 3.179669 3.331534

stopifnot(is.numeric(ST291_scores))
stopifnot(length(ST291_scores) == length(names(ST291_scores)))
stopifnot(sum(is.na(ST291_scores)) == 0)

# Comprovar estructura
QEA_ST291_KEGG <- qea_test(
  eset            = KEGGset_PubChem,
  scores          = ST291_scores,
  nperm           = 1000,
  p_adjust        = "BH",
  min_set_size    = 3,
  max_set_size    = 500,
  return_profiles = FALSE,
  quiet           = TRUE
)

QEA_ST291_table <- QEA_ST291_KEGG$table %>%
  filter(!is.na(p_value)) %>%
  arrange(p_adj) %>%
  mutate(set_name_clean = clean_pathway_name(set_name))

# Vista ràpida dels top 10 (QEA)
head(QEA_ST291_table, 10)

##      set_id
## 1  hsa04923
## 2  hsa00440
## 3  hsa00120
## 4  hsa05207
## 5  hsa00040
## 6  hsa05208
## 7  hsa00480
## 8  hsa00982
## 9  hsa00310
## 10 hsa00330
##                                                                    set_name
## 1              Regulation of lipolysis in adipocytes - Homo sapiens (human)
## 2             Phosphonate and phosphinate metabolism - Homo sapiens (human)
## 3                     Primary bile acid biosynthesis - Homo sapiens (human)
## 4      Chemical carcinogenesis - receptor activation - Homo sapiens (human)
## 5           Pentose and glucuronate interconversions - Homo sapiens (human)
## 6  Chemical carcinogenesis - reactive oxygen species - Homo sapiens (human)
## 7                             Glutathione metabolism - Homo sapiens (human)
## 8                  Drug metabolism - cytochrome P450 - Homo sapiens (human)
## 9                                 Lysine degradation - Homo sapiens (human)
## 10                   Arginine and proline metabolism - Homo sapiens (human)
##    n_in_set         ES direction p_value     p_adj
## 1         4  0.7215657  positive   0.005 0.2550000
## 2         4  0.6809453  positive   0.010 0.2550000
## 3         3  0.7660517  positive   0.012 0.2550000
## 4         4  0.6399557  positive   0.015 0.2550000
## 5         4  0.6676514  positive   0.017 0.2550000
## 6         5  0.5461936  positive   0.025 0.2892857
## 7         6 -0.5342702  negative   0.027 0.2892857
## 8         3 -0.6863469  negative   0.033 0.3093750
## 9        10  0.3632047  positive   0.047 0.3675000
## 10       13 -0.3286817  negative   0.049 0.3675000
##                                       set_name_clean
## 1              Regulation of lipolysis in adipocytes
## 2             Phosphonate and phosphinate metabolism
## 3                     Primary bile acid biosynthesis
## 4      Chemical carcinogenesis - receptor activation
## 5           Pentose and glucuronate interconversions
## 6  Chemical carcinogenesis - reactive oxygen species
## 7                             Glutathione metabolism
## 8                  Drug metabolism - cytochrome P450
## 9                                 Lysine degradation
## 10                   Arginine and proline metabolism

## Barplot QEA – Top 20 pathways
top20_qea <- QEA_ST291_table %>%
  dplyr::slice(1:20) %>%
  mutate(set_name_clean = factor(set_name_clean, levels = rev(set_name_clean)))

ggplot(top20_qea, aes(x = set_name_clean, y = -log10(p_adj))) +
  geom_col(fill = "#EF4444") +
  coord_flip() +
  labs(
    title = "ST000291 – KEGG QEA/MSEA (Top 20 pathways)",
    x     = "KEGG pathway",
    y     = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

## Bubble plot QEA – Top 30 (ES vs -log10(FDR))
top30_qea <- QEA_ST291_table %>%
  dplyr::slice(1:30)

ggplot(top30_qea, aes(
  x = ES,
  y = -log10(p_adj),
  size = n_in_set,
  color = direction
)) +
  geom_point(alpha = 0.8) +
  scale_color_manual(values = c(positive = "#22C55E", negative = "#F97316")) +
  labs(
    title = "ST000291 – KEGG QEA/MSEA (Bubble plot, Top 30)",
    x     = "Enrichment score (ES)",
    y     = "-log10(FDR)",
    size  = "Set size",
    color = "Direction"
  ) +
  theme_minimal(base_size = 12)

4.3 ORA and MSEA for the EnrichMet Example

4.3.1 ORA for the EnrichMet Example

## -------------------------------
## 1) ORA (Over-Representation Analysis)
## -------------------------------

ORA_EMex_KEGG <- ora_test(
  eset         = KEGGset_KEGG,
  selected     = EM_sig,
  background   = EM_background,
  test         = "fisher",
  p_adjust     = "none",
  min_set_size = 1,
  max_set_size = 500,
  quiet        = TRUE
)

# Resultats ordenats i amb nom net

ORA_EMex_KEGG_sig <- ORA_EMex_KEGG %>%
  filter(p_value < 0.05) %>%
  arrange(p_adj) %>%
  mutate(set_name_clean = clean_pathway_name(set_name))

# Vista ràpida dels top 10
head(ORA_EMex_KEGG_sig, 10)

## # A tibble: 2 × 9
##   set_id set_name    n_selected_in_set n_in_set p_value overlap_features   p_adj
##   <chr>  <chr>                   <int>    <int>   <dbl> <chr>              <dbl>
## 1 PW_45  Biosynthes…                 7       11 0.00314 C16527;C00249;C… 0.00314
## 2 PW_106 Fatty acid…                 4        7 0.0463  C00249;C06424;C… 0.0463 
## # ℹ 2 more variables: fold_enrichment <dbl>, set_name_clean <chr>

## Barplot ORA – Top 20 pathways
top_ora <- ORA_EMex_KEGG_sig %>%
  dplyr::slice(1:20) %>%
  mutate(set_name_clean = factor(set_name_clean, levels = rev(set_name_clean)))

ggplot(top_ora, aes(x = set_name_clean, y = -log10(p_adj))) +
  geom_col(fill = "#3B82F6") +
  coord_flip() +
  labs(
    title = "ST000291 – KEGG ORA (Top 20 pathways)",
    x     = "KEGG pathway",
    y     = "-log10(FDR)"
  ) +
  theme_minimal(base_size = 12)

## Bubble plot ORA – Top 30
top_ora <- ORA_EMex_KEGG_sig %>%
  dplyr::slice(1:30)

ggplot(top_ora, aes(
  x = fold_enrichment,
  y = -log10(p_adj),
  size = n_selected_in_set,
  color = -log10(p_adj)
)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "#8B5CF6", high = "#1E40AF") +
  labs(
    title = "ST000291 – KEGG ORA (Bubble plot, Top 30)",
    x     = "Fold enrichment",
    y     = "-log10(FDR)",
    size  = "# hits"
  ) +
  theme_minimal(base_size = 12)

4.3.2 MSEA for the EnrichMet Example

First we build a “scores” vector with the effect sizes, named by the Metabolites names

head(EM_summaryStats)

## # A tibble: 6 × 4
##   met_id             pval    padj log2fc
##   <chr>             <dbl>   <dbl>  <dbl>
## 1 C02721        0.0000124 0.00315  -2.32
## 2 C03736        0.0000694 0.00739   2.63
## 3 C03139        0.0000869 0.00739   1.75
## 4 C00074        0.000120  0.00763  -3.22
## 5 C02862|       0.000198  0.00819  -1.70
## 6 C00092|C00275 0.000206  0.00819   1.31

EM_scores <- EM_summaryStats$log2fc
names(EM_scores) <- EM_summaryStats$met_id
head(EM_scores)

##        C02721        C03736        C03139        C00074       C02862| 
##     -2.324807      2.634161      1.748349     -3.219306     -1.697477 
## C00092|C00275 
##      1.308356

This vector is passed to the MSEA function

# Comprovar estructura
QEA_EMex_KEGG <- qea_test(
  eset            = KEGGset_KEGG,
  scores          = EM_scores,
  nperm           = 1000,
  p_adjust        = "BH",
  min_set_size    = 3,
  max_set_size    = 500,
  return_profiles = FALSE,
  quiet           = TRUE
)

QEA_EMex_table <- QEA_EMex_KEGG$table %>%
  filter(!is.na(p_value)) %>%
  arrange(p_adj) %>%
  mutate(set_name_clean = clean_pathway_name(set_name))

# Vista ràpida dels top 10 (QEA)
head(QEA_EMex_table, 10)

##    set_id                                 set_name n_in_set         ES
## 1  PW_111                              Ferroptosis        6 -0.7911647
## 2  PW_269                    Pyrimidine metabolism       15 -0.4583333
## 3  PW_263         Protein digestion and absorption       13 -0.4370629
## 4   PW_16              Aminoacyl-tRNA biosynthesis       11 -0.4165425
## 5   PW_45  Biosynthesis of unsaturated fatty acids       11 -0.4076006
## 6  PW_239 Pentose and glucuronate interconversions        5 -0.5360000
## 7   PW_59      Central carbon metabolism in cancer       15 -0.3333333
## 8    PW_1                         ABC transporters       24 -0.2581169
## 9  PW_320                Thyroid hormone synthesis        3 -0.6666667
## 10 PW_326                    Tryptophan metabolism        5 -0.5520000
##    direction p_value     p_adj                           set_name_clean
## 1   negative   0.000 0.0000000                              Ferroptosis
## 2   negative   0.001 0.0355000                    Pyrimidine metabolism
## 3   negative   0.010 0.2351875         Protein digestion and absorption
## 4   negative   0.022 0.2351875              Aminoacyl-tRNA biosynthesis
## 5   negative   0.030 0.2351875  Biosynthesis of unsaturated fatty acids
## 6   negative   0.034 0.2351875 Pentose and glucuronate interconversions
## 7   negative   0.034 0.2351875      Central carbon metabolism in cancer
## 8   negative   0.035 0.2351875                         ABC transporters
## 9   negative   0.037 0.2351875                Thyroid hormone synthesis
## 10  negative   0.038 0.2351875                    Tryptophan metabolism

5 Enrichmet

5.1 ORA for the Enrichmet Example data

enrichmet provides a wrapper function to run all analysis in one step:

enrichmetResults <- enrichmet(
  inputMetabolites = EM_inputMetabolites,
  PathwayVsMetabolites = EM_PathwayVsMetabolites,
  example_data =EM_summaryStats, 
  kegg_lookup = EM_kegg_lookup,
  top_n = 100,
  p_value_cutoff = 1
)

## Running pathway enrichment analysis...

## Running GSEA analysis...

##   |                                                                              |                                                                      |   0%  |                                                                              |=======================                                               |  33%  |                                                                              |===============================================                       |  67%  |                                                                              |======================================================================| 100%

## Running centrality analysis...

## Generating metabolite-pathway network visualization...

## Generating enrichment heatmap...

## Generating pathway membership plot...

## Generating STITCH interaction network...

## Warning in create_interaction_plot(inputMetabolites, mapping_df, stitch_df, :
## STITCH interaction analysis requested but mapping_df or stitch_df not provided.

names(enrichmetResults)

##  [1] "pathway_enrichment_results" "pathway_plot"              
##  [3] "impact_plot"                "gsea_results"              
##  [5] "gsea_plot"                  "metabolite_centrality"     
##  [7] "rbc_plot"                   "network_plot"              
##  [9] "heatmap_plot"               "membership_plot"

ORAEM_EMex_KEGG <- enrichmetResults$pathway_enrichment_results
MSEAEMex_KEGG <- enrichmetResults$gsea_results

6 ChemRich (outline)

ChemRich performs enrichment based on chemical similarity clusters, not predefined pathways. It typically requires:

effect size (e.g. log2FC),
p-values,
a structural identifier (InChIKey, SMILES, etc.).

Since ChemRich is provided as a web tool, the usual workflow is:

Prepare a CSV with at least ID, log2FC and p-value.
Upload to ChemRich web.
Export the results table.
Re-import into R for comparison.

7 Comparison of results

7.1 ORA Comparison

It is straightforward to realize that even with the same input, the results of the different pacakges differ:

We use as inputs:

show(EM_inputMetabolites)

##  [1] "C02721" "C03736" "C03139" "C00074" "C02862" "C00092" "C00275" "C00386"
##  [9] "C10172" "C00361" "C00242" "C19806" "C16358" "C16357" "C16353" "C02140"
## [17] "C05488" "C01181" "C03017" "C02632" "C00246" "C00079" "C08317" "C00588"
## [25] "C16527" "C00065" "C00249" "C00836" "C02990" "C03690" "C04025" "C06424"
## [33] "C00123" "C00606" "C03067" "C00539" "C00633" "C00519" "C00366" "C00148"
## [41] "C00219" "C08261" "C00385" "C06429" "C05637" "C00295" "C06428" "C16513"
## [49] "C01909" "C03299" "C01530" "C01571" "C08278" "C00599" "C15587" "C00144"
## [57] "C03739"

show(EM_background)

##   [1] "C02721"                             
##   [2] "C03736"                             
##   [3] "C03139"                             
##   [4] "C00074"                             
##   [5] "C02862|"                            
##   [6] "C00092|C00275"                      
##   [7] "C00386"                             
##   [8] "C10172"                             
##   [9] "C00361"                             
##  [10] "C00242"                             
##  [11] "C19806"                             
##  [12] "C16358|C16357|C16353"               
##  [13] "C02140|C05488"                      
##  [14] "C01181"                             
##  [15] "C03017"                             
##  [16] "C02632|C00246"                      
##  [17] "C00079"                             
##  [18] "C08317|"                            
##  [19] "C00588"                             
##  [20] "C16527"                             
##  [21] "C00065"                             
##  [22] "C00249"                             
##  [23] "C00836"                             
##  [24] "C02990"                             
##  [25] "C03690|"                            
##  [26] "C04025"                             
##  [27] "C06424"                             
##  [28] "C00123"                             
##  [29] "C00606"                             
##  [30] "C03067|C00539|C00633"               
##  [31] "C00519"                             
##  [32] "C00366"                             
##  [33] "C00148"                             
##  [34] "C00219"                             
##  [35] "C08261"                             
##  [36] "C00385"                             
##  [37] "C06429"                             
##  [38] "C05637"                             
##  [39] "C00295"                             
##  [40] "C06428"                             
##  [41] "C16513"                             
##  [42] "C01909"                             
##  [43] "C03299"                             
##  [44] "C01530"                             
##  [45] "C01571"                             
##  [46] "C08278"                             
##  [47] "C00599"                             
##  [48] "C15587"                             
##  [49] "C00144"                             
##  [50] "C03739"                             
##  [51] "C06423"                             
##  [52] "C00847"                             
##  [53] "C00105"                             
##  [54] "C00026"                             
##  [55] "C00082"                             
##  [56] "||C18218"                           
##  [57] "C00407|C00123|C01933"               
##  [58] "C00051"                             
##  [59] "C06426"                             
##  [60] "C00879"                             
##  [61] "C00183||C01799"                     
##  [62] "C01083|C00243|C00089|C00208|C06422|"
##  [63] "C01216"                             
##  [64] "C00245"                             
##  [65] "C01062"                             
##  [66] "C01157"                             
##  [67] "C00106"                             
##  [68] "C00035"                             
##  [69] "C00805"                             
##  [70] "C00263"                             
##  [71] "C05842"                             
##  [72] "C02656"                             
##  [73] "C01601"                             
##  [74] "C01727"                             
##  [75] "C00022"                             
##  [76] "C00152"                             
##  [77] "C01987"                             
##  [78] "C07086"                             
##  [79] "C00188|C00263|C05519"               
##  [80] "C00299|"                            
##  [81] "C16512"                             
##  [82] "C03045|"                            
##  [83] "C00127"                             
##  [84] "C03242"                             
##  [85] "C00122"                             
##  [86] "C00140|C05021"                      
##  [87] "C00755|C03663|C12064"               
##  [88] "C01685"                             
##  [89] "|C00043"                            
##  [90] "C01104"                             
##  [91] "C00300"                             
##  [92] "|C05984"                            
##  [93] "C08362"                             
##  [94] "C17714"                             
##  [95] "C00041"                             
##  [96] "C03547"                             
##  [97] "C10833"                             
##  [98] "C00475"                             
##  [99] "C01879"                             
## [100] "C05472"                             
## [101] "C02710"                             
## [102] "C01586"                             
## [103] "C08356|C00137|C00936|C01487||C19891"
## [104] "C00212"                             
## [105] "C00418"                             
## [106] "C00158|C00311"                      
## [107] "C14829"                             
## [108] "C01481"                             
## [109] "C00380"                             
## [110] "C00015"                             
## [111] "C00530"                             
## [112] "C00587"                             
## [113] "C00299"                             
## [114] "C01046"                             
## [115] "C00153"                             
## [116] "C05635"                             
## [117] "C00525"                             
## [118] "C00715"                             
## [119] "C00624"                             
## [120] "C00328"                             
## [121] "C00314"                             
## [122] "C15923|C08356|C00936|C00267|C01487" 
## [123] "C03736|C00231"                      
## [124] "C02067"                             
## [125] "C00570"                             
## [126] "C00645|C00140|C05021"               
## [127] "C00881"                             
## [128] "C00049"                             
## [129] "C00093|C02979"                      
## [130] "C19910"                             
## [131] "C00257"                             
## [132] "C00048"                             
## [133] "C02353|C00575"                      
## [134] "C00581"                             
## [135] "C19615"                             
## [136] "C04148"                             
## [137] "C22040"                             
## [138] "C03406"                             
## [139] "C01595"                             
## [140] "C04294"                             
## [141] "C00539"                             
## [142] "C05378"                             
## [143] "C00318"                             
## [144] "C05629"                             
## [145] "C11439¬†"                           
## [146] "C00408"                             
## [147] "C01005"                             
## [148] "C00006"                             
## [149] "C00008"                             
## [150] "C00387"                             
## [151] "C00557"                             
## [152] "C05122"                             
## [153] "C00025|C00979"                      
## [154] "C01412"                             
## [155] "C02953"                             
## [156] "C00099"                             
## [157] "C01047"                             
## [158] "C00506"                             
## [159] "C00213|C00041|C00133"               
## [160] "C01234"                             
## [161] "C00597"                             
## [162] "||C06104"                           
## [163] "C01384"                             
## [164] "C00334"                             
## [165] "C02180"                             
## [166] "C12989"                             
## [167] "C00345"                             
## [168] "C04083"                             
## [169] "C01035"                             
## [170] "C02642"                             
## [171] "C00301"                             
## [172] "C02376"                             
## [173] "C05135"                             
## [174] "C00181|C02479"                      
## [175] "C03722"                             
## [176] "C00163"                             
## [177] "C00064"                             
## [178] "C00491"                             
## [179] "C00149"                             
## [180] "C00455"                             
## [181] "C00429"                             
## [182] "C00116"                             
## [183] "C00111"                             
## [184] "C00055"                             
## [185] "C03739|C01585"                      
## [186] "C19501"                             
## [187] "C01239"                             
## [188] "C00191"                             
## [189] "C03196"                             
## [190] "C02714"                             
## [191] "C00209"                             
## [192] "C01161"                             
## [193] "C16526"                             
## [194] "C00446|C00275||C00103"              
## [195] "C00003"                             
## [196] "C00109"                             
## [197] "C00170"                             
## [198] "C00476|C01508"                      
## [199] "C06470"                             
## [200] "C00956"                             
## [201] "C05942"                             
## [202] "C00188|C05519"                      
## [203] "C02918"                             
## [204] "C01494"                             
## [205] "C03570"                             
## [206] "C00951"                             
## [207] "C02835"                             
## [208] "C01019|C07326"                      
## [209] "C00130"                             
## [210] "C01762"                             
## [211] "C02240"                             
## [212] "C01697|C00392|C00794"               
## [213] "C00166"                             
## [214] "C00213"                             
## [215] "C00255"                             
## [216] "|C01712|C00712"                     
## [217] "C00258"                             
## [218] "C00550"                             
## [219] "C11512"                             
## [220] "C00186|C02154"                      
## [221] "C01262"                             
## [222] "C00253"                             
## [223] "C00236"                             
## [224] "C00671|C00233"                      
## [225] "C00791"                             
## [226] "C00037"                             
## [227] "C00777"                             
## [228] "C00294"                             
## [229] "C04067"                             
## [230] "C00020"                             
## [231] "C00262"                             
## [232] "C01042"                             
## [233] "C00147"                             
## [234] "C02305"                             
## [235] "C01004"                             
## [236] "C00327"                             
## [237] "C00379|C00474|C00532"               
## [238] "C02679"                             
## [239] "C07588"                             
## [240] "C01551"                             
## [241] "C00378"                             
## [242] "C00695"                             
## [243] "|C00803"                            
## [244] "|C00141"                            
## [245] "C03519|"                            
## [246] "C00864"                             
## [247] "C03761"                             
## [248] "C02946"                             
## [249] "C00346"                             
## [250] "C00785"                             
## [251] "C06231"                             
## [252] "C00969"                             
## [253] "C00157"                             
## [254] "C01620"                             
## [255] "C00042|C02170"

extended_background <- unique(unlist(KEGGset_KEGG@data$feature_list))
length(extended_background)

## [1] 3035

dim(EM_PathwayVsMetabolites)

## [1] 351   2

dim(KEGGset_KEGG@data)

## [1] 351   4

For instance, when ORA is performed with localEnrichment:

ORA_EMex_KEGG <- ora_test(
  eset         = KEGGset_KEGG,
  selected     = EM_inputMetabolites,
  background   = extended_background,
  test         = "fisher",
  p_adjust     = "none",
  min_set_size = 1,
  max_set_size = 500,
  quiet        = TRUE
)

head(ORA_EMex_KEGG, 10)

## # A tibble: 10 × 8
##    set_id set_name   n_selected_in_set n_in_set p_value overlap_features   p_adj
##    <chr>  <chr>                  <int>    <int>   <dbl> <chr>              <dbl>
##  1 PW_59  Central c…                 6       37 4.98e-5 C00074;C00092;C… 4.98e-5
##  2 PW_263 Protein d…                 6       47 1.99e-4 C02632;C00246;C… 1.99e-4
##  3 PW_45  Biosynthe…                 7       74 3.83e-4 C16527;C00249;C… 3.83e-4
##  4 PW_51  Caffeine …                 4       22 6.36e-4 C16358;C16357;C… 6.36e-4
##  5 PW_195 Mineral a…                 4       29 1.87e-3 C00079;C00065;C… 1.87e-3
##  6 PW_16  Aminoacyl…                 4       52 1.55e-2 C00079;C00065;C… 1.55e-2
##  7 PW_106 Fatty aci…                 4       58 2.24e-2 C00249;C06424;C… 2.24e-2
##  8 PW_274 Regulatio…                 2       14 2.73e-2 C02140;C00219    2.73e-2
##  9 PW_298 Sphingoli…                 2       15 3.11e-2 C00065;C00836    3.11e-2
## 10 PW_268 Purine me…                 5      101 3.94e-2 C00361;C00242;C… 3.94e-2
## # ℹ 1 more variable: fold_enrichment <dbl>

And when it is done with enrichmet:

ORAEM_EMex_KEGG <- perform_enrichment_analysis(
  EM_inputMetabolites,
  EM_PathwayVsMetabolites,
  top_n = 100,
  p_value_cutoff = 1
)

head(ORAEM_EMex_KEGG,20)

##                                   Pathway      P_value Log_P_value     Impact
## 1     Central carbon metabolism in cancer 4.817117e-05    4.317213 0.05461853
## 2        Protein digestion and absorption 1.925909e-04    3.715364 0.06838474
## 3 Biosynthesis of unsaturated fatty acids 3.699903e-04    3.431810 0.70197902
## 4                     Caffeine metabolism 6.220368e-04    3.206184 0.09487611
## 5                      Mineral absorption 1.832389e-03    2.736982 0.09229047
## 6             Aminoacyl-tRNA biosynthesis 1.519901e-02    1.818185 0.09400822
##     Coverage Count Adjusted_P_value      Q_value
## 1 0.16216216     6       0.01690808 0.0003367180
## 2 0.12765957     6       0.03379971 0.0006731084
## 3 0.09459459     7       0.04328886 0.0008620812
## 4 0.18181818     4       0.05458373 0.0010870143
## 5 0.13793103     4       0.12863372 0.0025616917
## 6 0.07692308     4       0.88914232 0.0177069319

7.2 MSEA comparison

Similarly we have

From localEnrichment

head(QEA_EMex_table, 20)

##    set_id                                 set_name n_in_set         ES
## 1  PW_111                              Ferroptosis        6 -0.7911647
## 2  PW_269                    Pyrimidine metabolism       15 -0.4583333
## 3  PW_263         Protein digestion and absorption       13 -0.4370629
## 4   PW_16              Aminoacyl-tRNA biosynthesis       11 -0.4165425
## 5   PW_45  Biosynthesis of unsaturated fatty acids       11 -0.4076006
## 6  PW_239 Pentose and glucuronate interconversions        5 -0.5360000
## 7   PW_59      Central carbon metabolism in cancer       15 -0.3333333
## 8    PW_1                         ABC transporters       24 -0.2581169
## 9  PW_320                Thyroid hormone synthesis        3 -0.6666667
## 10 PW_326                    Tryptophan metabolism        5 -0.5520000
## 11  PW_81       Cysteine and methionine metabolism       13 -0.3684043
## 12 PW_305                        Sulfur metabolism        3 -0.6746032
## 13  PW_85                  D-Amino acid metabolism       13 -0.3296249
## 14  PW_51                      Caffeine metabolism        3 -0.6587302
## 15 PW_268                        Purine metabolism       20 -0.2680851
## 16 PW_311       Taurine and hypotaurine metabolism        8 -0.3977733
## 17 PW_178                   Lipoic acid metabolism        4 -0.5507968
## 18 PW_195                       Mineral absorption        8 -0.3977733
## 19  PW_86                  Diabetic cardiomyopathy        6 -0.4397590
## 20 PW_236                       Pathways in cancer        4  0.5179283
##    direction p_value     p_adj                           set_name_clean
## 1   negative   0.000 0.0000000                              Ferroptosis
## 2   negative   0.001 0.0355000                    Pyrimidine metabolism
## 3   negative   0.010 0.2351875         Protein digestion and absorption
## 4   negative   0.022 0.2351875              Aminoacyl-tRNA biosynthesis
## 5   negative   0.030 0.2351875  Biosynthesis of unsaturated fatty acids
## 6   negative   0.034 0.2351875 Pentose and glucuronate interconversions
## 7   negative   0.034 0.2351875      Central carbon metabolism in cancer
## 8   negative   0.035 0.2351875                         ABC transporters
## 9   negative   0.037 0.2351875                Thyroid hormone synthesis
## 10  negative   0.038 0.2351875                    Tryptophan metabolism
## 11  negative   0.039 0.2351875       Cysteine and methionine metabolism
## 12  negative   0.043 0.2351875                        Sulfur metabolism
## 13  negative   0.046 0.2351875                  D-Amino acid metabolism
## 14  negative   0.048 0.2351875                      Caffeine metabolism
## 15  negative   0.052 0.2351875                        Purine metabolism
## 16  negative   0.053 0.2351875       Taurine and hypotaurine metabolism
## 17  negative   0.065 0.2642778                   Lipoic acid metabolism
## 18  negative   0.067 0.2642778                       Mineral absorption
## 19  negative   0.081 0.2716522                  Diabetic cardiomyopathy
## 20  positive   0.082 0.2716522                       Pathways in cancer

From enrichmet

MSEAEMex_KEGG

##                                     pathway        pval      padj    log2err
##                                      <char>       <num>     <num>      <num>
##  1:     Central_carbon_metabolism_in_cancer 0.002924586 0.0409442 0.43170770
##  2:             Aminoacyl-tRNA_biosynthesis 0.015148521 0.1060396 0.38073040
##  3:        Protein_digestion_and_absorption 0.028466548 0.1073712 0.35248786
##  4:                 D-Amino_acid_metabolism 0.030677475 0.1073712 0.35248786
##  5:                   Pyrimidine_metabolism 0.108256881 0.2663717 0.18302394
##  6: Biosynthesis_of_unsaturated_fatty_acids 0.127240143 0.2663717 0.16565670
##  7:      Cysteine_and_methionine_metabolism 0.136029412 0.2663717 0.16197895
##  8:                        ABC_transporters 0.152212389 0.2663717 0.14920754
##  9:         Arginine_and_proline_metabolism 0.215073529 0.3345588 0.12563992
## 10: Glyoxylate_and_dicarboxylate_metabolism 0.323159785 0.4164866 0.09754492
## 11:                       Purine_metabolism 0.327239488 0.4164866 0.09787733
## 12:                 beta-Alanine_metabolism 0.496629213 0.5794007 0.08578444
## 13:  Nicotinate_and_nicotinamide_metabolism 0.948623853 0.9838420 0.04415231
## 14: Neuroactive_ligand-receptor_interaction 0.983842011 0.9838420 0.04148804
##             ES        NES  size  leadingEdge input_count
##          <num>      <num> <int>       <list>       <int>
##  1: -0.6593812 -1.8162068    15 C00074, ....          10
##  2: -0.6465586 -1.6413875    11 C00079, ....           7
##  3: -0.5957926 -1.5763897    13 C00079, ....           7
##  4: -0.5925153 -1.5677183    13 C00079, ....           7
##  5: -0.4947670 -1.3627917    15 C00295, ....          12
##  6: -0.5322925 -1.3513056    11 C16527, ....           7
##  7: -0.5020263 -1.3282962    13 C00065, ....           6
##  8: -0.4147425 -1.2938273    24 C00079, ....          12
##  9: -0.4635643 -1.2265309    13 C00148, ....           6
## 10: -0.4617537 -1.1332444    10 C00065, ....           3
## 11: -0.3721362 -1.1189039    20 C00242, ....           5
## 12:  0.3765549  0.9559418    10       C00386           1
## 13: -0.2181045 -0.6007493    15 C05842, ....           8
## 14: -0.2010963 -0.4935342    10 C00245, ....           5

intersect(QEA_EMex_table$set_name, MSEAEMex_KEGG$pathway)

## character(0)

8 Apendix I Methods and packages explained

8.1 Over-Representation Analysis (ORA)

8.1.1 Concept

Over-Representation Analysis (ORA) is a classical approach used to identify biological sets (such as pathways, chemical classes, or ontological categories) that are statistically enriched in a predefined list of significant features, typically metabolites or genes.

The key idea is to test whether a particular set contains more significant features than expected by chance, given a specified background universe.

For each set, a contingency table is constructed:

	In Set	Not in Set	Total
Selected (hits)	k	K - k	K
Not selected	m - k	N - m - K + k	N - K

where: - N = total number of features in the background universe
- K = number of selected (significant) features
- m = number of features in the given set
- k = number of selected features that belong to that set

A statistical test (typically Fisher’s exact test or the hypergeometric test) is then applied to compute the probability of observing at least k selected features in the set under a random sampling model.

The resulting p-values are adjusted for multiple testing (e.g., Benjamini–Hochberg FDR correction), and the fold enrichment is often reported as:

\[ \text{Fold Enrichment} = \frac{k/K}{m/N} \]

indicating how much more represented the set is among selected features than expected by chance.

8.2 Implementation in `localEnrichment::ora_test()`

The function ora_test() implements this classical ORA workflow in a flexible and reproducible way using the EnrichmentSet object as input. It can be applied to any type of feature-to-set mapping, including pathways, metabolite classes, or ontological categories.

8.2.1 Main Steps

Validation of Inputs
- The function checks that the input eset is a valid EnrichmentSet object, containing the mapping between sets and features.
- The argument selected must be a character vector of feature identifiers.
- If no background is provided, the union of all features present in the EnrichmentSet is used as the background universe.
Iterating Over Sets
- For each set in the EnrichmentSet, the overlap with the selected features is computed.
- Sets outside the specified size range (min_set_size and max_set_size) are skipped.
- Depending on the argument test, either Fisher’s exact test or the hypergeometric test is applied to compute the p-value of enrichment.
Result Assembly
- For each significant overlap, the function records:
  - set_id and set_name
  - number of selected features in the set (n_selected_in_set)
  - total number of features in the set (n_in_set)
  - raw and adjusted p-values (p_value, p_adj)
  - the fold enrichment ratio
  - the list of overlapping feature identifiers (overlap_features)
Output
- Results are returned as a tibble with one row per set.
- The class "EnrichmentResult" is assigned to the output for consistency with downstream visualization or summary functions.
- Metadata from the original EnrichmentSet is preserved as an attribute.

8.2.2 Key Arguments

Argument	Description
`eset`	An `EnrichmentSet` object defining sets and their associated features.
`selected`	Character vector of feature IDs to test for enrichment.
`background`	Optional universe of features; if `NULL`, all unique features in `eset` are used.
`test`	Either `"fisher"` (default) or `"hypergeom"`.
`p_adjust`	Multiple testing correction method (default: `"BH"`).
`min_set_size`, `max_set_size`	Filters for valid set sizes (default: 3–500).
`quiet`	If `FALSE`, displays progress messages.

8.2.3 Output Structure

The function returns a tibble with columns:

Column	Description
`set_id`	Identifier of the set.
`set_name`	Descriptive name of the set.
`n_selected_in_set`	Number of selected features in the set.
`n_in_set`	Total number of features in the set.
`p_value`, `p_adj`	Raw and adjusted p-values.
`fold_enrichment`	Ratio of observed vs. expected proportion of hits.
`overlap_features`	Semicolon-separated list of overlapping feature IDs.

8.2.4 Example

# Define a small example enrichment mapping
meta <- list(
  mapping_name = "toy_example",
  feature_id_type = "HMDB_ID",
  feature_species = "Homo sapiens",
  set_source = "ExampleDB",
  version = "v1",
  description = "Toy example for ORA"
)

df <- data.frame(
  set_id = c("A", "B"),
  set_name = c("Pathway A", "Pathway B"),
  feature_ids = c("x1;x2;x3;x4", "x3;x4;x5;x6")
)

eset <- EnrichmentSet(df, meta)
selected <- c("x1", "x3", "x6")

# Run ORA
ora_test(eset, selected)

8.3 Quantitative Enrichment Analysis (QEA / MSEA)

8.3.1 Concept

Quantitative Enrichment Analysis (QEA), also known as Metabolite Set Enrichment Analysis (MSEA), is the analogue of GSEA for metabolomics and other continuous-valued omics data. Instead of dichotomizing features into “significant/non-significant,” QEA uses the entire ranked list of features, each with an associated score such as:

log fold change
t-statistic
correlation coefficient
loading from PCA/MFA/PLS
any quantity that defines an ordering of features

For each metabolite set, QEA tests whether its members tend to appear towards the top or bottom of the ranked list more often than expected by chance.

This avoids arbitrary cutoffs and provides sensitivity to subtle but consistent shifts across a pathway or chemical class.

8.4 Which MSEA variant is implemented here?

There are several variants of MSEA/QEA in the literature:

ORA-like MSEA (MetaboAnalyst’s early implementation):
uses Fisher tests on selected metabolites → not what we use here.
SSPA / Mean/median tests (subset-based statistics):
compares mean scores inside vs. outside a set → not used here.
Correlation-based global tests (e.g., GlobalTest):
model score–membership associations → not used here.
GSEA-style running-sum statistic (Subramanian et al., PNAS 2005):
uses an enrichment score based on a weighted random walk and permutation testing: this is the method implemented in qea_test().

8.4.1 localEnrichment implementss a GSEA-style running-score MSEA, adapted to metabolomics.

Specifically:

The ranked list is used without thresholding.
For each set, a running enrichment profile is computed:
- hits contribute +1/Nh
- misses contribute −1/(N − Nh)
The Enrichment Score (ES) is the maximum (positive enrichment) or minimum (negative enrichment) deviation from zero.
Significance is evaluated by permuting feature positions, exactly as in classical GSEA.

This is the most statistically robust form of QEA/MSEA and is the same approach used in modern tools such as fgsea, clusterProfiler::GSEA(), and the updated MetaboAnalyst 5.0 “QEA” module.

8.5 What the function `qea_test()` does

8.5.1 1. Input and validation

The function receives:
- an EnrichmentSet mapping sets → features
- a named numeric vector of scores (one per feature)
Scores are sorted in decreasing order, defining the ranked universe \(f_1, f_2, \dots, f_N\).
Only sets whose features intersect with this ranking are tested.

8.5.2 2. Running-score statistic (GSEA-style)

For each set:

Identify the indices of its features in the ranked list.
Let:
- \(N\) = total number of features
- \(N_h\) = number of hits (features in the set)
Construct the running-score profile:

\[ RS(i) = \sum_{j=1}^i \left( \frac{\mathbf{1}_{\text{hit}(j)}}{N_h} - \frac{\mathbf{1}_{\text{miss}(j)}}{N - N_h} \right) \]

Define the Enrichment Score:
- ES = max(RS) for positively enriched sets
- ES = min(RS) for negatively enriched sets

This mirrors the original GSEA algorithm without weighting (i.e., weight \(p = 0\)), which is appropriate for metabolomics where effect-size scales are not comparable across features.

8.5.3 3. Permutation testing

If nperm > 0, the function computes empirical p-values:

For each permutation:
- randomly reassign the hit positions
- recompute the ES
Compare observed ES to the null distribution:
- \(p = P(\text{ES}_{\text{perm}} \ge \text{ES}_{\text{obs}})\) for positive sets
- \(p = P(\text{ES}_{\text{perm}} \le \text{ES}_{\text{obs}})\) for negative sets

Multiple testing correction is applied via p.adjust().

If nperm = 0, p-values are returned as NA.

8.5.4 4. Output table

The function returns a list with:

8.5.4.1 (A) `table`

One row per tested set, containing:

Column	Meaning
`set_id`	Set identifier
`set_name`	Descriptive name
`n_in_set`	Number of overlapping features
`ES`	Enrichment Score
`direction`	`"positive"` or `"negative"`
`p_value`	Empirical p-value
`p_adj`	Adjusted p-value

Rows are sorted by p_value.

8.5.4.2 (B) `profiles` (optional)

If return_profiles = TRUE, the function additionally returns the running-score profile used to compute each ES, enabling direct visualization of enrichment curves.

8.6 Summary: What this implementation gives you

A GSEA-based quantitative enrichment test adapted to metabolomics.
Full use of a continuous ranking, avoiding arbitrary cutoffs.
Accurate empirical p-values via permutations.
Optional extraction of running-score profiles for plotting.
Compatibility with any set definition (KEGG, SMPDB, chemical classes, ontology categories, etc.).

This makes qea_test() suitable for replacing MetaboAnalyst’s QEA/MSEA module in automated or large-scale workflows.

Benchmarking Enrichment Packages

ORA and MSEA for metabolomics using ad-hoc metabolite sets

Alex Sanchez

2025-12-01