suppressPackageStartupMessages({
library(readxl)
library(readr)
library(limma)
library(stringr)
library(rvest)
library(kableExtra)
library(tidyverse)
library(BiocFileCache)
})
Enrichment Analysis -also called Pathway Analysis- generically describes a variety of analyses, -not always related neither with enrichment or with pathways- that can be done, once a set of biological features has been obtained from an omics study, in order to gain insight on its biological interpretation.
Although the analyses can differ from one another the input is usually the same. It is usally required:
In this document we present and prepare three datasets, that will be later used for Enrichment Analyses.
The datasets are:
Enrichmet package in a
tutorial.This is an example introduced by Xia et alt. in MetaboAnalyst and is used in several tutorials on how to use this software.
The dataset is available as a .csv file in the metaboData repository in github (https://github.com/nutrimetabolomics/metaboData/blob/main/Datasets/2024-Cachexia/human_cachexia.csv). For implicity it has been downladed from there and re-stored as an Excel File.
human_cachexia <- read_excel("data/human_cachexia.xlsx")
dim(human_cachexia)
## [1] 63 78
colnames(human_cachexia)[c(1:3, 20:22, 70:72)]
## [1] "Metabolite" "cachPIF_178" "cachPIF_087" "cachPIF_113"
## [5] "cachPIF_143" "cachNETCR_007_V1" "ctrlNETCR_009_V1" "ctrlNETCR_009_V2"
## [9] "ctrlNETL_007_V1"
There seem to be two groups defined by the prefix cach / ctrl in the data
In order to run an Over Representation Anaalysis (ORA) we only need a list of differentially quantified metabolites which we usually may obtain from some comparison test.
This dataset, however, contains only metabolites that have been shown to be significantly different among the groups. so, in principle this previous step may not seem necessary.
Anyway the comparison is performed, not only because it allows to check this, but also because it will provide an effect size measure that can be use for MSEA analysis.
The code below shows the analysis performed below, using the
Bioconductor limma package to compare the two groups
provided in the dataset.
# 1. Definir el vector de grups a partir dels noms de les columnes
sample_names <- colnames(human_cachexia)[-1] # excepte 'Metabolite'
group <- ifelse(str_detect(sample_names, "^ctrl"), "Control", "Cachexia")
group <- factor(group)
# 2. Construir la matriu d'expressió (metabolits x mostres)
X <- as.matrix(human_cachexia[ , -1])
rownames(X) <- human_cachexia$Metabolite
# 3. Disseny experimental
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
fit <- lmFit(X, design)
cont.matrix <- makeContrasts(Cachexia - Control, levels = design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
topCachexia <- topTable(fit2, number = Inf) %>% mutate(ordByLimma =1:nrow(X))
head(round(topCachexia,3),10)
tail (round(topCachexia,3))
Alternatively a row-wise t-test can be applied
topCachexiaByT<- genefilter::rowttests(x=X, fac=group) %>% mutate(p.adj= p.adjust(p.value, method="BH")) %>% arrange(p.value) %>% mutate(orderByT=1:nrow(X))
Both tests yield similar results as can be seen in the naive comparison showed below.
head(round(topCachexia[,c(3,1,4,5,7)],4))
head(round(topCachexiaByT,4))
merged_table <- merge(
topCachexia[,c(3,1,4,5,7)],
topCachexiaByT,
by="row.names"
) %>% arrange(p.value)
rownames(merged_table) <- merged_table$Row.names
merged_table$Row.names <- NULL
The results of both tests can be saved to disk for later use
# 6. Mostrar els primers resultats
save(topCachexia, topCachexiaByT, file="data/Cachexia_topTable.Rda")
write.table(rownames(topCachexia), file="data/cachexia_ids.txt", row.names = FALSE, quote=FALSE)
Metabolites in the table are identified by their common names, a type of identifier very useful for scientists working with these metabolites at chemical or nutritional level.
However, most enrichment analysis tools require a more objective ID from public databases such as PubChem, HumanDB or KEGG.
Retrieving metabolites by their name (for example from MetaboliteIDmapping) may be a frustrating task because small changes in the names make them unrecognizable.
In this case, however, the associated identifiers for these metabolites, when available, have been obtained using MetaboAnalyst Compound ID converter.
Probably the fact that the example came from MtaboAnalyst has greatly facilitated the conversion.
The equivalence names-IDs has been saved as a .csv file
cachexia_map <- read.csv("data/Cachexia_map.csv")
head(cachexia_map)
The resulting identifiers may be used as input for enrichment analysis.
If we check the required elements described above we have:
cachexia_map table HMDB, KEGG.groupCachexia-groupControl or the associated
F-statistic.This dataset comes from a study that has been uploaded to Metabolomics Workbench with ID ST0000291: LC-MS Based Approaches to Investigate Metabolomic Differences in the Urine of Young Women after Drinking Cranberry Juice or Apple Juice
WARNING Although the original study followed a 2×2 crossover design, the Metabolomics Workbench dataset (ST000291) does not include subject identifiers, and therefore paired or repeated-measures models cannot be applied. As a result, treatment groups must be analysed as independent samples.
This choice is conservative: ignoring the within-subject pairing increases residual variance and therefore raises the probability of false negatives, not false positives. Consequently, any metabolites that remain significant under this independent-samples analysis are likely to represent strong and biologically robust treatment effects.
Once downloaded the data consists of three separate files
The metabolite names - 1541 row, 3 columns (original name, PubChem ID and KEGG ID)
The downloaded data are stored in a particular format “MWTab” so some functions to extract data from it have been prepared.
Data in Metabolomics Workbench is not preprocessed so some work has to be performed before they can be analyzed.
The preprocessing steps are performed using the POMA Bioconductor
package and can be found in the acompanying script
Rcode/scripts/MW_ST000291_getTopTables.R
After preprocessing the Bioconductor limma package may be used to perform pairwise comparisons, whose results have been saved to a binary file.
limma_res = list(ApplvsBase = limma_resAB,
CranvsBase = limma_resCB,
ApplvsCran = limma_resAC)
Pubchem identifiers are widely used but they be
ambiguous and this may result in small number of coincidences among the
list and the EnrichmentSet IDs, which have been created separately.
To simplify this we have converted PubChem IDs to other systems using again that has been generated MetaboAnalyst ID converter.
The list of equivalences has been stored in file
MW_ST000291__map.csv
MW_map <- readr::read_csv("data/MW_ST000291__map.csv", show_col_types = FALSE) %>%
janitor::clean_names() %>%
mutate(
PubChemCID = as.character(query),
HMDB = ifelse(hmdb == "" | is.na(hmdb), NA, hmdb)
) %>%
select(PubChemCID, HMDB)
head(MW_map)
This has been used to filter out all metabolites not mapped to HMDB.
The analysis has been reproduced with this restricted dataset and a new topTable for the comparison Appl vs Cranberry has been generated.
The resulting topTables have been saved to
file=data/MW_ST000291-limma_res.Rda Focussing on the HMDB
based top table:
load("data/MW_ST000291-limma_res.Rda")
head(limma_HMDB_CB)
If we check the required elements described above we have:
log2FC or the associated t-statistic.The enrichmet package includes a tutorial that uses
these example data, available from Zotero (records: 15498097).
The data used in the tutorial are described only briefly as summary statistics from a differential metabolomics analysis involving KRAS G12D mutation and KRAS G12D mutation with TAp73 deletion. The authors indicate that they selected the top 57 metabolites, represented using KEGG identifiers, and used as background the set of all human KEGG metabolites associated with KEGG pathways. The ranked metabolite list appears to include at least log2 fold change and p-value information, since these variables are used for the MetSEA analysis. Beyond this, the article does not report the original data matrix, sample size, experimental platform, biological model, preprocessing steps, or the full list of metabolites, so the dataset is only minimally documented in the paper.
For ease of reproducibility we have previously downloaded all the
objects included in the use-case and saved them into a binary file:
data/enrichmet_example.Rda
The objects provided by the package can be directly mapped to what we have called the “components require for enrichment:
met_id object
EM_summaryStats.EM_PathwayVsMetabolites provided with the example
data relates Pathways with their associated KEGG IDs. For simplicity of
uses a standard EnrichmentSet object has been prepared from this
object.pval or log2fc from object
EM_summarystats can be used as effect size measures.Enrichmetmay use some optional files to enhance the
analysis and its biological interpretation:
KEGGLookup
A reference file linking KEGG compound IDs (https://www.kegg.jp/) to
metabolite names or other annotations, which helps standardize
identifiers and improve pathway mapping.
STITCHInteractions
A dataset of known or predicted interactions between metabolites or
proteins, obtained from the freely available STITCH database (http://stitch.embl.de/).
These interactions are useful for constructing networks to support
pathway analysis. The example file is supplied as: chemical_chemical.tsv.
Users may provide their own interaction data appropriate for their
species.
mapping_df
A mapping data frame that links STITCH identifiers with metabolite
names, facilitating integration between datasets. The provided example
file is available here: mapping_df.xlsx
Users can substitute this with a mapping file relevant to their species
of interest.
The code below describes how the data can be downloaded and saved into a binary file.
# InputMetabolites: A vector or list containing the metabolites of interest.
inputMetabolites <- c(
"C02721", "C03736", "C03139", "C00074", "C02862", "C00092", "C00275",
"C00386", "C10172", "C00361", "C00242", "C19806", "C16358", "C16357",
"C16353", "C02140", "C05488", "C01181", "C03017", "C02632", "C00246",
"C00079", "C08317", "C00588", "C16527", "C00065", "C00249", "C00836",
"C02990", "C03690", "C04025", "C06424", "C00123", "C00606", "C03067",
"C00539", "C00633", "C00519", "C00366", "C00148", "C00219", "C08261",
"C00385", "C06429", "C05637", "C00295", "C06428", "C16513", "C01909",
"C03299", "C01530", "C01571", "C08278", "C00599", "C15587", "C00144",
"C03739"
)
# Standard cache location
cache_dir <- tools::R_user_dir("enrichmet", "cache")
bfc <- BiocFileCache(cache_dir, ask = FALSE)
# Helper function to download & cache any file from Zenodo
get_cached_file <- function(url) {
bfcrpath(bfc, url)
}
# -------------------------
# Load files with caching
# -------------------------
# PathwayVsMetabolites
pathway_url <- "https://zenodo.org/api/records/15498097/files/Human-PathwaysVsMetabolites.csv/content"
pathway_path <- get_cached_file(pathway_url)
PathwayVsMetabolites <- read.csv(pathway_path)
str(PathwayVsMetabolites)
## 'data.frame': 351 obs. of 2 variables:
## $ Pathway : chr "Glycolysis / Gluconeogenesis" "Citrate cycle (TCA cycle)" "Pentose phosphate pathway" "Pentose and glucuronate interconversions" ...
## $ Metabolites: chr "C00022,C00024,C00031,C00033,C00036,C00068,C00074,C00084,C00085,C00103,C00111,C00118,C00186,C00197,C00221,C00236"| __truncated__ "C00022,C00024,C00026,C00036,C00042,C00068,C00074,C00091,C00122,C00149,C00158,C00311,C00417,C05125,C05379,C05381"| __truncated__ "C00022,C00031,C00085,C00117,C00118,C00119,C00121,C00197,C00198,C00199,C00204,C00221,C00231,C00257,C00258,C00279"| __truncated__ "C00022,C00026,C00029,C00103,C00111,C00116,C00167,C00181,C00191,C00199,C00204,C00216,C00231,C00259,C00266,C00309"| __truncated__ ...
# SummaryStatistics
example_url <- "https://zenodo.org/api/records/15498097/files/example_data.xlsx/content"
example_path <- get_cached_file(example_url)
example_data <- read_excel(example_path)
show(example_data[1:3,])
## # A tibble: 3 × 4
## met_id pval padj log2fc
## <chr> <dbl> <dbl> <dbl>
## 1 C02721 0.0000124 0.00315 -2.32
## 2 C03736 0.0000694 0.00739 2.63
## 3 C03139 0.0000869 0.00739 1.75
# KEGGLookup
kegg_url <- "https://zenodo.org/api/records/15498097/files/kegg_lookup.xlsx/content"
kegg_path <- get_cached_file(kegg_url)
kegg_lookup <- read_excel(kegg_path)
head(kegg_lookup)
# Mapping_df
mapping_url <- "https://zenodo.org/api/records/15498097/files/mapping_df.xlsx/content"
mapping_path <- get_cached_file(mapping_url)
mapping_df <- read_excel(mapping_path)
head(mapping_df)
# STITCHInteractions
# stitch_url <- "https://zenodo.org/records/15498097/files/chemical_chemical.tsv"
# stitch_path <- get_cached_file(stitch_url)
# stitch_df <- read_tsv(stitch_path)
# head(stitch_df)
The data from “EnrichMetExample” is not an EnrichmentSet, so a new
one is created from the PathwayvsMetabolites
data.frame:
First we prepare the long version of
PathwayVsMetabolites:
tmp <- PathwayVsMetabolites %>%
mutate(Metabolites = strsplit(Metabolites, ","))
unnested <- tidyr::unnest(tmp, cols = c(Metabolites))
PathwayVsMetabolites_long <- dplyr::rename(unnested, kegg_id = Metabolites)
head(PathwayVsMetabolites_long)
PathwayVsMetabolites_long <- PathwayVsMetabolites_long %>%
mutate(PathwayID = paste0("PW_", as.integer(as.factor(Pathway))))
head(PathwayVsMetabolites_long)
PathwayVsMetabolites_long <- PathwayVsMetabolites_long %>%
left_join(kegg_lookup, by = c("kegg_id" = "kegg_id"))
## Warning in left_join(., kegg_lookup, by = c(kegg_id = "kegg_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 132 of `x` matches multiple rows in `y`.
## ℹ Row 159 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(PathwayVsMetabolites_long)
With this we can directly build the EnrichmentSet
object:
require(localEnrichment)
## Cargando paquete requerido: localEnrichment
KEGGset_KEGG <- buildEnrichmentSet(
data = PathwayVsMetabolites_long,
id_col = "kegg_id",
category_col = "Pathway",
set_id_col = "PathwayID",
set_name = "KEGG_hsa_pathways",
source = "KEGG",
species = "Homo sapiens",
version = Sys.Date(),
description = "KEGG human pathways mapped to KEGG compound IDs",
sep = ";"
)
summary(KEGGset_KEGG)
## EnrichmentSet summary:
## Mapping name: KEGG_hsa_pathways
## Source: KEGG
## Feature IDs: kegg_id
## Number of sets: 351
## Mean set size: 16.37607 features
## Median set size: 6 features
head(KEGGset_KEGG@data)
All the data is saved as a binary file to be used later
EM_inputMetabolites <- inputMetabolites
EM_summaryStats <- example_data
EM_PathwayVsMetabolites<- PathwayVsMetabolites
EM_mapping_df <- mapping_df
# EM_stitch_df <- stitch_df
EM_kegg_lookup <- kegg_lookup
save(EM_inputMetabolites, KEGGset_KEGG, EM_PathwayVsMetabolites, EM_summaryStats, EM_kegg_lookup, EM_mapping_df, file="data/enrichmet_example.Rda")