installifnot <- function (packageName){
 if (!(require(packageName, character.only=TRUE))) {
    install.packages(packageName)
  }else{
    detach(paste ("package", packageName, sep=":"), character.only=TRUE)
  } 
}
bioCifnot <- function (packageName){
 if (!(require(packageName, character.only=TRUE))) {
    BiocManager::install(packageName)
 }else{
  detach(paste ("package", packageName, sep=":"), character.only=TRUE)
}  
}
installifnot("knitr")
installifnot("xml2") # May yield problems if some libraries (xml2-config) not available in linux
installifnot("ggnewscale")
bioCifnot ("org.Hs.eg.db")
bioCifnot ("hgu133a.db")
bioCifnot ("GO.db")
bioCifnot ("annotate")
bioCifnot ("Rgraphviz")
bioCifnot ("GOstats")
bioCifnot ("clusterProfiler")

Introduction

This document provides some examples on the analyses that can be perfomed on one or more gene lists to help gain biological insight on the results of a differential expression analysis. Overall these analyses are known as Pathway Analysis or, also, Functional Analysis.

Functional analysis can be performed in many different ways that lead to similar (or not-so-similar) results. Because there is not a universal acceptance of what is a complete, well done functional analysis some different approaches will be shown.

Input Data for Functional Analysis

Functional analysis can be made, on a first approach on:

One or more lists of genes selected for being differentially expressed in a given experimental setting. In this case we usually work with gene identifiers.
One or more list of values measuring the difference between groups (i.e. Fold Changes, p-values, or t-statistics) for all genes being compared.

Most tools require that gene list consist of gene identifiers in some standard notation such as Entrez, ENSEMBL or other related to these.

These gene lists can be usually extracted from output tables provided by microarrays or RNA-seq data analysis tools.

The examples shown in the document use several gene lists obtained from a maicroarray analysi performed on data from a breast cancer study, but it can be easily extended to more lists or other studies.

Read data

We start by reading two files that contain the expression values (expres_AvsB.csv2) and the results (Top_AvsB.csv2) of a differential expression analysis performed using microarrays.

The code and text for the analysis that, using these data, generated these tables, can be found at: https://github.com/ASPteaching/Ejemplo_de_Analisis_de_Microarrays_con_Bioconductor

The code below assumes the files have been stored in a subdirectory of the current folder named datasets.

inputDir="datasets"
topTabAvsB <- read.table (file.path(inputDir, "Top_AvsB.csv2"), head=T, sep=";", dec=",", row.names=1)
expresAvsB <- read.table (file.path(inputDir, "expres_AvsB.csv2"), head=T, sep=";", dec=",", row.names=1)
comparisonName <- "AvsB"
dim(topTabAvsB); head(topTabAvsB)

## [1] 6218    6

##                 logFC  AveExpr          t      P.Value    adj.P.Val         B
## 204667_at   -3.038344 8.651157 -14.361559 5.741537e-11 3.570088e-07 14.649763
## 215729_s_at  3.452290 6.137595  12.814617 3.438484e-10 1.069025e-06 13.150062
## 220192_x_at -3.016315 9.521883 -10.859041 4.336167e-09 7.020345e-06 10.929050
## 214451_at   -5.665059 7.432823 -10.829695 4.516144e-09 7.020345e-06 10.892594
## 217528_at   -5.622086 6.763101  -9.666279 2.429965e-08 3.021904e-05  9.364339
## 217284_x_at -4.313116 9.133307  -9.528468 2.994504e-08 3.103304e-05  9.172112

dim(expresAvsB); head(expresAvsB)

## [1] 6218   10

##             GSM26878.CEL GSM26883.CEL GSM26887.CEL GSM26903.CEL GSM26910.CEL
## 204667_at       9.822291     9.514461     9.919010     9.600700     9.592063
## 215729_s_at     4.736669     4.761001     6.254886     4.820416     4.848366
## 220192_x_at    10.484295    10.914847    10.510820    11.509958    10.265186
## 214451_at      10.176920    10.060391    11.201140    10.888825    10.403751
## 217528_at      10.533946    10.036022    11.325578     8.053482    10.618710
## 217284_x_at    11.726573     9.740720    11.436035    12.818812    12.686506
##             GSM26888.CEL GSM26889.CEL GSM26892.CEL GSM26898.CEL GSM26906.CEL
## 204667_at       6.483792     6.551345     7.001161     6.685062     6.535446
## 215729_s_at     8.265870     8.962679     8.304247     8.768668     8.381325
## 220192_x_at     7.823961     7.809752     7.522203     8.427221     7.020395
## 214451_at       4.818335     4.784154     4.975607     4.912010     4.915627
## 217528_at       4.580566     4.537627     4.518766     4.357199     4.463147
## 217284_x_at     7.274321     7.298390     7.490518     7.562369     7.217467

Exploring gene lists

A given gene list contains useful information that can be extracted by querying databases.

Let us see how we can obtain information fom the probesets in table (comparison) AvsB.

myProbes <- rownames(expresAvsB)
head(myProbes)

## [1] "204667_at"   "215729_s_at" "220192_x_at" "214451_at"   "217528_at"  
## [6] "217284_x_at"

We need to load the library (“package”) that contains specific annotations for the microarray type that was used in this study. It has to be noticed also that each row does not represent a gene, but a probeset, a sequence that has been designed to detect if a given gene is expressed. Microarrays contain multiple probesets for many genes and this is something that ha to be dealt with.

ID conversion

In order to do many analyses it is convenient to use a universally accepted identifier such as Entrez or ENSEMBL. For instance Bioconductor organism annotation packages rely on Entrez identifiers as main key for most mappings.

It is possible to easily find out which mappings are available for each ID.

library(hgu133a.db)
keytypes(hgu133a.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"      
## [26] "UCSCKG"       "UNIPROT"

Annotation packages make it possible to annotate genes and in a similar manner other omics features. For example, we can obtain gene symbol, entrez ID and gene name with a single SQL instruction.

geneAnots <- AnnotationDbi::select(hgu133a.db, myProbes, c("SYMBOL", "ENTREZID", "GENENAME"))
head(geneAnots)

##       PROBEID SYMBOL ENTREZID
## 1   204667_at  FOXA1     3169
## 2 215729_s_at  VGLL1    51442
## 3 220192_x_at  SPDEF    25803
## 4   214451_at TFAP2B     7021
## 5   217528_at  CLCA2     9635
## 6 217284_x_at SERHL2   253190
##                                                 GENENAME
## 1                                        forkhead box A1
## 2                         vestigial like family member 1
## 3 SAM pointed domain containing ETS transcription factor
## 4                         transcription factor AP-2 beta
## 5                           chloride channel accessory 2
## 6                                serine hydrolase like 2

Now we can provide a more informative list of differentially expressed genes in topTable

selected<- topTabAvsB[,"adj.P.Val"]<0.05 & topTabAvsB[,"logFC"] > 1
sum(selected)

## [1] 243

selectedTopTab <- topTabAvsB[selected,]
selectedProbes <- rownames(selectedTopTab)
selectedAnots <-  AnnotationDbi::select(hgu133a.db, selectedProbes, c("SYMBOL", "ENTREZID", "GENENAME"))
selectedTopTab2 <- cbind(PROBEID=rownames(selectedTopTab), selectedTopTab)
# selectedInfo <- cbind(selectedAnots, selectedTopTab)
selectedInfo = merge(selectedAnots, selectedTopTab2, by="PROBEID")
write.csv2(selectedInfo, file="selectedTopTab_AvsB.csv2")

From gene lists to pathway analysis

Pathway Analysis is an extensive field of research and application and the aim of this document is not to summarize it but simply to illustrate some applications with R.

See (https://github.com/ASPteaching/An-Introduction-to-Pathway-Analysis-with-R-and-Bioconductor/raw/main/Slides/Intro2PathwayEnrichmentAnalysis-SHORT.pdf) for an introduction to the topic.

The most used database in Pathway Analysis is the Gene Ontology, which, as suggested by its name, is not a dabase but an ontology. The GO.db can be accessed using the same syntaxis as with gene identifiers.

Basic GO Annotation

Bioconductor libraries allow for both: - Exploration of functional information on genes based on the Gene Ontology - Different types of Gene Enrichment and Pathway Analysis based on the GO or other pathway databases such as the KEGG, Reactome etc.

The code below shows some easy ways to retrieve GO information associated with genes

Start by loading the appropriate packages

require(GOstats)
require(GO.db)
require(hgu133a.db); 
require(annotate) # Loads the required libraries.

Select the “top 5” genes from the list

probes <- rownames(expresAvsB)[1:5]

In the first versions of Bioconductor, identifiers were managed using environments and functions such as get or mget. While it still works, nowadays it has been substituted by the use of the select function which allows for a muach easier retrieving of identifiers.

Select returns a data.frame. If we need a character vector we can obtain it using an ad-hoc function sucha as the simplecharFromSelectbelow.

charFromSelect <- function(df,posNames=1, posValues=2){
  res <- df[,posValues]
  names(res) <- df[,posNames]
  return(res)
}
  
require(annotate)
geneIDs <-  AnnotationDbi::select(hgu133a.db, probes, c("ENTREZID", "SYMBOL"))
entrezs <-  charFromSelect(geneIDs, 1, 2)
simbols <-  charFromSelect(geneIDs, 1, 3)

Given a set of gene identifiers its associated GO identifiers can be extracted from the microarrays (for probesets ids) or from the organism (for Entrez IDS) annotation package.

# % WANING
#  The previous chunk can be substituted by the followink code chunk, shorter and more efficient.
library(hgu133a.db)
keytypes(hgu133a.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"      
## [26] "UCSCKG"       "UNIPROT"

res <-AnnotationDbi::select(hgu133a.db, keys=probes, keytype = "PROBEID", columns = c("ENTREZID", "SYMBOL","ONTOLOGY"))
res1 <- AnnotationDbi::select(hgu133a.db, keys=probes, keytype = "PROBEID",  columns = c("ENTREZID", "SYMBOL","GO"))

The resulting tables can be printed or saved.

print(head(res, n=10))

##        PROBEID ENTREZID SYMBOL ONTOLOGY
## 1    204667_at     3169  FOXA1       BP
## 2    204667_at     3169  FOXA1       CC
## 3    204667_at     3169  FOXA1       MF
## 4  215729_s_at    51442  VGLL1       BP
## 5  215729_s_at    51442  VGLL1       CC
## 6  215729_s_at    51442  VGLL1       MF
## 7  220192_x_at    25803  SPDEF       BP
## 8  220192_x_at    25803  SPDEF       CC
## 9  220192_x_at    25803  SPDEF       MF
## 10   214451_at     7021 TFAP2B       BP

print(head(res1, n=10))

##      PROBEID ENTREZID SYMBOL         GO EVIDENCE ONTOLOGY
## 1  204667_at     3169  FOXA1 GO:0000122      IEA       BP
## 2  204667_at     3169  FOXA1 GO:0000785      ISA       CC
## 3  204667_at     3169  FOXA1 GO:0000976      IDA       MF
## 4  204667_at     3169  FOXA1 GO:0000978      IBA       MF
## 5  204667_at     3169  FOXA1 GO:0000981      IBA       MF
## 6  204667_at     3169  FOXA1 GO:0000981      ISA       MF
## 7  204667_at     3169  FOXA1 GO:0001228      ISS       MF
## 8  204667_at     3169  FOXA1 GO:0001650      IDA       CC
## 9  204667_at     3169  FOXA1 GO:0003677      IDA       MF
## 10 204667_at     3169  FOXA1 GO:0003682      IEA       MF

This process yield GO identifiers but its meaning is not clear. For this we need to query the GO.db package

What do these GO:XXXXXXX mean?

The We can query theGO.db` package can be queried as any anotation package.

library(GO.db)
keytypes(GO.db)

## [1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"

columns(GO.db)

## [1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"

Assume we are only interested in the first three GO identifiers. We can obtain its name directly using the appropriate select call. First we remove duplicates then we proceed to extract information.

uniqueGOTerms <- unique(res1$GO)
selectedGOTerms <- uniqueGOTerms[1:5]
select(GO.db, selectedGOTerms, c("GOID", "ONTOLOGY","TERM" ))

##         GOID ONTOLOGY
## 1 GO:0000122       BP
## 2 GO:0000785       CC
## 3 GO:0000976       MF
## 4 GO:0000978       MF
## 5 GO:0000981       MF
##                                                                    TERM
## 1             negative regulation of transcription by RNA polymerase II
## 2                                                             chromatin
## 3                           transcription cis-regulatory region binding
## 4 RNA polymerase II cis-regulatory region sequence-specific DNA binding
## 5 DNA-binding transcription factor activity, RNA polymerase II-specific

Navigating the GO Graph

The Gene Ontology has a hierarchichal structure. At some moment we can be interested in getting information about the ancestors or the descendants from one term.

The GO.db database contains a series of mappings that provide the ancestors or childrens of any given GOTerm

Take the top GO Term of the previous analysis:

oneTerm <- selectedGOTerms[1]
oneParent <- get(oneTerm, GOBPPARENTS)   # the vector of its parent terms in the BP ontology.
oneParent

##                  isa negatively regulates                  isa 
##         "GO:0006357"         "GO:0006366"         "GO:0045892"

oneChildren<-get(oneTerm, GOBPCHILDREN) # the vector of its children terms in the BP ontolog
oneChildren

##          isa          isa          isa          isa          isa          isa 
## "GO:0000437" "GO:0010674" "GO:0010688" "GO:0010768" "GO:0034396" "GO:0060633" 
##          isa          isa          isa          isa          isa          isa 
## "GO:0097201" "GO:0120191" "GO:0150165" "GO:1900463" "GO:1900464" "GO:1900477" 
##          isa          isa          isa          isa          isa          isa 
## "GO:1901581" "GO:1904089" "GO:1904489" "GO:1904490" "GO:1905381" "GO:2000820"

oneOffspring<-get(oneTerm, GOBPOFFSPRING) # the vector of its offspring terms in the BP ontology.
oneOffspring

##  [1] "GO:0000434" "GO:0000437" "GO:0010674" "GO:0010688" "GO:0010689"
##  [6] "GO:0010690" "GO:0010691" "GO:0010768" "GO:0017055" "GO:0034396"
## [11] "GO:0060633" "GO:0061415" "GO:0061417" "GO:0061428" "GO:0090375"
## [16] "GO:0097201" "GO:0120191" "GO:0150165" "GO:1900463" "GO:1900464"
## [21] "GO:1900477" "GO:1901581" "GO:1904089" "GO:1904489" "GO:1904490"
## [26] "GO:1905381" "GO:1990441" "GO:2000805" "GO:2000820"

oneChildren %in% oneOffspring

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE

While this is interesting and can be used to produce nice plots it is not directly related with Gene Enrichment Analysis and will be ignored from here on.

Gene Enrichment Analysis

There are two main types of enrichment analysis:

Over-Representation Analysis takes a list of differentially expressed genes and it searches for biological categories in which a number of genes appear with “unusually” high frequencies. That is it looks for genes appearing more often than they would be expected by chance in any category.
Gene Set Expression Analyses works with all genes and looks for differentially expressed gene sets (categories). That is it searches for categories that, without containing an unusually high number of differentially expressed genes, are associated with genes that are in the upper or lower parts of the list of genes ordered by some measure of intensity of difference, such as the “log-Fold Change”.

Over-Representation Analysis

Over-Representation Analysis is applied on a “truncated” list of genes that one considers to be differentially expressed.

These are checked for enrichment versus a “Universe” gene list, usually, all the genes that have entered in the analysis

require(hgu133a.db)
topTab <- topTabAvsB 
probesUniverse <- rownames(topTab)
# entrezUniverse = unlist(mget(rownames(topTab), hgu133aENTREZID, ifnotfound=NA)) 
entrezUniverse<- select(hgu133a.db, probesUniverse, "ENTREZID")
entrezUniverse <- entrezUniverse$ENTREZID
whichGenes<- topTab["adj.P.Val"]<0.05 & topTab["logFC"] > 1
sum(whichGenes)

## [1] 243

topGenes <-   entrezUniverse[whichGenes]

# Remove possible duplicates

topGenes <- topGenes[!duplicated(topGenes)]
entrezUniverse <- entrezUniverse[!duplicated(entrezUniverse)]

Many packages can be used to do a Gene Enrichment Analysis. Each of them perfoms slightly different analysis but the underlying ideas are the same. Some of these packages are:

GOstats
topGO
gprofiler
clusterProfiler

We show examples on how to do it using GOstats and clusterProfiler

Enrichment Analysis using `GOstats`

In this section we use GOstats which was one of the first packages available in Bioconduictor to do Enrichment Analysis.

The idea for using it is pretty simple: We need to create special type of object called “Hyperparameter” which may be of class either:

GOHyperGParams if we want to do the enrichment analysis using the Gene Ontology as reference databases or
KEGGHyperGParams if we wish to do the analysis using the KEGG database, or
PFAMHyperGParams if we wish to do the analysis using the PFAM database.

A good strategy is to use all three.

First, we create the hyper-parameters.

library(GOstats)

# This parameter has an "ontology" argument. It may be "BP", "MF" or "CC"
# Other arguments are taken by default. Check the help for more information.

GOparams = new("GOHyperGParams",
    geneIds=topGenes, universeGeneIds=entrezUniverse,
    annotation="hgu133a.db", ontology="BP",
    pvalueCutoff=0.001)

# These hyperparameters do not have an "ontology" argument.

KEGGparams = new("KEGGHyperGParams",
    geneIds=topGenes, universeGeneIds=entrezUniverse,
    annotation="hgu133a.db",
    pvalueCutoff=0.001)

PFAMparams = new("PFAMHyperGParams",
    geneIds=topGenes, universeGeneIds=entrezUniverse,
    annotation="hgu133a.db", 
    pvalueCutoff=0.001)

Next, we run the analyses:

GOhyper = hyperGTest(GOparams)
KEGGhyper = hyperGTest(KEGGparams)
# PFAMhyper = hyperGTest(PFAMparams)

We can extract information directly from the analysis results using summary:

dim(summary(GOhyper))

## [1] 6 7

head(summary(GOhyper))

##       GOBPID       Pvalue OddsRatio  ExpCount Count Size
## 1 GO:0019370 0.0000917494 31.294444 0.2904762     4    7
## 2 GO:0046395 0.0004796855  3.282433 4.4816327    13  108
## 3 GO:0016054 0.0005736863  3.213594 4.5646259    13  110
## 4 GO:0072329 0.0007287196  4.193992 2.4897959     9   60
## 5 GO:0006691 0.0007580159 13.402381 0.4564626     4   11
## 6 GO:0045109 0.0008510094  8.401076 0.7884354     5   19
##                                    Term
## 1      leukotriene biosynthetic process
## 2     carboxylic acid catabolic process
## 3        organic acid catabolic process
## 4 monocarboxylic acid catabolic process
## 5         leukotriene metabolic process
## 6    intermediate filament organization

Apart of this we can create an automated report per each result, or, with a litle more effort we can combine all outputs into a single one.

# Creamos un informe html con los resultados
comparison = "AvsB"
GOfilename =file.path(paste("GOResults.",comparison,".html", sep=""))
htmlReport(GOhyper, file = GOfilename, summary.args=list("htmlLinks"=TRUE))

The resulting file has the contents provided by the call to summarybut with additional links.

Enrichment Analysis using `clusterProfiler`

GO enrichment analysis can be performed using enrichGO function from clusterProfiler package. We will perform the analysis over the Biological Process (BP) GO category.

The Gene Universe, the set of all genes, can be specified as in GOstatsor simply assumed to be all the human genes.

An aside comment about the format of identifiers: - GOstats expects entrez identifiers to be integer values stored as characters. - clusterProfiler requires the same identifiers, but they have to be stored as integers.

This means that, in order to reuse identifierswe need to recast them into one or another direction. In this case, because topGenesis a character vector, to use them in Bioconductor we need to convert them into integers using as.integer(topGenes).

# In GOstats use the following code 
#
# GOparams = new("GOHyperGParams",
#     geneIds=topGenes, universeGeneIds=entrezUniverse,
#     annotation="hgu133a.db", ontology="BP",
#     pvalueCutoff=0.001)

library(clusterProfiler)

## Run GO enrichment analysis 

ego <- enrichGO(gene = as.integer(topGenes), # selgenes, 
                # universe = entrezUniverse, # universe = all_genes,
                keyType = "ENTREZID",
                OrgDb = org.Hs.eg.db, 
                ont = "BP", 
                pAdjustMethod = "BH", 
                qvalueCutoff = 0.25, 
                readable = TRUE)
head(ego, n=10)

##                    ID                           Description GeneRatio   BgRatio
## GO:0046395 GO:0046395     carboxylic acid catabolic process    13/244 238/18800
## GO:0016054 GO:0016054        organic acid catabolic process    13/244 242/18800
## GO:0072329 GO:0072329 monocarboxylic acid catabolic process     9/244 124/18800
## GO:0006635 GO:0006635             fatty acid beta-oxidation     7/244  76/18800
##                  pvalue   p.adjust     qvalue
## GO:0046395 1.475561e-05 0.02886389 0.02710290
## GO:0016054 1.760530e-05 0.02886389 0.02710290
## GO:0072329 3.500820e-05 0.03826396 0.03592947
## GO:0006635 5.809675e-05 0.04762481 0.04471921
##                                                                                 geneID
## GO:0046395 GLS/CRAT/ACAA2/ECHDC2/GCAT/ECI2/IRS2/AKR1A1/SCP2/CRABP1/ACOX2/KYAT3/ALDH5A1
## GO:0016054 GLS/CRAT/ACAA2/ECHDC2/GCAT/ECI2/IRS2/AKR1A1/SCP2/CRABP1/ACOX2/KYAT3/ALDH5A1
## GO:0072329                        CRAT/ACAA2/ECHDC2/ECI2/IRS2/AKR1A1/SCP2/CRABP1/ACOX2
## GO:0006635                                      CRAT/ACAA2/ECHDC2/ECI2/IRS2/SCP2/ACOX2
##            Count
## GO:0046395    13
## GO:0016054    13
## GO:0072329     9
## GO:0006635     7

# Output results from GO analysis to a table
ego_results <- data.frame(ego)

write.csv(ego_results, "clusterProfiler_ORAresults_UpGO.csv")

Visualization of enrichment results

Dotplot of top 10 enriched terms

dotplot(ego, showCategory=10)

Visualization of GO terms in hierarchy

Enriched GO terms can be visualized as a directed acyclic graph (only for GO):

goplot(ego, showCategory=10)

Gene network for the top terms

cnetplot(ego)

Enrichment Map

Enriched terms can be grouped by some similarity measure (eg. overlap of genes between terms) to summarize the results.

## Enrichmap clusters the 50 most significant (by adj.P.Va) GO terms to visualize relationships between terms
library(enrichplot)
ego_sim <- pairwise_termsim(ego)
emapplot(ego_sim, cex_label_category=0.5)

Note:

For overrepresentation analysis based on Reactome Pathways database one can use function enrichPathway from ReactomePA package.

Challenge: perform an overrepresentation analysis of the top down-regulated genes with a logFC < -2 and adjusted p-value < 0.05 over the GO-BP ontology.

Gene Set Enrichment Analysis

If, instead of relying on the gene lists we decided to use all the genes on the array and confront them to selected sets of genes we may use the Gene Set Enrichment Analysis approach.

Classical GSEA

The clusterProfiler package implements the classical GSEA method as introduced by Subramanian et alt (2005).

entrezIDs <- AnnotationDbi::select(hgu133a.db, rownames(topTabAvsB), c("ENTREZID"))
# entrezIDs <- charFromSelect(entrezIDs)
topTabAvsB2<- cbind( PROBEID= rownames(topTabAvsB), topTabAvsB)
geneList  <- merge(topTabAvsB2, entrezIDs, by="PROBEID")

# sort by absolute logFC to remove duplicates with smallest absolute logFC
geneList <- geneList[order(abs(geneList$logFC), decreasing=T),]
geneList <- geneList[ !duplicated(geneList$ENTREZ), ]  ### Keep highest
# re-order based on logFC to be GSEA ready
geneList <- geneList[order(geneList$logFC, decreasing=T),]
genesVector <- geneList$logFC
names(genesVector) <- geneList$ENTREZ

set.seed(123)
library(clusterProfiler)
gseResulti <- gseKEGG(geneList = genesVector,
                      organism = "hsa",
                      keyType = "kegg",
                      exponent = 1,
                      minGSSize = 10,maxGSSize = 500,
                      pvalueCutoff = 0.05,pAdjustMethod = "BH",
                      # nPerm = 10000, #augmentem permutacions a 10000
                      verbose = TRUE,
                      use_internal_data = FALSE,
                      seed = TRUE,
                      eps=0,
                      by = "fgsea"
                )

# keggResultsList[[i]] <- gseResulti

library(kableExtra)
gsea.result <- setReadable(gseResulti, OrgDb = org.Hs.eg.db, keyType ="ENTREZID" )

gsea.result.df <- as.data.frame(gsea.result)
print(kable(gsea.result.df[,c("Description","setSize","NES","p.adjust")])%>% scroll_box(height = "500px"))

	Description	setSize	NES	p.adjust
hsa04110	Cell cycle	89	2.078872	0.0000981
hsa04146	Peroxisome	43	-2.225763	0.0002159
hsa04061	Viral protein interaction with cytokine and cytokine receptor	42	2.164157	0.0002870
hsa05133	Pertussis	39	2.134750	0.0002870
hsa04064	NF-kappa B signaling pathway	56	2.069468	0.0005921
hsa05152	Tuberculosis	94	1.930153	0.0005921
hsa00280	Valine, leucine and isoleucine degradation	33	-2.165861	0.0010073
hsa05323	Rheumatoid arthritis	54	1.988814	0.0010073
hsa00350	Tyrosine metabolism	14	-2.172127	0.0019756
hsa04060	Cytokine-cytokine receptor interaction	87	1.853078	0.0019756
hsa05171	Coronavirus disease - COVID-19	116	1.752238	0.0019756
hsa04145	Phagosome	95	1.784051	0.0041828
hsa05150	Staphylococcus aureus infection	41	1.914447	0.0045621
hsa04668	TNF signaling pathway	69	1.804001	0.0049851
hsa05164	Influenza A	89	1.761631	0.0054464
hsa04640	Hematopoietic cell lineage	40	1.896413	0.0055800
hsa03040	Spliceosome	84	1.765446	0.0055800
hsa04621	NOD-like receptor signaling pathway	89	1.752585	0.0055800
hsa04115	p53 signaling pathway	49	1.824305	0.0064536
hsa04062	Chemokine signaling pathway	93	1.756455	0.0101539
hsa04650	Natural killer cell mediated cytotoxicity	58	1.740036	0.0101539
hsa05166	Human T-cell leukemia virus 1 infection	137	1.641323	0.0101539
hsa05140	Leishmaniasis	42	1.863793	0.0112608
hsa04612	Antigen processing and presentation	44	1.822988	0.0136566
hsa00053	Ascorbate and aldarate metabolism	10	-1.953422	0.0210111
hsa00982	Drug metabolism - cytochrome P450	21	-1.899255	0.0217437
hsa05332	Graft-versus-host disease	22	1.793265	0.0217437
hsa01240	Biosynthesis of cofactors	76	-1.673501	0.0217437
hsa05416	Viral myocarditis	34	1.759287	0.0227068
hsa00040	Pentose and glucuronate interconversions	12	-1.943107	0.0237546
hsa00410	beta-Alanine metabolism	18	-1.915210	0.0254396
hsa00650	Butanoate metabolism	14	-1.861185	0.0335622
hsa05206	MicroRNAs in cancer	107	1.586085	0.0335622
hsa04657	IL-17 signaling pathway	53	1.672150	0.0340398
hsa05320	Autoimmune thyroid disease	18	1.750445	0.0406686
hsa05417	Lipid and atherosclerosis	108	1.537359	0.0425294
hsa05330	Allograft rejection	19	1.715923	0.0450705
hsa04672	Intestinal immune network for IgA production	18	1.730816	0.0457146
hsa05167	Kaposi sarcoma-associated herpesvirus infection	112	1.533654	0.0457146
hsa00900	Terpenoid backbone biosynthesis	15	-1.814110	0.0479823

library(ggplot2)
# for (i in 1:length(files)){
#   cat("\nComparison: ", namesC[i],"\n")
   cat("DOTPLOT\n")
#   if(nrow(keggResultsList[[i]]) > 0){
 if(nrow(gseResulti) > 0){
   p<- dotplot(gseResulti, showCategory = 20, font.size = 15,
            title =paste("Enriched Pathways\n", comparisonName ,
            split=".sign") + facet_grid(.~.sign))
   plot(p)
   cat("\nENRICHMENT MAP\n")
   em<- emapplot(gseResulti)
   plot(em)
   #guardem en pdf
   pdf(file = paste0("KEGGplots.",comparisonName,".pdf"), 
                        width = 14, height = 14)
   print(p)
   print(em)
   dev.off()
   }else{
      cat("\nNo enriched terms found\n")
 }

An Introduction to Pathway Analysis with R and Bioconductor

Alex Sanchez

2023-02-22

Introduction

Input Data for Functional Analysis

Read data

Exploring gene lists

ID conversion

From gene lists to pathway analysis

Basic GO Annotation

What do these GO:XXXXXXX mean?

Navigating the GO Graph

Gene Enrichment Analysis

Over-Representation Analysis

Enrichment Analysis using `GOstats`

Enrichment Analysis using `clusterProfiler`

Visualization of enrichment results

Gene Set Enrichment Analysis

Classical GSEA

An Introduction to Pathway Analysis with R and Bioconductor

Alex Sanchez

2023-02-22

Introduction

Input Data for Functional Analysis

Read data

Exploring gene lists

ID conversion

From gene lists to pathway analysis

Basic GO Annotation

What do these GO:XXXXXXX mean?

Navigating the GO Graph

Gene Enrichment Analysis

Over-Representation Analysis

Enrichment Analysis using GOstats

Enrichment Analysis using clusterProfiler

Visualization of enrichment results

Gene Set Enrichment Analysis

Classical GSEA

Enrichment Analysis using `GOstats`

Enrichment Analysis using `clusterProfiler`