EnrichmentScope Object

omics.EnrichmentScope()

The EnrichmentScope is a dedicated module within OmicScope designed for conducting in silico enrichment analyses. It offers two primary types of enrichment analyses: Over-Representation Analysis (ORA) and Gene-Set Enrichment Analysis (GSEA). These analyses are powered by the Enrichr libraries, providing access to over 224 databases for thorough analysis.

To perform enrichment analysis, users must initially import and perform statistical analysis using the OmicScope module.

The OmicScope object is used to perform Enrichment Analysis in the EnrichmentScope module. It’s important to note that ORA uses differentially regulated proteins found in OmicScope, while the GSEA algorithm uses all quantified proteins to perform the statistical analysis.

To perform enrichment analysis, users also need to select appropriate databases, with KEGG_2021_Human used by default. Users can also define alternative background sizes, target organisms, or pAdjusted cutoffs to consider terms enriched.

import omicscope as omics

data = omics.OmicScope('../../tests/data/proteins/progenesis.xls', Method = 'Progenesis')

ora = omics.EnrichmentScope(data, Analysis='ORA', dbs = ['KEGG_2021_Human'])
OmicScope v 1.4.0 For help: https://omicscope.readthedocs.io/en/latest/ or https://omicscope.ib.unicamp.brIf you use  in published research, please cite:
'Reis-de-Oliveira, G., et al (2024). OmicScope unravels systems-level insights from quantitative proteomics data

User already performed statistical analysis
OmicScope identifies: 697 deregulations

Enrichment results

Enrichment results are stored in object.results as a table (DataFrame), with the following columns:

Column Name

Description

Gene_set

Gene set library used for enrichment analysis

Term

Enriched term

Overlap

Ratio of proteins overlapped in the experimental gene list and the total number of genes in the library term

P-value

Nominal p-value from Fisher’s exact test

Adjusted P-value

Adjusted p-value according to Benjamini-Hochberg (pAdjusted)

Combined Score (ORA only)

Score for the enrichment analysis

Genes

Genes overlapped between experimental data and the database

-log10(pAdj)

Log-transformed pAdjusted value

N_Proteins

Number of proteins overlapped between the experimental gene list and the target library term

Regulation

Log2(foldchange) of each protein overlapped

Down-regulated

Number of down-regulated proteins

Up-regulated

Number of up-regulated proteins

Normalized Enrichment Score (NES) (GSEA only)

An attempt to predict the effect of proteins on pathways (specific to GSEA analysis)

ora.results.head(4)
index Gene_set Term Overlap P-value Adjusted P-value Old P-value Old Adjusted P-value Odds Ratio Combined Score Genes -log10(pAdj) N_Proteins regulation down-regulated up-regulated
0 0 KEGG_2021_Human Parkinson disease 58/249 1.704579e-31 4.789868e-29 0 0 9.082385 643.458087 [NDUFA11, CALML3, COX6A1, UBE2L3, TUBB8, UCHL1... 28.319676 58 [0.2670808325175823, -0.10715415448907055, 0.7... 33 25
1 1 KEGG_2021_Human Pathways of neurodegeneration 78/475 6.471702e-31 9.092742e-29 0 0 6.000855 417.135594 [NDUFA11, CALML3, ATP2A1, COX6A1, UBE2L3, TUBB... 28.041305 78 [0.2670808325175823, -0.10715415448907055, -0.... 51 27
2 2 KEGG_2021_Human Prion disease 54/273 1.174929e-25 1.100517e-23 0 0 7.318264 420.093386 [NDUFA11, COX6A1, TUBB8, PPP3CB, TUBB6, PPP3CC... 22.958403 54 [0.2670808325175823, 0.7932637717587971, -0.33... 29 25
3 3 KEGG_2021_Human Amyotrophic lateral sclerosis 61/364 8.377698e-25 5.885333e-23 0 0 6.014281 333.426032 [NDUFA11, COX6A1, ACTG1, TUBB8, ACTR1A, PPP3CB... 22.230229 61 [0.2670808325175823, 0.7932637717587971, -0.22... 38 23

Background - ORA only

When conducting Over-Representation Analysis (ORA), the background gene list assumes a pivotal role in enrichment analysis by serving as the reference set against which the experimental gene list is compared. To put it simply, the background gene list encompasses all the genes or proteins that could potentially be present in the experimental dataset.

By default, when background = None, EnrichmentScope includes all genes found in the database as part of the background. Alternatively, users have the option to set background = True to encompass all proteins identified in the experiment. They can also use background = int to specify the background size, which could be, for instance, the reviewed human proteome in the case of human experiments (although this is not recommended). Another option is to define background = [ListOfGenes] to specify a particular gene set for comparative analysis.

Plots and Figures

EnrichmentScope introduces a variety of figures that aim to integrate the enrichment outcomes with the differentially regulated proteins in biological systems.

Users can choose between saving the generated plots in vector format (using vector=True) or in .png format (with vector=False). They have the flexibility to set the desired figure resolution (using dpi=300) and specify a file path for saving the plots. Moreover, users can adjust the color schemes of the plots using the “palettes” command, selecting color palettes from Matplotlib. These customizable options empower users to create informative and visually appealing visualizations that cater to their specific requirements and preferences

Dotplot - object.dotplot()

The dotplot function ranks enriched terms on the y-axis based on their adjusted p-values, while the x-axis represents the adjusted p-values. Additionally, the size of each dot is proportional to -log10(pAdjusted), providing an indication of the significance of the enrichment. Furthermore, the color of each dot is coded based on the number of proteins used in the enrichment analysis.

How to interpret: The positioning of each dot on the plot indicates the statistical significance of the term, with more statistically significant terms located towards the top-right side of the plot. Additionally, the color of each dot corresponds to the number of proteins associated with that term, with darker blue indicating a higher number of associated proteins.

ora.dotplot(dpi=90, palette='PuBu')
_images/enrichmentscope_7_0.png

Heatmap - object.Heatmap()

The heatmap is a valuable tool within the EnrichmentScope workflow, aiding in the visualization of proteins that are shared between enriched terms, helping to reduce data redundancy. In this heatmap, proteins are depicted on the y-axis, while terms are assigned to the x-axis.

By default, the heatmap colors are mapped according to the adjusted p-value. However, users have the option to color each protein based on its fold-change by setting foldchange=True.

How to interpret: When looking for specific proteins, users can identify the specific pathways (terms) associated with those proteins. Conversely, when exploring several pathways, users can observe the group of proteins that are shared between those pathways (terms). In the examples provided below, we highlight the default parameters and color coding based on fold change.

ora.heatmap(linewidths=0.5)
_images/enrichmentscope_9_0.png
# color based on protein fold-change
ora.heatmap(linewidths=0.5, foldchange=True)
_images/enrichmentscope_10_0.png

Number of DEPs - object.number_deps()

The number_deps function counts the number of up- and down-regulated entities (x-axis) and plots them according to each enriched term (y-axis). In this plot, sizes indicate the number of proteins found in each group.

How to interpret: For users performing ORA and GSEA analyses, questions often arise about the number of up- and down-regulated proteins associated with each term.

ora.number_deps(palette=['firebrick','darkcyan'] ,dpi = 90)
_images/enrichmentscope_12_0.png

Enrichment Network - object.enrichment_network()

In proteomics, major pathways frequently share several proteins, and visualizing pathways and proteins together in a network can be highly informative.

The Enrichment Network function visually connects terms to their associated proteins. In this visualization, terms are depicted in gray, and the node size is proportional to -log10(p-adjusted). Proteins are represented uniformly in size and are color-coded based on their fold-change. Labels can be added to the plot by using the labels=True option (default: False).

Note: Note: Visualizing graphs can be complex, particularly when dealing with substantial amounts of information. To achieve the best visualization possible, several software options, such as Cytoscape and Gephi, have been specifically designed for this purpose. Users can export the plot to these external tools by specifying save=PATH_TO_SAVE.

ora.enrichment_network(top = 10, dpi = 90)
_images/enrichmentscope_14_0.png
[<networkx.classes.graph.Graph at 0x182bed80d90>]

Enrichment Map - object.enrichment_map()

An advantageous aspect of employing graphical representations in enrichment analysis is their ability to reduce data redundancy. The enrichment_map function takes advantage of this by rendering nodes as terms and edges as similarity scores, typically calculated using statistical metrics such as Jaccard similarity (default). If users opt to enable modules=True, the Louvain method is utilized to identify communities within the network. Each community is assigned a unique term, typically the one with the highest degree, to describe the community when labels=True is specified.

Similar to the enrichment_network function, users can easily export the generated enrichment map to external tools for further exploration and visualization by adding save=PATH_TO_SAVE.

How to interpret: While aiming to investigate pathways that share proteins, users can look inside modules to identify pathways that present high similarity regarding protein presence. On the other hand, while avoiding redundancy, users can look for the node that presents a higher degree (number of connections) inside each module and/or a lower p-value and consider that node to represent the whole module.

ora.enrichment_map(dpi=90, modules=True)
_images/enrichmentscope_16_0.png
[<networkx.classes.graph.Graph at 0x182bc0941d0>]