Input

OmicScope offers eight methods for integrating data into its pipeline, six of which rely on proteomic software for protein identification and quantitative analyses:

Important: Examples for each data format can be easily downloadable in OmicScope Web App.

  • Progenesis QI for Proteomics (Method: ‘Progenesis’)

    • Progenesis QI for Proteomics (Waters Corporation) is software that enables protein quantification and identification (via APEX3D and ProteinLynx Global Server) for experiments that use Data Independent Acquisition (DIA).

    • Input format: Table with normalized abundance values - .csv (recommended) or .xlsx/xls.

  • PatternLab V (Method: ‘PatternLab’)

    • PatternLab V is an integrated computational environment for analyzing shotgun proteomic data and is considered one of the best options for quantitative proteomics using data-dependent acquisition, due to its high-confidence parameters for protein quantitation and identification.

    • Input format: Excel file exported by XIC section (user can perform filtering steps prior to export excel file) - .xlsx.

  • MaxQuant (Method: ‘MaxQuant’)

    • MaxQuant is the most widely used software for quantitative proteomics, offering users a range of parameter options for quantitative analyses.

    • Input format: ProteinGroup.txt and pdata.xlsx files.

    • Note: User must ensure sample names (indicated in LFQ Intensity columns) are the same in ProteinGroup and pdata files. Additionally, verify if LFQ Intensity and/or Intensities comprise valid values.

  • DIA-NN (Method: ‘DIA-NN’)

    • DIA-NN is a popular software that performs protein identification and quantification for DIA experiments, offering users a variety of parameter options for quantitative analyses.

    • Input format: main_output.tsv and pdata.xlsx files.

    • Note: User must ensure Run columns in main_output.tsv match the Sample column on pdata file. Additionally, verify if PG.MaxLFQ is present and contains valid values in the main_output.tsv columns.

  • FragPipe (Method: ‘MaxQuant’)

    • FragPipe is a suite of computational tools enabling comprehensive analysis of mass spectrometry-based proteomics data.

    • Input format: combined_protein.tsv and pdata.xlsx files.

    • Note: User must ensure sample names (indicated in MaxLFQ columns) are the same in combined_protein and pdata files. Additionally, verify if MaxLFQ and/or Intensities comprise valid values.

  • Proteome Discoverer (Method: ‘MaxQuant’)

    • Proteome Discoverer (Thermo Fisher Scientific) performs protein identification and quantitation.

    • Input format: Quantitative data on protein levels (.xls/.xlsx files) containing “Normalized” or “Abundance:” columns presenting quantitative values.

  • General (Method: ‘General’)

    • To import data from other sources, this generic input method requires users to import an Excel workbook with 3 sheets: quantitative values (sheet1 = assay), protein features (sheet2 = rdata), and sample information (sheet3 = pdata).

    • Input format: Excel file containing three sheets.

  • Snapshot (Method: ‘Snapshot’)

    • The Snapshot method is the simplest approach to be used in the OmicScope workflow. It involves using a single, concise Excel sheet that contains essential information about proteins in a study, including fold change, p-value, and grouping.

    • Input format: Excel file.

For more information about the formatting of these import methods, see the appropriate sections below.

Import OmicScope

First, OmicScope package must be imported in the Python programming environment.

import omicscope as omics
OmicScope v 1.4.0 For help: https://omicscope.readthedocs.io/en/latest/ or https://omicscope.ib.unicamp.brIf you use  in published research, please cite:
'Reis-de-Oliveira, G., et al (2024). OmicScope unravels systems-level insights from quantitative proteomics data

Progenesis QI for Proteomics

Progenesis exports protein quantitation data in a CSV file containing information about samples, protein groups, and quantitative values.

OmicScope imports Progenesis output and extracts the abundance levels of each protein (assay), the features of each protein (rdata), and features of each sample (pdata). OmicScope can also accept Excel spreadsheets (with extensions .xls or .xlsx) that contain a single sheet for the Progenesis workflow, as many users may use Excel to visualize and handle data.

progenesis = omics.OmicScope('../../tests/data/proteins/progenesis.xls', Method='Progenesis')
User already performed statistical analysis
OmicScope identifies: 697 deregulations

Only for OmicScope Package (not available in OmicScope App)

Since Progenesis exports certain information about sample groupings, OmicScope allows the user to input an Excel file containing all this information using the pdata argument (for more information about pdata format, see below). Furthermore, users can filter identifications based on the minimum number of unique peptides by specifying the parameter UniquePeptides (recommended: UniquePeptides = 1).

progenesis_uniquepepfilt = omics.OmicScope('../../tests/data/proteins/progenesis.xls', Method='Progenesis', UniquePeptides=1)
print('Original proteomics data: ' + str(len(progenesis.quant_data)) + '\n'+
      'Filtered proteomics data: ' + str(len(progenesis_uniquepepfilt.quant_data))
      )
User already performed statistical analysis
OmicScope identifies: 582 deregulations
Original proteomics data: 2179
Filtered proteomics data: 1797

IMPORTANT: Progenesis performs differential proteomics analyses based on preset groups, and OmicScope takes these statistical analyses into account. However, if the user has a specific experimental design, OmicScope Statistical Workflow can be used by renaming two columns in the original .csv file, as follows:

  • “Anova (p)” → “Original Anova (p)”

  • “q Value” → “Original q Value”

PatternLab

PatternLab exports an Excel file with an .xlsx extension, which contains the same type of information as Progenesis, including assay, pdata, and rdata. However, this exported file does not include differential proteomics statistics. Therefore, OmicScope automatically performs statistical analyses for PatternLab data.

plv = omics.OmicScope('../../tests/data/proteins/patternlab.xlsx', Method='PatternLab')

MaxQuant

MaxQuant exports the proteinGroups.txt file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods require an additional Excel file for pdata. See the pdata section below for instructions on formatting this file.

Troubleshooting: If you encounter issues with MaxQuant data, please ensure the following:

  • LFQ Intensity or Intensity columns are present in the data: OmicScope typically uses LFQ Intensity columns for statistical analysis, falling back to ‘Intensity’ columns if LFQ Intensity columns are absent.

  • LFQ Intensity or Intensity columns contain valid values: MaxQuant may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis.

  • Verify if the MaxQuant output includes the following columns (exact labels): ‘Majority protein IDs’, ‘Fasta headers’, ‘Gene names’: ‘gene_name’. Older versions of MaxQuant might use different column labels, which can cause issues in OmicScope.

maxquant = omics.OmicScope('../../tests/data/proteins/MQ.txt', Method='MaxQuant',
                           pdata='../../tests/data/proteins/MQ_pdata.xlsx')

DIA-NN

DIA-NN exports the main_output.tsv file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods require an additional Excel file for pdata. See the pdata section below for instructions on formatting this file.

IMPORTANT: Main-output.tsv files from DIA-NN may be larger than 1 GB, importing and analyzing these data can take a while.

Troubleshooting: If you encounter issues with DIA-NN data, please ensure the following:

  • PG.MaxLFQ column is present in the data: OmicScope uses PG.MaxLFQ columns for statistical analysis.

  • PG.MaxLFQ contains valid values: DIA-NN may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis.

diann = omics.OmicScope('../../tests/data/proteins/main_output.tsv', Method='DIA-NN',
                           pdata='../../tests/data/proteins/pdata.xlsx')

FragPipe

FragPipe exports the combined_protein.tsv file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods require an additional Excel file for pdata. See the pdata section below for instructions on formatting this file.

Troubleshooting: If you encounter issues with FragPipe data, please ensure the following:

  • MaxLFQ or Intensity columns are present in the data: OmicScope uses PG.MaxLFQ columns for statistical analysis.

  • MaxLFQ or Intensity contain valid values: FragPipe may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis.

fragpipe = omics.OmicScope('../../tests/data/proteins/fragpipe.txt', Method='FragPipe',
                           pdata='../../tests/data/proteins/fragpipe.xlsx')

Proteome Discoverer

Proteome Discoverer (PD) exports protein quantitation data in an Excel file containing a single sheet that comprises samples, protein groups, and quantitative values, used to separate between assay, rdata, and pdata.

Since PD allows users to select columns to be exported, we strongly recommend exporting the following columns: ‘Description’, ‘Accession’, ‘Normalizing’/‘Abundance:’. When importing statistical analysis exported by PD, also use: ‘Abundance Ratio P-Value’, ‘Abundance Ratio Adj’.

pd = omics.OmicScope('../../tests/data/proteins/pd.xlsx', Method='ProteomeDiscoverer')

General

The General workflow allows users to analyze data generated by other platforms, including Genomics and Transcriptomics. To do this, users need to organize an Excel file into three sheets: assay, rdata, and pdata.

  • Assay: Contains the abundance of N proteins (rows) from M samples (columns).

  • Rdata: Includes N proteins (rows) with their respective features within each column.

  • Pdata: Contains M samples (rows) with their respective characteristics, such as conditions, as well as the organization of biological and technical replicates.

For more information about how to properly format and import each of these sheets, see the respective sections below.

general = omics.OmicScope('../../tests/data/proteins/general.xlsx', Method='General')

Assay

The assay sheet should contain the abundance data for each protein/feature/transcript. The first row contains the sample names for each of the abundance values below.

import pandas as pd

assay = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=0)
# Slicing example to facilitate visualization
assay.head().iloc[:,0:5]
VCC_HB_1_1_2020 VCC_HB_1_2 VCC_HB_2_1 VCC_HB_2_1_2 VCC_HB_3_1
0 2.938847e+04 3.110927e+04 2.521807e+04 3.090703e+04 2.383499e+04
1 7.081308e+04 6.446946e+04 5.825493e+04 5.931610e+04 6.309095e+04
2 1.007536e+05 1.011999e+05 7.301329e+04 7.349391e+04 9.766835e+04
3 2.588031e+04 3.769105e+04 2.992691e+04 3.460095e+04 2.596320e+04
4 1.019192e+06 1.109406e+06 1.060396e+06 1.078239e+06 1.003426e+06

rdata

The rdata sheet needs to have at least two columns: ‘Accession’ and ‘Description’.

  1. Accession: An array of unique values that represent the proteins in the assay dataframe.

  2. Description: The header from UniProt Fasta.

Optionally, user may add “gene_name” column for alternative names.

rdata = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=1)
rdata.head(3)
Accession Peptide count Unique peptides Confidence score Anova (p) q Value Max fold change Power Highest mean condition Lowest mean condition Description
0 P0DJI8 1 1 6.8809 0.000000e+00 0.000000 2.192654 1.000000 COVID CTRL Serum amyloid A-1 protein OS=Homo sapiens OX=9...
1 P63313 2 0 24.1939 0.000000e+00 0.000000 3.823799 1.000000 COVID CTRL Thymosin beta-10 OS=Homo sapiens OX=9606 GN=TM...
2 P03886 3 0 24.0213 1.299387e-07 0.000041 1.386199 0.999998 CTRL COVID NADH-ubiquinone oxidoreductase chain 1 OS=Homo...

pdata

Pdata contains a description of each sample analyzed in the workflow. Pdata must have at least the following 3 columns: ‘Sample’, ‘Condition’, and ‘Biological’.

  1. Sample: The name of each sample to be analyzed, matching those in the first row of the Assay sheet.

  2. Condition: Respective group for each sample. All technical and biological replicates belonging to an experimental condition should have the same identifier here.

  3. Biological: Respective biological replicate for each sample. If two or more technical replicates were used for a single biological replicate, those replicates should have the same identifier here.

When performing longitudinal analysis, users must also include a TimeCourse column containing the day/hour/time/etc. associated with each sample.

See the example below for how to construct a pdata sheet. In this example, there are two groups being compared: COVID vs. CTRL. COVID contains 12 biological replicates, CTRL contains 7 biological replicates. All replicates were injected twice for two instrumental replicates. These replicates will be averaged and not considered individual samples for T-Test purposes.

pdata = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=2)
pdata
Sample Condition Biological
0 VCC_HB_1_1_2020 COVID 1
1 VCC_HB_1_2 COVID 1
2 VCC_HB_2_1 COVID 2
3 VCC_HB_2_1_2 COVID 2
4 VCC_HB_3_1 COVID 3
5 VCC_HB_3_1_2 COVID 3
6 VCC_HB_4_1 COVID 4
7 VCC_HB_4_1_2 COVID 4
8 VCC_HB_5_1 COVID 5
9 VCC_HB_5_1_2 COVID 5
10 VCC_HB_6_1 COVID 6
11 VCC_HB_6_1_2 COVID 6
12 VCC_HB_7_1 COVID 7
13 VCC_HB_7_1_2 COVID 7
14 VCC_HB_8_1 COVID 8
15 VCC_HB_8_1_2 COVID 8
16 VCC_HB_9_1 COVID 9
17 VCC_HB_9_1_2 COVID 9
18 VCC_HB_10_1 COVID 10
19 VCC_HB_10_1_2_ COVID 10
20 VCC_HB_11_1 COVID 11
21 VCC_HB_11_1_2_ COVID 11
22 VCC_HB_12_1 COVID 12
23 VCC_HB_12_1_2_ COVID 12
24 VCC_HB_A_1 CTRL 1
25 VCC_HB_A_1_2 CTRL 1
26 VCC_HB_B_1 CTRL 2
27 VCC_HB_B_1_2 CTRL 2
28 VCC_HB_C_1 CTRL 3
29 VCC_HB_C_1_2 CTRL 3
30 VCC_HB_D_1 CTRL 4
31 VCC_HB_D_1_2 CTRL 4
32 VCC_HB_E_1 CTRL 5
33 VCC_HB_E_1_2 CTRL 5
34 VCC_HB_F_1 CTRL 6
35 VCC_HB_F_1_2 CTRL 6
36 VCC_HB_G_1 CTRL 7
37 VCC_HB_G_1_2 CTRL 7

For detailed instructions on constructing pdata and integrating it into your experimental design, please refer to the page titled How to Make Pdata.

Snapshot

The Snapshot method is an alternative option in OmicScope for analyzing multiple ’omics studies by importing pre-analyzed data from other platforms.

To use the Snapshot method, the user needs to upload a CSV or Excel file organized as follows:

  1. First row: ControlGroup: LIST_YOUR_CONTROL_HERE

  2. Second row: Experimental: LIST_YOUR_EXPERIMENTAL_GROUPS_SEPARATED_BY_COMMAS

  3. Third row: A table header containing the following values: ‘Accession’, ‘gene_name’, ‘log2(fc)’, and either ‘pvalue’ or ‘pAdjusted’.

  4. Subsequent rows: The molecular data to fill the columns listed in the third row.

It is important to note that Snapshot contains a comparatively limited amount of information, which means that not all plots and enrichment analyses will be available. Nevertheless, once the data is imported into OmicScope, it can still be exported as an .omics file and used in the Nebula module.

Additional Informations

Users can also define any of the following additional parameters that are in the OmicScope function to optimize their analysis.

  1. ControlGroup (default, ControlGroup = None): Users can define a control group to perform comparisons against a specific group. The name of this group has to be explicitly defined in the ‘Conditions’ column on the pdata table.

  2. ExperimentalDesign (default, ExperimentalDesign = 'static') (options: ‘static’, ‘longitudinal’): Comparisons among independent groups are called static experimental designs. However, if the experiment takes into account several time points of related samples, then it is performing a longitudinal experimental design. Note: in this case, the pdata table must present a ‘TimeCourse’ column.

  3. pvalue (default, pvalue = 'pAdjusted') (options: ‘pvalue’, ‘pAdjusted’, ‘pTukey’): Defines the type of statistics used to report differentially regulated proteins. The options are nominal p-value (‘pvalue’), Benjamini-Hochberg adjusted p-value (‘pAdjusted’), or Tukey post-hoc correction (‘pTukey’, only available for multiple group comparisons in static experiments).

  4. PValue_cutoff (default = PValue_cutoff = 0.05): Statistical cutoff to consider proteins differentially regulated.

  5. normalization_method (default = normalization_method = None): Certain data may require a normalization preprocessing step. OmicScope offers three methods of normalization: ‘average’, ‘median’, ‘quantile’. Defaults to None.

  6. imputation_method (default = imputation_method = None): Some data may require data imputation to handle null values as a preprocessing step. OmicScope provides three methods of data imputation: ‘mean’, ‘median’, ‘knn’. Defaults to None.

  7. FoldChange_cutoff (default, FoldChange_cutoff = 0): Cutoff of the absolute abundance ratio to consider a protein to be differentially regulated. 0 indicates that p-values alone are sufficient to determine dysregulation.

  8. logTransform (default, logTransform = True): Usually, analysis software reports abundance in nominal values, requiring a log-transformation of the values to normalize abundance data. If users performed transformation before the OmicScope workflow, set logTransformed=True.

  9. ExcludeContaminants (default, ExcludeContaminants = True): Recently, Frankenfield (2022) evaluated the most common contaminants found in proteomics workflows. By default, OmicScope removes them from analyses. If this is not desired, OmicScope can leave them in the final results with ExcludeContaminants=False.

  10. degrees_of_freedom (default, degrees_of_freedom = 2): For longitudinal experiments, users can optimize this parameter according to their study, choosing a greater degree of freedom to perform the subsequent statistical analyses. Note that ExperimentalDesign and pdata must still be appropriately configured.

  11. independent_ttest (default, independent_ttest = True): If running a t-test, the user can specify if data sampling was independent (True) or paired (False).