Input ===== *OmicScope* offers eight methods for integrating data into its pipeline, six of which rely on proteomic software for protein identification and quantitative analyses: **Important:** Examples for each data format can be easily downloadable in `OmicScope Web App `__. - `Progenesis QI for Proteomics `__ (*Method: ‘Progenesis’*) - Progenesis QI for Proteomics (Waters Corporation) is software that enables protein quantification and identification (via APEX3D and ProteinLynx Global Server) for experiments that use Data Independent Acquisition (DIA). - **Input format:** Table with normalized abundance values - .csv (recommended) or .xlsx/xls. - `PatternLab V `__ (*Method: ‘PatternLab’*) - PatternLab V is an integrated computational environment for analyzing shotgun proteomic data and is considered one of the best options for quantitative proteomics using data-dependent acquisition, due to its high-confidence parameters for protein quantitation and identification. - **Input format:** PLP file exported by XIC section (user can perform filtering steps prior to export plp file) - .plp. - **This import method was empowered in collaboration with PatternLab developers.** - `MaxQuant `__ (*Method: ‘MaxQuant’*) - MaxQuant is the most widely used software for quantitative proteomics, offering users a range of parameter options for quantitative analyses. - **Input format:** ProteinGroup.txt and `pdata <#pdata>`__.xlsx files. - *Note: User must ensure sample names (indicated in LFQ Intensity columns) are the same in ProteinGroup and pdata files. Additionally, verify if LFQ Intensity and/or Intensities comprise valid values.* - `DIA-NN `__ (*Method: ‘DIA-NN’*) - DIA-NN is a popular software that performs protein identification and quantification for DIA experiments, offering users a variety of parameter options for quantitative analyses. - **Input format:** main_output.tsv and `pdata <#pdata>`__.xlsx files. - *Note: User must ensure Run columns in main_output.tsv match the Sample column on pdata file. Additionally, verify if PG.MaxLFQ is present and contains valid values in the main_output.tsv columns.* - `FragPipe `__ (*Method: ‘MaxQuant’*) - FragPipe is a suite of computational tools enabling comprehensive analysis of mass spectrometry-based proteomics data. - **Input format:** combined_protein.tsv and `pdata <#pdata>`__.xlsx files. - *Note: User must ensure sample names (indicated in MaxLFQ columns) are the same in combined_protein and pdata files. Additionally, verify if MaxLFQ and/or Intensities comprise valid values.* - `Proteome Discoverer `__ (*Method: ‘MaxQuant’*) - Proteome Discoverer (Thermo Fisher Scientific) performs protein identification and quantitation. - **Input format:** Quantitative data on protein levels (.xls/.xlsx files) containing “Normalized” or “Abundance:” columns presenting quantitative values. - `General `__ (*Method: ‘General’*) - To import data from other sources, this generic input method requires users to import an Excel workbook with 3 sheets: quantitative values (sheet1 = assay), protein features (sheet2 = rdata), and sample information (sheet3 = pdata). - **Input format:** Excel file containing three sheets. - `Snapshot `__ (*Method: ‘Snapshot’*) - The Snapshot method is the simplest approach to be used in the OmicScope workflow. It involves using a single, concise Excel sheet that contains essential information about proteins in a study, including fold change, p-value, and grouping. - **Input format:** Excel file. For more information about the formatting of these import methods, see the appropriate sections below. Import OmicScope ---------------- First, OmicScope package must be imported in the Python programming environment. .. code:: ipython3 import omicscope as omics .. parsed-literal:: OmicScope v 1.4.0 For help: https://omicscope.readthedocs.io/en/latest/ or https://omicscope.ib.unicamp.brIf you use in published research, please cite: 'Reis-de-Oliveira, G., et al (2024). OmicScope unravels systems-level insights from quantitative proteomics data Progenesis QI for Proteomics ---------------------------- Progenesis exports protein quantitation data in a CSV file containing information about samples, protein groups, and quantitative values. *OmicScope* imports Progenesis output and extracts the abundance levels of each protein (assay), the features of each protein (rdata), and features of each sample (pdata). *OmicScope* can also accept Excel spreadsheets (with extensions .xls or .xlsx) that contain a **single sheet** for the Progenesis workflow, as many users may use Excel to visualize and handle data. .. code:: ipython3 progenesis = omics.OmicScope('../../tests/data/proteins/progenesis.xls', Method='Progenesis') .. parsed-literal:: User already performed statistical analysis OmicScope identifies: 697 deregulations **Only for OmicScope Package (not available in OmicScope App)** Since Progenesis exports certain information about sample groupings, *OmicScope* allows the user to input an Excel file containing all this information using the pdata argument (for more information about pdata format, see below). Furthermore, users can filter identifications based on the minimum number of unique peptides by specifying the parameter ``UniquePeptides`` (recommended: ``UniquePeptides = 1``). .. code:: ipython3 progenesis_uniquepepfilt = omics.OmicScope('../../tests/data/proteins/progenesis.xls', Method='Progenesis', UniquePeptides=1) print('Original proteomics data: ' + str(len(progenesis.quant_data)) + '\n'+ 'Filtered proteomics data: ' + str(len(progenesis_uniquepepfilt.quant_data)) ) .. parsed-literal:: User already performed statistical analysis OmicScope identifies: 582 deregulations Original proteomics data: 2179 Filtered proteomics data: 1797 **IMPORTANT**: Progenesis performs differential proteomics analyses based on preset groups, and *OmicScope* takes these statistical analyses into account. However, if the user has a specific experimental design, *OmicScope* Statistical Workflow can be used by renaming two columns in the original .csv file, as follows: - “Anova (p)” → “Original Anova (p)” - “q Value” → “Original q Value” PatternLab ---------- PatternLab exports an plp file, which contains the same type of information as Progenesis, including assay, pdata, and rdata. However, this exported file does not include differential proteomics statistics. Therefore, *OmicScope* automatically performs statistical analyses for PatternLab data. For users aiming to use Statistical Workflows from T-Fold workflow in PatternLab, we advise using "Snapshot" import method. .. code:: ipython3 plv = omics.OmicScope('../../tests/data/proteins/patternlab.plp', Method='PatternLab') MaxQuant -------- MaxQuant exports the **proteinGroups.txt** file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods **require** an additional Excel file for pdata. See the `pdata section <#pdata>`__ below for instructions on formatting this file. **Troubleshooting:** If you encounter issues with MaxQuant data, please ensure the following: - *LFQ Intensity or Intensity columns are present in the data*: OmicScope typically uses LFQ Intensity columns for statistical analysis, falling back to ‘Intensity’ columns if LFQ Intensity columns are absent. - *LFQ Intensity or Intensity columns contain valid values*: MaxQuant may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis. - *Verify if the MaxQuant output includes the following columns (exact labels)*: ‘Majority protein IDs’, ‘Fasta headers’, ‘Gene names’: ‘gene_name’. Older versions of MaxQuant might use different column labels, which can cause issues in OmicScope. .. code:: ipython3 maxquant = omics.OmicScope('../../tests/data/proteins/MQ.txt', Method='MaxQuant', pdata='../../tests/data/proteins/MQ_pdata.xlsx') DIA-NN ------ DIA-NN exports the **main_output.tsv** file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods **require** an additional Excel file for pdata. See the `pdata section <#pdata>`__ below for instructions on formatting this file. **IMPORTANT**: Main-output.tsv files from DIA-NN may be larger than 1 GB, importing and analyzing these data can take a while. **Troubleshooting:** If you encounter issues with DIA-NN data, please ensure the following: - *PG.MaxLFQ column is present in the data*: OmicScope uses PG.MaxLFQ columns for statistical analysis. - *PG.MaxLFQ contains valid values*: DIA-NN may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis. .. code:: ipython3 diann = omics.OmicScope('../../tests/data/proteins/main_output.tsv', Method='DIA-NN', pdata='../../tests/data/proteins/pdata.xlsx') FragPipe -------- FragPipe exports the **combined_protein.tsv** file, which provides a comprehensive description of the assay and rdata. However, since pdata is missing in both cases, these methods **require** an additional Excel file for pdata. See the `pdata section <#2_pdata>`__ below for instructions on formatting this file. **Troubleshooting:** If you encounter issues with FragPipe data, please ensure the following: - *MaxLFQ or Intensity columns are present in the data*: OmicScope uses PG.MaxLFQ columns for statistical analysis. - *MaxLFQ or Intensity contain valid values*: FragPipe may sometimes export null values for quantitative data, hindering OmicScope’s statistical analysis. .. code:: ipython3 fragpipe = omics.OmicScope('../../tests/data/proteins/fragpipe.txt', Method='FragPipe', pdata='../../tests/data/proteins/fragpipe.xlsx') Proteome Discoverer ------------------- Proteome Discoverer (PD) exports protein quantitation data in an Excel file containing a single sheet that comprises samples, protein groups, and quantitative values, used to separate between assay, rdata, and pdata. Since PD allows users to select columns to be exported, we **strongly recommend** exporting the following columns: ‘Description’, ‘Accession’, ‘Normalizing’/‘Abundance:’. When importing statistical analysis exported by PD, also use: ‘Abundance Ratio P-Value’, ‘Abundance Ratio Adj’. .. code:: ipython3 pd = omics.OmicScope('../../tests/data/proteins/pd.xlsx', Method='ProteomeDiscoverer') General ------- The General workflow allows users to analyze data generated by other platforms, including Genomics and Transcriptomics. To do this, users need to organize an Excel file into three sheets: assay, rdata, and pdata. - **Assay:** Contains the abundance of N proteins (rows) from M samples (columns). - **Rdata:** Includes N proteins (rows) with their respective features within each column. - **Pdata:** Contains M samples (rows) with their respective characteristics, such as conditions, as well as the organization of biological and technical replicates. For more information about how to properly format and import each of these sheets, see the respective sections below. .. code:: ipython3 general = omics.OmicScope('../../tests/data/proteins/general.xlsx', Method='General') Assay ~~~~~ The assay sheet should contain the abundance data for each protein/feature/transcript. The first row contains the sample names for each of the abundance values below. .. code:: ipython3 import pandas as pd assay = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=0) # Slicing example to facilitate visualization assay.head().iloc[:,0:5] .. raw:: html
VCC_HB_1_1_2020 VCC_HB_1_2 VCC_HB_2_1 VCC_HB_2_1_2 VCC_HB_3_1
0 2.938847e+04 3.110927e+04 2.521807e+04 3.090703e+04 2.383499e+04
1 7.081308e+04 6.446946e+04 5.825493e+04 5.931610e+04 6.309095e+04
2 1.007536e+05 1.011999e+05 7.301329e+04 7.349391e+04 9.766835e+04
3 2.588031e+04 3.769105e+04 2.992691e+04 3.460095e+04 2.596320e+04
4 1.019192e+06 1.109406e+06 1.060396e+06 1.078239e+06 1.003426e+06
rdata ~~~~~ The rdata sheet needs to have at least two columns: ‘Accession’ and ‘Description’. 1. **Accession:** An array of unique values that represent the proteins in the assay dataframe. 2. **Description:** The header from UniProt Fasta. Optionally, user may add “gene_name” column for alternative names. .. code:: ipython3 rdata = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=1) rdata.head(3) .. raw:: html
Accession Peptide count Unique peptides Confidence score Anova (p) q Value Max fold change Power Highest mean condition Lowest mean condition Description
0 P0DJI8 1 1 6.8809 0.000000e+00 0.000000 2.192654 1.000000 COVID CTRL Serum amyloid A-1 protein OS=Homo sapiens OX=9...
1 P63313 2 0 24.1939 0.000000e+00 0.000000 3.823799 1.000000 COVID CTRL Thymosin beta-10 OS=Homo sapiens OX=9606 GN=TM...
2 P03886 3 0 24.0213 1.299387e-07 0.000041 1.386199 0.999998 CTRL COVID NADH-ubiquinone oxidoreductase chain 1 OS=Homo...
pdata ~~~~~ Pdata contains a description of each sample analyzed in the workflow. Pdata must have at least the following 3 columns: ‘Sample’, ‘Condition’, and ‘Biological’. 1. **Sample:** The name of each sample to be analyzed, matching those in the first row of the Assay sheet. 2. **Condition:** Respective group for each sample. All technical and biological replicates belonging to an experimental condition should have the same identifier here. 3. **Biological:** Respective biological replicate for each sample. If two or more technical replicates were used for a single biological replicate, those replicates should have the same identifier here. When performing longitudinal analysis, users must also include a ``TimeCourse`` column containing the day/hour/time/etc. associated with each sample. See the example below for how to construct a pdata sheet. In this example, there are two groups being compared: COVID *vs.* CTRL. COVID contains 12 biological replicates, CTRL contains 7 biological replicates. All replicates were injected twice for two instrumental replicates. These replicates will be averaged and not considered individual samples for T-Test purposes. .. code:: ipython3 pdata = pd.read_excel('../../tests/data/proteins/general.xlsx', sheet_name=2) pdata .. raw:: html
Sample Condition Biological
0 VCC_HB_1_1_2020 COVID 1
1 VCC_HB_1_2 COVID 1
2 VCC_HB_2_1 COVID 2
3 VCC_HB_2_1_2 COVID 2
4 VCC_HB_3_1 COVID 3
5 VCC_HB_3_1_2 COVID 3
6 VCC_HB_4_1 COVID 4
7 VCC_HB_4_1_2 COVID 4
8 VCC_HB_5_1 COVID 5
9 VCC_HB_5_1_2 COVID 5
10 VCC_HB_6_1 COVID 6
11 VCC_HB_6_1_2 COVID 6
12 VCC_HB_7_1 COVID 7
13 VCC_HB_7_1_2 COVID 7
14 VCC_HB_8_1 COVID 8
15 VCC_HB_8_1_2 COVID 8
16 VCC_HB_9_1 COVID 9
17 VCC_HB_9_1_2 COVID 9
18 VCC_HB_10_1 COVID 10
19 VCC_HB_10_1_2_ COVID 10
20 VCC_HB_11_1 COVID 11
21 VCC_HB_11_1_2_ COVID 11
22 VCC_HB_12_1 COVID 12
23 VCC_HB_12_1_2_ COVID 12
24 VCC_HB_A_1 CTRL 1
25 VCC_HB_A_1_2 CTRL 1
26 VCC_HB_B_1 CTRL 2
27 VCC_HB_B_1_2 CTRL 2
28 VCC_HB_C_1 CTRL 3
29 VCC_HB_C_1_2 CTRL 3
30 VCC_HB_D_1 CTRL 4
31 VCC_HB_D_1_2 CTRL 4
32 VCC_HB_E_1 CTRL 5
33 VCC_HB_E_1_2 CTRL 5
34 VCC_HB_F_1 CTRL 6
35 VCC_HB_F_1_2 CTRL 6
36 VCC_HB_G_1 CTRL 7
37 VCC_HB_G_1_2 CTRL 7
For detailed instructions on constructing pdata and integrating it into your experimental design, please refer to the page titled `How to Make Pdata `__. Snapshot -------- The Snapshot method is an alternative option in OmicScope for analyzing multiple ’omics studies by importing pre-analyzed data from other platforms. To use the Snapshot method, the user needs to upload a CSV or Excel file organized as follows: 1. First row: **ControlGroup: LIST_YOUR_CONTROL_HERE** 2. Second row: **Experimental: LIST_YOUR_EXPERIMENTAL_GROUPS_SEPARATED_BY_COMMAS** 3. Third row: A table header containing the following values: ‘Accession’, ‘gene_name’, ‘log2(fc)’, and either ‘pvalue’ or ‘pAdjusted’. 4. Subsequent rows: The molecular data to fill the columns listed in the third row. It is important to note that Snapshot contains a comparatively limited amount of information, which means that not all plots and enrichment analyses will be available. Nevertheless, once the data is imported into OmicScope, it can still be exported as an .omics file and used in the Nebula module. Additional Informations ----------------------- Users can also define any of the following additional parameters that are in the OmicScope function to optimize their analysis. 1. **ControlGroup** (default, ``ControlGroup = None``): Users can define a control group to perform comparisons against a specific group. The name of this group has to be explicitly defined in the ‘Conditions’ column on the pdata table. 2. **ExperimentalDesign** (default, ``ExperimentalDesign = 'static'``) (options: ‘static’, ‘longitudinal’): Comparisons among independent groups are called static experimental designs. However, if the experiment takes into account several time points of related samples, then it is performing a longitudinal experimental design. **Note:** in this case, the pdata table must present a ‘TimeCourse’ column. 3. **pvalue** (default, ``pvalue = 'pAdjusted'``) (options: ‘pvalue’, ‘pAdjusted’, ‘pTukey’): Defines the type of statistics used to report differentially regulated proteins. The options are nominal p-value (‘pvalue’), Benjamini-Hochberg adjusted p-value (‘pAdjusted’), or Tukey post-hoc correction (‘pTukey’, only available for multiple group comparisons in static experiments). 4. **PValue_cutoff** (default = ``PValue_cutoff = 0.05``): Statistical cutoff to consider proteins differentially regulated. 5. **normalization_method** (default = ``normalization_method = None``): Certain data may require a normalization preprocessing step. OmicScope offers three methods of normalization: ‘average’, ‘median’, ‘quantile’. Defaults to None. 6. **imputation_method** (default = ``imputation_method = None``): Some data may require data imputation to handle null values as a preprocessing step. OmicScope provides three methods of data imputation: ‘mean’, ‘median’, ‘knn’. Defaults to None. 7. **FoldChange_cutoff** (default, ``FoldChange_cutoff = 0``): Cutoff of the absolute abundance ratio to consider a protein to be differentially regulated. 0 indicates that p-values alone are sufficient to determine dysregulation. 8. **logTransform** (default, ``logTransform = True``): Usually, analysis software reports abundance in nominal values, requiring a log-transformation of the values to normalize abundance data. If users performed transformation before the OmicScope workflow, set logTransformed=True. 9. **ExcludeContaminants** (default, ``ExcludeContaminants = True``): Recently, Frankenfield (2022) evaluated the most common contaminants found in proteomics workflows. By default, OmicScope removes them from analyses. If this is not desired, OmicScope can leave them in the final results with ExcludeContaminants=False. 10. **degrees_of_freedom** (default, ``degrees_of_freedom = 2``): For longitudinal experiments, users can optimize this parameter according to their study, choosing a greater degree of freedom to perform the subsequent statistical analyses. Note that ExperimentalDesign and pdata must still be appropriately configured. 11. **independent_ttest** (default, ``independent_ttest = True``): If running a t-test, the user can specify if data sampling was independent (True) or paired (False).