How to make pdata¶
Statistical Analysis Overview¶
OmicScope offers a toolkit for conducting differential proteomics analyses, covering statistical approaches for both static and longitudinal experimental designs (as illustrated in the figure below). By default, OmicScope assumes a static workflow (designated as ‘ExperimentalDesign=static’). In this mode, it employs t-tests or analysis of variance (ANOVA) for statistical analysis. For longitudinal analyses (designated as ‘ExperimentalDesign=longitudinal’), OmicScope assumes protein abundance varies over time according to a natural cubic spline, as suggested by Storey’s 2005. This method assesses differences within and between groups over time.
After calculating nominal p-values, OmicScope applies the Benjamini-Hochberg correction to account for multiple hypothesis testing and reports the adjusted p-value (pAdjusted). Alternatively, users have the flexibility to import statistical analyses from other software tools by including a “pvalue” or “pAdjusted” column in the rdata using the General input method.
Pdata Role¶
Pdata (also known as phenotype data or metadata) plays a crucial role in allowing OmicScope to correctly conduct statistical analysis. To perform this task, Pdata must contain as much information as possible to compare 2 or more groups, different time courses, or different classes throughout time.
When importing data into OmicScope, users must select the appropriate method for data handling. While performing static analysis for Progenesis, Proteome Discoverer, and PatternLab, OmicScope can automatically identify and select groups based on the software’s output. However, when performing longitudinal analysis or using any of the General, DIA-NN, and MaxQuant methods, OmicScope requires pdata to select the appropriate test.
For all cases, OmicScope allows users to incorporate external pdata into the workflow, which helps tailor the statistical analysis to the specific experimental design (as described below).
Static Experimental Design¶
Static Workflow¶
Most proteomics experiments aim to compare proteomic signatures between
independent groups, which is why OmicScope defaults to a static
experimental design (designated as 'ExperimentalDesign=static'). The
static workflow involves two main statistical tests: the t-test and
ANOVA.
When comparing two groups, OmicScope conducts an independent t-test
(when independent_ttest=True) if the groups are independent, or a
paired t-test (when independent_ttest=False) if the groups are
related. For comparisons involving more than two groups, OmicScope
employs a one-way ANOVA. Additionally, for proteins with a pAdjusted
value less than 0.05, OmicScope performs a Tukey post-hoc correction.
This helps identify and highlight the groups with significant
differences.
Static Pdata¶
To create a pdata file for running the static workflow, users should include the following columns:
Sample: The name of each sample to be analyzed, matching those in the first row of the Assay sheet.
Condition: Respective group for each sample. All technical and biological replicates belonging to an experimental condition should have the same identifier here.
Biological: Respective biological replicate for each sample. If two or more technical replicates were used for a single biological replicate, those replicates should have the same identifier here.
Pdata Example 1¶
In the example below, each sample is assigned to a specific Condition, and each biological replicate is reported. Here, two distinct conditions were documented, and each biological replicate was acquired twice. Once this ‘pdata’ is integrated into the OmicScope workflow, the process involves calculating the mean of technical replicates and performing an independent t-test as the statistical test.
# Pdata for static experimental designs
import pandas as pd
pdata = pd.read_excel('..\\..\\tests\\data\\proteins\\general.xlsx', sheet_name=2)
pdata
| Sample | Condition | Biological | |
|---|---|---|---|
| 0 | VCC_HB_1_1_2020 | COVID | 1 |
| 1 | VCC_HB_1_2 | COVID | 1 |
| 2 | VCC_HB_2_1 | COVID | 2 |
| 3 | VCC_HB_2_1_2 | COVID | 2 |
| 4 | VCC_HB_3_1 | COVID | 3 |
| 5 | VCC_HB_3_1_2 | COVID | 3 |
| 6 | VCC_HB_4_1 | COVID | 4 |
| 7 | VCC_HB_4_1_2 | COVID | 4 |
| 8 | VCC_HB_5_1 | COVID | 5 |
| 9 | VCC_HB_5_1_2 | COVID | 5 |
| 10 | VCC_HB_6_1 | COVID | 6 |
| 11 | VCC_HB_6_1_2 | COVID | 6 |
| 12 | VCC_HB_7_1 | COVID | 7 |
| 13 | VCC_HB_7_1_2 | COVID | 7 |
| 14 | VCC_HB_8_1 | COVID | 8 |
| 15 | VCC_HB_8_1_2 | COVID | 8 |
| 16 | VCC_HB_9_1 | COVID | 9 |
| 17 | VCC_HB_9_1_2 | COVID | 9 |
| 18 | VCC_HB_10_1 | COVID | 10 |
| 19 | VCC_HB_10_1_2_ | COVID | 10 |
| 20 | VCC_HB_11_1 | COVID | 11 |
| 21 | VCC_HB_11_1_2_ | COVID | 11 |
| 22 | VCC_HB_12_1 | COVID | 12 |
| 23 | VCC_HB_12_1_2_ | COVID | 12 |
| 24 | VCC_HB_A_1 | CTRL | 1 |
| 25 | VCC_HB_A_1_2 | CTRL | 1 |
| 26 | VCC_HB_B_1 | CTRL | 2 |
| 27 | VCC_HB_B_1_2 | CTRL | 2 |
| 28 | VCC_HB_C_1 | CTRL | 3 |
| 29 | VCC_HB_C_1_2 | CTRL | 3 |
| 30 | VCC_HB_D_1 | CTRL | 4 |
| 31 | VCC_HB_D_1_2 | CTRL | 4 |
| 32 | VCC_HB_E_1 | CTRL | 5 |
| 33 | VCC_HB_E_1_2 | CTRL | 5 |
| 34 | VCC_HB_F_1 | CTRL | 6 |
| 35 | VCC_HB_F_1_2 | CTRL | 6 |
| 36 | VCC_HB_G_1 | CTRL | 7 |
| 37 | VCC_HB_G_1_2 | CTRL | 7 |
print('Number of Conditions: ' + str(len(pdata.Condition.drop_duplicates())))
Number of Conditions: 2
Longitudinal Experimental Design¶
Longitudinal Workflow¶
To accommodate the potential complexities of longitudinal experimental designs, OmicScope categorizes these experiments into two primary types:
Within-group experiments: These designs aim to identify differentially regulated proteins over time within a single group.
Between-group experiments: These designs aim to detect differential protein regulation over time by comparing different groups.
Pdata workflow¶
OmicScope manages these distinctions much like the static workflow, examining the number of conditions (#conditions) in the ‘Condition’ column. It selects “Within-group” if the #conditions is equal to 1, and “Between-group” if the #conditions exceed 1. Additionally, in the longitudinal workflow, the user is required to add a “TimeCourse” column to define the sampling frequency of the study.
Pdata Example 2¶
In the example below, the ‘pdata’ contains two distinct groups (12
Control and 12 Treatment) in the ‘Condition’ column, indicating a
Between-group analysis. Additionally, the TimeCourse column includes
4 time points, and each biological replicate was acquired twice.
pdata = pd.read_excel('..\\../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=0)
pdata
| Sample | Condition | TimeCourse | Biological | |
|---|---|---|---|---|
| 0 | Sample1_Day1_Bio1_1 | Control | 1 | 1 |
| 1 | Sample1_Day1_Bio1_2 | Control | 1 | 1 |
| 2 | Sample2_Day1_Bio2_1 | Control | 1 | 2 |
| 3 | Sample2_Day1_Bio2_2 | Control | 1 | 2 |
| 4 | Sample3_Day1_Bio3_1 | Control | 1 | 3 |
| 5 | Sample3_Day1_Bio3_2 | Control | 1 | 3 |
| 6 | Sample4_Day2_Bio1_1 | Control | 3 | 4 |
| 7 | Sample4_Day2_Bio1_2 | Control | 3 | 4 |
| 8 | Sample5_Day2_Bio2_1 | Control | 3 | 5 |
| 9 | Sample5_Day2_Bio2_2 | Control | 3 | 5 |
| 10 | Sample6_Day2_Bio3_1 | Control | 3 | 6 |
| 11 | Sample6_Day2_Bio3_2 | Control | 3 | 6 |
| 12 | Sample7_Day3_Bio1_1 | Control | 5 | 7 |
| 13 | Sample7_Day3_Bio1_2 | Control | 5 | 7 |
| 14 | Sample8_Day3_Bio2_1 | Control | 5 | 8 |
| 15 | Sample8_Day3_Bio2_2 | Control | 5 | 8 |
| 16 | Sample9_Day3_Bio3_1 | Control | 5 | 9 |
| 17 | Sample9_Day3_Bio3_2 | Control | 5 | 9 |
| 18 | Sample10_Day4_Bio1_1 | Control | 7 | 10 |
| 19 | Sample10_Day4_Bio1_2 | Control | 7 | 10 |
| 20 | Sample11_Day4_Bio2_1 | Control | 7 | 11 |
| 21 | Sample11_Day4_Bio2_2 | Control | 7 | 11 |
| 22 | Sample12_Day5_Bio3_1 | Control | 7 | 12 |
| 23 | Sample12_Day5_Bio3_2 | Control | 7 | 12 |
| 24 | Sample13_Day1_Bio1_1 | Treatment | 1 | 13 |
| 25 | Sample13_Day1_Bio1_2 | Treatment | 1 | 13 |
| 26 | Sample14_Day1_Bio2_1 | Treatment | 1 | 14 |
| 27 | Sample14_Day1_Bio2_2 | Treatment | 1 | 14 |
| 28 | Sample15_Day1_Bio3_1 | Treatment | 1 | 15 |
| 29 | Sample15_Day1_Bio3_2 | Treatment | 1 | 15 |
| 30 | Sample16_Day2_Bio1_1 | Treatment | 3 | 16 |
| 31 | Sample16_Day2_Bio1_2 | Treatment | 3 | 16 |
| 32 | Sample17_Day2_Bio2_1 | Treatment | 3 | 17 |
| 33 | Sample17_Day2_Bio2_2 | Treatment | 3 | 17 |
| 34 | Sample18_Day2_Bio3_1 | Treatment | 3 | 18 |
| 35 | Sample18_Day2_Bio3_2 | Treatment | 3 | 18 |
| 36 | Sample19_Day3_Bio1_1 | Treatment | 5 | 19 |
| 37 | Sample19_Day3_Bio1_2 | Treatment | 5 | 19 |
| 38 | Sample20_Day3_Bio2_1 | Treatment | 5 | 20 |
| 39 | Sample20_Day3_Bio2_2 | Treatment | 5 | 20 |
| 40 | Sample21_Day3_Bio3_1 | Treatment | 5 | 21 |
| 41 | Sample21_Day3_Bio3_2 | Treatment | 5 | 21 |
| 42 | Sample22_Day4_Bio1_1 | Treatment | 7 | 22 |
| 43 | Sample22_Day4_Bio1_2 | Treatment | 7 | 22 |
| 44 | Sample23_Day4_Bio2_1 | Treatment | 7 | 23 |
| 45 | Sample23_Day4_Bio2_2 | Treatment | 7 | 23 |
| 46 | Sample24_Day5_Bio3_1 | Treatment | 7 | 24 |
| 47 | Sample24_Day5_Bio3_2 | Treatment | 7 | 24 |
Pdata Example 3¶
It’s important to note that in some cases researchers may employ independent or related sampling over time. Independent sampling involves evaluating different individuals over time, while related sampling entails assessing the same individuals repeatedly. As OmicScope assumes independent sampling by default, it’s essential to add a fifth column labeled “Individual” if the experimental design involves related sampling. This column associates each sample with its respective individual number.
Using the example provided, when conducting related sampling, the user
should add the Individual column to associate each biological sample
with the corresponding individual.
import pandas as pd
pdata = pd.read_excel('../../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=1)
pdata
| Sample | Condition | Biological | TimeCourse | Individual | |
|---|---|---|---|---|---|
| 0 | Sample1_Day1_Bio1_1 | Control | 1 | 1 | 1 |
| 1 | Sample1_Day1_Bio1_2 | Control | 1 | 1 | 1 |
| 2 | Sample2_Day1_Bio2_1 | Control | 2 | 1 | 2 |
| 3 | Sample2_Day1_Bio2_2 | Control | 2 | 1 | 2 |
| 4 | Sample3_Day1_Bio3_1 | Control | 3 | 1 | 3 |
| 5 | Sample3_Day1_Bio3_2 | Control | 3 | 1 | 3 |
| 6 | Sample4_Day2_Bio1_1 | Control | 4 | 3 | 1 |
| 7 | Sample4_Day2_Bio1_2 | Control | 4 | 3 | 1 |
| 8 | Sample5_Day2_Bio2_1 | Control | 5 | 3 | 2 |
| 9 | Sample5_Day2_Bio2_2 | Control | 5 | 3 | 2 |
| 10 | Sample6_Day2_Bio3_1 | Control | 6 | 3 | 3 |
| 11 | Sample6_Day2_Bio3_2 | Control | 6 | 3 | 3 |
| 12 | Sample7_Day3_Bio1_1 | Control | 7 | 5 | 1 |
| 13 | Sample7_Day3_Bio1_2 | Control | 7 | 5 | 1 |
| 14 | Sample8_Day3_Bio2_1 | Control | 8 | 5 | 2 |
| 15 | Sample8_Day3_Bio2_2 | Control | 8 | 5 | 2 |
| 16 | Sample9_Day3_Bio3_1 | Control | 9 | 5 | 3 |
| 17 | Sample9_Day3_Bio3_2 | Control | 9 | 5 | 3 |
| 18 | Sample10_Day4_Bio1_1 | Control | 10 | 7 | 1 |
| 19 | Sample10_Day4_Bio1_2 | Control | 10 | 7 | 1 |
| 20 | Sample11_Day4_Bio2_1 | Control | 11 | 7 | 2 |
| 21 | Sample11_Day4_Bio2_2 | Control | 11 | 7 | 2 |
| 22 | Sample12_Day5_Bio3_1 | Control | 12 | 7 | 3 |
| 23 | Sample12_Day5_Bio3_2 | Control | 12 | 7 | 3 |
| 24 | Sample13_Day1_Bio1_1 | Treatment | 13 | 1 | 4 |
| 25 | Sample13_Day1_Bio1_2 | Treatment | 13 | 1 | 4 |
| 26 | Sample14_Day1_Bio2_1 | Treatment | 14 | 1 | 5 |
| 27 | Sample14_Day1_Bio2_2 | Treatment | 14 | 1 | 5 |
| 28 | Sample15_Day1_Bio3_1 | Treatment | 15 | 1 | 6 |
| 29 | Sample15_Day1_Bio3_2 | Treatment | 15 | 1 | 6 |
| 30 | Sample16_Day2_Bio1_1 | Treatment | 16 | 3 | 4 |
| 31 | Sample16_Day2_Bio1_2 | Treatment | 16 | 3 | 4 |
| 32 | Sample17_Day2_Bio2_1 | Treatment | 17 | 3 | 5 |
| 33 | Sample17_Day2_Bio2_2 | Treatment | 17 | 3 | 5 |
| 34 | Sample18_Day2_Bio3_1 | Treatment | 18 | 3 | 6 |
| 35 | Sample18_Day2_Bio3_2 | Treatment | 18 | 3 | 6 |
| 36 | Sample19_Day3_Bio1_1 | Treatment | 19 | 5 | 4 |
| 37 | Sample19_Day3_Bio1_2 | Treatment | 19 | 5 | 4 |
| 38 | Sample20_Day3_Bio2_1 | Treatment | 20 | 5 | 5 |
| 39 | Sample20_Day3_Bio2_2 | Treatment | 20 | 5 | 5 |
| 40 | Sample21_Day3_Bio3_1 | Treatment | 21 | 5 | 6 |
| 41 | Sample21_Day3_Bio3_2 | Treatment | 21 | 5 | 6 |
| 42 | Sample22_Day4_Bio1_1 | Treatment | 22 | 7 | 4 |
| 43 | Sample22_Day4_Bio1_2 | Treatment | 22 | 7 | 4 |
| 44 | Sample23_Day4_Bio2_1 | Treatment | 23 | 7 | 5 |
| 45 | Sample23_Day4_Bio2_2 | Treatment | 23 | 7 | 5 |
| 46 | Sample24_Day5_Bio3_1 | Treatment | 24 | 7 | 6 |
| 47 | Sample24_Day5_Bio3_2 | Treatment | 24 | 7 | 6 |