How to make pdata¶

Statistical Analysis Overview¶

OmicScope offers a toolkit for conducting differential proteomics analyses, covering statistical approaches for both static and longitudinal experimental designs (as illustrated in the figure below). By default, OmicScope assumes a static workflow (designated as ‘ExperimentalDesign=static’). In this mode, it employs t-tests or analysis of variance (ANOVA) for statistical analysis. For longitudinal analyses (designated as ‘ExperimentalDesign=longitudinal’), OmicScope assumes protein abundance varies over time according to a natural cubic spline, as suggested by Storey’s 2005. This method assesses differences within and between groups over time.

After calculating nominal p-values, OmicScope applies the Benjamini-Hochberg correction to account for multiple hypothesis testing and reports the adjusted p-value (pAdjusted). Alternatively, users have the flexibility to import statistical analyses from other software tools by including a “pvalue” or “pAdjusted” column in the rdata using the General input method.

Pdata Role¶

Pdata (also known as phenotype data or metadata) plays a crucial role in allowing OmicScope to correctly conduct statistical analysis. To perform this task, Pdata must contain as much information as possible to compare 2 or more groups, different time courses, or different classes throughout time.

When importing data into OmicScope, users must select the appropriate method for data handling. While performing static analysis for Progenesis, Proteome Discoverer, and PatternLab, OmicScope can automatically identify and select groups based on the software’s output. However, when performing longitudinal analysis or using any of the General, DIA-NN, and MaxQuant methods, OmicScope requires pdata to select the appropriate test.

For all cases, OmicScope allows users to incorporate external pdata into the workflow, which helps tailor the statistical analysis to the specific experimental design (as described below).

Static Experimental Design¶

Static Workflow¶

Most proteomics experiments aim to compare proteomic signatures between independent groups, which is why OmicScope defaults to a static experimental design (designated as 'ExperimentalDesign=static'). The static workflow involves two main statistical tests: the t-test and ANOVA.

When comparing two groups, OmicScope conducts an independent t-test (when independent_ttest=True) if the groups are independent, or a paired t-test (when independent_ttest=False) if the groups are related. For comparisons involving more than two groups, OmicScope employs a one-way ANOVA. Additionally, for proteins with a pAdjusted value less than 0.05, OmicScope performs a Tukey post-hoc correction. This helps identify and highlight the groups with significant differences.

Static Pdata¶

To create a pdata file for running the static workflow, users should include the following columns:

Sample: The name of each sample to be analyzed, matching those in the first row of the Assay sheet.
Condition: Respective group for each sample. All technical and biological replicates belonging to an experimental condition should have the same identifier here.
Biological: Respective biological replicate for each sample. If two or more technical replicates were used for a single biological replicate, those replicates should have the same identifier here.

Pdata Example 1¶

In the example below, each sample is assigned to a specific Condition, and each biological replicate is reported. Here, two distinct conditions were documented, and each biological replicate was acquired twice. Once this ‘pdata’ is integrated into the OmicScope workflow, the process involves calculating the mean of technical replicates and performing an independent t-test as the statistical test.

# Pdata for static experimental designs
import pandas as pd
pdata = pd.read_excel('..\\..\\tests\\data\\proteins\\general.xlsx', sheet_name=2)
pdata

	Sample	Condition	Biological
0	VCC_HB_1_1_2020	COVID	1
1	VCC_HB_1_2	COVID	1
2	VCC_HB_2_1	COVID	2
3	VCC_HB_2_1_2	COVID	2
4	VCC_HB_3_1	COVID	3
5	VCC_HB_3_1_2	COVID	3
6	VCC_HB_4_1	COVID	4
7	VCC_HB_4_1_2	COVID	4
8	VCC_HB_5_1	COVID	5
9	VCC_HB_5_1_2	COVID	5
10	VCC_HB_6_1	COVID	6
11	VCC_HB_6_1_2	COVID	6
12	VCC_HB_7_1	COVID	7
13	VCC_HB_7_1_2	COVID	7
14	VCC_HB_8_1	COVID	8
15	VCC_HB_8_1_2	COVID	8
16	VCC_HB_9_1	COVID	9
17	VCC_HB_9_1_2	COVID	9
18	VCC_HB_10_1	COVID	10
19	VCC_HB_10_1_2_	COVID	10
20	VCC_HB_11_1	COVID	11
21	VCC_HB_11_1_2_	COVID	11
22	VCC_HB_12_1	COVID	12
23	VCC_HB_12_1_2_	COVID	12
24	VCC_HB_A_1	CTRL	1
25	VCC_HB_A_1_2	CTRL	1
26	VCC_HB_B_1	CTRL	2
27	VCC_HB_B_1_2	CTRL	2
28	VCC_HB_C_1	CTRL	3
29	VCC_HB_C_1_2	CTRL	3
30	VCC_HB_D_1	CTRL	4
31	VCC_HB_D_1_2	CTRL	4
32	VCC_HB_E_1	CTRL	5
33	VCC_HB_E_1_2	CTRL	5
34	VCC_HB_F_1	CTRL	6
35	VCC_HB_F_1_2	CTRL	6
36	VCC_HB_G_1	CTRL	7
37	VCC_HB_G_1_2	CTRL	7

print('Number of Conditions: ' + str(len(pdata.Condition.drop_duplicates())))

Number of Conditions: 2

Longitudinal Experimental Design¶

Longitudinal Workflow¶

To accommodate the potential complexities of longitudinal experimental designs, OmicScope categorizes these experiments into two primary types:

Within-group experiments: These designs aim to identify differentially regulated proteins over time within a single group.
Between-group experiments: These designs aim to detect differential protein regulation over time by comparing different groups.

Pdata workflow¶

OmicScope manages these distinctions much like the static workflow, examining the number of conditions (#conditions) in the ‘Condition’ column. It selects “Within-group” if the #conditions is equal to 1, and “Between-group” if the #conditions exceed 1. Additionally, in the longitudinal workflow, the user is required to add a “TimeCourse” column to define the sampling frequency of the study.

Pdata Example 2¶

In the example below, the ‘pdata’ contains two distinct groups (12 Control and 12 Treatment) in the ‘Condition’ column, indicating a Between-group analysis. Additionally, the TimeCourse column includes 4 time points, and each biological replicate was acquired twice.

pdata = pd.read_excel('..\\../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=0)
pdata

	Sample	Condition	TimeCourse	Biological
0	Sample1_Day1_Bio1_1	Control	1	1
1	Sample1_Day1_Bio1_2	Control	1	1
2	Sample2_Day1_Bio2_1	Control	1	2
3	Sample2_Day1_Bio2_2	Control	1	2
4	Sample3_Day1_Bio3_1	Control	1	3
5	Sample3_Day1_Bio3_2	Control	1	3
6	Sample4_Day2_Bio1_1	Control	3	4
7	Sample4_Day2_Bio1_2	Control	3	4
8	Sample5_Day2_Bio2_1	Control	3	5
9	Sample5_Day2_Bio2_2	Control	3	5
10	Sample6_Day2_Bio3_1	Control	3	6
11	Sample6_Day2_Bio3_2	Control	3	6
12	Sample7_Day3_Bio1_1	Control	5	7
13	Sample7_Day3_Bio1_2	Control	5	7
14	Sample8_Day3_Bio2_1	Control	5	8
15	Sample8_Day3_Bio2_2	Control	5	8
16	Sample9_Day3_Bio3_1	Control	5	9
17	Sample9_Day3_Bio3_2	Control	5	9
18	Sample10_Day4_Bio1_1	Control	7	10
19	Sample10_Day4_Bio1_2	Control	7	10
20	Sample11_Day4_Bio2_1	Control	7	11
21	Sample11_Day4_Bio2_2	Control	7	11
22	Sample12_Day5_Bio3_1	Control	7	12
23	Sample12_Day5_Bio3_2	Control	7	12
24	Sample13_Day1_Bio1_1	Treatment	1	13
25	Sample13_Day1_Bio1_2	Treatment	1	13
26	Sample14_Day1_Bio2_1	Treatment	1	14
27	Sample14_Day1_Bio2_2	Treatment	1	14
28	Sample15_Day1_Bio3_1	Treatment	1	15
29	Sample15_Day1_Bio3_2	Treatment	1	15
30	Sample16_Day2_Bio1_1	Treatment	3	16
31	Sample16_Day2_Bio1_2	Treatment	3	16
32	Sample17_Day2_Bio2_1	Treatment	3	17
33	Sample17_Day2_Bio2_2	Treatment	3	17
34	Sample18_Day2_Bio3_1	Treatment	3	18
35	Sample18_Day2_Bio3_2	Treatment	3	18
36	Sample19_Day3_Bio1_1	Treatment	5	19
37	Sample19_Day3_Bio1_2	Treatment	5	19
38	Sample20_Day3_Bio2_1	Treatment	5	20
39	Sample20_Day3_Bio2_2	Treatment	5	20
40	Sample21_Day3_Bio3_1	Treatment	5	21
41	Sample21_Day3_Bio3_2	Treatment	5	21
42	Sample22_Day4_Bio1_1	Treatment	7	22
43	Sample22_Day4_Bio1_2	Treatment	7	22
44	Sample23_Day4_Bio2_1	Treatment	7	23
45	Sample23_Day4_Bio2_2	Treatment	7	23
46	Sample24_Day5_Bio3_1	Treatment	7	24
47	Sample24_Day5_Bio3_2	Treatment	7	24

Pdata Example 3¶

It’s important to note that in some cases researchers may employ independent or related sampling over time. Independent sampling involves evaluating different individuals over time, while related sampling entails assessing the same individuals repeatedly. As OmicScope assumes independent sampling by default, it’s essential to add a fifth column labeled “Individual” if the experimental design involves related sampling. This column associates each sample with its respective individual number.

Using the example provided, when conducting related sampling, the user should add the Individual column to associate each biological sample with the corresponding individual.

import pandas as pd
pdata = pd.read_excel('../../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=1)
pdata

	Sample	Condition	Biological	TimeCourse	Individual
0	Sample1_Day1_Bio1_1	Control	1	1	1
1	Sample1_Day1_Bio1_2	Control	1	1	1
2	Sample2_Day1_Bio2_1	Control	2	1	2
3	Sample2_Day1_Bio2_2	Control	2	1	2
4	Sample3_Day1_Bio3_1	Control	3	1	3
5	Sample3_Day1_Bio3_2	Control	3	1	3
6	Sample4_Day2_Bio1_1	Control	4	3	1
7	Sample4_Day2_Bio1_2	Control	4	3	1
8	Sample5_Day2_Bio2_1	Control	5	3	2
9	Sample5_Day2_Bio2_2	Control	5	3	2
10	Sample6_Day2_Bio3_1	Control	6	3	3
11	Sample6_Day2_Bio3_2	Control	6	3	3
12	Sample7_Day3_Bio1_1	Control	7	5	1
13	Sample7_Day3_Bio1_2	Control	7	5	1
14	Sample8_Day3_Bio2_1	Control	8	5	2
15	Sample8_Day3_Bio2_2	Control	8	5	2
16	Sample9_Day3_Bio3_1	Control	9	5	3
17	Sample9_Day3_Bio3_2	Control	9	5	3
18	Sample10_Day4_Bio1_1	Control	10	7	1
19	Sample10_Day4_Bio1_2	Control	10	7	1
20	Sample11_Day4_Bio2_1	Control	11	7	2
21	Sample11_Day4_Bio2_2	Control	11	7	2
22	Sample12_Day5_Bio3_1	Control	12	7	3
23	Sample12_Day5_Bio3_2	Control	12	7	3
24	Sample13_Day1_Bio1_1	Treatment	13	1	4
25	Sample13_Day1_Bio1_2	Treatment	13	1	4
26	Sample14_Day1_Bio2_1	Treatment	14	1	5
27	Sample14_Day1_Bio2_2	Treatment	14	1	5
28	Sample15_Day1_Bio3_1	Treatment	15	1	6
29	Sample15_Day1_Bio3_2	Treatment	15	1	6
30	Sample16_Day2_Bio1_1	Treatment	16	3	4
31	Sample16_Day2_Bio1_2	Treatment	16	3	4
32	Sample17_Day2_Bio2_1	Treatment	17	3	5
33	Sample17_Day2_Bio2_2	Treatment	17	3	5
34	Sample18_Day2_Bio3_1	Treatment	18	3	6
35	Sample18_Day2_Bio3_2	Treatment	18	3	6
36	Sample19_Day3_Bio1_1	Treatment	19	5	4
37	Sample19_Day3_Bio1_2	Treatment	19	5	4
38	Sample20_Day3_Bio2_1	Treatment	20	5	5
39	Sample20_Day3_Bio2_2	Treatment	20	5	5
40	Sample21_Day3_Bio3_1	Treatment	21	5	6
41	Sample21_Day3_Bio3_2	Treatment	21	5	6
42	Sample22_Day4_Bio1_1	Treatment	22	7	4
43	Sample22_Day4_Bio1_2	Treatment	22	7	4
44	Sample23_Day4_Bio2_1	Treatment	23	7	5
45	Sample23_Day4_Bio2_2	Treatment	23	7	5
46	Sample24_Day5_Bio3_1	Treatment	24	7	6
47	Sample24_Day5_Bio3_2	Treatment	24	7	6