How to make pdata

Statistical Analysis Overview

OmicScope offers a toolkit for conducting differential proteomics analyses, covering statistical approaches for both static and longitudinal experimental designs (as illustrated in the figure below). By default, OmicScope assumes a static workflow (designated as ‘ExperimentalDesign=static’). In this mode, it employs t-tests or analysis of variance (ANOVA) for statistical analysis. For longitudinal analyses (designated as ‘ExperimentalDesign=longitudinal’), OmicScope assumes protein abundance varies over time according to a natural cubic spline, as suggested by Storey’s 2005. This method assesses differences within and between groups over time.

After calculating nominal p-values, OmicScope applies the Benjamini-Hochberg correction to account for multiple hypothesis testing and reports the adjusted p-value (pAdjusted). Alternatively, users have the flexibility to import statistical analyses from other software tools by including a “pvalue” or “pAdjusted” column in the rdata using the General input method.

Pdata Role

Pdata (also known as phenotype data or metadata) plays a crucial role in allowing OmicScope to correctly conduct statistical analysis. To perform this task, Pdata must contain as much information as possible to compare 2 or more groups, different time courses, or different classes throughout time.

When importing data into OmicScope, users must select the appropriate method for data handling. While performing static analysis for Progenesis, Proteome Discoverer, and PatternLab, OmicScope can automatically identify and select groups based on the software’s output. However, when performing longitudinal analysis or using any of the General, DIA-NN, and MaxQuant methods, OmicScope requires pdata to select the appropriate test.

For all cases, OmicScope allows users to incorporate external pdata into the workflow, which helps tailor the statistical analysis to the specific experimental design (as described below).

Static Experimental Design

Static Workflow

Most proteomics experiments aim to compare proteomic signatures between independent groups, which is why OmicScope defaults to a static experimental design (designated as 'ExperimentalDesign=static'). The static workflow involves two main statistical tests: the t-test and ANOVA.

When comparing two groups, OmicScope conducts an independent t-test (when independent_ttest=True) if the groups are independent, or a paired t-test (when independent_ttest=False) if the groups are related. For comparisons involving more than two groups, OmicScope employs a one-way ANOVA. Additionally, for proteins with a pAdjusted value less than 0.05, OmicScope performs a Tukey post-hoc correction. This helps identify and highlight the groups with significant differences.

Static Pdata

To create a pdata file for running the static workflow, users should include the following columns:

  1. Sample: The name of each sample to be analyzed, matching those in the first row of the Assay sheet.

  2. Condition: Respective group for each sample. All technical and biological replicates belonging to an experimental condition should have the same identifier here.

  3. Biological: Respective biological replicate for each sample. If two or more technical replicates were used for a single biological replicate, those replicates should have the same identifier here.

Pdata Example 1

In the example below, each sample is assigned to a specific Condition, and each biological replicate is reported. Here, two distinct conditions were documented, and each biological replicate was acquired twice. Once this ‘pdata’ is integrated into the OmicScope workflow, the process involves calculating the mean of technical replicates and performing an independent t-test as the statistical test.

# Pdata for static experimental designs
import pandas as pd
pdata = pd.read_excel('..\\..\\tests\\data\\proteins\\general.xlsx', sheet_name=2)
pdata
Sample Condition Biological
0 VCC_HB_1_1_2020 COVID 1
1 VCC_HB_1_2 COVID 1
2 VCC_HB_2_1 COVID 2
3 VCC_HB_2_1_2 COVID 2
4 VCC_HB_3_1 COVID 3
5 VCC_HB_3_1_2 COVID 3
6 VCC_HB_4_1 COVID 4
7 VCC_HB_4_1_2 COVID 4
8 VCC_HB_5_1 COVID 5
9 VCC_HB_5_1_2 COVID 5
10 VCC_HB_6_1 COVID 6
11 VCC_HB_6_1_2 COVID 6
12 VCC_HB_7_1 COVID 7
13 VCC_HB_7_1_2 COVID 7
14 VCC_HB_8_1 COVID 8
15 VCC_HB_8_1_2 COVID 8
16 VCC_HB_9_1 COVID 9
17 VCC_HB_9_1_2 COVID 9
18 VCC_HB_10_1 COVID 10
19 VCC_HB_10_1_2_ COVID 10
20 VCC_HB_11_1 COVID 11
21 VCC_HB_11_1_2_ COVID 11
22 VCC_HB_12_1 COVID 12
23 VCC_HB_12_1_2_ COVID 12
24 VCC_HB_A_1 CTRL 1
25 VCC_HB_A_1_2 CTRL 1
26 VCC_HB_B_1 CTRL 2
27 VCC_HB_B_1_2 CTRL 2
28 VCC_HB_C_1 CTRL 3
29 VCC_HB_C_1_2 CTRL 3
30 VCC_HB_D_1 CTRL 4
31 VCC_HB_D_1_2 CTRL 4
32 VCC_HB_E_1 CTRL 5
33 VCC_HB_E_1_2 CTRL 5
34 VCC_HB_F_1 CTRL 6
35 VCC_HB_F_1_2 CTRL 6
36 VCC_HB_G_1 CTRL 7
37 VCC_HB_G_1_2 CTRL 7
print('Number of Conditions: ' + str(len(pdata.Condition.drop_duplicates())))
Number of Conditions: 2

Longitudinal Experimental Design

Longitudinal Workflow

To accommodate the potential complexities of longitudinal experimental designs, OmicScope categorizes these experiments into two primary types:

  1. Within-group experiments: These designs aim to identify differentially regulated proteins over time within a single group.

  2. Between-group experiments: These designs aim to detect differential protein regulation over time by comparing different groups.

Pdata workflow

OmicScope manages these distinctions much like the static workflow, examining the number of conditions (#conditions) in the ‘Condition’ column. It selects “Within-group” if the #conditions is equal to 1, and “Between-group” if the #conditions exceed 1. Additionally, in the longitudinal workflow, the user is required to add a “TimeCourse” column to define the sampling frequency of the study.

Pdata Example 2

In the example below, the ‘pdata’ contains two distinct groups (12 Control and 12 Treatment) in the ‘Condition’ column, indicating a Between-group analysis. Additionally, the TimeCourse column includes 4 time points, and each biological replicate was acquired twice.

pdata = pd.read_excel('..\\../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=0)
pdata
Sample Condition TimeCourse Biological
0 Sample1_Day1_Bio1_1 Control 1 1
1 Sample1_Day1_Bio1_2 Control 1 1
2 Sample2_Day1_Bio2_1 Control 1 2
3 Sample2_Day1_Bio2_2 Control 1 2
4 Sample3_Day1_Bio3_1 Control 1 3
5 Sample3_Day1_Bio3_2 Control 1 3
6 Sample4_Day2_Bio1_1 Control 3 4
7 Sample4_Day2_Bio1_2 Control 3 4
8 Sample5_Day2_Bio2_1 Control 3 5
9 Sample5_Day2_Bio2_2 Control 3 5
10 Sample6_Day2_Bio3_1 Control 3 6
11 Sample6_Day2_Bio3_2 Control 3 6
12 Sample7_Day3_Bio1_1 Control 5 7
13 Sample7_Day3_Bio1_2 Control 5 7
14 Sample8_Day3_Bio2_1 Control 5 8
15 Sample8_Day3_Bio2_2 Control 5 8
16 Sample9_Day3_Bio3_1 Control 5 9
17 Sample9_Day3_Bio3_2 Control 5 9
18 Sample10_Day4_Bio1_1 Control 7 10
19 Sample10_Day4_Bio1_2 Control 7 10
20 Sample11_Day4_Bio2_1 Control 7 11
21 Sample11_Day4_Bio2_2 Control 7 11
22 Sample12_Day5_Bio3_1 Control 7 12
23 Sample12_Day5_Bio3_2 Control 7 12
24 Sample13_Day1_Bio1_1 Treatment 1 13
25 Sample13_Day1_Bio1_2 Treatment 1 13
26 Sample14_Day1_Bio2_1 Treatment 1 14
27 Sample14_Day1_Bio2_2 Treatment 1 14
28 Sample15_Day1_Bio3_1 Treatment 1 15
29 Sample15_Day1_Bio3_2 Treatment 1 15
30 Sample16_Day2_Bio1_1 Treatment 3 16
31 Sample16_Day2_Bio1_2 Treatment 3 16
32 Sample17_Day2_Bio2_1 Treatment 3 17
33 Sample17_Day2_Bio2_2 Treatment 3 17
34 Sample18_Day2_Bio3_1 Treatment 3 18
35 Sample18_Day2_Bio3_2 Treatment 3 18
36 Sample19_Day3_Bio1_1 Treatment 5 19
37 Sample19_Day3_Bio1_2 Treatment 5 19
38 Sample20_Day3_Bio2_1 Treatment 5 20
39 Sample20_Day3_Bio2_2 Treatment 5 20
40 Sample21_Day3_Bio3_1 Treatment 5 21
41 Sample21_Day3_Bio3_2 Treatment 5 21
42 Sample22_Day4_Bio1_1 Treatment 7 22
43 Sample22_Day4_Bio1_2 Treatment 7 22
44 Sample23_Day4_Bio2_1 Treatment 7 23
45 Sample23_Day4_Bio2_2 Treatment 7 23
46 Sample24_Day5_Bio3_1 Treatment 7 24
47 Sample24_Day5_Bio3_2 Treatment 7 24

Pdata Example 3

It’s important to note that in some cases researchers may employ independent or related sampling over time. Independent sampling involves evaluating different individuals over time, while related sampling entails assessing the same individuals repeatedly. As OmicScope assumes independent sampling by default, it’s essential to add a fifth column labeled “Individual” if the experimental design involves related sampling. This column associates each sample with its respective individual number.

Using the example provided, when conducting related sampling, the user should add the Individual column to associate each biological sample with the corresponding individual.

import pandas as pd
pdata = pd.read_excel('../../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=1)
pdata
Sample Condition Biological TimeCourse Individual
0 Sample1_Day1_Bio1_1 Control 1 1 1
1 Sample1_Day1_Bio1_2 Control 1 1 1
2 Sample2_Day1_Bio2_1 Control 2 1 2
3 Sample2_Day1_Bio2_2 Control 2 1 2
4 Sample3_Day1_Bio3_1 Control 3 1 3
5 Sample3_Day1_Bio3_2 Control 3 1 3
6 Sample4_Day2_Bio1_1 Control 4 3 1
7 Sample4_Day2_Bio1_2 Control 4 3 1
8 Sample5_Day2_Bio2_1 Control 5 3 2
9 Sample5_Day2_Bio2_2 Control 5 3 2
10 Sample6_Day2_Bio3_1 Control 6 3 3
11 Sample6_Day2_Bio3_2 Control 6 3 3
12 Sample7_Day3_Bio1_1 Control 7 5 1
13 Sample7_Day3_Bio1_2 Control 7 5 1
14 Sample8_Day3_Bio2_1 Control 8 5 2
15 Sample8_Day3_Bio2_2 Control 8 5 2
16 Sample9_Day3_Bio3_1 Control 9 5 3
17 Sample9_Day3_Bio3_2 Control 9 5 3
18 Sample10_Day4_Bio1_1 Control 10 7 1
19 Sample10_Day4_Bio1_2 Control 10 7 1
20 Sample11_Day4_Bio2_1 Control 11 7 2
21 Sample11_Day4_Bio2_2 Control 11 7 2
22 Sample12_Day5_Bio3_1 Control 12 7 3
23 Sample12_Day5_Bio3_2 Control 12 7 3
24 Sample13_Day1_Bio1_1 Treatment 13 1 4
25 Sample13_Day1_Bio1_2 Treatment 13 1 4
26 Sample14_Day1_Bio2_1 Treatment 14 1 5
27 Sample14_Day1_Bio2_2 Treatment 14 1 5
28 Sample15_Day1_Bio3_1 Treatment 15 1 6
29 Sample15_Day1_Bio3_2 Treatment 15 1 6
30 Sample16_Day2_Bio1_1 Treatment 16 3 4
31 Sample16_Day2_Bio1_2 Treatment 16 3 4
32 Sample17_Day2_Bio2_1 Treatment 17 3 5
33 Sample17_Day2_Bio2_2 Treatment 17 3 5
34 Sample18_Day2_Bio3_1 Treatment 18 3 6
35 Sample18_Day2_Bio3_2 Treatment 18 3 6
36 Sample19_Day3_Bio1_1 Treatment 19 5 4
37 Sample19_Day3_Bio1_2 Treatment 19 5 4
38 Sample20_Day3_Bio2_1 Treatment 20 5 5
39 Sample20_Day3_Bio2_2 Treatment 20 5 5
40 Sample21_Day3_Bio3_1 Treatment 21 5 6
41 Sample21_Day3_Bio3_2 Treatment 21 5 6
42 Sample22_Day4_Bio1_1 Treatment 22 7 4
43 Sample22_Day4_Bio1_2 Treatment 22 7 4
44 Sample23_Day4_Bio2_1 Treatment 23 7 5
45 Sample23_Day4_Bio2_2 Treatment 23 7 5
46 Sample24_Day5_Bio3_1 Treatment 24 7 6
47 Sample24_Day5_Bio3_2 Treatment 24 7 6