How to make pdata
=================
Statistical Analysis Overview
-----------------------------
OmicScope offers a toolkit for conducting differential proteomics
analyses, covering statistical approaches for both static and
longitudinal experimental designs (as illustrated in the figure below).
By default, OmicScope assumes a static workflow (designated as
‘ExperimentalDesign=static’). In this mode, it employs t-tests or
analysis of variance (ANOVA) for statistical analysis. For longitudinal
analyses (designated as ‘ExperimentalDesign=longitudinal’), OmicScope
assumes protein abundance varies over time according to a natural cubic
spline, as suggested by Storey’s 2005. This method assesses differences
within and between groups over time.
After calculating nominal p-values, OmicScope applies the
Benjamini-Hochberg correction to account for multiple hypothesis testing
and reports the adjusted p-value (pAdjusted). Alternatively, users have
the flexibility to import statistical analyses from other software tools
by including a “pvalue” or “pAdjusted” column in the rdata using the
General input method.
Pdata Role
----------
Pdata (also known as phenotype data or metadata) plays a crucial role in
allowing OmicScope to correctly conduct statistical analysis. To perform
this task, Pdata must contain as much information as possible to compare
2 or more groups, different time courses, or different classes
throughout time.
When importing data into OmicScope, users must select the appropriate
method for data handling. While performing static analysis for
Progenesis, Proteome Discoverer, and PatternLab, OmicScope can
automatically identify and select groups based on the software’s output.
However, when performing longitudinal analysis or using any of the
General, DIA-NN, and MaxQuant methods, OmicScope requires pdata to
select the appropriate test.
For all cases, OmicScope allows users to incorporate external pdata into
the workflow, which helps tailor the statistical analysis to the
specific experimental design (as described below).
Static Experimental Design
--------------------------
Static Workflow
~~~~~~~~~~~~~~~
Most proteomics experiments aim to compare proteomic signatures between
independent groups, which is why OmicScope defaults to a static
experimental design (designated as ``'ExperimentalDesign=static'``). The
static workflow involves two main statistical tests: the t-test and
ANOVA.
When comparing two groups, OmicScope conducts an independent t-test
(when ``independent_ttest=True``) if the groups are independent, or a
paired t-test (when ``independent_ttest=False``) if the groups are
related. For comparisons involving more than two groups, OmicScope
employs a one-way ANOVA. Additionally, for proteins with a pAdjusted
value less than 0.05, OmicScope performs a Tukey post-hoc correction.
This helps identify and highlight the groups with significant
differences.
Static Pdata
~~~~~~~~~~~~
To create a pdata file for running the static workflow, users should
include the following columns:
1. **Sample:** The name of each sample to be analyzed, matching those in
the first row of the Assay sheet.
2. **Condition:** Respective group for each sample. All technical and
biological replicates belonging to an experimental condition should
have the same identifier here.
3. **Biological:** Respective biological replicate for each sample. If
two or more technical replicates were used for a single biological
replicate, those replicates should have the same identifier here.
Pdata Example 1
~~~~~~~~~~~~~~~
In the example below, each sample is assigned to a specific Condition,
and each biological replicate is reported. Here, two distinct conditions
were documented, and each biological replicate was acquired twice. Once
this ‘pdata’ is integrated into the OmicScope workflow, the process
involves calculating the mean of technical replicates and performing an
independent t-test as the statistical test.
.. code:: ipython3
# Pdata for static experimental designs
import pandas as pd
pdata = pd.read_excel('..\\..\\tests\\data\\proteins\\general.xlsx', sheet_name=2)
pdata
.. raw:: html
|
Sample |
Condition |
Biological |
| 0 |
VCC_HB_1_1_2020 |
COVID |
1 |
| 1 |
VCC_HB_1_2 |
COVID |
1 |
| 2 |
VCC_HB_2_1 |
COVID |
2 |
| 3 |
VCC_HB_2_1_2 |
COVID |
2 |
| 4 |
VCC_HB_3_1 |
COVID |
3 |
| 5 |
VCC_HB_3_1_2 |
COVID |
3 |
| 6 |
VCC_HB_4_1 |
COVID |
4 |
| 7 |
VCC_HB_4_1_2 |
COVID |
4 |
| 8 |
VCC_HB_5_1 |
COVID |
5 |
| 9 |
VCC_HB_5_1_2 |
COVID |
5 |
| 10 |
VCC_HB_6_1 |
COVID |
6 |
| 11 |
VCC_HB_6_1_2 |
COVID |
6 |
| 12 |
VCC_HB_7_1 |
COVID |
7 |
| 13 |
VCC_HB_7_1_2 |
COVID |
7 |
| 14 |
VCC_HB_8_1 |
COVID |
8 |
| 15 |
VCC_HB_8_1_2 |
COVID |
8 |
| 16 |
VCC_HB_9_1 |
COVID |
9 |
| 17 |
VCC_HB_9_1_2 |
COVID |
9 |
| 18 |
VCC_HB_10_1 |
COVID |
10 |
| 19 |
VCC_HB_10_1_2_ |
COVID |
10 |
| 20 |
VCC_HB_11_1 |
COVID |
11 |
| 21 |
VCC_HB_11_1_2_ |
COVID |
11 |
| 22 |
VCC_HB_12_1 |
COVID |
12 |
| 23 |
VCC_HB_12_1_2_ |
COVID |
12 |
| 24 |
VCC_HB_A_1 |
CTRL |
1 |
| 25 |
VCC_HB_A_1_2 |
CTRL |
1 |
| 26 |
VCC_HB_B_1 |
CTRL |
2 |
| 27 |
VCC_HB_B_1_2 |
CTRL |
2 |
| 28 |
VCC_HB_C_1 |
CTRL |
3 |
| 29 |
VCC_HB_C_1_2 |
CTRL |
3 |
| 30 |
VCC_HB_D_1 |
CTRL |
4 |
| 31 |
VCC_HB_D_1_2 |
CTRL |
4 |
| 32 |
VCC_HB_E_1 |
CTRL |
5 |
| 33 |
VCC_HB_E_1_2 |
CTRL |
5 |
| 34 |
VCC_HB_F_1 |
CTRL |
6 |
| 35 |
VCC_HB_F_1_2 |
CTRL |
6 |
| 36 |
VCC_HB_G_1 |
CTRL |
7 |
| 37 |
VCC_HB_G_1_2 |
CTRL |
7 |
.. code:: ipython3
print('Number of Conditions: ' + str(len(pdata.Condition.drop_duplicates())))
.. parsed-literal::
Number of Conditions: 2
Longitudinal Experimental Design
--------------------------------
Longitudinal Workflow
~~~~~~~~~~~~~~~~~~~~~
To accommodate the potential complexities of longitudinal experimental
designs, OmicScope categorizes these experiments into two primary types:
1. *Within-group experiments*: These designs aim to identify
differentially regulated proteins over time within a single group.
2. *Between-group experiments*: These designs aim to detect differential
protein regulation over time by comparing different groups.
Pdata workflow
~~~~~~~~~~~~~~
OmicScope manages these distinctions much like the static workflow,
examining the number of conditions (#conditions) in the ‘Condition’
column. It selects “Within-group” if the #conditions is equal to 1, and
“Between-group” if the #conditions exceed 1. Additionally, in the
longitudinal workflow, the user is **required to add a “TimeCourse”**
column to define the sampling frequency of the study.
Pdata Example 2
~~~~~~~~~~~~~~~
In the example below, the ‘pdata’ contains two distinct groups (12
Control and 12 Treatment) in the ‘Condition’ column, indicating a
Between-group analysis. Additionally, the ``TimeCourse`` column includes
4 time points, and each biological replicate was acquired twice.
.. code:: ipython3
pdata = pd.read_excel('..\\../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=0)
pdata
.. raw:: html
|
Sample |
Condition |
TimeCourse |
Biological |
| 0 |
Sample1_Day1_Bio1_1 |
Control |
1 |
1 |
| 1 |
Sample1_Day1_Bio1_2 |
Control |
1 |
1 |
| 2 |
Sample2_Day1_Bio2_1 |
Control |
1 |
2 |
| 3 |
Sample2_Day1_Bio2_2 |
Control |
1 |
2 |
| 4 |
Sample3_Day1_Bio3_1 |
Control |
1 |
3 |
| 5 |
Sample3_Day1_Bio3_2 |
Control |
1 |
3 |
| 6 |
Sample4_Day2_Bio1_1 |
Control |
3 |
4 |
| 7 |
Sample4_Day2_Bio1_2 |
Control |
3 |
4 |
| 8 |
Sample5_Day2_Bio2_1 |
Control |
3 |
5 |
| 9 |
Sample5_Day2_Bio2_2 |
Control |
3 |
5 |
| 10 |
Sample6_Day2_Bio3_1 |
Control |
3 |
6 |
| 11 |
Sample6_Day2_Bio3_2 |
Control |
3 |
6 |
| 12 |
Sample7_Day3_Bio1_1 |
Control |
5 |
7 |
| 13 |
Sample7_Day3_Bio1_2 |
Control |
5 |
7 |
| 14 |
Sample8_Day3_Bio2_1 |
Control |
5 |
8 |
| 15 |
Sample8_Day3_Bio2_2 |
Control |
5 |
8 |
| 16 |
Sample9_Day3_Bio3_1 |
Control |
5 |
9 |
| 17 |
Sample9_Day3_Bio3_2 |
Control |
5 |
9 |
| 18 |
Sample10_Day4_Bio1_1 |
Control |
7 |
10 |
| 19 |
Sample10_Day4_Bio1_2 |
Control |
7 |
10 |
| 20 |
Sample11_Day4_Bio2_1 |
Control |
7 |
11 |
| 21 |
Sample11_Day4_Bio2_2 |
Control |
7 |
11 |
| 22 |
Sample12_Day5_Bio3_1 |
Control |
7 |
12 |
| 23 |
Sample12_Day5_Bio3_2 |
Control |
7 |
12 |
| 24 |
Sample13_Day1_Bio1_1 |
Treatment |
1 |
13 |
| 25 |
Sample13_Day1_Bio1_2 |
Treatment |
1 |
13 |
| 26 |
Sample14_Day1_Bio2_1 |
Treatment |
1 |
14 |
| 27 |
Sample14_Day1_Bio2_2 |
Treatment |
1 |
14 |
| 28 |
Sample15_Day1_Bio3_1 |
Treatment |
1 |
15 |
| 29 |
Sample15_Day1_Bio3_2 |
Treatment |
1 |
15 |
| 30 |
Sample16_Day2_Bio1_1 |
Treatment |
3 |
16 |
| 31 |
Sample16_Day2_Bio1_2 |
Treatment |
3 |
16 |
| 32 |
Sample17_Day2_Bio2_1 |
Treatment |
3 |
17 |
| 33 |
Sample17_Day2_Bio2_2 |
Treatment |
3 |
17 |
| 34 |
Sample18_Day2_Bio3_1 |
Treatment |
3 |
18 |
| 35 |
Sample18_Day2_Bio3_2 |
Treatment |
3 |
18 |
| 36 |
Sample19_Day3_Bio1_1 |
Treatment |
5 |
19 |
| 37 |
Sample19_Day3_Bio1_2 |
Treatment |
5 |
19 |
| 38 |
Sample20_Day3_Bio2_1 |
Treatment |
5 |
20 |
| 39 |
Sample20_Day3_Bio2_2 |
Treatment |
5 |
20 |
| 40 |
Sample21_Day3_Bio3_1 |
Treatment |
5 |
21 |
| 41 |
Sample21_Day3_Bio3_2 |
Treatment |
5 |
21 |
| 42 |
Sample22_Day4_Bio1_1 |
Treatment |
7 |
22 |
| 43 |
Sample22_Day4_Bio1_2 |
Treatment |
7 |
22 |
| 44 |
Sample23_Day4_Bio2_1 |
Treatment |
7 |
23 |
| 45 |
Sample23_Day4_Bio2_2 |
Treatment |
7 |
23 |
| 46 |
Sample24_Day5_Bio3_1 |
Treatment |
7 |
24 |
| 47 |
Sample24_Day5_Bio3_2 |
Treatment |
7 |
24 |
Pdata Example 3
~~~~~~~~~~~~~~~
It’s important to note that in some cases researchers may employ
independent or related sampling over time. Independent sampling involves
evaluating different individuals over time, while related sampling
entails assessing the same individuals repeatedly. As OmicScope assumes
independent sampling by default, it’s essential to add a fifth column
labeled “Individual” if the experimental design involves related
sampling. This column associates each sample with its respective
individual number.
Using the example provided, when conducting related sampling, the user
should add the ``Individual`` column to associate each biological sample
with the corresponding individual.
.. code:: ipython3
import pandas as pd
pdata = pd.read_excel('../../tests/data/proteins/longitudinal_pdata.xlsx', sheet_name=1)
pdata
.. raw:: html
|
Sample |
Condition |
Biological |
TimeCourse |
Individual |
| 0 |
Sample1_Day1_Bio1_1 |
Control |
1 |
1 |
1 |
| 1 |
Sample1_Day1_Bio1_2 |
Control |
1 |
1 |
1 |
| 2 |
Sample2_Day1_Bio2_1 |
Control |
2 |
1 |
2 |
| 3 |
Sample2_Day1_Bio2_2 |
Control |
2 |
1 |
2 |
| 4 |
Sample3_Day1_Bio3_1 |
Control |
3 |
1 |
3 |
| 5 |
Sample3_Day1_Bio3_2 |
Control |
3 |
1 |
3 |
| 6 |
Sample4_Day2_Bio1_1 |
Control |
4 |
3 |
1 |
| 7 |
Sample4_Day2_Bio1_2 |
Control |
4 |
3 |
1 |
| 8 |
Sample5_Day2_Bio2_1 |
Control |
5 |
3 |
2 |
| 9 |
Sample5_Day2_Bio2_2 |
Control |
5 |
3 |
2 |
| 10 |
Sample6_Day2_Bio3_1 |
Control |
6 |
3 |
3 |
| 11 |
Sample6_Day2_Bio3_2 |
Control |
6 |
3 |
3 |
| 12 |
Sample7_Day3_Bio1_1 |
Control |
7 |
5 |
1 |
| 13 |
Sample7_Day3_Bio1_2 |
Control |
7 |
5 |
1 |
| 14 |
Sample8_Day3_Bio2_1 |
Control |
8 |
5 |
2 |
| 15 |
Sample8_Day3_Bio2_2 |
Control |
8 |
5 |
2 |
| 16 |
Sample9_Day3_Bio3_1 |
Control |
9 |
5 |
3 |
| 17 |
Sample9_Day3_Bio3_2 |
Control |
9 |
5 |
3 |
| 18 |
Sample10_Day4_Bio1_1 |
Control |
10 |
7 |
1 |
| 19 |
Sample10_Day4_Bio1_2 |
Control |
10 |
7 |
1 |
| 20 |
Sample11_Day4_Bio2_1 |
Control |
11 |
7 |
2 |
| 21 |
Sample11_Day4_Bio2_2 |
Control |
11 |
7 |
2 |
| 22 |
Sample12_Day5_Bio3_1 |
Control |
12 |
7 |
3 |
| 23 |
Sample12_Day5_Bio3_2 |
Control |
12 |
7 |
3 |
| 24 |
Sample13_Day1_Bio1_1 |
Treatment |
13 |
1 |
4 |
| 25 |
Sample13_Day1_Bio1_2 |
Treatment |
13 |
1 |
4 |
| 26 |
Sample14_Day1_Bio2_1 |
Treatment |
14 |
1 |
5 |
| 27 |
Sample14_Day1_Bio2_2 |
Treatment |
14 |
1 |
5 |
| 28 |
Sample15_Day1_Bio3_1 |
Treatment |
15 |
1 |
6 |
| 29 |
Sample15_Day1_Bio3_2 |
Treatment |
15 |
1 |
6 |
| 30 |
Sample16_Day2_Bio1_1 |
Treatment |
16 |
3 |
4 |
| 31 |
Sample16_Day2_Bio1_2 |
Treatment |
16 |
3 |
4 |
| 32 |
Sample17_Day2_Bio2_1 |
Treatment |
17 |
3 |
5 |
| 33 |
Sample17_Day2_Bio2_2 |
Treatment |
17 |
3 |
5 |
| 34 |
Sample18_Day2_Bio3_1 |
Treatment |
18 |
3 |
6 |
| 35 |
Sample18_Day2_Bio3_2 |
Treatment |
18 |
3 |
6 |
| 36 |
Sample19_Day3_Bio1_1 |
Treatment |
19 |
5 |
4 |
| 37 |
Sample19_Day3_Bio1_2 |
Treatment |
19 |
5 |
4 |
| 38 |
Sample20_Day3_Bio2_1 |
Treatment |
20 |
5 |
5 |
| 39 |
Sample20_Day3_Bio2_2 |
Treatment |
20 |
5 |
5 |
| 40 |
Sample21_Day3_Bio3_1 |
Treatment |
21 |
5 |
6 |
| 41 |
Sample21_Day3_Bio3_2 |
Treatment |
21 |
5 |
6 |
| 42 |
Sample22_Day4_Bio1_1 |
Treatment |
22 |
7 |
4 |
| 43 |
Sample22_Day4_Bio1_2 |
Treatment |
22 |
7 |
4 |
| 44 |
Sample23_Day4_Bio2_1 |
Treatment |
23 |
7 |
5 |
| 45 |
Sample23_Day4_Bio2_2 |
Treatment |
23 |
7 |
5 |
| 46 |
Sample24_Day5_Bio3_1 |
Treatment |
24 |
7 |
6 |
| 47 |
Sample24_Day5_Bio3_2 |
Treatment |
24 |
7 |
6 |