HeLa data quality control (QC) and maintenance (MNT) samples

We provide 7,444 samples through the PRIDE archiv (PXD042233).

Publication:

Webel, Henry, Yasset Perez-Riverol, Annelaura Bach Nielson, and Simon Rasmussen (2024): Mass spectrometry-based proteomics data from thousands of HeLa control samples. Sci Data 11, 112 https://doi.org/10.1038/s41597-024-02922-z

The projocet folder contains notebooks for data processing of MaxQaunt text output folders which were created using the workflows, processing each file separately.

Curate dumps

We provide metadata from the raw files together with the MaxQuant summary statistics of analzying the raw files in pride_metadata.csv. You can read it in python using pandas directly from the PRIDE FTP folder of the project.

import pandas as pd
pd.options.display.max_columns = 80

ftp_folder = 'https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233'
file = 'pride_metadata.csv'

df = pd.read_csv(f'{ftp_folder}/{file}', index_col=0)
df.sample(5, random_state=42).sort_index()

will yield

Sample ID	Pathname	bytes	size_gb	Version	Content Creation Date	Thermo Scientific instrument model	instrument attribute	instrument serial number	Software Version	firmware version	Number of MS1 spectra	Number of MS2 spectra	MS min charge	MS max charge	MS min RT	MS max RT	MS min MZ	MS max MZ	scan start time	mass resolution	mass unit	Number of scans	MS scan range	Retention time range	Mz range	beam-type collision-induced dissociation	sample number	Vial	injection volume setting	Row	dilution factor	Comment	collision-induced dissociation	sample name	Type	Enzyme	Enzyme mode	Use enzyme first search	Variable modifications	Fixed modifications	Use variable modifications first search	Requantify	Multiplicity	Max. missed cleavages	LC-MS run type	MS	MS/MS	MS/MS Submitted	MS/MS Submitted (SIL)	MS/MS Submitted (PEAK)	MS/MS Identified	MS/MS Identified (SIL)	MS/MS Identified (PEAK)	MS/MS Identified [%]	MS/MS Identified (SIL) [%]	MS/MS Identified (PEAK) [%]	Peptide Sequences Identified	Peaks	Peaks Sequenced	Peaks Sequenced [%]	Peaks Repeatedly Sequenced	Peaks Repeatedly Sequenced [%]	Isotope Patterns	Isotope Patterns Sequenced	Isotope Patterns Sequenced (z>1)	Isotope Patterns Sequenced [%]	Isotope Patterns Sequenced (z>1) [%]	Isotope Patterns Repeatedly Sequenced	Isotope Patterns Repeatedly Sequenced [%]	Recalibrated	Av. Absolute Mass Deviation [ppm]	Mass Standard Deviation [ppm]	Av. Absolute Mass Deviation [mDa]	Mass Standard Deviation [mDa]
2016_01_15_15_17_Q-Exactive-HF-Orbitrap_148	MNT_2016_Proteomics/2016_01_15_15_17_Q-Exactive-HF-Orbitrap_148.raw	1501447123	1.39833	66	2016-01-15 15:17:43	Q Exactive HF Orbitrap	Q Exactive HF Orbitrap	Exactive Series slot #148	2.5-204201/2.5.0.2042	rev. 1	12920	80807	2	19	0.00376826	144.005	300.147	1731.34	0.00376826	0.5	nan	93727	1:93727	0.0037682586:144.00486	100:6000	HCD	nan	A4	5	4	1	nan	nan	nan	nan	Trypsin/P	Specific	False	Oxidation (M);Acetyl (Protein N-term)	Carbamidomethyl (C)	False	False	1	2	Standard	12920	80807	89679	71935	17744	44494	43025	1469	50	60	8.3	34109	1288500	77301	6	1906	2.5	197041	67285	66513	34	38	4056	6	+	0.62615	0.92066	0.41545	0.63975
2016_04_16_01_56_Q-Exactive-HF-Orbitrap_147	MNT_2016_Proteomics/2016_04_16_01_56_Q-Exactive-HF-Orbitrap_147.raw	2464568579	2.29531	66	2016-04-16 01:56:38	Q Exactive HF Orbitrap	Q Exactive HF Orbitrap	Exactive Series slot #147	2.5-204201/2.5.0.2042	rev. 1	12233	87090	2	5	0.00228172	264.002	300.165	1730.02	0.00228172	0.5	nan	99323	1:99323	0.0022817164:264.00169	50:6000	HCD	A1	G12	5	nan	1	nan	nan	nan	nan	Trypsin/P	Specific	False	Oxidation (M);Acetyl (Protein N-term)	Carbamidomethyl (C)	False	False	1	2	Standard	12233	87090	89595	84585	5010	54273	53380	893	61	63	18	35604	1818019	70726	3.9	13395	19	272433	65019	64483	24	26	15913	24	+	0.41731	0.59783	0.2454	0.35516
2018_09_07_15_59_Q-Exactive-HF-X-Orbitrap_6011	MNT_2018_Proteomics/2018_09_07_15_59_Q-Exactive-HF-X-Orbitrap_6011.raw	3097450763	2.88473	66	2018-09-07 15:59:23	Q Exactive HF-X Orbitrap	Q Exactive HF-X Orbitrap	Exactive Series slot #6011	2.9-290033/2.9.0.2923	rev. 1	13344	118993	2	58	0.00364904	144.002	300.129	1657.28	0.00364904	0.5	nan	132337	1:132337	0.0036490414:144.00233	100:6000	HCD	1	B03	5	2	1	nan	nan	nan	nan	Trypsin/P	Specific	False	Oxidation (M);Acetyl (Protein N-term)	Carbamidomethyl (C)	False	False	1	2	Standard	13344	118993	137425	100557	36868	48931	47151	1780	36	47	4.8	35470	1968826	111792	5.7	2231	2	296137	92837	91437	31	34	6567	7.1	+	0.74379	1.0626	0.48075	0.72643
2018_09_28_13_00_Q-Exactive-HF-X-Orbitrap_6011	MNT_2018_Proteomics/2018_09_28_13_00_Q-Exactive-HF-X-Orbitrap_6011.raw	3192042749	2.97282	66	2018-09-28 13:00:47	Q Exactive HF-X Orbitrap	Q Exactive HF-X Orbitrap	Exactive Series slot #6011	2.9-290033/2.9.0.2923	rev. 1	11372	119429	2	58	0.00367022	144.001	300.129	1532.72	0.00367022	0.5	nan	130801	1:130801	0.0036702249:144.0008	100:6000	HCD	1	A01	2.5	1	1	nan	nan	nan	nan	Trypsin/P	Specific	False	Oxidation (M);Acetyl (Protein N-term)	Carbamidomethyl (C)	False	False	1	2	Standard	11372	119429	136853	102005	34848	50287	48043	2244	37	47	6.4	37767	1408019	113534	8.1	3886	3.4	201256	91094	89599	45	49	9025	9.9	+	0.70728	1.0271	0.44402	0.67985
2019_05_28_17_51_Q-Exactive-HF-X-Orbitrap_6096	MNT_2019_Proteomics/MNT/2019_05_28_17_51_Q-Exactive-HF-X-Orbitrap_6096.raw	1958832977	1.82431	66	2019-05-28 17:51:55	Q Exactive HF-X Orbitrap	Q Exactive HF-X Orbitrap	Exactive Series slot #6096	2.9-290033/2.9.0.2926	rev. 1	11353	97342	2	60	0.00371359	144.003	300.145	1625.79	0.00371359	0.5	nan	108695	1:108695	0.0037135892:144.00287	100:6000	HCD	1	C7	2.5	2	1	nan	nan	nan	nan	Trypsin/P	Specific	False	Oxidation (M);Acetyl (Protein N-term)	Carbamidomethyl (C)	False	False	1	2	Standard	11353	97342	106758	87926	18832	49905	48385	1520	47	55	8.1	37908	1399019	93506	6.7	2974	3.2	204304	79464	78172	39	43	7184	9	+	0.68956	0.96467	0.42647	0.62328

There are three curated intensity dumps available - for the protein groups level (aggregated using the gene sets) from proteinGroups.txt, the peptide level (aggregate the peptide sequence) from peptides.txt and the precursors (aggregate ions - sequence + charge) from evidence.txt. See a description of the MaxQuant output tables here.

Linked dumps in table format (FTP-Download):

geneGroups_aggregated.zip [0.2GB] - 7,444 samples, 4,547 selected protein groups
peptides_aggregated.zip [1.2 GB] - 7,444 samples, 42,723 selected peptides
precursors_aggregated.zip [1.3 GB] - 7,444 samples, 49,339 selected precursors

The aggregated dumps are best unzipped and a small script is provided in the zipped folder for loading the data.

These were build using cleaned MaxQuant output tables, which are provided here:

proteinGroups_single_dumps.zip [2.5 GB] - 7,484 files (some duplicates)
peptides_single_dumps.zip [9.4 GB] - 7,444 files
precursors_single_dumps.zip [7.1 GB] - 7,444 files

These contain the filtered, but not downsampled data for each samples. You might need the metadata file to filter the data for your analysis.

from pathlib import Path
import pandas as pd
import zipfile

# ARCHIVE = 'peptides_single_dumps.zip'
# ARCHIVE = 'precursors_single_dumps.zip'
ARCHIVE = 'proteinGroups_single_dumps.zip'

ftp_folder = 'https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233'
file = 'pride_metadata.csv'

meta = pd.read_csv(f'{ftp_folder}/{file}', index_col=0)

def len_csv_files_in_archive(archiv_path: str) -> dict:
    """Read csv files in zip and keep their length, i.e. number of features."""
    stats = dict()
    with zipfile.ZipFile(archiv_path, 'r') as zip_archive:
        for fname in zip_archive.namelist():
            # Check if the file is a csv
            if fname.endswith('.csv'):
                # Open each file
                with zip_archive.open(fname) as f:
                    df = pd.read_csv(f)
                    fname = Path(fname).stem
                    stats[fname] = len(df)
    return stats


stats = len_csv_files_in_archive(ARCHIVE)
stats = pd.Series(stats) # contains some duplicates which were removed from aggregated data.
stats = stats.loc[meta.index]
stats


count	7444
mean	3833.45
std	698.794
min	1196
25%	3390.75
50%	3849
75%	4181.25
max	5925

So for the 7,444 selected samples there is a median of 3849 protein groups quantified per sample.

Download using python

Download a file programaically using python:

import urllib.request
ARCHIVE = 'proteinGroups_single_dumps.zip'

ftp_folder = 'https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233'
file = 'pride_metadata.csv'

urllib.request.urlretrieve(f'{ftp_folder}/{file}', f'{file}')
urllib.request.urlretrieve(f'{ftp_folder}/{ARCHIVE}', f'{ARCHIVE}')

download on the command line

You can also downlaod the zip-archives on the command line using curl (which is available for Windows, MacOS, and Linux distributions):

curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/geneGroups_aggregated.zip
curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/peptides_aggregated.zip
curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/precursors_aggregated.zip
curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/proteinGroups_single_dumps.zip
curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/peptides_single_dumps.zip
curl -O https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233/precursors_single_dumps.zip

Could be executed from a .bat or .sh script.

Project

The project folder contains notebooks for data processing of MaxQaunt text output folders and the analysis of the data for the publication, see its README for details.

Workflows

The workflows folder in the repository contains snakemake workflows used for rawfile data processing, both for running MaxQuant over a large set of HeLa raw files and ThermoRawFileParser on a list of raw files to extract their meta data.

Once the PRIDE archiv is public, one could start to adapt the workflows to fetch the data from there.

MaxQuant

Process single raw files using MaxQuant. See README for details.

Metadata

Read metadata from single raw files using MaxQuant. See README for details.

Setup

Create a new environment in case you want to have a separate installation for running

conda create -n hela_data -c conda-forge -c defaults python numpy matplotlib pandas plotly seaborn fastcore omegaconf ipywidgets tqdm pyyaml umap-learn scikit-learn openpyxl xmltodict
# jupyterlab # maybe add a jupyterlab installation (depends on your setup)
conda activate hela_data

and then install the custom code (and dependencies, which should be installed already installed if you use the above environment):

pip install . # . for current directory which should be this repository

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
project		project
src/hela_data		src/hela_data
test		test
workflows		workflows
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src/hela_data

src/hela_data

test

test

workflows

workflows

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

environment.yml

environment.yml

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

HeLa data quality control (QC) and maintenance (MNT) samples

Curate dumps

Download using python

download on the command line

Project

Workflows

MaxQuant

Metadata

Setup

About

Releases

Packages

Languages

License

RasmussenLab/hela_qc_mnt_data

Folders and files

Latest commit

History

Repository files navigation

HeLa data quality control (QC) and maintenance (MNT) samples

Curate dumps

Download using python

download on the command line

Project

Workflows

MaxQuant

Metadata

Setup

About

Topics

Resources

License

Stars

Watchers

Forks

Languages