Skip to content

bigbio/pgt-pangenome

Repository files navigation

Pangenome Proteogenomics

Protegenomics analysis based on Pangenome references

The aim of this project is to search normal tissue proteomics datasets to identify novel proteins using the latest genome assemblies published via the PanGenome project.

Project Aims

  • Develop a workflow based on quantms to reanalyze public proteomics datasets with custom proteogenomics databases.

  • Develop a workflow that enables systematic validation of novel (non-canonical peptides) using multiple existing tools.

  • Performing a comprehensive analysis of multiple normal tissue datasets from public domain using databases generated from the latest Pangenome assemblies.

    • Compare scores and FDR for known canonical and novel canonical peptides, check distributions, etc.
    • Revisit FDR calculations and significant measures for non-canonical peptides.
    • Analyze the novel canonical, locations, gene types, other evidence for expression, etc.
  • Provide a fasta database with all the novel proteins observed.

  • Draft manuscript layout and sections.

Proteogenomics workflow

alt text

Workflow components:

  • Database generation: The proteogenomics database is created with pypgatk.
  • quanmts peptide identification: The proteomics data is searched against the database using quantms. The workflow uses three search engines including COMET, SAGE and MSGF+ to perform the peptide identification. Percolator is then used to boost the number of peptide identifications and proteomicsLFQ or proteinQuantifier tools are used to perform the quantification and statistical filter of peptides based on the TDA (Target-Decoy approach).
  • Post-processing: The identified peptides are then post-processed to identify novel peptides and perform a comprehensive analysis of the results.
    • Peptide Alignment: The identified peptides are aligned to a Global canonical protein sequences which includes (ENSEMBL, Uniprot TrEMBL) to identify novel peptides.
    • Spectrum Validation: Spectrum identification validation is based on MS2PIP and Signal-to-Noise ratio (SNR).
    • Variant annotation: The identified peptides that contain Single Aminoacid variants (SAAVs) are validated using PySpectrumAI tool
    • Retention time prediction: The retention time of the identified peptides is predicted using DeepLC
  • Manual inspection of results using USIs and PRIDE USI Viewer

Spectrum identification validation

For the spectrum identification, the following python script is used - ms2pip_novel.py.

ms2pip_novel.py contains a series of functions that together help create an MGF file from peptide data, run MS2PIP predictions, and compute additional metrics for each spectrum such as signal-to-noise ratio, number of peaks, and difference between the highest and lowest peaks.

Here's a brief overview of the main components of the code:

  • create_mgf: The command function that creates an MGF file from a peptide file and MGF file. It reads mzML files (either locally or from an FTP server) and uses the read_spectra_from_mzml function to parse spectra. The function then writes the spectra to an MGF file.
  • run_ms2pip: The command function that runs MS2PIP predictions on a given peptide and MGF file. It merges predictions with the original data and saves the results to an output file.
  • filter-ms2pip: The command function that runs the MS2PIP filtering process to remove low-quality peptides based on certain thresholds. It filters peptides with a sequence length below a specified threshold and then dynamically sets thresholds based on percentiles for a signal-to-noise ratio.

These functions and command-line commands together facilitate the process of working with peptide and MGF data files, running predictions using MS2PIP, and filtering and computing metrics for the resulting spectra.

Variant annotation

The spectrumAI algorithm was originally published in Nature Communication by Yafeng et al. and it was implemented originally in R. We implemented the algorithm in Python in the toolbox pypgatk enabling faster running of the algorithm and also integration in other Python workflows. The explanation of the original algorithm:

Assume a 12-amino-acid peptide is identified with single substitution at 8th residue, in order to pass SpectrumAI, it must have matched MS2 peaks (within fragment ion mass tolerance) from at least one of the following groups: b7&b8, y4&y5, y4&b7 or y5&b8. Second, the sum intensity of the supporting flanking MS2 ions must be larger than the median intensity of all fragmentation ions. An exception to these criteria is made when the substituted amino acid has a proline residue to its N-terminal side. Because CID/HCD fragmentation at the C-terminal side of a proline residue is thermodynamically unfavored, SpectrumAI only demands the presence of any b or y fragment ions containing substituted amino acids, in this case, b8 to b11, y5 to y11.

Retention time prediction

Using DeepLC, the script deeplc_novel.py is designed to evaluate the performance of DeepLC on the novel peptides. It uses canonical peptides (e.g. GRCh38 peptides) for training DeepLC and novel peptides peptides to evaluate its performance and filter them.

  • DeepLC Training and Prediction: For each sample ID, the script trains a DeepLC model on canonical peptide data. It then uses this trained model to predict retention times (preds_tr) for the novel peptides in the GCA dataset.
  • Error Calculation and Percentiles: The script calculates error (the difference between actual and predicted retention times) and absolute error. It also calculates error percentiles, which measure how the errors of the novel peptides compare to the canonical peptides.

Pangenome reanalysis of normal tissue datasets

Datasets of normal tissues

We used two big normal tissue datasets to detect novel peptides from pangenomes and to validate the results. The datasets are:

Database information

Results from analysis

The original PSMs are stored in quantms.io format.

Structure of the repository

Filtering scripts and notebooks

  • gca_canonical_validation.ipynb The GCA peptide in the results was compared with canonical protein to prevent misjudgment and modified to the input format supported by deeplc
  • deeplc_novel.py For each sample, the Grch38 peptide of that sample was used to calibrate the model to.
  • ms2pip_novel.py The script is used to validate the identified peptides using Signal-to-Noise ratio and MS2PIP

Other notebooks during the analysis

Other files generated during the analysis

Authors

  • Dong Wang - Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China.
  • Husen M. Umer - Bioscience Core Laboratory, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
  • Yasset Perez-Riverol - PRIDE Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.