Skip to content

ProGenNo/ProHap

Repository files navigation

ProHap & ProVar

Proteogenomics database-generation tool for protein haplotypes and variants. Preprint describing the tool: doi.org/10.1101/2023.12.24.572591.

A database created using ProHap on the 1000 Genomes Project data set can be found at DOI.

Input & Usage

Below is a brief overview, for details on input file format and configuration, please refer to the Wiki page.

Required ingredients:

  • GTF annotation file (Ensembl - downloaded automatically by Snakemake)
  • cDNA FASTA file (Ensembl - downloaded automatically by Snakemake)
  • (optional) ncRNA FASTA file (Ensembl - downloaded automatically by Snakemake)
  • For ProHap: VCF with phased genotypes, one file per chromosome (such as 1000 Genomes Project - downloaded automatically by Snakemake)
  • For ProVar: VCF, single file per dataset

Required software: Snakemake & Conda. ProHap was tested with Ubuntu 22.04.3 LTS. Windows users are encouraged to use the Windows Subsystem for Linux.

Using ProHap with the full 1000 Genomes Project data set (as per default) requires about 1TB disk space!

Usage:

  1. Clone this repository: git clone https://github.com/ProGenNo/ProHap.git; cd ProHap/;
  2. Create a configuration file called config.yaml using https://progenno.github.io/ProHap/. Please refer to the Wiki page for details.
  3. Test Snakemake with a dry-run: snakemake --ccores <# provided cores> -n -q
  4. Run the Snakemake pipeline to create your protein database: snakemake --ccores <# provided cores> -p --use-conda

Example: ProHap on 1000 Genomes

In the first usage example, we provide a small example dataset taken from the 1000 Genomes Project on GRCh38. We will use ProHap to create a database of protein haplotypes aligned with Ensembl v.111 (January 2024) using only MANE Select transcripts.

Expected runtime using 4 CPU cores: ~1 hour. Expected runtime using 23 CPU cores: ~30 minutes.

Requirements: Install Conda / Mamba and Snakemake using this guide. Minimum hardware requirements: 1 CPU core, ~5 GB disk space, 3 GB RAM.

Use the following commands to run this example:

# Clone this repository:
git clone https://github.com/ProGenNo/ProHap.git ;
cd ProHap;

# Unpack the sample dataset
cd sample_data ;
gunzip sample_1kGP_common_global.tar.gz ;
tar xf sample_1kGP_common_global.tar ;
cd .. ;

# Copy the configuration to config.yaml
cp config_example1.yaml config.yaml ;

# Activate the snakemake conda environment and run the pipeline
conda activate snakemake ;
snakemake --cores 4 -p --use-conda ;

Using the database for proteomic searches

Once you obtain a list of peptide-spectrum matches (PSMs), you can use a pipeline provided in the PeptideAnnotator repository to map the peptides back to the respective protein haplotype / variant sequences, and map the identified variants back to their genetic origin. For the usage and details, please refer to the following wiki page.

Output

The ProHap / ProVar pipeline produces three kinds of output files. Below is a brief description, please refer to the wiki page for further details.

  1. Concatenated FASTA file: The main result of the pipeline is the concatenated FASTA file, consisting of the ProHap and/or ProVar output, reference sequences from Ensembl, and common contaminant sequences (cRAP). The file can be used with any search engine, but is optimized for compatibility with SearchGUI and PeptideShaker. Optionally, headers are extracted and provided in an attached tab-separated file.
  2. Metadata table: Additional information on the variant / haplotype sequences produced by the pipeline, such as genomic coordinates of the variants covered, variant consequence type, etc.
  3. cDNA translations FASTA: FASTA file contains the original translations of variant / haplotype cDNA sequences prior to any optimization, the removal of UTR sequences, and merging with canonical proteins and contaminants.