self-immunopeptidome_cancer

Scripts relevant to Neoantigen characteristics in the context of the complete predicted MHC class I self-immunopeptidome doi:10.1080/2162402X.2018.1556080

Environment

Python v3.4.5 (and python_requirements.txt)
Samtools v0.1.8
sqlite3 v3.6.20
NetMHCpan v3.0

Tasks described here

Condensing proteomes
Prepare and run predictions
Parse the results of the predictions
Import parsed data into SQLite3 database
Creating SQLite3 database from downloaded data
Lookup self-immunopeptidome sizes from HLA genotypes
Generating random amino acid changes
Getting variant RNA-seq support

Procedure to condense a proteome

This will condense a given proteome (.fasta(s)) into sets of artificial proteins ("contigs")

8mers:

Extract all unique n-mers from the proteome file(s)

$ python processUniqueNmersProteome.py output_directory/8mers.txt -n 8 --fasta proteome_file1.fasta [proteome_file2.fasta ...]

Combine unique n-mers into a set of artificial proteins ("contigs") containing only each unique n-mer exactly once (when parsing with a sliding window of size n)

$ python makeContigsFromUniqueNmers.py output_directory/8mers.txt 8 output_directory/8mers_contigs.txt

Record lengths of the resulting contigs

$ awk '{{print length($0);}}' output_directory/8mers_contigs.txt > output_directory/8mers_lengths.txt

Repeat for 9, 10, and 11mers

Prepare and run predictions

First, get list of all HLA available for prediction in NetMHCpan

$ /path/to/netMHCpan-3.0/netMHCpan -listMHC | awk '{print $1}' | grep -e "HLA-[ABC]" > allHLAI.txt

Note, this was designed to create input files for clusterTAS cluster submission and management script locally at the BC Genome Sciences Centre. This will likely need modification to run on your system. Format of each line of scripts.sh is job_name bash_command;.

Prepare NetMHCpan invocations using the condensed proteomes. First edit lines 22-24 to paths on your system. Then run as:

$ python prepareJobs.py --species HUMAN --contig8mer output_directory/8mers_contigs.txt --contig9mer .output_directory/8mers_contigs.txt --contig9mer output_directory/10mers_contigs.txt --contig11mer output_directory/11mers_contigs.txt --contigsPerJob 1000 --hlaAlleleList allHLAI.txt --destDir /path/to/output/jobs_dir/

This breaks the input into many smaller individual jobs, depending on the proteome size. Example of one line of the scripts.sh file is:

HUMAN_HLA-B13-23_8_32	source /home/sbrown/bin/pythonvenv/python3/bin/activate; /path/to/netMHCpan-3.0/netMHCpan -tdir tmpdirXXXXXX -a HLA-B13:23 -l 8 -f prot8_32_HUMAN.fa > HUMAN_HLA-B13-23_8_32.pMHC; python /home/sbrown/scripts/parseNetMHCpanOutput.py HUMAN_HLA-B13-23_8_32.pMHC HUMAN_HLA-B13-23_8_32.pMHC.parsed; rm HUMAN_HLA-B13-23_8_32.pMHC;

Note, since the HLA is in the file name, and the prot8_32_HUMAN.fa is the list of peptides, this is parsed down to just be the IC50 scores for each peptide (in the same order as prot8_32_HUMAN.fa) to save space.

Parse the results of the predictions

Within the folder holding the results of all the predictions, we will check to see that all jobs completed successfully, and get simple summaries

$ python tallyParsedData_multiProc.py -v ../scripts.sh analysis_results/ singleHLAdata.tsv timeCharacteristics.tsv failedJobs.txt 16

Make sure that failedJobs.txt is an empty file before continuing.

Import parsed data into SQLite3 database

Now we will build an sqlite3 (v3.6.20) database to hold information on pMHC binding.

$ python makeDatabaseOfBinders.py HUMAN /path/to/results/ allHLAI.txt HUMAN_binders.db 16

Note: Database holds all peptides and hla, but only pMHC interactions (binders) with IC50 < 500 nM.

Details on the schema of the created database:

$ sqlite3 HUMAN_binders.db

SQLite version 3.6.20
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
binders  hla      peptide
sqlite> .schema peptide
CREATE TABLE peptide(id INT, sequence TEXT);
CREATE INDEX peptide_ind ON peptide(id);
CREATE INDEX peptide_seq ON peptide(sequence);
sqlite> .schema hla
CREATE TABLE hla(id INT, allele TEXT);
sqlite> .schema binders
CREATE TABLE binders(hla_id INT, pep_id INT, ic50 REAL);
CREATE INDEX binder_hla_ind ON binders(hla_id);
CREATE INDEX binder_pep_ind ON binders(pep_id);

Note that indices need to be created manually after running makeDatabaseOfBinders.py.

Creating SQLite3 database from downloaded data

Dataset DOI:

If you are using our human immunopeptidome data, you can generate a comparable SQLite3 database by running:

$ python makeDatabaseFromFlatFiles.py human_immunopeptidome_database_flat/ HUMAN_binders.db

Note that to avoid repeated database querying, this script loads all data into memory, and is quite RAM-intensive, requiring > 70 GB of RAM. Modifications may be necessary to run on your machine.

Lookup self-immunopeptidome sizes from HLA genotypes

Given a list of HLA genotypes, we can calculate the size of the self-immunopeptidome

$ python lookupHLAgenotypesSQL.py samples_hlaGeno.tsv HUMAN_binders.db output_selfimmunopeptidome_sizes.tsv 12

Where samples_hlaGeno.tsv is like:

sample_id HLA-A02-01_HLA-A28-01_HLA-B57-01_HLA-B27-05_HLA-C03-03_HLA-C03-03
...

Generating Random Proteome Mutations

To generate random proteome mutations (note that TCGA_aaChange_counts.tsv is included in this repository, but can be substituted with your own frequencies)

$ python generateRandomProteomeMutations.py proteome_reference.fasta /path/for/output/ generated_aaChange_counts.tsv --seed 171201 --num_mutations 50000 --transition_counts TCGA_aaChange_counts.tsv

Getting variant RNA-seq read support

To get transcriptome read support for SNVs: Edit line 29 to be the path to your Samtools v0.1.8 binary.

$ python lookupMutationReadSupport.py variants_and_bams_toQuery.tsv variants_readCounts.tsv 36

Where variants_and_bams_toQuery.tsv is like:

subject     sample    NCBI_Build      Chromosome      Start_Position  End_Position    STRAND  HGVSc   file
TCGA-VQ-A91D    TCGA-VQ-A91D-01A-11D-A410-08    GRCh37  19      15234051        15234051        -1      c.341C>T        /path/to/TCGA-VQ-A91D-01A-11R-A414-31_diseased_HG19_GSC_merge_RNA-Seq.bam
...

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
TCGA_aaChange_counts.tsv		TCGA_aaChange_counts.tsv
generateRandomProteomeMutations.py		generateRandomProteomeMutations.py
lookupHLAgenotypesSQL.py		lookupHLAgenotypesSQL.py
lookupMutationReadSupport.py		lookupMutationReadSupport.py
makeContigsFromUniqueNmers.py		makeContigsFromUniqueNmers.py
makeDatabaseFromFlatFiles.py		makeDatabaseFromFlatFiles.py
makeDatabaseOfBinders.py		makeDatabaseOfBinders.py
parseNetMHCpanOutput.py		parseNetMHCpanOutput.py
prepareJobs.py		prepareJobs.py
processUniqueNmersProteome.py		processUniqueNmersProteome.py
python_requirements.txt		python_requirements.txt
tallyParsedData_multiProc.py		tallyParsedData_multiProc.py

scottdbrown/self-immunopeptidome_cancer

Folders and files

Latest commit

History

Repository files navigation

self-immunopeptidome_cancer

Environment

Tasks described here

Procedure to condense a proteome

8mers:

Repeat for 9, 10, and 11mers

Prepare and run predictions

Parse the results of the predictions

Import parsed data into SQLite3 database

Creating SQLite3 database from downloaded data

Lookup self-immunopeptidome sizes from HLA genotypes

Generating Random Proteome Mutations

Getting variant RNA-seq read support

About

Resources

Stars

Watchers

Forks

Languages