Skip to content

Comprehensive cancer signatures with reusable modules written in python, integrating SNV, SV and MSI profiles in signatures decomposed using non-negative matrix factorisation, and produce production ready pdf reports.

License

Notifications You must be signed in to change notification settings

jessada/pyCancerSig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyCancerSig

A python package for deciphering cancer signatures.

Comprehensive cancer signatures with reusable modules written in python, integrating SNV, SV and MSI profiles in signatures decomposed using non-negative matrix factorisation, and produce production ready pdf reports.

Installation

Dependencies - Currently, feature extraction of structural variants was based on data generated by FindSV and feature extraction of microsatellite instability was based on data generated by MSIsensor

Install the dependencies, then download and install pyCancerSig

git clone https://github.com/jessada/pyCancerSig.git
cd pyCancerSig
python setup.py install
echo -e "# set pyCancerSig environment variable\nexport PYCANCERSIG=`pwd`\n" >> ~/.bashrc
source ~/.bashrc      # or logout and re-login

Workflow steps

The workflow consists of 4 steps

  1. Data preprocessing - The purpose of this step is to generate list of variants and/or information related. This step has to be performed by third party software.

    • Single nucleotide variant (SNV) - recommending MuTect2, otherwise Muse, VarScan2, or SomaticSniper.
    • Structural variant (SV) - dependency on FindSV
    • Microsatellite instability (MSI) - dependency on MSIsensor

    A note regarding vcf files generated by FindSV. Even though the VCF standard has support for SVs, callers may not always be fully interchangeable. Specifically, the “END” tag added by many callers and a “CHR2” tag are parsed out from the INFO field. Other information not evident from the VCF definition could be parsed by replacement or modification of a custom parseVCFLine function, as was done for FindSV. If any other SV callers are used, we would like to advise users to develop a parser to replace cancersig profile sv

    For MSIsensor, cancersig profile msi will look at output files *_somatic files. Each line represents one MSI locus. The fifth column indicates the repeat pattern.

  2. Profiling (Feature extraction) - cancersig profile - The purpose of this step is to turn information genereated in the first step into matrix features usable by the model in the next step. The output of this stage has similar format as https://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt, which consists of at least 3 columns.

    1. Column 1, Variant type (Substitution Type in COSMIC)
    2. Column 2, Variant subgroup (Trinucleotide in COSMIC)
    3. Column 3, Feature ID (Somatic Mutation Type in COSMIC)
    4. From column 4 onward, each column represent one sample

    There are subcommand to be used for each type of genetic variation

    • cancersig feature snv is for extraction single nucletide variant feature
    • cancersig feature sv is for extraction structural variant feature
    • cancersig feature msi is for extraction microsatellite instability feature
    • cancersig feature merge is for merging all feature profiles into one single profile ready to be used by the next step
  3. Deciphering mutational signatures - cancersig signature decipher - The purpose of this step is to use unsupervised learning model to find mutational signature components in the tumors.

  4. Visualizing profiles cancersig signature visualize - The purpose of this step is to visualize mutational signature component for each tumor.

Example of a visualized pdf report of an SNV-only profile

Example of a visualized pdf report of a combined profile

Usage

usage: cancersig <command> [options]

Key commands:

profile             extract mutational profile
signature           decipher mutational cancer signature component and visualization from mutational profiles

cancersig profile key commands:

snv                 extract SNV mutational profile
sv                  extract SV mutational profile
msi                 extract MSI mutational profile
merge               merge all mutaitonal profile into a single profile

cancersig signature key commands:

decipher            perform unsupervised learning model to find mutational signature components
visualize           visualize mutational signatures identified in tumors

cancersig profile snv [options]:

-i {file}           input VCF file (required)
-r {file}           path to genome reference (required)
-o {file}           output snv feature file (required)

cancersig profile sv [options]:

-i {file}           input VCF file (required)
-o {file}           output sv feature file (required)

cancersig profile msi [options]:

--raw_msisensor_report {file}    an output from "msisensor msi" that have only msi score (percentage of MSI loci) (required)
--raw_msisensor_somatic {file}   an output from "msisensor msi" that have suffix "_somatic" (required)
--sample_id {id}                 a sample id to be used as a column header in the output file (required)
-o {file}                        output msi feature file (required)

cancersig profile merge [options]:

-i {directories}                 comma-separated directories containing feature files to be merged (required)
-o {file}                        output merged feature file (required)
--profile_types [SV,SNV,MSI]     profile types to be merged, (default: SV,SNV,MSI)

cancersig signature decipher [options]:

--mutation_profiles {file}      input mutation calalog to be deciphered (required)
--min_signatures                minimum number of signatures to be deciphered (default=2)
--max_signatures                maximum number of signatures to be deciphered (default=15)
--out_prefix                    output file prefix (required)

cancersig signature visualize [options]:

--mutation_profiles {file}         input mutation calalog to be reconstructed (required)
--signatures_probabilities {file}  input file with deciphered cancer signatures probabilities (required)
--output_dir {directory)           output directory (required)

Examples and details - Step 1 Data preprocessing

As this part is performed by third-party software, please check the original website for the documentation

Examples and details - Step 2 Profiling (Feature extraction)

2.1 SNV profiling

cancersig profile snv will

  • scan the VCF (or vcf.gz) file in the genotype field for SNV changes on both strands
  • then, use the genomic coordinates to look up the 5' and 3' base in the reference fasta (using samtools)
  • then, perform SNV profiling of the sample by counting number of SNVs in each category and divide it by total number of variants in the sample.

The sample id in the output feature file will be the same as sample id in the input VCF file.

Example run:

cancersig profile snv -i input.vcf.gz -r /path/to/reference.fa -o snv_feature.txt

Example SNV feature output from Example SNV input.vcf.gz

2.2 SV profiling

cancersig profile sv will

  • check INFO field "SVTYPE" to determine type of structural variation
  • check INFO field "END" for calculating the length of the event
  • then, perform SV profiling of the sample by counting number of SVs in each category and divide it by total number of variants in the sample.

The sample id in the output feature file will be the same as sample id in the input VCF file (column 10).

Example run:

cancersig profile sv -i gunzip input.vcf -o sv_feature.txt

Note: Currently, cancersig profile sv only accept uncompressed vcf file

Example SV feature output from Example SV input.vcf

2.3 MSI profiling

cancersig profile msi will

  • scan for all possible repeat patterns of repeat unit with size between 1-3
  • for size between 4-5, just count with no sub-categories
  • then, perform MSI profiling of the sample by counting number of repeats in each category and divide it by total number of repeats.

The sample id in the output feature file has to be supplied as an input argument (--sample_id).

Example run:

cancersig profile msi --raw_msisensor_report msisensor_out --raw_msisensor_somatic msisensor_out_somatic --sample_id example_sample -o msi_feature.txt

Example MSI feature output from Example msisensor_out and Example msisensor_out_somatic

2.4 Merge profile

cancersig profile merge will

  • scan for *feature.txt or *profile.txt files in the input folder(s)
  • if a sample has all feature of all mutation types (SNV, SV, MSI), it will be merged into one profile. The percentage weight of SNV, SV and MSI are 70%, 30% and 10% respectively, which can be redefined in features.py.

Example run:

cancersig profile merge -i /path/to/first/dir,/path/to/second/dir -o merged_feature.txt

Example run for mergeing certain profile types (SV and SNV in this case):

cancersig profile merge -i /path/to/first/dir,/path/to/second/dir -o merged_feature.txt --profile_types SNV,SV

Example merged feature file from example input directories -i /path1,/path2,/path3,/path4,/path5,/path6

Examples and details - Step 3 Deciphering mutational signatures

cancersig signature decipher will

Example run:

cancersig signature decipher --mutation_profile merged_mutational_profile.txt --out_prefix deciphered_output_file_prefix

Example output:

Example and details - Step 4 Visualizing profile

cancersig signature visualize will

  • display mutational signature composition of the sample
  • display the original mutaitonal profile
  • display the reconstruction mutational profile (based on the recomposition)
  • display the reconstruction error

Example run:

cancersig signature visualize --mutation_profile merged_mutational_profile.txt --signatures_probabilities signatures_probabilities.txt --output_dir /path/to/output/dir

Example cancersig profile of sample1,sample2,sample3,sample4, and normalized_weights from input mutation_profile and signatures_probabilities

Runtime estimation

The amount of time needed for processing variants may depend on size of data and configuration of the machine. The following performance was based on execution the Uppsala Multidisciplinary Center for Advanced Computational Science computational cluster “bianca”, on a single Intel Xeon E5-2630 v3 core with 8 Gb RAM allocated.

  • cancersig profile snv can process 3523 variants/second
  • cancersig profile sv can process 17550 variants/second
  • cancersig profile msi can process 218450 loci/second
  • cancersig signature decipher took 33 minutes to process the combined profiles of 130 samples
  • cancersig signature visualize took 9 seconds to generate the pdf file of one sample.

Workflow flexibility

In case that users use different variant callers which produce output in the format that cancersig profile snv, cancersig profile sv, or cancersig profile msi cannot recognize, users can replace any profilers in this package with their own parsers. We have provided example input and output files of every processes in the example sections. As long as the new parser can generate output files in the same format as in the given examples, the workflow should continue to work correctly

Contact

If you need more information of have any questions, please don't hesitate to contact jessada.thutkawkorapin@gmail.com

About

Comprehensive cancer signatures with reusable modules written in python, integrating SNV, SV and MSI profiles in signatures decomposed using non-negative matrix factorisation, and produce production ready pdf reports.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published