CUPPA: Cancer of Unknown Primary Prediction Algorithm

Introduction

CUPPA is a tissue of origin classifier that uses features derived from whole genome sequencing (WGS; DNA) and/or whole transcriptome sequencing data (WTS; RNA). The Python classifier component of CUPPA is built off the scikit-learn library and is based on multiple logistic regressions chained together. Please see the section Classifier structure for more information.

CUPPA is intended to provide:

Molecular tumor type prediction to verify histopathological classification
Support for specific tumor type classification in case of inconclusive histopathological outcome (differential diagnosis)
Predictions of primary tumor location for Cancer of Unknown Primary (CUP)

Usage

Feature extraction (Java component)

The CuppaDataPrep java class can be used to extract the features required for CUPPA. CuppaDataPrep can be run for a single sample or multiple samples.

Required input files

Tool	Filename suffix	File details
PURPLE	.purple.somatic.vcf.gz	SNVs; used for the GEN_POS and SNV96 features
PURPLE	.purple.sv.vcf.gz	Structural variants
PURPLE	.purple.purity.tsv	Sample sex and WGD presence (amongst other data)
PURPLE	.purple.qc	WGS quality control stats
LINX	.linx.clusters.tsv	Structural variant clusters
LINX	.linx.driver.catalog.tsv	Driver mutations
LINX	.linx.fusion.tsv	Gene fusions
Virus Interpreter	.virus.annotated.tsv	Viral sequence insertions

Single sample example

The below command will extract the DNA features for one sample:

java -cp cuppa.jar com.hartwig.hmftools.cup.prep.CuppaDataPrep \
  -sample COLO829v003T \
  -categories DNA \
  -ref_genome_version V37 \
  -sample_data_dir /dir/with/all/input/files/ \
  -output_dir /output/dir/

This will produce a TSV file with the following format:

Source      Category       Key  Value
   DNA         snv96   C>A_ACA	  131
   DNA	     gen_pos  1_500000      2
   DNA  event.driver  TP53.mut    1.0
   ...           ...       ...    ...

Multi sample example

The below command will extract the DNA and RNA features for multiple samples. The wildcards (*) are converted to sample ids as specified in the -sample_id_file arg.

java -cp cuppa.jar com.hartwig.hmftools.cup.prep.CuppaDataPrep \
  -sample_id_file sample_ids.tsv \
  -categories DNA \
  -ref_genome_version V37 \
  -purple_dir /data/datasets/*/purple/ \
  -linx_dir /data/datasets/*/linx/ \
  -virus_dir /data/datasets/*/virus_interpreter/ \
  -isofox_dir /data/rna/*/ \
  -ref_alt_sj_sites alt_sj.selected_loci.tsv.gz \
  -write_by_category \
  -threads 8

This will produce multi TSV files:

cuppa_data.cohort.sv.tsv.gz
cuppa_data.cohort.alt_sj.tsv.gz
cuppa_data.cohort.feature.tsv.gz
cuppa_data.cohort.gene_exp.tsv.gz
cuppa_data.cohort.sample_trait.tsv.gz
cuppa_data.cohort.snv.tsv.gz

NOTE: the python component of CUPPA does not yet support loading of multi-sample CuppaDataPrep output

All arguments

Argument	Example	Description
`-sample`	COLO829v003T	Sample name
`-sample_id_file`	sample_ids.tsv	One column text file with "SampleId" as the header, and where each row is a sample name
`-categories`	DNA	Valid values: DNA, RNA, ALL (both DNA and RNA)
`-ref_genome_version`	V37	Valid values: V37, V38
`-output_dir`		Directory to write the output files
`-sample_data_dir`		Directory containing all of the above mentioned input files
`-purple_dir`	/data/datasets/*/purple/	Directory containing the PURPLE files. Accepts wildcards (*) that are replaced with sample names
`-linx_dir`	/data/datasets/*/linx/	Directory containing the LINX files. Accepts wildcards (*) that are replaced with sample names
`-virus_dir`	/data/datasets/*/virus_interpreter/	Directory containing the Virus Interpreter files. Accepts wildcards (*) that are replaced with sample names
`-isofox_dir`	/data/rna/*/	Directory containing the ISOFOX files. Accepts wildcards (*) that are replaced with sample names
`-ref_alt_sj_sites`	alt_sj.selected_loci.tsv	TSV file containing the required alternative splice junctions. Required columns: GeneId, Chromosome, PosStart, PosEnd
`-threads`	8	Number of threads to use. Each thread processes one sample at a time
`-write_by_category`		Flag. Output of `CuppaDataPrep` will be split over multiple files to prevent each find being to big. Only applies to multi-sample mode

Classification (Python component)

Installation

The core classification component of CUPPA is a python package (pycuppa) and can be installed using the below bash commands:

## Clone the `hmftools` git repository
git clone https://github.com/hartwigmedical/hmftools/

## Create a python virtual environment
python3 -m venv pycuppa_env

## Activate the environment
source pycuppa_env/bin/activate

## Install `pycuppa` using `pip`. This will also install required python packages
pip install hmftools/cuppa/src/main/python/pycuppa

Running CUPPA within python

Training of CUPPA and making predictions can be done interactively in Python. Please refer to these jupyter notebooks for example code at doc/notebooks/. This below links:

predict_example.ipynb: view on nbviewer.org
train_example.ipynb: view on nbviewer.org

Command line interface

CUPPA can also be called from the command line (which internally calls the PredictionRunner and TrainingRunner classes).

To predict on a single sample, the below example commands can be used. This produces the outputs as specified in section: Classifier output.

## Activate the previously created virtual environment (see above)
source pycuppa_env/bin/activate

## Create output dir
mkdir predict_output/

## Call the module `pycuppa/cuppa/predict.py`
python3 -m cuppa.predict \
--features_path=hmftools/cuppa/src/main/python/pycuppa/cuppa/resources/mock_data/input_data/new_format/COLO829v003T.cuppa_data.tsv.gz \
--classifier_path=hmftools/cuppa/src/main/python/pycuppa/cuppa/resources/cuppa_classifier.pickle.gz \
--output_dir=predict_output/

To train a new model, the following example commands can be used.

## Activate the previously created virtual environment (see above)
source pycuppa_env/bin/activate

## Call the module `pycuppa/cuppa/train.py`
python3 -m cuppa.train \
--features_path=/path/to/features/ \
--metadata_path=/path/to/metadata/ \
--output_dir=/path/to/output_dir/ \
--using_old_features_format \
--n_jobs=5 \
--log_to_file

The below table lists all possible arguments of TrainingRunner and PredictionRunner.

Argument	Runner	Description
`--features_path`	Both	Required. Old features format: Directory with the input features csv files. New features format: Path to the input features tsv file
`--output_dir`	Both	Required. Output directory
`--metadata_path`	TrainingRunner	Required. Path to the metadata file with cancer type labels per sample
`--classifier_path`	PredictionRunner	Required. Path to the pickle file containing a CuppaClassifier python object
`--cv_predictions_path`	PredictionRunner	Path to a CuppaPrediction tsv file containing the cross-validation predictions. If provided, sample ids found in this file will have their predictions returned from this file instead of being computed
`--sample_id`	PredictionRunner	If provided, will prepend `sample_id` to the output filenames when running `PredictionRunner`. `sample_id` will also be reflected in the sample_id column of the output tables
`--compress_tsv_files`	PredictionRunner	Compress tsv files with gzip? (will add .gz to the file extension)
`--using_old_features_format`	Both	Are the features from `features_path` in the old input features format?
`--genome_version`	Both	Genome version, Can be 37 or 38. Used for getting the gen_pos bin names. Default: 37
`--excl_chroms`	Both	Comma separated list of chromosomes to exclude from the gen_pos matrix. E.g. 'chrY' or 'chr1,chr2'. Default: ChrY,Y
`--excl_classes`	TrainingRunner	Comma separated list of cancer subtypes to exclude from training. E.g. 'Breast' or 'Breast,Lung'. Default: '_Other,_Unknown'
`--min_samples_with_rna`	TrainingRunner	Minimum number of samples with RNA in each cancer subtype. If the cancer subtype has fewer samples with RNA than this value, the cancer subtype will be excluded from training. Default: 5
`--fusion_overrides_path`	TrainingRunner	Path to the fusion overrides tsv file. See section FusionProbOverrider for how this file should be formatted
`--cv_folds`	TrainingRunner	Number of cross-validation folds. Default: 10
`--skip_cv`	TrainingRunner	Skip cross-validation?
`--no_cache_training`	TrainingRunner	Don't cache models during cross-validation and training of final model?
`--n_jobs`	TrainingRunner	Number of threads to use for cross-validation. If -1, all threads are used. Default = 1
`--log_to_file`	TrainingRunner	Output logs to a file?
`--log_path`	TrainingRunner	Path to log file. Default: `$output_dir/run.log`

Input files format

Features

The features for one sample are provided to CUPPA in the following format (as described in the previous section):

Source      Category       Key  Value
   DNA         snv96   C>A_ACA	  131
   DNA	     gen_pos  1_500000      2
   DNA  event.driver  TP53.mut    1.0
   ...           ...       ...    ...

Metadata

This file specifies the cancer type labels for each sample. Internally, CancerSubtype is used as the label for each sample, and corresponds to the cancer types that CUPPA classifies.

RnaReadLength together with CancerSubtype is used to create stratified cross-validation splits so that RnaReadLength and CancerSubtype is evenly distributed across the cross-validation folds. RnaReadLength is also used when the min_samples_with_rna argument is specified (see above table).

SampleId CancerType   CancerSubtype  RnaReadLength
sample_1   Prostate        Prostate            151
sample_2       Skin  Skin: Melanoma              0
     ...        ...             ...            ...

Fusion overrides file

  feat_prefix  feat_basename                     target_class
event.fusion.     CBFB_MYH11  Myeloid: Acute myeloid leukemia
event.fusion.   CTNNB1_PLAG1    Head and neck: Salivary gland
event.fusion.      FUS_DDIT3    Bone/Soft tissue: Liposarcoma
          ...            ...                              ...

Classifier output

For each sample CUPPA outputs a visualization of the predictions (PNG file), as well as the raw visualization data as a TSV file. Please see later sections for more info on the features and classifiers shown in the visualization.

Probabilities by classifier

The CUPPA visualization is split into 4 panels. The first panel shows the probabilities for each cancer subtype (columns) for each classifier (rows). The grey tabs indicate cancer type groupings based on molecular/histological supertypes (e.g. Lung cancer) but also by organ type (Bone/Soft tissue cancers). Group probabilities are shown for the COMBINED, DNA_COMBINED and RNA_COMBINED classifiers, where these probabilities are simply the sum of probabilities within the cancer type grouping and classifier (merged cells).

SNV96: Mutational signatures

This panel shows the presence of certain mutational signatures relative to their presence in each cancer type in the training set. Each heatmap cell shows the where a given sample lies in the cancer type cohort distribution for a particular signature (quantile).

NOTE: CUPPA does not use signatures as features internally. These are only displayed for informational purposes to confirm or reject CUPPA's predictions, such as in the following scenario:

A pathologist determines that a tumor is a breast cancer
CUPPA predicts the tumor to be a lung cancer
This tumor has over 10,000 SBS4 (smoking) mutations, which falls within the typical range for lung cancers
This tumor is therefore more likely to be a lung cancer

EVENT: Feature contributions

This panel shows which EVENT features (driver mutations, fusions, viral insertions, and various other features) were most important for the resulting prediction (from the EVENT classifier). Each heatmap cell shows an odds ratio, where a higher odds ratio means the feature was predictive for the respective cancer type. The odds ratio specifically refers to the odds of the sample belonging to the target cancer type vs the odds of not belonging to the target cancer type. For a given feature, the odds ratio is calculated from SHAP values as follows:

shap_value = coef * (x – X_mean) ## SHAP values are in log(odds) space
odds_ratio = e ** shap_value ## Exponentiate by e (=2.718) to get to odds space

DNA_COMBINED: Training set performance

For each cancer type, the number of training samples, as well recall and precision (as determined by cross-validation) are shown to give an idea about how reliably each cancer type can be predicted.

Number of training samples: Cancer types with more training samples are usually more reliably predicted
Recall: A recall of e.g. 0.95 for prostate cancer means that CUPPA expects to miss 5% of prostate cancer cases
Precision: A precision of e.g. 0.95 for prostate cancer means that of 100 samples CUPPA predicts as prostate cancer, 5% of those predictions are wrong

TSV file

The visualization data TSV is a long form table. The first few rows of the table are shown below.

    sample_id     data_type clf_group  clf_name         feat_name  feat_value                               cancer_type  data_value  rank  rank_group
0    SAMPLE_X          prob  combined  combined               NaN         NaN                                Anogenital    0.000218     6           0
1    SAMPLE_X          prob  combined  combined               NaN         NaN  Bone/Soft tissue: Cartilaginous neoplasm         NaN    34           0
2    SAMPLE_X          prob  combined  combined               NaN         NaN                    Bone/Soft tissue: GIST    0.000218     7           0
3    SAMPLE_X          prob  combined  combined               NaN         NaN          Bone/Soft tissue: Leiomyosarcoma    0.000218     8           0
4    SAMPLE_X          prob  combined  combined               NaN         NaN             Bone/Soft tissue: Liposarcoma    0.000218     9           0
          ...           ...       ...       ...               ...         ...                                       ...         ...   ...         ...
635  SAMPLE_X  feat_contrib       dna     event  sv.TELOMERIC_SGL         1.0                                  Prostate    1.000000    30          28
636  SAMPLE_X  feat_contrib       dna     event  sv.TELOMERIC_SGL         1.0                            Skin: Melanoma    1.306432     4          28
637  SAMPLE_X  feat_contrib       dna     event  sv.TELOMERIC_SGL         1.0                               Skin: Other    1.000000    31          28
638  SAMPLE_X  feat_contrib       dna     event  sv.TELOMERIC_SGL         1.0                             Thyroid gland    1.000000    32          28
639  SAMPLE_X  feat_contrib       dna     event  sv.TELOMERIC_SGL         1.0                          Urothelial tract    1.000000    33          28

Below is a description of the columns:

Column name	Column description	Data type	Valid values / Value description
`sample_id`	Sample ID	String
`data_type`	Data type	String	prob: probabilities feat_contrib: feature contributions sig_quantiles: mutational signature quantiles cv_performance: training set cross-validation performance metrics
`clf_group`	Classifier group	String	combined, dna, rna
`clf_name`	Classifier name	String	combined, dna_combined, gen_pos, snv96, event, rna_combined, gene_exp, alt_sj
`feat_name`	EVENT feature name	String	e.g. tmb.snv_count. Only when `data_type` is "feat_contrib" or "cv_performance", otherwise empty
`feat_value`	EVENT feature value	Float	e.g. 20000.0. Only when `data_type` is "feat_contrib", otherwise empty
`data_value`	Value of `data_type`	Float	e.g. if `data_type` is "prob", `data_value` represents probabilities
`rank`	Rank of `data_value`	Integer
`rank_group`	Grouping of [`data_type`, `clf_name`, `feat_name`] by which data_value is ranked	Integer

Features

The features used by CUPPA are derived from DNA and RNA sequencing. These features are provided to CUPPA as a single dataframe containing the features (columns) for each sample (rows).

Source	Feature type	Feature description	No. of features
DNA	GEN_POS	No. of SNVs per 500kb bin across the genome	6,101
DNA	SNV96	No. of SNVs per 96-trinucleotide context	96
DNA	EVENT	Mutations in genes (driver likelihood)	620
DNA	EVENT	Gene fusions (presence)	143
DNA	EVENT	Viral insertions (presence) of HBV, HPV, HSV or MCPyV	4
DNA	EVENT	Structural variant (SV) types	6
DNA	EVENT	Tumor mutational burden (TMB; no. of SNVs and indels)	2
DNA	EVENT	Whole genome duplication (presence)	1
DNA	EVENT	Sample sex (male/female)	1
RNA	GENE_EXP	log(transcripts per million + 1) for each gene	37,985
RNA	ALT_SJ	log(number of reads + 1) for each alternative splice junction	236,328

GEN_POS

The genomic position of somatic SNVs, also known as regional mutational density (RMD). These have been previously associated with tissue type-specific chromatin states [Lehner et al. 2012, Nature] and were shown to have strong predictive power for tissue of origin [Jiao et al. 2020, NatComm].

For CUPPA, we count the number of SNVs in each 500kb bin across the genome for each sample. Chromosome Y is excluded to avoid sex-related effects. We also exclude the mutations characteristic of AID-APOBEC hypermutation (T[C>T]N and T[C>G]N) as we observe them to have a distinct genomic position profile which does not generally match the cancer specific profile (predominantly affecting Lung, Breast and Urothelial cancers).

SNV96

The SNV counts in the 96-trinucleotide contexts reflects specific mutagenic effects such as UV radiation, smoking, and background mutational processes. While these trinucleotide contexts are often used for the COSMIC single base substitution (SBS) signatures, CUPPA does not use the COSMIC signatures directly. However, the SNV96 sub-classifier is designed to capture the obvious similarities that can also be observed via signatures.

EVENT

The EVENT group of features contains various driver and passenger mutational features that are specific to certain cancer types.

Driver mutations

Driver (or driver like) features used include all driver point mutation, high amplification, homozygous deletion and homozygous disruptions in the driver catalog. Homozygous deletions, disruptions and mutations drivers are considered together, but amplifications are treated as a separate feature as some oncogenes tend to be amplified in specific cancer types and mutated in others (notable examples are KRAS, EGFR, SPOP & FOXA1).

Indels in repeat contexts of 6 or fewer bases in 3 lineage defining genes: ALB (highly specific to Liver cancer) and SFTPB & SLC34A2 (highly specific to Lung cancer) are also treated as additional features (note though that they are ignored for MSI samples). A set of Lung cancer specific EGFR hotspots (including T790M, L858R and exon 19 and 20 inframe deletions) are also treated as a single feature.

Fusions

Known pathogenic fusion pairs, immunoglobulin (IG) rearrangement pairs and exon deletions/duplications configured in the HMF fusion knowledge base are all considered as features, as are fusions with highly promiscuous exons such as ALK exon 20-21.

Viral insertions

Some viruses can integrate their DNA into the host genome and are thus carcinogenic [Zapatka et al. 2020, NatGen]. Some of these are also cancer type specific and have been included as features, including human papillomavirus (HPV; cervical and head & neck cancers), merkel cell polyomavirus (MCPyV; skin carcinomas), hepatitis B virus (HBV; liver cancer), and herpes simplex virus (HSV; various cancer types).

Structural variants (SVs)

CUPPA uses various cancer type specific SV types as features:

LINE: Long interspersed nuclear elements. Often found in esophageal, head & neck, and colorectal cancers [Roderiguez-Martin et al. 2020, NatGen]
SIMPLE_DEL_20KB_1MB: Simple structural deletions, 10kb to 1Mb in size
SIMPLE_DUP_32B_200B: Simple structural duplications, 30bp to 200bp in size
SIMPLE_DUP_100KB_5MB: Simple structural duplications, 100kb to 5Mb in size
MAX_COMPLEX_SIZE: Size of the largest complex structural event (number of chained SVs), a proxy measure for the presence of chromothripsis
TELOMERIC_SGL: Telomeric single breakends. Associated with telomere lengthening in Leiomyosarcomas [Chudasama et al. 2018, NatComm]

Tumor mutational burden (TMB)

Certain cancer types have an extremely high number of mutations, such as skin and lung cancers (due to UV radiation and smoking respectively), while other have very few mutations, such as medulloblastomas and pilocytic astrocytomas (due to typically being pediatric cancers). CUPPA thus uses the total number of SNVs and indels as features.

Whole genome duplication (WGD)

The rate of WGD varies by cancer type [Priestley et al. 2019, Nature].

Sex

Cancers of the reproductive organs are exclusively male (e.g. prostate cancer) or female (e.g. ovarian, uterine, cervical cancers). Sex (male/female) is determined using the copy number of the sex chromosomes and is used as a feature in CUPPA.

GENE_EXP

Genes are differentially expressed across tissue types and can thus be used for CUP classification. Normalized transcript counts for each gene in log(transcripts per million + 1) are used as input for CUPPA.

ALT_SJ

A novel splice junction is defined in this context as any splice junction that is not annotated in ENSEMBL. A set of recurring novel splice junctions sites were identified within each cancer cohort – i.e. those with 3 or more fragments supporting a novel site in 2 or more samples. Intronic splice sites were excluded from the analysis. A reference file was then formed by calculating the average fragment count per cancer cohort at each of these novel sites.

Classifier structure

This section provides details on each component of the CUPPA classifier. CamelCase headings refer to the class name of the respective component.

CuppaClassifier

CUPPA is composed of 2 main layers of multinomial logistic regressions, and a final post-processing layer:

Sub-classifier layer: Each of the 5 feature types (GEN_POS, SNV96, EVENT, GENE_EXP, ALT_SJ) is passed as input to a sub-classifier for the corresponding feature type. Each sub-classifier performs some feature transformations, after which a logistic regression weighs the importance of each transformed feature.
Meta-classifier layer: This layer uses 2 logistic regressions to weigh the importance of the DNA and RNA sub-classifiers, followed by some post-processing steps.
Combined layer: Combines the probabilities from the meta-classifiers by multiplying them together.

LogisticRegression

The logistic regressions in all CUPPA classifiers share the following arguments:

multi_class="multinomial": A multinomial regression is used as our task is to classify multiple cancer types ('classes')
solver="saga": The SAGA (Stochastic Average Gradient Accelerated) solver is used as it is the only one that supports multinomial LASSO logistic regression in scikit-learn. SAGA is an optimized variant of stochastic gradient descent (SGD).
class_weight="balanced": Some cancer types in the training set have more samples than others. The "balanced" argument forces the logistic regression to optimize equally every cancer type cohort by weighing cohorts by the inverse of their sample size: cancer_type_weight = n_samples_total x (1 / n_samples_in_cancer_type).
random_state=0: Set the random seed so that re-running the training produces the same model.

The GEN_POS, SNV96, and EVENT logistic regressions use the following non-default arguments:

penalty="l1": LASSO (aka L1) regularization is used to remove irrelevant features during training by setting their weights to 0
max_iter=1000: Number of gradient descent iterations

The GENE_EXP and ALT_SJ logistic regressions use the following non-default arguments:

penalty="l1": See above
max_iter=10000: See above

The DNA_COMBINED and RNA_COMBINED logistic regressions use the following non-default arguments:

penalty='l2': Ridge regularization reduces the weights of less relevant features. Ridge is used instead of LASSO here as all output probabilities from sub-classifiers are relevant albeit to different extents.
C=4: Increased C from 1 (default) to 4 to reduce the strength of ridge regularization
max_iter=1000: See above

ProbCombiner

This layer combines the probabilities of the DNA_COMBINED and RNA_COMBINED meta-classifiers by multiplying them together. It is however possible that the probability for a cancer type is 0.99 for DNA_COMBINED but 0.00 for RNA_COMBINED, which when multiplied together equals 0.00. To avoid this issue, all probabilities <0.01 are set to 0.01 (prob_floor=0.01) prior to multiplication. After multiplying the probabilities, they are normalized to sum to 1.

Feature transformers

ProfileSimilarityTransformer

For the GEN_POS classifier, consensus GEN_POS profiles are calculated per cancer type at training time. At prediction time, the cosine similarity of a sample's profile to each per cancer type profile is then calculated.

The consensus profile is the sum (agg_func="sum") of each 500k bin in samples of a particular cancer type. Sum is used instead of mean as we want samples with low TMB (which have a noisier GEN_POS profile) to contribute less to the consensus profile.

Conversely, we want hypermutator samples to contribute less to the consensus profile as to not dominate the profile. Thus, if a sample has more than 10,000 SNVs (count_ceiling=10000), the GEN_POS profile of that sample will be scaled down such that the TMB sums to 10,000.

For the GENE_EXP and ALT_SJ classifiers, no count_ceiling is used and the mean (agg_func="mean") to generate the consensus profiles. This is because the (log transformed) GENE_EXP and ALT_SJ features are not skewed by outliers.

NoiseProfileAdder

For cancer types cohorts with few samples, the addition of one hypothetical new sample would result in a drastic change in the consensus profile. Small cancer type cohorts therefore tend to have highly variable consensus GEN_POS profiles.

To counteract this, we add a background 'noise' GEN_POS profile totalling 500 mutations (noise_counts=500) to every sample before generating the consensus profiles. The noise GEN_POS profile is the median of the per-cancer type profiles as generated by ProfileSimilarityTransformer, and is calculated at training time.

The GEN_POS consensus profiles of small cohorts will therefore tend towards the noise profile, thereby diminishing the importance of GEN_POS for these cohorts (and increasing the relative importance of other feature types).

NonBestSimilarityScaler

A major problem with cosine similarities is that they tend towards 1.0 with higher the numbers of dimensions (i.e. features). With 6101 GEN_POS bins, the (sorted) cosine similarities are often squished towards 1.0 like this:

cos_sims = [0.95, 0.92, 0.87, 0.86, 0.75, …]

However, we want larger differences between consecutive cosine similarities as this improves the performance of the downstream logistic regression. To achieve this, we perform the following computations:

exponent = 5
max_value = max(cos_sims) ## 0.95
scale_factors = cos_sims - max_value + 1 ## [1.00, 0.97, 0.92, 0.91, 0.80, ...]
scale_factors = scale_factors ** exponent ## [1.00, 0.86, 0.67, 0.62, 0.32, ...]
cos_sims_scaled = scale_factors * cos_sims ## [0.95, 0.79, 0.57, 0.54, 0.23, ...]

Larger exponent values push the bottom values closer to zero. exponent values were empirically chosen based on the one that yield the best performance:

GEN_POS: exponent=5
GENE_EXP: exponent=5
ALT_SJ: exponent=2

Log1pTransformer

Log transformation is performed to eliminate the effect of outliers (e.g. samples with exceptional high number of SNVs). The "+1" pseudocount is to prevent calculating log(0) which is undefined.

MaxScaler

Max scaling is performed to scale (non-negative) features to range from 0 to 1. This is because the downstream logistic regression expects features to be (roughly) in the same scale and range. This is performed for the SV and TMB features as well as for the GENE_EXP and ALT_SJ features.

Chi2FeatureSelector (drivers, fusions, virus, sex, whole genome duplication)

The chi2 test is used to as a preliminary feature selection step remove irrelevant features. This test which quantifies how much each feature varies across every cancer type in the training set compared to an expected (background) value for that feature. The below pseudo-code describes the feature selection procedure.

## Inputs
X ## Feature values per sample. Matrix of shape [n_samples, n_features]
y ## Cancer type labels per sample. Array of length n_samples

## Sum of the values of each feature across all samples. Array of length n_features
feature_counts = X.sum(axis="columns") 

## Proportions of each cancer type in `y`. Array of length n_cancer_types
cancer_type_freq = count_samples_per_cancer_type(y) / len(y)

## Calculate expected counts per feature per cancer type. Matrix of shape [n_cancer_types, n_features]
expected_counts = dot_product(cancer_type_freq, feature_counts)

## Calculate observed feature counts per cancer type
X_split = split_rows_by_cancer_type(X)
observed_counts = [X_i.sum(axis="columns") for X_i in X_split] ## List of length n_cancer_types
observed_counts = as_matrix(observed_counts) ## Matrix of shape [n_cancer_types, n_features]

## Calculate chi2 statistic. Array of length n_features
chi2_stat = (observed_counts – expected_counts)**2

## Calculate pvalue = area under the right-hand tail of the chi2 probability density function
## Array of length n_features
pvalue = chi2_pdf(chi2_stat)

## Calculate false discovery rate and select features
## Array of length n_features
fdr = benjamini_hochberg(pvalue) 
selected_feature_names = X.columns[fdr<0.001]

For the GENE_EXP and ALT_SJ classifiers, MaxScaledChi2FeatureSelector combines the MaxScaler and Chi2FeatureSelector steps.

NaRowFilter

Samples with no RNA data will bypass the RNA classifiers (GENE_EXP, ALT_SJ, RNA_COMBINED).

Adjustment of probabilities

RollingAvgCalibration

Probability calibration is performed to ensure that a probability of 0.8 means that: of 100 samples predicted as e.g. breast cancer, 80 of these are actually breast cancer. Logistic regressions generally output under-confident probabilities, e.g. a probability of 0.75 in practice means a probability of 0.85. Calibration of logistic regression probabilities increases the number of high confidence probabilities (>0.8, the threshold for a diagnosis), allowing for more CUP patients to be diagnosed. CUPPA uses a modified probability calibration approach based on isotonic regression. For each cancer type, this is performed with the below procedure:

Given:

An array of probabilities
An array of booleans, indicating whether a sample belongs to a cancer type

Perform the following steps:

Sort both arrays by the probabilities
Slide a window along the boolean array and take the mean of the values in the window at each step
Fit an isotonic regression, with the x-values being the probabilities and y-values being the sliding window means

This approach results in more granular y-values, which improves the isotonic regression fit.

After calibrating the probabilities, they are normalized to sum to 1.

FusionProbOverrider

Certain gene fusions are highly characteristic of certain cancer types, e.g. RUNX1-RUNX1T1 for acute myeloid leukemia (AML), but were not prevalent enough in the training set (i.e. too few training samples) to be picked up as dominant feature for that cancer type. If a RUNX1-RUNX1T1 fusion were found in a sample for example, we would force a high probability of AML. This 'fusion override' for this example sample is done using the below pseudo-code:

## Raw probabilities for sample with RUNX1-RUNX1T1 fusion
## Acute myeloid leukemia (AML) is often misclassified as myeloproliferative neoplasm (MPN)
classes = ["AML", "MPN", "Liver", "Prostate", ...]
probs = [0.26, 0.31, 0.02, 0.01, ...]

## Make 'mask' array
## 1.00 for target cancer type(s). Can be multiple cancer types
## 0.01 for other cancer types. Value specified with the `mask_base_value` argument
mask = [1.00, 0.01, 0.01, 0.01, ...]

## Apply mask
probs = probs * mask ## [0.26, 0.03, 0.00, 0.00, ...]

## Make probs sum to 1
probs = probs / sum(probs)

In total, there are 64 pathognomic fusion-cancer type mappings for which we apply a fusion override. These mappings are provided to the overrides argument of FusionProbFilter as a dataframe in the format shown in section Fusion overrides file

After overriding the probabilities, they are normalized to sum to 1.

SexProbFilter

It is possible for e.g. a female sample to have a non-zero probability of Prostate, since the logistic regressions do not treat female and male cancer types differently. We therefore force the probability of male cancer types to be 0 in female cancer types, and vice versa.

Keywords for male and female cancer types are specified to the keywords_male_classes and keywords_female_classes SexProbFilter arguments. SexProbFilter will be applied to any cancer types containing these keywords (case insensitive). By default, the keywords are:

_DEFAULT_KEYWORDS_MALE_CLASSES = [
    "prostate",
    "penis","penile",
    "testis","testicular"
]

_DEFAULT_KEYWORDS_FEMALE_CLASSES = [
    "ovary","fallopian","ovarian",
    "uterus","uterine","endometrium","endometrial",
    "cervix","cervical"
] ## We do not override "breast" as breast cancers are also found in male patients

After overriding the probabilities, they are normalized to sum to 1.

Training set

Sample selection

Samples were included in the training set if they passed the following criteria:

From the purity SQL table:
- qcStatus == "PASS"
- purity > 0.195
- fitMethod != "NO_TUMOR"
From LAMA (patient consent database):
- intUse == TRUE
- storeData == TRUE

As multiple samples can originate from the same patient, we de-duplicate samples by prioritizing samples with 151bp read length RNA data, followed by highest tumor purity.

Cancer subtype definitions

The below table summarizes the cancer subtype mappings based on clinical annotations in the HMF database as well as in the PCAWG metadata. The definitions are specified in following format:

primaryTumorLocation/pcawgCancerType: primaryTumorSublocation
(primaryTumorType: primaryTumorSubType)
<histology>

Additionally, two special operators are used:

[ ]: list/grouping
not: excludes the proceeding cancer types

Cancer subtype	Rules	Samples	Samples w/ RNA
Anogenital	Anus, Penis, Vulva, Vagina, Uterus: Cervix, Cervix-AdenoCA, Cervix-SCC	128	46
Bone/Soft tissue: Cartilaginous neoplasm	Bone-Cart	9	0
Bone/Soft tissue: GIST	Bone/Soft tissue (Gastrointestinal stromal tumor)	66	27
Bone/Soft tissue: Leiomyosarcoma	Bone/Soft tissue (Leiomyosarcoma), SoftTissue-Leiomyo	80	19
Bone/Soft tissue: Liposarcoma	Bone/Soft tissue (Liposarcoma), SoftTissue-Liposarc	59	5
Bone/Soft tissue: Osteosarcoma	Bone/Soft tissue (Osteosarcoma), Bone-Osteosarc	47	0
Bone/Soft tissue: Undiff. sarcoma	Bone/Soft tissue (Undifferentiated sarcoma), Bone/Soft tissue (Fibrous histiocytoma)	37	6
Bone/Soft tissue: Other	Bone/Soft tissue ([not Skin, unspecified]), Bone-Epith, Bone-Osteoblast	157	49
Breast: Other	Breast (unspecified: not Triple negative), [Breast-AdenoCA, Breast-DCIS, Breast-LobularCA]	908	412
Breast: Triple negative	Breast (unspecified: Triple negative), [Breast-AdenoCA, Breast-DCIS, Breast-LobularCA]	183	59
CNS: Glioma	Nervous system (Glioma), CNS-GBM, CNS-Oligo	159	63
CNS: Medulloblastoma	Nervous system (unspecified: Medulloblastoma), CNS-Medullo	138	0
CNS: Pilocytic astrocytoma	CNS-PiloAstro	58	0
Colorectum/Small intestine/Appendix	[Colorectum, Appendix, Small intestine] (not Neuroendocrine tumor), ColoRect-AdenoCA	837	336
Esophagus/Stomach	[Esophagus, Stomach, Gastroesophageal] (not Neuroendocrine tumor), Stomach-AdenoCA, Eso-AdenoCA	349	130
Gynecologic: Endometrium	Uterus: Endometrium (not Neuroendocrine tumor)	85	24
Gynecologic: Ovary/Fallopian tube	[Ovary, Fallopian tube, Ovary-AdenoCA] (not Neuroendocrine tumor)	320	112
Head and neck: Adenoid cystic	(Carcinoma: Adenoid cystic carcinoma)	10	5
Head and neck: Salivary gland	Salivary gland, Trachea, Head and neck: [Salivary gland, Parotid gland, Sublingual gland]	26	15
Head and neck: Other	Head and neck: unspecified	101	28
HPB: Bile duct/Gallbladder	Bile duct, Hepatobiliary system, Gallbladder, Biliary-AdenoCA	131	78
HPB: Liver	Liver, Liver-HCC	341	35
HPB: Pancreas	Pancreas (not Neuroendocrine tumor), Panc-AdenoCA	372	65
Kidney: Chromophobe	Kidney (unspecified: Chromophobe renal cell carcinoma), Kidney-ChRCC	44	0
Kidney: Other	Kidney (unspecified), Kidney-RCC	277	76
Lung: Non-small cell: LUAD	Lung [(Carcinoma: Adenocarcinoma), ], Lung-AdenoCA	254	107
Lung: Non-small cell: LUSC	Lung [(Carcinoma: Squamous cell carcinoma), ], Lung-SCC	78	12
Lung: Small cell	Lung (Carcinoma: [Small cell carcinoma, Small cell carcinoma combined type])	56	35
Lymphoid tissue	Lymphoid tissue, Lymph-BNHL, Lymph-CLL	216	20
Mesothelium	Mesothelium	77	23
Myeloid: Acute myeloid leukemia	Myeloid-AML	12	0
Myeloid: Myeloproliferative neoplasm	Myeloid-MPN	18	0
NET: Colorectum/Small intestine	[Colorectum, Small intestine] (Neuroendocrine tumor)	56	34
NET: Lung	Lung (Neuroendocrine tumor)	46	5
NET: Pancreas	Pancreas (Neuroendocrine tumor), Panc-Endocrine	118	21
Prostate	Prostate (not Neuroendocrine tumor), Prost-AdenoCA	607	161
Skin: Melanoma	Skin (Melanoma), Skin-Melanoma	408	133
Skin: Other	Skin (unspecified), Bone/Soft tissue: Skin	49	19
Thyroid gland	Thyroid gland, Thy-AdenoCA	71	16
Urothelial tract	Urothelial tract, Bladder-TCC	241	45

Cancer subtype groups

Grouping of subtypes are made based on molecular/histological supertypes (e.g. Lung cancer) but also by organ type (Bone/Soft tissue cancers) for the visualization of CUPPA.

The cancer subtype group is simply the prefix before the first colon (":"). For example, "Bone/Soft tissue: GIST" maps to the "Bone/Soft tissue" group. Subtypes with no prefix (i.e. no ":") belong to their own standalone group. For example, "Anogenital" maps to the "Anogenital" supertype.

CUPPA internally predicts subtype probabilities. Group probabilities are simply the sum of the subtype probabilities of a particular group.

Training procedure

The training data for CUPPA includes: i) a dataframe containing the features (columns) per sample (rows), and ii) an array of known cancer type labels for each sample. All samples from the training set are used to train the final CUPPA classifier.

10-fold cross-validation was used to assess the performance of CUPPA. Here, the training data was split into 10 parts, training was performed on 90% of samples, and tested on the remaining 10% of samples. This was repeated for the 10 different 'folds', each yielding cancer type probabilities for a different 10% subset of samples, and ultimately for every sample in the training set. These probabilities were then used to calculate various performance metrics (e.g. number of samples correctly predicted per cancer type).

Files

README.md

Latest commit

History

README.md

File metadata and controls

CUPPA: Cancer of Unknown Primary Prediction Algorithm

Introduction

Table of contents

Usage

Feature extraction (Java component)

Required input files

Single sample example

Multi sample example

All arguments

Classification (Python component)

Installation

Running CUPPA within python

Command line interface

Input files format

Features

Metadata

Fusion overrides file

Classifier output

Probabilities by classifier

SNV96: Mutational signatures

EVENT: Feature contributions

DNA_COMBINED: Training set performance

TSV file

Features

GEN_POS

SNV96

EVENT

Driver mutations

Fusions

Viral insertions

Structural variants (SVs)

Tumor mutational burden (TMB)

Whole genome duplication (WGD)

Sex

GENE_EXP

ALT_SJ

Classifier structure

CuppaClassifier

LogisticRegression

ProbCombiner

Feature transformers

ProfileSimilarityTransformer

NoiseProfileAdder

NonBestSimilarityScaler

Log1pTransformer

MaxScaler

Chi2FeatureSelector (drivers, fusions, virus, sex, whole genome duplication)

NaRowFilter

Adjustment of probabilities

RollingAvgCalibration

FusionProbOverrider

SexProbFilter

Training set

Sample selection

Cancer subtype definitions

Cancer subtype groups

Training procedure

Version History and Download Links