Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tumor only variant calling #81

Open
kokyriakidis opened this issue Jun 24, 2021 · 19 comments
Open

Tumor only variant calling #81

kokyriakidis opened this issue Jun 24, 2021 · 19 comments

Comments

@kokyriakidis
Copy link

Hi everyone,

Can I use this workflow for tumor only variant calling to detect both Somatic and Germline variants? Or should I first run my data with Mutect in tumor only mode and then use varlociraptor?

In general, do you have any recommendations for Tumor only variant calling?

Thank you for your time,
Konstantinos Kyriakidis

@dlaehnemann
Copy link
Contributor

Yes, generally you can use this workflow to call variants on any type of sample setup, including the case where you only have a tumor sample. You can even specify a normal sample along-side it (for which you don't have data) and thus leverage a tumor sample purity value that you usually get for tumor sequencing data. You just have to create a calling scenario that describes your sample setup, have a look at this example:
https://varlociraptor.github.io/varlociraptor-scenarios/scenarios/tumor-only-prior/

And for general infos on what you need to do in calling (and descriptions of all the possible entries in this scenario YAML file), have a look here:
https://varlociraptor.github.io/docs/calling/

For any variant type, you'll need candidate calls. Currently, this pipeline supports generating those candidates with Freebayes (SNVs, MNVs, some short indels) and Delly (insertions and deletions), but varlociraptor itself also supports larger structural variants specified via their breakpoints.

And for getting started with actually deploying the workflow, have a look at the deployment instructions for this workflow (they are also linked to in the main README.md):
https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows%2Fdna-seq-varlociraptor

I hope this gives some orientation around the varlociraptor docs and the capabilities of the caller and the workflow. Don't hesitate to ask further questions via the issues here or in the varlociraptor repo. And feel free to close the issue, if this answers the current question for you.

@kokyriakidis
Copy link
Author

Thank you for your detailed answer!

Is it always better to generate candidates with multiple callers (eg. Mutect + Vardict + Strelka + Octopus) and let Varlociraptor detect the true variants, or are there cases in which using only one caller yields better results? I do not care about runtimes. I care about Precision and Recall.

@dlaehnemann
Copy link
Contributor

For the candidate callers, all you should care about is sensitivity / recall, not precision, as varlociraptor will take care of that and you can easily filter results in a meaningful way using the false discovery rate or Bayesian odds:
https://varlociraptor.github.io/docs/filtering/

Thus, generally using more callers to generate candidates should improve the recall of your results, as every caller will yield different results. Unless one caller yields a clear subset of candidates of another caller, it makes sense to include both callers in candidates calling (if you REALLY don't care about runtimes;). Also, it makes sense to have a look at the possible parameters of candidate callers and turn off any automatic (heuristic) filtering of variants they might be doing.

That said, I would say that for SNVs, using a tool like Freebayes without any filtering is good enough to detect any potential variants as sites with some non-reference nucleotide present. But for MNVs, insertions, deletions and structural variants, the callsets will differ a lot more among callers. So it might make sense to investigate callsets a bot among the respective callers, if you want to increase recall for those variant types. If you do have a look into this, we're interested in any feedback and suggestions you might come up with! Which candidate caller combination is best is still something to be investigated in detail, see e.g. this issue in the varlociraptor repo:
varlociraptor/varlociraptor#149

@kokyriakidis
Copy link
Author

kokyriakidis commented Jun 27, 2021

I will try to investigate a few things about the Tumor-only case in which I need to detect both Germline and Somatic variants:

  1. Whether Mutect2 detects all variants in the sample (both Germline and Somatic), contrasting it with the callset of Haplotypecaller on the same Tumor-only sample. If Mutect2 does not detect all Germline variants then it may worth to also run Haplotypecaller.

  2. Whether Freebayes calls as many variants as the above callers

  3. Investigate the Vardict callset.

  4. How well Varlociraptor works with one callset vs the union of all of them (does it accurately distinguish somatic vs germline variants? or does the number of variants considered change the behavior of Varlociraptor? Does it call the same variant as Somatic the first time vs Germline the other time depending on the number of variants considered? It shouldn't, based on the fact that retrieves all information of a variant directly from the BAM, right?)

My Tumor-only samples are WES so I will not investigate SV callers like Delly.

@kokyriakidis kokyriakidis reopened this Jun 27, 2021
@kokyriakidis
Copy link
Author

kokyriakidis commented Jun 27, 2021

These are the preliminary results using MAF 0.01 (1%) and no filters:

HaplotypeCaller: 75,727 variants detected
Mutect2: 62,190 variants detected
Freebayes: 76,378 variants detected
Vardict: 436,506 variants detected

Ensemble calls (each variant in at least 1 caller): 447,550 variants detected
[gatk-haplotypecaller 919 unique variants
freebayes 3969 unique variants
mutect2 141 unique variants]

Ensemble calls (each variant in at least 1 caller) using PASS filter: ~220,000 variants detected

As you can see Vardict detected most of the variants. I am really worried about keeping those variant that do not have a PASS filter but as you said we want RECALL and Varlociraptor will take care of the rest. Are you confident that I do not have to do a more strict analysis and filtering and Varlociraptor will handle it?

@dlaehnemann
Copy link
Contributor

Nice, thanks a lot for investigating and documenting this here!

For freebayes, are you using the parameters as they are specified in this workflow?

# genotyping is performed by varlociraptor, hence we deactivate it in freebayes by
# always setting --pooled-continuous
extra="--pooled-continuous --min-alternate-count 1 --min-alternate-fraction {}".format(
config["params"]["freebayes"].get("min_alternate_fraction", "0.05")

Namely, this turns off genotyping (speeding up what freebayes does) and basically calls everything with at least one alternative read. For further increasing sensitivity (for low-frequency somatic variants), and as you eventually don't care about runtimes, you should decrease the min_alternate_fraction further down, e.g. to 0.001. This way, freebayes will probably generate quite a bit more candidates, a lot of whom will probably be due to sequencing errors, but some might be low-frequency somatic variants.

As for Vardict, those seem to be a lot of unique variants. But if you don't care about runtimes, varlociraptor should remove all junk calls reliably. I.e. I would not worry about false positives (precision), if you use the false discovery rate for filtering. So definitely no harder filtering necessary by varlociraptor, the false discovery rate should do the job rigorously.

The only trouble you might have with tumor-only calling, might actually be that the lack of data from the normal sample that you specify for the purity, might actually reduce the eventual varlociraptor recall quite a bit. Especially if allele frequencies become ambiguous (for example if the purity is 0.5 and several alternative allele frequencies have multiple interpretations; for example an alternative allele frequency of 0.25 can mean a clonal somatic heterozygous variant, but also a germline heterozygous variant lost somatically). But you can only genuinely address that problem (with any caller) by actually sequencing a normal sample, which is what I would always try to do (or get collaborators to do). But I understand this is not an option in your setup (and have had similar situations myself)...

@kokyriakidis
Copy link
Author

kokyriakidis commented Jun 28, 2021

I am running this code for freebayes:

freebayes -f hg19.fa --genotype-qualities --strict-vcf --ploidy 1 --targets regions.bed --min-repeat-entropy 1 --no-partial-observations --min-alternate-fraction 0.01 --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off -b prealign.bam | bcftools filter -i 'ALT="<*>" || QUAL > 5'
  1. Are you confident that Varlociraptor can accurately call fractions down to 0.001? (I know I am asking the same question over and over haha)

  2. What are your recommendations for FDR on Tumor-only samples? Stay with 5% or decrease it a bit? I will also filter by odds first as you do in your workflow "--odds barely"

  3. It is not feasible to have normal sample right now. Should I keep contamination fraction to 0.25? What did you do in a similar situation?

  4. Should I exclude the following biases in the case of WES data? (--omit-strand-bias, --omit-read-position-bias, --omit-softclip-bias, --omit-read-orientation-bias)

Sorry for the long list of questions. I am just trying to get the most of this analysis.

Thank you so much for addressing my conserns!

@johanneskoester
Copy link
Contributor

johanneskoester commented Jun 28, 2021

  1. Are you confident that Varlociraptor can accurately call fractions down to 0.001? (I know I am asking the same question over and over haha)

Yes, the statistical model will be able to deal with that as long as the coverage is large enough. Of course, probabilities will get weaker and weaker the lower the fraction becomes if the coverage is not sufficient.

  1. What are your recommendations for FDR on Tumor-only samples? Stay with 5% or decrease it a bit? I will also filter by odds first as you do in your workflow "--odds barely"

Actually for FDR control on tumor-only samples I would always provide multiple callsets. The first should control FDR over the somatic event only. The second over the germline event, and the third over the union of both events (the control-fdr command can take multiple events at the same time; it will then calculate the sum of their posteriors before filtering by FDR). This way, you also capture the unclear cases (which will be many). One could also consider just controling the third event and leave the distinguishing to manual inspection. Often this makes more sense (e.g. in a clinical setting).

  1. It is not feasible to have normal sample right now. Should I keep contamination fraction to 0.25? What did you do in a similar situation?

You will need a proper estimate for the contamination, which one can retrieve from pathologists, or by looking at certain variants from which you know that they have to be somatic and clonal (e.g. from your tumor entity). TP53 can perhaps be such a candidate.

  1. Should I include the following biases in the case of WES data? (--omit-strand-bias, --omit-read-position-bias, --omit-softclip-bias, --omit-read-orientation-bias)

Bias estimation usually only needs to be excluded in case of amplicon sequencing (where sometimes all reads are e.g. from the same strand or only a single amplicon overlaps a locus).

@kokyriakidis
Copy link
Author

kokyriakidis commented Jun 28, 2021

What should I expect from the FDR controled union of both events? What differencies may I observe between the somatic only and germline only?

Does FDR control filtering Tag variants to events in the vcf file (in the union case)? eg. variant X SOMATIC

@johanneskoester
Copy link
Contributor

The FDR control will simply output a filtered VCF (actually BCF) with only the variants that are selected (i.e. the FILTER column is not used).

When you take the union of events, you will get both germline and somatic. But the records keep the posterior probabilities, so that you can see the uncertainty between the two cases. If you filter only for somatic or only for germline via the FDR control, cases where Varlociraptor cannot decide between the two will be lost, only the certain cases will remain. And naturally, since you don't have a normal sample in your case, there will be lots of uncertain cases. Hence the union is important to look at.

@johanneskoester
Copy link
Contributor

Btw. a quick hint as your scenario sounds like you are in a clinical context: use local FDR control (--local) instead of global. This way, you also don't need the odds filtering. Global FDR control is more for explorative cases: it will fill up your callset with more uncertain cases up to the desired FDR.

@kokyriakidis
Copy link
Author

kokyriakidis commented Jun 28, 2021

Please correct me if I say something wrong:

If you filter only for somatic or only for germline via the FDR control, cases where Varlociraptor cannot decide between the two will be lost, only the certain cases will remain.

This means that --local and FDR control on Somatic only will give me a filtered bcf with the desired fdr control from cases that are at least favorable to be somatic (those that you call certain)? FDR control will then be applied only to those cases?

And --local and fdr control on Union will treat ALL cases (even the not certain ones) and then apply fdr control on those?

@johanneskoester
Copy link
Contributor

Yes

@kokyriakidis
Copy link
Author

One more question!

The Union will contain uncertain cases (Somatic or Germline) that Varlociraptor is certain that are not sequencing errors etc?

@dlaehnemann
Copy link
Contributor

The short answer is yes. ;)

But maybe this quick example helps for a good intuition of how the FDR control works:

The posterior probabilities that varlociraptor provides for all the events you find in the output BCF add up to 1, as they have been marginalized over their sum. Say you have a site where the PROB_SOMATIC is 0.49 and PROB_GERMLINE is 0.48, with the rest of the probability up to 1 in PROB_ABSENT and PROB_ARTIFACT.This implies that the data does not give a clear indication which one it is, and with tumor-only calling, this can happen quite a bit. The FDR would e.g. be looking at 1 - PROB_SOMATIC and this would be very large. In the case of controlling the local FDR, this would surely be filtered (unless you accept an FDR above 0.51), for the general FDR this depends on the other variants in the callset. Thus, when doing the FDR on either of those events by itself, this will most likely be classified as a false discovery for either of them, as neither probability is very high. But their joint probability, simply the sum PROB_SOMATIC + PROB_GERMLINE = 0.97 is quite high, so would easily pass the FDR filter allowing a maximum of 0.05, for example.

Thus, to generally call variants on a tumor-only sample, controlling for the false discovery rate of both jointly makes the most sense. And yes, this basically gives you all the variants that varlociraptor is sure are real (vs. some type of artifact). This should work with:

varlociraptor filter-calls control-fdr calls.bcf --events SOMATIC GERMLINE --fdr 0.05 --var SNV > calls.filtered.bcf

And then in some cases, the data will be clearer towards SOMATIC or GERMLINE and you can use FDR control to get those. And those you can then clearly label as SOMATIC or GERMLINE for your downstream analyses. For the others, you will probably need some orthogonal data to be sure, or use some good prior knowledge (e.g. the variant is in a gene that usually has somatic hits, or the variant is known as an inherited one, indicating a germline variant).

@kokyriakidis
Copy link
Author

Your answer is exceptional!

Can you please also give me an explanation of how --local behaves?

@dlaehnemann
Copy link
Contributor

dlaehnemann commented Jun 30, 2021

The general FDR approach basically sorts 1 - PROB_EVENT from the smallest to the largest, and finds a cutoff so that the sum of all the sites up to that point, divided by their total number, falls beneath the FDR you want to control for.

The local FDR simply checks whether a site has a 1 - PROB_EVENT that's smaller than the local FDR you want to control for. So the "local" here means local to a particular site.

One key difference is, that with the general FDR approach, you can theoretically have 1 - PROB_EVENT values that are well above the FDR you want to control for. As a simple example, take a setup where you have 4 very very certain variants that all have a 1 - PROB_EVENT of 0.00001. If you want to control for a false discovery rate of 0.05 and apart from these 4 very certain variants, you only have very uncertain ones, the next one in the sorted list will have a great jump. If we go for the maximum jump so that this site is still included in the filter set, this would in this case be (0.05 * 5) - (0.00001 * 4) = 0.24996. So simply due to how the method works, we would be filling up the callset we filter to with a very uncertain variant that has a probability of almost 0.25 of not being a variant.

I hope this gives a good intuition of when to use which. I'd opt for regular / global FDR when going for new discoveries that you then go and analyze further somehow. And especially if you have lots of certain variants you're looking at, and don't expect any such big jumps that fill up your callset, so usually what you'd expect from whole genome or whole exome callsets. If you're looking to identify certain variants from smaller callsets (capture kits, e.g.), I would definitely go for a local FDR. Similarly, if avoiding this "filling up" with uncertain variants is important, e.g. for diagnosis where such uncertain variants could be problematic (you can of course see this by simply looking at the PROB_EVENT, but this requires that the person interpreting the results actually knows this AND has a look).

@kokyriakidis
Copy link
Author

Another exceptional answer! Thank you so much!

@kokyriakidis
Copy link
Author

kokyriakidis commented Jul 12, 2021

Using FDR 5% GLOBAL on SOMATIC event got me around 109.000 variants on exome Tumor-only sample.
Using FDR 1% GLOBAL on SOMATIC event got me around 49.314 variants on exome Tumor-only sample.
Using FDR 5% and LOCAL on SOMATIC event got me around 55.380 variants on exome Tumor-only sample.
Using FDR 10% and LOCAL on SOMATIC event got me around 91.566 variants on exome Tumor-only sample.
Using FDR 12% and LOCAL on SOMATIC event got me around 107.800 variants on exome Tumor-only sample.

So it seems that the GLOBAL 5% FDR gets me variants up to 12% LOCAL FDR to fill up the list of variants.

109.000 seem a lot of Somatic variants for an exome. Mutect2 with filtering gives me around 20.000 variants. So 55.380 variants from LOCAL seem to be reasonable. Do you have any prior expectations about the average number of somatic mutations in exome sequencing data?

Number of total observations before any filtering: 3,134,662.
Using FDR 5% and LOCAL on SOMATIC event got me around 55,380 variants on exome Tumor-only sample.
Using FDR 5% and LOCAL on GERMLINE event got me around 73,312 variants on exome Tumor-only sample.
The UNION 5% FDR LOCAL callset (SOMATIC+GERMLINE) contains 7,863 cases it cannot categorise. This is a relatively small number and I am pretty happy about it!

Can you tell me your thoughts?

I found 9,977 variants that exist in PON provided by GATK and FDR 5% LOCAL Somatic callset.
Creating a FDR 1% LOCAL Somatic callset yields 8,527 variants found both in PON and in my Somatic callset.

Thank you for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants