Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question re. reproducing the example VCF file #198

Open
BarryDigby opened this issue Nov 2, 2022 · 7 comments
Open

Question re. reproducing the example VCF file #198

BarryDigby opened this issue Nov 2, 2022 · 7 comments

Comments

@BarryDigby
Copy link

BarryDigby commented Nov 2, 2022

Hi Sigve,

I'm curious if you have any hints towards producing the combined TAL field for a variant site that has been called by mutliple tools for a sample - just like your example VCF file:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	TCGA-A2-A04W-01A-31D-A10Y-09_TCGA-A2-A04W-10A-01D-A110-09
1	1900265	.	C	T	.	PASS	TCGA_CODE=BRCA;TVAF=0.222222222222;TDP=144;TAL=muse,mutect2,somaticsniper,varscan2	GT:DPC:DPT:ADC:ADT:AL	0/1:237:144:.,.:112,32:1,3,4,2

I have the following minimal example, a variant that has been called by Mutect2, FreeBayes, and Strelka, respectively. I have manually calculated the TAF/NAF, TDP/NDP and TAL/AL INFO/FORMAT fields:

chrX	154405616	.	G	A	.	PASS	AS_SB_TABLE=25,31|21,33;DP=112;ECNT=1;MBQ=20,20;MFRL=154,183;MMQ=60,60;MPOS=33;NALOD=1.58;NLOD=11.14;POPAF=3.18;TLOD=173.02;TDP=54;NDP=56;TAF=1;NAF=0;TAL=mutect2	GT:AD:AF:DP:F1R2:F2R1:FAD:SB:AL	0/0:56,0:0.026:56:9,0:23,0:37,0:25,31,0,0:1	0/1:0,54:0.976:54:0,8:0,29:0,39:0,0,21,33:1
chrX	154405616	.	G	A	1892.72	PASS	AB=0;ABP=0;AC=2;AF=0.5;AN=4;AO=56;CIGAR=1X;DP=112;DPB=112;DPRA=1;EPP=3.63072;EPPR=3.63072;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=2;NUMALT=1;ODDS=37.8354;PAIRED=1;PAIREDR=0.982143;PAO=0;PQA=0;PQR=0;PRO=0;QA=2172;QR=2214;RO=56;RPL=35;RPP=10.6105;RPPR=3.16541;RPR=21;RUN=1;SAF=23;SAP=6.88793;SAR=33;SRF=25;SRP=4.40625;SRR=31;TYPE=snp;technology.ILLUMINA=1;TDP=56;NDP=56;TAF=1;NAF=0;TAL=freebayes	GT:AD:AO:DP:GQ:PL:QA:QR:RO:AL	0/0:56,0:0:56:99:0,169,1994:0:2214:56:2	1/1:0,56:56:56:99:1957,169,0:2172:0:0:2
chrX	154405616	.	G	A	.	PASS	DP=112;MQ=60;MQ0=0;NT=ref;QSS=309;QSS_NT=3070;ReadPosRankSum=0;SGT=GG->AG;SNVSB=0;SOMATIC;SomaticEVS=19.73;TQSS=1;TQSS_NT=1;TDP=56;NDP=55;TAF=1;NAF=0;TAL=strelka	DP:FDP:SDP:SUBDP:AU:CU:GU:TU:AD:AL	55:0:0:0:0,0:0,0:55,56:0,0:0,0:3	56:0:0:0:56,56:0,0:0,0:0,0:0,56:3

Unfortunately, the deprecated CombineVariants, bcftools merge and GATK MergeVcfs didn't do the trick. Is there a tool or set of packages you have used for this previously?

Thinking out loud.. might it be a good start to reduce the VCF files to the bare minimum required by PCGR (call confidence, dp_tumor, dp_control, af_tumor, af_control)? It looks like the genotype information is not required

def simplify_vcf(input_vcf, vcf, output_dir, keep_uncompressed, logger, debug):
"""
Function that performs the following on the validated input VCF:
1. Strip of any genotype data

Best,

Barry

@sigven
Copy link
Owner

sigven commented Nov 2, 2022

Hi Barry,

Thanks for reaching out! And yeah, you have figured essentially all things out correctly, meaning
that PCGR ignores the genotype fields altogether, simply relying on a few INFO fields/tags provided by the user (TDP, TAF etc.. you can of course name them as you like as long as you pass them to PCGR).

In an ideal world (in my opinion..:-)), all somatic variant callers should have formatted their genotype data in the same manner, but unfortunately, it is our experience that there are a lot of exceptions here, which is why it is inherently difficult to grab this automatically. I have never gotten around to create a versatile VCF genotype grabber that can accomplish this. Others have pointed to the same issue, and even made some attempts to accomplish it:
#136

As for now, the consequence is that the user needs to do some preparation of the input VCF, which is where you are at the moment, raising the bar somewhat on how easy one can feed a somatic VCF directly to PCGR. Any way, I have not been able to find a tool that can combine somatic variant call VCFs (so that it tags each variant with the caller that called the specific variant (e.g. the TAL field) - even that is of course a subjective matter - people may potentially use different criteria to consider a "PASSED" MuTect call vs. a "PASSED" Strelka call, using caller-specific confidence properties). I have pretty much ended up with writing a small script that combines the set of VCFs called for a given sample (T/N pair), often considering the average when it comes to AF/DP values (callers may differ slightly wrt the reads that go into these calculations). Given your already great understanding, I suspect you should be able to pull this through - although standing on some existing tools would make your (and my) life much easier! :-)

Sorry not being able to help you further here, but hopefully this gives you some additional perspectives.

kind regards,
Sigve

@sigven
Copy link
Owner

sigven commented Nov 2, 2022

If you want to contribute code that will enhance this part, you are of course more than welcome :-)

@BarryDigby
Copy link
Author

No worries Sigve, thanks for the prompt response!

I have been developing a nextflow pipeline for PCGR for an internship, the main idea being that it can work directly off the outputs from one of nf-cores flagship pipelines - sarek. I made a stab at extending the utility of the script produced by gudeqing (#136), it can now handle mutect2, strelka and freebayes outputs. The missing piece is consolidating the algorithm calls and as you mentioned, averaging AF/DP values. Thanks for the discussion, it's encouraging to know I'm at least thinking on the right track. I'll keep chipping away at it and let you know if I come up with a solution :)

p.s in the event I get this workflow over the line would you be ok with me sharing it with the nf-core community? It's in dev and lives here https://github.com/BarryDigby/nf-core-pcgr

@sigven
Copy link
Owner

sigven commented Nov 2, 2022

This is more than welcome, Barry. Keep up the good work !

@BarryDigby
Copy link
Author

Hi @sigven, just wanted to ask a similar follow-up question re. CPSR here, if that is ok.

What are the minimum input requirements for CPSR in terms of VCF fields? I have noticed that cpsr_validate_input.py does not strip the VCF as heavily as pcgr_validate_input.py does. On the other hand, CPSR works without requiring any manipulation of VCF files from strelka, haplotypecaller, freebayes and deepvariant from my testing, so I'm hoping this means CPSR uses very simplified versions of these VCF files.

The goal is to merge multiple germline VCF files before running CPSR on a per-sample basis, but I'm not sure what genotype information I should be averaging &/or omitting, as I did for PCGR TAF/NAF/TDP/NDP.

Thanks again,

-Barry

@sigven
Copy link
Owner

sigven commented Jan 5, 2023

Hi Barry,

Thanks for reaching out! The input requirements for CPSR is not as strict as they are for PCGR. Basically, it expects a single-sample VCF file, preferably also with the FORMAT and sample genotype data columns available. Notably, if these are present (e.g. GT:DP:.. etc), CPSR will attempt to pull out the genotype status pr. variant (heterozygous or homozygous (0/1 or 1/1), if not present, the genotype status pr. variant will be undetermined. See also:

https://github.com/sigven/pcgr/blob/master/pcgrr/R/utils.R#L1315-L1337
https://sigven.github.io/cpsr/articles/input.html

Genotypes are not used to a great extent for the variant classification etc., but is useful for final interpretation of the variants. If you have merged multiple germline variant calls, I think the main point is to ensure that the genotype (GT) is non-ambiguous. We might of course also add a confidence tag denoting the algorithms that called each variant (as is done for PCGR), but I think this is generally more used in the somatic setting (using multiple callers that is, germline calling is typically more robust using a single caller).

There are basically no requirements as to depth and allelic fraction for now, and while allelic fraction are not that relevant for germline calls, depth might be useful for filtering, so we might add this at a later point, but for now, there are no requirements for this part.

One part that I can recommend to ensure prior to analysis with CPSR is to decompose the variants, meaning that there are no multiallelic variant records. CPSR will do so also internally (using vcftools), but it's not bullet-proof I believe, so if you can put some effort into this, I would recommend doing that.

Thanks again for working with the workflows! Data bundle updates to come :-)

best,
Sigve

@BarryDigby
Copy link
Author

Thank you for all of your help Sigve!

The workflow is 99% ready at this repository https://github.com/BarryDigby/nf-pcgr I would be keen for you to cast a critical eye over it if you get the chance :)

The only thing that remains is to touch up documentation and test it on large VCF files.

I've noticed you advised subsetting WGS tumor-only variants to exonic regions, I can add a process & parameter for this recommendation. I'm hoping users will apply sensible filtering expressions using bcftools filter to reduce the input variant set. Is there anything else that jumps to mind w.r.t filtering input files?

Best ,

Barry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants