New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question re. reproducing the example VCF file #198
Comments
Hi Barry, Thanks for reaching out! And yeah, you have figured essentially all things out correctly, meaning In an ideal world (in my opinion..:-)), all somatic variant callers should have formatted their genotype data in the same manner, but unfortunately, it is our experience that there are a lot of exceptions here, which is why it is inherently difficult to grab this automatically. I have never gotten around to create a versatile VCF genotype grabber that can accomplish this. Others have pointed to the same issue, and even made some attempts to accomplish it: As for now, the consequence is that the user needs to do some preparation of the input VCF, which is where you are at the moment, raising the bar somewhat on how easy one can feed a somatic VCF directly to PCGR. Any way, I have not been able to find a tool that can combine somatic variant call VCFs (so that it tags each variant with the caller that called the specific variant (e.g. the TAL field) - even that is of course a subjective matter - people may potentially use different criteria to consider a "PASSED" MuTect call vs. a "PASSED" Strelka call, using caller-specific confidence properties). I have pretty much ended up with writing a small script that combines the set of VCFs called for a given sample (T/N pair), often considering the average when it comes to AF/DP values (callers may differ slightly wrt the reads that go into these calculations). Given your already great understanding, I suspect you should be able to pull this through - although standing on some existing tools would make your (and my) life much easier! :-) Sorry not being able to help you further here, but hopefully this gives you some additional perspectives. kind regards, |
If you want to contribute code that will enhance this part, you are of course more than welcome :-) |
No worries Sigve, thanks for the prompt response! I have been developing a nextflow pipeline for PCGR for an internship, the main idea being that it can work directly off the outputs from one of nf-cores flagship pipelines - sarek. I made a stab at extending the utility of the script produced by gudeqing (#136), it can now handle mutect2, strelka and freebayes outputs. The missing piece is consolidating the algorithm calls and as you mentioned, averaging AF/DP values. Thanks for the discussion, it's encouraging to know I'm at least thinking on the right track. I'll keep chipping away at it and let you know if I come up with a solution :) p.s in the event I get this workflow over the line would you be ok with me sharing it with the nf-core community? It's in dev and lives here https://github.com/BarryDigby/nf-core-pcgr |
This is more than welcome, Barry. Keep up the good work ! |
Hi @sigven, just wanted to ask a similar follow-up question re. What are the minimum input requirements for The goal is to merge multiple germline VCF files before running Thanks again, -Barry |
Hi Barry, Thanks for reaching out! The input requirements for CPSR is not as strict as they are for PCGR. Basically, it expects a single-sample VCF file, preferably also with the https://github.com/sigven/pcgr/blob/master/pcgrr/R/utils.R#L1315-L1337 Genotypes are not used to a great extent for the variant classification etc., but is useful for final interpretation of the variants. If you have merged multiple germline variant calls, I think the main point is to ensure that the genotype ( There are basically no requirements as to depth and allelic fraction for now, and while allelic fraction are not that relevant for germline calls, depth might be useful for filtering, so we might add this at a later point, but for now, there are no requirements for this part. One part that I can recommend to ensure prior to analysis with CPSR is to decompose the variants, meaning that there are no multiallelic variant records. CPSR will do so also internally (using Thanks again for working with the workflows! Data bundle updates to come :-) best, |
Thank you for all of your help Sigve! The workflow is 99% ready at this repository https://github.com/BarryDigby/nf-pcgr I would be keen for you to cast a critical eye over it if you get the chance :) The only thing that remains is to touch up documentation and test it on large VCF files. I've noticed you advised subsetting WGS tumor-only variants to exonic regions, I can add a process & parameter for this recommendation. I'm hoping users will apply sensible filtering expressions using Best , Barry |
Hi Sigve,
I'm curious if you have any hints towards producing the combined
TAL
field for a variant site that has been called by mutliple tools for a sample - just like your example VCF file:I have the following minimal example, a variant that has been called by
Mutect2
,FreeBayes
, andStrelka
, respectively. I have manually calculated theTAF/NAF
,TDP/NDP
andTAL/AL
INFO/FORMAT fields:chrX 154405616 . G A . PASS AS_SB_TABLE=25,31|21,33;DP=112;ECNT=1;MBQ=20,20;MFRL=154,183;MMQ=60,60;MPOS=33;NALOD=1.58;NLOD=11.14;POPAF=3.18;TLOD=173.02;TDP=54;NDP=56;TAF=1;NAF=0;TAL=mutect2 GT:AD:AF:DP:F1R2:F2R1:FAD:SB:AL 0/0:56,0:0.026:56:9,0:23,0:37,0:25,31,0,0:1 0/1:0,54:0.976:54:0,8:0,29:0,39:0,0,21,33:1
chrX 154405616 . G A 1892.72 PASS AB=0;ABP=0;AC=2;AF=0.5;AN=4;AO=56;CIGAR=1X;DP=112;DPB=112;DPRA=1;EPP=3.63072;EPPR=3.63072;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=2;NUMALT=1;ODDS=37.8354;PAIRED=1;PAIREDR=0.982143;PAO=0;PQA=0;PQR=0;PRO=0;QA=2172;QR=2214;RO=56;RPL=35;RPP=10.6105;RPPR=3.16541;RPR=21;RUN=1;SAF=23;SAP=6.88793;SAR=33;SRF=25;SRP=4.40625;SRR=31;TYPE=snp;technology.ILLUMINA=1;TDP=56;NDP=56;TAF=1;NAF=0;TAL=freebayes GT:AD:AO:DP:GQ:PL:QA:QR:RO:AL 0/0:56,0:0:56:99:0,169,1994:0:2214:56:2 1/1:0,56:56:56:99:1957,169,0:2172:0:0:2
chrX 154405616 . G A . PASS DP=112;MQ=60;MQ0=0;NT=ref;QSS=309;QSS_NT=3070;ReadPosRankSum=0;SGT=GG->AG;SNVSB=0;SOMATIC;SomaticEVS=19.73;TQSS=1;TQSS_NT=1;TDP=56;NDP=55;TAF=1;NAF=0;TAL=strelka DP:FDP:SDP:SUBDP:AU:CU:GU:TU:AD:AL 55:0:0:0:0,0:0,0:55,56:0,0:0,0:3 56:0:0:0:56,56:0,0:0,0:0,0:0,56:3
Unfortunately, the deprecated
CombineVariants
,bcftools merge
andGATK MergeVcfs
didn't do the trick. Is there a tool or set of packages you have used for this previously?Thinking out loud.. might it be a good start to reduce the VCF files to the bare minimum required by PCGR (call confidence, dp_tumor, dp_control, af_tumor, af_control)? It looks like the genotype information is not required
pcgr/scripts/pcgr_validate_input.py
Lines 366 to 369 in d471ee8
Best,
Barry
The text was updated successfully, but these errors were encountered: