New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control_dp_min is not properly transfered #136
Comments
Hi @gudeqing, best, |
Hi @sigve, best wishes ! |
Sure thing, if you have an example VCF that you think adheres to these parameters, I could do such a check. I would also need the assembly version (grch37 or 38) |
Hi, sigven, from pysam import VariantFile
def reformat_vcf(vcf_file, out):
with VariantFile(vcf_file) as fr:
header = fr.header
header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
header.info.add('NAF', number=1, type='Float', description='Normal sample AF')
with VariantFile(out, 'w', header=header) as fw:
for record in fr:
record.info['TDP'] = record.samples[1]['DP']
record.info['NDP'] = record.samples[0]['DP']
record.info['TAF'] = record.samples[1]['AF'][0]
record.info['NAF'] = record.samples[0]['AF'][0]
fw.write(record) |
Hi @gudeqing, Cool! Very nice small script for this purpose. Although I agree that retrieving this information directly from the FORMAT column would by far be most convenient for most people, the fact that there is no concensus (among callers) on how to format sample-specific variant data in the FORMAT column makes it inevitably fairly challenging to parse this information robustly. This has at least been my experience over the years, I have seen callers with poor consistency in their way of formatting. E.g. how can one be sure that |
Hi, @sigven , First, pcgr requires the user to provide tumor sample id, thus we could use this ID to differentiate tumor sample from normal sample. Here is a small script that first splits multi-allelic records, autodetects tumor sample, and reformats INFO field. from pysam import VariantFile
import os
def reformat_vcf(vcf_file, out, reference, tumor_sample=None):
"""
[split and left-normalization] -> [re-calculate AF] -> [bgzip and tabix]
:param vcf_file:
:param out:
:param reference:
:param tumor_sample: tumor sample name or position, if the second sample is tumor, then it is 1 else 0
:return:
"""
os.system(f'bcftools norm -f {reference} -m -both {vcf_file} -o tmp_.vcf')
with VariantFile('tmp_.vcf') as fr:
header = fr.header
header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
header.info.add('NAF', number=1, type='Float', description='Normal sample AF')
samples = list(header.samples)
if tumor_sample is not None:
if tumor_sample not in [0, 1, '0', '1']:
if tumor_sample in samples:
tumor_idx = samples.index(tumor_sample)
normal_idx = 1 - tumor_idx
else:
raise Exception(f'{tumor_sample} is not in samples {samples} recorded in vcf')
else:
tumor_idx = int(tumor_sample)
normal_idx = 1 - tumor_idx
else:
tumor_idx = guess_tumor_idx(vcf_file)
normal_idx = 1 - tumor_idx
with VariantFile(out, 'w', header=header) as fw:
for record in fr:
record.info['TDP'] = record.samples[tumor_idx]['DP']
record.info['NDP'] = record.samples[normal_idx]['DP']
# re-calculate AF since caller like sentieon may report AF that is not consistent with AD info
record.info['TAF'] = round(record.samples[tumor_idx]['AD'][1]/record.samples[tumor_idx]['DP'], 3)
record.info['NAF'] = round(record.samples[normal_idx]['AD'][1]/record.samples[normal_idx]['DP'], 3)
fw.write(record)
os.remove('tmp_.vcf')
os.system(f'bgzip {out}')
os.system(f'tabix {out}.gz')
def guess_tumor_idx(vcf_file):
tumor_is_first = 0
tumor_is_second = 0
with VariantFile(vcf_file) as fr:
samples = list(fr.header.samples)
formats = list(fr.header.formats)
if 'AF' not in formats:
raise Exception('No AF in format info to detect tumor sample')
for record in fr:
if record.samples[0]['AF'][0] > record.samples[1]['AF'][0]:
tumor_is_first += 1
else:
tumor_is_second += 1
tumor_idx = tumor_is_second >= tumor_is_first
print(f'we guess tumor sample is {samples[tumor_idx]} ')
return tumor_idx
if __name__ == '__main__':
from xcmds import xcmds
xcmds.xcmds(locals()) |
Hi @gudeqing, I will follow up now on the great contribution you have done here. I am in the process of updating PCGR, and will be looking into how your work could be incorporated, either fully, or as a utility that can be offered to the users. |
The text was updated successfully, but these errors were encountered: