Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control_dp_min is not properly transfered #136

Open
gudeqing opened this issue Sep 14, 2021 · 7 comments
Open

control_dp_min is not properly transfered #136

gudeqing opened this issue Sep 14, 2021 · 7 comments

Comments

@gudeqing
Copy link

image

@sigven
Copy link
Owner

sigven commented Sep 14, 2021

Hi @gudeqing,
Thanks a lot for noticing this! Indeed a bug, pushed a fix to pcgr.py now.

best,
Sigve

@gudeqing
Copy link
Author

Hi @sigve
will you check if the following arguments work as expected ?
--tumor_dp_min 5
--tumor_af_min 0.05
--tumor_dp_tag TDP
--tumor_af_tag TAF
--control_dp_tag NDP
--control_af_tag NAF

best wishes !

@sigven
Copy link
Owner

sigven commented Sep 14, 2021

Sure thing, if you have an example VCF that you think adheres to these parameters, I could do such a check. I would also need the assembly version (grch37 or 38)

@gudeqing
Copy link
Author

gudeqing commented Sep 14, 2021

Hi, sigven,
Nice to have it work properly now.
By the way, here is a small script that helps to copy "DP/AF" in FORMAT to INFO field, However, I still hope by default, pcgr will extract info from FORMAT, and I believe this will cover 90% real cases of somatic variant calling. Simply, you can also allow users to specify the tag location, such as --tumor_dp_tag FOMAT/DP --tumor_af_tag INFO/AF

from pysam import VariantFile


def reformat_vcf(vcf_file, out):
    with VariantFile(vcf_file) as fr:
        header = fr.header
        header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
        header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
        header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
        header.info.add('NAF', number=1, type='Float', description='Normal sample AF')

        with VariantFile(out, 'w', header=header) as fw:
            for record in fr:
                record.info['TDP'] = record.samples[1]['DP']
                record.info['NDP'] = record.samples[0]['DP']
                record.info['TAF'] = record.samples[1]['AF'][0]
                record.info['NAF'] = record.samples[0]['AF'][0]
                fw.write(record)

@sigven
Copy link
Owner

sigven commented Sep 14, 2021

Hi @gudeqing,

Cool! Very nice small script for this purpose. Although I agree that retrieving this information directly from the FORMAT column would by far be most convenient for most people, the fact that there is no concensus (among callers) on how to format sample-specific variant data in the FORMAT column makes it inevitably fairly challenging to parse this information robustly. This has at least been my experience over the years, I have seen callers with poor consistency in their way of formatting. E.g. how can one be sure that record.samples[1] denotes the tumor sample? And how do you ensure that the record.samples[1]['AF'][0] is a single value and not a comma-delimited value (for multiallelic calls)? So IMO there are some potential pitfalls here, but if you can demonstrate that a robust parsing scheme for this across the most common somatic callers, I am happy to have a look at it for integration into PCGR :-)

@gudeqing
Copy link
Author

gudeqing commented Sep 14, 2021

Hi, @sigven ,

First, pcgr requires the user to provide tumor sample id, thus we could use this ID to differentiate tumor sample from normal sample.
Second, in most times, we can guess which sample is tumor sample by comparing AF between two samples.

Here is a small script that first splits multi-allelic records, autodetects tumor sample, and reformats INFO field.
I have tested this script with 28 vcf files, tumor sample is 100% correctly guessed.

from pysam import VariantFile
import os


def reformat_vcf(vcf_file, out, reference, tumor_sample=None):
    """
    [split and left-normalization] -> [re-calculate AF] -> [bgzip and tabix]
    :param vcf_file:
    :param out:
    :param reference:
    :param tumor_sample: tumor sample name or position, if the second sample is tumor, then it is 1 else 0
    :return:
    """
    os.system(f'bcftools norm -f {reference} -m -both {vcf_file} -o tmp_.vcf')
    with VariantFile('tmp_.vcf') as fr:
        header = fr.header
        header.info.add('TDP', number=1, type='Integer', description='Tumor sample depth')
        header.info.add('NDP', number=1, type='Integer', description='Normal sample depth')
        header.info.add('TAF', number=1, type='Float', description='Tumor sample AF')
        header.info.add('NAF', number=1, type='Float', description='Normal sample AF')
        samples = list(header.samples)

        if tumor_sample is not None:
            if tumor_sample not in [0, 1, '0', '1']:
                if tumor_sample in samples:
                    tumor_idx = samples.index(tumor_sample)
                    normal_idx = 1 - tumor_idx
                else:
                    raise Exception(f'{tumor_sample} is not in samples {samples} recorded in vcf')
            else:
                tumor_idx = int(tumor_sample)
                normal_idx = 1 - tumor_idx
        else:
            tumor_idx = guess_tumor_idx(vcf_file)
            normal_idx = 1 - tumor_idx

        with VariantFile(out, 'w', header=header) as fw:
            for record in fr:
                record.info['TDP'] = record.samples[tumor_idx]['DP']
                record.info['NDP'] = record.samples[normal_idx]['DP']
                # re-calculate AF since caller like sentieon may report AF that is not consistent with AD info
                record.info['TAF'] = round(record.samples[tumor_idx]['AD'][1]/record.samples[tumor_idx]['DP'], 3)
                record.info['NAF'] = round(record.samples[normal_idx]['AD'][1]/record.samples[normal_idx]['DP'], 3)
                fw.write(record)

    os.remove('tmp_.vcf')
    os.system(f'bgzip {out}')
    os.system(f'tabix {out}.gz')


def guess_tumor_idx(vcf_file):
    tumor_is_first = 0
    tumor_is_second = 0

    with VariantFile(vcf_file) as fr:
        samples = list(fr.header.samples)
        formats = list(fr.header.formats)
        if 'AF' not in formats:
            raise Exception('No AF in format info to detect tumor sample')
        for record in fr:
            if record.samples[0]['AF'][0] > record.samples[1]['AF'][0]:
                tumor_is_first += 1
            else:
                tumor_is_second += 1
    tumor_idx = tumor_is_second >= tumor_is_first
    print(f'we guess tumor sample is {samples[tumor_idx]} ')
    return tumor_idx


if __name__ == '__main__':
    from xcmds import xcmds
    xcmds.xcmds(locals())

@sigven
Copy link
Owner

sigven commented Jan 11, 2022

Hi @gudeqing, I will follow up now on the great contribution you have done here. I am in the process of updating PCGR, and will be looking into how your work could be incorporated, either fully, or as a utility that can be offered to the users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants