Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

##INFO fields being dropped #302

Open
FlexLuthORF opened this issue Mar 13, 2024 · 6 comments
Open

##INFO fields being dropped #302

FlexLuthORF opened this issue Mar 13, 2024 · 6 comments

Comments

@FlexLuthORF
Copy link

FlexLuthORF commented Mar 13, 2024

 cat 2202415007_hemi.vcf | head

/home/zmvanw01/projects/12-sample-comparison/hifiasm-path/geno_analysis/per_samp/2202415007/change_to_hemi/2202415007_hemi.vcf

=mpileup -B -a QS -Ou -f /home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta --threads 11 /home/zmvanw01/projects/12-sample-comparison/hifiasm-path/2202415007/2202415007.editRG.bam

##reference=file:///home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta

##contig=<ID=chr10,length=133797422>

##contig=<ID=chr11,length=135086622>

##contig=<ID=chr12,length=133275309>

##contig=<ID=chr13,length=114364328>

##contig=<ID=chr14,length=105480000>

##contig=<ID=chr15,length=101991189>

##contig=<ID=chr16,length=90338345>

(bcftools) [zmvanw01@bigdata change_to_hemi]$ cat ../2202415007.vcf | head

##fileformat=VCFv4.2

##FILTER=<ID=PASS,Description="All filters passed">

##bcftoolsVersion=1.19+htslib-1.19.1

##bcftoolsCommand=mpileup -B -a QS -Ou -f /home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta --threads 11 /home/zmvanw01/projects/12-sample-comparison/hifiasm-path/2202415007/2202415007.editRG.bam

##reference=file:///home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta

I am editing the below vcf file with the following script:

def change_genotypes(input_vcf, bedfile, output_path):
    vcf_reader = VCF(input_vcf, strict_gt=True)

    # Create a new VCF Writer using the input VCF as a template
    vcf_writer = Writer(output_path, vcf_reader)

    coords = read_bedfile(bedfile)

    for record in vcf_reader:
        for index, sample in enumerate(vcf_reader.samples):
            change_snp = False
            for chrom, start, end in coords:
                if record.CHROM != chrom:
                    continue
                if record.POS < start:
                    continue
                if record.POS > end:
                    continue
                change_snp = True
                break

            if change_snp:
                current_genotype = record.genotypes[index]
                if current_genotype is None or -1 in current_genotype:
                    record.genotypes[index] = [-1, -1, False]  # Set genotype to unknown
                elif 1 in current_genotype:
                    record.genotypes[index] = [-1, 1, False]  # Set genotype to ./1
                else:
                    record.genotypes[index] = [-1, 0, False]  # Set genotype to ./0
                
                # Reassign the genotypes field
                record.genotypes = record.genotypes

        vcf_writer.write_record(record)

    vcf_writer.close()
    vcf_reader.close()

I expect it to copy all ##info fields from the template to the new vcf, but it seems to truncate it? I am missing:
##fileformat=VCFv4.2

##FILTER=<ID=PASS,Description="All filters passed">

##bcftoolsVersion=1.19+htslib-1.19.1

##bcftoolsCommand

from the newly created file and it is causing downstream issues. It also seems to be placing the path to the new file right at the top with now ## before it.

@brentp
Copy link
Owner

brentp commented Mar 14, 2024

hi @FlexLuthORF can you attach a small VCF (can be just the header and a single variant) that shows the issue?

@FlexLuthORF
Copy link
Author

2201410002_head70.vcf.txt
2201410002_hemi_head70.vcf.txt

The first vcf is the one I am editing with cyvcf2 and the second one is the output that has strangely missing and cleaved ## fields.

@brentp
Copy link
Owner

brentp commented Mar 14, 2024

If I run this:

python issue302.py 2201410002_head70.vcf.txt > o.vcf
bcftools view o.vcf > /dev/null

I get no error. Here are the contents of issue302.py. If you can show me how to recreate, then we can debug, but I suspect something else is going on:

from cyvcf2 import VCF, Writer

def change_genotypes(input_vcf, bedfile, output_path):
    vcf_reader = VCF(input_vcf, strict_gt=True)

    # Create a new VCF Writer using the input VCF as a template
    vcf_writer = Writer(output_path, vcf_reader)


    for record in vcf_reader:
        for index, sample in enumerate(vcf_reader.samples):

            if True:
                current_genotype = record.genotypes[index]
                if current_genotype is None or -1 in current_genotype:
                    record.genotypes[index] = [-1, -1, False]  # Set genotype to unknown
                elif 1 in current_genotype:
                    record.genotypes[index] = [-1, 1, False]  # Set genotype to ./1
                else:
                    record.genotypes[index] = [-1, 0, False]  # Set genotype to ./0
                
                # Reassign the genotypes field
                record.genotypes = record.genotypes

        vcf_writer.write_record(record)

    vcf_writer.close()
    vcf_reader.close()

if __name__ == "__main__":
    import sys
    change_genotypes(sys.argv[1], None, "-")

@FlexLuthORF
Copy link
Author

FlexLuthORF commented Mar 14, 2024

Can you show the output vcf file? I get no traceback error either. It's that it messes up the ## fields and if ##fileformat is missing then I cant bcftools index the .vcf.gz

It does change the genotypes and all that. Its just dropping ## fields and appending the new file location at the top with no ## before it.

`cat 2201410002_hemi.bed

chr2 90225183 90249908 IGKV1-NL1`

thats the bed file it is using in case that helps. Only one line

@brentp
Copy link
Owner

brentp commented Mar 14, 2024

This happens for you using my code above? The o.vcf is well formed with the correct headers.

@FlexLuthORF
Copy link
Author

FlexLuthORF commented Mar 14, 2024

Interesting. You are correct, your code has all the ## fields. I cant figure out why the bit of me checking if the entry is within coords on a bed file is causing the issue. Thank you for your help. If I figure out the issue I will post it here... Though for now, since it was only the top few ## I am just using some bash commands to strip and re-add the info fields so I can continue my analysis.

Perhaps how I am writing the vcf afterwards?

    output_vcf="${sample_outd}/change_to_hemi/${sample}_hemi.vcf"
    
    python "${changeg}" "${of}.vcf" "$sample_sv_results" \
        "${SV_regions_entire}" "${sample}" > "${output_vcf}"

I am calling the python script within a batch script and redircting the returned vcf into a output file. Yeah I think this is the problem and its an issue on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants