Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash reading valid but weird VCF #187

Open
davmlaw opened this issue Jan 13, 2021 · 7 comments
Open

Crash reading valid but weird VCF #187

davmlaw opened this issue Jan 13, 2021 · 7 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Jan 13, 2021

Hi, I'm getting strange genotype calls and then a crash with my GATK VCF.

The zygosity call is strange, but the VCF passes validation.

I've reduced it down to a few lines (VCF at bottom of issue), so you can reproduce the issue.

import cyvcf2 

print(f"version: {cyvcf2.__version__}") 
reader = cyvcf2.Reader(open("./bad_vcf.vcf")) 
for v in reader: 
    print(v) 
    print(f"gt_types: {v.gt_types}") 

Output:

version: 0.30.4
chr7	42924353	rs552132089;rs771560161;rs869187830;rs370736797	CAA	.	57.23	PASS	.	GT:AD:DP:GQ:PGT:PID:PL:PS	.|.:0:1:.:0|1:42924353_CAAA_C:0:42924353

gt_types: [3]
chr13	37564190	rs1253322554	A	.	109.74	PASS	.	GT:AD:DP:GQ:PGT:PID:PL:PS	.|.:0:.:.:0|1:37564182_A_T:.:37564182

gt_types: [3]

Issue 1 - crash:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-c7bb21f9ef99> in <module>
      3 print(f"version: {cyvcf2.__version__}")
      4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
----> 5 for v in reader:
      6     print(v)
      7     print(f"gt_types: {v.gt_types}")

/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.__next__()

Exception: error parsing variant with `htslib::bcf_read` error-code: 0

I think it's uninitialised memory as sometimes you get a different exception, ie:

[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
	1	.	.	.	.	.	.

gt_types: []
[W::vcf_parse] Contig '
                       3D�' is not defined in the header. (Quick workaround: index the file with tabix.)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-c7bb21f9ef99> in <module>
      4 reader = cyvcf2.Reader(open("./bad_vcf.vcf"))
      5 for v in reader:
----> 6     print(v)
      7     print(f"gt_types: {v.gt_types}")
      8 

/usr/local/lib/python3.8/dist-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.Variant.__str__()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 5: invalid start byte

Issue 2

Dunno what GATK is doing here, but ALT is ".", and GT = ".|."

cyVCF calls the genotype as 3 (HOM ALT) - I think it should call this as 2 (UNKNOWN)

VCF

##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr13,length=114364328>
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    31_CoriellNA12878
chr7    42924353    rs552132089;rs771560161;rs869187830;rs370736797    CAA    .    57.23    PASS    .    GT:AD:DP:GQ:PGT:PID:PL:PS    .|.:0:1:.:0|1:42924353_CAAA_C:0:42924353
chr13    37564190    rs1253322554    A    .    109.74    PASS    .    GT:AD:DP:GQ:PGT:PID:PL:PS    .|.:0:.:.:0|1:37564182_A_T:.:37564182
@brentp
Copy link
Owner

brentp commented Jan 14, 2021

just use: reader = cyvcf2.VCF("./bad_vcf.vcf") and this problem goes away.

you might work if you open with "rb", but I never use that so haven't seen this.

@nh13
Copy link
Contributor

nh13 commented Jan 20, 2021

I think this may be something very insidious going on here.

The following works:

import cyvcf2
  
print(f"version: {cyvcf2.__version__}")

fh = open("./bad_vcf.vcf")
reader = cyvcf2.Reader(fh)
for v in reader:
    print(v)
    print(f"gt_types: {v.gt_types}")

I am not too familiar with CPython vs. Python, but could the fact we cache the file handle returned by open in a local variable cause the garbage collection to be delayed, whereas it is immediately garbage collected in your example?

@nh13
Copy link
Contributor

nh13 commented Jan 20, 2021

Some quick googling found maybe a red herring, or maybe not: https://stackoverflow.com/a/8011863

@brentp
Copy link
Owner

brentp commented Jan 20, 2021

oh, good catch, we probably need a weakref to the file handle.

@brentp
Copy link
Owner

brentp commented Jan 20, 2021

can also use a :

with open("./bad_vcf.vcf") as fh:
    ...

around the entire script so it doesn't get collected

@nh13
Copy link
Contributor

nh13 commented Jan 21, 2021

@brentp where do you store the weakref? VCF? HtsFile?

@brentp
Copy link
Owner

brentp commented Jan 21, 2021

I think in the htsfile makes sense but don't have a strong preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants