Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue querying db's made with latest cyvcf2 #238

Open
matthdsm opened this issue Mar 29, 2022 · 5 comments
Open

Issue querying db's made with latest cyvcf2 #238

matthdsm opened this issue Mar 29, 2022 · 5 comments

Comments

@matthdsm
Copy link

matthdsm commented Mar 29, 2022

Hi @brentp,

I've come across a weird issue.
I've updated one of our bcbio installations, resulting in an environment with the latest vcf2db + cyvcf2=0.30.14.
When I try to query one of the db's generated with this setup using the gemini python API, I get a funky result when fetching genotypes.

when I try to print the gts field from a table row (gemini.GeminiQuery.GeminiRow) I get the following numpy array

["T" "" "" "T" "" "" "C" "" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "T"
"" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" "T"
"" "" "/" "" "" "T" "" "" "T" "" "" "C" "" "" "" "" "" "" "" "" "" "" ""]
which should show

["TTC/TTC","T/TTC","T/TTC"]
e.g. the genotypes for three individuals.
This is the case for our older installs running cyvcf2=0.20.9.

I suppose this error may have something to do with #227.
When downgrading cyvcf2 to the older version and regenerating the db, everything seems to work again.

Any thoughts?
M

PS, it seems others have also run into similar issues: https://github.com/chapmanb/cloudbiolinux/blob/master/contrib/flavor/ngs_pipeline_minimal/packages-conda.yaml#L354

xpost from quinlan-lab/vcf2db#69

@brentp
Copy link
Owner

brentp commented Mar 29, 2022

Hi Matthias, can you share a VCF with these 2 rows?

@matthdsm
Copy link
Author

Sure,

here's an excerpt with the 10 first variants from that file.
tmp.vcf.txt

@brentp
Copy link
Owner

brentp commented Mar 29, 2022

with this script:

import cyvcf2

print(f"version: {cyvcf2.__version__}")
for v in cyvcf2.VCF("tmp.vcf.gz"):
    print(v.gt_bases)

run on that file, I see:

version: 0.30.15
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

which seems to match what I expect. Can you verify that you see the same?
If so, then it must be something elsewhere in vcf2db, or in how vcf2db is interacting with cyvcf2.

Perhaps it's since we updated how cyvcf2 is built?

@matthdsm
Copy link
Author

matthdsm commented Mar 29, 2022

Hi Brent

I'm getting the following results.

py2.7

cyvcf2 version: 0.30.15
numpy version: 1.16.5
[u'AC|AC' u'*|*' u'AC|.']
[u'AC|AC' u'./.' u'AC|A']
[u'./.' u'A/C' u'./.']
[u'./.' u'T/G' u'T/T']
[u'G/G' u'G/A' u'G/A']
[u'T/T' u'./.' u'T/A']
[u'C/T' u'./.' u'C/T']
[u'T/G' u'T/G' u'T/G']
[u'C/T' u'C/T' u'C/T']
[u'A/G' u'./.' u'A/A']

and

cyvcf2 version: 0.20.9
numpy version: 1.16.5
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

py3

cyvcf2 version: 0.30.15
numpy version: 1.22.3
['AC|AC' '*|*' 'AC|.']
['AC|AC' './.' 'AC|A']
['./.' 'A/C' './.']
['./.' 'T/G' 'T/T']
['G/G' 'G/A' 'G/A']
['T/T' './.' 'T/A']
['C/T' './.' 'C/T']
['T/G' 'T/G' 'T/G']
['C/T' 'C/T' 'C/T']
['A/G' './.' 'A/A']

It seems to me the latest version encodes the strings in unicode on python2, whereas the older version did not. I can imagine thats the cause of the issues downstream.

I know it's kind of stupid to still be using python2, but that's the way it currently configured in the bcbio environment.

Ping @naumenko-sa.
Sergey do you remember the reason as to why cyvcf2/vcf2db hadn't been moved to the main environment? I know we talked about it somewhere, but I don't seem to find it right now.

Matthias

@matthdsm
Copy link
Author

matthdsm commented Mar 29, 2022

I suppose it's a numpy issue

return np.array(bases, str)

vs
return np.array(bases, np.str)

xref: #191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants