-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcfcombine takes files with many sites and produces a tiny file #104
Comments
Ah. I bet it is because vcfcombine dies if there are overlapping records |
Could be that. Could you share the test? On Mon, Sep 21, 2015 at 1:02 PM, Zamin Iqbal notifications@github.com
|
It's my fault (yes, I can share if you like) - i have a lot of lines in these VCFs which overlap. |
Well, this is what vcfcombine claims to do: Combines VCF files positionally, combining samples when sites and alleles So if it's not, I should try to fix it. On Mon, Sep 21, 2015 at 2:38 PM, Zamin Iqbal notifications@github.com
|
So what would you do with chr1 1 A G leave aside the issue of left-alignment and the fact that noone calls a variant in the middle of a homopolymer, I made this up, and it's only the coords and ref-allele lengths that matter here. |
It does not understand overlaps or try to resolve these. There isn't enough It will assume that it can merge records with "chr1 11 AAA TTAT". But it needs to be exactly this, for instance it won't merge that record with "chr1 13 AAA GGAGA". See: https://github.com/ekg/vcflib/blob/master/src/vcfcombine.cpp#L115 variantsByChromPosAltFile[var->sequenceName][var->position][var->alt][vcf]
= var; The var->alt string needs to be identical. ... and I realize that this is buggy. It should also require an identical On Mon, Sep 21, 2015 at 2:59 PM, Zamin Iqbal notifications@github.com
|
For all my VCFs, the first 9 columns are 100% identical. It's only the genotyping column that differs. |
OK, there is still an issue. I have 2 files, with totally identical isolated SNPs within them vcflib/bin/vcfcombine file1.vcf file2.vcf > file3.vcf |
Any way I can attach VCFs to this? |
Not sure. Email?
|
Incoming... |
Bah. Those files had some things that were non overlapping, but only if you think chr1 10 A G I think those confused it. when I removed all non-SNPs from the VCFs, it seems to have merged things ok. |
OK, well , this is a bug I think. Current status, from ym point of view if there are no overlapping variants and onlt SNPs, then vcfcombine works very well, providing not merging too many files. 50 seems too much, but I can do it in batches of 10. |
I see how this could be extended. Basically, I should be tracking unique allele sets rather than just alternate alleles, which appears to be causing the problem in your case. |
This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
This is still an open bug afaik |
A patch would be welcome. |
I've got 2 VCF files with 671,775 PASS lines in each. Both VCFs have the same 1st 9 columns - just the sample columns differ
When I try this
vcflib/bin/vcfcombine file1.vcf file2.vcf > bob.vcf
bob.vcf only has 37 PASS elements.
reproduced several times with different inputs
The text was updated successfully, but these errors were encountered: