Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcfcombine takes files with many sites and produces a tiny file #104

Open
iqbal-lab opened this issue Sep 21, 2015 · 17 comments
Open

vcfcombine takes files with many sites and produces a tiny file #104

iqbal-lab opened this issue Sep 21, 2015 · 17 comments
Labels

Comments

@iqbal-lab
Copy link

I've got 2 VCF files with 671,775 PASS lines in each. Both VCFs have the same 1st 9 columns - just the sample columns differ
When I try this
vcflib/bin/vcfcombine file1.vcf file2.vcf > bob.vcf

bob.vcf only has 37 PASS elements.

reproduced several times with different inputs

@iqbal-lab
Copy link
Author

Ah. I bet it is because vcfcombine dies if there are overlapping records

@ekg
Copy link
Collaborator

ekg commented Sep 21, 2015

Could be that. Could you share the test?

On Mon, Sep 21, 2015 at 1:02 PM, Zamin Iqbal notifications@github.com
wrote:

Ah. I bet it is because vcfcombine dies if there are overlapping records


Reply to this email directly or view it on GitHub
#104 (comment).

@iqbal-lab
Copy link
Author

It's my fault (yes, I can share if you like) - i have a lot of lines in these VCFs which overlap.
I just wanted to ignore that, and "cat columns" But maybe simpler to remove overlapping stuff. Happy to close this

@ekg
Copy link
Collaborator

ekg commented Sep 21, 2015

Well, this is what vcfcombine claims to do:

Combines VCF files positionally, combining samples when sites and alleles
are identical.
Any number of VCF files may be combined. The INFO field and other columns
are taken from
one of the files which are combined when records in multiple files match.
Alleles must
have identical ordering to be combined into one record. If they do not,
multiple records
will be emitted.

So if it's not, I should try to fix it.

On Mon, Sep 21, 2015 at 2:38 PM, Zamin Iqbal notifications@github.com
wrote:

It's my fault (yes, I can share if you like) - i have a lot of lines in
these VCFs which overlap.
I just wanted to ignore that, and "cat columns" But maybe simpler to
remove overlapping stuff. Happy to close this


Reply to this email directly or view it on GitHub
#104 (comment).

@iqbal-lab
Copy link
Author

So what would you do with

chr1 1 A G
chr1 10 AAAAAA AAGATAGAGAA
chr1 11 AAA TTAT
chr1 13 AAA GGAGA

leave aside the issue of left-alignment and the fact that noone calls a variant in the middle of a homopolymer, I made this up, and it's only the coords and ref-allele lengths that matter here.
If all my sample VCFs looked like this, would it combine them thinking chr1 10 AAA...blah blah was the same in all samples? Or does it look ahead and notice there are overlapping thinga?

@ekg
Copy link
Collaborator

ekg commented Sep 21, 2015

It does not understand overlaps or try to resolve these. There isn't enough
information in the VCF file to do so and maintain the genotypes correctly.

It will assume that it can merge records with "chr1 11 AAA TTAT". But it needs to be exactly this, for instance it won't merge that record with "chr1 13 AAA GGAGA".

See: https://github.com/ekg/vcflib/blob/master/src/vcfcombine.cpp#L115

variantsByChromPosAltFile[var->sequenceName][var->position][var->alt][vcf]
= var;

The var->alt string needs to be identical.

... and I realize that this is buggy. It should also require an identical
REF. Hopefully thought this makes sense to get started?

On Mon, Sep 21, 2015 at 2:59 PM, Zamin Iqbal notifications@github.com
wrote:

So what would you do with

chr1 1 A G
chr1 10 AAAAAA AAGATAGAGAA
chr1 11 AAA TTAT
chr1 13 AAA GGAGA

leave aside the issue of left-alignment and the fact that noone calls a
variant in the middle of a homopolymer, I made this up, and it's only the
coords and ref-allele lengths that matter here.
If all my sample VCFs looked like this, would it combine them thinking
chr1 10 AAA...blah blah was the same in all samples? Or does it look ahead
and notice there are overlapping thinga?


Reply to this email directly or view it on GitHub
#104 (comment).

@iqbal-lab
Copy link
Author

For all my VCFs, the first 9 columns are 100% identical. It's only the genotyping column that differs.
So that ought to work...

@iqbal-lab
Copy link
Author

OK, there is still an issue. I have 2 files, with totally identical isolated SNPs within them
-bash-4.1$ grep -c PASS file1.vcf
26892
-bash-4.1$ grep -c PASS file2.vcf
26892

vcflib/bin/vcfcombine file1.vcf file2.vcf > file3.vcf
-bash-4.1$ grep -c PASS file3.vcf
1691

@iqbal-lab
Copy link
Author

Any way I can attach VCFs to this?

@ekg
Copy link
Collaborator

ekg commented Sep 21, 2015

Not sure. Email?
On Sep 21, 2015 5:15 PM, "Zamin Iqbal" notifications@github.com wrote:

Any way I can attach VCFs to this?


Reply to this email directly or view it on GitHub
#104 (comment).

@iqbal-lab
Copy link
Author

Incoming...

@iqbal-lab
Copy link
Author

Bah. Those files had some things that were non overlapping, but only if you think
eg

chr1 10 A G
chr1 10 AG A

I think those confused it. when I removed all non-SNPs from the VCFs, it seems to have merged things ok.

@iqbal-lab
Copy link
Author

OK, well , this is a bug I think. Current status, from ym point of view

if there are no overlapping variants and onlt SNPs, then vcfcombine works very well, providing not merging too many files. 50 seems too much, but I can do it in batches of 10.

@ekg
Copy link
Collaborator

ekg commented Oct 2, 2015

I see how this could be extended. Basically, I should be tracking unique allele sets rather than just alternate alleles, which appears to be causing the problem in your case.

@github-actions
Copy link

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the stale label Dec 17, 2020
@iqbal-lab
Copy link
Author

This is still an open bug afaik

@pjotrp
Copy link
Contributor

pjotrp commented Dec 17, 2020

A patch would be welcome.

@pjotrp pjotrp added bug Genuine bug help wanted labels Dec 17, 2020
@github-actions github-actions bot removed the stale label Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants