Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hetcount ignores ALT-only sites (such as 1/2) #313

Open
janxkoci opened this issue Feb 28, 2021 · 4 comments
Open

hetcount ignores ALT-only sites (such as 1/2) #313

janxkoci opened this issue Feb 28, 2021 · 4 comments
Labels
documentation Documentation missing or wrong enhancement help wanted

Comments

@janxkoci
Copy link

janxkoci commented Feb 28, 2021

Describe the bug

The vcfhetcount tool ignores heterozygous sites with just ALT alleles as genotype (such as 1/2, 1/3, 2/3 and so on).

Based on the wording in documantation I would actually expect these to be counted twice:

count the number of alternate alleles in heterozygous genotypes

But I think the tool is meant to rather just count heterozygous sites, no matter how many ALT alleles they hold. However at the moment it seems to require REF to be one of the alleles in genotype (e.g. 0/1, 0/2 etc).

To Reproduce

Take any VCF with multi-allelic sites, bi-allelic won't do. Run vcfhetcount on it.

Then you can compare your result with this example command, which counts possible GT combinations for the first sample ($10):

grep -v '#' INPUT.vcf | awk '{print $10}' | awk -F":" '{print $1}' | sort | uniq -c

Expected behavior

I would expect the vcfhetcount tool to count all heterozygous sites in a sample, including ALT-only sites like 1/2. And they should be counted once, even if they contain two ALT alleles. Consider also rewording its description in the documentation, please.

Additional context

I suspect that the vcfhethomratio tool may have the same limitation, however it's much less straightforward to test, especially since I'm still only guessing what is it actually reporting.

@janxkoci janxkoci added the bug Genuine bug label Feb 28, 2021
@janxkoci janxkoci changed the title hetcount ignores ALT-only sites (such as 1/2 hetcount ignores ALT-only sites (such as 1/2) Feb 28, 2021
@pjotrp
Copy link
Contributor

pjotrp commented Apr 14, 2021

Please provide a patch for the docs and/or the tool. That helps everyone.

@pjotrp pjotrp added documentation Documentation missing or wrong enhancement help wanted and removed bug Genuine bug labels Apr 14, 2021
@janxkoci
Copy link
Author

I've tried to look at the code, but I don't know C++ that well so I haven't figured out what needs changing, sorry. I'm not sure if fixing the documentation makes sense at this point, I think it's better to fix the issue first and then properly describe what the tool does afterwards.

@pjotrp
Copy link
Contributor

pjotrp commented Jun 22, 2021

OK. Thanks for trying! Maybe someone will get round to it.

@janxkoci
Copy link
Author

Yeah, sorry, last time I got lost while looking at the files referenced in the source for this tool. I'm looking at the code again though and I noticed this bit inside a while loop:

string& genotype = sample["GT"].front();
vector<string> gt = split(genotype, "|/");
int alt = 0;
for (vector<string>::iterator g = gt.begin(); g != gt.end(); ++g) {
if (*g != "0")
++alt;
}

As I understand the condition at line 63, it should count all alleles that are not 0 (i.e. not REF). So I don't see how it could get it wrong.

But then there is this bit just under it:

if (alt != gt.size()) {
hetCounts[name] += alt;
//hetAlleleCount += alt;
}

Could it be that the count is saved in a wrong variable (note the comment at line 68) or something like that? I'm not sure how the function works exactly, there is a lot of < and > symbols for instance that I don't know what they do (in C++).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation missing or wrong enhancement help wanted
Projects
None yet
Development

No branches or pull requests

2 participants