Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genotyping assertion failure #16

Open
prithikasritharan opened this issue Aug 5, 2019 · 6 comments
Open

Genotyping assertion failure #16

prithikasritharan opened this issue Aug 5, 2019 · 6 comments

Comments

@prithikasritharan
Copy link

prithikasritharan commented Aug 5, 2019

Hi,

I am using Bayestyper to genotype my yeast strains after the clustering step but I'm getting an assertion failure error. I ran bayestyper genotype with the --noise-genotyping parameter due to the smaller genome size but it produces the following error:

[02/08/2019 10:36:21] You are using BayesTyper (v1.5)

[02/08/2019 10:36:21] Seeding pseudo-random number generator with 1564565781 ...
[02/08/2019 10:36:21] Setting the kmer size to 55 ...

[02/08/2019 10:36:21] Parsed information for 19 sample(s)

[02/08/2019 10:36:21] Parsing reference genome ...
[02/08/2019 10:36:21] Parsed 17 reference genome chromosomes(s) (12157105 nucleotides)

[02/08/2019 10:36:21] Parsing decoy sequence(s) ...
[02/08/2019 10:36:21] Parsed 0 decoy sequence(s) (0 nucleotides)

[02/08/2019 10:36:21] Maximum resident set size: 0.017216 Gb


[02/08/2019 10:36:21] Parsing variant clusters ...
[02/08/2019 10:36:59] Parsed 3299 variant clusters (1776718 variants)

[02/08/2019 10:37:09] Parsing parameter kmers ...
[02/08/2019 10:37:09] Parsed 5028 kmers

[02/08/2019 10:37:09] Maximum resident set size: 18.2243 Gb


[02/08/2019 10:37:09] Counting kmers in variant cluster paths ...
[02/08/2019 11:04:13] Counting kmers in inter-cluster regions and decoy sequence(s) ...

[02/08/2019 11:04:16] Parsing KMC table containing 16766385 kmers for sample 1 ...
[02/08/2019 11:05:20] Parsing KMC table containing 20050264 kmers for sample 2 ...
[02/08/2019 11:06:18] Parsing KMC table containing 23337507 kmers for sample 3 ...
[02/08/2019 11:07:43] Parsing KMC table containing 14484357 kmers for sample 4 ...
[02/08/2019 11:08:38] Parsing KMC table containing 19351468 kmers for sample 5 ...
[02/08/2019 11:09:47] Parsing KMC table containing 190540458 kmers for sample 6 ...
[02/08/2019 11:14:59] Parsing KMC table containing 17861269 kmers for sample 7 ...
[02/08/2019 11:15:57] Parsing KMC table containing 16852594 kmers for sample 8 ...
[02/08/2019 11:16:55] Parsing KMC table containing 16318134 kmers for sample 9 ...
[02/08/2019 11:17:46] Parsing KMC table containing 22293085 kmers for sample 10 ...
[02/08/2019 11:18:47] Parsing KMC table containing 14517377 kmers for sample 11 ...
[02/08/2019 11:19:36] Parsing KMC table containing 14126999 kmers for sample 12 ...
[02/08/2019 11:20:30] Parsing KMC table containing 18857437 kmers for sample 13 ...
[02/08/2019 11:21:27] Parsing KMC table containing 17636840 kmers for sample 14 ...
[02/08/2019 11:22:21] Parsing KMC table containing 20092469 kmers for sample 15 ...
[02/08/2019 11:23:22] Parsing KMC table containing 230337421 kmers for sample 16 ...
[02/08/2019 11:28:47] Parsing KMC table containing 18801072 kmers for sample 17 ...
[02/08/2019 11:30:04] Parsing KMC table containing 18662345 kmers for sample 18 ...
[02/08/2019 11:31:14] Parsing KMC table containing 6201477 kmers for sample 19 ...

[02/08/2019 11:31:37] Classifying kmers in variant cluster paths ...
[02/08/2019 12:06:09] Out of 26475843 kmers:

        - 21164548 have a match to a single variant cluster
        - 480057 have a match to single variant cluster group and multiple variant clusters

        - 0 have match to at least one variant cluster and has match to a decoy sequence (not used for inference)
        - 199 have match to at least one variant cluster and has a maximum haploid multiplicity higher than 127 (not used for inference)
        - 4784833 have matches to multiple variant cluster groups within or across inference units (not used for inference)

        - 46206 have no match to a variant cluster (includes parameter kmers)

[02/08/2019 12:06:09] Maximum resident set size: 18.7509 Gb

[02/08/2019 12:06:09] Estimating genomic haploid kmer count distribution(s) from parameter kmers ...


WARNING: Low number of kmers used for negative binomial parameters estimation for sample 1 (0 < 10000)
WARNING: The mean and variance estimates might be biased due to the genome used being too small, too variant dense and/or too repetitive

bayesTyper: /isdata/kroghgrp/jasi/bayesTyper/code/releases/v1.5_static/BayesTyper-1.5/src/bayesTyper/CountDistribution.cpp:115: void CountDistribution::setGenomicCountDistributions(const std::vector<std::vector<std::vector<KmerStats> > >&, const string&): Assertion `max_genomic_kmer_multiplicity > 0' failed. 
Aborted 

Do you have any suggestions as to what may be causing this error?

Many thanks,
Prithika

@jonassibbesen
Copy link
Contributor

Hi Prithika,

Thank you for writing. I am currently on paternity leave. Will look into your issue next week, when I am back to work.

Best,

Jonas

@jonassibbesen
Copy link
Contributor

Hi Prithika,

This error arises when there is no kmers available to estimate the parameters for the negative binomial genomic kmer count distribution for each sample. These parameters are estimated from kmers that are unique to regions of the genome with no variation. In your case it seems like there is no such kmers, which it likely due to the large number of input variants compared to the size of the genome (1 variant per 6.84 basepair).

Are you using a variant database as input or are these variants predicted directly from your 19 samples? If you are using a database I would recommend filtering some of the variation using for example a frequency treshold. If predicted, is it from aligned assemblies or mapped reads?

Best,

Jonas

@prithikasritharan
Copy link
Author

Hi Jonas,

Many thanks for getting back to me. I had initially used a large variant database containing variants from 1011 strains (totalling 1,953,710 variants) as input but now I have tried re-running the pipeline using just the variants called from mapped reads for my 19 samples (282,435 variants therefore 1 variant per 43.04 basepairs) but it still seems to be producing the same error.

I will try filtering the frequency threshold to further reduce the number of variants in the input VCF.

Many thanks,
Prithika

@jonassibbesen
Copy link
Contributor

jonassibbesen commented Aug 19, 2019

That is strange. I would have expected it to work using the smaller variant set. Would it be possible for you to share with me the smaller variant set and the reference genome? I think it might be easier for me to debug if I had access to the data.

@prithikasritharan
Copy link
Author

prithikasritharan commented Aug 20, 2019

Hi Jonas,

I re-ran the pipeline from scratch and this seems to have done the trick. I'm currently still running bayestyper genotype, I have used the smaller variant set and this seems to be running fine as it has bypassed the assertion failure stage.

Many thanks again,
Prithika

@jonassibbesen
Copy link
Contributor

Glad to hear that it works now. Please let me know if you run into any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants