Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertAllele and Variant normalization #19

Open
malmarri opened this issue Nov 25, 2019 · 3 comments
Open

ConvertAllele and Variant normalization #19

malmarri opened this issue Nov 25, 2019 · 3 comments

Comments

@malmarri
Copy link

Hi,

Firstly thank you for creating such a great resource. I have two questions (using v1.5):

  1. I am trying to genotype SVs using identified by manta (only SVs), when following the steps and using bayesTyperTools convertAllele almost 20% of variants are skipped, for example:
  • Skipped 1219 unsupported allele(s):
	- 307 <INS> alternative allele(s)
	- 912 translocation alternative allele(s)

In your new version I understand that there is added support for insertions, is there a way to rescue these skipped insertions?

  1. At the variant normalization step using bcftools norm a significant number of variants suffer from errors and are not normalized, for example:

Non-ACGTN reference allele at chr3:52803269

Do you have any recommendations for this? I tried the bcftools norm --check-ref ws to fix 'bad sites'.

Best wishes,
Mo

@jonassibbesen
Copy link
Contributor

Hi Mo,

Thank you for your interest in our tool. BayesTyper does not support translocations and these will therefore always to filtered by convertAllele. For the insertions the inserted sequence needs to be present in either the INFO field or given as a separate fasta file (--alt-file). BayesTyper needs the sequence of the insertion on order to be able to genotype it. For Manta specifically, you can use the --keep-partial option which will allow partial insertions (insertion where only the left and right side is known and not the whole whole sequence) to all be added. The left and right side of the partial insertion is connected with N's.

Regarding the normalization it sounds like either you are running on a vcf still containing symbolic alleles (like <INS> or <DEL>) or that your sequences in the file contains nucleotides that are not A, C, G, T or N.

Please let me know if you have any other questions.

Best,

Jonas

@malmarri
Copy link
Author

malmarri commented Nov 26, 2019

Thank you for your quick reply Jonas, I very much appreciate it. I just have a few more questions:

  1. If i'm only attempting to genotype structural variation, do I still need SNPs and INDELS from these samples included in the candidate_variants file? or are the SNPs and INDELS in the variation prior sufficient?

  2. Is the variation prior necessary if I have a large high coverage human dataset (around 1000 samples) from over 30 diverse human populations? My dataset captures the (at least common) variation from most humans populations, so I was wondering whether the prior is needed in this case.

  3. Counting kmers for just one sample using kmc results in a huge file, 150GB, and the subsequent bloom step creates a 50GB file. Creating this for the whole dataset requires very large space requirements, I just want to ask if this is normal? (Or I might have been doing something wrong), and if it is, do you have any recommendation on this.

Thanks again,
Mo

@jonassibbesen
Copy link
Contributor

Hi, Sorry for the delayed reply.

  1. I would recommend using the SNVs and indels from the samples if possible. These variations are important to correctly match kmers in the sequencing reads to the SVs. Given that the prior only contains common SNVs there is likely going to many SNVs close be to SVs that are only in your samples.

  2. The prior is not strictly necessary, but are used to potentially increase sensitivity if your candidate set does not contain all putative variations. You are right that in your case using the prior might not provide as big of an advantage.

  3. Yes, the kmc output can get quite big. I am a bit surprised that the bloom filter is also that big. Normally it is closer to ~20 GB in my experience using high-coverage data (~35x). One trick you can use is to filter singleton kmers (only observed once) by removing the -ci1 option. This should result in far less kmers and thus smaller file sizes. It might result in lower accuracy, but for high coverage data it should not be by much.

Please let me know if you have any other questions.

Best,

Jonas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants