Documentation #121

timothymillar · 2022-02-24T04:01:22Z

Need to write some more extensive documentation.

General background
How tools fit together assemble, call etc.
Use of more advanced features such as parallel tempering
Worked examples with public data
Some examples of downstream analysis?
Atomizing haplotypes to get SNPs

The text was updated successfully, but these errors were encountered:

etnite · 2023-06-23T20:50:05Z

Hello Tim,

Thank you for all of your work on MCHap! It's a really great tool. I am planning to use it on some Flex-Seq data generated in alfalfa, but I had a question that I thought might make a good documentation topic:

Do you have any recommended practices for upstream processing of the VCF used to identify SNVs? I use Freebayes for calling, typically followed by decomposition of primitive alleles using vcflib. However, I'm unsure of whether or not the input VCF should be filtered - e.g. is it best to filter by quality, missing data percentage, depth, MAF, etc.? The raw Freebayes output contains a lot of junk with very low allele frequencies or call rates - I'm not sure how MCHap would handle such variants.

Thank you!

Brian

timothymillar · 2023-06-24T02:30:51Z

Hi @etnite, thanks for the feedback! That's a good question and something that I need to investigate more thoroughly. In my PhD work I used a similar process to what you've described: Freebayes -> decomposition of primitive alleles -> Some filtering. However, these days I would suggest using minimal, if any, filtering. Then review the results and consider further filtering if necessary. If the results are very messy (lots of unique haplotypes, low posterior probabilities (GPH and PHPM)) then filtering the input SNVs is one option that may help cleaning up the result.

MCHap assemble will automatically exclude SNVs at the sample level if they have a very high probability of being homozygous (configured by the --mcmc-fix-homozygous argument). This essentially means that filtering by MAF shouldn't be necessary.

My understanding is that the QUAL score from FreeBayes is essentially a probability of some variation being present. So a low QUAL score shouldn't be too different from a low MAF.

Low read depth can be a real issue for MCHap and results in low confidence calls with many unique haplotypes. However, with targeted sequencing like Flex-seq, it's probably more robust to adjust the assembly windows (bed intervals) to exclude any low coverage areas.

I have come across sites with high QUAL scores and read depth, but low GQ scores. My assumption was that this may be indicative of some sort of copy number variation or misaligned/off-target sequences. MCHap won't handle either of these situations very well. Filtering these variants is an option, but in may just be masking the underlying issue. So, these days I'd recommend running MCHap without filtering. The MCI field reported by MCHap can be useful to help diagnose this sort of issue although it's not bulletproof.

Overall, I think filtering by depth is the best option if you do want to apply some filtering. Removing the low call rate and low MAF stuff shouldn't haven a big impact on the results of MCHap, but it might speed up the analysis a bit.

Hopefully that's helpful!

etnite · 2023-06-25T13:35:49Z

Many thanks @timothymillar. This all makes sense to me. You are correct that in our Flex-seq data depth doesn't tend to be much of an issue, though there are always a handful of targeted regions where the assay doesn't work very efficiently for whatever reason. As you mention it might be easiest to simply exclude these from the .bed file.

I watched your presentation for Polyploid Tools so I think I understand what the MCI field is reporting.

I'm looking forwards to testing MCHap on a few regions of particular interest - then I can try scaling up to the whole genome once I get some experience working with the output.

timothymillar · 2023-10-27T07:57:30Z

Just an update @etnite, I've released version 0.9.0 which includes a new tool mchap find-snvs. This is a very simplistic approach to finding putative SNVs for haplotype assembly but I find it much more strait forward than using FreeBayes or similar. I won't promise that it produces better results, but it allows you to filter based on depth as described above.

timothymillar added the documentation label Feb 24, 2022

timothymillar added this to the Publication milestone Feb 24, 2022

timothymillar self-assigned this Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation #121

Documentation #121

timothymillar commented Feb 24, 2022 •

edited

etnite commented Jun 23, 2023

timothymillar commented Jun 24, 2023 •

edited

etnite commented Jun 25, 2023

timothymillar commented Oct 27, 2023

Documentation #121

Documentation #121

Comments

timothymillar commented Feb 24, 2022 • edited

etnite commented Jun 23, 2023

timothymillar commented Jun 24, 2023 • edited

etnite commented Jun 25, 2023

timothymillar commented Oct 27, 2023

timothymillar commented Feb 24, 2022 •

edited

timothymillar commented Jun 24, 2023 •

edited