Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #121

Open
3 of 6 tasks
timothymillar opened this issue Feb 24, 2022 · 4 comments
Open
3 of 6 tasks

Documentation #121

timothymillar opened this issue Feb 24, 2022 · 4 comments
Assignees
Milestone

Comments

@timothymillar
Copy link
Collaborator

timothymillar commented Feb 24, 2022

Need to write some more extensive documentation.

  • General background
  • How tools fit together assemble, call etc.
  • Use of more advanced features such as parallel tempering
  • Worked examples with public data
  • Some examples of downstream analysis?
  • Atomizing haplotypes to get SNPs
@timothymillar timothymillar added this to the Publication milestone Feb 24, 2022
@timothymillar timothymillar self-assigned this Feb 24, 2022
@etnite
Copy link

etnite commented Jun 23, 2023

Hello Tim,

Thank you for all of your work on MCHap! It's a really great tool. I am planning to use it on some Flex-Seq data generated in alfalfa, but I had a question that I thought might make a good documentation topic:

Do you have any recommended practices for upstream processing of the VCF used to identify SNVs? I use Freebayes for calling, typically followed by decomposition of primitive alleles using vcflib. However, I'm unsure of whether or not the input VCF should be filtered - e.g. is it best to filter by quality, missing data percentage, depth, MAF, etc.? The raw Freebayes output contains a lot of junk with very low allele frequencies or call rates - I'm not sure how MCHap would handle such variants.

Thank you!

Brian

@timothymillar
Copy link
Collaborator Author

timothymillar commented Jun 24, 2023

Hi @etnite, thanks for the feedback! That's a good question and something that I need to investigate more thoroughly. In my PhD work I used a similar process to what you've described: Freebayes -> decomposition of primitive alleles -> Some filtering. However, these days I would suggest using minimal, if any, filtering. Then review the results and consider further filtering if necessary. If the results are very messy (lots of unique haplotypes, low posterior probabilities (GPH and PHPM)) then filtering the input SNVs is one option that may help cleaning up the result.

MCHap assemble will automatically exclude SNVs at the sample level if they have a very high probability of being homozygous (configured by the --mcmc-fix-homozygous argument). This essentially means that filtering by MAF shouldn't be necessary.

My understanding is that the QUAL score from FreeBayes is essentially a probability of some variation being present. So a low QUAL score shouldn't be too different from a low MAF.

Low read depth can be a real issue for MCHap and results in low confidence calls with many unique haplotypes. However, with targeted sequencing like Flex-seq, it's probably more robust to adjust the assembly windows (bed intervals) to exclude any low coverage areas.

I have come across sites with high QUAL scores and read depth, but low GQ scores. My assumption was that this may be indicative of some sort of copy number variation or misaligned/off-target sequences. MCHap won't handle either of these situations very well. Filtering these variants is an option, but in may just be masking the underlying issue. So, these days I'd recommend running MCHap without filtering. The MCI field reported by MCHap can be useful to help diagnose this sort of issue although it's not bulletproof.

Overall, I think filtering by depth is the best option if you do want to apply some filtering. Removing the low call rate and low MAF stuff shouldn't haven a big impact on the results of MCHap, but it might speed up the analysis a bit.

Hopefully that's helpful!

@etnite
Copy link

etnite commented Jun 25, 2023

Many thanks @timothymillar. This all makes sense to me. You are correct that in our Flex-seq data depth doesn't tend to be much of an issue, though there are always a handful of targeted regions where the assay doesn't work very efficiently for whatever reason. As you mention it might be easiest to simply exclude these from the .bed file.

I watched your presentation for Polyploid Tools so I think I understand what the MCI field is reporting.

I'm looking forwards to testing MCHap on a few regions of particular interest - then I can try scaling up to the whole genome once I get some experience working with the output.

@timothymillar
Copy link
Collaborator Author

Just an update @etnite, I've released version 0.9.0 which includes a new tool mchap find-snvs. This is a very simplistic approach to finding putative SNVs for haplotype assembly but I find it much more strait forward than using FreeBayes or similar. I won't promise that it produces better results, but it allows you to filter based on depth as described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants