Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add step to break up misassemblies in de novo contigs #804

Open
notestaff opened this issue Mar 29, 2018 · 5 comments
Open

add step to break up misassemblies in de novo contigs #804

notestaff opened this issue Mar 29, 2018 · 5 comments
Assignees
Labels

Comments

@notestaff
Copy link
Contributor

De novo contigs sometimes glue together non-adjacent pieces of the genome, or repeat the same piece twice. Add a step, as in QUAST and SHIVER, to blast the contigs against known references and, if a contig has more than one local match, break the contig up. If we break up too much, scaffolding and gapfilling should be able to restore contiguity.

@notestaff notestaff added the bug label Mar 29, 2018
@notestaff notestaff self-assigned this Mar 29, 2018
@tomkinsc
Copy link
Member

tomkinsc commented Mar 29, 2018

Since we're planning to blast contigs for metagenomic analysis (#795 in progress); maybe we can reuse the hits for both splitting and LCA assignment? At a minimum we could re-run blast for assignment only on the contigs that need to be split.

@notestaff
Copy link
Contributor Author

So which workflow would this be in? Right now metagenomics analysis and reference-assisted assembly are different workflows. For the latter we'd blast against a small db of just the taxon we're assembling, e.g. all known mumps genomes.

@dpark01
Copy link
Member

dpark01 commented Mar 29, 2018

Which application are we trying to solve what kind of problem in? If this is about the assembly process, the simplest approach would be to leave the contigs alone and just make sure the scaffolding code tolerated splitting up contigs—it’s already aligning the contigs to the references the user wants to align to at that point so that’s probably the right step to solve the problem. It’s possible that the current code already handles this appropriately?

If this is about a metagenomic workflow, perhaps the first step is really to determine how often this problem happens and whether metaSPAdes makes it go away. And if not, maybe Chris’s suggestion makes the most sense: have the downstream contig classifier break things up based on what it sees.

@notestaff
Copy link
Contributor Author

I was thinking assembly. For metagenomics, it doesn't necessarily matter where a contig blasts to, only that it does? In assembly, a misassembled contig might not align well to a reference under novoalign: if it glues together two somewhat-far-away parts, it'll look like an unreasonably big insertion.

@notestaff
Copy link
Contributor Author

There are also de novo scaffolding tools I wanted to look at, that scaffold based on read pairs rather than a reference (as part of a general effort to make the process less reference-dependent). These tools might get confused by misassembled contigs, Of course, using a collection of references to fix misassemblies is itself a reference-dependent process, but less so than reference-assisted scaffolding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants