Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for large datasets? #92

Open
JohnsonStev opened this issue Feb 4, 2021 · 4 comments
Open

Best practice for large datasets? #92

JohnsonStev opened this issue Feb 4, 2021 · 4 comments

Comments

@JohnsonStev
Copy link

JohnsonStev commented Feb 4, 2021

Dear Isaac,

I would like to apply mccortex on a large scale resequencing project. (~400 individuals, 1GB genome size),
I read through the wiki, and here is what I think a possible workflow might look like

  1. Build graphs for each sample and reference with one chosen kmer size
  2. Clean each of the graphs
  3. Merge the clean graphs
  4. Read threading to produce link files
  5. Clean link files
  6. Merge the clean link files
  7. Call the variants

Do you have any suggestion about the workflow or is there any pitfall I need to be aware of?
Thank you so much.

@winni2k
Copy link

winni2k commented Feb 5, 2021

That should work in principle. If you don't hear back from Isaac, you might try asking @kvg.

@JohnsonStev
Copy link
Author

Thanks for the answer, I am trying to merge the clean graphs all together in one single command and it took a lot of time.
Is it more time saving to merge a few graphs in parallel first, then merge those merged graphs?
Thank you again

@kvg
Copy link

kvg commented Feb 5, 2021

McCortex loads all the graphs into memory before joining them, and yes, this can be a bit slow. I think what you've outlined would be faster, but it's not clear that the improvement would be particularly significant (I'd imagine it depends on the contents of the graphs - particularly the number of shared k-mers between each sample).

An alternate strategy that might help you is the "Join" command we wrote in a companion tool, Corticall. This assumes your graphs are stored in sorted order (with the '-s' option in mccortex commands), and then the graphs are merged linearly. This tends to be much faster than the built-in McCortex join command; I've used this to merge a couple hundred microbial genomes. The resulting joined graphs will remain compatible with all of mccortex's subcommands.

After downloading and building Corticall, the command-line for this would be:

$ java -jar build/jars/corticall.jar Join -g <graph_1.sorted.ctx> -g <graph_2.sorted.ctx> ... -g <graph_N.sorted.ctx> -o joined.ctx

Please let me know if that does or doesn't work for you.

@JohnsonStev
Copy link
Author

Dear KVG,

Thanks for your response, I will try it.

Meanwhile I am still working on running through the whole workflow using a subset of data.
I've done link threading plus link error cleaning of each sample. Now I am trying to merge the link files, when I found that I don't know how to generate a "ref.ctp.gz" or "refAndSamples.ctp.gz" file. All I got after running "thread" are "sample.ctp.gz".

Thanks you so much for the help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants