Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and query an index of many human genomes? #45

Open
jeremymsimon opened this issue May 26, 2023 · 0 comments
Open

Create and query an index of many human genomes? #45

jeremymsimon opened this issue May 26, 2023 · 0 comments

Comments

@jeremymsimon
Copy link

Hey Team Pufferfish-
I'm just wondering your opinions on best practices if I were to want to index many (hundreds? thousands?) of whole human genomes and then query them. Let's say I have a handful of k=25mers and I want to find out whether, which, where, and how many matches there are in my index. Although the index is huge, the query will always be small, no more than a few hundred at a time, unlike alignment of RNA-seq reads or similar.

In a somewhat-miniaturized (albeit still relatively giant) test, I grabbed sequences from the Human Pangenome Reference (n=94 genomes + CHM13) and am attempting to index them with pufferfish. This alone used >600GB of RAM just in the counting step, and is seemingly nowhere near complete after 24hrs of runtime with 12 threads.

Is this at all feasible? Is Pufferfish an appropriate tool for this? Or are related tools like fulgor or cuttlefish or others better suited for this scale?

I figured other standard k-mer counting tools may be more efficient but from my non-expert perspective it seemed I'd likely sacrifice knowing the genomic locations of the match, and perhaps also sacrifice knowing if a given k-mer matched multiple times in the same genomes and/or within the whole index. Unless it is truly necessary to do otherwise, this is information I'd like to retain in my output. Also note I'm currently approaching this without any downsampling or sparsity (in other words, it is a fully dense all-k-mers-represented index), but if needed I may be able to employ some k-mer selection tricks.

Curious to hear your thoughts on this, and thanks as always!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant