Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about optimizing foldseek for metagenomics gene catalogue #240

Open
szimmerman92 opened this issue Feb 8, 2024 · 7 comments
Open

Comments

@szimmerman92
Copy link

Hi,

I have a little over 1 million proteins from a metagenome I would like to perform an easy-search on Alphafold/UniProt50. I am running foldseek on Google Cloud so I am fairly flexible on CPUs and RAM, but ideally would like to keep costs down. I have tried running foldseek with over 600 GBs of memory but it still dies during prefiltering.

What are suggestions for optimizing foldseek on such a large query? I tried the command --sort-by-structure-bits 0 . I could try --prefilter-mode 1 but that only seems to be effective on smaller queries. In this dataset, I also have some proteins larger than 1000 amino acids long. Would you recommend I remove those?

I also tried to run createindex --split-memory-limit 60G database_afuniprot50 temp to make the database use a little less memory but I got the error "Database database_afuniprot50 needs header information". Do you know why this occured?

Is this too much data to use foldseek on? Any help would be appreciated and if you need more information please let me know.

Below please find the command I run and the output with a machine of 624 GBs of memory and 49 CPU cores.

Thank you very much for this incredible software.

Best,
Sam

foldseek easy-search --sort-by-structure-bits 0 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

easy-search --sort-by-structure-bits 0 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

MMseqs Version: 3bf3cdf
Seq. id. threshold 0
Coverage threshold 0
Coverage mode 0
Max reject 2147483647
Max accept 2147483647
Add backtrace false
TMscore threshold 0
TMalign hit order 0
TMalign fast 1
Preload mode 0
Threads 96
Verbosity 3
LDDT threshold 0
Sort by structure bit score 0
Alignment type 2
Substitution matrix aa:3di.out,nucl:3di.out
Alignment mode 3
Alignment mode 0
E-value threshold 10
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Gap open cost aa:10,nucl:10
Gap extension cost aa:1,nucl:1
Compressed 0
Seed substitution matrix aa:3di.out,nucl:3di.out
Sensitivity 9.5
k-mer length 6
Target search mode 0
k-score seq:2147483647,prof:2147483647
Max results per query 1000
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.99995
Mask lower case residues 1
Minimum diagonal score 30
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Exhaustive search mode false
Prefilter mode 0
Search iterations 1
Remove temporary files true
MPI runner
Force restart with latest tmp false
Cluster search 0
Chain name mode 0
Write mapping file 0
Mask b-factor threshold 0
Coord store mode 2
Write lookup file 1
Tar Inclusion Regex .*
Tar Exclusion Regex ^$
Input format 0
File Inclusion Regex .*
File Exclusion Regex ^$
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
Greedy best hits false

Create directory temp/5295610052143856966/search_tmp
search /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 temp/5295610052143856966/result temp/5295610052143856966/search_tmp --sort-by-structure-bits 0 --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/5295610052143856966/search_tmp/13789935573118591569/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid
Estimated memory consumption: 337G
Target database size: 53665860 type: Aminoacid
Index table k-mer threshold: 78 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 53.67M 36s 22ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 53.67M 1m 4s 179ms
Index statistics
Entries: 11018018466
DB size: 63533 MB
Avg k-mer size: 172.156539
Top 10 k-mers
DDDDDD 25475954
DDDDDP 22448571
DDDDPP 19428803
VVVVPD 17790099
DPVVVV 16769600
SVSVVV 14134938
SVVVVV 13707739
DDVVVV 12394056
LVVVVV 11880759
VSVVVV 10654575
Time for index table init: 0h 4m 32s 714ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1073247
Target db start 1 to 53665860
Killed ] 0.00% 1 eta -
Error: Kmer matching step died
Error: Search died

@milot-mirdita
Copy link
Member

milot-mirdita commented Feb 9, 2024

We had an issue where we could lose good hits and implemented a fix for correctness. However, that resulted in potentially huge memory allocations for some queries matching many target sequences.

We are aware of the issue and are thinking how to fix this. In the meantime, try forcing the use of 7-mers instead of 6-mers with -k 7 and using less threads --threads 32 (currently it seems to be using 96 threads). The issue is the per thread memory consumption, not the index.

I would recommend to avoid creating a precomputed index with createindex, especially when you have many queries. That comes with very different trade-offs.

@szimmerman92
Copy link
Author

Hi Milot,

Thank you very much for the quick response. I added the arguments you suggested but it still crashed. Here is the command I ran foldseek easy-search -k 7 --threads 32 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

And here is the output. Any more suggestions that you think might work? Do you think it is worth trying to use the previous version before the update?

Thanks again!

Best,
Sam

easy-search -k 7 --threads 32 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

MMseqs Version: 2503196
Seq. id. threshold 0
Coverage threshold 0
Coverage mode 0
Max reject 2147483647
Max accept 2147483647
Add backtrace false
TMscore threshold 0
TMalign hit order 0
TMalign fast 1
Preload mode 0
Threads 32
Verbosity 3
LDDT threshold 0
Sort by structure bit score 1
Alignment type 2
Substitution matrix aa:3di.out,nucl:3di.out
Alignment mode 3
Alignment mode 0
E-value threshold 10
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Gap open cost aa:10,nucl:10
Gap extension cost aa:1,nucl:1
Compressed 0
Seed substitution matrix aa:3di.out,nucl:3di.out
Sensitivity 9.5
k-mer length 7
Target search mode 0
k-score seq:2147483647,prof:2147483647
Max results per query 1000
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.99995
Mask lower case residues 1
Minimum diagonal score 30
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Exhaustive search mode false
Prefilter mode 0
Search iterations 1
Remove temporary files true
MPI runner
Force restart with latest tmp false
Cluster search 0
Chain name mode 0
Write mapping file 0
Mask b-factor threshold 0
Coord store mode 2
Write lookup file 1
Tar Inclusion Regex .*
Tar Exclusion Regex ^$
Input format 0
File Inclusion Regex .*
File Exclusion Regex ^$
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
Greedy best hits false

search /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 temp/12968833095025548755/result temp/12968833095025548755/search_tmp --threads 32 --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 7 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/12968833095025548755/search_tmp/12349765448822118560/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 32 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid
Estimated memory consumption: 185G
Target database size: 53665860 type: Aminoacid
Index table k-mer threshold: 90 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 53.67M 1m 15s 39ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 53.67M 3m 15s 8ms
Index statistics
Entries: 11335407711
DB size: 74627 MB
Avg k-mer size: 8.855787
Top 10 k-mers
DDDDDDD 22900829
DDDDDDP 20036462
DDDDDPP 16763068
DPVVVVV 15787301
VVVVVPP 13227255
DDVVVVV 13114181
SVVVVVV 12197158
LVVVVVV 9253579
DDPVVVV 8343123
VLVVVVV 7762948
Time for index table init: 0h 7m 57s 22ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 90
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1073247
Target db start 1 to 53665860
Killed ] 0.00% 1 eta -
Error: Kmer matching step died
Error: Search died

@martin-steinegger
Copy link
Collaborator

@szimmerman92 could you please try the newest version (commit f629bbe)? I implemented a different strategy that avoids reallocations in the prefilter.

@szimmerman92
Copy link
Author

Hi Martin,

Sorry for the late response. I am going to try the commit you made above. How do you install foldseek from source or using that commit?

Sorry for this very basic question.

Thanks again.

Best,
Sam

@milot-mirdita
Copy link
Member

You can download precompiled binaries here: https://mmseqs.com/foldseek

Otherwise follow the instructions in the wiki to compile from source.

@szimmerman92
Copy link
Author

Thank you very much for the instructions. Foldseek no longer dies when running but it seems like no progress is being made. It hangs during the step "Starting prefiltering scores calculation (Step 1 of 2).

This is what the output looks like so far.

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/13593466129177909235/search_tmp/4380869008468293709/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 16 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid
Target split mode. Searching through 2 splits
Estimated memory consumption: 69G
Target database size: 53665860 type: Aminoacid
Process prefiltering step 1 of 2

Index table k-mer threshold: 78 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 26.82M 2m 56s 668ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 26.82M 4m 41s 335ms

Index statistics
Entries: 5505266600
DB size: 31989 MB
Avg k-mer size: 86.019791
Top 10 k-mers
DDDDDD 12729457
DDDDDP 11218285
DDDDPP 9710346
VVVVPD 8890212
DPVVVV 8379919
SVSVVV 7066127
SVVVVV 6849741
DDVVVV 6193142
LVVVVV 5936668
VSVVVV 5325698
Time for index table init: 0h 11m 42s 309ms
k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 2)
Query db start 1 to 1073247
Target db start 1 to 26821505
] 0.00% 1 eta -

Its hard to tell if its just slowly making progress or completely hanging. Is there a way to tell?

Thank you for all your help!

Sam

@milot-mirdita
Copy link
Member

It's making very slow progress. This search is quite large. I would recommend to run on more CPU cores.
For comparison 340K vs 50M that I did recently took a couple days on 128 cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants