New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about optimizing foldseek for metagenomics gene catalogue #240
Comments
We had an issue where we could lose good hits and implemented a fix for correctness. However, that resulted in potentially huge memory allocations for some queries matching many target sequences. We are aware of the issue and are thinking how to fix this. In the meantime, try forcing the use of 7-mers instead of 6-mers with I would recommend to avoid creating a precomputed index with |
Hi Milot, Thank you very much for the quick response. I added the arguments you suggested but it still crashed. Here is the command I ran And here is the output. Any more suggestions that you think might work? Do you think it is worth trying to use the previous version before the update? Thanks again! Best,
|
@szimmerman92 could you please try the newest version (commit f629bbe)? I implemented a different strategy that avoids reallocations in the prefilter. |
Hi Martin, Sorry for the late response. I am going to try the commit you made above. How do you install foldseek from source or using that commit? Sorry for this very basic question. Thanks again. Best, |
You can download precompiled binaries here: https://mmseqs.com/foldseek Otherwise follow the instructions in the wiki to compile from source. |
Thank you very much for the instructions. Foldseek no longer dies when running but it seems like no progress is being made. It hangs during the step "Starting prefiltering scores calculation (Step 1 of 2). This is what the output looks like so far.
Its hard to tell if its just slowly making progress or completely hanging. Is there a way to tell? Thank you for all your help! Sam |
It's making very slow progress. This search is quite large. I would recommend to run on more CPU cores. |
Hi,
I have a little over 1 million proteins from a metagenome I would like to perform an easy-search on Alphafold/UniProt50. I am running foldseek on Google Cloud so I am fairly flexible on CPUs and RAM, but ideally would like to keep costs down. I have tried running foldseek with over 600 GBs of memory but it still dies during prefiltering.
What are suggestions for optimizing foldseek on such a large query? I tried the command
--sort-by-structure-bits 0
. I could try--prefilter-mode 1
but that only seems to be effective on smaller queries. In this dataset, I also have some proteins larger than 1000 amino acids long. Would you recommend I remove those?I also tried to run
createindex --split-memory-limit 60G database_afuniprot50 temp
to make the database use a little less memory but I got the error "Database database_afuniprot50 needs header information". Do you know why this occured?Is this too much data to use foldseek on? Any help would be appreciated and if you need more information please let me know.
Below please find the command I run and the output with a machine of 624 GBs of memory and 49 CPU cores.
Thank you very much for this incredible software.
Best,
Sam
foldseek easy-search --sort-by-structure-bits 0 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp
The text was updated successfully, but these errors were encountered: