Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage with large metaproteomics databases #1575

Closed
magnusarntzen opened this issue May 14, 2024 · 3 comments
Closed

Memory usage with large metaproteomics databases #1575

magnusarntzen opened this issue May 14, 2024 · 3 comments
Assignees

Comments

@magnusarntzen
Copy link

Hi,
We use Fragpipe on Linux in headless mode for metaproteomics. This works very nicely in most cases but when databases grow too big, Fragpipe runs out of memory.

For example: We do metagenomic sequencing of environmental samples and generate genomes of the microbes within our sample. Depending on the microbial complexity, it can easily be hundreds to many thousand microbes present, yielding proteomics databases containing 1-5 million protein entries. Such large databases work nicely up to 2-3 million, but with 5 million entries (and then plus decoys) we run out of memory. By enabling swap on the drive I can even run the JVM with 1300 GiB mem, but this is still not enough when mass_calibration is switched on. Switching this off however, fixes the problem with even half the memory. I have also split db to 16, reduced missed cleavages to 1, max aa to 35, and only oxidation of M as variable mod.

When we monitor the mem usage of Fragpipe it has some peaks, one in the beginning of MSFragger First search (msfragger_pep_split.py, line 614, calibrate-function) and one during IonQuant; the rest of the time the mem usage is rather low. My question is if it is possible to re-think the mem usage of Fragpipe to avoid these memory peaks? Could the calibration e.g., use just a subset of the fasta file? The main search uses database splits, but does the calibration also do that?

For the record, I am well aware of other issues coming with large databases, such as protein inference, detoriation of FDR, etc. Ideally, we would not use so big databases but our options are limited. We could perhaps filter to the most abundant microbes (DNA read count), or cluster similar proteins, but that is another discssion.

Any input on this would be highly appeciated!

Warm regards,
and thanks for a great software,
Magnus Arntzen
NMBU, Norway

@fcyu fcyu self-assigned this May 14, 2024
@fcyu
Copy link
Member

fcyu commented May 14, 2024

Hi Magnus ,

Thanks for the feedback and suggestions.

The issue was due to the mass calibration. MSFragger doesn't split the databases when doing the first search for the mass calibration. So, no matter how many splits you specified, the memory footprint of the first search won't change. As you have also figured out, disable the mass calibration will resolve this problem.

My question is if it is possible to re-think the mem usage of Fragpipe to avoid these memory peaks? Could the calibration e.g., use just a subset of the fasta file? The main search uses database splits, but does the calibration also do that?

It was due to how the split database searching was implemented. The split database searching uses a Python script that splits the fasta file and runs MSFragger multiple times, which don't work with MSFragger's internal first search and the following mass calibration. Making the first search support split database is not trivial because we need to either re-implement the database splitting inside MSFragger or figure out a smart way to let it work with the Python script which is even more complicated.

Best,

Fengchao

@magnusarntzen
Copy link
Author

Dear Fengchao,
thank you for a quick and detailed answer!

Would it be possible to make an implementation where the user could select a different database for the 1st search? In this case, we could make a small subset of our database, containing e.g., species that we know are abundantly present in all samples, ensuring significant hits for all raw files provided. Would this be sufficient for mass calibration and parameter optimization?

KR,
Magnus

@anesvi
Copy link
Collaborator

anesvi commented May 16, 2024

Hi Magnus, we are working on metaproteomics projects and improvements in FragPipe but they are not ready for public release. Perhaps you can email us directly to discuss. Best, Alexey

@fcyu fcyu closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants