Memory usage with large metaproteomics databases #1575

magnusarntzen · 2024-05-14T08:30:16Z

Hi,
We use Fragpipe on Linux in headless mode for metaproteomics. This works very nicely in most cases but when databases grow too big, Fragpipe runs out of memory.

For example: We do metagenomic sequencing of environmental samples and generate genomes of the microbes within our sample. Depending on the microbial complexity, it can easily be hundreds to many thousand microbes present, yielding proteomics databases containing 1-5 million protein entries. Such large databases work nicely up to 2-3 million, but with 5 million entries (and then plus decoys) we run out of memory. By enabling swap on the drive I can even run the JVM with 1300 GiB mem, but this is still not enough when mass_calibration is switched on. Switching this off however, fixes the problem with even half the memory. I have also split db to 16, reduced missed cleavages to 1, max aa to 35, and only oxidation of M as variable mod.

When we monitor the mem usage of Fragpipe it has some peaks, one in the beginning of MSFragger First search (msfragger_pep_split.py, line 614, calibrate-function) and one during IonQuant; the rest of the time the mem usage is rather low. My question is if it is possible to re-think the mem usage of Fragpipe to avoid these memory peaks? Could the calibration e.g., use just a subset of the fasta file? The main search uses database splits, but does the calibration also do that?

For the record, I am well aware of other issues coming with large databases, such as protein inference, detoriation of FDR, etc. Ideally, we would not use so big databases but our options are limited. We could perhaps filter to the most abundant microbes (DNA read count), or cluster similar proteins, but that is another discssion.

Any input on this would be highly appeciated!

Warm regards,
and thanks for a great software,
Magnus Arntzen
NMBU, Norway

fcyu · 2024-05-14T13:47:31Z

Hi Magnus ,

Thanks for the feedback and suggestions.

The issue was due to the mass calibration. MSFragger doesn't split the databases when doing the first search for the mass calibration. So, no matter how many splits you specified, the memory footprint of the first search won't change. As you have also figured out, disable the mass calibration will resolve this problem.

My question is if it is possible to re-think the mem usage of Fragpipe to avoid these memory peaks? Could the calibration e.g., use just a subset of the fasta file? The main search uses database splits, but does the calibration also do that?

It was due to how the split database searching was implemented. The split database searching uses a Python script that splits the fasta file and runs MSFragger multiple times, which don't work with MSFragger's internal first search and the following mass calibration. Making the first search support split database is not trivial because we need to either re-implement the database splitting inside MSFragger or figure out a smart way to let it work with the Python script which is even more complicated.

Best,

Fengchao

magnusarntzen · 2024-05-15T07:09:29Z

Dear Fengchao,
thank you for a quick and detailed answer!

Would it be possible to make an implementation where the user could select a different database for the 1st search? In this case, we could make a small subset of our database, containing e.g., species that we know are abundantly present in all samples, ensuring significant hits for all raw files provided. Would this be sufficient for mass calibration and parameter optimization?

KR,
Magnus

anesvi · 2024-05-16T13:08:28Z

Hi Magnus, we are working on metaproteomics projects and improvements in FragPipe but they are not ready for public release. Perhaps you can email us directly to discuss. Best, Alexey

fcyu self-assigned this May 14, 2024

fcyu added the discussion label May 14, 2024

fcyu closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage with large metaproteomics databases #1575

Memory usage with large metaproteomics databases #1575

magnusarntzen commented May 14, 2024

fcyu commented May 14, 2024

magnusarntzen commented May 15, 2024

anesvi commented May 16, 2024

Memory usage with large metaproteomics databases #1575

Memory usage with large metaproteomics databases #1575

Comments

magnusarntzen commented May 14, 2024

fcyu commented May 14, 2024

magnusarntzen commented May 15, 2024

anesvi commented May 16, 2024