Reduce MultiQC memory consumption #2517

rhpvorderman · 2024-04-30T09:47:56Z

Description of feature

See #1961 for the issue.

I have done a lot of memory profiling over the last few days and identified the issue. Rampant memory usage is caused by the modules mostly. MultiQC core itself is fairly ok with memory.

The core problem is that MultiQC uses global variables and does not use an iterative module to generate the report. All the modules are executed and put in a list. Then the report is build. This means all the modules stay in scope all the time after execution.
The problem then arises not from the MultiQC plots, but from the data structures that are added to the modules itself in a self.data[sample_name] = all_data_i_just_parsed pattern. This self.data variable contains a lot of data which uses a lot of memory, but the Garbage Collector can't clean it up.

The primary means of fixing this is executing the modules lazily this will bring the memory usage down from multiqc_mem_usage + sum(per_module_mem_usage) to multiqc_mem_usage + max(per_module_mem_usage).

The following steps need to be taken:

Reduce MultiQC core memory usage (Stream json data to a file to save 30% of the memory. #2510, Reduce memory requirement of MultiQC main functions #2515)
Refactor MultiQC so only one module remains in scope at module run time. Report sections should immediately be generated and the module discarded before the next module is run.
Reduce memory usage of modules that load a lot of data in memory
- FastQC (FastQC: Skip per tile sequence quality section for better performance #2552)
- Sequali (Reduce Sequali and (slightly) FastQC memory footprint #2516 )
- mosdepth
- Other modules... (please add other modules to the list)

I have done the things I could easily do. I feel the refactoring needs to be done by a seasoned core MultiQC developer. Ping @vladsavelyev @ewels .

The text was updated successfully, but these errors were encountered:

vladsavelyev · 2024-04-30T12:38:37Z

Refactor MultiQC so only one module remains in scope at module run time. Report sections should immediately be generated and the module discarded before the next module is run.

Cannot agree more! I started getting rid of module-level fields like self.data in some modules, and was planning to complete it after some bigger refactoring in #2442.

Modules also write into report.saved_raw_data, which we don't really need to keep for the entire run, unless the users requested that explicitly (e.g. for interactive use case).

That's something on my radar, and I'll prioritize it.

Reduce memory usage of modules that load a lot of data in memory

Absolutely. There are a bunch of modules that can be improved. mosdepth is another example.

Thanks for creating this issue! Super helpful to have this in such a structured way.

ewels · 2024-05-01T10:28:26Z

Great stuff, thanks @rhpvorderman! Actually really nice that FastQC has some easy pickings re: memory, as that is one of the most commonly used modules in MultiQC, so will have a big impact (it was also the first module, so some of the first Python code I ever wrote!)

Agree about the module refactoring. My grand plan that I've been mulling over for years is broadly:

Refactor internal code to use Pydantic
- Improve code quality and use of pydantic will give easy access to modern data formats
Develop / adopt a new "intermediate file format"
- Refactor modules to dump parsed data and separate parsing / report generation.
- Could be files, parquet, flat database or anything really.
- Allows multiple MultiQC runs to be combined
- Means everything doesn't need to be stored in memory
- Means MultiQC could be run in stages, could make sense in some high throughput analysis pipeline settings
Look into multi-threading modules
- Now that parsing is separated, may be able to get performance gains by running module parsing in parallel
- May not be worth the effort in terms of complexity vs. speed up

This is a fairly loose plan and doesn't have to happen in this order. @vladsavelyev is already making good progress on the first stages, as mentioned 🙌🏻

rhpvorderman · 2024-05-01T11:37:42Z

That is a very interesting roadmap! I am looking forward to see that come to fruition.

As regards to the multithreading. MultiQC executes in roughly a minute on my machine on the production data you have provided to me (with thousands of reports!). That means that longer execution times on clusters are usually caused by storage and memory bottlenecks. Multithreading is not going to mitigate those bottlenecks. While it may bring some performance improvements locally, I doubt they will materialize on clusters and cloud computing.

rhpvorderman · 2024-05-10T11:57:22Z

I fixed FastQC. It turned out to be a two-liner fix (with one additional comment line explaining the two lines).

How did I find out? I used a tip from the great Mike Acton. If you want to get a feel of what data you are handling: just use print. So I used print, and I found a lot of similar lines. This all turned to be out from the same section in the FastQC report, that is not used by MultiQC. So these lines are now skipped.

vladsavelyev · 2024-05-10T12:27:42Z

Wow, great find 😮

I suspected unzipping or HTML parsing, but I wouldn't guess it would have to do with something as simple as that. In retrospect, it makes sense given how unreasonably much memory FastQC is using compared to other modules.

Thanks a lot, Ruben!

vladsavelyev added the core: back end label Apr 30, 2024

vladsavelyev mentioned this issue May 9, 2024

Clean up module raw data after running each module #2551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce MultiQC memory consumption #2517

Reduce MultiQC memory consumption #2517

rhpvorderman commented Apr 30, 2024 •

edited

vladsavelyev commented Apr 30, 2024 •

edited

ewels commented May 1, 2024

rhpvorderman commented May 1, 2024

rhpvorderman commented May 10, 2024 •

edited

vladsavelyev commented May 10, 2024

Reduce MultiQC memory consumption #2517

Reduce MultiQC memory consumption #2517

Comments

rhpvorderman commented Apr 30, 2024 • edited

Description of feature

vladsavelyev commented Apr 30, 2024 • edited

ewels commented May 1, 2024

rhpvorderman commented May 1, 2024

rhpvorderman commented May 10, 2024 • edited

vladsavelyev commented May 10, 2024

rhpvorderman commented Apr 30, 2024 •

edited

vladsavelyev commented Apr 30, 2024 •

edited

rhpvorderman commented May 10, 2024 •

edited