Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce MultiQC memory consumption #2517

Open
3 of 7 tasks
rhpvorderman opened this issue Apr 30, 2024 · 5 comments
Open
3 of 7 tasks

Reduce MultiQC memory consumption #2517

rhpvorderman opened this issue Apr 30, 2024 · 5 comments

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Apr 30, 2024

Description of feature

See #1961 for the issue.

I have done a lot of memory profiling over the last few days and identified the issue. Rampant memory usage is caused by the modules mostly. MultiQC core itself is fairly ok with memory.

The core problem is that MultiQC uses global variables and does not use an iterative module to generate the report. All the modules are executed and put in a list. Then the report is build. This means all the modules stay in scope all the time after execution.
The problem then arises not from the MultiQC plots, but from the data structures that are added to the modules itself in a self.data[sample_name] = all_data_i_just_parsed pattern. This self.data variable contains a lot of data which uses a lot of memory, but the Garbage Collector can't clean it up.

The primary means of fixing this is executing the modules lazily this will bring the memory usage down from multiqc_mem_usage + sum(per_module_mem_usage) to multiqc_mem_usage + max(per_module_mem_usage).

The following steps need to be taken:

I have done the things I could easily do. I feel the refactoring needs to be done by a seasoned core MultiQC developer. Ping @vladsavelyev @ewels .

@vladsavelyev
Copy link
Member

vladsavelyev commented Apr 30, 2024

Refactor MultiQC so only one module remains in scope at module run time. Report sections should immediately be generated and the module discarded before the next module is run.

Cannot agree more! I started getting rid of module-level fields like self.data in some modules, and was planning to complete it after some bigger refactoring in #2442.

Modules also write into report.saved_raw_data, which we don't really need to keep for the entire run, unless the users requested that explicitly (e.g. for interactive use case).

That's something on my radar, and I'll prioritize it.

Reduce memory usage of modules that load a lot of data in memory

Absolutely. There are a bunch of modules that can be improved. mosdepth is another example.

Thanks for creating this issue! Super helpful to have this in such a structured way.

@ewels
Copy link
Member

ewels commented May 1, 2024

Great stuff, thanks @rhpvorderman! Actually really nice that FastQC has some easy pickings re: memory, as that is one of the most commonly used modules in MultiQC, so will have a big impact (it was also the first module, so some of the first Python code I ever wrote!)

Agree about the module refactoring. My grand plan that I've been mulling over for years is broadly:

  1. Refactor internal code to use Pydantic
    • Improve code quality and use of pydantic will give easy access to modern data formats
  2. Develop / adopt a new "intermediate file format"
    • Refactor modules to dump parsed data and separate parsing / report generation.
    • Could be files, parquet, flat database or anything really.
    • Allows multiple MultiQC runs to be combined
    • Means everything doesn't need to be stored in memory
    • Means MultiQC could be run in stages, could make sense in some high throughput analysis pipeline settings
  3. Look into multi-threading modules
    • Now that parsing is separated, may be able to get performance gains by running module parsing in parallel
    • May not be worth the effort in terms of complexity vs. speed up

This is a fairly loose plan and doesn't have to happen in this order. @vladsavelyev is already making good progress on the first stages, as mentioned 🙌🏻

@rhpvorderman
Copy link
Contributor Author

That is a very interesting roadmap! I am looking forward to see that come to fruition.

As regards to the multithreading. MultiQC executes in roughly a minute on my machine on the production data you have provided to me (with thousands of reports!). That means that longer execution times on clusters are usually caused by storage and memory bottlenecks. Multithreading is not going to mitigate those bottlenecks. While it may bring some performance improvements locally, I doubt they will materialize on clusters and cloud computing.

@rhpvorderman
Copy link
Contributor Author

rhpvorderman commented May 10, 2024

I fixed FastQC. It turned out to be a two-liner fix (with one additional comment line explaining the two lines).

How did I find out? I used a tip from the great Mike Acton. If you want to get a feel of what data you are handling: just use print. So I used print, and I found a lot of similar lines. This all turned to be out from the same section in the FastQC report, that is not used by MultiQC. So these lines are now skipped.

@vladsavelyev
Copy link
Member

Wow, great find 😮

I suspected unzipping or HTML parsing, but I wouldn't guess it would have to do with something as simple as that. In retrospect, it makes sense given how unreasonably much memory FastQC is using compared to other modules.

Thanks a lot, Ruben!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants