New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce MultiQC memory consumption #2517
Comments
Cannot agree more! I started getting rid of module-level fields like Modules also write into That's something on my radar, and I'll prioritize it.
Absolutely. There are a bunch of modules that can be improved. Thanks for creating this issue! Super helpful to have this in such a structured way. |
Great stuff, thanks @rhpvorderman! Actually really nice that FastQC has some easy pickings re: memory, as that is one of the most commonly used modules in MultiQC, so will have a big impact (it was also the first module, so some of the first Python code I ever wrote!) Agree about the module refactoring. My grand plan that I've been mulling over for years is broadly:
This is a fairly loose plan and doesn't have to happen in this order. @vladsavelyev is already making good progress on the first stages, as mentioned 🙌🏻 |
That is a very interesting roadmap! I am looking forward to see that come to fruition. As regards to the multithreading. MultiQC executes in roughly a minute on my machine on the production data you have provided to me (with thousands of reports!). That means that longer execution times on clusters are usually caused by storage and memory bottlenecks. Multithreading is not going to mitigate those bottlenecks. While it may bring some performance improvements locally, I doubt they will materialize on clusters and cloud computing. |
I fixed FastQC. It turned out to be a two-liner fix (with one additional comment line explaining the two lines). How did I find out? I used a tip from the great Mike Acton. If you want to get a feel of what data you are handling: just use print. So I used print, and I found a lot of similar lines. This all turned to be out from the same section in the FastQC report, that is not used by MultiQC. So these lines are now skipped. |
Wow, great find 😮 I suspected unzipping or HTML parsing, but I wouldn't guess it would have to do with something as simple as that. In retrospect, it makes sense given how unreasonably much memory FastQC is using compared to other modules. Thanks a lot, Ruben! |
Description of feature
See #1961 for the issue.
I have done a lot of memory profiling over the last few days and identified the issue. Rampant memory usage is caused by the modules mostly. MultiQC core itself is fairly ok with memory.
The core problem is that MultiQC uses global variables and does not use an iterative module to generate the report. All the modules are executed and put in a list. Then the report is build. This means all the modules stay in scope all the time after execution.
The problem then arises not from the MultiQC plots, but from the data structures that are added to the modules itself in a
self.data[sample_name] = all_data_i_just_parsed
pattern. Thisself.data
variable contains a lot of data which uses a lot of memory, but the Garbage Collector can't clean it up.The primary means of fixing this is executing the modules lazily this will bring the memory usage down from
multiqc_mem_usage + sum(per_module_mem_usage)
tomultiqc_mem_usage + max(per_module_mem_usage)
.The following steps need to be taken:
I have done the things I could easily do. I feel the refactoring needs to be done by a seasoned core MultiQC developer. Ping @vladsavelyev @ewels .
The text was updated successfully, but these errors were encountered: