New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory requirement of MultiQC main functions #2515
Conversation
@@ -283,3 +301,57 @@ def _replace(obj): | |||
return obj | |||
|
|||
return _replace(data) | |||
|
|||
|
|||
def compress_number_lists_for_json(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!
@@ -252,7 +267,10 @@ def multiqc_dump_json(report): | |||
elif s == "report": | |||
d = {f"{s}_{k}": getattr(report, k)} | |||
if d: | |||
dump_json(d, ensure_ascii=False) # Test that exporting to JSON works | |||
with open(os.devnull, "wt") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick!
""" | ||
Recursively replace NaNs and Infinities with None | ||
""" | ||
# Do checking in order of likelyhood of occurence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
# Stream to an in-memory buffer rather than compressing the big string | ||
# at once. This saves memory. | ||
buffer = io.BytesIO() | ||
with gzip.open(buffer, "wt", encoding="utf-8", compresslevel=9) as gzip_buffer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More streaming! Great
@@ -1398,13 +1398,15 @@ def include_file(name, fdir=tmp_dir, b64=False): | |||
|
|||
# Use jinja2 to render the template and overwrite | |||
config.analysis_dir = [os.path.realpath(d) for d in config.analysis_dir] | |||
report_output = j_template.render(report=report, config=config) | |||
report_output_generator = j_template.generate(report=report, config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great improvements, thanks again @rhpvorderman. I simply love reading your code - I'm learning from every trick you implement. And really appreciate that you consider possible edge cases (even if they are not covered by tests). I'll merge the PR right now.
The bulk of the memory is used by the modules.
Yeah, that's the next priority. Keeping parsed data in local variables instead of class attributes in module objects is the first step. Then I'll look into iteratively looping through modules and dropping them. We still want to keep the module objects for the interactive use in notebooks, but for standard command line runs, it's not necessary.
And love the use of arrays. So many more places where they can optimize memory usage. Will look into that more when we start exploring Apache Arrow for intermediate data.
Thanks for the compliments! It has been a pleasure contributing to MultiQC!
Great! Arrays have a very nice interface that allows reading from all sorts of buffers, including files. |
This PR fixes some of the areas where a lot of data was stored in memory unnecessarily. A lot of the patterns involved writing to an in-memory block and then dumping it at once. This PR streams in these places.
To make the multiqc plot data more scalable, every time a plot is generated the data is saved as an
array.array
where possible. This approximately halves the memory requirements for the plot data. You could argue whether this particular save is worth it. If the in-memory plot data is 125 MB, then it saves about 60 megabytes. Not exactly a showstopper. On the other hand, very little code is needed to enact this change.On the test data provided via the google drive, the PR saves about 100 MiB of memory. This is not a big deal.
The bulk of the memory is used by the modules. The reason why this is the case is that some modules store their data as an object in during class initialization. The data remains alive as long as the module object remains alive. This is not desirable.
A solution would be to refactor MultiQC to build the report module by module (using an iterator) and in that way discard module objects that are no longer used. That way the memory usage is no longer determined by adding all modules together, but by the worst performing module.
The worst performing module in the test data set is no doubt FastQC. Which parses all the reports and stores all the data in memory leading to an enormous amount of memory being allocated.