Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory requirement of MultiQC main functions #2515

Merged
merged 10 commits into from May 2, 2024

Conversation

rhpvorderman
Copy link
Contributor

  • This comment contains a description of changes (with reason)

This PR fixes some of the areas where a lot of data was stored in memory unnecessarily. A lot of the patterns involved writing to an in-memory block and then dumping it at once. This PR streams in these places.

To make the multiqc plot data more scalable, every time a plot is generated the data is saved as an array.array where possible. This approximately halves the memory requirements for the plot data. You could argue whether this particular save is worth it. If the in-memory plot data is 125 MB, then it saves about 60 megabytes. Not exactly a showstopper. On the other hand, very little code is needed to enact this change.

On the test data provided via the google drive, the PR saves about 100 MiB of memory. This is not a big deal.

The bulk of the memory is used by the modules. The reason why this is the case is that some modules store their data as an object in during class initialization. The data remains alive as long as the module object remains alive. This is not desirable.
A solution would be to refactor MultiQC to build the report module by module (using an iterator) and in that way discard module objects that are no longer used. That way the memory usage is no longer determined by adding all modules together, but by the worst performing module.

The worst performing module in the test data set is no doubt FastQC. Which parses all the reports and stores all the data in memory leading to an enormous amount of memory being allocated.

@vladsavelyev vladsavelyev self-requested a review April 30, 2024 15:00
@@ -283,3 +301,57 @@ def _replace(obj):
return obj

return _replace(data)


def compress_number_lists_for_json(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!

@@ -252,7 +267,10 @@ def multiqc_dump_json(report):
elif s == "report":
d = {f"{s}_{k}": getattr(report, k)}
if d:
dump_json(d, ensure_ascii=False) # Test that exporting to JSON works
with open(os.devnull, "wt") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick!

"""
Recursively replace NaNs and Infinities with None
"""
# Do checking in order of likelyhood of occurence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

# Stream to an in-memory buffer rather than compressing the big string
# at once. This saves memory.
buffer = io.BytesIO()
with gzip.open(buffer, "wt", encoding="utf-8", compresslevel=9) as gzip_buffer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More streaming! Great

@@ -1398,13 +1398,15 @@ def include_file(name, fdir=tmp_dir, b64=False):

# Use jinja2 to render the template and overwrite
config.analysis_dir = [os.path.realpath(d) for d in config.analysis_dir]
report_output = j_template.render(report=report, config=config)
report_output_generator = j_template.generate(report=report, config=config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@vladsavelyev vladsavelyev self-requested a review May 2, 2024 00:23
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great improvements, thanks again @rhpvorderman. I simply love reading your code - I'm learning from every trick you implement. And really appreciate that you consider possible edge cases (even if they are not covered by tests). I'll merge the PR right now.

The bulk of the memory is used by the modules.

Yeah, that's the next priority. Keeping parsed data in local variables instead of class attributes in module objects is the first step. Then I'll look into iteratively looping through modules and dropping them. We still want to keep the module objects for the interactive use in notebooks, but for standard command line runs, it's not necessary.

And love the use of arrays. So many more places where they can optimize memory usage. Will look into that more when we start exploring Apache Arrow for intermediate data.

@vladsavelyev vladsavelyev merged commit 744e2a8 into MultiQC:main May 2, 2024
6 checks passed
@rhpvorderman rhpvorderman deleted the memoryreduce branch May 2, 2024 13:42
@rhpvorderman
Copy link
Contributor Author

Thanks for the compliments! It has been a pleasure contributing to MultiQC!

And love the use of arrays. So many more places where they can optimize memory usage. Will look into that more when we start exploring Apache Arrow for intermediate data.

Great! Arrays have a very nice interface that allows reading from all sorts of buffers, including files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants