Reduce memory requirement of MultiQC main functions #2515

rhpvorderman · 2024-04-30T07:06:07Z

This comment contains a description of changes (with reason)

This PR fixes some of the areas where a lot of data was stored in memory unnecessarily. A lot of the patterns involved writing to an in-memory block and then dumping it at once. This PR streams in these places.

To make the multiqc plot data more scalable, every time a plot is generated the data is saved as an array.array where possible. This approximately halves the memory requirements for the plot data. You could argue whether this particular save is worth it. If the in-memory plot data is 125 MB, then it saves about 60 megabytes. Not exactly a showstopper. On the other hand, very little code is needed to enact this change.

On the test data provided via the google drive, the PR saves about 100 MiB of memory. This is not a big deal.

The bulk of the memory is used by the modules. The reason why this is the case is that some modules store their data as an object in during class initialization. The data remains alive as long as the module object remains alive. This is not desirable.
A solution would be to refactor MultiQC to build the report module by module (using an iterator) and in that way discard module objects that are no longer used. That way the memory usage is no longer determined by adding all modules together, but by the worst performing module.

The worst performing module in the test data set is no doubt FastQC. Which parses all the reports and stores all the data in memory leading to an enormous amount of memory being allocated.

…ce memory

vladsavelyev · 2024-05-02T00:01:49Z

multiqc/utils/util_functions.py

@@ -283,3 +301,57 @@ def _replace(obj):
        return obj

    return _replace(data)
+
+
+def compress_number_lists_for_json(obj):


This is amazing!

vladsavelyev · 2024-05-02T00:03:01Z

multiqc/utils/util_functions.py

@@ -252,7 +267,10 @@ def multiqc_dump_json(report):
                elif s == "report":
                    d = {f"{s}_{k}": getattr(report, k)}
                if d:
-                    dump_json(d, ensure_ascii=False)  # Test that exporting to JSON works
+                    with open(os.devnull, "wt") as f:


Nice trick!

vladsavelyev · 2024-05-02T00:04:54Z

multiqc/utils/util_functions.py

+        """
+        Recursively replace NaNs and Infinities with None
+        """
+        # Do checking in order of likelyhood of occurence


vladsavelyev · 2024-05-02T00:09:19Z

multiqc/utils/report.py

+    # Stream to an in-memory buffer rather than compressing the big string
+    # at once. This saves memory.
+    buffer = io.BytesIO()
+    with gzip.open(buffer, "wt", encoding="utf-8", compresslevel=9) as gzip_buffer:


More streaming! Great

vladsavelyev · 2024-05-02T00:09:48Z

multiqc/multiqc.py

@@ -1398,13 +1398,15 @@ def include_file(name, fdir=tmp_dir, b64=False):

        # Use jinja2 to render the template and overwrite
        config.analysis_dir = [os.path.realpath(d) for d in config.analysis_dir]
-        report_output = j_template.render(report=report, config=config)
+        report_output_generator = j_template.generate(report=report, config=config)


vladsavelyev

Really great improvements, thanks again @rhpvorderman. I simply love reading your code - I'm learning from every trick you implement. And really appreciate that you consider possible edge cases (even if they are not covered by tests). I'll merge the PR right now.

The bulk of the memory is used by the modules.

Yeah, that's the next priority. Keeping parsed data in local variables instead of class attributes in module objects is the first step. Then I'll look into iteratively looping through modules and dropping them. We still want to keep the module objects for the interactive use in notebooks, but for standard command line runs, it's not necessary.

And love the use of arrays. So many more places where they can optimize memory usage. Will look into that more when we start exploring Apache Arrow for intermediate data.

rhpvorderman · 2024-05-03T07:14:02Z

Thanks for the compliments! It has been a pleasure contributing to MultiQC!

And love the use of arrays. So many more places where they can optimize memory usage. Will look into that more when we start exploring Apache Arrow for intermediate data.

Great! Arrays have a very nice interface that allows reading from all sorts of buffers, including files.

rhpvorderman added 8 commits April 29, 2024 22:15

Write test json dump to /dev/null decrease allocated memory

3862225

Generate the report block by block rather than one big string to redu…

8cf4735

…ce memory

Save memory by using streaming compression

9ac9f29

Compress in-memory plot data using array.array

db78ba9

Use replace_nan to prevent errors

d9edc24

replace_nan should also replace infinities

936c05c

Move callable to check to the JSON encoder

80ac38f

Fix error where gzip stream was not properly closed

74a41a5

rhpvorderman mentioned this pull request Apr 30, 2024

Reduce MultiQC memory consumption #2517

Open

7 tasks

vladsavelyev added the core: back end label Apr 30, 2024

vladsavelyev added this to the MultiQC v1.22: Pydantic milestone Apr 30, 2024

Merge branch 'main' into memoryreduce

f308804

vladsavelyev self-requested a review April 30, 2024 15:00

vladsavelyev reviewed May 2, 2024

View reviewed changes

Merge branch 'main' into memoryreduce

dffae79

vladsavelyev self-requested a review May 2, 2024 00:23

vladsavelyev approved these changes May 2, 2024

View reviewed changes

vladsavelyev merged commit 744e2a8 into MultiQC:main May 2, 2024
6 checks passed

rhpvorderman deleted the memoryreduce branch May 2, 2024 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory requirement of MultiQC main functions #2515

Reduce memory requirement of MultiQC main functions #2515

rhpvorderman commented Apr 30, 2024

vladsavelyev May 2, 2024

vladsavelyev May 2, 2024

vladsavelyev May 2, 2024

vladsavelyev May 2, 2024

vladsavelyev May 2, 2024

vladsavelyev left a comment

rhpvorderman commented May 3, 2024

Reduce memory requirement of MultiQC main functions #2515

Reduce memory requirement of MultiQC main functions #2515

Conversation

rhpvorderman commented Apr 30, 2024

vladsavelyev May 2, 2024

Choose a reason for hiding this comment

vladsavelyev May 2, 2024

Choose a reason for hiding this comment

vladsavelyev May 2, 2024

Choose a reason for hiding this comment

vladsavelyev May 2, 2024

Choose a reason for hiding this comment

vladsavelyev May 2, 2024

Choose a reason for hiding this comment

vladsavelyev left a comment

Choose a reason for hiding this comment

rhpvorderman commented May 3, 2024