Reduce Sequali and (slightly) FastQC memory footprint #2516
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All I have done for FastQC is delete the huge amount of memory it uses to store the report data for aggregation. Any modules and MultiQC data that runs after the FastQC module can use this memory.
For sequali the changes are much more substantial. I thought hard about using clever methods, but these were always going to complicate the code a lot. Simply loading everything into memory and aggregating later is just so much simpler.
So I did the same as for FastQC, the data is not stored as a class variable but as a normal one that is passed to the class functions. This accomplishes the same as FastQC, when the variable is out of scope, the memory can be used again. On top of that I added a pruning function that removes all the data that is not used by MultiQC from each JSON sample dictionary immediately after loading. This pruning saves massive amounts of memory. According to memray this reduces the amount of memory used by sequali from 600+ MiB to just 150 MiB for 800 reports!
I did not add
Sequali:
in front of the title as the v1.22 release which contains sequali is not released yet.