Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster JSON dumping. #2518

Open
rhpvorderman opened this issue Apr 30, 2024 · 1 comment
Open

Faster JSON dumping. #2518

rhpvorderman opened this issue Apr 30, 2024 · 1 comment

Comments

@rhpvorderman
Copy link
Contributor

Description of feature

As suggested with #1920 , there might be possibilities to speed up the json serialization. Unfortunately many of the "fast" JSON libraries do not support streaming to a file. Using the native python JSON module seems the best option.

I profiled the current MultiQC and it spends a fair bit of time (30% ish) on dumping the JSON. Essentially it is done three times:

  • One time to generate the json that is embedded in the HTML.
  • One time for each report section to check if it is serializable.
  • One time to write all the report sections to the data file.

Technically it could be possible to create the json dump only once as a in memory gzip blob. That can be encoded with base64 for the embedded html. It can be written decompressed to the data file. That looses the ability however to selectively truncate misbehaving reports for the data file. Using the --no-data-dir option already make sure it is only used once.

In general I think it is not worth the effort, as it is a simple CPU bound problem and this is not much of an issue in the light of very extensive workflows. With the --no-data-dir option the speed is already pretty much optimal. I just want to report my findings here. If I happen to find a JSON library that can actually dump JSON faster while streaming to a file I will report it here.

@vladsavelyev
Copy link
Member

vladsavelyev commented Apr 30, 2024

Thanks for putting up an issue, it helps to have it structured!

We actually have on the roadmap replacing JSON as an intermediate format for data with something like Parquet: #1256

So will be looking into this more when we start working on it. Agree that writing JSON three times is suboptimal can can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants