Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to measure during benchmarking? #15

Open
JackKelly opened this issue Sep 27, 2023 · 1 comment
Open

What to measure during benchmarking? #15

JackKelly opened this issue Sep 27, 2023 · 1 comment

Comments

@JackKelly
Copy link
Collaborator

The plan is to implement a benchmarking tool which automatically runs a suite of "Zarr workloads" across a range of compute platforms, storage media, chunk sizes, and Zarr implementations.

What would we like to measure for each workload?

Existing benchmarking tools only measure the runtime of each workload. That doesn't feel sufficient for Zarr because one of our main questions during benchmarking is whether the Zarr implementation is able to saturate the IO subsystem, and how much CPU and RAM is required to saturate the IO.

I'd propose that it'd be great to measure these parameters each time each workload is run:

  • Total execution time of the workload
  • Total bytes read / written for disk / network
  • Total IO operations
  • Total bytes in final numpy array
  • Average CPU utilization (per CPU).
  • Max RAM usage during the execution of the workload
  • CPU cache hit ratio

(Each run would also capture a bunch of metadata about the environment such as the compute environment, storage media, chunk sizes, Zarr implementation name and version, etc.)

I had previously gotten over-excited and starting thinking about capturing a full "trace" during the execution of each workload, e.g. capturing a timeseries of the IO utilization every 100 milliseconds. This might be useful, but makes the benchmarking code rather more complex, and maybe doesn't tell us much more than the "totals per workload" tell us. And some benchmark workloads might run for less than 100 ms. And psutil's documentation states that some of its counters aren't reliable when polled more frequently than 10 times a second.

What do you folks think? Do we need to record a full "trace" during each workload? Or is it sufficient to just capture totals per workload? Are there any changes you'd make to the list of parameters I proposed above?

@jbms
Copy link
Contributor

jbms commented Sep 28, 2023

I think all of those metrics are helpful, but I think we can start implementing specific benchmark sequences to correspond to particular synthetic workloads, and separately add metrics that we record, since those are basically orthogonal so we don't need to block one on the other.

@joshmoore joshmoore transferred this issue from zarr-developers/zarr-python Oct 2, 2023
JackKelly added a commit that referenced this issue Oct 16, 2023
JackKelly added a commit that referenced this issue Oct 17, 2023
Perf counters are now much more modular.

I have implemented Disk IO perf counters.

#4
#15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants