What to measure during benchmarking? #15

JackKelly · 2023-09-27T17:02:58Z

The plan is to implement a benchmarking tool which automatically runs a suite of "Zarr workloads" across a range of compute platforms, storage media, chunk sizes, and Zarr implementations.

What would we like to measure for each workload?

Existing benchmarking tools only measure the runtime of each workload. That doesn't feel sufficient for Zarr because one of our main questions during benchmarking is whether the Zarr implementation is able to saturate the IO subsystem, and how much CPU and RAM is required to saturate the IO.

I'd propose that it'd be great to measure these parameters each time each workload is run:

Total execution time of the workload
Total bytes read / written for disk / network
Total IO operations
Total bytes in final numpy array
Average CPU utilization (per CPU).
Max RAM usage during the execution of the workload
CPU cache hit ratio

(Each run would also capture a bunch of metadata about the environment such as the compute environment, storage media, chunk sizes, Zarr implementation name and version, etc.)

I had previously gotten over-excited and starting thinking about capturing a full "trace" during the execution of each workload, e.g. capturing a timeseries of the IO utilization every 100 milliseconds. This might be useful, but makes the benchmarking code rather more complex, and maybe doesn't tell us much more than the "totals per workload" tell us. And some benchmark workloads might run for less than 100 ms. And psutil's documentation states that some of its counters aren't reliable when polled more frequently than 10 times a second.

What do you folks think? Do we need to record a full "trace" during each workload? Or is it sufficient to just capture totals per workload? Are there any changes you'd make to the list of parameters I proposed above?

jbms · 2023-09-28T16:34:58Z

I think all of those metrics are helpful, but I think we can start implementing specific benchmark sequences to correspond to particular synthetic workloads, and separately add metrics that we record, since those are basically orthogonal so we don't need to block one on the other.

#4 #15

Perf counters are now much more modular. I have implemented Disk IO perf counters. #4 #15

This was referenced Oct 2, 2023

Simple API for specifying each benchmark workload #14

Closed

Measure & record performance while benchmarks run! #4

Closed

joshmoore transferred this issue from zarr-developers/zarr-python Oct 2, 2023

JackKelly added a commit that referenced this issue Oct 16, 2023

Measure simple benchmarks per run.

007d6e8

#4 #15

JackKelly added a commit that referenced this issue Oct 17, 2023

Refactor performance counters

0f9e206

Perf counters are now much more modular. I have implemented Disk IO perf counters. #4 #15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to measure during benchmarking? #15

What to measure during benchmarking? #15

JackKelly commented Sep 27, 2023

jbms commented Sep 28, 2023

What to measure during benchmarking? #15

What to measure during benchmarking? #15

Comments

JackKelly commented Sep 27, 2023

jbms commented Sep 28, 2023