Feature request: Measure IO bandwidth & latency #173

garloff · 2024-03-22T07:51:53Z

We could do something like
fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12
and report (average) Bandwidth, IOPs and the percentage of I/O latency > 10ms.
The results could end up in the influxdb/grafana (and of course be reported to the console).

The text was updated successfully, but these errors were encountered:

This addresses issue #173. Signed-off-by: Kurt Garloff <kurt@garloff.de>

* Run fio benchmark (disk BW, IOPS, Lat>10ms) This addresses issue #173. * Allow a bit more time to assign names to volumes. * Output Disk bandwidth stats. * Add option -M to run disk benchmarks. * 1.106. Fix calculation of setup & test time, adjust for bench. In particular, we had not accounted the PI benchmark to testing time. This is fixed now (and also assigned the fio disk bench to testing time). The calculation of the maximum cycle time is now dependent on both the pi bench an the disk bench activation. * Fix waiting for fio and detcting success. * Don't output stats name if no stats are available. * Fix logging FIO results to logfile. * Scale IOPS to kIOPS. Rename Grafana fio labels. * Log fio output for logfile. Remove + for fioLat10ms+. * Add fio outputs to benchmark data in dashboard. * Update output of `-h` (help) into README.md * A word on benchmarks. Signed-off-by: Kurt Garloff <kurt@garloff.de>

Nils98Ar · 2024-03-26T11:39:02Z

@garloff Would you rather use the smallest mandatory ssd flavor with export JHFLAVOR="SCS-2V-4-20s" in the run_*.sh script or create a smaller one (e.g. SCS-1V-2-10s) and use that?

Maybe we should even measure both volumes and local storage performance in the future?

Currently our mean value for fioLat10ms with cinder volume root disk is 1.49 and therefore too high for etcd if I understood you correctly.

Nils98Ar · 2024-03-26T12:45:21Z

Apparently using a ssd flavor for the jumphost with JHFLAVOR is not enough and it still uses a volume as root disk?

At least the fioLat10ms does not change but it does when I create a ssd flavor instance manually and run the command there:

debian@test-ssd:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ssd:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                               
0.01

Compared to a volume root disk instance:

debian@test-ceph:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ceph:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                              
1.34

What should I do to make the jumphost use the nova disk?

garloff · 2024-04-26T11:39:11Z

With ~1.5% of writes above 10ms latency, you'll see some spurious leader changes with etcd. Probably not yet breaking it, but not very robust either.

For the JumpHosts, we currently create a volume manually that we use for booting. We don't do this for the normal VMs (although they do get a volume via nova for diskless flavors). I could add an option to NOT do this, so you can measure local disk performance.
We could also add more disk measurements by also running fio on some of the normal VMs, not just the jumphosts.
(But I don't think you want to have many VMs created with the SSD flavors, so we'd still need this local disk option for the JumpHosts.)

Nils98Ar · 2024-04-27T07:52:25Z

If I understood correctly with -Z from #184 you are able to switch the disk measurements from volume to local storage disk? Thank you for that!

Would there be an easy way to also implement measuring both volume and local storage disk?

garloff · 2024-04-29T17:47:26Z

With -Z you disable the manual creation of a volume for the Jump Hosts to boot from. This means that you will get whatever the Jump Host flavor says:

An automatically allocated (networked) cinder volume for diskless flavors (obviously not what you want)
A "local" disk for flavors with a root disk. Note that "local" might be not-so-local in some setups where local disks are rbd-backed. For s flavors that should not be the case though.

garloff · 2024-04-29T17:54:25Z

As for measuring both:

We could install fio also on the normal VMs and run it on a few of them (maybe one per AZ).
If you use a different flavor for the VMs vs the JumpHosts, you could measure a different disk performance.
If we wanted to avoid zig-zag lines for these cases, we'd have to report these measurements with a different tag to telegraf/influx and draw three additional lines in the dashboard.

Is this what you want?

Maybe we wait for the next generation health monitor from VP12 before adding another three lines...

Nils98Ar · 2024-05-02T11:46:47Z

Sounds good but for me it would be also okay to wait for the new health mon :)

garloff added enhancement New feature or request Ops Issues or pull requests relevant for Team 3: Ops Tooling labels Mar 22, 2024

garloff self-assigned this Mar 22, 2024

garloff added a commit that referenced this issue Mar 24, 2024

Run fio benchmark (disk BW, IOPS, Lat>10ms)

735ba3b

This addresses issue #173. Signed-off-by: Kurt Garloff <kurt@garloff.de>

garloff mentioned this issue Apr 26, 2024

Feat/jh vol optional #184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Measure IO bandwidth & latency #173

Feature request: Measure IO bandwidth & latency #173

garloff commented Mar 22, 2024

Nils98Ar commented Mar 26, 2024 •

edited

Nils98Ar commented Mar 26, 2024 •

edited

garloff commented Apr 26, 2024

Nils98Ar commented Apr 27, 2024

garloff commented Apr 29, 2024

garloff commented Apr 29, 2024

Nils98Ar commented May 2, 2024

Feature request: Measure IO bandwidth & latency #173

Feature request: Measure IO bandwidth & latency #173

Comments

garloff commented Mar 22, 2024

Nils98Ar commented Mar 26, 2024 • edited

Nils98Ar commented Mar 26, 2024 • edited

garloff commented Apr 26, 2024

Nils98Ar commented Apr 27, 2024

garloff commented Apr 29, 2024

garloff commented Apr 29, 2024

Nils98Ar commented May 2, 2024

Nils98Ar commented Mar 26, 2024 •

edited

Nils98Ar commented Mar 26, 2024 •

edited