Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Measure IO bandwidth & latency #173

Open
garloff opened this issue Mar 22, 2024 · 7 comments
Open

Feature request: Measure IO bandwidth & latency #173

garloff opened this issue Mar 22, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request Ops Issues or pull requests relevant for Team 3: Ops Tooling

Comments

@garloff
Copy link
Contributor

garloff commented Mar 22, 2024

We could do something like
fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12
and report (average) Bandwidth, IOPs and the percentage of I/O latency > 10ms.
The results could end up in the influxdb/grafana (and of course be reported to the console).

@garloff garloff added enhancement New feature or request Ops Issues or pull requests relevant for Team 3: Ops Tooling labels Mar 22, 2024
@garloff garloff self-assigned this Mar 22, 2024
garloff added a commit that referenced this issue Mar 24, 2024
This addresses issue #173.

Signed-off-by: Kurt Garloff <kurt@garloff.de>
garloff added a commit that referenced this issue Mar 25, 2024
* Run fio benchmark (disk BW, IOPS, Lat>10ms)
   This addresses issue #173.
* Allow a bit more time to assign names to volumes.
* Output Disk bandwidth stats.
* Add option -M to run disk benchmarks.
* 1.106. Fix calculation of setup & test time, adjust for bench.
   In particular, we had not accounted the PI benchmark to testing time.
   This is fixed now (and also assigned the fio disk bench to testing
   time).
   The calculation of the maximum cycle time is now dependent on both
   the pi bench an the disk bench activation.
* Fix waiting for fio and detcting success.
* Don't output stats name if no stats are available.
* Fix logging FIO results to logfile.
* Scale IOPS to kIOPS. Rename Grafana fio labels.
* Log fio output for logfile. Remove + for fioLat10ms+.
* Add fio outputs to benchmark data in dashboard.
* Update output of `-h` (help) into README.md
* A word on benchmarks.

Signed-off-by: Kurt Garloff <kurt@garloff.de>
@Nils98Ar
Copy link
Member

Nils98Ar commented Mar 26, 2024

@garloff Would you rather use the smallest mandatory ssd flavor with export JHFLAVOR="SCS-2V-4-20s" in the run_*.sh script or create a smaller one (e.g. SCS-1V-2-10s) and use that?

Maybe we should even measure both volumes and local storage performance in the future?

Currently our mean value for fioLat10ms with cinder volume root disk is 1.49 and therefore too high for etcd if I understood you correctly.

@Nils98Ar
Copy link
Member

Nils98Ar commented Mar 26, 2024

Apparently using a ssd flavor for the jumphost with JHFLAVOR is not enough and it still uses a volume as root disk?

At least the fioLat10ms does not change but it does when I create a ssd flavor instance manually and run the command there:

debian@test-ssd:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ssd:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                               
0.01

Compared to a volume root disk instance:

debian@test-ceph:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ceph:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                              
1.34

What should I do to make the jumphost use the nova disk?

@garloff
Copy link
Contributor Author

garloff commented Apr 26, 2024

With ~1.5% of writes above 10ms latency, you'll see some spurious leader changes with etcd. Probably not yet breaking it, but not very robust either.

For the JumpHosts, we currently create a volume manually that we use for booting. We don't do this for the normal VMs (although they do get a volume via nova for diskless flavors). I could add an option to NOT do this, so you can measure local disk performance.
We could also add more disk measurements by also running fio on some of the normal VMs, not just the jumphosts.
(But I don't think you want to have many VMs created with the SSD flavors, so we'd still need this local disk option for the JumpHosts.)

@Nils98Ar
Copy link
Member

If I understood correctly with -Z from #184 you are able to switch the disk measurements from volume to local storage disk? Thank you for that!

Would there be an easy way to also implement measuring both volume and local storage disk?

@garloff
Copy link
Contributor Author

garloff commented Apr 29, 2024

With -Z you disable the manual creation of a volume for the Jump Hosts to boot from. This means that you will get whatever the Jump Host flavor says:

  • An automatically allocated (networked) cinder volume for diskless flavors (obviously not what you want)
  • A "local" disk for flavors with a root disk. Note that "local" might be not-so-local in some setups where local disks are rbd-backed. For s flavors that should not be the case though.

@garloff
Copy link
Contributor Author

garloff commented Apr 29, 2024

As for measuring both:

  • We could install fio also on the normal VMs and run it on a few of them (maybe one per AZ).
  • If you use a different flavor for the VMs vs the JumpHosts, you could measure a different disk performance.
  • If we wanted to avoid zig-zag lines for these cases, we'd have to report these measurements with a different tag to telegraf/influx and draw three additional lines in the dashboard.

Is this what you want?

Maybe we wait for the next generation health monitor from VP12 before adding another three lines...

@Nils98Ar
Copy link
Member

Nils98Ar commented May 2, 2024

Sounds good but for me it would be also okay to wait for the new health mon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Ops Issues or pull requests relevant for Team 3: Ops Tooling
Projects
Status: Backlog
Development

No branches or pull requests

2 participants