Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected change in throughput running z_sub_thr and restarting z_pub_thr #1017

Closed
jackg0 opened this issue May 11, 2024 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@jackg0
Copy link

jackg0 commented May 11, 2024

Describe the bug

Hi,

Thanks for all the great work on zenoh! We're excited to use this project in various applications.

I am seeing varying bandwidth when running the z_sub_thr example and restarting the z_pub_thr example with a payload size of 1 MiB.

Using the attached annotated stdout for z_sub_thr as an example, for the first run of z_pub_thr 1048576 I see a nominal msg/s reported by z_sub_thr of 2300 msg/s. After running z_pub_thr for a 2nd time, I see a much higher bandwidth of 4600 msg/s. In restarts of z_pub_thr, I see 2300 msg/s.

Is this an issue with the examples, my build of zenoh, or possibly some other issue?

zenoh_z_sub_thr_output.txt

To reproduce

  1. Run ./target/release/examples/z_sub_thr -s 100000 -n 100
  2. Start/restart ./target/release/examples/z_pub_thr 1048576

System info

Platform: Docker container running Ubuntu 24.04
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
zenoh commit: b8dd01d
rustc 1.75.

@jackg0 jackg0 added the bug Something isn't working label May 11, 2024
@jackg0
Copy link
Author

jackg0 commented May 11, 2024

Here is another case where the first run of z_pub_thr results in a throughput of ~2800 msg/s according to z_sub_thr, but multiple restarts of z_pub_thr afterwards always have much higher bandwidth of ~5000 msg/s according to z_sub_thr.

zenoh_z_sub_thr_output.1.txt

@YuanYuYuan
Copy link
Contributor

Hi @jackg0! The instability possibly comes from the system itself. Could you please try the following

  1. Set the nice level of the process, e.g. sudo nice -n -20 PROCESS.
  2. Configure the CPU affinity, e.g. taskset -c 0,2 PROCESS_1 and taskset -c 1,3 PROCESS2.
  3. Check if other programs are running at the same time.

@jackg0
Copy link
Author

jackg0 commented May 12, 2024

Hi @YuanYuYuan, thanks for the quick response and the suggestions!

I've attached two videos to help debug. One with both the z_sub_thr and z_pub_thr processes running with nice level of -20 and another using both nice and taskset.

I also ran top and bmon in separate terminals to confirm system resource usage is as expected.

https://github.com/eclipse-zenoh/zenoh/assets/8713234/b10202d2-a5aa-4fdf-82cc-fc704e9168aa
https://github.com/eclipse-zenoh/zenoh/assets/8713234/10621585-c324-4a0f-91ef-389417917d2c

@YuanYuYuan
Copy link
Contributor

Hi @jackg0, I think you could increase the number of messages in each measurement to eliminate the variance. This is what I observed on my laptop.

sudo nice -n -20 taskset -c 0,2 ./target/release/examples/z_sub_thr -s 5 -n 5000
sudo nice -n -20 taskset -c 1,3 ./target/release/examples/z_pub_thr 1048576

Repeat 10 times.

Press CTRL-C to quit...
2186.625858047293 msg/s
1976.6553852170493 msg/s
2159.7951844801387 msg/s
2060.118765706755 msg/s
2322.7678741585833 msg/s
Press CTRL-C to quit...
2017.45538633758 msg/s
2048.95934592392 msg/s
2367.598719428168 msg/s
2013.1891291489683 msg/s
2492.418014551374 msg/s
Press CTRL-C to quit...
1945.737483207945 msg/s
1840.192912987812 msg/s
1962.4937566106748 msg/s
1852.0520909223058 msg/s
1989.227550648963 msg/s
Press CTRL-C to quit...
2044.7209231016272 msg/s
2075.8386407413896 msg/s
2078.8020593474657 msg/s
2058.1868709512523 msg/s
2078.3503005702096 msg/s
Press CTRL-C to quit...
2041.9730133957385 msg/s
2062.873456989323 msg/s
2146.1786394368332 msg/s
2036.5213537008829 msg/s
2190.538801541054 msg/s
Press CTRL-C to quit...
2037.434580970634 msg/s
2043.238797955243 msg/s
2040.4466059339704 msg/s
2055.1036757058237 msg/s
2042.4498654281867 msg/s
Press CTRL-C to quit...
2053.5388791367764 msg/s
2083.8771818636915 msg/s
2053.47317582037 msg/s
2066.2914473976557 msg/s
2095.531008961677 msg/s
Press CTRL-C to quit...
2810.2802030513103 msg/s
2064.075646021032 msg/s
2077.619939098062 msg/s
2073.4854773456614 msg/s
2062.563490860657 msg/s
Press CTRL-C to quit...
2447.5786489674642 msg/s
2207.575150459812 msg/s
2038.9994240784958 msg/s
2062.0763335292995 msg/s
2078.794691355887 msg/s
Press CTRL-C to quit...
2114.984155824615 msg/s
2059.924117907853 msg/s
2035.1727948625066 msg/s
2064.2951988374552 msg/s
2028.676008906326 msg/s

To me the variance is not so high.

@jackg0
Copy link
Author

jackg0 commented May 13, 2024

Hi @YuanYuYuan, I tried setting a much higher sample size of 10k and I also used bmon to rule out sample size in measuring bandwidth. bmon should be accurately reporting the bandwidth over localhost in the videos shared above. It appears as if the bandwidth changes by a factor of 2 using when measuring bandwidth independently of the examples.

I also only see the issue when I have a subscriber that doesn't exit after a few samples. In your case, it seems the subscriber is exiting after 5 samples with -s 5?

@YuanYuYuan
Copy link
Contributor

Here is another case where the first run of z_pub_thr results in a throughput of ~2800 msg/s according to z_sub_thr, but multiple restarts of z_pub_thr afterwards always have much higher bandwidth of ~5000 msg/s according to z_sub_thr.

zenoh_z_sub_thr_output.1.txt

We also observed the same thing before until we realized the CPU affinity is necessary to set.

And I had another one long-running result.

sudo nice -n -20 taskset -c 1,3 ./target/release/examples/z_pub_thr 1048576
sudo nice -n -20 taskset -c 0,2 ./target/release/examples/z_sub_thr -s 100 -n 5000 # repeat five times

sub.log

The variance seems within an acceptable range.

@jackg0
Copy link
Author

jackg0 commented May 14, 2024

Ok, thank you. I'll keep looking at it on my system. I am assuming the variance is a system issue. Do you recommend that we always set the cpu affinity when using zenoh? Is that necessary for each subscriber and publisher?

Thanks for the help!

@YuanYuYuan
Copy link
Contributor

Ok, thank you. I'll keep looking at it on my system. I am assuming the variance is a system issue. Do you recommend that we always set the cpu affinity when using zenoh? Is that necessary for each subscriber and publisher?

Thanks for the help!

It depends on your use case. The less CPU used, the more stable it performs. But this also sacrifices the maximal throughput. (Note that the optimal throughput might not be the case with all CPUs due to the cost of the context switch.) Do you frequently require high bandwidth in your scenario?

@jackg0
Copy link
Author

jackg0 commented May 15, 2024

Yes, we're using zenoh in high bandwidth applications right now. I'll make sure to keep this in mind as we add more publishers and subscribers.

Feel free to close this issue - thanks again!

@YuanYuYuan
Copy link
Contributor

Yes, we're using zenoh in high bandwidth applications right now. I'll make sure to keep this in mind as we add more publishers and subscribers.

Feel free to close this issue - thanks again!

Thanks for your information. Then it will be the topic of scalability that we are highly interested in. Don't hesitate to ping us if you have any issues. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants