Actual Report for Reference Setup, Performance, Scalability, and Sizing Guidelines #7905

PhanLe1010 · 2024-02-09T00:51:27Z

PhanLe1010 · 2024-02-09T01:50:54Z

Hi team, I am starting pushing report continuously starting with Public Cloud - Medium Node Spec report. Hope that we finish this one soon.

ejweber

It may be good to add a section of "whys" at some point. For example, why 1TiB? Why those particular replica scheduling settings? This can help us reason about the testing in the future.

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

PhanLe1010 · 2024-02-09T19:16:08Z

It may be good to add a section of "whys" at some point. For example, why 1TiB? Why those particular replica scheduling settings? This can help us reason about the testing in the future.

Good idea, I will add the explanation:

1TB is purely due to cost issue. We gonna have many nodes and each nodes will have 1TB disk.
Storage Minimal Available Percentage setting 10%. As we are using dedicated disk, we don't need big reserve storage as mentioned in best practice https://longhorn.io/docs/1.6.0/best-practices/#minimal-available-storage-and-over-provisioning
Storage Over Provisioning Percentage setting: 110%:
- We are planning to fill 15GB for each 20GB volume
- If we schedule maximum amount, it would be 1200 Gib and actual usage will be (15/20)*1100 = 825. This leaves 100GiB as 10% Storage Minimal Available Percentage setting plus some filesystem space overhead

PhanLe1010 · 2024-05-07T01:13:18Z

Update:

I am benchmarking Longhorn control plan. Looks like Longhorn cannot scale pass 310 pods (with 310 volumes) in the medium spec cluster (1 control + 3 worker nodes, EC2 instance type: ec2 m5zn.2xlarge - 8vCPUs - 32GB RAM)

The error event on pending pods:

  Warning  FailedMount         12s (x5 over 9m16s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[www], failed to process volumes=[]: timed out waiting for the condition

PhanLe1010 · 2024-05-09T00:24:26Z

The issue #7905 (comment) is a known issue #7919

PhanLe1010 · 2024-05-10T01:08:30Z

This PR is ready for review.

I finished the reported for the cloud medium spec cluster (the last information about max volume size is coming soon). As discussed in the US team meeting, I decided to leave the report for cloud big spec and baremetal cluster for the next release and focus on 1.7.0 backlog. This is the first version of the report, I am doing it manually. For the next version, I will try to have automation to speed up the process.

Thank you for all the helpful feedbacks!

cc @ejweber @james-munson @shuo-wu @derekbit @innobead

ejweber

Sorry for the nitpicky review! I don't think @jillian-maroket will be handling this one, so I paid a bit more attention to it grammatically than I normally would. From a technical perspective, things are looking quite good.

I only made it about halfway through so far. There aren't any dealbreakers for me yet, so please feel free to consider my comments/suggestions, either or adopt them or not, and directly resolve the conversation.

examples/reference-setup-performance-scalability-and-sizing-guidelines/README.md

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

ejweber · 2024-05-10T15:17:17Z

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

+**Comment:**
+* We choose 10000 for EBS disk's IOPs simply because it is a middle value between minimum value 3000 and maximum value 16000 of the gp3 EBS disk
+* We choose 360MiB/s for EBS disk's bandwidth because the m5zn.2xlarge EC2 instance has EBS bandwidth 396.25 MiB/s. 
+  If we choose a bigger value than 396.25 MiB/s for EBS disk's bandwidth, the ec2 instance would not be able to push EBS disk to that value.


NIT: Don't bother if it's too difficult or annoying, but links to where the reader can find this information might be useful.

I decided to keep the info here since there isn't a single page which explain this behavior. It would be multiple links if we decided to put the links here

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

ejweber · 2024-05-10T15:39:31Z

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

+
+> Result:
+> * Each Kbench pod is able to achieve 386 MiB/s random read bandwidth on its Longhorn volume 
+> * Total random read bandwidth can be achieved by all 3 Longhorn volumes is 1158


Suggested change

> * Total random read bandwidth can be achieved by all 3 Longhorn volumes is 1158

> * Total sequential read bandwidth can be achieved by all 3 Longhorn volumes is 1158

I think it should be random

ejweber · 2024-05-10T15:39:59Z

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

+Scaling workload from 3 to 6, then 6 to 9, then 9 to 12, then 12 to 15
+
+> Result:
+> * At 6 pods, the average random read bandwidth per Longhorn volume is 196 MiB/s. Total random bandwidth is 1176 MiB/s


Suggested change

> * At 6 pods, the average random read bandwidth per Longhorn volume is 196 MiB/s. Total random bandwidth is 1176 MiB/s

> * At 6 pods, the average sequential read bandwidth per Longhorn volume is 196 MiB/s. Total random bandwidth is 1176 MiB/s

And below.

I think it should be random

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

ejweber · 2024-05-10T15:42:36Z

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md

+
+
+
+### Random Read Bandwidth - Stress Tests


Stopped here for now. Will continue later.

…elines longhorn-2598 Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 · 2024-05-13T23:53:17Z

Thanks @ejweber I resolved most of the comments. Looking forward to your next review

Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 force-pushed the 2598-report branch 2 times, most recently from 60e0b80 to 83a86aa Compare February 9, 2024 00:56

PhanLe1010 mentioned this pull request Feb 9, 2024

Reference Setup, Performance, Scalability, and Sizing Guidelines #7616

Closed

PhanLe1010 force-pushed the 2598-report branch from 83a86aa to 6a6c3b5 Compare February 9, 2024 01:17

PhanLe1010 force-pushed the 2598-report branch 4 times, most recently from 1fe5d44 to d57e948 Compare February 9, 2024 05:29

ejweber reviewed Feb 9, 2024

View reviewed changes

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md Outdated Show resolved Hide resolved

...ference-setup-performance-scalability-and-sizing-guidelines/public-cloud/medium-node-spec.md Outdated Show resolved Hide resolved

PhanLe1010 force-pushed the 2598-report branch 19 times, most recently from dd2c03e to cf26a63 Compare February 13, 2024 18:53

PhanLe1010 force-pushed the 2598-report branch 8 times, most recently from fe6d2c9 to 17f0897 Compare March 14, 2024 21:15

PhanLe1010 force-pushed the 2598-report branch 7 times, most recently from 022e585 to bc1580e Compare May 10, 2024 01:03

PhanLe1010 marked this pull request as ready for review May 10, 2024 01:03

PhanLe1010 requested a review from a team as a code owner May 10, 2024 01:03

PhanLe1010 requested review from shuo-wu, derekbit and innobead May 10, 2024 01:04

PhanLe1010 force-pushed the 2598-report branch from bc1580e to 616bb58 Compare May 10, 2024 01:28

ejweber reviewed May 10, 2024

View reviewed changes

Report for Reference Setup, Performance, Scalability, and Sizing Guid…

a315e38

…elines longhorn-2598 Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 force-pushed the 2598-report branch from 616bb58 to a315e38 Compare May 10, 2024 23:59

Fixed Eric's review comments

322fcfa

Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 force-pushed the 2598-report branch from c9dda73 to 322fcfa Compare May 14, 2024 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actual Report for Reference Setup, Performance, Scalability, and Sizing Guidelines #7905

Actual Report for Reference Setup, Performance, Scalability, and Sizing Guidelines #7905

PhanLe1010 commented Feb 9, 2024

PhanLe1010 commented Feb 9, 2024

ejweber left a comment

PhanLe1010 commented Feb 9, 2024 •

edited

PhanLe1010 commented May 7, 2024 •

edited

PhanLe1010 commented May 9, 2024

PhanLe1010 commented May 10, 2024 •

edited

ejweber left a comment

ejweber May 10, 2024

PhanLe1010 May 13, 2024

ejweber May 10, 2024

PhanLe1010 May 13, 2024

ejweber May 10, 2024

PhanLe1010 May 13, 2024

ejweber May 10, 2024

PhanLe1010 commented May 13, 2024

	> * Total random read bandwidth can be achieved by all 3 Longhorn volumes is 1158
	> * Total sequential read bandwidth can be achieved by all 3 Longhorn volumes is 1158

	> * At 6 pods, the average random read bandwidth per Longhorn volume is 196 MiB/s. Total random bandwidth is 1176 MiB/s
	> * At 6 pods, the average sequential read bandwidth per Longhorn volume is 196 MiB/s. Total random bandwidth is 1176 MiB/s

Actual Report for Reference Setup, Performance, Scalability, and Sizing Guidelines #7905

Are you sure you want to change the base?

Actual Report for Reference Setup, Performance, Scalability, and Sizing Guidelines #7905

Conversation

PhanLe1010 commented Feb 9, 2024

PhanLe1010 commented Feb 9, 2024

ejweber left a comment

Choose a reason for hiding this comment

PhanLe1010 commented Feb 9, 2024 • edited

PhanLe1010 commented May 7, 2024 • edited

PhanLe1010 commented May 9, 2024

PhanLe1010 commented May 10, 2024 • edited

ejweber left a comment

Choose a reason for hiding this comment

ejweber May 10, 2024

Choose a reason for hiding this comment

PhanLe1010 May 13, 2024

Choose a reason for hiding this comment

ejweber May 10, 2024

Choose a reason for hiding this comment

PhanLe1010 May 13, 2024

Choose a reason for hiding this comment

ejweber May 10, 2024

Choose a reason for hiding this comment

PhanLe1010 May 13, 2024

Choose a reason for hiding this comment

ejweber May 10, 2024

Choose a reason for hiding this comment

PhanLe1010 commented May 13, 2024

PhanLe1010 commented Feb 9, 2024 •

edited

PhanLe1010 commented May 7, 2024 •

edited

PhanLe1010 commented May 10, 2024 •

edited