Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

Closed
PhanLe1010 opened this issue Apr 24, 2024 · 19 comments
Closed

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

PhanLe1010 opened this issue Apr 24, 2024 · 19 comments
Assignees
Labels
area/benchmark Performance Benchmark related backport/1.5.6 backport/1.6.2 kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@PhanLe1010
Copy link
Contributor

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

Investigate performance bottleneck in v1 data path. So far we have identified 2 bottlenecks:

  1. Number of engine-replica connections
  2. Revision counter inefficient logic

We are still investigating if there are more bottlenecks

Describe the solution you'd like

Resolve the bottleneck and increase the performance of v1 data path

Additional Context:

This is the test result on equinix metal m3.small.x86, ubuntu 22.04, 5.15.0-101-generic:

Example Image Example Image Example Image
@PhanLe1010 PhanLe1010 added require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Apr 24, 2024
@PhanLe1010 PhanLe1010 self-assigned this Apr 24, 2024
@PhanLe1010 PhanLe1010 added this to the v1.7.0 milestone Apr 24, 2024
@PhanLe1010
Copy link
Contributor Author

This is potentially the 3rd bottleneck:

cc @shuo-wu @ejweber @derekbit

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) area/benchmark Performance Benchmark related labels Apr 24, 2024
@derekbit
Copy link
Member

This is potentially the 3rd bottleneck:

cc @shuo-wu @ejweber @derekbit

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

@PhanLe1010
Copy link
Contributor Author

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Apr 25, 2024

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

@derekbit
Copy link
Member

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

It is int64. Probably no danger?

@PhanLe1010
Copy link
Contributor Author

It is int64. Probably no danger?

you are right, if we are writing with speed 50k IOPs, it would take 5849424 years to go overflow. None of us can see that day :)))

@PhanLe1010
Copy link
Contributor Author

Update: a tiny small improvement to the revision counter logic

Current implementation

We are converting the int64 to a string then a slice of bytes https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L45-L46

The strconv.FormatInt function seems to eat noticeable CPU time in profiling

Modification:

Using LittleEndian to encode and decode the int64:

func (r *Replica) writeRevisionCounter(counter int64) error {
	if r.revisionFile == nil {
		return fmt.Errorf("BUG: revision file wasn't initialized")
	}

	copy(revisionCounterBuf, int64ToBytes(counter))
	_, err := r.revisionFile.WriteAt(revisionCounterBuf, 0)
	if err != nil {
		return errors.Wrap(err, "failed to write to revision counter file")
	}
	return nil
}

func int64ToBytes(n int64) []byte {
	bytes := make([]byte, 8)
	binary.LittleEndian.PutUint64(bytes, uint64(n))
	return bytes
}

func bytesToInt64(bytes []byte) int64 {
	return int64(binary.LittleEndian.Uint64(bytes))
}

Result

A small improvement in the write performance maybe 1-2%.
However, the implementation cost is big: we need to migrate and update from old decoding to new decoding.
I think this is not a good idea

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Apr 25, 2024

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

@derekbit Looks like the performance gets worse. So we will not do this:

IOPS (Read/Write)
        Random:          45,395 / 19,553
    Sequential:          62,778 / 33,106

Bandwidth in KiB/sec (Read/Write)
        Random:        714,421 / 348,190
    Sequential:        976,918 / 353,571
                                        

Latency in ns (Read/Write)
        Random:        821,898 / 455,709
    Sequential:        814,332 / 456,455

@PhanLe1010
Copy link
Contributor Author

@shuo-wu This is the CPU usage of strconv.FormatInt that we were talking about in the sync up. It is small but noticeable

Screenshot from 2024-04-25 12-25-24

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Apr 25, 2024

As dissed with @shuo-wu in the sync up. We rerun some of the Broadcom tests. The setup is the same setup in our initial Broadcom report in SUSE data center

Case 1: In a CPU/RAM limited constraints

  1. Random read IOPs test
    • IOPs: similar
    • CPU usage: similar
  2. Sequential read IOPs test
    • IOPs: similar
    • CPU usage: similar
  3. Random write IOPs test
    • IOPs: increased from 2988 to 3150
    • CPU usage: similar
  4. Sequential write IOPs test
    • IOPs: similar
    • CPU usage: similar

Case 2: In no CPU/RAM constraint and workload are NOT rate-limited

With master-head engine:

IOPS (Read/Write)
        Random:          26,603 / 28,501
    Sequential:          49,915 / 43,092

Bandwidth in KiB/sec (Read/Write)
        Random:      1,587,307 / 505,325
    Sequential:      1,827,467 / 535,291
                                        

Latency in ns (Read/Write)
        Random:        272,445 / 260,084
    Sequential:        273,663 / 257,785

With the PR:

IOPS (Read/Write)
        Random:          26,946 / 33,542
    Sequential:          50,484 / 51,998

Bandwidth in KiB/sec (Read/Write)
        Random:      1,595,310 / 541,211
    Sequential:      1,611,390 / 528,907
                                        

Latency in ns (Read/Write)
        Random:        274,136 / 262,201
    Sequential:        277,235 / 252,450

Note: in this case read performance somehow doesn't increase. I believe it is something weird with tgt on Photon RT OS. Even with tgt+local file backend, it is still give the same read performance as tgt+Longhorn => the bottleneck is at tgt

Conclusion

  • Does the increasing number of connections between engine-replica hurt the resource-constrained env?
    -> Looks like no

@PhanLe1010
Copy link
Contributor Author

Btw, this article is very helpful in understand a common mistake in go concurrency https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

@PhanLe1010
Copy link
Contributor Author

Update:

There is potentially another bottleneck in the read flow in the replica. For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here. This systemcall is eating noticeable CPU. Do you think we can cache this snapshot size value and don't have to issue a systemcall for every read? @shuo-wu @derekbit @ejweber

image

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 26, 2024

For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here.

Good catch! I am fine with using the cached size. The snapshot immutability should be guaranteed by the ctime/checksum mechanism

@PhanLe1010
Copy link
Contributor Author

After testing, the snapshot file caching idea doesn't seem to improve performance in practice even though sounds good in theory. CPU usage is also relatively the same. Let's abandon it. Sorry for the noise.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 30, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [IMPROVEMENT] Investigate performance bottleneck in v1 data path 聽#8436 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at: Fix some v1 data path bottlenecks聽longhorn-engine#1085

  • Which areas/issues this PR might have potential impacts on?
    Area Volume Read/Write
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor Author

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

@ejweber
Copy link
Contributor

ejweber commented May 3, 2024

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

I don't think there is much danger here, especially around compatibility, as the changes are all self-contained within the engine. On the other hand, connection count will increase (and possible CPU utilization as well) as a tradeoff for the performance. Maybe that's surprising in a patch update?

I lean slightly towards backporting, but I'm also fine with the gains being associated with a new minor version of Longhorn. I'll defer to whatever you decide.

@PhanLe1010
Copy link
Contributor Author

Test Plan:

  1. Create a cluster of 3 worker nodes with similar big spec like equinix metal m3.small.x86, ubuntu 22.04, 5.15.0-101-generic, 20Gbps network
  2. deploy Longhorn 1.6.1
  3. Run kbench
  4. Upgrade Longhorn to master-head and upgrade the engine to master-head
  5. Run Kbench again
  6. Verify the you see better IOPs result
  7. See the issue description for example

@roger-ryao
Copy link

Verified on master-head 20240508

The test steps
#8436 (comment)

Result Passed

  1. In the AWS EC2 t2.xlarge environment, I did not observe significant differences in IOPS.
    Therefore, I have decided to build up a cluster in a local virtual machine to conduct this test.
    The test result is as follows.

v1.6.1

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           10,329 / 4,684
    Sequential:          17,677 / 10,204

Bandwidth in KiB/sec (Read/Write)
        Random:        413,252 / 141,958
    Sequential:        461,723 / 170,358
                                        

Latency in ns (Read/Write)
        Random:        431,773 / 553,568
    Sequential:        357,682 / 547,949

master-head

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           11,024 / 5,082
    Sequential:          18,149 / 10,672

Bandwidth in KiB/sec (Read/Write)
        Random:        415,402 / 137,600
    Sequential:        445,611 / 167,325
                                        

Latency in ns (Read/Write)
        Random:        409,800 / 525,464
    Sequential:        342,326 / 525,112

Here is the summary table

IOPS (Random Read/Write) IOPS (Sequential Read/Write)
v1.6.1-1st 10,329 / 4,684 17,677 / 10,204
master-head-1st 11,024 / 5,082 18,149 / 10,672
Bandwidth KiB/sec (Random Read/Write) Bandwidth KiB/sec (Sequential Read/Write)
v1.6.1-1st 413,252 / 141,958 461,723 / 170,358
master-head-1st 415,402 / 137,600 445,611 / 167,325
Latency ns (Random Read/Write) Latency ns (Sequential Read/Write)
v1.6.1-1st 431,773 / 553,568 357,682 / 547,949
master-head-1st 409,800 / 525,464 342,326 / 525,112

Hi @PhanLe1010
I think we can close this ticket. If there are any concerns regarding the test results, please let me know.

@roger-ryao roger-ryao self-assigned this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/benchmark Performance Benchmark related backport/1.5.6 backport/1.6.2 kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
None yet
Development

No branches or pull requests

7 participants