[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

PhanLe1010 · 2024-04-24T21:56:24Z

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Investigate performance bottleneck in v1 data path. So far we have identified 2 bottlenecks:

Number of engine-replica connections
Revision counter inefficient logic

We are still investigating if there are more bottlenecks

Describe the solution you'd like

Resolve the bottleneck and increase the performance of v1 data path

Additional Context:

This is the test result on equinix metal m3.small.x86, ubuntu 22.04, 5.15.0-101-generic:

PhanLe1010 · 2024-04-24T22:07:35Z

This is potentially the 3rd bottleneck:

For some reason, we always have to do an alignment for each read requests which means go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L120-L123
This doesn't happen for the write request. I.e., write request doesn't have to go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L133-L136

cc @shuo-wu @ejweber @derekbit

derekbit · 2024-04-24T23:48:29Z

This is potentially the 3rd bottleneck:

For some reason, we always have to do an alignment for each read requests which means go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L120-L123

This doesn't happen for the write request. I.e., write request doesn't have to go through this code path https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/vendor/github.com/longhorn/sparse-tools/sparse/file.go#L133-L136

cc @shuo-wu @ejweber @derekbit

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

PhanLe1010 · 2024-04-25T00:10:38Z

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

PhanLe1010 · 2024-04-25T01:12:52Z

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

derekbit · 2024-04-25T02:19:39Z

Side track a little bit. Do you think there is a danger with integer overflow when we keep increasing the counter over here https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L143 ?

If we keep writing long enough, it might become negative

It is int64. Probably no danger?

PhanLe1010 · 2024-04-25T02:23:23Z

It is int64. Probably no danger?

you are right, if we are writing with speed 50k IOPs, it would take 5849424 years to go overflow. None of us can see that day :)))

PhanLe1010 · 2024-04-25T02:40:24Z

Update: a tiny small improvement to the revision counter logic

Current implementation

We are converting the int64 to a string then a slice of bytes https://github.com/longhorn/longhorn-engine/blob/be5e072e02cd959ab371a04264957624223b01fb/pkg/replica/revision_counter.go#L45-L46

The strconv.FormatInt function seems to eat noticeable CPU time in profiling

Modification:

Using LittleEndian to encode and decode the int64:

func (r *Replica) writeRevisionCounter(counter int64) error {
	if r.revisionFile == nil {
		return fmt.Errorf("BUG: revision file wasn't initialized")
	}

	copy(revisionCounterBuf, int64ToBytes(counter))
	_, err := r.revisionFile.WriteAt(revisionCounterBuf, 0)
	if err != nil {
		return errors.Wrap(err, "failed to write to revision counter file")
	}
	return nil
}

func int64ToBytes(n int64) []byte {
	bytes := make([]byte, 8)
	binary.LittleEndian.PutUint64(bytes, uint64(n))
	return bytes
}

func bytesToInt64(bytes []byte) int64 {
	return int64(binary.LittleEndian.Uint64(bytes))
}

Result

A small improvement in the write performance maybe 1-2%.
However, the implementation cost is big: we need to migrate and update from old decoding to new decoding.
I think this is not a good idea

PhanLe1010 · 2024-04-25T02:58:26Z

We already know the input data []byte needs to be aligned for direct IO.
Can we always allocate an aligned buffer for data in the caller end for preventing extra data copy in the two functions?

Agree with this idea. I will give it a try to see if it can improve the read speed

@derekbit Looks like the performance gets worse. So we will not do this:

IOPS (Read/Write)
        Random:          45,395 / 19,553
    Sequential:          62,778 / 33,106

Bandwidth in KiB/sec (Read/Write)
        Random:        714,421 / 348,190
    Sequential:        976,918 / 353,571
                                        

Latency in ns (Read/Write)
        Random:        821,898 / 455,709
    Sequential:        814,332 / 456,455

PhanLe1010 · 2024-04-25T19:35:49Z

@shuo-wu This is the CPU usage of strconv.FormatInt that we were talking about in the sync up. It is small but noticeable

PhanLe1010 · 2024-04-25T22:18:53Z

As dissed with @shuo-wu in the sync up. We rerun some of the Broadcom tests. The setup is the same setup in our initial Broadcom report in SUSE data center

Case 1: In a CPU/RAM limited constraints

Random read IOPs test
- IOPs: similar
- CPU usage: similar
Sequential read IOPs test
- IOPs: similar
- CPU usage: similar
Random write IOPs test
- IOPs: increased from 2988 to 3150
- CPU usage: similar
Sequential write IOPs test
- IOPs: similar
- CPU usage: similar

Case 2: In no CPU/RAM constraint and workload are NOT rate-limited

With master-head engine:

IOPS (Read/Write)
        Random:          26,603 / 28,501
    Sequential:          49,915 / 43,092

Bandwidth in KiB/sec (Read/Write)
        Random:      1,587,307 / 505,325
    Sequential:      1,827,467 / 535,291
                                        

Latency in ns (Read/Write)
        Random:        272,445 / 260,084
    Sequential:        273,663 / 257,785

With the PR:

IOPS (Read/Write)
        Random:          26,946 / 33,542
    Sequential:          50,484 / 51,998

Bandwidth in KiB/sec (Read/Write)
        Random:      1,595,310 / 541,211
    Sequential:      1,611,390 / 528,907
                                        

Latency in ns (Read/Write)
        Random:        274,136 / 262,201
    Sequential:        277,235 / 252,450

Note: in this case read performance somehow doesn't increase. I believe it is something weird with tgt on Photon RT OS. Even with tgt+local file backend, it is still give the same read performance as tgt+Longhorn => the bottleneck is at tgt

Conclusion

Does the increasing number of connections between engine-replica hurt the resource-constrained env?
-> Looks like no

PhanLe1010 · 2024-04-26T00:05:19Z

Btw, this article is very helpful in understand a common mistake in go concurrency https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

PhanLe1010 · 2024-04-26T20:25:52Z

Update:

There is potentially another bottleneck in the read flow in the replica. For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here. This systemcall is eating noticeable CPU. Do you think we can cache this snapshot size value and don't have to issue a systemcall for every read? @shuo-wu @derekbit @ejweber

shuo-wu · 2024-04-26T20:30:21Z

For every read, we have to make a systemcall os.(*File).Stat to get the size of the snaoshot file here.

Good catch! I am fine with using the cached size. The snapshot immutability should be guaranteed by the ctime/checksum mechanism

PhanLe1010 · 2024-04-26T21:34:40Z

After testing, the snapshot file caching idea doesn't seem to improve performance in practice even though sounds good in theory. CPU usage is also relatively the same. Let's abandon it. Sorry for the noise.

longhorn-io-github-bot · 2024-04-30T00:05:45Z

PhanLe1010 · 2024-04-30T00:06:38Z

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

ejweber · 2024-05-03T14:03:09Z

Do you think we should backport this to v1.6.x and v1.5.x? @shuo-wu @ejweber @derekbit

I don't think there is much danger here, especially around compatibility, as the changes are all self-contained within the engine. On the other hand, connection count will increase (and possible CPU utilization as well) as a tradeoff for the performance. Maybe that's surprising in a patch update?

I lean slightly towards backporting, but I'm also fine with the gains being associated with a new minor version of Longhorn. I'll defer to whatever you decide.

PhanLe1010 · 2024-05-07T17:41:33Z

Test Plan:

Create a cluster of 3 worker nodes with similar big spec like equinix metal m3.small.x86, ubuntu 22.04, 5.15.0-101-generic, 20Gbps network
deploy Longhorn 1.6.1
Run kbench
Upgrade Longhorn to master-head and upgrade the engine to master-head
Run Kbench again
Verify the you see better IOPs result
See the issue description for example

roger-ryao · 2024-05-08T08:26:19Z

Verified on master-head 20240508

longhorn master-head 8026d1a
longhorn-engine master-head https://github.com/longhorn/longhorn-engine/commit/

The test steps
#8436 (comment)

Result Passed

In the AWS EC2 t2.xlarge environment, I did not observe significant differences in IOPS.
Therefore, I have decided to build up a cluster in a local virtual machine to conduct this test.
The test result is as follows.

v1.6.1

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           10,329 / 4,684
    Sequential:          17,677 / 10,204

Bandwidth in KiB/sec (Read/Write)
        Random:        413,252 / 141,958
    Sequential:        461,723 / 170,358
                                        

Latency in ns (Read/Write)
        Random:        431,773 / 553,568
    Sequential:        357,682 / 547,949

master-head

=========================
FIO Benchmark Summary
For: test_device
CPU Idleness Profiling: disabled
Size: 30G
Quick Mode: disabled
=========================
IOPS (Read/Write)
        Random:           11,024 / 5,082
    Sequential:          18,149 / 10,672

Bandwidth in KiB/sec (Read/Write)
        Random:        415,402 / 137,600
    Sequential:        445,611 / 167,325
                                        

Latency in ns (Read/Write)
        Random:        409,800 / 525,464
    Sequential:        342,326 / 525,112

Here is the summary table

	IOPS (Random Read/Write)	IOPS (Sequential Read/Write)
v1.6.1-1st	10,329 / 4,684	17,677 / 10,204
master-head-1st	11,024 / 5,082	18,149 / 10,672

	Bandwidth KiB/sec (Random Read/Write)	Bandwidth KiB/sec (Sequential Read/Write)
v1.6.1-1st	413,252 / 141,958	461,723 / 170,358
master-head-1st	415,402 / 137,600	445,611 / 167,325

	Latency ns (Random Read/Write)	Latency ns (Sequential Read/Write)
v1.6.1-1st	431,773 / 553,568	357,682 / 547,949
master-head-1st	409,800 / 525,464	342,326 / 525,112

Hi @PhanLe1010
I think we can close this ticket. If there are any concerns regarding the test results, please let me know.

PhanLe1010 self-assigned this Apr 24, 2024

PhanLe1010 added this to the v1.7.0 milestone Apr 24, 2024

PhanLe1010 mentioned this issue Apr 24, 2024

[EPIC] v1 data path performance enhancement #6600

Open

innobead added priority/0 Must be fixed in this release (managed by PO) area/benchmark Performance Benchmark related labels Apr 24, 2024

PhanLe1010 mentioned this issue Apr 25, 2024

Fix some v1 data path bottlenecks longhorn/longhorn-engine#1085

Merged

1 task

shuo-wu mentioned this issue May 3, 2024

Improve replica revision counter longhorn/longhorn-engine#1097

Open

mergify bot mentioned this issue May 6, 2024

Fix some v1 data path bottlenecks (backport #1085) longhorn/longhorn-engine#1104

Merged

1 task

PhanLe1010 added the backport/1.5.6 label May 6, 2024

mergify bot mentioned this issue May 6, 2024

Fix some v1 data path bottlenecks (backport #1085) longhorn/longhorn-engine#1105

Merged

1 task

PhanLe1010 added the backport/1.6.2 label May 6, 2024

This was referenced May 6, 2024

[BACKPORT][v1.5.6][IMPROVEMENT] Investigate performance bottleneck in v1 data path #8510

Closed

[BACKPORT][v1.6.2][IMPROVEMENT] Investigate performance bottleneck in v1 data path #8511

Closed

roger-ryao closed this as completed May 8, 2024

roger-ryao self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

PhanLe1010 commented Apr 24, 2024

PhanLe1010 commented Apr 24, 2024

derekbit commented Apr 24, 2024

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024 •

edited

derekbit commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024 •

edited

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024 •

edited

PhanLe1010 commented Apr 26, 2024

PhanLe1010 commented Apr 26, 2024

shuo-wu commented Apr 26, 2024

PhanLe1010 commented Apr 26, 2024

longhorn-io-github-bot commented Apr 30, 2024 •

edited by PhanLe1010

PhanLe1010 commented Apr 30, 2024

ejweber commented May 3, 2024

PhanLe1010 commented May 7, 2024

roger-ryao commented May 8, 2024

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

[IMPROVEMENT] Investigate performance bottleneck in v1 data path #8436

Comments

PhanLe1010 commented Apr 24, 2024

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe the solution you'd like

Additional Context:

PhanLe1010 commented Apr 24, 2024

derekbit commented Apr 24, 2024

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024 • edited

derekbit commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024

Current implementation

Modification:

Result

PhanLe1010 commented Apr 25, 2024 • edited

PhanLe1010 commented Apr 25, 2024

PhanLe1010 commented Apr 25, 2024 • edited

Case 1: In a CPU/RAM limited constraints

Case 2: In no CPU/RAM constraint and workload are NOT rate-limited

Conclusion

PhanLe1010 commented Apr 26, 2024

PhanLe1010 commented Apr 26, 2024

shuo-wu commented Apr 26, 2024

PhanLe1010 commented Apr 26, 2024

longhorn-io-github-bot commented Apr 30, 2024 • edited by PhanLe1010

Pre Ready-For-Testing Checklist

PhanLe1010 commented Apr 30, 2024

ejweber commented May 3, 2024

PhanLe1010 commented May 7, 2024

roger-ryao commented May 8, 2024

PhanLe1010 commented Apr 25, 2024 •

edited

PhanLe1010 commented Apr 25, 2024 •

edited

PhanLe1010 commented Apr 25, 2024 •

edited

longhorn-io-github-bot commented Apr 30, 2024 •

edited by PhanLe1010