[IMPROVEMENT] Disable revision counter by default #8563

derekbit · 2024-05-14T04:02:23Z

Is your improvement request related to a feature? Please describe (👍 if you like this request)

The purpose of revision counter is to help choose one replica containing the latest data. However, the mechanism introduces a significant performance drop in Longhorn volumes.

After revisiting the design, why we think disabling RC doesn't matter is

Without sync, filesystem and block layers don't guarantee data integrity. Therefore, choosing any replica for rebuilding should be fine
After sync, all IOs are flushed. All replicas should have identical data.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

https://suse.slack.com/archives/C02EDCA93RA/p1715712709278379?thread_ts=1715659059.436039&cid=C02EDCA93RA

The text was updated successfully, but these errors were encountered:

derekbit · 2024-05-14T04:03:19Z

@shuo-wu @PhanLe1010 @WebberHuang1118 @Vicente-Cheng Any thoughts for the improvement?

PhanLe1010 · 2024-05-16T00:48:57Z

We discussed this internally. The discussion so far is that the reason we can safely disable RC are:

Without sync, filesystem and block layers don't guarantee data integrity. Therefore, choosing any replica for rebuilding should be fine
After sync, all IOs are flushed. All replica should have identical data.

There is one concern we need to solve which is:

Even though the filesystem doesn't guarantee data integrity, I think Longhorn must guarantee the data consistency across different replicas. The current revision counter try (best-effort) to compare the RC of replicas when engine starts and rebuild the replica data if there are mismatch. That mechanism is not perfect but at least it is working maybe 70-80% of the time? Now, if we want to remove RC, should we come up with a new mechanism to detect replicas' data mismatch during engine starts?

And the idea for this one is:

we need to improve the auto-salvage part so that there will be only one replica salvaged when all replicas/engine crash. This should eliminate the mismatching issue.

Btw, we will have this in 1.7.0 instead of stable release 1.6.x

Thanks @innobead @derekbit @shuo-wu for the discussion!

shuo-wu · 2024-05-16T04:04:25Z

The auto salvage feature should pick up one reusable replica only. Considering the concurrent replica write out-of-order issue, this is required even if the revision counter is enabled.

WebberHuang1118 · 2024-05-16T04:48:42Z

Hi @PhanLe1010 @derekbit @shuo-wu
Just a question, after sync, would the following on-the-fly IOs make the replicas inconsistent at some point, or will the replicas always be the same? Thanks.

derekbit · 2024-05-16T06:56:14Z

Hi @PhanLe1010 @derekbit @shuo-wu Just a question, after sync, would the following on-the-fly IOs make the replicas inconsistent at some point, or will the replicas always be the same? Thanks.

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

cc @WebberHuang1118 @PhanLe1010 @shuo-wu @innobead

PhanLe1010 · 2024-05-24T01:47:17Z

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

PhanLe1010 · 2024-05-24T02:26:38Z

The auto salvage feature should pick up one reusable replica only. Considering the concurrent replica write out-of-order issue, this is required even if the revision counter is enabled.

For this idea, @shuo-wu @ejweber @james-munson and I discussed in the US sync and agree that Longhorn manager can salvage multiple replicas and it is the job of the engine to select the best replica and mark other replicas as err. The reason is that:

Engine has more info about the replicas' data state than Longhorn manager
Longhorn manager is actually getting the information about the replica from engine anyway

shuo-wu · 2024-05-24T19:13:09Z

Actually, we need to do few things extra after disabling the revision counter as most of the logic is already there:

The volume controller in the longhorn-manager will pick up all candidates from the failed replicas
The engine would automatically pick up one replica that has the latest modified time and largest head size. The main concern is that, after introducing the filesystem trim feature, the head with the largest actual size may not be the latest replica. But I am fine with it since picking up only one replica could guarantee no inconsistency among replicas.

PhanLe1010 · 2024-05-25T00:34:25Z

I am walking through the flow of auto-salvage for volume with disabled RC to double-check the logic. From a closer look, it seems that the engine will pick multiple replicas with the last modification timestamp within 5s duration of the latest one: https://github.com/longhorn/longhorn-engine/blob/e39b7f0313b22d5c435ce57d1442800999a0f4ac/pkg/controller/control.go#L644-L652

IMO, 5s can be problematic as during this time there might be IO differences between the replicas. This might cause the replicas' data inconsistency. However, if we reduce this value or even eliminate this value (so that the engine will always pick only the replica with the latest modification timestamp), we will trigger more replica rebuilding (potentially unnecessary expensive replica rebuilding)

So this becomes a trade-off between the risk of data being inconsistent VS potentially unnecessary expensive rebuilding cost. I am thinking of reduce this value to 1s as a middle ground between the 2 options. WDYT? @shuo-wu @derekbit @ejweber @innobead ?

ejweber · 2024-05-28T18:30:07Z

I am leaning towards eliminating it and only salvaging from one replica:

Even if two replicas fail only milliseconds apart, they can have data inconsistency. We can't guess how critical the inconsistency is.
Before this change, there was a reasonably high likelihood of rebuilding all replicas but one in the autosalvage case, wasn't there? Two replicas that failed milliseconds apart would likely have slightly different revision counter values. (As a counter to this, maybe there are common situations in which we mark all replicas failed, even when the engine was not actively writing to them?)

shuo-wu · 2024-05-28T18:43:06Z

Prefer to salvage one replica only so that the replica consistency can be guaranteed.

The replica rebuilding may not be as expensive as you expected since we already introduced the fast-rebuilding feature for v1 engines, which would quickly reuse existing snapshots by checking the checksum and ctime.
Any potential risk of data inconsistency should be eliminated. The data consistency is the bedrock of a storage system. Besides, the inconsistency may lead to some kinds of annoying issues like filesystem crashes.

PhanLe1010 · 2024-05-28T18:43:23Z

Even if two replicas fail only milliseconds apart, they can have data inconsistency. We can't guess how critical the inconsistency is.

I agree that we can't guess how critical the inconsistency is

Before this change, there was a reasonably high likelihood of rebuilding all replicas but one in the autosalvage case, wasn't there?

I think when volume doesn't have IO during the accident like instance-manager pod crash, they would not have to have to rebuild before this change. After this change if we only select one replica to use, other other replicas will always have to be rebuilt (because the last modification timestamp are hardly the same between replicas I think)

PhanLe1010 · 2024-05-28T18:46:42Z

Thanks @ejweber and @shuo-wu for the feedback! I agree! I will modify the code to always keep only 1 replica for salvaging

PhanLe1010 · 2024-05-28T22:20:48Z

Update:

I am sorry for the wrong statement above. Looks like the original design already attempted to select only 1 replica and mark other replica as err. However, there is a BUG in the implementation.

The intended design from the original LEP:

Based on 'volume-head-xxx.img' last modified time, to get the latest one and any one within 5 second can be put in the candidate replicas for now.

Compare the head file size for all the candidate replicas, pick the one with the most block numbers as the 'Source of Truth'.

Only mark one candidate replica to 'RW' mode, the rest of replicas would be marked as 'ERR' mode.

The actual implementation:

Based on 'volume-head-xxx.img' last modified time, to get the latest one and any one within 5 second can be put in the candidate replicas for now.
Mark a random replica from that list as RW and the rest as ERR. This is a bug due to we forget to update the largestSize in this for loop

PhanLe1010 · 2024-05-29T00:09:51Z

I created a new ticket for the bug at #8659 because I think we should backport the bug to the older versions and this ticket is not intended to be backported

derekbit · 2024-05-29T02:16:20Z

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

We are using direct I/O, but not synchronous I/O. Without synchronous I/O, there is a higher risk of data loss even though user has issued sync.

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.

O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.

PhanLe1010 · 2024-05-29T05:57:25Z

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

We are using direct I/O, but not synchronous I/O. Without synchronous I/O, there is a higher risk of data loss even though user has issued sync.
O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.

O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.

Discuss with @derekbit , we will handle this at #8662

PhanLe1010 · 2024-05-29T06:39:08Z

Update:

Proposing this modified logic for the engine to select the replica candidate for auto salvage case longhorn/longhorn-engine#1114 (comment)

longhorn-io-github-bot · 2024-05-29T23:22:01Z

Pre Ready-For-Testing Checklist

PhanLe1010 · 2024-05-29T23:37:27Z

Test Plan:

Test setting

Case 1: Longhorn UI

Create a new volume using UI
Verify that it has disabled revision counter enabled by default

Case 1: Default Longhorn StorageClass

Create a PVC using the default Longhorn SC
Verify that the provisioned Longhorn volume has disabled revision counter enabled by default

Case 1: A StorageClass without disableRevisionCounter parameter

Create the SC:

    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: test-sc
    provisioner: driver.longhorn.io
    allowVolumeExpansion: true

Create a PVC using the above SC
Verify that the provisioned Longhorn volume has disabled revision counter enabled by default

Test resilient

(This should be implemented by an e2e test. If you are testing it manually, you can try 5 times instead of 20 times to save time)

Create a Longhorn volume testvol-1 with RC enabled and 2 replicas
Attach the volume to a node, mount the volume, make a filesystem on the volume
On a thread do create and delete files
On another thread crash the 2 replicas
Wait for volume to become faulted, auto-salvage, become healthy
Remount the volume and check if the filesystem is corrupted. If yes increase fs_corruption_count_with_rc_enabled
Unmount, detach, delete the volume
Repeat steps 1-7 20 times
Create a Longhorn volume testvol-2 with RC disabled and 2 replicas
Attach the volume to a node, mount the volume, make a filesystem on the volume
On a thread do create and delete files
On another thread crash the 2 replicas
Wait for volume to become faulted, auto-salvage, become healthy
Remount the volume and check if the filesystem is corrupted. If yes increase fs_corruption_count_with_rc_disabled
Unmount, detach, delete the volume
Repeat steps 9-15 20 times
Verify that fs_corruption_count_with_rc_disabled <= fs_corruption_count_with_rc_enabled

derekbit added the area/performance System, volume performance label May 14, 2024

derekbit modified the milestones: v1.6.2, v1.7.0 May 14, 2024

innobead modified the milestones: v1.7.0, v1.6.2 May 14, 2024

derekbit added the backport/1.6.2 label May 14, 2024

innobead added priority/0 Must be fixed in this release (managed by PO) and removed area/performance System, volume performance backport/1.6.2 labels May 14, 2024

derekbit added area/performance System, volume performance backport/1.6.2 labels May 14, 2024

innobead modified the milestones: v1.6.2, v1.7.0 May 14, 2024

This was referenced May 14, 2024

[BACKPORT][v1.6.2][IMPROVEMENT] Disable revision counter by default #8564

Closed

[BACKPORT][v1.6.2][IMPROVEMENT] Disable revision counter by default #8565

Closed

derekbit added the area/performance System, volume performance label May 14, 2024

derekbit assigned PhanLe1010 May 14, 2024

This was referenced May 29, 2024

Disable revision counter by default #8664

Open

Disable revision counter by default longhorn/longhorn-manager#2833

Open

Disable revision counter by default longhorn/website#918

Open

PhanLe1010 added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 29, 2024

github-actions bot mentioned this issue May 29, 2024

[TEST][IMPROVEMENT] Disable revision counter by default #8665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Disable revision counter by default #8563

[IMPROVEMENT] Disable revision counter by default #8563

derekbit commented May 14, 2024 •

edited

derekbit commented May 14, 2024

PhanLe1010 commented May 16, 2024

shuo-wu commented May 16, 2024

WebberHuang1118 commented May 16, 2024 •

edited

derekbit commented May 16, 2024

PhanLe1010 commented May 24, 2024

PhanLe1010 commented May 24, 2024 •

edited

shuo-wu commented May 24, 2024 •

edited

PhanLe1010 commented May 25, 2024 •

edited

ejweber commented May 28, 2024

shuo-wu commented May 28, 2024

PhanLe1010 commented May 28, 2024 •

edited

PhanLe1010 commented May 28, 2024 •

edited

PhanLe1010 commented May 28, 2024

PhanLe1010 commented May 29, 2024

derekbit commented May 29, 2024 •

edited

PhanLe1010 commented May 29, 2024

PhanLe1010 commented May 29, 2024 •

edited

longhorn-io-github-bot commented May 29, 2024 •

edited by PhanLe1010

PhanLe1010 commented May 29, 2024 •

edited

[IMPROVEMENT] Disable revision counter by default #8563

[IMPROVEMENT] Disable revision counter by default #8563

Comments

derekbit commented May 14, 2024 • edited

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe the solution you'd like

Describe alternatives you've considered

Additional context

derekbit commented May 14, 2024

PhanLe1010 commented May 16, 2024

shuo-wu commented May 16, 2024

WebberHuang1118 commented May 16, 2024 • edited

derekbit commented May 16, 2024

PhanLe1010 commented May 24, 2024

PhanLe1010 commented May 24, 2024 • edited

shuo-wu commented May 24, 2024 • edited

PhanLe1010 commented May 25, 2024 • edited

ejweber commented May 28, 2024

shuo-wu commented May 28, 2024

PhanLe1010 commented May 28, 2024 • edited

PhanLe1010 commented May 28, 2024 • edited

PhanLe1010 commented May 28, 2024

PhanLe1010 commented May 29, 2024

derekbit commented May 29, 2024 • edited

PhanLe1010 commented May 29, 2024

PhanLe1010 commented May 29, 2024 • edited

longhorn-io-github-bot commented May 29, 2024 • edited by PhanLe1010

Pre Ready-For-Testing Checklist

PhanLe1010 commented May 29, 2024 • edited

Test Plan:

Test setting

Test resilient

derekbit commented May 14, 2024 •

edited

WebberHuang1118 commented May 16, 2024 •

edited

PhanLe1010 commented May 24, 2024 •

edited

shuo-wu commented May 24, 2024 •

edited

PhanLe1010 commented May 25, 2024 •

edited

PhanLe1010 commented May 28, 2024 •

edited

PhanLe1010 commented May 28, 2024 •

edited

derekbit commented May 29, 2024 •

edited

PhanLe1010 commented May 29, 2024 •

edited

longhorn-io-github-bot commented May 29, 2024 •

edited by PhanLe1010

PhanLe1010 commented May 29, 2024 •

edited