Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Disable revision counter by default #8563

Open
derekbit opened this issue May 14, 2024 · 20 comments
Open

[IMPROVEMENT] Disable revision counter by default #8563

derekbit opened this issue May 14, 2024 · 20 comments
Assignees
Labels
area/performance System, volume performance area/volume-replica-rebuild Volume replica rebuilding related component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented May 14, 2024

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

The purpose of revision counter is to help choose one replica containing the latest data. However, the mechanism introduces a significant performance drop in Longhorn volumes.

After revisiting the design, why we think disabling RC doesn't matter is

  • Without sync, filesystem and block layers don't guarantee data integrity. Therefore, choosing any replica for rebuilding should be fine
  • After sync, all IOs are flushed. All replicas should have identical data.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

https://suse.slack.com/archives/C02EDCA93RA/p1715712709278379?thread_ts=1715659059.436039&cid=C02EDCA93RA

@derekbit derekbit added require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. labels May 14, 2024
@derekbit
Copy link
Member Author

@shuo-wu @PhanLe1010 @WebberHuang1118 @Vicente-Cheng Any thoughts for the improvement?

@derekbit derekbit added the area/performance System, volume performance label May 14, 2024
@derekbit derekbit modified the milestones: v1.6.2, v1.7.0 May 14, 2024
@innobead innobead modified the milestones: v1.7.0, v1.6.2 May 14, 2024
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) and removed area/performance System, volume performance backport/1.6.2 labels May 14, 2024
@derekbit derekbit added area/performance System, volume performance backport/1.6.2 labels May 14, 2024
@innobead innobead added component/longhorn-manager Longhorn manager (control plane) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) area/volume-replica-rebuild Volume replica rebuilding related and removed area/performance System, volume performance backport/1.6.2 labels May 14, 2024
@innobead innobead modified the milestones: v1.6.2, v1.7.0 May 14, 2024
@derekbit derekbit added the area/performance System, volume performance label May 14, 2024
@PhanLe1010
Copy link
Contributor

We discussed this internally. The discussion so far is that the reason we can safely disable RC are:

  1. Without sync, filesystem and block layers don't guarantee data integrity. Therefore, choosing any replica for rebuilding should be fine
  2. After sync, all IOs are flushed. All replica should have identical data.

There is one concern we need to solve which is:

Even though the filesystem doesn't guarantee data integrity, I think Longhorn must guarantee the data consistency across different replicas. The current revision counter try (best-effort) to compare the RC of replicas when engine starts and rebuild the replica data if there are mismatch. That mechanism is not perfect but at least it is working maybe 70-80% of the time? Now, if we want to remove RC, should we come up with a new mechanism to detect replicas' data mismatch during engine starts?

And the idea for this one is:

we need to improve the auto-salvage part so that there will be only one replica salvaged when all replicas/engine crash. This should eliminate the mismatching issue.

Btw, we will have this in 1.7.0 instead of stable release 1.6.x

Thanks @innobead @derekbit @shuo-wu for the discussion!

@shuo-wu
Copy link
Contributor

shuo-wu commented May 16, 2024

The auto salvage feature should pick up one reusable replica only. Considering the concurrent replica write out-of-order issue, this is required even if the revision counter is enabled.

@WebberHuang1118
Copy link

WebberHuang1118 commented May 16, 2024

Hi @PhanLe1010 @derekbit @shuo-wu
Just a question, after sync, would the following on-the-fly IOs make the replicas inconsistent at some point, or will the replicas always be the same? Thanks.

@derekbit
Copy link
Member Author

Hi @PhanLe1010 @derekbit @shuo-wu Just a question, after sync, would the following on-the-fly IOs make the replicas inconsistent at some point, or will the replicas always be the same? Thanks.

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

cc @WebberHuang1118 @PhanLe1010 @shuo-wu @innobead

@PhanLe1010
Copy link
Contributor

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 24, 2024

The auto salvage feature should pick up one reusable replica only. Considering the concurrent replica write out-of-order issue, this is required even if the revision counter is enabled.

For this idea, @shuo-wu @ejweber @james-munson and I discussed in the US sync and agree that Longhorn manager can salvage multiple replicas and it is the job of the engine to select the best replica and mark other replicas as err. The reason is that:

  1. Engine has more info about the replicas' data state than Longhorn manager
  2. Longhorn manager is actually getting the information about the replica from engine anyway

@shuo-wu
Copy link
Contributor

shuo-wu commented May 24, 2024

Actually, we need to do few things extra after disabling the revision counter as most of the logic is already there:

  1. The volume controller in the longhorn-manager will pick up all candidates from the failed replicas
  2. The engine would automatically pick up one replica that has the latest modified time and largest head size. The main concern is that, after introducing the filesystem trim feature, the head with the largest actual size may not be the latest replica. But I am fine with it since picking up only one replica could guarantee no inconsistency among replicas.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 25, 2024

I am walking through the flow of auto-salvage for volume with disabled RC to double-check the logic. From a closer look, it seems that the engine will pick multiple replicas with the last modification timestamp within 5s duration of the latest one: https://github.com/longhorn/longhorn-engine/blob/e39b7f0313b22d5c435ce57d1442800999a0f4ac/pkg/controller/control.go#L644-L652

IMO, 5s can be problematic as during this time there might be IO differences between the replicas. This might cause the replicas' data inconsistency. However, if we reduce this value or even eliminate this value (so that the engine will always pick only the replica with the latest modification timestamp), we will trigger more replica rebuilding (potentially unnecessary expensive replica rebuilding)

So this becomes a trade-off between the risk of data being inconsistent VS potentially unnecessary expensive rebuilding cost. I am thinking of reduce this value to 1s as a middle ground between the 2 options. WDYT? @shuo-wu @derekbit @ejweber @innobead ?

@ejweber
Copy link
Contributor

ejweber commented May 28, 2024

I am leaning towards eliminating it and only salvaging from one replica:

  • Even if two replicas fail only milliseconds apart, they can have data inconsistency. We can't guess how critical the inconsistency is.
  • Before this change, there was a reasonably high likelihood of rebuilding all replicas but one in the autosalvage case, wasn't there? Two replicas that failed milliseconds apart would likely have slightly different revision counter values. (As a counter to this, maybe there are common situations in which we mark all replicas failed, even when the engine was not actively writing to them?)

@shuo-wu
Copy link
Contributor

shuo-wu commented May 28, 2024

Prefer to salvage one replica only so that the replica consistency can be guaranteed.

  1. The replica rebuilding may not be as expensive as you expected since we already introduced the fast-rebuilding feature for v1 engines, which would quickly reuse existing snapshots by checking the checksum and ctime.
  2. Any potential risk of data inconsistency should be eliminated. The data consistency is the bedrock of a storage system. Besides, the inconsistency may lead to some kinds of annoying issues like filesystem crashes.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 28, 2024

Even if two replicas fail only milliseconds apart, they can have data inconsistency. We can't guess how critical the inconsistency is.

I agree that we can't guess how critical the inconsistency is

Before this change, there was a reasonably high likelihood of rebuilding all replicas but one in the autosalvage case, wasn't there?

I think when volume doesn't have IO during the accident like instance-manager pod crash, they would not have to have to rebuild before this change. After this change if we only select one replica to use, other other replicas will always have to be rebuilt (because the last modification timestamp are hardly the same between replicas I think)

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 28, 2024

Thanks @ejweber and @shuo-wu for the feedback! I agree! I will modify the code to always keep only 1 replica for salvaging

@PhanLe1010
Copy link
Contributor

Update:

I am sorry for the wrong statement above. Looks like the original design already attempted to select only 1 replica and mark other replica as err. However, there is a BUG in the implementation.

The intended design from the original LEP:

  1. Based on 'volume-head-xxx.img' last modified time, to get the latest one and any one within 5 second can be put in the candidate replicas for now.
  2. Compare the head file size for all the candidate replicas, pick the one with the most block numbers as the 'Source of Truth'.
  3. Only mark one candidate replica to 'RW' mode, the rest of replicas would be marked as 'ERR' mode.

The actual implementation:

  1. Based on 'volume-head-xxx.img' last modified time, to get the latest one and any one within 5 second can be put in the candidate replicas for now.
  2. Mark a random replica from that list as RW and the rest as ERR. This is a bug due to we forget to update the largestSize in this for loop

@PhanLe1010
Copy link
Contributor

I created a new ticket for the bug at #8659 because I think we should backport the bug to the older versions and this ticket is not intended to be backported

@derekbit
Copy link
Member Author

derekbit commented May 29, 2024

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

We are using direct I/O, but not synchronous I/O. Without synchronous I/O, there is a higher risk of data loss even though user has issued sync.

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.

O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.

@PhanLe1010
Copy link
Contributor

All data should have the same data content.
However, I just found Longhorn doesn't implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 in tgt (codes).
Longhorn uses direct IO, but it doesn't mean data is really synchronized after sync. We might need to implement the two commands.

I am not sure what we have to do if we want to implement implement SYNCHRONIZE_CACHE and SYNCHRONIZE_CACHE_16 because the engine doesn't hold any data in a cache. @derekbit Could you give more details about the idea?

We are using direct I/O, but not synchronous I/O. Without synchronous I/O, there is a higher risk of data loss even though user has issued sync.

O_DIRECT alone only promises that the kernel will avoid copying data from user space to kernel space, and will instead write it directly via DMA (Direct memory access; if possible). Data does not go into caches. There is no strict guarantee that the function will return only after all data has been transferred.

O_SYNC guarantees that the call will not return before all data has been transferred to the disk (as far as the OS can tell). This still does not guarantee that the data isn't somewhere in the harddisk write cache, but it is as much as the OS can guarantee.

Discuss with @derekbit , we will handle this at #8662

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 29, 2024

Update:

Proposing this modified logic for the engine to select the replica candidate for auto salvage case longhorn/longhorn-engine#1114 (comment)

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 29, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [IMPROVEMENT] Disable revision counter by default聽#8563 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: Set the disable RC manually for UI and Longhorn SC

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at: Disable revision counter by default聽#8664
    The PR for the chart change is at: Disable revision counter by default聽#8664

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Disable revision counter by default聽longhorn-manager#2833

  • Fix bug the engine might choose a replica with a smaller head size to be the source of truth for auto-salvage聽longhorn-engine#1114

  • Which areas/issues this PR might have potential impacts on?
    Area Data consistency, Performance
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at Disable revision counter by default聽website#918

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template) [TEST][IMPROVEMENT] Disable revision counter by default聽#8665

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 29, 2024

Test Plan:

Test setting

Case 1: Longhorn UI

  1. Create a new volume using UI
  2. Verify that it has disabled revision counter enabled by default

Case 1: Default Longhorn StorageClass

  1. Create a PVC using the default Longhorn SC
  2. Verify that the provisioned Longhorn volume has disabled revision counter enabled by default

Case 1: A StorageClass without disableRevisionCounter parameter

  1. Create the SC:
        kind: StorageClass
        apiVersion: storage.k8s.io/v1
        metadata:
          name: test-sc
        provisioner: driver.longhorn.io
        allowVolumeExpansion: true
  2. Create a PVC using the above SC
  3. Verify that the provisioned Longhorn volume has disabled revision counter enabled by default

Test resilient

(This should be implemented by an e2e test. If you are testing it manually, you can try 5 times instead of 20 times to save time)

  1. Create a Longhorn volume testvol-1 with RC enabled and 2 replicas
  2. Attach the volume to a node, mount the volume, make a filesystem on the volume
  3. On a thread do create and delete files
  4. On another thread crash the 2 replicas
  5. Wait for volume to become faulted, auto-salvage, become healthy
  6. Remount the volume and check if the filesystem is corrupted. If yes increase fs_corruption_count_with_rc_enabled
  7. Unmount, detach, delete the volume
  8. Repeat steps 1-7 20 times
  9. Create a Longhorn volume testvol-2 with RC disabled and 2 replicas
  10. Attach the volume to a node, mount the volume, make a filesystem on the volume
  11. On a thread do create and delete files
  12. On another thread crash the 2 replicas
  13. Wait for volume to become faulted, auto-salvage, become healthy
  14. Remount the volume and check if the filesystem is corrupted. If yes increase fs_corruption_count_with_rc_disabled
  15. Unmount, detach, delete the volume
  16. Repeat steps 9-15 20 times
  17. Verify that fs_corruption_count_with_rc_disabled <= fs_corruption_count_with_rc_enabled

@PhanLe1010 PhanLe1010 added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance System, volume performance area/volume-replica-rebuild Volume replica rebuilding related component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) component/longhorn-manager Longhorn manager (control plane) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
None yet
Development

No branches or pull requests

7 participants