Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Make gRPC service timeout configurable #8590

Open
fab-sgnct opened this issue May 17, 2024 · 3 comments
Open

[IMPROVEMENT] Make gRPC service timeout configurable #8590

fab-sgnct opened this issue May 17, 2024 · 3 comments
Labels
area/volume-replica-rebuild Volume replica rebuilding related component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated

Comments

@fab-sgnct
Copy link

fab-sgnct commented May 17, 2024

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

We are using longhorn 1.5.x in various environments.
One of them has PVC getting close to 1TB. It also has slower network than others.
From time to time we have network issues that will cause replica issues and cause longhorn to salvage a volume. Then longhorn will try to rebuild a replica from remaining sane ones: with the slower network and that amount of data, this operation takes hours and at some point timeouts when reaching 24h which might be frustrating if you've spent those hours looking at rebuild percentage going up slowly to 90+% before going back to 0%.

Describe the solution you'd like

As making the rebuild operation faster might be challenging and limited by network speed vs data size, the alternative would be to be able to give it more time i.e., from my understanding, be able to configure the gRPC service long timeout

Describe alternatives you've considered

Alternatives:

  • Reduce amount of data for rebuild operation (not always possible)
    • move data to other system
    • trim
    • rebuild replica with low data size
    • copy data back
  • Recreate PVC, move data to new one and switch from old to new one (blue/green kindof)
@fab-sgnct fab-sgnct added kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated labels May 17, 2024
@fab-sgnct fab-sgnct changed the title [IMPROVEMENT] [IMPROVEMENT] Make gRPC service timeout configurable May 17, 2024
@PhanLe1010
Copy link
Contributor

PhanLe1010 commented May 17, 2024

Related to the ticket #2765 . We can investigate this one when doing that ticket

@PhanLe1010
Copy link
Contributor

Btw, if it takes longer than 24h to rebuild the replica, it is singling that the current infrastructure is not quite suitable for this big size volume. The cluster would busy doing rebuilding for a long time here. Would it be better to?

  1. Reduce the size of the volume
  2. Increase the network bandwidth in this cluster

@derekbit derekbit added the component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) label May 20, 2024
@derekbit
Copy link
Member

@fab-sgnct
Can you briefly introduce how you use the volume, e.g. if the volume data is overwritten very frequently? Do you have any snapshots of the big volume? v1.5.x has introduced the fast replica rebuilding, but it sounds not working in your case.

@derekbit derekbit added the area/volume-replica-rebuild Volume replica rebuilding related label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-rebuild Volume replica rebuilding related component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
Development

No branches or pull requests

3 participants