[IMPROVEMENT] Make gRPC service timeout configurable #8590

fab-sgnct · 2024-05-17T08:20:22Z

Is your improvement request related to a feature? Please describe (👍 if you like this request)

We are using longhorn 1.5.x in various environments.
One of them has PVC getting close to 1TB. It also has slower network than others.
From time to time we have network issues that will cause replica issues and cause longhorn to salvage a volume. Then longhorn will try to rebuild a replica from remaining sane ones: with the slower network and that amount of data, this operation takes hours and at some point timeouts when reaching 24h which might be frustrating if you've spent those hours looking at rebuild percentage going up slowly to 90+% before going back to 0%.

Describe the solution you'd like

As making the rebuild operation faster might be challenging and limited by network speed vs data size, the alternative would be to be able to give it more time i.e., from my understanding, be able to configure the gRPC service long timeout

Describe alternatives you've considered

Alternatives:

Reduce amount of data for rebuild operation (not always possible)
- move data to other system
- trim
- rebuild replica with low data size
- copy data back
Recreate PVC, move data to new one and switch from old to new one (blue/green kindof)

PhanLe1010 · 2024-05-17T23:11:46Z

Related to the ticket #2765 . We can investigate this one when doing that ticket

PhanLe1010 · 2024-05-17T23:18:18Z

Btw, if it takes longer than 24h to rebuild the replica, it is singling that the current infrastructure is not quite suitable for this big size volume. The cluster would busy doing rebuilding for a long time here. Would it be better to?

Reduce the size of the volume
Increase the network bandwidth in this cluster

derekbit · 2024-05-20T06:24:10Z

@fab-sgnct
Can you briefly introduce how you use the volume, e.g. if the volume data is overwritten very frequently? Do you have any snapshots of the big volume? v1.5.x has introduced the fast replica rebuilding, but it sounds not working in your case.

fab-sgnct changed the title ~~[IMPROVEMENT]~~ [IMPROVEMENT] Make gRPC service timeout configurable May 17, 2024

derekbit added the component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) label May 20, 2024

derekbit added the area/volume-replica-rebuild Volume replica rebuilding related label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Make gRPC service timeout configurable #8590

[IMPROVEMENT] Make gRPC service timeout configurable #8590

fab-sgnct commented May 17, 2024 •

edited

PhanLe1010 commented May 17, 2024 •

edited

PhanLe1010 commented May 17, 2024

derekbit commented May 20, 2024

[IMPROVEMENT] Make gRPC service timeout configurable #8590

[IMPROVEMENT] Make gRPC service timeout configurable #8590

Comments

fab-sgnct commented May 17, 2024 • edited

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe the solution you'd like

Describe alternatives you've considered

PhanLe1010 commented May 17, 2024 • edited

PhanLe1010 commented May 17, 2024

derekbit commented May 20, 2024

fab-sgnct commented May 17, 2024 •

edited

PhanLe1010 commented May 17, 2024 •

edited