Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lep): auto-balance pressured disks #7576

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
146 changes: 146 additions & 0 deletions enhancements/20240108-replica-auto-balance-pressured-disks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Replica Auto-Balance Pressured Disks On Same Node

## Summary

This proposal aims to enhance Longhorn's replica auto-balance feature, specifically addressing the challenge of growing volumes causing pressure on available disk space. By introducing the `replica-auto-balance-disk-pressure-percentage`setting, users will have the ability to allow Longhorn to automatically rebuild replica to another disk on the same node with more available storage when disk pressure reaches the user defined threshold. This enhancement active when the `replica-auto-balance` setting is configured as `best-effort`.


### Related Issues

https://github.com/longhorn/longhorn/issues/4105

## Motivation

In typical scenarios, disk space tends to be increasingly occupied over time, especially when volumes replicas are unevenly scheduled across disks. This enhancement triggers volume replica to another disk on the same node when the disk pressure reaches the user-defined threshold, improves space utilization and achieving a more balanced disk load.

### Goals

Introduce the `replica-auto-balance-disk-pressure-percentage` setting for automatic disk rebalancing when reaching a user-defined pressure threshold. This feature is effective when `replicas-auto-balance` is set to `best-effort`.

Introduce file local sync to copy the file when there is another running replica on the same node so the data transfer doesn't need to go through TCP.

### Non-goals [optional]

This enhancement does not encompass auto-balance replicas across disks on different nodes.

## Proposal

### User Stories

#### Story 1

Before: User manually balance replicas when disk space pressures arise due to volume growth.

After: User can set a disk pressure threshold, enabling Longhorn to automatically rebuild replicas to another disk with more available storage on the same node.

### User Experience In Detail

1. Cluster nodes have multiple schedulable disks.
1. Set `replica-auto-balance` to `best-effort`.
1. Set `replica-soft-anti-affinity` to `enabled`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have node, zone and disk anti affinity. What's replica affinity here?

1. Define the threshold triggering automatic disk rebalance with `replica-auto-balance-disk-pressure-percentage`.
1. Create a workload with Longhorn volume.
1. When the disk reaches the threshold percentage, observe replicas being rebuild to other disks with more available storage on the same node.

### API changes

## Design

### Introduce `replicas-auto-balance-disk-pressure-percentage` setting

```go
SettingDefinitionReplicaAutoBalanceDiskPressurePercentage = SettingDefinition{
DisplayName: "Replica Auto Balance Disk Limit Percentage",
Description: "The disk pressure percentage specifies the percentage of disk space used on a node that will trigger a replica auto-balance.",
Category: SettingCategoryScheduling,
Type: SettingTypeInt,
Required: true,
ReadOnly: false,
Default: "90",
}
```

#### Determine replica count for best-effort auto-balancing

- Balance replicas on node and zones before proceeding to node disk balancing.
- Check if the disk is in pressure and if there is another disk with more actual available storage on the same node, then return the adjustment count and the same node in the candidate node list.
- Select the first node from the candidate node list for schedulinga new replica.
- The scheduler allocates replica to the disk with the most actual available storage, following the current implementation.

#### Cleanup auto-balanced replicas

- Maintain the current implementation to identify preferred replica candidates for deletion.
- With the preferred replica candidates, sort them by actual available storage. If multiple disks have the same actual available storage, further sort by name.
```
storage available - storage reserved
```
- Delete the first replica from the sorted list.

### Local disk replica migration

#### Longhorn manager

During the replica rebuilding, check if there is another running volume replica on the same node. If there is, record the source file directory and the target file directory into the `rpc.EngineReplicaAddRequest.LocalSync`.
```go
localSync = &etypes.FileLocalSync{
Source: filepath.Join(nodeReplica.Spec.DiskPath, "replicas", nodeReplica.Spec.DataDirectoryName),
Target: filepath.Join(targetReplica.Spec.DiskPath, "replicas", targetReplica.Spec.DataDirectoryName),
}
```

#### Longhorn instance manager

1. Introduce `EngineReplicaLocalSync` to `EngineReplicaAddRequest` message in `proxy.proto`.
```go
message EngineReplicaAddRequest {
ProxyEngineRequest proxy_engine_request = 1;

string replica_address = 2;
bool restore = 3;
int64 size = 4;
int64 current_size = 5;
bool fast_sync = 6;
int32 file_sync_http_client_timeout = 7;
string replica_name = 8;
EngineReplicaLocalSync local_sync = 9;
}

message EngineReplicaLocalSync{
string source = 1;
string target = 2;
}
```
1. Then the proxy server is responsible for proxying the `EngineReplicaAddRequest.LocalSync` to the sync-agent's `FilesSyncRequest.LocalSync`.

#### Longhorn engine
1. Introduce `FileLocalsync` to `FileSyncRequest` message in `synagent.proto`.
```go
message FilesSyncRequest {
string from_address = 1;
string to_host = 2;
repeated SyncFileInfo sync_file_info_list = 3;
bool fast_sync = 4;
int32 file_sync_http_client_timeout = 5;
FileLocalSync local_sync = 6;
}

message FileLocalSync{
string source = 1;
string target = 2;
}
```
1. Then the sync-agent server is responsible for copying the files using `req.SyncFileInfoList` from source to the target directory when `req.FilesSyncRequest.LocalSync` is provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly are the files copied? I have experienced some strange behavior when experimenting with copying replica sparse files before, even when using options meant to preserve sparseness. The checksums always matched (and IIRC matched even if sparseness was not preserved), but it the actual space usage was different by a little bit between the source and target file. I think the exact behavior varied between copy utilities (e.g. rsync vs cp, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage https://github.com/longhorn/sparse-tools/blob/master/sparse/file.go or even improve it to support spares file copy?

1. When the files fails to sync locally, the sync-agent server removes file created to the target directory, then fall back to syncing the files via the TCP transfer.

### Test plan

- Validate automatic replica balance to alleviate disk pressure within the same node.
- Verify that disk auto-balancing does not occur when all disks on the replica nodes are under pressure.

### Upgrade strategy

`N/A`

## Note [optional]

`None`