[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

hoo29 · 2024-04-26T08:35:30Z

Describe the bug

With Harvester 1.3.0 and Longhorn 1.6.0, we have observed several disks becoming over provisioned despite Storage Over Provisioning Percentage being set to 100.

To Reproduce

Theoretical steps (we haven't reproduced)

Provision enough VMs in harvester to nearly saturate all available disk space.
Delete the VMs but don't delete their volumes (in terraform have auto_delete set to false)
Create all the VMs again.

Expected behavior

Disks do not become over provisioned and VMs fail to schedule if there isn't enough storage.

Support bundle for troubleshooting

Please can no URLs or VM details from the bundle be posted to this issue.

Sent to longhorn-support-bundle@suse.com.

Environment

Longhorn version: 1.6.0
Impacted volume (PV):
harvester-node-2 disk /var/lib/harvester/extra-disks/0e75b3ff4813c3cae0f71a1e9f3ac893
harvester-node-5 disk /var/lib/harvester/extra-disks/89ade731face5f52e750ade464ca09bc
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Harvester 1.3.0 ISO install
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: The RKE2 bundled with Harvester 1.3.0
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 5
Node config
- OS type and version: SLE Micro 5.4
- Kernel version: 5.14.21-150400.24.108-default
- CPU per node: Xeon Gold 5320T, 20 CPU Cores
- Memory per node: 384GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe SSD
- Network bandwidth between the nodes (Gbps): 1 Gbps (test environment)
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 83 - After clean up of dangling volumes

james-munson · 2024-04-26T16:24:11Z

Looking. The first thing I notice is that in the Harvester settings, yamls/cluster/harvesterhci.io/v1beta1/settings.yaml, there is an overcommit-config with a storage value of 200:

- apiVersion: harvesterhci.io/v1beta1
  default: '{"cpu":1600,"memory":150,"storage":200}'
  kind: Setting
  metadata:
...
   name: overcommit-config
    resourceVersion: "13529643"
    uid: fa02c2bd-f32b-4872-b803-112aec13351d
  status: {}

I would suspect that may be the cause of the observed behavior. And so, it seems like a reasonable workaround to change that setting. I have not tried it myself, and I cannot say what would happen if the system is already over 100% and there is an attempt to change the config.

ejweber · 2024-04-26T18:16:16Z

I discussed this with @james-munson offline and took a look at the support bundle. The bundle can't be posted here, but using the various generic object names:

For one of the overscheduled disks cbe58e05af80439c9320336f1dbb5dfc, there are 12 replicas scheduled. The CreationTimestamp for each replica is somewhat worthless for debugging, since most of them were created as clones of their original during a migration. However, we can see that 11 out of the 12 volumes with scheduled replicas were created during the exact same second (2024-04-17T15:40:44Z):

pvc-723ccacf-9491-4e7f-8809-5eee63cb0216
pvc-49617dbb-6077-4428-80e6-24000231a0a8
pvc-4e19d165-72c8-43d7-8570-7607ee206b23
pvc-16fdd3fd-cc3b-496b-b995-1f4458ac1d98
pvc-39b8415a-c91a-4a5b-872a-325058fb6815
pvc-99700ed4-00c4-4154-b9aa-c4c756649ad6
pvc-9943036e-8427-4b48-bb26-e788acfef2e1
pvc-b5d7b3c3-a57b-493b-a33d-d4ce528a6789
pvc-b4040b28-58fe-4b0c-9e43-b73dfe74c050
pvc-cba8733b-573a-4355-8e9c-4efae44116b2
pvc-d07ae34f-7bb7-42f8-ae59-dbc2d893b65e

@PhanLe1010 and I (and probably others) discussed this while I was working on #8043, but we need a followup ticket for it. That issue was more specific, but in general, we think Longhorn is vulnerable to accidental overscheduling if it is scheduling replicas for multiple volumes simultaneously.

The general flow of the replica scheduling is:

The volume controller that owns a volume decides to schedule a replica.
The volume controller looks at all the nodes to see where the replica can be scheduled.
The volume controller decides on a node to schedule the replica to.
The volume controller updates the replica with its decision.
The node controller that owns the node later looks at the replicas and realizes that a new replica has been scheduled.
The node controller updates the node to reflect it.
The next volume controller that schedules a replica sees the updated node information.

Now, imagine two different volume controllers are scheduling for two different volumes at the same time.

Both volume controllers see the same nodes.
Both volume controllers think a particular disk on a particular node is a good scheduling choice.
Both volume controllers decide to schedule a replica to that disk.
Both volume controller update their own replica. There is not conflict, because they are scheduling different replicas for different volumes.
Later, the node controller updates the node to reflect both replicas were scheduled.
The node controller reports an overscheduling, but it is just informational.

In summary, I believe this happened as a result of many volumes being created simultaneously. We need to improve Longhorn replica scheduling to ensure it cannot happen.

ejweber · 2024-04-26T18:19:17Z

We can probably just use this ticked to track the need for an enhancement. When we discussed it previously, the vulnerability was somewhat theoretical. This appears to be a textbook example of it actually occurring.

ejweber · 2024-04-26T18:24:36Z

Workaround to avoid the issue

This issue seems to be quite rare. It is probably because:

Many volumes are rarely created simultaneously.
Even when they are, not all replicas will be scheduled to the same disk.
Even when they are, the replicas may fit anyway.

I am not sure why all the volumes were created simultaneously in this ticket. Perhaps there was some other factor at work.

If your workflow involves creating many volumes simultaneously, it may be best to try intentionally slow it down a bit until a fix is implemented (e.g. create one volume, wait a second, create another volume).

Workaround if you have hit the issue

It should be possible to evict individual replicas from the overscheduled disks. Longhorn will find a different disk and move the data. This can be done while the workload is running.

james-munson · 2024-04-26T18:25:33Z

The ticket for a general solution to pick a leader from all longhorn-managers to avoid races and conflicts is #5571

hoo29 · 2024-04-26T18:58:33Z

Thank you for the detailed response and workaround. We are creating all of our machines and disks in one go with the harvester terraform provider; we'll add a pause somewhere to help avoid this issue.

Marking the disk as un-schedulable and using evicting allowed us to rebalance things.

I'm not clear on how the harvester overcommit settings factor into this. In Slack Connor Kuehl said

Connor Kuehl :rancher_employee: 2 hours ago
Yes, Harvester unconditionally overwrites Longhorn's overcommit with the value from Harvester's overcommit settings (storage)

Is it the case our disks became over provisioned due to the harvester overcommit settings, the bug you have described, or both?

ejweber · 2024-04-26T19:26:10Z

This is a good question. My current belief is that:

Harvester is incorrectly ignoring its own over-commit setting and not propagating it to Longhorn.
Separately, Longhorn overscheduled the disks as a result of the bug I described.

This is because the support bundle clearly shows the the over-commit setting with its default of 200 and the Longhorn storage-over-provisioning-percentage setting with a value of 100.

I will ask Connor to take a look at the first part and help decide if it is a Harvester bug.

james-munson · 2024-04-26T19:28:12Z

This bug allows Longhorn to over-schedule some disks, where it clearly might have been possible to consider other disks. If the Harvester value of 200% were used, Longhorn still might still have scheduled the same way, but not have reported them as over-scheduled later on. Whether they actually are over-committed depends on how much data is written to the volumes over time.

I think we do still have work to do to ensure that the Longhorn and Harvester settings are in step.

innobead · 2024-04-29T09:15:17Z

cc @derekbit

hoo29 added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 26, 2024

innobead added this to the v1.7.0 milestone Apr 29, 2024

innobead added reproduce/rare < 50% reproducible priority/0 Must be fixed in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related labels Apr 29, 2024

innobead added investigation-needed Need to identify the case before estimating and starting the development and removed investigation-needed Need to identify the case before estimating and starting the development labels Apr 29, 2024

derekbit added the investigation-needed Need to identify the case before estimating and starting the development label Apr 29, 2024

hoo29 mentioned this issue May 14, 2024

[FEATURE] Enhance Volume Expansion by Relocating and Rebuilding Replicas on Nodes with Sufficient Space #8513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

hoo29 commented Apr 26, 2024

james-munson commented Apr 26, 2024 •

edited

ejweber commented Apr 26, 2024 •

edited

ejweber commented Apr 26, 2024

ejweber commented Apr 26, 2024

james-munson commented Apr 26, 2024

hoo29 commented Apr 26, 2024 •

edited

ejweber commented Apr 26, 2024 •

edited

james-munson commented Apr 26, 2024 •

edited

innobead commented Apr 29, 2024

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

Comments

hoo29 commented Apr 26, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

james-munson commented Apr 26, 2024 • edited

ejweber commented Apr 26, 2024 • edited

ejweber commented Apr 26, 2024

ejweber commented Apr 26, 2024

Workaround to avoid the issue

Workaround if you have hit the issue

james-munson commented Apr 26, 2024

hoo29 commented Apr 26, 2024 • edited

ejweber commented Apr 26, 2024 • edited

james-munson commented Apr 26, 2024 • edited

innobead commented Apr 29, 2024

james-munson commented Apr 26, 2024 •

edited

ejweber commented Apr 26, 2024 •

edited

hoo29 commented Apr 26, 2024 •

edited

ejweber commented Apr 26, 2024 •

edited

james-munson commented Apr 26, 2024 •

edited