Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

Open
hoo29 opened this issue Apr 26, 2024 · 9 comments
Labels
area/volume-replica-scheduling Volume replica scheduling related investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@hoo29
Copy link

hoo29 commented Apr 26, 2024

Describe the bug

With Harvester 1.3.0 and Longhorn 1.6.0, we have observed several disks becoming over provisioned despite Storage Over Provisioning Percentage being set to 100.

To Reproduce

Theoretical steps (we haven't reproduced)

  1. Provision enough VMs in harvester to nearly saturate all available disk space.
  2. Delete the VMs but don't delete their volumes (in terraform have auto_delete set to false)
  3. Create all the VMs again.

Expected behavior

Disks do not become over provisioned and VMs fail to schedule if there isn't enough storage.

Support bundle for troubleshooting

Please can no URLs or VM details from the bundle be posted to this issue.

Sent to longhorn-support-bundle@suse.com.

Environment

  • Longhorn version: 1.6.0
  • Impacted volume (PV):
    harvester-node-2 disk /var/lib/harvester/extra-disks/0e75b3ff4813c3cae0f71a1e9f3ac893
    harvester-node-5 disk /var/lib/harvester/extra-disks/89ade731face5f52e750ade464ca09bc
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Harvester 1.3.0 ISO install
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: The RKE2 bundled with Harvester 1.3.0
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 5
  • Node config
    • OS type and version: SLE Micro 5.4
    • Kernel version: 5.14.21-150400.24.108-default
    • CPU per node: Xeon Gold 5320T, 20 CPU Cores
    • Memory per node: 384GB
    • Disk type (e.g. SSD/NVMe/HDD): NVMe SSD
    • Network bandwidth between the nodes (Gbps): 1 Gbps (test environment)
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 83 - After clean up of dangling volumes
@hoo29 hoo29 added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 26, 2024
@james-munson
Copy link
Contributor

james-munson commented Apr 26, 2024

Looking. The first thing I notice is that in the Harvester settings, yamls/cluster/harvesterhci.io/v1beta1/settings.yaml, there is an overcommit-config with a storage value of 200:

- apiVersion: harvesterhci.io/v1beta1
  default: '{"cpu":1600,"memory":150,"storage":200}'
  kind: Setting
  metadata:
...
   name: overcommit-config
    resourceVersion: "13529643"
    uid: fa02c2bd-f32b-4872-b803-112aec13351d
  status: {}

I would suspect that may be the cause of the observed behavior. And so, it seems like a reasonable workaround to change that setting. I have not tried it myself, and I cannot say what would happen if the system is already over 100% and there is an attempt to change the config.

@ejweber
Copy link
Contributor

ejweber commented Apr 26, 2024

I discussed this with @james-munson offline and took a look at the support bundle. The bundle can't be posted here, but using the various generic object names:

For one of the overscheduled disks cbe58e05af80439c9320336f1dbb5dfc, there are 12 replicas scheduled. The CreationTimestamp for each replica is somewhat worthless for debugging, since most of them were created as clones of their original during a migration. However, we can see that 11 out of the 12 volumes with scheduled replicas were created during the exact same second (2024-04-17T15:40:44Z):

  • pvc-723ccacf-9491-4e7f-8809-5eee63cb0216
  • pvc-49617dbb-6077-4428-80e6-24000231a0a8
  • pvc-4e19d165-72c8-43d7-8570-7607ee206b23
  • pvc-16fdd3fd-cc3b-496b-b995-1f4458ac1d98
  • pvc-39b8415a-c91a-4a5b-872a-325058fb6815
  • pvc-99700ed4-00c4-4154-b9aa-c4c756649ad6
  • pvc-9943036e-8427-4b48-bb26-e788acfef2e1
  • pvc-b5d7b3c3-a57b-493b-a33d-d4ce528a6789
  • pvc-b4040b28-58fe-4b0c-9e43-b73dfe74c050
  • pvc-cba8733b-573a-4355-8e9c-4efae44116b2
  • pvc-d07ae34f-7bb7-42f8-ae59-dbc2d893b65e

@PhanLe1010 and I (and probably others) discussed this while I was working on #8043, but we need a followup ticket for it. That issue was more specific, but in general, we think Longhorn is vulnerable to accidental overscheduling if it is scheduling replicas for multiple volumes simultaneously.

The general flow of the replica scheduling is:

  • The volume controller that owns a volume decides to schedule a replica.
  • The volume controller looks at all the nodes to see where the replica can be scheduled.
  • The volume controller decides on a node to schedule the replica to.
  • The volume controller updates the replica with its decision.
  • The node controller that owns the node later looks at the replicas and realizes that a new replica has been scheduled.
  • The node controller updates the node to reflect it.
  • The next volume controller that schedules a replica sees the updated node information.

Now, imagine two different volume controllers are scheduling for two different volumes at the same time.

  • Both volume controllers see the same nodes.
  • Both volume controllers think a particular disk on a particular node is a good scheduling choice.
  • Both volume controllers decide to schedule a replica to that disk.
  • Both volume controller update their own replica. There is not conflict, because they are scheduling different replicas for different volumes.
  • Later, the node controller updates the node to reflect both replicas were scheduled.
  • The node controller reports an overscheduling, but it is just informational.

In summary, I believe this happened as a result of many volumes being created simultaneously. We need to improve Longhorn replica scheduling to ensure it cannot happen.

@ejweber
Copy link
Contributor

ejweber commented Apr 26, 2024

We can probably just use this ticked to track the need for an enhancement. When we discussed it previously, the vulnerability was somewhat theoretical. This appears to be a textbook example of it actually occurring.

@ejweber
Copy link
Contributor

ejweber commented Apr 26, 2024

Workaround to avoid the issue

This issue seems to be quite rare. It is probably because:

  • Many volumes are rarely created simultaneously.
  • Even when they are, not all replicas will be scheduled to the same disk.
  • Even when they are, the replicas may fit anyway.

I am not sure why all the volumes were created simultaneously in this ticket. Perhaps there was some other factor at work.

If your workflow involves creating many volumes simultaneously, it may be best to try intentionally slow it down a bit until a fix is implemented (e.g. create one volume, wait a second, create another volume).

Workaround if you have hit the issue

It should be possible to evict individual replicas from the overscheduled disks. Longhorn will find a different disk and move the data. This can be done while the workload is running.

@james-munson
Copy link
Contributor

The ticket for a general solution to pick a leader from all longhorn-managers to avoid races and conflicts is #5571

@hoo29
Copy link
Author

hoo29 commented Apr 26, 2024

Thank you for the detailed response and workaround. We are creating all of our machines and disks in one go with the harvester terraform provider; we'll add a pause somewhere to help avoid this issue.

Marking the disk as un-schedulable and using evicting allowed us to rebalance things.

I'm not clear on how the harvester overcommit settings factor into this. In Slack Connor Kuehl said

Connor Kuehl :rancher_employee: 2 hours ago
Yes, Harvester unconditionally overwrites Longhorn's overcommit with the value from Harvester's overcommit settings (storage)

Is it the case our disks became over provisioned due to the harvester overcommit settings, the bug you have described, or both?

@ejweber
Copy link
Contributor

ejweber commented Apr 26, 2024

This is a good question. My current belief is that:

  • Harvester is incorrectly ignoring its own over-commit setting and not propagating it to Longhorn.
  • Separately, Longhorn overscheduled the disks as a result of the bug I described.

This is because the support bundle clearly shows the the over-commit setting with its default of 200 and the Longhorn storage-over-provisioning-percentage setting with a value of 100.

I will ask Connor to take a look at the first part and help decide if it is a Harvester bug.

@james-munson
Copy link
Contributor

james-munson commented Apr 26, 2024

This bug allows Longhorn to over-schedule some disks, where it clearly might have been possible to consider other disks. If the Harvester value of 200% were used, Longhorn still might still have scheduled the same way, but not have reported them as over-scheduled later on. Whether they actually are over-committed depends on how much data is written to the volumes over time.

I think we do still have work to do to ensure that the Longhorn and Harvester settings are in step.

@innobead innobead added this to the v1.7.0 milestone Apr 29, 2024
@innobead innobead added reproduce/rare < 50% reproducible priority/0 Must be fixed in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related labels Apr 29, 2024
@innobead
Copy link
Member

cc @derekbit

@innobead innobead added investigation-needed Need to identify the case before estimating and starting the development and removed investigation-needed Need to identify the case before estimating and starting the development labels Apr 29, 2024
@derekbit derekbit added the investigation-needed Need to identify the case before estimating and starting the development label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-scheduling Volume replica scheduling related investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: In progress
Development

No branches or pull requests

5 participants