You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During operation of the Volcano Scheduler service, a bug was discovered in the code that leads to a Data Race for the resource var IgnoredDevicesList []string. This results in an unexpected application panic and termination, which forcibly terminates running user Jobs.
The modification functions used in the informer handlers ensure safe modification of IgnoredDevicesList using a mutex (setNodeOthersResource pkg/scheduler/api/node_info.go), but there are no read locks - read operations can occur during modifications (NewResource pkg/scheduler/api/resource_info.go, GetInqueueResource pkg/scheduler/plugins/util/util.go).
Due to the Data Race, a panic occurs panic: runtime error: invalid memory address or nil pointer dereference with a strange stack trace, leading to Scheduler restart and forced termination of user Jobs. The strange stack trace leads to the standard library in the string comparison method (strings.Compare(...) strings/compare.go:24).
What you expected to happen:
There are no Data Races and the system behaves deterministically and predictably at any time. Unexpected Scheduler termination does not lead to termination of all user Jobs.
How to reproduce it:
E2E Reproduction:
1. Build vc-scheduler with the Data Race Detector (-race): CGO_ENABLED=1 go build -race -o ${BIN_DIR}/vc-scheduler ./cmd/scheduler
2. Deploy Volcano v1.8.1 on a Kubernetes cluster with multiple nodes
3. In a cluster with a large number of nodes, informers receive a large number of events when nodes change (e.g., Node Heartbeats). To emulate this, you can, for example, constantly update node labels using a script.
k apply -f ./jobs/volcano/16
job.batch.volcano.sh/completed-job-0 created
job.batch.volcano.sh/completed-job-1 created
...
job.batch.volcano.sh/completed-job-15 created
[Additionally] Leave the system running in a large cluster with constant rotation of user workloads, and with some luck, after 5-30 days you can detect a Scheduler restart due to panic:
SafeWrite - emulates updating node information in informer handlers. UnsafeRead - emulates unsafe reading from var IgnoredDevicesList []string. Using the code below, you can get a similar stack trace:
Additionally, when the Scheduler terminates with such a panic and stack trace, we found that all running user Jobs are forcibly terminated (Running -> Terminated). We do not yet understand why this happens, since ordinary restarts do not cause this behavior. Any ideas❓
What happened:
During operation of the Volcano Scheduler service, a bug was discovered in the code that leads to a Data Race for the resource
var IgnoredDevicesList []string
. This results in an unexpected application panic and termination, which forcibly terminates running user Jobs.The modification functions used in the informer handlers ensure safe modification of
IgnoredDevicesList
using a mutex (setNodeOthersResource pkg/scheduler/api/node_info.go
), but there are no read locks - read operations can occur during modifications (NewResource pkg/scheduler/api/resource_info.go
,GetInqueueResource pkg/scheduler/plugins/util/util.go
).Due to the Data Race, a panic occurs
panic: runtime error: invalid memory address or nil pointer dereference
with a strange stack trace, leading to Scheduler restart and forced termination of user Jobs. The strange stack trace leads to the standard library in the string comparison method (strings.Compare(...) strings/compare.go:24
).What you expected to happen:
There are no Data Races and the system behaves deterministically and predictably at any time. Unexpected Scheduler termination does not lead to termination of all user Jobs.
How to reproduce it:
E2E Reproduction:
1. Build vc-scheduler with the Data Race Detector (
-race
):CGO_ENABLED=1 go build -race -o ${BIN_DIR}/vc-scheduler ./cmd/scheduler
2. Deploy Volcano v1.8.1 on a Kubernetes cluster with multiple nodes
3. In a cluster with a large number of nodes, informers receive a large number of events when nodes change (e.g., Node Heartbeats). To emulate this, you can, for example, constantly update node labels using a script.
Script example
4. Run several instances of user Jobs:
Job example
5. Detect Data Race Detector messages in the Scheduler logs:
k logs -f deploy/volcano-scheduler | grep "WARNING: DATA RACE" -B 4 -A 40
Output example
[Additionally] Leave the system running in a large cluster with constant rotation of user workloads, and with some luck, after 5-30 days you can detect a Scheduler restart due to panic:
Localizing the behavior:
SafeWrite
- emulates updating node information in informer handlers.UnsafeRead
- emulates unsafe reading fromvar IgnoredDevicesList []string
. Using the code below, you can get a similar stack trace:Output example
Anything else we need to know?:
Additionally, when the Scheduler terminates with such a panic and stack trace, we found that all running user Jobs are forcibly terminated (Running -> Terminated). We do not yet understand why this happens, since ordinary restarts do not cause this behavior. Any ideas❓
I have prepared a PR that fixes the bugs: #3325
The text was updated successfully, but these errors were encountered: