Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: DRA: fix cache integrity #124323

Merged
merged 7 commits into from
May 13, 2024

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Apr 15, 2024

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

Changed ClaimInfo and claimInfoCache APIs to avoid updating cached claim infos when the cache is not locked.
This should fix possible data race conditions caused by changing cached claim infos.

Added new unit and e2e_node test cases to test cache integrity.

Which issue(s) this PR fixes:

Fixes #123474 #116411

Special notes for your reviewer:

This is a second attempt to fix #123474. See previous PR and this review comment for more details.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 15, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Apr 15 13:51:40 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 15, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Apr 15, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 15, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Apr 15, 2024

/cc @pohly @klueska

pkg/kubelet/cm/dra/claiminfo.go Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
@bart0sh bart0sh force-pushed the PR142-dra-fix-cache-integrity branch from d815a31 to 0120ab2 Compare April 15, 2024 21:46
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/state/state_checkpoint.go Outdated Show resolved Hide resolved
@bart0sh bart0sh force-pushed the PR142-dra-fix-cache-integrity branch from 0120ab2 to f046a6a Compare April 16, 2024 15:58
@klueska klueska force-pushed the PR142-dra-fix-cache-integrity branch from 6cee1b6 to f24134d Compare May 3, 2024 13:31
// Atomically perform some operations on the claimInfo cache.
err := m.cache.withLock(func() error {
// Delete all claimInfos from the cache that have just been unprepared.
for _, claimName := range claimNamesMap {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska Should we loop over response.Claims here?

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 6, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

@klueska I've just pushed e2e changes to this branch. All test cases passed locally:

sudo _output/local/go/bin/e2e_node.test --ginkgo.focus='DRA' --ginkgo.v --report-dir=/tmp/e2e_node.test/logs --node-name=devel --kubelet-flags='--cluster-domain=cluster.local --fail-swap-on=false --cgroup-driver=systemd' --prepull-images=false --container-runtime-endpoint=unix:///run/containerd/containerd.sock --kubelet-config-file=test/e2e_node/jenkins/default-kubelet-config.yaml --runtime-config=api/all=true --service-feature-gates=DynamicResourceAllocation=true --feature-gates=DynamicResourceAllocation=true
  W0506 13:15:08.834491  463773 test_context.go:538] Unable to find in-cluster config, using default host : https://127.0.0.1:6443
  I0506 13:15:08.834555 463773 test_context.go:561] The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: E2eNode Suite - /home/ed/go/src/k8s.io/kubernetes
================================================================
Random Seed: 1714990508 - will randomize all specs

Will run 16 of 585 specs
------------------------------
[SynchronizedBeforeSuite] 
...
[SynchronizedAfterSuite] PASSED [0.214 seconds]
------------------------------
[ReportAfterSuite] Kubernetes e2e JUnit report
/home/ed/go/src/k8s.io/kubernetes/test/e2e/framework/test_context.go:612
[ReportAfterSuite] PASSED [0.020 seconds]
------------------------------

Ran 16 of 585 Specs in 421.418 seconds
SUCCESS! -- 16 Passed | 0 Failed | 0 Pending | 569 Skipped
PASS

@bart0sh bart0sh force-pushed the PR142-dra-fix-cache-integrity branch from d73c603 to c57880e Compare May 6, 2024 13:57
@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

/test pull-kubernetes-kind-dra
/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

/test pull-kubernetes-kind-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

3 similar comments
@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

/test pull-kubernetes-kind-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

/test pull-kubernetes-kind-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh
Copy link
Contributor Author

bart0sh commented May 6, 2024

/test pull-kubernetes-kind-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh
Copy link
Contributor Author

bart0sh commented May 7, 2024

@klueska @pohly e2e seem to run without flakes. Can you review this PR again, please?

@bart0sh
Copy link
Contributor Author

bart0sh commented May 7, 2024

/assign @pohly

@bart0sh
Copy link
Contributor Author

bart0sh commented May 9, 2024

I don't see any comments, so assuming everything is good and going to lgtm this PR to get it merged in a couple of days.

@bart0sh
Copy link
Contributor Author

bart0sh commented May 11, 2024

/unhold
/lgtm

@k8s-ci-robot
Copy link
Contributor

@bart0sh: you cannot LGTM your own PR.

In response to this:

/unhold
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented May 11, 2024

@pohly @klueska can you lgtm this PR please?

@bart0sh bart0sh force-pushed the PR142-dra-fix-cache-integrity branch from c57880e to f24134d Compare May 13, 2024 13:11
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bart0sh
Copy link
Contributor Author

bart0sh commented May 13, 2024

@klueska Moved new e2e_node test cases back to #124617 for further review. Please lgtm this PR, thanks.

@klueska
Copy link
Contributor

klueska commented May 13, 2024

Thanks @bart0sh
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f2cf81dedebf7d9e09a8622fbb810aa36b80e412

@bart0sh
Copy link
Contributor Author

bart0sh commented May 13, 2024

/retest

@k8s-ci-robot k8s-ci-robot merged commit 8352c09 into kubernetes:master May 13, 2024
18 checks passed
SIG Node PR Triage automation moved this from Needs Reviewer to Done May 13, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/provider/gcp Issues or PRs related to gcp provider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

DRA: kubelet dies with "concurrent map iteration and map write"
4 participants