Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: add e2e_node tests #124608

Open
pohly opened this issue Apr 29, 2024 · 4 comments · May be fixed by #124617
Open

DRA: add e2e_node tests #124608

pohly opened this issue Apr 29, 2024 · 4 comments · May be fixed by #124617
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pohly
Copy link
Contributor

pohly commented Apr 29, 2024

What would you like to be added?

#124323 reminded us that test coverage of checkpointing and the interaction between kubelet and DRA drivers could be improved. We can do error injection and/or add delays by extending

func (d *Driver) interceptor(nodename string, ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
d.mutex.Lock()
defer d.mutex.Unlock()
m := MethodInstance{nodename, info.FullMethod}
d.callCounts[m]++
if d.fail[m] {
return nil, errors.New("injected error")
}
return handler(ctx, req)
}
such that a test can add callbacks. This needs to be connected to the NewDriver call in E2E tests, perhaps by extending the configureResources call such that it both configures resources and the driver.

Here are some test scenarios that may be relevant. In all cases the test must ensure that NodePreparedResources succeeds before running a pod and NodeUnprepareResources succeeds for all claims which were previously prepared (not called out explicitly below). All pods must stop running.

Retry NodePrepareResources

  • single claim, single driver
  • NodePrepareResources fails until allowed to succeed
  • wait that NodePrepareResources failed at least once
  • ensure that pod does not get started for a certain duration
  • allow NodePrepareResources to succeed
  • wait for pod to run
  • kill pod

Retry NodeUnprepareResources

  • single claim, single driver
  • run pod
  • kill pod
  • NodeUnprepareResources fails until allowed to succeed
  • wait that NodeUnprepareResources failed at least once
  • ensure that pod is not marked as completed for a certain duration
  • allow NodeUnprepareResources to succeed

Retry NodePrepareResources after restart

As above, but now also restart kubelet before allowing NodePrepareResources to succeed.

Retry NodeUnprepareResources

Same for NodeUnprepareResources.

Partial success for NodePrepareResources

  • single claim, two drivers for that claim
  • NodePrepareResources for driver A succeeds, the one for driver B fails.
  • Rest of the test as for "NodePrepareResources".

Partial success for NodeUnprepareResources

Same for NodeUnprepareResources.

Partial success for NodePrepareResources with restart

Restart kubelet while the NodePrepareResources call for driver B is running.

Partial success for NodeUnprepareResources

Restart kubelet while the NodeUnprepareResources call for driver B is running.

Pod deletion during NodePrepareResources

  • single claim, single driver
  • stop kubelet while NodePrepareResources is running
  • (force-)delete pod
  • ensure that pod is truly gone
  • restart kubelet
  • ensure that NodePrepareResources does not get called again for a certain period of time

Pod deletion during NodeUnprepareResources

  • single claim, single driver
  • stop kubelet while NodeUnprepareResources is running
  • (force-)delete pod
  • ensure that pod is truly gone
  • restart kubelet
  • ensure that NodeUnprepareResources gets called

/cc @klueska
/assign @bart0sh
/sig node

Why is this needed?

Better test coverage.

@pohly pohly added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 29, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 29, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024
@bart0sh bart0sh linked a pull request Apr 29, 2024 that will close this issue
@bart0sh
Copy link
Contributor

bart0sh commented Apr 29, 2024

@pohly I decided to first create working test cases in a simplest possible way and then modify a deploy.go as you've suggested. I looked at how to do it and found it complex enough to skip for now. Any kind of help with this would be appreciated. PTAL

@pohly
Copy link
Contributor Author

pohly commented Apr 30, 2024

I noticed that e2e_node doesn't actually use deploy.go, so it's a bit different. See bart0sh#20 for a potential solution.

@bart0sh
Copy link
Contributor

bart0sh commented May 2, 2024

@pohly The simple approach I used here works for me so far and allows to implement all test cases described in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: 👀 In review
Development

Successfully merging a pull request may close this issue.

4 participants
@pohly @bart0sh @k8s-ci-robot and others