DRA: add e2e_node tests #124608

pohly · 2024-04-29T11:34:45Z

What would you like to be added?

#124323 reminded us that test coverage of checkpointing and the interaction between kubelet and DRA drivers could be improved. We can do error injection and/or add delays by extending

kubernetes/test/e2e/dra/deploy.go

Lines 420 to 431 in 4946c1f

    
           func (d *Driver) interceptor(nodename string, ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) { 
        
           	d.mutex.Lock() 
        
           	defer d.mutex.Unlock() 
        
           	m := MethodInstance{nodename, info.FullMethod} 
        
           	d.callCounts[m]++ 
        
           	if d.fail[m] { 
        
           		return nil, errors.New("injected error") 
        
           	} 
        
           	return handler(ctx, req) 
        
           }

such that a test can add callbacks. This needs to be connected to the NewDriver call in E2E tests, perhaps by extending the configureResources call such that it both configures resources and the driver.

Here are some test scenarios that may be relevant. In all cases the test must ensure that NodePreparedResources succeeds before running a pod and NodeUnprepareResources succeeds for all claims which were previously prepared (not called out explicitly below). All pods must stop running.

Retry NodePrepareResources

single claim, single driver
NodePrepareResources fails until allowed to succeed
wait that NodePrepareResources failed at least once
ensure that pod does not get started for a certain duration
allow NodePrepareResources to succeed
wait for pod to run
kill pod

Retry NodeUnprepareResources

single claim, single driver
run pod
kill pod
NodeUnprepareResources fails until allowed to succeed
wait that NodeUnprepareResources failed at least once
ensure that pod is not marked as completed for a certain duration
allow NodeUnprepareResources to succeed

Retry NodePrepareResources after restart

As above, but now also restart kubelet before allowing NodePrepareResources to succeed.

Retry NodeUnprepareResources

Same for NodeUnprepareResources.

Partial success for NodePrepareResources

single claim, two drivers for that claim
NodePrepareResources for driver A succeeds, the one for driver B fails.
Rest of the test as for "NodePrepareResources".

Partial success for NodeUnprepareResources

Same for NodeUnprepareResources.

Partial success for NodePrepareResources with restart

Restart kubelet while the NodePrepareResources call for driver B is running.

Partial success for NodeUnprepareResources

Restart kubelet while the NodeUnprepareResources call for driver B is running.

Pod deletion during NodePrepareResources

single claim, single driver
stop kubelet while NodePrepareResources is running
(force-)delete pod
ensure that pod is truly gone
restart kubelet
ensure that NodePrepareResources does not get called again for a certain period of time

Pod deletion during NodeUnprepareResources

single claim, single driver
stop kubelet while NodeUnprepareResources is running
(force-)delete pod
ensure that pod is truly gone
restart kubelet
ensure that NodeUnprepareResources gets called

/cc @klueska
/assign @bart0sh
/sig node

Why is this needed?

Better test coverage.

The text was updated successfully, but these errors were encountered:

bart0sh · 2024-04-29T12:43:57Z

/triage accepted

bart0sh · 2024-04-29T17:14:24Z

@pohly I decided to first create working test cases in a simplest possible way and then modify a deploy.go as you've suggested. I looked at how to do it and found it complex enough to skip for now. Any kind of help with this would be appreciated. PTAL

pohly · 2024-04-30T17:00:18Z

I noticed that e2e_node doesn't actually use deploy.go, so it's a bit different. See bart0sh#20 for a potential solution.

bart0sh · 2024-05-02T12:49:11Z

@pohly The simple approach I used here works for me so far and allows to implement all test cases described in this issue.

pohly added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 29, 2024

k8s-ci-robot assigned bart0sh Apr 29, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024

bart0sh linked a pull request Apr 29, 2024 that will close this issue

e2e_node: DRA: test plugin failures #124617

Open

bart0sh mentioned this issue Apr 30, 2024

kubelet: DRA: fix cache integrity #124323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: add e2e_node tests #124608

DRA: add e2e_node tests #124608

pohly commented Apr 29, 2024 •

edited

bart0sh commented Apr 29, 2024

bart0sh commented Apr 29, 2024

pohly commented Apr 30, 2024

bart0sh commented May 2, 2024

DRA: add e2e_node tests #124608

DRA: add e2e_node tests #124608

Comments

pohly commented Apr 29, 2024 • edited

What would you like to be added?

Retry NodePrepareResources

Retry NodeUnprepareResources

Retry NodePrepareResources after restart

Retry NodeUnprepareResources

Partial success for NodePrepareResources

Partial success for NodeUnprepareResources

Partial success for NodePrepareResources with restart

Partial success for NodeUnprepareResources

Pod deletion during NodePrepareResources

Pod deletion during NodeUnprepareResources

Why is this needed?

bart0sh commented Apr 29, 2024

bart0sh commented Apr 29, 2024

pohly commented Apr 30, 2024

bart0sh commented May 2, 2024

pohly commented Apr 29, 2024 •

edited