Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake watchcache tests #124610

Merged
merged 1 commit into from Apr 30, 2024
Merged

Conversation

wojtek-t
Copy link
Member

Found when working on kubernetes/enhancements#4568

NONE

/kind flake
/priority important-longterm
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver labels Apr 29, 2024
ResourceVersion: "0",
// Limit is ignored when ResourceVersion is set to 0.
// Set it to consistent read.
ResourceVersion: "",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failures from before this PR could be easily triggered by adding time.Sleep(time.Second) in appropriate tests (cacher_test.go) after creating watchcache but before starting the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does adding a sleep cause the test to fail?
By changing the RV to consistent read, the list call will be delegated to the underlying storage. I assume this was what this test intended to do, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without sleep, the list is happening before watchcache is initialized and was delegated to underlying storage.
If watchcache is initialized, the this test is failing, because watchcache is ignoring Limit for RV=0.

We know that for RV=0, limit is ignored when used by watchcache - so yes, we want to test if for other RVs, it doesn't matter if its delegated or not, the result is the important stuff (but yes, it's delegated).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test for RV=0 to check if the limit is ignored for the watchcache ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we had one but I can't find any now. We should add that as a follow-up (though I wouldn't block this PR on it as it's not regressing our coverage).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I can add the new test if you don't have time. Just let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I found what I was looking for:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/testing/store_tests.go#L862-L866
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/testing/store_tests.go#L881-L885

We are now ignoring it, it might be better to slightly update them to check that we return everything in those cases.
If you have time to take it, it would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i will have a look.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 29, 2024
@@ -2407,7 +2415,7 @@ func RunTestGuaranteedUpdateWithSuggestionAndConflict(ctx context.Context, t *te
err := store.GuaranteedUpdate(ctx, key, updatedPod, false, nil,
storage.SimpleUpdate(func(obj runtime.Object) (runtime.Object, error) {
pod := obj.(*example.Pod)
pod.Name = "foo-2"
pod.Generation = 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests were somewhat broken for two reasons:

  • changing name is unrealistic (forbidden) and watchcache is lost - switched to use Generation instead
  • suggestion can be ignored by implementation - relaxed the validation check below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was RunTestGuaranteedUpdateWithSuggestionAndConflict flaky or did you change it to match reality ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both:

  • due to changing names, watchcache wasn't behaving as it is in reality (it's caching by objects namespace/name, not by the given key, so the caching was changing in the meantime
  • the last check was incorrect and you can trigger it by putting sleep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are changing the test to match reality would it make sense to add a new annotation instead of messing with the Generation field which is set by the system ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the Generation will be updated during update before we call this function so it is okay to mess with that field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generation isn't handled automatically by the system - we're handing it manually at strategy level for individual resources. So no - I think this is good.

@wojtek-t
Copy link
Member Author

/assign @p0lyn0mial

ResourceVersion: "0",
// Limit is ignored when ResourceVersion is set to 0.
// Set it to consistent read.
ResourceVersion: "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does adding a sleep cause the test to fail?
By changing the RV to consistent read, the list call will be delegated to the underlying storage. I assume this was what this test intended to do, right?

@@ -1654,7 +1656,8 @@ func RunTestListContinuation(ctx context.Context, t *testing.T, store storage.In
// no limit, should get two items
out = &example.PodList{}
options = storage.ListOptions{
ResourceVersion: "0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, setting the continuation token and an RV doesn't yield an error, is that correct ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thx.

@@ -2407,7 +2415,7 @@ func RunTestGuaranteedUpdateWithSuggestionAndConflict(ctx context.Context, t *te
err := store.GuaranteedUpdate(ctx, key, updatedPod, false, nil,
storage.SimpleUpdate(func(obj runtime.Object) (runtime.Object, error) {
pod := obj.(*example.Pod)
pod.Name = "foo-2"
pod.Generation = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was RunTestGuaranteedUpdateWithSuggestionAndConflict flaky or did you change it to match reality ?

@p0lyn0mial
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 40c0154dc0f651e815ea8da46471bac6ad77c859

@k8s-ci-robot k8s-ci-robot merged commit 02365ec into kubernetes:master Apr 30, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Apr 30, 2024
@cici37
Copy link
Contributor

cici37 commented Apr 30, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants