Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

condfetch: Skip revalidation of an invalidated stale_oc #4082

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dridi
Copy link
Member

@dridi dridi commented Mar 12, 2024

If we can catch a dying stale_oc before initiating a condfetch, we can fall back to a regular fetch instead. If we catch it too late, when we are ready to merge headers, we transition to vcl_backend_error and get a chance to retry.

This avoids a race between a condfetch and the invalidation of its stale objcore, covered with a purge in test cases.

@dridi
Copy link
Member Author

dridi commented Mar 13, 2024

@nigoroll @bsdphk: I think the first commit should be merged before the release.

@dridi dridi force-pushed the dying_304 branch 2 times, most recently from cc56c8b to 9257ac3 Compare March 13, 2024 11:11
@dridi
Copy link
Member Author

dridi commented Mar 15, 2024

@hermunn brought an interesting point regarding bans. We check the stale object against newer bans during lookup, but in the absence of another lookup or if the ban lurker doesn't get a chance to evaluate post-lookup bans (and the default ban_lurker_age almost guarantees that) the stale object will not be invalidated.

To get the discussion started I pushed what could look like evaluating bans before and after 304 revalidation. I will turn this pull request into a draft.

@dridi dridi marked this pull request as draft March 15, 2024 17:36
@nigoroll
Copy link
Member

nigoroll commented Mar 18, 2024

I understand that Dridi prepared these patches in a rush, so once the dust has settled, I would appreciate better explanations in the commit messages.

Regarding the actual issue here, I am not convinced yet that failing with a fetch error for the race is the best option (it might be, but I would like to think about it). There will probably be many sites for which the current behavior is just fine, and who would start to see errors unless they added the return(retry), so we should at least add this to the builtin vcl and docs.

If someone issues a PURGE, they expect that the respective object go away and not survive indirectly through a revalidation. The current suggestion to fail to v_b_e {} and then retry under VCL control has the advantage of replacing the busy object and thus not delivering the corrupted stale content. But we could also just accept the revalidated object and deliver it (once), also implicitly purging it. Or we could do the retry internally without VCL involvement.

Again, not sure yet.

@dridi
Copy link
Member Author

dridi commented Mar 18, 2024

bugwash:

  • go back to the original scope of the pull request
    • leave post-lookup ban checks out for now
  • add 304 retry in built-in vcl_backend_error
  • docs

A valid stale objcore may be selected during lookup, enabling a
conditional fetch from the backend. Depending on the backend, or
the resource being requested, this may leave a long window during
which the stale object may be invalidated.

In that case, a 304 response would defeat the invalidation, making
it possible for invalidated response header fields or bodies to be
perpetuated despite Varnish reporting a successful invalidation.

We should consider adding this to vcl_backend_error:

    if (beresp.was_304) {
            return (retry);
    }

This would mostly preserve the current behavior, as observed by a
client. In particular, when the ETag of the invalidated stale object
is still the same when a new copy is fetched, the client side INM
processing would remain. Varnish would simply no longer merge and
reuse invalidated stale objects internally.

From the point of view of an operator, there would of course be a
visible change in the presence of a retry.
When it is time to decide between making a bereq for a regular or
conditional fetch, we need to ensure that the stale object is still
valid. There is a window between the lookup and the begining of a
fetch task during which the object could have been invalidated. The
window is in theory much smaller than the one between the lookup and
the processing of a 304 response, but it can be significant if the
fetch task was queued or restarted.

Catching it early allows to proceed with a regular fetch.
Named after vcl_backend_refresh from varnishcache#3994.
@dridi
Copy link
Member Author

dridi commented Mar 19, 2024

Updated as per bugwash consensus.

Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good after cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants