Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1d: Log client write error #4042

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

dridi
Copy link
Member

@dridi dridi commented Jan 18, 2024

Some scenarios like hitting send_timeout because of a dripping write leaves no clue that an error occurred from the client request logs alone.

This introduces a client counterpart for:

FetchError backend write error: %d (%s) (%s)

Refs 49e44e3

@dridi dridi force-pushed the drip_log branch 2 times, most recently from c0f6e31 to a6d962b Compare January 19, 2024 08:06
@dridi dridi changed the title req_fsm: Log session error before Resp timestamp v1d: Log client write error Jan 19, 2024
@dridi
Copy link
Member Author

dridi commented Jan 22, 2024

Bugwash:

  • add SLT_DeliveryError
  • use it for "backend write error"
  • introduce "client write error"
  • collect feedback from production

@nigoroll
Copy link
Member

bugwash:

  • The error is in the session VSL, but due to the impracticality of -g session, it is not particularly useful
  • SLT_Error is not a particularly good tag for this, because it is currently used for internal errors which can usually be addressed with VCL or changing parameters
  • we have FetchError, should add a DeliveryError tag and then use the former consistently for VFPs, and the latter for VDPs

@dridi
Copy link
Member Author

dridi commented Jan 22, 2024

SLT_Error is not a particularly good tag for this, because it is currently used for internal errors which can usually be addressed with VCL or changing parameters

nitpick: this can be addressed with send_timeout for that case, but point taken, there is nothing we can do when the client disconnects.

This is not quite a counterpart of FetchError that is used in more
contexts than just a VFP.
This is not going through a VDP, but it eventually will.

Refs varnishcache#4035
@dridi
Copy link
Member Author

dridi commented Jan 29, 2024

I'd like to cover this during bugwash today, before we send this to production.

if (sc != SC_NULL)
if (sc != SC_NULL) {
VSLb(req->vsl, SLT_DeliveryError, "resp: \"%s\" %d \"%s\"",
sc->desc, errno, VAS_errtxt(errno));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use sc->name as with SessClose and add the long descriptions from include/tbl/sess_close.h to the documentation. Reason: name is shorter, less likely to change for editorial reasons and easier to parse (does not contain spaces). We should then also remove the quotes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 10f1e5c and #4047

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use sc->name as with SessClose and add the long descriptions from include/tbl/sess_close.h

I'm all for logging the names, but against including the descriptions in the manual. "See other section" ought to be enough.

We should then also remove the quotes.

Once we switch to the name, that goes without saying :)

See 10f1e5c

I think we should instead add a section dedicated to session/stream close reasons in the manual and add some codegen to generate the RST from the sess_close.h table. Then all the tags that print a name can refer to that section and we avoid manual sync.

If you don't, I will.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That if statement should be based on whether sc is an error instead of not null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with all your answers. In particular, moving the session close reasons to a separate docs section makes sense once they are no longer "owned" by SessClose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having second thoughts on the tag format and I'm wondering whether we should make it free form like FetchError so that any VDP could use DeliveryError to log errors?

That does not change my plans to address SessClose et al to use the name over the description and add a manual section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the prefix (resp here) can be replaced by a VDP name and then the format is custom?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that, but I think I would rather keep the VDP prefix informal. The current "backend write error" prefix is very informal and unusual (and yet query-able) and should be improved, but generally it should be possible to issue DeliveryError logs outside of a VDP too, as long as it is related to delivery? Like FetchError today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bugwash: It seems we actually agree on the alternative format <vdpname>: <error message free form>

bin/varnishd/http1/cache_http1_fetch.c Show resolved Hide resolved
nigoroll added a commit that referenced this pull request Jan 29, 2024
nigoroll added a commit to nigoroll/varnish-cache that referenced this pull request Jan 29, 2024
this is for consistency, to simplify parsing and to reduce the amount of
data logged with VSL.

Motivated by varnishcache#4042
@dridi dridi marked this pull request as draft January 29, 2024 17:00
@daghf
Copy link
Member

daghf commented Feb 2, 2024

bionic failure,

cache/cache_expire.c: In function 'exp_inbox':
cache/cache_expire.c:282:61: error: format '%ld' expects argument of type 'long int', but argument 6 has type 'VCL_INT {aka long long int}' [-Werror=format=]
   VSLb(&ep->vsl, SLT_ExpKill, "EXP_Removed x=%ju t=%.0f h=%ld",
                                                           ~~^
                                                           %lld
cache/cache_expire.c:284:7:
       oc->hits);
       ~~~~~~~~                                               
cache/cache_expire.c: In function 'exp_expire':
cache/cache_expire.c:360:61: error: format '%ld' expects argument of type 'long int', but argument 6 has type 'VCL_INT {aka long long int}' [-Werror=format=]
   VSLb(&ep->vsl, SLT_ExpKill, "EXP_Expired x=%ju t=%.0f h=%ld",
                                                           ~~^
                                                           %lld
cache/cache_expire.c:362:7:
       oc->hits);
       ~~~~~~~~                                               
  CC       cache/varnishd-cache_fetch_proc.o
cache/cache_expire.c: At top level:
cc1: error: unrecognized command line option '-Wno-nullability-completeness' [-Werror]
cc1: error: unrecognized command line option '-Wno-missing-variable-declarations' [-Werror]
cc1: error: unrecognized command line option '-Wno-system-headers-Wno-thread-safety' [-Werror]
cc1: all warnings being treated as errors

dridi added a commit that referenced this pull request Feb 13, 2024
Instead of adding hoops to the documentation, in particular to keep it
in sync, improve the only location where we emit HTC status logs.

We could also consider replacing the HTC enum with a struct, similar to
what we did in other places. The struct symbols would be named after the
UPPER name from the table, have a name and description fields, possibly
replace the error number with a simple is_err field.

Capturing the long description like this is less intrusive.

Refs #4042
@bsdphk
Copy link
Contributor

bsdphk commented Feb 19, 2024

goes after march-15 release

@nigoroll
Copy link
Member

during bugwash, @bsdphk brought up the question if this could affect vsl clients, and while it does not look like it, git grep suggests that we should change some other places where FetchError is referred.

@bsdphk
Copy link
Contributor

bsdphk commented Feb 19, 2024

after march15 release

nigoroll added a commit that referenced this pull request Mar 1, 2024
this is for consistency, to simplify parsing and to reduce the amount of
data logged with VSL.

Motivated by #4042
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants