Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log lines have inconsistent log levels and formats #602

Open
shulin-sq opened this issue Feb 14, 2024 · 5 comments
Open

Log lines have inconsistent log levels and formats #602

shulin-sq opened this issue Feb 14, 2024 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@shulin-sq
Copy link
Contributor

shulin-sq commented Feb 14, 2024

error logs are logged with info level, making it difficult to search through log lines to see what failed vs what are just informational lines


{"level":"info","ts":"2024-02-14T21:49:12.387Z","logger":"controller.route","caller":"controllers/route_controller.go:155","msg":"reconcile er
ror","name":"traffic-exemplar","message":"error during service synthesis failed ServiceManager.Upsert REDACTED due to
 ValidationException: customDomainName is required when certificateArn is provided.\n{\n  RespMetadata: {\n    StatusCode: 400,\n    RequestID
: \"1e3f12ab-4038-4212-9d94-f2f186be5c1b\"\n  },\n  FieldList: [{\n      Message: \"customDomainName is required when certificateArn is provid
ed.\",\n      Name: \"certificateArn\"\n    }],\n  Message_: \"customDomainName is required when certificateArn is provided.\",\n  Reason: \"f
ieldValidationFailed\"\n}"}

using version 1.0.1 but this hasn't changed in main branch either: https://github.com/aws/aws-application-networking-k8s/blob/main/pkg/controllers/route_controller.go#L155

Can warn/error logs be used more consistently? eg warn for errors that are ignored and error for errors that actually cause a reconciliation failure?

Another idea is to use some consistent messaging in log lines to indicate that something errored. eg if "failed_reconcile" will always include the error message of why the reconciliation failed that would be easy to search.

@erikfuller
Copy link
Contributor

Thanks for highlighting this @shulin-sq. I think the challenge is that the log levels don't easily translate well to the eventually consistent/eventually reconcile nature of the controller.

Definitively categorizing errors can be challenging, for example API calls to Lattice would need to be subdivided into 400/500 errors, but sometimes a 4XX can happen for a ResourceNotFound which would be fixed after another reconcile attempt. Other 4XXs will never succeed, such as an invalid parameter (like a malformed ARN).

The rough approach now is that "everything" is an info unless it's known to be an unrecoverable error, in which case we use error. I think it's fine to challenge this, or advocate for specific log lines to be error vs info. If there are other controllers out there which have a more standardized approach you prefer, or if you have a more specific proposal on how to change these, I'm happy to consider it.

@erikfuller erikfuller added the discussion something to talk about label Feb 29, 2024
@shulin-sq
Copy link
Contributor Author

shulin-sq commented Mar 4, 2024

@erikfuller

I would suggest

  • trace/debug is generally helpful but for now my suggestion for this log level is just ensuring that this works. from my experience, I have not seen trace/debug lines despite setting the log level in the env variable
  • info for observability into what is happening, eg some event triggered, some objects were reconciled etc
  • warning for skippable issues, but resulted in an error (eg, unexpected behavior). These are helpful to figure out why an error occurred. Eg perhaps we continued on with the reconciliation, but couldn't finish some skippable event (eg like a getTag call returning 404 due to permissions issues)
  • error for recoverable issues at the end of an event. Eg 1 attempt to reconcile all objects upon a change in k8s service object state-- even if this will be retried later. The reason being, in most cases we ignore transient errors, but after a certain number of errors, particularly if they affect multiple apps, we will investigate. So we want to be able to search by log level and easily filter by error. These can also be prefixed with some predictable string eg [reconcile_result] to make the text easily searchable.
  • error for unrecoverable issues

I did look around to see if there are some other projects we can use as an example but am coming up short. These are projects I checked to see if we can follow some guidelines (but not very helpful). However they do mention the usage of contexted log messages, which related to my other opened issue.

@erikfuller
Copy link
Contributor

@shulin-sq that sounds reasonable enough. We may want to put it in even simpler terms that make it easier to determine log level correctness just looking at the code rather than fully knowing the context. For example:

  • Error at top-level of reconcile before re-raising, or at top-level for other operations (like target group cleanup)
    • Ideally for each logical operation (like a reconcile) we aim for a single Error line if it fails
  • Warn for errors we encounter but do not return
  • Info for mutations in state, whether local or remote, other important auditing lines, or interesting but expected errors
  • Debug for everything else

Otherwise, I've certainly seen DEBUG logs running the controller locally, I'll double check I can see them when running via a deployment. I wasn't using JSON formatting though, but I'll try it out.

@shulin-sq
Copy link
Contributor Author

@erikfuller That sounds reasonable. What are next steps?

@erikfuller
Copy link
Contributor

@shulin-sq
I think we have a few options. We can start by publishing a version this guidance in the contributing/ docs, then reference it for any new changes. We could start working our way through the codebase, even a file at a time would be better than no progress. Or we could start with just getting error right at the top level reconciliation.

If you have a priority order of lines you care about most that might be another way to tackle it.

atm I don't have a date when someone here would be able to pick this up. For now, let me go ahead and re-tag it.

@erikfuller erikfuller added enhancement New feature or request good first issue Good for newcomers and removed discussion something to talk about labels Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants