Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a mechanism to retry event processing without logging an error #488

Open
flovouin opened this issue Nov 18, 2022 · 1 comment
Open

Comments

@flovouin
Copy link

Although Cloud Functions do allow a retry mechanism when processing events, this mechanism is currently not very flexible. Indeed, in order for the processing to be retried, the function must throw an error.

Because the functions framework does not know the nature of the thrown error (Is it an uncaught error in a dependency? Is it an error thrown on purpose?), it will always log it as such along with its stack trace. While this makes sense in the general case, it also has effects on other Google services:

  • The Error Reporting service will automatically pick up the error. If notifications have been enabled, this will also send a message to a different system (e.g. Slack, mail, etc).
  • The logged error also counts towards the logging.googleapis.com/log_entry_count metric under the ERROR severity. This metric could be used in dashboard and alert policies. If the error is thrown intentionally only to retry processing, this adds noise to the metric and can possibly trigger false alerts.

There are many valid reasons why one would want to retry processing without explicitly logging an error. Actually, I would argue that the main use case detailed in the documentation (transient errors) should probably not log an error. If an error is transient (e.g. a timeout or a network error when calling an API) and is expected to occur from time to time, there's not much value in counting it as a generic error in Error Reporting and Cloud Logging / Monitoring. In fact, those transient errors can probably be reported more precisely using Cloud Logging and Monitoring directly (not relying on the functions-framework wrapper), along with other statistics about the API calls (e.g. latency, endpoint name, etc).

Conversely, I'd argue that we want an error to be logged and picked us as such by downstream GCP services when it is unexpected. Moreover if the error is unexpected, it should probably not be retried because it could be a bug in the code (as the GCP documentation indeed calls out).

✨ Ideal behaviour

Summing up the previous points, the ideal behaviour for me would be to:

  • Log all (uncaught) errors as errors, but never retry event processing (return status code 204). This corresponds to unforeseen errors that might be bugs.
  • Provide an explicit way to ask for a later retry of the processing. For example, transient errors can explicitly be caught in user code, which can then return or throw a special value to indicate the need to retry processing. (In this case, the function would return status code 500.)

This would reduce the risk of setting up a Cloud Function with the retry flag and ending up with endless retries. The solutions provided by the documentation involve custom logic and are in my opinion prone to errors and/or suboptimal.

🛠️ Current workaround

My current workaround is actually a wrapper around all my Cloud Functions which:

  • Catches all uncaught errors of "unknown" types, and logs them as errors. Those errors are not rethrown by the wrapper and the execution terminates successfully. This is completely transparent to the functions-frameworks, which thinks the processing ended successfully. Processing is not retried. Logged errors do appear in Error Reporting / Cloud Monitoring error metrics, as they need to be reviewed and possibly fixed by developers.
  • Catches errors of a special type (RetryableError) and rethrows them. Only an INFO-level log is written because the RetryableError marks an explicit need in user code to retry processing. Because a rejection is passed to the functions-framework, the error is logged and processing is retried.

Obvisouly, this workaround does not fully eliminate the problem because retryable errors are still logged as errors by the functions-framework. However it does help to:

  • Avoid retrying processing in case the error is an unexpected bug.
  • Make it explicit which errors should be retried, both for code readability and in tests.

💡 Breaking change-free suggestion

The ideal behaviour described above would result in a huge breaking change and the idea would probably be quickly dismissed. However I think there is a smaller, non-breaking, change that can bring more flexibility to developers: simply provide an explicit value or error that can be returned/thrown by the function, which does not involve logging an error but still causes the functions-framework to return a 500 status code. This would be a small change to sendResponse, checking that result or err is of a specific type (or whatever check you feel most comfortable with).

This solution does solve the main problem which is that processing cannot currently be retried without logging an error. Because uncaught errors would still be retried (and also logged as errors), this does not alleviate the risk of an event being endlessly retried in case of a bug in the function's code. However this is clearly documented and can be solved in user code (e.g. like it is in my current workaround).

@josephlewis42
Copy link
Contributor

I was able to look into this today, it seems like CloudEvent handlers also support a (little documented) callback which can accept an (error, response) callback; but unfortunately the response body can't be used in an express friendly way to set the code without sending a body since FF also checks for integers (which would be interpreted as the HTTP response code) and casts them to strings first:

} else if (typeof result === 'number') {

Agree that a custom, sentinel error is probably cleanest here that'll allow extension without breaking backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants