Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict status codes raised by the library #595

Open
emcfarlane opened this issue Sep 20, 2023 · 9 comments
Open

Restrict status codes raised by the library #595

emcfarlane opened this issue Sep 20, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@emcfarlane
Copy link
Contributor

Connect-go raises error codes that fall outside what the gRPC spec recommends for a framework: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

The following status codes are never generated by the library:

  • INVALID_ARGUMENT
  • NOT_FOUND
  • ALREADY_EXISTS
  • FAILED_PRECONDITION
  • ABORTED
  • OUT_OF_RANGE
  • DATA_LOSS

INVALID_ARGUMENT and FAILED_PRECONDITION from a quick grep. I think most of these cases should be changed to Internal errors.

@emcfarlane emcfarlane added the bug Something isn't working label Sep 20, 2023
@akshayjshah
Copy link
Member

Connect returns INVALID_ARGUMENT in some cases. FAILED_PRECONDITION is only used in a test.

I'm very reluctantly okay with changing our use of INVALID_ARGUMENT to INTERNAL and UNAVAILABLE, matching gRPC. It does seem deeply odd to raise the equivalent of a 5xx when the client sends the server garbage data, though. (And this makes the Connect protocol return 404s if the client uses a compression mechanism that the server doesn't support, which is also quite odd.)

@emcfarlane
Copy link
Contributor Author

Clarifying the server errors codes should help users debug errors. Any INVALID_ARGUMENT will come from user code, ie. an invalid field. Mangled payloads I think should be server errors as this is a bug in the library or some other internal issue.

@akshayjshah
Copy link
Member

akshayjshah commented Oct 3, 2023

Mangled payloads can just as easily come from the client's RPC runtime. I've never seen a REST server treat a mangled JSON request payload as a 5xx - it seems pretty standard to me to assume that mangled payloads are the client's fault (or some middlebox's fault).

@timostamm
Copy link
Member

I've never seen a REST server treat a mangled JSON request payload as a 5xx

Anything but a 400 is unexpected in my book. The gRPC protocol does not use HTTP status codes and is rarely used with JSON. Are we sure we want to forego semantic HTTP status codes for the Connect implementation because of gRPC?

It would be really nice if there was a set of error codes reserved for users. It would help with debugging and it would let a user-facing application decide which error messages to show to the user verbatim. I don't think that restricting codes raised by the library gets us there though: It leaves just 7 codes that are guaranteed to originate from the server application, and it doesn't give any guarantees about the inverse (since a server application can raise any code).

So I wonder if what gRPC does is actually very helpful in practice.

@emcfarlane
Copy link
Contributor Author

Ideally it would be a 400 HTTP status but not INVALID_ARGUMENT error code as I do think that implies a user issue not a marshalling error.

If both client and server generate the request a 500 seems reasonable to me as this is a issue with the client/server lib, but yep this isn't RESTful.

@akshayjshah
Copy link
Member

@jhump and @mattrobenolt Curious about your thoughts on this one.

@mattrobenolt
Copy link
Contributor

To be explicit, if this is referring specifically to the codes we spit back for the gRPC protocol, and not the connect protocol, I think we should mirror the behavior of grpc-go and what the spec says.

I haven't gone through and audited each use of them and compared to what the spec says, but I think it'd make sense to try and be as compatible to expectations.

Granted, the only real downside I see is this might be considered an API breaking change? But I don't feel that strongly about it.

As a quick run through our/my uses, we don't do any checks that particularly compare the status codes. Anything not-OK is usually treated the same, and typically it's either pass/fail, with the code and status as just effectively metadata.

@jhump
Copy link
Member

jhump commented Oct 13, 2023

I agree with Timo, that anything other than a 400 is bad for the Connect protocol.

But, TBH, we should perhaps make an appeal to the gRPC team to change the spec, because using "internal" for gRPC error codes is also bad for the same reason: it makes observability and alerting a nightmare because you can no longer correctly distinguish between "errors caused by a bad client" and "errors caused by my server".

I've never worked at a place where dev ops didn't distinguish these cases for alerting and exception-reporting config. So having a client-induced error reported as "internal" is bad. With HTTP/REST, you'd generally distinguish by classifying status codes as 4xx vs 5xx. With gRPC, the best you can do is to statically classify each gRPC error code as either "client" or "server", and this breaks down when a code like "internal" can mean either.

@emcfarlane
Copy link
Contributor Author

Interestingly gRPC clients that receive a 400 response from a service map to INTERNAL as per: https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md
I'd argue UNKNOWN codes are more severe then INTERNAL in most cases.

OTEL recently adjusted the gRPC status codes to map closer to HTTP semantics: open-telemetry/opentelemetry-specification#3333
From the PR:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants