Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Call Timeout #1040

Open
jbspeakr opened this issue Jan 14, 2021 · 7 comments
Open

Configurable Call Timeout #1040

jbspeakr opened this issue Jan 14, 2021 · 7 comments

Comments

@jbspeakr
Copy link
Member

I would like to be able to configure timeouts on a per request basis similar to what OkHttp calls Call Timeout (link).

Detailed Description

My current understanding is, that following request-related timeouts can be configured:

  • connections.connect-timeout for starting the TCP connection
  • connections.socket-timeout for data flow/ packets
  • timeouts.global wrapping everything (retries, queuing, backup requests, fallbacks, etc)

From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body. OkHttp calls this Call Timeout.

Context

Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.

To be more specific: I am neither interested in a particular connections.connect-timeout nor a specific connections.socket-timeout as long as the entire HTTP request finishes within a given time.

Unfortunately the timeouts.global is too global on the other hand because it spans multiple request, which means that configuring something like this...

clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 150 milliseconds

... could still result in the global timeout kicking in before any retry happens.

Possible Implementation

  • a per call timeout (maybe via a plugin)
  • policy composition autoconfig possibilities

Your Environment

  • Version used: 3.0.0-RC.15
  • pretty standard Springboot app (2.3.7.RELEASE)
@whiskeysierra
Copy link
Collaborator

Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.

The way it was supposed to be used is to split your client up into multiple clients if there is a need to separate configuration within a client:

riptide.clients:
  example-login:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 150 milliseconds
  example-download:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 500 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 30 seconds

From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body.

That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for. Writing the request body is not subject to either of those two timeouts, because it's the client's responsibility, not the servers.

Global timeouts allow you to manage your own SLAs (what you promise your own clients) while connect- and socket-timeouts (and to some degree backup-requests) allow you to deal with SLAs of your downstream dependencies.

@jbspeakr
Copy link
Member Author

The way it was supposed to be used is to split your client up into multiple clients

That's understood and also what we do.

That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for.

Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link). It's a common misconception that a socket timeout is the timeout to receive the full response. This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.

What I want my application to do is to not wait longer than e.g. 50ms per request for a specific client and if this 50ms "call timeout" budget was eaten up, trigger one of my retries (which then has another 50ms guaranteed "call timeout" budget but not more)... Similar to what you would achieve with changing policy composition from the current default Timeout(RetryPolicy(CircuitBreaker(Supplier))) (simplified) to RetryPolicy(CircuitBreaker(Timeout(Supplier))) (simplified).

@whiskeysierra
Copy link
Collaborator

Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link).

Yes, correct.

This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.

You can with a backup request. The following will issue a second, concurrent request if the first one took longer than 50 milliseconds. Backup requets are essentially latency-based retries. Sounds like what you want in this case (because your strategy to counter them is retrying).

riptide.clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      backup-request:
        enabled: true
        delay: 50 milliseconds
      timeouts:
        enabled: true
        global: 150 milliseconds

Similar to what you would achieve with changing policy composition from the current default Timeout(RetryPolicy(CircuitBreaker(Supplier))) (simplified) to RetryPolicy(CircuitBreaker(Timeout(Supplier))) (simplified).

Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.

So you'd like to have this, then?

riptide.clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        call: 50 milliseconds
        global: 160 milliseconds # 3 calls, each 50ms + some time to issue a new request

This would only really be useful for scenarios where there is no retry desired, i.e. really fail fast.

Personally, I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.

@jbspeakr
Copy link
Member Author

jbspeakr commented Jan 14, 2021

Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?

Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.

Correct, I simplified to make my point clear.

I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.

I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.

Edit:
Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.

@whiskeysierra
Copy link
Collaborator

Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?

Correct.

I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.

Backup requests would run concurrently and give the original request a chance to still finish. The first request to finish will be used. Might as well be the original request.

Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.

Are you interested in contributing this? The way I see it, it's essentially just an addition to the auto configuration, because all the building blocks are already there (FailsafePlugin+ Timeout policy).

@jbspeakr
Copy link
Member Author

@whiskeysierra Yes, I believe we have a couple of guys who would like to contribute.

jbspeakr added a commit to jbspeakr/riptide that referenced this issue Jan 18, 2021
@whiskeysierra
Copy link
Collaborator

@jbspeakr I saw you made a commit to your fork. Dare to open a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants