Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Should global timeout override timeout in provided context? #1144

Closed
devchas opened this issue Jul 6, 2020 · 10 comments
Closed

Should global timeout override timeout in provided context? #1144

devchas opened this issue Jul 6, 2020 · 10 comments
Assignees
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@devchas
Copy link

devchas commented Jul 6, 2020

When calling ServerStreamingCallable.call(request, context) with a context set like so

ApiCallContext context = GrpcCallContext.createDefault();
context = context.withTimeout(Duration.of(10, ChronoUnit.HOURS));

the global timeout settings override the provided context. As a result, I cannot override global settings. Is this intended behavior, or is this a bug? If intended behavior, how can I override the global settings?

@devchas
Copy link
Author

devchas commented Jul 6, 2020

Some more context after digging into this further. I've also tried altering the deadline/timeout by directly updating the call options with

ApiCallContext context = GrpcCallContext.createDefault().withCallOptions(CallOptions.DEFAULT.withDeadline(
    Deadline.after(10, TimeUnit.HOURS)));

It looks like after overriding the timeout as described in the previous comment, the deadline in the call options is updated using the global timeout value if the deadline calculated using that global timeout value comes before the explicitly provided deadline.

It seems like I could decrease the deadline using withCallOptions. I am wondering, is there any way to increase the deadline?

@miraleung miraleung self-assigned this Jul 6, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jul 7, 2020
@devchas
Copy link
Author

devchas commented Jul 7, 2020

I shared some additional context offline, but for the benefit of anyone following this thread, this problem only seems to occur for a UnaryCallable of type RetryingCallable. When the UnaryCallable is of type ServerStreamingCallable, the timeout settings are not overridden by the global settings.

@miraleung miraleung added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Jul 7, 2020
@noahdietz
Copy link
Contributor

@igorbernstein2, per our discussion, if you could weigh in on the code in question, we'd appreciate it.

@noahdietz
Copy link
Contributor

@nwbirnie so we continue discussion about this behavior as we debug the code.

@noahdietz noahdietz added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jul 15, 2020
@noahdietz
Copy link
Contributor

Ok so there are two parts to this issue, if I'm understanding things:

  1. A timeout set in a ApiCallContext provided by a user to the call(request, context) API is overwritten when a ServerStreamingAttemptCallable is in use (which is when there is a server streaming RPC that is has retries enabled). The code in question for that issue is in start() here.

  2. When using CallOptions to supply a deadline (presumable with a context that will be supplied to the ServerStreamingAttemptCallable.call(request, context) API), the timeout on the ApiCallContext (which I'm guessing from (1) has already been overwritten to the totalTimeout value in the RetrySettings) is compared to the deadline provided via the CallOptions and the shorter of the two wins (in the specific use case with this API, the totalTimeout was 1H and the CallOptions contained a timeout of 10H). That code in question is here (and also in GrpcCallContext.withTimeout).

(1) can be resolved by changing the condition to respect whatever timeout is set on the context:

if (totalTimeout != null && context != null && context.getTimeout() == null) {
    context = context.withTimeout(totalTimeout);
}

However, allowing this then begs the question of (2). We'd need to add additional logic to check if the provided context timeout would expire sooner or later than the totalTimeout from the RetrySettings. Or maybe we wouldn't

(2) is essentially "should we allow users to increase a timeout via the ApiCallContext"

I think that it is reasonable to take a more restrictive approach from the ApiCallContext perspective, given that there is retry/timeout machinery that mutates the context higher in the stack (as is evident from (1)). Also, the ApiCallContext is (I believe) representative of a single RPC. In the context of a non-retryable RPC, there is only 1 RPC made, so respecting configuration in a given ApiCallContext makes sense. In the context of a retryable RPC though, multiple RPCs will be made, which are controlled by machinery around the ApiCallContext and respecting user-provided values could complicate things. However, given that there is no per-logical call API for customizing CallSettings on not only server streaming RPCs, but any method in a Java GAPIC at the moment, do we want to add more flexibility to ApiCallContext?

Context for discussion: we are designing the addition of overloads to set per-logical call CallSettings at the GAPIC level.

@igorbernstein2 can you weigh in on the questions above?

@devchas @nwbirnie please correct me if my summary was off.

@igorbernstein2
Copy link
Contributor

This is a bit messy and I apologize for my contributions to the messiness.

So if I understand correctly there are a few timeouts at play here:

  • CallOptions.withDeadline
  • GrpcApiCallContext.withDeadline
  • GrpcApiCallContext. withStreamWaitTimeout
  • RetrySettings.getTotalTimeout
  • RetrySettings.get*RetryTimeout

and they overlap and play different roles depending on the context

As we discussed I think this should be cleaned up where the user only controls a single timeout overall time and the rpc deadlines are derived from it. I also think users should be able to override the default deadline per call.

So I agree with solution 1 where the user's overrride is respected. Though I thinkfor the second problem, totalTimeout should be adjusted (including expanded) to the user's desires.

Ignoring backwards compatibility , I think the correct state would be:

  • ApiContext should add a new method #withTimeout() which should just proxy to CallOptions#withDeadline.
  • {Unary,ServerStreaming}CallableSettings should define a deadline configuration knob that will be stamped on ApiCallContext.withTimeout if the user hasn't provided their own value.
  • RetrySettings should only define if retries are enabled and what the exponential delay should be
  • RetryAlgorithm should be modified to use the incoming ApiCallContext/CallOptions deadline to know when it should stop scheduling retry attempts.
  • RetrySettings should drop all other fields
  • Add streamWaitTimeout to ServerStreamingCallableSettings

With backwards compatibility I think the transition can be:

  1. Add ApiCallContext#withOverallTimeout which proxies to CallOptions#withDeadline
  2. Use ApiCallContext#overallTimeout/CallOptions#deadline to override RetrySettings#totalTimeout in RetryAlgorithm
  3. Introduce *CallableSettings#timeout and have both RetrySettings and *CallableSettings write to that field. Deprecate RetrySettings#totalTimeout
  4. Make RetrySettings#rpcTimeouts no-ops
  5. Add ServerStreamingCallableSettings#streamWaitTimeout

@noahdietz
Copy link
Contributor

noahdietz commented Jul 20, 2020

Ok, so the future state of exposing a new per-call Settings API is in the near/medium future. The ideal end states you've described here are valuable. There is generator work and a design to write up for this.

Immediately, can I make the change for (1) - respecting a user-provided timeout on an ApiCallContext in ServerStreamingAttemptCallable (and potentially elsewhere if discovered)? I'm trying to differentiate between what might be non-ideal behavior now that we can fix, and what we want to do to simplify/align things overall. If that is not possible, or not advisable, please let me know.

@noahdietz noahdietz added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. status: investigating The issue is under investigation, which is determined to be non-trivial. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Jul 24, 2020
@noahdietz
Copy link
Contributor

Ok, so #1155 fixed (1) - an ApiCallContext with a caller-provided timeout (i.e. context.withTimeout()) given to a ServerStreamingAttemptCallable (the concrete class used when calling a Server Streaming RPC that is configured to be retryable) will be retained, rather than overwritten.

I discussed with @igorbernstein2 offline if the same change should be made to AttemptCallable (used for retryable Unary RPCs). As mentioned in #1155 (comment), we are punting on that for now because the context timeout is handled slightly differently in that case. We want to address it with other work we are doing to simplify timeouts in gax-java, as well as the future plan of exposing GAPIC-level CallSettings method overloads.

As for (2) - with ServerStreamingAttemptCallable, a caller-provided context should be able to lengthen the timeout if provided via context.withTimeout, as the call-provided timeout is respected w/o comparison to the settings.totalTimeout. I did not address gRPC CallOption resolution with the aforementioned PR. That is a more general change we will need to do separately if agreed upon. But the specific use case here is addressed by fixing (1).

I downgraded the ticket, because we were able to work around it for the specific use case, probably should've done that before.

I kept this open with status: investigating because I think there are some questions that came out of investigating the specific issue, but I'm not opposed to closing this in favor of a new issue that pertains to exactly what we want to address going forward. Thoughts?

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 22, 2020
@miraleung
Copy link
Contributor

Moving this to P3 until ndietz@ is back in Q1.

@miraleung miraleung added priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Nov 3, 2020
@noahdietz noahdietz removed the status: investigating The issue is under investigation, which is determined to be non-trivial. label Mar 9, 2021
@noahdietz
Copy link
Contributor

#1324 will fix this specific issue by respecting a timeout set via withTimeout. I am still going to be looking at how we can improve the configurability of timeouts at the method level (i.e. logical timeout). Also, #1238 will improve the accessibility of the RetrySettings by making them configurable from the ApiCallContext. That said, the timeouts in RetrySettings are not trivial to use, especially when compared to withTimeout.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants