Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

Merged
merged 5 commits into from Sep 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -100,19 +100,37 @@ public TimedAttemptSettings createNextAttempt(TimedAttemptSettings prevSettings)
(long) (settings.getRetryDelayMultiplier() * prevSettings.getRetryDelay().toMillis());
newRetryDelay = Math.min(newRetryDelay, settings.getMaxRetryDelay().toMillis());
}
Duration randomDelay = Duration.ofMillis(nextRandomLong(newRetryDelay));

// The rpc timeout is determined as follows:
// attempt #0 - use the initialRpcTimeout;
// attempt #1+ - use the calculated value.
// attempt #1+ - use the calculated value, or the time remaining in totalTimeout if the
// calculated value would exceed the totalTimeout.
long newRpcTimeout =
(long) (settings.getRpcTimeoutMultiplier() * prevSettings.getRpcTimeout().toMillis());
newRpcTimeout = Math.min(newRpcTimeout, settings.getMaxRpcTimeout().toMillis());

// The totalTimeout could be zero if a callable is only using maxAttempts to limit retries.
// If set, calculate time remaining in the totalTimeout since the start, taking into account the
// next attempt's delay, in order to truncate the RPC timeout should it exceed the totalTimeout.
if (!settings.getTotalTimeout().isZero()) {
Duration timeElapsed =
Duration.ofNanos(clock.nanoTime())
.minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos()));
Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay);

// If timeLeft at this point is < 0, the shouldRetry logic will prevent
// the attempt from being made as it would exceed the totalTimeout. A negative RPC timeout
// will result in a deadline in the past, which should will always fail prior to making a
// network call.
newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if newRpcTimeout <= 0?

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If newRpcTimeout <= 0, the subsequent check by shouldRetry would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout:

long totalTimeSpentNanos =
        clock.nanoTime()
            - nextAttemptSettings.getFirstAttemptStartTimeNanos()
            + nextAttemptSettings.getRandomizedRetryDelay().toNanos();

// If totalTimeout limit is defined, check that it hasn't been crossed
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
   return false;
}

We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero()), but I'm not sure what we'd set it to...maybe allow newRpcTimeout to remain unchanged by timeLeft?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.

}

return TimedAttemptSettings.newBuilder()
.setGlobalSettings(prevSettings.getGlobalSettings())
.setRetryDelay(Duration.ofMillis(newRetryDelay))
.setRpcTimeout(Duration.ofMillis(newRpcTimeout))
.setRandomizedRetryDelay(Duration.ofMillis(nextRandomLong(newRetryDelay)))
.setRandomizedRetryDelay(randomDelay)
.setAttemptCount(prevSettings.getAttemptCount() + 1)
.setOverallAttemptCount(prevSettings.getOverallAttemptCount() + 1)
.setFirstAttemptStartTimeNanos(prevSettings.getFirstAttemptStartTimeNanos())
Expand Down Expand Up @@ -144,7 +162,16 @@ public boolean shouldRetry(TimedAttemptSettings nextAttemptSettings) {
- nextAttemptSettings.getFirstAttemptStartTimeNanos()
+ nextAttemptSettings.getRandomizedRetryDelay().toNanos();

// If totalTimeout limit is defined, check that it hasn't been crossed
// If totalTimeout limit is defined, check that it hasn't been crossed.
//
// Note: if the potential time spent is exactly equal to the totalTimeout,
// the attempt will still be allowed. This might not be desired, but if we
// enforce it, it could have potentially negative side effects on LRO polling.
// Specifically, if a polling retry attempt is denied, the LRO is canceled, and
// if a polling retry attempt is denied because its delay would *reach* the
// totalTimeout, the LRO would be canceled prematurely. The problem here is that
// totalTimeout doubles as the polling threshold and also the time limit for an
// operation to finish.
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
return false;
}
Expand Down
Expand Up @@ -29,6 +29,7 @@
*/
package com.google.api.gax.retrying;

import static com.google.common.truth.Truth.assertThat;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
Expand Down Expand Up @@ -89,6 +90,25 @@ public void testCreateNextAttempt() {
assertEquals(Duration.ofMillis(4L), thirdAttempt.getRpcTimeout());
}

@Test
public void testTruncateToTotalTimeout() {
RetrySettings timeoutSettings =
retrySettings
.toBuilder()
.setInitialRpcTimeout(Duration.ofSeconds(4L))
.setMaxRpcTimeout(Duration.ofSeconds(4L))
.setTotalTimeout(Duration.ofSeconds(4L))
.build();
ExponentialRetryAlgorithm timeoutAlg = new ExponentialRetryAlgorithm(timeoutSettings, clock);

TimedAttemptSettings firstAttempt = timeoutAlg.createFirstAttempt();
TimedAttemptSettings secondAttempt = timeoutAlg.createNextAttempt(firstAttempt);
assertThat(firstAttempt.getRpcTimeout()).isGreaterThan(secondAttempt.getRpcTimeout());

TimedAttemptSettings thirdAttempt = timeoutAlg.createNextAttempt(secondAttempt);
assertThat(secondAttempt.getRpcTimeout()).isGreaterThan(thirdAttempt.getRpcTimeout());
}

@Test
public void testShouldRetryTrue() {
TimedAttemptSettings attempt = algorithm.createFirstAttempt();
Expand Down
Expand Up @@ -94,11 +94,14 @@ public class OperationCallableImplTest {
.setInitialRetryDelay(Duration.ofMillis(1L))
.setRetryDelayMultiplier(1)
.setMaxRetryDelay(Duration.ofMillis(1L))
.setInitialRpcTimeout(Duration.ZERO) // supposed to be ignored
.setInitialRpcTimeout(
Duration.ZERO) // supposed to be ignored, but are not actually, so we set to zero
.setMaxAttempts(0)
.setJittered(false)
.setRpcTimeoutMultiplier(1) // supposed to be ignored
.setMaxRpcTimeout(Duration.ZERO) // supposed to be ignored
.setRpcTimeoutMultiplier(
1) // supposed to be ignored, but are not actually, so we set to one
.setMaxRpcTimeout(
Duration.ZERO) // supposed to be ignored, but are not actually, so we set to zero
.setTotalTimeout(Duration.ofMillis(5L))
.build();

Expand Down Expand Up @@ -475,9 +478,16 @@ public void testFutureCallPollRPCTimeout() throws Exception {
OperationTimedPollAlgorithm.create(
FAST_RECHECKING_SETTINGS
.toBuilder()
// Update the polling algorithm to set per-RPC timeouts instead of the default zero.
//
// This is non-standard, as these fields have been documented as "should be ignored"
// for LRO polling. They are not actually ignored in code, so they changing them
// here has an actual affect. This test verifies that they work as such, but in
// practice generated clients set the RPC timeouts to 0 to be ignored.
.setInitialRpcTimeout(Duration.ofMillis(100))
.setMaxRpcTimeout(Duration.ofSeconds(1))
.setRpcTimeoutMultiplier(2)
.setTotalTimeout(Duration.ofSeconds(5L))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought and I was confused by what this test was doing.

The test setup actually sets by default a totalTimeout to be 5 ms, but then in this test increases the initialRpcTimeout and maxRpcTimeout to values much greater than the existing totalTimeout of 5ms. Per comments in a generated client, the initialRpcTimeout and maxRpcTimeout should be ignored and set to 0 for LROs? Not really sure what the test is doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we had no GAPIC config for LRO polling initialRpcTimeout or maxRpcTimeout, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalTimeout in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.

This test was set up weirdly, as I described. PTAL.

.build(),
clock);
callSettings = callSettings.toBuilder().setPollingAlgorithm(pollingAlgorithm).build();
Expand Down