Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

Merged
merged 5 commits into from Sep 28, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -100,19 +100,31 @@ public TimedAttemptSettings createNextAttempt(TimedAttemptSettings prevSettings)
(long) (settings.getRetryDelayMultiplier() * prevSettings.getRetryDelay().toMillis());
newRetryDelay = Math.min(newRetryDelay, settings.getMaxRetryDelay().toMillis());
}
Duration randomDelay = Duration.ofMillis(nextRandomLong(newRetryDelay));

// The rpc timeout is determined as follows:
// attempt #0 - use the initialRpcTimeout;
// attempt #1+ - use the calculated value.
// attempt #1+ - use the calculated value or the time remaining in totalTimeout.
noahdietz marked this conversation as resolved.
Show resolved Hide resolved
long newRpcTimeout =
(long) (settings.getRpcTimeoutMultiplier() * prevSettings.getRpcTimeout().toMillis());
newRpcTimeout = Math.min(newRpcTimeout, settings.getMaxRpcTimeout().toMillis());

// The totalTimeout could be zero if a callable is only using maxAttempts to limit retries.
// If set, calculate time remaining in the totalTimeout since the start, taking into account the
// next attempt's delay, in order to truncate the RPC timeout should it exceed the totalTimeout.
if (!settings.getTotalTimeout().isZero()) {
Duration timeElapsed =
Duration.ofNanos(clock.nanoTime())
.minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos()));
Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay);
newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if newRpcTimeout <= 0?

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If newRpcTimeout <= 0, the subsequent check by shouldRetry would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout:

long totalTimeSpentNanos =
        clock.nanoTime()
            - nextAttemptSettings.getFirstAttemptStartTimeNanos()
            + nextAttemptSettings.getRandomizedRetryDelay().toNanos();

// If totalTimeout limit is defined, check that it hasn't been crossed
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
   return false;
}

We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero()), but I'm not sure what we'd set it to...maybe allow newRpcTimeout to remain unchanged by timeLeft?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.

}

return TimedAttemptSettings.newBuilder()
.setGlobalSettings(prevSettings.getGlobalSettings())
.setRetryDelay(Duration.ofMillis(newRetryDelay))
.setRpcTimeout(Duration.ofMillis(newRpcTimeout))
.setRandomizedRetryDelay(Duration.ofMillis(nextRandomLong(newRetryDelay)))
.setRandomizedRetryDelay(randomDelay)
.setAttemptCount(prevSettings.getAttemptCount() + 1)
.setOverallAttemptCount(prevSettings.getOverallAttemptCount() + 1)
.setFirstAttemptStartTimeNanos(prevSettings.getFirstAttemptStartTimeNanos())
Expand Down
Expand Up @@ -89,6 +89,25 @@ public void testCreateNextAttempt() {
assertEquals(Duration.ofMillis(4L), thirdAttempt.getRpcTimeout());
}

@Test
public void testTruncateToTotalTimeout() {
RetrySettings timeoutSettings =
retrySettings
.toBuilder()
.setInitialRpcTimeout(Duration.ofSeconds(4L))
.setMaxRpcTimeout(Duration.ofSeconds(4L))
.setTotalTimeout(Duration.ofSeconds(4L))
.build();
ExponentialRetryAlgorithm timeoutAlg = new ExponentialRetryAlgorithm(timeoutSettings, clock);

TimedAttemptSettings firstAttempt = timeoutAlg.createFirstAttempt();
TimedAttemptSettings secondAttempt = timeoutAlg.createNextAttempt(firstAttempt);
assertTrue(firstAttempt.getRpcTimeout().compareTo(secondAttempt.getRpcTimeout()) > 0);
noahdietz marked this conversation as resolved.
Show resolved Hide resolved

TimedAttemptSettings thirdAttempt = timeoutAlg.createNextAttempt(secondAttempt);
assertTrue(secondAttempt.getRpcTimeout().compareTo(thirdAttempt.getRpcTimeout()) > 0);
}

@Test
public void testShouldRetryTrue() {
TimedAttemptSettings attempt = algorithm.createFirstAttempt();
Expand Down
Expand Up @@ -475,9 +475,11 @@ public void testFutureCallPollRPCTimeout() throws Exception {
OperationTimedPollAlgorithm.create(
FAST_RECHECKING_SETTINGS
.toBuilder()
// Update the polling algorithm to set per-RPC timeouts instead of the default zero.
.setInitialRpcTimeout(Duration.ofMillis(100))
.setMaxRpcTimeout(Duration.ofSeconds(1))
.setRpcTimeoutMultiplier(2)
.setTotalTimeout(Duration.ofSeconds(5L))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought and I was confused by what this test was doing.

The test setup actually sets by default a totalTimeout to be 5 ms, but then in this test increases the initialRpcTimeout and maxRpcTimeout to values much greater than the existing totalTimeout of 5ms. Per comments in a generated client, the initialRpcTimeout and maxRpcTimeout should be ignored and set to 0 for LROs? Not really sure what the test is doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we had no GAPIC config for LRO polling initialRpcTimeout or maxRpcTimeout, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalTimeout in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.

This test was set up weirdly, as I described. PTAL.

.build(),
clock);
callSettings = callSettings.toBuilder().setPollingAlgorithm(pollingAlgorithm).build();
Expand Down