Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

sming · 2021-01-24T21:10:02Z

Implement Findings

Use Case Resolved: unknown Bigtable timeout & retry behaviour

I am a Heroic developer
who wants to make sure that our timeout & retry changes will work as expected
so that we don't have a disaster on our hands!

Design & Implementation Notes

this ticket was spun up off of peterk & adam's chat w.r.t. upgrading the Bigtable java client lib
this PR depends a lot on what’s discovered as part of “Discover how Heroic reacts to BigtableRetriesExhaustedException” (Issue Bigtable server-side and client-side & Heroic-side behaviour analysis #741), for which you should check out the Heroic API Request Timeout Settings document for the latest updates/findings
currently it’s looking like we want a 6s timeout across the board
but we don’t know why 65 data channels are created and why there are 72 timeouts logged before the 6s timeout occurs

Concrete changes to make

delete maxElapsedBackoffMs - it’s irrelevant for our Use Case (double check with Adam about his as it’s still in his doc)
upgrade the Bigtable java lib from 1.12.1 to 1.18.1 and check that we see 2x retries before the 6s timeout (see Slack thread with Adam / testing Doc). We know we need this as it resolves 2 or 3 retry-related bugs.
fine-tune all our *TimeoutMs and *TimeoutMillis settings

sming · 2021-02-11T16:04:26Z

we finally know how API clients will experience the BT Timeout :

they’ll get a 200. Which is … interesting. And misleading IMO.
they get no results. Well, at least for the query I was running.
they get the following error message in the response body in errors[0].error :
"Some fetches failed (100) or were cancelled (870), caused by Some fetches failed (100) or were cancelled (870)"

sming · 2021-02-11T16:31:42Z

we (Prism) need to decide if the above is acceptable. @malish8632 's argument is that this is the current behaviour and always has been, hence no changes are necessary.

My argument is that many more users who've never received a timeout before will now receive one, in the shape of a 200, which is super misleading cos it actually failed, in effect.

Hence the agreed-upon action is to phrase the 1, 2, 3 above as a "reminder" of how heroic returns requests that time out.

sming added the heroic stability label Jan 24, 2021

sming self-assigned this Jan 24, 2021

project-bot bot added this to In progress in Observability Kanban Jan 24, 2021

sming changed the title ~~Implement changes from @peterk & Adam’s analysis of Bigtable timeouts & retries~~ Implement changes from @peterk & Adam’s analysis of Bigtable timeouts & retries (e.g. bump java Bigtable lib to v1.18.2) Jan 24, 2021

sming changed the title ~~Implement changes from @peterk & Adam’s analysis of Bigtable timeouts & retries (e.g. bump java Bigtable lib to v1.18.2)~~ Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. Feb 9, 2021

sming moved this from In progress to Done in Observability Kanban Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

sming commented Jan 24, 2021 •

edited

sming commented Feb 11, 2021

sming commented Feb 11, 2021

Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

Comments

sming commented Jan 24, 2021 • edited

Implement Findings

Use Case Resolved: unknown Bigtable timeout & retry behaviour

Design & Implementation Notes

Concrete changes to make

sming commented Feb 11, 2021

sming commented Feb 11, 2021

sming commented Jan 24, 2021 •

edited