Bigtable server-side and client-side & Heroic-side behaviour analysis #741

sming · 2021-01-11T17:00:32Z

Bigtable server-side and client-side & Heroic-side behaviour analysis

I am a heroic dev, implementing shorter timeouts
Who wants to know how heroic will react to these exceptions
So that I can be confident that Heroic will not be negatively impacted by rolling out the shorter timeouts & retries

Proposed Solution

Clone Adam’s fork of the java-bigtable client lib (see below) and use the integration test in this patch file to provoke a BigtableRetriesExhaustedException and observe how Heroic responds to it.

Design & Implementation Notes

note that the above test will need to be changed to better replicate a user query coming into the API as we need to see the full impact of this exception, not just in an isolated test context
here are Adam’s instructions from Slack :

git clone https://github.com/AdamBSteele/google-cloud-go
cd google-cloud-go/bigtable/cmd/emulator 
go run . --inject-latency="ReadRows:p50:100ms" --inject-latency="ReadRows:p99:5s"

sming · 2021-01-21T16:18:39Z

Update Jan 21, 2021

this has turned to be initially an analysis of Bigtable’s timeout behaviour itself
I discovered that there are 65 data channels and 72 retries total in the timeout period (6s). There should be 65 * 2 retries.
- Adam says that 1.18.1 results in 65 * 2 retries and strongly recommends we upgrade to that from 1.12.1. It also has 2 other semi-critical bugfixes so I think we should.
The analysis doc Heroic API Request Timeout Settings contains the up to date findings
I continue to work with Adam on Slack to fully understand what we’re seeing in the logs.

sming created this issue from a note in Observability Kanban (To do) Jan 11, 2021

sming self-assigned this Jan 11, 2021

project-bot bot moved this from To do to In progress in Observability Kanban Jan 11, 2021

sming added the heroic stability label Jan 11, 2021

sming changed the title ~~Discover how Heroic reacts to BigtableRetriesExhaustedException~~ Bigtable server-side and client-side & Heroic-side behaviour analysis Jan 21, 2021

sming mentioned this issue Jan 24, 2021

Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes. #748

Open

malish8632 removed this from In progress in Observability Kanban Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigtable server-side and client-side & Heroic-side behaviour analysis #741

Bigtable server-side and client-side & Heroic-side behaviour analysis #741

sming commented Jan 11, 2021 •

edited

sming commented Jan 21, 2021 •

edited

Bigtable server-side and client-side & Heroic-side behaviour analysis #741

Bigtable server-side and client-side & Heroic-side behaviour analysis #741

Comments

sming commented Jan 11, 2021 • edited

Bigtable server-side and client-side & Heroic-side behaviour analysis

Proposed Solution

Design & Implementation Notes

sming commented Jan 21, 2021 • edited

Update Jan 21, 2021

sming commented Jan 11, 2021 •

edited

sming commented Jan 21, 2021 •

edited