Skip to content
This repository has been archived by the owner on Mar 27, 2021. It is now read-only.

Bigtable server-side and client-side & Heroic-side behaviour analysis #741

Open
sming opened this issue Jan 11, 2021 · 1 comment
Open
Assignees

Comments

@sming
Copy link
Contributor

sming commented Jan 11, 2021

Bigtable server-side and client-side & Heroic-side behaviour analysis

  • I am a heroic dev, implementing shorter timeouts
  • Who wants to know how heroic will react to these exceptions
  • So that I can be confident that Heroic will not be negatively impacted by rolling out the shorter timeouts & retries

Proposed Solution

  • Clone Adam’s fork of the java-bigtable client lib (see below) and use the integration test in this patch file to provoke a BigtableRetriesExhaustedException and observe how Heroic responds to it.

Design & Implementation Notes

  • note that the above test will need to be changed to better replicate a user query coming into the API as we need to see the full impact of this exception, not just in an isolated test context
  • here are Adam’s instructions from Slack :
git clone https://github.com/AdamBSteele/google-cloud-go
cd google-cloud-go/bigtable/cmd/emulator 
go run . --inject-latency="ReadRows:p50:100ms" --inject-latency="ReadRows:p99:5s"
@sming sming created this issue from a note in Observability Kanban (To do) Jan 11, 2021
@sming sming self-assigned this Jan 11, 2021
@project-bot project-bot bot moved this from To do to In progress in Observability Kanban Jan 11, 2021
@sming sming changed the title Discover how Heroic reacts to BigtableRetriesExhaustedException Bigtable server-side and client-side & Heroic-side behaviour analysis Jan 21, 2021
@sming
Copy link
Contributor Author

sming commented Jan 21, 2021

Update Jan 21, 2021

  • this has turned to be initially an analysis of Bigtable’s timeout behaviour itself
  • I discovered that there are 65 data channels and 72 retries total in the timeout period (6s). There should be 65 * 2 retries.
    • Adam says that 1.18.1 results in 65 * 2 retries and strongly recommends we upgrade to that from 1.12.1. It also has 2 other semi-critical bugfixes so I think we should.
  • The analysis doc Heroic API Request Timeout Settings contains the up to date findings
  • I continue to work with Adam on Slack to fully understand what we’re seeing in the logs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant