Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk: import jobs often fail with server sent GOAWAY and closed the connection #65926

Closed
adityamaru opened this issue Jun 1, 2021 · 4 comments
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Jun 1, 2021

Time and again we have seen our roachtests fail with:
gs://cockroach-fixtures/tpce-csv/customers=2000000/746/NewsItem.txt?AUTH=implicit: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug="server_shutting_down"

While this is an infra flake and the only solution is to retry the import, maybe we should be retrying internally so as to not fail the job. This retry could either be at the job resumer level or could be marked as a retriable error in our external storage resuming reader implementations. Either way, the focus of this issue should be to find what error type is bubbled up in such scenarios so that we can intercept and consider it retriable.

Epic: CRDB-2556

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery labels Jun 1, 2021
@adityamaru adityamaru added this to Triage in Disaster Recovery Backlog via automation Jun 1, 2021
@mwang1026 mwang1026 moved this from Triage to Bug in Disaster Recovery Backlog Jun 1, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Jun 1, 2021

cc @stevendanna who has been thinking a bit about retry policies for cdc sinks (one of which is gcs).

@adityamaru
Copy link
Contributor Author

We're waiting on the google SDK to internally retry on seeing a GOAWAY error - googleapis/google-cloud-go#4226. We should then bump the SDK version to include this fix.

Disaster Recovery Backlog automation moved this from Bug to Done Jun 24, 2021
@knz knz reopened this Jul 28, 2021
Disaster Recovery Backlog automation moved this from Done to Triage Jul 28, 2021
@knz
Copy link
Contributor

knz commented Jul 28, 2021

Reopening this as a tracking issue. The GOAWAY fix cannot be used yet, see #68158.

craig bot pushed a commit that referenced this issue Jul 28, 2021
68158: Revert "go.mod: bump cloud.google.com/go/storage to v1.16.0" r=tbg,erikgrinaker a=knz

Reverts #67806

Informs #68154
Informs #65926 

Fixes #68150
Fixes #68152
Fixes  #68148
Fixes #68145
Fixes #68144
Fixes #68143

This reverts commit e52ddc8.

This is because that commit contains an upgrade to gRPC 1.39 which
contains a race condition.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@shermanCRL shermanCRL moved this from Triage to Import/Export in Disaster Recovery Backlog Aug 2, 2021
@adityamaru
Copy link
Contributor Author

Should be closed by #68650.

Disaster Recovery Backlog automation moved this from Import/Export to Done Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Development

No branches or pull requests

3 participants