Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: nil pointer crash in storage.(*Writer).open during backup in 22.2.9 #103597

Closed
renatolabs opened this issue May 18, 2023 · 2 comments · Fixed by #104187
Closed

backupccl: nil pointer crash in storage.(*Writer).open during backup in 22.2.9 #103597

renatolabs opened this issue May 18, 2023 · 2 comments · Fixed by #104187
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@renatolabs
Copy link
Collaborator

renatolabs commented May 18, 2023

During a roachtest (#103228), two nodes crashed while a backup was taken, both due to a panic within the GCS library:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x4127782]

goroutine 309871 [running]:
panic({0x4bb3320, 0x8fb3dd0})
        GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc014aeddf8 sp=0xc014aedd38 pc=0x48d77a
runtime.panicmem(...)
        GOROOT/src/runtime/panic.go:260
runtime.sigpanic()
        GOROOT/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc014aede48 sp=0xc014aeddf8 pc=0x4a4636
cloud.google.com/go/storage.(*Writer).open.func1()
        cloud.google.com/go/storage/external/com_google_cloud_go_storage/writer.go:162 +0x1a2 fp=0xc014aedfe0 sp=0xc014aede48 pc=0x4127782
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc014aedfe8 sp=0xc014aedfe0 pc=0x4c2401
created by cloud.google.com/go/storage.(*Writer).open
        cloud.google.com/go/storage/external/com_google_cloud_go_storage/writer.go:152 +0x495

Every node in the 4-node cluster was running 22.2.9.

Stack traces for the two nodes that crashed (n3 and n4) are attached below. Note that a very similar crash had been reported before [1], and deemed fixed by [2]. However, the issue doesn't seem to be completely solved.

Reproduction

Running the backup-restore/mixed-version roachtest in #103228 with seed -4303022106448172299 seems to reproduce this with high probability (~1h30m after test start).

n3_stacks.txt
n4_stacks.txt

Roachtest artifacts: https://console.cloud.google.com/storage/browser/cockroach-tmp/103597/roachtest_artifacts;tab=objects?project=cockroach-shared&prefix=&forceOnObjectsSortingFiltering=false&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))

[1] googleapis/google-cloud-go#4167
[2] #65660

Jira issue: CRDB-28094

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels May 18, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 18, 2023

cc @cockroachdb/disaster-recovery

@dt
Copy link
Member

dt commented May 30, 2023

Some progress: I patched 22.2.9 so that we keep track of all open writers until they are closed and can confirm this appears to be a bug in our usage, not upstream:

panic: open writer (1) when closing client! "data/869673008170696708.sst"

goroutine 432607 [running]:
panic({0x4eead00, 0xc008194790})
        GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc00274fa38 sp=0xc00274f978 pc=0x48aa9a
github.com/cockroachdb/cockroach/pkg/cloud/gcp.(*gcsStorage).Close(0xc009534d70)
        github.com/cockroachdb/cockroach/pkg/cloud/gcp/gcs_storage.go:402 +0x2b3 fp=0xc00274fb70 sp=0xc00274fa38 pc=0x4143e73
github.com/cockroachdb/cockroach/pkg/cloud.(*esWrapper).Close(0xc004953a40?)
        <autogenerated>:1 +0x2a fp=0xc00274fb88 sp=0xc00274fb70 pc=0x17fe9ca
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runBackupProcessor.func2.1()
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_processor.go:563 +0x73 fp=0xc00274fc38 sp=0xc00274fb88 pc=0x3a98dd3
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runBackupProcessor.func2({0x62da980, 0xc004bc7700})
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_processor.go:575 +0x422 fp=0xc00274ff58 sp=0xc00274fc38 pc=0x3a98ce2

backupccl/backup_processor.go:563 is the storage.Close call here:

		defer func() {
			err := sink.Close()
			err = errors.CombineErrors(storage.Close(), err)

So we're closing the sink, think closing the storage, but somehow closing the sink returned with one of the writers still unclosed. Going to add a bit more logging to see if I can figure out how.

craig bot pushed a commit that referenced this issue Jun 1, 2023
104187: backupccl: always track opened writer r=dt a=dt

Previously we would do things -- like wrap the opened writer with an encryption shim -- after opening the writer but before storing it in the sink. This could mean that if the sink opened a writer, but failed to save it, and was then closed it would fail to close the writer. This changes that, so that as soon as the writer is opened it is saved and then if it is later wrapped, saved again, to ensure that we cannot lose track of any successfully opened writer.

Fixes: #103597.

Release note: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig craig bot closed this as completed in 64dea46 Jun 1, 2023
Disaster Recovery Backlog automation moved this from Triage to Done Jun 1, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 1, 2023
Previously we would do things -- like wrap the opened writer with an encryption shim --
after opening the writer but before storing it in the sink. This could mean that if the
sink opened a writer, but failed to save it, and was then closed it would fail to close
the writer. This changes that, so that as soon as the writer is opened it is saved and
then if it is later wrapped, saved again, to ensure that we cannot lose track of any
successfully opened writer.

Fixes: #103597.

Release note: none.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Development

Successfully merging a pull request may close this issue.

2 participants