backupccl: nil pointer crash in `storage.(*Writer).open` during backup in 22.2.9 #103597

renatolabs · 2023-05-18T15:27:48Z

During a roachtest (#103228), two nodes crashed while a backup was taken, both due to a panic within the GCS library:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x4127782]

goroutine 309871 [running]:
panic({0x4bb3320, 0x8fb3dd0})
        GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc014aeddf8 sp=0xc014aedd38 pc=0x48d77a
runtime.panicmem(...)
        GOROOT/src/runtime/panic.go:260
runtime.sigpanic()
        GOROOT/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc014aede48 sp=0xc014aeddf8 pc=0x4a4636
cloud.google.com/go/storage.(*Writer).open.func1()
        cloud.google.com/go/storage/external/com_google_cloud_go_storage/writer.go:162 +0x1a2 fp=0xc014aedfe0 sp=0xc014aede48 pc=0x4127782
runtime.goexit()
        GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc014aedfe8 sp=0xc014aedfe0 pc=0x4c2401
created by cloud.google.com/go/storage.(*Writer).open
        cloud.google.com/go/storage/external/com_google_cloud_go_storage/writer.go:152 +0x495

Every node in the 4-node cluster was running 22.2.9.

Stack traces for the two nodes that crashed (n3 and n4) are attached below. Note that a very similar crash had been reported before [1], and deemed fixed by [2]. However, the issue doesn't seem to be completely solved.

Reproduction

Running the backup-restore/mixed-version roachtest in #103228 with seed -4303022106448172299 seems to reproduce this with high probability (~1h30m after test start).

n3_stacks.txt
n4_stacks.txt

Roachtest artifacts: https://console.cloud.google.com/storage/browser/cockroach-tmp/103597/roachtest_artifacts;tab=objects?project=cockroach-shared&prefix=&forceOnObjectsSortingFiltering=false&pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))

[1] googleapis/google-cloud-go#4167
[2] #65660

Jira issue: CRDB-28094

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-05-18T15:27:51Z

cc @cockroachdb/disaster-recovery

dt · 2023-05-30T19:33:18Z

Some progress: I patched 22.2.9 so that we keep track of all open writers until they are closed and can confirm this appears to be a bug in our usage, not upstream:

panic: open writer (1) when closing client! "data/869673008170696708.sst"

goroutine 432607 [running]:
panic({0x4eead00, 0xc008194790})
        GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc00274fa38 sp=0xc00274f978 pc=0x48aa9a
github.com/cockroachdb/cockroach/pkg/cloud/gcp.(*gcsStorage).Close(0xc009534d70)
        github.com/cockroachdb/cockroach/pkg/cloud/gcp/gcs_storage.go:402 +0x2b3 fp=0xc00274fb70 sp=0xc00274fa38 pc=0x4143e73
github.com/cockroachdb/cockroach/pkg/cloud.(*esWrapper).Close(0xc004953a40?)
        <autogenerated>:1 +0x2a fp=0xc00274fb88 sp=0xc00274fb70 pc=0x17fe9ca
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runBackupProcessor.func2.1()
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_processor.go:563 +0x73 fp=0xc00274fc38 sp=0xc00274fb88 pc=0x3a98dd3
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runBackupProcessor.func2({0x62da980, 0xc004bc7700})
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_processor.go:575 +0x422 fp=0xc00274ff58 sp=0xc00274fc38 pc=0x3a98ce2

backupccl/backup_processor.go:563 is the storage.Close call here:

		defer func() {
			err := sink.Close()
			err = errors.CombineErrors(storage.Close(), err)

So we're closing the sink, think closing the storage, but somehow closing the sink returned with one of the writers still unclosed. Going to add a bit more logging to see if I can figure out how.

104187: backupccl: always track opened writer r=dt a=dt Previously we would do things -- like wrap the opened writer with an encryption shim -- after opening the writer but before storing it in the sink. This could mean that if the sink opened a writer, but failed to save it, and was then closed it would fail to close the writer. This changes that, so that as soon as the writer is opened it is saved and then if it is later wrapped, saved again, to ensure that we cannot lose track of any successfully opened writer. Fixes: #103597. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

Previously we would do things -- like wrap the opened writer with an encryption shim -- after opening the writer but before storing it in the sink. This could mean that if the sink opened a writer, but failed to save it, and was then closed it would fail to close the writer. This changes that, so that as soon as the writer is opened it is saved and then if it is later wrapped, saved again, to ensure that we cannot lose track of any successfully opened writer. Fixes: #103597. Release note: none.

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels May 18, 2023

blathers-crl bot added this to Triage in Disaster Recovery Backlog May 18, 2023

renatolabs mentioned this issue May 18, 2023

roachtest: change some cluster settings during mixed-version backups #103228

Merged

livlobo assigned dt May 24, 2023

renatolabs mentioned this issue May 26, 2023

roachtest: backport improvements to mixedversion and backup test #103912

Merged

dt mentioned this issue Jun 1, 2023

backupccl: always track opened writer #104187

Merged

craig bot closed this as completed in 64dea46 Jun 1, 2023

Disaster Recovery Backlog automation moved this from Triage to Done Jun 1, 2023

blathers-crl bot mentioned this issue Jun 1, 2023

release-22.2: backupccl: always track opened writer #104207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: nil pointer crash in `storage.(*Writer).open` during backup in 22.2.9 #103597

backupccl: nil pointer crash in `storage.(*Writer).open` during backup in 22.2.9 #103597

renatolabs commented May 18, 2023 •

edited

blathers-crl bot commented May 18, 2023

dt commented May 30, 2023

backupccl: nil pointer crash in storage.(*Writer).open during backup in 22.2.9 #103597

backupccl: nil pointer crash in storage.(*Writer).open during backup in 22.2.9 #103597

Comments

renatolabs commented May 18, 2023 • edited

Reproduction

blathers-crl bot commented May 18, 2023

dt commented May 30, 2023

backupccl: nil pointer crash in `storage.(*Writer).open` during backup in 22.2.9 #103597

backupccl: nil pointer crash in `storage.(*Writer).open` during backup in 22.2.9 #103597

renatolabs commented May 18, 2023 •

edited