Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make a copy of the bytes array before passing to the channel, fixes #5470 #5893

Merged
merged 1 commit into from
Feb 27, 2024

Conversation

deviantintegral
Copy link
Collaborator

@deviantintegral deviantintegral commented Feb 23, 2024

The Issue

Docker Compose files can become corrupt. This is due to a race condition in the code that calls docker compose commands. I think this only comes up with ddev start because it's the only code that parses docker compose output - namely docker compose config.

How This PR Solves The Issue

#5340 added support for showing output while the containers are built. hanoii@3515174 switched from calling Text() to Bytes() to assist supporting real time output in debug commands. However, Bytes() has subtly different behaviour:

The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.

Unfortunately, this issue can take over 1000 instances to actually occur. That's why it hasn't come up yet on our Linux users locals, and also why retries in CI generally work. It's also why we originally thought it was a 1.22.4 issue, and it was! It was just hard to tell with such rare reoccurrence.

This PR fixes the issue by allocating and copying a new value to return through the channel. I did a search for any other calls to Bytes() and didn't find any.

Manual Testing Instructions

So far, I'm unable to reproduce this on colima / arm64. One time it only took ~20 loops, the last time it took nearly 1200. We can see an error we've seen before, and that in the case the compose file is completely empty.

Edit: I just reproduced the below on my mac. It took 614 iterations:

#!/bin/bash

echo "Generating first cases..."
ddev debug compose-config > /tmp/1.ddev.compose.yaml
md5sum /tmp/1.ddev.compose.yaml
ddev debug compose-config > /tmp/2.ddev.compose.yaml
md5sum /tmp/2.ddev.compose.yaml

echo "Looping..."
I=2
while diff -up /tmp/1.ddev.compose.yaml /tmp/2.ddev.compose.yaml; do
  ddev debug compose-config > /tmp/2.ddev.compose.yaml
  echo Try $I....
  ((I++))
  md5sum /tmp/2.ddev.compose.yaml
done
image

On macOS:

Try 614....
6ba83c77fda3524b8fdf2898c0c32cdd  /tmp/2.ddev.compose.yaml
--- /tmp/1.ddev.compose.yaml	2024-02-23 16:35:27
+++ /tmp/2.ddev.compose.yaml	2024-02-23 16:39:16
@@ -115,7 +115,7 @@ services:
               target: /mnt/ddev_config
               type: bind
             - bind:
-                create_host_path: true
+                cr-ate_host_path: true
               source: /Users/andrew/workspace/no-arq-backup/REDACTED/REDACTED/.ddev/redis
               target: /usr/local/etc/redis
               type: bind

I was also able to reproduce this over at Lullabot/drainpipe#453 (while simply testing the debug output!) and have run many retries with this change, and so far they've all passed.

Automated Testing Overview

This is hard to test! I'm not sure of a way to force the race condition to occur. Let's say it takes 5-10 minutes to run to reproduce this... do we really want that in all tests? Open to ideas here.

Related Issue Link(s)

https://github.com/orgs/ddev/discussions/5470

Release/Deployment Notes

None that I can think of.

@deviantintegral deviantintegral requested a review from a team as a code owner February 23, 2024 21:38
Copy link

github-actions bot commented Feb 23, 2024

@deviantintegral
Copy link
Collaborator Author

FYI I've seen it take even ~5000 iterations on my Mac before failing.

@rfay rfay force-pushed the fix-compose-stdin-yaml-race branch from 360ad56 to 9f5a0a1 Compare February 25, 2024 21:07
@rfay
Copy link
Member

rfay commented Feb 25, 2024

Rebased to get current set of tests.

@rfay
Copy link
Member

rfay commented Feb 25, 2024

Thanks so much for the thorough PR (and amazing troubleshooting and analysis). Especially thanks for showing the recreation scenario once you finally figured it out.

@rfay rfay changed the title fix: make a copy of the bytes array before passing to the channel fix: make a copy of the bytes array before passing to the channel, fixes #5470 Feb 25, 2024
@rfay rfay requested review from hanoii and stasadev February 25, 2024 21:17
@hanoii
Copy link
Collaborator

hanoii commented Feb 25, 2024

What and odd a and great find, I think this is fine!

Copy link
Collaborator

@hanoii hanoii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'd be curious add is if this copying over technique is documented somewhere around or it was an occurrence based on the race condition.

@tyler36
Copy link
Collaborator

tyler36 commented Feb 26, 2024

@deviantintegral Great work!
What a journey that must have been.

@rfay
Copy link
Member

rfay commented Feb 26, 2024

Here's a slight adaptation of the wonderful demonstration script: https://gist.github.com/rfay/9affe8e5007596885cb93d8018d4952d

I'm at 32,000 112193 iterations on mac M1 without it failing yet :(

Edit: Oh, does this need a particular starting project type or setup? I'm running it in an empty directory. It looks like maybe you're in at Drupal project with redis add-on?

@deviantintegral
Copy link
Collaborator Author

Yeah, my results for reproducing were from a real-world project that I already had set up. But you could also setup a ddev project following the steps in the job at https://github.com/Lullabot/drainpipe/actions/runs/8023313989/job/21919539064 to set up the Drupal project (which was consumed and then failed in https://github.com/Lullabot/drainpipe/actions/runs/8023313989/job/21919652968).

@rfay
Copy link
Member

rfay commented Feb 26, 2024

Thanks @deviantintegral - I'm quite sure this is good but also want to make sure the but is reproducible so we can really understand it. I don't have much hope for an automated test!

@deviantintegral
Copy link
Collaborator Author

I was thinking about testing too. I think the only reliable way to reproduce this without uncertainty would be to slow down the ddev go process. That would allow the goroutine to scan for additional tokens before the other one has consumed one. But, that would require us to insert automated test code to insert waits and other conditions into the production code which doesn't seem like a good plan.

And then I thought... "go should have tools like C to try and detect races"... and TIL'ed it's built in!

https://go.dev/doc/articles/race_detector

I rebuilt locally with this minor hack to the Makefile.

diff --git a/Makefile b/Makefile
index e609ca7e6..44763e945 100644
--- a/Makefile
+++ b/Makefile
@@ -81,7 +81,7 @@ $(TARGETS): mkcert $(GOFILES)
 	export GOOS="$${TARGET%_*}" GOARCH="$${TARGET#*_}" CGO_ENABLED=0 GOPATH="$(PWD)/$(GOTMP)" GOCACHE="$(PWD)/$(GOTMP)/.cache" && \
 	mkdir -p $(GOTMP)/{.cache,pkg,src,bin/$$TARGET} && \
 	chmod 777 $(GOTMP)/{.cache,pkg,src,bin/$$TARGET} && \
-	go build -o $(GOTMP)/bin/$$TARGET -installsuffix static -ldflags " $(LDFLAGS) " $(SRC_AND_UNDER)
+	go build -race -o $(GOTMP)/bin/$$TARGET -installsuffix static -ldflags " $(LDFLAGS) " $(SRC_AND_UNDER)
 	$( shell if [ -d $(GOTMP) ]; then chmod -R u+w $(GOTMP); fi )
 	@echo $(VERSION) >VERSION.txt

That immediately identified the bad line of code. I wish I had known of this months ago, it would have turned hours of debugging into a minute or two.

Generating first cases...
==================
WARNING: DATA RACE
Write at 0x00c00010eff8 by goroutine 27:
  runtime.racewriterange()
      <autogenerated>:1 +0x14
  internal/poll.ignoringEINTRIO()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/internal/poll/fd_unix.go:736 +0x348
  internal/poll.(*FD).Read()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/internal/poll/fd_unix.go:160 +0x15c
  os.(*File).read()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/os/file_posix.go:29 +0xb0
  os.(*File).Read()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/os/file.go:118 +0x90
  bufio.(*Scanner).Scan()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/bufio/scan.go:219 +0xef4
  github.com/ddev/ddev/pkg/dockerutil.ComposeCmd.func1()
      /Users/andrew/workspace/no-arq-backup/ddev/pkg/dockerutil/dockerutils.go:666 +0x78

Previous read at 0x00c00010efff by main goroutine:
  runtime.slicecopy()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/runtime/slice.go:325 +0x0
  bytes.(*Buffer).Write()
      /opt/homebrew/Cellar/go/1.22.0/libexec/src/bytes/buffer.go:181 +0x104
  github.com/ddev/ddev/pkg/dockerutil.ComposeCmd()
      /Users/andrew/workspace/no-arq-backup/ddev/pkg/dockerutil/dockerutils.go:687 +0xcd8
  github.com/ddev/ddev/pkg/ddevapp.(*DdevApp).WriteDockerComposeYAML()
      /Users/andrew/workspace/no-arq-backup/ddev/pkg/ddevapp/compose_yaml.go:39 +0x234
  github.com/ddev/ddev/cmd/ddev/cmd.init.func10()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/cmd/debug-compose-config.go:36 +0xf8
  github.com/spf13/cobra.(*Command).execute()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:987 +0xc24
  github.com/spf13/cobra.(*Command).ExecuteC()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:1115 +0x4b8
  github.com/spf13/cobra.(*Command).Execute()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:1039 +0x34
  github.com/ddev/ddev/cmd/ddev/cmd.Execute()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/cmd/root.go:163 +0x20
  main.main()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/main.go:27 +0x3d0

Goroutine 27 (running) created at:
  github.com/ddev/ddev/pkg/dockerutil.ComposeCmd()
      /Users/andrew/workspace/no-arq-backup/ddev/pkg/dockerutil/dockerutils.go:665 +0x97c
  github.com/ddev/ddev/pkg/ddevapp.(*DdevApp).WriteDockerComposeYAML()
      /Users/andrew/workspace/no-arq-backup/ddev/pkg/ddevapp/compose_yaml.go:39 +0x234
  github.com/ddev/ddev/cmd/ddev/cmd.init.func10()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/cmd/debug-compose-config.go:36 +0xf8
  github.com/spf13/cobra.(*Command).execute()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:987 +0xc24
  github.com/spf13/cobra.(*Command).ExecuteC()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:1115 +0x4b8
  github.com/spf13/cobra.(*Command).Execute()
      /Users/andrew/workspace/no-arq-backup/ddev/vendor/github.com/spf13/cobra/command.go:1039 +0x34
  github.com/ddev/ddev/cmd/ddev/cmd.Execute()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/cmd/root.go:163 +0x20
  main.main()
      /Users/andrew/workspace/no-arq-backup/ddev/cmd/ddev/main.go:27 +0x3d0

Binaries built with -race are notably slower to build and run. In this case, if we'd had a test case around ddev debug compose-config we would have caught this. I may throw up a quick PR just adding -race to the build to see what we get.

@rfay
Copy link
Member

rfay commented Feb 26, 2024

It sounds like we can do an alternate build and run modest tests and verify that it doesn't catch races. Awesome!

@rfay
Copy link
Member

rfay commented Feb 27, 2024

Yay, got it to fail with the updated gist after 3463 loops

Try 3463....
f7ae1fc718d0e4fbff9198e55c581b0b  /tmp/2.ddev.compose.yaml
--- /tmp/1.ddev.compose.yaml	2024-02-26 19:19:19.406335456 -0700
+++ /tmp/2.ddev.compose.yaml	2024-02-26 20:05:45.006647712 -0700
@@ -164,7 +164,7 @@ services:
         networks:
             ddev_default: null
             default: null
-        shm_size: "2147483648"
+        shm_sizl: "2147483648"
         volumes:
             - consistency: cached
               source: ../

@rfay
Copy link
Member

rfay commented Feb 27, 2024

Same test ran 51800 loops without failing using this patch, double-yay!

Copy link
Member

@rfay rfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason this goroutine code was added (b1810f3) was to be able to easily display the Docker build stage of a docker-compose up. That's it. But it ended up affecting every single docker-compose command (including here, docker-compose config.)

So the bottom line is that this function has ballooned out of control. The underlying problem is the unneeded complexity of the function.

I imagine the right thing to do is to accept this PR and then rewrite the function elsewhere, but perhaps it should be rewritten to solve this. It can be way simpler, or alternately, only use the goroutine on docker-compose up.

Regardless, moving to a simpler function needs to be done.

@rfay
Copy link
Member

rfay commented Feb 27, 2024

We're going to go with this and then rewrite the function. Thanks so much!

@rfay rfay merged commit 73abc7a into ddev:master Feb 27, 2024
24 of 26 checks passed
@deviantintegral deviantintegral deleted the fix-compose-stdin-yaml-race branch February 29, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants