-
-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make a copy of the bytes array before passing to the channel, fixes #5470 #5893
Conversation
Download the artifacts for this pull request:
See Testing a PR |
FYI I've seen it take even ~5000 iterations on my Mac before failing. |
360ad56
to
9f5a0a1
Compare
Rebased to get current set of tests. |
Thanks so much for the thorough PR (and amazing troubleshooting and analysis). Especially thanks for showing the recreation scenario once you finally figured it out. |
What and odd a and great find, I think this is fine! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'd be curious add is if this copying over technique is documented somewhere around or it was an occurrence based on the race condition.
@deviantintegral Great work! |
Here's a slight adaptation of the wonderful demonstration script: https://gist.github.com/rfay/9affe8e5007596885cb93d8018d4952d I'm at Edit: Oh, does this need a particular starting project type or setup? I'm running it in an empty directory. It looks like maybe you're in at Drupal project with redis add-on? |
Yeah, my results for reproducing were from a real-world project that I already had set up. But you could also setup a ddev project following the steps in the job at https://github.com/Lullabot/drainpipe/actions/runs/8023313989/job/21919539064 to set up the Drupal project (which was consumed and then failed in https://github.com/Lullabot/drainpipe/actions/runs/8023313989/job/21919652968). |
Thanks @deviantintegral - I'm quite sure this is good but also want to make sure the but is reproducible so we can really understand it. I don't have much hope for an automated test! |
I was thinking about testing too. I think the only reliable way to reproduce this without uncertainty would be to slow down the ddev go process. That would allow the goroutine to scan for additional tokens before the other one has consumed one. But, that would require us to insert automated test code to insert waits and other conditions into the production code which doesn't seem like a good plan. And then I thought... "go should have tools like C to try and detect races"... and TIL'ed it's built in! https://go.dev/doc/articles/race_detector I rebuilt locally with this minor hack to the Makefile.
That immediately identified the bad line of code. I wish I had known of this months ago, it would have turned hours of debugging into a minute or two.
Binaries built with |
It sounds like we can do an alternate build and run modest tests and verify that it doesn't catch races. Awesome! |
Yay, got it to fail with the updated gist after 3463 loops
|
Same test ran 51800 loops without failing using this patch, double-yay! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason this goroutine code was added (b1810f3) was to be able to easily display the Docker build stage of a docker-compose up
. That's it. But it ended up affecting every single docker-compose command (including here, docker-compose config
.)
So the bottom line is that this function has ballooned out of control. The underlying problem is the unneeded complexity of the function.
I imagine the right thing to do is to accept this PR and then rewrite the function elsewhere, but perhaps it should be rewritten to solve this. It can be way simpler, or alternately, only use the goroutine on docker-compose up
.
Regardless, moving to a simpler function needs to be done.
We're going to go with this and then rewrite the function. Thanks so much! |
The Issue
Docker Compose files can become corrupt. This is due to a race condition in the code that calls docker compose commands. I think this only comes up with ddev start because it's the only code that parses docker compose output - namely
docker compose config
.How This PR Solves The Issue
#5340 added support for showing output while the containers are built. hanoii@3515174 switched from calling
Text()
toBytes()
to assist supporting real time output in debug commands. However, Bytes() has subtly different behaviour:Unfortunately, this issue can take over 1000 instances to actually occur. That's why it hasn't come up yet on our Linux users locals, and also why retries in CI generally work. It's also why we originally thought it was a 1.22.4 issue, and it was! It was just hard to tell with such rare reoccurrence.
This PR fixes the issue by allocating and copying a new value to return through the channel. I did a search for any other calls to
Bytes()
and didn't find any.Manual Testing Instructions
So far, I'm unable to reproduce this on colima / arm64. One time it only took ~20 loops, the last time it took nearly 1200. We can see an error we've seen before, and that in the case the compose file is completely empty.Edit: I just reproduced the below on my mac. It took 614 iterations:
On macOS:
I was also able to reproduce this over at Lullabot/drainpipe#453 (while simply testing the debug output!) and have run many retries with this change, and so far they've all passed.
Automated Testing Overview
This is hard to test! I'm not sure of a way to force the race condition to occur. Let's say it takes 5-10 minutes to run to reproduce this... do we really want that in all tests? Open to ideas here.
Related Issue Link(s)
https://github.com/orgs/ddev/discussions/5470
Release/Deployment Notes
None that I can think of.