Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating over tailscale (sometimes!) fails with #199

Open
1 task done
stapelberg opened this issue Apr 26, 2023 · 7 comments
Open
1 task done

updating over tailscale (sometimes!) fails with #199

stapelberg opened this issue Apr 26, 2023 · 7 comments
Labels

Comments

@stapelberg
Copy link
Contributor

Platform

I’m using:

  • gokrazy/raspizero2w

Observed behavior

Updating gokrazy on my Raspberry Pi Zero 2W over tailscale became flaky about 10 days ago. When it doesn’t work, it fails with this error message:

2023/04/25 07:16:51 reboot: Post "http://gokrazy:***@consrv/reboot": read tcp 100.101.67.78:46166->100.108.215.49:80: read: connection reset by peer

Not sure yet what the trigger is, but maybe we need to programmatically ignore this error in the updater for now.

@stapelberg stapelberg added the bug label Apr 26, 2023
@nhanb
Copy link

nhanb commented May 18, 2023

I consistently get this error when running gok update over tailscale (it works fine using direct LAN IP):

Transferred root file system (47 MiB) at 26.01 MiB/s (total: 2s)s
Transferred boot file system (16 MiB) at 26.20 MiB/s (total: 1s)s
Triggering reboot
[update boot file system] 0.00% of 16 MiB, uploading at 0 B/s                 2023/05/18 15:00:50 reboot: Post "https://gokrazy:***@junk/reboot": read tcp 100.95.42.78:45338->100.90.227.60:443: read: connection reset by peer

To be more specific, it got stuck for some time at [update boot file system] 0.00% of 16 MiB, uploading at 0 B/s, then the rest of the log followed. However, the appliance has actually been updated and rebooted correctly.

I'm using latest gok and tailscale 1.40.1 on both machines. My appliance is an amd64 mini pc, if that matters.

@stapelberg
Copy link
Contributor Author

Not sure yet what the trigger is, but maybe we need to programmatically ignore this error in the updater for now.

Given that you can reproduce it reliably, could you send a PR to ignore this error in the updater? Thanks!

@nhanb
Copy link

nhanb commented May 18, 2023

I'll try!

@nhanb
Copy link

nhanb commented May 18, 2023

I dug into the gokrazy/* repos and found this snippet which could be a race condition:

gokrazy/update.go

Lines 353 to 362 in a764954

log.Println("Rebooting")
w.Write([]byte("Rebooting...\n"))
rc.Flush()
go func() {
time.Sleep(time.Second) // give the http response some time to be sent
if err := reboot(); err != nil {
log.Println("could not reboot:", err)
}
}()
- the /reboot handler assumes the response always finishes successfully within 1 second before rebooting, which may not always be the case (maybe tailscale just adds enough latency for this bug to surface more easily). If we instead allow the http server to gracefully shut down before rebooting, we might solve this bug without any workaround.

On that note, how do I tell gokrazy/tools to use the gokrazy/gokrazy repo from my local disk instead of github.com/gokrazy/gokrazy? Apparently gokrazy/tools/go.mod does not list gokrazy/gokrazy, so I can't replace it. Consequently, I can't test my local changes to the /reboot handler inside the gokrazy/gokrazy repo.

@stapelberg
Copy link
Contributor Author

the /reboot handler assumes the response always finishes successfully within 1 second before rebooting, which may not always be the case (maybe tailscale just adds enough latency for this bug to surface more easily).

Hmm. Can you try bumping this to 10s to see if that works around the issue?

On that note, how do I tell gokrazy/tools to use the gokrazy/gokrazy repo from my local disk instead of github.com/gokrazy/gokrazy? Apparently gokrazy/tools/go.mod does not list gokrazy/gokrazy, so I can't replace it. Consequently, I can't test my local changes to the /reboot handler inside the gokrazy/gokrazy repo.

You can just gok add . in your gokrazy/gokrazy working copy, that should set up the replace directive in the corresponding build directory within your gokrazy instance. See also https://gokrazy.org/development/modules/#building-local-code-the-replace-directive

nhanb added a commit to nhanb/gokrazy-tools that referenced this issue May 18, 2023
This is not a real fix. I still haven't figured out why a `connection
reset by peer` happens in the first place. `target.Reboot()` will still
block for a bit before giving up. At least it doesn't falsely report a
fail now.

Also moved canc() up so the terminal isn't littered with `[update boot
file system] 0.00% of 16 MiB, uploading at 0 B/s` during the wait. I
have no idea why the reporting hasn't been cancelled already: boot file
system update apparently finished waaaay before that point.

Related: gokrazy/gokrazy#199
@nhanb
Copy link

nhanb commented May 18, 2023

Bumping the wait time to 10s doesn't seem to do anything. I wanted to try gracefully shutting down the http server too, but honestly couldn't figure out how to do that in the gokrazy codebase. I'll sleep on it for now :(

Meanwhile I've pushed gokrazy/tools#55 to simply ignore the error.

stapelberg pushed a commit to gokrazy/tools that referenced this issue May 19, 2023
This is not a real fix. I still haven't figured out why a `connection
reset by peer` happens in the first place. `target.Reboot()` will still
block for a bit before giving up. At least it doesn't falsely report a
fail now.

Also moved canc() up so the terminal isn't littered with `[update boot
file system] 0.00% of 16 MiB, uploading at 0 B/s` during the wait.

Related: gokrazy/gokrazy#199
@damdo
Copy link
Collaborator

damdo commented May 19, 2023

Yeah noticed this too.
Thanks for putting up a patch for this @nhanb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants