Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guest agent sometimes becomes unavailable from the host #2227

Open
roman-kiselenko opened this issue Mar 2, 2024 · 4 comments
Open

Guest agent sometimes becomes unavailable from the host #2227

roman-kiselenko opened this issue Mar 2, 2024 · 4 comments

Comments

@roman-kiselenko
Copy link
Contributor

roman-kiselenko commented Mar 2, 2024

Description

I've found this abiosoft/colima/issues/994 and it looks exactly like mine.

After sometime the created VM started to loose connection to guestagent due to missing ga.sock.

journalctl -e -u lima-guestagent.service -f
-- Boot 3c14c00ca1b045588ca9d3a840437668 --
Mar 02 15:34:31 lima-cri-tools systemd[1]: Started lima-guestagent.
Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="event tick: 3s"
Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\""
Mar 02 15:34:36 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:36+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:36Z systime=2024-03-02T15:34:36+03:00 delta=-150.01032ms"
Mar 02 15:34:46 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:46+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:46Z systime=2024-03-02T15:34:46+03:00 delta=-152.18532ms"
Mar 02 15:34:56 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:56+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:56Z systime=2024-03-02T15:34:56+03:00 delta=-155.54332ms"
Mar 02 15:35:06 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:06+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:06Z systime=2024-03-02T15:35:06+03:00 delta=-151.786742ms"
Mar 02 15:35:16 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:16+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:16Z systime=2024-03-02T15:35:16+03:00 delta=-146.093647ms"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true"
Mar 02 15:35:26 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:26+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:26Z systime=2024-03-02T15:35:26+03:00 delta=-149.117589ms"
Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopping lima-guestagent...
Mar 02 15:35:27 lima-cri-tools systemd[1]: lima-guestagent.service: Deactivated successfully.
Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopped lima-guestagent.
Mar 02 15:35:27 lima-cri-tools systemd[1]: Started lima-guestagent.
Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="event tick: 3s"
Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\""
Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="LocalPorts(): worthCheckingIPTables=true"
Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="Get info %+v{[{0.0.0.0 22} {127.0.0.1 35967} {127.0.0.53 53} {:: 22}]}"

The logs above shows that the lima-guestagent.service has been restarted (🤔 not sure the reason). This leads to the missing connection from the host:

WARN[0110] [hostagent] connection to the guest agent was closed unexpectedly

Missing connection leads to ga.sock deleting.

I tried to comment ga.sock deleting, and it appears to be a fix.

diff
diff --git a/pkg/hostagent/hostagent.go b/pkg/hostagent/hostagent.go
index 92c5073..ee4c9a3 100644
--- a/pkg/hostagent/hostagent.go
+++ b/pkg/hostagent/hostagent.go
@@ -605,7 +605,7 @@ func (a *HostAgent) watchGuestAgentEvents(ctx context.Context) {
                                }
                        }
                } else {
-                       logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly", err)
+                       logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly\n", err)
                }
                select {
                case <-ctx.Done():
@@ -728,13 +728,14 @@ func forwardSSH(ctx context.Context, sshConfig *ssh.SSHConfig, port int, local,
                                }
                        } else {
                                logrus.Infof("Forwarding %q (guest) to %q (host)", remote, local)
-                               if err := os.RemoveAll(local); err != nil {
-                                       logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local)
-                               }
-                       }
-                       if err := os.MkdirAll(filepath.Dir(local), 0o750); err != nil {
-                               return fmt.Errorf("can't create directory for local socket %q: %w", local, err)
+                               logrus.Infof("Delete files %q (local)", local)
+                               // if err := os.RemoveAll(local); err != nil {
+                               //      logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local)
+                               // }
                        }

hostagent reconnect and everything is working again.

limactl start crio-tools
.....
INFO[0106] [hostagent] Waiting for the guest agent to be running
INFO[0106] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
INFO[0106] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local)
WARN[0106] [hostagent] Failed to set up forward from "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
WARN[0106] [hostagent] connection to the guest agent was closed unexpectedly
dial unix /Users/user/.lima/cri-tools/ga.sock: connect: no such file or directory
INFO[0116] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host)
INFO[0116] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local)
INFO[0116] [hostagent] Guest agent is running
INFO[0116] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0116] [hostagent] Not forwarding TCP [::]:22
INFO[0116] [hostagent] Not forwarding TCP 127.0.0.53:53
INFO[0116] [hostagent] Forwarding TCP from 127.0.0.1:44569 to 127.0.0.1:44569
INFO[0116] [hostagent] Not forwarding TCP 0.0.0.0:22
INFO[0116] [hostagent] The final requirement 1 of 1 is satisfied
INFO[0116] READY. Run `limactl shell cri-tools` to open the shell.

I see two options here (the second one is better, I guess, but I can't find the reason):

  1. In case of missing connection to the guestagent trying to reconnect and don't delete ga.sock
  2. Find a reason why lima-guestagent.service is failing.
lima.yml
vmType: null
os: null
arch: x86_64
images:
  - location: "https://cloud-images.ubuntu.com/releases/22.04/release-20240125/ubuntu-22.04-server-cloudimg-amd64.img"
    arch: "x86_64"
    digest: "sha256:f23e5b91af7f52bdd4680d8e9d091093bfc6c4a46b0da7baa3cb450ee917b66b"
cpus: 6
memory: 8GiB
disk: null
mounts:
  - location: "~/Documents/dev/opensource/cri-tools"
    mountPoint: null
    writable: true
    sshfs:
      cache: null
      followSymlinks: null
      sftpDriver: null
    9p:
      securityModel: null
      protocolVersion: null
      msize: null
      cache: "mmap"
  - location: "/tmp/lima"
    writable: true
mountType: 9p
additionalDisks:
ssh:
  localPort: 0
  loadDotSSHPubKeys: null
  forwardAgent: null
  forwardX11: null
  forwardX11Trusted: null
caCerts:
  removeDefaults: null
  files:
  certs:
upgradePackages: true
containerd:
  system: false
  user: false
cpuType:
  aarch64: null
  armv7l: null
  x86_64: null
rosetta:
  enabled: null
  binfmt: null
timezone: null
firmware:
  legacyBIOS: null
audio:
  device: null
video:
  display: null
  vnc:
    display: null
networks:
propagateProxyEnv: null
hostResolver:
  enabled: null
  ipv6: null
  hosts:
guestInstallPrefix: null
plain: null
> limactl --version
limactl version 0.20.1
> qemu-system-x86_64 --version
QEMU emulator version 8.2.1
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
@AkihiroSuda
Copy link
Member

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

@roman-kiselenko
Copy link
Contributor Author

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

This behaviour is specific to lima.

@roman-kiselenko
Copy link
Contributor Author

Another related issue abiosoft/colima#999 (comment)

@roman-kiselenko
Copy link
Contributor Author

ref #2064

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants