Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to communicate with Edge modules (in EFLOW) from the host OS #7220

Open
bhjertaas opened this issue Feb 20, 2024 · 7 comments
Open

Unable to communicate with Edge modules (in EFLOW) from the host OS #7220

bhjertaas opened this issue Feb 20, 2024 · 7 comments
Assignees

Comments

@bhjertaas
Copy link

Expected Behavior

We should be able to communicate with Edge modules from the host OS

Current Behavior

Sometimes, after reboot of the PC, communication from host OS to Edge modules does not work. We can Connect-EflowVm and Get-EflowVmAddr and things, but making HTTP calls to the Edge modules within Eflow doesn't work. The requests simply time out. We don't have much exceptions and logs to provide unfortunately, but here is the output from a Powershell on Windows

PS C:\Windows\System32> get-EflowVmAddr 
[01/17/2024 11:07:07] Querying IP and MAC addresses from virtual machine (DESKTOP-E4VKT5P-EFLOW)
  - Virtual machine MAC: 00:15:5d:43:7d:b6
 - Virtual machine IP : 172.18.92.180 retrieved directly from virtual machine
 00:15:5d:43:7d:b6
172.18.92.180 

PS C:\Windows\System32> curl http://172.18.92.180:9602/metrics curl: (28) Failed to connect to 172.18.92.180 port 9602 after 21008 ms: Couldn't connect to server 
PS C:\Windows\System32> curl http://172.18.92.180:1337/metadata curl: (28) Failed to connect to 172.18.92.180 port 1337 after 21031 ms: Couldn't connect to server 

The following image shows the problem over time. If the PC is unable to make simple HTTP requests the graph drops to zero. The test-PCs are setup to restart every 3 hours.
image

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. When this happens, one can test it by running Get-EflowVmAddr and copy the IP address that comes out
  2. Try to make a request to a module like this curl http://172.18.92.180:9602/metrics
    which in our case gets metric data from edgeHub because we have mapped the internal 9600 port to 9602. But it doesn't matter what you request, the request won't work.

Context (Environment)

Output of iotedge check

Click here

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK

‼ aziot-identity-service package is up-to-date - Warning
    Installed aziot-identity-service package has version 1.4.6 but 1.4.7 is the latest stable version available.
    Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
‼ host can connect to and perform TLS handshake with iothub AMQP port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub MQTT port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
√ host can connect to and perform TLS handshake with DPS endpoint - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
    Installed IoT Edge daemon has version 1.4.20 but 1.4.27 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
26 check(s) succeeded.
5 check(s) raised warnings. Re-run with --verbose for more details.
7 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.
iotedge-user@DESKTOP-5SO2V9I-EFLOW [ ~ ]$ sudo iotedge check

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
‼ aziot-identity-service package is up-to-date - Warning
    Installed aziot-identity-service package has version 1.4.6 but 1.4.7 is the latest stable version available.
    Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
‼ host can connect to and perform TLS handshake with iothub AMQP port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub MQTT port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
√ host can connect to and perform TLS handshake with DPS endpoint - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
    Installed IoT Edge daemon has version 1.4.20 but 1.4.27 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
26 check(s) succeeded.
5 check(s) raised warnings. Re-run with --verbose for more details.
7 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]: 1.4.10.25103 on Windows 11
  • Architecture [e.g. amd64, arm32, arm64]: amd64
  • Container OS [e.g. Linux containers, Windows containers]: Mariner Linux (in Eflow)

Runtime Versions

  • aziot-edged [run iotedge version]: 1.4.20
  • Edge Agent [image tag (e.g. 1.0.0)]: mcr.microsoft.com/azureiotedge-agent:1.4
  • Edge Hub [image tag (e.g. 1.0.0)]: mcr.microsoft.com/azureiotedge-hub:1.4
  • Docker/Moby [run docker version]: 20.10.27 (iotedge_moby_engine version -> 24.0.6)

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Additional Information

When we do encounter this issue, it remains a problem until the PC is restarted.

  • When we go into the VM with connect-eflowVm and run sudo iptables -P input accept we're able to ping the EFLOW VM from the host OS.
  • Running hcsdiag list to get the GUID, and then hcsdiag console <guid> we can confirm that from that shell, we can both ping the IP of the VM and do curl requests to edge module endpoints that did not work from the host OS

We have been in contact with the Iotedge-eflow team, and one member of that team have had a session with us on one of the test PCs that did not work. He believes the problem has to do with the IoT-Edge part, which is why I post this issue here.

@nyanzebra
Copy link
Contributor

@bhjertaas would you mind providing a support bundle? The command is iotedge support-bundle.

From what I understand, there is no issue retrieving the metrics via curl from EFLOW vm side, but when trying from host machine side it does not work. Is this correct?

Additionally, is this a recent issue? Or something that has been happening?

@bhjertaas
Copy link
Author

bhjertaas commented Feb 25, 2024

Sorry for the delay, here is a support bundle.

support_bundle_2024_02_25_14_47_17_UTC.zip

You asked: From what I understand, there is no issue retrieving the metrics via curl from EFLOW vm side, but when trying from host machine side it does not work. Is this correct? ==> Yes correct.

This has been happening for a while. We have, as mentioned, 7 test-PCs that are set to reboot and we've been seeing this problem quite regularly. If you setup a PC with EFLOW, that reboots every 2 hours say, you should be able to reproduce this behaviour within two days I estimate. (of course you need something on Windows that makes HTTP calls to a module endpoint to detect when it happens).

According to our health-reporting system, non-communication problem to this particular test-PC-6, started Feb 25th 12:15 CET.

The IP at the time was 172.18.164.255 as provided by get-eflowVmAddr.

curl http://172.18.164.255:9602/metrics
curl: (28) Failed to connect to 172.18.164.255 port 9602 after 21047 ms: Couldn't connect to server

I've had a look at the bundle and can't find much. It is unfortunate that logs for the modules themselves are "too large". See my other issue on that. But I doubt this networking problem is caused by module wrongdoing anyway.

@nyanzebra
Copy link
Contributor

@jagadishmurugan while edge team is investigating the support bundle is there anything to recommend to @bhjertaas to validate EFLOW to host networking. Just want to make sure that all looks correct.

@konichi3
Copy link

@jagadishmurugan @nyanzebra Can you follow up and share your finding?

@konichi3
Copy link

@jagadishmurugan @nyanzebra Any updates on this?

@nyanzebra
Copy link
Contributor

@bhjertaas sorry for delay, this got lost amongst other issues I have been looking at. The support bundle provided seems to have failed to get logs Error grabbing logs: log message is too large (973735539 > 1000000). Would it be possible to run a smaller window of time support bundle? For example support-bundle --since 6h?

@bhjertaas
Copy link
Author

I have tried to make a support-bundle with smaller time frame, but the "log message is too large" problem remains.

Until the underlying Moby bug is fixed, the recommended Local logging driver is not a viable option. Which log driver do you recommend instead on Edge devices? I'm just thinking that perhaps we will have to sort that out first, before we can get clarity on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants