Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EdgeAgent does not restart crashed module if busy downloading new manifest 1.4.10 #7259

Open
mathiash98 opened this issue Apr 8, 2024 · 3 comments
Assignees
Labels
bug Something isn't working customer-reported

Comments

@mathiash98
Copy link

mathiash98 commented Apr 8, 2024

Expected Behavior

EdgeAgent should ALWAYS keep all modules running regardless of edgeAgent operation.

Current Behavior

If EdgeAgent is busy downloading new version of modules it will not ensure existing modules are running.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Slow internet setup
  2. Deploy new version deployment manifest
  3. Ensure EdgeAgent struggles downloading new versions (100 second timeout) -> repeats for a long time
  4. Crash one of the existing modules with fatal exception or similar
  5. Observe that EdgeAgent does not restart the crashing module

Context (Environment)

Deployments operating on slow satellite internet connection with high ping.

Deployment manifest has:

  • "restartPolicy": "always",
  • "status": "running",
  • ModuleUpdateMode: WaitForAllPulls

Output of iotedge check

Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
× config.yaml has correct hostname - Error
    config.yaml has hostname PC2FD48A but device reports hostname pc2fd48a.
    Hostname in config.yaml must either be identical to the device hostname or be a fully-qualified domain name that has the device hostname as the first component.
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
    The Edge device is using self-signed automatically-generated development certificates.
    They will expire in 204 days (at 2024-10-26 11:29:55 UTC) causing module-to-module and downstream device communication to fail on an active deployment.
    After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs.
    Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

17 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
1 check(s) raised errors. Re-run with --verbose for more details.

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]:
  • Architecture [e.g. amd64, arm32, arm64]:
  • Container OS [e.g. Linux containers, Windows containers]:

Runtime Versions

  • aziot-edged [run iotedge version]: 1.1.15
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.4.10
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.4.10
  • Docker/Moby [run docker version]: 20.10.20+azure-1

Logs

edge-agent logs

<6> 2024-04-08 07:39:46.633 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321.
<6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of stores
<6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of store Metrics
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store deploymentConfig
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store default
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store moduleState
<6> 2024-04-08 07:43:13.374 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping Metrics
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2024-04-08 07:43:13.385 +00:00 [INF] - Storing Metrics
<6> 2024-04-08 07:43:13.389 +00:00 [INF] - Scraped and Stored Metrics
<6> 2024-04-08 07:43:13.389 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<4> 2024-04-08 07:43:48.781 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 07:53:53.787 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:03:58.789 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:14:03.794 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:24:08.798 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<6> 2024-04-08 08:28:48.987 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2024-04-08 08:28:51.738 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321.
<6> 2024-04-08 08:28:51.740 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<4> 2024-04-08 08:34:13.801 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping Metrics
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2024-04-08 08:43:13.394 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2024-04-08 08:43:13.408 +00:00 [INF] - Storing Metrics
<6> 2024-04-08 08:43:13.414 +00:00 [INF] - Scraped and Stored Metrics
<6> 2024-04-08 08:43:13.414 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<4> 2024-04-08 08:44:18.807 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:54:23.813 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 09:04:28.820 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 09:14:33.823 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271

@david-emakenemi
Copy link

@jollyaakash can you help take a look at this?

@vadim-kovalyov vadim-kovalyov self-assigned this Apr 10, 2024
@vadim-kovalyov
Copy link
Contributor

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

@vadim-kovalyov vadim-kovalyov added bug Something isn't working customer-reported labels Apr 10, 2024
@mathiash98
Copy link
Author

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

Thanks for the information, we will try to keep an eye out for the installation with poor internet connection. Manually polling on 100 devices is too cumbersome, but we will create some dashboard that will show the devices that IotEdge is reporting as failing to mitigate this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working customer-reported
Projects
None yet
Development

No branches or pull requests

3 participants