Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct Method of IoTEdge module takes 3 minutes to start responding #7251

Open
wvangeem opened this issue Mar 27, 2024 · 8 comments
Open

Direct Method of IoTEdge module takes 3 minutes to start responding #7251

wvangeem opened this issue Mar 27, 2024 · 8 comments
Assignees

Comments

@wvangeem
Copy link

Current Behaviour

I have 3 IoTEdge modules (C# SDK)

  • APIServer
  • StateAnalyzer
  • SensorController

Every second, the APIServer module calls a DirectMethod in both StateAnalyzer and SensorController

When I startup the device, the StateAnalyzer module almost immediately responds to the DirectMethod call.
The SensorController always timeouts exactly 3 minutes.
After these 3 minutes, it starts to respond to the DirectMethod call.

When I manually restart the SensorController module (iotedge restart SensorController), I have exactly the same behaviour.
It takes exactly 3 minutes.
When I manually restart the StateAnalyzer module, it takes only 6 seconds for the DirectMethod call to respond.

I have the full edgeHub and edgeAgent logs below, but here is an extract with my actions.
You will always see that the DirectMethods fail, that the SensorController connection is restarted twice by edgeHub, and that initially after 3 minutes it is started again after which it starts to work.

Upon startup

2024-03-27 12:38:34.427 +00:00 [INF] - Client 787052036B1B/SensorController in device scope authenticated locally.
2024-03-27 12:38:34.625 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController
...
2024-03-27 12:38:41.430 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID 6b7aa24a-202d-4ae4-8394-a7df05d04360
...
2024-03-27 12:41:34.888 +00:00 [INF] - Closing link Events for 787052036B1B/SensorController
2024-03-27 12:41:35.075 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController
2024-03-27 12:41:35.794 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131"
2024-03-27 12:41:36.054 +00:00 [INF] - "Request finished HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131 - 200 29 application/json;+charset=utf-8 259.4015ms"

Manual Restart StateAnalyzer

2024-03-27 12:55:08.858 +00:00 [INF] - Closing link Events for 787052036B1B/StateAnalyzer
> StateAnalyzer methods are failing
2024-03-27 12:55:13.230 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/StateAnalyzer
2024-03-27 12:55:13.904 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/StateAnalyzer/methods?api-version=2020-09-30 application/json;+charset=utf-8 329"
2024-03-27 12:55:14.777 +00:00 [INF] - "Request finished HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/StateAnalyzer/methods?api-version=2020-09-30 application/json;+charset=utf-8 329 - 200 470 application/json;+charset=utf-8 873.2651ms"

StateAnalyzer methods are working again after 6 seconds

Manual Restart SensorController

2024-03-27 12:55:44.606 +00:00 [INF] - Closing link Events for 787052036B1B/SensorController
> SensorController methods are failing
2024-03-27 12:55:48.331 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController
2024-03-27 12:55:54.165 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID ed867d78-ba62-4011-aefb-2aa6d275814a

Manual Restart StateAnalyzer in between - SensorController is still not working

2024-03-27 12:56:44.394 +00:00 [INF] - Closing link Events for 787052036B1B/StateAnalyzer
> StateAnalyzer methods are failing
2024-03-27 12:56:48.339 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/StateAnalyzer
2024-03-27 12:56:49.482 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/StateAnalyzer/methods?api-version=2020-09-30 application/json;+charset=utf-8 329"
2024-03-27 12:56:50.366 +00:00 [INF] - "Request finished HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/StateAnalyzer/methods?api-version=2020-09-30 application/json;+charset=utf-8 329 - 200 470 application/json;+charset=utf-8 883.6945ms"

StateAnalyzer methods are working again after 6 seconds
Still no response on the SensorController after 3 minutes

2024-03-27 12:57:46.845 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID 35072fcd-a4fc-418a-b708-2c77825a850f
> Automatic restart of SensorController by eventHub
2024-03-27 12:57:48.554 +00:00 [INF] - Closing link Events for 787052036B1B/SensorController
2024-03-27 12:57:48.792 +00:00 [INF] - Created cloud connection for client 787052036B1B/SensorController
> no "Opened link MethodReceiving"
2024-03-27 12:57:49.892 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131"
2024-03-27 12:57:49.893 +00:00 [WRN] - Unable to invoke method on client 787052036B1B/SensorController because no subscription for methods found.
2024-03-27 12:57:50.862 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID 685ef645-1862-4cd0-9dff-5c67ec11e9b7

After a while: Automatic restart of SensorController by eventHub

2024-03-27 12:58:48.613 +00:00 [INF] - Closing link Events for 787052036B1B/SensorController
2024-03-27 12:58:48.786 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController
2024-03-27 12:58:49.195 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131"
2024-03-27 12:58:49.446 +00:00 [INF] - "Request finished HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131 - 200 29 application/json;+charset=utf-8 251.5439ms"

At this point the DirectMethod is responding

Another test with manual restart SensorController - again 3 minutes

2024-03-27 12:49:12.907 +00:00 [INF] - Closing link Events for 787052036B1B/SensorController
2024-03-27 12:52:18.507 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController

Context (Environment)

Both modules are developped in C#, and have the same way of connecting (AMQP & MQTT)

Output of iotedge check

Click here


Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
‼ aziot-identity-service package is up-to-date - Warning
    Installed aziot-identity-service package has version 1.4.4 but 1.4.8 is the latest stable version available.
    Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
    Installed IoT Edge daemon has version 1.4.10 but 1.4.33 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
31 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.
root@CNH-787052036B1B:~# iotedge check --verbose

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
‼ aziot-identity-service package is up-to-date - Warning
    Installed aziot-identity-service package has version 1.4.4 but 1.4.8 is the latest stable version available.
    Please see https://aka.ms/aziot-update-runtime for update instructions.
        caused by: Installed aziot-identity-service package has version 1.4.4 but 1.4.8 is the latest stable version available.
                   Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
    Installed IoT Edge daemon has version 1.4.10 but 1.4.33 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
        caused by: Installed IoT Edge daemon has version 1.4.10 but 1.4.33 is the latest stable version available.
                   Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
√ DNS server - OK
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
        caused by: The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
                   Data might be lost if the module is deleted or updated.
                   Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
        caused by: The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
                   Data might be lost if the module is deleted or updated.
                   Please see https://aka.ms/iotedge-storage-host for best practices.
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the default network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
31 check(s) succeeded.
4 check(s) raised warnings.
2 check(s) were skipped due to errors from other checks.


Device Information

  • Host OS: Ubuntu 20.04
  • Architecture: amd64
  • Container OS: Linux containers

Runtime Versions

  • aziot-edged [run iotedge version]: 1.4.10
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.4
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.4
  • Docker/Moby [run docker version]: 20.10.25+azure-1

Logs

Links are added for the logs
edgeagent-2024032702.log
edgehub-2024032702.log

@wvangeem wvangeem changed the title Direct Methods of IoTEdge module takes 3 minutes to start responding Direct Method of IoTEdge module takes 3 minutes to start responding Mar 27, 2024
@nyanzebra nyanzebra self-assigned this Mar 27, 2024
@wvangeem
Copy link
Author

wvangeem commented Apr 2, 2024

@nyanzebra - Any ideas about this subject?

@nyanzebra
Copy link
Contributor

@wvangeem would you mind seeing if you can add a log message in your module to see if the message ever arrives in the initial timeout window? For example from edgeHub logs:

<4> 2024-03-27 12:38:41.430 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID 6b7aa24a-202d-4ae4-8394-a7df05d04360

There is a correlation ID, I need to double check if it is possible to print this in the module when handling direct method calls, but hoping it will be printed. Basically I want to make sure that the problem is either the message is not forwarded to module when it should be or the module received the message and was stuck or finally, edgeHub didn't receive the response for some reason.

@wvangeem
Copy link
Author

wvangeem commented Apr 4, 2024

Hi @nyanzebra
I gathered some more information and logs.
I also added some logging to the SensorController module (which was already the case).
At a customers device, they experience the same problem, but there it's the StateAnalyzer module that fails.
I tested by fixing the startupOrder of the modules by adding the SensorController or StateAnalyzer right after the edgeHub, but this has no result. In my case, the SensorController keeps failing.

I also fixed edgeAgent and edgeHub versions to 1.4.33

But now I cannot simulate it anymore when I reboot the device most of the times.
I now can only simulate it when I restart the module (iotedge restart SensorController)

I restarted the module on 14:29:43

This is the log of the APIServer, which calls both StateAnalyzer and SensorController.
You will see that the module stops responding at 14:29:43

2024/04/04-14:29:42 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:42 - Invoked SetSerialNumber on SensorController
2024/04/04-14:29:43 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:44 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is Device {"message":"Client 787052036B1B/SensorController not found"} not registered
2024/04/04-14:29:45 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:46 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is Device {"message":"Client 787052036B1B/SensorController not found"} not registered
2024/04/04-14:29:47 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:51 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:52 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is {"message":"Timed out waiting for device to respond to method request c04837eb-5333-4a1e-868d-ad6e59f5a871"}
2024/04/04-14:29:55 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:29:56 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is {"message":"Timed out waiting for device to respond to method request a1acbb6a-8da8-43b0-a697-b2362522bd8d"}

This is the log of the SensorController.
You will see it accepts requests until restart, and it starts accepting requests again at 14:32:48, exactly 3 minutes later.

2024/04/04-14:29:37 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:29:38 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:29:39 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:29:40 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:29:41 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:29:42 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
EDGERUNTIME - 2024/04/04-14:29:43 - Open a connection to the Edge runtime
EDGERUNTIME - 2024/04/04-14:29:44 - Opening Connection
EDGERUNTIME - 2024/04/04-14:29:46 - Connection to the Edge runtime success!
EDGERUNTIME - 2024/04/04-14:29:46 - Setting Property Callback
EDGERUNTIME - 2024/04/04-14:29:46 - Registering Direct Methods
EDGERUNTIME - 2024/04/04-14:29:46 - Registering Message Handlers
EDGERUNTIME - 2024/04/04-14:29:46 - Get Module Twin Properties
EDGERUNTIME - 2024/04/04-14:29:47 - Get Environment Variables
 Environment IOTEDGE_MODULEGENERATIONID = 638471393439578933
 Environment HOME = /home/moduleuser
 Environment UpstreamProtocol = Amqp
 Environment IOTEDGE_IOTHUBHOSTNAME =********.azure-devices.net
 Environment HOSTNAME = c103fd152b65
 Environment DOTNET_RUNNING_IN_CONTAINER = true
 Environment PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 Environment IOTEDGE_MODULEID = SensorController
 Environment IOTEDGE_AUTHSCHEME = sasToken
 Environment IOTEDGE_GATEWAYHOSTNAME = CNH-787052036B1B
 Environment RuntimeLogLevel = Information
 Environment IOTEDGE_WORKLOADURI = unix:///var/run/iotedge/workload.sock
 Environment IOTEDGE_APIVERSION = 2022-08-03
 Environment ASPNETCORE_URLS = http://+:80
 Environment IOTEDGE_DEVICEID = 787052036B1B
EDGERUNTIME - 2024/04/04-14:29:47 - Edge Runtime Connection Initialized
CRON - 2024/04/04-14:29:47 - Enabling Cron Daemon
CRONJOBS - 2024/04/04-14:29:47 - Stopping CRON Daemon
CRONJOBS - 2024/04/04-14:29:47 - CRON Job - LogAllActiveVibrationSensors
CRONJOBS - 2024/04/04-14:29:47 - CRON Job - CheckRunningVibrationSensorJobs
CRONJOBS - 2024/04/04-14:29:47 - Starting CRON Daemon
CRON - 2024/04/04-14:29:47 - Finished Enabling Cron
2024/04/04-14:30:00 -  - Sending 0 Active Sensors to IoTHub
2024/04/04-14:31:00 -  - Sending 0 Active Sensors to IoTHub
2024/04/04-14:32:00 -  - Sending 0 Active Sensors to IoTHub
2024/04/04-14:32:48 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:32:49 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:32:50 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:32:51 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1
2024/04/04-14:32:52 - DIRECTMETHOD FOR SERIAL NUMBER ===== SIM_WELOTEC1

I also sent the Connection State Events to an Eventhub, and added some logging.
You see that the module connects again at 14:29:46

2024-04-04T14:27:25Z   [Information]   Executed 'Functions.IoTHubConnectionEventProcessor' (Succeeded, Id=04308bef-cc07-4450-85b2-9c4f99ed2482, Duration=7ms)
2024-04-04T14:30:25Z   [Information]   Executing 'Functions.IoTHubConnectionEventProcessor' (Reason='(null)', Id=88654866-cd66-442c-bc84-0975dce38446)
2024-04-04T14:30:25Z   [Information]   Trigger Details: PartionId: 1, Offset: 86536-86536, EnqueueTimeUtc: 2024-04-04T14:30:25.2100000+00:00-2024-04-04T14:30:25.2100000+00:00, SequenceNumber: 140-140, Count: 1
2024-04-04T14:30:25Z   [Information]   ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2024-04-04T14:30:25Z   [Information]   Event Body: {"sequenceNumber":"000000000000000101D941606A78164D000000780000000000000000000003A2"}
2024-04-04T14:30:25Z   [Information]   opType : deviceConnected
2024-04-04T14:30:25Z   [Information]   operationTimestamp : 2024-04-04T14:29:46.8020054Z
2024-04-04T14:30:25Z   [Information]   hubName : *******
2024-04-04T14:30:25Z   [Information]   deviceId : 787052036B1B
2024-04-04T14:30:25Z   [Information]   moduleId : SensorController
2024-04-04T14:30:25Z   [Information]   ----------------------------------------------------------------------
2024-04-04T14:30:25Z   [Information]   Executed 'Functions.IoTHubConnectionEventProcessor' (Succeeded, Id=88654866-cd66-442c-bc84-0975dce38446, Duration=3ms)

Back to the log of APIServer, you also see here that the SensorController starts accepting requests at 14:32:48, 3 minutes later

2024/04/04-14:32:43 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:32:44 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is Device {"message":"Client 787052036B1B/SensorController not found"} not registered
2024/04/04-14:32:46 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:32:47 - Invoked SetSerialNumber on SensorController FAILED - module is not responding: Failure, error message is Device {"message":"Client 787052036B1B/SensorController not found"} not registered
2024/04/04-14:32:48 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:32:48 - Invoked SetSerialNumber on SensorController
2024/04/04-14:32:49 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:32:49 - Invoked SetSerialNumber on SensorController
2024/04/04-14:32:50 - Invoked ProcessVehicleState on StateAnalyzer - Received 'SIM_WELOTEC1' in state 'NoAnalyzedState'
2024/04/04-14:32:50 - Invoked SetSerialNumber on SensorController

I also added edgeHub logs (complete log in link)
edgeHub-20240404.log

Stop of SensorController at 14:29:42

<4> 2024-04-04 14:29:42.749 +00:00 [WRN] - "Closing connection for device: 787052036B1B/SensorController, Microsoft.Azure.Devices.ProtocolGateway.ProtocolGatewayException: Channel closed., "
<6> 2024-04-04 14:29:42.752 +00:00 [INF] - Disposing MessagingServiceClient for device Id 787052036B1B/SensorController because of exception - Microsoft.Azure.Devices.ProtocolGateway.ProtocolGatewayException: Channel closed.
<6> 2024-04-04 14:29:42.756 +00:00 [INF] - Setting device proxy inactive for device Id 787052036B1B/SensorController
<6> 2024-04-04 14:29:42.761 +00:00 [INF] - Removing device connection for device 787052036B1B/SensorController with removeCloudConnection flag 'True'.
<6> 2024-04-04 14:29:42.906 +00:00 [INF] - Closing receiver in cloud proxy a4baf8f4-e7ae-4b95-babe-2d1c2c29edc8 for 787052036B1B/SensorController
<6> 2024-04-04 14:29:42.909 +00:00 [INF] - Closed cloud proxy a4baf8f4-e7ae-4b95-babe-2d1c2c29edc8 for 787052036B1B/SensorController
<6> 2024-04-04 14:29:42.911 +00:00 [INF] - Device connection removed for device 787052036B1B/SensorController
<6> 2024-04-04 14:29:42.923 +00:00 [INF] - Remove device connection for device 787052036B1B/SensorController

Start of SensorController at 14:29:46

<6> 2024-04-04 14:29:46.390 +00:00 [INF] - New token received on the Cbs link
<6> 2024-04-04 14:29:46.594 +00:00 [INF] - Client 787052036B1B/SensorController in device scope authenticated locally.
<6> 2024-04-04 14:29:46.617 +00:00 [INF] - New device connection for device 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.618 +00:00 [INF] - Client 787052036B1B/SensorController connected to edgeHub, processing existing subscriptions.
<6> 2024-04-04 14:29:46.620 +00:00 [INF] - Attempting to connect to IoT Hub for client 787052036B1B/SensorController via AMQP...
<6> 2024-04-04 14:29:46.620 +00:00 [INF] - Bind device proxy for device 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.628 +00:00 [INF] - Initialized device listener in the AMQP protocol head for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.648 +00:00 [INF] - Opened link Events for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.718 +00:00 [INF] - Cloud connection for 787052036B1B/SensorController is True
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Connection status for 787052036B1B/SensorController changed to ConnectionEstablished
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Client 787052036B1B/SensorController connected to cloud, processing existing subscriptions.
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Skipping 787052036B1B/SensorController for subscription processing, as it is currently being processed.
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Created cloud proxy for client 787052036B1B/SensorController via AMQP, with client operation timeout 20 seconds.
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Initialized cloud proxy d8b6dff7-6eb9-4fef-87a4-3dd81ead4976 for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.719 +00:00 [INF] - Created cloud connection for client 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.808 +00:00 [INF] - Opened link TwinReceiving for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.812 +00:00 [INF] - Processing pending subscriptions for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.814 +00:00 [INF] - Opened link TwinSending for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.888 +00:00 [INF] - Processing pending subscriptions for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.891 +00:00 [INF] - Opened link MethodSending for 787052036B1B/SensorController
<6> 2024-04-04 14:29:46.896 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController

Failing requests from this moment on:

<6> 2024-04-04 14:29:51.342 +00:00 [INF] - "Request starting HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131"
<6> 2024-04-04 14:29:51.342 +00:00 [INF] - Executing endpoint '"Microsoft.Azure.Devices.Edge.Hub.Http.Controllers.TwinsController.InvokeModuleMethodAsync (Microsoft.Azure.Devices.Edge.Hub.Http)"'
<6> 2024-04-04 14:29:51.342 +00:00 [INF] - Route matched with "{action = \"InvokeModuleMethod\", controller = \"Twins\"}". Executing controller action with signature "System.Threading.Tasks.Task InvokeModuleMethodAsync(System.String, System.String, Microsoft.Azure.Devices.Edge.Hub.Http.MethodRequest)" on controller "Microsoft.Azure.Devices.Edge.Hub.Http.Controllers.TwinsController" ("Microsoft.Azure.Devices.Edge.Hub.Http").
<6> 2024-04-04 14:29:51.343 +00:00 [INF] - Client 787052036B1B/APIServer in device scope authenticated locally.
<4> 2024-04-04 14:29:52.302 +00:00 [WRN] - Did not receive response for method invoke call from device/module 787052036B1B/SensorController for 787052036B1B/SensorController with correlation ID c04837eb-5333-4a1e-868d-ad6e59f5a871
<6> 2024-04-04 14:29:52.303 +00:00 [INF] - Executed action "Microsoft.Azure.Devices.Edge.Hub.Http.Controllers.TwinsController.InvokeModuleMethodAsync (Microsoft.Azure.Devices.Edge.Hub.Http)" in 5017.3878ms
<6> 2024-04-04 14:29:52.303 +00:00 [INF] - Executed endpoint '"Microsoft.Azure.Devices.Edge.Hub.Http.Controllers.TwinsController.InvokeModuleMethodAsync (Microsoft.Azure.Devices.Edge.Hub.Http)"'
<6> 2024-04-04 14:29:52.304 +00:00 [INF] - "Request finished HTTP/1.1 POST https://cnh-787052036b1b/twins/787052036B1B/modules/SensorController/methods?api-version=2020-09-30 application/json;+charset=utf-8 131 - 504 108 application/json;+charset=utf-8 5018.9389ms"

Re-initialization of module by edgeHub 3 minutes later, which causes it to work again.

<6> 2024-04-04 14:32:47.733 +00:00 [INF] - New token received on the Cbs link
<6> 2024-04-04 14:32:47.737 +00:00 [INF] - Client 787052036B1B/SensorController in device scope authenticated locally.
<6> 2024-04-04 14:32:47.738 +00:00 [INF] - New device connection for device 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.738 +00:00 [INF] - Client 787052036B1B/SensorController connected to edgeHub, processing existing subscriptions.
<6> 2024-04-04 14:32:47.739 +00:00 [INF] - Attempting to connect to IoT Hub for client 787052036B1B/SensorController via AMQP...
<6> 2024-04-04 14:32:47.739 +00:00 [INF] - Bind device proxy for device 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.746 +00:00 [INF] - Initialized device listener in the AMQP protocol head for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.746 +00:00 [INF] - Opened link Events for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.754 +00:00 [INF] - Opened link MethodReceiving for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.757 +00:00 [INF] - Opened link MethodSending for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.758 +00:00 [INF] - Processing pending subscriptions for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.758 +00:00 [INF] - Opened link TwinSending for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.760 +00:00 [INF] - Opened link TwinReceiving for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.819 +00:00 [INF] - Cloud connection for 787052036B1B/SensorController is True
<6> 2024-04-04 14:32:47.819 +00:00 [INF] - Connection status for 787052036B1B/SensorController changed to ConnectionEstablished
<6> 2024-04-04 14:32:47.819 +00:00 [INF] - Client 787052036B1B/SensorController connected to cloud, processing existing subscriptions.
<6> 2024-04-04 14:32:47.819 +00:00 [INF] - Skipping 787052036B1B/SensorController for subscription processing, as it is currently being processed.
<6> 2024-04-04 14:32:47.820 +00:00 [INF] - Created cloud proxy for client 787052036B1B/SensorController via AMQP, with client operation timeout 20 seconds.
<6> 2024-04-04 14:32:47.820 +00:00 [INF] - Initialized cloud proxy ebfaee2c-8b98-45e4-afee-38916d4377d0 for 787052036B1B/SensorController
<6> 2024-04-04 14:32:47.820 +00:00 [INF] - Created cloud connection for client 787052036B1B/SensorController

Best regards.
Wim.

@nyanzebra
Copy link
Contributor

@wvangeem just to update, I am still investigating the code and logs and will post when have an update.

@nyanzebra
Copy link
Contributor

Hi @wvangeem sorry for the delay, it is also strange there is another setup in which the situation of the modules is reversed... Would you mind setting the RuntimeLogLevel for edgeHub to 'debug'? What I am looking for is if edgeHub received the message and then didn't do the right thing or if it never received the message to begin with.

@wvangeem
Copy link
Author

Hi @nyanzebra
I have modified the modules so they don't communicate with DirectMethods anymore. I changed it to GRPC server/client communication.
What I noticed is that the networking between the modules is immediately ok. So the docker network is working perfectly. But there's still something wrong with the edgeHub. I added the RuntimeLogLevel as requested.
So what's going on now:
So every second, APIServer is calling a GRPC service on StateAnalyzer and SensorController. This is a success almost immediately, since this depends only on docker network.

But the StateAnalyzer service needs to send a message to IoTHub with following code:

                    if (serialNumberUpdated)
                        Console.WriteLine($"{LogHelpers.GenerateLogTime()} - {serialNumber} {serialNumberSource} - Sending Vehicle State to IoTHub - Serial Number updated");
                    else
                    {
                        string until = sendVehicleStateUntil.ToString("yyyy/MM/dd - HH:mm:ss");
                        Console.WriteLine($"{LogHelpers.GenerateLogTime()} - {serialNumber} {serialNumberSource} - Sending Vehicle State to IoTHub until {until}");
                    }
                    // serialNumberUpdated = false; MOET HIER STAAN OM GOED TE WERKEN zoals in module 0.1.5
                    DateTime debugtime = DateTime.UtcNow;
                    var eventMessage = new Message(vstate.ToByteArray());
                    eventMessage.Properties.Add("prototype", "VehicleState");
                    eventMessage.Properties.Add("serialnumber", serialNumber);
                    await moduleClient.SendEventAsync("IoTHub", eventMessage); //DEZE DUURT HIER LANG om komt er misschien niet door de eerste minuten na start
                    serialNumberUpdated = false;
                    Console.WriteLine($"{LogHelpers.GenerateLogTime()} - VEHICLESTATE SENT {debugtime.ToString("yyyy/MM/dd - HH:mm:ss")}");

The code has a variable "debugtime", which is the date the message was sent, and a message "VEHICLESTATE SENT" when the SentEventAsync function has finished.

I started the device at 06:01 - so all the attached logs are having this dates from at least that time

At 06:01:42 - APIServer is sending a message to IoTHub
APIServer log
2024/04/18-06:01:42 - RECEIVED SERIAL NUMBER = SIM_WELOTEC1 - Sending to IoTHub
edgeHub log
<7> 2024-04-18 06:01:42.440 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/APIServer

This same action also calls the GRPC Service on StateAnalyzer, which executes the code above:
2024/04/18-06:01:47 - Invoked GRPC Method ProcessVehicleState on StateAnalyzer FAILED - module state is 'DeadlineExceeded' - SENDING VEHICLESTATE AS BACKUP

But this seems to take a long time.
StateAnalyzer log
2024/04/18-06:01:43 - SIM_WELOTEC1 (Can) - Sending Vehicle State to IoTHub - Serial Number updated
You see that the "VEHICLESTATE SENT" message isn't displayed there.
So the code "halts" on the await moduleClient.SendEventAsync

But in the edgeHub log, you see that edgeHub received the message:
<7> 2024-04-18 06:01:43.658 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer

And this goes on for 2 minutes, and suddenly, after 2 minutes, this shows up in the StateAnalyzer log

2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:18
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:06
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:24
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:48
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:30
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:00
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:42
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:12
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:03:36
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:54
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:00
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:24
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:12
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:01:54
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:18
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:01:43
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:01:48
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:36
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:06
2024/04/18-06:03:41 - VEHICLESTATE SENT 2024/04/18 - 06:02:30

you see that all the await moduleClient.SendEventAsync calls finish, and all my threads finish too, displaying the last message in the code.

From that time on, all GRPC calls are executed immediately, so all SentEventAsync calls are executed immediately.

When you check the edgeHub logs at that time, you see that the StateAnalyzer module is "reconnected".
<6> 2024-04-18 06:03:41.649 +00:00 [INF] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.LinkHandler] - Closing link Events for E45F013E3620/StateAnalyzer
And all messages are processed by the EventsLinkHandler:

<7> 2024-04-18 06:03:41.825 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.826 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.827 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.831 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.832 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.833 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.834 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.835 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.837 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.838 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.839 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.840 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.841 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.842 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.843 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.844 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Core.Routing.RoutingEdgeHub] - Received 1 message(s) from E45F013E3620/StateAnalyzer
<7> 2024-04-18 06:03:41.848 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Hub.Amqp.LinkHandlers.EventsLinkHandler] - EventsLinkHandler processed 1 messages for E45F013E3620/StateAnalyzer

So I guess it's not directly related to Direct Methods, but the edgeHub has some strange behaviour.
Also interesting to know is that when I restart the StateAnalyzer module within these 2 minutes manually, the problem is solved immediately.

I attached the full logs here:
APIServer-20240418.log
edgeAgent-20240418.log
edgeHub-20240418.log
StateAnalyzer-20240418.log

Best regards,

Wim

@nyanzebra
Copy link
Contributor

@wvangeem would you mind doing a small test? https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-8.0 would see if setting the minimum thread pool to say 100 helps? Want to rule out starvation as a cause.

@wvangeem
Copy link
Author

Hi @nyanzebra,

I added following piece of code to every container:

            int newminWorker = 100;
            int minWorker, minIOC;
            ThreadPool.GetMinThreads(out minWorker, out minIOC); // Get the current settings.
            if (ThreadPool.SetMinThreads(newminWorker, minIOC))
            {
                Console.WriteLine($"EDGERUNTIME - {LogHelpers.GenerateLogTime()} - Changed minimum worker threads to {newminWorker}, asynchronous I/O threads are {minIOC}");
            }
            else
            {
                Console.WriteLine($"EDGERUNTIME - {LogHelpers.GenerateLogTime()} - Failed to change minimum worker threads to {newminWorker}, still {minWorker}, asynchronous I/O threads are {minIOC}");
            }

Every container reports this in the startup log
EDGERUNTIME - 2024/04/30-11:15:18 - Changed minimum worker threads to 100, asynchronous I/O threads are 4

I still have exactly the same behaviour afterwards. All threads hang in the StateAnalyzer module, and 2 minutes after startup (after reboot), all the threads are finished.

Best regards,
Wim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants