Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of keep alive failed. after serval iteration. #63

Closed
NightSkySK opened this issue Jan 7, 2024 · 16 comments
Closed

Handling of keep alive failed. after serval iteration. #63

NightSkySK opened this issue Jan 7, 2024 · 16 comments

Comments

@NightSkySK
Copy link
Contributor

Describe the bug
I'm tring to understand nature of the failure and how to fix it for 2 weeks and I can't find the right solution.

After number of itteration I get the status MQTTKeepAliveTimeout

E (449803) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (449803) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (449803) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

Some improvement I get rising the priority up to 6 for CONFIG_GRI_MQTT_AGENT_TASK_PRIORITY

I've also played with some other settings without major change.
I've traced task stack_size to make sure I'm not exceed there, but it seems to be fine.

I've went trough Open and Closed issues in this repo but I've not found any hint that someone faced similar issue to mine.

System information

  • Hardware board: ESP32
  • IDE used: 4.3
  • Operating System: Windows

Expected behavior
Keep MQTT connected for few hours not only for few minutes

Steps to reproduce bug
Example:
1. "I am using project [ ... ], and have configured with [ ... ]"
2. "When run on [ ... ], I observed that [ ... ]"

Console trace

I (431533) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 115
W (431913) ota_over_mqtt: Stack size uxOTATask: 888

I (431973) sub_pub_unsub_demo: Subscribe 115 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (431973) sub_pub_unsub_demo: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 116.   
I (431983) sub_pub_unsub_demo: Task "SubPub0" waiting for publish 116 to complete.
W (432913) ota_over_mqtt: Stack size uxOTATask: 888

I (432973) coreMQTT: Publishing message to /filter/SubPub0.

W (433913) ota_over_mqtt: Stack size uxOTATask: 888

I (434203) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (434203) coreMQTT: State record updated. New state=MQTTPublishDone.
I (434203) sub_pub_unsub_demo: Publish 116 succeeded for task "SubPub0".
I (434203) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (434223) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (434223) sub_pub_unsub_demo: Task "SubPub0" received: SubPub0
I (434233) sub_pub_unsub_demo: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 117
W (434913) ota_over_mqtt: Stack size uxOTATask: 888

W (435913) ota_over_mqtt: Stack size uxOTATask: 888

I (436353) sub_pub_unsub_demo: Unsubscribe 117 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (436353) sub_pub_unsub_demo: Task "SubPub0" completed a loop. Delaying before next loop.
W (436363) sub_pub_unsub_demo: Stack size uxSubPubTask: 1060

W (436913) ota_over_mqtt: Stack size uxOTATask: 888

W (437913) ota_over_mqtt: Stack size uxOTATask: 888

W (438913) ota_over_mqtt: Stack size uxOTATask: 888

W (439913) ota_over_mqtt: Stack size uxOTATask: 888

W (440913) ota_over_mqtt: Stack size uxOTATask: 888

I (441363) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 118
I (441803) sub_pub_unsub_demo: Subscribe 118 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (441803) sub_pub_unsub_demo: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 119.   
I (441813) sub_pub_unsub_demo: Task "SubPub0" waiting for publish 119 to complete.
W (441913) ota_over_mqtt: Stack size uxOTATask: 888

I (442803) coreMQTT: Publishing message to /filter/SubPub0.

W (442913) ota_over_mqtt: Stack size uxOTATask: 888

W (443913) ota_over_mqtt: Stack size uxOTATask: 888

W (444913) ota_over_mqtt: Stack size uxOTATask: 888

W (445913) ota_over_mqtt: Stack size uxOTATask: 888

W (446913) ota_over_mqtt: Stack size uxOTATask: 888

W (447913) ota_over_mqtt: Stack size uxOTATask: 888

W (448913) ota_over_mqtt: Stack size uxOTATask: 888

E (449803) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (449803) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (449803) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (449813) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (449813) core_mqtt_agent_manager: TLS connection was disconnected.
I (449823) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (449833) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
W (449913) ota_over_mqtt: Stack size uxOTATask: 888

W (450913) ota_over_mqtt: Stack size uxOTATask: 888

W (451913) ota_over_mqtt: Stack size uxOTATask: 888

W (452243) esp-tls: Failed to open new connection in specified timeout
I (452563) core_mqtt_agent_manager: Retry attempt 1.
W (452913) ota_over_mqtt: Stack size uxOTATask: 888

W (453913) ota_over_mqtt: Stack size uxOTATask: 888

W (454913) ota_over_mqtt: Stack size uxOTATask: 888

E (455613) coreMQTT: CONNACK recv failed with status = MQTTNoDataAvailable.
E (455613) coreMQTT: MQTT connection failed with status = MQTTNoDataAvailable.
I (455613) core_mqtt_agent_manager: Session present: 0

E (455623) core_mqtt_agent_manager: MQTT_Status: MQTTNoDataAvailable
W (455913) ota_over_mqtt: Stack size uxOTATask: 888

I (455943) core_mqtt_agent_manager: Retry attempt 2.
W (456913) ota_over_mqtt: Stack size uxOTATask: 888

I (456913) ota_over_mqtt: OTA agent resumed by timer.
W (457913) ota_over_mqtt: Stack size uxOTATask: 888

I (458463) coreMQTT: MQTT connection established with the broker.
I (458463) core_mqtt_agent_manager: Session present: 1

E (458463) coreMQTT: Collision when adding PacketID=132 at index=0.
I (458473) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (458483) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (458483) core_mqtt_agent_manager: coreMQTT-Agent connected.
W (458913) ota_over_mqtt: Stack size uxOTATask: 888

I (458913) AWS_OTA: otaPal_GetPlatformImageState
I (458913) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (458913) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (458923) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (458943) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (458943) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (458943) sub_pub_unsub_demo: Error or timed out waiting for ack for publish message 119. Re-attempting publish.
I (458953) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (458963) sub_pub_unsub_demo: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 119.   
I (458963) coreMQTT: State record updated. New state=MQTTPublishDone.
I (458983) sub_pub_unsub_demo: Task "SubPub0" waiting for publish 119 to complete.
I (458983) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (459003) sub_pub_unsub_demo: Publish 119 succeeded for task "SubPub0".
I (459003) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (459023) sub_pub_unsub_demo: Task "SubPub0" received: SubPub0
I (459023) sub_pub_unsub_demo: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 120
I (459313) ota_over_mqtt: Subscribed to topic $aws/things/GA913F0001/jobs/notify-next.


I (459313) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA913F0001/jobs/notify-next
W (459923) ota_over_mqtt: Stack size uxOTATask: 888

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4000c250  PS      : 0x00060730  A0      : 0x8018ae5c  A1      : 0x3ffd75e0
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00007869  A5      : 0x3f42b34c  
A6      : 0x00000008  A7      : 0xff000000  A8      : 0x00000000  A9      : 0x3ffd75b0
A10     : 0x00000000  A11     : 0x3f408be0  A12     : 0x3ffd7834  A13     : 0x3ffb6d1c
A14     : 0x00000000  A15     : 0x00000001  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000001  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xfffffffd

Backtrace:0x4000c24d:0x3ffd75e0 0x4018ae59:0x3ffd7600 0x4018c4f5:0x3ffd7910 0x4019bc85:0x3ffd7940 0x40096801:0x3ffd7970 0x401062e8:0x3ffd79c0 0x40105e20:0x3ffd79f0 0x40106068:0x3ffd7a40 0x400d94f2:0x3ffd7a70 0x400945b1:0x3ffd7a90
0x4018ae59: _vfprintf_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vfprintf.c:1521

0x4018c4f5: vprintf at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vprintf.c:34 (discriminator 
5)

0x4019bc85: esp_log_writev at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:189

0x40096801: esp_log_write at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:199

0x401062e8: MQTTAgentCommand_Publish at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent_command_functions.c:79 (discriminator 13)

0x40105e20: processCommand at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:584

0x40106068: MQTTAgent_CommandLoop at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:1059

0x400d94f2: prvMQTTAgentTask at C:\mSpaceLock\mSpaceGate-fmw\build/../main/networking/mqtt/core_mqtt_agent_manager.c:519

0x400945b1: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.3.4/components/freertos/port/xtensa/port.c:168



ELF file SHA256: ece89f91691638ee

Rebooting...
I (169237) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 37
I (169677) sub_pub_unsub_demo: Subscribe 37 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (169687) sub_pub_unsub_demo: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 38.    
I (169697) sub_pub_unsub_demo: Task "SubPub0" waiting for publish 38 to complete.
W (169837) ota_over_mqtt: Stack size uxOTATask: 892

I (170677) coreMQTT: Publishing message to /filter/SubPub0.

W (170837) ota_over_mqtt: Stack size uxOTATask: 892

W (171837) ota_over_mqtt: Stack size uxOTATask: 892

W (172837) ota_over_mqtt: Stack size uxOTATask: 892

W (173837) ota_over_mqtt: Stack size uxOTATask: 892

W (174837) ota_over_mqtt: Stack size uxOTATask: 892

E (175687) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (175687) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (175687) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (17

Backtrace:0x400D9533:0x3FFD7F60 0x400945B1:0x3FFD7F90
0x400d9533: prvMQTTAgentTask at C:\mSpaceLock\mSpaceGate-fmw\build/../main/networking/mqtt/core_mqtt_agent_manager.c:536

0x400945b1: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.3.4/components/freertos/port/xtensa/port.c:168


5697) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (175707) core_mqtt_agent_manager: ../main/networking/mqtt/core_mqtt_agent_manager.c 537

I (175717) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (175707) core_mqtt_agent_manager: TLS connection was disconnected.
I (175727) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
W (175837) ota_over_mqtt: Stack size uxOTATask: 892

W (175837) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (175837) AWS_OTA: OTA Agent is suspended.
I (175837) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
W (177837) ota_over_mqtt: Stack size uxOTATask: 892

W (178437) esp-tls: Failed to open new connection in specified timeout
I (178837) core_mqtt_agent_manager: Retry attempt 1.
W (178837) ota_over_mqtt: Stack size uxOTATask: 892

W (179837) ota_over_mqtt: Stack size uxOTATask: 892

W (180837) ota_over_mqtt: Stack size uxOTATask: 892

W (181837) ota_over_mqtt: Stack size uxOTATask: 892

I (182787) coreMQTT: MQTT connection established with the broker.
I (182787) core_mqtt_agent_manager: Session present: 1

E (182787) coreMQTT: Collision when adding PacketID=43 at index=0.
I (182807) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (182807) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (182807) core_mqtt_agent_manager: coreMQTT-Agent connected.
W (182837) ota_over_mqtt: Stack size uxOTATask: 892

I (182837) AWS_OTA: otaPal_GetPlatformImageState
I (182837) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (182837) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (182847) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (182867) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (182867) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (182867) sub_pub_unsub_demo: Error or timed out waiting for ack for publish message 38. Re-attempting publish.
I (182877) sub_pub_unsub_demo: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 38.
I (182897) sub_pub_unsub_demo: Task "SubPub0" waiting for publish 38 to complete.
I (182997) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (182997) coreMQTT: State record updated. New state=MQTTPublishDone.
I (182997) sub_pub_unsub_demo: Publish 38 succeeded for task "SubPub0".
I (182997) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (183017) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (183017) sub_pub_unsub_demo: Task "SubPub0" received: SubPub0
I (183027) sub_pub_unsub_demo: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 39
I (183197) ota_over_mqtt: Subscribed to topic $aws/things/GA913F0001/jobs/notify-next.


I (183197) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA913F0001/jobs/notify-next
W (183847) ota_over_mqtt: Stack size uxOTATask: 892

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4000c250  PS      : 0x00060130  A0      : 0x8018ae84  A1      : 0x3ffd7ad0
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00007869  A5      : 0x3f42b39c  
A6      : 0x00000008  A7      : 0xff000000  A8      : 0x00000000  A9      : 0x3ffd7aa0  
A10     : 0x00000000  A11     : 0x3f408c34  A12     : 0x800819de  A13     : 0x3ffbea50
A14     : 0x00000000  A15     : 0x00000001  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000001  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xfffffffd

Backtrace:0x4000c24d:0x3ffd7ad0 0x4018ae81:0x3ffd7af0 0x4018c51d:0x3ffd7e00 0x4019bcad:0x3ffd7e30 0x40096801:0x3ffd7e60 0x40106310:0x3ffd7eb0 0x40105e48:0x3ffd7ee0 0x40106090:0x3ffd7f30 0x400d94fa:0x3ffd7f60 0x400945b1:0x3ffd7f90
0x4018ae81: _vfprintf_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vfprintf.c:1521

0x4018c51d: vprintf at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vprintf.c:34 (discriminator 
5)

0x4019bcad: esp_log_writev at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:189

0x40096801: esp_log_write at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:199

0x40106310: MQTTAgentCommand_Publish at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent_command_functions.c:79 (discriminator 13)

0x40105e48: processCommand at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:584

0x40106090: MQTTAgent_CommandLoop at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:1059

0x400d94fa: prvMQTTAgentTask at C:\mSpaceLock\mSpaceGate-fmw\build/../main/networking/mqtt/core_mqtt_agent_manager.c:520

0x400945b1: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.3.4/components/freertos/port/xtensa/port.c:168



ELF file SHA256: 5316e8bbe4264236

Rebooting...

Additional context
As a side question I wonder why in this example we start demo tasks before starting WiFi and coreMQTT-Agent network manager
I found another example where it's done other way around like github.com/aws-samples/device-software-for-aws-connected-mobility-solution What are the pros and cons of those two approach?

@aggarg
Copy link
Member

aggarg commented Jan 8, 2024

It seems like after the MQTT keep alive times out, this log statement is causing the core to panic which is what we want to debug. Can you use CONFIG_ESP_SYSTEM_PANIC_GDBSTUB option described here to find out why the log is leading to crash?

Network failures once in a while are okay as long as the application recovers. You can try capturing the network data, if you want to find out the cause of MQTTKeepAliveTimeout.

@NightSkySK
Copy link
Contributor Author

Sorry for late reply, but I was away last two days. Any way as you suggested I've run the debug with CONFIG_ESP_SYSTEM_PANIC_GDBSTUB option enabled here is my log out come:

I (117478) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (117478) coreMQTT: State record updated. New state=MQTTPublishDone.
I (117478) sub_pub_unsub: Publish 14 succeeded for task "SubPub0".
I (117478) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (117498) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (117498) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (117508) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 15
I (125508) sub_pub_unsub: Unsubscribe 15 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (125508) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (130518) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 16
I (130918) sub_pub_unsub: Subscribe 16 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (130918) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 17.
I (130928) sub_pub_unsub: Task "SubPub0" waiting for publish 17 to complete.
I (131918) coreMQTT: Publishing message to /filter/SubPub0.

E (136928) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (136928) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (136928) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (136938) sub_pub_unsub: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (136938) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 2040

I (136948) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (136938) core_mqtt_agent_manager: TLS connection was disconnected.
I (136958) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (140748) coreMQTT: MQTT connection established with the broker.
I (140748) core_mqtt_agent_manager: Session present: 1

E (140748) coreMQTT: Collision when adding PacketID=21 at index=0.
I (140768) sub_pub_unsub: coreMQTT-Agent connected.
I (140768) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (140768) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (141768) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (141768) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (141768) sub_pub_unsub: Error or timed out waiting for ack for publish message 17. Re-attempting publish.
I (141778) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (141778) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 17.
I (141788) coreMQTT: State record updated. New state=MQTTPublishDone.
I (141808) sub_pub_unsub: Task "SubPub0" waiting for publish 17 to complete.
I (141808) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (141828) sub_pub_unsub: Publish 17 succeeded for task "SubPub0".
I (141828) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (141838) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (141848) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 18
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x4000c250  PS      : 0x00060130  A0      : 0x8018b850  A1      : 0x3ffd8120
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00007871  A5      : 0x3f42b46c  
A6      : 0x00000008  A7      : 0xff000000  A8      : 0x00000000  A9      : 0x3ffd80f0  
A10     : 0x00000000  A11     : 0x3f408c14  A12     : 0x3ffd8374  A13     : 0x3ffb6d1c
A14     : 0x00000000  A15     : 0x00000001  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000001  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xfffffffd

Backtrace:0x4000c24d:0x3ffd8120 0x4018b84d:0x3ffd8140 0x4018cee9:0x3ffd8450 0x4019c75d:0x3ffd8480 0x40096905:0x3ffd84b0 0x40106318:0x3ffd8500 0x40105e50:0x3ffd8530 0x40106098:0x3ffd8580 0x400d94f6:0x3ffd85b0 0x400946b5:0x3ffd85d0
0x4018b84d: _vfprintf_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vfprintf.c:1521

0x4018cee9: vprintf at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vprintf.c:34 (discriminator 5)

0x4019c75d: esp_log_writev at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:189

0x40096905: esp_log_write at C:/Espressif/frameworks/esp-idf-v4.3.4/components/log/log.c:199

0x40106318: MQTTAgentCommand_Publish at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent_command_functions.c:79 (discriminator 13)

0x40105e50: processCommand at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:584

0x40106098: MQTTAgent_CommandLoop at C:\mSpaceLock\mSpaceGate-fmw\build/../components/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:1059
0x400d94f6: prvMQTTAgentTask at C:\mSpaceLock\mSpaceGate-fmw\build/../main/networking/mqtt/core_mqtt_agent_manager.c:520

0x400946b5: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.3.4/components/freertos/port/xtensa/port.c:168



ELF file SHA256: 9c1256dcb6df26f6

Entering gdb stub now.
$T0b#e6C:\Espressif\tools\xtensa-esp32-elf\esp-2021r2-patch3-8.4.0\xtensa-esp32-elf\bin\xtensa-esp32-elf-gdb.exe: warning: Couldn't determine a path for the index cache directory.
GNU gdb (crosstool-NG esp-2021r2-patch3) 9.2.90.20200913-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-host_w64-mingw32 --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from c:\mSpaceLock\mSpaceGate-fmw\build\mSpaceGate.elf...
Remote debugging using \\.\COM3
Ignoring packet error, continuing...
warning: unrecognized item "timeout" in "qSupported" response
Ignoring packet error, continuing...
Remote replied unexpectedly to 'vMustReplyEmpty': timeout

@NightSkySK
Copy link
Contributor Author

NightSkySK commented Jan 19, 2024

I finally was able to make my JTAG debugging to gbd work. (sorry this is my first time when I use it and I have no clue which information is useful and how to take more useful one, any advice is more than welcome)

ESP-IDF Monitor:

I (8529184) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (8529184) coreMQTT: State record updated. New state=MQTTPublishDone.
I (8529184) sub_pub_unsub: Publish 2591 succeeded for task "SubPub0".
I (8529184) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (8529204) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (8529204) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (8529214) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 2
I (8531444) sub_pub_unsub: Unsubscribe 2592 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (8531444) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (8536454) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 259
I (8536894) sub_pub_unsub: Subscribe 2593 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (8536894) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub
I (8536904) sub_pub_unsub: Task "SubPub0" waiting for publish 2594 to complete.
I (8537894) coreMQTT: Publishing message to /filter/SubPub0.

I (8539134) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (8539134) coreMQTT: State record updated. New state=MQTTPublishDone.
I (8539134) sub_pub_unsub: Publish 2594 succeeded for task "SubPub0".
I (8539144) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (8539154) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (8539154) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (8539164) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 2
I (8541374) sub_pub_unsub: Unsubscribe 2595 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (8541374) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (8544384) ota_over_mqtt: OTA agent suspend by timer.
W (8545384) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (8545384) AWS_OTA: OTA Agent is suspended.
acket: DeserializerResult=MQTTSuccess.
I (8548844) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (8548844) sub_pub_unsub: Error or timed out waiting for ack for publish message 2597. Re-attempting publish.
I (8548854) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (8548864) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 2597. with ID 2597.
I (8548864) coreMQTT: State record updated. New state=MQTTPublishDone.
I (8548884) sub_pub_unsub: Task "SubPub0" waiting for publish 2597 to complete.
I (8548894) sub_pub_unsub: Publish 2597 succeeded for task "SubPub0".
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Setting breakpoint at 0x4000c250 and returning...

The debug console output:

GNU gdb (crosstool-NG esp-2021r2-patch3) 9.2.90.20200913-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-host_w64-mingw32 --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Warning: Debuggee TargetArchitecture not detected, assuming x86_64.
=cmd-param-changed,param="pagination",value="off"
keep_alive() was not invoked in the 1000 ms timelimit. GDB alive packet not sent! (1115 ms). Workaround: increase "set remotetimeout" in GDB
[esp32.cpu0] Target halted, PC=0x40197352, debug_reason=00000000
Set GDB target to 'esp32.cpu0'
[esp32.cpu1] Target halted, PC=0x4000C250, debug_reason=00000001
[New Thread 1073467380]
[New Thread 1073547452]
[New Thread 1073573652]
[New Thread 1073557628]
[New Thread 1073596624]
[New Thread 1073446436]
[New Thread 1073569576]
[New Thread 1073549852]
[New Thread 1073587112]
[New Thread 1073444376]
[New Thread 1073446016]
[New Thread 1073561688]
[New Thread 1073581788]

Thread 1 "coreMQTT-Agent" received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 1073578364]
0x4000c250 in ?? ()
Execute debugger commands using "-exec <command>", for example "-exec info registers" will list registers in use (when GDB is the debugger)

The debug console is only part of what you would like to see as there is much more information in the VSCode debugger...

The call stack is looking like this:
image

I'm not sure how to provide all useful information about variables and registers from the debugger (I should probably be able to list them out to the gdb console if I know the right commands 😉 For the time being I paste a few screenshots which in my gut filling might be useful to understand the root cause.

image

Am I right that issue might be caused by pTopicName where it is blank (and should be /filter/SubPub0) but topicNameLength is 40006 ? So in next step in log.c format contain some garbage value.

image

pPublishinfo --> pTopicName is coming from pArg where in all there is 0x3ffd5d34
core_mqtt_agent_commad_functions.c line:72 pPublishInfo = ( const MQTTPublishInfo_t * ) ( pPublishArg );

image

core_mqtt_agent.c line:570 pCommandArgs = pCommand->pArgs;

I've lost track in this point as I couldn't figure out how the pPayload and pTopicName should be correctly passed into processCommand or MQTTAgentCommand_Publish functions

BTW After more deep dive into this problem I feel that I should move this topic in to coreMQTT-Agent repo, isn't it?

@aggarg
Copy link
Member

aggarg commented Jan 27, 2024

Am I right that issue might be caused by pTopicName where it is blank (and should be /filter/SubPub0) but topicNameLength is 40006 ?

You are absolutely right.

It seems that the pArgs value is not valid. In the case of PUBLISH command, it is MQTTPublishInfo_t supplied to MQTTAgent_Publish. Is it possible that your application called MQTTAgent_Publish and MQTTPublishInfo_t went out of scope before the MQTT Agent processed the command (i.e. before the command complete callback was invoked)?

BTW After more deep dive into this problem I feel that I should move this topic in to coreMQTT-Agent repo, isn't it?

Not sure that would be helpful. Let us first figure out the root cause.

@NightSkySK
Copy link
Contributor Author

I will do my best to support this root cause investigation, I just hope that my skills will be enough to handle it. I'm still noob in embedded programming, so I need some more detailed guidelines on what should I do or what to check, if you don't mind.

This catched error happened after over 2h of running the program and it was ~2600 iterations of this subpub loop.
Could you please advise, on how effectively can I check or catch that MQTTPublishInfo_t went out of scope?

@Skptak
Copy link
Member

Skptak commented Jan 29, 2024

Hey @NightSkySK, this is a complicated issue to even find, and what you've done already is great, no need to say your skills won't be enough!

Why don't you start by putting an if statement check in before the publish that checks that topicNameLength is less than 256 bytes, or maybe is equal to a known length?

if( pcTopicNameLength > 256UL )
{
    configASSERT( pcTopicNameLength > 256UL );
}

I'm using configASSERT() in this example, the idea is that this allows you to place breakpoint on the configASSERT() line itself. This way you could run this for an extended period of time, but only hit the breakpoint when the condition fails. By placing the breakpoint inside of the function that is using it you can then inspect the variables that are there, as well as provide a thread backtrace by using thread apply all bt

I think knowing the backtrace, and the values of the other variables, could be useful for figuring out what is happening here.

@aggarg
Copy link
Member

aggarg commented Jan 30, 2024

Also, are you using one of the examples or are you writing your own application? If you are writing your own application, it would be good if you can share your code.

@NightSkySK
Copy link
Contributor Author

@aggarg I've invited you and @Skptak to my repo as collaborators, as the repo is not prepared to make it publicly available.

@NightSkySK
Copy link
Contributor Author

NightSkySK commented Jan 31, 2024

Let me share few more observations regarding this case. I've noticed that each time before the the LoadProhibited error occure there is a warning msg like this
W (8548844) sub_pub_unsub: Error or timed out waiting for ack for publish message 2597. Re-attempting publish.

It might be our one step closer to the root casuse.

So this is like the standard sub_pub_usub loop task:

I (4809837) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (4814847) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1468
I (4815287) sub_pub_unsub: Subscribe 1468 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (4815287) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1469.
I (4815297) sub_pub_unsub: Task "SubPub0" waiting for publish 1469 to complete.
I (4816287) coreMQTT: Publishing message to /filter/SubPub0.

I (4817517) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (4817517) coreMQTT: State record updated. New state=MQTTPublishDone.
I (4817517) sub_pub_unsub: Publish 1469 succeeded for task "SubPub0".
I (4817527) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (4817537) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (4817537) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (4817547) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1470
I (4819667) sub_pub_unsub: Unsubscribe 1470 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (4819667) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.

and this is the loop after which LoadProhibited error occure:

I (4819667) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (4824677) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1471
I (4825117) sub_pub_unsub: Subscribe 1471 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (4825117) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1472.
I (4825127) sub_pub_unsub: Task "SubPub0" waiting for publish 1472 to complete.
I (4826117) coreMQTT: Publishing message to /filter/SubPub0.

I (4827347) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (4827347) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (4827347) sub_pub_unsub: Error or timed out waiting for ack for publish message 1472. Re-attempting publish.
I (4827357) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (4827367) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1472.
I (4827367) coreMQTT: State record updated. New state=MQTTPublishDone.
I (4827387) sub_pub_unsub: Task "SubPub0" waiting for publish 1472 to complete.
I (4827397) sub_pub_unsub: Publish 1472 succeeded for task "SubPub0".
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

From AWS IoT I can see the Subscribe 1471 for topic filter /filter/SubPub0 registered to AWS IoT
image
15:18:35 Jan 31th 2024

Also the Published msg id 1472 is registered the same time:
image

How ever this time the logs after publishing is a bit different comparing to previouse iterations

I (4826117) coreMQTT: Publishing message to /filter/SubPub0.

I (4827347) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (4827347) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (4827347) sub_pub_unsub: Error or timed out waiting for ack for publish message 1472. Re-attempting publish.
I (4827357) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (4827367) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1472.
I (4827367) coreMQTT: State record updated. New state=MQTTPublishDone.

once before we've received

I (4816287) coreMQTT: Publishing message to /filter/SubPub0.

I (4817517) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (4817517) coreMQTT: State record updated. New state=MQTTPublishDone.
I (4817517) sub_pub_unsub: Publish 1469 succeeded for task "SubPub0".
I (4817527) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (4817537) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (4817537) sub_pub_unsub: Task "SubPub0" received: SubPub0

Loop with crash didn't get Ack packet deserialized with result: MQTTSuccess. and State record updated. New state=MQTTPublishDone. before Error or timed out waiting for ack for publish message 1472. Re-attempting publish. and task is tring to republish the message. How ever shortly after warning about time out fro Ack. Msg arrive I (4827357) coreMQTT: Ack packet deserialized with result: MQTTSuccess. and I (4827367) coreMQTT: State record updated. New state=MQTTPublishDone, in the same time sup_pub_unsub task tries to reply the request to send msg I (4827367) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1472.
As coreMQTT get confirmation that msg was successfuly published, my assumption is, that, memory for payload and topic was released ...

During the crash loop the gap between publish and received Ack is 1240 ms once I've calculated some random previous loops its between 1120 ~ 1230 ms

        if (xCommandAdded == MQTTSuccess)
        {
            /* For QoS 1 and 2, wait for the publish acknowledgment.  For QoS0,
             * wait for the publish to be sent. */
            ESP_LOGI(TAG,
                     "Task \"%s\" waiting for publish %" PRIu32 " to complete.",
                     pcTaskGetName(xCommandContext.xTaskToNotify),
                     ulPublishMessageId);

            xCommandAcknowledged = prvWaitForNotification(&ulNotifiedValue);
        }

so function prvWaitForNotification is passing xCommandAcknowledged == pdFALSE

and triggers:

        /* Check all ways the status was passed back just for demonstration
         * purposes. */
        if ((xCommandAcknowledged != pdTRUE) ||
            (xCommandContext.xReturnStatus != MQTTSuccess) ||
            (ulNotifiedValue != ulPublishMessageId))
        {
            ESP_LOGW(TAG,
                     "Error or timed out waiting for ack for publish message %" PRIu32 ". Re-attempting publish.",
                     ulPublishMessageId);
        }

I can't understand why prvWaitForNotification is waiting so short as the function definition is like:

static BaseType_t prvWaitForNotification(uint32_t *pulNotifiedValue)
{
    BaseType_t xReturn;

    /* Wait for this task to get notified, passing out the value it gets
     * notified with. */
    xReturn = xTaskNotifyWait(0,
                              0,
                              pulNotifiedValue,
                              portMAX_DELAY);
    return xReturn;
}

once portMAX_DELAY seems to be much longer...

typedef uint32_t TickType_t;
#define portMAX_DELAY ( TickType_t ) 0xffffffffUL

Does it make any sens?

@aggarg
Copy link
Member

aggarg commented Feb 1, 2024

I think there is race condition causing the MQTTPublishInfo_t to go out of scope:

  1. The prvPublishToTopic sends a PUBLISH request to MQTT Agent's command queue.
  2. The MQTT Agent picks the PUBLISH request and publishes the message.
  3. The PUBACK does not arrive in time and the MQTT Agent times out the request.
  4. prvPublishToTopic returns from prvWaitForNotification and decides to re-attempt publish. As a result, another PUBLISH request with the same publish ID is added to the MQTT Agent's command queue.
  5. PUBACK for the previous request arrives. The MQTT Agent matches it with the second request (because the publish ID is same) and unblocks prvPublishToTopic.
  6. prvPublishToTopic returns and MQTTPublishInfo_t goes out of scope.
  7. The MQTT Agent picks up the second request but the corresponding MQTTPublishInfo_t is out of scope causing the crash.

I think we should not use the same publish ID for re-attempt. Can you try to change the code for re-attempt in prvPublishToTopic to the following:

        if ((xCommandAcknowledged != pdTRUE) ||
            (xCommandContext.xReturnStatus != MQTTSuccess) ||
            (ulNotifiedValue != ulPublishMessageId))
        {
            ESP_LOGW(TAG,
                     "Error or timed out waiting for ack for publish message %" PRIu32 ".",
                     ulPublishMessageId);

            /* Generate a new publish ID so that we can differentiate late
             * acknowledgements. */
            xSemaphoreTake(xMessageIdSemaphore, portMAX_DELAY);
            {
                ++ulMessageId;
                ulPublishMessageId = ulMessageId;
            }
            xSemaphoreGive(xMessageIdSemaphore);

            ESP_LOGW(TAG,
                     "Re-attempting publish with new ID %" PRIu32 ".",
                     ulPublishMessageId);

            /* Update publish ID in the command context. */
            xCommandContext.ulNotificationValue = ulPublishMessageId;
        }

You can apply the following patch to your repo's main branch - new_pub_id.patch.

@NightSkySK
Copy link
Contributor Author

@aggarg Thank you for the patch, it moves us in the right direction. However the new error pop up

I (7822661) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 2389
I (7823101) sub_pub_unsub: Subscribe 2389 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (7823101) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 2390.
I (7823121) sub_pub_unsub: Task "SubPub0" waiting for publish 2390 to complete.
I (7824101) coreMQTT: Publishing message to /filter/SubPub0.

I (7825331) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (7825331) coreMQTT: State record updated. New state=MQTTPubAckSend.
W (7825341) sub_pub_unsub: Re-attempting publish with new ID 2391.
I (7825351) coreMQTT: State record updated. New state=MQTTPublishDone.
I (7825361) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 2391.
I (7825381) sub_pub_unsub: Task "SubPub0" waiting for publish 2391 to complete.                                                         th ID 2391.
W (7825381) sub_pub_unsub: Error or timed out waiting for ack for publish message 2391.
W (7825391) sub_pub_unsub: Re-attempting publish with new ID 2392.
I (7825401) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 2392.                                                                                                                             th ID 2392.
I (7825411) sub_pub_unsub: Task "SubPub0" waiting for publish 2392 to complete.
I (7826361) coreMQTT: Publishing message to /filter/SubPub0.

I (7826591) coreMQTT: Publishing message to /filter/SubPub0.

I (7826791) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (7826791) coreMQTT: State record updated. New state=MQTTPublishDone.
I (7826791) sub_pub_unsub: Publish 2392 succeeded for task "SubPub0".
I (7826801) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (7826811) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (7826821) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (7826821) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 2393  
I (7826991) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (7827001) coreMQTT: State record updated. New state=MQTTPublishDone.
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Setting breakpoint at 0x40093d2b and returning...
0x40093d2b: xTaskGenericNotify at C:/Espressif/frameworks/esp-idf-v4.3.4/components/freertos/tasks.c:5323

So this time after unsubscribe from topicfilter second PUBACK arrive from repeated publish msg, and this is causing LoadProhibited error again
image

Function prvPublishCommandCallback tries to xTaskNotify that publish command was completed, but task is no longer waiting for this confirmation as it was already registered that
I (7826791) sub_pub_unsub: Publish 2392 succeeded for task "SubPub0".

static void prvPublishCommandCallback(MQTTAgentCommandContext_t *pxCommandContext,
                                      MQTTAgentReturnInfo_t *pxReturnInfo)
{
    /* Store the result in the application defined context so the task that
     * initiated the publish can check the operation's status. */
    pxCommandContext->xReturnStatus = pxReturnInfo->returnCode;

    if (pxCommandContext->xTaskToNotify != NULL)
    {
        /* Send the context's ulNotificationValue as the notification value so
         * the receiving task can check the value it set in the context matches
         * the value it receives in the notification. */
        xTaskNotify(pxCommandContext->xTaskToNotify,
                    pxCommandContext->ulNotificationValue,
                    eSetValueWithOverwrite);
    }
}

@aggarg
Copy link
Member

aggarg commented Feb 5, 2024

Apologies but I think my previous hypothesis was incorrect. The agent never times out a waiting ACK and therefore, the step 3 in my previous assessment cannot happen.

Looking more at your code and logs, I think we are getting PUBLISH message before PUBACK and the current code incorrectly detects that as failed publish and attempts to re-publish. Would you please try the following patch - remove_id_check.patch.

Note that it is only for testing and not the perfect solution.

@NightSkySK
Copy link
Contributor Author

Please have a look on the below logs.
After line I (6403583) sub_pub_unsub: Publish 1958 succeeded for task "SubPub0". sub_pub_unsub task stop working.
Next Info line should be I (xxxxxxx) sub_pub_unsub: Task "SubPub0" received: SubPub0, but it never arives and task stucked:

I (6391293) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1954
I (6391663) sub_pub_unsub: Subscribe 1954 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (6391673) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1955.
I (6391683) sub_pub_unsub: Task "SubPub0" waiting for publish 1955 to complete.
I (6392673) coreMQTT: Publishing message to /filter/SubPub0.

I (6393863) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (6393863) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6393863) sub_pub_unsub: Publish 1955 succeeded for task "SubPub0".
I (6393863) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (6393883) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (6393883) sub_pub_unsub: Task "SubPub0" received: SubPub0
I (6393893) sub_pub_unsub: Task "SubPub0" sending unsubscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1956
I (6395963) sub_pub_unsub: Unsubscribe 1956 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (6395963) sub_pub_unsub: Task "SubPub0" completed a loop. Delaying before next loop.
I (6400973) sub_pub_unsub: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1957
I (6401453) sub_pub_unsub: Subscribe 1957 for topic filter /filter/SubPub0 succeeded for task "SubPub0".
I (6401463) sub_pub_unsub: Task "SubPub0" sending publish request to coreMQTT-Agent with message "SubPub0" on topic "/filter/SubPub0" with ID 1958.
I (6401473) sub_pub_unsub: Task "SubPub0" waiting for publish 1958 to complete.
I (6402453) coreMQTT: Publishing message to /filter/SubPub0.

I (6403583) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (6403583) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (6403583) sub_pub_unsub: Publish 1958 succeeded for task "SubPub0".
I (6403593) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (6403603) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6416273) ota_over_mqtt: OTA agent suspend by timer.
W (6417273) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (6417273) AWS_OTA: OTA Agent is suspended.
I (6417273) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
����������������������������������������������������������������: OTA agent resumed by timer.
I (6538273) AWS_OTA: otaPal_GetPlatformImageState
I (6538273) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6538273) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6538273) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (6538583) ota_over_mqtt: Subscribed to topic $aws/things/GA965F0001/jobs/notify-next.


I (6538583) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA965F0001/jobs/notify-next
I (6539583) coreMQTT: Publishing message to $aws/things/GA965F0001/jobs/$next/get.

I (6540703) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (6540703) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6540703) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA965F0001/jobs/$next/get to broker.


I (6540713) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (6540723) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6540723) AWS_OTA: Current State=[WaitingForJob], Event=[RequestJobDocument], New state=[WaitingForJob]
I (6540733) ota_over_mqtt: Received job message callback, size 54.


I (6540753) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (6540763) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (6540763) AWS_OTA: otaPal_GetPlatformImageState
I (6540773) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6540783) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6540783) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (6540793) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (6568283) ota_over_mqtt: OTA agent suspend by timer.
W (6569283) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (6569283) AWS_OTA: OTA Agent is suspended.
I (6569283) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
����������������������������������������������������������������: OTA agent resumed by timer.
I (6690283) AWS_OTA: otaPal_GetPlatformImageState
I (6690283) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6690283) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6690283) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (6690743) ota_over_mqtt: Subscribed to topic $aws/things/GA965F0001/jobs/notify-next.


I (6690753) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA965F0001/jobs/notify-next
I (6691753) coreMQTT: Publishing message to $aws/things/GA965F0001/jobs/$next/get.

I (6692873) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (6692873) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6692873) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA965F0001/jobs/$next/get to broker.


I (6692873) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (6692883) AWS_OTA: Current State=[WaitingForJob], Event=[RequestJobDocument], New state=[WaitingForJob]
I (6692903) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6692913) ota_over_mqtt: Received job message callback, size 54.


I (6692923) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (6692933) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (6692933) AWS_OTA: otaPal_GetPlatformImageState
I (6692943) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6692953) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6692953) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (6692963) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (6720293) ota_over_mqtt: OTA agent suspend by timer.
W (6721293) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (6721293) AWS_OTA: OTA Agent is suspended.
I (6721293) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
����������������������������������������������������������������: OTA agent resumed by timer.
I (6842293) AWS_OTA: otaPal_GetPlatformImageState
I (6842293) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6842293) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6842293) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
I (6842593) ota_over_mqtt: Subscribed to topic $aws/things/GA965F0001/jobs/notify-next.


I (6842593) AWS_OTA: Subscribed to MQTT topic: $aws/things/GA965F0001/jobs/notify-next
I (6843593) coreMQTT: Publishing message to $aws/things/GA965F0001/jobs/$next/get.

I (6844743) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (6844743) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6844743) ota_over_mqtt: Sent PUBLISH packet to broker $aws/things/GA965F0001/jobs/$next/get to broker.


I (6844743) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (6844753) AWS_OTA: Current State=[WaitingForJob], Event=[RequestJobDocument], New state=[WaitingForJob]
I (6844773) coreMQTT: State record updated. New state=MQTTPublishDone.
I (6844783) ota_over_mqtt: Received job message callback, size 54.


I (6844793) AWS_OTA: No active job available in received job document: OtaJobParseErr_t=OtaJobParseErrNoActiveJobs
I (6844803) ota_over_mqtt: Received OtaJobEventNoActiveJob callback from OTA Agent.
I (6844803) AWS_OTA: otaPal_GetPlatformImageState
I (6844813) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (6844823) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (6844823) ota_over_mqtt: OTA Event processing completed. Freeing the event buffer to pool.
I (6844833) AWS_OTA: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]
I (6872303) ota_over_mqtt: OTA agent suspend by timer.
W (6873303) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (6873303) AWS_OTA: OTA Agent is suspended.

@aggarg
Copy link
Member

aggarg commented Feb 14, 2024

Can you break the code in the debugger and see what this pub-sub task is doing? As it is taking longer to debug, would you like to have a debug session to debug this together? If yes, please share your email id.

@AniruddhaKanhere
Copy link
Member

@NightSkySK was your issue resolved? Can you respond to @aggarg's post above?

Also, can you stop the OTA task (or better yet not start it altogether)? That will reduce the noise in the logs. Just keep the MQTT sub_pub_unsub task running.

@NightSkySK
Copy link
Contributor Author

@AniruddhaKanhere I was investigating and debugging this issue with @aggarg. Thanks to Gaurav's support we've managed to identify the root cause and confirmed the effectiveness of the solution provided by Gaurav.
The last step is to prepare the PR, which @aggarg promised to prepare based on our studies and debugging session.

aggarg added a commit to aggarg/iot-reference-esp32c3 that referenced this issue Apr 24, 2024
The current implementation uses task notification to wait for
acknowledgements and incoming publishes and therefore, cannot
differentiate between acknowledgements and incoming publishes. This
causes a crash when an incoming PUBLISH is received before PUBACK for
the previous outgoing publish.

This commit uses an event group to wait for acknowledgements and
incoming publishes. This allows us to differentiate between
acknowledgements and incoming publishes by waiting for different bits.

This was reported here - FreeRTOS#63.

Signed-off-by: Gaurav Aggarwal <aggarg@amazon.com>
@aggarg aggarg closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants