New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
otbr-agent crash after around a week of uptime #2085
Comments
Just had another crash after 3 days. Before the crash, we were sending a good amount of big packets, as 8 devices were receiving OTAs via CoAP block-wise transfer with a chunk size of 1024.
|
Yet another after 5 days:
|
And another one after 3 days, they all seem to be related to this line:
Please let me know if I can enable any additional debugging or logging options to figure out the root cause of this crash. |
The error So the root cause is that the Thread host receives a wrong Spinel frame from the esp32-c6. I think you can add the code to HandleTransmitDone() to print out the raw data of the Spinel message to better understand which field of the Spinel frame is wrong. |
I also encountered a similar problem. The platform I used was Silabs SoC + Raspberry Pi 4. Details are in the Silabs Community Discussion. The following is part of the log:
|
I'm seeing a similar problem with a Silabs based device. It's usually in the context of an MLE channel announcement. I added some message dumps to the spinel code (this function). 0001-st-add-logging-to-HandleTransmitDone.patch
The These failures are happening around 1-2 times a day on my setup, but I haven't been able to narrow down the exact failure in order to reproduce it. This is the corresponding code for the radio. @abtink, what data would you recommend logging on the radio? I tried printing the entire buffer, but that looks like its 2048 bytes and I'm running into issues with RTT overwriting data. |
Hi @jdswensen , My Silabs-based equipment also fails an average of 1-2 times a day. Under the default build of Silabs, this device will fail dozens of times a day, but after setting the log level to critical, the device will only fail 1-2 times a day. In addition, as the concurrency of the network increases, the number of device failures will also increase. |
@jdswensen, some suggestions/questions:
Some observation from the log snipper for things to be investigated: Here the spinel frame requesting a "Tx" is sent to RCP. It is sent with
Later we see this set of spinel messages received from RCP:
|
Describe the bug
The otbr-agent in the docker container crashes after around a week of uptime.
To Reproduce Information to reproduce the behavior, including:
openthread/otbr@sha256:78543c07c08650044a35e6e938ae787f0bf48b059dd8a82883a14e97ce5c86cd
Expected behavior
ot-agent shouldn't crash ;) And even if it does, it should either restart itself, or crash the docker container so that docker automatically restarts the container.
Console/log output
We have 4 pi's running the otbr docker image with around 30 devices each. Here are the 4 logs from each:
Additional context Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: