Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Matter never receives ack from Thread devices after operational discovery #33317

Open
fostergorman opened this issue May 6, 2024 · 5 comments
Labels
bug Something isn't working needs triage

Comments

@fostergorman
Copy link

fostergorman commented May 6, 2024

Reproduction steps

Hello, I am working on a PoC for a Matter hub and we want to have the hub also be a Thread Border Router. I have been working with this for a couple weeks now and have managed to get to operational discovery. Matter is able to connect to the device, but never receives an ack from the Thread devices.

This issue does not appear at all with WiFi devices.



Setup:

Raspberry Pi 4b with Ubuntu Server 23.10.
OTBR running in docker container
Commissioning with Matter (chip-tool)

chip-tool command executed:
chip-tool pairing code-thread 69 hex:0e08000000000001000035060004001fffe00708fd38de8fb4627cf00c0402a0f7f8051000112233445566778899aabbccddeeff030e4f70656e54687265616444656d6f0410445f2b5ca6f2a93a55ce570a70efeecb000300000f0208111111112222222201021234 05576556571 --paa-trust-store-path /opt/matter/certs



Debugging:

Currently able to get to the point where the Thread device is commissioned on the Thread network using Matter's chip-tool. Can confirm this by checking the child table within OTBR: ot-ctl child table.

I have verified that there are no errors on the OTBR side. The packets are received and sent just fine.

I see the operational discovery packets using avahi-browse -a.

+  otbr0 IPv6 063787A15C2B0395-0000000000000045             _matter._tcp         local
+  otbr0 IPv4 063787A15C2B0395-0000000000000045             _matter._tcp         local

chip-tool outputs an error, Timeout waiting for mDNS resolution, while on the FindOperational step:

[1714999513.967859][6179:6181] CHIP:CTL: Successfully finished commissioning step 'ThreadNetworkEnable'
[1714999513.967917][6179:6181] CHIP:CTL: Commissioning stage next step: 'ThreadNetworkEnable' -> 'FindOperational'
[1714999513.967978][6179:6181] CHIP:CTL: Performing next commissioning step 'FindOperational'
[1714999513.968027][6179:6181] CHIP:IN: Expiring all sessions for node <0000000000000045, 1>!!
[1714999513.968081][6179:6181] CHIP:CSM: FindOrEstablishSession: PeerId = [1:0000000000000045]
[1714999513.968130][6179:6181] CHIP:CSM: FindOrEstablishSession: No existing OperationalSessionSetup instance found
[1714999513.968199][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 1 --> 2
[1714999513.969669][6179:6181] CHIP:DMG: ICR moving to [AwaitingDe]
[1714999514.168146][6179:6181] CHIP:DIS: Checking node lookup status after 200 ms
[1714999515.439296][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.439771][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441354][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441658][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441912][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.445875][6179:6181] CHIP:DIS: Lookup clearing interface for non LL address
[1714999515.446133][6179:6181] CHIP:DIS: UDP:[fd11:22::d012:4589:e76d:1f9e%otbr0]:5540: new best score: 3
[1714999515.446193][6179:6181] CHIP:DIS: Checking node lookup status after 1478 ms
[1714999515.446262][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: Updating device address to UDP:[fd11:22::d012:4589:e76d:1f9e]:5540 while in state 2
[1714999515.446317][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 3
[1714999515.446930][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Allocated Type:2 LSID:57955
[1714999515.447048][6179:6181] CHIP:SC: Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x0000000000000045
[1714999515.449066][6179:6181] CHIP:EM: <<< [E:60042i S:0 M:121203353] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fd11:22::d012:4589:e76d:1f9e]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1714999515.449546][6179:6181] CHIP:SC: Sent Sigma1 msg
[1714999515.449635][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 3 --> 4
[1714999516.439920][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.440184][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.441701][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.441982][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.442226][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.545717][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999519.129858][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999519.130164][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999519.723194][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 1
[1714999521.713218][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999523.843283][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 2
[1714999524.200228][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999524.200499][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999524.326020][6179:6181] CHIP:DIS: Timeout waiting for mDNS resolution.
[1714999526.832197][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999529.318884][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999529.319227][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999529.976450][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 3
[1714999531.901893][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999534.437714][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999534.438058][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999537.020964][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999539.508301][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999539.508640][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999541.061654][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 4
[1714999542.090727][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999544.578227][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999544.578626][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999547.161008][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999549.696521][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999549.696854][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999552.279536][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999554.815223][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999554.815609][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999557.370905][6179:6181] CHIP:EM: Failed to Send CHIP MessageCounter:121203353 on exchange 60042i sendCount: 4 max retries: 4
[1714999557.447253][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999559.884884][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999559.885164][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999562.468218][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999563.960047][6179:6181] CHIP:SC: CASESession timed out while waiting for a response from the peer. Current state was 1
[1714999563.960149][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Released - Type:2 LSID:57955
[1714999563.960250][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 4 --> 2
[1714999563.960324][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 1
[1714999563.960402][6179:6181] CHIP:DIS: OperationalSessionSetup:attempts done: 1, attempts left: 2, retry delay 1s, status src/system/SystemLayerImplSelect.cpp:214: Success
[1714999563.960456][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 1 --> 6
[1714999563.960536][6179:6181] CHIP:CTL: Session establishment failed for <0000000000000045, 1>, error: src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 8 seconds

TLDR: Matter fails operational discovery of Thread devices even though operational discovery, _matter._tcp, is shown through avahi.

Bug prevalence

Commissioning any Thread device

GitHub hash of the SDK that was being used

f52860e

Platform

raspi

Platform Version(s)

1.2

Anything else?

No response

@fostergorman fostergorman added bug Something isn't working needs triage labels May 6, 2024
@fostergorman fostergorman changed the title [BUG] Failed operational discovery of Thread device [BUG] Failed operational discovery of Thread devices May 6, 2024
@fostergorman
Copy link
Author

As an update to this, I found something that is off. During operational discovery, the name used for lookup within Matter is not the name that is publishing the operational discovery seen in avahi.

Matter lookup: 66D5DE6615F10EFA-0000000000000045
Name from avahi: 063787A15C2B0395-0000000000000045

I would imagine that if Matter is doing a lookup on the wrong name, it will never find it, so this seems to be the crux of the issue. I did verify that the IPv6 address Matter is using is the correct one of the device on the Thread network.

@bzbarsky-apple
Copy link
Contributor

@fostergorman The log above shows the operational discovery succeeding just fine:

[1714999515.446262][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: Updating device address to UDP:[fd11:22::d012:4589:e76d:1f9e]:5540 while in state 2
[1714999515.446317][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 3
[1714999515.446930][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Allocated Type:2 LSID:57955
[1714999515.447048][6179:6181] CHIP:SC: Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x0000000000000045
[1714999515.449066][6179:6181] CHIP:EM: <<< [E:60042i S:0 M:121203353] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fd11:22::d012:4589:e76d:1f9e]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1714999515.449546][6179:6181] CHIP:SC: Sent Sigma1 msg

But then we never get a response or an ack for that message:

[1714999519.723194][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 1
[1714999523.843283][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 2
[1714999529.976450][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 3
[1714999541.061654][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 4
[1714999557.370905][6179:6181] CHIP:EM: Failed to Send CHIP MessageCounter:121203353 on exchange 60042i sendCount: 4 max retries: 4
[1714999563.960047][6179:6181] CHIP:SC: CASESession timed out while waiting for a response from the peer. Current state was 1

So:

  1. Is that the right IP that's being discovered?
  2. If it is, where are the packets being lost?

@fostergorman
Copy link
Author

Thank you for your response.

  1. I have verified that the IP that Matter is using is the correct IP through OTBR.
  2. Within the OTBR logs, I'm seeing communication with the thread end device, and no errors indicating that communication has failed.

@bzbarsky-apple
Copy link
Contributor

Well, clearly the Matter device doing the commissioning never got the response to Sigma1. So you should figure out where that response packet is being dropped in your specific network setup. Is it being sent by the Thread end device? If not, start there. If it is, then what happened to it?

@bzbarsky-apple
Copy link
Contributor

(And please fix the PR summary, since operational discovery worked just fine.)

@fostergorman fostergorman changed the title [BUG] Failed operational discovery of Thread devices [BUG] Matter never receives ack from Thread devices after operational discovery May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

2 participants