New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
samples: bluetooth: central_iso: fix synchronization of data submission #72622
base: main
Are you sure you want to change the base?
samples: bluetooth: central_iso: fix synchronization of data submission #72622
Conversation
fb95936
to
916630e
Compare
d269363
to
84989c5
Compare
84989c5
to
723196f
Compare
Update: This approach is too brittle. I tested it under different circumstances and it turns out that my initial assumption was wrong that the connected event would be a stable anchor for synchronization. I'll have to implement the timstamp-based approach right away. (Therefore converted back to Draft.) |
Introduces reversed xmas-tree format to local variables before making changes to them for better readability of subsequent commits. Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
The change replaces magic numbers in the ISO CIG configuration by preprocessor definitions. Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
Replaces static constant variables by preprocessor definitions for improved readability and (minor) RAM usage improvement. Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
To make the error more obvious we start with larger packets and count downwards. This makes it easier to identify the moment in time when transmission stops (e.g. using Sniffing, Tracing or Power Profiling). Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
This commit fixes the actual race condition: * The non-deterministic deferred work approach is replaced by a call to bt_iso_chan_get_tx_sync(), see Bluetooth Core Specification v5.4, Vol 4, Part E, 7.8.96 LE Read ISO TX Sync command. * All but the first SDU will now be sent through bt_iso_chan_send_ts(). * The timer is replaced by a sent() callback. As this callback already executes from the system work queue and the code is non-blocking, there is no need to schedule our own work. Sending SDUs can now be done synchronously which considerably simplifies the code. Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
Replace global variables by local variables to improve encapsulation. Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
723196f
to
12f1ef9
Compare
Done. Known issue (cf Discord), IMO to be fixed in a separate PR:
If you don't like the surrounding changes (refactoring, reversal of payload length, etc.) just let me know and I'll remove this unrelated stuff. |
struct bt_iso_tx_info iso_tx_info = {0}; | ||
int err; | ||
|
||
err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fgrandel , thanks for spending time on improving this sample!
I would recommend to not use timestamps when providing data in this sample.
The increase in sequence number gives the controller enough information about when it is to be scheduled on air. See https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrfxlib/softdevice_controller/doc/isochronous_channels.html#providing-data for more information on how the Nordic SoftDevice Controller processing incoming HCI ISO Data.
When using timestamps it is possible to provide the data to the controller right before it is going to be sent on air.
Unfortunately, this timing information cannot be obtained in a controller-agnostic way:
The timestamp returned from the HCI LE Read ISO TX Sync command is ambiguous, the returned value depends on how the controller implementation interprets the spec. Also, the command documentation refers to CIG/BIG reference anchor points which may not align with the SDU generation if ISO_Interval != SDU interval (up to the controller). These things make it hard to rely on this return value.
See the open spec errata https://bluetooth.atlassian.net/browse/ES-23138 for more details.
Another note: The timing of the iso_sent
is also implementation dependent. It is raised when the controller raises the HCI Number of Completed Packets event. This event may be raised when the controller is ready to receive more data. This may be the case when the payload is scheduled for transmission. That is, even before it is sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; I see why the sent-callback + HCI LE Read ISO TX Sync + timestamp approach is lacking and depends on controller implementation details. But IMO it is still the only approach that works at all to reliably synchronize ISO channels for the moment being.
Arguments following:
Thanks for the link. Unfortunately I don't have permission to access it - guess that's only for SIG members.
The SDU vs. ISO interval argument is convincing - however as long as all our options are to some degree depending on the controller implementation, I'd still prefer the approach recommended by the core spec. By the time we'll implement more controllers, I guess the spec will have been updated and the errata removed in a somehow backwards compatible way, so we can move on from there.
In Zephyr we can and should decouple from implementation details by wrapping the controller specific parts behind more usable high-level APIs in the meantime. This would also improve developer experience by protecting them from the hacky and error-prone synchronization details. Plus it guarantees a migration path.
The dependence of iso_sent on the controller implementation is also a good argument. But we still need to provide developers with a working solution until the spec does so. That's AFAICS, why Nordic recommends the HCI LE Read ISO TX Sync + timestamp solution for the moment being and uses the send callback as a trigger. Also see the recommended approach in the link you provided:
This is the preferred way of providing data to the SoftDevice Controller and guarantees the highest degree of control.
Re sent-callback latency: Introducing a delay of one ISO interval w/o technical need is not a good way of interpreting the slack in the spec IMO. We should at least provide best-effort latency as far as technically possible - even more as it seems to be the only workable solution for deterministic synchronization for the time being. So the controller really should deliver some sufficient degree of latency guarantees beyond what is required by the spec for this specific callback/HCI command until the spec provides a workable solution itself.
The seq-number and "time-of-arrival" approaches don't work unless you admit at least two ISO intervals of extra latency to avoid systematic package loss (as demonstrated by the bug I'm fixing here). This I find inacceptable because it counters the concept of ISO channels with unframed PDUs and deterministic latency. Also see the latency promises of the CIG config API in Zephyr which will not be met. You need to add at least two sequence numbers all the time to ensure that:
- the initially (but non-deterministically) skipped event gets accounted for.
- source-to-CIG anchor point offset doesn't play a role even if you happen to synchronize too closely to the next anchor point so that the controller will pick up your SDUs too late.
In addition the "time-of-arrival" approach is not consistent with the spec AFAICS as flushing cannot be controlled deterministically by the application. It's the point of the ISO protocol that you can synchronize your source with the sink with deterministic latency and time-to-live of data packets.
On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.
In our case for example we do decentralized measurements on peripherals in sync with the central's distributed clock across several devices and do explicitly want those measurements be sent out in the same CIG event with deterministic latency so we don't need timestamps and get cheap syntonization of all peripherals to the master clock w/o having to implement our own syntonization approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.
Fun fact: An ISO peripheral from BT Core spec 5.4 is not aware of the SDU interval and can actually never transmit correctly (reliably) 8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment for the use of bt_iso_chan_get_tx_sync
: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync
are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.
e.g.
- App send SDU_1 with TS_1 and PSN_1
- App gets
sent
callback and gets TS_1 and PSN_1 frombt_iso_chan_get_tx_sync
- App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
- Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
- App gets
sent
callback and gets TS_1 and PSN_1 frombt_iso_chan_get_tx_sync
(since it was not transmitted or scheduled for transmission) - App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.
Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.
- If
bt_iso_chan_get_tx_sync
is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previousbt_iso_chan_send
was called. - Every
bt_iso_chan_get_tx_sync
afterbt_iso_chan_send
call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call tobt_iso_chan_send
, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided tobt_iso_chan_send_ts
. - If
bt_iso_chan_get_tx_sync
followed bybt_iso_chan_send_ts
is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call ofbt_iso_chan_send_ts
to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, thenbt_iso_chan_get_tx_sync
will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.
Zephyr Controller follows a similar design (implementation choice) to ACL Sending Data for Sending ISO data too, refer to Bluetooth Core Specification Version 5.4, Vol 6, Part D, Section 6.1 Sending Data. In this design, the Number of Completed Packets is generated when corresponding buffer space is freed when acknowledged and/or flushed (in the case of CIS in the Zephyr Controller). But in the implementation of ISO ack/flush, the Zephyr Controller implementation reuses the code (control path) that is determining stale PDUs hence the Number of Completed Packets is delayed by one ISO Interval. An optimization is good to have, but not necessary, as the Specification states (Bluetooth Core Specification Version 5.4, Vol 4, Part E, Section 7.7.19 Number Of Completed Packets event): Hence, an application should not rely on the rate at which the sent callback is called. Application will need to ensure there is enough buffers available/configured in the Controller to be able to provide the ISO data with corresponding sequence numbers for the future SDU intervals. Zephyr Controller also needs that flush timeout is considered when determining the buffer number, as the implementation does not generate Number of Complete Packets until they are flushed under bad transmissions and NACK conditions. i.e for FT=2, Zephyr Controller needs 5 HCI ISO Data buffers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As others have pointed out, proper synchronization of ISO data over HCI as a generic host (i.e. not knowing what controller we use) is pretty hard (actually impossible, but can be done "good enough").
The use and requirements of both sequence numbers and timestamps are pretty vague / implementation specific in the core spec as per 5.4 unfortunately. Hopefully this will be improved in the future.
Some comments that need to be addressed
*/ | ||
static void iso_timer_timeout(struct k_work *work) | ||
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts) | |
static void iso_send_sdu(uint32_t seq_num, uint32_t next_sdu_ts, bool is_first_sdu) |
I think this will make more sense. There's not really any reason to encode the seq_num and "is_first" into the same variable, and since -1
was only called one place this seems like a better solution to me.
Furthermore, the timestamp is a uint32_t
, so for this to have worked the variable should otherwise have been a int64_t
.
static uint8_t buf_data[CONFIG_BT_ISO_TX_MTU]; | ||
static bool data_initialized; | ||
static size_t len_to_send = ARRAY_SIZE(buf_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static size_t len_to_send = ARRAY_SIZE(buf_data); | |
static const size_t len_to_send = ARRAY_SIZE(buf_data); |
.connected = iso_connected, | ||
.sent = iso_sent, | ||
.disconnected = iso_disconnected, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.connected = iso_connected, | |
.sent = iso_sent, | |
.disconnected = iso_disconnected, | |
.connected = iso_connected, | |
.disconnected = iso_disconnected, | |
.sent = iso_sent, |
or
.connected = iso_connected, | |
.sent = iso_sent, | |
.disconnected = iso_disconnected, | |
.connected = iso_connected, | |
.disconnected = iso_disconnected, | |
.sent = iso_sent, |
Personal opinion, but it kind of stood out with sent
in the middle of the connected and disconnected callbacks, and also not using the same indentation scheme :D
Agreed. But there are good arguments why it should be improved regardless of what the spec allows (see my response to @rugeGerritsen above). This is required to cover up for weaknesses in the spec in other places to achieve a workable solution at all. Guess you agree that it is also not against the spec to improve latency on the sent-callback. Gaining one ISO interval should provide us with enough leeway so that the sent-callback latency is deterministic enough under all practical circumstances to synchronize ISO channels even if HCI works over reasonable remote links. Can you agree with that statement? |
struct bt_iso_tx_info iso_tx_info = {0}; | ||
int err; | ||
|
||
err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment for the use of bt_iso_chan_get_tx_sync
: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync
are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.
e.g.
- App send SDU_1 with TS_1 and PSN_1
- App gets
sent
callback and gets TS_1 and PSN_1 frombt_iso_chan_get_tx_sync
- App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
- Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
- App gets
sent
callback and gets TS_1 and PSN_1 frombt_iso_chan_get_tx_sync
(since it was not transmitted or scheduled for transmission) - App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly
struct bt_iso_tx_info iso_tx_info = {0}; | ||
int err; | ||
|
||
err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.
Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.
- If
bt_iso_chan_get_tx_sync
is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previousbt_iso_chan_send
was called. - Every
bt_iso_chan_get_tx_sync
afterbt_iso_chan_send
call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call tobt_iso_chan_send
, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided tobt_iso_chan_send_ts
. - If
bt_iso_chan_get_tx_sync
followed bybt_iso_chan_send_ts
is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call ofbt_iso_chan_send_ts
to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, thenbt_iso_chan_get_tx_sync
will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.
} | ||
|
||
iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US); | |
iso_send_sdu(iso_tx_info.seq_num + 1U, iso_tx_info.ts + INTERVAL_US); |
if (unlikely(is_first_sdu)) { | ||
ret = bt_iso_chan_send(&iso_chan, buf, seq_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If is_first_sdu
, call bt_iso_chan_send
once then followed by call to bt_iso_chan_get_tx_sync
and bt_iso_chan_send_ts
consecutively ((iso_tx.rtn + 1U) * 2U)
times, so as to ensure the Controller has sufficient SDUs buffered before sent-callback under the case of retransmissions.
Give this a try under noisy conditions or when devices are far away. Let me know if the Sequence Number and Timestamp get back in sync when the radio condition get better, when devices move back close.
@rugeGerritsen @cvinayak a quick message to thank you for your reviews. I've just not been able to implement it yet, because I'm short on time. But will do. |
This PR makes the submission of data to the ISO channel synchronous to the BLE controller's clock.
This is an intermediate improvement.This fix uses timestamps, compare https://github.com/nrfconnect/sdk-nrf/blob/main/samples/bluetooth/iso_time_sync/src/iso_tx.c based on the "LE Read ISO TX Sync" HCI command, see Bluetooth Core Spec 5.4, Vol 4 Part E, 7.8.96).This will be done in a subsequent PR which could then also demo sending ISO SDUs from peripherals to the central. This PR is a good baseline for such a change as it already uses a timer in the correct way, just the actual timestamp calculation will have to be adapted.Update: It was too hard to get the phase right with the timer approach, therefore the HCI approach was chosen right away.
Fixes #72452