Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samples: bluetooth: central_iso: fix synchronization of data submission #72622

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

fgrandel
Copy link
Contributor

@fgrandel fgrandel commented May 11, 2024

This PR makes the submission of data to the ISO channel synchronous to the BLE controller's clock.

This is an intermediate improvement. This fix uses timestamps, compare https://github.com/nrfconnect/sdk-nrf/blob/main/samples/bluetooth/iso_time_sync/src/iso_tx.c based on the "LE Read ISO TX Sync" HCI command, see Bluetooth Core Spec 5.4, Vol 4 Part E, 7.8.96). This will be done in a subsequent PR which could then also demo sending ISO SDUs from peripherals to the central. This PR is a good baseline for such a change as it already uses a timer in the correct way, just the actual timestamp calculation will have to be adapted.

Update: It was too hard to get the phase right with the timer approach, therefore the HCI approach was chosen right away.

Fixes #72452

@fgrandel fgrandel added platform: nRF Nordic nRFx area: Bluetooth ISO Bluetooth LE Isochronous Channels labels May 11, 2024
@fgrandel fgrandel requested review from Thalley and cvinayak May 11, 2024 15:36
@fgrandel fgrandel self-assigned this May 11, 2024
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from fb95936 to 916630e Compare May 11, 2024 15:52
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch 2 times, most recently from d269363 to 84989c5 Compare May 11, 2024 16:35
@fgrandel fgrandel marked this pull request as ready for review May 11, 2024 16:36
@zephyrbot zephyrbot added the area: Samples Samples label May 11, 2024
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from 84989c5 to 723196f Compare May 11, 2024 16:43
@fgrandel fgrandel marked this pull request as draft May 12, 2024 08:08
@fgrandel
Copy link
Contributor Author

Update: This approach is too brittle. I tested it under different circumstances and it turns out that my initial assumption was wrong that the connected event would be a stable anchor for synchronization. I'll have to implement the timstamp-based approach right away. (Therefore converted back to Draft.)

Florian Grandel added 6 commits May 12, 2024 12:57
Introduces reversed xmas-tree format to local variables before making
changes to them for better readability of subsequent commits.

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
The change replaces magic numbers in the ISO CIG configuration by
preprocessor definitions.

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
Replaces static constant variables by preprocessor definitions for
improved readability and (minor) RAM usage improvement.

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
To make the error more obvious we start with larger packets and count
downwards. This makes it easier to identify the moment in time when
transmission stops (e.g. using Sniffing, Tracing or Power Profiling).

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
This commit fixes the actual race condition:

 * The non-deterministic deferred work approach is replaced by a call to
   bt_iso_chan_get_tx_sync(), see Bluetooth Core Specification v5.4,
   Vol 4, Part E, 7.8.96 LE Read ISO TX Sync command.

 * All but the first SDU will now be sent through bt_iso_chan_send_ts().

 * The timer is replaced by a sent() callback. As this callback already
   executes from the system work queue and the code is non-blocking,
   there is no need to schedule our own work. Sending SDUs can now be
   done synchronously which considerably simplifies the code.

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
Replace global variables by local variables to improve encapsulation.

Signed-off-by: Florian Grandel <florian.grandel@code-for-humans.de>
@fgrandel fgrandel force-pushed the fix/72452-ble-iso-sample-sync branch from 723196f to 12f1ef9 Compare May 12, 2024 15:43
@fgrandel
Copy link
Contributor Author

fgrandel commented May 12, 2024

Done.

Known issue (cf Discord), IMO to be fixed in a separate PR:

Just wondering why the peripheral's ACK does not trigger the sent-callback immediately. In the example almost a full ISO interval (~10 ms) elapses before the callback is called. This leaves very little time to prepare and send the next SDU. IMHO it would be nicer if one had time to prepare the next SDU triggered by the sent-callback. Is this hard to fix? I saw you already have a TODO for early ACK in the code.

If you don't like the surrounding changes (refactoring, reversal of payload length, etc.) just let me know and I'll remove this unrelated stuff.

@fgrandel fgrandel marked this pull request as ready for review May 12, 2024 15:49
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgrandel , thanks for spending time on improving this sample!
I would recommend to not use timestamps when providing data in this sample.
The increase in sequence number gives the controller enough information about when it is to be scheduled on air. See https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrfxlib/softdevice_controller/doc/isochronous_channels.html#providing-data for more information on how the Nordic SoftDevice Controller processing incoming HCI ISO Data.

When using timestamps it is possible to provide the data to the controller right before it is going to be sent on air.
Unfortunately, this timing information cannot be obtained in a controller-agnostic way:
The timestamp returned from the HCI LE Read ISO TX Sync command is ambiguous, the returned value depends on how the controller implementation interprets the spec. Also, the command documentation refers to CIG/BIG reference anchor points which may not align with the SDU generation if ISO_Interval != SDU interval (up to the controller). These things make it hard to rely on this return value.
See the open spec errata https://bluetooth.atlassian.net/browse/ES-23138 for more details.

Another note: The timing of the iso_sent is also implementation dependent. It is raised when the controller raises the HCI Number of Completed Packets event. This event may be raised when the controller is ready to receive more data. This may be the case when the payload is scheduled for transmission. That is, even before it is sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I see why the sent-callback + HCI LE Read ISO TX Sync + timestamp approach is lacking and depends on controller implementation details. But IMO it is still the only approach that works at all to reliably synchronize ISO channels for the moment being.

Arguments following:

Thanks for the link. Unfortunately I don't have permission to access it - guess that's only for SIG members.

The SDU vs. ISO interval argument is convincing - however as long as all our options are to some degree depending on the controller implementation, I'd still prefer the approach recommended by the core spec. By the time we'll implement more controllers, I guess the spec will have been updated and the errata removed in a somehow backwards compatible way, so we can move on from there.

In Zephyr we can and should decouple from implementation details by wrapping the controller specific parts behind more usable high-level APIs in the meantime. This would also improve developer experience by protecting them from the hacky and error-prone synchronization details. Plus it guarantees a migration path.

The dependence of iso_sent on the controller implementation is also a good argument. But we still need to provide developers with a working solution until the spec does so. That's AFAICS, why Nordic recommends the HCI LE Read ISO TX Sync + timestamp solution for the moment being and uses the send callback as a trigger. Also see the recommended approach in the link you provided:

This is the preferred way of providing data to the SoftDevice Controller and guarantees the highest degree of control.

Re sent-callback latency: Introducing a delay of one ISO interval w/o technical need is not a good way of interpreting the slack in the spec IMO. We should at least provide best-effort latency as far as technically possible - even more as it seems to be the only workable solution for deterministic synchronization for the time being. So the controller really should deliver some sufficient degree of latency guarantees beyond what is required by the spec for this specific callback/HCI command until the spec provides a workable solution itself.

The seq-number and "time-of-arrival" approaches don't work unless you admit at least two ISO intervals of extra latency to avoid systematic package loss (as demonstrated by the bug I'm fixing here). This I find inacceptable because it counters the concept of ISO channels with unframed PDUs and deterministic latency. Also see the latency promises of the CIG config API in Zephyr which will not be met. You need to add at least two sequence numbers all the time to ensure that:

  1. the initially (but non-deterministically) skipped event gets accounted for.
  2. source-to-CIG anchor point offset doesn't play a role even if you happen to synchronize too closely to the next anchor point so that the controller will pick up your SDUs too late.

In addition the "time-of-arrival" approach is not consistent with the spec AFAICS as flushing cannot be controlled deterministically by the application. It's the point of the ISO protocol that you can synchronize your source with the sink with deterministic latency and time-to-live of data packets.

On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.

In our case for example we do decentralized measurements on peripherals in sync with the central's distributed clock across several devices and do explicitly want those measurements be sent out in the same CIG event with deterministic latency so we don't need timestamps and get cheap syntonization of all peripherals to the master clock w/o having to implement our own syntonization approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the peripheral both approaches don't work at all because neither the time-of-arrival nor the seq number approach are able to account for drift of the peripheral's clock. This cannot work by design if you want well-synchronized data across devices. This also explains why the (local) timer approach does systematically not work.

Fun fact: An ISO peripheral from BT Core spec 5.4 is not aware of the SDU interval and can actually never transmit correctly (reliably) 8)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment for the use of bt_iso_chan_get_tx_sync: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

e.g.

  1. App send SDU_1 with TS_1 and PSN_1
  2. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync
  3. App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
  4. Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
  5. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync (since it was not transmitted or scheduled for transmission)
  6. App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly

Copy link
Contributor

@cvinayak cvinayak May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.

  1. If bt_iso_chan_get_tx_sync is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previous bt_iso_chan_send was called.
  2. Every bt_iso_chan_get_tx_sync after bt_iso_chan_send call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call to bt_iso_chan_send, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided to bt_iso_chan_send_ts.
  3. If bt_iso_chan_get_tx_sync followed by bt_iso_chan_send_ts is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call of bt_iso_chan_send_ts to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, then bt_iso_chan_get_tx_sync will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.

@cvinayak
Copy link
Contributor

cvinayak commented May 13, 2024

Done.

Known issue (cf Discord), IMO to be fixed in a separate PR:

Just wondering why the peripheral's ACK does not trigger the sent-callback immediately. In the example almost a full ISO interval (~10 ms) elapses before the callback is called. This leaves very little time to prepare and send the next SDU. IMHO it would be nicer if one had time to prepare the next SDU triggered by the sent-callback. Is this hard to fix? I saw you already have a TODO for early ACK in the code.

If you don't like the surrounding changes (refactoring, reversal of payload length, etc.) just let me know and I'll remove this unrelated stuff.

Zephyr Controller follows a similar design (implementation choice) to ACL Sending Data for Sending ISO data too, refer to Bluetooth Core Specification Version 5.4, Vol 6, Part D, Section 6.1 Sending Data. In this design, the Number of Completed Packets is generated when corresponding buffer space is freed when acknowledged and/or flushed (in the case of CIS in the Zephyr Controller). But in the implementation of ISO ack/flush, the Zephyr Controller implementation reuses the code (control path) that is determining stale PDUs hence the Number of Completed Packets is delayed by one ISO Interval. An optimization is good to have, but not necessary, as the Specification states (Bluetooth Core Specification Version 5.4, Vol 4, Part E, Section 7.7.19 Number Of Completed Packets event):
While the Controller has HCI Data packets or HCI ISO Data packets in its buffer, it shall keep sending the HCI_Number_Of_Completed_Packets event to the Host at least periodically, until it finally reports that all the pending packets have been completed. The rate with which this event is sent is manufacturer specific.

Hence, an application should not rely on the rate at which the sent callback is called. Application will need to ensure there is enough buffers available/configured in the Controller to be able to provide the ISO data with corresponding sequence numbers for the future SDU intervals.

Zephyr Controller also needs that flush timeout is considered when determining the buffer number, as the implementation does not generate Number of Complete Packets until they are flushed under bad transmissions and NACK conditions. i.e for FT=2, Zephyr Controller needs 5 HCI ISO Data buffers.

Copy link
Collaborator

@Thalley Thalley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As others have pointed out, proper synchronization of ISO data over HCI as a generic host (i.e. not knowing what controller we use) is pretty hard (actually impossible, but can be done "good enough").

The use and requirements of both sequence numbers and timestamps are pretty vague / implementation specific in the core spec as per 5.4 unfortunately. Hopefully this will be improved in the future.

Some comments that need to be addressed

*/
static void iso_timer_timeout(struct k_work *work)
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void iso_send_sdu(uint32_t seq_num, int32_t next_sdu_ts)
static void iso_send_sdu(uint32_t seq_num, uint32_t next_sdu_ts, bool is_first_sdu)

I think this will make more sense. There's not really any reason to encode the seq_num and "is_first" into the same variable, and since -1 was only called one place this seems like a better solution to me.

Furthermore, the timestamp is a uint32_t, so for this to have worked the variable should otherwise have been a int64_t.

static uint8_t buf_data[CONFIG_BT_ISO_TX_MTU];
static bool data_initialized;
static size_t len_to_send = ARRAY_SIZE(buf_data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static size_t len_to_send = ARRAY_SIZE(buf_data);
static const size_t len_to_send = ARRAY_SIZE(buf_data);

samples/bluetooth/central_iso/src/main.c Show resolved Hide resolved
Comment on lines 156 to 158
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
.connected = iso_connected,
.disconnected = iso_disconnected,
.sent = iso_sent,

or

Suggested change
.connected = iso_connected,
.sent = iso_sent,
.disconnected = iso_disconnected,
.connected = iso_connected,
.disconnected = iso_disconnected,
.sent = iso_sent,

Personal opinion, but it kind of stood out with sent in the middle of the connected and disconnected callbacks, and also not using the same indentation scheme :D

@fgrandel
Copy link
Contributor Author

An optimization is good to have, but not necessary, as the Specification states (Bluetooth Core Specification Version 5.4, Vol 4, Part E, Section 7.7.19 Number Of Completed Packets event):

Agreed. But there are good arguments why it should be improved regardless of what the spec allows (see my response to @rugeGerritsen above). This is required to cover up for weaknesses in the spec in other places to achieve a workable solution at all. Guess you agree that it is also not against the spec to improve latency on the sent-callback. Gaining one ISO interval should provide us with enough leeway so that the sent-callback latency is deterministic enough under all practical circumstances to synchronize ISO channels even if HCI works over reasonable remote links. Can you agree with that statement?

samples/bluetooth/central_iso/src/main.c Show resolved Hide resolved
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment for the use of bt_iso_chan_get_tx_sync: When we use it like this here, we can use the current response to compare with the previous response. IIRC (and please correct me @rugeGerritsen) the return values of bt_iso_chan_get_tx_sync are only when an SDUs have been sent (or scheduled for transmission). In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

e.g.

  1. App send SDU_1 with TS_1 and PSN_1
  2. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync
  3. App sends SDU_2 wit TS_1 and PSN_2 (or TS_2 and PSN_1)
  4. Controller reject this as invalid (which not all controllers may do), but still triggers a Number of Completed Packets event
  5. App gets sent callback and gets TS_1 and PSN_1 from bt_iso_chan_get_tx_sync (since it was not transmitted or scheduled for transmission)
  6. App is now aware that the previous TX went wrong and could attempt to fix by increasing PSN or TS accordingly

samples/bluetooth/central_iso/src/main.c Show resolved Hide resolved
struct bt_iso_tx_info iso_tx_info = {0};
int err;

err = bt_iso_chan_get_tx_sync(chan, &iso_tx_info);
Copy link
Contributor

@cvinayak cvinayak May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the timestamp is too low or the sequence number is our of order, the return value of the current and previous should be the same.

Only when the Controller has received a SDU from Host, the returned Sequence Number and Timestamp correspond to the last transmitted or scheduled SDU are calculated.

  1. If bt_iso_chan_get_tx_sync is called multiple times without a SDU enqueued, it will return same Sequence Number and Timestamp corresponding to the instant the previous bt_iso_chan_send was called.
  2. Every bt_iso_chan_get_tx_sync after bt_iso_chan_send call will return the Sequence Number and Timestamp of the last transmitted or scheduled SDU. Hence, for the consecutive call to bt_iso_chan_send, the next Sequence Number (returned value plus one) and next Timestamp (returned value plus one SDU interval) be provided to bt_iso_chan_send_ts.
  3. If bt_iso_chan_get_tx_sync followed by bt_iso_chan_send_ts is called in the sent-callback, then the returned Sequence Number and Timestamp could be stale depending on the SDU intervals elapsed from the last call of bt_iso_chan_send_ts to when the current sent-callback executes. If sufficient SDU buffers into the future has been already provided to the Controller, then bt_iso_chan_get_tx_sync will continue to return future Sequence Number and Timestamp ensuring new SDU enqueued to the Controller are not stale.

}

iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
iso_send_sdu(iso_tx_info.seq_num + 1, iso_tx_info.ts + INTERVAL_US);
iso_send_sdu(iso_tx_info.seq_num + 1U, iso_tx_info.ts + INTERVAL_US);

Comment on lines +57 to +58
if (unlikely(is_first_sdu)) {
ret = bt_iso_chan_send(&iso_chan, buf, seq_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_first_sdu, call bt_iso_chan_send once then followed by call to bt_iso_chan_get_tx_sync and bt_iso_chan_send_ts consecutively ((iso_tx.rtn + 1U) * 2U) times, so as to ensure the Controller has sufficient SDUs buffered before sent-callback under the case of retransmissions.

Give this a try under noisy conditions or when devices are far away. Let me know if the Sequence Number and Timestamp get back in sync when the radio condition get better, when devices move back close.

@fgrandel
Copy link
Contributor Author

fgrandel commented May 22, 2024

@rugeGerritsen @cvinayak a quick message to thank you for your reviews. I've just not been able to implement it yet, because I'm short on time. But will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth ISO Bluetooth LE Isochronous Channels area: Bluetooth area: Samples Samples platform: nRF Nordic nRFx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

samples: Bluetooth: central_iso: Desynchronized submission of data
5 participants