Improve telemetry throughput #2095

qqqlab · 2023-02-22T16:53:56Z

This PR improves the telemetry throughput of extended header crsf packets. (A race condition was fixed in a separate PR)

Current Situation

9 slots are defined for holding incoming CRSF packets. 7 slots are used for specific standard header CRSF packets, two are used for extended CRSF packets. The RXhandleUARTin() method writes packets to these slots. A scheduler loops over the slots, if a slot with data is found, it is locked, transmitted OTA, and then unlocked to receive new CRSF data.

In worst case condition throughput is only 50% what is possible. This happens when CRSF packets arrive a fraction faster than they are sent OTA, now the slot is still locked and the CRSF message is dropped. The OTA transmitter sits idle until the next CRSF packet arrives.

Pull Request

Modify AppendTelemetryPackage() as follows:

Currently the slots are effectively 1 position LIFO buffers with overwrite. This is fine for standard CRSF telemetry packets, where we are interested in the latest packet only. However, for extended packets we are interested in all packets. Make the 2 extended slots act as a 2 position FIFO buffer without overwrite. By double buffering, we can receive the next packet in the buffer whilst the previous packet is transmitted OTA.
Refactor code: first handle non-OTA, then OTA for standard CRSF, then OTA for anything not handled yet.
Remove the return value of the method, it is never used.

Test Results

Tested by tormenting the ELRS receiver by firing 100 extended CRSF packages per second at it. Each CRSF packet is 20 bytes long.

Maximum possible throughput with Telemetry ratio 1:2

Mode crsf/s  byte/s 
500    62     1250
333F   83     1665
100F   25      500
 50    12      250

crsf/s = number of 20 byte crsf packets OTA per second
byte/s = number of OTA bytes per second

Results with ELRS 3.2.0:

Mode crsf/s  byte/s    telemetry lost
500    21     420      YES
333F   37     740      NO
100F    0       0      YES
 50     0       0      YES

Results after this PR:

Mode crsf/s  byte/s     telemetry lost
500     52    1040      NO
333F    60    1200      NO
100F    18     360      NO
 50      5     100      NO

Tools

The Python script was used to enulate a FC and send crsf packets to the ELRS receiver. The Lua script was used to analyse the received data on an EdgeTX radio.
https://github.com/qqqlab/ExpressLRS/blob/pr_attachments/CRSF-Telem.py
https://github.com/qqqlab/ExpressLRS/blob/pr_attachments/CRSF-Debug.lua

JyeSmith · 2023-02-22T23:39:40Z

Nice! Ill leave this for the big brains to review. My only comment is if you would class this as a bug fix and for the maintenance branch?

qqqlab · 2023-02-23T14:47:16Z

Hi Jye, thanks for your feedback. I'm not an expert on the git administrative side of ELRS, but would qualify this both as bug fix and maintenance branch.

My main goal is to get bi-directional telemetry working smoothly and with maximum throughput in both directions. The bug fix alone made some improvement, but adding the double buffer strategy really did the trick.

The bug fix alone would be: replace if (targetFound) with if (targetFound && !payloadTypes[targetIndex].locked) at line 301.

schugabe

The fix for the potential race condition is a good find and should be added.
But the remaining changes have some issues: The logic for the ardupilot response is needed - I wrote tests for all the cases and the tests are failing now.

The solution for better throughput would be to keep the slots for all known tlm frames and use a dynamic queue for everything else of the extended frames. Two slots is simply not enough for all the data that is possibly sent.

qqqlab · 2023-03-01T11:19:32Z

Sorry, I do not agree. The datalink has limited throughput, if the FC sends more info than can be handled by the link then data will be dropped regardless of how many slots or queue positions are available. Adding a longer queue might even work in the opposite direction as it increases latency. It would more sense to me to add a flow control mechanism instead, i.e. to inform the FC about the queue state.

In the current design no specific slots are assigned for specific extended CRSF messages. The two available slots are both used for multiple types of messages.

One could argue for adding a separate slot for "status text" in the new design. I did not do it, because I think this is sufficiently handled by the queue.

Yes, one test is failing now, but that is a direct consequence of the proposed design change. It is intended behavior.

schugabe · 2023-03-01T12:38:52Z

Sorry, I do not agree. The datalink has limited throughput, if the FC sends more info than can be handled by the link then data will be dropped regardless of how many slots or queue positions are available. Adding a longer queue might even work in the opposite direction as it increases latency. It would more sense to me to add a flow control mechanism instead, i.e. to inform the FC about the queue state.

We don't have a "slow down" crsf message type so it's not possible to implement that right now but we are trying to work out a way crsf can be extended: tbs-fpv/freedomtx#26

In the current design no specific slots are assigned for specific extended CRSF messages. The two available slots are both used for multiple types of messages.

One could argue for adding a separate slot for "status text" in the new design. I did not do it, because I think this is sufficiently handled by the queue.
Yes, one test is failing now, but that is a direct consequence of the proposed design change. It is intended behavior.

Merging code that breaks working features is not an option for me and if calling it a design descision does not change that.

My suggestion: create a new PR that fixes the race condition. The remaining changes need more discussion and/or new crsf frames.

wvarty

The physical code changes look ok to me, so I'll pre-approve, with the condition that @schugabe's design concerns are addressed, and leave the final approval to him.

qqqlab · 2023-03-02T03:51:43Z

Thanks for your feedback.

@schugabe Please, let's discuss this further.

My initial thought was to keep the existing code as-is, and just add a fifo buffer to improve throughput. But the existing code for extended crsf did not really make sense to me. For example, the code mentions "reserve slot for status text", but the same slot is also used for msp and any non explicitly handled crsf packets.

Also, I asked myself the rather philosophical question: should elrs keep an ever increasing list of crsf messages and dictate priorities for them, or should elrs be net-neutral, treat all messages equally, and just provide the best possible link to the communicating parties?

I like the second option better.

Based on this I submitted a more radical PR, making sure that the PR still satisfies the requirements for extended crsf as outlined in the existing code. .

If it is a hard requirement that all current test pass, I can add code back in to make that happen. But before I spend time on that I would like to get feedback from you if there is an interest in proceeding with this PR.

wvarty · 2023-03-04T05:09:41Z

@qqqlab just an FYI, we are aiming to cut a minor bugfix release towards the end of March, and hope to have all fixes merged to the 3.x.x-maintenance branch before the 9th March to leave time for running the test plan.
It might be worth splitting out the bugfix in this PR to a new PR that targets 3.x.x-maintenance, and leave the enhancements here for discussion once @schugabe gets a chance to respond?

JyeSmith · 2023-04-16T23:51:03Z

just giving this a little bump @qqqlab and @schugabe.

src/lib/Telemetry/telemetry.cpp

pkendall64

Ok, I see why the unit test is failing now.
The previous code would only use the 2 slots for FC originated messages, now it's for any of the non-singled out message types. So it's reducing the special casing that was hard-coded.

So either we are happy to have 2 slots for all types and change the unit test or we change the PR to use 2 slots only for FC originated messages and one slot otherwise.

Personally I'm in favour of simpler is better and change the unit test.

# Conflicts: # src/lib/Telemetry/telemetry.cpp

pkendall64 · 2023-04-17T03:27:29Z

src/lib/Telemetry/telemetry.cpp

-        for (int8_t i = 0; i < payloadTypesCount - 2; i++)
+        // first try slot payloadTypesCount - 2, so that the OTA packets are transmitted in the same order as they are received
+        targetIndex = payloadTypesCount - 2;
+        targetFound = !payloadTypes[targetIndex].updated;


If we decide to make it compatible to the current behaviour we can change this line to

targetFound = !payloadTypes[targetIndex].updated || !(header->orig_addr == CRSF_ADDRESS_FLIGHT_CONTROLLER || (header->type == CRSF_FRAMETYPE_ARDUPILOT_RESP && package[CRSF_TELEMETRY_TYPE_INDEX + 1] == CRSF_AP_CUSTOM_TELEM_STATUS_TEXT));

Thanks. I've added the fix, to see if the test passes now.

But I'm not 100% sure that this is the best way to go. It looks to me that after this fix we're back to single buffering most messages, and thus back to the old situation slowing things down artificially - the exact opposite of what this PR tries to achieve.

I'm working on the streaming PR now, but time permitting I'll run the throughput tests with this fix in place.

Good, did the throughput testing, and as expected, throughput drops back to old levels....

I've added a new commit and changed the line 265 to:

targetFound = !payloadTypes[targetIndex].updated || header->type == CRSF_FRAMETYPE_PARAMETER_READ;

in order to artificially satisfy test_function_store_unknown_type_two_slots. Now telemetry throughput is improved again.

What does this test test? It tests that for random messages only one slot is used, which is the exact opposite of what this PR tries to achieve.

So, I still think this test should be removed, or changed to expect 2 updated messages.

void test_function_store_unknown_type_two_slots(void) { telemetry.ResetState(); uint8_t unknownSequence[] = {0xEC,0x04,CRSF_FRAMETYPE_PARAMETER_READ,0x62,0x6c,85}; int length = sizeof(unknownSequence); int sentLength = sendData(unknownSequence, length); TEST_ASSERT_EQUAL(length, sentLength); uint8_t* data; uint8_t receivedLength; telemetry.GetNextPayload(&receivedLength, &data); sentLength = sendData(unknownSequence, length); TEST_ASSERT_EQUAL(length, sentLength); TEST_ASSERT_EQUAL(1, telemetry.UpdatedPayloadCount()); }

Thats why I gave the options 😄, and said that personally I'd go for the simple option i.e. fix the unit test, and make it check for 2 messages.

If we really want a special case for the ardupilot text message being only a single slot, then we should just do that and allow all others to be double slot.

@pkendall64 I missed your hint about changing the unit tests - still was under the impression that changing the unit tests was a no go :-)

Changed back things back to the original PR code, and changed the unit test to check for 2 messages.

If we really want a special case for the ardupilot text message being only a single slot, then we should just do that and allow all others to be double slot.

We need the ardupilot logic exactly as it is implemented otherwise the yapuu script stops working and I added the test to check all the cases we need to cover. So changing the tests would probably break compability with stuff that is working. So to really confirm this PR does not break anything it needs to be tested with yapuu/bf/kiss lua scripts + the parameter get/set stuff in our lua script.

I've added an additional slot for StatusText.

I've been using this PR code (without additional StatusText slot) with Ardupilot & Yaapu and status text worked. StatusText could be missed when flooding the receiver with other CRSF messages. However the original code has the same issue, as it stores StatusText in slot targetIndex = payloadTypesCount - 1 which is also used for other CRSF messages.

So, I've added a separate StatusText slot, to handle this case separately and always keep the last StatusText buffered to be sent OTA.

@schugabe Please let me know if the addition I made eases your concerns, or do you require further changes?

The overflow was possible but yapuu tuned the sending rate to prevent issues. Did you test this PR with the yapuu script at 50hz update rate + 1:2 tlm?

…lots

schugabe

I added some comments to the code.

Basically we are back to the old logic and need 64 bytes more buffer space only for ardupilot text messages.

For msp responses we did already use both slots so I guess the large speed gains only happen for all other traffic like ardupilot and in this case the logic for the text messages would limit the reachable speed even more.

If everybode else if fine with throwing more ram at tlm buffer space it's fine with me but I'd say we need a followup PR to use the third slot for everything else if it's not a ardupilot fc.

schugabe · 2023-04-24T19:26:05Z

src/lib/Telemetry/telemetry.cpp

    }
-    if (header->type == CRSF_FRAMETYPE_DEVICE_PING && package[CRSF_TELEMETRY_TYPE_INDEX + 1] == CRSF_ADDRESS_CRSF_RECEIVER)
+    if (header->type == CRSF_FRAMETYPE_DEVICE_PING && header->dest_addr == CRSF_ADDRESS_CRSF_RECEIVER)


We need the conditon (header->type == CRSF_FRAMETYPE_DEVICE_PING && (header->dest_addr == 0 || header->dest_addr == CRSF_ADDRESS_CRSF_RECEIVER)) since it's also possible to send a ping to all devices

schugabe · 2023-04-24T19:38:15Z

src/lib/Telemetry/telemetry.cpp

 {
-    const crsf_header_t *header = (crsf_header_t *) package;
+    const crsf_ext_header_t *header = (crsf_ext_header_t *) package;


Only messages in the frame range 0x28-0x96 have ext headers and the classic tlm frames are not ext frames. While nothing bad happens in this code with this cast it's not always correct. That is the reason why I had the if (header->type >= CRSF_FRAMETYPE_DEVICE_PING) before I did interpret it as ext frame.

schugabe · 2023-04-24T19:43:50Z

src/lib/Telemetry/telemetry.cpp

-        for (int8_t i = 0; i < payloadTypesCount - 2; i++)
+        // first try slot payloadTypesCount - 2, so that the OTA packets are transmitted in the same order as they are received
+        targetIndex = payloadTypesCount - 2;
+        targetFound = !payloadTypes[targetIndex].updated;


The overflow was possible but yapuu tuned the sending rate to prevent issues. Did you test this PR with the yapuu script at 50hz update rate + 1:2 tlm?

qqqlab · 2023-04-29T17:10:35Z

@schugabe

Basically we are back to the old logic and need 64 bytes more buffer space only for ardupilot text messages.

Thanks for your feedback. Apparently I fail to clearly explain what this PR is about. This is not about the special cases (standard CRSF header GPS, SPEED, etc., or extended header CRSF status text etc.) but about any other CRSF message.

With the special cases we're interested in the most recent message, so it is fine to use a single buffer, so that we receive the most recent message and potentially drop the ones that came before. But with the rest we're interested to get each and every message. By using a single buffer for these messages, only 50% of the OTA capacity is used in worse case. Using a double buffer fixes this deficiency.

See example below: a CRSF message arrives every 4 transmission slots, and takes 5 slots to transmit OTA. With single buffering only 3 out of 6 messages are received OTA. With double buffering 4.8 of 6 messages are received, because while a message is being transmitted OTA the next incoming message is buffered in the second buffer.

incoming:   1---2---3---4---5---6---  
OTA single: 11111---33333---55555---
OTA double: 111112222233333444446666
V3.2.0:     11112---33334---55556---

('-' is idle transmission slot, '1' to '6' is msg transmission slot)

With single buffering, when the incoming data rate increases from 100% to 101% of OTA link capacity, OTA data rate suddenly drops from 100% to 50% and 51% of data is lost. This makes things difficult to diagnose. With double buffering the situation is more predictable, the OTA rate stays fully loaded at 100% and only 1% of incoming data is lost.

Note that it is very easy to overload the link, incoming data flows at 400kbps, whereas the OTA capacity is less than 16kbps.

Regarding current use cases:

V3.2.1 includes PR #2103 which fixes a race condition. For illustrative purposes the situation with V3.2.0 is also depicted above, now 0 out of 6 messages are received because the messages get corrupted by the next incoming message which overwrites the single buffer as it is being transmitted OTA.

Prior to V3.2.1 all extended CRSF messages (including all Ardupilot messages) were suffering from this race condition. Apparently everybody (including myself) was happy with the situation, as apparently nobody complained about missing messages. I only noticed the deficiency when I tried to send more data over the link.

Maybe others also noticed this, but instead spending the time to investigate, they switched to another datalink solution.

So, it appears to me that in the current use cases the CRSF messages are sufficiently throttled by the FC not to trigger the race condition. This also means that switching to double buffering will have no negative impact on these use cases.

Regarding the StatusText special case:

If an additional 64 bytes of RAM is unacceptable for low-RAM targets, we can either reverse the last commit, or use it only for plenty-RAM ESP targets.

pkendall64 · 2023-04-30T20:38:43Z

I think we can live with the extra 64 bytes TBH. This PR just needs the conflicts fixed again and the PING mentioned above fixed.

# Conflicts: # src/lib/Telemetry/telemetry.cpp

All changes should be addressed

AlessandroAU · 2023-05-09T01:40:07Z

@qqqlab We have a very active private developer channel on our discord server, It would be great for you to join and discuss ideas in real-time. Come join and tag Jye, Wez, PK or myself and we will add you.

Fix race condition in Telemetry and improve throughput

3633eba

schugabe previously requested changes Feb 28, 2023

View reviewed changes

wvarty approved these changes Mar 2, 2023

View reviewed changes

qqqlab mentioned this pull request Mar 5, 2023

Fix race condition in Telemetry #2103

Merged

qqqlab changed the title ~~Fix race condition in Telemetry and improve throughput~~ Improve telemetry throughput Mar 7, 2023

qqqlab mentioned this pull request Mar 7, 2023

Add telemetryStatus bit to MSP+SYNC OTA packets #2088

Open

JyeSmith added the V3.3 🍕 label Apr 3, 2023

pkendall64 reviewed Apr 17, 2023

View reviewed changes

src/lib/Telemetry/telemetry.cpp Show resolved Hide resolved

pkendall64 reviewed Apr 17, 2023

View reviewed changes

Merge branch 'master' into pr/2095

12145aa

# Conflicts: # src/lib/Telemetry/telemetry.cpp

pkendall64 reviewed Apr 17, 2023

View reviewed changes

qqqlab added 5 commits April 17, 2023 16:28

add break to stop loop early

808fd98

fix "test_function_store_unknown_type_two_slots:FAIL: Expected 1 Was 2"

50d5fc1

Try only - to see if it passes test_function_store_unknown_type_two_s…

8c67751

…lots

unit test for 2 slots

0d09570

revert back to 2 slots for all cases

c9022a2

pkendall64 approved these changes Apr 18, 2023

View reviewed changes

qqqlab added 3 commits April 18, 2023 11:43

add slot for status text

8fe31ac

add slot for status text

1d4353f

Remove trailing whitespace

5cb259a

schugabe reviewed Apr 24, 2023

View reviewed changes

pkendall64 added 2 commits May 7, 2023 16:37

Merge branch 'master' into pr/2095

68e5082

# Conflicts: # src/lib/Telemetry/telemetry.cpp

Handle PING and broadcast PING

ee0ec77

pkendall64 requested a review from schugabe May 8, 2023 03:44

pkendall64 added the do not merge 🚫 label May 31, 2023

pkendall64 removed the V3.3 🍕 label Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve telemetry throughput #2095

Improve telemetry throughput #2095

qqqlab commented Feb 22, 2023 •

edited

JyeSmith commented Feb 22, 2023

qqqlab commented Feb 23, 2023

schugabe left a comment

qqqlab commented Mar 1, 2023

schugabe commented Mar 1, 2023

wvarty left a comment

qqqlab commented Mar 2, 2023

wvarty commented Mar 4, 2023

JyeSmith commented Apr 16, 2023

pkendall64 left a comment •

edited

pkendall64 Apr 17, 2023

qqqlab Apr 17, 2023

qqqlab Apr 17, 2023

pkendall64 Apr 17, 2023

pkendall64 Apr 17, 2023

qqqlab Apr 17, 2023

schugabe Apr 18, 2023

qqqlab Apr 18, 2023

qqqlab Apr 20, 2023

schugabe Apr 24, 2023

schugabe left a comment

schugabe Apr 24, 2023

schugabe Apr 24, 2023

schugabe Apr 24, 2023

qqqlab commented Apr 29, 2023

pkendall64 commented Apr 30, 2023

AlessandroAU commented May 9, 2023

Improve telemetry throughput #2095

Are you sure you want to change the base?

Improve telemetry throughput #2095

Conversation

qqqlab commented Feb 22, 2023 • edited

Current Situation

Pull Request

Test Results

Tools

JyeSmith commented Feb 22, 2023

qqqlab commented Feb 23, 2023

schugabe left a comment

Choose a reason for hiding this comment

qqqlab commented Mar 1, 2023

schugabe commented Mar 1, 2023

wvarty left a comment

Choose a reason for hiding this comment

qqqlab commented Mar 2, 2023

wvarty commented Mar 4, 2023

JyeSmith commented Apr 16, 2023

pkendall64 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schugabe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qqqlab commented Apr 29, 2023

pkendall64 commented Apr 30, 2023

AlessandroAU commented May 9, 2023

qqqlab commented Feb 22, 2023 •

edited

pkendall64 left a comment •

edited