Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider the IMedia design to support low-copy/zero-copy rx/tx operation #352

Open
pavel-kirienko opened this issue Apr 24, 2024 · 8 comments
Assignees
Labels
class-requirement Issue that can be traced back to a design requiement domain-production Pertains to the shippable code rather than any scaffolding priority-high status-help-wanted Issues that core maintainers do not have the time or expertise to work on.

Comments

@pavel-kirienko
Copy link
Member

#343 (comment)

@pavel-kirienko pavel-kirienko added status-help-wanted Issues that core maintainers do not have the time or expertise to work on. domain-production Pertains to the shippable code rather than any scaffolding priority-high class-requirement Issue that can be traced back to a design requiement labels Apr 24, 2024
@pavel-kirienko pavel-kirienko self-assigned this Apr 24, 2024
@pavel-kirienko
Copy link
Member Author

Thinking aloud.

At the moment, we're focusing only on transmission. For reception, see the LibCyphal design document, section "Subscriber".

In line with what we discussed on the forum a while back, we could make IMedia provide a memory resource for serving dynamic memory for message objects:

    virtual std::pmr::memory_resource& getMemoryResource() = 0;

Then, as also covered on the forum, we expose some easy-to-use handle at the top layer of the library that the client can use to allocate messages to be transmitted; that handle would (through indirection perhaps) eventually invoke the above pure virtual method implemented by the media layer. The simplest possible implementation would simply return something like the new_delete_resource, as covered by Scott on the forum; a more advanced one could leverage specific memory regions that are DMA-reachable or are otherwise advantageous for storing the data to be transmitted.

Nunavut would then serialize the data such that the serialization buffer is allocated from the same memory resource in one or more fragments; if the message contains sections that happen to contain data in a wire-compatible format (little-endian, correct padding, IEEE 754, etc.), such sections will not be serialized but instead a view of them will be emitted as part of the output of the serialization routine; when data requires conversion it will be copied into a new temporary buffer from the same PMR.

At the output we get some variation of vector<buffer> that needs to be passed on into the lizard; the original message object must be kept alive. See OpenCyphal/libcanard#223 (this one is libcanard but it applies also to the other lizards).

The lizard will then push into its TX queue a sequence of TX items, where each item is an object referencing an ordered list of memory views pointing into the original list of fragments supplied by Nunavut. The lizard can also allocate additional memory for the payload from the same memory resource in order to inject protocol headers into the output TX queue; for libudpard this would mean the UDP frame headers and also the transfer CRC; for libcanard this includes the transfer CRC, padding, and the tail bytes.

At this stage, we get an ordered list of TX queue items, where each item contains:

  • Header and/or tail fragment(s) allocated from the media PMR and owned by the TX item itself (meaning they must be freed after the item is transmitted);
  • A list of views pointing into the original vector<buffer> from Nunavut of an arbitrary length which are to be concatenated (preferably by the scatter-gather DMA controller, or by the software itself in the absence thereof) upon transmission.

The transport implementation would then be responsible for deallocating the memory fragments as they are consumed by the media layer, finally deallocating the original message object when no references to it are left. This sounds convoluted but I think it should be possible to approach this sensibly by introducing some notion of reference counting on memory fragments; this is out of the scope right now.

One major issue with this direct approach is that it doesn't easily allow splitting outgoing transfers across multiple network interfaces, as the allocator is tied to a specific media instance. We could consider two options:

  • Some platforms may be able to utilize the same memory region for all NICs. For example, it could be a memory area reachable by the DMA controllers attached to distinct network interfaces, or DTCM (without DMA), etc. In this case, attaching the PMR to a specific media ceases to make sense, as it should instead be extracted into a new independent entity of a buffer memory manager that can work with all media instances. In cases where such is not possible, it would degrade into an ordinary PMR without any superpowers. This PMR is to be accessed by all parties: the application (for message object allocation), by Nunavut (for buffer fragment allocation), and by the lizard (for frame metadata buffer allocation). This option requires careful reference counting if a message is sent over multiple network interfaces concurrently, especially if they are managed by different lizards; this can potentially get rather complex.

  • We could allow the lizards to perform full deep copying, but only once. This choice allows us to keep dedicated memory managers per media instance and, at the same time, avoid reference counting across distinct lizards. A desirable side-effect is that the application will no longer need to transfer ownership of its message object upon transmission, allowing such objects to be allocated statically rather than from the heap (or whatever memory manager the media PMR is underpinned by); Nunavut can likewise operate using static memory buffers without requiring heap allocation during deserialization -- this is not incompatible with the concept of scattered buffers and low-copy serialization of large data blobs that do not require conversion (such as byte order swapping etc).

At the moment, the second option appears more compelling, so it will be pursued moving forward.

@thirtytwobits
Copy link
Contributor

Before I can comment on fully, do we intend on having a hierarchy of redundancy where the top layer is managed by the transport layer across multiple media devices (e.g. when using CAN and UDP as redundant channels) and the next layer is managed within each media layer device (e.g. when using multiple CAN peripherals) but looks like a single device to the transport layer?

@pavel-kirienko
Copy link
Member Author

No, per the design intention, one IMedia maps to one NIC. I mean, one could conceivably implement IMedia that manages more than one NIC, but this would count as off-label use, your mileage would vary, and the warranty would be void.

A lizard can manage an arbitrary number of redundant media instances. You may recall that we discussed in the past that there is some room for improvement regarding how it's currently done, but even when the improvements are in place, the fundamental model will stay the same: one IMedia = one NIC.

As any given lizard implements only a particular Cyphal transport protocol, heterogeneous redundancy requires a higher-level aggregation where the specifics of a given transport protocol are abstracted away; we will be approaching this like in PyCyphal -- with the help of the RedundantTransport protocol. See the diagram in the design doc, section "Homogeneous and heterogeneous redundancy".

@thirtytwobits
Copy link
Contributor

thirtytwobits commented May 6, 2024

We could allow the lizards to perform full deep copying, but only once.

My concern is, without serializing a message directly into an output buffer, the user is doomed to perform full deep copying N times where N is the number of redundant interfaces. If we design a way for deserialization to occur directly into an output buffer this becomes N - 1 deep copies after serialization.

@thirtytwobits
Copy link
Contributor

...need to transfer ownership of its message object upon transmission

If we are serializing into an output buffer then the application doesn't need to transfer ownership of the object-representation. The memory we are obtaining from IMedia for transmission would be distinct from the memory we obtained for reception where we de-serialized the data into object form*.

* This thread is about transmission but when we get to reception we should talk about lazy deserialization as a feature

@pavel-kirienko
Copy link
Member Author

My concern is, without serializing a message directly into an output buffer, the user is doomed to perform full deep copying N times where N is the number of redundant interfaces. If we design a way for deserialization to occur directly into an output buffer this becomes N - 1 deep copies after serialization.

This is true, but observe how the reference counting across several lizards required by this approach can potentially turn into a formidable can of worms 🪱

If we are serializing into an output buffer then the application doesn't need to transfer ownership of the object-representation.

If we apply the low-copy approach across the stack, then the output of Nunavut-generated serialization routines may contain references to the original message object (remember the example with imagery data?), from which follows:

  • The message object must be allocated in a DMA-reachable (or otherwise usable for efficient transmission) memory region.
  • The message object must remain intact for an arbitrary amount of time until the transmission is completed.

One way to achieve both is to allocate the message object from that DMA-compatible region and then hand it over to the transport layer upon transmission to let it dispose of it when it is no longer needed.

I am not saying these are insurmountable issues. They are quite manageable. We just need to choose between

@thirtytwobits
Copy link
Contributor

Okay. I suppose your preferred solution is adequate. It may be that libcyphal will always be a bit less optimized then we'd like for MCUs. I've been wondering if we should start a "cyphal-ard" project which is a minimal application-layer in C where such close-to-the-metal integrations would be more plausible?

@pavel-kirienko
Copy link
Member Author

I've been wondering if we should start a "cyphal-ard" project which is a minimal application-layer in C where such close-to-the-metal integrations would be more plausible?

We could entertain that thought just to see if there are sensible solutions to the problem of zero-copy transmission over heterogeneously redundant interfaces, and then try and transfer that back to LibCyphal.

Suppose you started a new C library from scratch and want to send zero-copy messages over CAN and UDP. What would be different compared to where we are now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
class-requirement Issue that can be traced back to a design requiement domain-production Pertains to the shippable code rather than any scaffolding priority-high status-help-wanted Issues that core maintainers do not have the time or expertise to work on.
Projects
None yet
Development

No branches or pull requests

2 participants