Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the MemPoolInfo of matching memory pools to typed endpoints #2092

Open
gpalmer-latai opened this issue Nov 14, 2023 · 50 comments
Open
Labels
enhancement New feature

Comments

@gpalmer-latai
Copy link
Contributor

Brief feature description

I would like the ability to somehow query the info of a particular memory pool that will be used for a comms endpoint's messages.

Detailed information

A full proof of concept is linked below:

master...gpalmer-latai:iceoryx:gpalmer/expose_mempool

The basic gist is that I've added a m_basePtr to the MemPoolInfo struct to inform consumers about where the actual memory pool is located. Then I've exposed some APIs to be able to lift that information all the way out to a typed publisher.

Alternatives Considered

Instead of lifting out the memory pool information through the layers of port info, it might be nice/favorable if we could

  1. Have more control over which comms endpoints use which memory pools by e.g. adding service info constraints to the memory pool config entries in the roudi config (only these publishers/subscribers use this memory pool), as well as "memory constraints" (the memory type of this pool is "pinned")
  2. Being able to query the runtime/roudi directly for information about a particular memory pool. I know there is some infrastructure here for introspection, but I'm not sure how I could use that to find the particular memory pool of interest.
@elBoberido
Copy link
Member

@gpalmer-latai it seems you want to use this for GPU computation. I've no experience with GPUs and would need to ask e.g. @elfenpiff but I'm wondering if your goals could also be achieved by extending the publisher and subscriber, e.g. by adding hooks like on_construction, on_destruction, on_pre_publish, on_post_publish, and so on. I have to think this through if it is easily feasible.

I think alternative 1 will be quite a lot of work. Funnily this is exactly the approach we are taking with the Rust implementation. There is exactly one memory pool per service. Having something like that in iceoryx would also possible but is definitely nothing that can be achieved in the short term.

Regarding option 2. With the introspection you might be able to find a memory pool for a specific service but for that you would also need to know under which user the process with the publisher/subscriber was started and you won't be able to acquire the base pointer. What exactly do you hope to achieve with the information from the introspection?

In the short term something like your proof of concept should be possible. Maybe with a more constraint API but that is more of a detail. My preferred solution would be the refactoring of the posh layer similar to the Rust implementation where we experimented with these ideas and have good experience.

@gpalmer-latai
Copy link
Contributor Author

Having on_construction instead of exposing getMemPoolInfo would work nicely for me yes. That only really avoids exposing the last layer of lifting I did (and of course one could always pass a lambda that lifts the info out themselves).

To give you a bit more context on my use case - I'm working on a platform with shared graphics memory. That means that the CPU and GPU can access the same memory. But this doesn't just work out of the box - you have to first call an NVIdia API - specifically cudaHostRegister. This API takes a simple void*, size, and some flags and what it effectively does is "pins" the memory - page-locking it. I'm fuzzy on the details beyond that but I believe this has implications on the TLB and associated memory hardware and the reason why I want to restrict the range of memory I call this on as much as possible is because, quoted from the above documentations:

Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.

Furthermore, I cannot simply call this API every time I receive a message sample because it is a high latency operation, so ideally I want to make all the appropriate calls once early on when initializing the system.

Anyway. My plan is to get a short term solution working for my needs and I would very happy if we could work out a reasonable way to do this upstream here. But I also agree with you that the proper long term solution is what you have already described in the Rust implementation. I'd be interested in having a lengthier discussion sometime about the work involved with bringing that to Iceoryx (or adopting the Rust implementation). I might be able to avoid needing it now, but I suspect there will be more use cases/requirements I stumble upon where I will be really sad not to have this level of control.

@elBoberido
Copy link
Member

There should be a way forward to bring this upstream. I think I have to take counsel with my pillow to think a little bit more about the best short term approach, especially what implication is has for the untyped API. Would you contribute the patch once we agreed on the API?

Once the announcement for the Rust implementation is out we can have a deep dive. Either as part of a developer meetup or a separate meeting. I guess @elfenpiff is also eager to participate.

@elBoberido elBoberido added the enhancement New feature label Nov 15, 2023
@gpalmer-latai
Copy link
Contributor Author

Yeah I should have some latitude to work on the patch.

@elBoberido
Copy link
Member

Yeah I should have some latitude to work on the patch.

Hahaha, great pun :)

@elBoberido
Copy link
Member

I had a few thought on this topic and currently I favor to continue on the idea we had when PortConfigInfo was introduced. All of this could be done in the publisher. Just for clarification, do you need to call cudaHostRegister also on the subscriber side or only on the publisher side?

@gpalmer-latai
Copy link
Contributor Author

I'd need the ability to map memory pools do be able to process messages as both input and output to GPU accelerated algorithms. So yes, subscriber too.

With the portconfiginfo - you could specify a different memory type for a port, but then how does this hook in to facilitating different hardware. I presume iceoryx isn't going to pull in Nvidia dependencies, so you'd need like the ability for the user to register behavior globally for different memory types.

@elBoberido
Copy link
Member

I've to ask the author of that code what the exact idea behind PortConfigInfo but I thought about either having a compile time switch to add the CUDA dependency or maybe a similar mechanism like with the platform where one can override the path to some hooks.

@gpalmer-latai
Copy link
Contributor Author

I like the idea of a compile time switch, possibly coupled with some bazel select statements. Then just have a compile time path that either calls the CUDA API or returns an error when a specific memory type is specified in the port config.

@gpalmer-latai
Copy link
Contributor Author

One thing to note though is that it appears the port config / memory_info will apply to the entire memory segment and it is currently not possible to configure multiple segments differing only by what type of memory they support.

We'd want to probably extend the Roudi config this way so that a publisher with a port config specifying "pinned host memory" would be told to use the corresponding segment.

@elBoberido
Copy link
Member

That's basically the idea. The end user would still not have easy access to the mempool and therefore also cannot easily tamper with it but it gives other projects the option to extend the framework.

Yes, mid to long term the config should be extended. How to do it exactly might be something to discuss in a meetup. Maybe together with the discussion on how to bring the improvements of the Rust based implementation to the C++ implementation.

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

Yes, mid to long term the config should be extended

Actually I mean in the short/medium term. I'm really trying to avoid pinning the entire shared memory segment used for all communications just because one publisher wants to use GPU acceleration when publishing messages in one specific mempool.

If the effect of specifying memoryType as something signifying pinned_host_memory is that we decide to call cudaHostRegister when constructing the mepoo_segment somewhere here, then I'd be forced to do something like put all "GPU publishers/subscribers" in separate linux groups just to get them to use the isolated segment. And I'd like to avoid conflating the linux group / access controls use case with the memory allocation/access one.

Edit: What I linked is the codepath for Roudi setting up the shared memory in the first place. I meant to find the code path where the clients receive the info from Roudi.

Second Edit: The publisher (for example) get's this information here. It then has access through the port()->getMembers()->m_chunkSenderData.m_memoryInfo. However I'm not sure if there exists any logic currently to do some sort of "post init" to react to that information.

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

What might be interesting to consider as well is adding an entry to both MePooConfig and memPoolInfo.

This would allow configuration at the memory pool level, which fits with my needs. Then we can see about adding some sort of "post_init" logic that is called after getMiddleWarePublisher to do setup on the client side. This boil down to the MemoryManager going through all of the memory pools and for each one with a pinned_host MemoryType calling cudaHostRegister on the memory region.

And this is something that could be added to the roudi config TOML as well at some point, though is also immediately available through manual roudi configuration.

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

There is still a bit of awkwardness to all of these approaches in that although a segment or a memory pool might be configured as supporting "pinned host memory", really it is ultimately up to the publishers or subscribers to decide if they want to use that feature. You have to consider that there are cases where the publisher is a normal CPU publisher or the subscriber is a normal CPU subscriber. In these processes you wouldn't really want to call cudaHostRegister at all.

So I guess, what you would do to handle this is to combine the MemPoolInfo (or segment info) with the PortConfigInfo to have some logic like:

// MemoryManager.getChunk()
for (auto& memPool : m_memPoolVector)
{
      uint32_t chunkSizeOfMemPool = memPool.getChunkSize();
      if (chunkSizeOfMemPool >= requiredChunkSize)
      {
        if (portConfigInfo.memoryInfo.memoryType == PINNED_HOST_MEMORY)
        {
          if (memPool.getInfo()->memoryType == PINNED_HOST_MEMORY)
          {
            chunk = memPool.getChunk();
            memPoolPointer = &memPool;
            aquiredChunkSize = chunkSizeOfMemPool;
            break;
          }
...

Note this code path is when you are retrieving a chunk which would happen much later than where you want to call cudaHostRegister on the appropriate segment/pool.

@elBoberido
Copy link
Member

Maybe there is a misunderstanding. What I meant was to only register the mempool the publisher and subscriber are using and not the whole segment. As long as it the memory is shared between CPU and GPU this should be sufficient and RouDi does not need to do any special handling. But I'm no GPU expert so this might be a bit naive.

So with that assumption I thought extending the config would not be required in the short term but only in the mid to long term.

We already have a concept of a shared memory provider, although only used and tested on CPU memory. This could be used to extend the memory types. The idea was to have this on segment level not on mempool level. It is quite long since we discussed about this topic though and some of our assumptions might not hold anymore.

@elBoberido
Copy link
Member

Yes, the publisher and subscriber would opt in by specifying the desired memory type. I think that should be more or less straightforward. What might be more complicated is mixed memory mode where some subscriber work on CPU memory and other work on GPU memory. To solve this in an efficient way we might need to do a whiteboard session.

@elBoberido
Copy link
Member

Oh, and @elfenpiff already had some ideas how to solve this in an efficient way about 4 years ago but there was never time to implement it. Maybe time has come now ... or at least in the not so distant future 😄

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

Okay. I think I follow now. I guess all one needs is to specify something in the portConfigInfo and have that consumed to make the cudaHostRegister call 🙂 You mentioned earlier one complication though - what about untyped publishers/subscribers?

Do we simply not support those initially with this feature, since we do not know in advance which memory pools they will request?

Related to the above - I've noticed that we appear to always do a linear search for the matching memory pool every time we get a chunk. Would it make sense to optimize this out for the typed endpoints in conjunction with the memory logic that is applied only to the corresponding memory pools? (Also in general we could consider doing a binary search)

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

Note that there is an approach where you defer until getChunk to do something like cuda_pinned_host_registry.ensure_pinned(memPool) to make the call to cudaHostRegister the first time a chunk from a MemPool is requested. This would work for untyped endpoint too.

It's a bit problematic though because that first publish/subscribe for a particular size message will incur a lot of latency. Though you could mitigate this by "priming the pump" - calling loan_message of all the sizes you care about on system startup, dropping them, and then proceeding with business as usual.

@elBoberido
Copy link
Member

For the untyped publisher and subscriber we have two options (three after I read your latest response)

  1. do not support it initially
  2. add an API call like likely_used_memory_sizes(iox::vector<uint64_t, ???> list_of_memory_sizes)
  3. like 2 but on loan/getChunk; might be a fallback for 2 with a log message warning about the new registration

@elBoberido
Copy link
Member

Oh, the typed API would then just call likely_used_memory_sizes on the port level

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 20, 2023

That sounds great to me. So let me see if I can summarize:

  1. Add some memoryType constants somewhere that will be used to populate this field in the MemoryInfo struct member of PortConfigInfo.
  2. Expose in the publisher and subscriber APIs the ability to pass in a non-default PortConfigInfo.
  3. For untyped endpoints, also extend the api to take a likely_used_memory_sizes vector on construction.
  4. For typed endpoints, automatically supply that information based on the type.
  5. Add some compile-time flag indicating whether the cuda_runtime_api is available.
  6. Create a global repository similar to the pointer_repository to track registration calls (we may have multiple endpoints requesting the same memory region be pinned. CUDA will set an error code the second time you try to do this so it is better to wrap such calls and cache the information - also for latency reasons).
  7. In the MemoryManager, extend the API to consume the likely_used_memory_sizes. This might be used in general for some optimization in getChunk, but additionally if the compile time flag is active is used to call cudaHostRegister on all appropriate memory pools.
  8. Error out early if the port config info requests pinned host memory but the compile time flag is not active.
  9. When calling getChunk - if the requested size matches any of the likely_used_memory_sizes, we simply return the matching memory pool. Otherwise we do the normal linear search and additionally make the cudaHostRegister call if the PortConfigInfo requests it, and if the compile time flag is active.

@elBoberido
Copy link
Member

elBoberido commented Nov 20, 2023

Here some remarks

  1. We should probably reserve a range for a curated list of memory types, similar to what IANA does
  2. Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.
  3. 👍 we might also end up to just have multiple calls to likely_used_memory_size. A detail which can be decided later on depending on what feels better to use
  4. 👍
  5. It would probably be similar to the mechanism we currently use to use a custom platform layer. By default only CPU memory would be supported and the user could choose multiple other options. CUDA might be an option which could be upstreamed
  6. Yes, this must be tracked for each process and cannot be stored in the mempool itself
  7. I'm not quite sure I understand this correctly. My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager. The MemoryManager needs the interface from your proof of concept to return the address and size of the mempool for a specific chunk size. The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool. If the size does not match the mempool at the index the current search could be performed or alternatively a binary search. The caching of the known indices could also be done in the PublisherPortUser since the MemoryManager resides in the shared memory and we would need to hold a structure for (IOX_MAX_PUBLISHER + IOX_MAX_SUBSCRIBERS) * 2 (aaaargh, I used IOX_MAX_PUBLISHER to define MAX_SERVERS ... facepalm 😆 )
  8. 👍
  9. I think this is also something for the PublisherPortUser since the repository for cudoHostRegister would live in process local memory and not in the shared memory ... but maybe I overlooked something

@gpalmer-latai
Copy link
Contributor Author

I think you are right about the PublisherPortUser owning a lot of the logic 👍

@gpalmer-latai
Copy link
Contributor Author

We should probably reserve a range for a curated list of memory types, similar to what IANA does

That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.

@elBoberido
Copy link
Member

That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.

Exactly. I've thought about something similar to Qt::UserRole, which is the offset when defining own roles in Qt's model-view framework. We just need to find a good name and set it to .e.g 0x100. This should leave us a range which is big enough for whatever we want to use. Downstream can then request to register a specific value directly in iceoryx and after a two weeks discussion on Oʻahu (downstream will take care of the hotel and traveling costs) the team will vote on the inclusion of the new value.

@elBoberido
Copy link
Member

... aaargh, too much Qt coding 😆 ... we can just define an enum to distinguish between user defined values and the ones defined by iceoryx

@gpalmer-latai
Copy link
Contributor Author

Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.

So is the idea that the builder is passed as an argument like PortConfigInfoBuilder and the publisher calls create() on it? Or is this pattern more of just a convenient way to get an object with reasonable defaults but the ability to override some if necessary?

When you talk about tying the builder to the runtime, I'm guessing you mean that instead of

m_port(iox::runtime::PoshRuntime::getInstance().getMiddlewarePublisher(service, publisherOptions))

You'd have something like

m_port(portInfo.getRuntime().getMiddlewarePublisher(service, publisherOptions, portInfo))

@gpalmer-latai
Copy link
Contributor Author

Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.

@elBoberido
Copy link
Member

Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.

Exactly. We don't use exception and when something went horribly wrong in the ctor we currently terminate. It would be better to use the factory method pattern or the builder pattern and return an expected<Publisher>.

@gpalmer-latai
Copy link
Contributor Author

Interesting. How deep are you thinking of going in this sort of refactor? It looks like the getMiddleware* functions all emit warnings and return a nullptr, which immediately gets passed to a constructor taking a not_null pointer...

So we could for example, also use the builder pattern for the port users. We can also refactor the getMiddleware* functions to be of expected type. And probably there are more rabbit holes like this to jump down.

@gpalmer-latai
Copy link
Contributor Author

gpalmer-latai commented Nov 21, 2023

The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool

In order to do this we will have to expose the index somehow - maybe instead of a getMemPoolInfo with a ChunkSettings overload there should be something like a uint32_t getMemPoolHint(const ChunkSettings& chunkSettings) const;

This function can then both be used as a way to query MemPoolInfo via the existing getMemPoolInfo as well as the starting index of a binary search implementation of getChunk.

@elBoberido
Copy link
Member

The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool

In order to do this we will have to expose the index somehow - maybe instead of a getMemPoolInfo with a ChunkSettings overload there should be something like a uint32_t getMemPoolHint(const ChunkSettings& chunkSettings) const;

This function can then both be used as a way to query MemPoolInfo via the existing getMemPoolInfo as well as the starting index of a binary search implementation of getChunk.

Sounds good. Since this is no user facing API we are free to do anything :)

@gpalmer-latai
Copy link
Contributor Author

My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager

On second thought and further study of the current architecture, I think maybe it might actually make sense for the ChunkSender to do this, since the ChunkSenderData is where the memory info gets stored, and the ChunkSender is responsible for

allocating and freeing memory chunks.

So basically, to avoid errors in construction we use the builder type for the chunk sender also, or we alternatively add a tryAllocateHint method which would be used by the chunkSender to both get the index of the matching mempool and cache it, as well as do the cuda stuff if appropriate.

For simplicity it probably suffices to make the hint that the chunkSender stores be a single tuple of ChunkSettings to uint32_t which is just a lastUsedHint. If the requested chunk settings change, the sender gets a new hint, attempts to register with CUDA again, and uses the new hint to get a chunk.

The CUDA calls themselves will be cached by the global registry so even if we end up back at the same hint it will become a noop.

@gpalmer-latai
Copy link
Contributor Author

Question about the pointer repository though - is this class thread-safe? I don't see any sort of synchronization around registering a new segment. I assume this is because there is no expectation that this would ever be called from different threads?

It's not practically a problem for my particular use case, but it would be problematic to introduce a code path that is able to be called at any time a publisher or subscriber takes or receives a message, that is not thread safe.

@elBoberido
Copy link
Member

Interesting. How deep are you thinking of going in this sort of refactor? It looks like the getMiddleware* functions all emit warnings and return a nullptr, which immediately gets passed to a constructor taking a not_null pointer...

So we could for example, also use the builder pattern for the port users. We can also refactor the getMiddleware* functions to be of expected type. And probably there are more rabbit holes like this to jump down.

The whole runtime could be refactored if there is time. The runtime belongs to a few remaining old constructs which have their roots in a R&D department. At that time the focus was more on trying out new things than on creating sophisticated APIs. In the meantime we also got new tools like the expected and optional which makes it easier to express specific ideas.

Ideally we could try to model the API in a way that it will become simpler to have the same API for the C++ bindings on the Rust basted next-gen implementation.

@elBoberido
Copy link
Member

Question about the pointer repository though - is this class thread-safe? I don't see any sort of synchronization around registering a new segment. I assume this is because there is no expectation that this would ever be called from different threads?

It's not practically a problem for my particular use case, but it would be problematic to introduce a code path that is able to be called at any time a publisher or subscriber takes or receives a message, that is not thread safe.

There is a related issue #1701. We need to check this.

@elBoberido
Copy link
Member

My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager

On second thought and further study of the current architecture, I think maybe it might actually make sense for the ChunkSender to do this, since the ChunkSenderData is where the memory info gets stored, and the ChunkSender is responsible for

allocating and freeing memory chunks.

So basically, to avoid errors in construction we use the builder type for the chunk sender also, or we alternatively add a tryAllocateHint method which would be used by the chunkSender to both get the index of the matching mempool and cache it, as well as do the cuda stuff if appropriate.

For simplicity it probably suffices to make the hint that the chunkSender stores be a single tuple of ChunkSettings to uint32_t which is just a lastUsedHint. If the requested chunk settings change, the sender gets a new hint, attempts to register with CUDA again, and uses the new hint to get a chunk.

The CUDA calls themselves will be cached by the global registry so even if we end up back at the same hint it will become a noop.

I need to think more about this. Unfortunately I'm traveling at the end of the week till next week and won't have much time. Please ping me again for this if I forget to respond.

@gpalmer-latai
Copy link
Contributor Author

That's okay. I'm also traveling as well until next Wednesday (It is a holiday weekend in the U.S.). Funnily enough - I am flying to Oahu tomorrow.

@gpalmer-latai
Copy link
Contributor Author

There is a related issue #1701. We need to check this.

So I've thought about it and it seems possible but very difficult to create a situation where one calls e.g. RegisterPtr from multiple threads. You'd have to be creating shared memory pools in different threads and registering them concurrently, for some reason, without knowing in advance what id you want to assign them.

For this CUDA registry we can possibly get around this problem by registering with a combination of the memory pool id and the segment id.

As for the "duplicate registration" race condition. If this occurs CUDA will set an error code along the lines of "This overlaps with some memory that is already registered". We could choose to ignore that one... though it hides other logic errors like actually accidentally overlapping memory pool info.

@elBoberido
Copy link
Member

That's okay. I'm also traveling as well until next Wednesday (It is a holiday weekend in the U.S.). Funnily enough - I am flying to Oahu tomorrow.

It's been a while since I've been there. Unfortunately it's a bit far away from my place. Have fun :)

@elBoberido
Copy link
Member

@MatthiasKillat in case you are interested in the CUDA stuff :)

@gpalmer-latai
Copy link
Contributor Author

So I've started prototyping on this new solution and have encountered an obstacle I don't think we've thought through yet.

For subscribers - it appears as if the subscriber does not know anything about memory pools or chunks until a message has been delivered, or more precisely, pointer information to a shared chunk (through which it can figure out where the memPool is to call freeChunk eventually).

Do you have any thoughts on an API that would allow a subscriber to figure out which memory pool messages will come from before it starts receiving messages? I'd like to avoid in the happy path making calls to cudaHostRegister only after the first message is about to be processed, as this might blow latency budgets on the first iteration of any GPU processing work.

@gpalmer-latai
Copy link
Contributor Author

One thing that comes to mind that might be somewhat odd but would at least be relatively "safe" would be adding a const RelativePointer<const mepoo::MemoryManager> m_memoryMgr; to the chunk_receiver. This would give the subscriber read-only access to the memPool information which could let it pass some size information to get access to the MemPoolInfo using only const API's of the memory manager.

@gpalmer-latai
Copy link
Contributor Author

Ah... but this approach runs into one more problem. Because of access controls we don't actually know which segment a message is going to come from, and therefore cannot acquire the memoryManager information the way the publisher does

@gpalmer-latai
Copy link
Contributor Author

What about this alternative approach - We leverage the fact that the PoshRuntime has access to the segment manager via the IpcRuntimeInterface. The segment manager in turn has a public API to return all the SegmentMappings. Interestingly here we also have the MemoryInfo struct that is key to supporting this feature.

My thought is - what if we extend this SegmentMapping struct to include a const pointer to the memory manager for the segment? This would expose the getMemPoolInfo and getMemPoolHint methods, thus allowing with my other proposed changes above for the runtime itself to have access to all of the memory managers.

Then, when creating a subscriber, the runtime can simply do an additional step to use either a template type size info or some size hints in the untyped API to find all memory pools the subscriber may read from and preemptively pin them.

In addition to this, we would still do a check when dequeuing a message in the subscriber that the owning memory pool has been pinned, but this check would likely be a noop unless the message we received is a different size than we expected.

@elBoberido
Copy link
Member

I've been busy with other stuff the last days. Will have a look at your ideas tomorrow.

@gpalmer-latai
Copy link
Contributor Author

Another idea along this line - if I expose the ability for the PoshRuntime itself to iterate over SegmentMappings, and the segment mappings in exchange exposed info about individual memory pools, then perhaps we could allow configuring MemoryInfo at the MemPool level such that the following behavior happens:

  1. When initializing the runtime, we call cudaHostRegister on every memory pool with memoryType=CUDA_HOST_PINNED
  2. When a publisher loans a message, it must use a memory pool with a memory type matching that of its own PortConfigInfo. Following 1, this memory will have already been pinned.
  3. A subscriber can only receive messages from publishers with a matching memory type. Following 1. and 2. this would mean that the memory backing any of these messages must have already been pinned.

@elBoberido
Copy link
Member

I'm currently making me familiar with the code again. Regarding your last idea. If the number of pinned MemPools need to be minimal, this would not work since we do not know which MemPool a publisher or subscriber will use. We would basically need to pin the whole shared memory segment. Maybe I misunderstood your idea

@gpalmer-latai
Copy link
Contributor Author

Ah no, you are right and I got too ahead of myself here.

We of course do not want to go around pinning every single memory pool that might "support it" in every single process because each given process probably only really cares about sending/receiving messages on one or a few memory pools.

So I'll backtrack again to the idea that, when creating a subscriber, the runtime goes through each data segment (if there are multiple) and pins the memory pools that match

  • The size of the templated type - for typed subscribers
  • The sizes provided in a vector of size hints - for untyped subscribers

And in order to support untyped subscribers, we would still need a code path as well to check, upon receipt of a message, if this message came from a new memory pool which has not yet been registered, in which case we simply eat the cost to pin the memory at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature
Projects
None yet
Development

No branches or pull requests

2 participants