-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA OFI: support vector rma operations #346
Comments
For some more context, here is code in MPICH that is issuing vector libfabric operations when needed: https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L514 I don't see guards around the provider type there, though it's possible that I'm missing something. |
Yes I had looked at that in the past but it was not supported. See ofiwg/libfabric#3062 We should have a look at it again and maybe have a query call to fall back for providers that do not support it. |
I didn't think to check the memory registration path; I just looked at what some of the codes are doing on the actual read/write RMA path. I guess we'll be testing both with this change :) |
MPICH appears to be setting msg.desc to NULL: https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L480 I don't know the implications, just pointing that out. |
I ever considered it long time ago. The vector rma should be supported by OFI, but OFI providers with limit setting for the local iov_count and remote rma_iov_count (fi_info::tx_attr::iov_limit/::rma_iov_limit), and different provider can with different limit number can be queried by fi_getinfo(). Just FYI. |
@carns All providers set mr_iov_limit and rma_iov_limit to 1, I am still doing some more testing and we'll support it internally now but I would keep hopes very low at this point :\ |
@carns I was looking in more details at the mpich code that you had sent. I think it's still limited though since they also always set iov_count and rma_iov_count to 1. |
Ah, Ok. Thanks for the updates. |
Supporting code was added as part of a6bbce8 but there are issues remaining with libfabric that will need to be further investigated. |
What are the "issues remaning with libfabric" ? Phil pointed me to this post https://lists.openfabrics.org/pipermail/libfabric-users/2021-June/000861.html documenting the But no impact on the vectored transfer benchmark:
After
|
I would need to look at it again but I had added support for it, you can turn it on by uncommenting that line: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L196 |
No change uncommenting that define:
|
a-ha! I stuck some debugging into |
that
setting the env changes the Dang, I really thought I had something by changing that define in libfabric, recompiling, and trying again, however mercury still tells me max_segments is 1. |
Ah ok from what I see it looks like it's because of the rxm provider. When you're running fi_info, are you specifying verbs;ofi_rxm ? or only verbs? |
I'm not calling the The rxm angle is interesting, but I don't see "1" in there. I do see
so I'd expect a "sticky" 4, but I don't know where the 1 comes from. |
I figured "what the heck" and hard-coded 30 in mercury's
Now vector lengths of 1 are ok, and 32 and higer are ok, but vector lengths of 2-16 give me |
@roblatham00 I don't think I was implying that you were calling the |
Thanks for the libfabric information. You are 100% correct that I am confused by libfabric domains, providers, etc. Let me be more concrete: I'm using Phil's benchmark to measure this: https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-vector.c#L243
|
Right, that code is creating a single registration to represent a vector of memory regions. That entire registration (as a whole, including all vectors) is then transmitted with a bulk transfer operation. Going by @soumagne ;s explanation, that means that this benchmark is reliant on the mr_iov_limit rather than the iov_limit (thanks for the explanation on that- I was wondering what the exact difference was). Mercury doesn't have an API to pick out pieces with a contiguous memory registration, or for that matter to construct a vector from different registrations. That could be a neat capability, though. MPICH for example uses this to stitch pre-registered header structs onto data payloads. You could also imagine it being used for 2-phase style aggregation, where one party doesn't know anything except the aggregate size, while the other parties fill in specific portions using vector operations. Right now we couldn't use vectors for that. |
I think at the very least, the benchmark could still be relevant if the virtual selection made internally results in several segments being transferred at once (internally registered as separate regions). I think the first step is to enable that and add support for it within the NA layer and then we can see about adding new vector types of APIs to HG Bulk itself. |
Is your feature request related to a problem? Please describe.
The na_ofi.c already uses libfabric API calls that support vectors for local and remote buffers (fi_writemsg() and fi_readmsg()), but it hard codes the vector lengths to 1 (
mercury/src/na/na_ofi.c
Line 4171 in 06bf87e
Describe the solution you'd like
It would be better if it translated vectors from the bulk handle into native vectors for libfabric, as is done in the na_sm.c code here:
mercury/src/na/na_sm.c
Line 3733 in 06bf87e
We need to confirm if the relevant libfabric providers actually support this capability; if not we will need to retain a fallback path for providers that don't (and add that to the flags checked in the NA_OFI_PROV_TYPES so we can quickly toggle if offending providers gain support later).
Additional context
From a quick scan of libfabric it looks like gni and psm2 likely support this. Verbs may not but we need to confirm. See numerous FI_E2BIG return codes in the verbs provider, but I'm not sure what ocde paths are important here when using rxm.
The text was updated successfully, but these errors were encountered: