Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA OFI: support vector rma operations #346

Open
carns opened this issue Feb 26, 2020 · 22 comments
Open

NA OFI: support vector rma operations #346

carns opened this issue Feb 26, 2020 · 22 comments

Comments

@carns
Copy link
Contributor

carns commented Feb 26, 2020

Is your feature request related to a problem? Please describe.

The na_ofi.c already uses libfabric API calls that support vectors for local and remote buffers (fi_writemsg() and fi_readmsg()), but it hard codes the vector lengths to 1 (

.iov_count = 1,
).

Describe the solution you'd like

It would be better if it translated vectors from the bulk handle into native vectors for libfabric, as is done in the na_sm.c code here:

/* Translate local offset, skip this step if not necessary */

We need to confirm if the relevant libfabric providers actually support this capability; if not we will need to retain a fallback path for providers that don't (and add that to the flags checked in the NA_OFI_PROV_TYPES so we can quickly toggle if offending providers gain support later).

Additional context

From a quick scan of libfabric it looks like gni and psm2 likely support this. Verbs may not but we need to confirm. See numerous FI_E2BIG return codes in the verbs provider, but I'm not sure what ocde paths are important here when using rxm.

@carns
Copy link
Contributor Author

carns commented Feb 26, 2020

For some more context, here is code in MPICH that is issuing vector libfabric operations when needed:

https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L514

I don't see guards around the provider type there, though it's possible that I'm missing something.

@soumagne
Copy link
Member

Yes I had looked at that in the past but it was not supported. See ofiwg/libfabric#3062 We should have a look at it again and maybe have a query call to fall back for providers that do not support it.

@carns
Copy link
Contributor Author

carns commented Feb 27, 2020

I didn't think to check the memory registration path; I just looked at what some of the codes are doing on the actual read/write RMA path. I guess we'll be testing both with this change :)

@carns
Copy link
Contributor Author

carns commented Feb 27, 2020

MPICH appears to be setting msg.desc to NULL:

https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L480

I don't know the implications, just pointing that out.

@liuxuezhao
Copy link
Collaborator

liuxuezhao commented Mar 4, 2020

I ever considered it long time ago. The vector rma should be supported by OFI, but OFI providers with limit setting for the local iov_count and remote rma_iov_count (fi_info::tx_attr::iov_limit/::rma_iov_limit), and different provider can with different limit number can be queried by fi_getinfo(). Just FYI.

@soumagne soumagne changed the title support vector rma operations in na_ofi NA OFI: support vector rma operations Jun 22, 2020
@soumagne
Copy link
Member

@carns All providers set mr_iov_limit and rma_iov_limit to 1, I am still doing some more testing and we'll support it internally now but I would keep hopes very low at this point :\

@soumagne
Copy link
Member

@carns I was looking in more details at the mpich code that you had sent. I think it's still limited though since they also always set iov_count and rma_iov_count to 1.

@carns
Copy link
Contributor Author

carns commented Aug 19, 2020

Ah, Ok. Thanks for the updates.

@soumagne soumagne added this to the mercury-2.0.0 milestone Sep 3, 2020
@soumagne soumagne added this to To do in NA OFI via automation Sep 3, 2020
@soumagne soumagne modified the milestones: mercury-2.0.0, mercury-2.1.0 Oct 9, 2020
@soumagne
Copy link
Member

soumagne commented Oct 9, 2020

Supporting code was added as part of a6bbce8 but there are issues remaining with libfabric that will need to be further investigated.

@soumagne soumagne modified the milestones: mercury-2.1.0, future Mar 29, 2021
@roblatham00
Copy link
Contributor

What are the "issues remaning with libfabric" ? Phil pointed me to this post https://lists.openfabrics.org/pipermail/libfabric-users/2021-June/000861.html documenting the FI_VERBS_TX_IOV_LIMIT and FI_VERBS_RX_IOV_LIMIT environment variables, and the ibv_devinfo -v utility. on summit, it looks like we should set those environment variables to 30

But no impact on the vectored transfer benchmark:
Before:

<op>    <warmup_seconds>        <concurrency>   <threads>       <xfer_size>     <vector_count>  <total_bytes>   <seconds>       <MiB/s> <xfer_memory>   <align_buffer>
PULL    1       1       0       524288  1       209230757888    20.000072       9976.864320     0
PULL    1       1       0       524288  2       209364451328    20.000095       9983.227515     0
PULL    1       1       0       524288  4       210315509760    20.000066       10028.592124    0
PULL    1       1       0       524288  8       210331238400    20.000078       10029.335904    0
PULL    1       1       0       524288  16      200076689408    20.000072       9540.365654     0
PULL    1       1       0       524288  32      192866156544    20.000097       9196.530270     0
PULL    1       1       0       524288  64      107745378304    20.000236       5137.639306     0
PULL    1       1       0       524288  128     53734277120     20.000313       2562.209957     0
PULL    1       1       0       524288  256     27163885568     20.000662       1295.232153     0

After

PULL    1       1       0       524288  2       203238670336    20.000096       9691.128674     0
PULL    1       1       0       524288  4       202780442624    20.000068       9669.291918     0
PULL    1       1       0       524288  8       202621583360    20.000085       9661.708882     0
PULL    1       1       0       524288  16      193659404288    20.000131       9234.339565     0
PULL    1       1       0       524288  32      185765724160    20.000126       8857.944351     0
PULL    1       1       0       524288  64      112774348800    20.000202       5377.445768     0
PULL    1       1       0       524288  128     56699125760     20.000352       2703.577398     0
PULL    1       1       0       524288  256     28120711168     20.000367       1340.875384     0

@soumagne
Copy link
Member

I would need to look at it again but I had added support for it, you can turn it on by uncommenting that line: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L196
I think if I recall correctly there were some issues with OFI not transferring things correctly. I don't think I have tried those environment variables though.

@roblatham00
Copy link
Contributor

No change uncommenting that define:

PULL    1       1       0       524288  1       188446408704    20.000071       8985.793186     0
PULL    1       1       0       524288  2       189848879104    20.000109       9052.650574     0
PULL    1       1       0       524288  4       192025198592    20.000094       9156.432103     0
PULL    1       1       0       524288  8       192173572096    20.000107       9163.500843     0
PULL    1       1       0       524288  16      183590453248    20.000067       8754.245780     0
PULL    1       1       0       524288  32      178335514624    20.000155       8503.634210     0
PULL    1       1       0       524288  64      108015386624    20.000248       5150.511207     0
PULL    1       1       0       524288  128     54678519808     20.000225       2607.245722     0
PULL    1       1       0       524288  256     26795311104     20.000391       1277.675006     0

@roblatham00
Copy link
Contributor

Here's a graph of the three cases, just to make it more clear:

margo-summit-vector-bw

@roblatham00
Copy link
Contributor

a-ha! I stuck some debugging into hg_bulk_create: it tells me
==== bulk: count: 64 max_segments: 1

@roblatham00
Copy link
Contributor

fi_info -v gives me this:

        domain: 0x0
        name: mlx5_1
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY ]
        mr_key_size: 4
        cq_data_size: 0
        cq_cnt: 16777216
        ep_cnt: 262144
        tx_ctx_cnt: 1024
        rx_ctx_cnt: 1024
        max_ep_tx_ctx: 1024
        max_ep_rx_ctx: 1024
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 8388608
        cntr_cnt: 0
        mr_iov_limit: 1

that mr_iov_limit is what i need to adjust: mercury consults that field :

priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;

setting the env changes the iov_limit but the mr_iov_limit is hard coded to 1 (see fi_verbs.h #define VERBS_MR_IOV_LIMIT )

Dang, I really thought I had something by changing that define in libfabric, recompiling, and trying again, however mercury still tells me max_segments is 1. fi_info tells me mr_iov_limit is now 30... so why is mercury insisting it is 1?

@soumagne
Copy link
Member

Ah ok from what I see it looks like it's because of the rxm provider. When you're running fi_info, are you specifying verbs;ofi_rxm ? or only verbs?

@roblatham00
Copy link
Contributor

I'm not calling the fi_info() routine. I'm calling the fi_info utility, which I think by default dumps all the providers it can find?

The rxm angle is interesting, but I don't see "1" in there. I do see

prov/rxm/src/rxm.h:#define RXM_IOV_LIMIT 4

so I'd expect a "sticky" 4, but I don't know where the 1 comes from.

@roblatham00
Copy link
Contributor

I figured "what the heck" and hard-coded 30 in mercury's na_ofi_initialize:

diff --git a/src/na/na_ofi.c b/src/na/na_ofi.c
index f3e6e47d..1152f26b 100644
--- a/src/na/na_ofi.c
+++ b/src/na/na_ofi.c
@@ -3928,7 +3928,9 @@ na_ofi_initialize(na_class_t *na_class, const struct na_info *na_info,
 #endif
 
     /* Cache IOV max */
-    priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+    //XXX: giant hack.  set to 30 and see what happens
+    //priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+    priv->iov_max = 30;
 
     /* Create endpoint */
     ret = na_ofi_endpoint_open(priv->domain, node_ptr, src_addr, src_addrlen,

Now vector lengths of 1 are ok, and 32 and higer are ok, but vector lengths of 2-16 give me HG_INVALID_ARG

@soumagne
Copy link
Member

soumagne commented Jul 15, 2021

@roblatham00 I don't think I was implying that you were calling the fi_info() routine :) there are options that you can pass to the fi_info utility and one of them is -p which lets you specify which provider you're querying. I was suggesting that you pass verbs\;ofi_rxm since otherwise yes it would just dump everything. You're also confusing the MR IOV limit and the regular IOV limit I think. The MR IOV limit is the maximum number of buffers that we can pass to memory registration calls at once, i.e. the maximum number of buffers that may be associated with a single memory region. This is basically what happens when you make a call to HG_Bulk_create() with N buffers, we register these N buffers if possible so that they appear as a single virtual region. The regular IOV limit on the other hand defines the maximum number of buffers at once that are passed to fi_writev() or fi_readv() operations. We currently do not have a mode in mercury that allows for passing multiple buffers at once when doing a bulk transfer if they are not part of the same virtual region. I believe I would need to add another query call then / possibly additional NA callbacks (current NA put/get take only a single MR region) and pass that array of buffers and an array of memory regions. Let me look into this.

@roblatham00
Copy link
Contributor

Thanks for the libfabric information. You are 100% correct that I am confused by libfabric domains, providers, etc.

Let me be more concrete:

I'm using Phil's benchmark to measure this: https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-vector.c#L243

margo_bulk_create immediately calls HG_Bulk_create(): This is a single virtual region, right? We are just pointing to different parts of it?

@carns
Copy link
Contributor Author

carns commented Jul 16, 2021

Right, that code is creating a single registration to represent a vector of memory regions. That entire registration (as a whole, including all vectors) is then transmitted with a bulk transfer operation.

Going by @soumagne ;s explanation, that means that this benchmark is reliant on the mr_iov_limit rather than the iov_limit (thanks for the explanation on that- I was wondering what the exact difference was).

Mercury doesn't have an API to pick out pieces with a contiguous memory registration, or for that matter to construct a vector from different registrations. That could be a neat capability, though. MPICH for example uses this to stitch pre-registered header structs onto data payloads. You could also imagine it being used for 2-phase style aggregation, where one party doesn't know anything except the aggregate size, while the other parties fill in specific portions using vector operations. Right now we couldn't use vectors for that.

@soumagne
Copy link
Member

I think at the very least, the benchmark could still be relevant if the virtual selection made internally results in several segments being transferred at once (internally registered as separate regions). I think the first step is to enable that and add support for it within the NA layer and then we can see about adding new vector types of APIs to HG Bulk itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
NA OFI
  
To do
Development

No branches or pull requests

4 participants