prov/efa: unable to register more than 95GB of memory #9739

jhh67 · 2024-01-18T17:10:36Z

Describe the bug
While developing the Chapel runtime for the EFA provider we encountered an error in which a single process cannot register more than 95GB of memory. 95GB succeeds, 96GB fails with the following error:

OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Cannot allocate memory

To Reproduce
We do not have a simple reproducer, we currently test using the full Chapel runtime. We observed the error on AWS c7i.48xlarge which has one EFA NIC and 384GB of memory.

Expected behavior
I expect to be able to register more than 25% of the physical memory of the machine.

Output
The output with FI_LOG_LEVEL=Debug contained:

libfabric:15409:1705531788::efa:mr:efa_mr_reg_impl():850<warn> Unable to register MR: Cannot allocate memory
libfabric:15409:1705531788::efa:mr:efa_mr_regattr():982<warn> Unable to register MR: Cannot allocate memory

Environment:
This is on an AWS c7i.48xlarge instance using libfabric 1.19, the efa provider, and export FI_EFA_USE_DEVICE_RDMA=1.

Additional context

The text was updated successfully, but these errors were encountered:

j-xiong · 2024-01-20T03:58:47Z

What is the output of ulimit -l?

shijin-aws · 2024-01-21T19:53:10Z

It's not a bug. EFA device has limit for the number of host pages that you can register. If you are currently allocating your memory with the regular page (4k), using huge page (can be 2M on some platform) can save the number of pages and allow you to register larger memory.

jhh67 · 2024-01-26T16:46:02Z

Thank you for your suggestions. We will try them and get back to you with the results.

jhh67 · 2024-02-13T16:57:23Z

We haven't had any luck registering more than 95GB of memory using hugepages. Can you provide some guidance on how to make this work? ulimit -l is unlimited so that isn't the issue. We tried using explicit hugepages using libhugetlbfs but encountered errors trying to register the memory:

internal error: 0: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address
internal error: 1: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB. Using this method we are sometimes able to register up to 155GB of memory, but not always. Is there documentation on getting the efa provider working using hugepages?

shijin-aws · 2024-02-13T19:11:13Z

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB.

I don't think EFA support transparent huge pages. If you have EFA installer installed on your instance, you should be able to see there are huge page reserved

(env) [ec2-user@ip-172-31-51-162 ~]$ cat /sys/kernel/mm/hugepages/**/nr_hugepages
0
14081

You can increase this count to allow larger size of huge page allocation

Libfabric uses this to allocate buffer from the huge page pool

*memptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

And EFA provider allocates its internal buffer pool from the huge page pool by default. Did you use the same mmap call in your application to allocate huge page memory?

jabraham17 · 2024-03-06T17:34:31Z

Yes, we used start = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

This would return (void*)-1 and error with "Cannot allocate memory". We also validated that cat /sys/kernel/mm/hugepages/**/nr_hugepages prints a non-zero value.

Another aspect we tried was adding (21 << MAP_HUGE_SHIFT) to the flags for mmap to request a specific size, this made no change.

bradcray · 2024-03-22T17:56:51Z

@shijin-aws / @j-xiong : Any further suggestions here for how to make progress? Have you successfully been able to register 96+GB of memory in your work?

j-xiong · 2024-03-22T18:11:56Z

Have you tried increasing the count at /sys/kernel/mm/hugepages/**/nr_hugepages as suggested by @shijin-aws?

j-xiong · 2024-03-22T18:19:24Z

This may be useful:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/configuring-huge-pages_monitoring-and-managing-system-status-and-performance#configuring-hugetlb-at-run-time_configuring-huge-pages

shijin-aws · 2024-03-22T18:37:36Z

As I mentioned earlier, You need to increase /sys/kernel/mm/hugepages/**/nr_hugepage because the default value is only configured for efa provider's internal bounce buffer pool usage, which is way less than 96GB.

shijin-aws · 2024-03-24T01:00:13Z

OK I just make a quick test on c7i.48xlarge and make fabtests allocated a 100GB buffer backed by huge pages

diff --git a/fabtests/common/shared.c b/fabtests/common/shared.c
index fc228f4d8..ae0c5301e 100644
--- a/fabtests/common/shared.c
+++ b/fabtests/common/shared.c
@@ -630,9 +636,20 @@ int ft_alloc_msgs(void)
                buf_size += alignment;
                ret = ft_hmem_alloc(opts.iface, opts.device, (void **) &buf,
                                    buf_size);
+
+               buf_size *= 100; // buf_size was 1 GB
+               buf = mmap(NULL, buf_size, PROT_READ | PROT_WRITE,
+               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+               if (!buf) {
+                       FT_PRINTERR("mmap", errno);
+                       ret = -FI_ENOMEM;
+                       return ret;
+               }
+               printf("allocated memory of size %lu \n", buf_size);
                if (ret)
                        return ret;
....
        if (!ft_mr_alloc_func && !ft_check_opts(FT_OPT_SKIP_REG_MR)) {
-               ret = ft_reg_mr(fi, rx_buf, rx_buf_size + tx_buf_size,
+               ret = ft_reg_mr(fi, rx_buf, buf_size,
                                ft_info_to_mr_access(fi),
                                FT_MR_KEY, opts.iface, opts.device, &mr,
                                &mr_desc);
                if (ret)
                        return ret;
+               printf("successfully register memory for rx buf\n");

I need to increase the nr_hugepages to 61121 , which will make it reserve 2MiB * 61121 ~ 120GB memory for hugepages.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
61121

And finally the registration succeeded.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ FI_LOG_LEVEL=warn fi_rdm_tagged_pingpong -p efa
allocated memory of size 107374201600
successfully register memory for rx buf

Let me know if you still have questions @bradcray

jhh67 · 2024-05-06T17:24:11Z

Resolved by chapel-lang/chapel#24971.

jabraham17 · 2024-05-06T19:10:11Z

Just pointing out one of the issues we ran into implementing chapel-lang/chapel#24971 due to some missing cleanup on our end. We had some missing fi_close calls when using the EFA provider, which seemed to cause subsequent runs with huge pages to fail. We would be able to run once, then running a second time would fail with an unknown error during memory registration. This appeared to us to be something not being properly released, even after the process would exit.

In summary, when the EFA teardown was not invoked, subsequent runs would fail until the compute nodes were restarted. Is this intentional?

shijin-aws · 2024-05-06T19:50:20Z

EFA provider uses MR cache for host memory by default and all mr dereg are actually deferred: it will be put into an LRU list if its use cnt is 0, or put into the dead region list if application frees the buffer. EFA domain close will cleanup the MR cache by flushing all MRs in the LRU list and dead region list. If you don't close your MRs by fi_close, I expect it should still be flushed as long as you freed it.

jhh67 added the bug label Jan 18, 2024

shijin-aws removed the bug label Jan 21, 2024

j-xiong added the prov/EFA label Feb 1, 2024

jhh67 closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/efa: unable to register more than 95GB of memory #9739

prov/efa: unable to register more than 95GB of memory #9739

jhh67 commented Jan 18, 2024

j-xiong commented Jan 20, 2024

shijin-aws commented Jan 21, 2024 •

edited

jhh67 commented Jan 26, 2024

jhh67 commented Feb 13, 2024

shijin-aws commented Feb 13, 2024 •

edited

jabraham17 commented Mar 6, 2024

bradcray commented Mar 22, 2024

j-xiong commented Mar 22, 2024

j-xiong commented Mar 22, 2024

shijin-aws commented Mar 22, 2024

shijin-aws commented Mar 24, 2024 •

edited

jhh67 commented May 6, 2024

jabraham17 commented May 6, 2024 •

edited

shijin-aws commented May 6, 2024

prov/efa: unable to register more than 95GB of memory #9739

prov/efa: unable to register more than 95GB of memory #9739

Comments

jhh67 commented Jan 18, 2024

j-xiong commented Jan 20, 2024

shijin-aws commented Jan 21, 2024 • edited

jhh67 commented Jan 26, 2024

jhh67 commented Feb 13, 2024

shijin-aws commented Feb 13, 2024 • edited

jabraham17 commented Mar 6, 2024

bradcray commented Mar 22, 2024

j-xiong commented Mar 22, 2024

j-xiong commented Mar 22, 2024

shijin-aws commented Mar 22, 2024

shijin-aws commented Mar 24, 2024 • edited

jhh67 commented May 6, 2024

jabraham17 commented May 6, 2024 • edited

shijin-aws commented May 6, 2024

shijin-aws commented Jan 21, 2024 •

edited

shijin-aws commented Feb 13, 2024 •

edited

shijin-aws commented Mar 24, 2024 •

edited

jabraham17 commented May 6, 2024 •

edited