Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: unable to register more than 95GB of memory #9739

Closed
jhh67 opened this issue Jan 18, 2024 · 14 comments
Closed

prov/efa: unable to register more than 95GB of memory #9739

jhh67 opened this issue Jan 18, 2024 · 14 comments
Labels

Comments

@jhh67
Copy link

jhh67 commented Jan 18, 2024

Describe the bug
While developing the Chapel runtime for the EFA provider we encountered an error in which a single process cannot register more than 95GB of memory. 95GB succeeds, 96GB fails with the following error:

OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Cannot allocate memory

To Reproduce
We do not have a simple reproducer, we currently test using the full Chapel runtime. We observed the error on AWS c7i.48xlarge which has one EFA NIC and 384GB of memory.

Expected behavior
I expect to be able to register more than 25% of the physical memory of the machine.

Output
The output with FI_LOG_LEVEL=Debug contained:

libfabric:15409:1705531788::efa:mr:efa_mr_reg_impl():850<warn> Unable to register MR: Cannot allocate memory
libfabric:15409:1705531788::efa:mr:efa_mr_regattr():982<warn> Unable to register MR: Cannot allocate memory

Environment:
This is on an AWS c7i.48xlarge instance using libfabric 1.19, the efa provider, and export FI_EFA_USE_DEVICE_RDMA=1.

Additional context

@jhh67 jhh67 added the bug label Jan 18, 2024
@j-xiong
Copy link
Contributor

j-xiong commented Jan 20, 2024

What is the output of ulimit -l?

@shijin-aws
Copy link
Contributor

shijin-aws commented Jan 21, 2024

It's not a bug. EFA device has limit for the number of host pages that you can register. If you are currently allocating your memory with the regular page (4k), using huge page (can be 2M on some platform) can save the number of pages and allow you to register larger memory.

@shijin-aws shijin-aws removed the bug label Jan 21, 2024
@jhh67
Copy link
Author

jhh67 commented Jan 26, 2024

Thank you for your suggestions. We will try them and get back to you with the results.

@jhh67
Copy link
Author

jhh67 commented Feb 13, 2024

We haven't had any luck registering more than 95GB of memory using hugepages. Can you provide some guidance on how to make this work? ulimit -l is unlimited so that isn't the issue. We tried using explicit hugepages using libhugetlbfs but encountered errors trying to register the memory:

internal error: 0: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address
internal error: 1: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB. Using this method we are sometimes able to register up to 155GB of memory, but not always. Is there documentation on getting the efa provider working using hugepages?

@shijin-aws
Copy link
Contributor

shijin-aws commented Feb 13, 2024

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB.

I don't think EFA support transparent huge pages. If you have EFA installer installed on your instance, you should be able to see there are huge page reserved

(env) [ec2-user@ip-172-31-51-162 ~]$ cat /sys/kernel/mm/hugepages/**/nr_hugepages
0
14081

You can increase this count to allow larger size of huge page allocation

Libfabric uses this to allocate buffer from the huge page pool

*memptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

And EFA provider allocates its internal buffer pool from the huge page pool by default. Did you use the same mmap call in your application to allocate huge page memory?

@jabraham17
Copy link

Yes, we used start = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

This would return (void*)-1 and error with "Cannot allocate memory". We also validated that cat /sys/kernel/mm/hugepages/**/nr_hugepages prints a non-zero value.

Another aspect we tried was adding (21 << MAP_HUGE_SHIFT) to the flags for mmap to request a specific size, this made no change.

@bradcray
Copy link

@shijin-aws / @j-xiong : Any further suggestions here for how to make progress? Have you successfully been able to register 96+GB of memory in your work?

@j-xiong
Copy link
Contributor

j-xiong commented Mar 22, 2024

Have you tried increasing the count at /sys/kernel/mm/hugepages/**/nr_hugepages as suggested by @shijin-aws?

@shijin-aws
Copy link
Contributor

As I mentioned earlier, You need to increase /sys/kernel/mm/hugepages/**/nr_hugepage because the default value is only configured for efa provider's internal bounce buffer pool usage, which is way less than 96GB.

@shijin-aws
Copy link
Contributor

shijin-aws commented Mar 24, 2024

OK I just make a quick test on c7i.48xlarge and make fabtests allocated a 100GB buffer backed by huge pages

diff --git a/fabtests/common/shared.c b/fabtests/common/shared.c
index fc228f4d8..ae0c5301e 100644
--- a/fabtests/common/shared.c
+++ b/fabtests/common/shared.c
@@ -630,9 +636,20 @@ int ft_alloc_msgs(void)
                buf_size += alignment;
                ret = ft_hmem_alloc(opts.iface, opts.device, (void **) &buf,
                                    buf_size);
+
+               buf_size *= 100; // buf_size was 1 GB
+               buf = mmap(NULL, buf_size, PROT_READ | PROT_WRITE,
+               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+               if (!buf) {
+                       FT_PRINTERR("mmap", errno);
+                       ret = -FI_ENOMEM;
+                       return ret;
+               }
+               printf("allocated memory of size %lu \n", buf_size);
                if (ret)
                        return ret;
....
        if (!ft_mr_alloc_func && !ft_check_opts(FT_OPT_SKIP_REG_MR)) {
-               ret = ft_reg_mr(fi, rx_buf, rx_buf_size + tx_buf_size,
+               ret = ft_reg_mr(fi, rx_buf, buf_size,
                                ft_info_to_mr_access(fi),
                                FT_MR_KEY, opts.iface, opts.device, &mr,
                                &mr_desc);
                if (ret)
                        return ret;
+               printf("successfully register memory for rx buf\n");

I need to increase the nr_hugepages to 61121 , which will make it reserve 2MiB * 61121 ~ 120GB memory for hugepages.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
61121

And finally the registration succeeded.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ FI_LOG_LEVEL=warn fi_rdm_tagged_pingpong -p efa
allocated memory of size 107374201600
successfully register memory for rx buf

Let me know if you still have questions @bradcray

@jhh67
Copy link
Author

jhh67 commented May 6, 2024

Resolved by chapel-lang/chapel#24971.

@jhh67 jhh67 closed this as completed May 6, 2024
@jabraham17
Copy link

jabraham17 commented May 6, 2024

Just pointing out one of the issues we ran into implementing chapel-lang/chapel#24971 due to some missing cleanup on our end. We had some missing fi_close calls when using the EFA provider, which seemed to cause subsequent runs with huge pages to fail. We would be able to run once, then running a second time would fail with an unknown error during memory registration. This appeared to us to be something not being properly released, even after the process would exit.

In summary, when the EFA teardown was not invoked, subsequent runs would fail until the compute nodes were restarted. Is this intentional?

@shijin-aws
Copy link
Contributor

EFA provider uses MR cache for host memory by default and all mr dereg are actually deferred: it will be put into an LRU list if its use cnt is 0, or put into the dead region list if application frees the buffer. EFA domain close will cleanup the MR cache by flushing all MRs in the LRU list and dead region list. If you don't close your MRs by fi_close, I expect it should still be flushed as long as you freed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants