prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

Seth5141 · 2024-01-03T00:24:25Z

Describe the bug
In libfabric 1.19 (from the aws-efa installer v1.30), I am unable to communicate over EFA when using shared memory when registered more than once with the same or different EFA devices.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
Libfabric 1.18 and 1.19 should behave the same way in this case.

Output

Environment:
ubuntu on a P5DN instance

Additional context

@aws-ofiwg-bot

shijin-aws · 2024-01-03T04:24:09Z

Libfabric 1.18 and 1.19 are very different on how EFA provider use the SHM provider to offload the intra-node traffic.

On the specific error you reported on the “unexpected status received from receiver”, this error usually indicates the remote peer is down for some reason. I don’t have a theory how this can relate to shm provider as it is an completion error from EFA NIC. Can you run with FI_LOG_LEVEL=warn in your env and see if there are more log printed?

Another thing worth trying is to disable the shm provider usage inside EFA, via FI_EFA_ENABLE_SHM_TRANSFER=0 to see if there is any behavior change.

shijin-aws · 2024-01-03T04:29:48Z

Could you also clarify how you communicate over EFA when using shared memory ? I am not very familiar with NVSHMEM and it's not clear to me how you do inter-node communication (showed in your log) with shared memory. Does that mean the intra-node traffic is handled by NVSHMEM and it only use Libfabric for inter-node traffic?

Seth5141 · 2024-01-05T18:55:30Z

Right, apologies, that didn't come across correctly (also the formatting got really messed up. I'll post the error again). The connection between the communication and the shared memory is a little more indirect than that.

We use shared memory only for intra-node communication.

The pointers we use as our symmetric heap (VA) point to the shared memory region.

We register our symmetric heap addresses with the NIC using fi_mr_regattr.

There are two distinct cases:

Rail Optimized - Each process on a given node registers the entire shared memory region with their respective EFA device.
Non Rail Optimized - Each process on a given node registers the portion of the shared memory region with their respective EFA device.

Libfabric from installer 1.30 (1.19)
case (1) - When we try to issue a write operation from an address in the shared memory region we see the below error:

WARN: Nonzero error count progressing EP 1 (1)
 
WARN: CQ 1 reported error (5): Input/output error
                Provider error: Unexpected status received from remote My EFA addr: fi_addr_efa://[fe80::5:c6ff:fe54:ee97]:1:1253370189 My host id: i-0b2d50ee0310af27b Peer EFA addr: fi_addr_efa://[fe80::c0:ebff:fe98:3939]:1:545311823 Peer host id: i-0a9a44173caf39741
                Supplemental error info: none

case (2) - When we try to issue a write operation from the same address in the shared memory region, we have no problems.

Libfabric from installer 1.22 (1.17)
Case (1) and Case (2) - no issues.

The only real connection between the two is the change in registration size (~1GiB - ~2GiB)

shijin-aws · 2024-01-05T19:07:45Z

@Seth5141 Thanks for the clarification, I am still interested in seeing your results with FI_LOG_LEVEL=warn

Seth5141 · 2024-01-05T21:19:31Z

That is with FI_LOG_LEVEL=warn.

Unfortunately, I don't have the entire logs available to me. I failed to grab them off the machine before the allocation I had access to expired. I don't have an immediate way to get back on a system to continue testing.

I can comment on this though:

FI_EFA_ENABLE_SHM_TRANSFER=0

I was playing with options to try and get to the bottom of this and did toggle that flag. It didn't have any effect on my results.

shijin-aws · 2024-01-16T20:50:26Z

@Seth5141 It will be hard to identify the issue without having more warning logs to understand why the Peer host id: i-0a9a44173caf39741 is down, or having a reproducing steps for us to try.

Seth5141 · 2024-01-16T23:53:37Z

Agreed. I'll let you know more information as soon as I can get access to another P5D instance. Until then, I am as stuck as you.
Any chance you know where I could get quick access?

Seth5141 added the bug label Jan 3, 2024

shijin-aws added the prov/EFA label Jan 3, 2024

shijin-aws changed the title ~~[aws-efa-team] Using Shared Memory with EFA results in strange errors with libfabric 1.19~~ Using Shared Memory with EFA results in strange errors with libfabric 1.19 Jan 3, 2024

shijin-aws changed the title ~~Using Shared Memory with EFA results in strange errors with libfabric 1.19~~ prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

Seth5141 commented Jan 3, 2024

shijin-aws commented Jan 3, 2024

shijin-aws commented Jan 3, 2024

Seth5141 commented Jan 5, 2024

shijin-aws commented Jan 5, 2024

Seth5141 commented Jan 5, 2024

shijin-aws commented Jan 16, 2024

Seth5141 commented Jan 16, 2024

prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

Comments

Seth5141 commented Jan 3, 2024

shijin-aws commented Jan 3, 2024

shijin-aws commented Jan 3, 2024

Seth5141 commented Jan 5, 2024

shijin-aws commented Jan 5, 2024

Seth5141 commented Jan 5, 2024

shijin-aws commented Jan 16, 2024

Seth5141 commented Jan 16, 2024