Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694

Open
Seth5141 opened this issue Jan 3, 2024 · 7 comments

Comments

@Seth5141
Copy link

Seth5141 commented Jan 3, 2024

Describe the bug
In libfabric 1.19 (from the aws-efa installer v1.30), I am unable to communicate over EFA when using shared memory when registered more than once with the same or different EFA devices.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
Libfabric 1.18 and 1.19 should behave the same way in this case.

Output
image

Environment:
ubuntu on a P5DN instance

Additional context
image

@aws-ofiwg-bot

@shijin-aws
Copy link
Contributor

Libfabric 1.18 and 1.19 are very different on how EFA provider use the SHM provider to offload the intra-node traffic.

On the specific error you reported on the “unexpected status received from receiver”, this error usually indicates the remote peer is down for some reason. I don’t have a theory how this can relate to shm provider as it is an completion error from EFA NIC. Can you run with FI_LOG_LEVEL=warn in your env and see if there are more log printed?

Another thing worth trying is to disable the shm provider usage inside EFA, via FI_EFA_ENABLE_SHM_TRANSFER=0 to see if there is any behavior change.

@shijin-aws shijin-aws changed the title [aws-efa-team] Using Shared Memory with EFA results in strange errors with libfabric 1.19 Using Shared Memory with EFA results in strange errors with libfabric 1.19 Jan 3, 2024
@shijin-aws shijin-aws changed the title Using Shared Memory with EFA results in strange errors with libfabric 1.19 prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 Jan 3, 2024
@shijin-aws
Copy link
Contributor

Could you also clarify how you communicate over EFA when using shared memory ? I am not very familiar with NVSHMEM and it's not clear to me how you do inter-node communication (showed in your log) with shared memory. Does that mean the intra-node traffic is handled by NVSHMEM and it only use Libfabric for inter-node traffic?

@Seth5141
Copy link
Author

Seth5141 commented Jan 5, 2024

Right, apologies, that didn't come across correctly (also the formatting got really messed up. I'll post the error again). The connection between the communication and the shared memory is a little more indirect than that.

We use shared memory only for intra-node communication.

The pointers we use as our symmetric heap (VA) point to the shared memory region.

We register our symmetric heap addresses with the NIC using fi_mr_regattr.

There are two distinct cases:

  1. Rail Optimized - Each process on a given node registers the entire shared memory region with their respective EFA device.
  2. Non Rail Optimized - Each process on a given node registers the portion of the shared memory region with their respective EFA device.

Libfabric from installer 1.30 (1.19)
case (1) - When we try to issue a write operation from an address in the shared memory region we see the below error:

WARN: Nonzero error count progressing EP 1 (1)
 
WARN: CQ 1 reported error (5): Input/output error
                Provider error: Unexpected status received from remote My EFA addr: fi_addr_efa://[fe80::5:c6ff:fe54:ee97]:1:1253370189 My host id: i-0b2d50ee0310af27b Peer EFA addr: fi_addr_efa://[fe80::c0:ebff:fe98:3939]:1:545311823 Peer host id: i-0a9a44173caf39741
                Supplemental error info: none

case (2) - When we try to issue a write operation from the same address in the shared memory region, we have no problems.

Libfabric from installer 1.22 (1.17)
Case (1) and Case (2) - no issues.

The only real connection between the two is the change in registration size (~1GiB - ~2GiB)

@shijin-aws
Copy link
Contributor

@Seth5141 Thanks for the clarification, I am still interested in seeing your results with FI_LOG_LEVEL=warn

@Seth5141
Copy link
Author

Seth5141 commented Jan 5, 2024

That is with FI_LOG_LEVEL=warn.

Unfortunately, I don't have the entire logs available to me. I failed to grab them off the machine before the allocation I had access to expired. I don't have an immediate way to get back on a system to continue testing.

I can comment on this though:

FI_EFA_ENABLE_SHM_TRANSFER=0

I was playing with options to try and get to the bottom of this and did toggle that flag. It didn't have any effect on my results.

@shijin-aws
Copy link
Contributor

@Seth5141 It will be hard to identify the issue without having more warning logs to understand why the Peer host id: i-0a9a44173caf39741 is down, or having a reproducing steps for us to try.

@Seth5141
Copy link
Author

Agreed. I'll let you know more information as soon as I can get access to another P5D instance. Until then, I am as stuck as you.
Any chance you know where I could get quick access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants