Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/efa: Various zero-copy receive bugfixes #10038

Closed
wants to merge 5 commits into from

Conversation

darrylabbate
Copy link
Member

This includes the CQ data header size for the zero-copy receive use-case

@darrylabbate darrylabbate requested a review from a team May 15, 2024 23:47
@@ -148,8 +148,9 @@ uint32_t *efa_rdm_pke_get_req_connid_ptr(struct efa_rdm_pke *pkt_entry)
return &raw_addr->qkey;
}

if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR)
if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR || pkt_entry->ep->use_zcpy_rx) {
opt_hdr += sizeof(struct efa_rdm_req_opt_cq_data_hdr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libfabric convention is to not have { for single line if

@darrylabbate darrylabbate changed the title prov/efa: Fix optional header size calculations prov/efa: Various zero-copy receive bugfixes May 17, 2024
@darrylabbate
Copy link
Member Author

darrylabbate commented May 17, 2024

Latest push doesn't yet include fixes for every known bug re: CQ data header or zero-copy receive path

@darrylabbate
Copy link
Member Author

Latest push should fix hanging unit tests. Still testing changes related to missing logic when handling the receive completion - should be the last known bug re: zero-copy.

This also adds an assert() before dereferencing the struct efa_rdm_peer
pointer.

Signed-off-by: Darryl Abbate <drl@amazon.com>
This leverages compound literal syntax to initialize struct objects;
ensuring all unspecified members are zeroed out. This avoids any
potential bugs due to fields not explicitly set to 0 or NULL.

Signed-off-by: Darryl Abbate <drl@amazon.com>
Alternatively, the utility functions could be rewritten to short-circuit
with an iov_count of 0.

Signed-off-by: Darryl Abbate <drl@amazon.com>
This includes the CQ data header size for the zero-copy receive use-case

Signed-off-by: Darryl Abbate <drl@amazon.com>
This treats zero-copy receive as a special case; deferring to minimal
logic in to  handle the completion. This avoids the bloated pke
dispatch.

Signed-off-by: Darryl Abbate <drl@amazon.com>
} else {
rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke);
efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you still need to handle zcpy recv in this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure the function is just poorly named. Is it safe to remove the logic completely from efa_rdm_pke_proc_matched_eager_rtm()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you already make a branch in efa_rdm_pke_handle_recv_completion that the zcpy received pkts should not enter efa_rdm_proc_received , who calls efa_rdm_pke_proc_matched_eager_rtm. Then we can remove this logic safely here, right?

But Brian is doubting if this shortcut is the best approach, we can discuss that.


rxe = pkt_entry->ope;
int err = 0;
struct efa_rdm_ope *rxe = pkt_entry->ope;

if (pkt_entry->alloc_type != EFA_RDM_PKE_FROM_USER_BUFFER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't expect a user buffer allocated pkt if we handle zcpy recv completion outside efa_rdm_pke_proc_received

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that we're continuing down an approach where we try to thread the zero copy path through existing state machine in a really fragile way. I know this is trying to move quickly, but we really need to rethink the approach.

if (ep->base_ep.util_ep.tx_msg_flags == 0)
tx_op_flags &= ~FI_COMPLETION;

*txe = (struct efa_rdm_ope) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes, this is a huge structure that we're now initializing to zero. I'm not really sure I understand why the structure is so large, but it's concerning that every operation starts with initializing something that big. This is not reasonable for a zero-copy path in particular.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-initializing was the point of the refactor here. I find the alternative prone to bugs and less maintainable.

To be clear: I'm guessing the storage duration/stack utilization is the primary concern here, correct? I hadn't really considered that - likely explains the existing memset() usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset(0) on a big region is also a problem. The big problem is you have a large structure that you're carrying around per operation. In the zero copy case, you shouldn't need anything more than some way of getting back to the context to return to the user. You've basically made it so that the fast path is limited by the slow path.

Readability / maintainability is important, but performance is the real goal.

@@ -57,7 +57,7 @@ void efa_rdm_txe_construct(struct efa_rdm_ope *txe,

dlist_init(&txe->queued_pkts);

if (ep->user_info->mode & FI_MSG_PREFIX)
if (txe->iov_count > 0 && ep->user_info->mode & FI_MSG_PREFIX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that this is a hacky workaround. I would audit the code to figure out whether we need to use this function to construct ope for the handshake packet. There is no msg involved for handshake

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the change that adds txe->iov_count > 0 to the condition or the condition as a whole? It's not currently valid to call ofi_consume_iov_desc() with an empty IOV - so that change is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want handshake efa_rdm_post_handshake call efa_rdm_txe_construct, so you don't have to make this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_txe_construct should be called with a msg passed by application. For handshake, we don't have any msg. The current implementation creates a fake msg that I want to get rid of

@@ -961,7 +961,13 @@ void efa_rdm_pke_handle_recv_completion(struct efa_rdm_pke *pkt_entry)
zcpy_rxe = pkt_entry->ope;
}

efa_rdm_pke_proc_received(pkt_entry);
if (ep->use_zcpy_rx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block should be moved to the if (pkt_entry->alloc_type == EFA_RDM_PKE_FROM_USER_BUFFER) { above. When pkt entry is allocated from user buffer, we do the zcpy recv handling, otherwise we do the non-zcpy recv handling via efa_rdm_pke_proc_received.

But Brian is doubting if this shortcut is the best approach, we can discuss that

} else {
rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke);
efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you already make a branch in efa_rdm_pke_handle_recv_completion that the zcpy received pkts should not enter efa_rdm_proc_received , who calls efa_rdm_pke_proc_matched_eager_rtm. Then we can remove this logic safely here, right?

But Brian is doubting if this shortcut is the best approach, we can discuss that.

@shijin-aws
Copy link
Contributor

shijin-aws commented May 21, 2024

I'm a little worried that we're continuing down an approach where we try to thread the zero copy path through existing
state machine in a really fragile way. I know this is trying to move quickly, but we really need to rethink the approach.

@bwbarrett understood. I was proposing this approach because the common code - big state machine efa_rdm_pke_proc_received - has > 10 layers in the stack that handles the ordering, tag matching, and data copy that 99.9% are not related to the zcpy receive mode where the data is already delivered when we get a completion from rdma-core. We can harmonize these two use cases as part of our efa provider rewriting. But for the short term and performance sake, I don't have a better idea now instead of making an early shortcut here. WDYT?

@darrylabbate
Copy link
Member Author

IIUC, an "early shortcut" would be the suggested short-term solution. Probably best to engineer a high-level fast path for zero-copy and worry about making the overall provider more clean and generic later - after performance is already obtained.

@darrylabbate
Copy link
Member Author

FWIW, I didn't intend for this PR to complete our zero-copy work; just fix a few bugs that were discovered while testing zero-copy receive.

@darrylabbate darrylabbate marked this pull request as draft May 22, 2024 20:41
@darrylabbate darrylabbate deleted the fix/efa/opt-hdr-cq-data branch May 24, 2024 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants