prov/efa: Various zero-copy receive bugfixes #10038

darrylabbate · 2024-05-15T23:46:55Z

This includes the CQ data header size for the zero-copy receive use-case

shijin-aws · 2024-05-15T23:48:37Z

prov/efa/src/rdm/efa_rdm_pke_req.c

@@ -148,8 +148,9 @@ uint32_t *efa_rdm_pke_get_req_connid_ptr(struct efa_rdm_pke *pkt_entry)
 		return &raw_addr->qkey;
 	}

-	if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR)
+	if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR || pkt_entry->ep->use_zcpy_rx) {
 		opt_hdr += sizeof(struct efa_rdm_req_opt_cq_data_hdr);


libfabric convention is to not have { for single line if

darrylabbate · 2024-05-17T19:02:17Z

Latest push doesn't yet include fixes for every known bug re: CQ data header or zero-copy receive path

darrylabbate · 2024-05-20T23:01:03Z

Latest push should fix hanging unit tests. Still testing changes related to missing logic when handling the receive completion - should be the last known bug re: zero-copy.

This also adds an assert() before dereferencing the struct efa_rdm_peer pointer. Signed-off-by: Darryl Abbate <drl@amazon.com>

This leverages compound literal syntax to initialize struct objects; ensuring all unspecified members are zeroed out. This avoids any potential bugs due to fields not explicitly set to 0 or NULL. Signed-off-by: Darryl Abbate <drl@amazon.com>

Alternatively, the utility functions could be rewritten to short-circuit with an iov_count of 0. Signed-off-by: Darryl Abbate <drl@amazon.com>

This includes the CQ data header size for the zero-copy receive use-case Signed-off-by: Darryl Abbate <drl@amazon.com>

This treats zero-copy receive as a special case; deferring to minimal logic in to handle the completion. This avoids the bloated pke dispatch. Signed-off-by: Darryl Abbate <drl@amazon.com>

shijin-aws · 2024-05-21T20:44:59Z

prov/efa/src/rdm/efa_rdm_pke_rtm.c

 	} else {
-		rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke);
+		efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe);


why do you still need to handle zcpy recv in this function?

I figure the function is just poorly named. Is it safe to remove the logic completely from efa_rdm_pke_proc_matched_eager_rtm()?

I mean you already make a branch in efa_rdm_pke_handle_recv_completion that the zcpy received pkts should not enter efa_rdm_proc_received , who calls efa_rdm_pke_proc_matched_eager_rtm. Then we can remove this logic safely here, right?

But Brian is doubting if this shortcut is the best approach, we can discuss that.

shijin-aws · 2024-05-21T20:48:52Z

prov/efa/src/rdm/efa_rdm_pke_rtm.c

-
-	rxe = pkt_entry->ope;
+	int err = 0;
+	struct efa_rdm_ope *rxe = pkt_entry->ope;

 	if (pkt_entry->alloc_type != EFA_RDM_PKE_FROM_USER_BUFFER) {


We shouldn't expect a user buffer allocated pkt if we handle zcpy recv completion outside efa_rdm_pke_proc_received

bwbarrett

I'm a little worried that we're continuing down an approach where we try to thread the zero copy path through existing state machine in a really fragile way. I know this is trying to move quickly, but we really need to rethink the approach.

bwbarrett · 2024-05-21T20:44:23Z

prov/efa/src/rdm/efa_rdm_ope.c

+	if (ep->base_ep.util_ep.tx_msg_flags == 0)
+		tx_op_flags &= ~FI_COMPLETION;
+
+	*txe = (struct efa_rdm_ope) {


yikes, this is a huge structure that we're now initializing to zero. I'm not really sure I understand why the structure is so large, but it's concerning that every operation starts with initializing something that big. This is not reasonable for a zero-copy path in particular.

Zero-initializing was the point of the refactor here. I find the alternative prone to bugs and less maintainable.

To be clear: I'm guessing the storage duration/stack utilization is the primary concern here, correct? I hadn't really considered that - likely explains the existing memset() usage.

memset(0) on a big region is also a problem. The big problem is you have a large structure that you're carrying around per operation. In the zero copy case, you shouldn't need anything more than some way of getting back to the context to return to the user. You've basically made it so that the fast path is limited by the slow path.

Readability / maintainability is important, but performance is the real goal.

shijin-aws · 2024-05-21T21:06:00Z

prov/efa/src/rdm/efa_rdm_ope.c

@@ -57,7 +57,7 @@ void efa_rdm_txe_construct(struct efa_rdm_ope *txe,

 	dlist_init(&txe->queued_pkts);

-	if (ep->user_info->mode & FI_MSG_PREFIX)
+	if (txe->iov_count > 0 && ep->user_info->mode & FI_MSG_PREFIX)


I am worried that this is a hacky workaround. I would audit the code to figure out whether we need to use this function to construct ope for the handshake packet. There is no msg involved for handshake

Are you referring to the change that adds txe->iov_count > 0 to the condition or the condition as a whole? It's not currently valid to call ofi_consume_iov_desc() with an empty IOV - so that change is necessary.

I don't want handshake efa_rdm_post_handshake call efa_rdm_txe_construct, so you don't have to make this change

efa_rdm_txe_construct should be called with a msg passed by application. For handshake, we don't have any msg. The current implementation creates a fake msg that I want to get rid of

shijin-aws · 2024-05-21T23:33:59Z

prov/efa/src/rdm/efa_rdm_pke_cmd.c

@@ -961,7 +961,13 @@ void efa_rdm_pke_handle_recv_completion(struct efa_rdm_pke *pkt_entry)
 		zcpy_rxe = pkt_entry->ope;
 	}

-	efa_rdm_pke_proc_received(pkt_entry);
+	if (ep->use_zcpy_rx) {


This code block should be moved to the if (pkt_entry->alloc_type == EFA_RDM_PKE_FROM_USER_BUFFER) { above. When pkt entry is allocated from user buffer, we do the zcpy recv handling, otherwise we do the non-zcpy recv handling via efa_rdm_pke_proc_received.

But Brian is doubting if this shortcut is the best approach, we can discuss that

shijin-aws · 2024-05-21T23:35:59Z

prov/efa/src/rdm/efa_rdm_pke_rtm.c

 	} else {
-		rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke);
+		efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe);


I mean you already make a branch in efa_rdm_pke_handle_recv_completion that the zcpy received pkts should not enter efa_rdm_proc_received , who calls efa_rdm_pke_proc_matched_eager_rtm. Then we can remove this logic safely here, right?

But Brian is doubting if this shortcut is the best approach, we can discuss that.

shijin-aws · 2024-05-21T23:53:22Z

I'm a little worried that we're continuing down an approach where we try to thread the zero copy path through existing
state machine in a really fragile way. I know this is trying to move quickly, but we really need to rethink the approach.

@bwbarrett understood. I was proposing this approach because the common code - big state machine efa_rdm_pke_proc_received - has > 10 layers in the stack that handles the ordering, tag matching, and data copy that 99.9% are not related to the zcpy receive mode where the data is already delivered when we get a completion from rdma-core. We can harmonize these two use cases as part of our efa provider rewriting. But for the short term and performance sake, I don't have a better idea now instead of making an early shortcut here. WDYT?

darrylabbate · 2024-05-21T23:58:08Z

IIUC, an "early shortcut" would be the suggested short-term solution. Probably best to engineer a high-level fast path for zero-copy and worry about making the overall provider more clean and generic later - after performance is already obtained.

darrylabbate · 2024-05-22T00:02:50Z

FWIW, I didn't intend for this PR to complete our zero-copy work; just fix a few bugs that were discovered while testing zero-copy receive.

darrylabbate requested a review from a team May 15, 2024 23:47

shijin-aws reviewed May 15, 2024

View reviewed changes

darrylabbate force-pushed the fix/efa/opt-hdr-cq-data branch from 1656a42 to 5920da9 Compare May 15, 2024 23:49

darrylabbate requested a review from shijin-aws May 15, 2024 23:58

darrylabbate force-pushed the fix/efa/opt-hdr-cq-data branch from 5920da9 to 02aa85d Compare May 17, 2024 19:00

darrylabbate added the evaluating label May 17, 2024

darrylabbate changed the title ~~prov/efa: Fix optional header size calculations~~ prov/efa: Various zero-copy receive bugfixes May 17, 2024

darrylabbate force-pushed the fix/efa/opt-hdr-cq-data branch from 02aa85d to 368b5f2 Compare May 20, 2024 22:59

darrylabbate added 5 commits May 21, 2024 13:14

prov/efa: Remove superfluous intermediate variable

1d0453b

This also adds an assert() before dereferencing the struct efa_rdm_peer pointer. Signed-off-by: Darryl Abbate <drl@amazon.com>

prov/efa: Refactor some struct init logic

f606c31

This leverages compound literal syntax to initialize struct objects; ensuring all unspecified members are zeroed out. This avoids any potential bugs due to fields not explicitly set to 0 or NULL. Signed-off-by: Darryl Abbate <drl@amazon.com>

prov/efa: Don't consume empty IOV

45245b7

Alternatively, the utility functions could be rewritten to short-circuit with an iov_count of 0. Signed-off-by: Darryl Abbate <drl@amazon.com>

prov/efa: Fix optional header size calculations

98d8553

This includes the CQ data header size for the zero-copy receive use-case Signed-off-by: Darryl Abbate <drl@amazon.com>

prov/efa: Skip recv processing for zero-copy

3144b47

This treats zero-copy receive as a special case; deferring to minimal logic in to handle the completion. This avoids the bloated pke dispatch. Signed-off-by: Darryl Abbate <drl@amazon.com>

darrylabbate force-pushed the fix/efa/opt-hdr-cq-data branch from 368b5f2 to 3144b47 Compare May 21, 2024 20:20

shijin-aws reviewed May 21, 2024

View reviewed changes

bwbarrett reviewed May 21, 2024

View reviewed changes

shijin-aws reviewed May 21, 2024

View reviewed changes

darrylabbate marked this pull request as draft May 22, 2024 20:41

darrylabbate closed this May 24, 2024

darrylabbate deleted the fix/efa/opt-hdr-cq-data branch May 24, 2024 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/efa: Various zero-copy receive bugfixes #10038

prov/efa: Various zero-copy receive bugfixes #10038

darrylabbate commented May 15, 2024

shijin-aws May 15, 2024

darrylabbate commented May 17, 2024 •

edited

darrylabbate commented May 20, 2024

shijin-aws May 21, 2024

darrylabbate May 21, 2024

shijin-aws May 21, 2024

shijin-aws May 21, 2024

bwbarrett left a comment

bwbarrett May 21, 2024

darrylabbate May 21, 2024

bwbarrett May 21, 2024

shijin-aws May 21, 2024 •

edited

darrylabbate May 21, 2024

shijin-aws May 21, 2024

shijin-aws May 21, 2024 •

edited

shijin-aws May 21, 2024

shijin-aws May 21, 2024

shijin-aws commented May 21, 2024 •

edited

darrylabbate commented May 21, 2024

darrylabbate commented May 22, 2024

prov/efa: Various zero-copy receive bugfixes #10038

prov/efa: Various zero-copy receive bugfixes #10038

Conversation

darrylabbate commented May 15, 2024

Choose a reason for hiding this comment

darrylabbate commented May 17, 2024 • edited

darrylabbate commented May 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwbarrett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws commented May 21, 2024 • edited

darrylabbate commented May 21, 2024

darrylabbate commented May 22, 2024

darrylabbate commented May 17, 2024 •

edited

shijin-aws May 21, 2024 •

edited

shijin-aws May 21, 2024 •

edited

shijin-aws commented May 21, 2024 •

edited