-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Various zero-copy receive bugfixes #10038
Conversation
@@ -148,8 +148,9 @@ uint32_t *efa_rdm_pke_get_req_connid_ptr(struct efa_rdm_pke *pkt_entry) | |||
return &raw_addr->qkey; | |||
} | |||
|
|||
if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR) | |||
if (base_hdr->flags & EFA_RDM_REQ_OPT_CQ_DATA_HDR || pkt_entry->ep->use_zcpy_rx) { | |||
opt_hdr += sizeof(struct efa_rdm_req_opt_cq_data_hdr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libfabric convention is to not have {
for single line if
1656a42
to
5920da9
Compare
5920da9
to
02aa85d
Compare
Latest push doesn't yet include fixes for every known bug re: CQ data header or zero-copy receive path |
02aa85d
to
368b5f2
Compare
Latest push should fix hanging unit tests. Still testing changes related to missing logic when handling the receive completion - should be the last known bug re: zero-copy. |
This also adds an assert() before dereferencing the struct efa_rdm_peer pointer. Signed-off-by: Darryl Abbate <drl@amazon.com>
This leverages compound literal syntax to initialize struct objects; ensuring all unspecified members are zeroed out. This avoids any potential bugs due to fields not explicitly set to 0 or NULL. Signed-off-by: Darryl Abbate <drl@amazon.com>
Alternatively, the utility functions could be rewritten to short-circuit with an iov_count of 0. Signed-off-by: Darryl Abbate <drl@amazon.com>
This includes the CQ data header size for the zero-copy receive use-case Signed-off-by: Darryl Abbate <drl@amazon.com>
This treats zero-copy receive as a special case; deferring to minimal logic in to handle the completion. This avoids the bloated pke dispatch. Signed-off-by: Darryl Abbate <drl@amazon.com>
368b5f2
to
3144b47
Compare
} else { | ||
rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke); | ||
efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you still need to handle zcpy recv in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figure the function is just poorly named. Is it safe to remove the logic completely from efa_rdm_pke_proc_matched_eager_rtm()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you already make a branch in efa_rdm_pke_handle_recv_completion
that the zcpy received pkts should not enter efa_rdm_proc_received
, who calls efa_rdm_pke_proc_matched_eager_rtm
. Then we can remove this logic safely here, right?
But Brian is doubting if this shortcut is the best approach, we can discuss that.
|
||
rxe = pkt_entry->ope; | ||
int err = 0; | ||
struct efa_rdm_ope *rxe = pkt_entry->ope; | ||
|
||
if (pkt_entry->alloc_type != EFA_RDM_PKE_FROM_USER_BUFFER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't expect a user buffer allocated pkt if we handle zcpy recv completion outside efa_rdm_pke_proc_received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried that we're continuing down an approach where we try to thread the zero copy path through existing state machine in a really fragile way. I know this is trying to move quickly, but we really need to rethink the approach.
if (ep->base_ep.util_ep.tx_msg_flags == 0) | ||
tx_op_flags &= ~FI_COMPLETION; | ||
|
||
*txe = (struct efa_rdm_ope) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes, this is a huge structure that we're now initializing to zero. I'm not really sure I understand why the structure is so large, but it's concerning that every operation starts with initializing something that big. This is not reasonable for a zero-copy path in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero-initializing was the point of the refactor here. I find the alternative prone to bugs and less maintainable.
To be clear: I'm guessing the storage duration/stack utilization is the primary concern here, correct? I hadn't really considered that - likely explains the existing memset()
usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memset(0) on a big region is also a problem. The big problem is you have a large structure that you're carrying around per operation. In the zero copy case, you shouldn't need anything more than some way of getting back to the context to return to the user. You've basically made it so that the fast path is limited by the slow path.
Readability / maintainability is important, but performance is the real goal.
@@ -57,7 +57,7 @@ void efa_rdm_txe_construct(struct efa_rdm_ope *txe, | |||
|
|||
dlist_init(&txe->queued_pkts); | |||
|
|||
if (ep->user_info->mode & FI_MSG_PREFIX) | |||
if (txe->iov_count > 0 && ep->user_info->mode & FI_MSG_PREFIX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that this is a hacky workaround. I would audit the code to figure out whether we need to use this function to construct ope for the handshake packet. There is no msg involved for handshake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the change that adds txe->iov_count > 0
to the condition or the condition as a whole? It's not currently valid to call ofi_consume_iov_desc()
with an empty IOV - so that change is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want handshake efa_rdm_post_handshake call efa_rdm_txe_construct
, so you don't have to make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efa_rdm_txe_construct should be called with a msg
passed by application. For handshake, we don't have any msg. The current implementation creates a fake msg
that I want to get rid of
@@ -961,7 +961,13 @@ void efa_rdm_pke_handle_recv_completion(struct efa_rdm_pke *pkt_entry) | |||
zcpy_rxe = pkt_entry->ope; | |||
} | |||
|
|||
efa_rdm_pke_proc_received(pkt_entry); | |||
if (ep->use_zcpy_rx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block should be moved to the if (pkt_entry->alloc_type == EFA_RDM_PKE_FROM_USER_BUFFER) {
above. When pkt entry is allocated from user buffer, we do the zcpy recv handling, otherwise we do the non-zcpy recv handling via efa_rdm_pke_proc_received
.
But Brian is doubting if this shortcut is the best approach, we can discuss that
} else { | ||
rxe->cq_entry.len = pkt_entry->pkt_size + sizeof(struct efa_rdm_pke); | ||
efa_rdm_pke_rtm_handle_zcpy_recv_completion(pkt_entry, rxe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you already make a branch in efa_rdm_pke_handle_recv_completion
that the zcpy received pkts should not enter efa_rdm_proc_received
, who calls efa_rdm_pke_proc_matched_eager_rtm
. Then we can remove this logic safely here, right?
But Brian is doubting if this shortcut is the best approach, we can discuss that.
@bwbarrett understood. I was proposing this approach because the common code - big state machine |
IIUC, an "early shortcut" would be the suggested short-term solution. Probably best to engineer a high-level fast path for zero-copy and worry about making the overall provider more clean and generic later - after performance is already obtained. |
FWIW, I didn't intend for this PR to complete our zero-copy work; just fix a few bugs that were discovered while testing zero-copy receive. |
This includes the CQ data header size for the zero-copy receive use-case