-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-15540 cart: Change proto query default timeout #14382
Conversation
Set it to 5 seconds so we can try a different target more quickly if a rank is down. Features: rebuild Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Ticket title is 'Engine unresponsive at scale with tcp and multi-recv enabled' |
Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>
src/cart/crt_register.c
Outdated
@@ -629,7 +629,7 @@ crt_proto_query_int(crt_endpoint_t *tgt_ep, crt_opcode_t base_opc, uint32_t *ver | |||
proto_query->pq_user_arg = arg; | |||
proto_query->pq_coq->coq_base = base_opc; | |||
|
|||
rc = crt_req_set_timeout(rpc_req, 5); | |||
rc = crt_req_set_timeout(rpc_req, timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer here if timeout passed is 0 to not crt_req_set_timeout() at all. This way callers like crt_rpc.c can then call with 0 and use whatever defaults were already set via CRT_TIMEOUT instead of having to specify 60 seconds (for fi/ctl proto queries below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean timeout passed from the cart tests, but not the daos proto query from init? i agree in that case that passing 0 should just be a no-op and not do anything in this case.
src/cart/crt_register.c
Outdated
@@ -629,7 +629,7 @@ crt_proto_query_int(crt_endpoint_t *tgt_ep, crt_opcode_t base_opc, uint32_t *ver | |||
proto_query->pq_user_arg = arg; | |||
proto_query->pq_coq->coq_base = base_opc; | |||
|
|||
rc = crt_req_set_timeout(rpc_req, 5); | |||
rc = crt_req_set_timeout(rpc_req, timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean timeout passed from the cart tests, but not the daos proto query from init? i agree in that case that passing 0 should just be a no-op and not do anything in this case.
src/client/api/rpc.c
Outdated
@@ -114,8 +115,10 @@ query_cb(struct crt_proto_query_cb_info *cb_info) | |||
|
|||
if (daos_rpc_retryable_rc(cb_info->pq_rc)) { | |||
rproto->ep.ep_rank = (rproto->ep.ep_rank + 1) % rproto->nr_ranks; | |||
if (rproto->timeout < 30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better here that instead of 30 to use the default timeout that cart was inited with. not sure if there is an api to get default timeout from cart, but should be easy to add one and use that instead of 30 and if current timeout exceeds the default to just use the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frostedcmos is there a way to get the current timeout setting from cart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope there isnt a single/simple way, as there are multiple timeouts involved (global, per context, per rpc).
src/cart/crt_rpc.c
Outdated
@@ -441,8 +441,7 @@ crt_register_proto_fi(crt_endpoint_t *ep) | |||
if (rc != 0) | |||
return -DER_MISC; | |||
|
|||
rc = crt_proto_query(ep, cpf.cpf_base, &cpf.cpf_ver, | |||
1, crt_pfi_cb, &pfi); | |||
rc = crt_proto_query(ep, cpf.cpf_base, &cpf.cpf_ver, 1, 60, crt_pfi_cb, &pfi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be better to pass 0 here and in cart we just not set the timeout of the value passed is 0.
Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>
sorry, @mchaarawi I noticed I didn't like the if statement after removing the API call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
src/cart/crt_register.c
Outdated
/** Should only fail if invalid parameter */ | ||
D_ASSERT(rc == 0); | ||
if (timeout != 0) { | ||
if (timeout < crt_gdata.cg_timeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove this line, as crt_gdata.cg_timeout isnt necessarily the one being used either if per-rpc timeout isnt set. the order is -- global timeout (crt_gdata.cg_timeout), per-context timeout, per-rpc timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should use the crt_req_get_timeout() instead to get the default?
Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Set it to 3 seconds initially and increase it as we try other targets so we can get going more quickly when a rank is down. Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Pick from all possible ranks for protocol query handling Backports the following patches, last one fixes the issue but prior two are benign and make the backport work without conflict DAOS-13717 client: use d_rand() to generate random number (#12418) DAOS-15120 crt: Misc d_rand-related fixes (#13738) DAOS-15476 client: do not use rsvc for proto query rank selection (#14131) DAOS-15540 cart: Change proto query default timeout (#14382) Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Wang Shilong <shilong.wang@intel.com> Signed-off-by: Li Wei <wei.g.li@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Set it to 5 seconds so we can try a different target more quickly if a rank is down.
Features: rebuild
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: