Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-15540 cart: Change proto query default timeout #14382

Merged
merged 7 commits into from
May 17, 2024
Merged

Conversation

jolivier23
Copy link
Contributor

Set it to 5 seconds so we can try a different target more quickly if a rank is down.

Features: rebuild

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Set it to 5 seconds so we can try a different target
more quickly if a rank is down.

Features: rebuild

Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 requested review from a team as code owners May 15, 2024 21:58
Copy link

github-actions bot commented May 15, 2024

Ticket title is 'Engine unresponsive at scale with tcp and multi-recv enabled'
Status is 'In Progress'
Labels: 'GCP,scrubbed,triaged'
https://daosio.atlassian.net/browse/DAOS-15540

Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 requested review from a team as code owners May 15, 2024 22:33
@jolivier23 jolivier23 changed the title DAOS-623 cart: Change proto query default timeout DAOS-15540 cart: Change proto query default timeout May 15, 2024
wangshilong
wangshilong previously approved these changes May 16, 2024
@@ -629,7 +629,7 @@ crt_proto_query_int(crt_endpoint_t *tgt_ep, crt_opcode_t base_opc, uint32_t *ver
proto_query->pq_user_arg = arg;
proto_query->pq_coq->coq_base = base_opc;

rc = crt_req_set_timeout(rpc_req, 5);
rc = crt_req_set_timeout(rpc_req, timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer here if timeout passed is 0 to not crt_req_set_timeout() at all. This way callers like crt_rpc.c can then call with 0 and use whatever defaults were already set via CRT_TIMEOUT instead of having to specify 60 seconds (for fi/ctl proto queries below).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean timeout passed from the cart tests, but not the daos proto query from init? i agree in that case that passing 0 should just be a no-op and not do anything in this case.

@@ -629,7 +629,7 @@ crt_proto_query_int(crt_endpoint_t *tgt_ep, crt_opcode_t base_opc, uint32_t *ver
proto_query->pq_user_arg = arg;
proto_query->pq_coq->coq_base = base_opc;

rc = crt_req_set_timeout(rpc_req, 5);
rc = crt_req_set_timeout(rpc_req, timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean timeout passed from the cart tests, but not the daos proto query from init? i agree in that case that passing 0 should just be a no-op and not do anything in this case.

@@ -114,8 +115,10 @@ query_cb(struct crt_proto_query_cb_info *cb_info)

if (daos_rpc_retryable_rc(cb_info->pq_rc)) {
rproto->ep.ep_rank = (rproto->ep.ep_rank + 1) % rproto->nr_ranks;
if (rproto->timeout < 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better here that instead of 30 to use the default timeout that cart was inited with. not sure if there is an api to get default timeout from cart, but should be easy to add one and use that instead of 30 and if current timeout exceeds the default to just use the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frostedcmos is there a way to get the current timeout setting from cart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope there isnt a single/simple way, as there are multiple timeouts involved (global, per context, per rpc).

@@ -441,8 +441,7 @@ crt_register_proto_fi(crt_endpoint_t *ep)
if (rc != 0)
return -DER_MISC;

rc = crt_proto_query(ep, cpf.cpf_base, &cpf.cpf_ver,
1, crt_pfi_cb, &pfi);
rc = crt_proto_query(ep, cpf.cpf_base, &cpf.cpf_ver, 1, 60, crt_pfi_cb, &pfi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to pass 0 here and in cart we just not set the timeout of the value passed is 0.

Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
mchaarawi
mchaarawi previously approved these changes May 16, 2024
Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23
Copy link
Contributor Author

sorry, @mchaarawi I noticed I didn't like the if statement after removing the API call

mchaarawi
mchaarawi previously approved these changes May 16, 2024
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

/** Should only fail if invalid parameter */
D_ASSERT(rc == 0);
if (timeout != 0) {
if (timeout < crt_gdata.cg_timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove this line, as crt_gdata.cg_timeout isnt necessarily the one being used either if per-rpc timeout isnt set. the order is -- global timeout (crt_gdata.cg_timeout), per-context timeout, per-rpc timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should use the crt_req_get_timeout() instead to get the default?

Required-githooks: true

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
@jolivier23 jolivier23 merged commit 3375a34 into master May 17, 2024
51 checks passed
@jolivier23 jolivier23 deleted the jvolivie/timeout branch May 17, 2024 18:16
jolivier23 added a commit that referenced this pull request May 17, 2024
Set it to 3 seconds initially and increase it as we try other targets so we
can get going more quickly when a rank is down.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
jolivier23 added a commit that referenced this pull request May 20, 2024
Pick from all possible ranks for protocol query handling

Backports the following patches, last one fixes the issue but prior two are benign and make the backport work without conflict

DAOS-13717 client: use d_rand() to generate random number (#12418)
DAOS-15120 crt: Misc d_rand-related fixes (#13738)
DAOS-15476 client: do not use rsvc for proto query rank selection (#14131)
DAOS-15540 cart: Change proto query default timeout (#14382)

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Signed-off-by: Wang Shilong <shilong.wang@intel.com>
Signed-off-by: Li Wei <wei.g.li@intel.com>
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants