Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-623 rebuild: uniform identifier in logs part 1 #14383

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

kccain
Copy link
Contributor

@kccain kccain commented May 15, 2024

To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads.

The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs.

The baseline format (defined in the DF_RB macro) is:
"rb=" DF_UUID "/%u/%u/%s"
and corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/

A verbose format (defined in the DF_RBF macro) adds the following (for <leader_rank>/)
" ld=%u/" DF_U64

Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as:
struct rebuild_global_pool_tracker
struct rebuild_tgt_pool_tracker
struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
struct migrate_query_arg

This initial patch covers the pool service leader execution in functions (and those that they invoke) such as:
rebuild_ults()
rebuild_task_ult()
rebuild_leader_start()
rebuild_leader_status_check()
rebuild_leader_status_notify().

And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as:
rebuild_tgt_scan_handler()
rebuild_tgt_status_check_ult()
ds_migrate_query_status()
migrate_check_one()
dss_rebuild_check_one()
rebuild_scan_leader()
rebuild_scanner()
rebuild_objects_send_ult()
rebuild_scan_done()

Features: rebuild
Allow-unstable-test: true
Skip-list: test_ec_degrade:DAOS-15843

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented May 15, 2024

Ticket title is 'Generic ticket for minor code cleanup and improvement'
Status is 'Resolved'
Labels: 'request_for_2.6,request_for_2.6.1'
https://daosio.atlassian.net/browse/DAOS-623

@kccain kccain force-pushed the kccain/rebuild_uniform_logging branch from 7424223 to cb61d6d Compare May 15, 2024 23:27
@daos-stack daos-stack deleted a comment from daosbuild1 May 15, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 May 15, 2024
@daos-stack daos-stack deleted a comment from daosbuild1 May 15, 2024
@kccain
Copy link
Contributor Author

kccain commented May 15, 2024

@liuxuezhao @gnailzenh here is a draft patch for what I discussed today. Can you take a look and provide some early comments on the approach?

@kccain kccain requested a review from liuxuezhao May 15, 2024 23:33
@daosbuild1

This comment was marked as outdated.

@kccain kccain force-pushed the kccain/rebuild_uniform_logging branch 2 times, most recently from d6b5852 to 64a4330 Compare May 16, 2024 14:56
To the extent possible in the rebuild code execution flow,
when rebuild emits log messages, include a uniform rebuild
operation identifier in those messages. This covers activities
across all pool storage engines (including the pool service leader),
system and per-target threads/xstreams, and dynamically spawned
user-level threads.

The motivation is to enable some amount of automated searching through
logfiles for all (or specific) rebuilds that occurred during execution,
and speed up DAOS engineer analysis/interpretation of the logs.

The baseline format (defined in the DF_RB macro) is:
  "rb=" DF_UUID "/%u/" DF_U64 "/%u/%u/%s"
and corresponds to:
  <pool_uuid>/<rebuild_ver>/<ps_term>/<rebuild_gen>/<ps_rank>/<opc_str>

A verbose format (defined in the DF_RBF macro) adds the following
(for <engine_rank>:<tgt_idx>)
  r:t=%u:%d"

Various DP_RB_* and DP_RBF_* macros are defined to specify the
arguments to go with the DF_RB and DF_RBF formats, given some
common rebuild implementation structures such as:
  struct rebuild_global_pool_tracker
  struct rebuild_tgt_pool_tracker
  struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
  struct migrate_query_arg

This initial patch covers the pool service leader execution in
functions (and those that they invoke) such as:
  rebuild_ults()
  rebuild_task_ult()
  rebuild_leader_start()
  rebuild_leader_status_check()
  rebuild_leader_status_notify().

And this patch covers "scan side" execution in all pool storage
engines (including the leader), in functions such as:
  rebuild_tgt_scan_handler()
  rebuild_tgt_status_check_ult()
  ds_migrate_query_status()
  migrate_check_one()
  dss_rebuild_check_one()
  rebuild_scan_leader()
  rebuild_scanner()
  rebuild_objects_send_ult()
  rebuild_scan_done()

Features: rebuild
Allow-unstable-test: true
Skip-list: test_ec_degrade:DAOS-15843

Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
@kccain kccain force-pushed the kccain/rebuild_uniform_logging branch from 64a4330 to 41005ed Compare May 17, 2024 13:24
src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
Features: rebuild
Allow-unstable-test: true
Skip-list: test_ec_degrade:DAOS-15843

Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
Copy link
Contributor Author

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed feedback in the latest push

src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
src/include/daos_srv/daos_engine.h Outdated Show resolved Hide resolved
src/include/daos_srv/rebuild.h Outdated Show resolved Hide resolved
@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

Features: rebuild
Allow-unstable-test: true
Skip-list: test_ec_degrade:DAOS-15843

Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/651/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/667/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/505/log

@kccain
Copy link
Contributor Author

kccain commented May 29, 2024

all per-PR tests passed. Some full_regression (weekly regresison tests) failed and are known issues (listed at the end)

But this one from Functional HW Large testing that includes an engine assertion seems new.] I don't see a direct link between the changes in this PR and the aggregation engine assertion and stacktrace. @liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?

  • erasurecode/offline_rebuild_aggregation.py failure

ERROR: daos_engine:0 05/29-00:48:44.26 wolf-113 DAOS[185489/8/640] vos EMRG src/vos/vos_aggregate.c:640 recx2ext() Assertion 'recx->rx_nr > 0' failed

For now I've prepared a test-only PR 14474 just the same master commit base as this patch, to see if it reproduces independently.

Functional HW Large testing with known existing failures:

Functional HW Large MD on SSD testing with known existing failures:

  • erasurecode/online_rebuild.py cases 2 and 4 failures are also an instance of DAOS-15863

Functional HW Medium MD on SSD testing with known existing failures:

  • deployment/server_rank_failure.py cases 1 and 2 failures are instances of DAOS-15809

And change one debug log to error in rebuild_objects_send_ult()

Allow-unstable-test: true

Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

Copy link
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @kccain

@kccain
Copy link
Contributor Author

kccain commented May 31, 2024

CI testing is blocked at the moment by https://daosio.atlassian.net/browse/SRE-2233
This PR has passed Features: rebuild testing in build 10 before, so I will move it from draft -> review-able status now and we can get a second reviewer in parallel to CI testing, once that's ready to resume.

@kccain kccain marked this pull request as ready for review May 31, 2024 13:44
@kccain kccain requested review from a team as code owners May 31, 2024 13:44
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1386/log

@kccain kccain requested a review from wangshilong June 3, 2024 01:19
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1524/log

@kccain
Copy link
Contributor Author

kccain commented Jun 3, 2024

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1386/log

The single test failure seen in this testing (scrubber/target_auto_eviction.py) is an instance of existing known issue https://daosio.atlassian.net/browse/DAOS-14585

@kccain
Copy link
Contributor Author

kccain commented Jun 3, 2024

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1524/log

The single test failure seen in this testing (daos_test/suite.py:DaosCoreTest.test_daos_single_rdg_tx) is an instance of existing known issue https://daosio.atlassian.net/browse/DAOS-14982

@kccain
Copy link
Contributor Author

kccain commented Jun 3, 2024

Features testing done/successful in build 10.
PR testing done/successful in build 17

Patch requires one more peer review

@kccain
Copy link
Contributor Author

kccain commented Jun 4, 2024

While waiting for reviews, I am continuing to investigate if the weekly regression test failure (recx2ext() engine assertion) seen in build 10 is always associated with this patch, or if it can be seen in master. That work is being documented in https://daosio.atlassian.net/issues/DAOS-15941

For now I have only seen it with this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants