-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-623 rebuild: uniform identifier in logs part 1 #14383
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Generic ticket for minor code cleanup and improvement' |
7424223
to
cb61d6d
Compare
@liuxuezhao @gnailzenh here is a draft patch for what I discussed today. Can you take a look and provide some early comments on the approach? |
This comment was marked as outdated.
This comment was marked as outdated.
d6b5852
to
64a4330
Compare
To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads. The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs. The baseline format (defined in the DF_RB macro) is: "rb=" DF_UUID "/%u/" DF_U64 "/%u/%u/%s" and corresponds to: <pool_uuid>/<rebuild_ver>/<ps_term>/<rebuild_gen>/<ps_rank>/<opc_str> A verbose format (defined in the DF_RBF macro) adds the following (for <engine_rank>:<tgt_idx>) r:t=%u:%d" Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as: struct rebuild_global_pool_tracker struct rebuild_tgt_pool_tracker struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input) struct migrate_query_arg This initial patch covers the pool service leader execution in functions (and those that they invoke) such as: rebuild_ults() rebuild_task_ult() rebuild_leader_start() rebuild_leader_status_check() rebuild_leader_status_notify(). And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as: rebuild_tgt_scan_handler() rebuild_tgt_status_check_ult() ds_migrate_query_status() migrate_check_one() dss_rebuild_check_one() rebuild_scan_leader() rebuild_scanner() rebuild_objects_send_ult() rebuild_scan_done() Features: rebuild Allow-unstable-test: true Skip-list: test_ec_degrade:DAOS-15843 Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
64a4330
to
41005ed
Compare
Features: rebuild Allow-unstable-test: true Skip-list: test_ec_degrade:DAOS-15843 Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed feedback in the latest push
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Features: rebuild Allow-unstable-test: true Skip-list: test_ec_degrade:DAOS-15843 Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/651/log |
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/667/log |
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/505/log |
all per-PR tests passed. Some full_regression (weekly regresison tests) failed and are known issues (listed at the end) But this one from Functional HW Large testing that includes an engine assertion seems new.] I don't see a direct link between the changes in this PR and the aggregation engine assertion and stacktrace. @liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?
For now I've prepared a test-only PR 14474 just the same master commit base as this patch, to see if it reproduces independently. Functional HW Large testing with known existing failures:
Functional HW Large MD on SSD testing with known existing failures:
Functional HW Medium MD on SSD testing with known existing failures:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @kccain
CI testing is blocked at the moment by https://daosio.atlassian.net/browse/SRE-2233 |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1386/log |
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/17/execution/node/1524/log |
The single test failure seen in this testing (scrubber/target_auto_eviction.py) is an instance of existing known issue https://daosio.atlassian.net/browse/DAOS-14585 |
The single test failure seen in this testing (daos_test/suite.py:DaosCoreTest.test_daos_single_rdg_tx) is an instance of existing known issue https://daosio.atlassian.net/browse/DAOS-14982 |
Features testing done/successful in build 10. |
While waiting for reviews, I am continuing to investigate if the weekly regression test failure (recx2ext() engine assertion) seen in build 10 is always associated with this patch, or if it can be seen in master. That work is being documented in https://daosio.atlassian.net/issues/DAOS-15941 At first it was seen with this patch testing in CI. in DAOS-15941 now I have seen it reproduce with master branch in an SCM(tmpfs)-only configuiration. |
To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads.
The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs.
The baseline format (defined in the DF_RB macro) is:
"rb=" DF_UUID "/%u/%u/%s"
and corresponds to:
<pool_uuid>/<rebuild_ver>/<rebuild_gen>/
A verbose format (defined in the DF_RBF macro) adds the following (for <leader_rank>/)
" ld=%u/" DF_U64
Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as:
struct rebuild_global_pool_tracker
struct rebuild_tgt_pool_tracker
struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input)
struct migrate_query_arg
This initial patch covers the pool service leader execution in functions (and those that they invoke) such as:
rebuild_ults()
rebuild_task_ult()
rebuild_leader_start()
rebuild_leader_status_check()
rebuild_leader_status_notify().
And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as:
rebuild_tgt_scan_handler()
rebuild_tgt_status_check_ult()
ds_migrate_query_status()
migrate_check_one()
dss_rebuild_check_one()
rebuild_scan_leader()
rebuild_scanner()
rebuild_objects_send_ult()
rebuild_scan_done()
Features: rebuild
Allow-unstable-test: true
Skip-list: test_ec_degrade:DAOS-15843
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: