Merge master into feature/fault domain #14368

kjacque · 2024-05-14T20:39:32Z

This is a clean merge of master into the branch.

Use deep stack size for IV ULT. Signed-off-by: Di Wang <di.wang@intel.com>

…13869) When a UNS link points to a container but that container is not accessible then return ENOLINK for the directory. Add a test and fix a crash when this occours. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

The ds_pool.sp_checkpoint_props_changed bitfield is modified from target xstreams, causing a data race on all surrounding bitfields among the system xstream and all the target xstreams. It is this author's guess that such data races have likely led to the pool destroy timeouts caused by pool_fetch_hdls_ult_abort hangs reported in the Jira ticket. Here is how the hang happened during one pool destroy timeout: 31:54.85 pool_fetch_hdls_ult() b262bfcf: begin: fetch_hdls=1 stopping=0 31:54.85 pool_fetch_hdls_ult() b262bfcf: waiting for map 32:03.96 pool_fetch_hdls_ult() b262bfcf: fetching handles 32:03.96 pool_fetch_hdls_ult() b262bfcf: signaling done 32:03.96 pool_fetch_hdls_ult() b262bfcf: end 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: begin: fetch_hdls=1 stopping=1 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: signaled 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: waiting for ULT The ULT had exited at 32:03.96, when it should have set the ds_pool.sp_fetch_hdls bitfield to 0. More than 6 minutes later, pool_fetch_hdls_ult_abort found that ds_pool.sp_fetch_hdls to be 1 and started waiting for the ULT to exit! The theory is that when the ULT was setting sp_fetch_hdls to 0 on the system xstream, a target xstream happened to be executing update_vos_prop_on_targets, who was setting sp_checkpoint_props_changed at the same time. The latter read sp_fetch_hdls == 1 before the ULT set the field to 0, and after the ULT had set sp_fetch_hdls == 0, wrote sp_fetch_hdls == 1, causing the ULT's write to be lost. This patch avoids the data race by replacing the ds_pool.sp_checkpoint_props_changed bitfield with a read-only collective parameter. Signed-off-by: Li Wei <wei.g.li@intel.com>

Signed-off-by: Justin Zhang <juszhan@google.com>

The Prometheus exporter is missing a few stats metrics that would make some things easier to graph: * sum * sample_size * sum_of_squares Fixes the Min/Max/Sum methods to return uint64, as this is the underlying data type. Callers should adjust as necessary. Signed-off-by: Michael MacDonald <mjmac@google.com>

Bump version to 2.5.101 Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Contain the fixes for CID: 1965549/1965550/1972512 Signed-off-by: Fan Yong <fan.yong@intel.com>

Backporting control plane related internal changes from the feature/multiprovider branch. These changes affect internals only, and not user interfaces. - Updated protobuf structures to recognize secondary providers in a backward-compatible way. - Updated libdaos network config logic to pass values from env variables to the agent, to allow better decision making when choosing a network interface. - Add support for multiple providers to control plane internal structures, including config file structures. They are intended to be invisible to users until we enable the feature throughout the stack. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

Meson "setup" sets up a package for buidlding, meson configure sets a configuration option, but does not do the setup. Previously our code would do setup, then configure which would set configuration options but not apply them. Ninja has file age checking built-in so if the config file was older than the build file then it would re-run setup to apply the correct config, and this was happening most times so the build would work, but occasionally the file timestamps would be the same so the check would not fail and the build would be run without the configuration options applied. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

To workaround SRE-471 increase the dfuse/mu_perms.py test timeout by 60 seconds. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Include a minimum revision for golang Signed-off-by: Maureen Jean <maureen.jean@intel.com>

Since these are always going to be a single level of nested lists, we don't need the more complex _flatten() helper. Signed-off-by: Michael MacDonald <mjmac@google.com>

…13959) Signed-off-by: Lei Huang <lei.huang@intel.com> Co-authored-by: Ashley Pittman <ashley.m.pittman@intel.com>

The VOS API supports combining multiple VOS operations into a single WAL commit for efficiency. The primary use case is RDB (see DAOS-11406) Signed-off-by: Jan Michalski <jan.michalski@intel.com> Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com> Co-authored-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com> Co-authored-by: Oksana Salyk <oksana.salyk@intel.com>

Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

Update to libfabric 1.19.1 Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

Add pool collective function to skip DOWN and DOWNOUT targets. Signed-off-by: Di Wang <di.wang@intel.com> Co-authored-by: Niu Yawei <yawei.niu@intel.com>

As requested by the Jira ticket, add a new I/O forwarding mechanism, dss_chore, to avoid creating a ULT for every forwarding task. - Forwarding of object I/O and DTX RPCs is converted to chores. - Cancelation is not implemented, because the I/O forwarding tasks themselves do not support cancelation yet. - In certain engine configurations, some xstreams do not need to initialize dx_chore_queue. This is left to future work. Signed-off-by: Li Wei <wei.g.li@intel.com>

Add missing comment for exported go function. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

Add more mock functions for dtx_tests. Signed-off-by: Niu Yawei <yawei.niu@intel.com>

When WAL SSD is faulty, WAL commit will always fail and the last committed tx ID won't be bumped anymore, checkpoint ULT shouldn't wait on tx commit in such case, otherwise, the checkpoint ULT will never be woken up, and the pool_child_stop() will be blocked on stopping the checkpoint ULT. Signed-off-by: Niu Yawei <yawei.niu@intel.com>

Allow the suid and sgid bits to be stored in dfs_osetattr. Even if libdfs does not support those bits, it allows dfuse to support them via the kernel. The lack of sgid support cause spack to fail over dfuse as reported in the jira ticket. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

When using server target, daos_metrics wasn't built because it was buried under a check for client target. I really need to figure out a better way to specify targets but this will fix the immediate issue. Signed-off-by: Jeff Olivier <jeffolivier@google.com>

Bumps google.golang.org/protobuf from 1.30.0 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…red. (#14049) This avoids a long-standing but previously unknown issue where the build directory was in LD_LIBRARY_PATH when running gcc etc. Update site_scons to not se LD_LIBRARY_PATH for all commands launched during the build but rather only set it for the step where the dmg man pages are generated. This impacts the daos_build test which bas previously always needed to run the daos_build with --jobs=1, with this change then the build can be run in parallel which reduces the run-time of this test by a third. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>

Adding tests for WAL commit, reply, and checkpoint metrics. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

one dkey migrate possible exceed the mpt_inflight_max_size, in this case original code possibly cause the dkey migrate ULT dead loop and then rebuild cannot complete. Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432", that case will cause the ULT wait again after wakeup until shutdown. Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

daosbuild1 · 2024-05-16T15:58:25Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/2/execution/node/1201/log

Add a GHA linting summary job to aggregate the status of linting checks. Allows branch protections to rely on just the summary. New jobs in the workflow do not require branch protections to be updated. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Add more unit tests for tags.py Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Update pylint from 3.1.1 -> 3.2.0 and resolve new warnings. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Create a dedicated event queue for each aio context Signed-off-by: Lei Huang <lei.huang@intel.com>

Signed-off-by: Maureen Jean <maureen.jean@intel.com>

rsvc_tests should use tenv instead of unit_env. Signed-off-by: Li Wei <wei.g.li@intel.com>

CID 2555629, 2555628, 2555602, 2555600 Signed-off-by: Lei Huang <lei.huang@intel.com>

…14370) Some DTX related fields in ds_cont_child structure are initialized via dtx_cont_register() that may be skipped under check mode as to some subsequent logic may access uninitialized members. Let's move such fields initialization into cont_child_alloc_ref(). Skip DTX resync and rebuild under check mode. Signed-off-by: Fan Yong <fan.yong@intel.com>

…14342) Use CRT_TIMEOUT=10 when destroying containers since it is expected to timeout. Signed-off-by: Padmanabhan <ravindran.padmanabhan@intel.com>

Skip the ftest tags.py githook when the python3 yaml module is not installed. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Set it to 3 seconds initially and increase it as we try other targets so we can get going more quickly when a rank is down. Signed-off-by: Jeff Olivier <jeffolivier@google.com>

Lack of it was causing failures in the Ubuntu build. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

daosbuild1 · 2024-05-17T21:28:33Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/3/execution/node/1201/log

Rather than blocking vos_obj_discard entirely when discard or aggregation are running, let's block it only when there is an actual conflict on the object being discarded. * Fix log messages to specify EC or VOS aggregation * Add metrics for conflicts Signed-off-by: Jeff Olivier <jeffolivier@google.com>

Setting checkpoint properties on pool create was not actually setting the properties. Signed-off-by: Jeff Olivier <jeffolivier@google.com> Co-authored-by: Phil Henderson <phillip.henderson@intel.com>

* DAOS-15420 pool: Clean up ds_pool_svc_<op> Convert ds_pool_svc_check_evict ds_pool_svc_query_target ds_pool_svc_get_prop ds_pool_svc_set_prop ds_pool_svc_target_update_state ds_pool_svc_update_acl ds_pool_svc_delete_acl ds_pool_svc_upgrade ds_pool_extend to the dsc_pool_svc_call framework, so that they will - time out, instead of hanging forever, if PSs are unavailable, and - respond much faster in common cases thanks to exponential backoffs. The req_time variable in dsc_pool_svc_call is part of the operation identifier, and should therefore retain its value across retries. Signed-off-by: Li Wei <wei.g.li@intel.com>

- Ensure the requested user/group exists before setting it. - Add a second API, daos_cont_set_owner_no_check(), for the case where the new owner/group can't be verified locally. - Modify daos_test to verify both check and no_check cases. - Add --no-check flag to daos cont set-owner. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

…es (#14379) Some control-plane storage unit tests are inadvertently calling into test runner host OS filesystem. Fix by consolidating SystemProvider interface and applying storage subsystem provider stubs by default in the unit test framework. As a result coverage can be improved by exercising a greater number of code paths. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

…ts (#14295) Some test occasionally fail to start servers due to insufficient available memory in CI due to left over DAOS mount points from a previous test. Adding an option to launch.py to provided a filter, which if specified, will be used to umount and remove the directory for any mounted tmpfs filesystems matching the filter. When using --mode=ci the filter will be set to /mnt/daos. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Moving master to 2.7 test builds. Bump version to 2.7.100. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

daosbuild1 · 2024-05-22T00:37:08Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/5/execution/node/1201/log

daosbuild1 · 2024-05-22T15:43:21Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/6/execution/node/458/log

Use crt_timeout: 10 for rebuild/basic, to restore config prior to PR #13997. This reduces the test time drastically. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Update pylint to 3.2.2 Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Signed-off-by: Lei Huang <lei.huang@intel.com>

…4297) As a convenience, provide a "streamlined" version of the pool query that only performs the minimum amount of work to query the pool's health. Practically speaking, this means that it will query for disabled ranks and omit the space query, which is expensive. Signed-off-by: Michael MacDonald <mjmac@google.com>

daosbuild1 · 2024-05-22T23:39:42Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/7/execution/node/1201/log

wangdi1 and others added 30 commits March 18, 2024 11:58

DAOS-15395 pool: use deep stack size for IV ult (#13946)

f04ef9d

Use deep stack size for IV ULT. Signed-off-by: Di Wang <di.wang@intel.com>

DAOS-15323 common: fix explicit null dereferenced (#13987)

c394274

Signed-off-by: Justin Zhang <juszhan@google.com>

DAOS-15438 build: Create 2.6 TB1 (#14000)

313fa74

Bump version to 2.5.101 Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15316 object: fix some coverity issues (#13905)

adf05dc

Contain the fixes for CID: 1965549/1965550/1972512 Signed-off-by: Fan Yong <fan.yong@intel.com>

SRE-2105 cq: Add permissions metatdata for GHA files. (#14010)

278dcd0

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

DAOS-15399 test: Fix collecting errors in steps.log (#14024)

357b18d

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15402 test: Increase dfuse/mu_perms.py timeout (#14025)

7dcab1f

To workaround SRE-471 increase the dfuse/mu_perms.py test timeout by 60 seconds. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15492 build: Add minimum go version (#14031)

262a296

Include a minimum revision for golang Signed-off-by: Maureen Jean <maureen.jean@intel.com>

DAOS-15425 telemetry: Use splat for stats metrics lists (#14013)

ca301c2

Since these are always going to be a single level of nested lists, we don't need the more complex _flatten() helper. Signed-off-by: Michael MacDonald <mjmac@google.com>

DAOS-14344 client: intercept telldir, seekdir, rewinddir, scandirat (#…

33a526f

…13959) Signed-off-by: Lei Huang <lei.huang@intel.com> Co-authored-by: Ashley Pittman <ashley.m.pittman@intel.com>

DAOS-14669 test: switch tcp;ofi_rxm testing to tcp (#13365)

e2083fb

Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

DAOS-15433 build: update to libfabric 1.19.1 (#14018)

c607541

Update to libfabric 1.19.1 Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

DAOS-15145 pool: add pool collective function (#13764)

8a32ae2

Add pool collective function to skip DOWN and DOWNOUT targets. Signed-off-by: Di Wang <di.wang@intel.com> Co-authored-by: Niu Yawei <yawei.niu@intel.com>

DAOS-7674 pool: add comment to exported go function (#14022)

bec41ab

Add missing comment for exported go function. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

DAOS-15145 test: add mock functions for dtx_tests (#14047)

afcc376

Add more mock functions for dtx_tests. Signed-off-by: Niu Yawei <yawei.niu@intel.com>

DAOS-15509 dfs: update POSIX compliance documentation (#14060)

590c0c7

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>

DAOS-11626 test: Adding MD on SSD metrics tests (#13661)

ac0902a

Adding tests for WAL commit, reply, and checkpoint metrics. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

daltonbohning and others added 12 commits May 16, 2024 10:24

DAOS-15228 test: add more unit tests for tags.py (#14309)

e93ab3d

Add more unit tests for tags.py Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15830 cq: update pylint to 3.2.0 (#14365)

2d14c91

Update pylint from 3.1.1 -> 3.2.0 and resolve new warnings. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15803 client: simplify aio code in pil4dfs (#14363)

2c639ee

Create a dedicated event queue for each aio context Signed-off-by: Lei Huang <lei.huang@intel.com>

DAOS-15848 test: Add module use to soak job scripts for aurora (#14376)

a2d0e91

Signed-off-by: Maureen Jean <maureen.jean@intel.com>

DAOS-15846 common: Use tenv for rsvc_tests (#14369)

a88f116

rsvc_tests should use tenv instead of unit_env. Signed-off-by: Li Wei <wei.g.li@intel.com>

DAOS-15831 client: fix a couple of coverity issues (#14364)

a395d83

CID 2555629, 2555628, 2555602, 2555600 Signed-off-by: Lei Huang <lei.huang@intel.com>

DAOS-15785 test: Fix erasurecode/online_rebuild.py test times out. (#…

94641ea

…14342) Use CRT_TIMEOUT=10 when destroying containers since it is expected to timeout. Signed-off-by: Padmanabhan <ravindran.padmanabhan@intel.com>

DAOS-623 cq: skip ftest githook for missing deps (#14397)

d002f71

Skip the ftest tags.py githook when the python3 yaml module is not installed. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15540 cart: Change proto query default timeout (#14382)

3375a34

Set it to 3 seconds initially and increase it as we try other targets so we can get going more quickly when a rank is down. Signed-off-by: Jeff Olivier <jeffolivier@google.com>

DAOS-15779 control: Explicitly link to libgurt in cgo (#14388)

2d8ff91

Lack of it was causing failures in the Ubuntu build. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

jolivier23 and others added 7 commits May 20, 2024 00:13

DAOS-15384 pool: Checkpoint properties not set (#14380)

39a2930

Setting checkpoint properties on pool create was not actually setting the properties. Signed-off-by: Jeff Olivier <jeffolivier@google.com> Co-authored-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15859 build: Move master to 2.7 test builds. (#14407)

ff4375a

Moving master to 2.7 test builds. Bump version to 2.7.100. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

daltonbohning and others added 4 commits May 22, 2024 11:52

DAOS-15856 test: speedup rebuild/basic (#14398)

0585783

Use crt_timeout: 10 for rebuild/basic, to restore config prior to PR #13997. This reduces the test time drastically. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15860 cq: update pylint to 3.2.2 (#14406)

be6a15d

Update pylint to 3.2.2 Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15834 client: disable interception before exec() (#14405)

6b006d4

Signed-off-by: Lei Huang <lei.huang@intel.com>

jolivier23 merged commit 1d77b96 into feature/fault_domain May 23, 2024
105 of 218 checks passed

jolivier23 deleted the kjacque/fault_domain/merge_20240514 branch May 23, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge master into feature/fault domain #14368

Merge master into feature/fault domain #14368

kjacque commented May 14, 2024 •

edited

daosbuild1 commented May 16, 2024

daosbuild1 commented May 17, 2024

daosbuild1 commented May 22, 2024

daosbuild1 commented May 22, 2024

daosbuild1 commented May 22, 2024

Merge master into feature/fault domain #14368

Merge master into feature/fault domain #14368

Conversation

kjacque commented May 14, 2024 • edited

daosbuild1 commented May 16, 2024

daosbuild1 commented May 17, 2024

daosbuild1 commented May 22, 2024

daosbuild1 commented May 22, 2024

daosbuild1 commented May 22, 2024

kjacque commented May 14, 2024 •

edited