Niu/multiprovider merge #14373

NiuYawei · 2024-05-15T08:02:20Z

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

A change further up in the stack revealed that "ERROR" wasn't accepted as a log mask string at the engine level. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

#14126) The test creates 50 containers for each of the 10 pools 20 times (10 x 50 x 20). Creating many containers serially takes significant amount of time, so use threads to create the containers in parallel. Tested the speed up of run_test_create_delete() (Just this method. Not the entire test) with 3 x 50 x 3 and took 732 sec in serial, but only 261 sec in parallel. Also reduce iteration to 2 and reduce timeout. Signed-off-by: Makito Kano <makito.kano@intel.com>

Create a separate pre read buffer so that it is not tied to the kernel buffer size, use 4Mb as size threshold for pre-read. Pre-allocate buffers at startup, not first read. Change to on-by-default for fresh directories. Do not disable is file is opened but not read from - this could be the kernel cache doing it's job. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Update actions/upload-artifact used in ossf-scorecard due to deprecation notice. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Signed-off-by: Joseph Moore <joseph.moore@intel.com>

to fix coverity issue 2555533 Signed-off-by: Lei Huang <lei.huang@intel.com>

…14190) Use TestPool.get_space_per_target instead of the pydaos.raw API call. Remove the no longer used pydaos.raw target_query and supporting code. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

1. Fix to return real error rather than return ENOMEM which is very confusing. 2. skip not started pool when creating migrating pools. 3. skip up targets when updating cont prop. Signed-off-by: Wang Shilong <shilong.wang@intel.com>

Remove references to wiki, jira and other links that are now on daos.io. Merge cloud content to installation section. Update Copyright to 2024. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

Fix path walk of the pil4dfs's dentry cache. Fix pil4dfs rename() function. Add enable/disable feature of the pil4dfs's dentry cache. Add new functional test of the pil4dfs's dentry cache, Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com> Co-authored-by: Lei Huang <lei.huang@intel.com>

required_src was added to avoid conflicts on the file during feature development. It is not necessary any longer (and wrong since ddb has moved from src to src/utils now). Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

When setting these previously I thought they only appeared in debugger output so they have names which are only meaningful in that context, but the thread names are also visable in ps and top and having a process called "main" does not make sense here. Do not rename the main dfuse thread, and use a dfuse prefix for other thread names. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

That will drop partial modification, remove the pinned DTX entry, evict related stale cache. Signed-off-by: Fan Yong <fan.yong@intel.com>

Avoid using storage: auto on vm tests until DAOS-15233 can be addressed. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Add registering calls for a container destroy for each TestContainer object created by the test. Using the register cleanup method will ensure proper order of operations when tearing down the test case. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

vos: Add version param to pool create In a DAOS pool using the old pool global version, we need to create new VOS pools using the old DF version. See the Jira ticket for the details. This patch adds a version parameter to vos_pool_create and vos_pool_create_ex. rsvc: Create rsvc with VOS DF version (#14156) If a pool with an old layout version is served by a DAOS version with a new default layout version, for instance, a 2.4-layout pool served by DAOS 2.5, then any new VOS pools created for this DAOS pool must use the old layout, or downgrading back to the old DAOS version would become impossible. Signed-off-by: Li Wei <wei.g.li@intel.com>

- SEP is currently not supported by any active provider. - Remove how we expose SEP as it's setting is based on sockets provider limitations Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com> Co-authored-by: Kris Jacque <kris.jacque@intel.com>

…ce stats (#14168) NEW devices should be ignored Rather than causing a failure, situation occurs when number of targets is less than the number of SSDs. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

Add missing ':avocado: recursive' from test class docstrings. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Support filenames with spaces when generating stack traces from core files detected after running tests. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

- Fix mem leak for coverity 2555536 Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>

The coverity tool gets confused about the use of assert in the debug version of the logging macros and can think a lock is being unlocked twice which it reports as a API usage error. Disable the complex macros for coverity to reduce the instances of false positives in the tool. Fixes coverity ID 1975167 and others. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Inject more faults in the non-baseline workload loops (10% / 20% fault rate change to 33% / 50%), so there is more separation in baseline loop timing compared to the fault-injection loops timing. Also, turn down engine logging during execution of the timed metadata workloads in co_op_dup_timing(). Restore to the originally-configured setting after the timed operations. This is done with the additoin of a new tests dmg helper function, dmg_server_set_logmasks(), called from co_op_dup_timing(). Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>

Originally use parameters "-g 11 -t 7 -o 3 -a 3 -d 3" for daos_gen_io_conf will generate 437 cmd lines that includes 54 exclude/add cmd each will trigger one rebuild. The total time 2100 Second possibly not enough to run those cmds (most time spend for the 54 rebuilds). This patch reduce the parameters "-g 11 -t 4 -o 3 -a 2 -d 2" will generate 181 cmd lines includes 24 exclude/rebuild cmds to reduce testing time. Reduce the timeout value accordingly. Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>

The recovery/container_list_consolidation.py test orphans a container so we need to indicate to the TestContainer object that we don't need to call a daos container destroy during tearDown. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

Fixes coverity ID 2555535 Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

"dmg system cleanup" will cleanup the pools and containers so skip teardown cleanup. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

CID: 2555531 Unchecked return value Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

…#14239) For certain situations a zero value NVMe namespace ID will be returned in dmg output, in this case it should be omitted from display output as valid values are non-zero. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

In SV overwerite case, the btr_update_record() will defer free the original record and allocate new record for record replacing, however, btr_node_tx_add() is mistakenly skipped in btr_update(), that leads to: 1. In md-on-ssd mode, tree node changes are missed in WAL. 2. In pmem mode, tree node snapshot is missed in undo log. Signed-off-by: Niu Yawei <yawei.niu@intel.com>

The group argument was removed by #14201. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Adds a new /svc group under each pool which contains the following set of metrics: * leader (gauge): Current pool service leader rank * map_version (counter): Current pool map version * open_pool_handles (gauge): Current count of open handles * total_ranks (gauge): Number of ranks in pool map * degraded_ranks (gauge): Number of ranks with disabled targets * total_targets (gauge): Number of targets in pool map * disabled_targets (gauge): Number of targets marked disabled * draining_targets (gauge): Number of targets in draining state For non-leader ranks, the service metrics will have zero values. Telemetry consumers may positively identify the current leader by checking the value of map_version, which will always be non-zero for the leader. Signed-off-by: Michael MacDonald <mjmac@google.com>

…erge Required-githooks: true

github-actions · 2024-05-15T08:02:39Z

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Niu/multiprovider

tanabarr

Most Go changes seem to be related to recent PRs having landed to master. No strange conflict related issues noticed. Should be careful that when feature branch is merged into master no unintended reverts slip through. Go changes LGTM.

daltonbohning

-1 just so we make a decision.
@jolivier23 This is good candidate for "Create a merge commit" instead of "Squash and merge", right?
The benefit to a merge commit is feature/multiprovider will contain the exact same commit SHAs as master, so doing a diff between feature/multiprovider and master makes more sense.

kjacque · 2024-05-15T16:19:42Z

-1 just so we make a decision. @jolivier23 This is good candidate for "Create a merge commit" instead of "Squash and merge", right? The benefit to a merge commit is feature/multiprovider will contain the exact same commit SHAs as master, so doing a diff between feature/multiprovider and master makes more sense.

This is what I was planning to do. Glad to see that ability was added to our repo!

frostedcmos

cart changes are minimal and lgtm

kjacque and others added 30 commits April 23, 2024 14:06

DAOS-15686 gurt: Accept ERROR as a log mask string (#14211)

899cdd2

A change further up in the stack revealed that "ERROR" wasn't accepted as a log mask string at the engine level. Signed-off-by: Kris Jacque <kris.jacque@intel.com>

DAOS-623 cq: update actions/upload-artifact version (#14227)

28f68f6

Update actions/upload-artifact used in ossf-scorecard due to deprecation notice. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15490 test: Change ucx log level for offline drain test. (#14075)

b329ce1

Signed-off-by: Joseph Moore <joseph.moore@intel.com>

DAOS-15721 container: remove unneeded code (#14229)

5ace0a6

to fix coverity issue 2555533 Signed-off-by: Lei Huang <lei.huang@intel.com>

DAOS-15646 test: update target_query.py to use get_space_per_target (#…

4b441ff

…14190) Use TestPool.get_space_per_target instead of the pydaos.raw API call. Remove the no longer used pydaos.raw target_query and supporting code. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15352 rebuild: fix few bugs (#14097)

683d3c2

1. Fix to return real error rather than return ENOMEM which is very confusing. 2. skip not started pool when creating migrating pools. 3. skip up targets when updating cont prop. Signed-off-by: Wang Shilong <shilong.wang@intel.com>

DAOS-8781 doc: remove duplicated info with daos.io (#14063)

38198c3

Remove references to wiki, jira and other links that are now on daos.io. Merge cloud content to installation section. Update Copyright to 2024. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

DAOS-9576 test: remove path to ddb src in ut (#14238)

a7224af

required_src was added to avoid conflicts on the file during feature development. It is not necessary any longer (and wrong since ddb has moved from src to src/utils now). Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>

DAOS-15499 dtx: cleanup DTX for failure (#14224)

ddaf6ce

That will drop partial modification, remove the pinned DTX entry, evict related stale cache. Signed-off-by: Fan Yong <fan.yong@intel.com>

DAOS-15648 test: Avoid failures with virtual NVMe (#14233)

d33dc69

Avoid using storage: auto on vm tests until DAOS-15233 can be addressed. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15654 control: Ignore NEW state NVMe devices when processing spa…

57cadb3

…ce stats (#14168) NEW devices should be ignored Rather than causing a failure, situation occurs when number of targets is less than the number of SSDs. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

DAOS-15750 test: Missing dfuse/mu_perms.py execution (#14249)

9149d77

Add missing ':avocado: recursive' from test class docstrings. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15747 test: Quote filenames when creating stack traces (#14246)

d8e09d2

Support filenames with spaces when generating stack traces from core files detected after running tests. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-15717 bug: Fix memory leak cid 2555536 (#14231)

1608fab

- Fix mem leak for coverity 2555536 Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>

DAOS-15718 dfuse: Fix invalid read in error path. (#14237)

1e9aaee

Fixes coverity ID 2555535 Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

DAOS-15768 test: skip cont cleanup in dmg_system_cleanup (#14264)

f7fd7b6

"dmg system cleanup" will cleanup the pools and containers so skip teardown cleanup. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-15723 test: Fix coverity warning 2555531 (#14240)

79f1d62

CID: 2555531 Unchecked return value Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daltonbohning and others added 3 commits May 14, 2024 11:00

DAOS-15835 test: remove invalid argument (#14362)

985ba72

The group argument was removed by #14201. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

Merge remote-tracking branch 'origin/master' into niu/multiprovider-m…

5cc67c4

…erge Required-githooks: true

NiuYawei requested review from a team as code owners May 15, 2024 08:02

tanabarr approved these changes May 15, 2024

View reviewed changes

daltonbohning requested changes May 15, 2024

View reviewed changes

frostedcmos reviewed May 15, 2024

View reviewed changes

kjacque merged commit 4d1db74 into feature/multiprovider May 16, 2024
61 of 69 checks passed

kjacque deleted the niu/multiprovider-merge branch May 16, 2024 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Niu/multiprovider merge #14373

Niu/multiprovider merge #14373

NiuYawei commented May 15, 2024

github-actions bot commented May 15, 2024

tanabarr left a comment •

edited

daltonbohning left a comment

kjacque commented May 15, 2024

frostedcmos left a comment

Niu/multiprovider merge #14373

Niu/multiprovider merge #14373

Conversation

NiuYawei commented May 15, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented May 15, 2024

tanabarr left a comment • edited

Choose a reason for hiding this comment

daltonbohning left a comment

Choose a reason for hiding this comment

kjacque commented May 15, 2024

frostedcmos left a comment

Choose a reason for hiding this comment

tanabarr left a comment •

edited