-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge master into feature/fault domain #14368
Merge master into feature/fault domain #14368
Conversation
Use deep stack size for IV ULT. Signed-off-by: Di Wang <di.wang@intel.com>
…13869) When a UNS link points to a container but that container is not accessible then return ENOLINK for the directory. Add a test and fix a crash when this occours. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
The ds_pool.sp_checkpoint_props_changed bitfield is modified from target xstreams, causing a data race on all surrounding bitfields among the system xstream and all the target xstreams. It is this author's guess that such data races have likely led to the pool destroy timeouts caused by pool_fetch_hdls_ult_abort hangs reported in the Jira ticket. Here is how the hang happened during one pool destroy timeout: 31:54.85 pool_fetch_hdls_ult() b262bfcf: begin: fetch_hdls=1 stopping=0 31:54.85 pool_fetch_hdls_ult() b262bfcf: waiting for map 32:03.96 pool_fetch_hdls_ult() b262bfcf: fetching handles 32:03.96 pool_fetch_hdls_ult() b262bfcf: signaling done 32:03.96 pool_fetch_hdls_ult() b262bfcf: end 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: begin: fetch_hdls=1 stopping=1 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: signaled 38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: waiting for ULT The ULT had exited at 32:03.96, when it should have set the ds_pool.sp_fetch_hdls bitfield to 0. More than 6 minutes later, pool_fetch_hdls_ult_abort found that ds_pool.sp_fetch_hdls to be 1 and started waiting for the ULT to exit! The theory is that when the ULT was setting sp_fetch_hdls to 0 on the system xstream, a target xstream happened to be executing update_vos_prop_on_targets, who was setting sp_checkpoint_props_changed at the same time. The latter read sp_fetch_hdls == 1 before the ULT set the field to 0, and after the ULT had set sp_fetch_hdls == 0, wrote sp_fetch_hdls == 1, causing the ULT's write to be lost. This patch avoids the data race by replacing the ds_pool.sp_checkpoint_props_changed bitfield with a read-only collective parameter. Signed-off-by: Li Wei <wei.g.li@intel.com>
Signed-off-by: Justin Zhang <juszhan@google.com>
The Prometheus exporter is missing a few stats metrics that would make some things easier to graph: * sum * sample_size * sum_of_squares Fixes the Min/Max/Sum methods to return uint64, as this is the underlying data type. Callers should adjust as necessary. Signed-off-by: Michael MacDonald <mjmac@google.com>
Bump version to 2.5.101 Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Contain the fixes for CID: 1965549/1965550/1972512 Signed-off-by: Fan Yong <fan.yong@intel.com>
Backporting control plane related internal changes from the feature/multiprovider branch. These changes affect internals only, and not user interfaces. - Updated protobuf structures to recognize secondary providers in a backward-compatible way. - Updated libdaos network config logic to pass values from env variables to the agent, to allow better decision making when choosing a network interface. - Add support for multiple providers to control plane internal structures, including config file structures. They are intended to be invisible to users until we enable the feature throughout the stack. Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Meson "setup" sets up a package for buidlding, meson configure sets a configuration option, but does not do the setup. Previously our code would do setup, then configure which would set configuration options but not apply them. Ninja has file age checking built-in so if the config file was older than the build file then it would re-run setup to apply the correct config, and this was happening most times so the build would work, but occasionally the file timestamps would be the same so the check would not fail and the build would be run without the configuration options applied. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
To workaround SRE-471 increase the dfuse/mu_perms.py test timeout by 60 seconds. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Include a minimum revision for golang Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Since these are always going to be a single level of nested lists, we don't need the more complex _flatten() helper. Signed-off-by: Michael MacDonald <mjmac@google.com>
…13959) Signed-off-by: Lei Huang <lei.huang@intel.com> Co-authored-by: Ashley Pittman <ashley.m.pittman@intel.com>
The VOS API supports combining multiple VOS operations into a single WAL commit for efficiency. The primary use case is RDB (see DAOS-11406) Signed-off-by: Jan Michalski <jan.michalski@intel.com> Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com> Co-authored-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com> Co-authored-by: Oksana Salyk <oksana.salyk@intel.com>
Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Update to libfabric 1.19.1 Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Add pool collective function to skip DOWN and DOWNOUT targets. Signed-off-by: Di Wang <di.wang@intel.com> Co-authored-by: Niu Yawei <yawei.niu@intel.com>
As requested by the Jira ticket, add a new I/O forwarding mechanism, dss_chore, to avoid creating a ULT for every forwarding task. - Forwarding of object I/O and DTX RPCs is converted to chores. - Cancelation is not implemented, because the I/O forwarding tasks themselves do not support cancelation yet. - In certain engine configurations, some xstreams do not need to initialize dx_chore_queue. This is left to future work. Signed-off-by: Li Wei <wei.g.li@intel.com>
Add missing comment for exported go function. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
Add more mock functions for dtx_tests. Signed-off-by: Niu Yawei <yawei.niu@intel.com>
When WAL SSD is faulty, WAL commit will always fail and the last committed tx ID won't be bumped anymore, checkpoint ULT shouldn't wait on tx commit in such case, otherwise, the checkpoint ULT will never be woken up, and the pool_child_stop() will be blocked on stopping the checkpoint ULT. Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Allow the suid and sgid bits to be stored in dfs_osetattr. Even if libdfs does not support those bits, it allows dfuse to support them via the kernel. The lack of sgid support cause spack to fail over dfuse as reported in the jira ticket. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
When using server target, daos_metrics wasn't built because it was buried under a check for client target. I really need to figure out a better way to specify targets but this will fix the immediate issue. Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Bumps google.golang.org/protobuf from 1.30.0 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…red. (#14049) This avoids a long-standing but previously unknown issue where the build directory was in LD_LIBRARY_PATH when running gcc etc. Update site_scons to not se LD_LIBRARY_PATH for all commands launched during the build but rather only set it for the step where the dmg man pages are generated. This impacts the daos_build test which bas previously always needed to run the daos_build with --jobs=1, with this change then the build can be run in parallel which reduces the run-time of this test by a third. Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Adding tests for WAL commit, reply, and checkpoint metrics. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
one dkey migrate possible exceed the mpt_inflight_max_size, in this case original code possibly cause the dkey migrate ULT dead loop and then rebuild cannot complete. Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432", that case will cause the ULT wait again after wakeup until shutdown. Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/2/execution/node/1201/log |
Add a GHA linting summary job to aggregate the status of linting checks. Allows branch protections to rely on just the summary. New jobs in the workflow do not require branch protections to be updated. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Add more unit tests for tags.py Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Update pylint from 3.1.1 -> 3.2.0 and resolve new warnings. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Create a dedicated event queue for each aio context Signed-off-by: Lei Huang <lei.huang@intel.com>
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
rsvc_tests should use tenv instead of unit_env. Signed-off-by: Li Wei <wei.g.li@intel.com>
CID 2555629, 2555628, 2555602, 2555600 Signed-off-by: Lei Huang <lei.huang@intel.com>
…14370) Some DTX related fields in ds_cont_child structure are initialized via dtx_cont_register() that may be skipped under check mode as to some subsequent logic may access uninitialized members. Let's move such fields initialization into cont_child_alloc_ref(). Skip DTX resync and rebuild under check mode. Signed-off-by: Fan Yong <fan.yong@intel.com>
…14342) Use CRT_TIMEOUT=10 when destroying containers since it is expected to timeout. Signed-off-by: Padmanabhan <ravindran.padmanabhan@intel.com>
Skip the ftest tags.py githook when the python3 yaml module is not installed. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Set it to 3 seconds initially and increase it as we try other targets so we can get going more quickly when a rank is down. Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Lack of it was causing failures in the Ubuntu build. Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/3/execution/node/1201/log |
Rather than blocking vos_obj_discard entirely when discard or aggregation are running, let's block it only when there is an actual conflict on the object being discarded. * Fix log messages to specify EC or VOS aggregation * Add metrics for conflicts Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Setting checkpoint properties on pool create was not actually setting the properties. Signed-off-by: Jeff Olivier <jeffolivier@google.com> Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
* DAOS-15420 pool: Clean up ds_pool_svc_<op> Convert ds_pool_svc_check_evict ds_pool_svc_query_target ds_pool_svc_get_prop ds_pool_svc_set_prop ds_pool_svc_target_update_state ds_pool_svc_update_acl ds_pool_svc_delete_acl ds_pool_svc_upgrade ds_pool_extend to the dsc_pool_svc_call framework, so that they will - time out, instead of hanging forever, if PSs are unavailable, and - respond much faster in common cases thanks to exponential backoffs. The req_time variable in dsc_pool_svc_call is part of the operation identifier, and should therefore retain its value across retries. Signed-off-by: Li Wei <wei.g.li@intel.com>
- Ensure the requested user/group exists before setting it. - Add a second API, daos_cont_set_owner_no_check(), for the case where the new owner/group can't be verified locally. - Modify daos_test to verify both check and no_check cases. - Add --no-check flag to daos cont set-owner. Signed-off-by: Kris Jacque <kris.jacque@intel.com>
…es (#14379) Some control-plane storage unit tests are inadvertently calling into test runner host OS filesystem. Fix by consolidating SystemProvider interface and applying storage subsystem provider stubs by default in the unit test framework. As a result coverage can be improved by exercising a greater number of code paths. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ts (#14295) Some test occasionally fail to start servers due to insufficient available memory in CI due to left over DAOS mount points from a previous test. Adding an option to launch.py to provided a filter, which if specified, will be used to umount and remove the directory for any mounted tmpfs filesystems matching the filter. When using --mode=ci the filter will be set to /mnt/daos. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Moving master to 2.7 test builds. Bump version to 2.7.100. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/5/execution/node/1201/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/6/execution/node/458/log |
Use crt_timeout: 10 for rebuild/basic, to restore config prior to PR #13997. This reduces the test time drastically. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Update pylint to 3.2.2 Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Signed-off-by: Lei Huang <lei.huang@intel.com>
…4297) As a convenience, provide a "streamlined" version of the pool query that only performs the minimum amount of work to query the pool's health. Practically speaking, this means that it will query for disabled ranks and omit the space query, which is expensive. Signed-off-by: Michael MacDonald <mjmac@google.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/7/execution/node/1201/log |
This is a clean merge of master into the branch.