Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master into feature/fault domain #14368

Merged
merged 3,522 commits into from
May 23, 2024

Conversation

kjacque
Copy link
Contributor

@kjacque kjacque commented May 14, 2024

This is a clean merge of master into the branch.

wangdi1 and others added 30 commits March 18, 2024 11:58
Use deep stack size for IV ULT.

Signed-off-by: Di Wang <di.wang@intel.com>
…13869)

When a UNS link points to a container but that container is not
accessible then return ENOLINK for the directory.
Add a test and fix a crash when this occours.

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
The ds_pool.sp_checkpoint_props_changed bitfield is modified from target
xstreams, causing a data race on all surrounding bitfields among the
system xstream and all the target xstreams. It is this author's guess
that such data races have likely led to the pool destroy timeouts caused
by pool_fetch_hdls_ult_abort hangs reported in the Jira ticket.

Here is how the hang happened during one pool destroy timeout:

  31:54.85 pool_fetch_hdls_ult() b262bfcf: begin: fetch_hdls=1
    stopping=0
  31:54.85 pool_fetch_hdls_ult() b262bfcf: waiting for map
  32:03.96 pool_fetch_hdls_ult() b262bfcf: fetching handles
  32:03.96 pool_fetch_hdls_ult() b262bfcf: signaling done
  32:03.96 pool_fetch_hdls_ult() b262bfcf: end
  38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: begin: fetch_hdls=1
    stopping=1
  38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: signaled
  38:07.48 pool_fetch_hdls_ult_abort() b262bfcf: waiting for ULT

The ULT had exited at 32:03.96, when it should have set the
ds_pool.sp_fetch_hdls bitfield to 0. More than 6 minutes later,
pool_fetch_hdls_ult_abort found that ds_pool.sp_fetch_hdls to be 1 and
started waiting for the ULT to exit! The theory is that when the ULT was
setting sp_fetch_hdls to 0 on the system xstream, a target xstream
happened to be executing update_vos_prop_on_targets, who was setting
sp_checkpoint_props_changed at the same time. The latter read
sp_fetch_hdls == 1 before the ULT set the field to 0, and after the ULT
had set sp_fetch_hdls == 0, wrote sp_fetch_hdls == 1, causing the ULT's
write to be lost.

This patch avoids the data race by replacing the
ds_pool.sp_checkpoint_props_changed bitfield with a read-only collective
parameter.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Signed-off-by: Justin Zhang <juszhan@google.com>
The Prometheus exporter is missing a few stats metrics
that would make some things easier to graph:
  * sum
  * sample_size
  * sum_of_squares

Fixes the Min/Max/Sum methods to return uint64, as this is
the underlying data type. Callers should adjust as necessary.

Signed-off-by: Michael MacDonald <mjmac@google.com>
Bump version to 2.5.101

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Contain the fixes for CID: 1965549/1965550/1972512

Signed-off-by: Fan Yong <fan.yong@intel.com>
Backporting control plane related internal changes from the
feature/multiprovider branch. These changes affect internals
only, and not user interfaces.

- Updated protobuf structures to recognize secondary providers
  in a backward-compatible way.
- Updated libdaos network config logic to pass values from env
  variables to the agent, to allow better decision making when
  choosing a network interface.
- Add support for multiple providers to control plane internal
  structures, including config file structures. They are intended
  to be invisible to users until we enable the feature throughout
  the stack.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
Meson "setup" sets up a package for buidlding, meson configure
sets a configuration option, but does not do the setup.
Previously our code would do setup, then configure which would
set configuration options but not apply them.

Ninja has file age checking built-in so if the config file was
older than the build file then it would re-run setup to apply
the correct config, and this was happening most times so
the build would work, but occasionally the file timestamps would
be the same so the check would not fail and the build would be run
without the configuration options applied.

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
To workaround SRE-471 increase the dfuse/mu_perms.py test timeout by 60
seconds.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Include a minimum revision for golang

Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Since these are always going to be a single level of nested
lists, we don't need the more complex _flatten() helper.

Signed-off-by: Michael MacDonald <mjmac@google.com>
…13959)

Signed-off-by: Lei Huang <lei.huang@intel.com>
Co-authored-by: Ashley Pittman <ashley.m.pittman@intel.com>
The VOS API supports combining multiple VOS operations into a single WAL
commit for efficiency. The primary use case is RDB (see DAOS-11406)

Signed-off-by: Jan Michalski <jan.michalski@intel.com>
Signed-off-by: Jeff Olivier <jeffrey.v.olivier@intel.com>
Co-authored-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Co-authored-by: Oksana Salyk <oksana.salyk@intel.com>
Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Update to libfabric 1.19.1

Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Add pool collective function to skip DOWN and DOWNOUT targets.

Signed-off-by: Di Wang <di.wang@intel.com>
Co-authored-by: Niu Yawei <yawei.niu@intel.com>
As requested by the Jira ticket, add a new I/O forwarding mechanism,
dss_chore, to avoid creating a ULT for every forwarding task.

  - Forwarding of object I/O and DTX RPCs is converted to chores.

  - Cancelation is not implemented, because the I/O forwarding tasks
    themselves do not support cancelation yet.

  - In certain engine configurations, some xstreams do not need to
    initialize dx_chore_queue. This is left to future work.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Add missing comment for exported go function.

Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
Add more mock functions for dtx_tests.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
When WAL SSD is faulty, WAL commit will always fail and the last
committed tx ID won't be bumped anymore, checkpoint ULT shouldn't
wait on tx commit in such case, otherwise, the checkpoint ULT
will never be woken up, and the pool_child_stop() will be blocked
on stopping the checkpoint ULT.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Allow the suid and sgid bits to be stored in dfs_osetattr.
Even if libdfs does not support those bits, it allows dfuse to
support them via the kernel.

The lack of sgid support cause spack to fail over dfuse as
reported in the jira ticket.

Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
When using server target, daos_metrics wasn't built
because it was buried under a check for client target.
I really need to figure out a better way to specify
targets but this will fix the immediate issue.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Bumps google.golang.org/protobuf from 1.30.0 to 1.33.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…red. (#14049)

This avoids a long-standing but previously unknown issue where
the build directory was in LD_LIBRARY_PATH when running gcc etc.

Update site_scons to not se LD_LIBRARY_PATH for all commands launched
during the build but rather only set it for the step where the dmg man
pages are generated.

This impacts the daos_build test which bas previously always needed
to run the daos_build with --jobs=1, with this change then the build
can be run in parallel which reduces the run-time of this test by a third.

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Adding tests for WAL commit, reply, and checkpoint metrics.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
one dkey migrate possible exceed the mpt_inflight_max_size, in this
case original code possibly cause the dkey migrate ULT dead loop and
then rebuild cannot complete.
Example log - "migrate_one_ult() mrone 0x7f3c91fe1ec0 wait start 0/33554432",
that case will cause the ULT wait again after wakeup until shutdown.

Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/2/execution/node/1201/log

daltonbohning and others added 12 commits May 16, 2024 10:24
Add a GHA linting summary job to aggregate the status of linting checks.
Allows branch protections to rely on just the summary.
New jobs in the workflow do not require branch protections to be
updated.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Add more unit tests for tags.py

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Update pylint from 3.1.1 -> 3.2.0 and resolve new warnings.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Create a dedicated event queue for each aio context

Signed-off-by: Lei Huang <lei.huang@intel.com>
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
rsvc_tests should use tenv instead of unit_env.

Signed-off-by: Li Wei <wei.g.li@intel.com>
CID 2555629, 2555628, 2555602, 2555600

Signed-off-by: Lei Huang <lei.huang@intel.com>
…14370)

Some DTX related fields in ds_cont_child structure are initialized
via dtx_cont_register() that may be skipped under check mode as to
some subsequent logic may access uninitialized members. Let's move
such fields initialization into cont_child_alloc_ref().

Skip DTX resync and rebuild under check mode.

Signed-off-by: Fan Yong <fan.yong@intel.com>
…14342)

Use CRT_TIMEOUT=10 when destroying containers since it is expected to timeout.

Signed-off-by: Padmanabhan <ravindran.padmanabhan@intel.com>
Skip the ftest tags.py githook when the python3 yaml module is not
installed.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Set it to 3 seconds initially and increase it as we try other targets so we
can get going more quickly when a rank is down.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Lack of it was causing failures in the Ubuntu build.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/3/execution/node/1201/log

jolivier23 and others added 7 commits May 20, 2024 00:13
Rather than blocking vos_obj_discard entirely when
discard or aggregation are running, let's block it
only when there is an actual conflict on the object
being discarded.

* Fix log messages to specify EC or VOS aggregation
* Add metrics for conflicts

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Setting checkpoint properties on pool create was
not actually setting the properties.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
* DAOS-15420 pool: Clean up ds_pool_svc_<op>

Convert

  ds_pool_svc_check_evict
  ds_pool_svc_query_target
  ds_pool_svc_get_prop
  ds_pool_svc_set_prop
  ds_pool_svc_target_update_state
  ds_pool_svc_update_acl
  ds_pool_svc_delete_acl
  ds_pool_svc_upgrade
  ds_pool_extend

to the dsc_pool_svc_call framework, so that they will

  - time out, instead of hanging forever, if PSs are unavailable, and
  - respond much faster in common cases thanks to exponential backoffs.

The req_time variable in dsc_pool_svc_call is part of the operation
identifier, and should therefore retain its value across retries.

Signed-off-by: Li Wei <wei.g.li@intel.com>
- Ensure the requested user/group exists before setting it.
- Add a second API, daos_cont_set_owner_no_check(), for the case
  where the new owner/group can't be verified locally.
- Modify daos_test to verify both check and no_check cases.
- Add --no-check flag to daos cont set-owner.

Signed-off-by: Kris Jacque <kris.jacque@intel.com>
…es (#14379)

Some control-plane storage unit tests are inadvertently calling into
test runner host OS filesystem. Fix by consolidating SystemProvider
interface and applying storage subsystem provider stubs by default
in the unit test framework. As a result coverage can be improved by
exercising a greater number of code paths.

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…ts (#14295)

Some test occasionally fail to start servers due to insufficient
available memory in CI due to left over DAOS mount points from a
previous test.  Adding an option to launch.py to provided a filter,
which if specified, will be used to umount and remove the directory for
any mounted tmpfs filesystems matching the filter. When using --mode=ci
the filter will be set to /mnt/daos.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Moving master to 2.7 test builds. Bump version to 2.7.100.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/5/execution/node/1201/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/6/execution/node/458/log

daltonbohning and others added 4 commits May 22, 2024 11:52
Use crt_timeout: 10 for rebuild/basic, to restore config prior to
PR #13997.
This reduces the test time drastically.

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Update pylint to 3.2.2

Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>
Signed-off-by: Lei Huang <lei.huang@intel.com>
…4297)

As a convenience, provide a "streamlined" version of the pool
query that only performs the minimum amount of work to query
the pool's health. Practically speaking, this means that it
will query for disabled ranks and omit the space query, which
is expensive.

Signed-off-by: Michael MacDonald <mjmac@google.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14368/7/execution/node/1201/log

@jolivier23 jolivier23 merged commit 1d77b96 into feature/fault_domain May 23, 2024
105 of 218 checks passed
@jolivier23 jolivier23 deleted the kjacque/fault_domain/merge_20240514 branch May 23, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet