DAOS-14679 pool: Report on stopping sp_stopping #14374

liw · 2024-05-15T08:24:43Z

When destroying a pool on an engine (as part of a pool destroy command),
we may block in

ds_pool_stop
  pool_put_sync // putting the last reference
    pool_free_ref // called after the hash rec deletion
      pool_child_delete_one
        pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool
destroy command, the new ds_pool_stop call on this engine can't find the
ds_pool object that is in the sp_stopping state, and the new pool
destroy command usually succeeds, even though the storage capacity
allocated to the pool hasn't been released yet.

This patch makes the following changes:

Move the pool_child_delete_one collective call from pool_free_ref to
before the last ds_pool_put call in ds_pool_stop. This ensures that
the ds_pool object remains in the LRU cache in the sp_stopping state
while we wait for ds_pool_child references (or some other upper
layer resource).
Wait in ds_pool_stop for all other ds_pool references before calling
the pool_child_delete_one collective.
Remove the pool_put_sync trick, which was introduced because of the
pool_child_delete_one collective call in pool_free_ref.
Return -DER_AGAIN from ds_pool_stop when the ds_pool object is in
the sp_stopping state, so that the new pool destroy command in the
aforementioned scenario will wait for the original pool destroy
operation.
Register a reply aggregation callback for MGMT_TGT_DESTROY so that
an error can reach the control plane.

Features: pool

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-05-15T08:25:01Z

Ticket title is 'LRZ: dmg pool destroy fails'
Status is 'Blocked'
Labels: 'LRZ,triaged'
https://daosio.atlassian.net/browse/DAOS-14679

When destroying a pool on an engine (as part of a pool destroy command), we may block in ds_pool_stop pool_put_sync // putting the last reference pool_free_ref // called after the hash rec deletion pool_child_delete_one pool_child_stop waiting for some ds_pool_child references. If the user retries the pool destroy command, the new ds_pool_stop call on this engine can't find the ds_pool object that is in the sp_stopping state, and the new pool destroy command usually succeeds, even though the storage capacity allocated to the pool hasn't been released yet. This patch makes the following changes: - Move the pool_child_delete_one collective call from pool_free_ref to before the ds_pool_put call in ds_pool_stop. This makes sure that the ds_pool object remains in the LRU cache in the sp_stopping state while we wait for ds_pool_child references (or some other upper layer resource). - Remove the pool_put_sync trick, which was introduced because of the pool_child_delete_one collective call in pool_free_ref. - Return an error from ds_pool_stop when the ds_pool object is in the sp_stopping state, so that the new pool destroy command in the aforementioned scenario will return an explicit error. - Register a reply aggregation callback for MGMT_TGT_DESTROY so that an error can reach the control plane. Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

daosbuild1 · 2024-05-17T18:13:28Z

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14374/3/testReport/

liw · 2024-05-20T01:19:50Z

container/boundary: DAOS-15857

NiuYawei · 2024-05-20T01:47:29Z

src/pool/srv_target.c

@@ -1263,9 +1247,13 @@ ds_pool_stop(uuid_t uuid)

 	ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
 	ds_migrate_stop(pool, -1, -1);
+
+	pool_stop_children(pool);


Looks it needs be called on error cleanup in ds_pool_start(). (and in pool_alloc_ref() when the collective call partially failed?).

Indeed... Will fix. Thanks, Niu.

Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

Features: pool Required-githooks: true

Nasf-Fan · 2024-05-21T01:58:37Z

src/pool/srv_target.c

+
+		DL_WARN(rc, DF_UUID ": already stopping", DP_UUID(uuid));
+		ds_pool_put(pool);
+		return rc;


If the pool stop is triggered by **pool destroy** command, and it is retried for some reason as you mentioned in the commit comment, then if we return -DER_BUSY, will it misguide the caller to regard the command as failed or need some enhancement on control plane side to handle that?

For future releases, yes, I think we can improve the behavior, for instance, by letting the retried pool destroy commands wait for the in-progress pool destroy operation on the same pool; for 2.6, I think it's acceptable to only address the main problem (i.e., the retried pool destroy command succeeds, even though the pool hasn't really been destroyed). This is a first step in the right direction, AFAICS, not a workaround that we would throw away immediately.

I am fine to resolve the main problem firstly. In the real user environment, the user may complain but acceptable that, because he/she can re-execute the pool destroy or query the space when pool destroy hits -DER_BUSY. But I am afraid that some CI tests may failed with -DER_BUSY if only check pool destroy result without checking system space.

OK. How about returning -DER_AGAIN instead, so that the control plane will retry the pool destroy operation after a backoff (observed to be around 10s)? Then we kind of get the "wait" we want, although it's not the most clean way of "waiting". The UI will looks like (I injected a ds_pool_child leak):

[liw2@boro-66 ~]$ time dmg -o ~/daos_control.yml pool destroy -rf test-pool-0 ^C real 0m11.489s user 0m0.069s sys 0m0.022s [liw2@boro-66 ~]$ time dmg -o ~/daos_control.yml pool destroy -rf test-pool-0 Pool-destroy command failed: client: code = 510 description = "the *control.PoolDestroyReq request timed out after 5m0s" ERROR: dmg: client: code = 510 description = "the *control.PoolDestroyReq request timed out after 5m0s" ERROR: dmg: client: code = 510 resolution = "retry the request or check server logs for more information" real 5m0.065s user 0m1.599s sys 0m0.092s

Yes, DER_AGAIN seems better for that.

Nasf-Fan · 2024-05-21T02:01:30Z

src/pool/srv_target.c

@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

 	ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
 	ds_migrate_stop(pool, -1, -1);
+
+	pool_child_delete_all(pool);


If the sponsor of ds_pool_stop() is not the last user for the ds_pool, then the other reference holders may still use related pool_child. Under such case, is it possible to cause some race?

Yes, I thought about this question too. I think if there's no other bug, there should be no other user of this ds_pool object. Moreover, if there were such a user, there was no ds_pool_child pointer in ds_pool. So, although that user might encounter some error, he should not encounter serious problems like segfaults.

The race may happen, for example, during ds_pool_tgt_map_update(), someone triggers ds_pool_stop() as to some pool_child is released before collective tasks for ds_pool_tgt_map_update completed, such as dtx_resync may get failure. It is risky for that because dtx_resync is global operation, if miss some pool_child, it will affect DTX check result as to the decision. Further more, if the DTX leader wants to commit or abort some DTX, but related pool shard is closed, then it may leave orphan DTX entry on related pool shard. If the pool will be destroyed, it's OK; otherwise, it will be trouble. I am not aware of some know segmentation fault issues under such case, but really not sure for that.

I think I can pull in one more change originally planned for future work: Wait for all other ds_pool references to be released. If we do that before calling pool_child_delete_all, the situation should become easier for other ds_pool reference holders.

The dtx_resync concern sounds like a different problem. Let's chat to understand where the problem is.

The root reason for dtx_resync trouble is not related with the patch, but the changes in this patch will increase the risk to expose the dtx_resync trouble.

liw

@Nasf-Fan, thanks for the review. Here are my replies/explanations.

liw · 2024-05-21T04:44:53Z

src/pool/srv_target.c

+
+		DL_WARN(rc, DF_UUID ": already stopping", DP_UUID(uuid));
+		ds_pool_put(pool);
+		return rc;


For future releases, yes, I think we can improve the behavior, for instance, by letting the retried pool destroy commands wait for the in-progress pool destroy operation on the same pool; for 2.6, I think it's acceptable to only address the main problem (i.e., the retried pool destroy command succeeds, even though the pool hasn't really been destroyed). This is a first step in the right direction, AFAICS, not a workaround that we would throw away immediately.

liw · 2024-05-21T04:50:23Z

src/pool/srv_target.c

@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

 	ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
 	ds_migrate_stop(pool, -1, -1);
+
+	pool_child_delete_all(pool);


Yes, I thought about this question too. I think if there's no other bug, there should be no other user of this ds_pool object. Moreover, if there were such a user, there was no ds_pool_child pointer in ds_pool. So, although that user might encounter some error, he should not encounter serious problems like segfaults.

Features: pool Required-githooks: true

liw · 2024-05-21T05:10:25Z

Merged with master to fix a minor conflict in daos_srv/pool.h.

liw

@Nasf-Fan, thanks for the details. Here are my replies.

liw · 2024-05-22T01:06:41Z

src/pool/srv_target.c

+
+		DL_WARN(rc, DF_UUID ": already stopping", DP_UUID(uuid));
+		ds_pool_put(pool);
+		return rc;


OK. How about returning -DER_AGAIN instead, so that the control plane will retry the pool destroy operation after a backoff (observed to be around 10s)? Then we kind of get the "wait" we want, although it's not the most clean way of "waiting". The UI will looks like (I injected a ds_pool_child leak):

[liw2@boro-66 ~]$ time dmg -o ~/daos_control.yml pool destroy -rf test-pool-0 ^C real 0m11.489s user 0m0.069s sys 0m0.022s [liw2@boro-66 ~]$ time dmg -o ~/daos_control.yml pool destroy -rf test-pool-0 Pool-destroy command failed: client: code = 510 description = "the *control.PoolDestroyReq request timed out after 5m0s" ERROR: dmg: client: code = 510 description = "the *control.PoolDestroyReq request timed out after 5m0s" ERROR: dmg: client: code = 510 resolution = "retry the request or check server logs for more information" real 5m0.065s user 0m1.599s sys 0m0.092s

liw · 2024-05-22T01:52:34Z

src/pool/srv_target.c

@@ -1263,9 +1261,13 @@ ds_pool_stop(uuid_t uuid)

 	ds_rebuild_abort(pool->sp_uuid, -1, -1, -1);
 	ds_migrate_stop(pool, -1, -1);
+
+	pool_child_delete_all(pool);


I think I can pull in one more change originally planned for future work: Wait for all other ds_pool references to be released. If we do that before calling pool_child_delete_all, the situation should become easier for other ds_pool reference holders.

The dtx_resync concern sounds like a different problem. Let's chat to understand where the problem is.

Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

NiuYawei · 2024-05-22T06:55:42Z

src/pool/srv_target.c

-	D_INFO(DF_UUID": pool stopped\n", DP_UUID(uuid));
+
+	while (!daos_lru_is_last_user(&pool->sp_entry))
+		dss_sleep(1000 /* ms */);


I don't see why we have to wait for the ds_pool reference dropped here, it's better to print some warning messages if it hangs here due to potential defects.

@NiuYawei, please see https://github.com/daos-stack/daos/pull/14374/files#r1609159549 (just below). There may be RPC handlers calling some functions who has looked up the ds_pool object before we set sp_stopping in ds_pool_stop. I think Fan Yong's point is that such a function may assume that if it has a ds_pool reference then it can expect the children are present.

I see. I don't think currently any user expect that, the ds_pool_child could be stopped due to faulty target at any time. Anyway, let's see if any potential ds_pool reference issue could be discovered by this new change.

I have indeed told Fan Yong that nowadays ds_pool_child may be stopped individually.

I'm reluctant to log repeated D_INFO messages here, so only added a completion message so that we know if we are blocked in the reference wait or pool_child_delete_all.

The patch seems some complex. Follow your patch logic, wait() here is necessary, but the potential risk is that whether related reference holders can get the stopping signal and release the reference in time. We may have to resolve related issues when hit long-time held reference one by one next step.

On the other hand, as my understand, the main issue to be resolved is that some pool destroy is in-processing, but timeout (for some unknown reason), and the subsequent retried pool destroy returns succeed since it cannot find it in the pool cache. If that is true, when why not only return "-DER_AGAIN" as your current patch does but without other changes? That seems enough.

@Nasf-Fan, active cancellation seems to be a large project. It is out of the scope of this PR, but has been in my plan for a long time.

"That" is not enough. Without this PR, when we wait for ds_pool_child references, we are in pool_free_ref, which is called after the ds_pool object has been deleted from the LRU cache. Hence a retried pool destroy operation will not find the ds_pool object. Please see the PR description. (I began this PR with only such a change, but testing proved it to be insufficient.) On the other hand, stopping activities when freeing the last reference is not a good model, so I plan to move that out anyway.

I have no better idea. Then let's resolve the potential issues when really hit in the future.

Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

Nasf-Fan · 2024-05-24T10:10:30Z

src/pool/srv_target.c

-	D_INFO(DF_UUID": pool stopped\n", DP_UUID(uuid));
+
+	while (!daos_lru_is_last_user(&pool->sp_entry))
+		dss_sleep(1000 /* ms */);


I have no better idea. Then let's resolve the potential issues when really hit in the future.

jolivier23 · 2024-05-24T14:43:11Z

Perhaps you should add

Allow-unstable-test: true

So it at least runs all hardware tests if it's blocked by a CI bug on VM tests.

Features: pool Allow-unstable-test: true Required-githooks: true

liw · 2024-05-28T05:31:30Z

Merged master again because the previous CI job was stuck in "Test RPMs".

daosbuild1 · 2024-05-28T05:31:49Z

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14374/8/display/redirect

Features: pool Allow-unstable-test: true Required-githooks: true

liw force-pushed the liw/ds_pool_stop branch from d3cde78 to 1e67120 Compare May 15, 2024 12:36

liw force-pushed the liw/ds_pool_stop branch from 1e67120 to d2b0df9 Compare May 16, 2024 07:11

liw marked this pull request as ready for review May 20, 2024 01:20

liw requested review from a team as code owners May 20, 2024 01:20

liw requested review from NiuYawei, liuxuezhao and Nasf-Fan May 20, 2024 01:20

NiuYawei reviewed May 20, 2024

View reviewed changes

liw added 2 commits May 20, 2024 14:35

Address review comments

d7ebda8

Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

Merge branch 'master' into liw/ds_pool_stop

4358fdd

Features: pool Required-githooks: true

Nasf-Fan reviewed May 21, 2024

View reviewed changes

liw commented May 21, 2024

View reviewed changes

Merge branch 'master' into liw/ds_pool_stop

f1b991c

Features: pool Required-githooks: true

liw commented May 22, 2024

View reviewed changes

Address review comments

9078e35

Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

liw requested review from Nasf-Fan and NiuYawei May 22, 2024 06:22

NiuYawei previously approved these changes May 22, 2024

View reviewed changes

Address review comments

4076db0

Features: pool Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true

liw dismissed NiuYawei’s stale review via 4076db0 May 22, 2024 08:10

NiuYawei approved these changes May 24, 2024

View reviewed changes

Nasf-Fan approved these changes May 24, 2024

View reviewed changes

Merge branch 'master' into liw/ds_pool_stop

5713c2a

Features: pool Allow-unstable-test: true Required-githooks: true

Merge branch 'master' into liw/ds_pool_stop

9ad231f

Features: pool Allow-unstable-test: true Required-githooks: true

Merge branch 'master' into liw/ds_pool_stop

00bc5de

Features: pool Allow-unstable-test: true Required-githooks: true

liuxuezhao approved these changes May 30, 2024

View reviewed changes

liw requested a review from a team June 2, 2024 23:31

NiuYawei merged commit dd530c9 into master Jun 3, 2024
51 of 52 checks passed

NiuYawei deleted the liw/ds_pool_stop branch June 3, 2024 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14679 pool: Report on stopping sp_stopping #14374

DAOS-14679 pool: Report on stopping sp_stopping #14374

liw commented May 15, 2024 •

edited

github-actions bot commented May 15, 2024 •

edited

daosbuild1 commented May 17, 2024

liw commented May 20, 2024

NiuYawei May 20, 2024

liw May 20, 2024

Nasf-Fan May 21, 2024

liw May 21, 2024

Nasf-Fan May 21, 2024

liw May 22, 2024

Nasf-Fan May 22, 2024

Nasf-Fan May 21, 2024

liw May 21, 2024

Nasf-Fan May 21, 2024 •

edited

liw May 22, 2024

Nasf-Fan May 22, 2024

liw left a comment

liw May 21, 2024

liw May 21, 2024

liw commented May 21, 2024

liw left a comment

liw May 22, 2024

liw May 22, 2024

NiuYawei May 22, 2024

liw May 22, 2024 •

edited

NiuYawei May 22, 2024

liw May 22, 2024

liw May 22, 2024

Nasf-Fan May 23, 2024

liw May 23, 2024

Nasf-Fan May 24, 2024

Nasf-Fan May 24, 2024

jolivier23 commented May 24, 2024

liw commented May 28, 2024

daosbuild1 commented May 28, 2024

DAOS-14679 pool: Report on stopping sp_stopping #14374

DAOS-14679 pool: Report on stopping sp_stopping #14374

Conversation

liw commented May 15, 2024 • edited

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented May 15, 2024 • edited

daosbuild1 commented May 17, 2024

liw commented May 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nasf-Fan May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liw commented May 21, 2024

liw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liw May 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolivier23 commented May 24, 2024

liw commented May 28, 2024

daosbuild1 commented May 28, 2024

liw commented May 15, 2024 •

edited

github-actions bot commented May 15, 2024 •

edited

Nasf-Fan May 21, 2024 •

edited

liw May 22, 2024 •

edited