Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds/quiesce: fix deadlocks of quiesce with fragmenting and renames #57332

Merged
merged 7 commits into from
May 21, 2024

Conversation

leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented May 7, 2024

Fixes: https://tracker.ceph.com/issues/65716

Quiesce requires revocation of capabilities, which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting for the sake of a faster quiesce.

Fixes: https://tracker.ceph.com/issues/65802

  • The decision to take a shared queisce lock must be taken by the object authority
    regardless of where the request came from
  • Remove "mustpin" and "skip_quiesce" from the Locker::acquire_locks interface.
    Callers who want to control how the quiesce lock is taken or not
    should include the quiesce lock in the LOV. To bypass taking the quiesce lock
    one should call add_nolock(quiescelock) on the lov
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@leonid-s-usov leonid-s-usov marked this pull request as draft May 7, 2024 14:05
@github-actions github-actions bot added the cephfs Ceph File System label May 7, 2024
@leonid-s-usov leonid-s-usov marked this pull request as ready for review May 12, 2024 11:52
@leonid-s-usov leonid-s-usov force-pushed the wip-lusov-quiesce-lock-auth branch 3 times, most recently from ae4b9a0 to 1bcd75f Compare May 13, 2024 05:52
@leonid-s-usov
Copy link
Contributor Author

This was tested together with #57250 in https://pulpito.ceph.com/leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi/. The results are promising: the only two quiesce timeouts are instances of a different issue https://tracker.ceph.com/issues/65977

the EMEDIUMTYPE are timeouts from the teuthology command runner, and are usually signs of unrelated issues.

[leonidus@vossi04 leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi]$ grep "ERROR:tasks.quiescer" */teuthology.log
7704534/teuthology.log:2024-05-13T06:56:08.322 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
7704534/teuthology.log:2024-05-13T06:56:08.323 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704542/teuthology.log:2024-05-13T07:29:20.358 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704542/teuthology.log:2024-05-13T07:29:20.358 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704543/teuthology.log:2024-05-13T07:34:41.685 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704543/teuthology.log:2024-05-13T07:34:41.686 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704544/teuthology.log:2024-05-13T06:38:37.093 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:
7704544/teuthology.log:2024-05-13T06:38:37.093 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7704563/teuthology.log:2024-05-13T06:50:21.065 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
7704563/teuthology.log:2024-05-13T06:50:21.065 ERROR:tasks.quiescer.fs.[cephfs]:exception:

@leonid-s-usov
Copy link
Contributor Author

jenkins test api

@leonid-s-usov
Copy link
Contributor Author

This has passed tests in the batch:

#57454
#57332
#57274
#57250

No quiesce timeouts were detected in 36 jobs. There is a good chance that the detected errors are test issues rather than code issues.

[leonidus@vossi04 leonidus-2024-05-14_16:11:40-fs-wip-lusov-quiesce-distro-default-smithi]$ grep -m 1 "ERROR:tasks.quiescer" */teuthology.log
7706170/teuthology.log:2024-05-14T17:26:39.882 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set '541c443e' with rc: 1 (EPERM), stdout:
7706171/teuthology.log:2024-05-14T17:16:57.065 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set '77401ef3' with rc: 1 (EPERM), stdout:
7706175/teuthology.log:2024-05-14T17:09:13.747 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set 'e0c7677c' with rc: 1 (EPERM), stdout:
7706181/teuthology.log:2024-05-14T17:35:22.634 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't parse response with error Expecting value: line 1 column 1 (char 0); rc: 124 (EMEDIUMTYPE); stdout:

@leonid-s-usov leonid-s-usov requested a review from a team May 15, 2024 11:12
@leonid-s-usov
Copy link
Contributor Author

@batrick please approve for merge

@leonid-s-usov
Copy link
Contributor Author

jenkins test make check arm64

@leonid-s-usov leonid-s-usov changed the title mds/quiesce: automatic shared quiesce lock should be taken by the auth mds/quiesce: fix deadlocks of quiesce with freezing due to fragmenting and renaming May 16, 2024
@leonid-s-usov leonid-s-usov changed the title mds/quiesce: fix deadlocks of quiesce with freezing due to fragmenting and renaming mds/quiesce: fix deadlocks of quiesce with fragmenting and renames May 16, 2024
@batrick
Copy link
Member

batrick commented May 16, 2024

jenkins test make check arm64

@leonid-s-usov
Copy link
Contributor Author

jenkins test api

@ceph ceph deleted a comment from batrick May 16, 2024
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outstanding work. Great job @leonid-s-usov !

src/mds/Locker.cc Outdated Show resolved Hide resolved
src/mds/Locker.cc Show resolved Hide resolved
@leonid-s-usov
Copy link
Contributor Author

I ran 2 new job sets with the latest version.

  1. https://pulpito.ceph.com/leonidus-2024-05-19_09:55:47-fs-wip-lusov-quiesce-distro-default-smithi/
[leonidus@vossi04 leonidus-2024-05-19_09:55:47-fs-wip-lusov-quiesce-distro-default-smithi]$ grep "ERROR:tasks.quiescer" */teuthology.log
7713401/teuthology.log:2024-05-19T10:26:54.953 ERROR:tasks.quiescer.fs.[cephfs]:exception:
7713402/teuthology.log:2024-05-19T10:38:05.029 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set '3d4f6a9f' with rc: 1 (EPERM), stdout:
7713407/teuthology.log:2024-05-19T10:47:54.333 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set 'fafe5684' with rc: 1 (EPERM), stdout:
7713434/teuthology.log:2024-05-19T10:54:52.973 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 1 (EPERM), stdout:

7713402 and 7713407 are out of the scope of this PR, these denote script timing issues: the set had expired by the time we got to sending the release command.

7713401 is a case of an injected socket failure. The set was properly released, but the response couldn't be delivered


teuthology.log:
2024-05-19T10:24:54.920 DEBUG:tasks.quiescer.fs.[cephfs]:Running ceph command: 'tell mds.4281 quiesce db --set-id 3c0708ef --release --await'
2024-05-19T10:24:55.057 INFO:teuthology.orchestra.run.smithi067.stderr:2024-05-19T10:24:55.049+0000 7f3046261740 10 client.4878 resolve_mds: validated gid 4281 aka daemon mds.b
2024-05-19T10:24:55.081 INFO:teuthology.orchestra.run.smithi067.stderr:2024-05-19T10:24:55.076+0000 7f3046261740  1 -- 172.21.15.67:0/2620733663 --> [v2:172.21.15.120:6834/3847279522,v1:172.21.15.120:6836/3847279522] -- command(tid 0: {"prefix":
 "quiesce db", "set_id": "3c0708ef", "release": true, "await": true}) -- 0x5601d4c86710 con 0x5601d4d4e640

mds.b:
2024-05-19T10:24:55.075+0000 7f580069f640 15 quiesce.mgr.4281 <leader_upkeep_set> [3c0708ef@7,file:/] updating member state to QS_RELEASED
2024-05-19T10:24:55.075+0000 7f580069f640 15 quiesce.mgr.4281 <leader_upkeep_set> [3c0708ef@8] updated set state to match member reports: QS_RELEASED
2024-05-19T10:24:55.076+0000 7f580c6b7640  0 -- [v2:172.21.15.120:6834/3847279522,v1:172.21.15.120:6836/3847279522] >> 172.21.15.67:0/2620733663 conn(0x55c1edb6e400 msgr2=0x55c1edbd0680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1)._try_send injec
ting socket failure
2024-05-19T10:24:55.076+0000 7f58096b1640  3 mds.b ms_handle_reset closing connection for session client.4884 172.21.15.67:0/2620733663

teuthology.log:
2024-05-19T10:26:54.729 INFO:tasks.workunit.client.0.smithi067.stdout:./196/.snap/snap-subdir-test
2024-05-19T10:26:54.952 DEBUG:teuthology.orchestra.run:got remote process result: 124

7713434 is a real quiesce timeout, but it's very different from what we've seen so far. There is a 150MB ops dump on the rank 0. There are hundreds of pending peer_requests, several times fewer quiesce requests, and just one client request. No renames, no fragmenting.

2024-05-19T10:54:39.331 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (b) for eecc9a9a-1: 'client_request(client.24764:70160 create owner_uid=1000, owner_gid=1316 #0x1000000bacb/netrom.h 2024-05-19T10:52:59.430168+0000 caller_uid=100
0, caller_gid=1316{6,36,1000,1316,})'

I'll have to look into this one more, but as of now, no evidence of any of the issues that should have been fixed.

@leonid-s-usov
Copy link
Contributor Author

  1. https://pulpito.ceph.com/leonidus-2024-05-19_10:35:10-fs-wip-lusov-quiesce-distro-default-smithi/
[leonidus@vossi04 leonidus-2024-05-19_10:35:10-fs-wip-lusov-quiesce-distro-default-smithi]$ grep "ERROR:tasks.quiescer" */teuthology.log
7713463/teuthology.log:2024-05-19T11:15:37.944 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
7713463/teuthology.log:2024-05-19T11:15:37.944 ERROR:tasks.quiescer.fs.[cephfs]:exception:

No outstanding quiesce ops

$ grep "Outstanding"  7713463/teuthology.log
2024-05-19T11:14:54.323 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (f) for 760ad67a-1: 'peer_request:mds.2:2'
2024-05-19T11:14:54.323 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (f) for 760ad67a-1: 'peer_request:mds.1:2'
2024-05-19T11:14:54.324 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (f) for 760ad67a-1: 'peer_request:mds.1:3'

This is another case of a lost ack:

2024-05-19T11:14:07.933+0000 7f8ce392f640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(15:1) roots:1/0]
2024-05-19T11:14:07.933+0000 7f8ce392f640 10 quiesce.mds.2 <operator()> sending ack q-map[v:(15:1) roots:1/0] to the leader 4306
2024-05-19T11:14:07.933+0000 7f8ce392f640  1 -- [v2:172.21.15.77:6838/1986506243,v1:172.21.15.77:6839/1986506243] send_to--> mds [v2:172.21.15.134:6836/2152123477,v1:172.21.15.134:6837/2152123477] --  -- ?+0 0x563eec343080
2024-05-19T11:14:07.933+0000 7f8ce392f640  1 -- [v2:172.21.15.77:6838/1986506243,v1:172.21.15.77:6839/1986506243] --> [v2:172.21.15.134:6836/2152123477,v1:172.21.15.134:6837/2152123477] --  -- 0x563eec343080 con 0x563eec3e8800
2024-05-19T11:14:07.933+0000 7f8cee144640  1 -- [v2:172.21.15.77:6838/1986506243,v1:172.21.15.77:6839/1986506243] >> [v2:172.21.15.134:6836/2152123477,v1:172.21.15.134:6837/2152123477] conn(0x563eec3e8800 msgr2=0x563eec3ec680 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38

Not sure if it's due to the failure injection, but it's anyway a subject of the new ticket https://tracker.ceph.com/issues/66107

@leonid-s-usov
Copy link
Contributor Author

Out of the two runs above, there's only one new real quiesce timeout, and it's some interlock with export

https://pulpito.ceph.com/leonidus-2024-05-19_09:55:47-fs-wip-lusov-quiesce-distro-default-smithi/7713434/

All of the pending quiesce_inode ops are failing to authpin:

"flag_point": "failed to authpin, subtree is being exported",

Given this new issue's low reproduction rate, I'd like to tackle it in a separate PR. I'm creating a ticket for that: https://tracker.ceph.com/issues/66123. I'm not yet 100% confident it's a quiesce issue

@leonid-s-usov
Copy link
Contributor Author

jenkins test make check arm64

@leonid-s-usov
Copy link
Contributor Author

The arm64 failure is addressed by #57552

@leonid-s-usov
Copy link
Contributor Author

jenkins test make check arm64

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mds/quiesce: overdrive exporting that is still freezing looks good

src/mds/Server.cc Outdated Show resolved Hide resolved
@batrick
Copy link
Member

batrick commented May 20, 2024

This PR is under test in https://tracker.ceph.com/issues/66140.

src/mds/Migrator.cc Outdated Show resolved Hide resolved
@batrick
Copy link
Member

batrick commented May 20, 2024

ceph/src/mds/Migrator.cc

Lines 2336 to 2338 in acdd6bc

dout(5) << "blocking import during quiesce" << dendl;
import_reverse_discovering(df);
mds->send_message_mds(make_message<MExportDirDiscoverAck>(df, m->get_tid(), false), from);

also creates a problem because the exporter may cancel after we remove the import from import_state causing

ceph_abort_msg("got export_cancel in weird state");

I don't think that assert can be correct. importer may remove an import, tell the exporter, and race with a cancel. It simply should ignore the export (ideally we return an error but not all of these messages permit an error, I believe).

Repeatedly quiesce under a heavy balancer load

Fixes: https://tracker.ceph.com/issues/65716
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
This is a functional revert of a9964a7
git revert was giving too many conflicts, as the code has changed
too much since the original commit.

The bypass freezing mechanism lead us into several deadlocks,
and when we found out that a freezing inode defers reclaiming
client caps, we realized that we needed to try a different approach.
This commit removes the bypass freezing related changes to clear way
for a different approach to resolving the conflict between quiesce
and freezing.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Quiesce requires revocation of capabilities,
which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting
for the sake of a faster quiesce.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Fixes: https://tracker.ceph.com/issues/65716
* when the quiesce lock is taken by this op, don't consider the inode `quiesced`
* drop all locks taken during traversal
* drop all local authpins after the locks are taken
* add --await functionality that will block the command until locks are taken or an error is encountered
* return the RC that represents the operation result. 0 if the operation was scheduled and hasn't failed so far
* add authpin control flags
** --ap-freeze - to auth_pin_freeze the target inode
** --ap-dont-block - to pass auth_pin_nonblocking when acquiring the target inode locks

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
When a request is blocked on the quiesce lock, it should release
all remote authpins, especially those that make an inode AUTHPIN_FROZEN

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@batrick
Copy link
Member

batrick commented May 20, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented May 21, 2024

https://pulpito.ceph.com/leonidus-2024-05-20_23:06:03-fs:functional-wip-lusov-quiesce-distro-default-smithi/

the 9/10 failures are not legitimate. The tests actually finished successfully.

https://pulpito.ceph.com/leonidus-2024-05-20_23:06:58-fs-wip-lusov-quiesce-distro-default-smithi/

these failures were unrelated.

@batrick batrick merged commit 9d3f590 into main May 21, 2024
12 checks passed
@batrick batrick deleted the wip-lusov-quiesce-lock-auth branch May 21, 2024 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants