-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds/quiesce: fix deadlocks of quiesce with fragmenting and renames #57332
Conversation
697145e
to
d662c7c
Compare
d662c7c
to
caa61df
Compare
ae4b9a0
to
1bcd75f
Compare
This was tested together with #57250 in https://pulpito.ceph.com/leonidus-2024-05-13_05:53:33-fs-wip-lusov-quiesce-distro-default-smithi/. The results are promising: the only two quiesce timeouts are instances of a different issue https://tracker.ceph.com/issues/65977 the EMEDIUMTYPE are timeouts from the teuthology command runner, and are usually signs of unrelated issues.
|
jenkins test api |
1bcd75f
to
d9bd121
Compare
This has passed tests in the batch: No quiesce timeouts were detected in 36 jobs. There is a good chance that the detected errors are test issues rather than code issues.
|
@batrick please approve for merge |
jenkins test make check arm64 |
d9bd121
to
b16d849
Compare
jenkins test make check arm64 |
jenkins test api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outstanding work. Great job @leonid-s-usov !
I ran 2 new job sets with the latest version.
I'll have to look into this one more, but as of now, no evidence of any of the issues that should have been fixed. |
No outstanding quiesce ops
This is another case of a lost ack:
Not sure if it's due to the failure injection, but it's anyway a subject of the new ticket https://tracker.ceph.com/issues/66107 |
Out of the two runs above, there's only one new real quiesce timeout, and it's some interlock with export All of the pending quiesce_inode ops are failing to authpin:
Given this new issue's low reproduction rate, I'd like to tackle it in a separate PR. I'm creating a ticket for that: https://tracker.ceph.com/issues/66123. I'm not yet 100% confident it's a quiesce issue |
jenkins test make check arm64 |
The arm64 failure is addressed by #57552 |
jenkins test make check arm64 |
0ede4c8
to
dee722d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mds/quiesce: overdrive exporting that is still freezing
looks good
dee722d
to
c24beba
Compare
This PR is under test in https://tracker.ceph.com/issues/66140. |
c24beba
to
48a9ae2
Compare
Lines 2336 to 2338 in acdd6bc
also creates a problem because the exporter may cancel after we remove the import from Line 2390 in acdd6bc
I don't think that assert can be correct. importer may remove an import, tell the exporter, and race with a cancel. It simply should ignore the export (ideally we return an error but not all of these messages permit an error, I believe). |
Repeatedly quiesce under a heavy balancer load Fixes: https://tracker.ceph.com/issues/65716 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
This is a functional revert of a9964a7 git revert was giving too many conflicts, as the code has changed too much since the original commit. The bypass freezing mechanism lead us into several deadlocks, and when we found out that a freezing inode defers reclaiming client caps, we realized that we needed to try a different approach. This commit removes the bypass freezing related changes to clear way for a different approach to resolving the conflict between quiesce and freezing. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Quiesce requires revocation of capabilities, which is not working for a freezing/frozen nodes. Since it is best effort, abort an ongoing fragmenting for the sake of a faster quiesce. Signed-off-by: Leonid Usov <leonid.usov@ibm.com> Fixes: https://tracker.ceph.com/issues/65716
* when the quiesce lock is taken by this op, don't consider the inode `quiesced` * drop all locks taken during traversal * drop all local authpins after the locks are taken * add --await functionality that will block the command until locks are taken or an error is encountered * return the RC that represents the operation result. 0 if the operation was scheduled and hasn't failed so far * add authpin control flags ** --ap-freeze - to auth_pin_freeze the target inode ** --ap-dont-block - to pass auth_pin_nonblocking when acquiring the target inode locks Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
When a request is blocked on the quiesce lock, it should release all remote authpins, especially those that make an inode AUTHPIN_FROZEN Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com> Fixes: https://tracker.ceph.com/issues/65802
48a9ae2
to
5692f7f
Compare
jenkins test make check arm64 |
the 9/10 failures are not legitimate. The tests actually finished successfully. https://pulpito.ceph.com/leonidus-2024-05-20_23:06:58-fs-wip-lusov-quiesce-distro-default-smithi/ these failures were unrelated. |
Fixes: https://tracker.ceph.com/issues/65716
Quiesce requires revocation of capabilities, which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting for the sake of a faster quiesce.
Fixes: https://tracker.ceph.com/issues/65802
regardless of where the request came from
Locker::acquire_locks
interface.Callers who want to control how the quiesce lock is taken or not
should include the quiesce lock in the LOV. To bypass taking the quiesce lock
one should call
add_nolock(quiescelock)
on the lovShow available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e