New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
massive Peer rework, based on FernyTransport #19668
base: main
Are you sure you want to change the base?
massive Peer rework, based on FernyTransport #19668
Conversation
allisonkarlitskaya
commented
Nov 29, 2023
•
edited by martinpitt
edited by martinpitt
- revert PR test: workaround an old (and very common) flake #19835 , this PR fixes it properly (hopefully)
bf525f8
to
3aec639
Compare
2965bb2
to
dc9cd0e
Compare
This comment was marked as resolved.
This comment was marked as resolved.
8efcb70
to
5eae275
Compare
The permission denied error in the superuser case is annoying: that's the sound of the kill() syscall failing to reap the bridge running as root... That's python/cpython#112800 but we're going to need a workaround... |
5eae275
to
5d20586
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
09a5c4d
to
dc4ed95
Compare
dc4ed95
to
6ab101e
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Looking at the polkit code, and thinking about the error message, it's worth noting: self.subject = ('unix-session', {'session-id': Variant(os.environ['XDG_SESSION_ID'], 's')}) So clearly it's the not-a-real-session spawning of the bridge that's the issue here, which means that this problem is specific to this testcase. That being said, there's nothing wrong with robustifying the code a bit there. There may be other reasons that polkit isn't going to want us to become an agent... |
6698519
to
3d568bf
Compare
Okay, so most stuff is getting under control by now. A large part of the remaining problem is that the I could add yet another option for dealing with that, but I start to wonder if maybe it's time to draw a clearer line in the sand between "interactive" and "non-interactive" uses of cockpit-beiboot. The non-interactive case would immediately prompt for the password on startup using the I'll roll this over in my head a bit. Pending a reasonable solution there and resolution of the situation about how to run |
075c65b
to
e8d01ca
Compare
e8d01ca
to
65b2f8c
Compare
message['superuser'] = False | ||
self.ssh_peer.write_control(message) | ||
# Step 2: when the client replies, create the peer | ||
def do_authorize(self, message: JsonObject) -> None: |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
7414e8f
to
779a474
Compare
779a474
to
35ed01e
Compare
35ed01e
to
ebab696
Compare
ebab696
to
01ca58d
Compare
testBeibootWithBridge is rather shallow: There's just some impedance mismatch with the recent commit 22f38b3. Either we keep -- src/cockpit/beiboot.py
+++ src/cockpit/beiboot.py
@@ -252,7 +252,7 @@ class ForwarderPeer(Peer):
def do_exception(self, exc: Exception) -> None:
if isinstance(exc, (OSError, socket.gaierror)):
- raise CockpitProblem('no-host', error='no-host', message=str(exc)) from exc
+ raise CockpitProblem('unknown-host', error='unknown-host', message=str(exc)) from exc
elif isinstance(exc, ferny.SshHostKeyError):
hostkey_info = self.authorize_handler.hostkey_info or {} Or we need to change the tests to expect --- test/verify/check-client
+++ test/verify/check-client
@@ -206,7 +206,7 @@ Command = /usr/bin/env python3 -m cockpit.beiboot --interactive
# unreachable host
b.set_val("#server-field", "unknownhost")
b.click("#login-button")
- b.wait_in_text("#login-error-message", "Host is unknown")
+ b.wait_in_text("#login-error-message", "Unable to connect to that address")
b.wait_val("#server-field", "unknownhost")
# does not appear in recent hosts
b.wait_in_text("#recent-hosts-list", "10.111.113.2")
@@ -215,7 +215,7 @@ Command = /usr/bin/env python3 -m cockpit.beiboot --interactive
# wrong port
b.set_val("#server-field", "10.111.113.2:222")
b.click("#login-button")
- b.wait_in_text("#login-error-message", "Host is unknown")
+ b.wait_in_text("#login-error-message", "Unable to connect to that address")
# unencrypted SSH key login
self.m_client.execute("runuser -u admin -- ssh-keygen -t rsa -N '' -f ~admin/.ssh/id_rsa")
I actually prefer the latter -- I reviewed the two errors, and "unknown-host" feels more like "SSH does not know this" rather than "network/DNS failure". |
TestHostSwitching.testBasic is just an unexpected message:
It's a flake, too, and doesn't reproduce locally. However, the test already has:
but that string doesn't actually exist any more -- the bridge has done this for a long time already:
So the test just needs to adjust the expected error message (that should be a separate commit, as it would be a flake on main as well). I'm also happy to send that as a separate PR (but it by itself feels too small to run the whole CI machinery on..) |
testBeibootNoBridge is harder: The "switch to admin access" dialog never goes away, but it actually got quite far: on the "target" machine I see
i.e. the remote priv bridge did start. Before I debug that further, I'll coordinate with Lis. test/verify/check-superuser TestSuperuserDashboard is hopefully/likely the same root cause: no actual superusers session on a remote machine after logging in. And finally, TestSuperuserDashboardOldMachine.test is supposedly the hardest: CentOS 7 did not yet have any concept of a "remote privilege indicator". It tries to run |
Oh, and there's of course the too old Python crash for RHEL 9.4's py 3.9:
|
This message doesn't exist anymore — update to the new string written by cockpit-ws. Thanks Martin :)
This reverts commit a0f6bfe.
b29c305
to
35fc1d4
Compare
This is also the failing unit test... |
35fc1d4
to
ba48757
Compare
Okay. Fixed that, and it's getting a lot farther, but now it's failing for a new reason: the processes aren't getting cleaned up properly. |
self.allow_journal_messages(".*: failure while serving external channel: internal-error") | ||
self.allow_journal_messages(".*: external channel failed: internal-error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it seems I was wrong -- The message still does exist:
src/ws/cockpitchannelresponse.c: g_debug ("%s: failure while serving external channel: %s", self->logname, problem);
So instead I suggest
self.allow_journal_messages(".*external channel.*: internal-error")
# TODO: this should only be 'access-denied' | ||
legit_results = [f'result: {err}' for err in ('access-denied', 'authentication-failed', 'terminated')] | ||
self.assertIn(b.text(".super-channel span"), legit_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's back 😢 so perhaps drop that commit for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was kinda hoping to slay this one, to be honest...
I'm not even sure if that's a new problem. We've had failures like this for ages, especially on Arch. It's also "just" the second affected retry on /devel, but everywhere else it failed on the first run, so at least it made this bug much easier to reproduce. It would of course be super fun if your old fixup proposal would help here at all 😉 |
if not os.access('/dev/kvm', os.W_OK): | ||
pytest.skip('kvm unavailable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM, pytest only runs on github workflows, so this test doesn't run anywhere.
So AFAICS we have two choices here:
-
Treat this as unit test. Then this shouldn't involve VMs (or even containers) and network access. I get why we don't want to test against src/ssh/mock-sshd.c, as we are trying to get rid of that code and the libssh dependency. But ferny tests against python-asyncssh, and a mock SSH server looks quite reasonable?
-
Treat it as an integration test. Then this shouldn't reimplement all the bots/machine/hostkey/privkey handling here, but just use the bots API. You can drive this from Python in just a few lines of code if you insist on doing this with pytest fixtures (but honestly our standard
MachineCase
seems just fine for that, and it should be possible to wrap that into a fixture). The integration tests already assume that they run from the git checkout (e.g. as they need test/verify/files), so importing the beiboot module and running it on the host seems fine to me -- the integration tests don't have to involve a browser. E.g. most of test/verify/check-connection doesn't.
It makes sense to split the test. I.e. keep the "unit-y" bits here, like test_parse_destination()
or def test_conds()
(with some adjustments), and move the integration-y ones to test/verify/check-client or a new test/verify/check-beiboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sort of what I was hoping you'd look into...
This is definitely some kind of a "third type" of test. fwiw, I've been running them locally and they're helping me tonnes (and they all currently pass).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary from our discussion:
- we keep the general shape, but move it into test/pytest-integration/
- drop that bots fixture and just import testvm statically; integration tests just assume bots/
- add a pytest -k mumble mumble select test/pytest-integration/* to test/run, so that it runs on our CI independently of run-tests; it doesn't parallelize, but it's also fast enough (preferably the /networking scenario, it's the fastest)
But also: We don't want to/have to block this PR on that. We can live with not running these for a while. The more useful work would be to add a proper beiboot scenario on an unprepared image. The covered functionality is mostly just a faster/smaller version of what the integration tests already do.
ba48757
to
fccc304
Compare