Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

massive Peer rework, based on FernyTransport #19668

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

allisonkarlitskaya
Copy link
Member

@allisonkarlitskaya allisonkarlitskaya commented Nov 29, 2023

src/cockpit/peer.py Fixed Show resolved Hide resolved
src/cockpit/remote.py Fixed Show resolved Hide resolved
src/cockpit/beiboot.py Fixed Show resolved Hide resolved
src/cockpit/beiboot.py Fixed Show resolved Hide resolved
@martinpitt

This comment was marked as resolved.

@allisonkarlitskaya allisonkarlitskaya force-pushed the ferny-transport branch 3 times, most recently from 8efcb70 to 5eae275 Compare December 6, 2023 12:52
src/cockpit/beiboot.py Fixed Show fixed Hide fixed
@allisonkarlitskaya
Copy link
Member Author

allisonkarlitskaya commented Dec 6, 2023

The permission denied error in the superuser case is annoying: that's the sound of the kill() syscall failing to reap the bridge running as root...

That's python/cpython#112800 but we're going to need a workaround...

@allisonkarlitskaya

This comment was marked as resolved.

@martinpitt

This comment was marked as resolved.

@allisonkarlitskaya allisonkarlitskaya force-pushed the ferny-transport branch 2 times, most recently from 09a5c4d to dc4ed95 Compare December 8, 2023 07:59
src/cockpit/beiboot.py Fixed Show fixed Hide fixed
@martinpitt

This comment was marked as resolved.

@allisonkarlitskaya
Copy link
Member Author

Looking at TestReverseProxy.testNginxNoTLS. It starts ws with --local-session=cockpit-bridge, but that bridge crashes right away:

Looking at the polkit code, and thinking about the error message, it's worth noting:

            self.subject = ('unix-session', {'session-id': Variant(os.environ['XDG_SESSION_ID'], 's')})

So clearly it's the not-a-real-session spawning of the bridge that's the issue here, which means that this problem is specific to this testcase.

That being said, there's nothing wrong with robustifying the code a bit there. There may be other reasons that polkit isn't going to want us to become an agent...

@allisonkarlitskaya allisonkarlitskaya force-pushed the ferny-transport branch 4 times, most recently from 6698519 to 3d568bf Compare December 11, 2023 14:53
@allisonkarlitskaya
Copy link
Member Author

Okay, so most stuff is getting under control by now.

A large part of the remaining problem is that the NumberOfPasswordPrompts SSH option doesn't do what its name suggests: it also controls the number of time that passphrases are prompted for, and setting it to zero effectively disables the use of locked keys.

I could add yet another option for dealing with that, but I start to wonder if maybe it's time to draw a clearer line in the sand between "interactive" and "non-interactive" uses of cockpit-beiboot.

The non-interactive case would immediately prompt for the password on startup using the * authorize message and never send authorize messages again after that. Any other place where an askpass prompt arrived would effectively be fatal. The interactive mode wouldn't prompt on startup, but would behave more or less the same as Cockpit Client expects.

I'll roll this over in my head a bit. Pending a reasonable solution there and resolution of the situation about how to run cockpit.beiboot from inside the beipack this is starting to look promising, though...

from pathlib import Path
from typing import Dict, Iterable, Optional, Sequence
from typing import Iterable, NamedTuple, Sequence

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Sequence' is not used.
src/cockpit/peer.py Fixed Show fixed Hide fixed
src/cockpit/beiboot.py Fixed Show fixed Hide fixed
message['superuser'] = False
self.ssh_peer.write_control(message)
# Step 2: when the client replies, create the peer
def do_authorize(self, message: JsonObject) -> None:

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@allisonkarlitskaya allisonkarlitskaya force-pushed the ferny-transport branch 3 times, most recently from 7414e8f to 779a474 Compare December 15, 2023 14:12
import sys
import typing

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note

Module 'typing' is imported with both 'import' and 'import from'.
test/pytest/test_peer.py Fixed Show fixed Hide fixed
test/pytest/test_peer.py Fixed Show fixed Hide fixed
@martinpitt
Copy link
Member

testBeibootWithBridge is rather shallow: There's just some impedance mismatch with the recent commit 22f38b3. Either we keep unknown-host (as on current main), and that fixes it:

-- src/cockpit/beiboot.py
+++ src/cockpit/beiboot.py
@@ -252,7 +252,7 @@ class ForwarderPeer(Peer):
 
     def do_exception(self, exc: Exception) -> None:
         if isinstance(exc, (OSError, socket.gaierror)):
-            raise CockpitProblem('no-host', error='no-host', message=str(exc)) from exc
+            raise CockpitProblem('unknown-host', error='unknown-host', message=str(exc)) from exc
 
         elif isinstance(exc, ferny.SshHostKeyError):
             hostkey_info = self.authorize_handler.hostkey_info or {}

Or we need to change the tests to expect no-host instead:

--- test/verify/check-client
+++ test/verify/check-client
@@ -206,7 +206,7 @@ Command = /usr/bin/env python3 -m cockpit.beiboot --interactive
         # unreachable host
         b.set_val("#server-field", "unknownhost")
         b.click("#login-button")
-        b.wait_in_text("#login-error-message", "Host is unknown")
+        b.wait_in_text("#login-error-message", "Unable to connect to that address")
         b.wait_val("#server-field", "unknownhost")
         # does not appear in recent hosts
         b.wait_in_text("#recent-hosts-list", "10.111.113.2")
@@ -215,7 +215,7 @@ Command = /usr/bin/env python3 -m cockpit.beiboot --interactive
         # wrong port
         b.set_val("#server-field", "10.111.113.2:222")
         b.click("#login-button")
-        b.wait_in_text("#login-error-message", "Host is unknown")
+        b.wait_in_text("#login-error-message", "Unable to connect to that address")
 
         # unencrypted SSH key login
         self.m_client.execute("runuser -u admin -- ssh-keygen -t rsa -N '' -f ~admin/.ssh/id_rsa")

I actually prefer the latter -- I reviewed the two errors, and "unknown-host" feels more like "SSH does not know this" rather than "network/DNS failure".

@martinpitt
Copy link
Member

martinpitt commented Mar 22, 2024

TestHostSwitching.testBasic is just an unexpected message:

/manifests.js: external channel failed: internal-error

It's a flake, too, and doesn't reproduce locally. However, the test already has:

        self.allow_journal_messages(".*: failure while serving external channel: internal-error")

but that string doesn't actually exist any more -- the bridge has done this for a long time already:

src/ws/cockpitchannelresponse.c:          g_message ("%s: external channel failed: %s", self->logname, problem);

So the test just needs to adjust the expected error message (that should be a separate commit, as it would be a flake on main as well). I'm also happy to send that as a separate PR (but it by itself feels too small to run the whole CI machinery on..)

@martinpitt
Copy link
Member

martinpitt commented Mar 22, 2024

testBeibootNoBridge is harder: The "switch to admin access" dialog never goes away, but it actually got quite far: on the "target" machine I see

\_ sshd: admin [priv]
|   \_ sshd: admin@notty
|       \_ python3 -ic # cockpit-bridge
|           \_ sh -ec echo SSH_AUTH_SOCK=$SSH_AUTH_SOCK; read a
|           |   \_ ssh-agent sh -ec echo SSH_AUTH_SOCK=$SSH_AUTH_SOCK; read a
|           \_ sudo -k -A python3 -ic # cockpit-bridge --privileged
|               \_ python3 -ic # cockpit-bridge --privileged

i.e. the remote priv bridge did start. Before I debug that further, I'll coordinate with Lis.

test/verify/check-superuser TestSuperuserDashboard is hopefully/likely the same root cause: no actual superusers session on a remote machine after logging in.

And finally, TestSuperuserDashboardOldMachine.test is supposedly the hardest: CentOS 7 did not yet have any concept of a "remote privilege indicator". It tries to run pkexec cockpit-bridge --privileged, but that hangs forever, and that's presumably the reason why the "reboot" root channel (and thus that dialog) blocks. So let's sort out the other two above first.

@martinpitt
Copy link
Member

Oh, and there's of course the too old Python crash for RHEL 9.4's py 3.9:

   File "/usr/lib/python3.9/site-packages/cockpit/peer.py", line 144, in ConfiguredPeer
    helper: BridgeBeibootHelper | None = None
  TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

This message doesn't exist anymore — update to the new string written by
cockpit-ws.

Thanks Martin :)
@allisonkarlitskaya allisonkarlitskaya force-pushed the ferny-transport branch 2 times, most recently from b29c305 to 35fc1d4 Compare March 25, 2024 10:47
@allisonkarlitskaya
Copy link
Member Author

testBeibootNoBridge is harder: The "switch to admin access" dialog never goes away, but it actually got quite far: on the "target" machine I see

This is also the failing unit test...

@allisonkarlitskaya
Copy link
Member Author

testBeibootNoBridge is harder: The "switch to admin access" dialog never goes away, but it actually got quite far: on the "target" machine I see

This is also the failing unit test...

Okay. Fixed that, and it's getting a lot farther, but now it's failing for a new reason: the processes aren't getting cleaned up properly.

self.allow_journal_messages(".*: failure while serving external channel: internal-error")
self.allow_journal_messages(".*: external channel failed: internal-error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it seems I was wrong -- The message still does exist:

src/ws/cockpitchannelresponse.c:        g_debug ("%s: failure while serving external channel: %s", self->logname, problem);

So instead I suggest

 self.allow_journal_messages(".*external channel.*: internal-error")

Comment on lines -97 to -99
# TODO: this should only be 'access-denied'
legit_results = [f'result: {err}' for err in ('access-denied', 'authentication-failed', 'terminated')]
self.assertIn(b.text(".super-channel span"), legit_results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's back 😢 so perhaps drop that commit for the time being?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kinda hoping to slay this one, to be honest...

@martinpitt
Copy link
Member

failing for a new reason: the processes aren't getting cleaned up properly.

I'm not even sure if that's a new problem. We've had failures like this for ages, especially on Arch. It's also "just" the second affected retry on /devel, but everywhere else it failed on the first run, so at least it made this bug much easier to reproduce. It would of course be super fun if your old fixup proposal would help here at all 😉

Comment on lines +40 to +41
if not os.access('/dev/kvm', os.W_OK):
pytest.skip('kvm unavailable')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM, pytest only runs on github workflows, so this test doesn't run anywhere.

So AFAICS we have two choices here:

  1. Treat this as unit test. Then this shouldn't involve VMs (or even containers) and network access. I get why we don't want to test against src/ssh/mock-sshd.c, as we are trying to get rid of that code and the libssh dependency. But ferny tests against python-asyncssh, and a mock SSH server looks quite reasonable?

  2. Treat it as an integration test. Then this shouldn't reimplement all the bots/machine/hostkey/privkey handling here, but just use the bots API. You can drive this from Python in just a few lines of code if you insist on doing this with pytest fixtures (but honestly our standard MachineCase seems just fine for that, and it should be possible to wrap that into a fixture). The integration tests already assume that they run from the git checkout (e.g. as they need test/verify/files), so importing the beiboot module and running it on the host seems fine to me -- the integration tests don't have to involve a browser. E.g. most of test/verify/check-connection doesn't.

It makes sense to split the test. I.e. keep the "unit-y" bits here, like test_parse_destination() or def test_conds() (with some adjustments), and move the integration-y ones to test/verify/check-client or a new test/verify/check-beiboot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of what I was hoping you'd look into...

This is definitely some kind of a "third type" of test. fwiw, I've been running them locally and they're helping me tonnes (and they all currently pass).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary from our discussion:

  • we keep the general shape, but move it into test/pytest-integration/
  • drop that bots fixture and just import testvm statically; integration tests just assume bots/
  • add a pytest -k mumble mumble select test/pytest-integration/* to test/run, so that it runs on our CI independently of run-tests; it doesn't parallelize, but it's also fast enough (preferably the /networking scenario, it's the fastest)

But also: We don't want to/have to block this PR on that. We can live with not running these for a while. The more useful work would be to add a proper beiboot scenario on an unprepared image. The covered functionality is mostly just a faster/smaller version of what the integration tests already do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants