Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Secure Connections Standard #548

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

markus-hentsch
Copy link
Contributor

Closes #547

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
@markus-hentsch markus-hentsch added the SCS-VP10 Related to tender lot SCS-VP10 label Apr 4, 2024
@markus-hentsch markus-hentsch changed the title Add Secure Communication Standard Add Secure Connections Standard Apr 4, 2024
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Copy link
Contributor

@josephineSei josephineSei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice first draft - I have a few questions inline

Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
## Conformance Tests

Conformance tests are limited to communication channels exposed to users, such as the public API interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be necessary to test here besides the exposed channels?
RPC and DB? Maybe Libvirt? - Can we audit this somehow? Maybe an tcp-dump looking for such traffic?

However we will decide on this - it should also be documented here, so CSPs will know how to deal with possible audits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially everything that mentions a "MUST" or "MUST NOT" in this standard, except for the external API.

  • AMQP/RPC
  • SQL/DB
  • internal API
  • Libvirt port

Most of it could be determined based on a network traffic recording (e.g. tcpdump) by looking at the port numbers and traffic in conjunction with the target addresses/ports configured in the OpenStack *.confs. For the Libvirt port exposure, a netstat on the compute nodes might help or simply grep'ing the Libvirt configuration for listen_addr.

Most of this is deep in the infrastructure though. A filtered network capture might work but I don't think CSPs would be especially happy about it.

Is there any existing standard where non-external audit is proposed? I'm unsure how to tackle this in the standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, standard should mention, that conformance tests do only check external endpoints. Internal endpoints MUST be checked with manual audits. We will have manual audits in other conformance tests as well. E.g. in #555, we investigate, how to check flavor implementation and I guess, manual audits are required in this context, too. In favor to of this PR, I suggest to open a new issue to propose a general concept for manual audits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Anja, we need a central place for manual audits. And a document that includes all necessary checks, so any auditor can simply follow through.

https://crypto.stackexchange.com/a/95660

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
and implement all tests based on the current standard draft

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
@markus-hentsch markus-hentsch marked this pull request as ready for review April 9, 2024 12:25
@bitkeks bitkeks self-requested a review April 10, 2024 08:38
@martinmo
Copy link
Member

@markus-hentsch I have general remark: Because TLS configuration and security is a moving target, have you considered to base the recommended configuration on one of the profiles offered by Mozilla SSL? For example, the "intermediate" profile, see https://ssl-config.mozilla.org/ and https://wiki.mozilla.org/Security/Server_Side_TLS. (AFAIK, these can be checked with sslyze.)

@artificial-intelligence
Copy link
Contributor

@markus-hentsch I have general remark: Because TLS configuration and security is a moving target, have you considered to base the recommended configuration on one of the profiles offered by Mozilla SSL? For example, the "intermediate" profile, see https://ssl-config.mozilla.org/ and https://wiki.mozilla.org/Security/Server_Side_TLS. (AFAIK, these can be checked with sslyze.)

lol, I had the same idea and actually checked our haproxy TLS implementation, seems there is some opportunity to do some hardening there:

COMPLIANCE AGAINST MOZILLA TLS CONFIGURATION
 --------------------------------------------

    Checking results against Mozilla's "MozillaTlsConfigurationEnum.INTERMEDIATE" configuration. See https://ssl-config.mozilla.org/ for more details.

    a.regiocloud.tech:443: FAILED - Not compliant.
        * ciphers: Cipher suites {'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA', 'TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384', 'TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA', 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256'} are supported, but should be rejected.

@artificial-intelligence
Copy link
Contributor

working on a fix for upstream: https://bugs.launchpad.net/kolla-ansible/+bug/2060787

@anjastrunk anjastrunk self-requested a review April 15, 2024 07:21
Copy link
Contributor

@anjastrunk anjastrunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor cosmetic changes:

  • Is prefer to write Cloud Service Provider instead of CSP, as "CSP" is not an official abbreviation
  • I do not like writing "SCS proposes..", "SCS decides...". AFAIK, SCS stands for Sovereign Cloud Stack, which is a Software Stack, which cannot decide something. I prefer to write "SCS project" or "SCS community"

But again. This is just cosmetics.

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
@markus-hentsch
Copy link
Contributor Author

Just some minor cosmetic changes:

* Is prefer to write Cloud Service Provider instead of CSP, as "CSP" is not an official abbreviation

* I do not like writing "SCS proposes..", "SCS decides...". AFAIK, SCS stands for Sovereign Cloud Stack, which is a Software Stack, which cannot decide something. I prefer to write "SCS project" or "SCS community"

But again. This is just cosmetics.

I adjusted the SCS references. I left "CSP" as-is and added a glossary instead, like I did with some other standards. We seem to use CSP a lot in other standards so I'd like to stay consistent. The glossary at the top should introduce the abbreviation sufficiently now.

Copy link
Contributor

@josephineSei josephineSei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a good starting point. We need something for the manual audits, but this may something to discuss in the standards SIG or in the IAM and Security Meeting

Standards/scs-0114-v1-secure-connections.md Outdated Show resolved Hide resolved
## Conformance Tests

Conformance tests are limited to communication channels exposed to users, such as the public API interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Anja, we need a central place for manual audits. And a document that includes all necessary checks, so any auditor can simply follow through.

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Copy link
Contributor

@artificial-intelligence artificial-intelligence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work done so far on this.

Unfortunately I think there's still some work left to do.

Thanks.

Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved
3. For the message queue, AMQP-based solutions such as RabbitMQ and QPid do offer TLS natively which is also supported by OpenStack. ZeroMQ does not and requires IPsec or CIPSO instead.
4. External API endpoints can be protected easily by using a TLS proxy. They can then be registered with their HTTPS endpoint in the Keystone service catalog. The certificates of external APIs usually need to be signed by a well-known CA in order to be accepted by arbitrary external clients.
5. Internal API endpoints can be treated and secured similarly to the external ones. Optionally, the TLS certificate provider and PKI can differ to the external ones. It is often sufficient for the CA of internal TLS endpoints to be accepted within the infrastructure and doesn't need to be a common public CA.
6. For protecting the data transferred between compute nodes during live-migration of VMs, [Nova offers support for QEMU-native TLS](https://docs.openstack.org/nova/latest/admin/secure-live-migration-with-qemu-native-tls.html). As an alternative, SSH is also a channel that Nova can be configured to use between hosts for this but requires passwordless SSH keys with root access to all other compute nodes which in turn requires further hardening.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not necessarily require root SSH access. also note that certain migration scenarios require ssh based transport instead. One scenario is the usage of secure boot VMs that use the by nova provided nvram file to store their secure boot keys. This file can only be transferred currently via SSH as far as I know.

Also if you have a volume backed instance you need ssh, see the PTG notes:

https://etherpad.opendev.org/p/nova-dalmatian-ptg#L617

to quote it, as it's an editable etherpad, which might vanish in the future:

    (tobias-urdin, not attending thu-fri): live migration of a volume-backed instance requires SSH access between compute nodes
    the RemoteFS code is used to mkdir the volume directory on the destination, could we move that to be done by the nova-compute service over the RPC layer instead?

    same RemoteFS code is used to copy vtpm and config-drive from /var/lib/nova/instances directory when a volume backed instance is used, but I don't see any good way to replace that since we need a datapath and nova-compute <-> nova-compute API would probably be messy

    (mikal) This sort of came up for me the other day. The config-drive code is a bit weird at the moment. Its either on the hypervisor, or in your network block store as long as the network block store is exactly ceph. Is it weird that we can't store config-drives on cinder volumes? I think maybe we ended up this way because we didn't want to create lots of small cinder volumes, but maybe we should revisit that so that live migration doesn't need to copy a config-drive around at all? Should that also be true for UEFI NVRAM persistence? We could just pretend they were small disks too?

This is also why ssh based migration was, in fact, undeprecated again (well it's in the process still):

https://review.opendev.org/c/openstack/nova/+/915481

Still, qemu-native-tls is good, if it works for the needed usecase.

But it would certainly be possible to secure the ssh access in a way fairly trivially to only support certain locked down commands, see the "command" section in the authorized keyfiles man page:

https://manpages.debian.org/bullseye/openssh-server/authorized_keys.5.en.html#AUTHORIZED_KEYS_FILE_FORMAT

So maybe we could extend this to make ssh based live migration secure as well.
I have opened an issue for that on our side, as we need to implement it in kolla either way: osism/issues#1001

Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved
Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved

### Hypervisor and Live Migration Connections

- If using libvirt on compute nodes, the libvirt port (as per `listen_addr`) MUST NOT be exposed to the network in an unauthenticated and unprotected fashion. It SHOULD be bound to `127.0.0.1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the part "It SHOULD be bound to 127.0.0.1" can really be done.
This address is used for libvirt live migration and must be reachable by other compute nodes, which obviously doesn't work via localhost.

See e.g. this old security notice for details: https://wiki.openstack.org/wiki/OSSN/OSSN-0007

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to look into this again. There was a way with just unix domain sockets in conjunction with SSH iirc: https://gitlab.com/yaook/operator/-/issues/164#note_553333437

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that insightful source! Though the comment is very wrong about the ssh access - I talk about this part specifically:

[..] because it would result in a full host access instead of only libvirt[..]

It's certainly possible to lock down ssh access as you like, I assume the person commenting was just not familiar with how this is possible and only referring to the upstream - honestly quite confusing - nova docs around this:

https://docs.openstack.org/nova/latest/admin/configuring-migrations.html#general-configuration

specifically point 3 which reads:

Enable password-less SSH so that root on one compute host can log on to any other compute host without providing a password.

as I said already, this can of course be locked down, the nova docs are just wrong here, you don't need to expose arbitrary root access here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said already, this can of course be locked down, the nova docs are just wrong here, you don't need to expose arbitrary root access here.

Are you referring to your other comment which states:

But it would certainly be possible to secure the ssh access in a way fairly trivially to only support certain locked down commands, see the "command" section in the authorized keyfiles man page

?
If so, have you looked at steps 8 and onward in the comment I linked1? Isn't that essentially what the comment author was proposing? Or am I missing something?
In this setup you'd have libvirt exposed as a local unix domain socket only and migration connections from other compute hosts would use SSH as a tunnel and the local unix socket as the target. The SSH tunnel will be restricted to commands known to be used by Nova for migration purposes specifically.

Footnotes

  1. https://gitlab.com/yaook/operator/-/issues/164#note_553333437

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that step 8, thanks for bringing it to my attention.

So a local unix domain socket might work, never tested that, seems like a clever idea.

I don't know the package used in that comment though ("restricted-ssh-commands") to be honest, I would like to review that. Usually you don't need an extra package for this and you need to craft a list of allowed commands manually by pure necessity I think, either way. So I would be sceptical about this package.

Reading the man-page: https://manpages.debian.org/bookworm/restricted-ssh-commands/restricted-ssh-commands.1.en.html

This seems to just allow for a regex, which I don't think is fine grained enough.

The way this is usually done is, that you either write a program in your favorite secure compiled language or just use a scripting language if it can be made secure enough.

You encode all the allowed invariants directly into your program/script, maybe add some parameters which can be supplied via the ssh command and place that with the sufficient permissions inside the correct nova containers.

then you alter your authorized-keys file like this:

command="/path/to/necessary_nova_command $with_option1 $with_option2" ssh-ed25519 AAAAC[...]

this way the connection via this ssh key can only invoke this command, nothing else, you get no shell access, you need to care only for the usual environment sanitization to avoid usual env attacks, like LD_PRELOAD et al. This is pretty robust and easy, no need for additional packages.

Regex is on one side not complex enough, because nova might need to call different commands/needs access to different folders etc, so you need to wrap that inside a script, on the other side regex is too complex because you might end up allowing more than you originally wanted.

The man page even has a security warning which just reiterates my point really, regex is not a good format itself to do this kind of stuff.

Also this tool is written in bash, which can be fine, but I would do a security code review before using it, I have seen innocent bash proxy scripts leading to remote code execution before.

The code can be seen here:
https://github.com/bdrung/restricted-ssh-commands/blob/master/restricted-ssh-commands

But this is really an implementation detail, not relevant to the standard itself I would say.
Feel free to ping me if you or some colleague of you wants to chat more about this topic. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a bit more research into this today and for now I added a whole new section for this to the design considerations for now.
It seems like there are two ways in total to address OSSN-0007, i.e. the problem of unauthorized access to the libvirt interface:

  1. Switching libvirt to TLS + using QEMU-native TLS for live migration + employing TLS client certificates (while restricting the libvirt port to known identities)
  2. Restricting the libvirt port to localhost (127.0.0.1) + enabling libvirt UNIX socket + using SSH for live migration + restricting SSH user permissions

In case of 1 the access restriction would be implemented by distributing X.509 client certificates and allowlisting them and in case of 2 the access restriction would be implemented using SSH key pairs in a similar fashion.

I still need to pursue this further and derive a better recommendation in the actual standard section (i.e. the line this comment thread was started on).

Standards/scs-01XX-v1-secure-connections.md Outdated Show resolved Hide resolved

## Conformance Tests

Conformance tests are limited to communication channels exposed to users, such as the public API interfaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? At least there should be some justification either here or in a decision record document, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automatic conformance tests as I understand them are executed on SCS infrastructure running again remotely reachable cloud infrastructure. Is this correct? This would simply not allow to look into internal communication channels automatically. E.g. the connection from Nova to the DB can only be checked when checking the config files. Same goes for the Message Queue. This can only be done if you have access on the nodes and not just to the OpenStack Infrastructure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand that, I find it difficult to standardize something when the conformance tests for the standard can not really guarantee that the standard is being correctly implemented in it's full form. Imho this leads to a false sense of security.

At least it must be made clear to end users what is not tested (e.g. live migration traffic), and/or maybe the CSP must somehow declare that they in fact follow the standard to the letter, but I find this a much weaker sign than the result of an automated test.

Maybe create an automated test that only the CSP can run internally and mandate that results are published? You still have to trust the CSP to not tamper with the test, but at least the output can be parsed automatically via machines and be made available on scs.community this way.

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
@markus-hentsch
Copy link
Contributor Author

Updated standard and test script to use the Mozilla TLS "intermediate" preset now.

@@ -0,0 +1,2 @@
openstacksdk>=3.0.0
sslyze>=6.0.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we need proper pinning for this, preferably originating from Tests/scs-compatible-iaas.yaml as discussed with @mbuechse

We need to document the corresponding Mozilla JSON1 tightly coupled with the latest SSLyze library version that ships that exact JSON2 for testing.

Requires: #595

Footnotes

  1. https://ssl-config.mozilla.org/guidelines/5.7.json

  2. https://github.com/nabla-c0d3/sslyze/blob/761892bf6a613726f272eb4e19b33faee2272b6f/sslyze/mozilla_tls_profile/5.7.json

Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Signed-off-by: Markus Hentsch <markus.hentsch@cloudandheat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Secure communication standard for IaaS infastructure
5 participants