Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

rbott · 2022-08-21T16:34:29Z

We have been successfully using the move-instance Script to move instances from older clusters (e.g. based on Debian Stretch) to newer Clusters (based on Debian Bullseye / Ganeti 3.0.2). However, we can not move Instances between Debian Bullseye servers.

This happens because socat is configured to verify the TLS certificate presented by the destination node:

ganeti/lib/impexpd/__init__.py

Line 91 in da6aba3

SOCAT_OPENSSL_OPTS = ["verify=1", "cipher=%s" % constants.OPENSSL_CIPHERS]

However, with recent socat versions verification also includes matching the hostname to the certificate CN/SAN entries. For the connection, the destination node's ip address is used, but the cluster certificate always contains ganeti.example.com. This is hardcoded in the constants:

ganeti/src/Ganeti/Constants.hs

Line 605 in da6aba3

x509CertCn = "ganeti.example.com"

I see multiple solutions to this problem:

we can supply each node with a valid certificate (which contains the hostname and all primary/secondary IP addresses) and use that for the import socket server
we can make the verify switch configurable and leave it up to the user (with verify=1 being always broken)
set verify=0 to at least allow people to migrate instances again
something else

What would you suggest?

The text was updated successfully, but these errors were encountered:

anarcat · 2023-03-13T21:05:44Z

we can supply each node with a valid certificate (which contains the hostname and all primary/secondary IP addresses) and use that for the import socket server

this would seem like the proper course of action, and i can confirm that such a junk certificate is also used in our cluster configuration here.

anarcat · 2023-03-14T20:10:25Z

ganeti/src/Ganeti/Constants.hs

Line 605 in da6aba3

x509CertCn = "ganeti.example.com"

just for the record, i couldn't find a trace of that variable anywhere in the code but eventually figured out the Haskell constants are kind of transpiled into Python code and it's X509_CERT_CN there.

grepping around for this, i found also this bit:

https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L225-L229C64

which means this should already be working, thanks to 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20)....

what error message were you actually getting? right now I'm getting:

2023-03-14 19:54:03,139: DestForMove1 INFO [Tue Mar 14 19:54:03 2023] Disk 2 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.6247 s, 0.0 kB/s)

... but i'm not sure it's related?

anarcat · 2023-03-14T20:17:13Z

related?

https://bugs.launchpad.net/ubuntu/+source/socat/+bug/1936407

https://groups.google.com/g/ganeti/c/BV8GvyN93w0

rbott · 2023-03-14T22:36:00Z

just for the record, i couldn't find a trace of that variable anywhere in the code but eventually figured out the Haskell constants are kind of transpiled into Python code and it's X509_CERT_CN there.

Yeah, the Haskell and Python worlds share the same constants, but the Python constants file (_constants.py) is getting generated during build time from the Constants.hs haskell file. This makes debugging the source code sometimes harder, but helps in the long run :-)

Anyways, instead of using X509_CERT_CN for a share cluster-wide certificate each node could/should have its own certificate (which holds the node name and also the cluster name as subject alternate names and also the related IP addreses). OR one certificate which holds all names and ip addresses. In any case that would be a rather big change to the ganeti configuration/data on disk.

grepping around for this, i found also this bit:

https://github.com/ganeti/ganeti/blob/114e59fcc9d4a7c82618569f5d6b7389a0f80123/lib/impexpd/__init__.py#L225-L229C64

which means this should already be working, thanks to 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20)....

The socat manpage states:

NOTE: Up to version 1.7.2.4 the server certificate was only checked for validity against the system certificate store or cafile or capath, but not for match with the server’s name or its IP ad‐
dress. Since version 1.7.3.0 socat checks the peer certificate for match with the parameter or the value of the openssl-commonname option. Socat tries to match it against the certifi‐
cates subject commonName, and the certificates extension subjectAltName DNS names. Wildcards in the certificate are supported.
Option groups: FD,SOCKET,IP4,IP6,TCP,OPENSSL,RETRY
Useful options: min-proto-version, cipher, verify, commonname, cafile, capath, certificate, key, compress, bind, pf, connect-timeout, sourceport, retry
See also: OPENSSL-LISTEN, TCP

I think I should do some more socat testing to qualify this issue properly. You might be on to something here :-)

what error message were you actually getting? right now I'm getting:

2023-03-14 19:54:03,139: DestForMove1 INFO [Tue Mar 14 19:54:03 2023] Disk 2 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.6247 s, 0.0 kB/s)

The error we received was pretty clear:

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

I think the Disk 2 failed to receive data[...] error must have its cause somewhere else. The problem only hit us with Ganeti 3.0 on Debian Bullseye (however, it is actually related to Debian shipping socat in version 1.7.4.1 with Bullseye which handles verify=1 different than older versions, it seems to have nothing to do with Ganeti itself).

anarcat · 2023-03-15T13:47:12Z

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

oh, i did get that too, actually.

the above ganeti/launchpad bugs work around the issue by completely downgrading socat, interestingly.

I think I should do some more socat testing to qualify this issue properly. You might be on to something here :-)

i've been tearing my hair out trying to get the impexpd to show the actual damn socat command it's running, i'm down to doing hot patches on the live code right now to include traces, and seriously considering a BPF trace to just show executed programs cluster-wide as well. arghl.

did you manage to get a sample of what the daemon actually executes on your end?

anarcat · 2023-03-15T15:46:56Z

did you manage to get a sample of what the daemon actually executes on your end?

i managed to do an execsnoop and catch this:

socat            14118  14114    0 /usr/bin/socat -ls -d -d -b1048576 -u stdin OPEN
SSL:204.8.99.102:38547,connect-timeout=20,retry=10,intervall=1,keepalive,keepidle=6
0,keepintvl=10,keepcnt=5,verify=1,cipher

... so it seems execsnoops gets only a truncated version of the args, aarghl...

anarcat · 2023-03-15T17:09:28Z

i managed to extract the full commandline with bpftrace:

592586     68380 /usr/bin/socat -ls -d -d -b1048576 -u stdin OPENSSL:204.8.99.102:43419,connect-timeout=20,retry=10,intervall=1,keepalive,keepidle=60,keepintvl=10,keepcnt=5,verify=1,cipher=HIGH:-DES:-3DES:-EXPORT:-DH,compress=none,key=/var/run/ganeti/crypto/x509-2023-03-15_16_54_20-q4e8scoz/key,cert=/var/run/ganeti/crypto/x509-2023-03-15_16_54_20-q4e8scoz/cert,cafile=/var/run/ganeti/import-export/export-disk2-2023-03-15_17_03_50-3p2a6wa7/ca,pf=ipv4,openssl-commonname=ganeti.example.com

anarcat · 2023-03-15T18:09:07Z

okay, so i did this test.

generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1
start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem
connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem

.. and... that works! (Note that I've also tried with all the extra arguments on the client I found in the execsnoop above, it still works.)

So it certainly seems like we're actually creating a bad certificate that does not match ganeti.example.com.

now if i try commonname=ganeti.example.net (note the .net instead of .com) I get this error:

2023/03/15 14:07:57 socat[263499] E certificate is valid but its commonName does not match hostname "ganeti.example.net"

which is the error we're getting in this issue. So it seems like the certificate generated on the import side does not use the ganeti.example.com domain!

anarcat · 2023-03-15T18:26:31Z

well shit:

root@chi-node-08:~# certtool -i < /run/ganeti/crypto/x509-2023-03-15_18_19_00-scnvnih3/cert | grep Subject:
        Subject: CN=chi-node-08.torproject.org

that's doing a move-instance, while the backup is being exported, before the socat, and on the source side... but it sure looks like the cert is being generated with the node name and NOT the ganeti.example.com thing!

now the trick here is that only the IP address (!? WHY?) is passed down to the API call. I traced the import-export stuff all the way up to noded.NodeRequestHandler.perspective_export_start:

ganeti/lib/server/noded.py

Lines 1247 to 1260 in 114e59f

    
             def perspective_export_start(params): 
        
               """Starts an export daemon. 
        
               """ 
        
               (opts_s, host, port, instance, component, (source, source_args)) = params 
        
               opts = objects.ImportExportOptions.FromDict(opts_s) 
        
               return backend.StartImportExportDaemon(constants.IEM_EXPORT, opts, 
        
                                                      host, port, 
        
                                                      objects.Instance.FromDict(instance), 
        
                                                      component, source, 
        
                                                      _DecodeImportExportIO(source, 
        
                                                                            source_args))

but worse than this, it seems we enforce the host to be an IP address in the import/export daemon:

ganeti/daemons/import-export

Lines 432 to 436 in 114e59f

    
           if options.host is not None and not netutils.IPAddress.IsValid(options.host): 
        
             try: 
        
               options.host = netutils.Hostname.GetNormalizedName(options.host) 
        
             except errors.OpPrereqError as err: 
        
               parser.error("Invalid hostname '%s': %s" % (options.host, err))

so it's going to be pretty hard to fix that without some magic hackery (e.g. doing a reverse DNS lookup, ouch?).

in any case, this is starting to look pretty promising...

In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681

anarcat · 2023-03-15T19:20:34Z

okay, I got this to work! i had to do some pretty nasty stuff like resolving the IP address given to impexpd as I can't figure out why or where the IP is passed instead of the hostname. but it works for me, and i figured it was worth sharing. phew!

rbott · 2023-03-17T22:31:56Z

Wow, good work - thank you! I was actually on vacation during the last week and not able to do any testing myself (or respond earlier). I will take a look at your PR now!

//Edit: OK, I'll respond here, to not mess up the discussion flow :-)

The temporary certificate with the instance's primary node name in it seems to be requested here:

ganeti/lib/cmdlib/backup.py

Line 83 in 114e59f

result = self.rpc.call_x509_cert_create(self.instance.primary_node,

We could also try and work around this issue and have the certificate created for the IP address instead of the name instead. openssl-commonname also works perfectly with an IP address if the certificate's subject contains one :-) That way we would not need to rely on DNS reverse resolution. We could also have the code put both the IP address and the hostname in the certificate as SAN/subject alternative names to not break other usages (although I do not think there are any right now). However, this is a bit more complicated than I expected.

The "heavy lifting" (creating a certificate) is done by this function:

ganeti/lib/utils/x509.py

Lines 254 to 283 in 114e59f

    
           def GenerateSelfSignedX509Cert(common_name, validity, serial_no): 
        
             """Generates a self-signed X509 certificate. 
        
             @type common_name: string 
        
             @param common_name: commonName value 
        
             @type validity: int 
        
             @param validity: Validity for certificate in seconds 
        
             @return: a tuple of strings containing the PEM-encoded private key and 
        
                      certificate 
        
             """ 
        
             # Create private and public key 
        
             key = OpenSSL.crypto.PKey() 
        
             key.generate_key(OpenSSL.crypto.TYPE_RSA, constants.RSA_KEY_BITS) 
        
             # Create self-signed certificate 
        
             cert = OpenSSL.crypto.X509() 
        
             if common_name: 
        
               cert.get_subject().CN = common_name 
        
             cert.set_serial_number(serial_no) 
        
             cert.gmtime_adj_notBefore(0) 
        
             cert.gmtime_adj_notAfter(validity) 
        
             cert.set_issuer(cert.get_subject()) 
        
             cert.set_pubkey(key) 
        
             cert.sign(key, constants.X509_CERT_SIGN_DIGEST) 
        
             key_pem = OpenSSL.crypto.dump_privatekey(OpenSSL.crypto.FILETYPE_PEM, key) 
        
             cert_pem = OpenSSL.crypto.dump_certificate(OpenSSL.crypto.FILETYPE_PEM, cert) 
        
             return (key_pem, cert_pem)

It is called by the follwoing backend function, which in turn is part of the noded RPC which is running on each node. Bascially the ganeti master asks the node to create a certificate with its name in it (it always passes netutils.Hostname.GetSysName() as the cert's common name).

ganeti/lib/backend.py

Lines 5102 to 5132 in 114e59f

    
           def CreateX509Certificate(validity, cryptodir=pathutils.CRYPTO_KEYS_DIR): 
        
             """Creates a new X509 certificate for SSL/TLS. 
        
             @type validity: int 
        
             @param validity: Validity in seconds 
        
             @rtype: tuple; (string, string) 
        
             @return: Certificate name and public part 
        
             """ 
        
             serial_no = int(time.time()) 
        
             (key_pem, cert_pem) = \ 
        
               utils.GenerateSelfSignedX509Cert(netutils.Hostname.GetSysName(), 
        
                                                min(validity, _MAX_SSL_CERT_VALIDITY), 
        
                                                serial_no) 
        
             cert_dir = tempfile.mkdtemp(dir=cryptodir, 
        
                                         prefix="x509-%s-" % utils.TimestampForFilename()) 
        
             try: 
        
               name = os.path.basename(cert_dir) 
        
               assert len(name) > 5 
        
               (_, key_file, cert_file) = _GetX509Filenames(cryptodir, name) 
        
               utils.WriteFile(key_file, mode=0o400, data=key_pem) 
        
               utils.WriteFile(cert_file, mode=0o400, data=cert_pem) 
        
               # Never return private key as it shouldn't leave the node 
        
               return (name, cert_pem) 
        
             except Exception: 
        
               shutil.rmtree(cert_dir, ignore_errors=True) 
        
               raise

Simply extending GenerateSelfSignedX509Cert() to accept a list of names/IPs (and add them as subject alternate names) is not possible, as the crypto library used does not support SAN certificates. It refers to the cryptography library as a higher level replacement that should be used instead of OpenSSL/crypto directly.

We could also modify CreateX509Certificate to use the node's IP address instead of netutils.Hostname.GetSysName() - this should be possible by using netutils.Hostname.GetIP(netutils.Hostname.GetSysName()) instead. However, netutils.Hostname.GetIP() also relies on DNS resolution eventually (although it is a forward lookup instead of a reverse lookup). All in all, this doesn't get us much further than your approach I guess.

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

What would you say @anarcat?

anarcat · 2023-03-20T00:51:15Z

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

i'm not sure i can follow you all the way down that rabbit hole, @rbott, but it seems to make sense. i think that, short term, the latter seems to be sane, but the former is probably a requirement in the long term anyway...

that said, i'd personally like to see the impexpd get the hostname instead of the IP address; i don't get why we pass it the IP address (or where!). it's what i identify as the root cause of the problem. i was just too tired to walk back up the stack (and i was confused by the API layer i couldn't walk up from) to figure out how ti fix that....

but i think passing the hostname instead of the IP address would fix the cert issue neatly without having to change anything in the cert generation. i think we could even revert the openssl-commonnmame stuff since we'd be using the real hostname with a "real" cert...

is that something we could consider here?

also note that I'll be using my three patches in production tomorrow, i am not sure i will have much more time to fight this one problem, as we're already late in this project and i'm living a bit on borrowed time here. :) but if you have patches to test, that's something i could probably do...

thanks!

rbott · 2023-03-21T21:25:21Z

So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage".

Actually I was a bit wrong here. It is possible with the current OpenSSL implementation to create a certificate with additional SAN entries. I will look into that (having both the IP and name in the cert surely won't do any harm here).

that said, i'd personally like to see the impexpd get the hostname instead of the IP address; i don't get why we pass it the IP address (or where!).

You are right, I totally forgot to check that path as well. I will look into this and post my findings!

also note that I'll be using my three patches in production tomorrow, i am not sure i will have much more time to fight this one problem, as we're already late in this project and i'm living a bit on borrowed time here. :) but if you have patches to test, that's something i could probably do...

I am quite busy right now and will try to allocate some time for this issue soon. Along with updating the move-instance documentation :-)

In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681

rbott mentioned this issue Mar 12, 2023

move-instance difficult to use and ultimately fails #1696

Open

anarcat linked a pull request Mar 15, 2023 that will close this issue

impexpd: verify remote socket against actual host name #1699

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

rbott commented Aug 21, 2022

anarcat commented Mar 13, 2023

anarcat commented Mar 14, 2023

anarcat commented Mar 14, 2023

rbott commented Mar 14, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023 •

edited

anarcat commented Mar 15, 2023

rbott commented Mar 17, 2023 •

edited

anarcat commented Mar 20, 2023

rbott commented Mar 21, 2023

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

Inter-Cluster Instance Transfer fail due to socat TLS verification #1681

Comments

rbott commented Aug 21, 2022

anarcat commented Mar 13, 2023

anarcat commented Mar 14, 2023

anarcat commented Mar 14, 2023

rbott commented Mar 14, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023

anarcat commented Mar 15, 2023 • edited

anarcat commented Mar 15, 2023

rbott commented Mar 17, 2023 • edited

anarcat commented Mar 20, 2023

rbott commented Mar 21, 2023

anarcat commented Mar 15, 2023 •

edited

rbott commented Mar 17, 2023 •

edited