-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inter-Cluster Instance Transfer fail due to socat TLS verification #1681
Comments
this would seem like the proper course of action, and i can confirm that such a junk certificate is also used in our cluster configuration here. |
just for the record, i couldn't find a trace of that variable anywhere in the code but eventually figured out the Haskell constants are kind of transpiled into Python code and it's grepping around for this, i found also this bit: which means this should already be working, thanks to 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20).... what error message were you actually getting? right now I'm getting:
... but i'm not sure it's related? |
Yeah, the Haskell and Python worlds share the same constants, but the Python constants file ( Anyways, instead of using
The
I think I should do some more
The error we received was pretty clear:
I think the |
oh, i did get that too, actually. the above ganeti/launchpad bugs work around the issue by completely downgrading socat, interestingly.
i've been tearing my hair out trying to get the impexpd to show the actual damn socat command it's running, i'm down to doing hot patches on the live code right now to include traces, and seriously considering a BPF trace to just show executed programs cluster-wide as well. arghl. did you manage to get a sample of what the daemon actually executes on your end? |
i managed to do an execsnoop and catch this:
... so it seems execsnoops gets only a truncated version of the args, aarghl... |
i managed to extract the full commandline with bpftrace:
|
okay, so i did this test.
.. and... that works! (Note that I've also tried with all the extra arguments on the client I found in the execsnoop above, it still works.) So it certainly seems like we're actually creating a bad certificate that does not match now if i try
which is the error we're getting in this issue. So it seems like the certificate generated on the import side does not use the |
well shit:
that's doing a now the trick here is that only the IP address (!? WHY?) is passed down to the API call. I traced the import-export stuff all the way up to Lines 1247 to 1260 in 114e59f
but worse than this, it seems we enforce the host to be an IP address in the import/export daemon: Lines 432 to 436 in 114e59f
so it's going to be pretty hard to fix that without some magic hackery (e.g. doing a reverse DNS lookup, ouch?). in any case, this is starting to look pretty promising... |
In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681
okay, I got this to work! i had to do some pretty nasty stuff like resolving the IP address given to impexpd as I can't figure out why or where the IP is passed instead of the hostname. but it works for me, and i figured it was worth sharing. phew! |
Wow, good work - thank you! I was actually on vacation during the last week and not able to do any testing myself (or respond earlier). I will take a look at your PR now! //Edit: OK, I'll respond here, to not mess up the discussion flow :-) The temporary certificate with the instance's primary node name in it seems to be requested here: Line 83 in 114e59f
We could also try and work around this issue and have the certificate created for the IP address instead of the name instead. The "heavy lifting" (creating a certificate) is done by this function: Lines 254 to 283 in 114e59f
It is called by the follwoing backend function, which in turn is part of the Lines 5102 to 5132 in 114e59f
Simply extending We could also modify So we are down to "replace the crypto/x509 library with possible side-effects" or "still rely on DNS, just at a different stage". What would you say @anarcat? |
i'm not sure i can follow you all the way down that rabbit hole, @rbott, but it seems to make sense. i think that, short term, the latter seems to be sane, but the former is probably a requirement in the long term anyway... that said, i'd personally like to see the impexpd get the hostname instead of the IP address; i don't get why we pass it the IP address (or where!). it's what i identify as the root cause of the problem. i was just too tired to walk back up the stack (and i was confused by the API layer i couldn't walk up from) to figure out how ti fix that.... but i think passing the hostname instead of the IP address would fix the cert issue neatly without having to change anything in the cert generation. i think we could even revert the is that something we could consider here? also note that I'll be using my three patches in production tomorrow, i am not sure i will have much more time to fight this one problem, as we're already late in this project and i'm living a bit on borrowed time here. :) but if you have patches to test, that's something i could probably do... thanks! |
Actually I was a bit wrong here. It is possible with the current OpenSSL implementation to create a certificate with additional SAN entries. I will look into that (having both the IP and name in the cert surely won't do any harm here).
You are right, I totally forgot to check that path as well. I will look into this and post my findings!
I am quite busy right now and will try to allocate some time for this issue soon. Along with updating the move-instance documentation :-) |
In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681
In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681
In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681
In 7bb0351 (impexpd: fix certificate verification with new socat versions, 2017-12-20), a hostname verification was introduced to fix socat's new (and proper) behavior of actually checking the remote hostname during OpenSSL-protected transfers. The problem, however, is that the hostname used was the default `X509_CERT_CN` constant (`x509CertCn` in Haskell) which is hardcoded to `ganeti.example.com`. In a real-world deployment, it seems like the remote CommonName (CN) of the certificate used by the export daemon is actually the target node name. In my case, it meant I was getting the following error from socat during transfers: Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com") At first I thought socat might be doing us some trouble, but no: socat works properly. An example is this: 1. generate a self-signed certificate: openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -nodes -subj '/CN=ganeti.example.com' -days 1 2. start a socat server: socat -ls -d -d STDIO OPENSSL-LISTEN:8443,reuseaddr,forever,key=key.pem,cert=cert.pem,cafile=cert.pem 3. connect to it with a client: socat -ls -d -d STDIO OPENSSL:localhost:8443,openssl-commonname=ganeti.example.com,verify=1,key=key.pem,cert=cert.pem,cafile=cert.pem ... which actually works, which means the `openssl-commonname` argument actually works, and works properly. If it's changed, for example, to `ganeti.example.net`, the above fails with the aforementioned error message. We fix this by doing a reverse name resolution on the provided IP address. Now, we don't *assume* it's an IP address: this code kicks in only if the impexpd is passed an actual IP address, but in my experience it seems to always be the case (which is probably a separate problem to fix). This is rather brittle and assumes DNS will not lie, which is quite a stretch. In our environment, however, we have end-to-end DNSSEC so we can trust the DNS. And this beats hardcoding verify=0, which is the other workaround that can be done to fix this issue. Closes: ganeti#1681
We have been successfully using the move-instance Script to move instances from older clusters (e.g. based on Debian Stretch) to newer Clusters (based on Debian Bullseye / Ganeti 3.0.2). However, we can not move Instances between Debian Bullseye servers.
This happens because
socat
is configured to verify the TLS certificate presented by the destination node:ganeti/lib/impexpd/__init__.py
Line 91 in da6aba3
However, with recent
socat
versions verification also includes matching the hostname to the certificate CN/SAN entries. For the connection, the destination node's ip address is used, but the cluster certificate always containsganeti.example.com
. This is hardcoded in the constants:ganeti/src/Ganeti/Constants.hs
Line 605 in da6aba3
I see multiple solutions to this problem:
verify
switch configurable and leave it up to the user (withverify=1
being always broken)verify=0
to at least allow people to migrate instances againWhat would you suggest?
The text was updated successfully, but these errors were encountered: