Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for broken_promise exception observed on connection shutdown #106

Conversation

graphcareful
Copy link

@graphcareful graphcareful commented Apr 12, 2024

REVIEWERS: Please review #115 instead

A broken_promise exception was observed when calling close() due to the fact that the input stream was called while it was waiting on a pending operation to complete.

This occurred on call to close because of how do_shutdown() is implemented. do_shutdown() acquires the output semaphore, but not the input, then attempts to wait_for_input(). If there is already a waiter on the read path, then the broken_promise exception will be observed when do_shutdown() calls do_get().

The fix is to simply modify do_shutdown() to call wait_for_eof() instead of wait_for_input(),however this would be incomplete because there are other instances in the implementation where out of band operations on the read / write path come from opposite paths.

To resolve the above concern new methods have been developed (maybe_wait_for_data and maybe_send_data) which attempt to acquire the in/out semaphore before performing the desired operation. If the semaphore is already acquired, then the method just returns to the caller, which would lead to a retry until the operation can be performed.

Fixes: CORE-539

@graphcareful
Copy link
Author

Changes in force-push

  • Split formatting commit out

@graphcareful
Copy link
Author

Changes in force-push

  • Removed accidental duplicate call wrapping do_get() in context of semaphore
  • Slightly reworked logic in wait_for_eof() , previously this depended on waiting on _in_sem a deadlock was detected in a unit test. Its been since refactored to use wait_for_input() instead

src/net/ossl.cc Outdated Show resolved Hide resolved
src/net/ossl.cc Outdated Show resolved Hide resolved
@graphcareful
Copy link
Author

Changes in force-push

  • Modified implementation to not use 2 additional semaphores and instead retry if semaphore is already taken , otherwise take lease and perform work.
  • Ensure read lock is taken within wait_for_eof

Copy link

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one clean up suggestion

src/net/ossl.cc Outdated
Comment on lines 1211 to 1215
static const auto max_len = 4096;
char buffer[max_len];
memset(buffer, 0, max_len);
BIO_read(out.get(), buffer, max_len - 1);
return sstring(buffer);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static const auto max_len = 4096;
char buffer[max_len];
memset(buffer, 0, max_len);
BIO_read(out.get(), buffer, max_len - 1);
return sstring(buffer);
char * bio_ptr = nullptr;
auto len = BIO_get_mem_data(out.get(), &bio_ptr);
return sstring(bio_ptr, len);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   char * bio_ptr = nullptr;
   auto len = BIO_get_mem_data(out.get(), &bio_ptr);
   return sstring(bio_ptr, len);

BIO_get_mem_data() returns the total number of bytes available on success, 0 if b is NULL, or a negative value in case of other errors.

Looks like the error condition should be checked?

src/net/ossl.cc Outdated
return ossl_str;
static std::optional<sstring> get_dn_string(X509_NAME* name) {
auto out = bio_ptr(BIO_new(BIO_s_mem()));
if (!X509_NAME_print_ex(out.get(), name, 0, ASN1_STRFLGS_RFC2253 | XN_FLAG_SEP_COMMA_PLUS |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X509_NAME_print_ex() and X509_NAME_print_ex_fp() return 1 on success or 0 on error if the XN_FLAG_COMPAT is set, which is the same as X509_NAME_print(). Otherwise, it returns -1 on error or other values on success.

Should the check be -1 == X509_NAME_print_ex?

src/net/ossl.cc Outdated
Comment on lines 1211 to 1215
static const auto max_len = 4096;
char buffer[max_len];
memset(buffer, 0, max_len);
BIO_read(out.get(), buffer, max_len - 1);
return sstring(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   char * bio_ptr = nullptr;
   auto len = BIO_get_mem_data(out.get(), &bio_ptr);
   return sstring(bio_ptr, len);

BIO_get_mem_data() returns the total number of bytes available on success, 0 if b is NULL, or a negative value in case of other errors.

Looks like the error condition should be checked?

@@ -1444,6 +1444,10 @@ SEASTAR_THREAD_TEST_CASE(test_dn_name_handling) {
fout.get();

auto dn = fdn.get();
BOOST_REQUIRE(dn.has_value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for the failure modes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, i believe these unit tests could be enhanced a bit more

- Now that code has been moved around, the session class' methods need
to be known to consumers for example tls_connected_socket_impl.

- Solution is to create a new interface, called session_impl that
tls::session inherits from. session_ref now stores a pointer to this
interface instead of the concrete instance.

- This will be useful for defining other session types in future commits

(cherry picked from commit 0ffb0fc)
- Now that code has been moved around the private _impl within the
certificate credentials class cannot be accessed externally due to the
fact that impl is forward defined in tls.hh

- Solution is to create new method on the certificate_credentials class
that will forward the call to the underlying impl

(cherry picked from commit d44bf62)
- Also disables websockets

(cherry picked from commit eb2cfd5)
- The implementation of the struct isn't neccessary to get up and
running with a simple server. However the methods must have
implementations or the code won't compile.

(cherry picked from commit 173b1a6)
- Caching the x509 certs as they are in the process of being verified
is neccessary because openssl does not provide an API to get
certificates that have failed verification.

- Examining failed certificates for the reason they failed will be
implemented in future commits.

(cherry picked from commit 5e14b14)
- Differences in dn name handling will cause issues in projects that
expected a subject or issuer to be in a particular format. This is the
case for features like mtls principal mapping within redpanda.

- The method X509_NAME_oneline printed a nicely human readable version of
the distinguished name parameters however this differed from what GnuTLS
would print.

- The change swaps out the use of X509_NAME_oneline in favor of
X509_NAME_print_ex which accepts an integer for which can be used to
control how the distinguished name is formatted.

(cherry picked from commit c09cd25)
@graphcareful graphcareful changed the base branch from feature-ossl to rebase-feature-ossl May 6, 2024 14:19
@graphcareful graphcareful changed the base branch from rebase-feature-ossl to feature-ossl May 6, 2024 14:20
@graphcareful graphcareful changed the base branch from feature-ossl to rebase-feature-ossl May 6, 2024 14:20
@graphcareful
Copy link
Author

Note to reviewers, base has been modified to rebase-feature-ossl which is the feature-ossl branch with the seastar 24.2.x changes rebased onto it.

- x509_verify was never the correct function to verify the signature of
a private key, replacing with x509_check_private_key instead.

Fixes: CORE-2025
- Flush() aquires the _out_sem and so does do_shutdown, a deadlock was
observed due to this fact. Flush would wait for shutdown to complete,
which was being held by wait_for_eof. After a 10s timeout, eof would
manually be set and the stall would be resolved.

- The fix is to not call wait_for_eof within the context of the
_out_sem.
- With this behavior redpanda will not log disconnects as ERRORs during
shutdown
- This line will cause redpanda to crash as its currently set to pass
GnuTLS specific priority strings.
src/net/ossl.cc Outdated
@@ -824,7 +850,7 @@ class session : public enable_shared_from_this<session>, public session_impl {
f = pull_encrypted_and_send();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this by maybe_send_data()?

@@ -1204,7 +1227,8 @@ class session : public enable_shared_from_this<session>, public session_impl {
future<> server_handshake() {
return wait_for_input().then([this]{
if (eof()) {
return make_exception_future<>(std::runtime_error("EOF observed during handshake"));
_error = std::make_exception_ptr(std::system_error(ENOTCONN, std::system_category()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be an existing error that should be preferred/not overwritten?

}

future<> client_handshake() {
return do_handshake(SSL_connect, [this]{
return pull_encrypted_and_send().then([this]{
return wait_for_input().then([this]{
if (eof()) {
return make_exception_future<>(std::runtime_error("EOF observed during handshake"));
_error = std::make_exception_ptr(std::system_error(ENOTCONN, std::system_category()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be an existing error that should be preferred/not overwritten?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test?

The commit message talks about Redpanda, but I think it would be more appropriate to discuss this in terms of compatibility with existing behaviour in GnuTLS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message talks about Redpanda, but I think it would be more appropriate to discuss this in terms of compatibility with existing behaviour in GnuTLS.

In this situation, GnuTLS will return an error code which turns into a system_error which Redpanda does parse to see if it's a reconnect type error.

if (!ext) {
return {};
}
auto names = general_names_ptr((GENERAL_NAMES*)X509V3_EXT_d2i(ext.get()));
auto names = general_names_ptr((GENERAL_NAMES*)X509V3_EXT_d2i(ext));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer static_cast.

Comment on lines +134 to +135
seastar_set_dep_args (OpenSSL REQUIRED
VERSION 3.0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this version correct? Is it an exact match? A minimum?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a minimum.

src/net/ossl.cc Outdated
Comment on lines 851 to 853
if (!_options.wait_for_eof_on_shutdown) {
return make_ready_future<>();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note about why this is "sidestepped".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches original behavior in the GnuTLS implementation

@@ -315,11 +315,9 @@ class tls::certificate_credentials::impl {
if (!pkey) {
throw ossl_error("Error attempting to parse private key");
}
#if 0 // https://github.com/redpanda-data/core-internal/issues/1233
if (!X509_verify(x509_cert.get(), pkey.get())) {
if (!X509_check_private_key(x509_cert.get(), pkey.get())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a test be added?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seastar's TLS unit tests do verify the coherency of the key and cert

@@ -828,7 +828,7 @@ class session : public enable_shared_from_this<session>, public session_impl {
}

future<> do_shutdown() {
if(_error || !connected() || eof()) {
if(_error || !connected()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a test be added?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of shutdown unit tests in Seastar that verify the shutdown behavior

@michael-redpanda michael-redpanda force-pushed the rebase-feature-ossl branch 2 times, most recently from 58f6053 to 83fae91 Compare May 17, 2024 17:07
@michael-redpanda michael-redpanda marked this pull request as draft May 17, 2024 18:49
@michael-redpanda
Copy link

Leaving this in draft, but will close. Please review #115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants