Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

Open
wants to merge 1 commit into
base: 3.x
Choose a base branch
from
Open

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

wants to merge 1 commit into from

Conversation

sysprg
Copy link

@sysprg sysprg commented Oct 30, 2018

This patch fixes several bugs in Galera that lead to sporadic
failures when testing MariaDB server with the mtr due to false
error messages (or warnings) that are not related to the new
fixes.

The patch includes a some changes taken from the latest
versions of Galera from Codership, as well as some changes
taken from PXC version of Galera, as well as some corrections
made by me.

  1. Some mtr tests sometimes fails due to
    "[Warning] WSREP: gcs_caused() returned -1" warnings:

Currently gcs_.caused() function works only when the group
is primary, and fails if the group is non-primary or even if
the group in a transient state (during configuration changes).

Instead of failing immediately, this patch changes gcs_.caused()
to return EAGAIN error code when function was called while
group in a transient state. On receiving EAGAIN error code
ReplicatorSMM::causal_read() retries to obtain a new seqno
(by calling gcs_.caused() again).

  1. Some mtr tests sometimes fails due to
    "[Warning] WSREP: Failed to report last committed "
    warnings:

This is because when processing cluster configuration changes,
the GCS layer does not always timely update the group->last_applied
variable.

To correct this error, I added an additional call to the
group_redo_last_applied() function. In addition, to protect
against other similar situations, I added a cycle to re-call
gcs_.set_last_applied() in case of failure due to interruption
of internal operations in the current Galera implementation.

  1. Some mtr test sometimes fails when node is evicted from
    the cluster in middle of SST.

Even when node evicted, the SST script may completes normally.
After this, the node calls the gcs_join() function and tries
to join the cluster. However, this is impossible, because the
node is already evicted. Therefore, the _join() function
(which called from gcs_join) fails. Then node does IST
(which also fails), after/during which it is aborted.

To fix this, we should avoid joining the cluster through
gcs_join function if node is evicted. To do this, we should
check the current connection state in the gcs_join() function
to return from it immediately if the node's communication
channel was closed.

  1. If SST fails due to a network error, the node that acted
    as a donor sometimes does not return to its original state,
    which leads to failure due to the inability to continue
    the test execution (due to a timeout).

If sst_sent() fails node should restore itself back to joined
state. The sst_sent function can fail. commonly due to network
errors, where DONOR may lose connectivity to JOINER (or existing
cluster). But on re-join it should restore the original state
without waiting for transition to JOINER state. SST failure
on JOINER will gracefully shutdown the joiner.

https://jira.mariadb.org/browse/MDEV-17565

This patch fixes several bugs in Galera that lead to sporadic
failures when testing MariaDB server with the mtr due to false
error messages (or warnings) that are not related to the new
fixes.

The patch includes a some changes taken from the latest
versions of Galera from Codership, as well as some changes
taken from PXC version of Galera, as well as some corrections
made by me.

1) Some mtr tests sometimes fails due to
"[Warning] WSREP: gcs_caused() returned -1" warnings:

Currently gcs_.caused() function works only when the group
is primary, and fails if the group is non-primary or even if
the group in a transient state (during configuration changes).

Instead of failing immediately, this patch changes gcs_.caused()
to return EAGAIN error code when function was called while
group in a transient state. On receiving EAGAIN error code
ReplicatorSMM::causal_read() retries to obtain a new seqno
(by calling gcs_.caused() again).

2) Some mtr tests sometimes fails due to
"[Warning] WSREP: Failed to report last committed <number>"
warnings:

This is because when processing cluster configuration changes,
the GCS layer does not always timely update the group->last_applied
variable.

To correct this error, I added an additional call to the
group_redo_last_applied() function. In addition, to protect
against other similar situations, I added a cycle to re-call
gcs_.set_last_applied() in case of failure due to interruption
of internal operations in the current Galera implementation.

3) Some mtr test sometimes fails when node is evicted from
the cluster in middle of SST.

Even when node evicted, the SST script may completes normally.
After this, the node calls the gcs_join() function and tries
to join the cluster. However, this is impossible, because the
node is already evicted. Therefore, the _join() function
(which called from gcs_join) fails. Then node does IST
(which also fails), after/during which it is aborted.

To fix this, we should avoid joining the cluster through
gcs_join function if node is evicted. To do this, we should
check the current connection state in the gcs_join() function
to return from it immediately if the node's communication
channel was closed.

4) If SST fails due to a network error, the node that acted
as a donor sometimes does not return to its original state,
which leads to failure due to the inability to continue
the test execution (due to a timeout).

If sst_sent() fails node should restore itself back to joined
state. The sst_sent function can fail. commonly due to network
errors, where DONOR may lose connectivity to JOINER (or existing
cluster). But on re-join it should restore the original state
without waiting for transition to JOINER state. SST failure
on JOINER will gracefully shutdown the joiner.

https://jira.mariadb.org/browse/MDEV-17565
@janlindstrom janlindstrom self-assigned this Dec 4, 2019
Copy link

@janlindstrom janlindstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address my mostly trivial review commends and wait for review from Alexey.

act->type = GCS_ACT_COMMIT_CUT;

int const buf_len(sizeof(commit_cut));
void* const buf(malloc(buf_len));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me is this memory allocation freed automatically?

@@ -255,7 +255,7 @@ group_check_donor (gcs_group_t* group)
gu_warn ("Donor %s is no longer in the group. State transfer cannot "
"be completed, need to abort. Aborting...", donor_id);

gu_abort();
// gu_abort();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why leave commented code ?

@@ -311,7 +333,10 @@ namespace galera
return ret;
}

gcs_seqno_t caused() { return global_seqno_; }
void caused(gcs_seqno_t& seqno, gu::datetime::Date& wait_until)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope recent compiler does not produce a error on unused parameter, have you tried e.g. gcc 9 ?

@@ -1420,6 +1452,7 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx,
assert(app_req_len <= 0);
log_fatal << "View callback failed. This is unrecoverable, "
<< "restart required.";
local_monitor_.leave(lo);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment why this is necessary.

@@ -1428,14 +1461,13 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx,
log_fatal << "Local state UUID " << state_uuid_
<< " is different from group state UUID " << group_uuid
<< ", and SST request is null: restart required.";
local_monitor_.leave(lo);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also comment.

@@ -1528,6 +1560,7 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx,
{
log_fatal << "Internal error: unexpected next state for "
<< "non-prim: " << next_state << ". Restart required.";
local_monitor_.leave(lo);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment also here.

libgcs_env.Append(CPPFLAGS = ' -Wno-missing-field-initializers')
libgcs_env.Append(CPPFLAGS = ' -Wno-effc++')
libgcs_env.Append(CCFLAGS = ' -Wno-missing-field-initializers')
libgcs_env.Append(CCFLAGS = ' -Wno-variadic-macros')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is these changes really needed and why ?

@@ -985,8 +999,6 @@ wsrep_status_t galera::ReplicatorSMM::causal_read(wsrep_gtid_t* gtid)
// at monitor drain and disallowing further waits until
// configuration change related operations (SST etc) have been
// finished.
gu::datetime::Date wait_until(gu::datetime::Date::calendar()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this as it is used below ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants