Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented May 13, 2024

@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.

The result of this patch is as follows. Scenario:

  1. ocf:pacemaker:remote resource with reconnect_interval=30s, cluster-recheck-interval=2min, and fencing configured.
  2. Remote connection resource prefers to run on cluster node 2 (location constraint) and is running there.
  3. Put node 2 in standby; remote connection resource migrates to cluster node 1.
  4. Block 3121/tcp on node 2.
  5. Take node 2 out of standby.
  6. Remote connection resource tries to migrate back to node 2. This times out due to the firewall block.

Before patch:

  • The remote node is fenced.
  • The remote connection resource is stopped on node 1 and node 2 due to the multiple-active policy.
  • After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if reconnect_interval has passed.)

After patch:

  • The remote node is not fenced.
  • The remote connection resource immediately tries to recover on node 2 (where it just failed a migrate_from, since start-failure-is-fatal doesn't apply to migrate_from). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block.
  • The resource recovers onto node 1 successfully.
  • After reconnect_interval expires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening every reconnect_interval until migration-threshold is reached.

Fixes T214

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented May 13, 2024

Marking ready for review. This might be sufficient to fix the cluster-recheck-interval behavior too (rather than just masking it)... Since we no longer set pcmk_on_fail_reset_remote, we also don't set the role-after-failure to stopped anymore. We can recover right away instead of waiting for a later transition.

enum rsc_role_e
pcmk__role_after_failure(const pcmk_resource_t *rsc, const char *action_name,
                         enum action_fail_response on_fail, GHashTable *meta)
{
    ...
    // Set default for role after failure specially in certain circumstances
    switch (on_fail) {
        ...
        case pcmk_on_fail_reset_remote:
            if (rsc->remote_reconnect_ms != 0) {
                role = pcmk_role_stopped;
            }
            break;

If this isn't a viable solution (or close to it) as-is, @clumens or anyone else can feel free to take it and run with it themselves. I took a crack at it since I've been talking to Chris about it a lot last week.


Okay, there is at least one big wrinkle in this... if a resource is running on the remote node when the connection resource migrate_from fails, then we still fence the remote node, which results in the remote connection resource being stopped due to node availability until the timer pops :(

May 12 23:07:45.302 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Fence (reboot) fastvm-fedora39-23 'dummy is thought to be active there'
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Recover    fastvm-fedora39-23     (                       fastvm-fedora39-24 )
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Move       dummy                  ( fastvm-fedora39-23 -> fastvm-fedora39-22 )
...
May 12 23:07:48.230 fastvm-fedora39-22 pacemaker-fenced    [8719] (finalize_op)     notice: Operation 'reboot' targeting fastvm-fedora39-23 by fastvm-fedora39-22 for pacemaker-controld.8723@fastvm-fedora39-22: OK (complete) | id=b8a4cae2
...
# Transition abort due to connection resource monitor failure,
# presumably due to remote node fenced
May 12 23:07:48.257 fastvm-fedora39-22 pacemaker-controld  [8723] (abort_transition_graph)  info: Transition 8 aborted by status-1-fail-count-fastvm-fedora39-23.monitor_60000 doing create fail-count-fastvm-fedora39-23#monitor_60000=1: Transient attribute change | cib=0.115.19 source=abort_unless_down:305 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'] complete=true
...
May 12 23:07:48.262 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Stop       fastvm-fedora39-23     (                       fastvm-fedora39-24 )  due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Stop       fastvm-fedora39-23     (                       fastvm-fedora39-22 )  due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Start      dummy                  (                       fastvm-fedora39-24 )
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (pcmk__log_transition_summary)    error: Calculated transition 9 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-26.bz2

This is because the remote node is considered offline after the migration failure.

(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 presumed ONLINE because connection resource is started
(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 OFFLINE because connection resource failed
(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 online=FALSE

I don't think we want to skip setting the pcmk_rsc_failed flag for the connection resource, so we'd need to somehow detect that the failure was for migrate_from and avoid marking the remote node offline in that case. Spitballing here, we could maybe overload partial_migration_{source,target} so that we have them even after a failed migration... Or add a new flag like failed_migrate_from. (migrate_to may not warrant this treatment.) Or maybe it's still more complicated.

A failed migrate_from is somewhere between a partial migration and a dangling migration. No stop has been run on the source. The migrate_from action completed (so not partial) but failed (so not dangling).

@nrwahl2 nrwahl2 marked this pull request as ready for review May 13, 2024 05:38
@@ -1120,6 +1120,8 @@ TESTS = [
"Make sure partial migrations are handled before ops on the remote node" ],
[ "remote-partial-migrate2",
"Make sure partial migration target is prefered for remote connection" ],
[ "remote-partial-migrate3",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realizing this doesn't qualify as a partial migration, which is really more of an "in-progress migration"

*/
if (rsc->is_remote_node
&& pcmk__is_remote_node(pcmk_find_node(rsc->cluster, rsc->id))
&& !pcmk_is_probe(action_name, interval_ms)
&& !pcmk__str_eq(action_name, PCMK_ACTION_START, pcmk__str_none)) {
&& !pcmk__str_any_of(action_name, PCMK_ACTION_START,
PCMK_ACTION_MIGRATE_FROM, NULL)) {
Copy link
Contributor Author

@nrwahl2 nrwahl2 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even more permissive like MIGRATE_TO, undecided right now

Edit: Yeah, thinking we should treat migrate_to and migrate_from the same for failure handling purposes for remote connection resources, and just call it a failed migration. It would be hard for migrate_to to fail at all unless something wider is seriously wrong.

@nrwahl2 nrwahl2 marked this pull request as draft May 13, 2024 08:39
The test results are currently wrong.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
This also prevents the resource from remaining stopped until the
cluster-recheck-interval expires. That's because we no longer set
pcmk_on_fail_reset_remote after a migrate_from failure, so
pcmk__role_after_failure() no longer returns pcmk_role_stopped.

Ref T214

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Ref T214

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Deprecated, for internal use only.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
…grate

Previously, if the remote node's connection resource failed to migrate,
the remote node was considered offline. If a resource was running on the
remote node, the remote node would be fenced.

This doesn't work yet.

Fixes T214

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
This still has problems. The newly added tests work as expected, but
existing tests break.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented May 16, 2024

Still not working. Latest push rebases on main, adds another test, and adds some experimental commits on top.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented May 17, 2024

I'm thinking the high-level behavior should be:

  • Only stop failures and recurring monitor failures for the connection resource should cause the remote node to be fenced.
  • Only recurring monitor failures of the connection resource should change the online state of the remote node (though it's fine if some others set it explicitly). Currently any failure sets it to offline.
    • start: offline -> offline
    • stop: online -> online
    • probe: X -> X. No info; the probe could be run on a node that can't connect to the remote node, while the resource is started or the probe succeeds on another node.
    • reload: X -> X (it's a no-op so doesn't really matter)
    • migrate_to/migrate_from: online -> online. The migration sequence is migrate_to -> migrate_from -> stop on source, so we haven't stopped the resource yet at this point)
    • recurring monitor: online -> offline

That would still leave some decisions about exactly how and where in the call chain to implement the online-state patch. The possibility of multiple failed actions in the history, occurring in various orders, gives me a headache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant