Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

nrwahl2 · 2024-05-13T04:41:52Z

@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.

The result of this patch is as follows. Scenario:

ocf:pacemaker:remote resource with reconnect_interval=30s, cluster-recheck-interval=2min, and fencing configured.
Remote connection resource prefers to run on cluster node 2 (location constraint) and is running there.
Put node 2 in standby; remote connection resource migrates to cluster node 1.
Block 3121/tcp on node 2.
Take node 2 out of standby.
Remote connection resource tries to migrate back to node 2. This times out due to the firewall block.

Before patch:

The remote node is fenced.
The remote connection resource is stopped on node 1 and node 2 due to the multiple-active policy.
After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if reconnect_interval has passed.)

After patch:

The remote node is not fenced.
The remote connection resource immediately tries to recover on node 2 (where it just failed a migrate_from, since start-failure-is-fatal doesn't apply to migrate_from). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block.
The resource recovers onto node 1 successfully.
After reconnect_interval expires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening every reconnect_interval until migration-threshold is reached.

Fixes T214

nrwahl2 · 2024-05-13T05:38:56Z

Marking ready for review. This might be sufficient to fix the cluster-recheck-interval behavior too (rather than just masking it)... Since we no longer set pcmk_on_fail_reset_remote, we also don't set the role-after-failure to stopped anymore. We can recover right away instead of waiting for a later transition.

enum rsc_role_e
pcmk__role_after_failure(const pcmk_resource_t *rsc, const char *action_name,
                         enum action_fail_response on_fail, GHashTable *meta)
{
    ...
    // Set default for role after failure specially in certain circumstances
    switch (on_fail) {
        ...
        case pcmk_on_fail_reset_remote:
            if (rsc->remote_reconnect_ms != 0) {
                role = pcmk_role_stopped;
            }
            break;

If this isn't a viable solution (or close to it) as-is, @clumens or anyone else can feel free to take it and run with it themselves. I took a crack at it since I've been talking to Chris about it a lot last week.

Okay, there is at least one big wrinkle in this... if a resource is running on the remote node when the connection resource migrate_from fails, then we still fence the remote node, which results in the remote connection resource being stopped due to node availability until the timer pops :(

May 12 23:07:45.302 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Fence (reboot) fastvm-fedora39-23 'dummy is thought to be active there'
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Recover    fastvm-fedora39-23     (                       fastvm-fedora39-24 )
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Move       dummy                  ( fastvm-fedora39-23 -> fastvm-fedora39-22 )
...
May 12 23:07:48.230 fastvm-fedora39-22 pacemaker-fenced    [8719] (finalize_op)     notice: Operation 'reboot' targeting fastvm-fedora39-23 by fastvm-fedora39-22 for pacemaker-controld.8723@fastvm-fedora39-22: OK (complete) | id=b8a4cae2
...
# Transition abort due to connection resource monitor failure,
# presumably due to remote node fenced
May 12 23:07:48.257 fastvm-fedora39-22 pacemaker-controld  [8723] (abort_transition_graph)  info: Transition 8 aborted by status-1-fail-count-fastvm-fedora39-23.monitor_60000 doing create fail-count-fastvm-fedora39-23#monitor_60000=1: Transient attribute change | cib=0.115.19 source=abort_unless_down:305 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'] complete=true
...
May 12 23:07:48.262 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Stop       fastvm-fedora39-23     (                       fastvm-fedora39-24 )  due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Stop       fastvm-fedora39-23     (                       fastvm-fedora39-22 )  due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item)   notice: Actions: Start      dummy                  (                       fastvm-fedora39-24 )
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (pcmk__log_transition_summary)    error: Calculated transition 9 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-26.bz2

This is because the remote node is considered offline after the migration failure.

(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 presumed ONLINE because connection resource is started
(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 OFFLINE because connection resource failed
(determine_remote_online_status)        trace: Remote node fastvm-fedora39-23 online=FALSE

I don't think we want to skip setting the pcmk_rsc_failed flag for the connection resource, so we'd need to somehow detect that the failure was for migrate_from and avoid marking the remote node offline in that case. Spitballing here, we could maybe overload partial_migration_{source,target} so that we have them even after a failed migration... Or add a new flag like failed_migrate_from. (migrate_to may not warrant this treatment.) Or maybe it's still more complicated.

A failed migrate_from is somewhere between a partial migration and a dangling migration. No stop has been run on the source. The migrate_from action completed (so not partial) but failed (so not dangling).

nrwahl2 · 2024-05-13T08:02:38Z

cts/cts-scheduler.in

@@ -1120,6 +1120,8 @@ TESTS = [
          "Make sure partial migrations are handled before ops on the remote node" ],
        [ "remote-partial-migrate2",
          "Make sure partial migration target is prefered for remote connection" ],
+        [ "remote-partial-migrate3",


realizing this doesn't qualify as a partial migration, which is really more of an "in-progress migration"

nrwahl2 · 2024-05-13T08:18:07Z

lib/pengine/pe_actions.c

     */
    if (rsc->is_remote_node
        && pcmk__is_remote_node(pcmk_find_node(rsc->cluster, rsc->id))
        && !pcmk_is_probe(action_name, interval_ms)
-        && !pcmk__str_eq(action_name, PCMK_ACTION_START, pcmk__str_none)) {
+        && !pcmk__str_any_of(action_name, PCMK_ACTION_START,
+                             PCMK_ACTION_MIGRATE_FROM, NULL)) {


Maybe even more permissive like MIGRATE_TO, undecided right now

Edit: Yeah, thinking we should treat migrate_to and migrate_from the same for failure handling purposes for remote connection resources, and just call it a failed migration. It would be hard for migrate_to to fail at all unless something wider is seriously wrong.

The test results are currently wrong. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

This also prevents the resource from remaining stopped until the cluster-recheck-interval expires. That's because we no longer set pcmk_on_fail_reset_remote after a migrate_from failure, so pcmk__role_after_failure() no longer returns pcmk_role_stopped. Ref T214 Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Ref T214 Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Deprecated, for internal use only. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

…grate Previously, if the remote node's connection resource failed to migrate, the remote node was considered offline. If a resource was running on the remote node, the remote node would be fenced. This doesn't work yet. Fixes T214 Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

This still has problems. The newly added tests work as expected, but existing tests break. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

nrwahl2 · 2024-05-16T09:40:36Z

Still not working. Latest push rebases on main, adds another test, and adds some experimental commits on top.

nrwahl2 · 2024-05-17T22:33:32Z

I'm thinking the high-level behavior should be:

Only stop failures and recurring monitor failures for the connection resource should cause the remote node to be fenced.
Only recurring monitor failures of the connection resource should change the online state of the remote node (though it's fine if some others set it explicitly). Currently any failure sets it to offline.
- start: offline -> offline
- stop: online -> online
- probe: X -> X. No info; the probe could be run on a node that can't connect to the remote node, while the resource is started or the probe succeeds on another node.
- reload: X -> X (it's a no-op so doesn't really matter)
- migrate_to/migrate_from: online -> online. The migration sequence is migrate_to -> migrate_from -> stop on source, so we haven't stopped the resource yet at this point)
- recurring monitor: online -> offline

That would still leave some decisions about exactly how and where in the call chain to implement the online-state patch. The possibility of multiple failed actions in the history, occurring in various orders, gives me a headache.

nrwahl2 marked this pull request as ready for review May 13, 2024 05:38

nrwahl2 force-pushed the nrwahl2-T214 branch from e714f22 to 4468d0f Compare May 13, 2024 06:10

nrwahl2 commented May 13, 2024

View reviewed changes

nrwahl2 marked this pull request as draft May 13, 2024 08:39

nrwahl2 added 8 commits May 16, 2024 02:39

Test: cts-scheduler: Test failed remote connection resource migrate_from

e40831a

The test results are currently wrong. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Test: cts-scheduler: Update test after remote migrate_from fix

11fbd77

Ref T214 Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Doc: Pacemaker Explained: Render footnote correctly

539591a

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Fix: libpe_status: Fence remote node only after stop/monitor failure

4d69753

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

API: libcrmcommon: New pcmk_rsc_remote_conn_lost pcmk_rsc_flags value

a935472

Deprecated, for internal use only. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

WIP: Test: cts-scheduler: Update tests for remote migrate_from fence fix

01e161b

This still has problems. The newly added tests work as expected, but existing tests break. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

nrwahl2 force-pushed the nrwahl2-T214 branch from 4468d0f to 01e161b Compare May 16, 2024 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

nrwahl2 commented May 13, 2024 •

edited

nrwahl2 commented May 13, 2024 •

edited

nrwahl2 May 13, 2024

nrwahl2 May 13, 2024 •

edited

nrwahl2 commented May 16, 2024

nrwahl2 commented May 17, 2024

Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

Are you sure you want to change the base?

Fix: libpe_status: Don't fence a remote node due to failed migrate_from #3458

Conversation

nrwahl2 commented May 13, 2024 • edited

nrwahl2 commented May 13, 2024 • edited

nrwahl2 May 13, 2024

Choose a reason for hiding this comment

nrwahl2 May 13, 2024 • edited

Choose a reason for hiding this comment

nrwahl2 commented May 16, 2024

nrwahl2 commented May 17, 2024

nrwahl2 commented May 13, 2024 •

edited

nrwahl2 commented May 13, 2024 •

edited

nrwahl2 May 13, 2024 •

edited