[WIP] Ensure workers and miq_worker rows match #22771

agrare · 2023-11-02T15:25:18Z

It is possible for the workers and the miq_worker rows to get out of sync

agrare · 2023-11-02T15:26:49Z

app/models/miq_server/worker_management/kubernetes.rb

    # Update worker deployments with updated settings such as cpu/memory limits
    sync_deployment_settings


TODO I think this has to be after cleanup_orphaned_worker_rows, and doesn't make sense IMO to have sync_deployment_settings in sync_from_system since this is supposed to just be "find what is out there"

Fryguy · 2023-11-02T16:10:22Z

app/models/miq_server/worker_management/kubernetes.rb

@@ -50,6 +46,9 @@ def cleanup_orphaned_worker_rows
    end
  end

+  def cleanup_orphaned_workers
+  end


Can you add a comment in here just to describe why this is empty?

Same comment for the other "empty" methods.

It just isn't implemented yet, I'm testing with Process first

Fryguy · 2023-11-02T16:14:00Z

app/models/miq_server/worker_management/process.rb

  end

  def sync_starting_workers
    MiqWorker.find_all_starting.to_a
  end

+  def cleanup_orphaned_worker_rows
+    orphaned_rows = miq_workers.where.not(:pid => miq_pids)


We use system_uid column in kubernetes - is that column populated with pid in process mode? If so, may be worth using the same column for consistency.

It isn't set for process or systemd but we should do that for consistency. That's going to be another PR but we can probably get that in first and make use of it here.

kbrock · 2023-11-02T16:19:59Z

The first 10 failures are from awesome_spawn update.

Think this one is reasonable:

  11) MiqServer::WorkerManagement::Kubernetes#sync_from_system #ensure_kube_monitors_started podified, ensures pod monitor started and orphaned rows are removed
      Failure/Error: expect(server.worker_manager).to receive(:cleanup_orphaned_worker_rows)

        (#<MiqServer::WorkerManagement::Kubernetes:0x0000564498f5f478 @my_server=#<MiqServer id: 88000000002459, guid: "4ae48e49-aef7-4e8c-9c53-361dc1a662a5", status: "started", started_on: "2023-11-02 15:58:19.285129921 +0000", stopped_on: nil, pid: nil, build: nil, percent_memory: nil, percent_cpu: nil, cpu_time: nil, name: "miq_server_0000000002315", last_heartbeat: "2023-11-02 15:58:19.285110421 +0000", os_priority: nil, is_master: false, logo: nil, version: "9.9.9.9", zone_id: 88000000004083, memory_usage: nil, memory_size: nil, hostname: nil, ipaddress: nil, drb_uri: nil, mac_address: nil, vm_id: nil, has_active_userinterface: nil, has_active_webservices: nil, sql_spid: nil, log_file_depot_id: nil, proportional_set_size: nil, has_active_remote_console: nil, system_memory_free: nil, system_memory_used: nil, system_swap_free: nil, system_swap_used: nil, has_active_cockpit_ws: nil, unique_set_size: nil, has_vix_disk_lib: nil>, @workers_lock=#<Sync:0x0000564498f5f388 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5f1d0>>, @workers={}, @queue_messages_lock=#<Sync:0x0000564498f5f068 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5eed8>>, @queue_messages={}>).cleanup_orphaned_worker_rows(*(any args))
            expected: 1 time with any arguments
            received: 0 times with any arguments
      # ./spec/models/miq_server/worker_management/kubernetes_spec.rb:125:in `block (4 levels) in <top (required)>'

agrare · 2023-11-02T16:21:17Z

@kbrock yeah this isn't close to done yet I expect the tests to fail right now

Fryguy · 2023-11-02T16:37:35Z

Reran after merge of #22772

jrafanie · 2023-11-03T19:41:45Z

@agrare I don't remember the details why but I recall worker rows are created by the server in spawn/systemd but in the pod itself when it invokes run_single_worker.rb. In other words, for a moment we have a row without a process in the spawn/systemd case and a pod without a row in the podified case. I don't know if this change or my change I tested earlier this week would cause a problem with this existing assumption.

In my case, I was attempting to reload the worker row before each heartbeat and let the process exit if the row was removed.

diff --git a/app/models/miq_worker/runner.rb b/app/models/miq_worker/runner.rb
index 6b7ed69f3d..c8dd3ffd9c 100644
--- a/app/models/miq_worker/runner.rb
+++ b/app/models/miq_worker/runner.rb
@@ -319,6 +319,7 @@ class MiqWorker::Runner
     # Heartbeats can be expensive, so do them only when needed
     return if @last_hb.kind_of?(Time) && (@last_hb + worker_settings[:heartbeat_freq]) >= now

+    reload_worker_record
     systemd_worker? ? @worker.sd_notify_watchdog : heartbeat_to_file

     if config_out_of_date?

agrare · 2023-11-06T20:24:59Z

@jrafanie is a great point and something we need to be careful about, kubernetes creates the worker rows in run_single_worker because we can't know the GUID prior to starting the pod (all pods in a replica set have to have the same environment). Kubernetes was already deleting all worker rows that didn't match any running pods.

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

jrafanie · 2023-11-06T23:00:52Z

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

Yeah, that could work. We've seen it before where pods are really slow to start in our 🔥 environments and such as when CPU throttling is in play. Anything we do should account for possibly significant delays in both situations:

Yeah, think you're right for "creating" rows in systemd/spawn. We do allow for some additional time, using starting timeout... https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/settings.rb#L51 via https://github.com/ManageIQ/manageiq/blob/master/app/models/miq_server/worker_management/monitor/validation.rb#L5.
For podified, we liveness probes per pod but it's likely that's running independently from the creation of the worker row. Killing it come be hard because it could be in a reboot loop if it's continually getting throttled or delayed. We'll have to think about that.

miq-bot · 2024-02-19T00:00:27Z

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

miq-bot · 2024-03-18T16:35:37Z

Checked commit agrare@bcd5b98 with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
5 files checked, 2 offenses detected

app/models/miq_server/worker_management/kubernetes.rb

⚠️ - Line 58, Col 5 - Lint/EmptyBlock - Empty block detected.
❗ - Line 57, Col 7 - Style/CommentAnnotation - Annotation keywords like TODO should be all upper case, followed by a colon, and a space, then a note describing the problem.

agrare added the bug label Nov 2, 2023

agrare requested a review from jrafanie as a code owner November 2, 2023 15:25

agrare added the wip label Nov 2, 2023

agrare force-pushed the ensure_workers_and_worker_rows_match branch from 4f32ebc to f368402 Compare November 2, 2023 15:26

agrare commented Nov 2, 2023

View reviewed changes

agrare force-pushed the ensure_workers_and_worker_rows_match branch from f368402 to 0d79355 Compare November 2, 2023 15:27

miq-bot removed the wip label Nov 2, 2023

agrare force-pushed the ensure_workers_and_worker_rows_match branch 2 times, most recently from 2d55105 to e6e1c39 Compare November 2, 2023 15:40

Fryguy reviewed Nov 2, 2023

View reviewed changes

Fryguy assigned jrafanie Nov 2, 2023

Fryguy reviewed Nov 2, 2023

View reviewed changes

agrare changed the title ~~Ensure workers and miq_worker rows match~~ [WIP] Ensure workers and miq_worker rows match Nov 2, 2023

miq-bot added the wip label Nov 2, 2023

Fryguy closed this Nov 2, 2023

Fryguy reopened this Nov 2, 2023

agrare force-pushed the ensure_workers_and_worker_rows_match branch 3 times, most recently from b11a008 to 697d8b8 Compare November 2, 2023 20:03

agrare mentioned this pull request Nov 6, 2023

[BUG] Worker records stuck in creating are never deleted on appliance #22644

Open

miq-bot added the stale label Feb 19, 2024

miq-bot added the unmergeable label Mar 15, 2024

Ensure workers and miq_worker rows match

bcd5b98

agrare force-pushed the ensure_workers_and_worker_rows_match branch from 697d8b8 to bcd5b98 Compare March 18, 2024 16:33

agrare removed stale unmergeable labels Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Ensure workers and miq_worker rows match #22771

[WIP] Ensure workers and miq_worker rows match #22771

agrare commented Nov 2, 2023 •

edited

agrare Nov 2, 2023

Fryguy Nov 2, 2023

Fryguy Nov 2, 2023

agrare Nov 2, 2023

Fryguy Nov 2, 2023

agrare Nov 2, 2023

kbrock commented Nov 2, 2023

agrare commented Nov 2, 2023

Fryguy commented Nov 2, 2023

jrafanie commented Nov 3, 2023

agrare commented Nov 6, 2023

jrafanie commented Nov 6, 2023

miq-bot commented Feb 19, 2024

miq-bot commented Mar 18, 2024

		# Update worker deployments with updated settings such as cpu/memory limits
		sync_deployment_settings

[WIP] Ensure workers and miq_worker rows match #22771

Are you sure you want to change the base?

[WIP] Ensure workers and miq_worker rows match #22771

Conversation

agrare commented Nov 2, 2023 • edited

agrare Nov 2, 2023

Choose a reason for hiding this comment

Fryguy Nov 2, 2023

Choose a reason for hiding this comment

Fryguy Nov 2, 2023

Choose a reason for hiding this comment

agrare Nov 2, 2023

Choose a reason for hiding this comment

Fryguy Nov 2, 2023

Choose a reason for hiding this comment

agrare Nov 2, 2023

Choose a reason for hiding this comment

kbrock commented Nov 2, 2023

agrare commented Nov 2, 2023

Fryguy commented Nov 2, 2023

jrafanie commented Nov 3, 2023

agrare commented Nov 6, 2023

jrafanie commented Nov 6, 2023

miq-bot commented Feb 19, 2024

miq-bot commented Mar 18, 2024

agrare commented Nov 2, 2023 •

edited