Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging #1844

sylvain-de-fuster · 2024-04-29T12:04:17Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Our environment :

1 awx master on k3s with 2 external workers (using receptor)
projects are hosted on enterprise github
SAML auth
Control node EE untouched (only custom workers EE for playbooks)
No limit added to pods

We are working on our dev plateform to give some muscles to it.
The final goal is to have a awx plateform more reactive, available and manageable to face growing usage, maintenance needs and potentially incidents.

• First step
To avoid having issues fixed by newer versions, we started by updating our infra from 23.3.0 to 24.2.0 and reinstall our external workers with receptor 1.4.5.

• Second step
Externalize our postgres database. The externalization itself was pretty easy.
We are using a patroni cluster (v3.2.2) of two postgres instances (v16.1) (one leader and one streaming replica) with a VIP on top.

Our issues are in our behavior tests : We experienced bad behaviors during switchover and failover.

After checking ansible/awx#13505 and #1393, we used the corresponding keepalive postgres parameters.
It helped a lot but there are still some issues.

Our current tests (each sub tests were tested with a healthy plateform) :

• Switchover

Without activity
- AWX interface usable (UI or API)
- Workers healthy
- New jobs launching correctly
Simple activity in progress (one sleep task on localhost)
- AWX interface usable (UI or API)
- Workers healthy
- New jobs launching correctly
- Current job ended correctly
"Complex" activity in progress (one job with multiples tasks)
- AWX interface usable (UI or API)
- Workers healthy
- New jobs launching correctly
- Current job failed (I don't know if it is the expected behavior but it is acceptable for us at least)

Connection is reestablished automatically without any restart of container.
We are in acceptable behavior (if there are some tweaks to fix the last point to be perfect, we take it!)

• Crash simulation of the postgres leader

Without activity
- AWX interface usable (UI or API)
- Workers healthcheck hanging
- New jobs can't launch correctly (waiting state)
Simple activity in progress (one sleep task on localhost)
- AWX interface usable (UI or API)
- Workers healthcheck hanging
- New jobs can't launch correctly (waiting state)
- Current job ended correctly
"Complex" activity in progress (one job with multiples tasks)
- AWX interface usable (UI or API)
- Workers healthcheck hanging
- New jobs can't launch correctly (waiting state)
- Current job failed (I don't know if it is the expected behavior but it is acceptable for us at least)

Wrong behavior.

Connection doesn't seems to be reestablished automatically. There is no restart of container.
A rollout restart of the task pod give back the ability the execute new jobs.
The healthchecks previously launched are still hanging (health_check_pending true). The only way I found to fix this hang issue is to reinstall the instance.
If no healthcheck was done previously on a worker, a new healthcheck on it works and ends correctly.

After checking the multiple logs, I can't find why new jobs can't be launched and why healthcheck hangs indefinitely.

AWX Operator version

2.7.0 and 2.15.0

AWX version

23.3.0 and 24.2.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

k3s v1.25.4+k3s1

Modifications

no

Steps to reproduce

Use external postgres cluster with one leader and one replica (streaming mode).
Configure standard postgres keepalive
Shut hard the leader (VM power off for example).

Expected results

Current jobs may fail.
Pod restart or reconnection to database
Some awx interface latencies during recover
After recover : new jobs can be executed
Healthcheck of instances works

Actual results

No apparent reconnection or pod restart
Can't launch new jobs
Healthchecks hanging
pod-awx-task-logs.txt
pod-awx-task-manage.txt

Additional information

No response

Operator Logs

No response

jessicamack · 2024-05-01T20:45:56Z

@sylvain-de-fuster there was further work around Postgres done in recent AWX releases. Can you upgrade to the latest AWX release and see if that resolves the issue?

sylvain-de-fuster · 2024-05-03T15:56:39Z

Hello,

Thanks for your answer.

I did update to 24.2.0.
For the work on postgres: did you mean after this version ? At first sight, I don't see postgres related changelog on 24.3.0 or 24.3.1.

Anyway, I updated to the latest (24.3.1 at current time) and below my checks :

• Recurrent error in task container of awx task pod.
[...]
min_value in DecimalField should be Decimal type.
[...]

Only that line. No other informations found about it. I don't really know where to search. The error appears without activity on awx.
The line is also present in web container logs of the web pod but not with the same regularity.
It is not specifically related to my checks but I did notice it so FYI.

• Switchover tests
AWX restart the container under the web pod (It didn't on 24.2.0)
The "simple task playbook case" (sleep command on localhost) is failing now. (It didn't on 24.2.0)

See le awx_web_container logs.

awx_web_container.txt

• Fake crash
Same behavior as before.

github-actions bot added needs_triage community labels Apr 29, 2024

jessicamack removed the needs_triage label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging #1844

Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging #1844

sylvain-de-fuster commented Apr 29, 2024

jessicamack commented May 1, 2024

sylvain-de-fuster commented May 3, 2024 •

edited

Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging #1844

Fake crash of external postgres leader - Can't launch new job and instance healthcheck hanging #1844

Comments

sylvain-de-fuster commented Apr 29, 2024

Please confirm the following

Bug Summary

AWX Operator version

AWX version

Kubernetes platform

Kubernetes/Platform version

Modifications

Steps to reproduce

Expected results

Actual results

Additional information

Operator Logs

jessicamack commented May 1, 2024

sylvain-de-fuster commented May 3, 2024 • edited

sylvain-de-fuster commented May 3, 2024 •

edited