How to troubleshoot a double run? (with pics!) #38728

garywhiteford · 2024-04-03T22:01:29Z

garywhiteford
Apr 3, 2024

Edit: If you don't have any ideas for an answer on this one, check out my other question over at #38590.

Edit: Speaking of locks (un-locks), is there any chance this discussion and issue #36920 could be inversely related?

We are using Airflow with two schedulers (AIRFLOW__SCHEDULER__USE_ROW_LEVEL_LOCKING=True) and a PostgreSQL database cluster with three nodes on the back end.

This week, the primary node of the cluster was shut down for patching. When the shutdown of the database primary node occurred, about 70 previously (past 48 hours) completed tasks were found by the scheduler to have "Dependencies all met" and were run again. The DAGs were unfortunately not designed to run again.

How do I go about troubleshooting the root cause of this scenario and prevent it in the future?

RNHTTR · 2024-05-16T17:00:10Z

RNHTTR
May 16, 2024
Collaborator

These don't appear to be duplicates. One run is for 30 March and the other is for 31 March (see the run_id column). It seems like you have a DAG with catchup=True that was unable to be run during maintenance and worked to catchup once maintenance was complete.

0 replies

garywhiteford · 2024-05-16T19:37:12Z

garywhiteford
May 16, 2024
Author

Ryan, thanks for responding! With the passage of time (since I posted this back in March), I have since learned that the likely culprit is that the database was running split brain (PostgreSQL cluster with two primaries). When that condition was fixed (so say my PostgreSQL expert colleagues), the "older" instance of the database took over. Ergo, the scheduler saw tasks that had not been run. Could there be some way to catch that condition? Obviously, the logs captured the output for both runs. @RNHTTR

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to troubleshoot a double run? (with pics!) #38728

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to troubleshoot a double run? (with pics!) #38728

garywhiteford Apr 3, 2024

Replies: 2 comments

RNHTTR May 16, 2024 Collaborator

garywhiteford May 16, 2024 Author

garywhiteford
Apr 3, 2024

RNHTTR
May 16, 2024
Collaborator

garywhiteford
May 16, 2024
Author