How to troubleshoot a double run? (with pics!) #38728
Replies: 2 comments
-
These don't appear to be duplicates. One run is for 30 March and the other is for 31 March (see the |
Beta Was this translation helpful? Give feedback.
-
Ryan, thanks for responding! With the passage of time (since I posted this back in March), I have since learned that the likely culprit is that the database was running split brain (PostgreSQL cluster with two primaries). When that condition was fixed (so say my PostgreSQL expert colleagues), the "older" instance of the database took over. Ergo, the scheduler saw tasks that had not been run. Could there be some way to catch that condition? Obviously, the logs captured the output for both runs. @RNHTTR |
Beta Was this translation helpful? Give feedback.
-
Edit: If you don't have any ideas for an answer on this one, check out my other question over at #38590.
Edit: Speaking of locks (un-locks), is there any chance this discussion and issue #36920 could be inversely related?
We are using Airflow with two schedulers (AIRFLOW__SCHEDULER__USE_ROW_LEVEL_LOCKING=True) and a PostgreSQL database cluster with three nodes on the back end.
This week, the primary node of the cluster was shut down for patching. When the shutdown of the database primary node occurred, about 70 previously (past 48 hours) completed tasks were found by the scheduler to have "Dependencies all met" and were run again. The DAGs were unfortunately not designed to run again.
How do I go about troubleshooting the root cause of this scenario and prevent it in the future?
<Screenshot of
airflow dags list-runs
showing last two instances started about the time the database was patched><Screenshot of task log showing first run success and unexpected "bonus" second run>
Beta Was this translation helpful? Give feedback.
All reactions