Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cosmos stale temporary directories #958

Open
4 tasks
tatiana opened this issue May 14, 2024 · 1 comment
Open
4 tasks

[Bug] Cosmos stale temporary directories #958

tatiana opened this issue May 14, 2024 · 1 comment
Assignees
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working epic-assigned priority:high High priority issues are blocking or critical issues without a workaround and large impact
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented May 14, 2024

Context

It seems sometimes Cosmos is creating and not deleting temporary directories.

An example of a report, from 30 April 2024 in the #airflow-dbt Slack channel:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1714484749579599

Robert Jones: Good day I'm running cosmos 1.3.2 on MWAA 2.7.2. I'm using the load_method=LoadMode.DBT_MANIFEST in the RenderConfig. But the tmp is filling up over time with dirs that are mostly symlinks but then there is also a target directories that has the compiled code and graph.gpickle. Does cosmos come with a cleanup mechanism or do I need to build a DAG that cleans the tmp directory?

Here are some examples:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - /tmp/tmp0x486tay/target:
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - total 38M
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 4 airflow airflow 4.0K Apr 30 11:37 .
[2024-04-30T12:45:15.478+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 ..
[2024-04-30T12:45:15.479+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 compiled
[2024-04-30T12:45:15.481+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 6.4M Apr 30 11:36 graph.gpickle
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 843K Apr 30 11:36 graph_summary.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  17M Apr 30 11:37 manifest.json
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  14M Apr 30 11:36 partial_parse.msgpack
[2024-04-30T12:45:15.482+0000] {{subprocess.py:93}} INFO - drwxr-xr-x 3 airflow airflow 4.0K Apr 30 11:36 run
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow 3.9K Apr 30 11:37 run_results.json
[2024-04-30T12:45:15.483+0000] {{subprocess.py:93}} INFO - -rw-r--r-- 1 airflow airflow  234 Apr 30 11:37 semantic_manifest.json
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp0m2gkxru
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp0x486tay
[2024-04-30T12:45:15.304+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:45 tmp0zf1jeip
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp12fmwv6j
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp1i3_4i36
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmp20rc3co2
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmp24zhr1yr
[2024-04-30T12:45:15.305+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp35wco4yc
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmp3_ugz4uy
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp3nk2g9_w
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmp3pverouw
[2024-04-30T12:45:15.306+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp4md00dzk
[2024-04-30T12:45:15.312+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp59t5kcsa
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp5_yn3j90
[2024-04-30T12:45:15.313+0000] {{subprocess.py:93}} INFO - -rw------- 1 airflow airflow  13K Apr 30 11:31 tmp5c_o7q6w
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp5di13bsx
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmp5p8h7bk2
[2024-04-30T12:45:15.314+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp5zdqwdez
[2024-04-30T12:45:15.315+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmp62enlhb2
[2024-04-30T12:45:15.317+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp6gv98dj3
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp74qo5pm7
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp86bttjpg
[2024-04-30T12:45:15.318+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmp8_601uyf
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmp8p1if86h
[2024-04-30T12:45:15.324+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp99tdocou
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmp9l6ljh4p
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmp9tx575tl
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmp_0uymz3x
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 03:15 tmp_ph3232z
[2024-04-30T12:45:15.325+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmpbtvqwlhb
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpd1eb6y2y
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpe_gz5es8
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 02:15 tmpeaqsvtm4
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 12:44 tmpem8_pa52
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpemz78wrw
[2024-04-30T12:45:15.326+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpf3v5iz3_
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:39 tmpfb8rvq9v
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpho39phpd
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmphxru1hq8
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpiamgv8ox
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpj_k0_9h4
[2024-04-30T12:45:15.327+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpjg_xwyd9
[2024-04-30T12:45:15.328+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:48 tmpjpxu2ltp
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjr32_bt0
[2024-04-30T12:45:15.329+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:38 tmpjs2i5xmv
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpjsg9rdn6
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 02:15 tmpkcw52xq4
[2024-04-30T12:45:15.330+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpksxjhhda
[2024-04-30T12:45:15.332+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmplkcqmlm8
[2024-04-30T12:45:15.333+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpn57z5j0_
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:49 tmpna7v18jf
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpne2u4xqd
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 28 03:15 tmpnpbketqi
[2024-04-30T12:45:15.335+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpnsrts_b8
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:43 tmpp6qk2k5w
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmppf0pprft
[2024-04-30T12:45:15.338+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:37 tmppjrk253b
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 30 02:15 tmpq5_8ma32
[2024-04-30T12:45:15.342+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpq6blaw7z
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpr5zqdd56
[2024-04-30T12:45:15.344+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:00 tmpsrpybyij
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmptcyhp6is
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 26 07:20 tmpu591ld0j
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpuop7l2k4
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:36 tmpvox078h7
[2024-04-30T12:45:15.358+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpx6jq5_ct
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 02:15 tmpxi6odkmu
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 4 airflow airflow 4.0K Apr 30 11:46 tmpxpsqdy3a
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 14:15 tmpy3vhu055
[2024-04-30T12:45:15.359+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 03:15 tmpyi3sxcqm
[2024-04-30T12:45:15.362+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 27 03:15 tmpzcq8r5_a
[2024-04-30T12:45:15.365+0000] {{subprocess.py:93}} INFO - drwx------ 3 airflow airflow 4.0K Apr 29 13:30 tmpzw4pw0ve

There have been previous discussions in the core Airflow apache/airflow#22404 about tempfile.TemporaryDirectory doesn't necessarily behaving as expected.

I was not able to reproduce this issue yet, but one possibility is that there is some exception or error, and the context manager tempfile.TemporaryDirectory is not being able to clean things after those scenarios.

Acceptance criteria

  • Review source code to confirm we are always using the context manager.
  • Improve how we generate temporary directories so we can know if they are being generated by DAG parsing or task execution (in this particular report, it was created by task execution)
  • Reproduce the issue, if possible
  • Propose solutions
@tatiana tatiana added this to the 1.5.0 milestone May 14, 2024
@tatiana tatiana added priority:high High priority issues are blocking or critical issues without a workaround and large impact bug Something isn't working area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 14, 2024
@tatiana tatiana self-assigned this May 17, 2024
@tatiana tatiana mentioned this issue May 17, 2024
@maslick
Copy link

maslick commented May 27, 2024

We are experiencing a similar issue in our MWAA environment:

  • astronomer-cosmos version: 1.4.1
  • MWAA/Airflow version: 2.6.3
  • MWAA configuration: mw1-large (4 vCPUs, 8 GB RAM, up to 1000 DAG capacity, 20 concurrent tasks)
  • Number of dbt DAGs: 300+
  • All MWAA workers are failing dbt DAGs/tasks due to OSError: [Errno 28] No space left on device.

I have done some investigation:

  • Worker nodes have exhausted 30 GB of disk space.
  • There are hundreds of files (manifest.json and partial_parse.msgpack) totalling ~26 MB in each /tmp/cosmos/<...>/target directory:
df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   28G   48M 100% /
sudo du -ah /tmp | sort -n -r | head -n 200
26M	/tmp/cosmos/wf_12__DBTGrp_1/target
26M	/tmp/cosmos/wf_13__DBTGrp_1/target
26M	/tmp/cosmos/wf_14__DBTGrp_1/target
26M	/tmp/cosmos/wf_15__DBTGrp_1/target
26M	/tmp/cosmos/wf_15__DBTGrp_1/target
...
...
ls -la /tmp/cosmos/wf_12__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13667800 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack
ls -la /tmp/cosmos/wf_13__DBTGrp_1/target
total 25952
drwxr-xr-x 3 airflow airflow     4096 May 24 11:36 ..
drwxr-xr-x 2 airflow airflow     4096 May 24 11:38 .
-rw-r--r-- 1 airflow airflow 13659451 May 25 09:25 manifest.json
-rw-r--r-- 1 airflow airflow 12896530 May 25 09:25 partial_parse.msgpack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working epic-assigned priority:high High priority issues are blocking or critical issues without a workaround and large impact
Projects
None yet
Development

No branches or pull requests

2 participants