Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Cleanup of Temporary Directories for built-in Materializers #2257

Open
1 task
strickvl opened this issue Jan 10, 2024 · 8 comments · May be fixed by #2560
Open
1 task

Implement Cleanup of Temporary Directories for built-in Materializers #2257

strickvl opened this issue Jan 10, 2024 · 8 comments · May be fixed by #2560
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@strickvl
Copy link
Contributor

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

support@zenml.io

What happened?

Some materializers in ZenML create local temporary directories to handle files from the artifact store. These directories are not being cleaned up automatically, leading to potential clutter and storage issues, especially when large or numerous temporary files are involved.

Task Description

Develop a solution within ZenML to ensure that temporary directories created by materializers are cleaned up efficiently after their use. This could involve direct cleanup mechanisms within materializers or a centralized approach, such as a dedicated temporary directory for ZenML step runs, which is automatically cleared post-execution.

Expected Outcome

  • Temporary directories used by materializers should be cleaned up automatically to avoid unnecessary storage use.
  • The solution should handle scenarios where temporary files need to persist for the duration of the Python object's lifecycle but should be cleared once no longer needed.
  • This enhancement will improve resource management and maintain cleanliness in the file system.

Steps to Implement

  • Analyze the current implementation of materializers and identify how and where temporary directories are being used.
  • Design and implement a cleanup mechanism that is triggered post usage of these temporary directories.
  • Consider the creation of a dedicated ZenML temporary directory for step runs, which is automatically cleared after a step is completed.
  • Test the implementation to ensure that temporary files are appropriately cleared without affecting the functionality of materializers.
  • Update documentation to inform users about this behavior and any potential impact on their existing workflows.

Additional Context

Ensuring efficient cleanup of temporary directories is crucial for sustainable resource management and to maintain optimal performance in data pipelines managed by ZenML.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@strickvl strickvl added enhancement New feature or request good first issue Good for newcomers labels Jan 10, 2024
@akesterson
Copy link

Can you assign this to me please

@akesterson
Copy link

Is there a documented reproduction case for this?

Analyze the current implementation of materializers and identify how and where temporary directories are being used.

I've examined all of the materializers in zenml/src/zenml/materializers, and here's what I've found:

  • 100% of their file operations are through the artifact store
  • None of the builtin materializers make temporary directories, either using the artifact store or otherwise

Do we have a list of the materializers that are engaging in the offensive behavior? Without knowing how they are creating the temporary directories, designing a solution is quite difficult.

@strickvl
Copy link
Contributor Author

Most of the 'real' materializers that people use are in src/zenml/integrations and then inside the various integrations. I'd check those out to see a variety of other behaviors.

@akesterson
Copy link

akesterson commented Mar 11, 2024

Looks like these are the integrations that create temporary objects in their materializer

zenml/src/zenml/integrations/bentoml
zenml/src/zenml/integrations/huggingface
zenml/src/zenml/integrations/lightgbm
zenml/src/zenml/integrations/llama_index
zenml/src/zenml/integrations/pillow
zenml/src/zenml/integrations/polars
zenml/src/zenml/integrations/pycaret
zenml/src/zenml/integrations/tensorflow
zenml/src/zenml/integrations/whylogs
zenml/src/zenml/integrations/xgboost

I'll take a look

@akesterson
Copy link

Most of those materializers are already using the tempfile module, they just aren't cleaning up after themselves. (Some are doing it, some are inconsistent within the same module, others have logic that could leave temporary files dangling depending on code path or if exceptions are encountered). Changing the implementations to use the tempfile classes as context handlers is sufficient for the average case here.

However there are a few corner cases:

  • huggingface_dataset_materializer intentionally does not clean up after itself (see [BUG]: HuggingFace dataset materializer problem #1135 ) because something about that integration requires the original path that materialized the dataset to be present as it moves through various processing stages. There is likely additional work to be done in this integration to resolve the way those references are created/updated/deleted but I don't know enough about it yet to suggest the right changes.
  • the llama_index materializer is 100% commented out. Is this worth fixing?
  • both of the lightgbm materializer save() methods use the tempfile context handler but explicitly disables delete-on-exit functionality. Looking at Fix Keras materializer to work with remote artifact stores #569 I'm not clear why it needs to be this way. Shouldn't it be safe to delete the original file as soon as it's been copied into the artifact store?

I have a patch that fixes the common cases, I'll work that into a PR and improve the tests where possible. Can anyone answer for these 3 special case integrations?

@strickvl
Copy link
Contributor Author

@akesterson sorry for the delay.

  • I think you can leave the HF dataset materializer to one side as a light exception, for the reasons you mention.
  • we're handling llama_index on our side, and as you say it's non-functional at the moment anyway
  • I also don't see a reason why the lightgbm shouldn't allow for delete on exit in the context handler, and I'm not sure why the keras materializer you mention handles it that way, but I think you should first try to do it within the context handler and if for some reason it doesn't work, then handle it explicitly. But yeah I don't see why we can't delete the file after it's been copied.

Thanks!

akesterson added a commit to akesterson/zenml that referenced this issue Mar 22, 2024
@akesterson
Copy link

akesterson commented Mar 22, 2024

Sorry for the long delay here. PR is up. There are some comments around the best way to test.

#2560

@akesterson akesterson linked a pull request Mar 22, 2024 that will close this issue
9 tasks
akesterson added a commit to akesterson/zenml that referenced this issue Mar 23, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 23, 2024
@akesterson
Copy link

I discovered another materializer that is a special case. The tensorflow dataset materializer intentionally does not clean up after itself in the load() stage, because the dataset is lazily loaded from the generated temporary directory. A decision should be made as to whether or not these files are actually temporary, or if there needs to be something else done in the artifact storage for this pattern.

@strickvl strickvl linked a pull request Mar 23, 2024 that will close this issue
9 tasks
akesterson added a commit to akesterson/zenml that referenced this issue Mar 23, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 23, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 24, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 24, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 24, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 25, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 26, 2024
akesterson added a commit to akesterson/zenml that referenced this issue Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants