Support for `safetensors` materializers #2539

Dev-Khant · 2024-03-18T14:43:28Z

Describe changes

I implemented support for safetensors for model serialization. It is regarding #2532

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
If my change requires a change to docs, I have updated the documentation accordingly.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

coderabbitai · 2024-03-18T14:43:34Z

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

strickvl · 2024-03-18T14:56:50Z

I think the thing to add now would be a section in that docs page on materializers that explains:

why you might want to use safetensors for materialization instead of the default
how to use it / set it up so you can use these custom materializers (i.e. give a code example showing a step and how you'd specify to use the safetensors materializer)

Dev-Khant · 2024-03-18T15:38:35Z

@strickvl Do I create a separate kind of section for these new materializers or add them with existing ones?
And I'll also prepare a different section explaining why to use safetensors and provide a code example.

strickvl · 2024-03-18T15:42:16Z

I think I'd do it as a section on its own before https://docs.zenml.io/user-guide/advanced-guide/data-management/handle-custom-data-types#custom-materializers this section

…

On Mon, 18 Mar 2024 at 16:38, Dev Khant ***@***.***> wrote: @strickvl Do I create a separate kind of section for these new materializers or add them with existing ones? And I'll also prepare a different section explaining why to use safetensors and provide a code example. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

strickvl · 2024-03-18T15:49:16Z

Also note that tests are failing https://github.com/zenml-io/zenml/actions/runs/8328772930/job/22789533265?pr=2539

Dev-Khant · 2024-03-19T13:49:25Z

Hi @strickvl, I have added the documentation, but I'm not sure if the code is correct for using materializers as I could not find any docs on how to use integration-specific materialized. I have also made fixes for failing tests.

So could please check and let me know if that's correct or how it can be improved? Thanks.

strickvl · 2024-03-20T09:00:35Z

@Dev-Khant one thing you'll have to do is to make sure that your PR is made off the develop branch. See https://github.com/zenml-io/zenml/blob/develop/CONTRIBUTING.md#-pull-requests-rebase-your-branch-on-develop for more. At the moment this PR is listed as being based on main branch.. (See at the top).

Dev-Khant · 2024-03-20T09:43:01Z

I'm fixing failed test cases!

Dev-Khant · 2024-03-20T10:57:16Z

@strickvl Issue is being caused because here only step_output_type is passed but load_model of safetensors expect a model as well in input reference. This issue is coming up for both pytorch and huggingface safetensors materializers

So how do you think we should handle safetensors materializers in test cases?

Dev-Khant · 2024-03-22T10:42:46Z

Hi @strickvl @avishniakov @bcdurak, please can you guide me on how to fix this issue? Thanks.

@strickvl Issue is being caused because here only step_output_type is passed but load_model of safetensors expect a model as well in input reference. This issue is coming up for both pytorch and huggingface safetensors materializers

So how do you think we should handle safetensors materializers in test cases?

strickvl · 2024-03-22T10:50:10Z

@Dev-Khant Not sure I understand the question. The test function takes a model in as you've currently defined it.

btw, this is currently also failing mypi linting (https://github.com/zenml-io/zenml/actions/runs/8389112411/job/22974663097?pr=2539)

Dev-Khant · 2024-03-22T11:07:13Z

@Dev-Khant Not sure I understand the question. The test function takes a model in as you've currently defined it.

btw, this is currently also failing mypi linting (https://github.com/zenml-io/zenml/actions/runs/8389112411/job/22974663097?pr=2539)

@strickvl I have fixed the lint issue, will push in next commit.
The issue here is loaded_data = materializer.load(step_output_type) should also take step_output(which a pytorch or hf model in our case) as input.

Because when safetensor materialize is called it usesload() of safetensors which requires a model and filename.

So here if we do this materializer.load(step_output, step_output_type) then all the tests are passing locally.

So what is the best to way handle all materializers that need alsomodel to load and materializers that only need filename.

Dev-Khant · 2024-03-23T02:44:11Z

I made the fix for failing test cases and for lint issues.

strickvl · 2024-03-25T09:16:11Z

Linting still is failing, btw. @Dev-Khant

Dev-Khant · 2024-03-28T05:13:54Z

Hey @Dev-Khant,

Thank you for your contribution :)

It seems like the implementation is off to a good start, however, there are a few things that need some modifications.

Before I talk about these modifications though, it is a good idea to take a look at the concept of materializers. In short, they constitute a mechanism that ZenML uses to manage the inputs and outputs of steps under the hood. Depending on the type, our orchestration logic selects the right materializer for the job and uses it during the execution of a step.

In the current example (from your docs page), the materializer acts as an external factor (dealing with the saving/loading outside of the orchestration) which goes against the main idea.

With that in mind, you can modify the example you wrote to something like this:
import logging

from torch.nn import Module

from zenml import step, pipeline
from zenml.integrations.pytorch.materializers import PyTorchModuleSTMaterializer


@step(enable_cache=False, output_materializers=PyTorchModuleSTMaterializer)
def my_first_step() -> Module:
    """Step that saves a Pytorch model"""
    from torchvision.models import resnet50

    pretrained_model = resnet50()

    return pretrained_model


@step(enable_cache=False)
def my_second_step(model: Module):
    """Step that loads the model."""
    logging.info("Input loaded correctly.")


@pipeline
def first_pipeline():
    model = my_first_step()
    my_second_step(model)


first_pipeline()
This is the recommended way of using a custom materializer in a ZenML pipeline.

However, if you run this example now, the first step will succeed, however, the second step will fail due to:
TypeError: BasePyTorchSTMaterializer.load() missing 1 required positional argument: 'obj'
This is due to the fact that the new materializers use the abstraction of the save functionality correctly but they introduce a load function which has a different signature from the base materializers. The orchestration logic can not handle this which eventually leads to a failure. This is also the same reason why the test that you mentioned in your comment was failing.

I would propose modifying them to use the same abstraction as follows because the orchestrators do not know how to handle the input obj:
    def load(self, data_type: Type[Any]) -> Any:
        """Write logic here to load the data of an artifact.

        Args:
            data_type: The type of data that the artifact should be loaded as.

        Returns:
            The data of the artifact.
        """
        # read from a location inside self.uri
        # 
        # Example:
        # data_path = os.path.join(self.uri, "abc.json")
        # return yaml_utils.read_json(data_path)
I understand this is a challenging task because safetensors' load_model function requires you to pass a model to load onto, however, this needs to be solved in a different manner than passing the object to the load function.

I hope this explanation is helpful and feel free to reach out again if you have any additional questions.

I have also added a few more small comments as well.

@bcdurak Thanks for reviewing the code and helping me understand this much better.

I can think of solution here that would store the architecture of model like this:

pretrained_model = resnet50()
arch = {"model": pretrained_model}
save_file(pretrained_model.state_dict(), "weights.safetensors")
torch.save(arch, "model_arch.json")

print("Model Saved!")

new_arch = torch.load("model_arch.json")
weights = load_file("weights.safetensors")
_model = new_arch["model"]

loaded_model = _model.load_state_dict(weights)

print("Model Loaded!")

This way we can save model's architecture and weights in save of materializer and then load both in load of materializer. Do you think this a good approach?

Dev-Khant · 2024-03-30T07:17:43Z

@bcdurak I have relevant changes. Please review it. Thanks.

bcdurak

Thank you for your effort.

I really like the direction of the PR. I think you already solved some of the challenges you faced before. I added a few more comments for the required changes and modifications.

Additionally, I have two more questions:

I see on their docs, that safetensors also feature APIs for tensorflow and numpy. Is it possible to add these to the list of materialized which have a safetensors variant?
Can we also add a test for the pytorch_lightning materializer?

docs/book/user-guide/advanced-guide/data-management/handle-custom-data-types.md

bcdurak · 2024-04-02T14:49:05Z

src/zenml/integrations/huggingface/materializers/huggingface_pt_model_st_materializer.py

+        # Save model architecture
+        model_arch = {"model": model}
+        model_filename = os.path.join(self.uri, DEFAULT_MODEL_FILENAME)
+        torch.save(model_arch, model_filename)


I think you are on the right path here, however, there is an issue:

This is a pattern that I see in all of the new materializers, AFAIK, if you do torch.save(...), it does not only save the model architecture but also the weights.

You can see this in play in the example we mentioned above. If you check your artifacts in your local artifact store manually, there are entire_model.safetensors and model_architecture.json present which are both roughly 100 MBs. Basically, it is saving the model twice in two different ways. We need to modify the torch.save and torch.load calls to only handle the architecture without the weights.

@bcdurak Here I could not find/there is no method to just store the architecture in pytorch. So what would you recommend here?

This is a tough question. But in the current case, it is really inefficient.

It feels like we need to go back to the version where you used the save_model and load_model calls. And, we somehow need to figure out how to save the model type in the save method. If I can think of anything, I will share it here.

Sure @bcdurak. Let me know when I switch back to previous method.

bcdurak · 2024-04-02T14:55:40Z

src/zenml/integrations/huggingface/materializers/huggingface_pt_model_st_materializer.py

+            model: The Torch Model to write.
+        """
+        # Save model weights
+        obj_filename = os.path.join(self.uri, DEFAULT_FILENAME)


If you check the regular variant of this materializer, you will see that it uses TemporaryDictionary() instances during the load and save calls. This is the case for many of our materializers, as they load/save from/to a temp directory and copy the contents to the proper destination.

This is mainly due to the remote artifact stores. While our fileio and io_utils can handle remote operations, unfortunately, this does not apply to most of the other libraries. Unless the safetensors.save_file and safetensors.load_file calls are compatible with all of the remote artifact store that ZenML provides, I would suggest using the same paradigm.

I'll first check and see how it can be added.

Here fileio doesn't work with safetensors because it fileio does not return PathLike str.
So I have changed the HF materialized only.

This is also quite critical. I have just tested the example in the docs with a simple GCP artifact store and as I predicted, it failed. If we leave it in this state, these materializers won't work in any non-local case.

Can we implement any conversion to make it work here?

Alright if that is the case, I'll find a way to use it. But first let's first decide on which method to use for saving and loading the model.

@bcdurak Any update for this?

bcdurak · 2024-04-02T15:07:14Z

By the way, there is a known issue with our testing suite. We are currently fixing it at the moment. I will keep you updated as it goes on. For the time being, feel free to ignore the failing tests.

pyproject.toml

…tom-data-types.md Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>

Dev-Khant · 2024-04-03T03:41:39Z

@bcdurak Thanks for this detailed review, I have gone through your comments. And here are my answers:

For tensorflow there is no method to directly save and load models using safetensors. And for numpy I'll add safetensors materialize.
I'll add a test for pytorch_lightning as well.

socket-security · 2024-04-08T09:38:58Z

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size	Publisher
pypi/safetensors@0.4.3	filesystem, unsafe	`0`	5.61 MB	McPotato, Nicolas.Patry, Wauplin, ...1 more

View full report↗︎

Dev-Khant · 2024-04-10T11:56:45Z

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

avishniakov · 2024-04-10T15:02:38Z

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

Hey @Dev-Khant , IMO, since you modified numpy materializers to rely on safetensors it is not an optional dependency anymore, but the base one, so it should fall under [tool.poetry.dependencies] section directly. According to their pyproject.toml there are no mandatory dependencies, which should be quite good for a base dependency here. @bcdurak WDYT, is it fine to push safetensors to default deps?

Dev-Khant · 2024-04-10T15:07:09Z

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

Hey @Dev-Khant , IMO, since you modified numpy materializers to rely on safetensors it is not an optional dependency anymore, but the base one, so it should fall under [tool.poetry.dependencies] section directly. According to their pyproject.toml there are no mandatory dependencies, which should be quite good for a base dependency here. @bcdurak WDYT, is it fine to push safetensors to default deps?

Understood thanks @avishniakov. @bcdurak Let me know if should I make it a default dependency.

bcdurak · 2024-04-11T15:09:42Z

@Dev-Khant let me discuss the dependency issue with the team internally, I will update this thread asap.

gitguardian · 2024-04-18T03:27:39Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
-		Username Password	`e423484`	src/zenml/cli/init.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

Dev-Khant · 2024-04-19T06:16:03Z

@bcdurak Any update for this?

bcdurak · 2024-04-30T15:00:03Z

Hey @Dev-Khant , I can give you a quick update regarding the status. I think, I can sum it up in there different roadblocks that still remain:

The save and load methods are quite inefficient right now. As I mentioned above, in their current state, the materializers are saving/loading the models twice (for instance, once through the regular torch package and once through safetensors). I think, the first version of the implementation was closer to an actual solution, where you used save_model and load_model functions from safetensors. However, in this case, you would need to figure out how to store the model type when you call the save method, so you can access it during the load method and call the load_model function properly.
In their current state, the materializers do not work with remote artifact stores, because the load_... and save_... calls of safetensors do not inherently work with remote storage systems. In general, ZenML handles this issue for other materializers by using our fileio functions around the save and load methods. You can see an example of it right here. I have seen that you already implemented it for some of the materializers, however, this needs to be applied to all of the materializers.
Lastly, there is the question of how to handle the dependency of safetensors. We had a discussion within the team regarding this topic and we thought the best way to go forward here is to implement a safetensors integration instead of adding it to the main package or the respective integrations (like torch or huggingface). However, this is not the main road block right now and before trying this I would recommend fixing the materializers themselves.

Dev-Khant · 2024-05-03T09:06:06Z

Alright @bcdurak, For first point I'll switch back to the previous method and see how to store model_type.
And for second point I'll try different things to see which works in storing the file in the correct location.

Support for Safetensors

b84974d

strickvl marked this pull request as ready for review March 18, 2024 14:57

add documentation and few fixes

3698210

Dev-Khant changed the base branch from main to develop March 20, 2024 09:32

Merge branch 'develop' into support-for-safetensors

bd7dfa8

strickvl requested review from bcdurak and safoinme March 20, 2024 09:47

strickvl added enhancement New feature or request good first issue Good for newcomers labels Mar 20, 2024

strickvl changed the title ~~Support for Safetensors~~ Support for safetensors materializers Mar 20, 2024

Merge branch 'develop' into support-for-safetensors

dd1fac7

strickvl requested review from avishniakov and removed request for safoinme March 22, 2024 09:16

Merge branch 'develop' into support-for-safetensors

c815a42

lint and tests fix

df476cb

Merge branch 'develop' into support-for-safetensors

c610440

Dev-Khant added 2 commits March 28, 2024 10:10

Merge branch 'develop' into support-for-safetensors

574a4d3

remove alias

0c161b7

Dev-Khant added 2 commits March 30, 2024 12:40

Merge branch 'develop' into support-for-safetensors

6c74334

remove passing object as argument

8f9f06c

Dev-Khant requested a review from bcdurak March 30, 2024 07:17

bcdurak requested changes Apr 2, 2024

View reviewed changes

bcdurak reviewed Apr 2, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

Dev-Khant and others added 3 commits April 3, 2024 08:07

Update docs/book/user-guide/advanced-guide/data-management/handle-cus…

26f7188

…tom-data-types.md Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>

Update docs/book/user-guide/advanced-guide/data-management/handle-cus…

111aab1

…tom-data-types.md Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>

Update docs/book/user-guide/advanced-guide/data-management/handle-cus…

a0e9370

…tom-data-types.md Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>

Dev-Khant added 3 commits April 3, 2024 11:07

numpy_st materializer, pytorch_lightning tests and few fixes

8c71c36

Merge branch 'develop' into support-for-safetensors

8615da8

use tempprary dir for HF

66f6da5

Dev-Khant requested a review from bcdurak April 3, 2024 06:09

avishniakov and others added 2 commits April 8, 2024 10:32

Merge branch 'develop' into support-for-safetensors

083ad9f

poetry fix

e937f05

Merge branch 'develop' into support-for-safetensors

e423484

Support for safetensors materializers #2539

Are you sure you want to change the base?

Support for safetensors materializers #2539

Conversation

Dev-Khant commented Mar 18, 2024 • edited

Describe changes

Pre-requisites

Types of changes

coderabbitai bot commented Mar 18, 2024 • edited

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

strickvl commented Mar 18, 2024

Dev-Khant commented Mar 18, 2024

strickvl commented Mar 18, 2024 via email

strickvl commented Mar 18, 2024

Dev-Khant commented Mar 19, 2024 • edited

strickvl commented Mar 20, 2024

Dev-Khant commented Mar 20, 2024

Dev-Khant commented Mar 20, 2024

Dev-Khant commented Mar 22, 2024 • edited

strickvl commented Mar 22, 2024

Dev-Khant commented Mar 22, 2024

Dev-Khant commented Mar 23, 2024

strickvl commented Mar 25, 2024

Dev-Khant commented Mar 28, 2024 • edited

Dev-Khant commented Mar 30, 2024

bcdurak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dev-Khant Apr 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bcdurak commented Apr 2, 2024

Dev-Khant commented Apr 3, 2024

socket-security bot commented Apr 8, 2024 • edited

Dev-Khant commented Apr 10, 2024

avishniakov commented Apr 10, 2024

Dev-Khant commented Apr 10, 2024

bcdurak commented Apr 11, 2024

gitguardian bot commented Apr 18, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Dev-Khant commented Apr 19, 2024

bcdurak commented Apr 30, 2024

Dev-Khant commented May 3, 2024

Support for `safetensors` materializers #2539

Support for `safetensors` materializers #2539

Dev-Khant commented Mar 18, 2024 •

edited

coderabbitai bot commented Mar 18, 2024 •

edited

CodeRabbit Configration File (`.coderabbit.yaml`)

Dev-Khant commented Mar 19, 2024 •

edited

Dev-Khant commented Mar 22, 2024 •

edited

Dev-Khant commented Mar 28, 2024 •

edited

Dev-Khant Apr 12, 2024 •

edited

socket-security bot commented Apr 8, 2024 •

edited