Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for safetensors materializers #2539

Open
wants to merge 30 commits into
base: develop
Choose a base branch
from

Conversation

Dev-Khant
Copy link

@Dev-Khant Dev-Khant commented Mar 18, 2024

Describe changes

I implemented support for safetensors for model serialization. It is regarding #2532

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Copy link
Contributor

coderabbitai bot commented Mar 18, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@strickvl
Copy link
Contributor

I think the thing to add now would be a section in that docs page on materializers that explains:

  • why you might want to use safetensors for materialization instead of the default
  • how to use it / set it up so you can use these custom materializers (i.e. give a code example showing a step and how you'd specify to use the safetensors materializer)

@strickvl strickvl marked this pull request as ready for review March 18, 2024 14:57
@Dev-Khant
Copy link
Author

@strickvl Do I create a separate kind of section for these new materializers or add them with existing ones?
And I'll also prepare a different section explaining why to use safetensors and provide a code example.

@strickvl
Copy link
Contributor

strickvl commented Mar 18, 2024 via email

@strickvl
Copy link
Contributor

@Dev-Khant
Copy link
Author

Dev-Khant commented Mar 19, 2024

Hi @strickvl, I have added the documentation, but I'm not sure if the code is correct for using materializers as I could not find any docs on how to use integration-specific materialized. I have also made fixes for failing tests.

So could please check and let me know if that's correct or how it can be improved? Thanks.

@strickvl
Copy link
Contributor

@Dev-Khant one thing you'll have to do is to make sure that your PR is made off the develop branch. See https://github.com/zenml-io/zenml/blob/develop/CONTRIBUTING.md#-pull-requests-rebase-your-branch-on-develop for more. At the moment this PR is listed as being based on main branch.. (See at the top).

@Dev-Khant Dev-Khant changed the base branch from main to develop March 20, 2024 09:32
@Dev-Khant
Copy link
Author

I'm fixing failed test cases!

@strickvl strickvl added enhancement New feature or request good first issue Good for newcomers labels Mar 20, 2024
@strickvl strickvl changed the title Support for Safetensors Support for safetensors materializers Mar 20, 2024
@Dev-Khant
Copy link
Author

@strickvl Issue is being caused because here only step_output_type is passed but load_model of safetensors expect a model as well in input reference. This issue is coming up for both pytorch and huggingface safetensors materializers

So how do you think we should handle safetensors materializers in test cases?

@strickvl strickvl requested review from avishniakov and removed request for safoinme March 22, 2024 09:16
@Dev-Khant
Copy link
Author

Dev-Khant commented Mar 22, 2024

Hi @strickvl @avishniakov @bcdurak, please can you guide me on how to fix this issue? Thanks.

@strickvl Issue is being caused because here only step_output_type is passed but load_model of safetensors expect a model as well in input reference. This issue is coming up for both pytorch and huggingface safetensors materializers

So how do you think we should handle safetensors materializers in test cases?

@strickvl
Copy link
Contributor

@Dev-Khant Not sure I understand the question. The test function takes a model in as you've currently defined it.

btw, this is currently also failing mypi linting (https://github.com/zenml-io/zenml/actions/runs/8389112411/job/22974663097?pr=2539)

@Dev-Khant
Copy link
Author

@Dev-Khant Not sure I understand the question. The test function takes a model in as you've currently defined it.

btw, this is currently also failing mypi linting (https://github.com/zenml-io/zenml/actions/runs/8389112411/job/22974663097?pr=2539)

@strickvl I have fixed the lint issue, will push in next commit.
The issue here is loaded_data = materializer.load(step_output_type) should also take step_output(which a pytorch or hf model in our case) as input.

Because when safetensor materialize is called it usesload() of safetensors which requires a model and filename.

So here if we do this materializer.load(step_output, step_output_type) then all the tests are passing locally.

So what is the best to way handle all materializers that need alsomodel to load and materializers that only need filename.

@Dev-Khant
Copy link
Author

I made the fix for failing test cases and for lint issues.

@strickvl
Copy link
Contributor

Linting still is failing, btw. @Dev-Khant

@Dev-Khant
Copy link
Author

Dev-Khant commented Mar 28, 2024

Hey @Dev-Khant,

Thank you for your contribution :)

It seems like the implementation is off to a good start, however, there are a few things that need some modifications.

Before I talk about these modifications though, it is a good idea to take a look at the concept of materializers. In short, they constitute a mechanism that ZenML uses to manage the inputs and outputs of steps under the hood. Depending on the type, our orchestration logic selects the right materializer for the job and uses it during the execution of a step.

In the current example (from your docs page), the materializer acts as an external factor (dealing with the saving/loading outside of the orchestration) which goes against the main idea.

With that in mind, you can modify the example you wrote to something like this:

import logging

from torch.nn import Module

from zenml import step, pipeline
from zenml.integrations.pytorch.materializers import PyTorchModuleSTMaterializer


@step(enable_cache=False, output_materializers=PyTorchModuleSTMaterializer)
def my_first_step() -> Module:
    """Step that saves a Pytorch model"""
    from torchvision.models import resnet50

    pretrained_model = resnet50()

    return pretrained_model


@step(enable_cache=False)
def my_second_step(model: Module):
    """Step that loads the model."""
    logging.info("Input loaded correctly.")


@pipeline
def first_pipeline():
    model = my_first_step()
    my_second_step(model)


first_pipeline()

This is the recommended way of using a custom materializer in a ZenML pipeline.

However, if you run this example now, the first step will succeed, however, the second step will fail due to:

TypeError: BasePyTorchSTMaterializer.load() missing 1 required positional argument: 'obj'

This is due to the fact that the new materializers use the abstraction of the save functionality correctly but they introduce a load function which has a different signature from the base materializers. The orchestration logic can not handle this which eventually leads to a failure. This is also the same reason why the test that you mentioned in your comment was failing.

I would propose modifying them to use the same abstraction as follows because the orchestrators do not know how to handle the input obj:

    def load(self, data_type: Type[Any]) -> Any:
        """Write logic here to load the data of an artifact.

        Args:
            data_type: The type of data that the artifact should be loaded as.

        Returns:
            The data of the artifact.
        """
        # read from a location inside self.uri
        # 
        # Example:
        # data_path = os.path.join(self.uri, "abc.json")
        # return yaml_utils.read_json(data_path)

I understand this is a challenging task because safetensors' load_model function requires you to pass a model to load onto, however, this needs to be solved in a different manner than passing the object to the load function.

I hope this explanation is helpful and feel free to reach out again if you have any additional questions.

I have also added a few more small comments as well.

@bcdurak Thanks for reviewing the code and helping me understand this much better.

I can think of solution here that would store the architecture of model like this:

pretrained_model = resnet50()
arch = {"model": pretrained_model}
save_file(pretrained_model.state_dict(), "weights.safetensors")
torch.save(arch, "model_arch.json")

print("Model Saved!")

new_arch = torch.load("model_arch.json")
weights = load_file("weights.safetensors")
_model = new_arch["model"]

loaded_model = _model.load_state_dict(weights)

print("Model Loaded!")

This way we can save model's architecture and weights in save of materializer and then load both in load of materializer. Do you think this a good approach?

@Dev-Khant Dev-Khant requested a review from bcdurak March 30, 2024 07:17
@Dev-Khant
Copy link
Author

@bcdurak I have relevant changes. Please review it. Thanks.

Copy link
Contributor

@bcdurak bcdurak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your effort.

I really like the direction of the PR. I think you already solved some of the challenges you faced before. I added a few more comments for the required changes and modifications.

Additionally, I have two more questions:

  • I see on their docs, that safetensors also feature APIs for tensorflow and numpy. Is it possible to add these to the list of materialized which have a safetensors variant?
  • Can we also add a test for the pytorch_lightning materializer?

# Save model architecture
model_arch = {"model": model}
model_filename = os.path.join(self.uri, DEFAULT_MODEL_FILENAME)
torch.save(model_arch, model_filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are on the right path here, however, there is an issue:

This is a pattern that I see in all of the new materializers, AFAIK, if you do torch.save(...), it does not only save the model architecture but also the weights.

You can see this in play in the example we mentioned above. If you check your artifacts in your local artifact store manually, there are entire_model.safetensors and model_architecture.json present which are both roughly 100 MBs. Basically, it is saving the model twice in two different ways. We need to modify the torch.save and torch.load calls to only handle the architecture without the weights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcdurak Here I could not find/there is no method to just store the architecture in pytorch. So what would you recommend here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough question. But in the current case, it is really inefficient.

It feels like we need to go back to the version where you used the save_model and load_model calls. And, we somehow need to figure out how to save the model type in the save method. If I can think of anything, I will share it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @bcdurak. Let me know when I switch back to previous method.

model: The Torch Model to write.
"""
# Save model weights
obj_filename = os.path.join(self.uri, DEFAULT_FILENAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the regular variant of this materializer, you will see that it uses TemporaryDictionary() instances during the load and save calls. This is the case for many of our materializers, as they load/save from/to a temp directory and copy the contents to the proper destination.

This is mainly due to the remote artifact stores. While our fileio and io_utils can handle remote operations, unfortunately, this does not apply to most of the other libraries. Unless the safetensors.save_file and safetensors.load_file calls are compatible with all of the remote artifact store that ZenML provides, I would suggest using the same paradigm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll first check and see how it can be added.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here fileio doesn't work with safetensors because it fileio does not return PathLike str.
So I have changed the HF materialized only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also quite critical. I have just tested the example in the docs with a simple GCP artifact store and as I predicted, it failed. If we leave it in this state, these materializers won't work in any non-local case.

Can we implement any conversion to make it work here?

Copy link
Author

@Dev-Khant Dev-Khant Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright if that is the case, I'll find a way to use it. But first let's first decide on which method to use for saving and loading the model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcdurak Any update for this?

@bcdurak
Copy link
Contributor

bcdurak commented Apr 2, 2024

By the way, there is a known issue with our testing suite. We are currently fixing it at the moment. I will keep you updated as it goes on. For the time being, feel free to ignore the failing tests.

pyproject.toml Show resolved Hide resolved
Dev-Khant and others added 3 commits April 3, 2024 08:07
…tom-data-types.md

Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>
…tom-data-types.md

Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>
…tom-data-types.md

Co-authored-by: Barış Can Durak <36421093+bcdurak@users.noreply.github.com>
@Dev-Khant
Copy link
Author

@bcdurak Thanks for this detailed review, I have gone through your comments. And here are my answers:

  1. For tensorflow there is no method to directly save and load models using safetensors. And for numpy I'll add safetensors materialize.
  2. I'll add a test for pytorch_lightning as well.

@Dev-Khant Dev-Khant requested a review from bcdurak April 3, 2024 06:09
Copy link

socket-security bot commented Apr 8, 2024

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
pypi/safetensors@0.4.3 filesystem, unsafe 0 5.61 MB McPotato, Nicolas.Patry, Wauplin, ...1 more

View full report↗︎

@Dev-Khant
Copy link
Author

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

@avishniakov
Copy link
Contributor

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

Hey @Dev-Khant , IMO, since you modified numpy materializers to rely on safetensors it is not an optional dependency anymore, but the base one, so it should fall under [tool.poetry.dependencies] section directly. According to their pyproject.toml there are no mandatory dependencies, which should be quite good for a base dependency here. @bcdurak WDYT, is it fine to push safetensors to default deps?

@Dev-Khant
Copy link
Author

@bcdurak @avishniakov Can you please guide me how can I change pyproject.toml so that safetensors is installed properly?

Hey @Dev-Khant , IMO, since you modified numpy materializers to rely on safetensors it is not an optional dependency anymore, but the base one, so it should fall under [tool.poetry.dependencies] section directly. According to their pyproject.toml there are no mandatory dependencies, which should be quite good for a base dependency here. @bcdurak WDYT, is it fine to push safetensors to default deps?

Understood thanks @avishniakov. @bcdurak Let me know if should I make it a default dependency.

@bcdurak
Copy link
Contributor

bcdurak commented Apr 11, 2024

@Dev-Khant let me discuss the dependency issue with the team internally, I will update this thread asap.

Copy link

gitguardian bot commented Apr 18, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
- Username Password e423484 src/zenml/cli/init.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@Dev-Khant
Copy link
Author

@bcdurak Any update for this?

@bcdurak
Copy link
Contributor

bcdurak commented Apr 30, 2024

Hey @Dev-Khant , I can give you a quick update regarding the status. I think, I can sum it up in there different roadblocks that still remain:

  1. The save and load methods are quite inefficient right now. As I mentioned above, in their current state, the materializers are saving/loading the models twice (for instance, once through the regular torch package and once through safetensors). I think, the first version of the implementation was closer to an actual solution, where you used save_model and load_model functions from safetensors. However, in this case, you would need to figure out how to store the model type when you call the save method, so you can access it during the load method and call the load_model function properly.

  2. In their current state, the materializers do not work with remote artifact stores, because the load_... and save_... calls of safetensors do not inherently work with remote storage systems. In general, ZenML handles this issue for other materializers by using our fileio functions around the save and load methods. You can see an example of it right here. I have seen that you already implemented it for some of the materializers, however, this needs to be applied to all of the materializers.

  3. Lastly, there is the question of how to handle the dependency of safetensors. We had a discussion within the team regarding this topic and we thought the best way to go forward here is to implement a safetensors integration instead of adding it to the main package or the respective integrations (like torch or huggingface). However, this is not the main road block right now and before trying this I would recommend fixing the materializers themselves.

@Dev-Khant
Copy link
Author

Alright @bcdurak, For first point I'll switch back to the previous method and see how to store model_type.
And for second point I'll try different things to see which works in storing the file in the correct location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers run-slow-ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants