Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I fine tune Llama 3 on my own data? #970

Open
agutell opened this issue May 13, 2024 · 6 comments
Open

How can I fine tune Llama 3 on my own data? #970

agutell opened this issue May 13, 2024 · 6 comments
Assignees

Comments

@agutell
Copy link

agutell commented May 13, 2024

Hi,

I have been looking for documentation on how to add my own data set into your config, but without success. If it's possible, could anyone provide a small guide on what needs to be done in order to:

  1. Fine tune on csv data.
    or
  2. Arbitrary Hugging Face data.

Much appreciated!

@joecummings
Copy link
Contributor

Hi @agutell, great questions! I think both of these use-cases are covered by this tutorial: https://pytorch.org/torchtune/main/tutorials/chat.html.

If you have follow-up questions though, please let us know!

@agutell
Copy link
Author

agutell commented May 13, 2024

Thank you! I'll check it out. One thing though, that page does not seem to show when you enter the documentation from https://pytorch.org/torchtune/stable/index.html

@joecummings
Copy link
Contributor

Thank you! I'll check it out. One thing though, that page does not seem to show when you enter the documentation from pytorch.org/torchtune/stable/index.html

So we have a stable version of the documentation that is tied to our code at the v0.1.1 release - this is the one you see immediately if you navigate to the torchtune docs from the web. However, we also update our docs constantly as we contribute more code to the repository. This code lives under the dropdown on the corner of the website under "main". You can also access it through pytorch.org/torchtune/main/index.html.

Because we're developing so quickly on torchtune, it's best to check the "main" documentation.

@agutell
Copy link
Author

agutell commented May 15, 2024

Ok, thanks! Thats very helpful :)

About the earlier message, regarding the custom data. I have now implemented message_converter and custom_dataset. However, the config demands a "dotted path" to the custom_dataset. I am working in a notebook (google colab) and I have not found a way to create a dotted path that the function "_get_component_from_path(path: str)" will accept. I keep getting the error:

raise InstantiationError(
f"Error loading '{path}':\n{repr(exc_import)}"
+ f"\nAre you sure that module '{part0}' is installed?"

I there an easier way to get around this problem, instead of trying to define a module and then give the dotted path? It would be preferable if one could just specify the function name "custom_data" in the config, when working from a notebook, and the function "custom_data" is just defined in a cell.

Thanks for all your work! :)

@ebsmothers
Copy link
Contributor

I there an easier way to get around this problem, instead of trying to define a module and then give the dotted path? It would be preferable if one could just specify the function name "custom_data" in the config, when working from a notebook, and the function "custom_data" is just defined in a cell.

Hi @agutell, this is a good question. I think to have arbitrary functions/modules importable by torchtune from cells is a bit tricky as there is not really a consistent way to define this in a globally unique fashion. We had discussed using a registry, but ultimately found it was harder to scale than the current approach based on _component_. For this particular case, can you use an editable install of torchtune in Colab and add your own dataset there?

E.g. for install run

!git clone https://github.com/pytorch/torchtune.git
!pip install -e torchtune

Then from the file tree you can navigate to torchtune/datasets and create a new file e.g. my_custom_dataset.py containing the function custom_data. Then you should be able to override from the CLI via

!tune run <recipe> --config <config> dataset=torchtune.datasets.my_custom_dataset.custom_data ...

(Or you can do the same thing by modifying whatever config file you're using like this:)

dataset:
  _component_: torchtune.datasets.my_custom_dataset.custom_data

Also tagging @RdoubleA for any thoughts on this.

@RdoubleA
Copy link
Contributor

Thanks for raising this @agutell, this is an excellent case we hadn't fully considered when designing the configs. It would be ideal to support both in code custom datasets and ad-hoc in notebook code that's not directly importable.

What @ebsmothers suggested is a good approach. Actually you don't need to modify the torchtune internals directly (we should prefer user code as an entry point instead of having to modify torchtune internals), you can create a separate .py file in the directory of your notebook with your custom dataset and converters. As long as this is importable in the notebook, it should work as a module path in the config.

I'll put up an issue for how to make this easier in a notebook environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants