How can I fine tune Llama 3 on my own data? #970

agutell · 2024-05-13T09:15:08Z

Hi,

I have been looking for documentation on how to add my own data set into your config, but without success. If it's possible, could anyone provide a small guide on what needs to be done in order to:

Fine tune on csv data.
or
Arbitrary Hugging Face data.

Much appreciated!

joecummings · 2024-05-13T14:47:17Z

Hi @agutell, great questions! I think both of these use-cases are covered by this tutorial: https://pytorch.org/torchtune/main/tutorials/chat.html.

If you have follow-up questions though, please let us know!

agutell · 2024-05-13T17:32:51Z

Thank you! I'll check it out. One thing though, that page does not seem to show when you enter the documentation from https://pytorch.org/torchtune/stable/index.html

joecummings · 2024-05-14T18:20:56Z

Thank you! I'll check it out. One thing though, that page does not seem to show when you enter the documentation from pytorch.org/torchtune/stable/index.html

So we have a stable version of the documentation that is tied to our code at the v0.1.1 release - this is the one you see immediately if you navigate to the torchtune docs from the web. However, we also update our docs constantly as we contribute more code to the repository. This code lives under the dropdown on the corner of the website under "main". You can also access it through pytorch.org/torchtune/main/index.html.

Because we're developing so quickly on torchtune, it's best to check the "main" documentation.

agutell · 2024-05-15T08:47:18Z

Ok, thanks! Thats very helpful :)

About the earlier message, regarding the custom data. I have now implemented message_converter and custom_dataset. However, the config demands a "dotted path" to the custom_dataset. I am working in a notebook (google colab) and I have not found a way to create a dotted path that the function "_get_component_from_path(path: str)" will accept. I keep getting the error:

raise InstantiationError(
f"Error loading '{path}':\n{repr(exc_import)}"
+ f"\nAre you sure that module '{part0}' is installed?"

I there an easier way to get around this problem, instead of trying to define a module and then give the dotted path? It would be preferable if one could just specify the function name "custom_data" in the config, when working from a notebook, and the function "custom_data" is just defined in a cell.

Thanks for all your work! :)

ebsmothers · 2024-05-15T12:38:36Z

I there an easier way to get around this problem, instead of trying to define a module and then give the dotted path? It would be preferable if one could just specify the function name "custom_data" in the config, when working from a notebook, and the function "custom_data" is just defined in a cell.

Hi @agutell, this is a good question. I think to have arbitrary functions/modules importable by torchtune from cells is a bit tricky as there is not really a consistent way to define this in a globally unique fashion. We had discussed using a registry, but ultimately found it was harder to scale than the current approach based on _component_. For this particular case, can you use an editable install of torchtune in Colab and add your own dataset there?

E.g. for install run

!git clone https://github.com/pytorch/torchtune.git
!pip install -e torchtune

Then from the file tree you can navigate to torchtune/datasets and create a new file e.g. my_custom_dataset.py containing the function custom_data. Then you should be able to override from the CLI via

!tune run <recipe> --config <config> dataset=torchtune.datasets.my_custom_dataset.custom_data ...

(Or you can do the same thing by modifying whatever config file you're using like this:)

dataset:
  _component_: torchtune.datasets.my_custom_dataset.custom_data

Also tagging @RdoubleA for any thoughts on this.

RdoubleA · 2024-05-18T17:35:32Z

Thanks for raising this @agutell, this is an excellent case we hadn't fully considered when designing the configs. It would be ideal to support both in code custom datasets and ad-hoc in notebook code that's not directly importable.

What @ebsmothers suggested is a good approach. Actually you don't need to modify the torchtune internals directly (we should prefer user code as an entry point instead of having to modify torchtune internals), you can create a separate .py file in the directory of your notebook with your custom dataset and converters. As long as this is importable in the notebook, it should work as a module path in the config.

I'll put up an issue for how to make this easier in a notebook environment.

joecummings assigned RdoubleA and joecummings May 13, 2024

RdoubleA mentioned this issue May 18, 2024

Make objects in notebooks importable in config #1003

Open

jeff52415 mentioned this issue May 22, 2024

[#970] Guide Users to Fine-Tune with Their Own Data #1012

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I fine tune Llama 3 on my own data? #970

How can I fine tune Llama 3 on my own data? #970

agutell commented May 13, 2024

joecummings commented May 13, 2024

agutell commented May 13, 2024

joecummings commented May 14, 2024

agutell commented May 15, 2024

ebsmothers commented May 15, 2024

RdoubleA commented May 18, 2024

How can I fine tune Llama 3 on my own data? #970

How can I fine tune Llama 3 on my own data? #970

Comments

agutell commented May 13, 2024

joecummings commented May 13, 2024

agutell commented May 13, 2024

joecummings commented May 14, 2024

agutell commented May 15, 2024

ebsmothers commented May 15, 2024

RdoubleA commented May 18, 2024