Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training: Add Fine-Tune API Docs #3718

Merged
merged 7 commits into from May 20, 2024

Conversation

andreyvelich
Copy link
Member

Related: kubeflow/training-operator#2013
This is draft PR for our new Fine-Tune API in Kubeflow Training Operator.
We will work on the page structure in this Google doc to finalise it: https://docs.google.com/document/d/18PuuaDRISj5mlrBn1GJrxwuB6Z5zTtXKpVbLUIeLx-8/edit?usp=sharing.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich marked this pull request as ready for review April 26, 2024 20:06
@andreyvelich
Copy link
Member Author

I added content from the Google doc and one tutorial.
Please let me know what do you think.
/assign @StefanoFioravanzo @kubeflow/wg-training-leads @deepanker13 @kuizhiqing

@andreyvelich
Copy link
Member Author

/hold for review

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -0,0 +1,172 @@
+++
title = "How to Fine-Tune LLM with Kubeflow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title = "How to Fine-Tune LLM with Kubeflow"
title = "How to Fine-Tune LLMs with Kubeflow"


[Training Operator Python SDK](/docs/components/training/installation/#installing-training-python-sdk)
implements a [`train` Python API](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112)
that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.
that simplifies the ability to fine-tune LLMs with distributed PyTorchJob workers.

implements a [`train` Python API](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112)
that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.

You need to provide the following parameters to use `train` API:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You need to provide the following parameters to use `train` API:
You need to provide the following parameters to use the `train` API:

)
```

After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources
After you execute `train`, Training Operator will orchestrate appropriate PyTorchJob resources

For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset
from HuggingFace Hub:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I copy paste this snippet into a notebook, does it run seamlessly? What are the required dependencies? Do we need to provide a pip install command to make sure that this snippet runs? Also, what is the expected output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add the prerequisites to run this API.

After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources
to fine-tune LLM.

## Architecture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to a "Reference"

You can implement your own trainer for other ML use-cases such as image classification,
voice recognition, etc.

## User Value for this Feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just fold this under Why Training Operator Fine-Tune API Matter ? by stripping the title User Value for this Feature

image classification, or another ML domain, fine-tuning can drastically improve performance and
applicability of pre-existing models to new datasets and problems.

## Why Training Operator Fine-Tune API Matter ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is out of place here. The how-to guide provides a step by step sequenced guide on how to achieve a very specific task. A how-to guide generally does not provide Reference or Explanation. It seems to me we are writing some paragraphs that would be more suited to an "Explanation" section. This is the fourth content types proposed by Diataxis - see here https://diataxis.fr/explanation/

I can very well see a page under "Explanation" titled "LLM Fine-Tune APIs in Kubeflow" where we discuss why we need it and how it fits into the ecosystem. Basically what you wrote already, plus a little bit of refactoring. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but how user will map one guide to another ?
E.g. how user will quickly understand which explanation relates to which user guide looking at the website content ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question. In the how to guide, we can have something like: "If you want to learn more about how the fine tune API fit in the Kubeflow ecosystem, head to <...>".

And in the exlanation guide, we can say something like: "Head to for a quick start tutorial on using LLM Fine-tune APIs. Head to for a reference architecture on the control plane implementation"

And generally we can have links to how-tos in tutorials and reference guides. So in general, let's try to link related topics together when it makes sense for a user to follow that train of thought

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, what do you think about it @StefanoFioravanzo ?
7d30f12

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label May 6, 2024
@andreyvelich
Copy link
Member Author

I addressed your comments @StefanoFioravanzo.
Regarding this comment:

Also, what is the expected output?

How we can show the expected output ? Our LLM trainer doesn't support any output yet: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/trainer/hf_llm_training.py#L178, so we need to work in the future to understand how user should consume the fine-tuned model.
E.g. exporting to S3 or to other storage.
cc @johnugeorge @deepanker13

@StefanoFioravanzo
Copy link
Member

so we need to work in the future to understand how user should consume the fine-tuned model.

Issue + KF 1.10 tag? :)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

@StefanoFioravanzo I believe, I addressed all of your comments. Does it look good to you ?
/assign @johnugeorge @deepanker13 @tenzen-y

@StefanoFioravanzo
Copy link
Member

@andreyvelich yes it does thank you!

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awsome documentation! Thank you!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label May 14, 2024
@deepanker13
Copy link

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

@andreyvelich
Copy link
Member Author

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

@deepanker13 Did you check these links via Website preview: https://deploy-preview-3718--competent-brattain-de2d6d.netlify.app/ ?

@deepanker13
Copy link

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

@deepanker13 Did you check these links via Website preview: https://deploy-preview-3718--competent-brattain-de2d6d.netlify.app/ ?

@andreyvelich it's working with the preview. Thanks for the awesome documentation!
/lgtm

@StefanoFioravanzo
Copy link
Member

@andreyvelich shall we merge this one?

@andreyvelich
Copy link
Member Author

Sure, let's merge it. Thanks everyone for review!
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 36544ae into kubeflow:master May 20, 2024
6 checks passed
@andreyvelich andreyvelich deleted the fine-tune-architecture branch May 20, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants