New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Training: Add Fine-Tune API Docs #3718

Merged

google-oss-prow merged 7 commits into kubeflow:master from andreyvelich:fine-tune-architecture

May 20, 2024

Member

andreyvelich commented Apr 22, 2024

Related: kubeflow/training-operator#2013
This is draft PR for our new Fine-Tune API in Kubeflow Training Operator.
We will work on the page structure in this Google doc to finalise it: https://docs.google.com/document/d/18PuuaDRISj5mlrBn1GJrxwuB6Z5zTtXKpVbLUIeLx-8/edit?usp=sharing.

andreyvelich marked this pull request as draft

April 22, 2024 16:25

google-oss-prow bot added the do-not-merge/work-in-progress label

google-oss-prow bot requested review from johnugeorge and terrytangyuan

April 22, 2024 16:25

google-oss-prow bot added approved size/S labels

andreyvelich added 2 commits

April 26, 2024 20:09


          Add Train API for LLMs Architecture

dbb5818

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>


          Training: Add Fine-Tune API Docs

61a8f20

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the fine-tune-architecture branch from 079a402 to 61a8f20 Compare

April 26, 2024 19:10


          Add content for Train API

3ab2701

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich marked this pull request as ready for review

April 26, 2024 20:06

google-oss-prow bot added size/L and removed do-not-merge/work-in-progress size/S labels

Member Author

andreyvelich commented Apr 26, 2024

I added content from the Google doc and one tutorial.
Please let me know what do you think.
/assign @StefanoFioravanzo @kubeflow/wg-training-leads @deepanker13 @kuizhiqing

Member Author

andreyvelich commented Apr 26, 2024

/hold for review

google-oss-prow bot added the do-not-merge/hold label


          Add PVC notice

b64e336

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot commented Apr 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


          Fix BERT example

b526390

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

hbelmiro reviewed

View reviewed changes

Contributor

hbelmiro left a comment

/lgtm

google-oss-prow bot assigned hbelmiro

google-oss-prow bot added the lgtm label

StefanoFioravanzo reviewed

View reviewed changes

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

		@@ -0,0 +1,172 @@
		+++
		title = "How to Fine-Tune LLM with Kubeflow"

Member

StefanoFioravanzo May 6, 2024

Suggested change

      
            title = "How to Fine-Tune LLM with Kubeflow"
          
            title = "How to Fine-Tune LLMs with Kubeflow"

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              [Training Operator Python SDK](/docs/components/training/installation/#installing-training-python-sdk)
+              implements a [`train` Python API](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112)
+              that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.

Member

StefanoFioravanzo May 6, 2024

Suggested change

      
            that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.
          
            that simplifies the ability to fine-tune LLMs with distributed PyTorchJob workers.

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              implements a [`train` Python API](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112)
+              that simplify ability to fine-tune LLMs with distributed PyTorchJob workers.
+              You need to provide the following parameters to use `train` API:

Member

StefanoFioravanzo May 6, 2024

Suggested change

      
            You need to provide the following parameters to use `train` API:
          
            You need to provide the following parameters to use the `train` API:

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              )
+              ```
+              After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources

Member

StefanoFioravanzo May 6, 2024

Suggested change

      
            After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources
          
            After you execute `train`, Training Operator will orchestrate appropriate PyTorchJob resources

content/en/docs/components/training/user-guides/fine-tuning.md

+              For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset
+              from HuggingFace Hub:
+              ```python

Member

StefanoFioravanzo May 6, 2024

If I copy paste this snippet into a notebook, does it run seamlessly? What are the required dependencies? Do we need to provide a pip install command to make sure that this snippet runs? Also, what is the expected output?

Member Author

andreyvelich May 6, 2024

Let me add the prerequisites to run this API.

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources
+              to fine-tune LLM.
+              ## Architecture

Member

StefanoFioravanzo May 6, 2024

This should go to a "Reference"

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              You can implement your own trainer for other ML use-cases such as image classification,
+              voice recognition, etc.
+              ## User Value for this Feature

Member

StefanoFioravanzo May 6, 2024

I think we can just fold this under Why Training Operator Fine-Tune API Matter ? by stripping the title User Value for this Feature

content/en/docs/components/training/user-guides/fine-tuning.md Outdated

+              image classification, or another ML domain, fine-tuning can drastically improve performance and
+              applicability of pre-existing models to new datasets and problems.
+              ## Why Training Operator Fine-Tune API Matter ?

Member

StefanoFioravanzo May 6, 2024

I feel like this is out of place here. The how-to guide provides a step by step sequenced guide on how to achieve a very specific task. A how-to guide generally does not provide Reference or Explanation. It seems to me we are writing some paragraphs that would be more suited to an "Explanation" section. This is the fourth content types proposed by Diataxis - see here https://diataxis.fr/explanation/

I can very well see a page under "Explanation" titled "LLM Fine-Tune APIs in Kubeflow" where we discuss why we need it and how it fits into the ecosystem. Basically what you wrote already, plus a little bit of refactoring. WDYT?

Member Author

andreyvelich May 6, 2024 •

edited

That makes sense, but how user will map one guide to another ?
E.g. how user will quickly understand which explanation relates to which user guide looking at the website content ?

Member

StefanoFioravanzo May 6, 2024

That's a very good question. In the how to guide, we can have something like: "If you want to learn more about how the fine tune API fit in the Kubeflow ecosystem, head to <...>".

And in the exlanation guide, we can say something like: "Head to for a quick start tutorial on using LLM Fine-tune APIs. Head to for a reference architecture on the control plane implementation"

And generally we can have links to how-tos in tutorials and reference guides. So in general, let's try to link related topics together when it makes sense for a user to follow that train of thought

Member Author

andreyvelich May 6, 2024

Sure, what do you think about it @StefanoFioravanzo ?
7d30f12

Member

StefanoFioravanzo May 7, 2024

Looks great!

StefanoFioravanzo mentioned this pull request

Fine-Tune APIs for LLM Documentation kubeflow/training-operator#2013

Open

2 tasks


          Add reference architecture

cf1d0b5

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot removed the lgtm label

Member Author

andreyvelich commented May 6, 2024

I addressed your comments @StefanoFioravanzo.
Regarding this comment:

Also, what is the expected output?

How we can show the expected output ? Our LLM trainer doesn't support any output yet: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/trainer/hf_llm_training.py#L178, so we need to work in the future to understand how user should consume the fine-tuned model.
E.g. exporting to S3 or to other storage.
cc @johnugeorge @deepanker13

Member

StefanoFioravanzo commented May 6, 2024

so we need to work in the future to understand how user should consume the fine-tuned model.

Issue + KF 1.10 tag? :)

andreyvelich mentioned this pull request

Export Fine-Tuned LLM after Trainer is Complete kubeflow/training-operator#2101

Open


          Add Explanation for Fine-Tuning API

7d30f12

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Member Author

andreyvelich commented May 8, 2024

@StefanoFioravanzo I believe, I addressed all of your comments. Does it look good to you ?
/assign @johnugeorge @deepanker13 @tenzen-y

google-oss-prow bot assigned deepanker13, johnugeorge and tenzen-y

Member

StefanoFioravanzo commented May 13, 2024

@andreyvelich yes it does thank you!

tenzen-y reviewed

View reviewed changes

Member

tenzen-y left a comment

Awsome documentation! Thank you!
/lgtm

google-oss-prow bot added the lgtm label

deepanker13 commented May 14, 2024

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

Member Author

andreyvelich commented May 14, 2024

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

@deepanker13 Did you check these links via Website preview: https://deploy-preview-3718--competent-brattain-de2d6d.netlify.app/ ?

deepanker13 commented May 15, 2024

@andreyvelich the links in fine-tuning.md are giving 404 page not found. Am I missing something?

@deepanker13 Did you check these links via Website preview: https://deploy-preview-3718--competent-brattain-de2d6d.netlify.app/ ?

@andreyvelich it's working with the preview. Thanks for the awesome documentation!
/lgtm

Member

StefanoFioravanzo commented May 20, 2024

@andreyvelich shall we merge this one?

Member Author

andreyvelich commented May 20, 2024

Sure, let's merge it. Thanks everyone for review!
/hold cancel

google-oss-prow bot removed the do-not-merge/hold label

google-oss-prow bot merged commit 36544ae into kubeflow:master

6 checks passed

andreyvelich deleted the fine-tune-architecture branch

May 20, 2024 19:57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

StefanoFioravanzo StefanoFioravanzo left review comments

hbelmiro hbelmiro left review comments

tenzen-y tenzen-y left review comments

johnugeorge Awaiting requested review from johnugeorge

terrytangyuan Awaiting requested review from terrytangyuan