Skip to content

codingpot/git-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Git Based MLOps

This project shows how to realize MLOps in Git/GitHub. In order to achieve this aim, this project heavily leverages the toolse such as DVC, DVC Studio, DVCLive - all products built by iterative.ai, Google Drive, Jarvislabs.ai, and HuggingFace Hub.

Instructions

Prior work

  1. Click "Use this template" button to create your own repository
  2. Wait for few seconds, then Initial Setup PR will be automatically created
  3. Merge the PR, and you are good to go

Basic setup

  1. Run pip install -r requirements.txt (requirements.txt)
  2. Run dvc init to enable DVC
  3. Add your data under data directory
  4. Run git rm -r --cached 'data' && git commit -m "stop tracking data"
  5. Run dvc add [ADDED FILE OR DIRECTORY] to track your data with DVC
  6. Run dvc remote add -d gdrive_storage gdrive://[ID of specific folder in gdrive] to add Google Drive as the remote data storage
  7. Run dvc push, then URL to auth is provided. Copy and paste it to the browser, and autheticate
  8. Copy the content of .dvc/tmp/gdrive-user-credentials.json and put it as in GitHub Secret with the name of GDRIVE_CREDENTIAL
  9. Run git add . && git commit -m "initial commit" && git push origin main to keep the initial setup
  10. Write your own pipeline under pipeline directory. Codes for basic image classification in TensorFlow are provided initially.
  11. Run the following dvc stage add for training stage
# if you want to use Iterative Studio / DVCLive for tracking training progress
$ dvc stage add -n train \
                -p train.train_size,train.batch_size,train.epoch,train.lr \
                -d pipeline/modeling.py -d pipeline/train.py -d data \
                --plots-no-cache dvclive/scalars/train/loss.tsv \
                --plots-no-cache dvclive/scalars/train/sparse_categorical_accuracy.tsv \
                --plots-no-cache dvclive/scalars/eval/loss.tsv \
                --plots-no-cache dvclive/scalars/eval/sparse_categorical_accuracy.tsv \
                -o outputs/model \
                python pipeline/train.py outputs/model

# if you want to use W&B for tracking training progress
$ dvc stage add -n train \
                -p train.train_size,train.batch_size,train.epoch,train.lr \
                -d pipeline/modeling.py -d pipeline/train_wandb.py -d data \
                -o outputs/model \
                python pipeline/train_wandb.py outputs/model
  1. Run the following dvc stage add for evaluate stage
# if you want to use Iterative Studio / DVCLive for tracking training progress
$ dvc stage add -n evaluate \
                -p evaluate.test,evaluate.batch_size \
                -d pipeline/evaluate.py -d data/test -d outputs/model \
                -M outputs/metrics.json \
                python pipeline/evaluate.py outputs/model

# if you want to use W&B for tracking training progress
$ dvc stage add -n evaluate \
                -p evaluate.test,evaluate.batch_size \
                -d pipeline/evaluate.py -d data/test -d outputs/model \
                python pipeline/evaluate.py outputs/model
  1. Update params.yaml as you need.
  2. Run git add . && git commit -m "add initial pipeline setup" && git push origin main
  3. Run dvc repro to run the pipeline initially
  4. Run dvc add outputs/model.tar.gz to add compressed version of model
  5. Run dvc push outputs/model.tar.gz
  6. Run echo "/pipeline/__pycache__" >> .gitignore to ignore unnecessary directory
  7. Run git add . && git commit -m "add initial pipeline run" && git push origin main
  8. Add access token and user email of JarvisLabs.ai to GitHub Secret as JARVISLABS_ACCESS_TOKEN and JARVISLABS_USER_EMAIL
  9. Add GitHub access token to GitHub Secret as GH_ACCESS_TOKEN
  10. Create a PR and write #train --with dvc as in comment (you have to be the onwer of the repo)

W&B Integration Setup

  1. Add W&B's project name to GitHub Secret as WANDB_PROJECT
  2. Add W&B's API KEY to GitHub Secret as WANDB_API_KEY
  3. Use #train --with wandb instead of #train --with dvc

HuggingFace Integration Setup

  1. Add access token of HugginFace to GitHub Secret as HF_AT
  2. Add username of HugginfFace to GitHub Secret as HF_USER_ID
  3. Write #deploy-hf in comment of PR you want to deploy to HuggingFace Space
    • GitHub Action assumes your model is archieved as model.tar.gz under outputs directory
    • Algo GitHub Action assumes your HuggingFace Space app is written in Gradio under hf-space directory. You need to change app_template.py as you need(you shouldn't remove any environment variables in the file).

TODO

Brief description of each tools

  • DVC(Data Version Control): Manages data in somewhere else(i.e. cloud storage) while keeping the version and remote information in metadata file in Git repository.
  • DVCLive: Provides callbacks for ML framework(i.e. TensorFlow, Keras) to record metrics during training in tsv format.
  • DVC Studio: Visuallize the metrics from files in Git repository. What to visuallize is recorded in dvc.yaml.
  • Google Drive: Is used as a remote data repository. However, you can use others such as AWS S3, Google Cloud Storage, or your own file server.
  • Jarvislabs.ai: Is used to provision cloud GPU VM instances to conduct each experiments.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published