App-K8s-HF-WnB

This project aims to create an end-to-end ML app as a functional MVP. The app itself uses Hugging Face (HF) and Weights&Biases (WandB) to reduce initial complexity. The ML modules used should be interchangeable without interrupting the pipeline. The app can be deployed into a Python venv, a Docker image and Kubernetes to showcase the separation of concerns of the different pipeline components.

Status

[DRAFT] [WIP] ----> Not fully implemented yet

For version history have a look at CHANGELOG.md.

Quickstart

TODO

Usage ↑

If inside poetry venv

python -m app

or if outside

poetry run python -m app

Install ↑

Python

From a venv with available poetry

make install

or with conda

envname='App-K8s-HF-WnB'
conda create -ym -n $envname poetry
conda activate $envname
make install

Container

TODO

Kubernetes

TODO

Reason ↑

TODO

Purpose ↑

Showcase an end-to-end app with train and inference mode
Implement self-contained modular pipeline

Paradigms ↑

TDD/BDD
Mostly functional
Time-to-value, time-to-market
Light-weight
Code should (Dave Farley)
- Work
- Be modular
- Be cohesive
- Be appropriatly coupled
- Be separated by concerns
- Hide/abstract information

App Structure ↑

Show essential structure

/
├─ app/
│  ├─ config/
│  ├─ payload/
│  ├─ pipeline/
│  ├─ utils/
│  └─ app.py
├─ assets/
├─ container/
├─ kubernetes/
│  ├─ base/
│  └─ overlay/
├─ tests/
├─ CHANGELOG.md
├─ make.bat
├─ Makefile
├─ pyproject.toml
└─ README.md

Show full structure

/
├─ .github/
│  ├─ workflows/
│  │  ├─ links-fail-fast.yml
│  └─ dependabot.yml
├─ app/
│  ├─ config/
│  │  ├─ defaults.yml
│  │  ├─ huggingface.yml
│  │  ├─ logging.conf
│  │  ├─ parameters.dummy.json
│  │  ├─ sweep-wandb.yml
│  │  ├─ sweep.yml
│  │  ├─ task.yml
│  │  ├─ wandb.key.dummy.yml
│  │  └─ wandb.yml
│  ├─ payload/
│  │  ├─ handle_hf.py
│  │  ├─ handle_sweep.py
│  │  ├─ infer_model.py
│  │  └─ train_model.py
│  ├─ pipeline/
│  │  ├─ load_hf_components.py
│  │  ├─ prepare_pipe_data.py
│  │  └─ prepare_pipe_params.py
│  ├─ utils/
│  │  ├─ handle_logging.py
│  │  ├─ handle_paths.py
│  │  ├─ load_configs.py
│  │  ├─ log_system_info.py
│  │  ├─ parse_args.py
│  │  └─ toggle_features.py
│  ├─ __main__.py
│  ├─ __version__.py
│  ├─ _version.py
│  ├─ app.py
│  └─ py.typed
├─ assets
│  ├─ tuna_importtime_dark.PNG
│  └─ tuna_importtime_light.PNG
├─ container/
│  └─ Dockerfile.PNG
├─ kubernetes/
│  ├─ base/
│  │  ├─ deployment.yml
│  │  ├─ kustomization.yml
│  │  ├─ pvc.yml
│  │  └─ service.yml
│  └─ overlay/
│     ├─ prod/
│     │  ├─ ingress.yml
│     │  ├─ kustomization.yml
│     │  └─ namespace.yml
│     └─ test/
│        ├─ ingress.yml
│        ├─ kustomization.yml
│        └─ namespace.yml
├─ tests/
│  ├─ behavior/
│  │  ├─ test_load_hf_components_behavior.py
│  │  └─ test_train_model_behavior.py
│  └─ functionality/
│  │  └─ test_load_hf_components_functionality.py
├─ .bumpversion.cfg
├─ .cirrus.yml
├─ .coveragerc
├─ .flake8
├─ .gitattributes
├─ .gitignore
├─ .gitmessage
├─ .markdownlint.yml
├─ .pre-commit-config.yaml
├─ .yamllint.yml
├─ CHANGELOG.md
├─ LICENSE
├─ make.bat
├─ Makefile
├─ pyproject.toml
└─ README.md

App Details ↑

App accepts only .yml as config right now
config/wandb.key.dummy.yml showcases a keyfile to be used with the provider Weights&Biases (wandb)
config/parameters.dummy.json presents an example of the data model the pipeline uses

Import performance

The import performance of the app can be measured with python -X importtime -m app and visualized with tuna. From root this flow can be invoked by:

make importtime

An example how the visualized import time could look like

TODO ↑

ML

Get WandB sweep config
- Implemented and functional
- May be extended to other providers, but for MVP sufficient
Save models, datasets, tokenizer and metrics in local folder other than cache
Define the core of the app
- train
- infer

Coding

Dependency tracking and packaging

Explore use of pipenv with Pipfile & Pipfile.lock as a proposed replacement to requirements.txt
- Auto-creation of venv
- pipenv install -e for editable mode, i.e. 'dependency resolution can be performed with an up to date copy of the repository each time it is performed'
Use Poetry as replacement for pipenv
- Auto-creation of venv
- Build-tool for packaging
Experiment with pyproject.toml to build app wheel
- Used to pool information for build, package, tools etc into one file
- Some tools like flake8 do not support this approach
Create a package
- Required for tox and pdoc
- Experiment package as single source app version with setup.py and hatchling or setuptools
- Experiment with poetry

Project management

Use Makefile instead of self-implemented imparative setup.sh
- Implemented and functional
- Need improvement for local venv install, because source can not run inside make
Adopt CHANGELOG.md
- 'A changelog is a file which contains a curated, chronologically ordered list of notable changes for each version of a project.'
- Seems to be reasonable
Adopt SemVer for semantic versioning
- Seems to be reasonable
Implement basic CI/CD-Skeleton
- Using bump2version, pre-commit, black etc
- Rationale:
  - Get fast feedback
  - Raise confidence in codebase
  - Always keep codebase in releasable state
Adopt TDD/BDD as described by Dave Farley TDD Is The Best Design Technique and TDD vs BDD
- Goals
  - Think of specification first, then test
  - Confirm behavior instead of testing the code
- Structure
  - Specification (Test Suite) ==> Test (Szenario) ==> "Given, When, Then"
- Sequence
  - Red: Write test ==> Green: Write code passing test ==> Blue Refactor code and test
  - Test: Arrange ==> Act ==> Assert ==> Clean
- Frameworks Gherkin and Cucumber
Move from Makefile to Cirrus CLI
- Use --dirty for write-backs to files instead of rsync instance, e.g. for isort
Implement pydoc-action to auto-generate into gh-pages /docs, e.g. Sphinx Build Action for Sphinx

Inspirations ↑

Martin Fowler
Dave Farley
- TDD Is The Best Design Technique
- Test Driven Development vs Behavior Driven Development
Ian Cooper
- 🚀 TDD, Where Did It All Go Wrong (Ian Cooper)
Using Cirrus CLI instead of Makefiles for gRPC code generation

Resources ↑

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github		.github
app		app
assets		assets
kubernetes		kubernetes
requirements		requirements
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.cirrus.yml		.cirrus.yml
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmessage		.gitmessage
.hadolint.yaml		.hadolint.yaml
.markdownlint.yml		.markdownlint.yml
.yamllint.yml		.yamllint.yml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

License

qte77/App-K8s-HF-WnB

Folders and files

Latest commit

History

Repository files navigation