StagedML brings manageability into Deep Learning by applying Nix ideas of software deployment to the domain of ML model libraries. The project is currenlty focused on NLP models which often require complex pre-processing and long training. StagedML uses minimalistic immutable data management engine named Pylightnix.
StagedML formalizes the concepts of model configuration and dependency, provides stable grounds for experimentation by tracking realizations. In contrast to regular package managers, StagedML takes possible non-determenism of training into account.
Stage configuration dependencies related to BERT fine-tuning (Source)
-
StagedML is a library of adopted ML models. We do not claim any remarkable accuracy or performance achievements, but we do provide infrastructure properties which as we hope simplify the processes of development end experimentation.
- StagedML is powered by Pylightnix
immutable data management library.
>>> from stagedml.stages.all import ( all_convnn_mnist, realize, >>> instantiate, rref2path, shell, mklens )
- Models and datasets are defined on top of linked graph of Pylightnix core objects called stages. A Stage is a direct analogy of a package manager's package.
- Stage objects are defined by python functions and could be created
(realized) in just one line of Python code. Dependencies between stages are
encoded by passing special handlers called derivation references .For
example, here we realize an object representing trained MNIST classifier:
>>> rref=realize(instantiate(all_convnn_mnist)) >>> rref 'rref:2bf51e3ce37061ccff6168ccefac7221-3b9f88037f737f06af0fe82b6f6ac3c8-convnn-mnist'
- StagedML re-uses as much stage realizations as possible. If no realization match the criteria, the user-defined building procedure is called. For ML models, this results it training of new model instances. For datasets this may launch pre-processing or downloading from the Internet.
- For every stage, user could access it's configuration fields and the
configuration filelds of any of it's dependencies:
>>> mklens(rref).learning_rate.val # Learning rate of the model 0.001 >>> mklens(rref).mnist.url.val # URL of the dataset used to train the model 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'
- StagedML evaluates all the configurations before executing all the builders. Thanks to this feature, equipped with Lenses and Promises, we could catch configuration-time errors like misspelled parameter names and or incorrect paths before starting long training.
- StagedML offers facilities to re-define existing stages and compose new
stages by using old ones as dependencies. Already existing stages couldn't
be altered or lost in this process:
>>> from pylightnix import redefine, mkconfig >>> def _new_config(old_config): >>> old_config['learning_rate'] = 1e-5 >>> return mkconfig(old_config) >>> rref5=realize(instantiate(redefine(all_convnn_mnist, new_config=_new_config))) >>> rref5 'rref:1ece593a8e761fa28fdc0da0fed00eb8-dd084d4a8b75a787b7c230474549e5db-convnn-mnist' >>> mklens(rref5).learning_rate.val 1e-05
- Thanks to the REPL
API,
it is possible to debug intermediate stages by instructing Pylightnix to
pause at certain building procedures. Using this API is similar to what we
experience during
git-rebase --continue
workflow. An example is REPL demo of Pylightnix. - StagedML supports non-determenistic build processes which means that we
could train several instances of the model and pick up the best one to use
in subsequent stages. Selection criteria are up to the user. See
Matcher
topic of the Pylightnix documentation.>>> rref2path(rref) '/tmp/pylightnix/store-v0/3b9f88037f737f06af0fe82b6f6ac3c8-convnn-mnist/2bf51e3ce37061ccff6168ccefac7221' # ^^^ Storage root ^^^ Stage configuration ^^^ Stage realization (one of)
- The only way to remove data in StagedML is to use a garbage
collector. GC removes unused stages but keeps the stages which are pointed
to by at least one symlink originated from special
experiments
directory.
- StagedML is powered by Pylightnix
immutable data management library.
-
Currently, we include some NLP models from tensorflow-models, other libraries may be supported in future. Often we pick only BASE versions which could be trained on GPU. Check the full collection of adopted models and datasets
-
Deployment of trained models is not supported now but may be supported in future. Thanks to the simplicity of Pylightnix storage format, the deployment could probably be done just by running
rsync
on the Pylightnix storage folders of the source and target machines. -
StagedML is not tested as thoroughly as we wish it should be. At the same time:
- To minimize the uncertainty, we specify the exact versions of dependency libraries (TensorFlow, TensorFlow Models, Pylightnix, etc.) by linking them as Git submodules.
- The considerable efforts were made to test the underlying Pylightnix core library.
- We extensively use Mypy-compatible type annotations.
- Linux system as Docker host (other OS may accidently work too)
- GPU suitable for machine learning acceleration. We use NVidia 1080Ti.
- Considerable amount of hard drive space. Some tasks, like BERT pretraining, may require >=200 Gb datasets.
StagedML depends on slightly customized versions of TensorFlow and TensorFlow/models. While TensorFlow changes are negligible (minor fixes in build system), we do modify TensorFlow/models in a non-trivial way.
Currently, we provide StagedML in Docker containers of two kinds: 'User' and 'Dev':
Feature | stagedml/user | stagedml/dev |
---|---|---|
Where to get | Docker hub | docker build |
Cloned repo is required | No | Yes |
Pylightnix installed | System-wide | via PYTHONPATH |
StagedML installed | System-wide | via PYTHONPATH |
TensorFlow installed | System-wide | No(1) |
TF/Models installed | System-wide | via PYTHONPATH |
- (1) - TensorFlow can't be populated by setting PYTHONPATH, so the installation from source is required. We provide reference scripts for this task.
Depending on your needs, you could follow either a user or a developer installation track.
Latest docker images should be available at our Docker Hub page.
For reference: there is an update-docker-hub.sh script which automates the procedure of 'user'-docker image building.
'User' docker container offers latest StagedML and it's dependencies, all
installed system-wide. We do recomend to use our rundocker.sh
script instead of calling docker pull
directly. The script constructs docker
command line and enables the following important functionality:
- Bind-mounting Hosts's current folder as container's HOME folder
- Passing correct user and group IDs to the container
- Forwarding TCP ports for TensorBoard and Jupyter Notebooks
- Forwarding Host's X session into the container
As a result, you use docker shell as a development console almost transparently. In order to run the container, follow these steps:
- Get the rundocker.sh script by saving it manually or by
using your favorite command line downloader:
$ wget https://github.com/stagedml/stagedml/raw/master/rundocker.sh $ chmod +x ./rundocker.sh
- Run the container by passing it's name to the script:
$ ./rundocker.sh stagedml/user:latest
- Proceed with Quick Start
Development docker container includes most of the Python dependencies (the notable exception is TensorFlow which should be installed manually), but not the packages themselves. Pylightnix, StagedML and TensorFlow/Models are propagated via PYTHONPATH. The StagedML repository and all it's submodules are required to be checked-out locally. The detailed instalation procedure follows:
-
Clone the Stagedml repo recursively
$ git clone --recursive https://github.com/stagedml/stagedml
-
Cd to project's root and run the docker script without arguments to build the development docker container from the Dockerfile.
$ cd stagedml $ ./rundocker.sh
The docker builder will download deepo base image and additional dependencies. Among other actions, the script will bind-mount host's project folder to container's
/workspace
. Finally, it will open Bash shell withPYTHONPATH
pointing to Python sources of required libraries. -
Install TensorFlow. At the time of this writing, the default TF from Deepo Docker was a bit old, so we provide our favorite version as
./3rdparty/tensorflow
Git submodule. You have the following options:- (preferred) Build our favorite version of TensorFlow from source. We
link it under
./3rdparty/tensorflow
Git submodule folder.- Make sure that submodules are initialized
$ git submodule update --init --recursive
- Run the
buildtf
shell function to configure and build TensorFlow wheel.Typically,(docker) $ buildtf
buildtf
takes a long time to complete. It requires considerable amount of RAM and HDD, but we need to run it only once. The wheel apper in./_tf
folder. - Install the tensorflow wheel.
This last command should be re-run every time we start the development container.
(docker) $ sudo -E make install_tf
- Make sure that submodules are initialized
- Check the current version of TF shipped with the base docker image of
deepo
. StagedML wants it to be >=2.1
, maybe this requirement is already satisfied by default. - Install TensorFlow from custom Debian repositories. Typically one have
to execute shell commands like
sudo -E pip3 install tensorflow-gpu
orsudo apt-get install tensorflow-gpu
. Please, consult the Internet.
- (preferred) Build our favorite version of TensorFlow from source. We
link it under
-
(Optional) StagedML supports
mypy
-based type checking:(docker) $ make typecheck
StagedML encourages presentation-driven development. End-user report typically contains results of machine learning experiment, including task-specific utilities, plots and tables. This kind of content is often stored in Jupyter notebooks, but we prefer Markdown-based rendering using codebraid toolset.
To build a report, consider following this steps:
- Run latest "user" docker image as described in the "user" part of the install section.
- In the docker shell, enter the report directory
cd run/bert_finetune
- Run
make train
to download and train models. This could take some time. - Run
make html
to generate HTML page of the report - View
./out_html/Report.html
with your favorite browser
See the list of available reports here.
Top-level definitions are listed in a single
all.py file. There, every all_
function
defines a stage, which is usually a model or a dataset. Every stage could be
built (or realized) by running realize(instantiate(...))
functions on it.
Stages depend on each other and Pylightnix will manage dependencies
automatically.
An example IPython session may look like the following:
>>> from stagedml.stages.all import * # Import the collection of toplevel stages
>>> initialize() # Make sure that Pylightnix storage is initialized
>>> realize(instantiate(all_bert_finetune_glue, 'MRPC')) # Train our model of choice
# During the realize, StagedML will:
# * Download GLUE Dataset...
# * Download pretrained BERT checkpoint
# * Convert the Dataset into TFRecord format
# * Fine tune the BERT model on MRPC classification task
# (~15 min on Nv1080Ti GPU)
# * Save model's checkpoint and other data
# * Return the handle to this data
'rref:eedaa6f13fee251b9451283ef1932ca0-c32bccd3f671d6a3da075cc655ee0a09-bert'
Now we have realization reference, so we could ask IPython to save it in a
variable by typing rref=_
. RRefs identifiy stages in the Pyligtnix storage.
They could be converted into system paths by calling pylightnix.rref2path
function:
>>> print(rref2path(rref))
/var/run/pylightnix/store-v0/c32bccd3f671d6a3da075cc655ee0a09/eedaa6f13fee251b9451283ef1932ca0/
With the realization reference in hands, we could:
- Manually examine training logs and figures by accessing training artifacts
located in storage folder returned by running
pylightnix.bashlike.shell
function on it. - Run TensorBoard by passing RRef to
stagedml.utils.tf.runtb
. Assuming that we run StagedML in Docker as described in the Install section, we could run./runchrome.sh
script from Host machine to connect a web-client to it. - Obtain derivation reference with
pylightnix.rref2dref
. We pass Derivation references to newly defined stages to make them depend on the current stage. StagedML tracks all the configurations and prevent us from messing up the data. - Tweak model parameters with
pylightnix.redefine
, re-train the model while keeping results of previous trainings. - Finally, run the garbage collector
stagedml.stages.all.gc
to remove outdated data.
The ideas behind StagedML are described in the following presentation
The core library of StagedML is called Pylightnix. In fact, StagedML is nothing more than a collection of Pylightnix stages. The following Pylightnix documentation and manuals do apply:
- MNIST demo shows the machine learning specifics of Pylightnix.
- REPL demo illustrates how to debug stages using Read-Eval-Print-friendly routines (wiki).
- Ultimatum tutorial is a note on organizing experiments.
- Pylightnix API Reference
Main sources are located in the src folder.
The most importand module is stagedml.stages.all. It contains top-level stages definitions. Most of the intermediate stages are defined in stagedml.stages sub-modules.
Folder ./run containes end-user applications (reports). They don't depend on StagedML source artifacts and may be distributed separately from other sources. In particular, they have their own Makefiles.
Machine learning models are mostly borrowed from the TensorFlow Official Models, but some parts of them were modified by us. We keep modified parts under the stagedml.models module.
Low-level utilities are defined in stagedml.utils.
We keep external dependencies in a separate module called stagedml.imports.
Important third-party dependencies are included in the form of Git submodules.
We link them under ./3rdparty folder. Less important dependencies
were installed with pip
and became a part of Docker image.
Overall repository structure:
.
├── 3rdparty/ # Thirdparty dependencies in source form
│ ├── pylightnix/ # Pylightnix core library
│ ├── nl2bash_essence/
│ ├── tensorflow/
│ └── tensorflow_models/
├── docker/ # Docker scripts and install rules
│ ├── devenv.sh # Development shell-functions
│ ├── stagedml_ci.docker
│ └── stagedml_dev.docker
├── nix/
│ └── docker_inject.nix
├── run/ # Experiments, have own Makefile
│ └── ...
├── src/ # Python sources
│ └── stagedml/
│ ├── datasets/ # Dataset utilities
│ ├── imports/ # Imports from thirdparty Python packages
│ ├── models/ # Parts of ML models
│ ├── stages/ # Collcection of Stages
│ └── utils/ # Utilities
├── LICENSE
├── Makefile # Rules for building wheels, testing, etc.
├── README.md # <-- (You are here)
├── ipython.sh
├── localrc.vim
├── runchrome.sh* # Chrome browser runner, TensorBoard ports are open
├── rundocker.sh* # Docker container runner
└── setup.py