This python script provides a easy and parameterizeable way of defining typical dvc (sub-)stages for:
- data prepossessing
- data transformation
- data splitting
- data validation
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
pandas>=0.20.*
dvc>=2.12.*
pyyaml>=5
pip install git+https://github.com/MArpogaus/dvc-stage.git
DVC-Stage works ontop of two files: dvc.yaml
and params.yaml
.
They are expected to be at the root of an initialized dvc project.
From there you can execute dvc-stage -h
to see available commands or dvc-stage get-config STAGE
to generate the dvc stages from the params.yaml
file. The tool then generates the respective yaml which you can then manually paste into the dvc.yaml
file. Existing stages can then be updated inplace using dvc-stage update-stage STAGE
.
Stages are defined inside params.yaml
in the following schema:
STAGE_NAME:
load: {}
transformations: []
validations: []
write: {}
The load
and write
sections both require the yaml-keys path
and format
to read and save data respectively.
The transformations
and validations
sections require a sequence of functions to apply, where transformations
return data and validations
return a truth value (derived from data).
Functions are defined by the key id
an can be either:
- Methods defined on Pandas Dataframes, e.g.
transformations: - id: transpose
- Imported from any python module, e.g.
transformations: - id: custom description: duplikate rows import_from: demo.duplicate
- Predefined by DVC-Stage, e.g.
validations: - id: validate_pandera_schema schema: import_from: demo.get_schema
When writing a custom function, you need to make sure the function gracefully handles data being None
, which is required for type inference. Data is passed as first argument. Further arguments can be provided as additional keys, as shown above for validate_pandera_schema
, where schema is passed as second argument to the function.
A working demonstration can be found at examples/
.
Distributed under the GNU General Public License v3
Marcel Arpogaus - marcel.arpogaus@gmail.com
Project Link: https://github.com/MArpogaus/dvc-stage
Parts of this work have been funded by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety due to a decision of the German Federal Parliament (AI4Grids: 67KI2012A).