Skip to content

MArpogaus/dvc-stage

Repository files navigation

img img img img img img

DVC-Stage

  1. About The Project
  2. Getting Started
    1. Prerequisites
    2. Installation
  3. Usage
  4. License
  5. Contact
  6. Acknowledgments

About The Project

This python script provides a easy and parameterizeable way of defining typical dvc (sub-)stages for:

  • data prepossessing
  • data transformation
  • data splitting
  • data validation

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

  • pandas>=0.20.*
  • dvc>=2.12.*
  • pyyaml>=5

Installation

pip install git+https://github.com/MArpogaus/dvc-stage.git

Usage

DVC-Stage works ontop of two files: dvc.yaml and params.yaml. They are expected to be at the root of an initialized dvc project. From there you can execute dvc-stage -h to see available commands or dvc-stage get-config STAGE to generate the dvc stages from the params.yaml file. The tool then generates the respective yaml which you can then manually paste into the dvc.yaml file. Existing stages can then be updated inplace using dvc-stage update-stage STAGE.

Stages are defined inside params.yaml in the following schema:

STAGE_NAME:
  load: {}
  transformations: []
  validations: []
  write: {}

The load and write sections both require the yaml-keys path and format to read and save data respectively.

The transformations and validations sections require a sequence of functions to apply, where transformations return data and validations return a truth value (derived from data). Functions are defined by the key id an can be either:

  • Methods defined on Pandas Dataframes, e.g.
    transformations:
    - id: transpose
  • Imported from any python module, e.g.
    transformations:
    - id: custom
      description: duplikate rows
      import_from: demo.duplicate
  • Predefined by DVC-Stage, e.g.
    validations:
    - id: validate_pandera_schema
      schema:
      import_from: demo.get_schema

When writing a custom function, you need to make sure the function gracefully handles data being None, which is required for type inference. Data is passed as first argument. Further arguments can be provided as additional keys, as shown above for validate_pandera_schema, where schema is passed as second argument to the function.

A working demonstration can be found at examples/.

License

Distributed under the GNU General Public License v3

Contact

Marcel Arpogaus - marcel.arpogaus@gmail.com

Project Link: https://github.com/MArpogaus/dvc-stage

Acknowledgments

Parts of this work have been funded by the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety due to a decision of the German Federal Parliament (AI4Grids: 67KI2012A).