Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Basic documentation

Matt Casters edited this page Jan 13, 2019 · 9 revisions

Dataset plugin documentation

What is it?

This plugin provides not just one plugin but a whole series of plugins to provide a testing framework for Pentaho Data Integration.

What's the basic idea?

PDI transformations manipulate data coming from a variety of data sources. They read input and produce output. This project provides functionality to simulate inputs in the form of "Input data sets" and validates output against "Golden data sets". A unit test is a combination of zero or more input sets and golden data sets along with a bunch of tweaks you can do to the transformation prior to testing.

What are the usecases?

Protect your investment

Any PDI project old enough represents a serious investment in time and money. Unit tests allow us to safeguard this investment by making sure that old requirements are not forgotten.

Test driven development

Test driven development starts from the ETL requirements and allows a PM or software architect to clearly specify the needs in the form of input and output rows. The ETL developer can then start from these requirements and knows exactly when the requirements are met.

Speed up development

By allowing the replacement of regular input data with data set we can speed up development in these cases:

  • Transformations without design time input: mappings, Map/Reduce, single threader, ...
  • When input data doesn't exist yet, is in development, or where there is no direct access to the source system
  • When it takes a long time to get to input data, long running queries, ... Please note that you can flag a unit test to be opened and selected automatically when a transformation is loaded in Spoon.

What are the main components of a unit test?

  • Data Set group : A grouping of data sets, links to a database where the data sets are stored.
  • Data set : A set of rows with a certain layout stored in a database table. When used as input we call it an input data set. When used to validate a steps output we call it a golden data set.
  • Unit test tweak: the ability to remove or bypass a step during a test
  • Unit test: The combination of input data sets, golden data sets, tweaks and a transformation.

You can have 0, 1 or more input or golden data sets defined in a unit test. You can have multiple unit tests defined per transformation.

How does it work at runtime?

When a transformation is executed in Spoon and a unit test is selected the following happens:

  • all steps marked with an input data set are replaced with an Injector step
  • all steps marked with a golden data set are replaced with a dummy step (does nothing).
  • all steps marked with a "Bypass" tweak are replaced with a dummy.
  • all steps marked with a "Remove" tweak are removed

These operations take place on a copy of the transformation, in memory only unless you specify a ktr file location in the unit test dialog.

After execution, step output is validated against golden data and logged. In case of errors in the test, a dialog will pop up when running in Spoon.

popup dialog Popup dialog example

Why can we specify different types of unit tests?

Sometimes a developer is just developing something new or trying out a new prototype. It's convenient to be able to use input data sets and tweaks. Other times you want to secure your investment and flag the test as a real Unit Test. If you're developing new unit tests for a project and you don't want to run all the already defined tests, you can select another type and only execute those without all the "official" ones, speeding up things.

How can I automate execution of all unit tests?

There is a step called "Execute Unit Tests" which can execute all defined unit tests of a certain type. The output of the step can be stored in any format or location with regular PDI steps. Execute the transformation through Pan, in a job or a scheduler. You can execute Pan from Jenkins as well.

Setup

Compile this project by running "mvn". The resulting jar file will be in the target/ folder. Copy the jar file found in the releases or from the target/ folder to the plugins/ folder of your PDI distribution. This plugin has no additional dependencies and should run fine on any fairly recent version of PDI (7 or higher recommended)

Usage

Data sets in CSV files

You can skip anything database related mentioned here. Simply create a Data Set Group with a type set to CSV. There you can specify the base folder where the data sets will be stored OR you can set environment variable: DATASETS_BASE_PATH

Typically you also configure PENTAHO_METASTORE_FOLDER and UNIT_TESTS_BASE_PATH to point to the same git project so you can check everything into version control.

Data sets in a database

  • Create a new database connection to store dataset groups in and share it. (right click on connection) Alternatively, create a new database connection in a repository. (automatically shared)
  • Create a new dataset group which uses the database connection.

Datasets

  • Create one or more datasets. Please note that the "Unit test" step right click menu has options for automatic creation and population of data set based on step output.
  • Load a transformation
  • Create one or more unit tests, mark input and golden data sets and define tweaks, all with the step right click menu.

Metastore

Data set and their groups as well as the unit tests are stored in the metastore, in folder ~/.pentaho/metastore or in a repository if you're connected. Click in the left hand side tree and hit CTFL-F5 to see a "golden egg" metastore browser. Set the variable PENTAHO_METASTORE_FOLDER to move the metastore root somewhere else. (like in your project folder)

If you want to store relative references to transformations in the Kettle Transformation Unit Test metastore elements, make sure to specify a base path in each individual unit test or set the environment variable UNIT_TESTS_BASE_PATH prior to editing and executing any tests.