Skip to content

topher-lo/simple-data-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Data Workflow

MIT License Python 3.8 Test codecov

A simple end-to-end data workflow (preprocessing, modelling, visualization) orchestrated using Prefect tasks and flows.

🚀 Quickstart

The easiest way to get started is to clone the repo:

git clone git@github.com:topher-lo/simple-data-workflow.git

Then install its dependencies using pip:

pip install -r requirements.txt

✨ Quick Example

instance e2e_pipeline

An end-to-end data data workflow that:

  1. Downloads data from an URL;
  2. Cleans data;
  3. Runs linear regression;
  4. Plots regression results as a box-and-whisker chart.
from src.flow import e2e_pipeline

# Flow parameters
kwargs = {
    'url': 'https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/fakeTSD.csv',
    'cat_cols': ['year'],  # List of categorical variables in dataset
    'na_strategy': 'mice',  # Method to deal with missing values
    'transf_cols': ['x1', 'x2'],  # Variables to apply transformation on
    'transf_func': 'arcsinh',  # Transformation function
    'endog': 'y',  # Endogenous (outcome) variable
    'exog': ['x1', 'x2']  # Exogenous (feature) variables
}

# Execute flow
state = e2e_pipeline.run(**kwargs)

# Check if flow run was successful
if state.is_successful():
    
    # Get task's reference ID from its name
    task_name = 'plot_confidence_intervals'
    task_ref = e2e_pipeline.get_tasks(name=task_name)[0]
    
    # Get altair chart
    conf_int_chart = state.result[task_ref].result

🎛 Tasks API

These are individual data tasks that make up each part (i.e. preprocessing, modelling, post-processing) of the end-to-end data flow.

Preprocessing:

function sanitize_col_names

Sanitizes strings in list by: 1. stripping all white-spaces at start and end; 2. replaces any excess whitespace with an underscore; and 3. lower-cases all characters.

Parameters:

  • cols (List[str]): List of string (e.g. column names) to sanitize.

Returns: Sanitized list of strings (e.g. column names).

function retrieve_data

Reads data (from url string) into a DataFrame.

Parameters:

  • url (str): URL to data. Data is a delimiter-separated text file.
  • sep (str): Delimiter to use.
  • nrows (int): Number of rows of the file to read.

Returns: The delimiter-separated text file as a Pandas DataFrame.

function _column_wrangler

Returns DataFrame with columns transformed into a consistent format (see sanitize_col_names).

Parameters:

  • data (pd.DataFrame): The data.

Returns: DataFrame with sanitized column names.

function _obj_wrangler

Converts columns with object dtype into StringDtype.

Parameters

  • data (pd.DataFrame): The data.

Returns: A copy of the inputted Pandas DataFrame with any object dtype columns cast as StringDtype.

function _factor_wrangler

Converts columns in is_cat into CategoricalDtype.

Parameters

  • data (pd.DataFrame): The data.
  • cat_cols (list of str): List of columns to convert to CategoricalDtype.
  • ordered_cols (list of str): List of categorical columns to declare to have an ordered relationship between its categories.
  • categories (dict of [str, int, float]): Dictionary with column names as keys and list of str, int, or float as values.
  • str_to_cat (bool): If True, converts all StringDtype columns to CategoricalDtype.
  • dummy_to_bool (bool): If True, converts all columns with integer [0, 1] values or float [0.0, 1.0] values into BooleanDtype.

Returns: A copy of the inputted Pandas DataFrame. Converts specified columns to CategoricalDtype, both ordered and unordered, and sets specified categorical columns' categories. All other columns' dtypes are unchanged.

function _check_model_assumptions

Empty function to be implemented.

function clean_data

Data preprocessing pipeline. Runs the following data wranglers on data:

  1. convert_dtypes
  2. _replace_na
  3. _column_wrangler
  4. _obj_wrangler
  5. _factor_wrangler
  6. _check_model_assumptions.

Parameters:

  • data (pd.dataFrame): The data.
  • na_values (list of str, int, or float): List of values to replace with NA.
  • kwargs: keyword arguments in _factor_wrangler.

Returns: The preprocessed data.

function encode_data

Transforms columns with unordered CategoricalDtype into dummy columns. Dummy columns are cast as BooleanDtype columns. Transforms columns with ordered CategoricalDtypeinto their category integer codes.

Parameters:

  • data (pd.dataFrame): The data.

Returns: The encoded data.

function wrangle_na

Wrangles missing values. 5 available strategies: complete case ("cc"), fill-in ("fi"), fill-in with indicators ("fii"), grand model ("gm"), and MICE ("mice").

Parameters:

  • data (pd.dataFrame): The data.
  • strategy (str): Strategy to deal with missing values.
  • cols (list of str): columns to wrangle.

Returns: The data with missing data wrangled according to the specified strategy.

function transform_data

Applies either log or arcsine transformations on data.

Parameters:

  • data (pd.dataFrame): The data.
  • cols (list of str): Columns to transform.
  • func (str): log transform ("log") or inverse hyperbolic sine transform ("arcsinh").

Returns: The data with transformation applied to specified columns.

function gelman_standardize_data

Standardizes data by dividing by 2 standard deviations and mean-centering them.

Parameters:

  • data (pd.dataFrame): The data.

Returns: The standardized data.

Modelling:

function run_model

statsmodels linear regression implementation.

Parameters:

  • data (pd.dataFrame): The data.
  • y (str): Endogenous (outcome) variable.
  • X (list of str): Exogenous (feature) variables.

Returns: The data with missing data wrangled according to the specified strategy.

Post-processing:

function plot_confidence_intervals

Given a fitted OLS model in statsmodels, returns a box and whisker regression coefficient plot.

Parameters:

  • res (RegressionResultsWrapper): regression results from statsmodels OLS.

Returns: A matplotlib axes containing a box and whisker Altair plot of regression coefficients' point estimates and confidence intervals.

About

A simple end-to-end data workflow (preprocessing, modelling, visualisation) orchestrated using Prefect tasks and flows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages