Skip to content

HowardRiddiough/deploy-sklearn-in-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

This repo includes a notebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:

  • Deploying a RandomForestRegressor in PySpark
  • Deployment of ML Pipeline that scales numerical features
  • Deployment of ML Pipeline that is capable of preprocessing mixed feature types

Introducing the spark_predict function: a vessle for python ml deployment in PySpark

Making predictions in PySpark using sophistaicated python ml is unlocked using our spark_predict function defined below.

spark_predict is a wrapper around a pandas_udf, a wrapper is used to enable a python ml model to be passed to the pandas_udf.

def spark_predict(model, cols) -> pyspark.sql.column:
    """This function deploys python ml in PySpark using the `predict` method of `model.

    Args:
        model: python ml model with sklearn API
        cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.
    """
    @sf.pandas_udf(returnType=DoubleType())
    def predict_pandas_udf(*cols):
        # cols will be a tuple of pandas.Series here.
        x = pd.concat(cols, axis=1)
        return pd.Series(model.predict(x))

    return predict_pandas_udf(*cols)

Python ML Deployment in practice

The deploying-python-ml-in-pyspark notebook demonstrates how spark_predict can be used to deploy python ML in PySpark. It is shown that spark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.

I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearn Pipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using the spark_predict function.

Requirements

See requirements.txt.

PySpark Installation

The code used in the deploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.

Further Reading

About

Deploying python ML models in pyspark using Pandas UDFs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published