Python ML Deployment in PySpark Using Pandas UDFs

This repo includes a notebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:

Deploying a RandomForestRegressor in PySpark
Deployment of ML Pipeline that scales numerical features
Deployment of ML Pipeline that is capable of preprocessing mixed feature types

Introducing the spark_predict function: a vessle for python ml deployment in PySpark

Making predictions in PySpark using sophistaicated python ml is unlocked using our spark_predict function defined below.

spark_predict is a wrapper around a pandas_udf, a wrapper is used to enable a python ml model to be passed to the pandas_udf.

def spark_predict(model, cols) -> pyspark.sql.column:
    """This function deploys python ml in PySpark using the `predict` method of `model.

    Args:
        model: python ml model with sklearn API
        cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.
    """
    @sf.pandas_udf(returnType=DoubleType())
    def predict_pandas_udf(*cols):
        # cols will be a tuple of pandas.Series here.
        x = pd.concat(cols, axis=1)
        return pd.Series(model.predict(x))

    return predict_pandas_udf(*cols)

Python ML Deployment in practice

The deploying-python-ml-in-pyspark notebook demonstrates how spark_predict can be used to deploy python ML in PySpark. It is shown that spark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.

I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearn Pipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using the spark_predict function.

Requirements

See requirements.txt.

PySpark Installation

The code used in the deploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
deploying-python-ml-in-pyspark.ipynb		deploying-python-ml-in-pyspark.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

deploying-python-ml-in-pyspark.ipynb

deploying-python-ml-in-pyspark.ipynb

requirements.txt

requirements.txt

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

Introducing the spark_predict function: a vessle for python ml deployment in PySpark

Python ML Deployment in practice

Requirements

PySpark Installation

Further Reading

About

Releases

Packages

Languages

HowardRiddiough/deploy-sklearn-in-pyspark

Folders and files

Latest commit

History

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

Introducing the spark_predict function: a vessle for python ml deployment in PySpark

Python ML Deployment in practice

Requirements

PySpark Installation

Further Reading

About

Topics

Resources

Stars

Watchers

Forks

Languages