pyspark XGBoostSageMakerEstimator fails on .fit() #142

torsjonas · 2021-11-23T13:24:27Z

Please fill out the form below.

System Information

Spark or PySpark: pyspark
SDK Version: latest (pip version 1.4.2), EMR 5.23.0
Spark Version: 2.4.0
Algorithm (e.g. KMeans): XGBoost

Describe the problem

Since version 1.4.2 the pyspark XGBoostSageMakerEstimator wrapper class no longer match the corresponding scala class, producing an error in the pyspark JVM communication (during serialization of the python class) when calling pyspark fit function. Specifically, it looks like the property lamba was changed to lambda_weights without a corresponding change in the scala class.
https://github.com/aws/sagemaker-spark/pull/135/files#diff-ac899a7e58823fff725d351c8459435bb2f09a9687097cd47d3ec34741eb4156R179

It looks like the 1.4.2 release change also bumps the spark version from 2.2.0 to 2.4.0

I can see a couple of workarounds, downgrading EMR to 5.10.1 which is the latest version that has Spark 2.2.0, but I do not want to do this because EMR 5.10.1 does not have support for Jupyter notebooks (only EMR 5.18.0 has support for Jupyter), and I don't want to run Zeppelin notebooks. Another workaround is to sidestep pyspark completely and just use the scala spark sagemaker integration instead of the pyspark variant.

Minimal repo / logs

This fails with error

Param Param(parent='Identifiable_66065fac1a12', name='lambda', doc='L2 regularization term on weights, increase this value will make model more conservative.') does not belong to Identifiable_66065fac1a12.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sagemaker_pyspark/SageMakerEstimator.py", line 256, in fit
    self._transfer_params_to_java()

Probably, the pyspark communication with Java fails because the pyspark XGBoostSageMakerEstimator class has changed a property previously named lamba to lambda_weights in a recent change, but the scala class was not changed accordingly.

Exact command to reproduce:
Start an EMR 5.23.0 cluster with a cluster bootstrap action to pip install sagemaker_pyspark. Attach an EMR Notebook (JupyterLabs pyspark kernel) and execute the following notebook code

from sagemaker_pyspark import IAMRole
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator

region = "eu-west-1"
training_data = (spark.read.format("libsvm").option("numFeatures", "784").load("s3a://sagemaker-sample-data-{}/spark/mnist/train/".format(region)))
model_role_arn = "SOME_ROLE_ARN"

xgboost_estimator = XGBoostSageMakerEstimator(
    trainingInstanceType="ml.m4.xlarge",
    trainingInstanceCount=1,
    endpointInstanceType="ml.m4.xlarge",
    endpointInitialInstanceCount=1,
    sagemakerRole=IAMRole(model_role_arn))

xgboost_estimator.setObjective('multi:softmax')
xgboost_estimator.setNumRound(25)
xgboost_estimator.setNumClasses(10)

xgboost_model = xgboost_estimator.fit(training_data)

The text was updated successfully, but these errors were encountered:

Karrthik-Arya · 2022-07-22T13:41:55Z

I am also facing the same issue

torsjonas changed the title ~~pyspark XGBoostSageMakerEstimator.py does not match scala XGBoostSageMakerEstimator.scala~~ pyspark XGBoostSageMakerEstimator fails on .fit() Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark XGBoostSageMakerEstimator fails on .fit() #142

pyspark XGBoostSageMakerEstimator fails on .fit() #142

torsjonas commented Nov 23, 2021 •

edited

Karrthik-Arya commented Jul 22, 2022

pyspark XGBoostSageMakerEstimator fails on .fit() #142

pyspark XGBoostSageMakerEstimator fails on .fit() #142

Comments

torsjonas commented Nov 23, 2021 • edited

System Information

Describe the problem

Minimal repo / logs

Karrthik-Arya commented Jul 22, 2022

torsjonas commented Nov 23, 2021 •

edited