Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carify job fails in spark mode #127

Open
lorenzwalthert opened this issue Aug 1, 2022 · 7 comments
Open

Carify job fails in spark mode #127

lorenzwalthert opened this issue Aug 1, 2022 · 7 comments

Comments

@lorenzwalthert
Copy link

lorenzwalthert commented Aug 1, 2022

Thanks for this project. For my project, I'd need to configure some elements of the clarify processing and it would require respective Docker Files available for modification. More concretely, I am facing timeouts in the endpoint calls due to a very high max batch size/max payload and a slow model, but only when apache spark integration is used, i.e. instance_count > 1. In that case, the max payload is for some reason much higher than when spark integration is disabled, leading to longer response times for a batch. Choosing more or a bigger or more powerful instance in the endpoint does not solve the problem.

Can you open-source the Dockerfiles? This would be very beneficial.

In addition, sagemaker.clarify.SageMakerClarifyProcessor() should accept an optional image_uri argument so I can supply my custom image, but that I can also solve myself by forking the sagemaker sdk and create a PR

@lorenzwalthert lorenzwalthert changed the title Open-source clarify docker files Open-source clarify docker files for processing Aug 1, 2022
@keerthanvasist
Copy link
Member

Thank you for your question!

For batch size in endpoint calls, Clarify has a system of figuring out the optimal batch size. If all fails, we will end up with a single instance per request. I am curious why you think your endpoint calls failed.

I would like to understand your concern. Is it that your clarify jobs are failing, or that the job is slower than you expected? If failing, can you please share the error stack? If it is just the logs, you generally shouldn't have to worry about it.

Also, You can specify the image_uri this way:

clarify_processor.image_uri = <your_image_uri>

@xgchena
Copy link
Contributor

xgchena commented Aug 1, 2022

When you enable Spark integration, could you also increase the shadow endpoint instance count (instance_count parameter of sagemaker.clarify.ModelConfig)? We recommend 1:1 ratio between processing instance count and endpoint instance count.

This repo (amazon-sagemaker-clarify) is one of the core libraries used by SageMaker Clarify processing container, and SageMakerClarifyProcessor API is designed to launch the container which is Amazon proprietary. If you want to launch your own processing container, then the generic Processor API is a better choice.

@lorenzwalthert
Copy link
Author

Thanks very much @keerthanvasist and @xgchena, I will follow-up on your posts soon to provide more details

@lorenzwalthert
Copy link
Author

lorenzwalthert commented Aug 2, 2022

Thanks for that swift response (I wish it was like that in all AWS (sagemaker) repos). My clarify job worked when I used an instance count of 1 for both the processing job and the endpoint. I got various warnings in the log saying batch size was reducted:

2022-08-02 11:54:45,071 Prediction batch size is reduced to 1100 to fit max payload size limit.

The clarify job completed but was slow (~17h). Checking in cloudwatch, the model latency never exceeded one minute, was in the range of 250k microseconds.

Increasing the number of workes in the endpoint did not speed up the calculation but reduced the CPU load on each worker (judging from Cloudwatch metrics). This made me believe that the requests are blocking and that the one worker sends to different end points, one at the time.

Hence, the next natural step was to increase the number of workers in the processing job. According to the docs, this means that now Apache Spark is leveraged and it's recommended to use a 1:1 ratio for endpoint and processing instances (as @xgchena just pointed out above):

Specifically, we recommend that you use a one-to-one ratio of endpoint to processing instances.

I did that (e.g. 2 workers on both ends), and some variations of it, but I always get errors there too (even after batch size is reduced to 1, while latency did not go down) and the job did not complete. I believe the relevant errors (from the below logs) are always of this form:


INFO:analyzer.predictor:Model endpoint delivered 5.94108 requests per second and a total of 2 requests over 0 seconds
INFO:analyzer.shap.spark_shap_analyzer:model output size is 3
INFO:analyzer.shap.spark_shap_analyzer:data frame has 64 partition(s).
12:10:04.388 [task-result-getter-1] WARN  o.a.spark.scheduler.TaskSetManager - Lost task 12.0 in stage 25.0 (TID 225) (algo-2 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1659441581501_0001/container_1659441581501_0001_01_000002/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1659441581501_0001/container_1659441581501_0001_01_000002/pyspark.zip/pyspark/worker.py", line 594, in process
    out_iter = func(split_index, iterator)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 418, in func
  File "/usr/local/lib/python3.9/site-packages/analyzer/shap/spark_shap_analyzer.py", line 108, in __call__
    explainer = KernelExplainer(
  File "/usr/local/lib/python3.9/site-packages/explainers/shap/kernel_shap.py", line 240, in __init__
    model_out = self.model(self.bg_dataset) if bg_dataset_model_out is None else bg_dataset_model_out
  File "/usr/local/lib/python3.9/site-packages/analyzer/shap/spark_shap_analyzer.py", line 78, in __predict
    prediction = predictor.predict_proba(data)
  File "/usr/local/lib/python3.9/site-packages/analyzer/predictor.py", line 490, in predict_proba
    predicted_labels = self.__predict(data, self.__extract_predicted_score)
  File "/usr/local/lib/python3.9/site-packages/analyzer/predictor.py", line 636, in __predict
    raise e
  File "/usr/local/lib/python3.9/site-packages/analyzer/predictor.py", line 589, in __predict
    prediction = self.__do_predict(
  File "/usr/local/lib/python3.9/site-packages/analyzer/predictor.py", line 677, in __do_predict
    inference = self.predictor.predict(data, self._initial_args, self._target_model)
  File "/usr/local/lib/python3.9/site-packages/sagemaker/predictor.py", line 161, in predict
    response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation (reached max retries: 0): Received server error (504) from primary with message "<html>#015
<head><title>504 Gateway Time-out</title></head>#015
<body bgcolor="white">#015
<center><h1>504 Gateway Time-out</h1></center>#015
<hr><center>nginx/1.14.0 (Ubuntu)</center>#015
</body>#015
</html>#015
". See https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#logEventViewer:group=/aws/sagemaker/Endpoints/sm-clarify-sagemaker-scikit-learn-2022-07-30-08-1659441630-2305 in account 982361546614 for more information.

The endpoints had many errors of this form:

2022/08/02 12:50:37 [error] 19#19: *1834 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock/ping", host: "169.254.180.2:8080"

I checked cloudwatch and I see that the average model latency is much higher with spark enabled. The maximum latency quickly goes to 60s, which I believe is probably causing a timeout. Below I plot the average latency. I also noted that the warnings about maximal payload don't occur.

This lead me to believe that I should choose a more powerful machine behind the endpoint to bring down the latency. I tried to replace my initial value ml.m5.xlarge with ml.m5.2xlarge and ml.m5.4xlarge, no luck. My model is a custom scikit learn model that combines two HistGradientBoostClassifiers` and hence not very fast (but still reasonably fast I believe). Since the latency did not fall even when the payload was reduced to one observation (the mechanism you mentioned above @keerthanvasist), I am not sure what the problem is now. The below graphs extend over the whole runtime.

Screenshot 2022-08-02 at 15 55 29

Screenshot 2022-08-02 at 15 55 12

Logs

Here's my python script (wont' run on your machine because of account and machine specific dependencies):
clarify-2-2.py.zip

Would be great if you could help me solve this.


This repo (amazon-sagemaker-clarify) is one of the core libraries used by SageMaker Clarify processing container, and SageMakerClarifyProcessor API is designed to launch the container which is Amazon proprietary. If you want to launch your own processing container, then the generic Processor API is a better choice.

Noted. But I think my use case is just the regular use case, so I prefer to understand why it fails instead of building my own solution with sagemaker processing.

Also, You can specify the image_uri this way:

Thanks, but without open-sourcing the container, I don't think it makes sense to use this. Also, I think my use case is the regular use case, so it should be fixed upstream instead of building my own container.

Is it that your clarify jobs are failing, or that the job is slower than you expected?

As I hope it becomes clear from the above description, it fails (the Sagemaker Job does not complete).

@larroy
Copy link
Contributor

larroy commented Aug 4, 2022

Thanks Lorenz for the great description. We are looking at this with the team and get back to you shortly. Just to let you know you can use ANALYZER_USE_SPARK=1 in the environment to use the spark implementation even in single instance if you want.

@lorenzwalthert lorenzwalthert changed the title Open-source clarify docker files for processing Carify job fails in spark mode Aug 9, 2022
@lorenzwalthert
Copy link
Author

lorenzwalthert commented Aug 9, 2022

Thanks @larroy, appreciate it. I updated the title of to reflect the discussion.

@xgchena
Copy link
Contributor

xgchena commented Jan 10, 2023

Cross-reference https://tiny.amazon.com/8lwa4yrv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants