New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource exhausted error #1740
Comments
@lminer this looks like a problem with grpc. Looking at this, it appears that this limit can be configured. I raised the limit to 128 MB on the To test with the change from the above branch, you only need to modify your API spec (assuming you use the For more context, this is the commit we did for the above solution. Needless to say, this is a temporary solution. If this works for you, then we'll address this in one of the next releases. Please let us know if this worked for you. |
@RobertLucian thanks for this! Do you have a version with cuda 11.0 and cudnn 8? I'm testing this out for deployment on a GPU machine. |
@lminer yes, we have a version with CUDA 11.0 and cuDNN 8, but it's for the Python Predictor image. For the TensorFlow Predictor, there are 2 images required: one for the TFS server (which needs to have access to the GPU device) and one for the API server. As of this moment, the TFS server is configured to use version 10.1 of CUDA and (presumably version 7 of cuDNN). Here's its Dockerfile for the GPU version: cortex/images/tensorflow-serving-gpu/Dockerfile Lines 1 to 12 in c8da085
Let us know if you want to follow this path, in which case, we can help you along the way! |
Up to now, I've been trying to use the Tensorflow Predictor path. Do you have any TFS and tensorflow predictor images for tensorflow 2.4.0 (CUDA 11.0, CuDNN 8) that work with a higher limit? Here's my cortex.yaml: - name: foo
kind: RealtimeAPI
predictor:
type: tensorflow
path: serving/cortex_server.py
models:
path: ../foo/
signature_key: serving_default
image: quay.io/cortexlabs/python-predictor-gpu-slim:0.25.0-cuda11.0-cudnn8
tensorflow_serving_image: quay.io/cortexlabs/tensorflow-serving-gpu:0.25.0
compute:
gpu: 1
|
@lminer one immediate problem I notice with your cortex.yaml is that for the TensorFlow Predictor, you must have the In your case, it seems to be set to the Now, let's re-reference this cortex/images/tensorflow-serving-gpu/Dockerfile Lines 1 to 12 in c8da085
This is the Dockerfile for the The upgraded Dockerfile would look like this: FROM tensorflow/serving:2.4.0-gpu
RUN apt-get update -qq && apt-get install -y --no-install-recommends -q \
libnvinfer7=7.1.3-1+cuda11.0 \
libnvinfer-plugin7=7.1.3-1+cuda11.0 \
curl \
&& apt-get clean -qq && rm -rf /var/lib/apt/lists/*
COPY images/tensorflow-serving-gpu/run.sh /src/
RUN chmod +x /src/run.sh
ENTRYPOINT ["/src/run.sh"] I already built this image for your - you can pull it from
Let me know if this worked for you! |
@RobertLucian Thanks so much for this! What do I use for the |
@lminer you would not need to set the That means that inferences will take place in the container represented by the Could you also share your TensorFlow Predictor implementation? I might be able to give you a few pointers if required. |
Ah. I didn't quite understand the breakdown of responsibilities there. I guess the CuDNN not found warning I saw was a red herring as the api doesn't use the GPU anyway. That being said, it seems as if I'm still seeing the error around resource exhaustion. It does look like the GPU is being used before the error, so maybe the exhaustion is occurring when the server is returning large values. Here's my current cortex.yaml - name: foo
kind: RealtimeAPI
predictor:
type: tensorflow
path: serving/cortex_server.py
models:
path: ../foo/
signature_key: serving_default
tensorflow_serving_image: quay.io/robertlucian/cortex-tensorflow-serving-gpu-tf2.4:0.25.0
compute:
gpu: 1 Here's my tensorflow predictor implementation: import numpy as np
class TensorFlowPredictor:
def __init__(self, tensorflow_client, config):
self.client = tensorflow_client
self.config = config
def predict(self, payload, query_params, headers):
target, residual = self.client.predict(
{"waveform": np.array(payload["audio"]).astype("float32")}
)
return {"target": target.numpy().tolist(), "residual": residual.numpy().tolist()} And here's the error again: 2020-12-30 22:49:04.428932:cortex:pid-1558:INFO:500 Internal Server Error POST /
2020-12-30 22:49:04.429214:cortex:pid-1558:ERROR:Exception in ASGI application
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.6/site-packages/uvicorn/protocols/http/httptools_impl.py", line
390, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
return await self.app(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/fastapi/applications.py", line 181, in __call__
await super().__call__(scope, receive, send) # pragma: no cover
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/applications.py", line 111, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/errors.py", line 181, in __call__
raise exc from None
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/errors.py", line 159, in __call__
await self.app(scope, receive, _send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 25, in __call__
response = await self.dispatch_func(request, self.call_next)
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/serve/serve.py", line 187, in parse_payload
return await call_next(request)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 45, in call_next
task.result()
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 38, in coro
await self.app(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 25, in __call__
response = await self.dispatch_func(request, self.call_next)
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/serve/serve.py", line 134, in register_request
response = await call_next(request)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 45, in call_next
task.result()
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/middleware/base.py", line 38, in coro
await self.app(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/exceptions.py", line 82, in __call__
raise exc from None
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/exceptions.py", line 71, in __call__
await self.app(scope, receive, sender)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/routing.py", line 566, in __call__
await route.handle(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/routing.py", line 227, in handle
await self.app(scope, receive, send)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/routing.py", line 41, in app
response = await func(request)
File "/opt/conda/envs/env/lib/python3.6/site-packages/fastapi/routing.py", line 183, in app
dependant=dependant, values=values, is_coroutine=is_coroutine
File "/opt/conda/envs/env/lib/python3.6/site-packages/fastapi/routing.py", line 135, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/opt/conda/envs/env/lib/python3.6/site-packages/starlette/concurrency.py", line 34, in run_in_threadpool
return await loop.run_in_executor(None, func, *args)
File "/opt/conda/envs/env/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/serve/serve.py", line 200, in predict
prediction = predictor_impl.predict(**kwargs)
File "/mnt/project/serving/cortex_server.py", line 11, in predict
{"waveform": np.array(payload["audio"]).astype("float32")}
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/client/tensorflow.py", line
114, in predict
return self._run_inference(model_input, consts.SINGLE_MODEL_NAME, model_version)
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/client/tensorflow.py", line
164, in _run_inference
return self._client.predict(model_input, model_name, model_version)
File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/model/tfs.py", line 376, in
predict
response_proto = self._pred.Predict(prediction_request, timeout=timeout)
File "/opt/conda/envs/env/lib/python3.6/site-packages/grpc/_channel.py", line 826, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/opt/conda/envs/env/lib/python3.6/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (102484524 vs. 4194304)"
debug_error_string = "{"created":"@1609368544.425429771","description":"Received message larger
than max (102484524 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":203,"grpc_status":8}"
> |
@RobertLucian Any idea what's going on. Would love to get this working. |
@lminer your - name: foo
kind: RealtimeAPI
predictor:
type: tensorflow
...
image: quay.io/robertlucian/tensorflow-predictor:0.25.0-tfs
tensorflow_serving_image: quay.io/robertlucian/cortex-tensorflow-serving-gpu-tf2.4:0.25.0
compute:
gpu: 1 Remember that the If you've already done this and it still doesn't work, then maybe you can also share the model with me (you can email me at robert@cortexlabs.com to keep this private) so I can test that as well. |
I'm still getting the same error message when I add the
|
@lminer I see. We will look into this then. I'll keep you posted as we look into this. Thanks for bringing this up to us. |
@lminer we have fixed this error in #1769. The fix is present on a customized version of 0.25 specifically made for you - this change is now available in - name: foo
kind: RealtimeAPI
predictor:
type: tensorflow
...
image: quay.io/robertlucian/tensorflow-predictor:0.25.0-tfs
... This fix will get into 0.27, or if you need this in 0.26, we can make a patch release for you. For reference, I had this tested on a sound recognizer model and big inputs/outputs were used (up to 256 MB). Also, the grpc limit has been increased to 256 MB. Do you think this will get higher than this? We may also consider making this limit configurable in the API spec sometime down the road or sooner if it's really required. |
Thanks, it works! I don't anticipate getting higher than 256 MB. Do you have a tentative timeline for when 0.27 is coming out? |
@lminer that's great news! We plan to release 0.27 on the 19th of January (about 2 weeks from now). |
@RobertLucian Have you looked into whether there's any slowdown incurred by passing such a large input to tensorflow-serving? I'm currently trying to figure out why inference is so slow. Right now inference takes 3 minutes and for most of that time both the CPU and GPU are idle. GPU is running for at most 6 seconds during that time. It makes me wonder whether that limit was put there for a reason. |
@lminer I did not. But I remember trying with 140MB payloads and for something like #1770 it would take a few seconds to run the inference - what I can recall was ~10 seconds. This was done on Assuming there's something wrong with grpc + TFS, I wonder if this would be in any way related to tensorflow/serving#1725. I think to rule out a problem with grpc + TFS is to run the model as a |
@RobertLucian I've been trying to get this working with the python predictor, but I think I'm having an issue where there is a confusion because the session information from when I load the model in the constructor is being lost at predict time. The solution seems to be to convert everything to graph mode. Unfortunately, this isn't so easy given my code. Do you know any other way to get around this error? |
@lminer can you tell us if you're running multiple threads per process If setting the |
Everything is set to the default: 1. Do you know what version I would need to downgrade to? I'm currently on 2.4.0 |
@RobertLucian so I managed to solve this by using the tensorflow-predictor, but reading from s3 directly in the model itself. Now the model is significantly faster, basically equivalent to local tests. So I think the problem is the passing of large datasets via grpc |
Yeah, it's 40 MB. |
I'll go ahead and close this issue, since the gRPC resource exhausted error has been resolved and will be released in 0.27 (next week). #1774 will remain open as we investigate if there is an avoidable issue that is causing the slowdown. |
I'm trying to send audio files, which are fairly large, to the server and am getting a resource exhausted error. Is there any way to configure the server in order to increase the maximum allowed message size?
Here's the stack trace:
The text was updated successfully, but these errors were encountered: