Response caching GPU tensors #7140

rahchuenmonroe · 2024-04-19T23:04:17Z

According to your docs, only input tensors located in CPU memory will be hashable for accessing the cache. And only responses with all output tensors located in CPU memory will be eligible for caching.

Does this mean that if a model runs on GPU, the requests will not be able to be cached since their outputs are on GPU? If that's the case, I think it would be great if we could cache tensors that are located on GPU since a lot of models running on Triton run on GPU.

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2024-04-30T23:58:47Z

Hi @rahchuenmonroe,

This applies to input/output tensors within Triton core, before and after the model execution in the backend. If you are communicating with Triton over the network (HTTP/GRPC), then all request and response tensors will be on CPU when going through Triton by default.

Using CUDA shared memory is a different story, but assumes client/server are co-located
Backends that execute the model on GPU will handle copying the data to/from CPU

So long story short, if you're talking to Triton over the network without using shared memory (and therefore communicating tensors over CPU), you can likely cache the responses even if they are from a model that is running on GPU. This is the large majority of use cases.

If you are using Triton in-process or using CUDA shared memory and passing Triton tensors that are already on GPU, then caching of those tensors is not currently supported.

rmccorm4 added the question Further information is requested label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response caching GPU tensors #7140

Response caching GPU tensors #7140

rahchuenmonroe commented Apr 19, 2024

rmccorm4 commented Apr 30, 2024 •

edited

Response caching GPU tensors #7140

Response caching GPU tensors #7140

Comments

rahchuenmonroe commented Apr 19, 2024

rmccorm4 commented Apr 30, 2024 • edited

rmccorm4 commented Apr 30, 2024 •

edited