Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic batching that supports static batch size with padding #7124

Open
ShuaiShao93 opened this issue Apr 17, 2024 · 10 comments
Open

Dynamic batching that supports static batch size with padding #7124

ShuaiShao93 opened this issue Apr 17, 2024 · 10 comments
Labels
enhancement New feature or request module: server Issues related to the server core and frontends

Comments

@ShuaiShao93
Copy link

Is your feature request related to a problem? Please describe.
Since TensorRT has limited support for dynamic shape, the dynamic batch size required by dynamic batcher is not very ideal.

Describe the solution you'd like
Support padding batch size to the static batch size when there is not sufficient amount of data.

@SunnyGhj
Copy link

Great minds think alike, I'm trying to manually implement padding size from the request side

@ShuaiShao93
Copy link
Author

Great minds think alike, I'm trying to manually implement padding size from the request side

Does this mean you disabled dynamic batching on triton? This is not ideal, because one of the most important reasons for us to use Triton is dynamic batching

@SunnyGhj
Copy link

SunnyGhj commented Apr 17, 2024

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

@ShuaiShao93
Copy link
Author

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

Ok, it sounds like you re-implemented the dynamic batcher at your own client, which is probably not the best investment of time. I hope Triton can support this natively. But thanks for sharing this!

@Tabrizian
Copy link
Member

I think this enhancement makes sense. @GuanLuo / @nnshah1 any additional thoughts?

@Tabrizian Tabrizian added enhancement New feature or request module: server Issues related to the server core and frontends labels Apr 19, 2024
@nnshah1
Copy link
Contributor

nnshah1 commented Apr 19, 2024

@ShuaiShao93 If I understand correctly - the idea here is to have a static batch defined in the engine but then have the dynamic batcher pad if it sends in batches with smaller size?

Is that something to handle in the server or in the backend? It might be more efficient to pad right before sending it to the engine.

@ShuaiShao93
Copy link
Author

ShuaiShao93 commented Apr 19, 2024

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

@nnshah1
Copy link
Contributor

nnshah1 commented Apr 19, 2024

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

No I get your point - I mean to pad in the TRT backend vs the core server piece - not to pad at the client.

@nnshah1
Copy link
Contributor

nnshah1 commented Apr 19, 2024

As a kind of example for our stable diffusion tutorial - I ended up padding / splitting on the model side and allowing the dynamic batcher to provide batches independent of that. (this is just an example and would need to be implement in the TRT engine or triton core)

https://github.com/triton-inference-server/tutorials/blob/cb2ca257000cd14d59642a7aa86b56d054535d73/Popular_Models_Guide/StableDiffusion/backend/diffusion/model.py#L178

@ShuaiShao93
Copy link
Author

@nnshah1 Ah gotcha. Thanks! Either should work, but sounds better to make this a general feature, and make it a flag in config, in case other backends also want static batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: server Issues related to the server core and frontends
Development

No branches or pull requests

4 participants