Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How inference efficiency is measured #9

Open
FC-Li opened this issue Mar 28, 2024 · 11 comments
Open

How inference efficiency is measured #9

FC-Li opened this issue Mar 28, 2024 · 11 comments

Comments

@FC-Li
Copy link

FC-Li commented Mar 28, 2024

The tech report described the methodology of the inference efficiency measurement but not in detail. It compared the Llama2-70B and DBRX. We have great interests in the comparison. So we also carried out some tests where we spawned different number synchronous clients in order to stress the service in different QPS. What performance we get is different from the tech report. DBRX is faster than Llama2-70B when the traffic is lower than 0.35 QPS. The Latency vs QPS curve is flipped after that. By the way we use the same prompt length and output length as that in tech report.

So I wonder if you could give more details about how the performance is test.

@dskhudia
Copy link

@FengcunLi

The performance reported in the technical report is measured using TRT-LLM with the model support in NVIDIA/TensorRT-LLM#1363 as it behaves today. We use our own webserver with TRT-LLM as the backend. More details about the methodology is available here. There are a few key things to note:

  • We spawn different users at the delay of 1s each and then each user sends a number of requests. User sends another after its current request is done.

  • The benchmarking is done on an 8x-H100-80G system.

  • The benchmarking is done in an “online” setting which utilizes aggregated prefill and decode with continuous (aka inflight) batching, i.e., the same way the inference server gets used in production. In this setting (see more here remove_input_padding), TRT-LLM packs together the context (prompt processing) phase of some requests with the generation (i.e., decode) phase of other requests.

  • LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency just doubles the latency thus bringing down per user throughput by ~2x. DBRX/Mixtral being MoE models, reach the compute bound regime at larger concurrency. Processing of context and generation also affects the how much effective batch size the system sees.

I am curious what is your benchmark methodology and system is? Could you share some numbers as well?

@JadeRay
Copy link

JadeRay commented Mar 29, 2024

@dskhudia

Thank you for your explaination. However, I still have a question about this description.

“LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency just doubles the latency thus bringing down per user throughput by ~2x. DBRX/Mixtral being MoE models, reach the compute bound regime at larger concurrency. Processing of context and generation also affects the how much effective batch size the system sees.”

For H100-80G(NVL) system, the model will reach compute bound just after the batchsize exceeds 507 ( = PeakFlops / PeakBandwidth) theoretically. (Or for H100-80G SXM, The batch size corresponding to the roofline inflection point is 253.) Considering the latency perceived by users, this is a batch size that is difficult to achieve. So we can assume that the generation phase of both the dense model and the moe model will be dominated by bandwidth. On the other hand, the params of 36B-132B moe model is much more than a 70B dense model. For DBRX, 36B activated params, 132B total params, 16 experts, topk=4, when the batchsize eq. 3, the total params activated is 132B * (1 - (1-1/4)**3) = 76B, from now one, the IO bound problem will become much heavier than a 70B dense model. So the generation phase will become slower for DBRX than llama2-70B. Am I right about this? And I am curious about how do you benchmark llama2-70B, is it tested on TRT-LLM, same as DBRX? Thank you for your time again.

@FC-Li
Copy link
Author

FC-Li commented Mar 29, 2024

@dskhudia
Thank you for your swift response.
Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our test, our inference engine is on par with TRT-LLM.

Our test result is down below:
image

The qps is measured on the client side, means how many requests are handled by the server per second.
The latency is end-to-end time of one single request, means how many seconds the server takes to process 2048 prompt tokens and generate 256 output tokens.

The qps range is achieved by different number of concurrent clients which is [1,2,4,6,8,10,12,14,16,18,20,24,28,32]. Each client's behavior is synchronous. It waits for the response from server before sending out a new request.

@JadeRay
Copy link

JadeRay commented Mar 29, 2024

@dskhudia Thank you for your swift response. Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our test, our inference engine is on par with TRT-LLM.

Our test result is down below: image

The qps is measured on the client side, means how many requests are handled by the server per second. The latency is end-to-end time of one single request, means how many seconds the server takes to process 2048 prompt tokens and generate 256 output tokens.

The qps range is achieved by different number of concurrent clients which is [1,2,4,6,8,10,12,14,16,18,20,24,28,32]. Each client's behavior is synchronous. It waits for the response from server before sending out a new request.

That's what I have said! When the batchsize is smaller than 3, the loading params of moe model will be less than llama2-70B. Thus DBRX performs better than llaam2-70B. When the batchsize exceeds 3, DBRX performs slower than llama2-70B thanks to the bigger loading params. This benchmark makes sense.

@dskhudia
Copy link

dskhudia commented Mar 29, 2024

For H100-80G(NVL) system, the model will reach compute bound just after the batchsize exceeds 507 ( = PeakFlops / PeakBandwidth) theoretically. (Or for H100-80G SXM, The batch size corresponding to the roofline inflection point is 253.)

@JadeRay : Arithmetic intensity for matrix multiplication is not directly the batch size. Transformer inference is dominated by matrix multiplications. See an explanation of arithmetic intensity for matrix multiplications here

So we can assume that the generation phase of both the dense model and the moe model will be dominated by bandwidth.

Not true. When overlapping context and generation, MoEs will be compute bound and will perform in proportion to their live param count.

And I am curious about how do you benchmark llama2-70B, is it tested on TRT-LLM, same as DBRX? Thank you for your time again.

LLaMa-2-70B is also benchmarked with trt-llm with the same benchmarking setup and prompts.

@dskhudia
Copy link

@JadeRay , @FC-Li

I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and adds another (context processing for new incoming request slows other in flight requests). For example, I see 40ms/70ms for DBRX model and 80ms/90ms for the LLaMa-2-70B model. This increases overall TPOT time for the other inflight requests larger for LLaMa-2-70B than DBRX.

@JadeRay : It's possible that in other controlled batching scenarios and different input/output tokens you see a different behavior. Overall your intuition for batch size 4 - 16 is correct.

For MoE performance in bandwidth bound and compute bound regime, see the excellent analysis by Dmytro here.
GBCAOi_aUAAJ6-D (1)

@JadeRay
Copy link

JadeRay commented Apr 1, 2024

@JadeRay , @FC-Li

I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and adds another (context processing for new incoming request slows other in flight requests). For example, I see 40ms/70ms for DBRX model and 80ms/90ms for the LLaMa-2-70B model. This increases overall TPOT time for the other inflight requests larger for LLaMa-2-70B than DBRX.

@JadeRay : It's possible that in other controlled batching scenarios and different input/output tokens you see a different behavior. Overall your intuition for batch size 4 - 16 is correct.

For MoE performance in bandwidth bound and compute bound regime, see the excellent analysis by Dmytro here. GBCAOi_aUAAJ6-D (1)

@dskhudia
Thank you for your response.
As Dmytro has analysied and what you have pointed out, the MoE model will be much more heavily bounded for memory bandwidth because it has to load all params for batched processing. So, the TPOT of MoE model will be much slower than llama2-70B dense model considering the generating phase is IO bounded for a relatively small batchsize (if you run DBRX or llama2-70B on a single H100 server node with 8 GPUs, maybe the batchsize would not be too big to reach computation bound area for most of the situition). But as DBRX has announced its throughput up to 2x faster than llama2-70B dense model, can I assume the benefit comes only from continuous batching setting for MoE model?

@JadeRay
Copy link

JadeRay commented Apr 1, 2024

For example, DBRX is both higher quality than LLaMA2-70B and - thanks to having about half as many active parameters - DBRX inference throughput is up to 2x faster (Figure 2).

In other words, will this conclusion be true for static batching? If without continous bathcing, what will the theoretical performance of a MoE model vs a dense model look like?

@dskhudia
Copy link

dskhudia commented Apr 1, 2024

@JadeRay :

But as DBRX has announced its throughput up to 2x faster than llama2-70B dense model, can I assume the benefit comes only from continuous batching setting for MoE model?

The ratio of total params to live params in DBRX is 3.6x (36/132). So in the compute bound regime this is what you should expect wrt an equivalent dense model with 132B params. Please note that MoEs have extra compute in terms of router and other inefficiencies from using GroupedGEMMs for MoE layers so you may not reach 3.6x.

@dskhudia
Copy link

dskhudia commented Apr 1, 2024

@JadeRay :

In other words, will this conclusion be true for static batching? If without continous bathcing, what will the theoretical performance of a MoE model vs a dense model look like?

If your static batch is large enough, you should approach the theoretical compute bound limit of 3.6x wrt an equivalent 132B dense model.

@JadeRay
Copy link

JadeRay commented Apr 2, 2024

@dskhudia

We have benchmarked DBRX and llama2-70B layer by layer, and we find that the benefit of TTFT comes from PerTokenFlops and the benefit of TPOT comes from communication as the DBRX has about half layers than llama2-70B. This is a wonderful model. Thanks for all the disscusions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants