Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput (exposed as Queries per second (1/s)) needs to be measured and not derived from latency (1.0 / attrs["best_search_time"]) #423

Open
filipecosta90 opened this issue Jun 1, 2023 · 1 comment

Comments

@filipecosta90
Copy link
Contributor

In theory, if a system has a latency of 1 millisecond for a request, one might think the system can handle 1000 (1/0.001) requests per second, implying a direct correlation between latency and throughput (considering the throughput as the inverse of latency). However, in real-world scenarios, this correlation does not hold due to several reasons:

  • Concurrency: Many systems can process multiple requests simultaneously. This means that while the system may take 1ms to respond to a single request, it can also process hundreds or thousands of other requests within that same 1ms timeframe, assuming sufficient resources are available. We can argue that this is not much of a concern in this type of single client benchmarks but please read the points bellow.

  • Resource Saturation: A system may experience an increase in latency as the load increases due to the saturation of system resources. This could be due to CPU limits, memory, network bandwidth, or I/O capacity, among others. The system's throughput would level off and could even decrease under extreme load conditions, even while latency continues to rise. If we just focus on the best search time like we're doing now we're not really measuring the implications of a long running benchmark on the resulting throughput.

  • Queuing Delays: In many systems, requests are queued before processing. Even if a single request can be processed quickly, if many requests are lined up for processing, the overall latency will be higher. This can happen even if the system's throughput is high.

  • Network and System Behavior: Latency can also be affected by network factors like propagation delays, transmission medium, routing, etc., or system-level factors like hardware architecture, software efficiency, etc. These factors might not affect the throughput in the same proportion, creating disparities between the two. Let's consider cloud deployments that allow for a burst of requests and then reduce the achievable number of requests per second. The reported throughput would not be representative the "common-case" for the entire benchamrk run.

For these reasons, to have a comprehensive understanding of a system's performance, we need to measure both latency and throughput. The current approach is too optimistic and not representative of the a system's performance over time -- we're doing long running benchmarks and only focusing on the best tini-tiny portion to deduce throughput. Only by considering both of these metrics can we understand how well a system responds to individual requests and how effectively it processes a large volume of requests over time.

With the above in mind I would like to propose a PR for this tool that keeps track of throughput from start to end and uses the median/common case value as the reported "Queries per second" value. Agree?

@erikbern
Copy link
Owner

erikbern commented Jun 1, 2023

Given that the benchmarks are single threaded and serving exactly one request at any time, I think throughput is exactly the inverse of latency?

I agree that this isn't exactly what people care about in a real world setting, but I think it would be quite complex to extend the benchmarks. You would have to make assumptions about the arrival process – both the overall rate and the distribution. So for instance wrt the arrival process, do you assume it's exponential or do you make some other assumption (bursty, like you mention).

You would have to introduce a whole new axis in the benchmark and plot latency vs arrival rate. Plotting the tradeoff between arrival rate and latency would show the classic behavior where latency spikes to infinity as the arrival rate goes asymptotically towards the upper capacity.

All of this is doable, but at the cost of

  1. Making the numbers a lot more hard to parse
  2. Making the simulations a lot more expensive – you have a whole new axis to grid search
  3. Making the code more complex
  4. Introduce a whole new set of assumptions (arrival distributions etc)

My feeling is that it's worth simplifying reality a bit, and keeping the benchmarks single-threaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants