Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(coord): Improve performance of result streaming #1549

Merged
merged 4 commits into from
Jun 13, 2023

Conversation

vishramachandran
Copy link
Member

@vishramachandran vishramachandran commented Apr 3, 2023

Pull Request checklist

  • The commit(s) message(s) follows the contribution guidelines ?
  • Tests for the changes have been added (for bug fixes / features) ?
  • Docs have been added / updated (for bug fixes / features) ?

Improve performance of Query Result Streaming by

  1. Adding multiple (configurable number) RVs within one streaming StreamQueryResult message. This was needed since performance tests with one RV per akka message created a bottleneck in Akka Remoting. [2023-03-23 12:45:13,666] WARN filo-standalone-akka.actor.default-dispatcher-34 akka.remote.EndpointWriter [akka.tcp://filo-standalone@127.0.0.1:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ffilo-standalone%40127.0.0.1%3A63524-1/endpointWriter] - [79284] buffered messages in EndpointWriter for [akka.tcp://filo-standalone@127.0.0.1:63524]. You should probably implement flow control to avoid flooding the remote connection.
  2. Use one result actor to receive streaming query results from callee. Since there was a overload of LocalActorRefs in heap dumps. This required (a) externalizing QueryScheduler outside of QueryActor so it can be used from ResultActor as well. (b) Adding new unique planId UUID to each execPlan so we can distinguish between streamed query results across various plans and route it to the right consumer query pipeline.
  3. Invoke child plans in parallel rather than sequential

With this change I was able to bring performance of raw query throughput and latency on one-node-setup on par with non-streaming solution. More iterations are needed to improve the performance. The bottleneck appears to arise from the fact that Prom HTTP API is non-streaming and requires accumulation of new swaths of data in memory. In any case, FiloDB is relieved from this and would be more reliable than without streaming.

Briefly here are the remaining TODOs:

  1. On a setup with one FiloDB, and one query facade service node, I was able to see equivalent latencies for 5QPS. There was improved GC performance on FiloDB, but increased memory usage on Query Facade process. Scaling streaming setup to 6QPS produced more timeouts than fat response.
  2. There are some functional issues on Apple M1. The query monix pipeline seems to be canceled abruptly. Whereas the same setup in Intel-Mac works functionally.
  3. There is probably a lurking bug when I occasional see some additional result RVs than expected.

alextheimer
alextheimer previously approved these changes Apr 18, 2023
Copy link
Contributor

@alextheimer alextheimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a couple comments / questions 👍

amolnayak311
amolnayak311 previously approved these changes Jun 5, 2023
Copy link
Contributor

@amolnayak311 amolnayak311 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vishramachandran for the PR, if you can rebase I will be happy to take a quick look again and approve.

@vishramachandran vishramachandran merged commit 08fb4e0 into filodb:develop Jun 13, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants