Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Converting the query result to Apache Arrow format in server side. #1198

Open
1 task done
JinHai-CN opened this issue May 10, 2024 · 3 comments
Open
1 task done
Assignees
Labels
feature request New feature or request

Comments

@JinHai-CN
Copy link
Contributor

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

Currently, query results are stored in memory in a columnar format. However, the client expects the results in Apache Arrow format. At the moment, the format conversion is executed on the Python client, but this worsens the performance, so we plan to convert the results to Apache Arrow format on the server side before sending them to the client.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

@JinHai-CN JinHai-CN added the feature request New feature or request label May 10, 2024
@JinHai-CN JinHai-CN mentioned this issue May 10, 2024
33 tasks
@niebayes
Copy link

@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:

  • Are we planning to use Arrow in-memory format to replace the current in-memory columnar format, or do we only need to support converting the in-memory columnar format to Arrow in-memory format while returning the query result?
  • Why does our client need to convert query results to Arrow format? Is it for interacting with Infinity using Python SDK in applications integrated with Arrow? I noticed that our examples and tests barely use the to_arrow method.
  • Why does converting to Arrow format on the client side degrade performance? Here is my understanding:
    • The conversion from Pandas DataFrame or Polar DataFrame to Arrow format requires serialization and deserialization of metadata and data.
    • Conversely, if we serialize query result to Thrift or HTTP response using Arrow IPC protocol on the server side, and then deserialize the result using Arrow IPC on the client side, we only pay the cost of serializing and deserializing the metadata since Arrow IPC protocol does not need to serialize the deserialize the data.
    • In fact, serialization and deserialization overhead for metadata in Arrow IPC protocol can be eliminated in certain scenarios. I have implemented this optimization for GreptimeDB before.
    • I am not sure if Thrift and HTTP protocols themselves perform additional serialization and deserialization. If we can support the Arrow Flight protocol, we do not need to worry about this overhead.
  • Is it mandatory for the server to convert query results to Arrow format? Is this conversion a default behavior, or is it optional? If optional, is it configured via server config, or controlled through options sent in client requests?
  • Do all protocols need to convert query results to Arrow? Does the Query Result in the Embedded API need conversion? Where is this conversion performed for the Embedded API?
  • What is the recommended way to implement such a conversion? How about adding a pluggable middleware in the server side?

@JinHai-CN
Copy link
Contributor Author

JinHai-CN commented May 13, 2024

@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:

  • Are we planning to use Arrow in-memory format to replace the current in-memory columnar format, or do we only need to support converting the in-memory columnar format to Arrow in-memory format while returning the query result?

The plan has two steps:

  1. Uses arrow format as the query result and transfer it to the client.
  2. Replaces the in-memory columnar format with Arrow format.
  • Why does our client need to convert query results to Arrow format? Is it for interacting with Infinity using Python SDK in applications integrated with Arrow? I noticed that our examples and tests barely use the to_arrow method.

Yes, the examples are barely use to_arrow method. But arrow or data-frame are massively used in most production environments.

  • Why does converting to Arrow format on the client side degrade performance? Here is my understanding:

    • The conversion from Pandas DataFrame or Polar DataFrame to Arrow format requires serialization and deserialization of metadata and data.
    • Conversely, if we serialize query result to Thrift or HTTP response using Arrow IPC protocol on the server side, and then deserialize the result using Arrow IPC on the client side, we only pay the cost of serializing and deserializing the metadata since Arrow IPC protocol does not need to serialize the deserialize the data.
    • In fact, serialization and deserialization overhead for metadata in Arrow IPC protocol can be eliminated in certain scenarios. I have implemented this optimization for GreptimeDB before.
    • I am not sure if Thrift and HTTP protocols themselves perform additional serialization and deserialization. If we can support the Arrow Flight protocol, we do not need to worry about this overhead.

Yes, your understanding is correct. We plan to support Arrow flight protocol to eliminate the serialization and deserialization cost.

And the HTTP API didn't use thrift in the past, and won't use Arrow flight protocol in the future.

  • Is it mandatory for the server to convert query results to Arrow format? Is this conversion a default behavior, or is it optional? If optional, is it configured via server config, or controlled through options sent in client requests?

Mandatory.

  • Do all protocols need to convert query results to Arrow? Does the Query Result in the Embedded API need conversion? Where is this conversion performed for the Embedded API?

Excepts HTTP API, it shall works on all RPC protocols and Embedded API of python SDK.

  • What is the recommended way to implement such a conversion? How about adding a pluggable middleware in the server side?
  1. Adds an Arrow conversion layer in server side.
  2. Add Arrow flight protocol to infinity server and python client.
  3. If all function works, benchmark result is good and all test cases are passed, apache thrift related code can be removed.

@niebayes
Copy link

@JinHai-CN Please assign me. I will start by creating a PoC to add support for converting to the Arrow format in the server side and for integrating the Arrow Flight protocol in both sides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants