Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timeout experience and add configuration #3800

Open
shuiyisong opened this issue Apr 25, 2024 · 0 comments
Open

Improve timeout experience and add configuration #3800

shuiyisong opened this issue Apr 25, 2024 · 0 comments
Labels
C-user-experience Category User Experience docs-required This change requires docs update.

Comments

@shuiyisong
Copy link
Contributor

shuiyisong commented Apr 25, 2024

What type of enhancement is this?

User experience

What does the enhancement do?

Here is an example error log indicating a timeout is happened during a query.

2024-04-25T05:45:55.408746Z ERROR sql{protocol="http" request_type="sql"}: client::region: Failed to do Flight get, addr: greptimedb-datanode-0.greptimedb-datanode.greptimedb:4001, code: The operation was cancelled err=0: Timeout expired
2024-04-25T05:45:55.409104Z ERROR sql{protocol="http" request_type="sql"}: servers::http::error_result: Failed to handle HTTP request err=0: , at greptimedb/src/common/recordbatch/src/adapter.rs:254:55
1: External(0: External error, at greptimedb/src/query/src/dist_plan/merge_scan.rs:207:22
1: Region query error, at greptimedb/src/frontend/src/instance/region_query.rs:53:14
2: Failed to query, at greptimedb/src/frontend/src/instance/region_query.rs:74:14
3: External error, at greptimedb/src/client/src/region.rs:72:14
4: Failed to do Flight get, code: The operation was cancelled
5: Timeout expired)
2024-04-25T05:45:55.409893Z ERROR tower_http::trace::on_failure: response failed classification=Status code: 500 Internal Server Error latency=10008 ms

We have two major issue here

  1. It somehow returns 1003 as error code, which is not accurate and confusing. Perhaps the first External hides the actual reason. We want to deliver the message clearly to end user if it's a timeout error.
  2. the timeout threshold(which is 10 seconds) is not configurable yet. It's inappropriate since some large queries can easily take more than a minute.

Implementation challenges

We want to

  1. fix the case where the actual timeout reason is hidden. It might relate to our error passing mechanism, please refer to this blog(previewing) for some insights.
  2. add a configuration for timeout. Better off, a separate configuration for each protocol like HTTP, MySQL, PG and gRPC. This might be harder than it seems to be, for a timeout can occur during the gRPC call from frontend to datanode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-user-experience Category User Experience docs-required This change requires docs update.
Projects
None yet
Development

No branches or pull requests

1 participant