New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky tests 2 (stateless, integration) #61869
Conversation
@@ -101,9 +101,11 @@ def test_query_cache_size_is_runtime_configurable(start_cluster): | |||
node.query("SELECT 2 SETTINGS use_query_cache = 1, query_cache_ttl = 1") | |||
node.query("SELECT 3 SETTINGS use_query_cache = 1, query_cache_ttl = 1") | |||
|
|||
time.sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unreliable, it will remain flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually break this test here in the last assert, sleep is required as i understand
res = node.query_with_retry(
"SELECT value FROM system.asynchronous_metrics WHERE metric = 'QueryCacheEntries'",
check_callback=lambda result: result == "1\n",
)
> assert res == "1\n"
E AssertionError: assert '2\n' == '1\n'
E - 1
E + 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this test originally and found this open PR by luck.
There were three recorded test failures in the past 6 months or so:
- https://s3.amazonaws.com/clickhouse-test-reports/57933/ca3ca8ffe403a26a1ff5bc77adb8895c89b16970/integration_tests__tsan__[1_6].html
- https://s3.amazonaws.com/clickhouse-test-reports/60211/c5d3aef83e287b477f67bb5b0e97ed534b567270/integration_tests__asan__analyzer__[1_6].html
- https://s3.amazonaws.com/clickhouse-test-reports/60497/cb8d8594533de3773ad1f30f7a5792f1e7e1bb39/integration_tests__tsan__[1_6].html
which is super sporadic given how many CI tests run per day.
In all cases, we ran three queries with query cache TTL = 1 sec, then wait 2 sec, and find that the query cache contains only 1 entry whereas 2 entries are expected. I can honestly not think of a scenario where this can happen except one: the asynchronous metrics are updated in a weird time interval that they report only 1 entry. Instead of adding retries (l. 105, please revert all retries from this file which only hide problems), please add SYSTEM RELOAD ASYNCHRONOUS METRICS
before l. 106, see #53710. When I am back from vacation, I can add some documentation for SYSTEM RELOAD ASYNCHRONOUS METRICS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries are removed and SYSTEM RELOAD ASYNCHRONOUS METRICS
after each wait. Lets pray it will be fixed
This is an automated comment for commit 0afd2e8 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
baf53fb
to
ef001b1
Compare
@tavplubix addition to original commits:
|
02f486a
to
fb4bf33
Compare
@@ -101,9 +101,11 @@ def test_query_cache_size_is_runtime_configurable(start_cluster): | |||
node.query("SELECT 2 SETTINGS use_query_cache = 1, query_cache_ttl = 1") | |||
node.query("SELECT 3 SETTINGS use_query_cache = 1, query_cache_ttl = 1") | |||
|
|||
time.sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this test originally and found this open PR by luck.
There were three recorded test failures in the past 6 months or so:
- https://s3.amazonaws.com/clickhouse-test-reports/57933/ca3ca8ffe403a26a1ff5bc77adb8895c89b16970/integration_tests__tsan__[1_6].html
- https://s3.amazonaws.com/clickhouse-test-reports/60211/c5d3aef83e287b477f67bb5b0e97ed534b567270/integration_tests__asan__analyzer__[1_6].html
- https://s3.amazonaws.com/clickhouse-test-reports/60497/cb8d8594533de3773ad1f30f7a5792f1e7e1bb39/integration_tests__tsan__[1_6].html
which is super sporadic given how many CI tests run per day.
In all cases, we ran three queries with query cache TTL = 1 sec, then wait 2 sec, and find that the query cache contains only 1 entry whereas 2 entries are expected. I can honestly not think of a scenario where this can happen except one: the asynchronous metrics are updated in a weird time interval that they report only 1 entry. Instead of adding retries (l. 105, please revert all retries from this file which only hide problems), please add SYSTEM RELOAD ASYNCHRONOUS METRICS
before l. 106, see #53710. When I am back from vacation, I can add some documentation for SYSTEM RELOAD ASYNCHRONOUS METRICS.
…r frequently and less time
Run pytest --pdb. On failure or breakpoint, it will fall back to PDB. Commands written here will be saved to a local file and loaded at the start of the next test run.
…8b78e919e53c0a813dbc73dde9716 Cherry pick #61869 to 24.3: Fix flaky tests 2 (stateless, integration)
Backport #61869 to 24.3: Fix flaky tests 2 (stateless, integration)
…8b78e919e53c0a813dbc73dde9716 Cherry pick #61869 to 23.8: Fix flaky tests 2 (stateless, integration)
…8b78e919e53c0a813dbc73dde9716 Cherry pick #61869 to 24.1: Fix flaky tests 2 (stateless, integration)
…8b78e919e53c0a813dbc73dde9716 Cherry pick #61869 to 24.2: Fix flaky tests 2 (stateless, integration)
Changelog category (leave one):
Changes:
ClickHouseCluster.get_docker_handle()
wait docker frequently and less timeRun pytest --pdb. On failure or breakpoint, it will fall back to PDB. Commands written here will be saved to a local file and loaded at the start of the next test run.
List of tests:
Failed reports is here