New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test_lock_time_tracing
flakiness
#7712
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
VladLazar
approved these changes
May 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. If this doesn't solve it, we can make the timeout configurable at start-up. It would also make the test faster.
VladLazar
added
the
approved-for-ci-run
Changes are safe to trigger CI for the PR
label
May 13, 2024
github-actions
bot
removed
the
approved-for-ci-run
Changes are safe to trigger CI for the PR
label
May 13, 2024
Closed
3060 tests run: 2927 passed, 0 failed, 133 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
0839c68 at 2024-05-13T09:45:34.935Z :recycle: |
Thank you Jure! |
a-masterov
pushed a commit
that referenced
this pull request
May 20, 2024
## Problem Closes [test_lock_time_tracing](#7691) ## Summary of changes Taking a look at the execution of the same test in logs, it can be concluded that the time we are holding the lock is sometimes not enough(must be above 30s) to cause the second log to be shown by the thread that is creating a timeline. In the [successful execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9021247520/index.html#testresult/a21bce8c702b37f0) it can be seen that the log `Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock` was on the edge of being logged, if it was below 30s it would not be shown. ``` 2024-05-09T18:02:32.552093Z WARN request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Lock on UpdatePolicy was held for 31.001892592s 2024-05-09T18:02:32.552109Z INFO request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Request handled, status: 200 OK 2024-05-09T18:02:32.552271Z WARN request{method=POST path=/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/timeline request_id=d3af756e-dbb3-476b-89bd-3594f19bbb67}: Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock ``` In the [failed execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9022743601/index.html#/testresult/deb90136aeae4fce): ``` 2024-05-09T20:14:33.526311Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af 2024-05-09T20:14:36.441165Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.441657Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.535227Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Handling request 2024-05-09T20:14:41.535269Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: handle_validate: 68194ffadb61ca11adcbb11cbeb4ec6e(gen 1): valid=true (latest Some(00000001)) 2024-05-09T20:14:41.535284Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Request handled, status: 200 OK 2024-05-09T20:14:46.441854Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:51.441151Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:56.441199Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:01.440971Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:03.516320Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: failpoint "tenant-update-policy-exclusive-lock": sleep done 2024-05-09T20:15:03.518474Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Updated scheduling policy to Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518512Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518540Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Lock on UpdatePolicy was held for 31.003712703s 2024-05-09T20:15:03.518570Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Request handled, status: 200 OK 2024-05-09T20:15:03.518804Z WARN request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518815Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline on shard 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af, attached to node 1 (localhost) ``` we can see that the difference between starting to create timeline `2024-05-09T20:14:33.526311Z` and creating timeline `2024-05-09T20:15:03.518815Z` is not above 30s and will not cause any logs to appear. The proposed solution is to prolong how long we will pause to ensure that the thread that creates the timeline waits above 30s.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Closes test_lock_time_tracing
Summary of changes
Taking a look at the execution of the same test in logs, it can be concluded that the time we are holding the lock is sometimes not enough(must be above 30s) to cause the second log to be shown by the thread that is creating a timeline.
In the successful execution it can be seen that the log
Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock
was on the edge of being logged, if it was below 30s it would not be shown.In the failed execution:
we can see that the difference between starting to create timeline
2024-05-09T20:14:33.526311Z
and creating timeline2024-05-09T20:15:03.518815Z
is not above 30s and will not cause any logs to appear.The proposed solution is to prolong how long we will pause to ensure that the thread that creates the timeline waits above 30s.
Checklist before requesting a review
Checklist before merging