-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_lock_time_tracing
is flaky
#7691
Labels
Comments
jcsp
added
a/test
Area: related to testing
c/storage/controller
Component: Storage Controller
labels
May 10, 2024
@jbajic are you available to fix this? |
Yes I am. I will take a look at it |
jcsp
pushed a commit
that referenced
this issue
May 13, 2024
## Problem Closes [test_lock_time_tracing](#7691) ## Summary of changes Taking a look at the execution of the same test in logs, it can be concluded that the time we are holding the lock is sometimes not enough(must be above 30s) to cause the second log to be shown by the thread that is creating a timeline. In the [successful execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9021247520/index.html#testresult/a21bce8c702b37f0) it can be seen that the log `Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock` was on the edge of being logged, if it was below 30s it would not be shown. ``` 2024-05-09T18:02:32.552093Z WARN request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Lock on UpdatePolicy was held for 31.001892592s 2024-05-09T18:02:32.552109Z INFO request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Request handled, status: 200 OK 2024-05-09T18:02:32.552271Z WARN request{method=POST path=/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/timeline request_id=d3af756e-dbb3-476b-89bd-3594f19bbb67}: Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock ``` In the [failed execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9022743601/index.html#/testresult/deb90136aeae4fce): ``` 2024-05-09T20:14:33.526311Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af 2024-05-09T20:14:36.441165Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.441657Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.535227Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Handling request 2024-05-09T20:14:41.535269Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: handle_validate: 68194ffadb61ca11adcbb11cbeb4ec6e(gen 1): valid=true (latest Some(00000001)) 2024-05-09T20:14:41.535284Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Request handled, status: 200 OK 2024-05-09T20:14:46.441854Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:51.441151Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:56.441199Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:01.440971Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:03.516320Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: failpoint "tenant-update-policy-exclusive-lock": sleep done 2024-05-09T20:15:03.518474Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Updated scheduling policy to Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518512Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518540Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Lock on UpdatePolicy was held for 31.003712703s 2024-05-09T20:15:03.518570Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Request handled, status: 200 OK 2024-05-09T20:15:03.518804Z WARN request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518815Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline on shard 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af, attached to node 1 (localhost) ``` we can see that the difference between starting to create timeline `2024-05-09T20:14:33.526311Z` and creating timeline `2024-05-09T20:15:03.518815Z` is not above 30s and will not cause any logs to appear. The proposed solution is to prolong how long we will pause to ensure that the thread that creates the timeline waits above 30s.
a-masterov
pushed a commit
that referenced
this issue
May 20, 2024
## Problem Closes [test_lock_time_tracing](#7691) ## Summary of changes Taking a look at the execution of the same test in logs, it can be concluded that the time we are holding the lock is sometimes not enough(must be above 30s) to cause the second log to be shown by the thread that is creating a timeline. In the [successful execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9021247520/index.html#testresult/a21bce8c702b37f0) it can be seen that the log `Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock` was on the edge of being logged, if it was below 30s it would not be shown. ``` 2024-05-09T18:02:32.552093Z WARN request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Lock on UpdatePolicy was held for 31.001892592s 2024-05-09T18:02:32.552109Z INFO request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Request handled, status: 200 OK 2024-05-09T18:02:32.552271Z WARN request{method=POST path=/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/timeline request_id=d3af756e-dbb3-476b-89bd-3594f19bbb67}: Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock ``` In the [failed execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9022743601/index.html#/testresult/deb90136aeae4fce): ``` 2024-05-09T20:14:33.526311Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af 2024-05-09T20:14:36.441165Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.441657Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.535227Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Handling request 2024-05-09T20:14:41.535269Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: handle_validate: 68194ffadb61ca11adcbb11cbeb4ec6e(gen 1): valid=true (latest Some(00000001)) 2024-05-09T20:14:41.535284Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Request handled, status: 200 OK 2024-05-09T20:14:46.441854Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:51.441151Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:56.441199Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:01.440971Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:03.516320Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: failpoint "tenant-update-policy-exclusive-lock": sleep done 2024-05-09T20:15:03.518474Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Updated scheduling policy to Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518512Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518540Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Lock on UpdatePolicy was held for 31.003712703s 2024-05-09T20:15:03.518570Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Request handled, status: 200 OK 2024-05-09T20:15:03.518804Z WARN request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518815Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline on shard 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af, attached to node 1 (localhost) ``` we can see that the difference between starting to create timeline `2024-05-09T20:14:33.526311Z` and creating timeline `2024-05-09T20:15:03.518815Z` is not above 30s and will not cause any logs to appear. The proposed solution is to prolong how long we will pause to ensure that the thread that creates the timeline waits above 30s.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
This test was added recently, and is failing ~3% of the time.
An example failure:
The text was updated successfully, but these errors were encountered: