-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
messaging_service: add streaming/maintenance tenant to statement tenants #18719
Comments
Currently only the user tenant (statement scheduling group) and system (default scheduling group) tenants exist, as we used to have only user-initiated operations and sytem (internal) ones. Now there is need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). Fixes: scylladb#18719
Don't we already have an open issue on the Alternator TTL scanning losing its scheduling group? This looks like a dup. |
I guess we didn't, so let me explain this now: As explained in #17609, when we implemented Alternator's TTL feature, which has a background thread scanning the database for expired rows, we did not implement any controller on how fast this scanning should proceed. Instead, we let the scanner run in the maintenance scheduling group, and the theory is that the scheduler will then ensure that this work cannot hurt the latency of the main workload (which runs in the query scheduling group), and also its throughput reduction will be limited to the amount of shares configured for the maintenance scheduling group. However, experimentation showed that not all scanning work was done in the maintenance scheduling group. When node A runs the expiration scanner on node A on maintenance scheduling-group, and performs a read request to remote node B, that read request was supposed to also be run on maintenance scheduling-group. But it turns out that it didn't, and this is the bug reported in this issue: The node B ran the read request on the default request scheduling group, not the inherited one. |
I added an assignment to myself to write a test for this issue (@denesb has already provided a fix, #18729, but we also need a test). I think this is an important issue to fix (we now believe that users who used TTL encountered it almost immediately, and some performance problems we saw in tests were also caused by this), so I want this fix to go in. |
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Refs scylladb#18719. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Refs scylladb#18719. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for issue #18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before #18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit 1fe8f22) # Conflicts: # test/topology_experimental_raft/test_mv_tablets.py
This patch adds a test for issue #18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before #18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit 1fe8f22)
…heduling group' from ScyllaDB Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group. Fixes: #18719 - [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this. (cherry picked from commit 5d3f7c1) (cherry picked from commit 1fe8f22) Refs #18729 Closes #19196 * github.com:scylladb/scylladb: alternator, scheduler: test reproducing RPC scheduling group bug main: add maintenance tenant to messaging_service's scheduling config
Messaging service associates certain scheduling groups with certain groups of verbs. For the statement verbs, this is a bit different, because reads and writes can be initiated by users but also by the system (internal code). The two should use different scheduling groups: user operations use the
statement
scheduling group, while system operations use thedefault
(system
) scheduling group. This is implemented by so called "statement tenants". On OSS there is two, the aforementionedstatement
andsystem
tenants. Turns out, not all system operations are made equal. Some are background operations and use thestreaming
(maintenance) scheduling group. This scheudling group will not be recognized as a statement tenant and will fall-back to the default statement tenant which will result in priority escalation on the remote end of the RPC, possibly resulting in competing with and hurting the latencies of any ongoing user load.The text was updated successfully, but these errors were encountered: