messaging_service: add streaming/maintenance tenant to statement tenants #18719

denesb · 2024-05-17T05:54:14Z

Messaging service associates certain scheduling groups with certain groups of verbs. For the statement verbs, this is a bit different, because reads and writes can be initiated by users but also by the system (internal code). The two should use different scheduling groups: user operations use the statement scheduling group, while system operations use the default (system) scheduling group. This is implemented by so called "statement tenants". On OSS there is two, the aforementioned statement and system tenants. Turns out, not all system operations are made equal. Some are background operations and use the streaming (maintenance) scheduling group. This scheudling group will not be recognized as a statement tenant and will fall-back to the default statement tenant which will result in priority escalation on the remote end of the RPC, possibly resulting in competing with and hurting the latencies of any ongoing user load.

The text was updated successfully, but these errors were encountered:

Currently only the user tenant (statement scheduling group) and system (default scheduling group) tenants exist, as we used to have only user-initiated operations and sytem (internal) ones. Now there is need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). Fixes: scylladb#18719

nyh · 2024-05-17T22:35:40Z

Don't we already have an open issue on the Alternator TTL scanning losing its scheduling group? This looks like a dup.

nyh · 2024-05-19T12:34:44Z

Don't we already have an open issue on the Alternator TTL scanning losing its scheduling group? This looks like a dup.

I guess we didn't, so let me explain this now:

As explained in #17609, when we implemented Alternator's TTL feature, which has a background thread scanning the database for expired rows, we did not implement any controller on how fast this scanning should proceed. Instead, we let the scanner run in the maintenance scheduling group, and the theory is that the scheduler will then ensure that this work cannot hurt the latency of the main workload (which runs in the query scheduling group), and also its throughput reduction will be limited to the amount of shares configured for the maintenance scheduling group.

However, experimentation showed that not all scanning work was done in the maintenance scheduling group. When node A runs the expiration scanner on node A on maintenance scheduling-group, and performs a read request to remote node B, that read request was supposed to also be run on maintenance scheduling-group. But it turns out that it didn't, and this is the bug reported in this issue: The node B ran the read request on the default request scheduling group, not the inherited one.

nyh · 2024-05-19T14:25:13Z

I added an assignment to myself to write a test for this issue (@denesb has already provided a fix, #18729, but we also need a test). I think this is an important issue to fix (we now believe that users who used TTL encountered it almost immediately, and some performance problems we saw in tests were also caused by this), so I want this fix to go in.

This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Refs scylladb#18719. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

nyh · 2024-05-20T12:14:05Z

I wrote a test reproducing the bug and validating @denesb 's patch: #18757

This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Refs scylladb#18719. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work in the right group). After it was fixed, the work on the wrong scheduling group went down to zero. The test relies *slightly* on timing: It needs write of 100 rows to finish in 2 seconds, and their deletion to finish in 2 second. I believe that these durations will be enough even in very slow debug runs. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

This patch adds a test for issue scylladb#18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>

This patch adds a test for issue #18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before #18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit 1fe8f22) # Conflicts: # test/topology_experimental_raft/test_mv_tablets.py

This patch adds a test for issue #18719: Although the Alternator TTL work is supposedly done in the "streaming" scheduling group, it turned out we had a bug where work sent on behalf of that code to other nodes failed to inherit the correct scheduling group, and was done in the normal ("statement") group. Because this problem only happens when more than one node is involved, the test is in the multi-node test framework test/topology_experimental_raft. The test uses the Alternator API. We already had in that framework a test using the Alternator API (a test for alternator+tablets), so in this patch we move the common Alternator utility functions to a common file, test_alternator.py, where I also put the new test. The test is based on metrics: We write expiring data, wait for it to expire, and then check the metrics on how much CPU work was done in the wrong scheduling group ("statement"). Before #18719 was fixed, a lot of work was done there (more than half of the work done in the right group). After the issue was fixed in the previous patch, the work on the wrong scheduling group went down to zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit 1fe8f22)

…heduling group' from ScyllaDB Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group). This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group. Fixes: #18719 - [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this. (cherry picked from commit 5d3f7c1) (cherry picked from commit 1fe8f22) Refs #18729 Closes #19196 * github.com:scylladb/scylladb: alternator, scheduler: test reproducing RPC scheduling group bug main: add maintenance tenant to messaging_service's scheduling config

denesb added the area/internals an issue which refers to some internal class or something which has little exposure to users and is label May 17, 2024

denesb self-assigned this May 17, 2024

denesb added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 17, 2024

denesb mentioned this issue May 17, 2024

alternator: keep TTL work in the maintenance scheduling group #18729

Merged

1 task

mykaul added this to the 6.0 milestone May 19, 2024

nyh added area/alternator Alternator related Issues area/ttl labels May 19, 2024

nyh self-assigned this May 19, 2024

nyh mentioned this issue May 20, 2024

alternator, scheduler: test reproducing RPC scheduling group bug #18757

Closed

denesb modified the milestones: 6.0, 6.1 May 23, 2024

yaronkaikov removed the backport/5.2 label Jun 9, 2024

scylladb-promoter closed this as completed in b2a500a Jun 10, 2024

scylladb-promoter closed this as completed in #18729 Jun 10, 2024

scylladb-promoter added the Backport candidate label Jun 10, 2024

mergify bot mentioned this issue Jun 10, 2024

[Backport 5.4] alternator: keep TTL work in the maintenance scheduling group #19195

Draft

1 task

mergify bot mentioned this issue Jun 10, 2024

[Backport 6.0] alternator: keep TTL work in the maintenance scheduling group #19196

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

messaging_service: add streaming/maintenance tenant to statement tenants #18719

messaging_service: add streaming/maintenance tenant to statement tenants #18719

denesb commented May 17, 2024

nyh commented May 17, 2024

nyh commented May 19, 2024

nyh commented May 19, 2024

nyh commented May 20, 2024

messaging_service: add streaming/maintenance tenant to statement tenants #18719

messaging_service: add streaming/maintenance tenant to statement tenants #18719

Comments

denesb commented May 17, 2024

nyh commented May 17, 2024

nyh commented May 19, 2024

nyh commented May 19, 2024

nyh commented May 20, 2024