Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messaging_service: add streaming/maintenance tenant to statement tenants #18719

Closed
denesb opened this issue May 17, 2024 · 4 comments · Fixed by #18729
Closed

messaging_service: add streaming/maintenance tenant to statement tenants #18719

denesb opened this issue May 17, 2024 · 4 comments · Fixed by #18729
Assignees
Labels
area/alternator Alternator related Issues area/internals an issue which refers to some internal class or something which has little exposure to users and is area/ttl Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0
Milestone

Comments

@denesb
Copy link
Contributor

denesb commented May 17, 2024

Messaging service associates certain scheduling groups with certain groups of verbs. For the statement verbs, this is a bit different, because reads and writes can be initiated by users but also by the system (internal code). The two should use different scheduling groups: user operations use the statement scheduling group, while system operations use the default (system) scheduling group. This is implemented by so called "statement tenants". On OSS there is two, the aforementioned statement and system tenants. Turns out, not all system operations are made equal. Some are background operations and use the streaming (maintenance) scheduling group. This scheudling group will not be recognized as a statement tenant and will fall-back to the default statement tenant which will result in priority escalation on the remote end of the RPC, possibly resulting in competing with and hurting the latencies of any ongoing user load.

@denesb denesb added the area/internals an issue which refers to some internal class or something which has little exposure to users and is label May 17, 2024
@denesb denesb self-assigned this May 17, 2024
@denesb denesb added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 17, 2024
denesb added a commit to denesb/scylla that referenced this issue May 17, 2024
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).

Fixes: scylladb#18719
@nyh
Copy link
Contributor

nyh commented May 17, 2024

Don't we already have an open issue on the Alternator TTL scanning losing its scheduling group? This looks like a dup.

@mykaul mykaul added this to the 6.0 milestone May 19, 2024
@nyh nyh added area/alternator Alternator related Issues area/ttl labels May 19, 2024
@nyh
Copy link
Contributor

nyh commented May 19, 2024

Don't we already have an open issue on the Alternator TTL scanning losing its scheduling group? This looks like a dup.

I guess we didn't, so let me explain this now:

As explained in #17609, when we implemented Alternator's TTL feature, which has a background thread scanning the database for expired rows, we did not implement any controller on how fast this scanning should proceed. Instead, we let the scanner run in the maintenance scheduling group, and the theory is that the scheduler will then ensure that this work cannot hurt the latency of the main workload (which runs in the query scheduling group), and also its throughput reduction will be limited to the amount of shares configured for the maintenance scheduling group.

However, experimentation showed that not all scanning work was done in the maintenance scheduling group. When node A runs the expiration scanner on node A on maintenance scheduling-group, and performs a read request to remote node B, that read request was supposed to also be run on maintenance scheduling-group. But it turns out that it didn't, and this is the bug reported in this issue: The node B ran the read request on the default request scheduling group, not the inherited one.

@nyh nyh self-assigned this May 19, 2024
@nyh
Copy link
Contributor

nyh commented May 19, 2024

I added an assignment to myself to write a test for this issue (@denesb has already provided a fix, #18729, but we also need a test). I think this is an important issue to fix (we now believe that users who used TTL encountered it almost immediately, and some performance problems we saw in tests were also caused by this), so I want this fix to go in.

nyh added a commit to nyh/scylla that referenced this issue May 20, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work in the right group). After it
was fixed, the work on the wrong scheduling group went down to zero.

The test relies *slightly* on timing: It needs write of 100 rows to
finish in 2 seconds, and their deletion to finish in 2 second.
I believe that these durations will be enough even in very slow
debug runs.

Refs scylladb#18719.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh
Copy link
Contributor

nyh commented May 20, 2024

I wrote a test reproducing the bug and validating @denesb 's patch: #18757

denesb pushed a commit to denesb/scylla that referenced this issue May 20, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work in the right group). After it
was fixed, the work on the wrong scheduling group went down to zero.

The test relies *slightly* on timing: It needs write of 100 rows to
finish in 2 seconds, and their deletion to finish in 2 second.
I believe that these durations will be enough even in very slow
debug runs.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
nyh added a commit to nyh/scylla that referenced this issue May 22, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work in the right group). After it
was fixed, the work on the wrong scheduling group went down to zero.

The test relies *slightly* on timing: It needs write of 100 rows to
finish in 2 seconds, and their deletion to finish in 2 second.
I believe that these durations will be enough even in very slow
debug runs.

Refs scylladb#18719.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@denesb denesb modified the milestones: 6.0, 6.1 May 23, 2024
nyh added a commit to nyh/scylla that referenced this issue May 27, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work in the right group). After it
was fixed, the work on the wrong scheduling group went down to zero.

The test relies *slightly* on timing: It needs write of 100 rows to
finish in 2 seconds, and their deletion to finish in 2 second.
I believe that these durations will be enough even in very slow
debug runs.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
nyh added a commit to nyh/scylla that referenced this issue May 27, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
denesb pushed a commit to denesb/scylla that referenced this issue May 28, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
denesb pushed a commit to denesb/scylla that referenced this issue May 28, 2024
This patch adds a test for issue scylladb#18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before scylladb#18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
mergify bot pushed a commit that referenced this issue Jun 10, 2024
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22)

# Conflicts:
#	test/topology_experimental_raft/test_mv_tablets.py
mergify bot pushed a commit that referenced this issue Jun 10, 2024
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22)
denesb added a commit that referenced this issue Jun 10, 2024
…heduling group' from ScyllaDB

Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

- [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

(cherry picked from commit 5d3f7c1)

(cherry picked from commit 1fe8f22)

 Refs #18729

Closes #19196

* github.com:scylladb/scylladb:
  alternator, scheduler: test reproducing RPC scheduling group bug
  main: add maintenance tenant to messaging_service's scheduling config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alternator Alternator related Issues area/internals an issue which refers to some internal class or something which has little exposure to users and is area/ttl Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants