Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balancerd: Handle canceling connections #24081

Closed
ParkMyCar opened this issue Dec 21, 2023 · 12 comments
Closed

balancerd: Handle canceling connections #24081

ParkMyCar opened this issue Dec 21, 2023 · 12 comments
Assignees
Labels
C-bug Category: something is broken

Comments

@ParkMyCar
Copy link
Member

What version of Materialize are you using?

v0.81.0-dev

What is the issue?

It appears as though balancerd isn't properly canceling requests. If you start running a subscribe in psql and then try canceling it with CTRL+C, the subscribe will not stop. Also it seems like when you close the terminal window that's running psql, the session does not get cleaned up. See this Slack thread for more details

@ParkMyCar ParkMyCar added the C-bug Category: something is broken label Dec 21, 2023
@benesch
Copy link
Member

benesch commented Dec 21, 2023

The code bears this out!

// Balancer ignores cancel requests.
//
// TODO: Can/should we return some error here so users are informed
// this won't ever work?

@benesch
Copy link
Member

benesch commented Dec 21, 2023

Two ideas that I talked about with @mjibson way back when:

  1. If we have SNI, we're good: balancerd: use pgwire SNI if present #23907
  2. Without SNI, we could send the cancellation request to all environments. Feels bad, but the combination of (conn_id, secret_key) is probably unique enough, especially if there are only a few hundred environments behind each balancer.

@benesch
Copy link
Member

benesch commented Jan 3, 2024

Talked about this in the TL sync today.


@def- suggested that we could have our PostgreSQL sessions cycle the cancellation keys every few minutes. This would introduce more randomness, to make guessing the cancellation key for a session harder. I pointed out that the protocol didn't support this, but @def- pointed out on Slack that libpq nonetheless handles changes to the cancellation key mid session. Weird! I'd still be nervous about relying on this, though, because other pgwire protocol implementations are likely not so kind.


Here's an idea I proposed. We have 64 bits to play with here: 32 bits of connection ID and 32 bits of secret key. We don't need 2^32 connection IDs! At the moment max_connections is constrained to 1000. Let's be conservative and assume one day we want to support 1MM connections. That means we need 20 bits of the connection ID for the actual connection ID. The remaining 12 bits can be used to represent the first 12 bits of the environment ID. That's not enough to identify a full UUID (128 bits), but at our current scale we'd have very few collisions, and even in the limit where we have hundreds or thousands of environments per Kubernetes cluster, we'd still expect each 12-bit UUID prefix to only hit a handful of environments.

We can also randomly assign connection IDs. Today we sequentially assign connection IDs. But that leaves a fair bit of randomness on the table. If both connection IDs and secret keys are randomly generated, that gives us 52 bits of randomness (4 quadrillion possibilities) instead of 32 bits of randomness (4 billion possibilities).

On the Kubernetes side, we could still leverage DNS here. Have the environment controller create a service named after the first 12 bits of each environment's organization ID by using the first three hex digits of the UUID. An environment for org ID 2b826f1e-4691-40b5-b627-8f7e4fc6369f would use service name 2b8. A given service will map to multiple environments in the case of collision, and that's fine. balancerd, when it gets a cancellation request, will look up the service for the given org ID prefix, and then crucially forward the cancellation request on to all of the IPs that the DNS record resolves to.

tl;dr

  • High order 12 bits of connection IDs are the low order 12 bits of environment IDs
  • Low order 20 bits of connection IDs are a randomly generated connection ID
  • Secret key is a randomly generated 32-bit integer, as it is today

cc @alex-hunt-materialize @mjibson

@ParkMyCar
Copy link
Member Author

The remaining 12 bits can be used to represent the first 12 bits of the environment ID.

As @mjibson has pointed out to me before, bits 6, 7, and 12 of a UUIDv4 are constant. What do you think about using the last 12 bits of the environment ID for more randomness?

@benesch
Copy link
Member

benesch commented Jan 3, 2024

Oh, wow, good to know. That's too bad, because it's much harder to sort the environments by their last 12 bits than their first, but agreed that we should do it, because losing 3 of the 12 bits of UUID randomness would be a big hit.

@alex-hunt-materialize
Copy link
Contributor

alex-hunt-materialize commented Jan 4, 2024

What should be responsible for maintaining the services? We could probably do this in the environment-controller, but we currently don't have anything there that operates on all environments, only on one environment at a time.

I guess on environment reconciliation we could query any existing service, but it's a bit clunky. We'd probably want to serialize interactions with these services, so that we don't race.

@benesch
Copy link
Member

benesch commented Jan 5, 2024

I think it should be fine to handle the creation of the service during the normal environment reconciliation. It will need to do two things:

  1. Add a label to the environmentd pod like: cancel-suffix.materialize.cloud=<last twelve bits of org id>
  2. Ensure a service named cancel-<last twelve bits of org id> exists with a pod selector that matches on the label from (1).

It's true that environments that are reconciled in parallel will race on (2). But that should be fine, because they're both going to try to create identical services. IIUC, the service membership doesn't need to be managed, since that's handled by Kubernetes automatically, and so I think it all works out pretty smoothly.

@alex-hunt-materialize
Copy link
Contributor

Ah, that's a nice way to handle it for creation. I'm not sure how to handle delete that way, though.

@benesch
Copy link
Member

benesch commented Jan 8, 2024

It's plausibly fine just to leave them around? There's only going to be 4096 of them at most.

@pH14
Copy link
Contributor

pH14 commented Jan 10, 2024

If the issue is the mechanics of the reconciler, It's not too uncommon to have a Kube controller that considers all CRs for reconciliation if any one of them changes. kube-rs supports this through reconcile_all_on, which I think we can use to solve the creation / deletion issues here? Any time an Environment changes, we scan through all cancellation Services and all Environments and ensure the right ones exist

maddyblue added a commit to maddyblue/materialize that referenced this issue Jan 11, 2024
This will decrease their predictability, which is desirable for the balancer.

See MaterializeInc#24081 (comment)
maddyblue added a commit to maddyblue/materialize that referenced this issue Jan 13, 2024
This will decrease their predictability, which is desirable for the balancer.

See MaterializeInc#24081 (comment)

Bump connection id space to 20 bits as part of balancer cancellation.
maddyblue added a commit to maddyblue/materialize that referenced this issue Jan 13, 2024
This will decrease their predictability, which is desirable for the balancer.

See MaterializeInc#24081 (comment)

Bump connection id space to 20 bits as part of balancer cancellation.
maddyblue added a commit that referenced this issue Jan 14, 2024
This will decrease their predictability, which is desirable for the
balancer.

See
#24081 (comment)

### Motivation

  * This PR adds a known-desirable feature.

### Checklist

- [x] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - n/a
mgreene pushed a commit to shepherdlyinc/materialize that referenced this issue Jan 15, 2024
This will decrease their predictability, which is desirable for the
balancer.

See
MaterializeInc#24081 (comment)

### Motivation

  * This PR adds a known-desirable feature.

### Checklist

- [x] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](https://github.com/MaterializeInc/cloud/pull/5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - n/a
maddyblue added a commit that referenced this issue Jan 17, 2024
Use the lower 12 bits of the environment id as the upper 12 bits of the
connection id.

See
#24081 (comment)

### Motivation

  * This PR adds a known-desirable feature.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - n/a
maddyblue added a commit that referenced this issue Jan 20, 2024
When balancer receives a cancel request, broadcast it to all envds with
matching ids.

See
#24081 (comment)

### Motivation

  * This PR adds a known-desirable feature.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - n/a
@maddyblue
Copy link
Contributor

Adapter parts of this are complete. There's a new arg to balancerd (see #24503) for a cancel broadcast DNS lookup address.

@benesch
Copy link
Member

benesch commented Feb 14, 2024

@alex-hunt-materialize pointed out today on Slack that our plan to have cancellation Services in a cancellation namespace doesn't quite work as previously described, because the environmentd pods the service wants to reference are strewn about many different namespaces.

I think we can still make this work by manually managing the endpoint slices with the "FQDN" address type:

Service:

apiVersion: v1
kind: Service
metadata:
  name: cancel-b3f
spec:
  clusterIP: "None"

EndpointSlice:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: cancel-b3f-1
  labels:
    kubernetes.io/service-name: cancel-b3f
addressType: FQDN
endpoints:
  - addresses:
      - "environmentd.environment-b3fxxxx.svc.cluster.local"
      - "environmentd.environment-b3fyyyy.svc.cluster.local"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken
Projects
None yet
Development

No branches or pull requests

5 participants