Metadata | Value |
---|---|
Date | 2023-10-12 |
Author | @Jarema |
Status | Implemented |
Tags | client, server, spec |
Revision | Date | Author | Info |
---|---|---|---|
1 | 2023-10-12 | @Jarema | Initial draft |
This document describes how clients connect to the NATS server or NATS cluster. That includes topics like:
- connection process
- reconnect
- tls
- discoverability of other nodes in a cluster
Ensuring a consistent way how Clients establish and maintain connection with the NATS server and provide consistent and predictable behaviour across the ecosystem.
TODO Add WebSocket flow.
- Clients initiate a network connection to the Server.
- Server responds with INFO json.
- Client sends CONNECT json.
- Clients and Server start to exchange PING/PONG messages to detect if the connection is alive.
Note If clients sets protocol
field in Connect to equal or greater than 1, Server can send subsequent INFO on a ongoing connection.
Client needs to handle them appropriately and update server lists and server info.
TODO
There are two flows available in the Server that enable TLS.
This method is available in all NATS Server versions.
- Clients initiate a network connection to the Server.
- Server responds with INFO json.
- If Server INFO contains
tls_required
set totrue
, or the client has a tls requirement set totrue
, the client performs a TLS upgrade. - Client sends CONNECT json.
- Clients and Server start to exchange PING/PONG messages to detect if the connection is alive.
This method has been available since NATS Server 2.11.
There are two prerequisites to use this method:
- Server config has enabled
handshake_first
field in thetls
block. - The client has set the
tls_first
option set to true.
handshake_first has those possible values:
false
: handshake first is disabled. Default valuetrue
: handshake first is enabled and enforced. Clients that do not use this flow will fail to connect.duration
(i.e. 2s): a hybrid mode that will wait a given time, allowing the client to follow thetls_first
flow. After the duration has expired,INFO
is sent, enabling standard client TLS flow.auto
: same as above, with some default value. By default it waits 50ms for TLS upgrade before sending the INFO.
The flow itself is flipped. TLS is established before the Server sends INFO:
- Client initiate a network connection to the Server.
- Client upgrades the connection to TLS.
- Server sends INFO json.
- Client sends CONNECT json.
- Client and Server start to exchange PING/PONG messages to detect if the connection is alive.
Note: Server will send back the info only
When Server sends back INFO. It may contain additional URLs to which the client can make connection attempts. The client should store those URLs and use them in the Reconnection Strategy.
A client should have an option to turn off using advertised URLs. By default, those URLs are used.
TODO: Add more in-depth explanation how topology discovery works.
Client should have a way that allows users to force reconnection process. This can be useful for refreshing auth or rebalancing clients.
When triggered, client will drop connection to the current server and perform standard reconnection process. That means that all subscriptions and consumers should be resubscribed and their work resumed after successful reconnect where all reconnect options are respected.
For most clients, that means having a reconnect
method on the Client/Connection handle.
There are two methods that clients should use to detect disconnections:
- Missing two consecutive PONGs from the Server (number of missing PONGs can be configured).
- Handling errors from network connection.
When the client detects disconnection, it starts to reconnect attempts with the following rules:
- Immediate reconnect attempt
- The client attempts to reconnect immediately after finding out it has been disconnected.
- Exponential backoff with jitter
- When the first reconnect fails, the backoff process should kick in. Default Jitter should also be included to avoid thundering herd problems.
- If the Server returned additional URLs, the client should try reconnecting in random order to each Server on the list, unless randomization option is disabled in the client options.
- Successful reconnect resets the timers
- Upon reconnection, clients should resubscribe to all created subscriptions.
If there is any change in the connection state - connected/disconnected, the client should have some way of notifying the user about it. This can be a callback function or any other idiomatic mechanism in a given language for reporting asynchronous events.
Disconnect buffer Most clients have a buffer that will aggregate messages on the client side in case of disconnection. It will fill up the buffer and send pending messages as soon as connection is restored. If buffer will be filled before the connection is restored - publish attempts should return error noting that fact.
Although clients should provide sensible defaults for handling the connection, in many cases, it requires some tweaking. The below list defines what can be changed, what it means, and what the defaults are.
default: 2 minutes
As the client or server might not know that the connection is severed, NATS has Ping/Pong protocol. Client can set at what intervals it will send a PING to the server, expecting PONG. If two consecutive PONGs are missed, connection is marked as lost triggering reconnect attempt.
It's worth noting that shorter PING intervals can improve responsiveness of the client to network issues, but it also increases the load on the whole NATS system and the network itself with each added client.
default: 2
Sets number of allowed outstanding PONG responses for the client PINGs before marking client as disconnected and triggering reconnect.
default: false
By default, if a client makes a connection attempt, if it fails, connect
returns an error.
In many scenarios, users might want to allow the first attempt to fail as long as clients continue the efforts
and notify the progress.
When this option is enabled, the client should start the initial connection process and return the standard NATS connection/client handle while in background connection attempts are continued.
The client should not wait for the first connection to succeed or fail, as in some network scenarios, this can take much time. If the first attempt fails, a standard [Reconnect process] should be performed.
**default: 3 / none
Specifies the number of consecutive reconnect attempts the client will make before giving up.
This is useful for preventing zombie services
from endlessly reaching the servers, but it can also
be a footgun and surprise for users who do not expect that the client can give up entirely.
default 5s
Specifies how long the client will wait for the network connection to be established. In some languages, this can hang eternally, and timeout mechanics might be necessary. In others, the network connection method might have a way to configure its timeout.
Default: none
If fine-grained control over reconnect attempts intervals is needed, this option allows users to specify one.
Implementation should make sense in a given language. For example, it can be a callback fn reconnect(attempt: int) -> Duration
.
If given client supports storing messages during disconnect periods, this option allows to tweak the number of stored messages. It should also allow disable buffering entirely.
default: false If set, the client enforces the TLS, whether the Server also requires it or not.
If tls://
scheme is used in the connection string, this also enforces tls.
default: false
When connecting to the Server, it may send back a list of other servers in the cluster of which it is aware.
This can be very helpful for discoverability and removes the need for the client to pass all servers in connect
,
but it also may be unwanted if, for example, some servers URLs are unreachable for a given client.
default: false By default, if many server addresses are passed in the connect string or array, the client will try to connect to them in random order. This helps healthy connection distribution, but if in a specific case list should be treated as a preference list, randomization may be turned off.
This function can be expressed "enable retaining order" or "disable randomization" depending on what is more idiomatic in given language.
[LINK][LINK]
Send by the Server before or after establishing TLS, depending of flow used. It contains information about the Server, the nonce, and other server URLs to which the client can connect.
Send by the client in response to INFO. Contains information about client, including optional signature, client version and connection options.
This is a mechanism to detect broken connections that may not be reported by the network connection in a given language.
If the Server sends PING
, the client should answer with PONG
.
If the Client sends PING
, the Server should answer with PONG
.
If two (configurable) consecutive `PONGs are missed, the client should treat the connection as broken, and it should start reconnect attempts.
The default interval for PING is 2 minutes.
Server can respond with Authorization Error
.
Discuss any additional security considerations pertaining to the TLS implementation and connection handling.
Smart Reconnection could be a potential big improvement.