Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaos testing: break down Redis #1013

Open
quantranhong1999 opened this issue Apr 15, 2024 · 22 comments
Open

Chaos testing: break down Redis #1013

quantranhong1999 opened this issue Apr 15, 2024 · 22 comments
Assignees

Comments

@quantranhong1999
Copy link
Member

Why

Expectation: TMail core service should not be disrupted more or less by Redis outage.

How

Experiment on preprod with what happens to TMail deployment if Redis is down.

Some related Redis features:

  • Rate limiting: failure rate limit should not dispose the mail IMO
  • Redis event bus: failing key dispatch is OK
  • Rspamd: see what happens to Rspamd and if it disrupts the TMail service
  • Backend for our OIDC back-channel logout: failure storing revoked token should not fail user's logout?

Identify issues and propose enhancements to help TMail deployment be more fault tolerant and resilient.

@quantranhong1999
Copy link
Member Author

cc @chibenwa

@vttranlina vttranlina self-assigned this Apr 24, 2024
@vttranlina
Copy link
Member

Local test result:

Feature Status
Rate limiting ⚠️
Rspamd 💚
SSO via Apisix 🔴
Jmap/ Redis event bus 🔴

Rate limiting and Rspamd mailet

It should be noted that we have already declared <onMailetException>ignore</onMailetException> (in mailetcontainer.xml file), so exceptions within the mailet do not disrupt the entire mailet pipeline flow.

Rate limiting

The PerRecipientRateLimit and PerSenderRateLimit mailets were waiting for a response from Redis. I have not yet looked up the default timeout in the ratelimitj library. I observed that it was simply waiting, and after a few minutes, I "un-paused" Redis, then the mailet continued and executed successfully (without any exceptions logged).

I attempted to modify the source code to override the timeout by Reactor (ex: set to 10 seconds). In this case, the mailet threw an exception, but it was ignored, and the next mailet in the pipeline continued execution. The recipient received an email from the sender successfully.

// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet

Rspamd

The RspamdScanner mailet behaves slightly differently. After a few seconds (faster than rateLimit), the RspamdScanner automatically finishes processing. I checked the logs from Rspamd and found the message cannot get ANNs list from redis: timeout while connecting the server, which was marked normal log by rpsamd. There were no errors or disruptions.

SSO via Apisix

It is not possible to log in or log out via SSO. The tmail-apisix-plugin-runner plugin is responsible for this issue. The process of checking whether the "token" has been revoked or not (by querying Redis) causes the process to hang and eventually timeout.

Error occurred at:
com.linagora.apisix.plugin.RedisRevokedTokenRepository.exist(RedisRevokedTokenRepository.java:24)

Jmap/ Redis event bus

It is not possible to receive a response from Jmap methods: Email/send, Email/set + EmailSubmission/set (methods using queue). The client waits for a response for several minutes (the exact duration of the wait leading to an error response from the server is unknown; I waited for more than 5 minutes, but it was still waiting, so I stopped it).

Another related exception:

2024-04-24T10:03:58.401397480Z io.netty.handler.timeout.ReadTimeoutException: null
2024-04-24T10:03:58.401663224Z 10:03:58.401 [ERROR] o.a.j.e.GroupConsumerRetry - Exception happens when handling event after 1 retries

Last: we can not start a new Tmail (or restart) when stop redis.

@chibenwa
Copy link
Member

Just to be sure, did you relied on a redis cluster for those tests ? Or did you work on a single container?

By using redis as a pub-sub component for Apache James, then getting some level of reliance of Redis is IMO acceptable, it would be Ok to tolerate failures / depoendency to Redis for that pub sub component.

Rate limiting and Rspamd mailet

// A warning flag has been set for this feature because we should configurable the timeout exception for this mailet

+1 for the timeout in the RateLimiting mailets configuration, none by default.

I do not understand why lettuce driver do not handle the timeout itself. They documented a default timeout to 60 seconds. We need to understand why it is not the case IMO.

We could also configure by default this mailet to ignore exceptions...

SSO via Apisix

Tested with a redis cluster?

We might want to add a parameter ignoreRedisErrors defaulting to false. If turned to true, if redis fails we ignore the logout flow, effectively preserving the service at a cost of a bit of security.
This seems like a valuable tool to have in the tool box for bad day situations...

Jmap/ Redis event bus

If operating on top of a lone redis, then failure at the JMAP level seems ok to me at first glance.

However failing to clearly timeout IS an issue. If Redis is KO those JMAP requests should fail fast, in the 5s range IMO.

Last: we can not start a new Tmail (or restart) when stop redis.

That's indeed a problem: we shall be able to do reboot TMail (when not using Redis for pub sub).

If using Redis for pub sub then failing starting James would be acceptable if redis is down...

Thoughts?

@vttranlina
Copy link
Member

vttranlina commented Apr 25, 2024

I used the single redis container for test

The redis cluster (master-slave) on staging k8s is enough for what we want?

@vttranlina
Copy link
Member

I checked the staging, topology is 1 master + 2 replicas.
the tmail configuration uses redis master's endpoint.
=> Testing with it no difference with single node container

@quantranhong1999
Copy link
Member Author

quantranhong1999 commented Apr 25, 2024

I checked the staging, topology is 1 master + 2 replicas.
the tmail configuration uses redis master's endpoint.

A bit more explanation on that.
Before the Redis event bus key work, we configure the Redis endpoint to the Redis service endpoint (K8s can load balance to either master or slave).
After the Redis event bus key work, some related PUBSUB commands need to execute against the write-able node like the master. So recently I changed directly the Redis endpoint to the master node.

Sounds not good actually. Alternatives I think:

  • Redis Cluster (multi-master)
  • Configure the Redis replicas in Redis Master/Replica write-able (I did not research if it is a best practice yet).
  • Redis Sentinel?

@Arsnael
Copy link
Member

Arsnael commented Apr 25, 2024

Redis Cluster (multi-master)

Now that I think about it again, wasn't there an issue using the redis cluster with one of our component? Maybe apisix?

@chibenwa
Copy link
Member

The redis cluster (master-slave) on staging k8s is enough for what we want?

No ideally redis-cluster cluster cluster should be used for testing.

IMO redis topology shall be...

  • One redis cluster for all our use cases
  • Master-slave with AOF for the specific RSpamD use case.

@vttranlina
Copy link
Member

One redis cluster for all our use cases

How many master in redis-cluster?

@chibenwa
Copy link
Member

3 node cluster

@vttranlina
Copy link
Member

Redis-cluster lab (local)

Docker-compose lab:


Testing with redis-cluster can lead to various scenarios, so before describing them, I'll make a few remarks:

  • We need to pay attention to the cluster-node-timeout parameter in the configuration file when starting redis-cluster.
    For example: cluster-node-timeout = 60000, meaning when a node in the cluster goes down, it takes up to 60 seconds for the remaining nodes to confirm that the entire cluster is down. Within 1-60 seconds after node 1 goes down, the remaining nodes still have a normal status.

1. redis cluster: 3 node master, 0 node replicas

(a requirement for building a cluster is to have a minimum of 3 master nodes):
Example before any node go down:

  • key1 is stored on node1
  • key2 -> node2
  • key3 on node3.

When node1 goes down:

  • Within the first 1-60 seconds after going down:
    • Clients cannot read or write key1 (waiting).
    • Clients can read and write key2 and key3 normally.
    • When client writes new key4:
      • If the "Client-Side Sharding" algorithm on the client side calculates that key4 should be stored on node1, it will be waiting.
      • Otherwise, if it calculates that key4 should be stored on node2 or node3, it will be successfully written (reading afterwards is also successful).
  • After 60 seconds, nodes 2 and 3 confirm that the entire cluster is down. At this point, no data can be read or written.

2. redis cluster: 3 node master, 3 node replicas

Scenario sample:

Node1 (master) - Node4 (replica)
Node2 (master) - Node5 (replica)
Node3 (master) - Node6 (replica)
  • When node 1 goes down:
    • first 1-60 seconds: similar to the scenario of 3 master nodes above.
    • After 60 seconds: node4 automatically becomes the master. Reading and writing any data return to normal.

During this time, monitoring the Redis logs, there will be logs like:

Cluster state changed: fail
....
Cluster state changed: ok

Tmail-backend and Redis cluster

Rspamd

  • The document's rspamd doesn't remind redis-cluster, It accepts only 1 Redis endpoint,
    -> Then it similar to a single Redis node

Rate limiting, Jmap/ Redis event bus

  • When one node goes down:

    • Within the first 1-60 seconds:
      • If Client-Side Sharding routes data to down-node => waiting.
      • If Client-Side Sharding routes to up-node => response created normally.
  • In case JMAP method (Email/send, EmailSubmission/set), when cluster confirmed "fail" status:
    The response returns immediately:

{
    "sessionState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
    "methodResponses": [
        [
            "Email/send",
            {
                "accountId": "b0d9e55c1a2682586469bc2a23dbb2c671e138ee61e0362972fd7c3d265ea9b2",
                "newState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
                "notCreated": {
                    "K87": {
                        "type": "serverFail",
                        "description": "CLUSTERDOWN The cluster is down"
                    }
                }
            },
            "c1"
        ]
    ]
}

Related error log regarding the Lettuce library:

io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down
  • In contrast to lab 1 single node redis, when restarting TMail, the instance will not crash and exit, but rather starting the server will take several minutes. ( 8 minutes on my lab)

Warning log when starting TMail:

06:17:33.031 [WARN ] i.l.c.c.t.DefaultClusterTopologyRefresh - Unable to connect to [redis1/<unresolved>:6379]: Connection initialization timed out after 1 minute(s)
2024-04-26T06:17:33.031475228Z io.lettuce.core.RedisCommandTimeoutException: Connection initialization timed out after 1 minute(s)
  • A new error related to using Redis Cluster, even when the Redis Cluster is up and running normally, is the RedisHealthCheck error:
PeriodicalHealthChecks - DEGRADED: Redis: Can not connect to Redis.

Another note:

  • When fixing up the source code, we should consider timeout (tmail site) is suitable with cluster-node-timeout configured in redis site.
  • The current "," character for separate redis nodes in redis.properties does not work. Don't know why parsing it accepts only the first URL, I tried to replace with ; and update the RedisUris.from, then it worked normally

@quantranhong1999
Copy link
Member Author

Interesting experiment.

Within the first 1-60 seconds:
If Client-Side Sharding routes data to down-node => waiting.
If Client-Side Sharding routes to up-node => response created normally.

So Can TMail recover reconnecting to the Redis Cluster after the Redis Cluster is backed normally?

A new error related to using Redis Cluster, even when the Redis Cluster is up and running normally, is the RedisHealthCheck error:

I can not understand this. The RedisHealthCheck is supposed to create a new connection for every check which should be acked about the Redis Cluster healthy again.

The current "," character for separate redis nodes in redis.properties does not work. Don't know why parsing it accepts only the first URL, I tried to replace with ; and update the RedisUris.from, then it worked normally

Dont forget to fire a fix for it ^^

@Arsnael
Copy link
Member

Arsnael commented Apr 26, 2024

So what I understand is that we can't use redis cluster with rspamd, correct? Same for sentinel I would guess then if you can only point one endpoint?

Or maybe the headless endpoint with k8s that redirects to all redis pods addresses would do the trick?

@quantranhong1999
Copy link
Member Author

Same for sentinel I would guess then if you can only point one endpoint?

Redis does support Redis Sentinel:
https://rspamd.com/doc/configuration/redis.html

I am unsure about Redis Cluster as I do not see Rspamd mentions.

@chibenwa
Copy link
Member

I am unsure about Redis Cluster as I do not see Rspamd mentions.

I remeber unsupported as it lacked some REDIS commands.

@chibenwa
Copy link
Member

Some summary:

  • Redis cluster needs 3 master and 3 replica
  • Upon master unavailability, it falls back to the slave in 60 seconds
  • BEFORE the fallback we shall expect redis related features to be broken in James
  • And recover after the fallback

@chibenwa
Copy link
Member

Some questions:

  • What configration parameter can be used to trigger the fallback? Shall we lower the trigger to say 10 seconds?

  • I really wonder if we shall not ignore failures upon key dispatch. It would not be that bad. We could make this configurable into redis conf? Because loosing the ability to send email if redis is down do not seem like a nice property to me!

@vttranlina
Copy link
Member

What configration parameter can be used to trigger the fallback?

Open redis-cli then command CLUSTER FAILOVER FORCE on replicas node
ref: https://redis.io/docs/latest/commands/cluster-failover/

Shall we lower the trigger to say 10 seconds?

my opinion: lower is better
The default configuration is 15 seconds
Ref: https://raw.githubusercontent.com/redis/redis/7.2/redis.conf

I really wonder if we shall not ignore failures upon key dispatch. It would not be that bad. We could make this configurable into redis conf? Because loosing the ability to send email if redis is down do not seem like a nice property to me!

+1

The key dispatch by Redis for the notification feature is not critical,

@chibenwa
Copy link
Member

Shall we lower the trigger to say 10 seconds?

WHat is the impact of false positive ie you fallback when there is nothing?

@vttranlina
Copy link
Member

WHat is the impact of false positive ie you fallback when there is nothing?

Even when the master node is down, or not
if we run command CLUSTER FAILOVER FORCE on replica node, then replica will "force" to the master immediately.
Old master -> replicas

@chibenwa
Copy link
Member

WHat is the impact of false positive ie you fallback when there is nothing?

That was not the question.

Upon a master slave failover...

... do we loose unreplicated data?

... How long does the failover takes?

... Are there other side effects?

Based on these answer we might want to put a low value, or a defensive value to prevent too-frequent switches...

@vttranlina
Copy link
Member

vttranlina commented May 2, 2024

Upon a master slave failover...

... do we loose unreplicated data?

yes,

Example cases:

  1. Data Loss due to Asynchronous Replication:
  • client C writes data1 to master A
  • A "ACKED" to C
  • A was not yet replicated data to replicas node A1 (async way)
  • As a result, replica node A1 is promoted to master, and the data (data1) which was not yet replicated to A1 is lost.
  1. Data Loss due to Partition:
  • Cluster has a partition issue. Site1: masterA + client C. Site2: master B,C, Replicas A1,B2,C1
  • During the partition (before node_timeout is detected), client C successfully writes data2 to master node A.
  • When the partition issue is resolved, the data (data2) written by client C is lost.

// the Redis document write that does not support strong consistency

... How long does the failover takes?

config: cluster-node-timeout + buffer 2 seconds

Redis document:

the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).

Related to

Shall we lower the trigger to say 10 seconds?

Updated answer:
A longer duration would be preferable in the event of a partition issue where the client is situated on the same site as the failed master node. This ensures that data loss does not occur when the issue is resolved.
For the case "Asynchronous Replication" above, it also can help the lost data that was not yet replicated, but the trade-off is long downtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants