Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate message_ids within the same channel #346

Open
sagium opened this issue Nov 3, 2023 · 1 comment
Open

Duplicate message_ids within the same channel #346

sagium opened this issue Nov 3, 2023 · 1 comment

Comments

@sagium
Copy link

sagium commented Nov 3, 2023

Description

We had investigated a weird issue happening on the channel level. On several channels we have identified that it is possible to have duplicate message ids.

The result of this is something like

# example
[
  {global_id: 1, message_id: 1, channel: "channel_name", data: ""},
  {global_id: 2, message_id: 2, channel: "channel_name", data: ""},
  {global_id: 3, message_id: 3, channel: "channel_name", data: ""},
  {global_id: 57, message_id: 3, channel: "channel_name", data: ""},
  {global_id: 4, message_id: 4, channel: "channel_name", data: ""},
  {global_id: 58, message_id: 4, channel: "channel_name", data: ""},
  {global_id: 59, message_id: 5, channel: "channel_name", data: ""}
]

To reach that state you have to reset the backlog_id_key (message_id) key without resetting the backlog_key (channel backlog) itself.

The only way message_id gets reset as far as I know is through expiration inside the transactional LUA script but in many cases we didn't expect this to expire at all since new messages where added daily to the channel (default expiration is 7.days).

The code that affects us the most right now is the following
https://github.com/discourse/message_bus/blob/main/lib/message_bus/backends/redis.rb#L228-L238

since it gets the first of the items having the same score without taking into account the global_id order. In the usual case you will always have 1 item there, but in our case it happens to be 2, so it's very possible that we can consume stale data.

Questions

  • Are there other ways to reset or expire just the backlog_id_key in message bus channel lifecycle? That could explain this inconsistent state of the channel?
  • Did you ever face anything similar? The LUA script is nearly identical through the gem versions.
  • Is it reasonable to handle this edge case and always return the last item for a message_id based on the global_id on the redis.zrangebyscore backlog_key, message_id, message_id with a PR?

Versions

redis: 6.x
redis gem: 4.8.1
message_bus: 3.3.8 (we are upgrading to 4.3.8 since having duplicates results in infinite loops on the old version 🙃 )

@sagium
Copy link
Author

sagium commented Jan 30, 2024

For anyone that might face a similar issue.

The eviction policy on Redis itself is the most important factor when it comes to invalidate keys. Even though message bus is using the same TTL for keys inside the LUA script, the global eviction policy in Redis can affect keys in random ways when memory pressure is present.

Using volatile-lru for example can invalidate backlogs without invalidating the backlog_id_key and vice versa.

In my opinion this could be part of the documentation for the Redis backend in the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant