Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With active-active + cluster mode enabled often messages get processed twice #151

Open
Dbasdeo opened this issue Mar 12, 2022 · 7 comments
Assignees
Labels
limitation This is a system limitation

Comments

@Dbasdeo
Copy link

Dbasdeo commented Mar 12, 2022

We had originally implemented a scheduler using your library with Redis elasticache in AWS with cluster mode disabled. Every thing seemed to work out great. For context, our service in production that implements the scheduler runs 3 instances. The scheduled task was always dispatched to a single instance as expected.

We have now tried to switch to an enterprise version of redis with cluster mode enabled and an active-active database. By that I mean that we are running our service in two regions (x3 instances per region) both pointing to different redis databases (i.e. redis in region A and redis in region b) that are replicated in between.

What're application dependencies ?

  • Rqueue Version: 2.8.0-RELEASE
  • Spring Boot Version: 2.4.4
  • Spring Data Redis Version: 2.4.6

How to Reproduce (optional)?

  • This happens maybe 1 in every 10 times we schedule a job. *

We aren't running under any load as we currently only really run smoke tests. We schedule a job with a wait of x seconds. x seconds later the job is consumed and executed by two instances. The instances are always in different regions. I'm guessing library was just never built to handle this case?

@sonus21
Copy link
Owner

sonus21 commented Mar 14, 2022

Yeah, it's not built to handle such scenarios. One thing can be done here is, you can turn off Rqueue workers/listeners in 2nd region.

Do you know what's the Redis lag between these two setups? Generally if data is replicated immediately then it should not process duplicate messages instead they should process unique messages.

Also, some more detailed architecture will help me to understand what's happening.

@Dbasdeo
Copy link
Author

Dbasdeo commented Mar 14, 2022

There seems to be a bit of lag between replication. When the same job gets dispatched to 2 instances in separate regions and they are then actually both able to update the request (we use redis for persistence as well) without a cas exception maybe even 30ms apart. Im currently trying to figure out what expected lag is.
As for our arch we are basically running a service in 2 euro regions each pointing to their own redis db that is replicated in between. We are constantly scheduling jobs in both regions that need to be executed in X minutes and we have found maybe 10% of the time they are processed twice (once in each region). The ideal scenario being that they could be processed in either region but only once. I assume there must be some sort of race condition that exists in this active active setup that didn't exist in a single region that we are exposing.

I think we will just try to handle it ourselves because our holy grail is to be able to handle in multiple regions so we are pretty fault tolerant. Just wanted to know if this had ever come up. We have been considering seeing if we could could fork your code and try to solve ourselves, but I imagine that would be harder for us than implementing in our own service :/.

@Dbasdeo
Copy link
Author

Dbasdeo commented Mar 15, 2022

@sonus21 I've started to think about the problem in a different way but Im not sure the library allows for it.

  1. Each region creates their own queue, schedules their own message and handles their own results
  2. Periodically, we do a health check on the opposite region. If it is returning 5xx, we start listening from the other regions queue. If we see the health check passing again we stop consuming.

But I can't really see if there is a an exposed way for an instance to start / stop consuming from a queue. Like not using the annotation and manually polling or something. Does this exist?

@sonus21
Copy link
Owner

sonus21 commented Mar 16, 2022

We can stop/start listener at runtime but it does not support adding new queue at runtime. Why do you need active active setup for consumers? Should not you stop consumer in another region?

@Dbasdeo
Copy link
Author

Dbasdeo commented Mar 16, 2022

Basically we want to be resistant to losing an entire region.
If all the consumers in region A are lost (either instances go down or redis goes down), we would like region B to start consuming A's job queue so its not lost. We don't need to add a queue but just start a consumer of an existing queue at runtime. But it seems like all the consumer code is hidden in the library.

@sonus21
Copy link
Owner

sonus21 commented Mar 19, 2022

I think it's not correct to use a multi-regions active-active setup for consumers. Even SQS is a single region it's not replicated across regions.

https://stackoverflow.com/questions/66249605/does-aws-sqs-replicate-messages-across-regions

Can you refer me to some articles that suggest this solution for high availability?

Proposed solution:

  • Deploy listener in all regions but enable only in one region using task count
  • Once you identify that the current region has an issue, decrease the task count of all listeners to zero and increase the task count in another region. It can be automated but it means the application might lose some messages that should be fine due to a region issue.

Expected delay in message processing (5-15 minutes)

@Dbasdeo
Copy link
Author

Dbasdeo commented Mar 22, 2022

I think your proposed solution sounds like what we are going to attempt. That sounds like the way I was kind of angling now. I was trying to find something like task count so thank you for sharing!

I honestly was just dreaming a bit that this would be possible but sounds like my head was a bit in the clouds.

@sonus21 sonus21 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2022
@sonus21 sonus21 reopened this Jun 12, 2022
@sonus21 sonus21 added the limitation This is a system limitation label Jun 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
limitation This is a system limitation
Projects
None yet
Development

No branches or pull requests

2 participants