Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: make NSQ "cloud native" #1254

Open
thestephenstanton opened this issue May 12, 2020 · 17 comments
Open

*: make NSQ "cloud native" #1254

thestephenstanton opened this issue May 12, 2020 · 17 comments
Assignees

Comments

@thestephenstanton
Copy link

When running nsqd on AWS's ECS using Fargate, we are running into a problem where when nsqd gets SIGTERM, it automatically closes all ports and writes to disk everything that it has. Is there a reason that nsqd doesn't allow the consumers to finish reading off of the channel then do that flushing? Even more so for us, most of our messages are already inflight, so why not just let them finish up?

I would add this ability myself, I am just wondering though if there is a specific reason for this happening this way. It could be a flag that the user can give an amount of time that the consumer has left to finish before it flushes everything to disk.

@ploxiln
Copy link
Member

ploxiln commented May 14, 2020

It is assumed that messages could continue to arrive, and that the normal reason to exit is that settings or binary have been updated, so after a quick restart it can continue serving queued messages, and it can re-send un-acked in-flight messages. It is more server than cluster oriented. Servers can come and go, but if the server is going gracefully, you'd stop all other services that may publish (to localhost nsqd) first, and check that nsqd queues are drained before terminating that server.

A similar way to manage it in a cluster environment would be to stop publishing to that nsqd instance first, and terminate it some time later.

Preventing new connections for publishing seems a bit awkward, because these ports are shared with other purposes: http can be used for publish or for admin/stats, tcp-protocol port can be used for consumers or for publishing. I guess you'd want to just close them all, kick existing tcp-protocol publishers, and only allow existing consumers to continue for a bit. Makes sense, I guess, but not an existing feature, as you noted ... it's just not a way nsqd is commonly managed.

@thestephenstanton
Copy link
Author

Then my question would be how is it normally managed? Because in my mind, say I have service A that is running nsqd and the producer. Then I have service B consuming from A. That to me seems like a standard way to use nsqd but with something that is serverless like Fargate with ephemeral storage, it spinning back up will not have what nsqd flushed out to disk anymore.

So my main question (and maybe it's just me not understanding) is why not just let consumers continue consuming and stop producing yourself (assuming we are running the producer on the same machine as nsqd)

@ploxiln
Copy link
Member

ploxiln commented May 14, 2020

uh, well, that's my question for you: why not just let consumers continue consuming, and stop producing yourself? since your service is the producer?

@ploxiln
Copy link
Member

ploxiln commented May 14, 2020

When using a modern container orchestration system, that makes for some complicated problems for you to solve ... anyway, it's been suggested in the past, that you manage nsqd similar to how you would manage a database, in a container orchestration system.

@thestephenstanton
Copy link
Author

Because when Fargate is updating tasks, it sends a sigterm to all of the containers in the task and nsqd immediately shuts down and flushes to disk.

But I can see what you mean especially since these orchestration systems were probably not as popular when nsqd was first created.

Anyway, if we were to either wrap nsqd with another process that captures the sigterm, and give it n amount of time to let consumers finish consuming (assuming we shut down the producer), could you see any problems with that? And if not, could you see that as something useful to add as a flag to nsqd?

@ploxiln
Copy link
Member

ploxiln commented May 14, 2020

A wrapper manager process makes sense to me, and a new flag/option to nsqd that is reasonably simple could also make sense.

(In my case, I do use docker containers, but most of my nsqd instances currently have 3 months uptime, and existed on the same hosts with the same data-dir for 8 months.)

@jehiah
Copy link
Member

jehiah commented May 14, 2020

@thestephenstanton just want to chime in to say thank you for this experience report; I've been thinking about how to make it easier to use nsq in a container environment recently and it's helpful to know the specific pain points folks are encountering.

I would love to know a little bit more about how you configure your producer to know where to connect to nsqd for publishing; So far that (producer configuration) is an aspect that nsqd has not been opinionated about but might be the area that we could most easily improve for a container environment.

@thestephenstanton
Copy link
Author

@ploxiln ahhh gotcha that makes sense then they way you have it then.

@jehiah so for all our producers we simply just give it our nsqd address which is localhost essentially like this:

nsq.NewProducer("localhost:4150", nsqConfig)

Nothing special in our nsqConfig.

So there is nothing really interesting about our producer. But essentially what we do do, is when our producer container gets a sigterm, it will indirectly*** stop the producer. But then, like I mentioned, when we get the sigterm to nsqd, it will immediately close connections and flush to disk.

There is a little fuzziness in my head around when both the producer and nsqd get the sigterm what is going on. Because nsqd probably doesn't allow any new messages the second it gets that sigterm and before it finishes flushing messages, am I correct? In which case, idk what happens if the producer tries to publish a message before reacting to the sigterm it got. I can only assume it is an error.

*** I say indirectly because our producer relies on getting messages via our consumer, consuming data off another topic in another service. So we stop our consumer on sigterm in which case also stops our producer from producing new messages.

@thestephenstanton
Copy link
Author

I know that was a lot, hope it all made sense

@mreiferson mreiferson changed the title nsqd to allow consumers to finish consuming nsqd: allow consumers to finish consuming Jun 11, 2020
@mreiferson
Copy link
Member

Given this issue, and #1078 and #980, feels like we should keep this one open to be the catch all for discussion / tracking for making NSQ more "cloud native" (sorry), regardless of whether we actually decide to do anything.

cc @jehiah @ploxiln

@mreiferson mreiferson changed the title nsqd: allow consumers to finish consuming *: make NSQ "cloud native" Jun 14, 2020
@thestephenstanton
Copy link
Author

So just to add some more clarity on what I am trying to describe for this. So the scenario is this in AWS's Fargate.

image

For those who don't know, for ECS Fargate you define a Task. And you spin up n number of these tasks. Each task will also have its own ephemeral storage. In each task you can run any number of containers that you want. So in this example, Producer A is a Go binary and Topic T is the nsqd that we run on that task.

So in this scenario, when Service A gets a SIGTERM, each task and each container get that SIGTERM. What happens is, nsqd will immediately flush Topic T to disk then exit 0. When this is done, and all containers in the Fargate Task have exited, that file that nsqd flushed to will be lost.

A simple solution for this kind of scenario would be to have a nsqd read a flag that gives it a delayed shutdown, so when it receives SIGTERM, it will stay around for x time to allow consumers to finish consuming.

NOTE this only solves it with this architecture because the SIGTERM will also go to the producer, and we make sure that we stop the producer as soon as we get that SIGTERM.

If we get super backed up, this delayed shutdown of nsqd doesn't fully solve our problem because we could still lose data if the consumer hasn't finished it time. So I think there is still room for this cloud native friendly nsq to have some different kind of flushing mechanism.

@pepusz
Copy link

pepusz commented Aug 10, 2020

I just started to think about exactly the same situation. With one of our ECS fargate based systems, I've planned to implement some messaging queue and NSQ looked really handy in this term, but the scenario that you described worries me also.

First of all, I don't think extra wait time would solve the issue because in theory if the consumers are too slow then the SIGTERM will apply only too late or data loss will happen, and that could cause issues from either way.

What I've just thought to add an EFS volume to Task A, B since April it's possible to attach persistent volumes with fargate too, but never tried in usage, but definitely will try it during the next couple of weeks.

@thestephenstanton
Copy link
Author

@pepusz

First of all, I don't think extra wait time would solve the issue because in theory if the consumers are too slow then the SIGTERM will apply only too late or data loss will happen, and that could cause issues from either way.

Granted, this delay I propose is only a "kick the can down the road". It definitely isn't the best for sure. But the consumers of the NSQ topic would only have to process the messages that are in flight. And our consumer is really quick, so we only need a little bit of time. Plus ECS allows you up to 10 minutes before it SIGKILLs.

What I've just thought to add an EFS volume to Task A, B since April it's possible to attach persistent volumes with fargate too, but never tried in usage, but definitely will try it during the next couple of weeks.

So we tried EFS but the problem is, you set up all that EFS stuff on the task definition. So lets say you have a directory on EFS called /data, well once you have more than one task spin up, your multiple tasks with NSQ will try to connect to that /data directory and only one will succeed the others will fail.

Something I thought of that we could use EFS for, is if NSQ could be told: look at this directory, and make a sub directory in it with some kind of uid that that specific instance can connect to. And when NSQs spin up next time, they know to look and find any of those uid folders, if they exist, put a lock on them and consume from them. Still have problems if you have zombie directories though. And I know that is much easier said that done.

@pepusz
Copy link

pepusz commented Aug 16, 2020

Thanks, @thestephenstanton for checking out EFS. Good findings, a couple of weeks later I'll try to play with NSQ and ECS so if I found out something else, will write here.

@yongzhang
Copy link

Any news on this? 2024 now 😃

@pepusz
Copy link

pepusz commented Feb 21, 2024

We've switched to Nats

@thestephenstanton
Copy link
Author

Never got around to writing it and our team switched to Kafka (though I still like NSQ better, just much more simple and easy to maintain).

If NSQ could accomplish this and have durability, it'd be goat'd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants