Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsq: DRAINING mode #1302

Open
jehiah opened this issue Nov 23, 2020 · 5 comments · May be fixed by #1305
Open

nsq: DRAINING mode #1302

jehiah opened this issue Nov 23, 2020 · 5 comments · May be fixed by #1305
Assignees

Comments

@jehiah
Copy link
Member

jehiah commented Nov 23, 2020

To facilitate running nsqd in environments where the host isn't long running and to facilitate operations around managing a cluster a new "draining" mode will be introduced to nsqd.

A nsqd instance in "draining" mode will:

  • Not accept any new messages
  • Allow consumers to receive all remaining messages
  • Exit after all topics and channels are empty.
  • Indicate draining status via /info endpoint

Clients that use a HA approach of pooling multiple nsqds for publishing messages (i.e. nsqio/go-nsq#311 ) are expected to transparently tolerate a host in draining mode.


Implementation Plan

A new --sigterm=drain CLI flag will enable this new behavior. Existing functionality will be preserved with the argument --sigterm=clean-shutdown

A new PUT /config/drain endpoint can also initiate a drain, and a PUT /config/shutdown would initiate a clean shutdown.

When in a draining mode new messages will be rejected with an error. E_PUB_FAILED will be the response for new messages over the TCP protocol, and HTTP 503 for http protocol.

An attempt to create new topics and channels (via subscribe) will be rejected if nsqd is in drain mode.

Once initiated a drain operation can only be completed, it can't be canceled. TBD: PUT /config/shutdown may be able to override the drain and close all connections and exit nsqd.

Open Questions

  • Should each topic (and/or channel) be closed as they are drained or should they only be closed after all are drained? If this functionality is per-topic should the HTTP API expose that same behavior, and is there a need to expand the lookupd protocol to initiate a tombstone before existing to avoid race conditions w/ clients?
  • Should draining close a topic/channel or should it still be configured on a nsqd instance after restart? (i.e. is this similar to POST /topic/delete)

cc: #1254
Closes #1022

@jehiah jehiah self-assigned this Nov 23, 2020
@mreiferson
Copy link
Member

mreiferson commented Nov 25, 2020

SGTM, I suspect this one is going to be a bit tricky.

Should each topic (and/or channel) be closed as they are drained or should they only be closed after all are drained? If this functionality is per-topic should the HTTP API expose that same behavior, and is there a need to expand the lookupd protocol to initiate a tombstone before existing to avoid race conditions w/ clients?

My gut tells me that, as a first pass, trying an implementation that waits for all topics/channels to be empty and then exits will likely avoid the "premature client reconnect after close" problem.

Should draining close a topic/channel or should it still be configured on a nsqd instance after restart? (i.e. is this similar to POST /topic/delete)

A little confused by this — my understanding is that this proposal isn't intended to modify the existence of topics/channels, so my answer would be topics/channels should remain present if an nsqd pointed at the same --data-path starts up again after draining.

TBD: PUT /config/shutdown may be able to override the drain and close all connections and exit nsqd.

IMO yes, we must provide a mechanism to force a (clean) shutdown. Maybe even offer a timeout?

Minor:

  • Feel like we can come up with something slightly better than --sigterm, how about --term-mode?
  • /config/{drain,shutdown} don't really feel like "configurations"

@mreiferson
Copy link
Member

Also love that this and #1300 are labeled chore 😂

@jehiah
Copy link
Member Author

jehiah commented Nov 25, 2020

I should comment that i don't yet have a perfectly clear idea of the implementation for this; it will be a chore!

A little confused by this — my understanding is that this proposal isn't intended to modify the existence of topics/channels, so my answer would be topics/channels should remain present if an nsqd pointed at the same --data-path starts up again after draining.

Perhaps there is a case for both? My intention is to targeting a use case where a nsqd is going away (i.e. removed from rotation), and by the time it's done draining - there is nothing left on that nsqd instance. From that context i'm leaning towards "delete" functionality where topics disappear as they are drained.

If you are trying to remove a nsqd instance from rotation where that nsqd instance had 10 different topics, but just one or two with notable backlogs, it would be desirable to have the topics that drain quickly deleted. Deleting promotes a better cluster hygiene where you don't have a nsqd instance which is no longer getting messages on a topic still getting consumer connections where it causes RDY to be spread thin. (i.e. think a topic that takes a day to drain in some odd circumstance.)

I've used the word "drain" because i think it's best, but what i really mean is the process of removing a nsqd from rotation.

Feel like we can come up with something slightly better than --sigterm, how about --term-mode?

I had --term-mode in a prototype of this feature but felt it wasn't obvious enough that this was about signal handling. --sigterm-mode ?

/config/{drain,shutdown} don't really feel like "configurations"

agreed. ideas? /state/{drain,shutdown} were some other naming ideas i had.

@mreiferson
Copy link
Member

Got it. In that case, simplest way might be to to proactively send a tombstone to nsqlookupd (to avoid new clients discovering that node) but not closing (which may force connected clients to reconnect)? However, one can imagine scenarios where you need clients to reconnect in order to fully drain 😜.

--sigterm-mode 👍

agreed. ideas? /state/{drain,shutdown} were some other naming ideas i had.

🤷 might make sense at the top-level?

@jehiah
Copy link
Member Author

jehiah commented Nov 25, 2020

simplest way might be to to proactively send a tombstone to nsqlookupd (to avoid new clients discovering that node) but not closing (which may force connected clients to reconnect)? However, one can imagine scenarios where you need clients to reconnect in order to fully drain 😜.

I think we are on the same page; you wouldn't toombstone until the actual removal so i don't think that affects clients draining. Currently the TCP protocol for lookupd doesn't support toombstone, but that would be easy to resolve if needed. It might also not be critical if nsqd rejects the creation of new topics when it's draining. That would inhibit new subscriptions after a topic is deleted.

might make sense at the top-level?

👍

I think i have enough feedback here to start on an implementation; then we can move to discussion the tradeoffs of a concrete implementation.

@jehiah jehiah changed the title nsq: DRAINING mode [RFC] nsq: DRAINING mode Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants