Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to start worker crashes pool #122

Open
juhlig opened this issue Mar 18, 2019 · 8 comments
Open

Failure to start worker crashes pool #122

juhlig opened this issue Mar 18, 2019 · 8 comments

Comments

@juhlig
Copy link

juhlig commented Mar 18, 2019

In the case that a worker cannot be started (for example, connection limit of a database reached), the entire pool will crash. This is tragical with overflow workers, when requesting an overflow worker will crash the pool and all the linked existing workers.

In my case, I'm working in a tight environment where other things running on the same machine may simply exhaust the local ports. A new connection to the database is then not possible, which is somewhat ok, can always try later. However, I wouldn't know that unless starting an overflow worker is attempted, and then all the existing connections in the pool will crash.

@alex88
Copy link

alex88 commented Jun 9, 2020

@juhlig did you find any way around this in the meantime? I'm having the same issue and I'm actually trying to do what ecto does with its connections, it just logs errors with some delay between failured and it doesn't crash

@juhlig
Copy link
Author

juhlig commented Jun 9, 2020

@alex88 Well, yes and no =^^= That was why I submitted #124...

One option is to implement your workers in a way so that the crash-prone part is not part of the starting procedure. In other words, let them always start, and crash later. This of course is only feasible if you have control over the worker implementation.
At the time when I was writing this ticket, my workers were mysql-otp clients, and back then connecting was invariably part of their startup routine. Since then, mysql-otp has evolved and can now be told to either connect as part of their startup routine ({connect_mode, synchronous}), outside but immediately following the startup routine ({connect_mode, asynchronous}), or only when it must connect for the first time, like for a query ({connect_mode, lazy}).

What I really ended up with was an own pool project, hnc =^^= Well, two actually, I scrapped the first one since I had made some mistakes early on that were hard to iron out later, it went over my head quickly, and I unwittingly reinvented most of pooler along the way ^^; You might want to give it a try, but be aware that it is still in 0.0.1 and as yet undocumented (though the API is very similar to poolboys).

@alex88
Copy link

alex88 commented Jun 10, 2020

@juhlig thanks for the detailed answer, I've currently implemented the crashing part after the init with a retry, the problem is that in my case (grpc connection) poolboy will checkout the connection even if it's not connected yet

@juhlig
Copy link
Author

juhlig commented Jun 11, 2020

Hm, yes, but I would argue that this is not poolboy's responsibility. It just gives you a worker, it doesn't say anything about that worker being ready.

I don't know the intricacies of your use case, so the following is based on some guesswork. It looks like what you want is establishing a connection as part of the worker startup routine, ie when poolboy gives you a worker, it is guaranteed to be in a connected state. But if it can't connect... what do you expect to happen then?

As you asked a question in this very issue, I guess you expect poolboy to give you "nothing" (instead, raise an exception of sorts in the process calling checkout) because the worker couldn't start (In a sense, this is just what poolboy does, only it crashes and thereby all the other workers in the same pool along the way).
Whether the entire pool crashes or not, you would have to provide for the exception in whatever process wants to use a worker, so why not provide for being given a worker that may not be operational (yet) instead? You have to provide for the worker malfunctioning (losing connection or sth) while you're using it, anyway, so there is not much difference between now and later.

You also mentioned retries, so I think what your startup procedure is supposed to do is try to connect, maybe retry a few times, and either complete when it finally got connected, or fail. Even if poolboy did handle the failing part gracefully, baking retries into the start procedure drags it out. Poolboy starts workers in a synchronous fashion, and it will become unresponsive until the startup has completed.
Imagine you have like 3 checkout requests happening at the same time for which poolboy needs to start new workers (vs serving free ones), each startup taking 5s on average. The first request will be served after 5s, the second after 10s, the third after 15s. Now imagine further that, just after those 3 checkout requests, while poolboy is busy starting new workers, a process checks in a worker and immediately after another process wants to check out a worker. You might expect that this last checkout can be done instantly, or that the second request finds the free worker and gets it. After all, it appears that a worker was checked in, and the checkout should find a free worker. But that is not the case, the requests are processed in order, one after another, and they arrived as checkout (-> start worker, 5s -> 5s), checkout (-> start worker, 5s+5s -> 10s), checkout (-> start worker, 5s+5s+5s -> 15s), checkin (instant, 5s+5s+5s+0s -> 15s), checkout (instant, served with just checked in worker, 5s+5s+5s+0s+0s -> 15s).
Depending on circumstances and requirements, this alone could defeat the entire purpose of having a pool in the first place.
Summing it up, in a nutshell, you really want to make startup procedures go fast.

I would suggest to start a worker and do the connecting in a second step, as you already do. The worker should block and not accept any work until it has either reached an operational state, or given up (and crashes), completing the second step in this way or that. This way, the pool is always responsive, the individual users of workers may have to wait, but other users may still check workers out from the pool at the same time.

@alex88
Copy link

alex88 commented Jun 16, 2020

@juhlig sorry for the delay.

I agree it's not poolboy responsibility know which worker is connected and which one is not. my idea of something that would work is that if the init fails, poolboy should just restart it (even better with some backoff) and if no workers are available, the checkout function will timeout and then the user will handle the timeout.

Whether the entire pool crashes or not, you would have to provide for the exception in whatever process wants to use a worker, so why not provide for being given a worker that may not be operational (yet) instead? You have to provide for the worker malfunctioning (losing connection or sth) while you're using it, anyway, so there is not much difference between now and later.

In my case the grpc client throws an error on first connection if it fails but automatically handle reconnections, so if it fails on init you get nothing, if it fails later you can still checkout a client which will return a connection error but will work after the reconnection.
However this is just one possibility, definitely not a generic solution.

You also mentioned retries, so I think what your startup procedure is supposed to do is try to connect, maybe retry a few times, and either complete when it finally got connected, or fail. Even if poolboy did handle the failing part gracefully, baking retries into the start procedure drags it out. Poolboy starts workers in a synchronous fashion, and it will become unresponsive until the startup has completed.

Yes that was my first try and on top of that not only poolboy wasn't starting but also the whole app because poolboy was included in the main app supervisor.

I would suggest to start a worker and do the connecting in a second step, as you already do. The worker should block and not accept any work until it has either reached an operational state, or given up (and crashes), completing the second step in this way or that. This way, the pool is always responsive, the individual users of workers may have to wait, but other users may still check workers out from the pool at the same time.

Thanks for the suggestion, maybe some better error handling can also be done on the client side to make the integration with poolboy easier and more reliable. Thank you!

@JayDadhania
Copy link

JayDadhania commented Mar 16, 2021

@alex88 I was having a similar problem while maintaining a connection pool. I found a solution which is not perfect but does the job. You can initialize a worker without making the grpc connection. When the first call is made to the worker you make the connection and store the connection in the wroker's state and use that connection for the rest of the calls.

Another solution to not crash the worker on init is to use {continue, Term} and handle_continue/2 in your worker.

I don't know if this is even helpful since I'm replying after such a long time.

@alex88
Copy link

alex88 commented Mar 16, 2021

@alex88 I was having a similar problem while maintaining a connection pool. I found a solution which is not perfect but does the job. You can initialize a worker without making the grpc connection. When the first call is made to the worker you make the connection and store the connection in the wroker's state and use that connection for the rest of the calls.

Another solution to not crash the worker on init is to use {continue, Term} and handle_continue/2 in your worker.

I don't know if this is even helpful since I'm replying after such a long time.

Thanks for checking in, we've ended up using a single connection instead of a pool and that works fine for now, will definitely check out your solution if we decide to give it a try again

@juhlig
Copy link
Author

juhlig commented Mar 17, 2021

Another solution to not crash the worker on init is to use {continue, Term} and handle_continue/2 in your worker.

Or use proc_lib:start_link in in you worker's start_link, then proc_lib:init_ack and gen_server:enter_loop (or gen_statem:enter_loop, whichever your worker is) in init, to a similar effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants