Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plug.Cowboy.Drainer shut down abnormally #92

Open
dantswain opened this issue Feb 22, 2023 · 12 comments
Open

Plug.Cowboy.Drainer shut down abnormally #92

dantswain opened this issue Feb 22, 2023 · 12 comments

Comments

@dantswain
Copy link

Hi!

My app has been shutting down seemingly randomly. This only happens on my local machine (not on other devs, not in production environments) so I suspect it has something to do with my set up.

There are no logs when my app shuts down. It doesn't seem to correlate to any particular behavior. This has been happening for at least a couple weeks with regular daily development happening.

I finally got a clue by enabling SASL logging:

[error] Child Plug.Cowboy.Drainer of Supervisor MyApp.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.1011.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker

[notice] Application my_app exited: shutdown

Based on looking at what Plug.Cowboy.Drainer does, this seems like the error could be a red herring? Any ideas how I could get more diagnostic information? I highly suspect this is a memory limit or running out of file descriptors or something else specific to my environment but I can't seem to get any more clues.

I'm using plug_cowboy 2.6.0 with Phoenix 1.7.0-rc.2 on Elixir 1.14.3, OTP 25

@josevalim
Copy link
Member

Can you try running with config :logger, handle_sasl_reports: true? My suspicion is that another child in your supervision tree is crashing causing the tree to restart. During restart, it has to shut down the current children. Cowboy cannot drain everything in 5000ms (probably because a browser is open) and then it is killed after 5 seconds.

@dantswain
Copy link
Author

This is with config :logger, handle_sasl_reports: true 😬

iex(4)> Application.get_env(:logger, :handle_sasl_reports)
true

The log message just before this is (in the most recent crash) [info] Sent 200 in 36ms. Open to trying anything else!

@josevalim
Copy link
Member

Hrm, that's very weird then! What could be killing it? Maybe you can play with Erlang's dbg or tracer and trace the messages between processes?

@dantswain
Copy link
Author

I agree!

My hunch is I'm running out of file descriptors or something at a system level? I don't know how else it would be so quiet...

I've attached a tracer to my drainer process, so maybe that will show something. I don't think I can get away with tracing everything since it can be hours between crashes and I need to be able to work in the meantime 😓

@dantswain
Copy link
Author

OK that got a little more info!

(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83170,904349})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83170,904389})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181956>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83170,
                                                                                 904398})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83170,904448})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83171,905187})
(<0.763.0>) << timeout (Timestamp: {1677,83171,905236})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181957>},
                                     count_children} (Timestamp: {1677,83171,
                                                                  905324})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83171,905357})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83171,905403})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181957>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83171,
                                                                                 905417})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83171,905467})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,906176})
(<0.763.0>) << timeout (Timestamp: {1677,83172,906224})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181958>},
                                     count_children} (Timestamp: {1677,83172,
                                                                  906282})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83172,906313})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83172,906356})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181958>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83172,
                                                                                 906368})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83172,906414})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,911083})
(<0.763.0>) exit killed (Timestamp: {1677,83172,911114})
(<0.763.0>) out_exited 0 (Timestamp: {1677,83172,911131})
[error] Child Plug.Cowboy.Drainer of Supervisor MyAppWeb.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.763.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker
[notice] Application my_app exited: shutdown

There's more before that but the pattern repeats. I know this isn't always necessarily repeatable but if I restart the application and get the same pid (<0.763.0>) for the drainer then the process that sent the message, <0.661.0>, appears to be ranch_cons_sup.

@josevalim
Copy link
Member

Hrm, this is weird because ranch has no idea about this process unless we are shutting it down!

@dantswain
Copy link
Author

Am I reading the trace correctly that the drainer is sending {$gen_call', {... count_children}} to <0.661.0> or is it the other way around? It looks like the drainer could be doing that here: https://github.com/elixir-plug/plug_cowboy/blob/master/lib/plug/cowboy/drainer.ex#L103-L105

@josevalim
Copy link
Member

Right, but the point is that we should not be in that piece of the code unless the drainer started to terminate!

@dantswain
Copy link
Author

I'm trying to catch it in the act but I think it might be getting triggered by or during the automatic rebuild cycle. Does that ring any bells?

Is the trace saying that the call is timing out? What might cause that to timeout? It might hint at the root problem.

I'll keep digging here but it's sort of hard because it's very distracting compared to the mild inconvenience of having to manually restart periodically 😂 Definitely open to ideas of things to try.

@dantswain
Copy link
Author

OK I think I've actually tracked my problem down to another source - joken-elixir/joken_jwks#40

I'm wondering, though, if what I've seen here might hint at a problem in plug_cowboy? The traces make it look like a timeout is happening and then the drainer exits abnormally.

@josevalim
Copy link
Member

Right, that was my point since the beginning. This is definitely a secondary failure, something is making your app fail before, but I am not sure why you can't see the first failure! I am not sure if there is a bug here, for all purposes, the shutdown/kill is happening for correct reasons.

@einariii
Copy link

I encountered this issue when a project didn't have the .env file used in the original dev environment.

nb. this happened immediately when running mix phx.server, unlike np who seemed to see this after deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants