Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Server congestion #2109

Closed
seanlinsley opened this issue Mar 2, 2014 · 10 comments
Closed

Server congestion #2109

seanlinsley opened this issue Mar 2, 2014 · 10 comments

Comments

@seanlinsley
Copy link
Contributor

Lately we've been experiencing quite a few incidents where the server becomes so congested that it becomes completely unresponsive.

Today at 1 PM Central Time: (IRC)

screen shot 2014-03-02 at 1 48 38 pm

Today at 5 PM Central Time: (IRC)

screen shot 2014-03-02 at 5 21 55 pm

In both cases we had to restart the server for it to get back to normal. We need to get to the bottom of this. What's causing the slowdown? Is there any insight to be gained from the log of pages accessed?

We should have a system to automatically alert us when the site is having such troubles.

@chadwhitacre
Copy link
Contributor

We should have a system to automatically alert us when the site is having such troubles.

Pretty sure that's Pagerduty (#2072).

@seanlinsley
Copy link
Contributor Author

Yeah, @patcon mentioned the details shortly after in IRC

@chadwhitacre
Copy link
Contributor

I've marked this DevX ★.

@patcon
Copy link
Contributor

patcon commented Mar 3, 2014

"Site unavailable" notifications with pagerduty are simple, as we're converting UptimeRobot alert emails into pagerduty notifications. Just need to work with @whit537 and @zwn to get them properly set up and get on-call schedules spread fairly. (ie. #2072)

As for more advanced alerting on certain conditions or logs messages, that requires new tooling (graylog2? some heroku add-on for log monitoring? newrelic? etc.). We would still route any other services' monitoring through the consistent pagerduty alerting system. Let's make this issue about that.

I've added a Todo section to the OP

@patcon patcon self-assigned this Mar 3, 2014
@zwn
Copy link
Contributor

zwn commented Mar 3, 2014

Not writing to the database on each and every request for signed in users (even for static files, even for 304) could help us move further away from the congestion issue (see #2041). The query to update session_expires is 9th most often called query, while the first two most often called queries are BEGIN and COMMIT and 8th is SET client_encoding TO 'UTF8' so it is more like 6th really (when I count only the meaningful queries).

@seanlinsley
Copy link
Contributor Author

@zwn I assume you queried the production database to find the most common queries? Could you list them out somewhere?

@seanlinsley
Copy link
Contributor Author

@patcon assigned this ticket to himself, taking the intention here to set up proper alerting. I think that's nicely encapsulated in #2072 already, so I'm un-assigning @patcon from this ticket and removing it from working, as no one has started really working on the core issue here. (IRC)

@grampajoe
Copy link

I strongly recommend NewRelic for diagnosing these issues. The free NewRelic Standard you get with Heroku gives you things like detailed error reports, alerting, event (downtime, high error rate, etc.) reports, transaction tracing, etc..

@chadwhitacre
Copy link
Contributor

IRC

@chadwhitacre chadwhitacre mentioned this issue Jul 7, 2014
3 tasks
@Changaco
Copy link
Contributor

Changaco commented Jul 9, 2014

Closing. The specific issue of becoming completely unresponsive when busy threads are maxed out has been fixed by #2384.

@Changaco Changaco closed this as completed Jul 9, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants