Server congestion #2109

seanlinsley · 2014-03-02T23:39:11Z

Lately we've been experiencing quite a few incidents where the server becomes so congested that it becomes completely unresponsive.

Today at 1 PM Central Time: (IRC)

Today at 5 PM Central Time: (IRC)

In both cases we had to restart the server for it to get back to normal. We need to get to the bottom of this. What's causing the slowdown? Is there any insight to be gained from the log of pages accessed?

We should have a system to automatically alert us when the site is having such troubles.

chadwhitacre · 2014-03-03T17:45:51Z

We should have a system to automatically alert us when the site is having such troubles.

Pretty sure that's Pagerduty (#2072).

seanlinsley · 2014-03-03T17:48:05Z

Yeah, @patcon mentioned the details shortly after in IRC

chadwhitacre · 2014-03-03T17:48:37Z

I've marked this DevX ★.

patcon · 2014-03-03T18:42:27Z

"Site unavailable" notifications with pagerduty are simple, as we're converting UptimeRobot alert emails into pagerduty notifications. Just need to work with @whit537 and @zwn to get them properly set up and get on-call schedules spread fairly. (ie. #2072)

As for more advanced alerting on certain conditions or logs messages, that requires new tooling (graylog2? some heroku add-on for log monitoring? newrelic? etc.). We would still route any other services' monitoring through the consistent pagerduty alerting system. Let's make this issue about that.

I've added a Todo section to the OP

zwn · 2014-03-03T19:13:06Z

Not writing to the database on each and every request for signed in users (even for static files, even for 304) could help us move further away from the congestion issue (see #2041). The query to update session_expires is 9th most often called query, while the first two most often called queries are BEGIN and COMMIT and 8th is SET client_encoding TO 'UTF8' so it is more like 6th really (when I count only the meaningful queries).

seanlinsley · 2014-03-04T05:02:25Z

@zwn I assume you queried the production database to find the most common queries? Could you list them out somewhere?

seanlinsley · 2014-03-05T20:24:36Z

@patcon assigned this ticket to himself, taking the intention here to set up proper alerting. I think that's nicely encapsulated in #2072 already, so I'm un-assigning @patcon from this ticket and removing it from working, as no one has started really working on the core issue here. (IRC)

grampajoe · 2014-03-07T22:55:08Z

I strongly recommend NewRelic for diagnosing these issues. The free NewRelic Standard you get with Heroku gives you things like detailed error reports, alerting, event (downtime, high error rate, etc.) reports, transaction tracing, etc..

chadwhitacre · 2014-07-07T16:16:58Z

IRC

Changaco · 2014-07-09T20:06:13Z

Closing. The specific issue of becoming completely unresponsive when busy threads are maxed out has been fixed by #2384.

chadwhitacre added the DevX ★ label Mar 3, 2014

patcon self-assigned this Mar 3, 2014

patcon added the 3 - Work in Progress label Mar 3, 2014

seanlinsley unassigned patcon Mar 5, 2014

seanlinsley added DevX ★ and removed DevX ★ labels Mar 5, 2014

patcon mentioned this issue Mar 6, 2014

Set up [pagerduty] alerts for important metrics #2119

Closed

chadwhitacre mentioned this issue Jul 7, 2014

new homepage #2544

Merged

3 tasks

Changaco closed this as completed Jul 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server congestion #2109

Server congestion #2109

seanlinsley commented Mar 2, 2014

chadwhitacre commented Mar 3, 2014

seanlinsley commented Mar 3, 2014

chadwhitacre commented Mar 3, 2014

patcon commented Mar 3, 2014

zwn commented Mar 3, 2014

seanlinsley commented Mar 4, 2014

seanlinsley commented Mar 5, 2014

grampajoe commented Mar 7, 2014

chadwhitacre commented Jul 7, 2014

Changaco commented Jul 9, 2014

Server congestion #2109

Server congestion #2109

Comments

seanlinsley commented Mar 2, 2014

chadwhitacre commented Mar 3, 2014

seanlinsley commented Mar 3, 2014

chadwhitacre commented Mar 3, 2014

patcon commented Mar 3, 2014

zwn commented Mar 3, 2014

seanlinsley commented Mar 4, 2014

seanlinsley commented Mar 5, 2014

grampajoe commented Mar 7, 2014

chadwhitacre commented Jul 7, 2014

Changaco commented Jul 9, 2014