New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpuUsageThrottle #1460
cpuUsageThrottle #1460
Changes from 15 commits
c93553d
4e1d57f
7e3d850
ed74e52
30953ba
acd8459
be53945
fa5ec89
b9f46a9
c43210a
583041b
cb6d4d1
0904cfc
fd7b3a5
6c269be
131ef37
6dd7423
2a2201b
0c57b85
6b75f56
9d98d80
d9078ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -260,8 +260,14 @@ event, e.g., `server.on('after', plugins.metrics());`: | |
The module includes the following plugins to be used with restify's `pre` event: | ||
* `inflightRequestThrottle(options)` - limits the max number of inflight requests | ||
* `options.limit` {Number} the maximum number of inflight requests the server will handle before returning an error | ||
* `options.err` {Error} opts.err A restify error used as a response when the inflight request limit is exceeded | ||
* `options.err` {Error} A restify error used as a response when the inflight request limit is exceeded | ||
* `options.server` {Object} The restify server that this module will throttle | ||
* `cpuUsageThrottle(options)`- Reject requests based on the server's current CPU usage | ||
* `options.limit` - {Number} The point at which restify will begin rejecting a % of all requests at the front door. | ||
* `options.max` - {Number} The point at which restify will reject 100% of all requests at the front door. | ||
* `options.interval` - {Number} How frequently to recalculate the % of traffic to be rejecting. | ||
* `options.halfLife` - {Number} How responsive your application will be to spikes in CPU usage, for more details read the cpuUsageThrottle section below. | ||
* `options.err` - {Error} A restify error used as a response when the cpu usage limit is exceeded | ||
|
||
|
||
## QueryParser | ||
|
@@ -525,6 +531,46 @@ requests. It defaults to `503 ServiceUnavailableError`. | |
This plugin should be registered as early as possibly in the middleware stack | ||
using `pre` to avoid performing unnecessary work. | ||
|
||
## CPU Usage Throttling | ||
|
||
```js | ||
var restify = require('restify'); | ||
var errors = require('restify-errors'); | ||
|
||
var server = restify.createServer(); | ||
const options = { | ||
limit: .75, | ||
max: 1, | ||
interval: 250, | ||
halfLife: 500, | ||
err: new errors.InternalServerError() | ||
} | ||
|
||
server.pre(restify.plugins.cpuUsageThrottle(options)); | ||
``` | ||
|
||
cpuUsageThrottle is a middleware that rejects a variable number of requests (between 0% and 100%) based on a historical view of CPU utilization of a Node.js process. Essentially, this plugin allows you to define what constitutes a saturated Node.js process via CPU utilization and it will handle dropping a % of requests based on that definiton. This is useful when you would like to keep CPU bound tasks from piling up causing an increased per-request latency. | ||
|
||
The algorithm asks you for a maximum CPU utilization rate, which it uses to determine at what point it should be rejecting 100% of traffic. For a normal Node.js service, this is 1 since Node is single threaded. It uses this, paired with a limit that you provide to determine the total % of traffic it should be rejecting. For example, if you specify a limit of .5 and a max of 1, and the current EWMA (next paragraph) value reads .75, this plugin will reject approximately 50% of all requests. | ||
|
||
When looking at the process' CPU usage, this algorithm will take a load average over a user specified interval. This is similar to Linux's top. For example, if given an interval of 250ms, this plugin will attempt to record the average CPU utilization over 250ms intervals. Due to contention for resources, the duration of each average may be wider or narrower than 250ms. To compensate for this, we use an exponentially weighted moving average. The EWMA algorithm is provided by the ewma module. The parameter for configuring the EWMA is halfLife. This value controls how quickly each load average measurment decays to half it's value when being represented in the current average. For example, if you have an interval of 250, and a halfLife of 250, you will take the previous ewma value multiplied by .5 and add it to the new CPU utilization average measurement multiplied by .5. The previous value and the new measurement would each represent 50% of the new value. A good way of thinking about the halfLife is in terms of how responsive this plugin will be to spikes in CPU utilization. The higher the halfLife, the longer CPU utilization will have to remain above your defined limit before this plugin begins rejecting requests and, converserly, the longer it will have to drop below your limit before the plugin begins accepting requests again. This is a knob you will want to with play when trying to determine the ideal value for your use case. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Love the new summary. Instead of saying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's actually not using linux's top's formula. Used the word "similar" here since it isn't the same, but it gives people a point of reference when trying to understand the metric. The actual algorithm used is: https://stackoverflow.com/questions/16726779/how-do-i-get-the-total-cpu-usage-of-an-application-from-proc-pid-stat/16736599#16736599 I could link to it, but if someone is diving that deep, I would suspect they would refer to the |
||
|
||
For a better understanding of the EWMA algorithn, refer to the documentation for the ewma module. | ||
|
||
Params: | ||
* `max` - The point at which restify will reject 100% of all requests at the front door. This is used in conjunction with limit to determine what % of traffic restify needs to reject when attempting to bring the average load back within tolerable thresholds. Since Node.js is single threaded, the default for this is 1. In some rare cases, a Node.js process can exceed 100% CPU usage and you will want to update this value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😊 |
||
* `interval` - How frequently to check if we should be rejecting/accepting connections. It's best to keep this value as low as possible without creating a significant impact on perfomance in order to keep your server as responsive to change as possible. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this might be clearer if it was |
||
* `halfLife` - When we sample the CPU usage on an interval, we create a series of data points. We take these points and calculate a moving average. The halfLife indicates how quickly a point "decays" to half it's value in the moving average. The lower the halfLife, the more impact newer data points have on the average. If you want to be extremely responsive to spikes in CPU usage, set this to a lower value. If you want your process to put more emphasis on recent historical CPU usage when determininng whether it should shed load, set this to a higher value. The unit is in ms. The default is to set this to the same value as interval. | ||
* `err` - A restify error used as a response when the cpu usage limit is exceeded | ||
|
||
You can also update the plugin during runtime using the `.update()` function. This function accepts the same `opts` object as a constructor. | ||
|
||
```js | ||
var plugin = restify.plugins.cpuUsageThrottle(options); | ||
server.pre(plugin); | ||
|
||
plugin.update({ limit: .4, halfLife: 5000 }); | ||
``` | ||
|
||
## Conditional Request Handler | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
'use strict'; | ||
|
||
var assert = require('assert-plus'); | ||
var pidusage = require('pidusage'); | ||
var errors = require('restify-errors'); | ||
var defaultResponse = errors.ServiceUnavailableError; | ||
var EWMA = require('ewma'); | ||
|
||
/** | ||
* cpuUsageThrottle | ||
* | ||
* cpuUsageThrottle is a middleware that rejects a variable number of requests | ||
* (between 0% and 100%) based on a historical view of CPU utilization of a | ||
* Node.js process. Essentially, this plugin allows you to define what | ||
* constitutes a saturated Node.js process via CPU utilization and it will | ||
* handle dropping a % of requests based on that definiton. This is useful when | ||
* you would like to keep CPU bound tasks from piling up causing an increased | ||
* per-request latency. | ||
* | ||
* The algorithm asks you for a maximum CPU utilization rate, which it uses to | ||
* determine at what point it should be rejecting 100% of traffic. For a normal | ||
* Node.js service, this is 1 since Node is single threaded. It uses this, | ||
* paired with a limit that you provide to determine the total % of traffic it | ||
* should be rejecting. For example, if you specify a limit of .5 and a max of | ||
* 1, and the current EWMA (next paragraph) value reads .75, this plugin will | ||
* reject approximately 50% of all requests. | ||
* | ||
* When looking at the process' CPU usage, this algorithm will take a load | ||
* average over a user specified interval. This is similar to Linux's top. For | ||
* example, if given an interval of 250ms, this plugin will attempt to record | ||
* the average CPU utilization over 250ms intervals. Due to contention for | ||
* resources, the duration of each average may be wider or narrower than 250ms. | ||
* To compensate for this, we use an exponentially weighted moving average. | ||
* The EWMA algorithm is provided by the ewma module. The parameter for | ||
* configuring the EWMA is halfLife. This value controls how quickly each | ||
* load average measurment decays to half it's value when being represented in | ||
* the current average. For example, if you have an interval of 250, and a | ||
* halfLife of 250, you will take the previous ewma value multiplied by .5 and | ||
* add it to the new CPU utilization average measurement multiplied by .5. The | ||
* previous value and the new measurement would each represent 50% of the new | ||
* value. A good way of thinking about the halfLife is in terms of how | ||
* responsive this plugin will be to spikes in CPU utilization. The higher the | ||
* halfLife, the longer CPU utilization will have to remain above your defined | ||
* limit before this plugin begins rejecting requests and, converserly, the | ||
* longer it will have to drop below your limit before the plugin begins | ||
* accepting requests again. This is a knob you will want to with play when | ||
* trying to determine the ideal value for your use case. | ||
* | ||
* For a better understanding of the EWMA algorithn, refer to the documentation | ||
* for the ewma module. | ||
* | ||
* @param {Object} opts Configure this plugin. | ||
* @param {Number} opts.limit The point at which restify will begin rejecting a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/.8/0.8. I think you mean cpulimit? Have we tested that the cpu utilization stays constant if we add additional cores to an instance? |
||
* % of all requests at the front door. This value is a percentage like you | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think here it would be less confusing and more precise to just use the temr There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed this is confusing. I removed the reference to linux top for the reasons detailed in #1460 (comment) |
||
* would see when running `top` on linux. For example 0.8 === 80%. | ||
* @param {Number} [opts.max] The point at which restify will reject 100% of all | ||
* requests at the front door. This is used in conjunction with limit to | ||
* determine what % of traffic restify needs to reject when attempting to | ||
* bring the average load back to the user requested values. Since Node.js is | ||
* single threaded, the default for this is 1. In some rare cases, a Node.js | ||
* process can exceed 100% CPU usage and you will want to update this value. | ||
* @param {Number} [opts.interval] How frequently to check if we should be | ||
* rejecting/accepting connections. It's best to keep this value as low as | ||
* possible without creating a significant impact on perfomance in order to | ||
* keep your server as responsive to change as possible. Defaults to 250. | ||
* @param {Number} [opts.halfLife] When we sample the CPU usage on an interval, | ||
* we create a series of data points. We take these points and calculate a | ||
* moving average. The halfLife indicates how quickly a point "decays" to | ||
* half it's value in the moving average. The lower the halfLife, the more | ||
* impact newer data points have on the average. If you want to be extremely | ||
* responsive to spikes in CPU usage, set this to a lower value. If you want | ||
* your process to put more emphasis on recent historical CPU usage when | ||
* determininng whether it should shed load, set this to a higher value. The | ||
* unit is in ms. The default is to set this to the same value as interval. | ||
* @param {Error} [opts.err] A restify error used as a response when the | ||
* cpu usage limit is exceeded, this should be created with | ||
* require('restify-errors').makeConstructor. | ||
* @returns {Function} middleware to be registered on server.pre | ||
*/ | ||
function cpuUsageThrottle (opts) { | ||
|
||
// Scrub input and populate our configuration | ||
assert.object(opts, 'opts'); | ||
assert.number(opts.limit, 'opts.limit'); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts around making all of these configuration parameters optional? That is, have a set of sane defaults so folks can just plug and play? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not confident we can have a sane default here. A server that can handle 2RPS binding large data sets to large templates may want a very different limit than a server that does 8k RPS validating auth for login. I may be surprised. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After talking it through, we settled on a default of .75 for right now. We will open a PR later if we find a better default. |
||
assert.optionalNumber(opts.max, 'opts.max'); | ||
assert.optionalNumber(opts.interval, 'opts.interval'); | ||
assert.optionalNumber(opts.halfLife, 'opts.halfLife'); | ||
|
||
if (opts.err !== undefined && opts.err !== null) { | ||
assert.ok((new opts.err()) instanceof errors.HttpError, | ||
'opts.err must be an error'); | ||
assert.optionalNumber(opts.err.statusCode, 'opts.err.statusCode'); | ||
} | ||
|
||
var self = {}; | ||
self._err = opts.err || defaultResponse; | ||
self._limit = opts.limit; | ||
self._max = opts.max || 1; | ||
self._interval = opts.interval || 250; | ||
self._halfLife = (typeof opts.halfLife === 'number') ? | ||
opts.halfLife : opts.interval; | ||
self._timeout = null; | ||
assert.ok(self._max > self._limit, 'limit must be less than max'); | ||
|
||
self._ewma = new EWMA(self._halfLife); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 0_0 why am I not using this? |
||
|
||
// self._reject represents the % of traffic that we should reject at the | ||
// current point in time based on how much over our limit we are. This is | ||
// updated on an interval by updateReject(). | ||
self._reject = 0; | ||
|
||
// updateReject should be called on an interval, it checks the average CPU | ||
// usage between two invocations of updateReject. | ||
function updateReject() { | ||
pidusage.stat(process.pid, function (e, stat) { | ||
// If we fail to stat for whatever reason, err on the side of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we at least log/emit something to state that getting the CPU utilization failed and we've stopped throttling? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
// accepting traffic, while failure probably indicates something | ||
// horribly wrong with the server, identifying this error case and | ||
// appropriately responding is way outside the scope of this plugin | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth emitting an error in this case? Might be a way to hook up metrics for when the plugin is misbehaving for whatever reason. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plugin reads the metrics from either stdout or a file descriptor, the error we would get here would be the same as a read from the filesystem failing. Something horribly wrong is happening, and more than likely is showing up somewhere else that is in a better position to handle it. I'd be open to logging this, though I don't believe we have access to the log object at this point. We would need to either accept it in the constructor or cache it on the first request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After a conversation with @DonutEspresso, we settled on exposing this module's state through https://github.com/restify/node-restify/pull/1460/files#diff-01a30fd8ce166416dd691d3498c746c7R144 This allows for tooling outside of this plugin/repo to check and log/monitor/etc. the behavior of this plugin. When fails, we can can check |
||
if (e) { | ||
// When cpu is NaN, we should never reject | ||
stat = { cpu: NaN }; | ||
} | ||
|
||
// Divide by 100 to match linux's top format | ||
self._ewma.insert(stat.cpu / 100); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when you insert NaN? |
||
self._cpu = self._ewma.value(); | ||
// Update reject with the % of traffic we should be rejecting. This | ||
// is safe since max > limit so the denominator can never be 0. If | ||
// the current cpu usage is less that the limit, _reject will be | ||
// negative and we will never shed load | ||
self._reject = | ||
(self._cpu - self._limit) / (self._max - self._limit); | ||
self._timeout = setTimeout(updateReject, self._interval); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Under heavy load There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, EWMA accounts for this :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is the reason for the crazy magic maths here: https://github.com/ReactiveSocket/ewma/blob/master/index.js#L55 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even tho ewma accounts for this in the algorithm, the interval is a parameter configured by the user, and we should strive to stick to said interval. I think it still makes sense to keep track of the delta here, since changes to the interval will affect the throttling algorithm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed. That is what I'm trying to accomplish with this code, though there may be a better way. Since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about just exposing the delta for now then (through EE or similar) and we can keep track of it as we first deploy? As we gather data it should inform us just how much of an impact there will be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In other words, no need to consume it in a meaningful way now, but at least we have some visibility into what's going on. Although, now that I think about it - would something like this already be captured in the event loop metrics? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I think there's been some confusion. I am suggesting we keep track of how much time has elapsed so that we can update There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There could be latency introduced by the file descriptor read on Linux or spinning up the subprocess on windows (inside or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also adding it to the error context when we shed load |
||
}); | ||
} | ||
|
||
// Kick off updating our _reject value | ||
updateReject(); | ||
|
||
function onRequest (req, res, next) { | ||
// Check to see if this request gets rejected. Since, in updateReject, | ||
// we calculate a percentage of traffic we are planning to reject, we | ||
// can use Math.random() (which picks from a uniform distribution in | ||
// [0,1)) to give us a `self._reject`% chance of dropping any given | ||
// request. This is a stateless was to drop approximatly `self._reject`% | ||
// of traffic. | ||
var draw = Math.random(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe |
||
|
||
if (draw >= self._reject) { | ||
return next(); // Don't reject this request | ||
} | ||
|
||
var err = new self._err({ | ||
context: { | ||
plugin: 'cpuUsageThrottle', | ||
cpuUsage: self._cpu, | ||
limit: self._limit, | ||
max: self._max, | ||
reject: self._reject, | ||
halfLife: self._halfLife, | ||
interval: self._interval, | ||
draw: draw | ||
} | ||
}); | ||
|
||
return next(err); | ||
} | ||
|
||
// Allow the app to clear the timeout for this plugin if necessary, without | ||
// this we would never be able to clear the event loop when letting Node | ||
// shut down gracefully | ||
function close () { | ||
clearTimeout(self._timeout); | ||
} | ||
onRequest.close = close; | ||
|
||
// Expose internal plugin state for introspection | ||
onRequest.state = self; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are any of the exposed fields actually public? They all look like _ properties |
||
|
||
/** | ||
* cpuUsageThrottle.update | ||
* | ||
* Allow the plugin's configuration to be updated during runtime. | ||
* | ||
* @param {Object} opts The opts object for reconfiguring this plugin, it | ||
* follows the same format as the constructor for this plugin. | ||
*/ | ||
onRequest.update = function update(newOpts) { | ||
assert.object(newOpts, 'newOpts'); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this function you might actually want to log/emit that we've updated the parameters, to ease debugging later on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't have access to the logger here :-\ I can start caching the logger on the first request though. |
||
assert.optionalNumber(newOpts.limit, 'newOpts.limit'); | ||
assert.optionalNumber(newOpts.max, 'newOpts.max'); | ||
assert.optionalNumber(newOpts.interval, 'newOpts.interval'); | ||
assert.optionalNumber(newOpts.halfLife, 'newOpts.halfLife'); | ||
|
||
if (newOpts.err !== undefined && newOpts.err !== null) { | ||
assert.ok((new newOpts.err()) instanceof errors.HttpError, | ||
'newOpts.err must be an error'); | ||
assert.optionalNumber(newOpts.err.statusCode, | ||
'newOpts.err.statusCode'); | ||
self._err = newOpts.err; | ||
} | ||
|
||
if (newOpts.limit !== undefined) { | ||
self._limit = newOpts.limit; | ||
} | ||
|
||
if (newOpts.max !== undefined) { | ||
self._max = newOpts.max; | ||
} | ||
|
||
if (newOpts.interval !== undefined) { | ||
self._interval = newOpts.interval; | ||
} | ||
|
||
if (newOpts.halfLife !== undefined) { | ||
self._halfLife = newOpts.halfLife; | ||
// update our ewma with the new halfLife, we use the previous known | ||
// state as the initial state for our new halfLife in liue of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/liue/lieu |
||
// having access to true historical data. | ||
self._ewma = new EWMA(self._halfLife, self._cpu); | ||
} | ||
|
||
// Ensure new values are still valid | ||
assert.ok(self._max > self._limit, 'limit must be less than max'); | ||
|
||
// Update _reject with the new settings | ||
self._reject = | ||
(self._cpu - self._limit) / (self._max - self._limit); | ||
}; | ||
|
||
return onRequest; | ||
} | ||
|
||
module.exports = cpuUsageThrottle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bonus points might want to link to the ewma paper, but if you're already doing that in the ewma module, then maybe not. I leave that up to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the EWMA module 😄 the 🐰 🕳️ is deep 😉