Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuUsageThrottle #1460

Merged
merged 22 commits into from Sep 8, 2017
Merged

cpuUsageThrottle #1460

merged 22 commits into from Sep 8, 2017

Conversation

retrohacker
Copy link
Member

Pre-Submission Checklist

  • Opened an issue discussing these changes before opening the PR
  • Ran the linter and tests via make prepush
  • Included comprehensive and convincing tests for changes

Changes

Adds a CPU usage throttle which begins dropping requests at the front door when a process begins using more than the defined limit. This helps prevent unacceptable latencies from requests backing up.

@retrohacker
Copy link
Member Author

Failing tests are unrelated to this PR: https://github.com/nodesecurity/nsp/issues/185

server.pre(restify.plugins.cpuUsageThrottle(options));
```

Have restify reject a variable number of requests at the front door based on the process' current CPU usage. This plugin works by keeping a running average of the CPU usage over time and rejecting a percentage of requests based on the user specified CPU limit, maximum CPU limit, and the current value. This module will reject between 0% and 100% of all incomming requests when attempting to bring CPU usage back to a reasonable level. The method we use for calculating the CPU usage will work across platforms. This is an effective module for preventing high latencies caused by an over saturated restify server. This module is most effective when paired with an auto-scale policy at an orechestration layer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/orechestration/orchestration

* we use for calculating the CPU usage will work across platforms. This is
* an effective module for preventing high latencies caused by an over saturated
* restify server. This module is most effective when paired with an auto-scale
* policy at an orechestration layer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/orechestration/orchestration

assert.object(opts, 'opts');
assert.number(opts.limit, 'opts.limit');
assert.optionalNumber(opts.max, 'opts.max');
assert.number(opts.interval, 'opts.interval');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a sane default for this? Just wondering we propose folks to figure out what this value should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably right... I keep coming back to 500ms, so recheck every 1/2 second. Though I'm not sure if that is aggressive enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a super low overhead function and results in reading from a single file descriptor, I'm going to have it default to checking every 1/4 of a second.

* @param {Number} opts.limit The point at which restify will begin rejecting a
* % of all requests at the front door. This value is a percentage like you
* would see when running `top` on linux. For example .8 === 80%.
* @param {Number} opts.max The point at which restify will reject 100% of all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use [] to indicate optional options in your JSDoc

// If we fail to stat for whatever reason, err on the side of
// accepting traffic, while failure probably indicates something
// horribly wrong with the server, identifying this error case and
// appropriately responding is way outside the scope of this plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth emitting an error in this case? Might be a way to hook up metrics for when the plugin is misbehaving for whatever reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin reads the metrics from either stdout or a file descriptor, the error we would get here would be the same as a read from the filesystem failing. Something horribly wrong is happening, and more than likely is showing up somewhere else that is in a better position to handle it. I'd be open to logging this, though I don't believe we have access to the log object at this point. We would need to either accept it in the constructor or cache it on the first request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a conversation with @DonutEspresso, we settled on exposing this module's state through https://github.com/restify/node-restify/pull/1460/files#diff-01a30fd8ce166416dd691d3498c746c7R144

This allows for tooling outside of this plugin/repo to check and log/monitor/etc. the behavior of this plugin. When fails, we can can check cpu === NaN in that module and log it out.

// negative and we will never shed load
self._reject =
(self._cpu - self._limit) / (self._max - self._limit);
self._timeout = setTimeout(updateReject, self._interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under heavy load updateReject() might execute later than you'd like it to - is it valuable to incorporate that delta into the algo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, EWMA accounts for this :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the reason for the crazy magic maths here: https://github.com/ReactiveSocket/ewma/blob/master/index.js#L55

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even tho ewma accounts for this in the algorithm, the interval is a parameter configured by the user, and we should strive to stick to said interval. I think it still makes sense to keep track of the delta here, since changes to the interval will affect the throttling algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even tho ewma accounts for this in the algorithm, the interval is a parameter configured by the user, and we should strive to stick to said interval.

Agreed. That is what I'm trying to accomplish with this code, though there may be a better way. Since pid.stat is async and it's performance is not necessarily bound to process saturation, its possible the first call to it could take a few minutes to return. With setInterval we have the potential to queue a whole bunch of pid.stat calls up while that first blocks. Then it ends up being a race to burn through that queue. This tries to ensure we call pid.stat exactly once at a time and that the time between it returning back to us and us calling it again is as close to the interval the user requested as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just exposing the delta for now then (through EE or similar) and we can keep track of it as we first deploy? As we gather data it should inform us just how much of an impact there will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, no need to consume it in a meaningful way now, but at least we have some visibility into what's going on. Although, now that I think about it - would something like this already be captured in the event loop metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think there's been some confusion. I am suggesting we keep track of how much time has elapsed so that we can update interval appropriately. e.g. if the interval is 500ms, but it took us 300ms to go through this current interval, we should then set timeout to 200ms, so that we're ensuring we're firing the interval every 500ms. Otherwise in this example we would be running the next interval 800ms since the last interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be latency introduced by the file descriptor read on Linux or spinning up the subprocess on windows (inside or pidusage.stat). I'll export it as a value off of state

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding it to the error context when we shed load


function onRequest (req, res, next) {
// Check to see if this request gets rejected
if (self._reject > Math.random()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments around Math.random() and why it exists would be helpful

cpuUsage: self._cpu,
limit: self._limit
}, 'maximum cpu usage exceeded, rejecting request');
res.send(self._err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When working with errors, let's use return next(self._err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so the err events emitted by server will work correctly, but should also make your tests a little easier so you don't have to mock res.send

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, didn't know next handled sending when an error was passed in! https://github.com/restify/node-restify/blob/master/lib/server.js#L925-L926

var MR = Math.random;
describe('cpuUsageThrottle', function () {

it('Setup: stub math.random', function (done) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a before()

});
});

it('Teardown: Reset Math.random', function (done) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after()

Copy link
Member

@yunong yunong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, super excited to have advanced throttling in restify. I have some comments around documentation, specifically making it easier for other folks to use and work on this plugin. :)

/**
* cpuUsageThrottle
*
* Place an upper limit on the CPU usage for this process. Once this limit is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend really fleshing out this summary a bit more. Think of this as an executive summary which specifies the throttle algorithm and which metrics and algorithms are used. Specifically it helps to either link to documentation or be more crisp about some of the terms.

e.g. instead of using the method specify that it's EWMA, with a link to docs around EWMA, and why EWMA is helpful.

The tighter the interval, the more leanient you should with your upper limit Can you be more explicit and providesome examples? This reads a bit ambiguous: does tightening the interva mean increasing the sampling rate of CPU?

* with your upper limit.
*
* @param {Object} opts Configure this plugin.
* @param {Number} opts.limit The point at which restify will begin rejecting a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/.8/0.8. I think you mean cpulimit? Have we tested that the cpu utilization stays constant if we add additional cores to an instance?

* @param {Number} opts.limit The point at which restify will begin rejecting a
* % of all requests at the front door. This value is a percentage like you
* would see when running `top` on linux. For example .8 === 80%.
* @param {Number} opts.max The point at which restify will reject 100% of all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear what point refers to. Might consider changing copy to read
the maximum CPU utilization in which restify will reject all requests or something where you explicit point out it's about CPU utilization.

* bring the average load back within tolerable thresholds. Since Node.js is
* single threaded, the default for this is 1. In some rare cases, a Node.js
* process can exceed 100% CPU usage and you will want to update this value.
* @param {Number} opts.interval How frequently to check if we should be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside, do we know how expensive this check is? Might be worthwhile quantifying the cost to consumers as part of the docs.


// Scrub input and populate our configuration
assert.object(opts, 'opts');
assert.number(opts.limit, 'opts.limit');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts around making all of these configuration parameters optional? That is, have a set of sane defaults so folks can just plug and play?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident we can have a sane default here. A server that can handle 2RPS binding large data sets to large templates may want a very different limit than a server that does 8k RPS validating auth for login. I may be surprised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking it through, we settled on a default of .75 for right now. We will open a PR later if we find a better default.

* average of the CPU usage over time and rejecting a percentage of requests
* based on the user specified CPU limit, maximum CPU limit, and the current
* value. This module will reject between 0% and 100% of all incomming requests
* when attempting to bring CPU usage back to a reasonable level. The method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reasonable level/user defined limit

*
* Have restify reject a variable number of requests at the front door based on
* the process' current CPU usage. This plugin works by keeping a running
* average of the CPU usage over time and rejecting a percentage of requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a running average? I thought it was ewma?

/**
* cpuUsageThrottle
*
* Have restify reject a variable number of requests at the front door based on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of this as an executive summary. You may want to state the throttling algorithm, metrics, and additional algorithm this plugin is using. It's also helpful to talk about what the expected behaviour is, which is that the plugin strives to let the server process the maximum number of requests it can handle, while purposely dropping all other requests. Since otherwise we're handling requests that are long timeout and dead, and never handling any real requests.

Might also be helpful to talk about the design and decisions that went into this plugin to help users understand and maintainers maintain the module going forward.

i.e. we don't talk about ewma here in the docs, but I think we absolutely should and also should quantify why ewma is used, and how sampling rate and threshold affect ewma and throttling.

// negative and we will never shed load
self._reject =
(self._cpu - self._limit) / (self._max - self._limit);
self._timeout = setTimeout(updateReject, self._interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even tho ewma accounts for this in the algorithm, the interval is a parameter configured by the user, and we should strive to stick to said interval. I think it still makes sense to keep track of the delta here, since changes to the interval will affect the throttling algorithm.

}

// If we are rejecting the request, log some metadata and return error
req.log.trace({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of logging trace here, you should just put these parameters in a context object on the error, and that way the audit log will automatically emit it. This has the additional benefit of consumers being able to listen to the 'after' event and emit a metric based on the fact that this got throttled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this module is miss-using errors. I should be creating a new error on every rejected request instead of returning the same error. Will have to change that if I want to pass a context through 😊

@retrohacker
Copy link
Member Author

@yunong @DonutEspresso one more time?

self._timeout = null;
assert.ok(self._max > self._limit, 'limit must be less than max');

self._ewma = new EWMA(self._halfLife);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0_0 why am I not using this?

@retrohacker
Copy link
Member Author

Rebased against master

}

// Divide by 100 to match linux's top format
self._ewma.insert(stat.cpu / 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you insert NaN?

onRequest.close = close;

// Expose internal plugin state for introspection
onRequest.state = self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the exposed fields actually public? They all look like _ properties

Copy link
Member

@yunong yunong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 👍 Just a couple of points of feedback on documentation :)


The algorithm asks you for a maximum CPU utilization rate, which it uses to determine at what point it should be rejecting 100% of traffic. For a normal Node.js service, this is 1 since Node is single threaded. It uses this, paired with a limit that you provide to determine the total % of traffic it should be rejecting. For example, if you specify a limit of .5 and a max of 1, and the current EWMA (next paragraph) value reads .75, this plugin will reject approximately 50% of all requests.

When looking at the process' CPU usage, this algorithm will take a load average over a user specified interval. This is similar to Linux's top. For example, if given an interval of 250ms, this plugin will attempt to record the average CPU utilization over 250ms intervals. Due to contention for resources, the duration of each average may be wider or narrower than 250ms. To compensate for this, we use an exponentially weighted moving average. The EWMA algorithm is provided by the ewma module. The parameter for configuring the EWMA is halfLife. This value controls how quickly each load average measurment decays to half it's value when being represented in the current average. For example, if you have an interval of 250, and a halfLife of 250, you will take the previous ewma value multiplied by .5 and add it to the new CPU utilization average measurement multiplied by .5. The previous value and the new measurement would each represent 50% of the new value. A good way of thinking about the halfLife is in terms of how responsive this plugin will be to spikes in CPU utilization. The higher the halfLife, the longer CPU utilization will have to remain above your defined limit before this plugin begins rejecting requests and, converserly, the longer it will have to drop below your limit before the plugin begins accepting requests again. This is a knob you will want to with play when trying to determine the ideal value for your use case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the new summary. Instead of saying linux's top, we could be more precise and link load average to this: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not using linux's top's formula. Used the word "similar" here since it isn't the same, but it gives people a point of reference when trying to understand the metric. The actual algorithm used is: https://stackoverflow.com/questions/16726779/how-do-i-get-the-total-cpu-usage-of-an-application-from-proc-pid-stat/16736599#16736599

I could link to it, but if someone is diving that deep, I would suspect they would refer to the pidusage documentation which does define the algorithm.

For a better understanding of the EWMA algorithn, refer to the documentation for the ewma module.

Params:
* `max` - The point at which restify will reject 100% of all requests at the front door. This is used in conjunction with limit to determine what % of traffic restify needs to reject when attempting to bring the average load back within tolerable thresholds. Since Node.js is single threaded, the default for this is 1. In some rare cases, a Node.js process can exceed 100% CPU usage and you will want to update this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is limit defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊


Params:
* `max` - The point at which restify will reject 100% of all requests at the front door. This is used in conjunction with limit to determine what % of traffic restify needs to reject when attempting to bring the average load back within tolerable thresholds. Since Node.js is single threaded, the default for this is 1. In some rare cases, a Node.js process can exceed 100% CPU usage and you will want to update this value.
* `interval` - How frequently to check if we should be rejecting/accepting connections. It's best to keep this value as low as possible without creating a significant impact on perfomance in order to keep your server as responsive to change as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be clearer if it was how frequently we update the load average, which drives if we should be throttling connections. updating the load average is not cheap, and keeping this value as low as possible reduces the performance impact

*
* @param {Object} opts Configure this plugin.
* @param {Number} opts.limit The point at which restify will begin rejecting a
* % of all requests at the front door. This value is a percentage like you
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it would be less confusing and more precise to just use the temr CPU utilization, and link to documentation: http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is confusing. I removed the reference to linux top for the reasons detailed in #1460 (comment)

// negative and we will never shed load
self._reject =
(self._cpu - self._limit) / (self._max - self._limit);
self._timeout = setTimeout(updateReject, self._interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think there's been some confusion. I am suggesting we keep track of how much time has elapsed so that we can update interval appropriately. e.g. if the interval is 500ms, but it took us 300ms to go through this current interval, we should then set timeout to 200ms, so that we're ensuring we're firing the interval every 500ms. Otherwise in this example we would be running the next interval 800ms since the last interval.

// [0,1)) to give us a `self._reject`% chance of dropping any given
// request. This is a stateless was to drop approximatly `self._reject`%
// of traffic.
var draw = Math.random();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe probabilityDraw?

* follows the same format as the constructor for this plugin.
*/
onRequest.update = function update(newOpts) {
assert.object(newOpts, 'newOpts');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function you might actually want to log/emit that we've updated the parameters, to ease debugging later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have access to the logger here :-\

I can start caching the logger on the first request though.

server.pre(restify.plugins.cpuUsageThrottle(options));
```

cpuUsageThrottle is a middleware that rejects a variable number of requests (between 0% and 100%) based on a historical view of CPU utilization of a Node.js process. Essentially, this plugin allows you to define what constitutes a saturated Node.js process via CPU utilization and it will handle dropping a % of requests based on that definiton. This is useful when you would like to keep CPU bound tasks from piling up causing an increased per-request latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bonus points might want to link to the ewma paper, but if you're already doing that in the ewma module, then maybe not. I leave that up to you :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the EWMA module 😄 the 🐰 🕳️ is deep 😉

@retrohacker
Copy link
Member Author

@DonutEspresso @yunong one more time?

if (newOpts.halfLife !== undefined) {
self._halfLife = newOpts.halfLife;
// update our ewma with the new halfLife, we use the previous known
// state as the initial state for our new halfLife in liue of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/liue/lieu

@DonutEspresso
Copy link
Member

Chatted offline re: removing the err option and just relying on restify's built in server.on('restifyError'... capability to do error customization.

Copy link
Member

@yunong yunong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@retrohacker retrohacker merged commit 84be679 into master Sep 8, 2017
@sean3z sean3z deleted the cpuUsageThrottle branch September 12, 2017 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants