Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

AdmissionControl RequestSize AutoTuning #597

Merged
merged 3 commits into from
Jul 16, 2021
Merged

AdmissionControl RequestSize AutoTuning #597

merged 3 commits into from
Jul 16, 2021

Conversation

mitalawachat
Copy link
Contributor

@mitalawachat mitalawachat commented May 13, 2021

Fixes #: #579

Description of changes:
AutoTune AdmissionControl RequestSizeController based on heap occupancy (percent).
AdmissionControl currently has default values of of 85% JVMMemoryPressure threshold and 10% RequestSize threshold. As part of this change we are tuning RequestSize threshold based on how much heap is currently occupied. If we have more heap available then we expand the threshold for RequestSize controller and if less heap is available then we contract the RequestSize controller threshold. This tuning happens when heap occupancy (percent) range change is detected.

Tests:

  • Unit Test
  • Created cluster, ran load, verified that when heap range changes the flow-units are passed through chain (Rca -> ClusterRca -> Decider -> Publisher)

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

Copy link
Contributor

@khushbr khushbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments.
1/ Add as much details around the RCA, Decider and Action logic as possible. It is tricky for the reviewer to read the code if no overview is provided in PR and no javadoc in the code.
2/ Create an issue and add the scoping and requirement details there. It helps us keep track of changes going in the new release.
3/ Mention the reasoning for picking specific input metrics for the RCA.
4/ Add details on testing.
5/ Add detail on if this RCA, Decider and Action will be enabled or disabled by default.

@sruti1312
Copy link
Contributor

Can we also add Gauntlet test to test this from end to end?
https://github.com/opendistro-for-elasticsearch/performance-analyzer-rca/blob/main/docs/gauntlet.md

Comment on lines +74 to +75
"lower-bound": 0,
"upper-bound": 75,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the user changes this such that we have gap's in range and no threshold declared for that? Are we catching that?
Can we add a test case for this scenario?

One option might be is to have the thresholds as p75_threshold instead of range? But this makes the range not configurable on the fly then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add test around it. No action should be taken if there is gap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead if there is a gap here, can we detect it and fall back to the defaults in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaps could be left intentionally. For example, we can configure autotune only when jvm reaches above 85%, and want to work with default admissioncontrol behavior for below usage.

admissionControllerRca.addAllUpstreams(Arrays.asList(acCurrent, acThreshold, acRejectionCount));
private AdmissionControlDecider buildAdmissionControlDecider(Metric heapUsed, Metric heapMax) {
AdmissionControlRca admissionControlRca = new AdmissionControlRca(RCA_PERIOD, heapUsed, heapMax);
admissionControlRca.addTag(TAG_LOCUS, LOCUS_DATA_MASTER_NODE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like admissionControlRca exists on both data and master node. Can we add the rca.conf changes to all the rca*.conf files in the pa_config directory? We have conf specific to master and idle_master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. admissionControlRca should act only on data nodes. Will fix this.

// We mark increase/decrease pressure based on desired value
final ImpactVector impactVector = new ImpactVector();
if (desiredValue > currentValue) {
impactVector.increasesPressure(ADMISSION_CONTROL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see indirect impact on HEAP because of this tuning. If yes, @khushbr Do you think we need to add them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdmissionControl would increase/decrease threshold based on available jvm, and will kick in only when node is under duress. For example, if JVM reaches 95% then the customer will see more 429 compared to when JVM was at 85%, eventually reducing node drops and 5xx.
Increase in heap pressure will be driven from customer workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have a cool-off period when heap decreases to avoid flip-flops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have 5 mins cool-off period
protected static final long DEFAULT_COOL_OFF_PERIOD_IN_MILLIS = TimeUnit.MINUTES.toMillis(5);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree wtih Sruti. how are we going to define the impact of ADMISSION_CONTROL ? and what happen if let's say you have a different type of admission control action in the future and if they both have impact on ADMISSION_CONTROL, then one will be blocked by the other ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each controller of admissioncontrol works in its own perview. When we add new controllers in admissioncontrol they would work on their own thresholds. Ofcourse these should be validated when we add more admissioncontrol AutoTune

Comment on lines +332 to +335
private AdmissionControlDecider buildAdmissionControlDecider(Metric heapUsed, Metric heapMax) {
AdmissionControlRca admissionControlRca = new AdmissionControlRca(RCA_PERIOD, heapUsed, heapMax);
admissionControlRca.addTag(TAG_LOCUS, LOCUS_DATA_NODE);
admissionControlRca.addAllUpstreams(Arrays.asList(heapUsed, heapMax));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand why we changed the input metrics to AdmissionControlRca.
Previously we were using Admission Control current value, threshold and rejection count. Now, we are using only heap metrics.

What happens if the heap usage is spiky ? Even with coolOffPeriod, we can end up suggesting rejection threshold change frequently and drastically, where we jump from a lower heap-range bucket to a higher bucket. This will impact the heap stability and the wider customer experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous metrics were unused. Will add new metrics as required.
Suggestion on avoiding large threshold diff on drastic heap range change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For frequency, we can extend the cool-off time period to a larger value and for flip-flop we check for delta change suggested ?

Again, I still have my doubts on the efficiency of the RCA to suggest Admission Control Threshold change just based off raw JVM metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We can change default cool-off period to 15 mins instead of 5 and later tune it as required based on case studies.
  • AdmissionControl decides it's default limits based on % of jvm, that's why only jvm is considered as initial input to rca. More metrics could be added as inputs to rca when we find such suitable metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed DEFAULT_COOL_OFF_PERIOD_IN_MILLIS to 15 mins in AdmissionControlAction

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants