initial changes for invoker cluster backpressure #3

tysonnorris · 2019-03-01T01:05:13Z

Description

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

chetanmeh

This would really address the dynamic capacity addition aspect! Do we have the change in MesosContainerFactory somewhere which trigger the NodeStatsUpdateEvent? That would help to see complete picture once

It may be better to extract out a ResourcePolicy abstraction to keep the policy aspect separate from the scheduling purpose

chetanmeh · 2019-03-01T09:56:44Z

core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerPool.scala

+  //if cluster managed resources, subscribe to events
+  if (poolConfig.clusterManagedResources) {
+    logging.info(this, "subscribing to NodeStats updates")
+    Events.subscribe(self, NodeStatsUpdateEvent)


Should it unsubscribe itself? Given its kind of singleton its fine but may be for testcases we should unsubscribe

Looks like EventStream will handle unsubscribe on actor termination per docs, but will change this if it is a problem for tests.

chetanmeh · 2019-03-01T10:18:09Z

common/scala/src/main/scala/org/apache/openwhisk/utils/Events.scala

+ * Publishes the payload of the MsgEnvelope when the topic of the
+ * MsgEnvelope equals the OWEvent specified when subscribing.
+ */
+object Events extends EventBus with LookupClassification {


We can possibly use the EventStream of the ActorSystem itself. Should meet our requirement

Sounds good! will look into making this change

tysonnorris · 2019-03-01T15:51:12Z

I'm interested to know what you have in mind for ResourcePolicy abstraction - it may help once I get the mesos PR setup, will do that today.

codecov-io · 2019-03-01T18:06:43Z

Codecov Report

Merging #3 into master will decrease coverage by 36.62%.
The diff coverage is 35.21%.

@@             Coverage Diff             @@
##           master       #3       +/-   ##
===========================================
- Coverage   79.42%   42.79%   -36.63%     
===========================================
  Files         170      173        +3     
  Lines        7940     8256      +316     
  Branches      532      582       +50     
===========================================
- Hits         6306     3533     -2773     
- Misses       1634     4723     +3089

Impacted Files	Coverage Δ
...pache/openwhisk/core/containerpool/Container.scala	`69.62% <ø> (-13.93%)`	⬇️
...cala/org/apache/openwhisk/core/yarn/YARNTask.scala	`0% <ø> (-70%)`	⬇️
...sk/core/containerpool/docker/DockerContainer.scala	`78.04% <ø> (-9.76%)`	⬇️
...penwhisk/core/containerpool/ContainerFactory.scala	`85.71% <ø> (ø)`	⬆️
...la/org/apache/openwhisk/core/mesos/MesosTask.scala	`0% <ø> (-73.81%)`	⬇️
...inerpool/AkkaClusterContainerResourceManager.scala	`0% <0%> (ø)`
...in/scala/org/apache/openwhisk/common/Logging.scala	`58.88% <100%> (-10.87%)`	⬇️
.../containerpool/LocalContainerResourceManager.scala	`100% <100%> (ø)`
.../core/containerpool/ContainerResourceManager.scala	`100% <100%> (ø)`
.../scala/org/apache/openwhisk/core/entity/Size.scala	`59.7% <16.66%> (-29.78%)`	⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a304c1...a2a1844. Read the comment docs.

chetanmeh · 2019-03-04T14:46:00Z

Was thinking of trait which abstracts the hasPoolSpaceFor method (like here). However current logic closely match the reservation and I am getting bit confused between its role and free and busy pool.

However idea would be to move your current changes to such a policy and then use that in ContainerPool decision logic. One aspect which makes it tricky is that policy need to be refreshed via event. So we may need to use some volatile state which gets updated after post processing of cluster state changes and then consumer by the hasPoolSpaceFor logic

…e#4475) * Additional cleanup and simplifications for the ActionLoop docs * Additional cleanup and simplifications for the ActionLoop docs

* Updating runtimes to include new Node.js v12 image. - Updated runtimes manifest - Added API documentation - Minor updates to docs - Added automated test case * Fixing accidental default flag for v12

* Update docker version in the controller and guides * Update travis configuration to install docker=18.06.3 * Separate docker installation from the travis setup script * Add docker setup script * Add CoordinatedShutdown to cleanup runtime containers * add a configuration for root runc dir * Disable runc use in Jenkins environment * Add comments which explain the correlation among the version docker and runc and the type of user * Reenable Docker remote API again

…ame (apache#4488) Add `pip install docker` to Dockerfile for ow-utils to fix problem pulling docker images

* Switch to consistent indexing policy * Remove reference to Range index * Tweak indexing policy comparison to only check for included and excluded path Do not check for Index type as now there is only one which is Range for our cases * Use implicit logger * Excluding root path should be using `/*` instead of `/`

Updates to Alpakka S3 Connector v1.0.1 release. This commit also includes some fixes related to apache#4484 which were causing compatibility issues with existing setup. For existing setup indexes were still using `Hash` indexing and that caused failure while creating `IndexingPolicy` upon collection read. So added back support for `Hash` index but not using them to create new indexes now

Record collection usage stats for CosmosDB so as to enable tracking the growth of collection in terms of storage size, document count and index size over the period of time. It also enables tracking any indexing progress if any change is done in Index configuration. Note that Count stats are currently not exposed via Azure Portal Further this commit also enables emitting verbose trace for query when in debug mode. This would simplify any query performance analysis. Fixes apache#4489

`lines` method is now defined as part of java.lang.String in JDK 11. So need to use `linesIterator` for right method to be picked

Adds a configurable MetricsReporter to route Kafka metrics to Kamon once enabled. Set of metrics names which need to be captured needs to be explicitly configured

* Update to restassured v4.0 which is compatible with jdk11 * Need to initialize both truststore and keystore for ssl cert to be validated

* Add explicit return type on a few implicit values. Implicit values without an explicit return types are not guaranteed to work. They may or may not compile, depending on compilation order and they are going to be disallowed in future versions of Scala. This confuses IntelliJ as well. * Make Exec public since it's being used from tests. The call to Exec.isBinaryCode in ActionContainer.scala is from a different package and normally not visible at the call site. Due to scala/bug#11554 this may compile sometimes.

* simplify throttle code * revert to original throttle algorithm * Make the waiting time to calm down the thottle explicit

…onger available

…rceManager impls; handle ClusterResourceError in case of reaching cluster capacity

…s, total memory, max memory

…r idle timeout

… size

…dle containers; only remove idle containers that match the actual host/port of selected container

…ove()

…nup of containers

chetanmeh reviewed Mar 1, 2019

View reviewed changes

tysonnorris force-pushed the invoker-backpressure branch from 9d7a1a0 to 43f9028 Compare March 5, 2019 23:46

tysonnorris changed the base branch from ow-invoker-backpressure to master March 5, 2019 23:47

tysonnorris force-pushed the invoker-backpressure branch from 2c92309 to ef31370 Compare March 28, 2019 14:41

tysonnorris force-pushed the invoker-backpressure branch from 6a5e9cc to 9889d68 Compare May 6, 2019 15:54

chetanmeh and others added 22 commits May 13, 2019 13:54

Update version for Akka dependencies and Scala (apache#4316)

e7ad4a2

Additional cleanup and simplifications for the ActionLoop docs (apach…

4740d40

…e#4475) * Additional cleanup and simplifications for the ActionLoop docs * Additional cleanup and simplifications for the ActionLoop docs

Updating runtimes to include new Node.js v12 image. (apache#4472)

4df3d09

* Updating runtimes to include new Node.js v12 image. - Updated runtimes manifest - Added API documentation - Minor updates to docs - Added automated test case * Fixing accidental default flag for v12

Update LeanBalancer for invoke SPI change (apache#4478)

ae9ff6a

Limit the minimum to the cpu-share value (apache#4477)

8650f36

Adding Lean Openwhisk test to Jenkins (apache#4480)

541709d

Fixing Jenkins Lean Openwhisk test (apache#4482)

55d975f

Invoker agent failed to pull docker images if using docker registry n…

fd5d06b

…ame (apache#4488) Add `pip install docker` to Dockerfile for ow-utils to fix problem pulling docker images

add quotes to avoid build brakes (apache#4491)

1bf6e43

Apache OpenWhisk logos as svg (apache#4493)

72c08a6

Use linesIterator instead of String.lines (apache#4495)

6fdc7ce

`lines` method is now defined as part of java.lang.String in JDK 11. So need to use `linesIterator` for right method to be picked

Track Kafka client side metrics via Kamon (apache#4481)

658516e

Adds a configurable MetricsReporter to route Kafka metrics to Kamon once enabled. Set of metrics names which need to be captured needs to be explicitly configured

Update rest-assured to v4.0.0 (apache#4500)

aeabc35

* Update to restassured v4.0 which is compatible with jdk11 * Need to initialize both truststore and keystore for ssl cert to be validated

Use Instant with milli second precision (apache#4497)

9d37fad

Remove default port on api host. (apache#4504)

d0f6ba5

Consolidated action annotations to a new singleton (apache#4499)

e88257e

Settle throttle before running the API GW Tests (apache#4496)

da61ad2

* simplify throttle code * revert to original throttle algorithm * Make the waiting time to calm down the thottle explicit

tysonnorris added 29 commits June 27, 2019 16:58

ContainerPool must evict idles when certain portion of hosts are no l…

19208b8

…onger available

refactoring to isolate cluster/local communications to ContainerResou…

184bd55

…rceManager impls; handle ClusterResourceError in case of reaching cluster capacity

add GB to size

85b0ae1

cleanup

ded9882

cleanup

1e4b41a

cleanup; increase replicated data sharing to once per second

4ba7c0d

added resent list of activationIds to avoid resending duplicates

43e91b4

added metrics for resource error, rescheduled activation, number node…

810daa9

…s, total memory, max memory

review feedback

f5a2689

use histogram for node count

bc673f2

don't emit metrics from empty nodestats

7767607

tolerate empty nodestats

b64ae47

shuffle dependencies

550a620

shuffle dependencies

5d7728d

cleanup

89455f9

cleanup; adding metrics

eb38426

typo in conf

1885198

added tests; removed some methods defined in ContainerResourceManager

cb9615a

cleanup

a184417

use gauge metrics; update unused refs when containers are purged afte…

ea4a08f

…r idle timeout

change container pool metrics to gauge type, add metric for runbuffer…

e1e6cd9

… size

fixing idle metrics

1ce2d35

add counter metric for system errors

d5a1964

add idleGrace duration to prevent aggressive collection of recently i…

ba580d3

…dle containers; only remove idle containers that match the actual host/port of selected container

respect idle grace for local remove() as well as remote triggered rem…

23e9cf0

…ove()

keep track of Container.lastUsed value properly to prevent early clea…

f0faa87

…nup of containers

review feedback

6d17424

akka version updates

91c3eb8

fixing tests

a2a1844

tysonnorris force-pushed the invoker-backpressure branch from dec81c5 to a2a1844 Compare July 5, 2019 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial changes for invoker cluster backpressure #3

initial changes for invoker cluster backpressure #3

tysonnorris commented Mar 1, 2019

chetanmeh left a comment

chetanmeh Mar 1, 2019

tysonnorris Mar 1, 2019

chetanmeh Mar 1, 2019

tysonnorris Mar 1, 2019

tysonnorris commented Mar 1, 2019

codecov-io commented Mar 1, 2019 •

edited

chetanmeh commented Mar 4, 2019

initial changes for invoker cluster backpressure #3

Are you sure you want to change the base?

initial changes for invoker cluster backpressure #3

Conversation

tysonnorris commented Mar 1, 2019

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

chetanmeh left a comment

Choose a reason for hiding this comment

chetanmeh Mar 1, 2019

Choose a reason for hiding this comment

tysonnorris Mar 1, 2019

Choose a reason for hiding this comment

chetanmeh Mar 1, 2019

Choose a reason for hiding this comment

tysonnorris Mar 1, 2019

Choose a reason for hiding this comment

tysonnorris commented Mar 1, 2019

codecov-io commented Mar 1, 2019 • edited

Codecov Report

chetanmeh commented Mar 4, 2019

codecov-io commented Mar 1, 2019 •

edited