Skip to content

Commit

Permalink
Fixed issue kubernetes-sigs#7112.  Created new API Server vars that r…
Browse files Browse the repository at this point in the history
…eplace defunct Controller Manager one (kubernetes-sigs#7114)

Signed-off-by: Brendan Holmes <5072156+holmesb@users.noreply.github.com>
  • Loading branch information
holmesb authored and LuckySB committed Jan 17, 2021
1 parent d1e193b commit 949a286
Show file tree
Hide file tree
Showing 6 changed files with 27 additions and 14 deletions.
25 changes: 15 additions & 10 deletions docs/kubernetes-reliability.md
Expand Up @@ -43,8 +43,10 @@ attempts to set a status of node.

At the same time Kubernetes controller manager will try to check
`nodeStatusUpdateRetry` times every `--node-monitor-period` of time. After
`--node-monitor-grace-period` it will consider node unhealthy. It will remove
its pods based on `--pod-eviction-timeout`
`--node-monitor-grace-period` it will consider node unhealthy. Pods will then be rescheduled based on the
[Taint Based Eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions)
timers that you set on them individually, or the API Server's global timers:`--default-not-ready-toleration-seconds` &
``--default-unreachable-toleration-seconds``.

Kube proxy has a watcher over API. Once pods are evicted, Kube proxy will
notice and will update iptables of the node. It will remove endpoints from
Expand All @@ -57,12 +59,14 @@ services so pods from failed node won't be accessible anymore.
If `-–node-status-update-frequency` is set to **4s** (10s is default).
`--node-monitor-period` to **2s** (5s is default).
`--node-monitor-grace-period` to **20s** (40s is default).
`--pod-eviction-timeout` is set to **30s** (5m is default)
`--default-not-ready-toleration-seconds` and ``--default-unreachable-toleration-seconds`` are set to **30**
(300 seconds is default). Note these two values should be integers representing the number of seconds ("s" or "m" for
seconds\minutes are not specified).

In such scenario, pods will be evicted in **50s** because the node will be
considered as down after **20s**, and `--pod-eviction-timeout` occurs after
**30s** more. However, this scenario creates an overhead on etcd as every node
will try to update its status every 2 seconds.
considered as down after **20s**, and `--default-not-ready-toleration-seconds` or
``--default-unreachable-toleration-seconds`` occur after **30s** more. However, this scenario creates an overhead on
etcd as every node will try to update its status every 2 seconds.

If the environment has 1000 nodes, there will be 15000 node updates per
minute which may require large etcd containers or even dedicated nodes for etcd.
Expand All @@ -75,7 +79,8 @@ minute which may require large etcd containers or even dedicated nodes for etcd.
## Medium Update and Average Reaction

Let's set `-–node-status-update-frequency` to **20s**
`--node-monitor-grace-period` to **2m** and `--pod-eviction-timeout` to **1m**.
`--node-monitor-grace-period` to **2m** and `--default-not-ready-toleration-seconds` and
``--default-unreachable-toleration-seconds`` to **60**.
In that case, Kubelet will try to update status every 20s. So, it will be 6 * 5
= 30 attempts before Kubernetes controller manager will consider unhealthy
status of node. After 1m it will evict all pods. The total time will be 3m
Expand All @@ -90,9 +95,9 @@ etcd updates per minute.
## Low Update and Slow reaction

Let's set `-–node-status-update-frequency` to **1m**.
`--node-monitor-grace-period` will set to **5m** and `--pod-eviction-timeout`
to **1m**. In this scenario, every kubelet will try to update the status every
minute. There will be 5 * 5 = 25 attempts before unhealthy status. After 5m,
`--node-monitor-grace-period` will set to **5m** and `--default-not-ready-toleration-seconds` and
``--default-unreachable-toleration-seconds`` to **60**. In this scenario, every kubelet will try to update the status
every minute. There will be 5 * 5 = 25 attempts before unhealthy status. After 5m,
Kubernetes controller manager will set unhealthy status. This means that pods
will be evicted after 1m after being marked unhealthy. (6m in total).

Expand Down
3 changes: 2 additions & 1 deletion docs/large-deployments.md
Expand Up @@ -30,7 +30,8 @@ For a large scaled deployments, consider the following configuration changes:
* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
``kube_controller_node_monitor_grace_period``,
``kube_controller_node_monitor_period``,
``kube_controller_pod_eviction_timeout`` for better Kubernetes reliability.
``kube_apiserver_pod_eviction_not_ready_timeout_seconds`` &
``kube_apiserver_pod_eviction_unreachable_timeout_seconds`` for better Kubernetes reliability.
Check out [Kubernetes Reliability](kubernetes-reliability.md)

* Tune network prefix sizes. Those are ``kube_network_node_prefix``,
Expand Down
3 changes: 2 additions & 1 deletion roles/kubernetes/master/defaults/main/main.yml
Expand Up @@ -95,7 +95,6 @@ kube_controller_memory_requests: 100M
kube_controller_cpu_requests: 100m
kube_controller_node_monitor_grace_period: 40s
kube_controller_node_monitor_period: 5s
kube_controller_pod_eviction_timeout: 5m0s
kube_controller_terminated_pod_gc_threshold: 12500
kube_scheduler_memory_limit: 512M
kube_scheduler_cpu_limit: 250m
Expand All @@ -106,6 +105,8 @@ kube_apiserver_cpu_limit: 800m
kube_apiserver_memory_requests: 256M
kube_apiserver_cpu_requests: 100m
kube_apiserver_request_timeout: "1m0s"
kube_apiserver_pod_eviction_not_ready_timeout_seconds: "300"
kube_apiserver_pod_eviction_unreachable_timeout_seconds: "300"

# 1.9 and below Admission control plug-ins
kube_apiserver_admission_control:
Expand Down
Expand Up @@ -100,6 +100,12 @@ certificatesDir: {{ kube_cert_dir }}
imageRepository: {{ kube_image_repo }}
apiServer:
extraArgs:
{% if kube_apiserver_pod_eviction_not_ready_timeout_seconds is defined %}
default-not-ready-toleration-seconds: "{{ kube_apiserver_pod_eviction_not_ready_timeout_seconds }}"
{% endif %}
{% if kube_apiserver_pod_eviction_unreachable_timeout_seconds is defined %}
default-unreachable-toleration-seconds: "{{ kube_apiserver_pod_eviction_unreachable_timeout_seconds }}"
{% endif %}
{% if kube_api_anonymous_auth is defined %}
anonymous-auth: "{{ kube_api_anonymous_auth }}"
{% endif %}
Expand Down Expand Up @@ -256,7 +262,6 @@ controllerManager:
extraArgs:
node-monitor-grace-period: {{ kube_controller_node_monitor_grace_period }}
node-monitor-period: {{ kube_controller_node_monitor_period }}
pod-eviction-timeout: {{ kube_controller_pod_eviction_timeout }}
node-cidr-mask-size: "{{ kube_network_node_prefix }}"
profiling: "{{ kube_profiling }}"
terminated-pod-gc-threshold: "{{ kube_controller_terminated_pod_gc_threshold }}"
Expand Down
Expand Up @@ -151,6 +151,8 @@ spec:
- --proxy-client-cert-file={{ kube_cert_dir }}/apiserver.pem
- --proxy-client-key-file={{ kube_cert_dir }}/apiserver-key.pem
{% endif %}
- --default-not-ready-toleration-seconds={{ kube_apiserver_pod_eviction_not_ready_timeout_seconds }}
- --default-unreachable-toleration-seconds={{ kube_apiserver_pod_eviction_unreachable_timeout_seconds }}
{% if apiserver_custom_flags is string %}
- {{ apiserver_custom_flags }}
{% else %}
Expand Down
Expand Up @@ -36,7 +36,6 @@ spec:
- --enable-hostpath-provisioner={{ kube_hostpath_dynamic_provisioner }}
- --node-monitor-grace-period={{ kube_controller_node_monitor_grace_period }}
- --node-monitor-period={{ kube_controller_node_monitor_period }}
- --pod-eviction-timeout={{ kube_controller_pod_eviction_timeout }}
- --profiling={{ kube_profiling }}
- --terminated-pod-gc-threshold={{ kube_controller_terminated_pod_gc_threshold }}
- --v={{ kube_log_level }}
Expand Down

0 comments on commit 949a286

Please sign in to comment.