Reserve compute resources

vishh
derekwaynecarr
dashpole title: Reserving Compute Resources for System Daemons

TOC {:toc}

Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.

The kubelet exposes a feature named Node Allocatable that helps to reserve compute resources for system daemons. Kubernetes recommends cluster administrators to configure Node Allocatable based on their workload density on each node.

Node Allocatable

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

Allocatable on a Kubernetes node is defined as the amount of compute resources that are available for pods. The scheduler does not over-subscribe Allocatable. CPU and memory are supported as of now. Support for storage is expected to be added in the future.

Node Allocatable is exposed as part of v1.Node object in the API and as part of kubectl describe node in the CLI.

Resources can be reserved for two categories of system daemons in the kubelet.

Enabling QoS and Pod level cgroups

To properly enforce node allocatable constraints on the node, you must enable the new cgroup hierarchy via the --cgroups-per-qos flag. This flag is enabled by default. When enabled, the kubelet will parent all end-user pods under a cgroup hierarchy managed by the kubelet.

Configuring a cgroup driver

The kubelet supports manipulation of the cgroup hierarchy on the host using a cgroup driver. The driver is configured via the --cgroup-driver flag.

The supported values are the following:

cgroupfs is the default driver that performs direct manipulation of the cgroup filesystem on the host in order to manage cgroup sandboxes.
systemd is an alternative driver that manages cgroup sandboxes using transient slices for resources that are supported by that init system.

Depending on the configuration of the associated container runtime, operators may have to choose a particular cgroup driver to ensure proper system behavior. For example, if operators use the systemd cgroup driver provided by the docker runtime, the kubelet must be configured to use the systemd cgroup driver.

Kube Reserved

Kubelet Flag: --kube-reserved=[cpu=100m][,][memory=100Mi]
Kubelet Flag: --kube-reserved-cgroup=

kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of pod density on the nodes. This performance dashboard exposes cpu and memory usage profiles of kubelet and docker engine at multiple levels of pod density. This blog post explains how the dashboard can be interpreted to come up with a suitable kube-reserved reservation.

To optionally enforce kube-reserved on system daemons, specify the parent control group for kube daemons as the value for --kube-reserved-cgroup kubelet flag.

It is recommended that the kubernetes system daemons are placed under a top level control group (runtime.slice on systemd machines for example). Each system daemon should ideally run within its own child control group. Refer to this doc for more details on recommended control group hierarchy.

Note that Kubelet does not create --kube-reserved-cgroup if it doesn't exist. Kubelet will fail if an invalid cgroup is specified.

System Reserved

Kubelet Flag: --system-reserved=[cpu=100mi][,][memory=100Mi]
Kubelet Flag: --system-reserved-cgroup=

system-reserved is meant to capture resource reservation for OS system daemons like sshd, udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slice in systemd world).

To optionally enforce system-reserved on system daemons, specify the parent control group for OS system daemons as the value for --system-reserved-cgroup kubelet flag.

It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).

Note that Kubelet does not create --system-reserved-cgroup if it doesn't exist. Kubelet will fail if an invalid cgroup is specified.

Eviction Thresholds

Kubelet Flag: --eviction-hard=[memory.available<500Mi]

Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it. Nodes can go offline temporarily until memory has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet provides Out of Resource management. Evictions are supported for memory and storage only. By reserving some memory via --eviction-hard flag, the kubelet attempts to evict pods whenever memory availability on the node drops below the reserved value. Hypothetically, if system daemons did not exist on a node, pods cannot use more than capacity - eviction-hard. For this reason, resources reserved for evictions are not available for pods.

Enforcing Node Allocatable

Kubelet Flag: --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]

The scheduler treats Allocatable as the available capacity for pods.

kubelet enforce Allocatable across pods by default. Enforcement is performed by evicting pods whenever the overall usage across all pods exceeds Allocatable. More details on eviction policy can be found here This enforcement is controlled by specifying pods value to the kubelet flag --enforce-node-allocatable.

Optionally, kubelet can be made to enforce kube-reserved and system-reserved by specifying kube-reserved & system-reserved values in the same flag. Note that to enforce kube-reserved or system-reserved, --kube-reserved-cgroup or --system-reserved-cgroup needs to be specified respectively.

General Guidelines

System daemons are expected to be treated similar to Guaranteed pods. System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments. For example, kubelet should have its own control group and share Kube-reserved resources with the container runtime. However, Kubelet cannot burst and use up all available Node resources if kube-reserved is enforced.

Be extra careful while enforcing system-reserved reservation since it can lead to critical system services being CPU starved or OOM killed on the node. The recommendation is to enforce system-reserved only if a user has profiled their nodes exhaustively to come up with precise estimates and is confident in their ability to recover if any process in that group is oom_killed.

To begin with enforce Allocatable on pods.
Once adequate monitoring and alerting is in place to track kube system daemons, attempt to enforce kube-reserved based on usage heuristics.
If absolutely necessary, enforce system-reserved over time.

The resource requirements of kube system daemons may grow over time as more and more features are added. Over time, kubernetes project will attempt to bring down utilization of node system daemons, but that is not a priority as of now. So expect a drop in Allocatable capacity in future releases.

Example Scenario

Here is an example to illustrate Node Allocatable computation:

Node has 32Gi of memory and 16 CPUs
--kube-reserved is set to cpu=1,memory=2Gi
--system-reserved is set to cpu=500m,memory=1Gi
--eviction-hard is set to memory.available<500Mi

Under this scenario, Allocatable will be 14.5 CPUs & 28.5Gi of memory. Scheduler ensures that the total requests across all pods on this node does not exceed 28.5Gi. Kubelet evicts pods whenever the overall memory usage exceeds across pods exceed 28.5Gi. If all processes on the node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.

If kube-reserved and/or system-reserved is not enforced and system daemons exceed their reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi.

Feature Availability

As of Kubernetes version 1.2, it has been possible to optionally specify kube-reserved and system-reserved reservations. The scheduler switched to using Allocatable instead of Capacity when available in the same release.

As of Kubernetes version 1.6, eviction-thresholds are being considered by computing Allocatable. To revert to the old behavior set --experimental-allocatable-ignore-eviction kubelet flag to true.

As of Kubernetes version 1.6, kubelet enforces Allocatable on pods using control groups. To revert to the old behavior unset --enforce-node-allocatable kubelet flag. Note that unless --kube-reserved, or --system-reserved or --eviction-hard flags have non-default values, Allocatable enforcement does not affect existing deployments.

As of Kubernetes version 1.6, kubelet launches pods in their own cgroup sandbox in a dedicated part of the cgroup hierarchy it manages. Operators are required to drain their nodes prior to upgrade of the kubelet from prior versions in order to ensure pods and their associated containers are launched in the proper part of the cgroup hierarchy.