Sysctl cluster

sttts title: Using Sysctls in a Kubernetes Cluster

TOC {:toc}

This document describes how sysctls are used within a Kubernetes cluster.

What is a Sysctl?

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems such as:

kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
More subsystems are described in Kernel docs.

To get a list of all parameters, you can run

$ sudo sysctl -a

Namespaced vs. Node-Level Sysctls

A number of sysctls are namespaced in today's Linux kernels. This means that they can be set independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.

The following sysctls are known to be namespaced:

kernel.shm*,
kernel.msg*,
kernel.sem,
fs.mqueue.*,
net.*.

Sysctls which are not namespaced are called node-level and must be set manually by the cluster admin, either by means of the underlying Linux distribution of the nodes (e.g. via /etc/sysctls.conf) or using a DaemonSet with privileged containers.

Note: it is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes taints and toleration feature to implement this.

Safe vs. Unsafe Sysctls

Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

must not have any influence on any other pod on the node
must not allow to harm the node's health
must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe.

For Kubernetes 1.4, the following sysctls are supported in the safe set:

kernel.shm_rmid_forced,
net.ipv4.ip_local_port_range,
net.ipv4.tcp_syncookies.

This list will be extended in future Kubernetes versions when the kubelet supports better isolation mechanisms.

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

Warning: Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems like wrong behavior of containers, resource shortage or complete breakage of a node.

Enabling Unsafe Sysctls

With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations like e.g. high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet, e.g.:

$ kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu' ...

Only namespaced sysctls can be enabled this way.

Setting Sysctls for a Pod

The sysctl feature is an alpha API in Kubernetes 1.4. Therefore, sysctls are set using annotations on pods. They apply to all containers in the same pod.

Here is an example, with different annotations for safe and unsafe sysctls:

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
  annotations:
    security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
    security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:
  ...

Note: a pod with the unsafe sysctls specified above will fail to launch on any node which has not enabled those two unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.