Kubernetes Contributors – Kubernetes Enhancement Proposals (KEPs)

Resources: Add an nftables-based kube-proxy backend

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3866: Add an nftables-based kube-proxy backend

Summary
Motivation
Proposal
- Notes/Constraints/Caveats
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Summary

The default kube-proxy implementation on Linux is currently based on iptables. IPTables was the preferred packet filtering and processing system in the Linux kernel for many years (starting with the 2.4 kernel in 2001). However, problems with iptables led to the development of a successor, nftables, first made available in the 3.13 kernel in 2014, and growing increasingly featureful and usable as a replacement for iptables since then. Development on iptables has mostly stopped, with new features and performance improvements primarily going into nftables instead.

This KEP proposes the creation of a new official/supported nftables backend for kube-proxy. While it is hoped that this backend will eventually replace both the iptables and ipvs backends and become the default kube-proxy mode on Linux, that replacement/deprecation would be handled in a separate future KEP.

Motivation

There are currently two officially supported kube-proxy backends for Linux: iptables and ipvs. (The original userspace backend was deprecated several releases ago and removed from the tree in 1.26.)

The iptables mode of kube-proxy is currently the default, and it is generally considered “good enough” for most use cases. Nonetheless, there are good arguments for replacing it with a new nftables mode.

The iptables kernel subsystem has unfixable performance problems

Although much work has been done to improve the performance of the kube-proxy iptables backend, there are fundamental performance-related problems with the implementation of iptables in the kernel, both on the “control plane” side and on the “data plane” side:

The control plane is problematic because the iptables API does not support making incremental changes to the ruleset. If you want to add a single iptables rule, the iptables binary must acquire a lock, download the entire ruleset from the kernel, find the appropriate place in the ruleset to add the new rule, add it, re-upload the entire ruleset to the kernel, and release the lock. This becomes slower and slower as the ruleset increases in size (ie, as the number of Kubernetes Services grows). If you want to replace a large number of rules (as kube-proxy does frequently), then simply the time that it takes /sbin/iptables-restore to parse all of the rules becomes substantial.
The data plane is problematic because (for the most part), the number of iptables rules used to implement a set of Kubernetes Services is directly proportional to the number of Services. And every packet going through the system then needs to pass through all of these rules, slowing down the traffic.

IPTables is the bottleneck in kube-proxy performance, and it always will be until we stop using it.

Upstream development has moved on from iptables to nftables

In large part due to its unfixable problems, development on iptables in the kernel has slowed down and mostly stopped. New features are not being added to iptables, because nftables is supposed to do everything iptables does, but better.

Although there is no plan to remove iptables from the upstream kernel, that does not guarantee that iptables will remain supported by distributions forever. In particular, Red Hat has declared that iptables is deprecated in RHEL 9 and is likely to be removed entirely in RHEL 10, a few years from now. Other distributions have made smaller steps in the same direction; for instance, Debian removed iptables from the set of “required” packages in Debian 11 (Bullseye).

The RHEL deprecation in particular impacts Kubernetes in two ways:

Many Kubernetes users run RHEL or one of its downstreams, so in a few years when RHEL 10 is released, they will be unable to use kube-proxy in iptables mode (or, for that matter, in ipvs or userspace mode, since those modes also make heavy use of the iptables API).
Several upstream iptables bugs and performance problems that affect Kubernetes have been fixed by Red Hat developers over the past several years. With Red Hat no longer making any effort to maintain iptables, it is less likely that upstream iptables bugs that affect Kubernetes in the future would be fixed promptly, if at all.

The `ipvs` mode of kube-proxy will not save us

Because of the problems with iptables, some developers added an ipvs mode to kube-proxy in 2017. It was generally hoped that this could eventually solve all of the problems with the iptables mode and become its replacement, but this never really happened. It’s not entirely clear why… kubeadm #817 , “Track when we can enable the ipvs mode for the kube-proxy by default” is perhaps a good snapshot of the initial excitement followed by growing disillusionment with the ipvs mode:

“a few issues … re: the version of iptables/ipset shipped in the kube-proxy container image”
“clearly not ready for defaulting”
“complications … with IPVS kernel modules missing or disabled on user nodes”
“we are still lacking tests”
“still does not completely align with what [we] support in iptables mode”
“iptables works and people are familiar with it”
“not sure that it was ever intended for IPVS to be the default ”

Additionally, the kernel IPVS APIs alone do not provide enough functionality to fully implement Kubernetes services, and so the ipvs backend also makes heavy use of the iptables API. Thus, if we are worried about iptables deprecation, then in order to switch to using ipvs as the default mode, we would have to port the iptables parts of it to use nftables anyway. But at that point, there would be little excuse for using IPVS for the core load-balancing part, particularly given that IPVS, like iptables, is no longer an actively-developed technology.

The `nf_tables` mode of `/sbin/iptables` will not save us

In 2018, with the 1.8.0 release of the iptables client binaries, a new mode was added to the binaries, to allow them to use the nftables API in the kernel rather than the legacy iptables API, while still preserving the “API” of the original iptables binaries. As of 2022, most Linux distributions now use this mode, so the legacy iptables kernel API is mostly dead.

However, this new mode does not add any new syntax, and so it is not possible to use any of the new nftables features (like maps) that are not present in iptables.

Furthermore, the compatibility constraints imposed by the user-facing API of the iptables binaries themselves prevent them from being able to take advantage of many of the performance improvements associated with nftables.

(Additionally, the RHEL deprecation of iptables includes iptables-nft as well.)

The `iptables` mode of kube-proxy has grown crufty

Because iptables is the default kube-proxy mode, it is subject to strong backward-compatibility constraints which mean that certain “features” that are now considered to be bad ideas cannot be removed because they might break some existing users. A few examples:

It allows NodePort services to be accessed on localhost, which requires it to set a sysctl to a value that may introduce security holes on the system. More generally, it defaults to having NodePort services be accessible on all node IPs, when most users would probably prefer them to be more restricted.
It implements the LoadBalancerSourceRanges feature for traffic addressed directly to LoadBalancer IPs, but not for traffic redirected to a NodePort by an external LoadBalancer.
Some new functionality only works correctly if the administrator passes certain command-line options to kube-proxy (eg, --cluster-cidr), but we cannot make those options be mandatory, since that would break old clusters that aren’t passing them.

A new kube-proxy mode, which existing users would have to explicitly opt into, could revisit these and other decisions. (Though if we expect it to eventually become the default, then we might decide to avoid such changes anyway.)

We will hopefully be able to trade 2 supported backends for 1

Right now SIG Network is supporting both the iptables and ipvs backends of kube-proxy, and does not feel like it can ditch ipvs because of perceived performance issues with iptables. If we create a new backend which is as functional and non-buggy as iptables but as performant as ipvs, then we could (eventually) deprecate both of the existing backends and only have one Linux backend to support in the future.

Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts

There is a desire to provide a “kube-proxy library” that third parties could use as a base for external service proxy implementations (KEP-3786 ). The existing “core kube-proxy” code, while functional, is not very well designed and is not something we would want to support other people using in its current form.

Writing a new proxy backend will force us to look over all of this shared code again, and perhaps give us new ideas on how it can be cleaned up, rationalized, and optimized.

Goals

Design and implement an nftables mode for kube-proxy.
- Consider various fixes to legacy iptables mode behavior.
  - Do not enable the route_localnet sysctl.
  - Add a more restrictive startup mode to kube-proxy, which will error out if the configuration is invalid (e.g., “--detect-local-mode ClusterCIDR” without specifying “--cluster-cidr”) or incomplete (e.g., partially-dual-stack but not fully-dual-stack).
  - (Possibly other changes discussed in this KEP.)
  - Ensure that any such changes are clearly documented for users.
  - To the extent possible, provide metrics to allow iptables users to easily determine if they are using features that would behave differently in nftables mode.
- Document specific details of the nftables implementation that we want to consider as “API”. In particular, document the high-level behavior that authors of network plugins can rely on. We may also document ways that third parties or administrators can integrate with kube-proxy’s rules at a lower level.
Allowing switching from the iptables (or ipvs) mode to nftables, or vice versa, without needing to manually clean up rules in between.
Document the minimum kernel/distro requirements for the new backend.
Document incompatible changes between iptables mode and nftables mode (e.g. localhost NodePorts, firewall handling, etc).
Do performance testing comparing the iptables, ipvs, and nftables backends in small, medium, and large clusters, comparing both the “control plane” aspects (time/CPU usage spent reprogramming rules) and “data plane” aspects (latency and throughput of packets to service IPs).
Help with the clean-up and refactoring of the kube-proxy “library” code.
Although this KEP does not include anything post-GA (e.g., making nftables the default backend, or changing the status of the iptables and/or ipvs backends), we should have at least the start of a plan for the future by the time this KEP goes GA, to ensure that we don’t just end up permanently maintaining 3 backends instead of 2.

Non-Goals

Falling into the same traps as the ipvs backend, to the extent that we can identify what those traps were.
Removing the iptables KUBE-IPTABLES-HINT chain from kubelet; that chain exists for the benefit of any component on the node that wants to use iptables, and so should continue to exist even if no part of the kubernetes core uses iptables itself. (And there is no need to add anything similar for nftables, since there are no bits of host filesystem configuration related to nftables that containerized nftables users need to worry about.)

And some Non-Goals relative to earlier discussions in this KEP:

Changing the session affinity behavior; the nftables backend will implement the same behavior as iptables does (which is different from ipvs and some third-party proxy implementations). If we decide to revisit session affinity in the future, it will be easy to add or change the nftables backend’s behavior, because it is implemented “manually”.
Implementing LoadBalancerSourceRanges filtering for NodePort (or ExternalIPs) traffic. The kube-proxy implementation of that feature mostly only exists for the pod-to-load balancer short circuit case anyway. Users who want more consistent filtering behavior can use the Gateway API.
Support for running multiple instances of the proxy (along with the service.kubernetes.io/service-proxy-name label). There is now a proof-of-concept of this idea (kubernetes #122814 ), so we know that the design supports it and it could be implemented in the future.
Explicit support for “debug”/“admin-override” rules. The nftables backend will retain the iptables backend’s behavior of “you can change our rules but your changes will eventually get overwritten”. We may still some day add support for explicit overrides, as discussed below , but this will not be part of the initial release.

Proposal

Notes/Constraints/Caveats

At least three nftables-based kube-proxy implementations already exist, but none of them seems suitable either to adopt directly or to use as a starting point:

kube-nftlb : This is built on top of a separate nftables-based load balancer project called nftlb , which means that rather than translating Kubernetes Services directly into nftables rules, it translates them into nftlb load balancer objects, which then get translated into nftables rules. Besides making the code more confusing for users who aren’t already familiar with nftlb, this also means that in many cases, new Service features would need to have features added to the nftlb core first before kube-nftld could consume them. (Also, it has not been updated since November 2020.)
nfproxy : Its README notes that “nfproxy is not a 1:1 copy of kube-proxy (iptables) in terms of features. nfproxy is not going to cover all corner cases and special features addressed by kube-proxy”. (Also, it has not been updated since January 2021.)
kpng’s nft backend : This was written as a proof of concept and is mostly a straightforward translation of the iptables rules to nftables, and doesn’t make good use of nftables features that would let it reduce the total number of rules. It also makes heavy use of kpng’s APIs, like “DiffStore”, which there is not consensus about adopting upstream.

Risks and Mitigations

Functionality

The primary risk of the proposal is feature or stability regressions, which will be addressed by testing, and by a slow, optional, rollout of the new proxy mode.

The most important mitigation for this risk is ensuring that rollback from nftables mode back to iptables/ipvs mode works reliably.

Compatibility

Many Kubernetes networking implementations use kube-proxy as their service proxy implementation. Given that few low-level details of kube-proxy’s behavior are explicitly specified, using it as part of a larger networking implementation (and in particular, writing a NetworkPolicy implementation that interoperates with it correctly) necessarily requires making assumptions about (currently-)undocumented aspects of its behavior (such as exactly when and how packets get rewritten).

While the nftables mode is likely to look very similar to the iptables mode from the outside, some CNI plugins, NetworkPolicy implementations, etc, may need updates in order to work with it. (This may further limit the amount of testing the new mode can get during the Alpha phase, if it is not yet compatible with popular network plugins at that point.) There is not much we can do here, other than avoiding gratuitous behavioral differences.

Security

The nftables mode should not pose any new security issues relative to the iptables mode.

Design Details

High level design

At a high level, the new mode should have the same architecture as the existing modes; it will use the service/endpoint-tracking code in k8s.io/kubernetes/pkg/proxy to watch for changes, and update rules in the kernel accordingly.

Low level design

Some details will be figured out as we implement it. We may start with an implementation that is architecturally closer to the iptables mode, and then rewrite it to take advantage of additional nftables features over time.

Tables

Unlike iptables, nftables does not have any reserved/default tables or chains (eg, nat, PREROUTING). Instead, each nftables user is expected to create and work with its own table(s), and to ignore the tables created by other components (for example, when firewalld is running in nftables mode, restarting it only flushes the rules in the firewalld table, unlike when it is running in iptables mode, where restarting it causes it to flush all rules).

Within each table, “base chains” can be connected to “hooks” that give them behavior similar to the built-in iptables chains. (For example, a chain with the properties type nat and hook prerouting would work like the PREROUTING chain in the iptables nat table.) The “priority” of a base chain controls when it runs relative to other chains connected to the same hook in the same or other tables.

An nftables table can only contain rules for a single “family” (ip (v4), ip6, inet (both IPv4 and IPv6), arp, bridge, or netdev). We will create a single kube-proxy table in the ip family, and another in the ip6 family. All of our chains, sets, maps, etc, will go into those tables.

(In theory, instead of creating one table each in the ip and ip6 families, we could create a single table in the inet family and put both IPv4 and IPv6 chains/rules there. However, this wouldn’t really result in much simplification, because we would still need separate sets/maps to match IPv4 addresses and IPv6 addresses. (There is no data type that can store/match either an IPv4 address or an IPv6 address.) Furthermore, because of how Kubernetes Services evolved in parallel with the existing kube-proxy implementation, we have ended up with a dual-stack Service semantics that is most easily implemented by handling IPv4 and IPv6 completely separately anyway.)

Communicating with the kernel nftables subsystem

We will use the nft command-line tool to read and write rules, much like how we use command-line tools in the iptables and ipvs backends.

However, the nft tool is mostly just a thin wrapper around libnftables, so any golang API that wraps the nft command-line could easily be rewritten to use libnftables directly (via a cgo wrapper) in the future if that seemed like a better idea. (In theory we could also use netlink directly, without needing cgo or external libraries, but this would probably be a bad idea; libnftables implements quite a bit of functionality on top of the raw netlink API.)

The nftables command-line tool allows either a single command per invocation (as with /sbin/iptables):

$ nft add table ip kube-proxy '{ comment "Kubernetes service proxying rules"; }'
$ nft add chain ip kube-proxy services
$ nft add rule ip kube-proxy services ip daddr . ip protocol . th dport vmap @service_ips

or multiple commands to be executed in a single atomic transaction (as with /sbin/iptables-restore, but more flexible):

$ nft -f - <<EOF
add table ip kube-proxy { comment "Kubernetes service proxying rules"; }
add chain ip kube-proxy services
add rule ip kube-proxy services ip daddr . ip protocol . th dport vmap @service_ips
EOF

The syntax for the two modes is the same, other than the need to escape shell meta characters in the former case.

When reading data from the kernel (nft list ...), nft outputs the data in a nested “object” form:

$ nft list table ip kube-proxy
table ip kube-proxy {
comment "Kubernetes service proxying rules";
chain services {
ip daddr . ip protocol . th dport vmap @service_ips
}
}

(It is possible to pass data to nft -f in this form as well, but this wouldn’t be useful for us, since we would have to pass the entire contents of table ip kube-proxy rather than just adding, removing, and updating the particular rules, sets, etc, that we wanted to change.)

nft also has a JSON API, which would theoretically be a better option for programmatic use than the “plain text” API. Unfortunately, the representation of rules in this mode is vastly different from the representation of rules in “plain text” mode:

$ nft --json list table ip kube-proxy | jq .
...
{
"rule": {
"family": "ip",
"table": "kube-proxy",
"chain": "services",
"handle": 19,
"expr": [
{
"vmap": {
"key": {
"concat": [
{
"payload": {
"protocol": "ip",
"field": "daddr"
}
},
{
"payload": {
"protocol": "ip",
"field": "protocol"
}
},
{
"payload": {
"protocol": "th",
"field": "dport"
}
}
]
},
"data": "@service_ips"
}
}
]
}
...

While it’s clear how this particular rule would be converted back and forth between the two forms, there is no way to be able to map all rules back and forth without having separate code for every rule type. Furthermore, the JSON syntax of individual rules is poorly documented, and essentially all examples on the web (including the nftables wiki, random blog posts, etc) use the non-JSON syntax. So if we used the JSON syntax in kube-proxy, it would make the code harder to understand and to maintain.

As a result, the plan is that for our internal nftables API:

When passing data to nft, we will use the “plain text” API. In particular, this means that all add rule ... commands will use the well-documented plain text rule form.
When reading data back from nft, we will use the JSON API, to ensure that the results are unambiguously parseable (rather than having to make assumptions about the exact whitespace, punctuation, etc, that nft will output in particular cases in the plain text mode).

This means that our internal nftables API would not be able to support reading back rules in a “legible” form. However, this is not expected to be a problem, given that our internal iptables API (pkg/util/iptables) also does not explicitly support this, and it’s not a problem for the iptables backend.

Notes on the sample rules in this KEP

The examples below all show data in the plain text “object” form, but this is just for reader convenience, and does not correspond to either the form we would be writing the data in (the multi-command transaction form) or the form we would be reading it back in (JSON). (Likewise, note that the #-prefixed comments would be ignored by nft and are only there for the benefit of the KEP reader, whereas the comment "..." comments are actual object metadata that would be stored in nftables, as with iptables --comment "...". Every table, chain, set, map, rule, and set/map element can have its own comment, so there is a lot of opportunity for us to make the ruleset self-documenting, if we want to.)

The examples below are also all IPv4-specific, for simplicity. When actually writing out rules for nft, we will need to switch between, e.g., “ip daddr” and “ip6 daddr” appropriately, to match an IPv4 or IPv6 destination address. This will actually be fairly simple because the nft command lets you create “variables” (really constants) and substitute their values into the rules. Thus, we can just always have the rule-generating code write “$IP daddr”, and then pass either “-D IP=ip” or “-D IP=ip6” to nft to fix it up.)

The per-service/per-endpoint chain names below use hashed strings to shorten the names, as in the iptables backend (e.g., “svc_4SW47YFZTEDKD3PK”, where that hash was copied out of the existing iptables unit tests and happens to represent “ns4/svc4:p80tcp”). However, it turns out that nftables chain names can be much longer than iptables chain names (256 characters rather than 30), so we ought to be able to create more recognizable chain names in the nftables backend.

The multi-word names in the examples are also inconsistent about the use of underscores vs hyphens; underscores are standard in most nftables documentation, but hyphens are more iptables-kube-proxy-like. We should eventually settle on one or the other.

(Also, most of the examples below have not actually been tested and may have syntax errors. Caveat lector.)

Versioning and compatibility

Since nftables is subject to much more development than iptables has been recently, we will need to pay more attention to kernel and tool versions.

The nft command has a --check option which can be used to check if a command could be run successfully; it parses the input, and then (assuming success), uploads the data to the kernel and asks the kernel to check it (but not actually act on it) as well. Thus, with a few nft --check runs at startup we should be able to confirm what features are known to both the tooling and the kernel.

It is not yet clear what the minimum kernel or nft command-line versions needed by the nftables backend will be. The newest feature used in the examples below was added in Linux 5.6, released in March 2020 (though they could be rewritten to not need that feature).

It is possible some users will not be able to upgrade from the iptables and ipvs backends to nftables. (Certainly the nftables backend will not support RHEL 7, which some people are still using Kubernetes with.)

NAT rules

General Service dispatch

For ClusterIP and external IP services, we will use an nftables “verdict map” to store the logic about where to dispatch traffic, based on destination IP, protocol, and port. We will then need only a single actual rule to apply the verdict map to all inbound traffic. (Or it may end up making more sense to have separate verdict maps for ClusterIP, ExternalIP, and LoadBalancer IP?) Either way, service dispatch will be roughly O(1) rather than O(n) as in the iptables backend.

Likewise, for NodePort traffic, we will use a verdict map matching only on destination protocol / port, with the rules set up to only check the nodeports map for packets addressed to a local IP.

map service_ips {
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic";
# The "type" clause defines the map's datatype; the key type is to
# the left of the ":" and the value type to the right. The map key
# in this case is a concatenation (".") of three values; an IPv4
# address, a protocol (tcp/udp/sctp), and a port (aka
# "inet_service"). The map value is a "verdict", which is one of a
# limited set of nftables actions. In this case, the verdicts are
# all "goto" statements.
type ipv4_addr . inet_proto . inet_service : verdict;
elements {
172.30.0.44 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
192.168.99.33 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
...
}
}
map service_nodeports {
comment "NodePort traffic";
type inet_proto . inet_service : verdict;
elements {
tcp . 3001 : goto svc_4SW47YFZTEDKD3PK,
...
}
}
chain prerouting {
jump services
jump nodeports
}
chain services {
# Construct a key from the destination address, protocol, and port,
# then look that key up in the `service_ips` vmap and take the
# associated action if it is found.
ip daddr . ip protocol . th dport vmap @service_ips
}
chain nodeports
# Return if the destination IP is non-local, or if it's localhost.
fib daddr type != local return
ip daddr == 127.0.0.1 return
# If --nodeport-addresses was in use then the above would instead be
# something like:
# ip daddr != { 192.168.1.5, 192.168.3.10 } return
# dispatch on the service_nodeports vmap
ip protocol . th dport vmap @service_nodeports
}
# Example per-service chain
chain svc_4SW47YFZTEDKD3PK {
# Send to random endpoint chain using an inline vmap
numgen random mod 2 vmap {
0 : goto sep_UKSFD7AGPMPPLUHC,
1 : goto sep_C6EBXVWJJZMIWKLZ
}
}
# Example per-endpoint chain
chain sep_UKSFD7AGPMPPLUHC {
# masquerade hairpin traffic
ip saddr 10.180.0.4 jump mark_for_masquerade
# send to selected endpoint
dnat to 10.180.0.4:8000
}

Masquerading

The example rules above include

 ip saddr 10.180.0.4 jump mark_for_masquerade

to masquerade hairpin traffic, as in the iptables proxier. This assumes the existence of a mark_for_masquerade chain, not shown.

nftables has the same constraints on DNAT and masquerading as iptables does; you can only DNAT from the “prerouting” stage and you can only masquerade from the “postrouting” stage. Thus, as with iptables, the nftables proxy will have to handle DNAT and masquerading at separate times. One possibility would be to simply copy the existing logic from the iptables proxy, using the packet mark to communicate from the prerouting chains to the postrouting ones.

However, it should be possible to do this in nftables without using the mark or any other externally-visible state; we can just create an nftables set, and use that to communicate information between the chains. Something like:

# Set of 5-tuples of connections that need masquerading
set need_masquerade {
type ipv4_addr . inet_service . ipv4_addr . inet_service . inet_proto;
flags timeout ; timeout 5s ;
}
chain mark_for_masquerade {
update @need_masquerade { ip saddr . th sport . ip daddr . th dport . ip protocol }
}
chain postrouting_do_masquerade {
# We use "ct original ip daddr" and "ct original proto-dst" here
# since the packet may have been DNATted by this point.
ip saddr . th sport . ct original ip daddr . ct original proto-dst . ip protocol @need_masquerade masquerade
}

This is not yet tested, but some kernel nftables developers have confirmed that it ought to work. We should test to make sure that having a potentially-high-churn need_masquerade set will not be a performance problem.

Session affinity

Session affinity can be done in roughly the same way as in the iptables proxy, just using the more general nftables “set” framework rather than the affinity-specific version of sets provided by the iptables recent module. In fact, since nftables allows arbitrary set keys, we can optimize relative to iptables, and only have a single affinity set per service, rather than one per endpoint. (And we also have the flexibility to change the affinity key in the future if we want to, eg to key on source IP+port rather than just source IP.)

set affinity_4SW47YFZTEDKD3PK {
# Source IP . Destination IP . Destination Port
type ipv4_addr . ipv4_addr . inet_service;
flags timeout; timeout 3h;
}
chain svc_4SW47YFZTEDKD3PK {
# Check for existing session affinity against each endpoint
ip saddr . 10.180.0.4 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_UKSFD7AGPMPPLUHC
ip saddr . 10.180.0.5 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_C6EBXVWJJZMIWKLZ
# Send to random endpoint chain
numgen random mod 2 vmap {
0 : goto sep_UKSFD7AGPMPPLUHC,
1 : goto sep_C6EBXVWJJZMIWKLZ
}
}
chain sep_UKSFD7AGPMPPLUHC {
# Mark the source as having affinity for this endpoint
update @affinity_4SW47YFZTEDKD3PK { ip saddr . 10.180.0.4 . 80 }
ip saddr 10.180.0.4 jump mark_for_masquerade
dnat to 10.180.0.4:8000
}
# likewise for other endpoint(s)...

Filter rules

The iptables mode uses the filter table for three kinds of rules:

Dropping or rejecting packets for services with no endpoints

As with service dispatch, this is easily handled with a verdict map:

map no_endpoint_services {
type ipv4_addr . inet_proto . inet_service : verdict
elements = {
192.168.99.22 . tcp . 80 : drop,
172.30.0.46 . tcp . 80 : goto reject_chain,
1.2.3.4 . tcp . 80 : drop
}
}
chain filter {
...
ip daddr . ip protocol . th dport vmap @no_endpoint_services
...
}
# helper chain needed because "reject" is not a "verdict" and so can't
# be used directly in a verdict map
chain reject_chain {
reject
}

Dropping traffic rejected by `LoadBalancerSourceRanges`

The implementation of LoadBalancer source ranges will be similar to the ipset-based implementation in the ipvs kube proxy: we use one set to recognize “traffic that is subject to source ranges”, and then another to recognize “traffic that is accepted by its service’s source ranges”. Traffic which matches the first set but not the second gets dropped:

set firewall {
comment "destinations that are subject to LoadBalancerSourceRanges";
type ipv4_addr . inet_proto . inet_service
}
set firewall_allow {
comment "destination+sources that are allowed by LoadBalancerSourceRanges";
type ipv4_addr . inet_proto . inet_service . ipv4_addr
}
chain filter {
...
ip daddr . ip protocol . th dport @firewall jump firewall_check
...
}
chain firewall_check {
ip daddr . ip protocol . th dport . ip saddr @firewall_allow return
drop
}

Where, eg, adding a Service with LoadBalancer IP 10.1.2.3, port 80, and source ranges ["192.168.0.3/32", "192.168.1.0/24"] would result in:

add element ip kube-proxy firewall { 10.1.2.3 . tcp . 80 }
add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.0.3/32 }
add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.1.0/24 }

Forcing traffic on `HealthCheckNodePort`s to be accepted

The iptables mode adds rules to ensure that traffic to NodePort services’ health check ports is allowed through the firewall. eg:

-A KUBE-NODEPORTS -m comment --comment "ns2/svc2:p80 health check node port" -m tcp -p tcp --dport 30000 -j ACCEPT

(There are also rules to accept any traffic that has already been tagged by conntrack.)

This cannot be done reliably in nftables; the semantics of accept (or -j ACCEPT in iptables) is to end processing of the current table. In iptables, this effectively guarantees that the packet is accepted (since -j ACCEPT is mostly only used in the filter table), but in nftables, it is still possible that someone would later call drop on the packet from another table, causing it to be dropped. There is no way to reliably “sneak behind the firewall’s back” like you can in iptables; if an nftables-based firewall is dropping kube-proxy’s packets, then you need to actually configure that firewall to accept them instead.

However, this firewall-bypassing behavior is somewhat legacy anyway; the iptables proxy is able to bypass a local firewall, but has no ability to bypass a firewall implemented at the cloud network layer, which is perhaps a more common configuration these days anyway. Administrators using non-local firewalls are already required to configure those firewalls correctly to allow Kubernetes traffic through, and it is reasonable for us to just extend that requirement to administrators using local firewalls as well.

Thus, the nftables backend will not attempt to replicate these iptables-backend rules.

Future improvements

Further improvements are likely possible.

For example, it would be nice to not need a separate “hairpin” check for every endpoint. There is no way to ask directly “does this packet have the same source and destination IP?”, but the proof-of-concept kpng nftables backend does this instead:

set hairpin {
type ipv4_addr . ipv4_addr;
elements {
10.180.0.4 . 10.180.0.4,
10.180.0.5 . 10.180.0.5,
...
}
}
chain ... {
...
ip saddr . ip daddr @hairpin jump mark_for_masquerade
}

More efficiently, if nftables eventually got the ability to call eBPF programs as part of rule processing (like iptables’s -m ebpf) then we could write a trivial eBPF program to check “source IP equals destination IP” and then call that rather than needing the giant set of redundant IPs.

If we do this, then we don’t need the per-endpoint hairpin check rules. If we could also get rid of the per-endpoint affinity-updating rules, then we could get rid of the per-endpoint chains entirely, since dnat to ... is an allowed vmap verdict:

chain svc_4SW47YFZTEDKD3PK {
# FIXME handle affinity somehow
# Send to random endpoint
random mod 2 vmap {
0 : dnat to 10.180.0.4:8000
1 : dnat to 10.180.0.5:8000
}
}

With the current set of nftables functionality, it does not seem possible to do this (in the case where affinity is in use), but future features may make it possible.

It is not yet clear what the tradeoffs of such rewrites are, either in terms of runtime performance, or of admin/developer-comprehensibility of the ruleset.

Changes from the iptables kube-proxy backend

Switching to a new backend which people will have to opt into gives us the chance to break backward-compatibility in various places where we don’t like the current iptables kube-proxy behavior.

However, if we intend to eventually make the nftables mode the default, then differences from iptables mode will be more of a problem, so we should limit these changes to cases where the benefit outweighs the cost.

Localhost NodePorts

Kube-proxy in iptables mode supports NodePorts on 127.0.0.1 (for IPv4 services) by default. (Kube-proxy in ipvs mode does not support this, and neither mode supports localhost NodePorts for IPv6 services, although userspace mode did, in single-stack IPv6 clusters.)

Localhost NodePort traffic does not work cleanly with a DNAT-based approach to NodePorts, because moving a localhost packet to network interface other than lo causes the kernel to consider it “martian” and refuse to route it. There are various ways around this problem:

The userspace approach: Proxy packets in userspace rather than redirecting them with DNAT. (The userspace proxy did this for all IPs; the fact that localhost NodePorts worked with the userspace proxy was a coincidence, not an explicitly-intended feature).
The iptables approach: Enable the route_localnet sysctl, which tells the kernel to never consider IPv4 loopback addresses to be “martian”, so that DNAT works. This only works for IPv4; there is no corresponding sysctl for IPv6. Unfortunately, enabling this sysctl opens security holes (CVE-2020-8558 ), which kube-proxy then needs to try to close, which it does by creating iptables rules to block all the packets that route_localnet would have blocked except for the ones we want (which assumes that the administrator didn’t also change certain other sysctls that might have been safe to change had we not set route_localnet, and which according to some reports may block legitimate traffic in some configurations).
The Cilium approach: Intercept the connect(2) call with eBPF and rewrite the destination IP there, so that the network stack never actually sees a packet with destination 127.0.0.1 / ::1. (As in the userspace kube-proxy case, this is not a special-case for localhost, it’s just how Cilium does service proxying.)
If you control the client, you can explicitly bind the socket to 127.0.0.1 / ::1 before connecting. (I’m not sure why this works since the packet still eventually gets routed off lo.) It doesn’t seem to be possible to “spoof” this after the socket is created, though as with the previous case, you could do this by intercepting syscalls with eBPF.

In discussions about this feature, only one real use case has been presented: it allows you to run a docker registry in a pod and then have nodes use a NodePort service via 127.0.0.1 to access that registry. Docker treats 127.0.0.1 as an “insecure registry” by default (though containerd and cri-o do not) and so does not require TLS authentication in this case; using any other IP would require setting up TLS certificates, making the deployment more complicated. (In other words, this is basically an intentional exploitation of the security hole that CVE-2020-8558 warns about: enabling route_localnet may allow someone to access a service that doesn’t require authentication because it assumed it was only accessible to localhost.)

In all other cases, it is generally possible (though not always convenient) to just rewrite things to use the node IP rather than localhost (or to use a ClusterIP rather than a NodePort). Indeed, since localhost NodePorts do not work with ipvs mode or with IPv6, many places that used to use NodePorts on 127.0.0.1 have already been rewritten to not do so (eg contiv/vpp#1434 ).

So:

There is no way to make IPv6 localhost NodePorts work with a NAT-based solution.
The way to make IPv4 localhost NodePorts work with NAT introduces a security hole, and we don’t necessarily have a fully-generic way to mitigate it.
The only commonly-argued-for use case for the feature involves deploying a service in a configuration which its own documentation describes as insecure and “only appropriate for testing”.
- The use case in question works by default against cri-dockerd but not against containerd or cri-o with their default configurations.
- cri-dockerd, containerd, and cri-o all allow additional “insecure registry” IPs/CIDRs to be configured, so an administrator could configure them to allow non-TLS image pulling against a ClusterIP.

Given this, I think we should not try to support localhost NodePorts in the nftables backend.

NodePort Addresses

In addition to the localhost issue, iptables kube-proxy defaults to accepting NodePort connections on all local IPs, which has effects varying from intended-but-unexpected (“why can people connect to NodePort services from the management network?”) to clearly-just-wrong (“why can people connect to NodePort services on LoadBalancer IPs?”)

The nftables proxy should default to only opening NodePorts on a single interface, probably the interface with the default route by default. (Ideally, you really want it to accept NodePorts on the interface that holds the route to the cloud load balancers, but we don’t necessarily know what that is ahead of time.) Admins can use --nodeport-addresses to override this.

Behavior of service IPs

Traffic to invalid ports on active cluster IPs will be rejected by the nftables proxy. If the MultiServiceCIDRAllocator feature gate is enabled, it will additionally drop traffic to unassigned cluster IPs.

Defining an API for integration with admin/debug/third-party rules

Administrators sometimes want to add rules to log or drop certain packets. Kube-proxy makes this difficult because it is constantly rewriting its rules, making it likely that admin-added rules will be deleted shortly after being added.

Likewise, external components (eg, NetworkPolicy implementations) may want to write rules that integrate with kube-proxy’s rules in well-defined ways.

The existing kube-proxy modes do not provide any explicit “API” for integrating with them, although certain implementation details of the iptables backend in particular (e.g. the fact that service IPs in packets are rewritten to endpoint IPs during iptables’s PREROUTING phase, and that masquerading will not happen before POSTROUTING) are effectively API, in that we know that changing them would result in significant ecosystem breakage.

We should provide a stronger definition of these larger-scale “black box” guarantees in the nftables backend. NFTables makes this easier than iptables in some ways, because each application is expected to create their own table, and not interfere with anyone else’s tables. If we document the priority values we use to connect to each nftables hook, then admins and third party developers should be able to reliably process packets before or after kube-proxy, without needing to modify kube-proxy’s chains/rules. (As of 1.33, this is now documented.)

In cases where administrators want to insert rules into the middle of particular service or endpoint chains, we would have the same problem that the iptables backend has, which is that it would be difficult for us to avoid accidentally overwriting them when we update rules. Additionally, we want to preserve our ability to redesign the rules later to take better advantage of nftables features, which would be impossible to do if we were officially allowing users to modify the existing rules.

One possibility would be to add “admin override” vmaps that are normally empty but which admins could add jump/goto rules to for specific services to augment/bypass the normal service processing. It probably makes sense to leave these out initially and see if people actually do need them, or if creating rules in another table is sufficient.

Rule monitoring

Given the constraints of the iptables API, it would be extremely inefficient to do a controller loop in the “standard” style :

for {
desired := getDesiredState()
current := getCurrentState()
makeChanges(desired, current)
}

(In particular, the combination of “getCurrentState” and “makeChanges” is slower than just skipping the “getCurrentState” and rewriting everything from scratch every time.)

In the past, the iptables backend did rewrite everything from scratch every time:

for {
desired := getDesiredState()
makeChanges(desired, nil)
}

but KEP-3453 “Minimizing iptables-restore input size” changed this, to improve performance:

for {
desired := getDesiredState()
predicted := getPredictedState()
if err := makeChanges(desired, predicted); err != nil {
makeChanges(desired, nil)
}
}

That is, it makes incremental updates under the assumption that the current state is correct, but if an update fails (e.g. because it assumes the existence of a chain that didn’t exist), kube-proxy falls back to doing a full rewrite. (It also eventually falls back to a full update after enough time passes.)

Proxies based on iptables have also historically had the problem that system processes (particularly firewall implementations) would sometimes flush all iptables rules and restart with a clean state, thus completely breaking kube-proxy. The initial solution for this problem was to just recreate all iptables rules every 30 seconds even if no services/endpoints had changed. Later this was changed to create a single “canary” chain, and check every 30 seconds that the canary had not been deleted, and only recreate everything from scratch if the canary disappears.

NFTables provides a way to monitor for changes without doing polling; you can keep a netlink socket open to the kernel (or a pipe open to an nft monitor process) and receive notifications when particular kinds of nftables objects are created or destroyed.

However, the “everyone uses their own table” design of nftables means that this should not be necessary. IPTables-based firewall implementations flush all iptables rules because everyone’s iptables rules are all mixed together and it’s hard to do otherwise. But in nftables, a firewall ought to only flush its own table when restarting, and leave everyone else’s tables untouched. In particular, firewalld works this way when using nftables. We will need to see what other firewall implementations do.

Switching between kube-proxy modes

In the past, kube-proxy attempted to allow users to switch between the userspace and iptables modes (and later the ipvs mode) by just restarting kube-proxy with the new arguments. Each mode would attempt to clean up the iptables rules used by the other modes on startup.

Unfortunately, this didn’t work well because the three modes all used some of the same iptables chains, so, e.g., when kube-proxy started up in iptables mode, it would try to delete the userspace rules, but this would end up deleting rules that had been created by iptables mode too, which mean that any time you restarted kube-proxy, it would immediately delete some of its rules and be in a broken state until it managed to re-sync from the apiserver. So this code was removed with KEP-2448 .

However, the same problem would not apply when switching between an iptables-based mode and an nftables-based mode; it should be safe to delete all iptables and ipvs rules when starting kube-proxy in nftables mode, and to delete all nftables rules when starting kube-proxy in iptables or ipvs mode. This will make it easier for users to switch between modes.

Since rollback from nftables mode is most important when the nftables mode is not actually working correctly, we should do our best to make sure that the cleanup code that runs when rolling back to iptables/ipvs mode is likely to work correctly even if the rest of the nftables code is broken. To that end, we can have it simply run nft directly, bypassing the abstractions used by the rest of the code. Since our rules will be isolated to our own tables, all we need to do to clean up all of our rules is:

nft delete table ip kube-proxy
nft delete table ip6 kube-proxy

In fact, this is simple enough that we could document it explicitly as something administrators could do if they run into problems while rolling back.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None. (We had considered refactoring the iptables unit tests to make it possible to share the same tests between the two backends, but we ended up just copying them instead.)

Unit tests

We will add unit tests for the nftables mode that are equivalent to the ones for the iptables mode. In particular, we will port over the tests that feed Services and EndpointSlices into the proxy engine, dump the generated ruleset, and then mock running packets through the ruleset to determine how they would behave.

Since virtually all of the new code will be in a new directory, there should not be any large changes either way to the test coverage percentages in any existing directories.

As of 2023-09-22, pkg/proxy/iptables has 70.6% code coverage in its unit tests. For Alpha, we will have comparable coverage for nftables. However, since the nftables implementation is new, and more likely to have bugs than the older, widely-used iptables implementation, we will also add additional unit tests before Beta.

k8s.io/kubernetes/pkg/proxy/nftables: 2024-05-24 - 74.7%

for comparison:

k8s.io/kubernetes/pkg/proxy/iptables: 2024-05-24 - 68.4%
k8s.io/kubernetes/pkg/proxy/ipvs: 2024-05-24 - 60.9%

Integration tests

Kube-proxy does not have integration tests.

e2e tests

Most of the e2e testing of kube-proxy is backend-agnostic. Initially, we will need a separate e2e job to test the nftables mode (like we do with ipvs). Eventually, if nftables becomes the default, then this would be flipped around to having a legacy “iptables” job.

The test “[It should recreate its iptables rules if they are deleted]” tests (a) that kubelet recreates KUBE-IPTABLES-HINT if it is deleted, and (b) that deleting all KUBE-* iptables rules does not cause services to be broken forever. The latter part is obviously a no-op under nftables kube-proxy, but we can run it anyway. (We are currently assuming that we will not need an nftables version of this test, since the problem of one component deleting another component’s rules should not exist with nftables.)

(Though not directly related to kube-proxy, there are also other e2e tests that use iptables which should eventually be ported to nftables; notably, the ones using TestUnderTemporaryNetworkFailure .)

For the most part, we should not need to add any nftables-specific e2e tests; the nftables backend’s job is just to implement the Service proxy API to the same specifications as the other backends do, so the existing e2e tests already cover everything relevant. The only exception to this is in cases where we change default behavior from the iptables backend, in which case we may need new tests for the different behavior.

We will eventually need e2e tests for switching between iptables and nftables mode in an existing cluster.

Scalability & Performance tests

We have an nftables scalability job . Initial performance is fine; we have not done a lot of further testing/improvement yet.

Graduation Criteria

Alpha

kube-proxy --proxy-mode nftables available behind a feature gate
nftables mode has unit test parity with iptables
An nftables-mode e2e job exists, and passes
Documentation describes any changes in behavior between the iptables and ipvs modes and the nftables mode.
Documentation explains how to manually clean up nftables rules in case things go very wrong.

Beta

At least two releases since Alpha.
The nftables mode has seen at least a bit of real-world usage. (Yes; we’ve gotten bug reports and PRs from users experimenting with it.)
No major outstanding bugs.
nftables mode better unit test coverage than iptables mode (currently) has. (It is possible that we will end up adding equivalent unit tests to the iptables backend in the process.)
A “kube-proxy mode-switching” e2e job exists, to confirm that you can redeploy kube-proxy in a different mode in an existing cluster. Rollback is confirmed to be reliable.
An nftables e2e periodic perf/scale job exists, and shows performance as good as iptables and ipvs.
Documentation describes any changes in behavior between the iptables and ipvs modes and the nftables mode. Any warnings that we have decide to add for iptables users using functionality that behaves differently in nftables have been added.

GA

At least two releases since Beta.
The nftables mode has seen non-trivial real-world usage.
The nftables mode has no bugs / regressions that would make us hesitate to recommend it.
We have at least the start of a plan for the next steps (changing the default mode, deprecating the old backends, etc).
No UNRESOLVED sections in the KEP. (In particular, we have figured out what sort of “API” we will offer for integrating third-party nftables rules.)

Upgrade / Downgrade Strategy

The new mode should not introduce any upgrade/downgrade problems, excepting that you can’t downgrade or feature-disable a cluster using the new kube-proxy mode without switching it back to iptables or ipvs first. (The older kube-proxy would refuse to start if given --proxy-mode nftables, and wouldn’t know how to clean up stale nftables service rules if any were present.)

When rolling out or rolling back the feature, it should be safe to enable the feature gate and change the configuration at the same time, since nothing cares about the feature gate except for kube-proxy itself. Likewise, it is expected to be safe to roll out the feature in a live cluster, even though this will result in different proxy modes running on different nodes, because Kubernetes service proxying is defined in such a way that no node needs to be aware of the implementation details of the service proxy implementation on any other node.

Version Skew Strategy

The feature is isolated to kube-proxy and does not introduce any API changes, so the versions of other components do not matter.

Kube-proxy has no problems skewing with different versions of itself across different nodes, because Kubernetes service proxying is defined in such a way that no node needs to be aware of the implementation details of the service proxy implementation on any other node.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

The administrator must enable the feature gate to make the feature available, and then must run kube-proxy with the --proxy-mode=nftables flag.

Feature gate (also fill in values in kep.yaml)
- Feature gate name: NFTablesProxyMode
- Components depending on the feature gate:
  - kube-proxy
Other
- Describe the mechanism:
  - kube-proxy must be restarted with the new --proxy-mode.
- Will enabling / disabling the feature require downtime of the control plane?
  - No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
  - No

Does enabling the feature change any default behavior?

Enabling the feature gate does not change any behavior; it just makes the --proxy-mode=nftables option available.

Switching from --proxy-mode=iptables or --proxy-mode=ipvs to --proxy-mode=nftables will likely change some behavior, depending on what we decide to do about certain un-loved kube-proxy features like localhost nodeports. Whatever differences in behavior exist will be explained clearly by the documentation; this is no different from users switching from iptables to ipvs, which initially did not have feature parity with iptables.

(Assuming we eventually make nftables the default, then differences in behavior from iptables will be more important, but making it the default is not part of this KEP.)

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, though it is necessary to clean up the nftables rules that were created, or they will continue to intercept service traffic. In any normal case, this should happen automatically when restarting kube-proxy in iptables or ipvs mode, however, that assumes the user is rolling back to a version of kube-proxy that has at least the Alpha nftables code (1.29+). If the user wants to roll back the cluster to a version of Kubernetes that doesn’t have the nftables kube-proxy code (i.e., rolling back from Alpha to Pre-Alpha), or if they are rolling back to an external service proxy implementation (e.g., kpng), then they would need to make sure that the nftables rules got cleaned up before they rolled back, or else clean them up manually. (We document how to do this.)

(By the time we are considering making the nftables backend the default in the future, the feature will have existed and been GA for several releases, so at that point, rollback (to another version of kube-proxy) would always be to a version that still supports nftables and can properly clean up from it.)

What happens if we reenable the feature if it was previously rolled back?

It should just work.

Are there any tests for feature enablement/disablement?

The actual feature gate enablement/disablement itself is not interesting, since it only controls whether --proxy-mode nftables can be selected.

We will need an e2e test of switching a node from iptables (or ipvs) mode to nftables, and vice versa. The Graduation Criteria currently list this e2e test as being a criterion for Beta, not Alpha, since we don’t really expect people to be switching their existing clusters over to an Alpha version of kube-proxy anyway.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Simply enabling the feature (or upgrading to the release where it is Beta) has no effect. Admins must explicitly choose to switch to the new backend.

Switching to the new backend can’t really “fail”, other than in the case of bugs, which could have results ranging from “almost unnoticeable” to “utterly catastrophic”. Such a failure would almost certainly impact already running workloads. However, each node must be switched over to the new backend independently, so any especially bad failure would likely be noticed after switching over the first node, and could be rolled back at that point.

Rollback should not be able to fail unless there are bugs in the nftables cleanup code, which is very very simple.

What specific metrics should inform a rollback?

If sync_proxy_rules_nftables_sync_failures_total is growing, that indicates that something is going wrong, and kube-proxy logs may provide more information.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Tested by hand:

Start kube-proxy in iptables mode.
Confirm (via iptables-save) that iptables rules exist for Services.
Kill kube-proxy.
Start kube-proxy in nftables mode.
Confirm (via iptables-save) that iptables rules for Services no longer exist. (There will still be a handful of iptables chains left over, but nothing that actually affects the behavior of services.)
Confirm (via nft list ruleset) that nftables rules for Services exist.
Kill kube-proxy.
Start kube-proxy in iptables mode again.
Confirm (via iptables-save) that iptables rules exist for Services.
Confirm (via nft list ruleset) that the kube-proxy table (or tables, if dual-stack) has been deleted.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

The new backend is not 100% compatible with the iptables backend. This will be documented, and there are new metrics in the iptables backend that can help users figure out if they are depending on features that aren’t implemented or that work differently in nftables.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator is the one who would enable the feature, and they would know it is in use by looking at the kube-proxy configuration.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: If Services still work then the feature is working

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

For Beta, the goal is for the network programming latency to be equivalent to the old, pre-KEP-3453 iptables performance (because the current code is not yet heavily optimized).

For GA, the goal was for it to be at least as good as the current iptables performance.

In fact, we never got entirely clear measurements of this, because the iptables-based 1000 node perf/scale test still uses minSyncPeriod: 10s, while the nftables-based one does not. However, the nftables performance is quite satisfactory (and the fact that it is able to have satisfactory performance without using minSyncPeriod is also a major win).

Meanwhile, nftables data plane performance is substantially better than iptables:

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

sync_proxy_rules_nftables_sync_failures_total indicates the number of failed syncs; if this number is growing, it indicates the backend is failing in some way.

The various generic kube-proxy metrics like network_programming_duration_seconds and sync_proxy_rules_duration_seconds also exist, and can be used to check that changes are being processed promptly, and that individual syncs are taking a reasonable amount of time, respectively.

It’s not clear yet what sort of nftables-specific metrics will be interesting. For example, in the iptables backend we have sync_proxy_rules_iptables_total, which tells you the total number of iptables rules kube-proxy has programmed. But the equivalent metric in the nftables backend is not going to be as interesting, because many of the things that are done with rules in the iptables backend will be done with maps and sets in the nftables backend. Likewise, just tallying “total number of rules and set/map elements” is not likely to be useful, because the entire point of sets and maps is that they have more-or-less O(1) behavior, so knowing the number of elements is not going to give you much information about how well the system is likely to be performing.

(Update while going to GA: it’s still not clear. We have not found ourselves wanting any additional metrics, nor have we received any requests for additional metrics.)

Metrics
- Metric names:
  - network_programming_duration_seconds (already exists)
  - sync_proxy_rules_last_queued_timestamp_seconds (already exists)
  - sync_proxy_rules_last_timestamp_seconds (already exists)
  - sync_proxy_rules_duration_seconds (already exists)
  - sync_proxy_rules_nftables_sync_failures_total
  - sync_proxy_rules_nftables_cleanup_failures_total
- Components exposing the metric:
  - kube-proxy

Are there any missing metrics that would be useful to have to improve observability of this feature?

We have now added some metrics to the iptables mode (e.g., kubeproxy_iptables_ct_state_invalid_dropped_packets_total), allowing users to be aware of whether they are depending on features that work differently in the nftables backend, to help users decide whether they can migrate to nftables, and whether they need any non-standard configuration in order to do so.

Dependencies

Does this feature depend on any specific services running in the cluster?

It may require a newer kernel than some current users have. It does not depend on anything else in the cluster.

Scalability

Will enabling / using this feature result in any new API calls?

No. kube-proxy is still using the same Service/EndpointSlice-monitoring code, it is just doing different things locally with the results.

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

It is not expected to…

We do not currently have any apples-to-apples comparisons; the nftables perf job uses more CPU than the corresponding iptables job, but this is because it doesn’t run with minSyncPeriod: 10s like the iptables job does, and so it syncs rule changes more often. (However, the that it’s able to do so without the cluster falling over is a strong indication that it is more efficient.)

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The same way that kube-proxy currently does; updates stop being processed until the apiserver is available again.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Initial proposal: 2023-02-01
Merged: 2023-10-06
Updates for beta: 2024-05-24
Updates for GA: 2025-01-15

Drawbacks

Adding a new officially-supported kube-proxy implementation implies more work for SIG Network (especially if we are not able to deprecate either of the existing backends soon).

Replacing the default kube-proxy implementation will affect many users.

However, doing nothing would result in a situation where, eventually, many users would be unable to use the default proxy implementation.

Alternatives

Continue to improve the `iptables` mode

We have made many improvements to the iptables mode, and could make more. In particular, we could make the iptables mode use IP sets like the ipvs mode does.

However, even if we could solve literally all of the performance problems with the iptables mode, there is still the looming deprecation issue.

Fix up the `ipvs` mode

Rather than implementing an entirely new nftables kube-proxy mode, we could try to fix up the existing ipvs mode.

However, the ipvs mode makes extensive use of the iptables API in addition to the IPVS API. So while it solves the performance problems with the iptables mode, it does not address the deprecation issue. So we would at least have to rewrite it to be IPVS+nftables rather than IPVS+iptables.

Use an existing nftables-based kube-proxy implementation

Discussed in Notes/Constraints/Caveats .

Create an eBPF-based proxy implementation

Another possibility would be to try to replace the iptables and ipvs modes with an eBPF-based proxy backend, instead of an an nftables one. eBPF is very trendy, but it is also notoriously difficult to work with.

One problem with this approach is that the APIs to access conntrack information from eBPF programs only exist in the very newest kernels. In particular, the API for NATting a connection from eBPF was only added in the recently-released 6.1 kernel. It will be a long time before a majority of Kubernetes users have a kernel new enough that we can depend on that API.

Thus, an eBPF-based kube-proxy implementation would initially need a number of workarounds for missing functionality, adding to its complexity (and potentially forcing architectural choices that would not otherwise be necessary, to support the workarounds).

One interesting eBPF-based approach for service proxying is to use eBPF to intercept the connect() call in pods, and rewrite the destination IP before the packets are even sent. In this case, eBPF conntrack support is not needed (though it would still be needed for non-local service connections, such as connections via NodePorts). One nice feature of this approach is that it integrates well with possible future “multi-network Service” ideas, in which a pod might connect to a service IP that resolves to an IP on a secondary network which is only reachable by certain pods. In the case of a “normal” service proxy that does destination IP rewriting in the host network namespace, this would result in a packet that was undeliverable (because the host network namespace has no route to the isolated secondary pod network), but a service proxy that does connect()-time rewriting would rewrite the connection before it ever left the pod network namespace, allowing the connection to proceed.

The multi-network effort is still in the very early stages, and it is not clear that it will actually adopt a model of multi-network Services that works this way. (It is also possible to make such a model work with a mostly-host-network-based proxy implementation; it’s just more complicated.)

Resources: Add AppArmor Support

Mon, 01 Jan 0001 00:00:00 +0000

Add AppArmor Support

Release Signoff Checklist
Summary
Motivation
Proposal
- API
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Syncing fields & annotations on workload resources

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This is a proposal to add AppArmor support to the Kubernetes API.

For GA graduation, this proposal aims to do the bare minimum to clean up the feature from its beta release, without blocking future enhancements.

Motivation

AppArmor can enable users to run a more secure deployment, and/or provide better auditing and monitoring of their systems. AppArmor should be supported to provide users an alternative to SELinux, and provide an interface for users that are already maintaining a set of AppArmor profiles.

Background

Kubernetes AppArmor support predates most of our current feature lifecycle practices, including the KEP process. This KEP is backfilling for current AppArmor support. For the original AppArmor proposal, see https://github.com/kubernetes/design-proposals-archive/blob/main/auth/apparmor.md .

This KEP is proposing a minimal path to GA, per the no perma-Beta requirement . This feature graduation closely parallels that of Seccomp . The notable exceptions are that the AppArmor annotations are immutable on pods, which simplifies the migration. AppArmor is also feature gated, via the AppArmor gate.

Goals

Allow running Pods with AppArmor confinement

Non-Goals

This KEP proposes the absolute minimum to provide generally available AppArmor confinement for Pods and their containers. Further functional enhancements are out of scope, including:

Defining any standard “Kubernetes branded” AppArmor profiles
Formally specifying the AppArmor profile format in Kubernetes
Providing mechanisms for defining custom profiles using the Kubernetes API, or for loading profiles from outside of the node.
Windows support

Proposal

Add a new field to the Pod API that allows defining the AppArmor profile. The new field should be part of the security context.

API

Pods and PodTemplate will include an appArmorProfile field that you can set either for a Pod’s security context or for an individual container. If AppArmor options are defined at both the pod and container level, the container-level options override the pod options.

Pod Annotations (beta API)

The beta API was defined through annotations on pods.

The container.apparmor.security.beta.kubernetes.io/<container_name> annotation will be used to configure the AppArmor profile that the container named <container_name> is run with. The annotation is immutable on Pods.

Possible annotation values are:

runtime/default - This explicitly selects the default profile configured by the container runtime. Absent this annotation, containerd and CRI-O will run non-privileged containers with this profile by default on AppArmor-enabled (LSM loaded) hosts.
unconfined - Run without any AppArmor profile. This is the default for privileged pods.
localhost/<profile_name> - Run the container using the <profile_name> AppArmor profile. The profile must be pre-loaded into the kernel (typically via apparmor_parser utility), otherwise the container will not be started.

Pod API

The Pod AppArmor API is generally immutable, except in PodTemplates.

type PodSecurityContext struct {
 ...
 // The AppArmor options to use by the containers in this pod.
 // Note that this field cannot be set when spec.os.name is windows.
 // +optional
 AppArmorProfile *AppArmorProfile
 ...
}

type SecurityContext struct {
 ...
 // The AppArmor options to use by this container. If AppArmor options are
 // provided at both the pod & container level, the container options
 // override the pod options.
 // Note that this field cannot be set when spec.os.name is windows.
 // +optional
 AppArmorProfile *AppArmorProfile
 ...
}

// AppArmorProfile defines a pod or container's AppArmor settings.
// Only one profile source may be set.
// +union
type AppArmorProfile struct {
 // type indicates which kind of AppArmor profile will be applied.
 // Valid options are:
 // Localhost - a profile pre-loaded on the node.
 // RuntimeDefault - the container runtime's default profile.
 // Unconfined - no AppArmor enforcement.
 // +unionDescriminator
 Type AppArmorProfileType

 // LocalhostProfile indicates a loaded profile on the node that should be used.
 // The profile must be preconfigured on the node to work.
 // Must match the loaded name of the profile.
 // Must only be set if type is "Localhost".
 // +optional
 LocalhostProfile *string
}

type AppArmorProfileType string

const (
 AppArmorProfileTypeUnconfined AppArmorProfileType = "Unconfined"
 AppArmorProfileTypeRuntimeDefault AppArmorProfileType = "RuntimeDefault"
 AppArmorProfileTypeLocalhost AppArmorProfileType = "Localhost"
)

This API makes the options more explicit and leaves room for new profile sources to be added in the future (e.g. Kubernetes predefined profiles or ConfigMap profiles) and for future extensions, such as defining the behavior when a profile cannot be set.

RuntimeDefault Profile

We propose maintaining the support to a single runtime profile, which will be defined by using the AppArmorProfileTypeRuntimeDefault. The reasons being:

No changes to the current behavior. Users are currently not allowed to specify other runtime profiles. The existing API server rejects runtime profile names that are different than runtime/default.
Most runtimes only support the default profile, although the CRI is flexible enough to allow the kubelet to send other profile names.
Multiple runtime profiles has never been requested as a feature.

If built-in support for multiple runtime profiles is needed in the future, a new KEP will be created to cover its details.

Localhost Profile

This KEP proposes LocalhostProfile as the only source of user-defined profiles. User-defined profiles are essential for users to realize the full benefits out of AppArmor, allowing them to decrease their attack surface based on their own workloads.

Updating localhost AppArmor profiles

AppArmor profiles are applied at container creation time. The underlying container runtime only references already loaded profiles by its name. Therefore, updating the profiles content requires a manual reload (typically via apparmor_parser).

Note that changing profiles is not recommended and may cause containers to fail on next restart, in the case of the new profile being more restrictive, invalid or the file no longer available on the host.

Currently, users have no way to tell whether their physical profiles have been deleted or modified. This KEP proposes no changes to the existing functionality.

The recommended approach for rolling out changes to AppArmor profiles is to always create new profiles instead of updating existing ones. Create and deploy a new version of the existing Pod Template, changing the profile name to the newly created profile. Redeploy, once working delete the former Pod Template. This will avoid disruption on in-flight workloads.

The current behavior lacks features to facilitate the maintenance of AppArmor profiles across the cluster. Two examples being: 1) the lack of profile synchronization across nodes and 2) how difficult it can be to identify that profiles have been changed on disk/memory, after pods started using it. However, Kubernetes managed profiles are out of scope for this KEP. Out of tree enhancements like the security-profiles-operator can provide such enhanced functionality on top.

Profiles managed by the cluster admins

The current support relies on profiles being loaded on all cluster nodes where the pods using them may be scheduled. It is also the cluster admin’s responsibility to ensure the profiles are correctly saved and synchronized across the all nodes. Existing mechanisms like node labels and nodeSelectors can be used to ensure that pods are scheduled on nodes supporting their desired profiles.

Validation

The following validations were applied to the AppArmor annotations on pods:

Pod annotations are immutable (cannot be added, modified, or removed on pod update)
Annotation value must have a localhost/ prefix, or be one of: "", runtime/default, unconfined.

The annotation validations will be carried over to the field API, and the following additional validations are proposed:

Fields must match the corresponding annotations when both are present, except for ephemeral containers.
AppArmor profile must be unset on Windows pods (spec.os.name == "windows"). Only enforced on fields.
Localhost profile must not be empty, and must not be padded with whitespace. Only enforced on creation. This was previously enforced by the Kubelet .

Note on localhost profile validation: AppArmor profile naming is flexible, but both of the leading CRI implementations (containerd & cri-o) require a profile with a matching name to be loaded. This prevents the special unconfined profile, or various wildcard and variable profile names from being used in practice. This validation is deferred to the runtime, rather than being enforced by the API for backwards compatibility.

Node Status

The Kubelet SHOULD NOT append the AppArmor status to the node ready condition message.

The ready condition is certainly not the right place for this message, but more generally the kubelet does not broadcast the status of every optional feature. (A beta implementation of this feature, added before the Kubernetes enhancement process was formalized, did customize the node ready condition message).

Design Details

When an AppArmor profile is set on a container (or pod), the kubelet will pass the option on to the container runtime, which is responsible for running the container with the desired profile. Profiles must be loaded into the kernel before the container is started (profile loading is out of scope for this KEP). For more details, see https://kubernetes.io/docs/tutorials/security/apparmor/ .

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None

Unit tests

New tests will be added covering the annotation/field conflict cases described under Version Skew Strategy .

Integration tests

Pod Security tests: https://github.com/kubernetes/kubernetes/blob/1ded677b2a77a764a0a0adfa58180c3705242c49/test/integration/auth/podsecurity_test.go

e2e tests

[AppArmor node E2E][https://github.com/kubernetes/kubernetes/blob/2f6c4f5eab85d3f15cd80d21f4a0c353a8ceb10b/test/e2e_node/apparmor_test.go]

These tests are guarded by the [Feature:AppArmor] tag and run as part of the containerd E2E features test suite.

The E2E tests will be migrated to the field-based API.

Failure and Fallback Strategy

There are different scenarios in which applying an AppArmor profile may fail, below are the ones we mapped and their outcome once this KEP is implemented:

Scenario	API Server Result	Kubelet Result
1) Using localhost or explicit `runtime/default` profile when container runtime does not support AppArmor.	Pod created	The outcome is container runtime dependent. In this scenario containers may 1) fail to start or 2) run normally without having its policies enforced.
2) Using custom or `runtime/default` profile that restricts actions a container is trying to make.	Pod created	The outcome is workload and AppArmor dependent. In this scenario containers may 1) fail to start, 2) misbehave or 3) log violations.
3) Using a localhost profile that does not exist on the node.	Pod created	Container runtime dependent: containers fail to start. Retry respecting RestartPolicy and back-off delay. Error message in event.
4) Using an unsupported runtime profile (i.e. `runtime/default-audit`).	Fails validation: pod not created.	N/A
5) Using localhost or explicit `runtime/default` profile when AppArmor is disabled by the host or build	Pod created.	Kubelet puts Pod in blocked state.
6) Using implicit (default) `runtime/default` profile when AppArmor is disabled by the host or build.	Pod created	Container created without AppArmor enforcement.
7) Using localhost profile with invalid (empty) name	Fails validation: pod not created.	N/A

Scenario 2 is the expected behavior of using AppArmor and it is included here for completeness.

Scenario 7 represents the case of failing the existing validation, which is defined at Pod API .

Version Skew Strategy

All API skew is resolved in the API server.

Pod Creation

If no AppArmor annotations or fields are specified, no action is necessary.

If the AppArmor feature is disabled per feature gate, then the annotations and fields are cleared (current behavior ).

If the pod’s OS is windows, fields are forbidden to be set and annotations are not copied to the corresponding fields.

If only AppArmor fields are specified, add the corresponding annotations. If these are specified at the Pod level, copy the annotations to each container that does not have annotations already specified. This ensures that the fields are enforced even if the node version trails the API version (see Version Skew Strategy ).

If only AppArmor annotations are specified, copy the values into the corresponding fields. This ensures that existing applications continue to enforce AppArmor, and prevents the kubelet from needing to resolve annotations & fields. If the annotation is empty, then the runtime/default profile will be used by the CRI container runtime. If a localhost profile is specified, then container runtimes will strip the localhost/ prefix, too. This will be covered by e2e tests during the GA promotion.

If both AppArmor annotations and fields are specified, the values MUST match. This will be enforced in API validation.

Container-level AppArmor profiles override anything set at the pod-level.

Pod Security Admission

The Pod Security admission plugin will be updated to evaluate AppArmorProfile fields in addition to annotations.

The policy for the baseline Pod security standard forbids setting an Unconfined profile, but allows unset, RuntimeDefault and Localhost profiles. In the case of localhost profiles, this can include OS profiles intended for other system daemons, so additional profile restrictions are encouraged (e.g. via ValidatingAdmissionPolicy .

Pod Update

The AppArmor fields on a pod are immutable, which also applies to the annotation .

When an Ephemeral Container is added, it will follow the same rules for using or overriding the pod’s AppArmor profile. Ephemeral container’s will never sync with an AppArmor annotation.

PodTemplates

PodTemplates (and their embeddings within e.g. ReplicaSets, Deployments, StatefulSets, etc.) will be ignored. The field/annotation resolution will happen on template instantiation.

Warnings

To raise awareness of workloads using the beta AppArmor annotations that need to be migrated, a warning will be emitted when only AppArmor annotations are set (no fields) on pod creation, or pod template (including workload resources with an embedded pod template) create & update.

Kubelet fallback

Since Kubelet versions must not be ahead of API versions, Kubelets can defer annotation/field resolution to the API server, and only consider the AppArmor fields.

The exception to this is static pods. In this case, Kubelet will copy annotation values to fields in the applyDefaults function. In this case, Kubelet will also log a warning.

Runtime Profiles

The API Server will continue to reject annotations with runtime profiles different than runtime/default, to maintain the existing behavior.

Violations would lead to the error message:

Invalid value: "runtime/profile-name": must be a valid AppArmor profile

Upgrade / Downgrade Strategy

Nodes do not currently support in-place upgrades, so pods will be recreated on node upgrade and downgrade. No special handling or consideration is needed to support this.

On the API server side, we’ve already taken version skew in HA clusters into account. The same precautions make upgrade & downgrade handling a non-issue.

Since we support up to 2 minor releases of version skew between the master and node, annotations must continue to be supported and backfilled for at least 2 versions passed the initial implementation. Specifically, fields will no longer be copied to annotations for older kubelet versions. However, annotations submitted to the API server will continue to be copied to fields at the kubelet indefinitely, as was done with Seccomp.

Kubelet Backwards compatibility

Since we don’t support running newer Kubelets than API server, new Kubelets only need to handle AppArmor fields. All the version skew resolution happens within the API server.

Removing annotation support

(Assuming field support merges in 1.30, otherwise adjust all versions a constant amount)

Phase 1 (v1.30): AppArmor field support merged

Sync annotations & fields on Pod create (version skew strategy described above)
Warn on annotation use, if field isn’t set
Kubelet copies static pod annotations to fields

Phase 2 (v1.34):

API server stops copying fields to annotations
Warn on annotation use if there is no corresponding container field (including on workload resources)
Risk: policy controllers that don’t consider field values

Phase 3 (v1.36): End state

API server stops copying annotations to fields
Kubelet stops copying annotations to fields for static pods
Validation that annotations & fields match persists indefinitely
Risk: workloads that haven’t migrated

Graduation Criteria

General Availability:

Field-based API

Production Readiness Review Questionnaire

Feature enablement and rollback

How can this feature be enabled / disabled in a live cluster?

AppArmor is controlled by the AppArmor feature gate (already beta by the time this KEP was formally opened).

Feature gate
- Feature gate name: AppArmor
- Components depending on the feature gate:
  - kube-apiserver
  - kubelet

Does enabling the feature change any default behavior?

No - AppArmor has been enabled by default since Kubernetes v1.4.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Containers already running with AppArmor enforcement will continue to do so, but on restart will fallback to the container runtime default. Pods created with AppArmor disabled will have their fields & annotations stripped.

What happens if we reenable the feature if it was previously rolled back?

Newly started or restarted containers in pods that still have the AppArmor field/annotations will have the specified AppArmor profile applied, rather than the runtime default.

Are there any tests for feature enablement/disablement?

No.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The Version Skew Strategy section covers this point. Running workloads should have no impact as the Kubelet will support either the existing annotations or the new fields introduced by this KEP.

Disabling the AppArmor feature will cause the container runtimes to apply the runtime default profile (except for privileged pods). In cases where a user was expecting to apply a custom profile (or explicitly unconfined), this could break the workload.

What specific metrics should inform a rollback?

An increase in pod validation errors can indicate issues with the field translation. These would show up as code=400 (Bad Request) errors in apiserver_request_total.

The following errors could indicate problems with how kubelets are interpreting AppArmor profiles.

started_containers_errors_total
started_pods_errors_total

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Automated tests will cover the scenarios with and without the changes proposed on this KEP. As defined under Version Skew Strategy , we are assuming the cluster may have kubelets with older versions (without this KEP’ changes), therefore this will be covered as part of the new tests.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

The promotion of AppArmor to GA would deprecate the beta annotations as described in the Version Skew Strategy .

Monitoring requirements

How can an operator determine if the feature is in use by workloads?

The feature is built into the kubelet and api server components. No metric is planned at this moment. The way to determine usage is by checking whether the pods/containers have a AppArmorProfile set.

How can someone using this feature know that it is working for their instance?

The AppArmor enforcement status is not directly surfaced by Kubernetes, but is visible through the linux proc API. For example, you can check what profile a container is running with by execing into it:

$ kubectl exec -n $NAMESPACE $POD_NAME -- cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Negligible increase in Pod object size, and any objects embedding a PodSpec.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. AppArmor profiles are managed outside of Kubernetes, and without this feature enabled the runtime default AppArmor profile is still enforced on non-privileged containers (for AppArmor enabled hosts).

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No impact to running workloads.

What are other known failure modes?

No impact is being foreseen to running workloads based on the nature of changes brought by this KEP.

Although some general errors and failures can be seen on Failure and Fallback Strategy .

What steps should be taken if SLOs are not being met to determine the problem?

N/A

Implementation History

2016-07-25: AppArmor design proposal
2016-09-26: AppArmor beta release with v1.4
2020-01-10: Initial (retrospective) KEP

Drawbacks

Custom AppArmor profiles are not fully managed by Kubernetes
AppArmor support adds a dimension to the feature compatibility matrix, as support is not guaranteed in linux

Alternatives

Syncing fields & annotations on workload resources

AppArmor fields & annotations on Pods are immutable, which means that syncing fields & annotations is a one-time operation. This is not true for workload resources (ReplicaSets, Deployments, etc).

In order to support syncing fields on workload resources, we need to account for clients that only pay attention to one of the field/annotation settings. When combined with the validation requirement that fields & annotations match, getting this right in both the patch & update cases adds significant complexity.

Resources: Add CDI devices to device plugin API

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4009: Add CDI devices to device plugin API

Release Signoff Checklist
Summary
Motivation
- Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes extending the Device Plugin API, adding a field to specify Container Device Interface (CDI) device IDs in the AllocateResponse. This supplements the existing fields such as annotations and allows device plugin implementations to uniquely specify devices using their fully-qualified CDI devices names.

The recent addition of CDI device IDs to the CRI structures in #3731 allow these IDs to be forwarded to the CRI runtimes in a secure manner. Although these changes were motivated by KEP-3063 , adding support for these fields to the existing device plugin API allows this mechanism to also be used for devices supported by these plugins.

Motivation

The Container Device Inteface (CDI) provides a standard mechanism for device vendors to describe what is required to provide access to a specific resource such as a GPU. These resources can be uniquely identified using a fully-qualified CDI device name.

The changes proposed in #3731 ) extend the CRI to provide a well-defined mechanism for forwarding such requests to CRI runtimes such as Containerd and Cri-o. These have already been extended to accept CDI device requests, and to use the associated CDI specifications to ensure that the required modifications are made to the OCI runtime specification for a container being launched.

The addition of an explicit field for specifying CDI device names to the Device Plugin API allows this CRI field to be used to indicate which devices should be injected. This removes the need to use workarounds such as container annotations to pass this information to the runtimes and allows Device Plugin authors to adopt CDI to inject devices without requiring that users move to a Dynamic Resource Allocation (DRA) based implementation.

Goals

Allow Device Plugin authors to forward device requests to CRI runtimes as a CRI field.
Allow Device Plugin authors to use CDI to define the modifications required for containerised environments.

Proposal

We propose a mechanism for device plugin authors to specify devices using Container Device Interface (CDI) names. The names of the requested devices are passed down as CRI fields to CRI runtimes which are ultimately responsible for making the requested devices accessible from a container.

Design Details

This adds a repeated CDIDevice field to the exiting ContainerAllocateResponse returned as part of the AllocateResponse in the Device Plugin API. This matches the modifications made to the Dynamic Resource Allocation API in #3731 .

The values contained in this field are then used to populate the corresponding field in the CRI which is passed to the container runtimes. In addition, annotations with a cdi.k8s.io prefix will be added to the CRI to allow for consumption in container runtimes that do not yet support the CRI field directly, but do support device requests through annotations.

// CDIDevice specifies a CDI device information.
message CDIDevice {
 // Fully qualified CDI device name
 // for example: vendor.com/gpu=gpudevice1
 // see more details in the CDI specification:
 // https://github.com/container-orchestrated-devices/container-device-interface/blob/main/SPEC.md
 string name = 1;
}

message ContainerAllocateResponse {
 // List of environment variable to be set in the container to access one of more devices.
 map<string, string> envs = 1;
 // Mounts for the container.
 repeated Mount mounts = 2;
 // Devices for the container.
 repeated DeviceSpec devices = 3;
 // Container annotations to pass to the container runtime
 map<string, string> annotations = 4;
 // CDI devices for the container.
 repeated CDIDevice cdi_devices = 5;
}

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

devicemanager: 2023-06-15 - 85.1%

Integration tests

There are currently no integration tests for device plugins. We do not plan to add any for this feature.

However, these cases will be added in the existing integration tests:

Feature gate enable/disable tests

e2e tests

This test case has been added to the existing e2e_node tests:

DevicePlugin can make a CDI device accessible in a container

Links to test grid:

https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cdi-device-plugins

Links to k8s-triage for tests:

https://storage.googleapis.com/k8s-triage/index.html?job=ci-crio-cdi-device-plugins

Graduation Criteria

Alpha

Add the CDIDevices field to the device plugin API
Implement the logic to pass the CDIDevices into the CRI
Add proper e2e_node tests

Alpha to Beta Graduation

No major bugs reported in the previous cycle

Beta to G.A Graduation

Gather feedback from at least 2 device plugin vendors that CDI support works for them

Upgrade / Downgrade Strategy

We expect no impact on upgrades. On downgrades, we expect no impact to Kubernetes and minimal impact to device plugin developers.

We are not bumping the device plugin API version, but simply adding a field to its protobuf. On upgrades this means that older device plugins will simply continue to work as they always have, since they will need to opt-in to using this new field.

For downgrades, if a plugin has not opted to use the new field, there will be no impact since a downgraded kubelet won’t support it anyway. If a device plugin has opted-in to use the new field, a downgraded kubelet will simply silently ignore it. This would have no impact to Kubernetes itself, but the plugin developer would need to be aware of this if they are confused as to why their new CDI support is suddenly not working anymore.

Version Skew Strategy

The kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate names:
  - DevicePluginCDIDevices
- Components depending on the feature gate: kubelet
Pass CDI devices to the kubelet over the new field in the device plugin API
- Will enabling / disabling the feature require downtime of the control plane? No.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.

Does enabling the feature change any default behavior?

No. Device Plugins need to be updated to make use of the new field.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, disabling the DevicePluginCDIDevices feature gate shuts down the feature completely.
Yes, by not sending CDI devices over the device plugin API (and falling back to the old way of passing device info).

What happens if we reenable the feature if it was previously rolled back?

Nothing bad will happen, new containers will simply be able to be started with CDI devices again.

Are there any tests for feature enablement/disablement?

There will be e2e tests demonstrating that CDI devices are attached as expected when the feature is enabled, and silently ignored if the feature is disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The failure of the kubelet would mean that fields from new device allocations will not be processed.

However, CDI device themselves are only interpereted at container start. Existing containers that were started with support for CDI devices will not be impacted if the feature gate is enabled or disabled during the lifetime of a running container. Only new containers will be impacted by the presence or absence of the feature gate.

What specific metrics should inform a rollback?

N/A

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This depends on Device Plugin vendor implementations making use of the required field and cannot be directly determined.

How can someone using this feature know that it is working for their instance?

End-users are not aware that this feature exists. Device plugin developers can ensure that this feature is working by passing CDI devices to workloads requesting them, and ensuring that the workloads come up successfully with access to the devices they asked for.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

The container runtime (e.g. containerd, crio-o, etc.) must support CDI.
A Device Plugin must be implemented to use the field.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No. The additional field will replace existing usages where used.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A

What are other known failure modes?

The change to Kubernetes to support this feature is very minimal. The CDI device list passed from the plugin to the kubelet is opaquely forwarded to the underlying container runtime without affecting the overall logic of the kubelet in any significant way. As such, the only known failure scenarios result from plugins themselves doing something incorrectly (not the kubelet). For example, sending back a list of CDI devices that are not included in any CDI spec visible to the underlying container runtime. However, such failure scenarios do not affect the proper functioning of kubernetes itself, and are therefore out of scope for this KEP. We recommend you check the device plugin and container runtime logs instead.

What steps should be taken if SLOs are not being met to determine the problem?

N/A

Implementation History

2023-05-15: KEP created
2023-09-25: KEP updated to mark transition to beta
2024-01-24: KEP updated to mark transition to stable

Drawbacks

There is no reason this KEP should not be implemented. CDI is the new standard for device support in containerized environments, and this enhancement now makes this possible through a simple addition to the device plugin API.

Alternatives

None

Resources: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories (Optional)
  - Story 1
  - Story 2
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Starting with Kubernetes 1.22, a new CPUManager flag has facilitated the use of CPUManager Policy options (#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy. These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster. The policy options that already exist are full-pcpus-only (#2625) and distribute-cpus-across-numa (#2902) and align-by-socket (#3327) and distribute-cpus-across-cores (#4176). With this KEP, a new CPUManager policy option strict-cpu-reservation is introduced which ensures that reservedSystemCPUs are strictly reserved for system daemons or interrupt processing and are not used by burstable and best-effort pods.

Motivation

The static policy is used to reduce latency or improve performance. If you want to move system daemons or interrupt processing to dedicated cores, the obvious way is use the reservedSystemCPUs option. But in current implementation this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods (and guaranteed pods with fractional CPU requests). Admission is only comparing the cpu requests against the allocatable cpus. Since the cpu limit are higher than the request, it allows burstable and best-effort pods to use up the capacity of reservedSystemCPUs and cause host OS services to starve in real life deployments.

Goals

Align scheduler and node view for Node Allocatable (total - reserved).
Ensure reservedSystemCPUs is only used by system daemons or interrupt processing not by workloads.
Ensure no breaking changes for the static policy of CPUManager.

Non-Goals

Change interface between node and scheduler.

Proposal

We propose to add a new CPUManager policy option called strict-cpu-reservation to the static policy of CPUManager. When this policy option is enabled, we remove the reserved cores from the list of all available cores at the stage of calculation DefaultCPUSet. As a result, burstable and best-effort containers are launched with a cpuset in which the reserved cores are excluded.

User Stories (Optional)

Story 1

To protect latency of workload, systemd daemons including irqbalance daemon are commonly constrained to the reserved CPUs. Burstable and best-effort pods (and guaranteed pods with fractional CPU requests) running on the reserved CPUs causes CPU throttling for infrastructure services which results in poor system response time which in turn hits back on workload response time. This issue is particularly bad in all-in-one deployments where workloads are placed on combined master+worker+storage nodes.

Story 2

Silently allowing workloads running on the reserved CPUs makes benchmarking infrastructure and workloads both inaccurate.

Design Details

In Kubelet, when strict-cpu-reservation is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet.

Feature impact can be illustrated as following:

With the following Kubelet configuration:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cpuManagerPolicy: static
cpuManagerPolicyOptions:
 strict-cpu-reservation: "true"
reservedSystemCPUs: "0,32,1,33,16,48"
...

When strict-cpu-reservation is disabled:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-63","checksum":1058907510}

When strict-cpu-reservation is enabled:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

Risks and Mitigations

The feature is isolated to a specific policy option strict-cpu-reservation under cpuManagerPolicyOptions.

Concern for feature impact on best-effort workloads, the workloads that do not have resource requests, is brought up.

Kube-scheduler schedules pods on node allocatable (total - reserved). For best-effort pods, kube-scheduler uses default request values when scoring the nodes, see https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/util/pod_resources.go#L32 and https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/noderesources/resource_allocation.go#L123 , but the scheduler does not use the default request values when fitting the nodes i.e. best-effort pods are always admitted.

The concern is, when the feature graduates to Stable, it will be enabled by default, best-effort workloads could be starved on the node when the node runs out of CPU cores.

However, this is exactly the feature intent, best-effort workloads have no KPI requirement, they are meant to consume whatever CPU resources left on the node including starving from time to time. Best-effort workloads are not scheduled to run on the reservedSystemCPUs so they shall not be run on the reservedSystemCPUs to destablize the whole node.

Nevertheless, risk mitigation has been discussed in details (see archived options below) and we agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stages to assess the actual impact in real deployment. The plan is to move the current implementation to Stable stage if no field issue is observed for one year.

https://github.com/kubernetes/kubernetes/pull/127506

cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16)

Archived Risk Mitigation (Option 1)

This option is to add numMinSharedCPUs in strict-cpu-reservation option as the minimum number of CPU cores not available for exclusive allocation and expose it to Kube-scheduler for enforcement.

In Kubelet, when strict-cpu-reservation is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet and remove the MinSharedCPUs from the list of available cores for exclusive allocation.

When strict-cpu-reservation is disabled:

Total CPU cores: 64
ReservedSystemCPUs: 6
defaultCPUSet = Reserved (6) + 58 (available for exclusive allocation)

When strict-cpu-reservation is enabled:

Total CPU cores: 64
ReservedSystemCPUs: 6
MinSharedCPUs: 4
defaultCPUSet = MinSharedCPUs (4) + 54 (available for exclusive allocation)

Prototype PR for the option is created: https://github.com/kubernetes/kubernetes/pull/123979/commits

Add numMinSharedCPUs as part of strict-cpu-reservation option in Kubelet configuration:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
 ...
 CPUManagerPolicyAlphaOptions: true
cpuManagerPolicy: static
cpuManagerPolicyOptions:
 strict-cpu-reservation: { "enable": "true", "numMinSharedCPUs": 4 }
reservedSystemCPUs: "0,32,1,33,16,48"
...

In Node API, we add exclusive-cpu in Node Allocatable for Kube-scheduler to consume.

 "status": {
"capacity": {
"cpu": "64",
"exclusive-cpu": "64",
"ephemeral-storage": "832821572Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "196146004Ki",
"pods": "110"
},
"allocatable": {
"cpu": "58",
"exclusive-cpu": "54",
"ephemeral-storage": "767528359485",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "186067796Ki",
"pods": "110"
},
...

In kube-scheduler, ExlusiveMilliCPU is added in scheduler’s Resource structure and NodeResourcesFit plugin is extended to filter out nodes that can not meet pod’s exclusive CPU request.

A new item ExclusiveMilliCPU is added in the scheduler Resource structure:

// Resource is a collection of compute resource.
type Resource struct {
MilliCPU int64
ExclusiveMilliCPU int64 // added
Memory int64
EphemeralStorage int64
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
// explicitly as int, to avoid conversions and improve performance.
AllowedPodNumber int
// ScalarResources
ScalarResources map[v1.ResourceName]int64
}

A new node fitting failure ‘Insufficient exclusive cpu’ is added in the NodeResourcesFit plugin:

 if podRequest.MilliCPU > 0 && podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU-nodeInfo.Requested.MilliCPU) {
insufficientResources = append(insufficientResources, InsufficientResource{
ResourceName: v1.ResourceCPU,
Reason: "Insufficient cpu",
Requested: podRequest.MilliCPU,
Used: nodeInfo.Requested.MilliCPU,
Capacity: nodeInfo.Allocatable.MilliCPU,
})
}
if nodeInfo.Allocatable.ExclusiveMilliCPU > 0 { // added
if podRequest.ExclusiveMilliCPU > 0 && podRequest.ExclusiveMilliCPU > (nodeInfo.Allocatable.ExclusiveMilliCPU-nodeInfo.Requested.ExclusiveMilliCPU) {
insufficientResources = append(insufficientResources, InsufficientResource{
ResourceName: v1.ResourceExclusiveCPU,
Reason: "Insufficient exclusive cpu",
Requested: podRequest.ExclusiveMilliCPU,
Used: nodeInfo.Requested.ExclusiveMilliCPU,
Capacity: nodeInfo.Allocatable.ExclusiveMilliCPU,
})
}
}

Archived Risk Mitigation (Option 2)

The problem with MinSharedCPUs is that it creates another complication like memory and hugpages, new resources vs overlapping resources, exclusive-cpus is a subset of cpu.

Currently the noderesources scheduler plugin does not filter out the best-effort pods in the case there’s no available CPU.

Another option is to force the cpu requests for best effort pods to 1 MilliCPU in kubelet for the purpose of resource availability checks (or, equivalently, check there’s at least 1 MilliCPU allocatable). This option is meant to be simpler than option-1, but it can create runaway pods similar to that in https://github.com/kubernetes/kubernetes/issues/84869 .

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/policy_static.go: 03-18-2024 - 91.1

Integration tests

No new integration tests for kubelet are planned.

e2e tests

The e2e tests are implemented in https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_test.go , marked with Ginkgo “strict-cpu-reservation” label.

Feature functionality tests:

running with strict CPU reservation: should let the container access all the online CPUs without a reserved CPUs set
running with strict CPU reservation: should let the container access all the online CPUs minus the reserved CPUs set when enabled
running with strict CPU reservation: should let the container access all the online non-exclusively-allocated CPUs minus the reserved CPUs set when enabled`

CPU Manager options compatibility tests:

SMT Alignment and strict CPU reservation: should reject workload asking non-SMT-multiple of cpus
SMT Alignment and strict CPU reservation: should admit workload asking SMT-multiple of cpus
Strict CPU Reservation and Uncore Cache Alignment: should assign CPUs aligned to uncore caches with prefer-align-cpus-by-uncore-cache and avoid reserved cpus

Testgrid:

kubelet-serial-gce-e2e-cpu-manager : Green
kubelet-gce-e2e-arm64-ubuntu-serial : Green
pull-e2e-serial-ec2 : Green
node-kubelet-containerd-resource-managers : Green

Graduation Criteria

Alpha

Implement the new policy option.
Ensure proper unit tests are in place.

Beta

Gather feedback from consumers of the new policy option.
Verify no major bugs reported in the previous cycle.
Ensure proper e2e tests are in place.

GA

Allow time for feedback (two releases).
Make sure all risks have been addressed.

Upgrade / Downgrade Strategy

The new policy option is opt-in and orthogonal to the existing ones.

Version Skew Strategy

No changes needed.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

The /var/lib/kubelet/cpu_manager_state needs to be removed when enabling or disabling the feature.

How can this feature be enabled / disabled in a live cluster?

Change the kubelet configuration to set a CPUManager policy of static and a CPUManager policy option of strict-cpu-reservation
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No – removing /var/lib/kubelet/cpu_manager_state and restarting kubelet are enough.

Does enabling the feature change any default behavior?

Yes. Reserved CPU cores will be strictly used for system daemons and interrupt processing no longer available for workloads.

The feature is only enabled when all following conditions are met:

The static CPUManager policy is selected
The strict-cpu-reservation policy option is selected
The reservedSystemCPUs is not empty

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled by the following steps:

Remove strict-cpu-reservation from the list of CPUManager policy options
Remove /var/lib/kubelet/cpu_manager_state and restart kubelet

What happens if we reenable the feature if it was previously rolled back?

The feature will be enabled regardless it is enabled for the first time or not.

Are there any tests for feature enablement/disablement?

A specific e2e test will demonstrate that the default behaviour is preserved when the feature is not used (2 separate tests)

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If the feature rollout fails, burstable and best-efforts continue to run on the reserved CPU cores. If the feature rollback fails, burstable and best-efforts continue not to run on the reserved CPU cores. In either case, existing workload will not be affected.

When enabling or disabling the feature, make sure /var/lib/kubelet/cpu_manager_state is removed before restarting kubelet otherwise kubelet restart could fail.

What specific metrics should inform a rollback?

Best-effort workloads are starved for prolonged time. This indicates you are lacking hardware to use the feature, or you should review the amount of CPU cores reserved.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

If you have this feature enabled in v1.32 under CPUManagerPolicyAlphaOptions (default to false) you will continue to have the feature enabled in v1.33 under CPUManagerPolicyBetaOptions (default to true) automatically i.e. no extra action is needed. To enable or disable this feature in v1.33, follow the feature activation and de-activation procedures described above.

Manual upgrade->downgrade->upgrade testing from v1.32 to v1.33 is as follows:

With the following Kubelet configuration and cpu_manager_state v1.32:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
 CPUManagerPolicyAlphaOptions: true
 ...
cpuManagerPolicy: static
cpuManagerPolicyOptions:
 strict-cpu-reservation: "true"
reservedSystemCPUs: "0,32,1,33,16,48"
...

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

The same Kubelet cpu_manager_state will be seen after upgrading to v1.33:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

You are recommended to remove the CPUManagerPolicyAlphaOptions feature gate after upgrading to v1.33 for operational integrity, but it is not mandatory.

If you want to disable the feature in v1.33, you can either disable the CPUManagerPolicyBetaOptions feature gate, or remove the strict-cpu-reservation policy option. Remember to remove the /var/lib/kubelet/cpu_manager_state file before restarting kubelet.

The following cpu_manager_state will be seen after the feature is disabled:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-63","checksum":1058907510}

If you want to enable the feature in v1.33, you need to make sure the CPUManagerPolicyBetaOptions feature gate is not disabled and add the strict-cpu-reservation policy option. Remember to remove the /var/lib/kubelet/cpu_manager_state file before restarting kubelet.

The following cpu_manager_state will be seen after the feature is enabled:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Inspect the defaultCpuSet in /var/lib/kubelet/cpu_manager_state:

When the feature is disabled, the reserved CPU cores are included in the defaultCpuSet.
When the feature is enabled, the reserved CPU cores are not included in the defaultCpuSet.

How can someone using this feature know that it is working for their instance?

Inspect the pods’ status file – check the reserved cores are not used by them.

Below is an example:

# kubectl exec cnf1-58446568f4-dr986 -n cnf1-ns -- grep Cpus_allowed /proc/self/status
Cpus_allowed: fffefffc,fffefffc
Cpus_allowed_list: 2-15,17-31,34-47,49-63

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature allows users to protect infrastructure services from bursty workloads.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Monitor the following kubelet counters:

cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve
cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16)

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

No.

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

Increase kubelet log level and check kubelet log for errors.

Below is how to check kubelet log when it runs as a systemd service:

journalctl _SYSTEMD_INVOCATION_ID=`systemctl show -p InvocationID --value kubelet.service`

How does this feature react if the API server and/or etcd is unavailable?

There is no known impact.

What are other known failure modes?

There is no known failure mode.

What steps should be taken if SLOs are not being met to determine the problem?

You can safely disable the feature.

Implementation History

2024-03-08: Initial KEP created
2024-10-07: KEP gets LGTM and Approval
2025-02-03: KEP updated with Beta criteria
2025-09-30: KEP updated with GA criteria

Drawbacks

Alternatives

Infrastructure Needed (Optional)

Resources: Add generic control plane staging repository(ies)

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4080: Add generic control plane staging repository(ies)

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes factoring kube-apiserver and kube-controller-manager to build on one or multiple new staging repositories that consume k/apiserver but have a bigger, carefully chosen subset of the functionality of kube-apiserver and kube-controller-manager such that it is reusable.

The factoring will be progressive: we will start with new repo(s) adding nothing to k/apiserver, refactor in-place and then progressively move generic functionality from kube-apiserver and kube-controller-manager to the new repositories.

The suggested naming of the new repository(ies) is k/generic-controlplane (-apiserver/-controllers; for simplicity we drop these suffixes in this document until the names are finalized). Choosing the exact name(s) and split of packages will be part of the process of implementation.

Motivation

A working kube-based control plane is more than just an apiserver component built on k/apiserver. It includes standard resources (depending on context namespaces, CRDs, RBAC, secrets, configmaps), and standard controllers (think of garbage collection, namespace deletion, etc.). kube-apiserver today is a bundle of those resources with container orchestration, kube-controller-manager equally for the corresponding controllers.

Separating the generic parts from container orchestration will allow new use-cases building upon k/apimachinery and k/apiserver, while keeping a unified codebase and ecosystem, and by improving the factoring of kube-apiserver for easier maintenance due to less complexity by clear layering.

Goals

As always: every PR transforms a working system into a working system, and PRs are of manageable size.
Improve factoring of kube-apiserver through layering on-top of k/generic-controlplane, reducing complexity through more explicit structure and reduction of code in k/k.
k/generic-controlplane will provide
- a sample-generic-controlplane binary
- a modular, further customizable (in code) library suitable to build a working kube-based control plane without vendoring k/k.
k/generic-controlplane will (optionally) include the ability to define resources by CustomResourceDefinition objects.
k/generic-controlplane will be able to (optionally) delegate handling of some kinds of objects to another server, as directed by APIService objects.
k/generic-controlplane will allow customization (in code) of which generic (native) resources like secrets, configmaps, admission webhooks, RBAC, etc. are served.
k/generic-controlplane will not include the definitions of the resources in Kubernetes for the management of containerized workloads. For example, the excluded resources include: nodes, pods, daemonsets, ingresses, services, persistentvolumes.
k/generic-controlplane as a library will be agnostic to being used in separate binaries or in an all-in-one binary, both in a hyperkube-like subcommand way, and in an all-in-one k3s like way.

Non-Goals

provide and ship a de-facto standard, full-featured generic-controlplane binary:

i.e. this is clearly a library approach and consumer projects will define a feature set of a control plane. There is no new deliverable beyond a staging repository with a library and a sample binary only, with clear limited scope of demonstrating plumbing.
change anything noticeable to the user for existing binaries.
change compatibility guarantees of (server-side) staging repositories
create k/kube-apiserver or anything similar, although this work can lead the path by defining package structures suitable for k/kube-apiserver.

Proposal

The desired outcome is a useful new library, while of course keeping everything working during iterative development. Success will be measured by community members saying that the new library is useful to them.

User Stories (Optional)

Story 1

Project kube-hyper-mini wants to maintain a main program that bundles

the subset of kube-apiserver that is not concerned with the management of containerized workloads,
a single-member etcd cluster, and
the subset of kube-controller-manager that is not concerned with the management of containerized workloads. This makes a convenient platform for hosting kube-style APIs defined by CRDs and/or resources served by their own extension apiserver. They use k/generic-controlplane to get part (1).

Story 2

Project kube-core recognizes that the three parts of kube-hyper-mini scale out differently, and wants instead to maintain a main program that is just the desired subset of kube-apiserver. Their main program is very little more than a use of k/generic-controlplane.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

This KEP is about code refactoring introducing another layer to the staging repositories that kube-apiserver and kube-controller-manager are built from. With every code refactoring there is risk of bugs. The mitigation are small, easy reviewable “obvious” PRs, iteratively moving from the old to the new structure.

Design Details

First steps are about splitting existing kube-apiserver and kube-controller-manager packages in-place, aka inside of k/k. This includes:

cmd/kube-apiserver
pkg/kubeapiserver
pkg/controlplane.

Early sketch of an end-state. By far, most changes towards this goal will be code moves only:

k/generic-controlplane
- pkg/apis
- pkg/apiserver/options
- pkg/apiserver/server
- pkg/apiserver/registry
- pkg/apiserver/admission
- pkg/controllers/garbagecollection
- pkg/controllers/namespacedeletion
- cmd/sample-generic-controlplane

Potentially, we will split the apiserver and controller parts into two separate repositories. This will be decided after the initial in-place steps have been done and the best structure has become clearer how to host the new packages.

Test Plan

[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

e2e tests

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

There will be no changes to system behavior. The typical alpha/beta/GA stages and requirements do not apply as this KEP proposed code moves of existing code, without changing its alpha/beta/GA status.

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism: Does not apply. This is a code move of existing code without functional changes.

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)

Resources: Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe

Mon, 01 Jan 0001 00:00:00 +0000

KEP-2727: Add GRPC Probe

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
- Alpha
- Beta
- GA
Drawbacks
Alternatives
References
Infrastructure Needed (Optional)

Release Signoff Checklist

Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
KEP approvers have approved the KEP status as implementable
Design details are appropriately documented
Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
Graduation criteria is in place
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe.

Motivation

gRPC is wide spread RPC framework. Existing solutions to add probes to gRPC apps like exposing additional http endpoint for health checks or packing external gRPC client as part of an image and use exec probes have many limitations and overhead.

Many load balancers support gRPC natively so adding it to Kubernetes aligns well with the industry.

Finally, Kubernetes project actively uses gRPC so adding built-in support for gRPC endpoints does not introduce any new dependencies to the project.

Goals

Enable gRPC probe natively from Kubelet without requiring users to package a gRPC healthcheck binary with their container.

Non-Goals

Add gRPC support in other areas of K8s (e.g. Services).

Proposal

Add the follow configuration to the LivenessProbe, ReadinessProbe and StartupProbe. Example:

 readinessProbe:
 grpc: #+
 port: 9090 #+
 service: my-service  #+
 initialDelaySeconds: 5
 periodSeconds: 10

This will result in the use of gRPC (using HTTP/2 over TLS) to use the standard healthcheck service (Check method) to determine the health of the container. Using Watch method of the healthcheck service is not supported, but may be considered in future iterations. As spec’d, the kubelet probe will not allow use of client certificates nor verify the certificate on the container. We do not support other protocols for the time being (unencrypted HTTP/2, QUIC).

The healthcheck request will be identified with the following gRPC User-Agent metadata. This user agent will be statically defined (not configurable by the user):

User-Agent: kube-probe/K8S_MAJOR_VER.K8S_MINOR_VER

Example:

User-Agent: kube-probe/1.23

Risks and Mitigations

Adds more code to Kubelet and surface area to Pod.Spec. Response: we expect that this will be generally useful given broad gRPC adoption in the industry.

Design Details

// core/v1/types.go

type Handler struct {
 // ...
 TCPSocket *TCPSocketAction `json...`

 // GRPC specifies an action involving a TCP port. //+
 // +optional //+
 GRPC *GRPCAction `json...` //+

 // ...
}

type GRPCAction struct { //+
 // Port number of the gRPC service. Number must be in the range 1 to 65535. //+
 Port int32 `json:"port" protobuf:"bytes,1,opt,name=port"` //+
 //+
 // Service is the name of the service to place in the gRPC HealthCheckRequest //+
 // (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). //+
 // //+
 // The service name can be the empty string (i.e. ""). //+
 Service string `json:"service" protobuf:"bytes,2,opt,name=service"` //+
 //+
 // Host is the host name to connect to, defaults to the Pod's IP. //+
 Host string `json,omitempty", protobuf:"bytes,3,opt,name=host"` //+
} //+

Note that GRPCAction.Port is an int32, which is inconsistent with the other existing probe definitions. This is on purpose – we want to move users away from using the (portNum, portName) union type.

Alternative Considerations

Note that readinessProbe.grpc.service may be confusing, some alternatives considered:

serviceName
healthCheckServiceName
grpcService
grpcServiceName

There were no feedback on the selected name being confusing in the context of a probe definition.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/probe/grpc: 2023/02/06 - 78.1%

Integration tests

N/A, only unit tests and e2e coverage.

e2e tests

Tests in test/e2e/common/node/container_probe.go:

should not be restarted with a GRPC liveness probe: results
should be restarted with a GRPC liveness probe: results

TODO: stress test to validate the scale (see GA requirements).

Graduation Criteria

Alpha

Implement the feature.
Add unit and e2e tests for the feature.

Beta

Solicit feedback from the Alpha.
Ensure tests are stable and passing.

Depending on skew strategy:

kubelet version skew ensures all (kubelet ver, cluster ver) support the feature.

GA

Address feedback from beta usage
Validate that API is appropriate for users. There are some potential tunables:
- User-Agent
- connect timeout
- protocol (HTTP, QUIC)
Close on any remaining open issues & bugs
Promote tests to conformance
Implement a stress test

Upgrade / Downgrade Strategy

Upgrade: N/A

Downgrade: gRPC probes will not be supported in a downgrade from Alpha.

Version Skew Strategy

We may not be able to graduate this widely until all kubelet version skew supports the probe type.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

Feature enablement will be guarded by a feature gate flag.

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: GRPCContainerProbe
- Components depending on the feature gate: kubelet (probing), API server (API changes).

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. This would require restarting kubelet, and so probes for existing Pods would no longer run.

What happens if we reenable the feature if it was previously rolled back?

It becomes enabled again after the kubelet restart.

Are there any tests for feature enablement/disablement?

Yes, unit tests for the feature when enabled and disabled will be implemented in both kubelet and api server.

Rollout, Upgrade and Rollback Planning

We passed the version skew problem for the new API. No planning is required.

How can a rollout or rollback fail? Can it impact already running workloads?

We passed the version skew problem - the API will be available on any supported version skew. So no issues are expected with rollout and rollback.

What specific metrics should inform a rollback?

Rollback wouldn’t address issues. Pods will need to stop using the new probe type.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

When gRPC probe is configured, Pod must be scheduled and, the metric probe_total can be observed to see the result of probe execution.

How can someone using this feature know that it is working for their instance?

When gRPC probe is configured, Pod must be scheduled and, the metric probe_total can be observed to see the result of probe execution.

Event will be emitted for the failed probe and logs available in kubelet.log to troubleshoot the failing probes.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Probe must succeed whenever service has returned the correct response in defined timeout, and fail otherwise.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

The metric probe_total can be used to check for the probe result. Event and kubelet.log log entries can be observed to troubleshoot issues.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Creation of a probe duration metric is tracked in this issue: https://github.com/kubernetes/kubernetes/issues/101035 and out of scope for this KEP.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Adds < 200 bytes to Pod.Spec, which is consistent with other probe types.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

The overhead of executing probes is consistent with other probe types.

We expect decrease of disk, RAM, and CPU use for many scenarios where the https://github.com/grpc-ecosystem/grpc-health-probe was used to probe gRPC endpoints.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Yes, gRPC probes use node resources to establish connection. This may lead to issue like kubernetes/kubernetes#89898 .

The node resources for gRPC probes can be exhausted by a Pod with HostPort making many connections to different destinations or any other process on a node. This problem cannot be addressed generically.

However, the design where node resources are being used for gRPC probes works for the most setups. The default pods maximum is 110. There are currently no limits on number of containers. The number of containers is limited by the amount of resources requested by these containers. With the fix limiting the TIME_WAIT for the socket to 1 second, this calculation demonstrates it will be hard to reach the limits on sockets.

Troubleshooting

Logs and Pod events can be used to troubleshoot probe failures.

How does this feature react if the API server and/or etcd is unavailable?

No dependency on etcd availability.

What are other known failure modes?

None

What steps should be taken if SLOs are not being met to determine the problem?

Make sure feature gate is set
Make sure configuration is correct and gRPC service is reacheable by kubelet. This may be different when migrating off https://github.com/grpc-ecosystem/grpc-health-probe and is covered in feature documentation.
kubelet.log log must be analyzed to understand why there is a mismatch of service response and status reported by probe.

Implementation History

Original PR for k8 Prober: https://github.com/kubernetes/kubernetes/pull/89832
2020-04-04: MR for k8 Prober
2021-05-12: Cloned to this KEP to move the probe forward.
2021-05-13: Updates.

Alpha

Alpha feature was implemented in 1.23.

Beta

Feature is promoted to beta in 1.24.

GA

Feature is promoted to GA in 1.27.

Drawbacks

See Motivation on why gRPC was picked as another RPC framework to support natively.

Adding gRPC is a small increment to k8s functionality with very little side effects. But providing a lot of “quaity of life improvements” to gRPC apps.

Alternatives

3rd party solutions like https://github.com/grpc-ecosystem/grpc-health-probe

References

GRPC healthchecking: https://github.com/grpc/grpc/blob/master/doc/health-checking.md

Infrastructure Needed (Optional)

Resources: Add Informer Metrics

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4346: Add Informer Metrics

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Informer is a base component in most K8s controllers, it is important to find a way to check if it is healthy. This enhancement proposal adds metrics to the client-go informer. It will expose reflector/queue/eventHandler internal metrics to Prometheus. These metrics is useful for developers/reliability engineers, they can monitor informer depend on it.

Motivation

A Kubernetes controller will watch objects for the desired state and the actual state, then send instructions to make the actual state be more like the desired state. Most controllers use informer to watch object change, then send work items that require reconcile to the workqueue.

Now the workqueue exposes metrics about queueLatency/workDuration, it is useful to find issues in reconcile routine. When a lot of objects need to be reconciled, but there are no new work items sent into workqueue, the informer most likely blocked. Informer is composed of reflector/queue/eventHandler, to find the root cause, users have to add debug log and change log level.

Informer should expose reflector/queue/eventHandler metrics, it will be easy to find why this informer is blocked. For example, it will show how long in seconds eventHandler processing an item.

This change remove reflector metrics before https://github.com/kubernetes/kubernetes/pull/74636 . It is essential to fix memory leak issue.

Goals

Add metrics for informer
Expose informer reflector/queue/eventHandler metrics

Non-Goals

It does not introduce breaking changes for controllers which use informer.
It does not modify core Kubernetes components which use informer.
It does not list all informer metrics, which can add as needed

Proposal

Introduce the informer metrics struct informerMetrics contains queue/eventHandler metrics
Introduce the informer metrics provider interface informerMetricsProvider, implement in k8s.io/component-base/metrics
Revert the deleted reflectorMetrics
Add a feature gate InformerMetrics to enable informer/reflector metrics

User Stories (Optional)

Story 1

Client-go informer create a RingGrowing pendingNotifications for every eventHandler. This RingGrowing will grow, but never shrink. An informer has some eventHandlers, it is hard to distinguish which pendingNotifications linked to a lot of objects. The pendingNotifications metric will help developers distinguish the slow eventHandler.

Story 2

Users want to know how often the reflector performs a LIST.

Story 3

It is hard to known how many item in informer queue/store. Add metrics for queue/store, it will help developers to find the number of pending deltas.

Notes/Constraints/Caveats (Optional)

N/A

Risks and Mitigations

The informer metrics is disabled by default. When enable informer metrics, the newly added metrics will increase CPU/MEM usage.

If the metrics result memory leak, users can disable the informer metrics.

Design Details

Add a feature gate InformMetrics in client-go. It is disabled when in the Alpha state.

Informer metrics

Introduce the informer metrics struct informerMetrics and eventHandlerMetrics. It is similar to the existing workqueue metrics.

type informerMetrics struct {
clock clock.Clock
// total number of item in store
numbernOfStoredItem GaugeMetric
// total number of item in queue
numberOfQueuedItem GaugeMetric
// each eventHandler metrics
eventHandlerMetrics map[string]eventHandlerMetrics
}
type eventHandlerMetrics struct {
// number of pending data
numberOfPendingNotifications GaugeMetric
// size of RingGrowring data
sizeOfRingGrowing GaugeMetric
// how long processing an item from informer reflector
prcoessDuration HistogramMetric
}
// MetricsProvider generates various metrics used by the queue.
type MetricsProvider interface {
// the informer name
NewStoredItemMetric(name string) GaugeMetric
NewQueuedItemMetric(name string) GaugeMetric
// the eventHandler name
NewPendingNotificationsMetric(name string) GaugeMetric
NewRingGrowingMetric(name string) GaugeMetric
NewPrcoessDurationMetric(name string) HistogramMetric
}

Add prometheus metrics item in subsystem informer

name	labels	description
store_item_total	informer name	Total number of item in store
queued_item_total	informer name	Total number of item in queue
pending_notifications_total	eventHandler name	Total number of pending notifications in eventHandler RingGrowing
ring_growing_capacity	eventHandler name	Capacity of eventHandler RingGrowing
event_process_duration	eventHandler name	How long in seconds eventHandler processing an item from RingGrowing takes

Reflector metrics

This change https://github.com/kubernetes/kubernetes/pull/74636 will be reverted.

Each reflector metrics contains 3 counter, 4 summary and 1 gauge.

type reflectorMetrics struct {
numberOfLists CounterMetric
listDuration HistogramMetric
numberOfItemsInList HistogramMetric
numberOfWatches CounterMetric
numberOfShortWatches CounterMetric
watchDuration HistogramMetric
numberOfItemsInWatch HistogramMetric
lastResourceVersion GaugeMetric
}

According to kubernetes/kubernetes#73587, the memory leak is caused by summary. It’d be better to use histograms instead. HistogramMetrics are aggregatable and it will reduce memory usage.

Remove Metrics

When the informers and reflectors stopped, the reference metrics will be removed.

Kube component-base metrics support to delete metrics by matching labels.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>
Unit tests to ensure that the metrics output meets expectations.
Unit tests to ensure that the metrics deletion is functioning properly.

Integration tests

We will have extensive integration testing of the union code in the test/integration/metrics package.

When enabling InformerMetrics feature gate, ensure the metrics will be exposed. Ensure the metrics subsystem/label/granularity is correct.
When the informers and reflectors are stopped, ensure the reference metrics will be removed.

e2e tests

Graduation Criteria

Alpha

Feature implemented behind a feature gate flag
Add related integration and unit tests to ensure functionality and make sure there is no memory leak in existing behavior

Beta

Gather feedback from developers and surveys
Work on feedback and add additional tests as needed

GA

Decision on GA will be made based on beta feedback

Upgrade / Downgrade Strategy

N/A

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: InformerMetrics
- Components depending on the feature gate:
  - components via client-go library

Does enabling the feature change any default behavior?

No. It does not change any default behavior. When this feature is enabled, it will increase memory usage in client-go.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling InformerMetrics FeatureGate for components via client-go library. In this case informers will not expose metrics anymore.

What happens if we reenable the feature if it was previously rolled back?

The expected behavior of the feature will be restored.

Are there any tests for feature enablement/disablement?

For now, there is no tests for feature enablement/disablement. The unit / integration tests will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Feature has no impact on rollout/rollback, and no impact on running workloads.

What specific metrics should inform a rollback?

The memory used by this metrics continues to grow, consuming a significant amount

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet. In the alpha releases, we could test this.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

This feature does not deprecate or remove any features/APIs/fields/flags/etc.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Informer / Reflector (e.g., lists_total, watches_total) metrics returned by the operator are populated

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details:
  - The following metrics are available when InformerMetrics is enabled:
    - lists_total
    - watches_total
    - last_resource_version
    - etc.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The feature gate will increase memory usage. The memory usage should not continuously grow. The informerMetrics / eventHandlerMetrics / reflectorMetrics memory consumption is in a stable state.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: Memory usage
- [Optional] Aggregation method:
- Components exposing the metric: Operating System/golang pprof

Are there any missing metrics that would be useful to have to improve observability of this feature?

Not at the moment.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes. The informer metrics will increase CPU/RAM usage.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Yes. When enable informer metrics, kubelet will only increase CPU/RAM usage.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2023-11-29: Initial draft KEP

Drawbacks

N/A

Alternatives

N/A

Infrastructure Needed (Optional)

Resources: Add job creation timestamp to job annotations

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4026: Add job creation timestamp to job annotations

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, there is no supported way to get the original/expected initial scheduled timestamp for the job created from a cronjob. This KEP proposes to set the original scheduled time as an annotation in the job metadata.

Motivation

Goals

Set job scheduled timestamp as an annotation on the job.
Adding the annotation should not be disruptive to existing workloads.

Non-Goals

Proposal

At a high level, the proposal is to modify the CronJob controller to set the job scheduled timestamp as a job annotation. The details of this are outlined in the Design Details section below.

Job scheduled timestamp annotation: batch.kubernetes.io/cronjob-scheduled-timestamp

User Stories (Optional)

Story 1

As a user, I would like to get the job’s scheduled timestamp that this job was expected to be running.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

CronJobs are always working with the assumption that the changes apply only to newly created jobs after the change. Therefore, the change will be to inject the annotation for newly created Jobs from CronJobs for when the feature is on. This will nicely play with downgrade and doesn’t introduce unnecessary complexity.

Design Details

The CronJob controller will only need a minor update to the getJobFromTemplate2 function, to add the job scheduled timestamp as the job annotation batch.kubernetes.io/cronjob-scheduled-timestamp. The scheduled timestamp is represented in RFC3339.

For the scheduled timestamp’s timezone, the initial thought was to use UTC as it’s used as the primary one for less confusion. However, since the job object has a spec.timeZone, it was a better to use the same timezone within the same object. If the job spec.timeZone is not set or nil, the annotation will use the UTC timezone as a default.

Test Plan

I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/controller/cronjob: 09/24/2023 - 71.2%

Integration tests

No integration tests are planned for this feature.

e2e tests

CronJob should set the cronjob-scheduled-timestamp annotation : test coverage

Graduation Criteria

The feature will be released directly in Beta state since there is no benefit in having an alpha release, since we are simply adding a new annotation so there is very little risk.

Beta

Feature implemented behind the CronJobsScheduledAnnotation feature gate.
Unit and e2e tests passing.

GA

Fix any potentially reported bugs.

Upgrade / Downgrade Strategy

No changes required to existing cluster to use this feature.

Version Skew Strategy

N/A. This feature doesn’t require coordination between control plane components, the changes to each controller are self-contained.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: CronJobCreationAnnotation
- Components depending on the feature gate: kube-controller-manager
Other
- Describe the mechanism: N/A.
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or re-provisioning of a node? No

Does enabling the feature change any default behavior?

The jobs newly created by cronjob controller will contain a new annotation CronJobsScheduledAnnotation.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If the feature gate is disabled, the CronJob controller will not add the scheduled timestamp as an annotation.

What happens if we reenable the feature if it was previously rolled back?

The CronJob controller will begin adding the scheduled timestamp as an annotation to jobs created while the feature is enabled, and existing jobs will be unaffected.

Are there any tests for feature enablement/disablement?

Given the feature results in adding an annotation only to newly created objects, those tests won’t really be different from the actual feature tests.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This change will not impact the rollout or rollback fail. It also will not impact the already running workloads.

What specific metrics should inform a rollback?

Users can monitor CronJobs metrics job_creation_skew_duration_seconds and cronjob_controller_rate_limiter_use, cronjob_job_creation_skew.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The following manual upgrade->downgrade->upgrade scenario was performed:

Create a v1.27 cluster where the feature is not available, yet.
Create a CronJob and wait for jobs to be created. Verify the newly created job does NOT have the batch.kubernetes.io/cronjob-scheduled-timestamp annotation.
Upgrade cluster to v1.28, where the feature was available as beta, iow. on by default. Verify the newly created job from a CronJob created in 2nd step has the batch.kubernetes.io/cronjob-scheduled-timestamp annotation with planned time, when a job was to be created.
Downgrade cluster to v1.27, where the feature was NOT available. Verify the newly created job from a CronJob created in 2nd step does NOT have the batch.kubernetes.io/cronjob-scheduled-timestamp annotation.

During the tests no problems were identified with cronjobs or jobs.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Randomly checking the CronJobs annotation batch.kubernetes.io/cronjob-scheduled-timestamp is sufficient. For monitoring purposes, we can rely on pre-existing metrics which monitor both the cronjob queue and the job creation skew, which should provide sufficient signal if the controller is working as expected. For small clusters, checking the annotation will determine the feature is used.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .metadata
- Condition name:
- Other field:
  - .metadata.annotations['batch.kubernetes.io/cronjob-scheduled-timestamp']

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS limit.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: cronjob_job_creation_skew
- Components exposing the metric: kube-controller-manager
- Metric name: job_creation_skew_duration_seconds
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, each job created by a cronjob-controller will have an additional annotation containing RFC3339 timestamp, which together with annotation name results in ~70B per job object.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change comparing to existing failure modes.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

The new annotation shouldn’t cause any unforeseen issues with the cronjob controller. In the event of issues with meeting SLOs, cluster admins are advised to consult troubleshooting overview document .

Implementation History

2023-06-06: KEP published
2024-09-24: KEP updated for stable promotion

Drawbacks

Alternatives

Add label instead of annotation
- Labels are unnecessary as we need to pass data that won’t be used with search or satisfy certain conditions.
Add a status field
- The object already has the CreationTimestamp field, but it will get overridden with the time the CronJob will start. The point of the new annotation is to pass the original/expected scheduled timestamp information.

Infrastructure Needed (Optional)

N/A

Resources: Add kubelet instance configuration to configure CRI socket for each node

Mon, 01 Jan 0001 00:00:00 +0000

Resources: Add NonPreempting Option For PriorityClasses

Mon, 01 Jan 0001 00:00:00 +0000

Add NonPreempting Option For PriorityClasses

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
- Risks and Mitigations
Design Details
- Testing Plan
- Graduation Criteria
Production Readiness Review Questionnaire
Implementation History

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

PriorityClasses are a GA feature as on 1.14, which impact the scheduling and eviction of pods. Pods are be scheduled according to descending priority. If a pod cannot be scheduled due to insufficient resources, lower-priority pods will be preempted to make room.

This proposal makes the preempting behavior optional for a PriorityClass, by adding a new field to PriorityClasses, which in turn populates PodSpec. If a pod is waiting to be scheduled, and it does not have preemption enabled, it will not trigger preemption of other pods.

Motivation

Allowing PriorityClasses to be non-preempting is important for running batch workloads.

Batch workloads typically have a backlog of work, with unscheduled pods. Higher-priority workloads can be assigned a higher priority via a PriorityClass, but this may result in pods with partially-completed work being preempted. Adding the non-preempting option allows users to prioritize the scheduling queue, without discarding incomplete work.

Goals

Add a boolean flag to PriorityClasses, to enable or disable preemption for pods of that PriorityClass.

Non-Goals

Protecting pods from preemption. PodDisruptionBudget should be used.

Proposal

Add a Preempting field to both PodSpec and PriorityClass. This field will default to true, for backwards compatibility.

If Preempting is true for a pod, the scheduler will preempt lower priority pods to schedule this pod, as is current behavior.

If Preempting is false, a pod of that priority will not preempt other pods.

Setting the Preempting field in PriorityClass provides a straightforward interface, and allows ResourceQuotas to restrict preemption.

PriorityClass type example:

type PriorityClass struct {
metav1.TypeMeta
metav1.ObjectMeta
Value int32
GlobalDefault bool
Description string
Preempting *bool // New option
}

The Preempting field in PodSpec will be populated during pod admission, similarly to how the PriorityClass Value is populated. Storing the Preempting field in the pod spec has several benefits:

The scheduler does not need to be aware of PiorityClasses, as all relevant information is in the pod.
Mutating PriorityClass objects does not impact existing pods.
Kubelets can set Preempting on static pods.

PodSpec type example:

type PodSpec struct {
...
Preempting *bool
...
}

This feature should be gated in alpha, provisionally under the gate NonPreemptingPriority.

Documentation must be updated to reflect the new feature, and changes to PriorityClass/PodSpec fields.

User Stories

A user is running batch workloads on a cluster. The user has a high-priority job, that they wish to schedule before other workloads in the queue. As the user does not want to preempt running batch workloads and discard work, the user creates the new workload with a high-priority, non-preempting PriorityClass. The new workload’s pods are scheduled ahead of the queue, without disrupting running workloads.

Users are able to run preempting and non-preempting workloads in a stable manner, and are not requesting additional changes.
The feature has been stable and reliable in at least 2 releases.
Adequate documentation exists for preemption and the optional field.
Test coverage includes non-preempting use cases.
Conformance requirements for non-preempting PriorityClasses are agreed upon.

Risks and Mitigations

The new feature may malfuction, or existing preemption functionality may be impaired. New tests (covering both nonpreepting workloads and mixed workloads), and the existing preempting PriorityClass tests should be used to prove stability.

Design Details

Testing Plan

Add detailed unit and integration tests for nonpreempting workloads.

Add basic e2e tests, to ensure all components are working together.

Ensure existing tests (for preempting PriorityClasses) do not break.

Graduation Criteria

Alpha (v1.15):

Support NonPreemptingPriority in PriorityClasses

Beta (v1.19):

Add integration test for NonPreemptingPriority.
Graduate NonPreemptingPriority to Beta.
Update documents to reflect the changes.

Stable (v1.24):

No negative feedback.
Enhance the message of the existing event for scheduling failed to include details about preemption.
Graduate NonPreemptingPriority to GA.
Update documents to reflect the changes.

Production Readiness Review Questionnaire

Feature enablement and rollback

How can this feature be enabled / disabled in a live cluster?
- Feature gate
  - Feature gate name: NonPreemptingPriority
  - Components depending on the feature gate:
    - kube-apiserver
    - kube-scheduler
Does enabling the feature change any default behavior? No
Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)? Yes. This feature can be disabled by restarting kube-apiserver and kube-scheduler with feature-gate turned off.
What happens if we reenable the feature if it was previously rolled back? If we reenable the feature, the Pod with high priority and NonPreemptionPolicy will be eligible to preempt other pods with low priority when cluster resources are tight.
Are there any tests for feature enablement/disablement? No

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads? If a rollout fails, kube-scheduler will keep crashing. Running workloads won’t be affected by kube-scheduler.
What specific metrics should inform a rollback? Check the following indicators to determine if there are any exceptions:
- pod_preemption_victims
- total_preemption_attempts
- scheduling_algorithm_preemption_evaluation_seconds
Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested? Manually tested successfully. The test environment version is v1.23. We tested enabling and disabling this feature. After each change in the feature-gate, 3 separate priorityclasses will be recreated (One high-priorityclass with preemptionPolicy as Never, other high-priorityclass with preemptionPolicy not be set, one low-priorityclass with preemptionPolicy not be set). Create multiple pods with the above 3 priorityclasses to verify that the preemption results are as expected.
Is the rollout accompanied by any deprecations and/or removals of features? N/A.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator can determine if the workload is using the feature by checking if the priorityclass’s preemptionPolicy is set to “Never”.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason: There is an event sent by kube-scheduler if the pod preempts other pods. If the feature is working and the pod with the priorityclass’preemptionPolicy as Never, there won’t be a preemption related event for this pod.
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Check if pods with preemptionPolicy set to Never can preempt other low-priority pods when the cluster resources cannot be met.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: preemption_victims
- [Optional] Aggregation method:
- Components exposing the metric: kube-scheduler
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

We currently only have events that describe a pod being preempted by another pod. But we don’t have an event that describes why sometimes the preemption is not successful. We can enhance the message of the existing event for scheduling failed to include details about preemption. This will help us to improve observability for this feature and other scenarios.

In addition to events, we can add metrics about how many pods have stopped preempting other pods because of this no-preemption option. However, since the probability of this metric being used is likely to be small, it was not added.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls? No
Will enabling / using this feature result in introducing new API types? No
Will enabling / using this feature result in any new calls to cloud provider? No
Will enabling / using this feature result in increasing size or count of the existing API objects? No
Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]? No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? No

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable? N/A.
What are other known failure modes? N/A.
What steps should be taken if SLOs are not being met to determine the problem?

Errors for the preempt process are visible in logs.
check the metrics below to determine if there is an exception

pod_preemption_victims
total_preemption_attempts
scheduling_algorithm_preemption_evaluation_seconds

Implementation History

2019-03-17: Initial KEP
2020-05-19: Graduate the feature to Beta
2022-01-15: Graduate the feature to GA

Resources: Add pod-startup liveness-probe holdoff for slow-starting pods

Mon, 01 Jan 0001 00:00:00 +0000

Add pod-startup liveness-probe holdoff for slow-starting pods

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Implementation History

Release Signoff Checklist

kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
KEP approvers have set the KEP status to implementable
Design details are appropriately documented
Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
Graduation criteria is in place
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Slow starting containers are difficult to address with the current status of health probes: they are either killed before being up, or could be left deadlocked during a very long time before being killed.

This proposal adds a new probe called startupProbe that holds off all the other probes until the pod has finished its startup. In the case of a slow-starting pod, it could poll on a relatively short period with a high failureThreshold. Once it is satisfied, the other probes can start.

Motivation

Slow starting containers here refer to containers that require a significant amount of time (one to several minutes) to start. There can be various reasons for this slow startup:

long data initialization: only the first startup takes a lot of time
heavy workload: every startups take a lot of time
underpowered/overloaded node: startup times depend on external factors (however, solving node related issues is not a goal of this proposal)

The main problem with this kind containers is that they should be given enough time to start before having livenessProbe fail failureThreshold times, which triggers a kill by the kubelet before they have a chance to be up.

There are various strategies to handle this situation with the current API:

Delay the initial livenessProbe sufficiently to permit the container to start up (set initialDelaySeconds greater than startup time). While this ensures no livenessProbe will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than initialDelaySeconds. Also, since the livenessProbe isn’t run at all during startup, there is no feedback loop on the actual startup time of the container.
Increase the allowed number of livenessProbe failures until kubelet kills the container (set failureThreshold so that failureThreshold times periodSeconds is greater than startup time). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.

However, none of these strategies provide an timely answer to slow starting containers stuck in a deadlock, which is the primary reason of setting up a livenessProbe.

Goals

Allow slow starting containers to run safely during startup with health probes enabled.
Improve documentation of the Probe structure in core types’ API.
Improve kubernetes.io/docs section about Pod lifecycle:
- Clearly state that PostStart handlers do not delay probe executions.
- Introduce and explain this new probe.
- Document appropriate use cases for this new probe.

Non-Goals

This proposal does not address the issue of pod load affecting startup (or any other probe that may be delayed due to load). It is acting strictly at the pod level, not the node level.
This proposal will only update the official Kubernetes documentation, excluding A Pod’s Life and other well referenced pages explaining probes.

Proposal

Implementation Details

The proposed solution is to add a new probe named startupProbe in the container spec of a pod which will determine whether it has finished starting up.

It also requires keeping the state of the container (has the startupProbe ever succeeded?) using a boolean Started inside the ContainerStatus struct.

Depending on Started the probing mechanism in worker.go might be altered:

Started == true: the kubelet worker works the same way as today
Started == false: the kubelet worker only probes the startupProbe

If startupProbe fails more than failureThreshold times, the result is the same as today when livenessProbe fails: the container is killed and might be restarted depending on restartPolicy.

If no startupProbe is defined, Started is initialized with true.

Why a new probe instead of initializationFailureThreshold

While trying to merge PR #1014 in time for code-freeze, @thockin has make the following points which I agree with:

I feel pretty strongly that something like a startupProbe would be net simpler to comprehend than a new field on liveness.

In issuecomment-437208330 we looked at a different take on this API - it is more precise in its meaning and rather than add yet another behavior modifier to probe, it can reuse the probe structure directly.

Here is the excerpt of issuecomment-437208330 talking about the design:

An idea that I toyed with but never pursued was a StartupProbe - all the other probes would wait on it at pod startup. It could poll on a relatively short period with a long FailureThreshold. Once it is satisfied, the other probes can start.

I also think the third probe gives more flexibility if we find other good reasons to inhibit livenessProbe or readinessProbe before something occurs during container startup.

Configuration example

This example shows how startupProbe can be used to emulate the functionality of initializationFailureThreshold as it was proposed before:

ports:
- name: liveness-port
 containerPort: 8080
 hostPort: 8080

livenessProbe:
 httpGet:
 path: /healthz
 port: liveness-port
 failureThreshold: 1
 periodSeconds: 10

startupProbe:
 httpGet:
 path: /healthz
 port: liveness-port
 failureThreshold: 30 (=initializationFailureThreshold)
 periodSeconds: 10

Design Details

Test Plan

Unit tests will be implemented with newTestWorker and will check the following:

proper initialization of Started to false
Started becomes true as soon as startupProbe succeeds
livenessProbe and readinessProbe are disabled until Started is true
startupProbe is disabled after Started becomes true
failureThreshold exceeded for startupProbe kills the container

E2e tests will also cover the main use-case for this probe:

startupProbe disables livenessProbe long enough to simulate a slow starting container, using a high failureThreshold

Feature Gate

Expected feature gate key: StartupProbeEnabled
Expected default value: false

Graduation Criteria

Alpha: Initial support for startupProbe added. Disabled by default.
Beta: startupProbe enabled with no default configuration.
Stable: startupProbe enabled with no default configuration.

Implementation History

2018-11-27: prototype implemented in PR #71449 under review
2019-03-05: present KEP to sig-node
2019-04-11: open issue in enhancements #950
2019-05-01: redesign to additional probe after @thockin proposal
2019-05-02: add test plan

Version 1.16

Implement startupProbe as Alpha #77807
Cherry pick of #82747 #83607

Version 1.17

Fix startup_probe_test.go failing test #82747
Add startupProbe result handling to kuberuntime #84279
Clarify startupProbe e2e tests #84291

Version 1.18

Graduate startupProbe to Beta #83437
Cherry pick of #92196 #92477

Version 1.19

Pods which have not “started” can not be “ready” #92196

Version 1.20

Graduate startupProbe to GA #94160

Resources: Add ProcMount option

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4265: add ProcMount option

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

For Linux containers, the Kubelet instructs container runtimes to mask and set as read-only certain paths in /proc. This is to prevent data from being exposed into a container that should not be. However, there are certain use-cases where it is necessary to turn this off.

This KEP proposes adding a field to the Pod security context to allow bypassing the usual restrictions.

In 1.12, this was introduced as the ProcMountType feature gate, and has it has languished in alpha ever since. This KEP is a successor to (and heavily based on) https://github.com/kubernetes/community/pull/1934 , updated for the modern era.

Motivation

Some end users would like to run unprivileged containers nested inside a Kubernetes container using user namespaces. The outer container is started by the CRI implementation. Kubernetes defaults to masking the /proc mount of a container, setting some paths as read only. To run a nested container within an unprivileged Pod, a user would need a way to override that default masking behavior.

Please see the following filed issues for more information:

Goals

Allow users to opt out of the CRI masking /proc for Linux containers.

Non-Goals

Proposal

Add a new string named procMount to the securityContext definition for choosing from a set of proc mount isolation mode options.

The default for procMount is Default, which instructs the container runtime to mask the aforementioned paths.

This will look like the following in the spec:

type ProcMountType string

const (
 // DefaultProcMount uses the container runtime default ProcType. Most 
 // container runtimes mask certain paths in /proc to avoid accidental security
 // exposure of special devices or information.
 DefaultProcMount ProcMountType = "Default"

 // UnmaskedProcMount bypasses the default masking behavior of the container
 // runtime and ensures the newly created /proc the container stays intact with
 // no modifications. 
 UnmaskedProcMount ProcMountType = "Unmasked"
)

procMount *ProcMountType

where nil is default, and is interpreted as “Default” ProcMountType.

When the kubelet is presented with a pod that has a ProcMountType as Unmasked, it will edit the default list of masked paths it passes down to the CRI to be empty which it does with the CRI request .

This requires changes to the CRI runtime integrations so that kubelet will add the specific unmasked option. This was done after alpha:

CRI-O has support in v1.25.0 after https://github.com/cri-o/cri-o/pull/6025/commits/4102586132214263c5d0ae93ec257432653ab82b
containerd has support in 1.6. See https://github.com/containerd/containerd/pull/5070/commits/07f1df4541d6a81c205d194f4f6ea3a6a95c3e29

The main use case for unmasking paths in /proc are for a user nesting unprivileged containers within a container. However, having an Unmasked ProcMountType is a privileged operation, and thus is part of the privileged Pod Security Admission (PSA). Since a user must have be in the privileged policy, they are also trusted to choose the correct user ID and run a workload that won’t interfere with the host.

A container running as root user on the host and an unmasked /proc could be able to write to the host /proc, and thus this privileged designation is appropriate.

User Stories (Optional)

Story 1

As a cluster admin, I would like a way to nest containers within containers. To do so, kernel the top level containers need an unmasked /proc.

Story 2

As a kubernetes user, I may want to build containers from within a kubernetes container. See this article for more information .

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

A user turning this on without user namespaces enabled
- Admission should deny a pod that tries to use ProcMountType: Unmasked with HostUsers: true
More trust in user namespacing/the kernel instead of container runtime
- This is probably the correct direction to head in.

Design Details

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/securitycontext: 10-05-2023 - 70.04

Integration tests

N/A (Kubelet barely defines integration tests today, focusing on e2e_node tests instead)

e2e tests

test/e2e_node
additional tests should be added to e2e_node suite to test the adherence of the ProcMount field
- Test default behavior actually masks /proc paths.
- Test Unmasked behavior is not masking /proc paths.
- Test PSA integration (if possible to test in e2e)
- Test that Windows pod cannot be scehduled with the value of ProcMount specifies

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Add e2e tests for the feature (must be done before beta)
- Including ones for enabling/disabling the feature

Beta

Explicitly require hostUsers option to be false if this option is enabled.
- Otherwise, this option effectively becomes another “privileged” field

GA

Allowing time for feedback

Upgrade / Downgrade Strategy

Turn off the feature gate to turn off the feature.

Version Skew Strategy

The feature gate is only processed by the API server–Kubelet has no awareness of it. API server will scrub the ProcMount field from the request if it doesn’t support the feature gate. Since all supported Kubelet versions support ProcMountType field, there’s no version skew worry. API server can have the feature gate toggled without worrying about doing the same for Kubelets.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ProcMountType
- Components depending on the feature gate: kube-apiserver (kube-apiserver filters procMount field if it’s not enabled).

Does enabling the feature change any default behavior?

No, only gives a user access to the Unmasked ProcMountType

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. This can be done by removing the feature gate from all kube-apiservers. To fully roll back, the nodes will need to be drained or rebooted, as the Kubelet will not remove the procMount of an already running container.

What happens if we reenable the feature if it was previously rolled back?

Nothing special. The pod’s procMount field depends on where in the enablement process the kube-apiserver was when it was created. The container has to be restarted to be up to date with the kube-apiserver.

Are there any tests for feature enablement/disablement?

Yes. I have manually tested feature enablement and disablement on kube-apiserver, and verified that pods are not recreated without a drain. There will be an e2e test to verify this as well.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It cannot. Either the kube-apiserver has the feature gate on or not. If it has it on, then workloads with the feature enabled will get an Unmasked ProcMountType if they request it. If it’s off, then the kube-apiserver will force it to default, and the container’s creation will move forward without an Unmasked ProcMountType.

Already running workloads aren’t stopped and restarted on a feature revert, so an admin would need to reboot or drain to impact running workloads.

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The behavior of this feature has been consistent for more than 10 minor releases, so these tests are less relevant now. Put differently: there is no upgrade->downgrade->upgrade path between supported versions of kubernetes that support this feature.

Manual testing has been done between versions that do support it, toggling the feature on and off. In these cases, the feature works as described.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

kubectl get pods --all-namespaces -o jsonpath="{range .items[*]}{.metadata.name}{' '}{.spec.containers[*].securityContext.procMount}{'\n'}{end}" | grep -i unmasked Will print all pods that has an Unmasked ProcMountType, along with the pod name.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Container created with the Unmasked ProcMountType have paths here as writable, not read only.
  - “/proc/asound”,
  - “/proc/acpi”,
  - “/proc/kcore”,
  - “/proc/keys”,
  - “/proc/latency_stats”,
  - “/proc/timer_list”,
  - “/proc/timer_stats”,
  - “/proc/sched_debug”,
  - “/proc/scsi”,
  - “/sys/firmware”,
- Another option is to run kubectl exec $podname -- mount | grep /proc.
  - If there’s just one mount, and it looks like proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) this is an unmasked /proc

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

No noticeable change in pod start times when this feature is enabled.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kubelet_pod_start_sli_duration_seconds
- [Optional] Aggregation method:
- Components exposing the metric: kubelet

I don’t think any would be useful.

Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

I don’t think any would be useful.

Dependencies

Does this feature depend on any specific services running in the cluster?

A CRI implementation that supports this feature
- All supported versions currently do.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

ProcMountType in the pod spec

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

There is one additional field in the pod API: procMount. It has an enum value of two values: Default and Unmasked. The Kubelet is also passing the MaskedPaths to the CRI, which involves a single slice of strings. When the value Default is chosen, the slice is defined here . If Unmasked, the slice is empty. Both of these are size changes on the order of bytes and can be considered negligible.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Potentially a malicious user given access and running with a root container in the host context could mess with the host processes. PSA has already been configured to mitigate this by required a user be in a privileged namespace to get access to the field.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No effect

What are other known failure modes?

Malicious user gaining access to the host /proc with a rootful container

admission should be updated to deny unmasked ProcMountType without user namespaces (hostUsers: true)

What steps should be taken if SLOs are not being met to determine the problem?

The field can be unset in a pod spec (or feature gate turned off) to see if SLOs met after the feature is disabled for pods.

Implementation History

2018-05-07: k/community update opened 2018-05-27: k/kubernetes PR merged with support. 2023-10-02: KEP opened and retargeted at Alpha 2024-02-26: Update Unmasked ProcMountType to fail validation without a pod level user namespace. 2024-05-31: Added e2e tests 2024-05-31: KEP updated to Beta 2025-01-31: KEP updated to on by default Beta 2026-01-29: KEP updated to GA

Alternatives

--oci-worker-no-process-sandbox like in BuildKit
- Not broadly supported with other container runtimes/builders.
Update the kernel to allow mounting a new procfs with masks.
- Proposed, but denied in the kernel
Adopt a similar approach to LXD where /proc and /sys are mounted to different locations within the container, instead of masked.
Give all pods with hostUsers: false (pod level user namespace) access to these mounts by default
- Even though it potentially is safe, it opens an argument that user namespaced pods are less secure than non user namespaced pods. The weakining of these boundries should be opt-in.
Ditch this option
- Most use cases don’t really need this. However, if a pod wants to be able to, for instance, set its own sysctls, it would need this option.

Infrastructure Needed (Optional)

Resources: Add Recreate Update Strategy to StatefulSet

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3541: Add Recreate Update Strategy to StatefulSet

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
- Downtime Requirement
- Limited Rollback Options
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

StatefulSets currently offer two update strategies: OnDelete (manual) and RollingUpdate (automatic, default). When using RollingUpdate with the default podManagementPolicy: OrderedReady, StatefulSets follow sequential ordering where each individual pod must be Running and Ready before the controller proceeds to update the next pod. Even with the maxUnavailable option (which allows multiple pods to be updated simultaneously), the controller still requires each pod to reach Ready state before moving forward but stuck pods halt the entire update process. While podManagementPolicy: Parallel allows pods to be updated simultaneously without waiting for Ready state, stuck pods remain and are not automatically replaced. This design ensures data safety for stateful workloads but creates a critical operational problem.

When a StatefulSet update results in pods that fail to reach Ready state (due to configuration errors, resource constraints, etc..), the rolling update process becomes permanently stuck. Even after applying a corrected configuration, the controller will not automatically replace the broken pods, requiring manual intervention to delete stuck pods.

This behavior has generated significant user frustration across multiple GitHub issues (#67250 , #60164 , #109597 ) with users reporting:

Broken CI/CD pipelines requiring manual intervention
Inability to automatically recover from configuration mistakes
Operational burden in managing stateful applications

This KEP proposes adding a new Recreate update strategy to StatefulSets, mirroring the behavior of Deployments’ Recreate strategy. This strategy deletes all pods, waits for full termination, then creates new pods according to podManagementPolicy. This provides a simple, predictable way to handle stuck pods and enables automated recovery for workloads that can tolerate downtime (CI/CD environments, stateless applications using StatefulSet for pod identity, applications with external data storage, and use cases like LeaderWorkerSet). The Recreate strategy offers a clean parallel with existing Kubernetes patterns, simplifies controller logic, and provides users with explicit control over update behavior.

Motivation

Current Behavior and Problems

StatefulSets with RollingUpdate strategy follow this algorithm:

if podManagementPolicy: OrderedReady (default)
1. Update pods in reverse ordinal order (N-1, N-2, …, 0)
2. For each pod, wait until it becomes Running and Ready before proceeding to the next
3. If any pod fails to become Ready, the entire update process halts
4. Even when a corrected configuration is applied, stuck pods are never automatically replaced
if podManagementPolicy: Parallel
1. Update all pods simultaneously (or up to maxUnavailable at a time if specified)
2. Pods are created/deleted without waiting for Ready state
3. Stuck pods do not block other pods from being updated
4. Even when a corrected configuration is applied, stuck pods are never automatically replaced

The current approach was designed for stateful workloads where data persistence is critical, pod identity and storage are tightly coupled, or automatic pod deletion could cause data loss.

This behavior has significant impact across multiple scenarios:

CI/CD Pipeline Failures: Teams report broken deployments that require manual intervention, breaking automation:

# Example: A typo in image name breaks the entire update
apiVersion: apps/v1
kind: StatefulSet
spec:
 template:
 spec:
 containers:
 - name: app
 image: myapp:v2.0.0-typo  # ImagePullBackOff
 # Update gets stuck, requires manual pod deletion

Operational Overhead: Platform teams must build custom controllers or fix it manually to handle stuck updates.

Why Existing Solutions Are Insufficient

MaxUnavailable doesn’t address the core issue. The maxUnavailable option in RollingUpdate strategy allows multiple pods to be updated simultaneously, but its behavior depends on podManagementPolicy.
```
 spec:
 podManagementPolicy: Parallel
 updateStrategy:
 type: RollingUpdate
 rollingUpdate:
 maxUnavailable: 2 # Can update 2 pods at once
```
With podManagementPolicy: Parallel + maxUnavailable: 2, multiple pods can be updated simultaneously, but if any pod fails to reach Ready state, it remains stuck and requires manual cleanup. Stuck pods don’t block other pods from updating, but they are never automatically replaced (see section 2 below).

With podManagementPolicy: OrderedReady, updates happen one pod at a time in reverse ordinal order. If any pod fails to reach Ready state, the entire rolling update process halts completely, even with maxUnavailable configured. The controller waits indefinitely for stuck pods to become Ready.

Example Scenario with podManagementPolicy: OrderedReady: - StatefulSet with 5 replicas - Update pod app-4 first - app-4 gets stuck in ImagePullBackOff - Even after fixing the image name, app-4 remains stuck - Update process cannot proceed to app-3, app-2, app-1, or app-0 - Manual intervention still required: kubectl delete pod app-4
Custom Controllers Some teams have built custom controllers to delete stuck pods, but this:
- Duplicates StatefulSet controller logic
- Creates maintenance burden
- May conflict with StatefulSet controller behavior
- Lacks integration with StatefulSet status and events

Proposed Solution Benefits

Adding Recreate update strategy to StatefulSets addresses these issues by:

Stuck pods are cleared and replaced during updates without manual intervention
Clean algorithm with no complexity around timeout tracking or transient failure detection
Consistency with Kubernetes Patterns (Deployment) Recreate strategy.
Handles All Stuck Scenarios, regardless of whether pods are stuck in ImagePullBackOff, Pending, CrashLoopBackOff, or any other state

Goals

Add a new Recreate update strategy type to StatefulSet, providing a third option alongside OnDelete and RollingUpdate
Align StatefulSet update strategies with Deployment patterns for API consistency
Enable automated recovery from stuck pod states without manual intervention
Provide a simple, predictable update behavior for workloads that can tolerate downtime
Support use cases like CI/CD environments, stateless applications, external storage applications, and LeaderWorkerSet patterns
Add Progressing state condition to StatefulSet status for all strategies

Non-Goals

Change default behavior of StatefulSet updates (opt-in via explicit type: Recreate configuration)
Add timeout-based progressive failure detection (use Recreate for simplicity)
Change Recreate deletion semantics (all pods are always deleted simultaneously but recreate ordering follows podManagementPolicy)
Replace Deployment-style revision management (StatefulSets continue to directly manage Pods)

Proposal

User Stories

Story 1: CI/CD Platform Team

Context: A platform team manages hundreds of StatefulSet deployments across development and staging environments. Their CI/CD system requires end-to-end automation, but StatefulSet rolling updates break automation when pods get stuck. The team either has to implement custom “garbage collection” logic or accept that automated deployments will fail and require manual intervention. Since these are non-production environments, downtime during updates is acceptable.

Solution: With updateStrategy: type: Recreate configured, when an update with incorrect configuration is applied, all pods are deleted and new pods are created. If they fail, the deployment fails quickly and clearly. When a corrected configuration is applied, the Recreate strategy deletes all broken pods and creates fresh ones, allowing the CI/CD pipeline to complete without manual intervention. The downtime is acceptable in CI/CD environments where fast, automated recovery is more important than uptime.

Story 2: Stateless Web Application

Context: A web application uses StatefulSet for predictable pod naming but doesn’t store critical data locally. When resource limit typos cause pods to get stuck in Pending state, the entire update halts even though pod replacement is safe. The application can tolerate brief downtime during updates.

Solution: With updateStrategy: type: Recreate configured, when an update encounters issues, all pods are deleted and recreated cleanly. This eliminates the need for manual pod deletion since stuck pods are automatically cleared. The brief downtime is acceptable for this stateless application that primarily uses StatefulSet for pod identity rather than stateful semantics.

Story 3: Development/Experiment Environment

Context: Developers using StatefulSet for experiments face constant frustration - every time a rolling update breaks due to configuration errors, they must manually delete stuck pods after applying fixes. This manual intervention disrupts the development workflow. Uptime is not a concern in development environments.

Solution: With updateStrategy: type: Recreate configured, developers get fast, clean resets - when an update fails, applying a fix automatically deletes all broken pods and creates fresh ones. This enables a smoother development experience without requiring cluster operator intervention or manual pod cleanup. The Recreate strategy’s simplicity makes it ideal for rapid iteration in development.

Story 4: External Data Storage

Context: A database application stores all persistent data on network-attached storage (not local pod storage). Pod replacement is completely safe since no local data would be lost, but the StatefulSet controller treats it as a traditional stateful workload and requires manual intervention. The application can tolerate brief downtime for clean updates.

Solution: With updateStrategy: type: Recreate configured, the controller automatically deletes and recreates all pods during updates, which is safe for this architecture since all data persists externally. The Recreate strategy provides clean, predictable updates without concerns about stuck pods, and the brief downtime is acceptable given the data safety guarantees from external storage.

Story 5: LeaderWorkerSet (LWS) Use Case

Context: Developers use StatefulSet as the high-level controller workload for LWS . However, it behaves more like a Deployment - there’s no ordering dependency between different replicas. They only need the ordinal index for pod identification. When a replica fails during updates, the entire StatefulSet update gets stuck, even though there’s no actual ordering requirement between replicas. The LeaderWorkerSet pattern can tolerate brief downtime for updates.

Solution: With updateStrategy: type: Recreate configured, all replicas are cleanly deleted and recreated during updates, eliminating stuck pod scenarios entirely. This aligns perfectly with the deployment-like nature of LWS workloads, providing simple and predictable updates for applications that use StatefulSet primarily for pod identity rather than traditional stateful semantics. The Recreate strategy’s “all or nothing” approach matches the LWS pattern where all workers restart together.

Notes/Constraints/Caveats

Strategy Type Change Does Not Trigger Rollout: changing only .spec.updateStrategy.type from RollingUpdate to Recreate (or vice versa) does not trigger a new rollout. This is consistent with Deployment behavior. The StatefulSet controller uses the controller-revision-hash label to identify pod revisions, which is computed from .spec.template content only.

The Recreate behavior will only be triggered when users either:

Make a change to .spec.template
Force a rollout using kubectl rollout restart

Risks and Mitigations

Risk: Unintended Data Loss

Risk Description: If Recreate strategy is used on StatefulSets with local persistent data and PersistentVolumeClaims, the downtime could affect applications expecting sequential updates. However, data on PVCs is preserved since Recreate only deletes pods, not volumes.

Mitigation Strategies:

Documentation: Clear guidance on when to use Recreate strategy - suitable for workloads that can tolerate downtime
No Default Change: Opt-in behavior - existing workloads continue using safe RollingUpdate (current behavior unchanged)
Explicit Strategy Selection: Users must explicitly set type: Recreate, preventing accidental usage
Clear Events: Events emitted during the recreate process to show deletion and recreation phases
Status Conditions: StatefulSet status clearly reflects the recreate process state
PVC Preservation: PersistentVolumeClaims are not deleted, so data on volumes persists across recreate operations

Design Details

Detailed Algorithm Specification

Current RollingUpdate Algorithm

FOR i = replicas-1 To i >= 0 DO i--
If pod[i] needs update Then
wait_for_predecessors_ready(i+1 to replicas-1)
If !pod[i].Running Or !pod[i].Ready Then
return // STUCK - wait for manual intervention
ENDIF
update_pod(i)
wait_until_ready(pod[i])
ENDIF
ENDFOR

The algorithm halts when pod[i] is not Running or Ready, even if a fix is applied.

Proposed Recreate Strategy Algorithm

// Recreate Strategy Algorithm
// Uses controller-revision-hash label to identify pod revision (same as RollingUpdate)
// updateRevision = hash of current spec.template (computed by controller)
current_phase = determine_phase()
IF current_phase == "NeedsDeletion" THEN
// Phase 1: Delete all pods with old revision
emit_event("RecreateStarted", "Deleting all pods for Recreate update")
set_condition("Progressing", status="True", reason="RecreateInProgress")
// Delete ALL pods owned by this StatefulSet that have old revision
// This handles orphaned pods with ordinals >= replicas
FOR each pod in pods:
IF pod.Labels["controller-revision-hash"] != updateRevision THEN
IF pod.DeletionTimestamp == nil THEN
delete_pod(pod)
ENDIF
ENDIF
ENDFOR
return // Reconcile again after deletions are issued
ENDIF
IF current_phase == "WaitingTermination" THEN
// Phase 2: Wait for all old-revision pods to be fully removed from etcd
// Controller watches pods and will reconcile when deletions complete
// Note: Only emit event on first entry to this phase (tracked via condition)
return
ENDIF
IF current_phase == "ReadyForCreation" THEN
// Phase 3: Create pods with new revision according to podManagementPolicy
IF podManagementPolicy == OrderedReady THEN
// Create in ascending ordinal order; only create the next ordinal when predecessor is Running and Ready
i = lowest ordinal in [0, replicas-1] such that pod i does not exist
IF i is defined THEN
IF i == 0 OR (pod i-1 exists AND is Running and Ready) THEN
create_pod(i, updateRevision)
ENDIF
ENDIF
ELSE
// Parallel: create all missing pods at once
FOR i = 0 TO replicas-1:
IF pod with ordinal i does not exist THEN
create_pod(i, updateRevision)
ENDIF
ENDFOR
ENDIF
return // Reconcile again to check creation progress
ENDIF
IF current_phase == "Complete" THEN
// All replicas exist with current revision
set_condition("Progressing", status="True", reason="RecreateComplete")
return
ENDIF
// Helper: Determine current phase based on pod states
FUNCTION determine_phase():
pods = get_all_pods_for_statefulset() // All pods owned by this StatefulSet
old_revision_pods_active = 0 // Old revision, not yet deleted
old_revision_pods_terminating = 0 // Old revision, has DeletionTimestamp
new_revision_pods = 0 // Current revision (not terminating)
FOR each pod in pods:
IF pod.Labels["controller-revision-hash"] != updateRevision THEN
// Pod has old revision
IF pod.DeletionTimestamp == nil THEN
old_revision_pods_active++
ELSE
old_revision_pods_terminating++
ENDIF
ELSE
// Pod has current revision
IF pod.DeletionTimestamp == nil THEN
new_revision_pods++
ENDIF
// Note: new revision pods with DeletionTimestamp are ignored
// (could happen if user manually deleted, will be recreated)
ENDIF
ENDFOR
// Phase 1: Any old-revision pods that haven't been deleted yet
IF old_revision_pods_active > 0 THEN
return "NeedsDeletion"
ENDIF
// Phase 2: Old pods are terminating, wait for full removal
IF old_revision_pods_terminating > 0 THEN
return "WaitingTermination"
ENDIF
// Phase 3: No old pods remain, but we don't have enough new pods yet
IF new_revision_pods < replicas THEN
return "ReadyForCreation"
ENDIF
// Phase 4: All replicas exist with current revision
return "Complete"
END FUNCTION

Key Characteristics:

Uses controller-revision-hash label (same as RollingUpdate) to identify old vs new pods
All old-revision pods are fully terminated before any new pods are created
Guarantees old and new pods never run simultaneously
Deletes all old-revision pods including orphans with ordinals >= replicas
Since all pods are forcibly deleted, updates cannot become permanently blocked
Explicit downtime: Users opt-in knowing there will be unavailability between deletion and creation phases
Safe to retry deletions and creations on controller restart
Recreation phase respects podManagementPolicy

API Changes

Spec Changes

// StatefulSetUpdateStrategyType is a string enumeration type that represents the update strategy type for StatefulSets
type StatefulSetUpdateStrategyType string

const (
 // RollingUpdateStatefulSetStrategyType indicates that pods in a StatefulSet will be updated in reverse ordinal order
 RollingUpdateStatefulSetStrategyType StatefulSetUpdateStrategyType = "RollingUpdate"
 // OnDeleteStatefulSetStrategyType indicates that pods in a StatefulSet will only be updated when manually deleted
 OnDeleteStatefulSetStrategyType StatefulSetUpdateStrategyType = "OnDelete"
 // RecreateStatefulSetStrategyType indicates that all pods will be fully terminated before new ones are created
 RecreateStatefulSetStrategyType StatefulSetUpdateStrategyType = "Recreate"
)

Example Usage:

apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: web
spec:
 replicas: 10
 updateStrategy:
 type: Recreate
 template:
 spec:
 containers:
 - name: nginx
 image: nginx:1.14.2

Behavior:

When update is triggered (e.g., template change):
1. All pods (web-0 through web-9) are deleted simultaneously
2. Controller waits for all pods to fully terminate
3. All new pods (web-0 through web-9) are created according to their .spec.podManagementPolicy
Downtime occurs between deletion and recreation phases
No stuck pod scenarios - all pods are forcibly deleted

Status Changes

// StatefulSetConditionType describes the condition types
type StatefulSetConditionType string

const (
 // Progress for a StatefulSet is considered when a new pod is created, deleted, or becomes ready.
 StatefulSetProgressing StatefulSetConditionType = "Progressing"

 StatefulSetAvailable StatefulSetConditionType = "Available"
)

Implementation Changes

The implementation requires changes to the StatefulSet controller in pkg/controller/statefulset/stateful_set_control.go:

Strategy Type Handling:
- Add new case for RecreateStatefulSetStrategyType in update strategy switch statement
- Implement separate update path for Recreate strategy alongside existing RollingUpdate and OnDelete paths
Recreate Update Logic:
- Phase 1 - Deletion: Iterate through all pods and delete them (similar to scale-down operation)
- Phase 2 - Wait for Termination: Check all pods for deletionTimestamp; reconcile periodically until all pods are fully terminated
- Phase 3 - Recreation: Create all new pods according to spec.podManagementPolicy
Status Condition Management:
- Add Progressing condition to StatefulSet status
Validation:
- API validation in pkg/apis/apps/validation/validation.go
- Validate type: Recreate can be set on StatefulSet
- No additional fields required for Recreate strategy (unlike RollingUpdate which has partition, maxUnavailable)
Respect Ordering Semantics:
- Recreate strategy according to podManagementPolicy settings
- All pods deleted at once and then re-created according to podManagementPolicy settings

Comparison with Existing Solutions

Solution	Sequential Ordering	Automatic Recovery	Downtime	Behavior When Pod Stuck	Use Case
`RollingUpdate` (default)	Yes	No	No	Halts completely, waits forever	Traditional stateful apps
`RollingUpdate` + `maxUnavailable`	Yes (batched)	No	No	Still halts completely	Faster updates, but same stuck problem
`OnDelete`	Yes (manual)	No	No	Fully manual control	Maximum safety/control
`Recreate` (proposed)	No	Yes	Yes	All pods deleted and recreated	CI/CD, stateless apps, external storage, LWS

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/apis/apps/validation/validation.go: 2025-10-13 - 92.8%
pkg/controller/statefulset/stateful_set_control.go: 2025-10-13 - 91.5%
pkg/controller/statefulset/stateful_pod_control.go: 2025-10-13 - 89.6%
pkg/registry/apps/statefulset/strategy.go: 2025-10-13 - 83.9%

Integration tests

We should cover below scenarios:

Without type: Recreate: Existing StatefulSets with RollingUpdate and OnDelete continue to work unchanged (backward compatibility)
With type: Recreate configured:
- All pods are deleted when update is triggered (template spec change)
- Controller waits for all pods to fully terminate (no pods with deletionTimestamp remain)
- All new pods are created after termination complete
- Status condition Progressing=True
- Status condition Progressing=True with reason=RecreateComplete after pods created
- Recreate strategy respects podManagementPolicy
PVC preservation: PersistentVolumeClaims are not deleted during Recreate (only pods are deleted)
Stuck pod handling: Pods stuck in any state are forcibly deleted (ImagePullBackOff, Pending, CrashLoopBackOff, etc.)
Validation: API validation accepts type: Recreate on StatefulSet
For alpha, Add test to verify that we cannot switch strategies from Recreate to RollingUpdate or OnDelete. Later on beta, we will need to add a test to verify that we can switch strategies

e2e tests

The following e2e tests will be added to test/e2e/apps/statefulset.go:

StatefulSet with type: Recreate successfully deletes and recreates all pods during update
Recreate works with stuck pods (ImagePullBackOff scenario - pods are deleted and new ones created)
Recreate waits for full termination before creating new pods (no mixed old/new state)
Recreate preserves PersistentVolumeClaims (data persists across recreation)
Recreate respects podManagementPolicy during recreation
StatefulSets without type: Recreate maintain current RollingUpdate/OnDelete behavior (backward compatibility)
Controller restart during Recreate resumes correctly from last phase

Graduation Criteria

Alpha

Feature implemented behind a feature flag.
Unit and integration tests passed as designed in TestPlan .

Beta

Feature is enabled by default
Address reviews and bug reports from Alpha users
Users are able to switch strategies from Recreate to RollingUpdate or OnDelete
e2e tests:
- Add links to testgrid results
- Verify zero flakes over 2+ weeks

GA

No negative feedback from developers.
Consider conformance test if feature becomes widely adopted and part of core contract
Ensure existing conformance tests for basic RollingUpdate continue to pass

Upgrade / Downgrade Strategy

Upgrade

This feature is protected by the feature-gate StatefulSetRecreateStrategy, which must be enabled on both kube-apiserver and kube-controller-manager.

Component Dependencies:

kube-apiserver: Validates and persists the type: Recreate strategy in the StatefulSet spec
kube-controller-manager: Implements the Recreate strategy logic (delete all, wait for termination, create all)

Upgrade Sequence

Enable feature gate on kube-apiserver first
Enable feature gate on kube-controller-manager
Create/update StatefulSets with updateStrategy.type: Recreate

Partial Upgrade Behavior

If apiserver has feature enabled but kube-controller-manager does not:
- API server accepts type: Recreate strategy
- Strategy type is persisted in etcd
- Kube-controller-manager ignores Recreate type and falls back to default RollingUpdate behavior
- No errors, but Recreate behavior is not active
If apiserver does NOT have feature enabled but kube-controller-manager does:
- API server rejects create/update requests that set type: Recreate with a validation error
- Users cannot create or switch to Recreate until the apiserver has the feature enabled.
- Kube-controller-manager cannot process Recreate in this skew because no StatefulSet with type: Recreate can be stored.

Enable the feature gate on kube-apiserver first, then kube-controller-manager to ensure smooth transition.

Downgrade

The older apiserver does not recognize type: Recreate and will reject create/update requests that set it.
StatefulSets that already have type: Recreate stored in etcd remain stored, but any update that touches the spec may be rejected unless the strategy is changed back to RollingUpdate/OnDelete first
The controller in the older version ignores Recreate and behaves as RollingUpdate for those existing objects

Version Skew Strategy

This feature has dependencies between control plane components.

kube-apiserver v1.xx+1 (feature enabled) and kube-controller-manager v1.xx (no feature)
- API accepts type: Recreate, controller ignores it
- StatefulSets fall back to default RollingUpdate behavior
- StatefulSets are functional, just without Recreate strategy feature
- No errors or warnings
kube-apiserver v1.xx (no feature) and kube-controller-manager v1.xx+1 (feature enabled)

API server rejects create/update requests that set type: Recreate with a validation error
Users cannot create or update StatefulSets to use Recreate until apiserver is upgraded and the feature is enabled
Enable the feature on kube-apiserver first, then on kube-controller-manager

Mixed control plane during rolling upgrade
- During control plane upgrade, apiservers and controller-managers may have different versions, and the feature may be enabled or disabled. The behavior depends on the leader’s version:
  - If leader has feature enabled: Recreate strategy is processed correctly
  - If leader has feature disabled: Recreate strategy is ignored, falls back to RollingUpdate behavior
  - Leader may change during upgrade, causing behavior to switch between Recreate and RollingUpdate

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: StatefulSetRecreateStrategy
- Components depending on the feature gate:
  - kube-apiserver
  - kube-controller-manager

Does enabling the feature change any default behavior?

No. Enabling the StatefulSetRecreateStrategy feature gate does not change any default behavior.

The type: Recreate strategy is opt-in. When not explicitly set:

StatefulSets behave exactly as they do today (default RollingUpdate behavior)
All existing StatefulSet update strategies continue to work unchanged

The feature only activates when users explicitly configure spec.updateStrategy.type: Recreate in their StatefulSet spec.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled.

What happens if we reenable the feature if it was previously rolled back?

The feature works normally again. StatefulSets with type: Recreate in their spec will immediately start using Recreate behavior for the next update.

Are there any tests for feature enablement/disablement?

No, unit and integration tests will be added to cover feature gate enablement/disablement scenarios.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout Failures:

If apiserver and controller-manager have different feature gate states, type: Recreate may be accepted but ignored (falls back to RollingUpdate)
API validation accepts type: Recreate as valid strategy type (no complex validation needed)

Rollback Failures:

If the strategy type was not changed back, StatefulSets with type: Recreate will fall back to RollingUpdate behavior and Recreate behavior will be ignored.

Impact on Running Workloads:

No impact on StatefulSets without type: Recreate
StatefulSets with type: Recreate will experience downtime during updates (i.e. all pods are deleted before new ones are created)

What specific metrics should inform a rollback?

statefulset_unavailable_replicas shows how many Statefulset replicas are unavailable
workqueue_depth{name="statefulset"} shows the current depth of the StatefulSet controller queue
workqueue_queue_duration_seconds{name="statefulset"} shows how long items wait in queue before processing
workqueue_retries_total{name="statefulset"} shows retry counts which may indicate processing failures

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No, tests will be added to cover upgrade and rollback scenarios.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No. This feature adds a new strategy type Recreate to spec.updateStrategy.type. No deprecations of existing fields or APIs nor removals of existing functionality.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By querying StatefulSets using kubectl:

kubectl get statefulsets -A -o json | \
 jq '.items[] | select(.spec.updateStrategy.type == "Recreate") |
 {namespace: .metadata.namespace, name: .metadata.name, strategy: .spec.updateStrategy.type}'

By checking StatefulSet status conditions:

kubectl get statefulsets -A -o json | \
 jq '.items[] | select(.status.conditions[]? | select(.type=="Progressing"))'

How can someone using this feature know that it is working for their instance?

[] Events
API .status
- Condition name: Progressing
Metrics (existing metrics kube-state-metrics )
- kube_statefulset_replicas
- kube_statefulset_status_replicas_ready
- kube_statefulset_status_replicas_current

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

100% of StatefulSets without type: Recreate behave identically to pre-feature behavior
99% of Recreate updates complete within (pod termination time + pod startup time + 30s)
0% of pods are left in mixed old/new spec states after Recreate update

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics (existing metrics kube-state-metrics )
- Metric(s) name:
  - kube_statefulset_status_replicas_available
  - kube_statefulset_status_replicas_ready
  - kube_statefulset_status_replicas_current
  - Components exposing the metric: kube-state-metrics
- Metric name:
  - statefulset_unavailable_replicas
  - Components exposing the metric: kube-controller-manager
- These metrics reflect the StatefulSet .status (availableReplicas, readyReplicas, currentReplicas). They have labels statefulset and namespace, so operators can filter by StatefulSet to monitor a specific StatefulSet during Recreate
- During Recreate updates, the values show the transition from all pods deleted (0 available) to all new pods created and ready

Are there any missing metrics that would be useful to have to improve observability of this feature?

No. The existing StatefulSet metrics provide sufficient observability for the Recreate strategy.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No new types of API calls. If the feature gate is enabled but no StatefulSet uses type: Recreate, then no additional API calls occur.

When Recreate strategy is used during an update, the following existing API call types are made:

Pod Deletion (DELETE /api/v1/namespaces/{ns}/pods/{name})
Pod Creation (POST /api/v1/namespaces/{ns}/pods)
StatefulSet Status Update (PUT /apis/apps/v1/namespaces/{ns}/statefulsets/{name}/status)
Event Creation (POST /api/v1/namespaces/{ns}/events)

Will enabling / using this feature result in introducing new API types?

No. A new strategy type Recreate is added to the existing StatefulSetUpdateStrategyType enum, but no new API types are introduced.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, minor increases in size when type: Recreate is used.

Per StatefulSet using Recreate strategy:

Spec: ~8 bytes (strategy type enum value: “Recreate”)
Status: ~150-200 bytes when Progressing condition is active
Total: ~160-210 bytes per StatefulSet

For a cluster with 1000 StatefulSets using Recreate strategy:

Total increase: ~160-210 KB
Impact: Negligible compared to typical etcd usage (multi-GB scale)

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

API Server Operations:
- GET/LIST StatefulSets: No impact (strategy type is standard enum field, standard deserialization)
- CREATE/UPDATE StatefulSets: Minimal impact (~10-20μs for validating strategy type enum).
StatefulSet Controller Reconciliation:
- With feature enabled but strategy not set to Recreate: No additional overhead.
- With Recreate strategy: Same overhead as manual pod deletion + creation operations.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Etcd Operations:
- Minimal increase in object size when using Recreate strategy (~8 bytes for strategy type enum value + ~150-200 bytes for status conditions when active).
Memory/CPU:
- Memory (per StatefulSet): ~8 bytes for strategy type enum value.
- CPU: Strategy type comparison on each reconciliation: ~1-2μs (simple string comparison).
Network I/O:
- An additional ~8 bytes per StatefulSet spec when Recreate strategy is set, and ~150-200 bytes per status update when Progressing condition is active.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, the feature does not introduce new node resource exhaustion risks beyond existing mechanism.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature behaves similar to existing controllers which depend on API server and etcd availability.

API Server Unavailable: StatefulSet controller cannot read/write StatefulSet or Pod objects, so all updates halt.
etcd Unavailable: Similar to API server unavailability, no state changes can be persisted.

No special handling is required as this feature only changes the update progression logic, not the fundamental dependency on API server/etcd availability.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Examine Metrics (kube_statefulset_status_replicas_available, kube_statefulset_status_replicas_ready)
- If kube_statefulset_status_replicas_available is stuck at 0 for extended period → pods may be stuck in termination
- If kube_statefulset_status_replicas_current is increasing but kube_statefulset_status_replicas_ready is not → pods may be failing to start
Check if pods are stuck in termination (long grace periods, finalizers blocking deletion)
Verify pod startup time is reasonable (image pull, initialization containers, readiness probes)

Implementation History

2022-09-26: Initial KEP Created
2025-07-29: Updated the KEP after changing the ownership
2025-10-13: Pivoted KEP from EnforcedRollingUpdate strategy to podProgressTimeoutSeconds field based on sig-apps feedback. This approach better handles transient vs permanent failures and aligns with Deployment semantics.
2025-12-01: Pivoted KEP from podProgressTimeoutSeconds to Recreate strategy based on sig-apps meeting (meeting recording ). Key feedback:
- Progress deadline seconds in Deployments do not terminate pods, but podProgressTimeoutSeconds proposal would terminate pods
- Deleting/terminating pods based on readiness signals is problematic and disruptive
- Group consensus favored Recreate for simplicity and consistency with existing Kubernetes APIs
2025-04-06: Updated KEP to change the milestone to 1.37

Drawbacks

Downtime Requirement

The Recreate strategy causes downtime during updates since all pods are deleted before new ones are created:

Service Interruption: Application is completely unavailable during the deletion/recreation window
Not Suitable for All Workloads: Traditional stateful applications requiring high availability cannot use this strategy
User Expectation Management: Users must understand and accept downtime implications

Mitigation:

Clear documentation emphasizing downtime implications
Explicit opt-in via type: Recreate (no accidental usage)
Recommendation to use for appropriate workloads (CI/CD, stateless apps, development environments)

Limited Rollback Options

During a Recreate update, there’s no gradual rollback:

If new version has issues, all pods are affected (no gradual detection)
Cannot compare old vs new pods side-by-side during rollout
Must wait for full recreation cycle to attempt fixes

Mitigation:

Clear events and status conditions during Recreate process
Users can choose RollingUpdate for gradual rollouts where needed
Quick feedback loop due to fast recreation (all pods start together)

Alternatives

Alternative 1: PodProgressTimeoutSeconds Field in RollingUpdate Strategy

Extend the existing RollingUpdate strategy with a podProgressTimeoutSeconds field (similar to Deployment’s progressDeadlineSeconds) that allows timeout-based detection of stuck pods.

API Example:

spec:
 updateStrategy:
 type: RollingUpdate
 rollingUpdate:
 podProgressTimeoutSeconds: 600 # Wait 10 minutes per pod
 maxUnavailable: 1

Algorithm: For each pod in reverse ordinal order, delete and create new pod, wait for Ready state with timeout. If pod doesn’t become Ready within podProgressTimeoutSeconds, delete and recreate it.

Pros:

Maintains sequential ordering guarantees
Distinguishes transient failures (slow image pulls) from permanent failures (misconfig)
Works with existing maxUnavailable and partition fields
Allows fine-grained control over timeout per workload

Cons:

Complexity: Requires tracking per-pod creation timestamps and deadline state across reconciliation loops
Timeout Configuration Burden: Users must choose appropriate timeout values (too short = unnecessary churn, too long = slow recovery)
Doesn’t Solve All Scenarios: Still blocks on transient issues until timeout expires
Controller Complexity: Adds significant complexity to StatefulSet controller logic

Why Not Chosen as Primary Solution: Based on sig-apps meeting feedback (meeting link ), the group favored the simpler Recreate strategy approach. Key concerns raised:

Progress deadline in Deployments does not terminate pods when deadline is reached, but this proposal would
Using readiness signals to terminate pods is problematic and disruptive
The timeout-based approach adds complexity that may not be necessary for the primary use cases (CI/CD, stateless apps, external storage)
Recreate strategy is “pretty bare” and has direct parallel with Deployment patterns, making it easier to implement and understand

Alternative 2: EnforcedRollingUpdate Strategy

Add a new update strategy type EnforcedRollingUpdate that immediately deletes and replaces stuck pods without timeout during rolling updates.

API Example:

spec:
 updateStrategy:
 type: EnforcedRollingUpdate
 enforcedRollingUpdate:
 maxUnavailable: 1

Algorithm: When pod[i] needs update, delete it immediately regardless of current state, create new pod, wait for Ready.

Pros:

Simpler than timeout-based approach (no deadline tracking)
Maintains some ordering through sequential updates
Immediate action on stuck pods

Cons:

Cannot distinguish transient from permanent failures (network delays, CI/CD pipeline delays, slow image pulls)
Still maintains sequential ordering, which adds complexity
Doesn’t solve initial deployment failure, only works when spec changes

Why Not Chosen: Similar concerns as Alternative 1, but Recreate is even simpler by removing ordering requirements entirely.

Alternative 3: (Now Primary Solution): Recreate Strategy

NOTE: This alternative was chosen as the primary solution for this KEP based on sig-apps meeting feedback.

Add a Recreate update strategy (matching Deployment’s Recreate strategy) that deletes all pods before creating new ones.

API Example:

spec:
 updateStrategy:
 type: Recreate

Algorithm: Delete all pods, wait for termination, create all new pods according to spec.podManagementPolicy.

Pros:

No complexity around stuck pods or timeout tracking
All pods deleted before new ones created, guaranteeing clean state
Simple, predictable behavior aligned with Deployment patterns
Can quickly replace all pods regardless of their current state
No need to configure timeouts or tune parameters

Cons:

No ordering during deletion (all at once). Ordering during creation only when podManagementPolicy
Not suitable for traditional stateful workloads requiring zero-downtime updates

Why Chosen as Primary Solution: Based on sig-apps meeting discussion, this approach is:

Simpler to implement and understand (matches existing Deployment Recreate pattern)
Addresses the primary use cases (CI/CD, stateless apps, external storage, LeaderWorkerSet)
Avoids concerns about terminating pods based on readiness/timeout signals
Provides explicit opt-in behavior where users accept downtime for automated recovery

Alternative 4: Add Force Flag to RollingUpdate

Add a boolean field like spec.updateStrategy.rollingUpdate.forceUpdate: true.

Pros:

Minimal API change

Cons:

Same issue as Alternative 1; cannot distinguish transient from permanent failures
Less discoverable than dedicated field
Boolean flag doesn’t allow tuning timeout per workload

Why Not Chosen: Recreate strategy is clearer about behavior and simpler to implement.

Alternative 5: Enhance Parallel Policy

Extend podManagementPolicy: Parallel to automatically replace stuck pods during updates.

Pros:

Reuses existing field
Already has parallel semantics

Cons:

Loses sequential ordering guarantees
Confuses semantics of podManagementPolicy (affects both scaling and updates) vs updateStrategy (updates only)
Less explicit than dedicated strategy type
Doesn’t automatically delete all pods for clean state

Why Not Chosen: Recreate strategy as a dedicated update strategy type is clearer and more explicit. It also aligns better with Deployment patterns.

Infrastructure Needed (Optional)

N/A

Resources: Add Resource Health Status to the Pod Status for Device Plugin and DRA

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4680: Add Resource Health Status to the Pod Status for Device Plugin and DRA

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).

Motivation

Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.

Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.

Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.

Goals

Expose device health information (served by Device Plugin or DRA) in Pod Status and events.

Non-Goals

Expose any other device information beyond the health.
Expose CPU assignment of the pod by CPU manager or any other resources assignment by other managers.

Proposal

PodStatus.AllocatedResourcesStatus

As part of the InPlacePodVerticalScaling KEP, the two new fields were introduced in Pod Status to reflect the currently allocated resources for the Pod:

type ContainerStatus struct {
 ...

 // AllocatedResources represents the compute resources allocated for this container by the
 // node. Kubelet sets this value to Container.Resources.Requests upon successful pod admission
 // and after successfully admitting desired pod resize.
 // +featureGate=InPlacePodVerticalScaling
 // +optional
 AllocatedResources ResourceList `json:"allocatedResources,omitempty" protobuf:"bytes,10,rep,name=allocatedResources,casttype=ResourceList,castkey=ResourceName"`

 // Resources represents the compute resource requests and limits that have been successfully
 // enacted on the running container after it has been started or has been successfully resized.
 // +featureGate=InPlacePodVerticalScaling
 // +optional
 Resources *ResourceRequirements `json:"resources,omitempty" protobuf:"bytes,11,opt,name=resources"`

 ...
}

One field reflects the resource requests and limits and the other actual allocated resources.

This structure will contain standard resources as well as extended resources. As noted in the comment: https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713 , it is only logical to also include the status of those allocated resources.

The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the AllocatedResources match the desired state.

The proposal is to introduce an additional field:

type ContainerStatus struct {
 ...

 // AllocatedResourcesStatus represents the status of various resources
 // allocated for this Container. In case of DRA, the same resource health
 // can be reported multiple times if it is associated with the multiple containers.
 // +featureGate=ResourceHealthStatus
 // +optional
 // +patchMergeKey=name
 // +patchStrategy=merge
 // +listType=map
 // +listMapKey=name
 AllocatedResourcesStatus []ResourceStatus `json:"allocatedResourcesStatus,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,14,rep,name=allocatedResourcesStatus"`

 ...
}

The ResourceStatus is defined as:

type ResourceStatus struct {
 // Name of the resource. Must be unique within the pod and in case of non-DRA resource, match one of the resources from the pod spec.
 // For DRA resources, the value must be claim:claim_name/request.
 // The claim_name must match one of the claims from resourceClaims field in the podSpec.
 // +required
 Name ResourceName `json:"name" protobuf:"bytes,1,opt,name=name"`
 // List of unique Resources health. Each element in the list contains an unique resource ID and resource health.
 // At a minimum, ResourceID must uniquely identify the Resource
 // allocated to the Pod on the Node for the lifetime of a Pod.
 // See ResourceID type for it's definition.
 // +listType=map
 // +listMapKey=resourceID
 Resources []ResourceHealth `json:"resources,omitempty" protobuf:"bytes,2,rep,name=resources"`
}

type ResourceHealthStatus string

const (
 ResourceHealthStatusHealthy ResourceHealthStatus = "Healthy"
 ResourceHealthStatusUnhealthy ResourceHealthStatus = "Unhealthy"
 ResourceHealthStatusUnknown ResourceHealthStatus = "Unknown"
)

// ResourceID is calculated based on the source of this resource health information.
// For DevicePlugin:
//
// DeviceID, where DeviceID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
//
// DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
// For DRA:
//
// <driver name>/<pool name>/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
type ResourceID string

// ResourceHealth represents the health of a resource. It has the latest device health information.
// This is a part of KEP https://kep.k8s.io/4680 and historical health changes are planned to be added in future iterations of a KEP.
type ResourceHealth struct {
 // ResourceID is the unique identifier of the resource. See the ResourceID type for more information.
 ResourceID ResourceID `json:"resourceID" protobuf:"bytes,1,opt,name=resourceID"`
 // Health of the resource.
 // can be one of:
 // - Healthy: operates as normal
 // - Unhealthy: reported unhealthy. We consider this a temporary health issue
 // since we do not have a mechanism today to distinguish
 // temporary and permanent issues.
 // - Unknown: The status cannot be determined.
 // For example, Device Plugin got unregistered and hasn't been re-registered since.
 //
 // In future we may want to introduce the PermanentlyUnhealthy Status.
 Health ResourceHealthStatus `json:"health,omitempty" protobuf:"bytes,2,name=health"`
 // Message provides additional human-readable context about the health status.
 // This can include error details, failure reasons, or other diagnostic information.
 // This field is optional and may be empty for healthy resources.
 // +optional
 Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
}

In alpha2 in order to support pod level DRA resources, the following field will be added to the PodStatus:

// PodStatus represents information about the status of a pod. Status may trail the actual
// state of a system.
type PodStatus struct {

 ...

 // Status of resource claims.
 // +featureGate=DynamicResourceAllocation
 // +optional
 ResourceClaimStatuses []PodResourceClaimStatus

 ...

 // AllocatedResourcesStatus represents the status of various resources
 // allocated for this Pod, but not associated with any of containers.
 // +featureGate=ResourceHealthStatus
 // +optional
 // +patchMergeKey=name
 // +patchStrategy=merge
 // +listType=map
 // +listMapKey=name
 AllocatedResourcesStatus []ResourceStatus `json:"allocatedResourcesStatus,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,14,rep,name=allocatedResourcesStatus"`
}

Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?

No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already (if restartPolicy=Never, in other cases Pod will enter crash loop backoff).

Note: Updating Pod Status with device health after the pod has been marked as Failed is not supported due to a race condition in the Kubelet’s DRA manager cleanup. See the Known Limitations section for details.

Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?

We may consider this as a future improvement.

Should we introduce a permanent failure status?

We may consider this as a future improvement.

User Stories (Optional)

Story 1

User scheduled a Pod using the GPU device
When GPU device fails, user sees the Pod is in crash loop backoff
User checks the Pod Status using kubectl describe pod
User sees the pod status indicating that the GPU device is not healthy
User or some (custom for now) controller deletes the Pod and replicaset reschedules it on another available GPU

Notes/Constraints/Caveats (Optional)

DRA Device Health Timeout Configuration: The timeout for marking a DRA device’s health as “Unknown” when no updates are received can be configured per device through the health_check_timeout_seconds field in the DeviceHealth message. This allows different hardware types (e.g., GPUs, FPGAs, TPUs, storage devices) to specify appropriate timeout values based on their health-reporting characteristics. If not specified, Kubelet will use a default timeout of 30 seconds. This addresses Issue #133118 and the discussion in PR #130606 .
Failure Message Field: The ResourceHealth struct includes an optional Message field that provides additional human-readable context about device health status. This field enables Device Plugins and DRA drivers to report detailed error information, failure reasons, and diagnostic information beyond the basic health status. This enhancement improves troubleshooting capabilities for device-related failures. See Issue #133202 and PR #134506 for implementation details.
Known Limitation - Device Health for Terminated Pods: Device health status is not updated in PodStatus after a Pod has terminated (e.g., in Failed state). Due to a race condition between pod termination and health status updates, the Kubelet’s DRA manager cleans up the ClaimInfo from its cache before health updates can be applied. The complexity required to fix this (tombstoning terminated ClaimInfo entries) was deemed not worth the benefit for this edge case. The core value for long running services (RestartPolicy: Always) is unaffected. See Issue #132978 for details on why this was closed without implementation.

Risks and Mitigations

There is not many risks of this KEP. The biggest risk is that Device Plugins will not be able to detect device health reliably and fast enough to assign this status to the Pods, marked as restartPolicy=Never. End users will expect this field and the missing health information will confuse them.

Design Details

Device Plugin implementation details

Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods.

One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered.

NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.

We should consider introducing another field to the Status that will be a free form error information as a future improvement.

DRA implementation details

Today DRA does not return the health of the device back to kubelet. The proposal is to extend the type BasicDevice (from staging/src/k8s.io/dynamic-resource-allocation/api/types.go ) to include the Health field the same way it is done in the Device Plugin as well as a device ID.

The following design outlines how Kubelet will obtain health information from DRA plugins and use it to update the PodStatus. This design focuses on an optional, proactive health reporting mechanism from DRA plugins.

High-Level Architectural Approach for DRA Health

Optional gRPC Stream: A new, optional gRPC service for health monitoring will be defined. DRA plugins can implement this service to proactively send health updates for their managed devices to Kubelet. It will expose a server-streaming RPC that allows the plugin to send a complete list of device health states whenever a change occurs. If a plugin does not implement this service, the health of its devices will be reported as “Unknown”.
Health Information Cache: Kubelet’s DRA Manager will maintain a persistent cache of device health information. This cache will store the latest known health status (e.g., Healthy, Unhealthy, Unknown) and a timestamp for each device, keyed by driver and device identifiers. The cache will be responsible for reconciling the state reported by the plugin, handling timeouts for stale data (marking devices as “Unknown” if not updated within a certain period), and persisting this information across Kubelet restarts.

Note: The timeout for marking a device’s health as “Unknown” can be configured per device via the health_check_timeout_seconds field in the DeviceHealth message. If not specified, Kubelet will use a default timeout of 30 seconds. This addresses Issue #133118 , allowing different hardware types (e.g., GPUs, FPGAs, TPUs, storage) to specify appropriate timeout values based on their health-reporting characteristics.
Kubelet Integration: The DRA Manager in Kubelet will act as the gRPC client. Upon plugin registration, it will attempt to initiate the health monitoring stream. If successful, it will consume the health updates, update its internal health cache, and identify which Pods are affected by any reported health changes. For seamless plugin upgrades, where multiple instances of a plugin might run concurrently, the Kubelet will always watch the most recently registered plugin for health updates.
PodStatus Update: When health changes for a device are detected, the DRA manager will trigger an update for the affected Pods. Kubelet’s main pod synchronization logic will then read the current health status for the Pod’s allocated DRA devices from the health cache and populate the AllocatedResourcesStatus field in the PodStatus with the correct health information.

Note: Kubelet will only use this health information to update the Pod Status. The DRA plugin remains responsible for other actions, such as tainting ResourceSlices to prevent scheduling on unhealthy resources.

gRPC API for DRA Device Health

A new gRPC service, NodeHealth, will be introduced in a new API group (e.g., dra-health/v1alpha1) to keep it separate from the core DRA API and signify its optionality.

The service will define a WatchResources RPC:

service NodeHealth {
 // WatchResources allows a DRA plugin to stream health updates for its devices to Kubelet.
 // Kubelet calls this method, and the plugin streams responses.
 // This method is optional; if not implemented by a plugin, Kubelet will assume
 // devices managed by that plugin have an "Unknown" health status.
 rpc WatchResources(WatchResourcesRequest) returns (stream WatchResourcesResponse) {}
}

message WatchResourcesRequest {
 // Reserved for future use, e.g., filtering or options.
}

message WatchResourcesResponse {
 // A list of all devices managed by the plugin for which health is being reported.
 // This should be a complete list for the driver; Kubelet will reconcile this state.
 repeated DeviceHealth devices = 1;
}

message DeviceHealth {
 // The name of the resource pool this device belongs to.
 // Required.
 string pool_name = 1;
 // The unique name of the device within the pool.
 // Required.
 string device_name = 2;
 // Health status of the device.
 // Expected values: "Healthy", "Unhealthy", "Unknown".
 // Required.
 string health_status = 3;
 // Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
 // Required.
 int64 last_updated_timestamp = 4;
 // Health check timeout duration in seconds for this device.
 // If not specified or zero, Kubelet will use a default timeout.
 // Optional.
 int64 health_check_timeout_seconds = 5;
}

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

The existing test coverage for Device Manager and DRA will be used as a baseline. New code introduced by this KEP will include thorough unit tests to maintain or improve coverage.

Unit tests

Current coverage for the relevant packages (as of June 2025):

k8s.io/kubernetes/pkg/kubelet/cm/devicemanager: 84.8%
k8s.io/kubernetes/pkg/kubelet/cm/dra: 79.8%
k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin: 84.0%
k8s.io/kubernetes/pkg/kubelet/cm/dra/state: 46.2%

The new DRA health monitoring logic will have thorough unit test coverage, including:

Health Information Cache Logic:
- Cache initialization from scratch and from a checkpoint file.
- State reconciliation of device health based on plugin reports.
- Correct handling of LastUpdated timestamps.
- Marking devices as “Unknown” after a timeout period.
- Correctly identifying which devices have changed health status.
- Accurate retrieval of health status for existing, timed-out, and non-existent devices.
- Proper cleanup of a driver’s health data upon its deregistration.
- Persistence logic for saving to and loading from the checkpoint file.
Plugin Registration and gRPC Stream Handling:
- Verification of successful health stream startup and background processing.
- Graceful handling of plugins that do not implement the health monitoring service (Unimplemented error).
- Correct cancellation of the health stream when a plugin is replaced or deregistered.
- Error handling during stream initiation and message reception.
DRA Manager Logic:
- Correct processing of health update messages from the gRPC stream.
- Accurate identification of Pods affected by a health change.
- Properly sending update notifications for affected Pods.
- Correct population of the AllocatedResourcesStatus field in the Pod’s status object.

Integration tests

N/A

e2e tests

Planned tests will cover the user-visible behavior of the feature:

Basic Health Reporting:
- Verify that when a DRA plugin reports a device as unhealthy, the PodStatus is updated to reflect this.
- Verify that when the device becomes healthy again, the PodStatus is correctly updated.
State Transitions:
- Test rapid health state changes (e.g., unhealthy to healthy and back) to ensure the final PodStatus reflects the latest state.
Failure Scenarios:
- Verify that a Pod in a CrashLoopBackOff state due to an unhealthy device correctly shows the device’s unhealthy status.
Feature Gate Behavior (for Alpha):
- When the feature gate is disabled, verify that the AllocatedResourcesStatus field is not populated by the DRA manager.
- When the feature gate is disabled on an existing cluster, verify that existing health information is gracefully ignored or removed on the next Pod update.
- When the feature gate is re-enabled, verify that health reporting resumes correctly.

Graduation Criteria

Alpha

New field is introduced in Pod Status
Feature implemented in Device Manager behind a feature flag
Initial e2e tests completed and enabled

Alpha2

Feature implemented in DRA behind a feature flag
e2e tests completed and enabled for DRA

Beta

The following requirements must be met for Beta graduation:

Complete e2e tests coverage
Configurable Device Health Check Timeout (Issue #133118 , PR #133752 ): Verify that the configurable device health check timeout implementation (via health_check_timeout_seconds field) works correctly across different plugin vendors and hardware types (e.g., GPUs, FPGAs, TPUs, storage devices).
Failure Message Field (Issue #133202 , PR #134506 ): Support for a message field in device health reporting to provide additional context about health status and failures, enabling better troubleshooting capabilities.

GA

Feedback is collected on usability of the field
Example of real-world usage with one of the device plugin. For example, NVIDIA Device Plugin

Upgrade / Downgrade Strategy

The feature exposes a new field based on information the Device Plugin already exposes. There will be no dependency on upgrade/downgrade, feature will either work or not.

DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA ugrade/downgrade strategy.

Version Skew Strategy

There is no issue with the version skew. Kubelet that will expose this flag will always be the same version of behind the API, which introduced this new field.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

Simple change of a feature gate will either enable or disable this feature.

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ResourceHealthStatus
- Components depending on the feature gate: kubelet and kube-apiserver

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, with no side effect except of missing the new field in pod status. When the feature is disabled, the values of the AllocatedResourcesStatus fields will be dropped when serving the API even if they are written to storage. This prevents clients from acting on potentially stale data when the feature is off. Values written while the feature was enabled may be wiped on next update request. Re-enablement of the feature will not guarantee to keep the values written before the feature was disabled.

What happens if we reenable the feature if it was previously rolled back?

The pod status will be updated again. When the feature is re-enabled, there may be a brief period where stale values from storage reappear in the API before kubelet and controllers actuate and update the values with current device health information. This period should be kept as short as possible through normal kubelet reconciliation. Consistency will not be guaranteed for fields written before the last enablement.

Are there any tests for feature enablement/disablement?

Yes, see in e2e tests section.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

API server error rate increase. apiserver_request_total filtered by code to be non 2xx. API validation error is the most likely indication of an error.

Potential errors on kubelet would likely be exposed as error logs and events on Pods.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Will be tested, but we do not expect any issues.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check the Pod Status.

How can someone using this feature know that it is working for their instance?

API pod.status

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

There are a few error modes for this feature:

API issues accepting the new field - for example kubelet is writing the field in a format not acceptable by the API server
kubelet fails while populating this field

First error mode can be observer with the metric apiserver_request_total filtered by code to be non 2xx.

There is no good metric for the second error mode because it will not be clear what part of processing may fail. The most likely indication of an error would be the increased number of error events on the Pod.

Dependencies

DRA implementation.

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Pod Status size will increase insignificantly.

Will enabling / using this feature result in introducing new API types?

New field on Pod Status.

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Pod Status size will increase insignificantly.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Not significantly. We already keep all the collection in memory, just need to connect dots.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A

What are other known failure modes?

Not applicable.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

v1.31: KEP is in alpha and imlpemented for Device Plugin

Drawbacks

No post mortem health status for terminated pods: For batch jobs using RestartPolicy: Never, device health status will not be updated after the pod terminates. This means “post mortem” troubleshooting for batch jobs cannot rely on this field. The race condition between pod termination and health updates would require significant complexity to fix (tombstoning ClaimInfo entries in the DRA manager), which was deemed not worth the benefit. See Issue #132978 .

Alternatives

There are a few alternatives to this proposal.

First, an API similar to Pod Resources API can be exposed by kubelet to query via kubectl or directly thru some node exposed port. The problem with this approach is:

it opens up a new API surface
It will be impossible to get status for Pods that have completed already

Second, exposing the status for DRA via claims - this approach leads to a debate on how to ensure security so kubelet is limited to which statuses it can set. With this approach, there are mechanisms in place to ensure that kubelet updates status for Pods scheduled on that node.

Infrastructure Needed (Optional)

We may need to update sample device plugin. No special infra is needed as emulating real GPU failures or failures in other devices is not practical.

Resources: Add support for a kubelet drop-in configuration directory

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3983: Add support for a drop-in kubelet configuration directory

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
  - Story 1
  - Story 2
  - Story 3
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Add support for a drop-in configuration directory for the Kubelet. This directory can be specified via a --config-dir flag, and configuration files will be processed in alphanumeric order. The flag will be empty by default and if not specified, drop-in support will not be enabled. During the alpha phase, we introduced an environment variable called KUBELET_CONFIG_DROPIN_DIR_ALPHA to control the drop-in configuration directory for testing purposes. In the beta phase, we plan to leave the --config-dir flag unset by default, which aligns with the behavior of the --config flag. Users are encouraged to opt in by specifying their desired configuration directory. Additionally, we will enhance the feature with E2E testing and streamline the configuration process. As part of this optimization, we will remove the KUBELET_CONFIG_DROPIN_DIR_ALPHA environment variable, simplifying configuration management. The feature will be enabled by default during the beta phase, and we will evaluate its status in future releases.

Motivation

A common pattern for software configuration in linux is support for a drop-in configuration directory. The location of this directory is often based on a corresponding configuration file. For instance, /etc/security/limits can be overridden by files in /etc/security/limits.d. This pattern is useful for a number of reasons, though a large motivation here is to allow files to be owned by a single owner. If multiple processes are vying for changing the same file, then they could stamp over each other’s changes and possibly race against each other, creating TOCTOU problems.

Components in Kubernetes can similarly be configured by multiple entities and preventing races between them is cumbersome. There has been past work in the Kubelet to have a Dynamic Configuration, but resolving between multiple entities and a last known good state was also complicated. Since the Kubelet is the node agent, and is often distributed as a package on the host operating system along with the container runtime, configuring it similarly to other host processes seems clear. This paves the path for continuing the pattern of drop-in configuration for the Kubelet.

Goals

Add support for a --config-dir flag to the kubelet to allow users to specify a drop-in directory, which will override the configuration for the Kubelet located at /etc/kubernetes/kubelet.conf
Extend kubelet configuration parsing code to handle files in the drop-in directory.
Define Kubernetes best-practices for configuration definitions, similarly to API conventions . This is intended for other maintainers who would wish to setup a configuration object that works well with drop-in directories.
Add ability to easily view the effective configuration that is being used by kubelet.

Non-Goals

Add support for drop-in configuration for Kubernetes components other than the Kubelet.
Dynamically reconfiguring running kubelets if drop-in contents change.

Proposal

This proposal aims to add support for a drop-in configuration directory for the kubelet via specifying a --config-dir flag (for example, /etc/kubernetes/kubelet.conf.d). Users are able to specify individually configurable kubelet config snippets in files, formatted in the same way as the existing kubelet.conf file. The kubelet will process the configuration provided in the drop-in directory in alphanumeric order:

If no other configuration for the subfield(s) exist, append to the base configuration
If the subfield(s) exists in the base configuration at /etc/kubernetes/kubelet.conf file or another file in the drop-in directory with lesser alphanumeric ordering, overwrite it
- If the subfield(s) exist as a list, overwrite instead of attempting to merge. This makes it easier to delete items from lists defined in the base kubelet.conf or other drop-ins without having to modify other files. See example below

If there are any issues with the drop-ins (e.g. formatting errors), the error will be reported in the same way as a misconfigured kubelet.conf file. Only files with a .conf extension will be parsed. All other files found will be skipped and logged.

This drop-in directory is purely optional and if empty, the base configuration is used and no behavior changes will be introduced. The --config-dir flag, along with the KUBELET_CONFIG_DROPIN_DIR_ALPHA environment variable, allows users to specify a drop-in configuration directory for the Kubelet. This directory is empty by default, ensuring that drop-in support is not enabled unless explicitly configured. This aims to align with --config flag defaults.

Example:

Base configuration:

authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
clusterDNS:
- 1.2.3.4
- 1.2.3.5

Drop-in 1:

authentication:
x509
clientCAFile: /some/new/location

Drop-in 2:

clusterDNS:
- 1.2.3.6

Final result:

authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: some/new/location
clusterDNS:
- 1.2.3.6

User Stories

Story 1

As a cluster admin, I would like to be able to easily customize the Kubelet configuration for different node types, while still sharing a base configuration. For instance, I would like to have customized system reserved allocations for the control plane and workers.

Story 2

As a Kubernetes distribution author, I would like to enable users to customize fields on the Kubelet while leaving a sensible and secure default in an easy way.

Story 3

As a cluster admin, I would like to have cgroup management and log size management in different files, so I can automate per-node management of those configurations performed via different components without cross-interference.

Risks and Mitigations

Handling of zeroed fields
- It’s possible the configuration of the Kubelet does not handle not specified fields well. Special testing will need to be done for different types to define and ensure conformance of that behavior.
Handling of lists
- During the beta phase, we will conduct additional testing to address risks and refine the feature.

Design Details

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

cmd/kubelet/app: 07-17-2023 27.6

Integration tests

e2e tests

A test should confirm that the kubelet.conf.d directory is correctly processed, and its contents are accurately reported in the configz endpoint.

Graduation Criteria

Alpha

Add ability to support drop-in configuration directory.

Beta

Add ability to augment the feature’s capabilities with a focus on robustness and testing, which includes:

Ensure the correct kubelet configuration is displayed when queried using the kubectl get --raw "/api/v1/nodes/{node-name}/proxy/configz" command, particularly verifying the contents of the kubelet.conf.d directory.
Remove the environment variable KUBELET_CONFIG_DROPIN_DIR_ALPHA, introduced during the Alpha phase, to streamline the user experience by simplifying configuration management.
Leave the --config-dir flag empty by default. Users can configure it by specifying a path, with /etc/kubernetes/kubelet.conf.d as the recommended directory.
Add a version compatibility check for drop-in files to ensure alignment with the expected Kubelet configuration API version and catch discrepancies when future versions are introduced.
Provide official guidance on the Kubernetes website for merging lists and structures in the kubelet configuration file, including documentation for the /configz endpoint.

GA

Collect user feedback and gather information about real-world use cases for this feature.

Upgrade / Downgrade Strategy

Upgrades and downgrades are safe as far as Kubelet stability is concerned. It’s possible a vendor may ship vital pieces of configuration within a drop-in directory. If the Kubelet downgrades to a version that doesn’t support reading the drop-in directory, the kubelet will not recognize the “–config-dir” flag and risk failing. However, assuming that vendor left that the original /etc/kubelet/kubelet.conf is in a valid state, and the flag isn’t specified, there should be no risk to the system. Any configuration that exists in a drop-in dir won’t be applied, but that would not affect kubelet stability.

Version Skew Strategy

All behavior change is encapsulated within the Kubelet, so there is no version skew possible within core Kubernetes. It is possible third party tools may attempt to utilize the Kubelet’s drop-in directory before the Kubelet is upgraded to support it, which would cause silent failures. It is the responsibility of these third party tools to ensure the Kubelet is new enough to support this.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

[] Feature gate N/A

In addition to configuring the KUBELET_CONFIG_DROPIN_DIR_ALPHA environment variable, administrators must explicitly set the –config-dir flag in the kubelet’s command-line interface (CLI) to enable this feature. Starting from the beta phase, specifying the –config-dir flag is the only way to enable this feature. The default value for --config-dir is an empty string, which means the feature is disabled by default.

The decision to use an environment variable (KUBELET_CONFIG_DROPIN_DIR_ALPHA) over a feature gate was made to avoid potential conflicts in configuration settings. With the current configuration flow, feature gates could lead to unexpected behavior when CLI settings conflict with the kubelet.conf.d directory. The potential issue arises when the CLI initially sets the feature gate to “off,” but the kubelet configuration specifies it as “on.” In this scenario, the kubelet would start with the feature gate “off,” switch it to “on” during configuration rendering, and then have conflicting settings when reading the kubelet.conf.d directory, leading to unexpected behavior. By using an environment variable during the alpha phase, we provided a simpler and more predictable way to control the drop-in configuration directory for testing. In the beta phase, we are removing this environment variable to streamline configuration management and enhance the user experience.

Does enabling the feature change any default behavior?

No, upgrading to a version of the Kubelet with this feature will not enable the Kubelet to be configured with the drop-in directory if no flag is specified.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. To disable the feature, roll back by removing the –config-dir flag from the kubelet’s CLI.

What happens if we reenable the feature if it was previously rolled back?

This feature will be re-enabled via adding back the --config-dir flag to the CLI.

Are there any tests for feature enablement/disablement?

A test will be added to assemble a single, functional kubelet configuration object from various individual drop-in config files.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The feature can cause the Kubelet to fail if the configuration in the drop-in directory is invalid. A rollback could fail if the original configuration also has an invalid configuration. This situation would cause workloads to not appear on that node. Neither of these cases are expected.

What specific metrics should inform a rollback?

The Kubelet not starting, which will cause nodes to be NotReady.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The feature does not persist any data and so the upgrade->downgrade->upgrade path is not special.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Workloads do not directly consume this feature, it is for cluster admins during kubelet configuration. To check if the feature is enabled, users can query the merged configuration. One way to do this is by hitting the configz endpoint using kubectl or a standalone kubelet mode.

How can someone using this feature know that it is working for their instance?

In alpha, the user can query their active kubeletconfiguration to see if their drop-ins have taken effect. In beta and onwards, the user will be able to read this off logs or the API, to be determined as described above.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The node bootstrap time should be minimal so kubelet doesn’t take too long to reconcile the configuration.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

No noticeable increase in the kubelet startup time.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

No, though there may be changes to the Kubelet configuration required.

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

No, though metadata on the fields may need to be changed.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

It will take slightly longer for the Kubelet to start, but it should not be noticeable unless there are very many (hundreds?) of configurations.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Likely negligible amounts of CPU.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Not likely

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

This feature is enabled in Kubelet alone.

What are other known failure modes?

Invalid configuration, including issues like incorrect file permissions or misconfigured settings for the drop-in directory and files, falls under known failure modes, same as exists today with /etc/kubernetes/kubelet.conf

What steps should be taken if SLOs are not being met to determine the problem?

Fix the invalid configuration, or remove configurations.

Implementation History

2023-05-04: KEP initialized.
2023-07-17: Alpha is implemented in 1.28
2023-09-25: KEP retargeted to Alpha in 1.29
2024-01-19: Added an e2e test and set KEP target to Beta in 1.30
2024-09-30: Update Beta requirements
2025-10-02: Update to stable

Drawbacks

Alternatives

Reinstate the now deprecated Dynamic Kubelet Configuration

Continue to rely on CLI flags or systemd drop-in files.

Resources: Add support for AdminNetworkPolicy resources

Mon, 01 Jan 0001 00:00:00 +0000

KEP-2091: Add support for AdminNetworkPolicy resources

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Introduce new set of APIs to express an administrator’s intent in securing their K8s cluster. This doc proposes the AdminNetworkPolicy API to complement the developer focused NetworkPolicy API in Kubernetes.

Motivation

Kubernetes provides the NetworkPolicy resource to control traffic within a cluster. NetworkPolicy focuses on expressing a developer’s intent to secure their applications. However, it was not intended to be used for cluster scoped administrative traffic control, which is reflected by its design:

NetworkPolicy uses a “implicit isolation” model, which means that once a policy is applied to certain workloads, they are automatically isolated (in the direction specified by the policy) and anything allowed needs to be explicitly called out.
It has no concept of explicit “deny” rules, because the application deployer can simply refrain from allowing the things they want to deny.
The commutative nature of NetworkPolicy can make certain filtering intents difficult to express. Thus, in order to satisfy the needs of a cluster admin, we propose to introduce a new API that captures the administrator’s intent.

Goals

The goals for this KEP are to satisfy the following key user stories:

As a cluster administrator, I want to enforce irrevocable in-cluster guardrails that all workloads must adhere to in order to guarantee the safety of my clusters. In particular I want to enforce certain network level access controls that are cluster scoped and cannot be overridden or bypassed by namespace scoped NetworkPolicies.

Example: I would like to explicitly allow all pods in my cluster to reach kubeDNS.
As a cluster administrator, I want to have the option to enforce in-cluster network level access controls that facilitate network multi-tenancy and strict network level isolation between multiple teams and tenants sharing a cluster via use of namespaces or groupings of namespaces per tenant.

Example: I would like to define two tenants in my cluster, one composed of the pods in foo-ns-1 and foo-ns-2 and the other with pods in bar-ns-1, where inter-tenant traffic is denied.
As a cluster administrator, I want to optionally also deploy an additional default set of policies to all in-cluster workloads that may be overridden by the developers if needed

Example: I would like to explicitly delegate the restriction of traffic destined for cluster monitoring pods to the developer, allowing them to setup network policy to deny or allow the traffic from/to their application.

There are several unique properties that we need to add in order accomplish the user stories above.

Deny rules and, therefore, hierarchical enforcement of policy
Semantics for a cluster-scoped policy object that may include namespaces/workloads that have not been created yet.
Interoperability with existing Kubernetes Network Policy API

Non-Goals

Our mission is to solve the most common use cases that cluster admins have. That is, we don’t want to solve for every possible policy permutation a user can think of. Instead, we want to design an API that addresses 90-95% use cases while keeping the mental model easy to understand and use. The focus of this KEP is on cluster scoped controls for east-west traffic within a cluster, meaning that an AdminNetworkPolicyPeer is always defined as a set of in cluster objects. Cluster scoped controls for north-south traffic may be addressed via future versions of the api resources introduced in this or other future KEPs. For the time being, the AdminNetworkPolicy resource introduced by this KEP will never affect north-south traffic, and thus also don’t override or bypass NetworkPolicies with ipBlock rules that select external traffic.

Proposal

In order to achieve the three primary broad use cases for a cluster admin to secure K8s clusters, we propose to introduce the following resources under policy.networking.k8s.io API group:

AdminNetworkPolicy
BaselineAdminNetworkPolicy

The AdminNetworkPolicy(ANP) and BaselineAdminNetworkPolicy(BANP) resources will help the administrators:

Set strict security rules for the cluster, i.e. a developer CANNOT override these rules by creating NetworkPolicies that applies to the same workloads as the AdminNetworkPolicy does.
Set baseline security rules that describes default connectivity for cluster workloads, which CAN be overridden by developer NetworkPolicies if needed.

AdminNetworkPolicy resource

An AdminNetworkPolicy (ANP) resource will help the administrators set strict security rules for the cluster, i.e. a developer CANNOT override these rules by creating NetworkPolicies that applies to the same workloads as the AdminNetworkPolicy does.

Actions

Unlike the NetworkPolicy resource in which each rule represents an allowed traffic, AdminNetworkPolicy will enable administrators to set Pass, Deny or Allow as the action of each rule. AdminNetworkPolicy rules should be read as-is, i.e. there will not be any implicit isolation effects for the Pods selected by the AdminNetworkPolicy, as opposed to what NetworkPolicy rules imply.

Pass: Traffic that matches a Pass rule will skip all further rules from all lower precedenced ANPs, and instead be enforced by the K8s NetworkPolicies. If there is no K8s NetworkPolicy rule match, and no BaselineAdminNetworkPolicy rule match (more on this in the priority section ), traffic will be governed by the implementation. For most implementations, this means “allow”, but there may be implementations which have their own policies outside of the standard Kubernetes APIs.
Deny: Traffic that matches a Deny rule will be dropped.
Allow: Traffic that matches an Allow rule will be allowed.

AdminNetworkPolicy Pass rules allows an admin to delegate security posture for certain traffic to the Namespace owners by overriding any lower precedenced Allow or Deny rules. For example, intra-tenant traffic management can be delegated to tenant admins explicitly with the use of Pass rules.

AdminNetworkPolicy Deny rules are useful for administrators to explicitly block traffic with malicious in-cluster clients, or workloads that pose security risks. Those traffic restrictions can only be lifted once the Deny rules are deleted, modified by the admin, or overridden by a higher priority rule.

On the other hand, the Allow rules can be used to call out traffic in the cluster that needs to be allowed for certain components to work as expected (egress to CoreDNS for example). Those traffic should not be blocked when developers apply NetworkPolicy to their Namespaces which isolates the workloads.

Priority

The policy instances will be ordered based on the numeric priority assigned to each ANP. Priority is a 32 bit integer value, where a smaller number corresponds to a higher precedence. The lowest numeric priority value is “0”, which corresponds to the highest precedence. Larger numbers have lower precedence. All ANPs will have higher precedence over NetworkPolicies in the cluster. If traffic matches both an ANP rule and a NetworkPolicy rule, the only case where the NetworkPolicy rule will be evaluated is when there is a third higher-precedence ANP Pass rule that allows it to bypass any lower-precedence ANP rules.

The relative precedence of the rules within a single ANP object (all of which share a priority) will be determined by the order in which the rule is written. Thus, a rule that appears at the top of the ingress/egress rules would take the highest precedence.

For alpha, this API defines “1000” as the maximum numeric value for priority, but this may be revisited as the proposal advances. For future-safety, clients may assume that higher values will eventually be allowed, and simply treat it as an int32. Also for alpha, each ANP is limited to 100 ingress rules and 100 egress rules, which is subject to change (to a greater number) in the future as well.

Conflict resolution: Two policies are considered to be conflicting if they are assigned the same priority and apply to the same resources or a union of resources. In order to avoid such conflicts, we propose to include tooling for ANP resources to help alert the admin to potentially ambiguous ANP priority scenarios, more details in risks and mitigation . However, ultimately it will be the job of the network policy implementation to decide how to handle overlapping priority situations.

Rule Names

In order to help future proof the ANP API, a built in mechanism to identify each allow/deny/pass rule is required. Such a mechanism will help administrators organize and identify individual rules within an AdminNetworkPolicy resource. We propose to introduce a new string field, called name, in each AdminNetworkPolicy ingress/egress rule. Currently the name of a rule is optional and is most useful if it is unique within an ANP instance. The max length for the rule name string is restricted to 100 characters, which provides flexibility for long generated names.

BaselineAdminNetworkPolicy resource

An BaselineAdminNetworkPolicy (BANP) resource will help the administrators set baseline security rules that describes default connectivity for cluster workloads, which CAN be overridden by developer owned NetworkPolicies if needed.

The BaselineAdminNetworkPolicy spec will look almost identical to that of an ANP, except for two important differences:

There is no Priority field associated with BaselineAdminNetworkPolicy. Note that in writing a BaselineAdminNetworkPolicy, admins can create different priorities in rules by placing them before or after one another. However, the authors of this KEP did not find a valid usecase for creating multiple BaselineAdminNetworkPolicies in a cluster with distinct policy-level priorities. BANPs are intended for setting cluster default security postures, and in most cases the subject of such policy should be the entire cluster.
There is no Pass action for BaselineAdminNetworkPolicy rules.

User Stories

Note: This KEP will focus on East-West traffic, cluster internal, user stories and not address North-South traffic, cluster external, use cases, which will be solved in a follow-up proposal.

Story 1: Deny traffic at a cluster level

As a cluster admin, I want to apply non-overridable deny rules to certain pod(s) and(or) Namespace(s) that isolate the selected resources from all other cluster internal traffic.

For Example: In this diagram there is a AdminNetworkPolicy applied to the sensitive-ns denying ingress from all other in-cluster resources for all ports and protocols.

Story 2: Allow traffic at a cluster level

As a cluster admin, I want to apply non-overridable allow rules to
certain pods(s) and(or) Namespace(s) that enable the selected resources to communicate with all other cluster internal entities.

For Example: In this diagram there is a AdminNetworkPolicy applied to every namespace in the cluster allowing egress traffic to kube-dns pods, and ingress traffic from pods in monitoring-ns for all ports and protocols.

Story 3: Explicitly Delegate traffic to existing K8s Network Policy

As a cluster admin, I want to explicitly delegate traffic so that it skips any remaining cluster network policies and is handled by standard namespace scoped network policies.

For Example: In the diagram below egress traffic destined for the service svc-pub in namespace bar-ns-1 on TCP port 8080 is delegated to the k8s network policies implemented in foo-ns-1 and foo-ns-2. If no k8s network policies touch the delegated traffic the traffic will be allowed.

Story 4: Create and Isolate multiple tenants in a cluster

As a cluster admin, I want to build tenants in my cluster that are isolated from each other by default. Tenancy may be modeled as 1:1, where 1 tenant is mapped to a single Namespace, or 1:n, where a single tenant may own more than 1 Namespace.

For Example: In the diagram below two tenants (Foo and Bar) are defined such that all ingress traffic is denied to either tenant.

Story 5: Cluster Wide Default Guardrails

As a cluster admin I want to change the default security model for my cluster, so that all intra-cluster traffic (except for certain essential traffic) is blocked by default. Namespace owners will need to use NetworkPolicies to explicitly allow known traffic. This follows a whitelist model which is familiar to many security administrators, and similar to how kubernetes suggests network policy be used .

For Example: In the following diagram all Ingress traffic to every cluster resource is denied by a baseline deny rule.

RBAC

AdminNetworkPolicy resources are meant for cluster administrators. Thus, access to manage these resources must be granted to subjects which have the authority to outline the security policies for the cluster. Therefore, by default, the cluster-admin ClusterRole will be granted the permissions to edit the AdminNetworkPolicy resources.

Key differences between AdminNetworkPolicies and NetworkPolicies

	AdminNetworkPolicy	K8s NetworkPolicies
Target persona	Cluster administrator or equivalent	Developers within Namespaces
Scope	Cluster	Namespaced
Drop traffic	Supported with a `Deny` rule action	Supported via implicit isolation of target Pods
Skip enforcement	Supported with an `Pass` rule action	Not needed
Allow traffic	Supported with an `Allow` rule action	Default action for all rules is to allow
Implicit isolation	No implicit isolation	All rules have an implicit isolation of target Pods
Rule precedence	Depends on the order in which they appear within a ANP	Rules are additive
Policy precedence	Depends on `priority` field among ANPs. Enforced before K8s NetworkPolicies if positive numeric priority value	Enforced after numeric-priority ClusterNetworkPolicies, before baseline-priority AdminNetworkPolicy
Matching pod selection	Can apply different rules to multiple groups of Pods	Applies rules to a single group of Pods
Rule identifiers	Name per rule in string format. Unique within a ANP	Not supported
Cluster external traffic	Not supported	Partially supported via IPBlock
Namespace selectors	Supports advanced selection of Namespaces with the use of `namespaceSet`	Supports label based Namespace selection with the use of `namespaceSelector` field

Note that AdminNetworkPolicy can also apply to Pods in Namespaces that don’t exist yet, and will automatically apply to a new Namespace as long as the new Namespace’s labels match the AdminNetworkPolicy rule’s appliedTo selection criteria. NetworkPolicies, on the contrary, only apply to Pods in the Namespace they are created in.

Notes/Constraints/Caveats

It is important to note that the controller implementation for cluster-scoped policy APIs will not be provided as part of this KEP. Such controllers which realize the intent of these APIs will be provided by individual network policy providers, as is the case with the NetworkPolicy API.

Risks and Mitigation

To understand why traffic between a pair of Pods is allowed or denied, a list of NetworkPolicy resources in both Pods’ Namespace used to be sufficient (considering no other CRDs in the cluster tries to alter traffic behavior). With the introduction of AdminNetworkPolicy this is no longer the case, and users could face difficulty in determining why NetworkPolicies did not take effect.

For example, in the case where a positive priority (non-zero) AdminNetworkPolicy rule, NetworkPolicy rule and “0” priority AdminNetworkPolicy rule apply to an overlapping set of Pods, users will need to refer to the priority associated with the rule to determine which rule would take effect. Figuring out how stacked policies affect traffic between workloads might not be very straightforward.

To mitigate this risk and improve usability, some additional in-tree tooling for both the Admin and Developer will need to be created. For the Admin, it is safe to assume they will have the correct RBAC roles to list all the NetworkPolicies and AdminNetworkPolicies in a cluster. Therefore, the Admin oriented tooling should be able to both alert the Admin to any overriding of NetworkPolicies that may occur if a new AdminNetworkPolicy is to be created and provide a warning if there is another ANP with the same priority. For the Developer, who usually can only list the NetworkPolicies in a given namespace, the tooling should simply alert if a given NetworkPolicy would be overridden by any of the ANPs in a cluster. The aforementioned tooling will not be a primary development goal during the alpha version of this API, and will most likely be completed during the beta development cycle.

Future Work

Although the scope of the AdminNetworkPolicies is extensive, the above proposal intends to only solve the documented use cases. However, we would also like to consider the following set of proposals as future work items:

Audit Logging: Very often cluster administrators want to log every connection that is either denied or allowed by a firewall rule and send the details to an IDS or any custom tool for further processing of that information. With the introduction of deny rules, it may make sense to incorporate the cluster-scoped policy resources with a new field, say auditPolicy, to determine whether a connection matching a particular rule/policy must be logged or not.

Design Details

AdminNetworkPolicy API Design

The following new AdminNetworkPolicy API will be added to the policy.networking.k8s.io API group:


// AdminNetworkPolicy describes cluster-level network traffic control rules
type AdminNetworkPolicy struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata"`

 // Specification of the desired behavior of AdminNetworkPolicy.
 Spec AdminNetworkPolicySpec `json:"spec"`

 // Status is the status to be reported by the implementation, this is not
 // standardized in alpha and consumers should report what they see fit in
 // relation to their AdminNetworkPolicy implementation.
 // +optional
 Status AdminNetworkPolicyStatus `json:"status,omitempty"`
}

type AdminNetworkPolicyStatus struct {
 Conditions []metav1.Condition
}

// AdminNetworkPolicySpec defines the desired state of AdminNetworkPolicy.
type AdminNetworkPolicySpec struct {
 // Priority is a value from 0 to 1000. Rules with lower priority values have
 // higher precedence, and are checked before rules with higher priority values.
 // All AdminNetworkPolicy rules have higher precedence than NetworkPolicy or
 // BaselineAdminNetworkPolicy rules
 // The relative precedence of the rules within a single ANP object (all of
 // which share the priority) will be determined by the order in which the rule
 // is written. Thus, a rule that appears at the top of the ingress/egress rules
 // would take the highest precedence. If ingress rules are defined before egress
 // rules in the same ANP object then ingress would take precedence and vice versa.
 // The behavior is undefined if two ANP objects have same priority.
 // +kubebuilder:validation:Minimum=0
 // +kubebuilder:validation:Maximum=1000
 Priority int32 `json:"priority"`

 // Subject defines the pods to which this AdminNetworkPolicy applies.
 Subject AdminNetworkPolicySubject `json:"subject"`

 // Ingress is the list of Ingress rules to be applied to the selected pods.
 // A total of 100 rules will be allowed in each ANP instance. ANPs with no
 // ingress rules do not affect ingress traffic.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ingress []AdminNetworkPolicyIngressRule `json:"ingress,omitempty"`

 // Egress is the list of Egress rules to be applied to the selected pods.
 // A total of 100 rules will be allowed in each ANP instance. ANPs with no
 // egress rules do not affect egress traffic.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Egress []AdminNetworkPolicyEgressRule `json:"egress,omitempty"`
}

// AdminNetworkPolicyIngressRule describes an action to take on a particular
// set of traffic destined for pods selected by an AdminNetworkPolicy's
// Subject field.
type AdminNetworkPolicyIngressRule struct {
 // Name is an identifier for this rule, that may be no more than 100 characters
 // in length. This field should be used by the implementation to help
 // improve observability, readability and error-reporting for any applied
 // AdminNetworkPolicies.
 // +optional
 // +kubebuilder:validation:MaxLength=100
 Name string `json:"name,omitempty"`

 // Action specifies the effect this rule will have on matching traffic.
 // Currently the following actions are supported:
 // Allow: allows the selected traffic (even if it would otherwise have been denied by NetworkPolicy)
 // Deny: denies the selected traffic
 // Pass: instructs the selected traffic to skip any remaining ANP rules, and
 // then pass execution to any NetworkPolicies that select the pod.
 // If the pod is not selected by any NetworkPolicies then execution
 // is passed to any BaselineAdminNetworkPolicies that select the pod.
 Action AdminNetworkPolicyRuleAction `json:"action"`

 // From is the list of sources whose traffic this rule applies to.
 // If any AdminNetworkPolicyPeer matches the source of incoming
 // traffic then the specified action is applied.
 // This field must be defined and contain at least one item.
 // +kubebuilder:validation:MinItems=1
 // +kubebuilder:validation:MaxItems=100
 From []AdminNetworkPolicyPeer `json:"from"`

 // Ports allows for matching traffic based on port and protocols.
 // If Ports is not set then the rule does not filter traffic via port.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ports *[]AdminNetworkPolicyPort `json:"ports,omitempty"`
}

// AdminNetworkPolicyEgressRule describes an action to take on a particular
// set of traffic originating from pods selected by a AdminNetworkPolicy's
// Subject field.
type AdminNetworkPolicyEgressRule struct {
 // Name is an identifier for this rule, that may be no more than 100 characters
 // in length. This field should be used by the implementation to help
 // improve observability, readability and error-reporting for any applied
 // AdminNetworkPolicies.
 // +optional
 // +kubebuilder:validation:MaxLength=100
 Name string `json:"name,omitempty"`

 // Action specifies the effect this rule will have on matching traffic.
 // Currently the following actions are supported:
 // Allow: allows the selected traffic (even if it would otherwise have been denied by NetworkPolicy)
 // Deny: denies the selected traffic
 // Pass: instructs the selected traffic to skip any remaining ANP rules, and
 // then pass execution to any NetworkPolicies that select the pod.
 // If the pod is not selected by any NetworkPolicies then execution
 // is passed to any BaselineAdminNetworkPolicies that select the pod.
 Action AdminNetworkPolicyRuleAction `json:"action"`

 // To is the List of destinations whose traffic this rule applies to.
 // If any AdminNetworkPolicyPeer matches the destination of outgoing
 // traffic then the specified action is applied.
 // This field must be defined and contain at least one item.
 // +kubebuilder:validation:MinItems=1
 // +kubebuilder:validation:MaxItems=100
 To []AdminNetworkPolicyPeer `json:"to"`

 // Ports allows for matching traffic based on port and protocols.
 // If Ports is not set then the rule does not filter traffic via port.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ports *[]AdminNetworkPolicyPort `json:"ports,omitempty"`
}


const (
 // AdminNetworkPolicyRuleActionAllow indicates that matching traffic will be
 // allowed regardless of NetworkPolicy and BaselineAdminNetworkPolicy
 // rules. Users cannot block traffic which has been matched by an "Allow"
 // rule in an AdminNetworkPolicy.
 AdminNetworkPolicyRuleActionAllow AdminNetworkPolicyRuleAction = "Allow"
 // AdminNetworkPolicyRuleActionDeny indicates that matching traffic will be
 // denied before being checked against NetworkPolicy or
 // BaselineAdminNetworkPolicy rules. Pods will never receive traffic which
 // has been matched by a "Deny" rule in an AdminNetworkPolicy.
 AdminNetworkPolicyRuleActionDeny AdminNetworkPolicyRuleAction = "Deny"
 // AdminNetworkPolicyRuleActionPass indicates that matching traffic will
 // bypass further AdminNetworkPolicy processing (ignoring rules with lower
 // precedence) and be allowed or denied based on NetworkPolicy and
 // BaselineAdminNetworkPolicy rules.
 AdminNetworkPolicyRuleActionPass AdminNetworkPolicyRuleAction = "Pass"
)

The following new BaslineAdminNetworkPolicy API will also be added to the policy.networking.k8s.io API group:


type BaselineAdminNetworkPolicy struct {
 metav1.TypeMeta `json:",inline"`
 metav1.ObjectMeta `json:"metadata"`

 // Specification of the desired behavior of BaselineAdminNetworkPolicy.
 Spec BaselineAdminNetworkPolicySpec `json:"spec"`

 // Status is the status to be reported by the implementation.
 // +optional
 Status BaselineAdminNetworkPolicyStatus `json:"status,omitempty"`
}

// BaselineAdminNetworkPolicyStatus defines the observed state of
// BaselineAdminNetworkPolicy.
type BaselineAdminNetworkPolicyStatus struct {
 Conditions []metav1.Condition `json:"conditions"`
}

// BaselineAdminNetworkPolicySpec defines the desired state of
// BaselineAdminNetworkPolicy.
type BaselineAdminNetworkPolicySpec struct {
 // Subject defines the pods to which this BaselineAdminNetworkPolicy applies.
 Subject AdminNetworkPolicySubject `json:"subject"`

 // Ingress is the list of Ingress rules to be applied to the selected pods
 // if they are not matched by any AdminNetworkPolicy or NetworkPolicy rules.
 // A total of 100 Ingress rules will be allowed in each BANP instance.
 // BANPs with no ingress rules do not affect ingress traffic.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ingress []BaselineAdminNetworkPolicyIngressRule `json:"ingress,omitempty"`

 // Egress is the list of Egress rules to be applied to the selected pods if
 // they are not matched by any AdminNetworkPolicy or NetworkPolicy rules.
 // A total of 100 Egress rules will be allowed in each BANP instance. BANPs
 // with no egress rules do not affect egress traffic.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Egress []BaselineAdminNetworkPolicyEgressRule `json:"egress,omitempty"`
}

// BaselineAdminNetworkPolicyIngressRule describes an action to take on a particular
// set of traffic destined for pods selected by a BaselineAdminNetworkPolicy's
// Subject field.
type BaselineAdminNetworkPolicyIngressRule struct {
 // Name is an identifier for this rule, that may be no more than 100 characters
 // in length. This field should be used by the implementation to help
 // improve observability, readability and error-reporting for any applied
 // BaselineAdminNetworkPolicies.
 // +optional
 // +kubebuilder:validation:MaxLength=100
 Name string `json:"name,omitempty"`

 // Action specifies the effect this rule will have on matching traffic.
 // Currently the following actions are supported:
 // Allow: allows the selected traffic
 // Deny: denies the selected traffic
 Action BaselineAdminNetworkPolicyRuleAction `json:"action"`

 // From is the list of sources whose traffic this rule applies to.
 // If any AdminNetworkPolicyPeer matches the source of incoming
 // traffic then the specified action is applied.
 // This field must be defined and contain at least one item.
 // +kubebuilder:validation:MinItems=1
 From []AdminNetworkPolicyPeer `json:"from"`

 // Ports allows for matching traffic based on port and protocols.
 // If Ports is not set then the rule does not filter traffic via port.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ports *[]AdminNetworkPolicyPort `json:"ports,omitempty"`
}

// AdminNetworkPolicyEgressRule describes an action to take on a particular
// set of traffic originating from pods selected by a BaselineAdminNetworkPolicy's
// Subject field.
type BaselineAdminNetworkPolicyEgressRule struct {
 // Name is an identifier for this rule, that may be no more than 100 characters
 // in length. This field should be used by the implementation to help
 // improve observability, readability and error-reporting for any applied
 // BaselineAdminNetworkPolicies.
 // +optional
 // +kubebuilder:validation:MaxLength=100
 Name string `json:"name,omitempty"`

 // Action specifies the effect this rule will have on matching traffic.
 // Currently the following actions are supported:
 // Allow: allows the selected traffic
 // Deny: denies the selected traffic
 Action BaselineAdminNetworkPolicyRuleAction `json:"action"`

 // To is the list of destinations whose traffic this rule applies to.
 // If any AdminNetworkPolicyPeer matches the destination of outgoing
 // traffic then the specified action is applied.
 // This field must be defined and contain at least one item.
 // +kubebuilder:validation:MinItems=1
 To []AdminNetworkPolicyPeer `json:"to"`

 // Ports allows for matching traffic based on port and protocols.
 // If Ports is not set then the rule does not filter traffic via port.
 // +optional
 // +kubebuilder:validation:MaxItems=100
 Ports *[]AdminNetworkPolicyPort `json:"ports,omitempty"`
}

// BaselineAdminNetworkPolicyRuleAction string describes the BaselineAdminNetworkPolicy
// action type.
// +enum
type BaselineAdminNetworkPolicyRuleAction string

const (
 // BaselineAdminNetworkPolicyRuleActionDeny enables admins to deny traffic.
 BaselineAdminNetworkPolicyRuleActionDeny BaselineAdminNetworkPolicyRuleAction = "Deny"
 // BaselineAdminNetworkPolicyRuleActionAllow enables admins to allow certain traffic.
 BaselineAdminNetworkPolicyRuleActionAllow BaselineAdminNetworkPolicyRuleAction = "Allow"
)

The following types are common to the AdminNetworkPolicy and BaselineAdminNetworkPolicy resources:


// AdminNetworkPolicySubject defines what resources the policy applies to.
// Exactly one field must be set.
// +kubebuilder:validation:MaxProperties=1
// +kubebuilder:validation:MinProperties=1
type AdminNetworkPolicySubject struct {
 // Namespaces is used to select pods via namespace selectors.
 // +optional
 Namespaces *metav1.LabelSelector `json:"namespaces,omitempty"`
 // Pods is used to select pods via namespace AND pod selectors.
 // +optional
 Pods *NamespacedPodSubject `json:"pods,omitempty"`
}

// NamespacedPodSubject allows the user to select a given set of pod(s) in
// selected namespace(s)
type NamespacedPodSubject struct {
 // This field follows standard label selector semantics; if empty,
 // it selects all Namespaces. 
 NamespaceSelector metav1.LabelSelector `json:"namespaceSelector"`

 // Used to explicitly select pods within a namespace; if empty,
 // it selects all Pods. 
 PodSelector metav1.LabelSelector `json:"podSelector"`
}

// AdminNetworkPolicyPort describes how to select network ports on pod(s).
// Exactly one field must be set.
// +kubebuilder:validation:MaxProperties=1
// +kubebuilder:validation:MinProperties=1
type AdminNetworkPolicyPort struct {
 // Port selects a port on a pod(s) based on number.
 // +optional
 PortNumber *Port `json:"portNumber,omitempty"`

 // NamedPort selects a port on a pod(s) based on name.
 // +optional
 NamedPort *string `json:"namedPort,omitempty"`

 // PortRange selects a port range on a pod(s) based on provided start and end
 // values.
 // +optional
 PortRange *PortRange `json:"portRange,omitempty"`
}

type Port struct {
 // Protocol is the network protocol (TCP, UDP, or SCTP) which traffic must
 // match. If not specified, this field defaults to TCP.
 Protocol v1.Protocol `json:"protocol"`

 // Number defines a network port value.
 Port int32 `json:"port"`
}

// PortRange defines an inclusive range of ports from the the assigned Start value
// to End value.
type PortRange struct {
 // Protocol is the network protocol (TCP, UDP, or SCTP) which traffic must
 // match. If not specified, this field defaults to TCP.
 Protocol v1.Protocol `json:"protocol,omitempty"`

 // Start defines a network port that is the start of a port range, the Start
 // value must be less than End.
 Start int32 `json:"start"`

 // End defines a network port that is the end of a port range, the End value
 // must be greater than Start.
 End int32 `json:"end"`
}

// AdminNetworkPolicyPeer defines an in-cluster peer to allow traffic to/from.
// Exactly one of the selector pointers should be set for a given peer.
type AdminNetworkPolicyPeer struct {
 // Namespaces defines a way to select a set of Namespaces.
 // +optional
 Namespaces *NamespacedPeer `json:"namespaces,omitempty"`
 // Pods defines a way to select a set of pods in
 // in a set of namespaces.
 // +optional
 Pods *NamespacedPodPeer `json:"pods,omitempty"`
}

type NamespaceRelation string

const (
 NamespaceSelf NamespaceRelation = "Self"
 NamespaceNotSelf NamespaceRelation = "NotSelf"
)

// NamespacedPeer defines a flexible way to select Namespaces in a cluster.
// Exactly one of the selectors must be set. If a consumer observes none of
// its fields are set, they must assume an unknown option has been specified
// and fail closed.
// +kubebuilder:validation:MaxProperties=1
// +kubebuilder:validation:MinProperties=1
type NamespacedPeer struct {
 // Related provides a mechanism for selecting namespaces relative to the
 // subject pod. A value of "Self" matches the subject pod's namespace,
 // while a value of "NotSelf" matches namespaces other than the subject
 // pod's namespace.
 // +optional
 Related *NamespaceRelation `json:"related,omitempty"`

 // NamespaceSelector is a labelSelector used to select Namespaces, This field
 // follows standard label selector semantics; if present but empty, it selects
 // all Namespaces.
 // +optional
 NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"`

 // SameLabels is used to select a set of Namespaces that share the same values
 // for a set of labels.
 // To be selected a Namespace must have all of the labels defined in SameLabels,
 // and they must all have the same value as the subject of this policy.
 // If Samelabels is Empty then nothing is selected.
 // +optional
 SameLabels []string `json:"sameLabels,omitempty"`

 // NotSameLabels is used to select a set of Namespaces that do not have a set
 // of label(s). To be selected a Namespace must have none of the labels defined
 // in NotSameLabels. If NotSameLabels is empty then nothing is selected.
 // +optional
 NotSameLabels []string `json:"notSameLabels,omitempty"`
}

// NamespacedPodPeer defines a flexible way to select Namespaces and pods in a
// cluster. The `Namespaces` and `PodSelector` fields are required.
type NamespacedPodPeer struct {
 // Namespaces is used to select a set of Namespaces.
 Namespaces NamespacedPeer `json:"namespaces"`

 // PodSelector is a labelSelector used to select Pods, This field is NOT optional,
 // follows standard label selector semantics and if present but empty, it selects
 // all Pods.
 PodSelector metav1.LabelSelector `json:"podSelector"`
}

General Notes on the AdminNetworkPolicy API

Much of the proposed behavior is intentionally not aligned with K8s NetworkPolicy resource, especially in regards to the behavior of empty fields. Specifically this api is designed to be verbose and explicit. Please pay attention to the comments above each field for more information.
For the AdminNetworkPolicy ingress/egress rule, the Action field dictates whether traffic should be allowed/denied/passed from/to the AdminNetworkPolicyPeer. This will be a required field.
The AdminNetworkPolicySubject and AdminNetworkPolicyPeer types are explicitly designed to allow for future extensibility with a focus on the addition of new types of selectors. Specifically it will allow for failing closed in the event an implementation does not implement a defined selector. For example, If a new type (ServiceAccounts) was added to the AdminNetworkPolicyPeer struct, and an implementation had not yet implemented support for such a selector, an ANP using the new selector would have no effect since the implementation would simply see an empty AdminNetworkPolicyPeer object.

Further examples utilizing the self field for `NamespacedPeer` objects

Self: This is a special strategy to indicate that the rule only applies to the Namespace for which the ingress/egress rule is currently being evaluated upon. Since the Pods selected by the AdminNetworkPolicy subject could be from multiple Namespaces, the scope of ingress/egress rules whose namespaces.related=self will be the Pod’s own Namespace for each selected Pod. Consider the following example:

Pods [a1, b1], with labels app=a and app=b respectively, exist in Namespace x.
Pods [a2, b2], with labels app=a and app=b respectively, exist in Namespace y.

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
spec:
 priority: 10
 subject:
 namespaceSelector: {}
 ingress:
 - action: Allow
 from:
 - pods:
 namespaces:
 related: self
 podSelector:
 matchLabels:
 app: b

The above AdminNetworkPolicy should be interpreted as: for each Namespace in the cluster, all Pods in that Namespace should strictly allow traffic from Pods in the same Namespace who has label app=b at all ports. Hence, the policy above allows x/b1 -> x/a1 and y/b2 -> y/a2, but does not allow y/b2 -> x/a1 and x/b1 -> y/a2.

SameLabels: This is a special strategy to indicate that the rule only applies to the Namespaces which share the same label value. Since the Pods selected by the AdminNetworkPolicy subject could be from multiple Namespaces, the scope of ingress/egress rules whose namespaces.samelabels=tenant will be all the Pods from the Namespaces who have the same label value for the “tenant” key. Consider the following example:

Pods [a1, b1] exist in Namespace t1-ns1, which has label tenant=t1.
Pods [a2, b2] exist in Namespace t1-ns2, which has label tenant=t1.
Pods [a3, b3] exist in Namespace t2-ns1, which has label tenant=t2.
Pods [a4, b4] exist in Namespace t2-ns2, which has label tenant=t2.

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
spec:
 priority: 20
 subject:
 namespaceSelector:
 matchExpressions: {key: "tenant"; operator: Exists}
 ingress:
 - action: Pass
 from:
 - namespaces:
 sameLabels:
 - tenant

The above AdminNetworkPolicy should be interpreted as: for each Namespace in the cluster who has a label key set as “tenant”, traffic for all Pods in that Namespace from all Pods in the Namespaces who has the same label value for key tenant is delegated to the Namespace admins, i.e such traffic will not be subject to any ANP (priority > 50) rules and be evaluated by K8s NetworkPolicies. Hence, the policy above delegates traffic from all Pods in Namespaces labeled tenant=t1 i.e. t1-ns1 and t1-ns2, to reach each other, to K8s NetworkPolicies, similarly traffic for all Pods in Namespaces labeled tenant=t2 i.e. t2-ns1 and t2-ns2, to talk to each other is delegated to K8s NetworkPolicies as well, however it does not delegate traffic from any Pod in t1-ns1 or t1-ns2 to reach Pods in t2-ns1 or t2-ns2, such traffic is still subject to ANP rules.

Sample Specs for User Stories

Sample spec for Story 1: Deny traffic at a cluster level

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
 name: cluster-wide-deny-example
spec:
 priority: 10
 subject:
 namespaces:
 matchLabels:
 kubernetes.io/metadata.name: sensitive-ns
 ingress:
 - action: Deny
 from:
 - namespaces:
 namespaceSelector: {}

Sample spec for Story 2: Allow traffic at a cluster level

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
 name: cluster-wide-allow-example
spec:
 priority: 30
 subject:
 namespaces: {}
 ingress:
 - action: Allow
 from:
 - namespaces:
 namespaceSelector:
 matchLabels:
 kubernetes.io/metadata.name: monitoring-ns
 egress:
 - action: Allow
 to:
 - pods:
 namespaces:
 namespaceSelector:
 matchlabels:
 kubernetes.io/metadata.name: kube-system
 podSelector:
 matchlabels:
 app: kube-dns

Sample spec for Story 3: Explicitly Delegate traffic to existing K8s Network Policy

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
 name: pub-svc-delegate-example
spec:
 priority: 20
 subject:
 namespaces: {}
 egress:
 - action: Pass
 to:
 - pods:
 namespaces:
 namespaceSelector:
 matchLabels:
 kubernetes.io/metadata.name: bar-ns-1
 podSelector:
 matchLabels:
 app: svc-pub
 ports:
 port:
 - protocol: TCP
 number: 8080

Sample spec for Story 4: Create and Isolate multiple tenants in a cluster

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
 name: tenant-creation-example
spec:
 priority: 50
 subject:
 namespaces:
 matchExpressions: {key: "tenant"; operator: Exists}
 ingress:
 - action: Deny
 from:
 - namespaces:
 notSameLabels:
 - tenant

Note: the above AdminNetworkPolicy can also be written in the following fashion:

apiVersion: policy.networking.k8s.io/v1alpha1
kind: AdminNetworkPolicy
metadata:
 name: tenant-creation-example
spec:
 priority: 50
 subject:
 namespaces:
 matchExpressions: {key: "tenant"; operator: Exists}
 ingress:
 - action: Pass
 from:
 - namespaces:
 sameLabels:
 - tenant
 - action: Deny  # Deny everything else other than same tenant traffic
 from:
 - namespaces:
 namespaceSelector: {}

The difference is that in the first case, traffic within tenant Namespaces will fall through, and be evaluated against lower-priority ClusterNetworkPolicies, and then NetworkPolicies. In the second case, the matching packet will skip all AdminNetworkPolicy evaluation (except for AdminNetworkPolicy priority=0), and only match against NetworkPolicy rules in the cluster. In other words, the second AdminNetworkPolicy specifies intra-tenant traffic must be delegated to the tenant Namespace owners.

Sample spec for Story 5: Cluster Wide Default Guardrails

apiVersion: policy.networking.k8s.io/v1alpha1
kind: BaselineAdminNetworkPolicy
metadata:
 name: baseline-rule-example
spec:
 subject:
 namespaces: {}
 ingress:
 - action: Deny  # zero-trust cluster default security posture
 from:
 - namespaces:
 namespaceSelector: {}
 egress:
 - action: Deny
 to:
 - namespaces:
 namespaceSeletor: {}

Test Plan

Add e2e tests for AdminNetworkPolicy resource
- Ensure Pass rules are delegated and are not subject to ANP rules.
- Ensure Deny rules drop traffic.
- Ensure Allow rules allow traffic.
- Ensure that in stacked ClusterNetworkPolicies/K8s NetworkPolicies, precedence is maintained as per the priority set in ANP.
e2e test cases must cover ingress and egress rules.
e2e test cases must cover port-ranges, named ports, integer ports etc.
e2e test cases must cover various combinations of namespaceSet*s in each ingress/egress rule.
Ensure that namespace matching strategies work as expected.
Add unit tests to test the validation logic which shall be introduced for cluster-scoped policy resources.
- Ensure exactly one selector has to be set in an Subject section.
- Ensure exactly one selector has to be set in an AdminNetworkPolicyPeer section.
- Test cases for fields which are shared with NetworkPolicy, like endPort etc.
Ensure that only administrators or assigned roles can create/update/delete cluster-scoped policy resources.
Ensure smooth integration with existing Kubernetes NetworkPolicy.
- Ensure all positive priority (non-zero) ANP rules are evaluated before any NetworkPolicy rules.
- Ensure ANP rules with priority=“0” are evaluated after any NetworkPolicy rules.

Graduation Criteria

Alpha to Beta Graduation

Gather feedback from developers and surveys
At least 2 implementors must provide a functional and scalable implementation for the complete set of alpha features.
- Specifically, ensure that only selecting E/W cluster traffic is plausible at scale.
Evaluate the need for multiple Subjects per ANP.
Evaluate “future work” items based on feedback from community and challenges faced by implementors.
Ensure extensibility of adding new fields. i.e. adding new fields do not “fail-open” traffic for older clients.
Revisit the topic of whether this API should cover north-south traffic.

Beta to GA Graduation

At least 4 implementors providers must provide a scalable implementation for the complete set of beta features
More rigorous forms of testing — e.g., downgrade tests and scalability tests
Allowing time for feedback
Completion of all accepted “future work” items

Upgrade / Downgrade Strategy

Upgrade considerations

As such, the cluster-scoped policy resources are new and shall not exist prior to upgrading to a new version. Thus, there is no direct impact on upgrades.

Downgrade considerations

Downgrading to a version which no longer supports cluster-scoped policy APIs must ensure that appropriate security rules are created to mimick the cluster-scoped policy rules by other means, such that no unintended traffic is allowed, and all intended traffic is allowed.

Version Skew Strategy

n/a

Production Readiness Review Questionnaire

Feature Enablement and Rollback

N/A for alpha release.

NOTE: for alpha this resource will be implemented as a CRD following the precedence set by the gateway API.

How can this feature be enabled / disabled in a live cluster?

N/A for alpha release.

NOTE: for alpha this resource will be implemented as a CRD following the precedence set by the gateway API.

Does enabling the feature change any default behavior?

Creating a AdminNetworkPolicy does have an effect on the cluster, however they must be specifically created, which means the administrator is aware of the impact.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

For alpha there will be no feature gate so this is N/A.

What happens if we reenable the feature if it was previously rolled back?

For alpha there will be no feature gate so this is N/A.

Are there any tests for feature enablement/disablement?

Not in-tree, generally the implementations should have unit tests covering this scenario.

Rollout, Upgrade and Rollback Planning

N/A for alpha.

How can a rollout or rollback fail? Can it impact already running workloads?

N/A for alpha.

What specific metrics should inform a rollback?

The AdminNetworkPolicy API has a Status field which should be used by the implementation to report weather or not the rules were correctly programmed.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This will be tested once implementations of the API have been completed.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Since the controller for this feature will not be implemented in-tree, it will be the responsibility of the implementations to report metrics as they see fit.

How can someone using this feature know that it is working for their instance?

API .status
- Condition name: The Condition name will not be standardized in alpha however implementations are given the status field to report what they see fit.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Specific SLOs will be determined by the implementations.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Other (treat as last resort)
- Details: N/A since the indicators will vary based on the implementation.

Are there any missing metrics that would be useful to have to improve observability of this feature?

A metric describing the time it takes for the implementation to program the rules defined in an AdminNetworkPolicy could be useful. However, some implementations may struggle to report such a metric.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

API Type: AdminNetworkPolicy
Supported number of objects per cluster: The total number of AdminNetworkPolicies will not be limited. However, it is important to remember that the only users creating ANPs will be Cluster-Admins, of which there should only be a handful. This will help limit the total number ANPs being deployed at any given time.

Will enabling / using this feature result in any new calls to the cloud provider?

This depends on the implementation, specifically based on the API used to program the AdminNetworkPolicy rules into the data-plane.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Not in any in-tree components, resource efficiency will need to be monitored by the implementation.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A for alpha release.

What are other known failure modes?

N/A for alpha release.

What steps should be taken if SLOs are not being met to determine the problem?

N/A for alpha release.

Implementation History

2021-02-18 - Created initial PR for the KEP

Drawbacks

Securing traffic for a cluster for administrator’s use case can get complex. This leads to introduction of a more complex set of APIs which could confuse users.

Alternatives

Following alternative approaches were considered as this KEP has been iterated upon:

NetworkPolicy v2

A new version for NetworkPolicy, v2, was evaluated to address features and use cases documented in this KEP. Since the NetworkPolicy resource already exists, it would be a low barrier to entry and can be extended to incorporate admin use cases. However, this idea was rejected because the NetworkPolicy resource was introduced solely to satisfy a developers intent. Thus, adding new use cases for a cluster admin would be contradictory. In addition to that, the administrator use cases are mainly scoped to the cluster as opposed to the NetworkPolicy resource, which is namespaced.

Empower, Deny, Allow action based CRD

Alternatively, AdminNetworkPolicy can have Empower (as opposed to Pass), Deny or Allow as the action of each rule.

In terms of precedence, the aggregated Empower rules (all AdminNetworkPolicy rules with action Empower in the cluster combined) should be evaluated before aggregated AdminNetworkPolicy Deny rules, followed by aggregated AdminNetworkPolicy Allow rules, followed by NetworkPolicy rules in all Namespaces. As such, the Empower rules have the highest precedence, which shall only be used to provide exceptions to deny rules. The Empower rules do not guarantee that the traffic will not be dropped: it simply denotes that the packets matching those rules can bypass the AdminNetworkPolicy Deny rule evaluation. This idea was outvoted by the Pass action during sig-networkpolicy meetings, as most members find the Empower keyword confusing, and using an ‘action’ to provide exceptions to certain rule feels counter-intuitive.

ClusterDefaultNetworkPolicy resource

Instead of using the Baseline action to set cluster default rules, the authors of this KEP also considered using an entirely separate resource named ClusterDefaultNetworkPolicy. A ClusterDefaultNetworkPolicy resource will help the administrators set baseline security rules for the cluster, i.e. a developer CAN override these rules by creating NetworkPolicies that applies to the same workloads as the ClusterDefaultNetworkPolicy does.

ClusterDefaultNetworkPolicy works just like NetworkPolicy except that it is cluster-scoped. When workloads are selected by a ClusterDefaultNetworkPolicy, they are isolated except for the ingress/egress rules specified. ClusterDefaultNetworkPolicy rules will not have actions associated – each rule will be an ‘allow’ rule.

Aggregated NetworkPolicy rules will be evaluated before aggregated ClusterDefaultNetworkPolicy rules. If a Pod is selected by both, a ClusterDefaultNetworkPolicy and a NetworkPolicy, then the ClusterDefaultNetworkPolicy’s effect on that Pod becomes obsolete. In this case, the traffic allowed will be solely determined by the NetworkPolicy.

This idea was eventually abandoned due to several reasons:

Two separate resources make it harder to reason about effect of aggregated rules.
It is confusing that one cluster level resource has implicit isolation and the other does not.

Single CRD with DefaultRules field

This alternate proposal was a hybrid approach, where in the AdminNetworkPolicy resource (introduced in the proposal) would include additional fields called defaultIngress and defaultEgress. These defaultIngress/defaultEgress fields would be similar in structure to the ingress/egress fields, except that the default rules will not have action field. All default rules will be “allow” rules only, similar to K8s NetworkPolicy. Presence of at least one defaultIngress rule will isolate the appliedTo workloads from accepting any traffic other than that specified by the policy. Similarly, the presence of at least one defaultEgress rule will isolate the appliedTo workloads from accessing any other workloads other than those specified by the policy. In addition to that, the rules specified by defaultIngress and defaultEgress fields will be evaluated to be enforced after the K8s NetworkPolicy rules, thus such default rules can be overridden by a developer written K8s NetworkPolicy.

Single CRD with IsOverrideable field

Another alternative for separating non-overridable guardrail rules and overridable baseline rules is to introduce a IsOverridable field in ANP ingress/egress rules:

type AdminNetworkPolicyIngress/EgressRule struct {
 Action RuleAction
 IsOverridable bool
 Ports []networkingv1.NetworkPolicyPort
 From/To []networkingv1.AdminNetworkPolicyPeer
}

If IsOverridable is set to false, the rules will take higher precedence than the Kubernetes Network Policy rules. Otherwise, the rules will take lower precedence. Note that both overridable and non overridable cluster network policy rules have explicit allow/ deny rules. The precedence order of the rules is as follows:

AdminNetworkPolicy Deny (IsOverridable=false) > AdminNetworkPolicy Allow (IsOverridable=false) > K8s NetworkPolicy > AdminNetworkPolicy Allow (IsOverridable=true) > AdminNetworkPolicy Deny (IsOverridable=true)

As the semantics for overridable Cluster NetworkPolicies are different from K8s Network Policies, cluster administrators who worked on K8s NetworkPolicies will have hard time writing similar policies for the cluster. Also, modifying a single field (IsOverridable) of a rule will change the priority in a non-intuitive manner which may cause some confusion. For these reasons, we decided not go with this proposal.

Single CRD with BaselineAllow as Action

We evaluated another single CRD approach with an additional RuleAction to cover use-cases of both AdminNetworkPolicy and ClusterDefaultNetworkPolicy

In this approach, we introduce a BaselineRuleAction rule action.

type AdminNetworkPolicyIngress/EgressRule struct {
 Action RuleAction
 Ports []networkingv1.NetworkPolicyPort
 From/To []networkingv1.AdminNetworkPolicyPeer
}
const (
 RuleActionDeny RuleAction = "Deny"
 RuleActionAllow RuleAction = "Allow"
 RuleActionBaselineAllow RuleAction = "BaselineAllow"
)

RuleActionDeny and RuleActionAllow are used to specify rules that take higher precedence than Kubernetes NetworkPolicies whereas RuleActionBaselineAllow is used to specify the rules that take lower precedence Kubernetes NetworkPolicies. The RuleActionBaselineAllow rules have same semantics as Kubernetes NetworkPolicy rules but defined at cluster level.

One of the reasons we did not go with this approach is the ambiguity of the term BaselineAllow. Also, the semantics around RuleActionBaselineAllow is slightly different as it involves implicit isolation compared to explicit Allow/ Deny rules with other RuleActions.

Resources: Add webhook hosting to CCM.

Mon, 01 Jan 0001 00:00:00 +0000

KEP-2699: Add webhook hosting capability to CCM framework

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP will detail enhancing the CCM framework to support cloud provider specific webhooks. The intent is to make it easy to either generate a binary or enhance the existing CCM binary to host such webhooks. We also intend to allow for easily linking in “standard” webhooks needed by other SIGs which need to be customized for particular cloud providers.

Motivation

The Cloud Controller Manager (CCM) is the binary into which the Cloud Provider places all the controllers needed to make a Kubernetes cluster work correctly on their Cloud. There are also occasions when it makes sense for a Cloud Provider to want these customizations to be applied synchronously, during API server request handling, rather than asynchronously after a change has already been applied.

Our initial example of this is from SIG Storage. These would like the functionality from the PersistentVolumeLabel (PVL) admission controller (https://github.com/kubernetes/kubernetes/issues/52617) . This needs to be completed for cloud provider extraction to complete. Several Cloud Providers have indicated that this should be done in-line, especially as the existing deprecated solution is implemented in an API server admission plugin, which is in-line in the request path.

Goals

Our immediate goal is to allow in tree Cloud Providers to stop using the existing PVL admission controller and do so using the framework. However we want to build a framework which wil be usable by similar solutions to problems. This KEP is about the framework needed to support the PVL webhook and not the webhook itself. The frameworks default listener for the webhook should use existing Kubernetes mechanism (secure serving, authz, authn) to secure itself and validate the client. It should be possible to then easily change that configuration to any other Kubernetes supported options for a webhook.

Non-Goals

We are not intending to create a general admission webhook solution . This is just intended to host Cloud Provider specific webhooks as part of the Control Plane.

Proposal

We will start by adding extension hooks which can be registered in the cmd/cloud-controller-manager/main.go. This would be similar to the mechanism we already use to register new controllers. The existing sample shows this with a sample of registering the nodeipamcontroller which is not a normally installed controller in the cloud controller manager. In a similar way we will have a sample of integrating a PVL mutating webhook into the sample CCM. We will also have the system automatically detect if there are both controllers and webhooks registered in the binary. If both are registered it will automatically add command line flags allowing webhooks and/or controllers to be disabled. There will be two separate flags, the controller flag and the webhook flag. The controller flag will default to the controllers being enabled. The webhook flag will default to the webhooks being disabled. We would also like to provide a builder pattern for registering both the controller and webhook extensions.

Another issue to consider is how the mutating/admission webhook configuration is written into the cluster. This may be somewhat dependent on if the Cloud Provider intends to run the webhook server on the Control Plane or on the Cluster. We would recommend running the webhook server on the Control Plane. However for some Cloud Providers that can lead to special issues with the configuration. As such we will provide a flag which enables the service to automatically register the webhooks as part of startup. However that functionality can be disabled, allowing the Cloud Provider to do their own custom registration, as part of cluster setup.

There are a few parts to this issue. If the webhook server is run on the control plane it may be possible to do things like assume it will be collocated with the KAS, hence allowing the webhook server to be reference via localhost. It also means that the webhook server could be instantiated from a static manifest. If the webhook server is run in the cluster, then resources such as the Node, Pod, PVs, and AdmissionController which are needed to start the webhook server, must all be creatable prior to the webhook server coming up. In addition the template code for the webhook server, should not be written such that having all the webhook servers crash will not wedge the cluster from being able to get the webhook server started again.

User Stories (Optional)

The users of this KEP are Cloud Providers and feature developers whose code impacts Cloud Providers. The intent is to make it easy for them both develop features and to maintain the CCM controllers and webhooks across multiple versions. At the same time we are attempting to make it easy for the SIGs to make controllers or webhooks which can do what they know needs to be done and integrated into Cloud Provider specific processes. We would like to do that in a way which makes merging upgrades relatively painless.

Story 1 - Full control, separation of concerns

Some Cloud Providers would prefer to keep controllers and webhooks in different processses. They have concerns about attempting to run batch controllers in the same process as webhooks which are “in-line” and time sensitive. For these users it is easy to either build two different binaries or have the same binary act as two different binaries based on command line flags.

Story 2 - Fast and simple

For Cloud Providers who would like to keep things simple, it is easy to create a single process which handles both controllers and webhooks. While this KEP does not deployment, this is a simple deployment, being fewer processes. It does not stop the Cloud Provider from converting to Story 1 later. This system should make our part of this simple. Obviously the Cloud Provider would have to change their deployment setup.

Story 3 - Immediate Cloud Provider Extraction effort

PVL use case. Cloud Providers want to allow customers to migrate an existing workload to Kubernetes. That workload uses an existing persistent volume. To get that workload migrated the end user needs to be able to link the existing PV into the cluster. However this requires an association which requires calls out to the cloud provider for certain kinds of storage. Ideally the lookup and label of the PV to that pre-existing storage happens in-line when the PV is written. That ensures the write volume is attached to the Node/Pod when it is scheduled and there are no race conditions.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

There could be potential problems running webhooks and controller in the same process. Irrespective of failure mode on the webhook configuration, timeouts will always cause a webhook call to fail. As such we are making it easy to turn the CCM into two processses to mitigate this. It will be upto the Cloud Provider to determine if they want the webhook policy to be FAIL or IGNORE. We will have the sample set the configuration to IGNORE as its the safe option. Incorrectly setting FAIL can quickly lead to a non functional cluster. Having a FAIL policy on Pods for example can prevent the system from allocating the webhook service, which prevents the webhook from ever passing.

Webhooks are configured by a runtime resource. As a consequence this configuration can be modified to deleted at runtime. That means that an admin on the cluster can disable or alter the functionality. This potentially makes it harder for a cloud provider to enforce that this logic is being applied. It also means that there needs to be a deployment mechanism for the webhook. It is left to the Cloud Provider to determine if the need for an in-line request is sufficient to override these concerns. The Cloud Provider can alternatively use a standalone controller which is not in-line or use an admission controller , built into the APIServer.

The change outlined in this KEP affects the framework which generates the CCM and not the CCM itself. The Cloud Provider may wish to run webhooks separately from the controllers in the CCM, therefore the framework will support that usecase. In this mode, the CCM will just have controllers in it. A “Cloud Webhook Manager” can be run separately and host the webhooks. That is being left as homework for the Cloud Provider. However the sample CCM which demonstrates how this will be done will have both in the same sample to make it easy.

It is noteworthy that the CCM derives from the KCM. The KCM (and the CCM) predate efforts like controller runtime. The controller runtime is a good reference as it is demonstrates that operators and webhooks can be successfully run inside the same binary. It further demonstrates that this is a pattern which is understood and followed by a significant portion of the Kubernetes community. Having said that, we consider it more important to unify the KCM and CCM code bases, then to build on top controller runtime. We are not saying not to use anything from controller runtime. We are saying that if need to choose between unifying the KCM & CCM code and building with controller runtime, we will choose unifying the KCM & CCM code base.

Design Details

A sample of how the Builder pattern might look is:

cmOptions, err := options.NewCloudManagerOptions()
if err != nil {
klog.Fatalf("unable to initialize command options: %v", err)
}
fss := cliflag.NamedFlagSets{}
cloudManagerBuilder := app.NewCloudManagerBuilder("name")
cloudManagerBuilder.setOptions(cmOptions)
cloudManagerBuilder.setFlags(fss)
cloudManagerBuilder.registerWebhook(gvkList, handler)
cloudManagerBuilder.registerWebhook(gvkSecondList, secondHandler)
manager, err := cloudManagerBuilder(wait.NeverStop)
if err != nil {
klog.Fatalf("unable to construct cloud manager: %v", err)
}
err := command.start()

This will not alter the existing extension hooks in the controller manager framework, as they are critical for backward compatibility. The builders are meant to be an abstraction layer on top to make the extensions easier to use. So for the existing controller manager code you might see changes like:

cloudControllerManagerBuilder.registerController("nodeipamcontroller", handler)
cloudControllerManagerBuilder.deregisterController("servicecontroller")

The handler in this case is likely to be of type ControllerInitFuncConstructor.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/vendor/k8s.io/cloud-provider/options: 2022-10-12 - 34.2
k8s.io/kubernetes/vendor/k8s.io/cloud-provider/config/v1alpha1: 2022-10-12 - 38.5
k8s.io/kubernetes/staging/src/k8s.io/cloud-provider/app/: There is currently no published coverage on this because its not vendored by Kubernetes itself and for some reason staging does not seem to be included in the metrics.

Integration tests

Integration test for builder pattern exercising the case of building a CCM with a webhook:

e2e tests

None, this feature is consumed by cloud provider repositories for the final binary so it will not be used in e2e tests in K/K.

Graduation Criteria

Alpha

Reference implementation of the PVL mutating webhook served from the sample CCM.
Impementation of the PVL mutating webhook for at least 1 Cloud Provider.

Beta

GA

Note: Generally we also wait at least two releases between beta and GA/stable, because there’s no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests .

Deprecation

Not deprecated

Upgrade / Downgrade Strategy

Upgrade is not believed to be an issue at this point.
Currently we are leaving upgrade as an issue for the Cloud Provider

Version Skew Strategy

We are currently assuming that this will be deployed as part of the control plane. We assume it will be upgraded with the KAS, KCM and CCM.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

This will be built into the CCM by the Cloud Provider. Code must be written specifically by the Cloud Provider to enable this feature.

There will be a feature gate which will be used to track the stage of the feature. Principally this is to make users aware of the support level of the feature. It will control if the listener can be started. Please note however this is a library and we expect users to vendor this into their own code base. As such we cannot control if they will remove the check rather than setting the flag.

Feature gate name: CloudWebhookServer
Components depending on the feature gate: cloud-controller-manager

Does enabling the feature change any default behavior?

This cannot just be “enabled”.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If you build using our framework, then you will be able to disable using a command line flag. It can also be disabled by changing the admission webhook configuration.

What happens if we reenable the feature if it was previously rolled back?

For new update requests it will work. However it will not change any persisted resources, unless they are rewritten.

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By examining the admission webhook configuration.

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

It requires on mutating/validating admission webhooks.

Scalability

The webhooks have an advantage that they can be more easily scaled than controllers.

Will enabling / using this feature result in any new API calls?

It requires a new call admission webhook call.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

Depends on the Cloud Providers implementation.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, in the same way that any additional admission webhook call does. It is worth noting that the Cloud Provider has the option of instead using a controller, at least for the PVL case. However that is not the preferred mechanism. These is an optional extension mechanism.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

This is a admission webhook server. Those already exist and those troubleshooting mechanism should apply here as well.

How does this feature react if the API server and/or etcd is unavailable?

This feature does not apply unless the API server is functional.

What are other known failure modes?

Timeouts on webhooks act as failures, so any resource sent to the CCM will fail if it times out.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

The primary alternative is to use controllers to solve all the problems. This has an issue for things which need to be done in-line. If it is not ok for state to be missing from a resource between creation and usage, the controllers are a problem

Initializers solve the problem between creation and usage, however this solution has been deprecated.

Infrastructure Needed (Optional)

Resources: Adding AppProtocol to Services and Endpoints

Mon, 01 Jan 0001 00:00:00 +0000

Adding AppProtocol to Services and Endpoints

Summary
Motivation
- Goals
Proposal
Alpha -> Beta
Beta -> GA
- Test plan
- Production Readiness Review Questionnaire

Summary

Kubernetes does not have a standardized way of representing application protocols. When a protocol is specified, it must be one of TCP, UDP, or SCTP. With the EndpointSlice beta release in 1.17, a concept of AppProtocol was added that would allow application protocols to be specified for each port. This KEP proposes adding support for that same attribute to Services and Endpoints.

Motivation

The lack of direct support for specifying application protocols for ports has led to widespread use of annotations, providing a poor user experience and general frustration (https://github.com/kubernetes/kubernetes/issues/40244) . Unfortunately annotations are cloud specific and simply can’t provide the ease of use of a built in attribute like AppProtocol. Since application protocols are specific to each port specified on a Service or Endpoints resource, it makes sense to have a way to specify it at that level.

Goals

Add AppProtocol field to Ports in Services and Endpoints.

Proposal

In both Endpoints and Services, a new AppProtocol field would be added. In both cases, constraints validation would directly mirror what already exists with EndpointSlices.

Services:

// ServicePort represents the port on which the service is exposed
type ServicePort struct {
 ...
 // The application protocol for this port.
 // This field follows standard Kubernetes label syntax.
 // Un-prefixed names are reserved for IANA standard service names (as per
 // RFC-6335 and http://www.iana.org/assignments/service-names).
 // Non-standard protocols should use prefixed names such as
 // mycompany.com/my-custom-protocol.
 // +optional
 AppProtocol *string
}

Endpoints:

// EndpointPort is a tuple that describes a single port.
type EndpointPort struct {
 ...
 // The application protocol for this port.
 // This field follows standard Kubernetes label syntax.
 // Un-prefixed names are reserved for IANA standard service names (as per
 // RFC-6335 and http://www.iana.org/assignments/service-names).
 // Non-standard protocols should use prefixed names such as
 // mycompany.com/my-custom-protocol.
 // +optional
 AppProtocol *string
}

Risks and Mitigations

It may take some time for cloud providers and other consumers of these APIs to support this attribute. To help with this, we will work to communicate this change well in advance of release so it can be well supported initially.

Proposed Roadmap

Kubernetes 1.18: New field is added but gated behind new alpha ServiceAppProtocol feature gate. Kubernetes 1.19: ServiceAppProtocol feature gate graduates to beta and is enabled by default. Kubernetes 1.20: ServiceAppProtocol feature gate graduates to GA. Kubernetes 1.21: ServiceAppProtocol feature gate is removed.

Graduation Criteria

This adds a new optional attribute to 2 existing stable APIs. This will follow the traditional approach for adding new fields initially guarded by a feature gate.

Alpha -> Beta

ServiceAppProtocol has been supported for at least 1 minor release.
ServiceAppProtocol feature gate is enabled by default.

Beta -> GA

ServiceAppProtocol has been enabled by default for at least 1 minor release.

Test plan

This will replicate the existing validation tests for the AppProtocol field that already exists on EndpointSlice. Additionally, it will add tests that ensure that both the Endpoints and EndpointSlice controllers appropriately set the AppProtocol field on Endpoints and EndpointSlices when it is set on the corresponding Service.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster? This was previously enabled with the ServiceAppProtocol feature gate. That will be removed in Kubernetes 1.21.
Does enabling the feature change any default behavior? No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Not anymore.
What happens if we reenable the feature if it was previously rolled back? N/A.
Are there any tests for feature enablement/disablement? N/A.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads? If the ServiceAppProtocol gate is manually enabled on Kubernetes components it will no longer be recognized in Kubernetes 1.21. Users should stop using this feature gate.
What specific metrics should inform a rollback? N/A.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? N/A.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? The v1.21 rollout will include the removal of the ServiceAppProtcol feature gate.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads? If this field is set on any Services, it may be used by applications that consume those Services. No core Kubernetes components consume this field.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? N/A.
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? N/A.
Are there any missing metrics that would be useful to have to improve observability of this feature? No.

Dependencies

Does this feature depend on any specific services running in the cluster? No.

Scalability

Will enabling / using this feature result in any new API calls? No.
Will enabling / using this feature result in introducing new API types? No.
Will enabling / using this feature result in any new calls to the cloud provider? No.
Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them, providing:
- API type(s): Service
- Estimated increase in size: 10B
- Estimated amount of new objects: This field could be specified on each port in each Service in a cluster although that is unlikely.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? No

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

How does this feature react if the API server and/or etcd is unavailable? N/A
What are other known failure modes? N/A
What steps should be taken if SLOs are not being met to determine the problem? N/A

Resources: Admission Webhook Match Conditions

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3716: Admission Webhook Match Conditions

Release Signoff Checklist
Summary
Motivation
Proposal
- API
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Future Work
- Cross-webhook match conditions
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding “match conditions” to admission webhooks, as an extension to the existing rules to define the scope of a webhook. A matchCondition is a CEL expression that must evaluate to true for the admission request to be sent to the webhook. If a matchCondition evaluates to false, the webhook is skipped for that request (implicitly allowed).

Motivation

Reliability: Admission webhooks continue to be an operational sore spot for many Kubernetes users. Webhooks that target cluster critical resources put the admission controller backing the webhook in the critical path of cluster stability. Even if tools like namespace scoping are used to avoid circular-dependencies and exclude critical system resources, a webhook outage can still have a major impact on cluster availability. This proposal aims to mitigate (but not eliminate) these issues by allowing webhooks to be more narrowly scoped and targeted.

Performance: Admission webhooks sit in the critical request path for write-requests. Validating webhooks can be run in parallel, but Mutating webhooks must be run in serial (up to 2 times!). This makes webhooks extremely latency sensitive, and even a webhook that doesn’t do any work still needs to pay the network round-trip cost.

Supportability: For hosted or managed Kubernetes distributions, webhooks can be a problem when they interfere with requests by managed components. The existing criteria for filtering out requests are insufficient for many use cases, and aren’t easily appended with provider rules.

What about CEL for Admission Control ? ValidatingAdmissionPolicy is an exciting new feature which we hope will greatly reduce the need for admission webhooks, but it is intentionally not attempting to cover every possible use case. This proposal aims to improve the situation for those webhooks that cannot be migrated.

User Stories

Exclude resources from a wildcard rule

I want to enforce metadata policy through an admission webhook without adding latency & risk to high QPS system requests.

Currently, if a webhook uses wildcard match rules, there is no way to filter out a subset of resources or requests from matching the wildcard. If the webhook instead enumerates every resource that should match, it must be kept up-to-date with every CRD that’s added.

With CEL match conditions, the webhook could specify wildcard match rules, and add match conditions to filter out the desired resources:

rules:
 # Match CREATE & UPDATE on all resources:
 - operations:
 - CREATE
 - UPDATE
 apiGroups: '*'
 apiVersions: '*'
 resources: '*'
matchConditions:
 - name: 'exclude-leases'
 expression: '!(request.resource.group == "coordination.k8s.io" && resource.resource == "leases")'

Exempt system users from security policy

As a managed cluster provider, I want to prevent user webhooks from intercepting critical system requests.

System resources can currently be exempted through a namespace or label selector, but requests by system components against non-system resources cannot be. For example, update pod status requests by Kubelets cannot be excluded from user webhooks intercepting all pod requests.

With matchConditions, a managed cluster could append system-exclusion rules to each webhook. For example:

matchConditions:
 - name: 'exclude-kubelet-requests'
 expression: '!("system:nodes" in request.userInfo.groups)'

Since the expression will be evaluated using a common Kubernetes CEL library, these expressions should also get automatic access to the secondary authorization check mechanism described in KEP-3488: CEL for Admission Control . In practice, this means that RBAC bindings can be used to opt-out privileged users from security policy:

matchConditions:
 # Requests by users without breakglass should be included.
 - name: 'breakglass'
 expression: 'authorizer.resource('admissionregistration.k8s.io', 'validatingwebhookconfigurations', '*').name('security-policy').check('breakglass').denied()'

Scope an NFS access management webhook to Pods mounting NFS volumes

I want to narrowly scope my webhook to only the relevant requests, in order to reduce load on the webhook and reduce latency in irrelevant requests.

Concrete example:

A NFS deployment uses an third-party access management system. I have an admission webhook that performs an access check for against the external system for pods that mount NFS volumes. Only pods with NFS volumes need to be checked.

Currently, there is no way to achieve this. Many webhook implementations today start by checking that the request is within scope, and return early if it’s not. This adds latency and an additional failure point to irrelevant requests. This example requires an external integration, and thus is not a candidate for migration to CEL ValidatingAdmissionPolicy.

With match conditions, the expressions can check whether the request object is in-scope for the webhook:

rules:
 - operations: ['CREATE']
 apiGroups: '' # core
 apiVersions: '*'
 resources: 'pods'
matchConditions:
 # Only include pods with an NFS volume.
 - name: 'nfs-volume-present'
 expression: 'object.spec.volumes.exists(v, v.has(nfs))'

Goals

Provide a filtering mechanism for excluding requests from an admission webhook
Maintain consistency with ValidatingAdmissionPolicy

Non-Goals

Provide a mechanism to exclude requests from all webhooks.

Proposal

API

Both ValidatingWebhook and MutatingWebhook (in admissionregistration.k8s.io) will be updated with a new MatchConditions field:


type ValidatingWebhook struct {
 // ...

 // MatchConditions is a list of conditions on the AdmissionRequest ('request') that must be met
 // for a request to be sent to this webhook. All conditions in the list must evaluate to TRUE for
 // the request to be matched.
 // +optional
 // +patchMergeKey=name
 // +patchStrategy=merge
 MatchConditions []MatchCondition `json:"matchConditions,omitempty"`
}

type MutatingWebhook struct {
 // ...
 MatchConditions []MatchCondition `json:"matchConditions,omitempty"`
}

// MatchCondition represents a condition which must by fulfilled for a request to be sent to a webhook.
type MatchCondition struct {
 // Name is an identifier for this match condition, used for strategic merging of MatchConditions,
 // as well as providing an identifier for logging purposes. A good name should be descriptive of
 // the associated expression.
 // Name must be a valid RFC 1123 DNS subdomain, and unique in a set of MatchConditions.
 // Required.
 Name string `json:"name"`
 // NOTE: Placeholder documentation, to be replaced by https://github.com/kubernetes/website/issues/39089.
 //
 // Expression represents the expression which will be evaluated by CEL.
 // ref: https://github.com/google/cel-spec
 // CEL expressions have access to the contents of the AdmissionRequest, organized into CEL variables:
 //
 // 'object' - The object from the incoming request. The value is null for DELETE requests.
 // 'oldObject' - The existing object. The value is null for CREATE requests.
 // 'request' - Attributes of the admission request([ref](https://raw.githubusercontent.com/kubernetes/enhancements/master/pkg/apis/admission/types.go#AdmissionRequest)).
 //
 // Required.
 Expression string `json:"expression"`
}

The match condition expression is evaluated by the same libraries as those used for CEL ValidatingAdmissionPolicy. The only difference in expressions is the availability of the params variable. Expressions requiring access to additional information outside the AdmissionRequest must be performed in the webhook, and are out of scope for this proposal.

Risks and Mitigations

Security

Risk: Attacker adds or changes a match condition to weaken an admission policy.

This does not represent a new threat, as doing so would require update access to the admission registration object, and with that permission an attacker could already disable the policy through manipulating match rules, namespace selector, or object selector (or reroute the webhook entirely).

Risk: Logic error in match condition expression.

Currently the match conditions must be encoded in the webhook backend itself. Moving the logic into a CEL expression adds a potential failure point. This can be mitigated by testing, but the CEL ecosystem currently lacks some of the tools that would make this easier.

Of particular significance are match conditions tied to non-functional properties of an object, such as using labels to decide whether to opt an object out of a policy. Without additional admission controls on who can set those non-functional aspects, exempting the policy based on that could be a security vulnerability. In contrast, the NFS example usecase exempts the policy on a functional aspect - whether an NFS volume is mounted, and thus whether the policy is relevant.

These risks are inherent to the feature being proposed and cannot be mitigated through technical means, but should be highlighted in the documentation.

Debuggability

We do not normally log, audit, or emit an event when a webhook is out-of-scope for a request, and the same will mostly be true for match conditions.

At log level V(5) , we will emit a log when a request that would otherwise be in-scope for a webhook is excluded for a non-matching match condition.

Short of increasing log verbosity, the recommended debug strategy is to capture or reproduce a relevant AdmissionRequest (for example, in a non-prod cluster disable all match conditions and log the requests from a webhook). Then, manually test the match conditions against the request, and iterate as necessary.

Performance

The CEL expression evaluation will leverage the same Resource Constraints used by CEL CRD Validation & CEL Admission Control. The runtime cost budgets are defined here CEL Runtime Cost .

The per call limit is shared with Validating Admission Policy CEL expressions and set at roughly 0.1 second for each expression evaluation call. The total budget per object (i.e. per ValidatingWebhook) for CEL match conditions is roughly .25 seconds and 1/4 the budget of Validating Admission Policy limit.

Design Details

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Integration tests

Test cases to add:

Feature gate enablement / disablement is a no-op when no matchConditions are set (until graduation to GA as feature gate will go away)
Feature gate enablement / disablement works as expected when matchConditions are set (until graduation to GA as feature gate will go away)
Single match condition:
- Request out of scope without matchConditions
- Request in scope without matchConditions, but not matching
- Request in scope without matchConditions, and also matching
Multiple match conditions, covering the same cases as the single-condition case

e2e tests

We will test the edge cases mostly in integration tests and unit tests.

Once the feature is default enabled in beta, a single E2E test covering the single-match-condition cases outlined above will be added.

Graduation Criteria

Alpha

Feature implemented behind AdmissionWebhookMatchConditions feature flag
Integration tests implemented

Beta

Add E2E test coverage
Resolve resource constraints validation
Smart reload/recompile of Webhook Accessors, see issue
ValidatingAdmissionPolicy is promoted to Beta.

GA

Promote appropriate E2E tests to conformance
- https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apimachinery/webhook.go
- “should be able to create and update validating webhook configurations with match conditions”
- “should be able to create and update mutating webhook configurations with match conditions”
- “should reject validating webhook configurations with invalid match conditions”
- “should reject mutating webhook configurations with invalid match conditions”
- “should mutate everything except ‘skip-me’ configmaps”
Cover any missing test coverage

Upgrade / Downgrade Strategy

Downgrading in a way that disables match conditions after it is already in use can increase the scope of requests evaluated by a webhook. See Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? for more details

Version Skew Strategy

The new field is only evaluated by the apiserver, so only HA apiserver version skew is relevant. In this case, if the feature is enabled in one apiserver and not another, a request could non-deterministically be sent to a webhook. Enabling match conditions without setting matchConditions on an webhooks is a no-op, so the version skew non-determinism is best avoided by waiting until it has been enabled in all apiservers before starting to use the new field.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: AdmissionWebhookMatchConditions
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

No. If the feature is enabled, but the matchConditions field is unset, the default behavior remains unchanged.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Disabling the feature gate will ignore any matchConditions set, and return to the default behavior. Disabling AdmissionWebhookMatchConditions could increase the traffic to the webhook, and potentially increase the error rate if the webhook fails to process the additional requests correctly.

What happens if we reenable the feature if it was previously rolled back?

Any matchConditions that were already stored on existing webhooks will be enforced.

Note: enabling matchConditions can only reduce the number of requests being sent to a webhook (or remain unchanged). Enabling it will never increase the number of requests.

Are there any tests for feature enablement/disablement?

We will add tests that verify the functionality is turned off when feature gate is toggled off and turned on when toggled on.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

In general, rollout / rollback should not fail since the feature is not enabled by default. However, there are risks on rollback if webhook preconditions was enabled and then unexpectedly disabled on rollback.

What specific metrics should inform a rollback?

webhook_admission_match_condition_evaluation_errors_total is high
webhook_admission_match_condition_exclusions_total is too high or too low
webhook_admission_match_condition_evaluation_seconds is high

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet, but manual testing should be completed and documented prior to beta.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

A new per-webhook metric will measure the number of requests excluded by match conditions:

Metric name: webhook_admission_match_condition_exclusions_total Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

Metric name: webhook_admission_match_condition_evaluation_errors_total Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

Metric name: webhook_admission_match_condition_evaluation_seconds Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

How can an operator determine if the feature is in use by workloads?

The metric webhook_admission_match_condition_evaluation_seconds should indicate if the match conditions are being used and being evaluated for invoking webhooks.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details:
  - Check the preconditions field in the webhook object and check the webhook_admission_match_condition_exclusions_total metric for exclusions
  - Check webhook_admission_match_condition_evaluation_errors_total

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Only negligible impact to admission latency due to evaluation of CEL rules
CEL evaluation time (webhook_admission_match_condition_evaluation_seconds)
CEL evaluation errors (webhook_admission_match_condition_evaluation_errors_total)

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - webhook_admission_match_condition_evaluation_seconds
  - webhook_admission_match_condition_evaluation_errors_total
- [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes, the following metrics will be considered for Beta which will improve observability of this feature:

webhook_admission_match_condition_evaluation_seconds
webhook_admission_match_condition_exclusions_total

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

No, this feature only adds new fields to existing webhook APIs

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, it can increase size of webhook configuration objects. There is a limit in place for the number of preconditions a webhook can have, however, webhook objects can still increase in size significantly if large expressions are used.

API types(s): ValidatingWebhookConfiguration, MutatingWebhookConfiguration Estimated increase in size: depends on size of CEL expressions, but should be negligible in most cases Estimated amount of new objects: none

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, it can impact latency SLI/SLO if evaluating CEL expressions add significant latency.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes, it has potential to increase CPU usage in kube-apiserver if there is a webhook intercepting many requests with many precondition rules.

Troubleshooting

See Debuggability .

How does this feature react if the API server and/or etcd is unavailable?

N/A – since the feature is part of kube-apiserver.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Feature can be disabled per webhook by removing preconditions or for all webhooks by disabling the feature gate in kube-apiserver.

Implementation History

Drawbacks

Future Work

Cross-webhook match conditions

In the future, we should explore ways to apply common match conditions across multiple webhooks.

Example use cases:

Apply a break-glass exemption across many (or all) webhooks.
Managed cluster provider wants to exempt provider-managed resources from user-managed webhooks.

Considerations:

Access by managed cluster provider vs. cluster admin
Side effects & mutations

Alternatives

Exclusion Expressions

The matchCondition expression could be inverted, so that requests that match are excluded rather than included. In this case, we would probably also want to change from requiring all expressions to match, to excluding the request if any match.

Although this approach would simplify some usecases, such as excluding resources from a wildcard rule or exempting system users from a security policy , it means other expressions would become double-negatives, which generally goes against API design best-practices.

Resource Exclusions

KEP-3693 Proposes an alternative approach using a more structured format for expressing resource exclusions. This approach may be more approachable to users who are not comfortable writing CEL expressions, but it is significantly less powerful. This would address Exclude resources from a wildcard rule , and could be extended with subject exclusions to address Exempt system users from security policy , but would not be sufficient to address Scope an NFS access management webhook to Pods mounting NFS volumes .

These two approaches are not mutually exclusive.

CEL Admission Control

KEP-3488: CEL for Admission Control adds the ability for admission webhooks to be replaced entirely by CEL expressions, but this is not intended to cover 100% of webhook use cases. For example, the user story described in Scope an NFS access management webhook to Pods mounting NFS volumes requires integrating with a third-party system, and is not implementable through a CEL ValidatingAdmissionPolicy.

With a mutating CEL admission policy (not yet implemented), a combination of mutating & validating policies could ensure that objects have a designated scoping label applied, which could be filtered using the ObjectSelector on the webhook. However, such an approach adds a lot of overhead and complexity beyond this proposal.

Resources: Aggregated Discovery

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3352: Aggregated Discovery

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The operations that a Kubernetes API server supports are reported through a collection of small documents partitioned by group-version. All clients of Kubernetes APIs must send a request to every group-version in order to “discover” the available APIs. This causes a storm of requests for clusters and is a source of latency and throttling. When new types are added to the API, types will need to be fetched again and adds an additional storm of requests. This KEP proposes centralizing the “discovery” mechanism into two aggregated documents so clients do not need to send a storm of requests to the API server to retrieve all the operations available.

Motivation

All clients and users of Kubernetes APIs usually first need to “discover” what the available APIs are and how they can be used. These APIs are described through a mechanism called “Discovery” which is typically queried to then build the requests to correct APIs. Unfortunately, the “Discovery” API is made of lots of small objects that need to be queried individually, causing possibly a lot of delay due to the latency of each individual request (up to 80 requests, with most objects being less than 1,024 bytes). The more numerous the APIs provided by the Kubernetes cluster, the more requests need to be performed.

The most well known Kubernetes client that uses the discovery mechanism is kubectl, and more specifically the CachedDiscoveryClient in client-go. To mitigate some of this latency, kubectl has implemented a 6 hour timer during which the discovery API is not refreshed. The drawback of this approach is that the freshness of the cache is doubtful and the entire discovery API needs to be refreshed after 6 hours, even if it hasn’t expired. Other clients such as Openshift UI have slow loading times due to the browser limit of the amount of parallel requests that can be made.

This primarily concerns clients that need a discovery cache and need to frequently poll the apiserver for the latest discovery information. Clients include kubectl, web interfaces, controllers, etc.

Goals

Fix the discovery storm issue that clients face when first loading the discovery document
On an update to the discovery document, efficiently allow clients to detect new types for appropriate decisions to be made
Aggregate the discovery documents for all Kubernetes types

Non-Goals

Since the current discovery separated by group-version is already GA, removal of the endpoint will not be attempted. There are still use cases for publishing the discovery document per group-version and this KEP will solely focus on introducing the new aggregated endpoint.

Watchable discovery is also outside the scope of this KEP. Polling with ETag support is sufficient for most users.

Proposal

We are proposing augmenting the current discovery endpoints at /api and /apis with an new content negotiation accept type. This endpoint will serve an aggregated discovery document that contains the resources for all group versions. ETag support will be provided so clients who already have the latest version of the aggregated discovery can avoid redownloading the document.

We will add a new controller responsible for aggregating the discovery documents when a resource on the cluster changes. There will be no conflicts when aggregating since each discovery document is self-contained.

Notes/Constraints/Caveats (Optional)

This is an important design note around selecting the group version for the new discovery types to be apidiscovery/v2beta1. Link to the full comment

Discovery is a non-resource API class
As a non-resource API class, once the feature gate is “on-by-default” the API is required to be stable (only additive features)
Non-resource APIs that are “off-by-default” do not promise stability
A non-resource APIs that has to change before promotion to “on-by-default” must represent incompatible changes somehow to clients (if the version is “v1” and then we find a bug, we would have to rev to “v2” before “on-by-default”, which means “v1” might not ever be exposed to end users)
Unversioned net new endpoints (/healthz) are effectively v1 even if they are “off-by-default”
We don’t want to have multiple endpoints for discovery because it’s confusing for users and defeats the purpose of making discovery more efficient, and we have a way to do that with negotiation
We think there is value in a new API type (APIGroupDiscovery) which simplifies client logic, but it comes with a small risk of not being correct
We have a good idea of what the API looks like due to a previous v1, so we are evolving an existing API and are not “completely flying blind” (i.e. implying this is really an alpha api)
While we aren’t exactly like an unversioned new endpoint (v1 from start), we want to deliver the feature (improves clients) without giving the perception that the API is perfect

Risks and Mitigations

Design Details

The current discovery endpoints /api and /apis will accept a new content negotiation type APIGroupDiscoveryList, representing an aggregated discovery document.

Clients requesting the aggregated document will send a request with as (kind), v (version), and g (group) set as part of the Accept header. For example, a client requesting the v2beta1 version will send Accept: application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io.

Clients should send an accept header with all the acceptable responses in preferred order. This is to avoid sending additional requests to the same endpoint if the initial preferred version is unavailable. The default accept type will not be changed and omitting the content negotiation type will default to the unaggregated APIGroupList type. Requests should have application/json or application/vnd.kubernetes.protobuf as a fallback option in case the server does not support the aggregated type (eg: Different version, feature disabled, etc) For instance, Accept: application/json;as=APIGroupDiscoveryList;v=v1;g=apidiscovery.k8s.io,application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json will request for the aggregated discovery v2 type, aggregated discovery v2beta1 type, and unaggregated v1 type in that order. The server will return the first option that is supported.

Refer to the Version Skew Strategy section for more information on how backwards compatibility is maintained by both the client and server when the types are promoted from v2beta1 to v2.

API

The contents of this endpoint will be an APIGroupDiscoveryList, containing a list of APIGroupDiscovery, with each group include a list of versions (APIVersionDiscovery). Each APIVersionDiscovery will include a list of APIResourcesForDiscovery. There are a couple minor changes for the APIResourceForDiscovery compared to the current APIResource object, but all states expressible with the current API will be representable in the new API.

The endpoint will also publish an ETag calculated based on a hash of the data for clients.

These types will live in the apidiscovery/v2 group version.

This is what the new API will look like.

// APIGroupDiscoveryList is a resource containing a list of APIGroupDiscovery.
// This is what is returned from the /discovery/v1 endpoint and is used to discover
// the list of API resources (built-ins, Custom Resource Definitions, resources from aggregated servers)
// that a cluster supports.
type APIGroupDiscoveryList struct {
 TypeMeta `json:",inline"`
 // ResourceVersion will not be set, because this does not have a replayable ordering among multiple apiservers.
 // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
 // +optional
 ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

 // items is the list of groups for discovery.
 Items []APIGroupDiscovery `json:"items" protobuf:"bytes,2,rep,name=items"`
}

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// APIGroupDiscovery holds information about which resources are being served for all version of the API Group.
// It contains a list of APIVersionDiscovery that holds a list of APIResourceDiscovery types served for a version.
// Versions are in descending order of preference, with the first version being the preferred entry.
type APIGroupDiscovery struct {
 TypeMeta `json:",inline"`
 // Standard object's metadata.
 // The only field completed will be name. For instance, resourceVersion will be empty.
 // name is the name of the API group whose discovery information is presented here.
 // name is allowed to be "" to represent the legacy, ungroupified resources.
 // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
 // +optional
 ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
 // versions are the versions supported in this group. They are sorted in descending order of preference,
 // with the preferred version being the first entry.
 // +listType=map
 // +listMapKey=version
 Versions []APIVersionDiscovery `json:"versions,omitempty" protobuf:"bytes,2,rep,name=versions"`
}

// APIVersionDiscovery holds a list of APIResourceDiscovery types that are served for a particular version within an API Group.
type APIVersionDiscovery struct {
 // version is the name of the version within a group version.
 Version string `json:"version" protobuf:"bytes,1,opt,name=version"`
 // resources is a list of APIResourceDiscovery objects for the corresponding group version.
 // +listType=map
 // +listMapKey=resource
 Resources []APIResourceDiscovery `json:"resources,omitempty" protobuf:"bytes,2,rep,name=resources"`
 // freshness marks whether a group version's discovery document is up to date.
 // "Current" indicates no problems when fetching the discovery document. "Stale" indicates
 // that there was an error fetching the discovery document, and the current version may not
 // be up to date.
 Freshness DiscoveryFreshness `json:"freshness,omitempty" protobuf:"bytes,3,opt,name=freshness"`
}

// APIResourceDiscovery provides information about an API resource for discovery.
type APIResourceDiscovery struct {
 // resource is the plural name of the resource. This is used in the URL path and is the unique identifier
 // for this resource across all versions in the API group.
 // resources with non-"" groups are located at /apis/<APIGroupDiscovery.objectMeta.name>/<APIVersionDiscovery.version>/<APIResourceDiscovery.Resource>
 // resource with "" groups are located at /api/v1/<APIResourceDiscovery.Resource>
 Resource string `json:"resource" protobuf:"bytes,1,opt,name=resource"`
 // responseKind describes the type of serialization that will typically be returned from this endpoint.
 // APIs may return other objects types at their discretion, such as error conditions, requests for alternate representations, or other operation specific behavior.
 ResponseKind GroupVersionKind `json:"responseKind" protobuf:"bytes,2,opt,name=responseKind"`
 // scope indicates the scope of a resource, either Cluster or Namespaced
 Scope ResourceScope `json:"scope" protobuf:"bytes,3,opt,name=scope"`
 // singularResource is the singular name of the resource. This allows clients to handle plural and singular opaquely.
 // For many clients the singular form of the resource will be more understandable to users reading messages and should be used when integrating the name of the resource into a sentence.
 // The command line tool kubectl, for example, allows use of the singular resource name in place of plurals.
 // The singular form of a resource should always be an optional element - when in doubt use the canonical resource name.
 SingularResource string `json:"singularResource" protobuf:"bytes,4,opt,name=singularResource"`
 // verbs is a list of supported API operation types (this includes
 // but is not limited to get, list, watch, create update, patch,
 // delete, deletecollection, and proxy)
 Verbs Verbs `json:"verbs" protobuf:"bytes,5,opt,name=verbs"`
 // shortNames is a list of suggested short names of the resource.
 // +listType=set
 ShortNames []string `json:"shortNames,omitempty" protobuf:"bytes,6,rep,name=shortNames"`
 // categories is a list of the grouped resources this resource belongs to (e.g. 'all').
 // Clients may use this to simplify acting on multiple resource types at once.
 // +listType=set
 Categories []string `json:"categories,omitempty" protobuf:"bytes,7,rep,name=categories"`
 // subresources is a list of subresources provided by this resource. Subresources are located at /apis/<APIGroupDiscovery.objectMeta.name>/<APIVersionDiscovery.version>/<APIResourceDiscovery.Resource>/name-of-instance/<APIResourceDiscovery.subresources[i].subresource>
 // +listType=map
 // +listMapKey=subresource
 Subresources []APISubresourceDiscovery `json:"subresources,omitempty" protobuf:"bytes,8,rep,name=subresources"`
}

// ResourceScope is an enum defining the different scopes available to a resource.
type ResourceScope string

const (
 ScopeCluster ResourceScope = "Cluster"
 ScopeNamespace ResourceScope = "Namespaced"
)

// DiscoveryFreshness is an enum defining whether the Discovery document published by an apiservice is up to date (fresh).
type DiscoveryFreshness string

const (
 DiscoveryFreshnessCurrent DiscoveryFreshness = "Current"
 DiscoveryFreshnessStale DiscoveryFreshness = "Stale"
)

// APISubresourceDiscovery provides information about an API subresource for discovery.
type APISubresourceDiscovery struct {
 // subresource is the name of the subresource. This is used in the URL path and is the unique identifier
 // for this resource across all versions.
 Subresource string `json:"subresource" protobuf:"bytes,1,opt,name=subresource"`
 // responseKind describes the type of serialization that will be returned from this endpoint.
 // Some subresources do not return normal resources, these will have nil return types.
 ResponseKind *GroupVersionKind `json:"responseKind,omitempty" protobuf:"bytes,2,opt,name=responseKind"`
 // acceptedTypes describes the kinds that this endpoint accepts. It is possible for a subresource to accept multiple kinds.
 // It is also possible for an endpoint to accept no standard types. Those will have a zero length list.
 // +listType=set
 AcceptedTypes []GroupVersionKind `json:"acceptedTypes,omitempty" protobuf:"bytes,3,rep,name=acceptedTypes"`
 // verbs is a list of supported kube verbs: get, list, watch, create,
 // update, patch, delete
 Verbs Verbs `json:"verbs" protobuf:"bytes,4,opt,name=verbs"`
}

Aggregation

For the aggregation layer on the server, a new controller will be created to aggregate discovery for built-in types, apiextensions types (CRDs), and types from aggregated api servers.

A post start hook will be added and the kube-apiserver health check should only pass if the discovery document is ready. Since aggregated api servers may take longer to respond and we do not want to delay cluster startup, the health check will only block on the local api servers (built-ins and CRDs) to have their discovery ready. For api servers that have not been aggregated, their group-versions will be published with an empty resource list and a Stale for Freshness to indicate that they have not synced yet.

Client

The client-go interface will be modified to add a new method to retrieve the aggregated discovery document and kubectl will be the initial candidate. As a starting point, kubectl api-resources should use the aggregated discovery document instead of sending a storm of requests.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/apiserver/pkg/endpoints/discovery/aggregated: 77.4
- Note that the fake.go file has no unit test coverage as it is a utility designed to be used by integration tests. The rest of the files in the package have 90+ coverage.
k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go: 82.2
k8s.io/client-go/discovery/aggregated_discovery.go: 96.8

Integration tests

test/integration/apiserver/discovery/discovery_test.go

e2e tests

test/e2e/apimachinery/aggregated_discovery.go

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial e2e tests completed and enabled
At least one client (kubectl) has an implementation to use the aggregated discovery feature

We want all clients to benefit from this feature, but for alpha our main focus will be on kubectl and golang clients.

Beta

kubectl uses the aggregated discovery feature by default
Metrics are added

GA

Existing bugs are fixed:
- AggregatedDiscovery controller does not purge old APIServices from cache (Issue )
- Aggregated Discovery doesn’t show aggregated apiservices as Stale before initial health check (Issue )
New API type apidiscovery.k8s.io/v2 is introduced
e2e and conformance tests

Note: Generally we also wait at least two releases between beta and GA/stable, because there’s no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests .

Deprecation

Once Aggregated Discovery v2 types are GA, v2beta1 types will be deprecated and removed after 3 releases.

Upgrade / Downgrade Strategy

Aggregated discovery will be behind a feature gate. It is an in-memory feature and upgrade/downgrade is not a problem.

Version Skew Strategy

When moving from beta to GA, we will introduce a new API group version apidiscovery.k8s.io/v2.

All clients v1.26 to v1.29 will only request for the beta API group version apidiscovery.k8s.io/v2beta1.

To accommodate skew between the client and server (older client and newer server), the server will serve both v2 and v2beta1 versions based on the client request headers. The API server will continue to support v2beta1 until its removal in Kubernetes v1.33.

To accommodate skew between an older server and newer client, starting in v1.30, client-go will request for both v2 and v2beta1 by sending a list of group versions requested (in order from v2, v2beta1, unaggregated) and the server will return the first group version that matches. Concretely, this is done using Accept headers with a single request.

Accept: application/json;as=APIGroupDiscoveryList;v=v2;g=apidiscovery.k8s.io,application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json

In the case of older servers, the server will only be able to match v2beta1. The client will support both v2 and v2beta1. This allows a newer client to communicate with an older server that supports only the beta version. Other clients should follow the same convention to support version skew, though a client that is only capable of processing v2 is sufficient if it only communicates with v1.30+ servers. Otherwise, the client will need to be ready to tolerate a 406 Not Acceptable response and handle the error appropriately.

If there is no skew and both server and client are v1.30+, clients will still request for v2 and v2beta1, and the server will match the first group version and return v2.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: AggregatedDiscovery
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

Clients using client-go version 1.26 and up will use the aggregated discovery endpoint rather than the unaggregated discovery endpoint. This is handled automatically in client-go and clients should see less requests to the api server when fetching discovery information. Client versions older than 1.26 will continue to use the old unaggregated discovery endpoint without any changes.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature may be disabled on the apiserver by reverting the feature flag. This will disable aggregated discovery for all clients. If there is a golang specific client side bug, the feature may also be turned off in client-go via the WithLegacy() toggle and this will require a recompile of the application.

What happens if we reenable the feature if it was previously rolled back?

The feature does not depend on state, and can be disabled/enabled at will.

Are there any tests for feature enablement/disablement?

A test will be added to ensure that the RESTMapper functionality works properly both when the feature is enabled and disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

During a rollout, some apiservers may support aggregated discovery and some may not. It is recommended that clients request for both the aggregated discovery document with a fallback to the unaggregated discovery format. This can be achieved by setting the Accept header to have a fallback to the default GVK of the /apis and /api endpoint. For example, to request the aggregated discovery type and fallback to the unaggregated discovery, the following header can be sent: Accept: application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json

This kind of fallback is already implemented in client-go and this note is intended for non-golang clients.

What specific metrics should inform a rollback?

High latency or failure of a metric in the newly added discovery aggregation controller. If the /api and /apis endpoint returns an error or is unreachable with the APIGroupDiscoveryList accept type, that could be a sign of rollback.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

n/a. The API introduced does not store data and state is recalculated on the upgrade, downgrade, upgrade cycle. No state is preserved between versions.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

By enabling aggregated discovery as the default, the new API is slightly different from the unaggregated version. The StorageVersionHash field is removed from resources in the aggregated discovery API. The storage version migrator will have an additional flag when initializing the discovery client to continue using the unaggregated API.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operators can check whether an aggregated discovery request can be made by sending a request to apis with application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json as the Accept header and looking at the the Content-Type response header. A Content Type response header of Content-Type: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList indicates that aggregated discovery is supported and a Content-Type: application/json header indicates that aggregated discovery is not supported. They can also check for the presence of aggregated discovery related metrics: aggregated_discovery_aggregation_count

How can someone using this feature know that it is working for their instance?

/api and /apis endpoints are populated with discovery information when the aggregated content negotiation type accept header is passed, and all expected group-versions are present.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Aggregated Discovery falls under a non-streaming read-only API call which is defined under the Kubernetes API call latency SLI/SLO . The number in the SLO are a good bound for Aggregated Discovery (p99 < 1s).

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: aggregator_discovery_aggregation_duration
- Components exposing the metric: kube-server
- This is a metric for exposing the time it took to aggregate all the api resources.
- Metric name: aggregator_discovery_aggregation_count
- Components exposing the metric: kube-server
- This is a metric for the number of times that the discovery document has been aggregated.

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No, but if aggregated apiservers are present, the feature will attempt to contact and aggregate the data published from the aggregated apiserver on a set interval. If there is high error rate, stale data may be returned because the latest data was not able to be fetched from the aggregated apiserver.

Scalability

Will enabling / using this feature result in any new API calls?

No. Enabling this feature should reduce the total number of API calls for client discovery. Instead of clients sending a discovery request to all group versions (/apis/<group>/<version>), they will only need to send a request to the aggregated endpoint to obtain all resources that the cluster supports.

Will enabling / using this feature result in introducing new API types?

Yes, but these API types are not persisted.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature is built into the API server, and will not work if the API server is unavailable.

What are other known failure modes?

Aggregated API Server is unavailable:
Detection: An Aggregated API Server that is unavailable will return Stale as the DiscoveryFreshness. A prolonged period of staleness could indicate that the aggregated apiserver is unavailable.
Mitigations: If the aggregated apiserver is not reacheable, it will not be part of the resources available. Restarting the pod or checking for any misconfigurations could be a valid next step.
Diagnostics: Missing the (v3) log line: DiscoveryManager: successfully downloaded discovery/legacy discovery for <apiservice>
Testing: We test for unreacheable aggregated apiservers returning Stale, but an aggregated apiserver could be unavailable for a wide variety of reasons that would require further diagnosis.

What steps should be taken if SLOs are not being met to determine the problem?

The feature can be rolled back by setting the AggregatedDiscoveryEndpoint feature flag to false.

Implementation History

v1.26: Aggregated Discovery KEP is merged and moves to alpha
v1.27: Aggregated Discovery moves to beta
v1.30: Aggregated Discovery moves to stable

Drawbacks

With aggregation, the size of the aggregated discovery document could be an issue in the future since clients will need to download the entire document on any resource update. At the moment, even with 3000 CRDs (already very unlikely), the total size is still smaller than 1MB.

Alternatives

An alternative that was considered is Discovery Cache Busting . Cache-busting allows clients to know if the files need to be downloaded at all, and in most cases the download can be skipped entirely. This typically works by including a hash of the resource content in its name, while marking the objects as never-expiring using cache control headers. Clients can then recognize if the names have changed or stayed the same, and re-use files that have kept the same name without downloading them again.

Aggregated Discovery was selected because of the amount of requests that are saved both on startup and on changes involving multiple group versions. For a full comparison between Discovery Cache Busting and Aggregated Discovery, please refer to the Google Doc .

An additional alternative that we considered is watchable discovery. After diving into the use cases, polling with ETag support is sufficient for most clients and adding support for watch drastically changes the scope of this proposal.

Finally, another alternative that was explored was creating a new URL endpoint /discovery/<version>. The additional of a new URL endpoint per serialization version creates burden for clients as the API evolves, as they may need to check multiple endpoints to determine the state of the feature.

Resources: Allow a Network Policy to contemplate a set of ports in a single rule

Mon, 01 Jan 0001 00:00:00 +0000

Summary

Today the ports field in ingress and egress network policies is an array that needs a declaration of each single port to be contemplated. This KEP proposes to add a new field that allows a declaration of a port range, simplifying the creation of rules with multiple ports.

Motivation

NetworkPolicy object is a complex object, that allows a developer to specify what’s the traffic behavior expected of the application and allow/deny undesired traffic.

There are a number of user issues like kubernetes #67526 and kubernetes #93111 where users expose the need to create a policy that allow a range of ports but some specific port, or also cases that a user wants to create a policy that allows the egress to other cluster to the NodePort range (eg 32000-32768) and in this case, the rule should be created specifying each port separately, as:

spec:
egress:
- ports:
- protocol: TCP
port: 32000
- protocol: TCP
port: 32001
- protocol: TCP
port: 32002
- protocol: TCP
port: 32003
[...]
- protocol: TCP
port: 32768

So for the user:

To allow a range of ports, each of them must be declared as an item from ports array
To make an exception needs a declaration of all ports but the exception

Adding a new endPort field inside the ports will allow a simpler creation of NetworkPolicy to the user.

Goals

Add an endPort field in NetworkPolicyPort

Non-Goals

Support specific Exception field.
Support endPort when the starting port is a named port.

Proposal

In NetworkPolicy specification, inside NetworkPolicyPort specify a new endPort field composed of a numbered port that defines if this is a range and when it ends.

User Stories

Story 1 - Opening communication to NodePorts of other cluster

I have an application that communicates with NodePorts of a different cluster and I want to allow the egress of the traffic only the NodePort range (eg. 30000-32767) as I don’t know which port is going to be allocated on the other side, but don’t want to create a rule for each of them.

Story 2 - Blocking the egress for not allowed insecure ports

As a developer, I need to create an application that scrapes informations from multiple sources, being those sources databases running in random ports, web applications and other sources. But the security policy of my company asks me to block communication with well known ports, like 111 and 445, so I need to create a network policy that allows me to communicate with any port except those two and so I can be compliant with the company’s policy.

Story 3 - Containerized Passive FTP Server

As a Kubernetes User, I’ve received a demand from my boss to run our FTP server in an existing Kubernetes cluster, to support some of my legacy applications. This FTP Server must be acessible from inside the cluster and outside the cluster, but I still need to keep the basic security policies from my company, that demands the existence of a default deny rule for all workloads and allowing only specific ports.

Because this FTP Server runs in PASV mode, I need to open the Network Policy to ports 21 and also to the range 49152-65535 without allowing any other ports.

Notes/Constraints/Caveats

The technology used by the CNI provider might not support port range in a trivial way as described in [#drawbacks]

Risks and Mitigations

CNIs will need to support the new field in their controllers. For this case we’ll try to make broader communication with the main CNIs so they can be aware of the new field.

Design Details

API changes to NetworkPolicy:

Add a new field called EndPort inside NetworkPolicyPort as the following:

// NetworkPolicyPort describes a port to allow traffic on
type NetworkPolicyPort struct {
 // The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this
 // field defaults to TCP.
 // +optional
 Protocol *v1.Protocol `json:"protocol,omitempty" protobuf:"bytes,1,opt,name=protocol,casttype=k8s.io/api/core/v1.Protocol"`

 // The port on the given protocol. This can either be a numerical or named 
 // port on a pod. If this field is not provided, this matches all port names and
 // numbers, whether an endPort is defined or not.
 // +optional
 Port *intstr.IntOrString `json:"port,omitempty" protobuf:"bytes,2,opt,name=port"`

 // EndPort defines the last port included in the port range.
 // Example:
 // endPort: 12345
 // +optional
 EndPort int32 `json:"port,omitempty" protobuf:"bytes,2,opt,name=endPort"`
}

Validations

The NetworkPolicyPort will need to be validated, with the following scenarios:

If an EndPort is specified a Port must also be specified
If Port is a string (named port) EndPort cannot be specified
EndPort must be equal or bigger than Port

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests:

test API validation logic
test API strategy to ensure disabled fields

E2E tests:

Add e2e tests exercising only the API operations for port ranges. Data-path validation should be done by CNIs.

Unit tests

pkg/apis/networking/validation/validation: 14/Jun/2022 - 92.5%
pkg/registry/networking/networkpolicy/strategy: 14/Jun/2022 - 75.9%

e2e tests

Feature:NetworkPolicyEndPort: https://storage.googleapis.com/k8s-triage/index.html?text=EndPort#eaa4b8cdb7b461dccfa9

The flakes shown here are not related to this feature, per the tests logs

Graduation Criteria

Alpha

Add a feature gated new field to NetworkPolicy
Communicate CNI providers about the new field
Add validation tests in API

Beta

EndPort has been supported for at least 1 minor release
Three commonly used NetworkPolicy (or CNI providers) implement the new field, with generally positive feedback on its usage.
Feature Gate is enabled by Default.

GA

At least four NetworkPolicy providers (or CNI providers) support the EndPort field
EndPort has been enabled by default for at least 1 minor release

The following are the CNIs that implement this feature:

Calico
Antrea
Openshift SDN
Kuberouter

Upgrade / Downgrade Strategy

If upgraded no impact should happen as this is a new field.

If downgraded the CNI wont be able to look into the new field, as this does not exists and network policies using this field will stop working correctly and start working incorrectly. This is a fail-closed failure, so it is acceptable.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: NetworkPolicyEndPort
- Components depending on the feature gate: Kubernetes API Server

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. One caveat here is that NetworkPolicies created with EndPort field set when the feature was enabled will continue to have that field set when the feature is disabled unless user removes it from the object.

If the value is dropped with the FeatureGate disabled, the field can only be re-inserted if feature gate is enabled again.

Rolling back the Kubernetes API Server that does not have this field will make the field not be returned anymore on GET operations, so CNIs relying on the new field wont recognize it anymore.

If this happens, CNIs will recognize the policy as a single port instead of a port range, which may break users, which is inevitable but satisfies the fail-closed requirement.

What happens if we reenable the feature if it was previously rolled back?

Nothing.

Are there any tests for feature enablement/disablement?

Yes and they can be found here

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Not probably, but still there’s the risk of some bug that fails validation, or conversion function crashes.

What specific metrics should inform a rollback?

The increase of 5xx http error count on Network Policies Endpoint

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, with unit tests. Manual tests were also executed as the following:

Created a KinD cluster in v1.24 and Calico as a CNI
Created a Network Policy with endPort field to allow a Pod egress to ports from 70 to 90
Did a test against a target in port 80 - Worked
Disabled the Feature Gate
The Network Policy still worked fine
Changed the Network Policy so the range is 70 to 79, and the Network Policy was changed fine
Traffic started to be blocked, but could call port 78 as it is still within range
Removed endPort field, and wasn’t able to re-add it as Feature gate was disabled
Re-enabled feature gate
Re-added endPort field with value of 90
Traffic started to flow/be accepted again

Per the manual tests, all worked as desired.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

None

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operators can determine if NetworkPolicies are making use of EndPort creating an object specifying the range and validating if the traffic is allowed within the specified range.

Also Network Policy object now supports (as alpha) status/condition fields, so Network Policy providers can add a feedback to the user whether the policy was processed correctly or not. Providing this feedback is optional and depends on implementation by each NPP.

How can someone using this feature know that it is working for their instance?

Other
Details: The API Field must be present when a NetworkPolicy is created with that field. The feature working correctly depends on the CNI implementation, so the operator can look into CNI metrics to check if the rules are being applied correctly, like Calico that provides metrics like felix_iptables_restore_errors that can be used to verify if the amount of restoring errors raised after the feature being applied. For NetworkPolicy Providers that doesn’t support this feature, a new status field was added in Network Policy object allowing the providers to give feedback to users using conditions. Any NPP that does not support this feature should add a condition on the Network Policy object.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Operators can use metrics provided by the CNI to use as SLI, like felix_iptables_restore_errors from Calico to verify if the errors rate has raised.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

per-day percentage of API calls finishing with 5XX errors <= 1% is a reasonable SLO

Are there any missing metrics that would be useful to have to improve observability of this feature? N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

Yes, a CNI supporting the new feature

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

API type(s): NetworkPolicyPorts
Estimated increase in size: 2 bytes for each new EndPort value specified + the field name/number in its serialized format
Estimated amount of new objects: N/A

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

N/A

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

It might get some increase of resource usage by the CNI while parsing the new field.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

As this feature is mainly used by CNI providers, the reaction with API server and/or etcd being unavailable will be the same as before.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Remove EndPort field and check if the number of errors reduce, although this might lead to undesired Network Policy, blocking previously working rules.

Implementation History

2022-06-14 Propose GA graduation
2021-05-11 Propose Beta graduation and add more Performance Review data
2020-10-08 Initial KEP PR

Drawbacks

The technology used by the CNI provider might not support port range in a trivial way. As an example, OpenFlow did not supported to specify port range for a while as commented in kubernetes #67526 . While this has changed in Open vSwitch v1.6, this still might be a caveat for other CNIs, like eBPF based CNIs will need to populate their maps in a different way.

For this cases, CNIs will have to iteract through the Port Range and populate their packet filtering tables with each port.

Alternatives

During the development of this KEP there was an alternative implementation of the NetworkPolicyPortRange field inside the NetworkPolicyPort as the following:

// NetworkPolicyPort describes a port or a range of ports to allow traffic on
type NetworkPolicyPort struct {
// The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this
// field defaults to TCP.
// +optional
Protocol *api.Protocol
// The port on the given protocol. This can either be a numerical or named
// port on a pod. If this field is not provided but a Range is
// provided, this field is ignored. Otherwise this matches all port names and
// numbers.
// +optional
Port *intstr.IntOrString
// A range of ports on a given protocol and the exceptions. If this field
// is not provided, this doesn't matches anything
// +optional
Range *NetworkPolicyPortRange
}

But the main design suggested in this Kep seems more clear, so this alternative has been discarded.

Also it has been proposed that the implementation contains an Except array and a new struct to be used in Ingress/Egress rules, but because it would bring much more complexity than desired the proposal has been dropped right now:

// NetworkPolicyPortRange describes the range of ports to be used in a
// NetworkPolicyPort struct
type NetworkPolicyPortRange struct {
// From defines the start of the port range
From uint16
// To defines the end of the port range, being the end included within the
// range
To uint16
// Except defines all the exceptions in the port range
+optional
Except []uint16

Resources: Allow DaemonSets to surge during update like Deployments

Mon, 01 Jan 0001 00:00:00 +0000

Allow DaemonSets to surge during update like Deployments

Summary
Motivation
- Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History

Summary

Daemonsets allow two update strategies - OnDelete which only replaces pods when they are deleted and RollingUpdate which supports MinAvailable like Deployments but not Surge. Daemonsets should support Surge in order to minimize DaemonSet downtime on nodes. This will allow daemonset workloads to implement zero-downtime upgrades.

Motivation

DaemonSets are a key enabler of Kubernetes system-level integrations like CNI, CSI, or per-node functionality. These integrations may have availability impacts on workloads during daemonset updates for a number of reasons, including image pull time or setup. While increasing availability of these daemonsets often requires development investment to manage the handoff between the old instance and the new instance, without the ability to have two pods on the same node these handoffs are complex to implement and typically require higher level orchestration (such as running two daemonsets and round robining updates, or using the OnDelete strategy and orchestrating pod deletes when nodes will be rebooted).

It should be possible for a node level integration to offer zero-downtime upgrades via a DaemonSet without resorting to a higher level orchestration.

Goals

Add support for Surge to the DaemonSet rolling update strategy

Proposal

Implementation Details/Notes/Constraints

The design of Deployment rolling updates introduced the surge concept, and the initial design for DaemonSet updates considered the implications of adding the Surge strategy later (https://github.com/kubernetes/design-proposals-archive/blob/master/apps/daemonset-update.md#future-plans) . StatefulSets may also surge in a workload specific fashion , so this design should be as consistent as possible with existing concepts but clearly denote where the workload concept differs from other controllers.

We would add MaxSurge *intstr.IntOrString to the RollingUpdate daemonset upgrade strategy. It would have a default value of 0, preserving current behavior. We would allow MaxUnavailable to be 0 when MaxSurge is set.

// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
// The maximum number of DaemonSet pods that can be unavailable during the
// update. Value can be an absolute number (ex: 5) or a percentage of total
// number of DaemonSet pods at the start of the update (ex: 10%). Absolute
// number is calculated from percentage by rounding up.
// This cannot be 0 if MaxSurge is 0
// Default value is 1.
// Example: when this is set to 30%, at most 30% of the total number of nodes
// that should be running the daemon pod (i.e. status.desiredNumberScheduled)
// can have their pods stopped for an update at any given time. The update
// starts by stopping at most 30% of those DaemonSet pods and then brings
// up new DaemonSet pods in their place. Once the new pods are available,
// it then proceeds onto other DaemonSet pods, thus ensuring that at least
// 70% of original number of DaemonSet pods are available at all times during
// the update.
// +optional
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,1,opt,name=maxUnavailable"`
// The maximum number of nodes with an existing available DaemonSet pod that
// can have an updated DaemonSet pod during during an update.
// Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
// This can not be 0 if MaxUnavailable is 0.
// Absolute number is calculated from percentage by rounding up to a minimum of 1.
// Default value is 0.
// Example: when this is set to 30%, at most 30% of the total number of nodes
// that should be running the daemon pod (i.e. status.desiredNumberScheduled)
// can have their a new pod created before the old pod is marked as deleted.
// The update starts by launching new pods on 30% of nodes. Once an updated
// pod is available (Ready for at least minReadySeconds) the old DaemonSet pod
// on that node is marked deleted. If the old pod becomes unavailable for any
// reason (Ready transitions to false, is evicted, or is drained) an updated
// pod is immediately created on that node without considering surge limits.
// Allowing surge implies the possibility that the resources consumed by the
// daemonset on any given node can double if the readiness check fails, and
// so resource intensive daemonsets should take into account that they may
// cause evictions during disruption.
// +optional
MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge"`

Unlike Deployments, MaxSurge only considers nodes that have an available old pod and will instantly launch updated pods if no old available pod is detected on a node. An available pod is defined the same way as Deployments - the pod is not terminating, pod is Ready, and pod has been Ready for MinReadySeconds.

In the event a rollout cannot proceed due to hitting the MaxSurge limit (due to any condition, whether scheduling, new pods not going ready) the controller should pause creating new pods until conditions change.

DaemonSet pods are slightly more constrained than Deployments when it comes to scheduling issues since each pod is tied to a single node, so it is worth describing exactly how surge pods that violate same node constraints would be handled consistent with Deployments. The most common conflict is use of HostPort within the pod spec across two versions, which would prevent the second pod from landing and the rollout from proceeding. An identical failure would occur with a Deployment of scale 4 on a 3 node cluster - the rollout would be prohibited because the fourth pod could not be scheduled, and so should be handled identically by this controller. It is user error to specify impossible scheduling constraints, and the correct way to convey that is via status conditions on the DaemonSet (which is a separate proposal).

In order to reduce confusion for new users, we will start by rejecting HostPort use in daemonset when MaxSurge is non-zero. A user will not be able to update a daemonset to MaxSurge != 0 if HostPort is set, or update a HostPort if MaxSurge is set, without receiving a validation error. If the MaxSurge feature gate is off, the validation rule is bypassed, and a user who turns off the gate, sets both fields, and then enables the gate will have failing pods but will be able to update their daemonset to either remove surge or remove the host port safely.

A user who uses HostNetwork but does not declare HostPorts and attempts to use MaxSurge with processes that listen on the host network should see errors from the network stack when their process attempts to bind a port (such as cannot bind to address: port in use) and the new pod will crash and go into a crashloop. Users should expect to see these failures as they would any other “my application does not start on Kubernetes” error via pod status, daemonset status conditions, and pod logs.

Building a daemonset that hands off between two host level processes with any degree of coordination is an advanced topic and is up to the workload author. The simplest daemonsets may use pod network without any host level sharing and will benefit significantly from maxSurge during updates by reducing downtime at the cost of extra resources. As more complex sharing (host network, disk resources, unix domain sockets, configuration) is needed, the author is expected to leverage custom readiness probes, process start conditions, and process coordination mechanisms (like disks, networking, or shared memory) across pods. Debugging those interactions will be in the domain of the workload author.

Workload Implications

There are three main workload types that seek to minimize disruption:

Infrastructure that should be quickly replaced during update (CNI plugins, CSI plugins).
Infrastructure that wishes to hand off a node resource during an upgrade (socket, namespace, process)
Infrastructure that must remain 100% available to support workloads (networking components, proxies).

In general, all of these benefit from minimizing the time between old pod shutting down and new pod starting up. MaxSurge allows components to arbitrarily approach zero disruption by careful tuning of their launch scripts and access to shared resources, such as sockets or shared disk.

Infrastructure invoked by Kubernetes components (CRI, CNI, CSI) can usually fall within the first category and may require some coordination from the invoking process to minimize downtime. For instance, the Kubelet may retry certain types of CSI errors transparently to mitigate brief disruption to a CSI plugin. Or the container runtime may retry certain CNI errors if the plugin is not available.

The second category of workload requires some coordination between the old and new container - for instance, reusing a host volume and checking for file locking on shared resources, or using the SO_REUSEPORT option to start listening on an interface and share old and new traffic. In general the workload author is assumed to understand how to minimize disruption and Kubernetes is only giving them an overlapping window of execution before beginning the termination of the old process. The readiness probe should be used by the workload author to manage this transition as in other workload flows.

The last category is the most difficult to achieve and generally combines categories 1 and 2 along with careful tuning. Networking plugins that provide pod network capability may have one or more daemon processes that are desirable to deliver containerized, but any disruption to those critical pods may impact other workloads. In most cases, the capability to overlap execution provided by the MaxSurge is sufficient to allow those components to adapt to zero-downtime updates.

In the future, service topology will have implications for services implemented as daemonsets across all nodes. The update strategy for surge or drain will need to take into account topology, although the full details of that are outside the scope of this design. In general, service owners using daemonset surge will wish to maximize availability and minimize the risk of disruption during update.

Risks and Mitigations

The primary risk is a bug in the implementation of the controller that causes excessive pod creations or deletions, as we have experienced during previous enhancements to workload controllers. The best mitigation for that scenario is unit testing to ensure the update strategy is stable and general purpose stress e2e testing of the controller.

Because we are widening validation for MaxUnavailable, we must ensure that during an upgrade old apiservers can still handle that field. The alpha release of this field would have special logic that, if MaxSurge is set and dropped, a value of MaxUnavailable 0 would be set to 1 (the minimum allowed unavailable). The alpha controller would also special case this check when the gate was off. When a cluster was upgraded to beta with the gate on by default, the old controller and apiservers would treat MaxSurge != 0, MaxUnavailable == 0 as MaxSurge == 0, MaxUnavailable == 1 until they themselves were upgraded.

Design Details

Implications to drain

DaemonSets currently ignore unschedulable, but triggering a drain of a node and choosing to delete daemonsets would ensure that if the old pod can be deleted the daemonset controller immediately schedules a new pod onto that node when MaxSurge is in play (because the invariant that there must be at least one pod). If the old pod delays deletion, then the new pod has a chance to accept handoff from the old pod exactly like a normal rolling surge update.

Test Plan

Unit tests covering the daemonset controller behavior in all major edge cases
E2E test for surge strategy that verifies expected recovery behavior and that the controller settles
- Testing should set up conflicting rules like HostPort and verify that surge fails and the correct daemonset condition is set and events are generated.
  - A test should cover a pod going unready during rollout and verifying it is immediately replaced.

Prerequisite testing updates

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Unit tests

`k8s.io/kubernetes/pkg/apis/apps/validation` `06/06/2022`: `90.6% of statements` `The tests added for the current feature in this package touches the daemonSet Spec field. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:387`: `06/06/2022`: `100.0% of statements`
`k8s.io/kubernetes/pkg/controller/daemon`: `06/06/2022`: `70.7% of statements` `The tests added for the current feature in this package touches the daemonSet update strategies. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/daemonset`: `06/06/2022`: `31.1% of statements` `The tests added for the current feature in this package makes sure that the kubernetes version upgrades/downgrades won't have any impact on the new field to the daemonSet api when persisting to etcd. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/daemonset/strategy.go:129`: `06/06/2022`: `100.0% of statements`

Integration tests

A new integration which exercises maxSurge when RollingUpdate is used as update strategy will be added to DS integration test suite

e2e tests

An e2e test which exercises maxSurge when RollingUpdate is used as update strategy is added for daemonsets.

should surge pods onto nodes when spec was updated and update strategy is RollingUpdate: test grid

Graduation Criteria

This will be added as a alpha field enhancement to DaemonSets with a backward compatible default. After sufficient exposure this field would be promoted to beta, and then to GA in successive releases. The feature gate for this field will be DaemonSetUpdateSurge.

Alpha

Complete feature behind a featuregate
Have proper unit and e2e tests

Alpha -> Beta

Gather feedback from the community

Beta -> GA

Atleast one of example of user benefitting from this feature:

OpenShift has few critical DS where maxSurge is beneficial

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in kep.yaml)
  - Feature gate name: DaemonSetUpdateSurge
  - Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Other
  - Describe the mechanism:
  - Will enabling / disabling the feature require downtime of the control plane?
  - Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

No
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, when the feature gate is disabled the field is ignored and can be cleared by an end user. A workload using this alpha feature would no longer be able to surge and would fall back to the default MaxUnavailable value (which is minimum 1).
What happens if we reenable the feature if it was previously rolled back?

The field would become active and whatever new values were present would cause the surge feature to become active. If the field name were changed old values would be lost and the controller would default to using maxUnavailable 1.

To clear the field from etcd, disable the gate and perform a no-op PUT on every daemonset.
Are there any tests for feature enablement/disablement?

A unit test will verify disablement ignores surge and behaves as MaxUnavailable=1

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

How can a rollout fail? Can it impact already running workloads? It shouldn’t impact already running workloads. This is an opt-in feature since users need to explicitly set the MaxSurge parameter in the DaemonSetSet spec’s RollingUpdate i.e .spec.strategy.rollingUpdate.maxSurge field. if the feature is disabled the field is preserved if it was already set in the presisted DaemonSetSet object, otherwise it is silently dropped.
What specific metrics should inform a rollback? MaxSurge in DaemonSet doesn’t get respected and additional surge pods won’t be created. We consider the feature to be failing if enabling the featuregate and giving appropriate value to MaxSurge doesn’t cause additional surge pods to be created.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Manually tested. No issues were found when we enabled the feature gate -> disabled it -> re-enabled the feature gate. Upgrade -> downgrade -> upgrade scenario was tested manually.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? None

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

How can an operator determine if the feature is in use by workloads? By checking the DaemonSetSet’s .spec.strategy.rollingUpdate.maxSurge field. The additional workload pods created should be respecting the value specified in the maxSurge field.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
  - Metric name:
  - [Optional] Aggregation method:
  - Components exposing the metric:
- Other (treat as last resort)
  - Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in the DaemonSetSet definition. This can be used to determine the health of this feature. The existing metrics like kube_daemonset_status_number_available and kube_daemonset_status_number_unavailable can be used to track additional pods created
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload service is available 99.99% of time during updates.
Are there any missing metrics that would be useful to have to improve observability of this feature? Describe the metrics themselves and the reasons why they weren’t added (e.g., cost, implementation difficulties, etc.).

Dependencies

This section must be completed when targeting beta graduation to a release.

Does this feature depend on any specific services running in the cluster? None. It is part of kube-controller-manager.

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

Will enabling / using this feature result in any new API calls?

No, the controller will perform roughly the same order of magnitude calls as for the normal strategy.
Will enabling / using this feature result in introducing new API types?

No.
Will enabling / using this feature result in any new calls to the cloud provider?

No.
Will enabling / using this feature result in increasing size or count of the existing API objects?

No, except for the explicit user chosen field on the daemonset.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs ?

No, only broken Daemonsets in surge configurations would fail to roll out. In both strategies, the readiness check gates the SLO of rollout.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No, the calculations for this controller change are of the same magnitude as the existing flow.

Troubleshooting

This section must be completed when targeting beta graduation to a release.

How does this feature react if the API server and/or etcd is unavailable? This feature will not work if the API server or etcd is unavailable as the controller-manager won’t be even able get events or updates for DaemonSetSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn’t start since it cannot communicate with the API server
What are other known failure modes?
- MaxSurge not respected and too many pods are created
  - Detection: Looking at kube_daemonset_status_number_available and kube_daemonset_status_number_unavailable metrics.
  - Mitigations: Disable the DaemonSetUpdateSurge feature flag
  - Diagnostics: Controller-manager when starting at log-level 4 and above
  - Testing: Yes, e2e tests are already in place
- MaxSurge not respected and very few pods are created. This causes the workloads to be not be available at 99.99%
  - Detection: Looking at kube_daemonset_status_number_available and kube_daemonset_status_number_unavailable metrics.
  - Mitigations: Disable the DaemonSetUpdateSurge feature flag
  - Diagnostics: Controller-manager when starting at log-level 4 and above
  - Testing: Yes, e2e tests are already in place
- maxUnavailable should be set to 0 even when maxSurge is configured
  - Detection: Looking at the .spec.strategy.rollingUpdate.maxSurge and .spec.strategy.rollingUpdate.maxUnavailable
  - Mitigations: Setting maxUnavailable to appropriate value
  - Diagnostics: Controller-manager when starting at log-level 4 and above
  - Testing: Yes, e2e tests are already in place
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2021-02-09: Initial KEP merged
2021-03-05: Initial implementation merged
2021-04-30: Graduate the feature to Beta proposed
2022-05-10: Graduate the feature to stable proposed

Resources: Allow HostNetwork Pods to Use User Namespaces

Mon, 01 Jan 0001 00:00:00 +0000

KEP-5607: Allow HostNetwork Pods to Use User Namespaces

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes introducing a new feature gate to allow Pods to have both hostNetwork enabled and user namespaces enabled (by setting hostUsers: false).

Motivation

The primary motivation is to enhance the security of Kubernetes control plane components. Many control plane components, such as the kube-apiserver and kube-controller-manager often run as static Pods and are configured with hostNetwork: true to bind to node ports or interact directly with the host’s network stack.

Currently, a validation rule in the kube-apiserver prevents the combination of hostNetwork: true and hostUsers: false. This KEP aims to remove that barrier.

Goals

Introduce a new, separate alpha feature gate: UserNamespacesHostNetworkSupport.
When this feature gate is enabled, modify the Pod validation logic to allow Pod specs where spec.hostNetwork is true and spec.hostUsers is false.

Non-Goals

Including this functionality as part of the UserNamespacesSupport feature gate. As UserNamespacesSupport is nearing GA, it would be unwise to add a new, unstable feature with external dependencies.

Proposal

We propose the introduction of a new feature gate named UserNamespacesHostNetworkSupport.

When this feature gate is disabled (the default state), the kube-apiserver will maintain the current validation behavior, rejecting any Pod spec that includes both spec.hostNetwork: true and spec.hostUsers: false.

When the UserNamespacesHostNetworkSupport feature gate is enabled, we will relax this validation check.

User Stories (Optional)

Story 1

As a cluster administrator, I want to enable user namespaces for my control plane static Pods (e.g., kube-apiserver, kube-controller-manager) to follow the principle of least privilege and reduce the attack surface. These Pods need to use hostNetwork to interact correctly with the cluster network. By enabling the new feature gate, I can add a critical layer of security isolation to these vital components without changing their networking model.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

If either the container runtime or the underlying container runtime does not support this feature, the container will fail to be created. To mitigate this issue, we will keep this feature in the alpha stage until mainstream container runtimes (containerd/runc) and mainstream underlying container runtimes (runc/crun) both support it, before promoting it to beta.

Users might upgrade the container runtime to a newer version on some nodes first, but pods could still be scheduled onto nodes that do not support this feature. In such cases, users can leverage Node Declared Features to avoid this problem. Specifically, the new UserNamespacesHostNetwork field in CRI-API’s RuntimeFeatures will allow the kubelet to report whether the node supports this combination, enabling the scheduler to make informed placement decisions.

Design Details

The UserNamespacesHostNetworkSupport feature integrates with the NodeDeclaredFeatures framework to ensure that Pods requiring the combination of hostNetwork: true and hostUsers: false are only scheduled onto nodes that explicitly declare support for this feature. The feature relies on the UserNamespacesHostNetwork field in CRI-API’s RuntimeFeatures to determine whether the container runtime supports this combination.

Node Feature Declaration:

The kubelet will check the UserNamespacesHostNetwork field in CRI-API’s RuntimeFeatures field in the CRI-API to determine if the container runtime supports the UserNamespacesHostNetwork feature. If supported, the kubelet will declare the UserNamespacesHostNetwork feature in the node.status.declaredFeatures field. This ensures that the scheduler and other control plane components are aware of the node’s capabilities.

Pod Validation:

And add a parameter to PodValidationOptions so that if the UserNamespacesHostNetworkSupport feature gate is disabled, and the pod has already used the combination of hostNetwork: true and hostUsers: false, then we should allow updates the pod.

Scheduling:

The NodeDeclaredFeatures scheduler plugin will ensure that Pods requiring the UserNamespacesHostNetwork feature are only scheduled onto nodes that declare support for it. This is achieved by matching the Pod’s feature requirements against the node’s node.status.declaredFeatures.

CRI Implementation

When using hostNetwork: true and hostUsers: false together, container runtime needs to mount /sys using bind mounts instead of directly mounting sysfs. This is because directly mounting sysfs in this configuration will fail with insufficient permissions (EPERM).

The following mount options will be used to ensure security and proper functionality:

nosuid: Prevents privilege escalation through SUID binaries.
nodev: Prevents unauthorized access to hardware through device files.
noexec: Prevents execution of binary programs from the mounted filesystem.
rbind: Ensures that the directory is mounted along with all its sub-mount points.
rro: Ensures that the entire directory tree, including sub-mount points, is mounted as read-only.

Test Plan

[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/apis/core/validation: 2025-10-03 - 85.1%

Integration tests

e2e tests

Add e2e tests to ensure that pods with the combination of hostNetwork: true and hostUsers: false can run properly.

Graduation Criteria

Alpha

The UserNamespacesHostNetworkSupport feature gate is implemented and disabled by default.
Add an implementation that integrates with the NodeDeclaredFeatures feature gate.

Beta

Mainstream container runtimes and low-level container runtimes (e.g., containerd/CRI-O, runc/crun) have released generally available versions that support the concurrent use of hostNetwork and user namespaces.
Add e2e tests to ensure feature availability.
Document the limitations of combining user namespaces and hostNetwork (e.g., CAP_NET_RAW, CAP_NET_ADMIN, CAP_NET_BIND_SERVICE remain restricted).

GA

The feature has been stable in Beta for at least 2 Kubernetes releases.
Multiple major container runtimes support the feature.

Upgrade / Downgrade Strategy

Upgrade: After upgrading to a version that supports this KEP, the UserNamespacesHostNetworkSupport feature gate can be enabled at any time.

Downgrade: If downgraded to a version that does not support this KEP, kube-apiserver will revert to strict validation. Pods that were already running in this configuration will continue to run with this configuration. If we were supposed to disable the feature, all pods using that configuration should be manually purged.

Version Skew Strategy

When the NodeDeclaredFeatures feature gate is enabled on the control plane but not on an older Kubelet:

If the control plane is upgraded to a version that supports the UserNamespacesHostNetworkSupport feature, it will correctly identify older nodes as incompatible. The scheduler will filter these nodes, causing Pods with the feature requirement to remain in the Pending state until compatible nodes are available.
For API validation, operations will be rejected if the target Pod resides on an older node that lacks the necessary feature.
This strict filtering is reliable because the NodeDeclaredFeatures framework is scoped to new features only. This prevents ambiguous situations where a feature might be present on a node but is not being reported because the node is too old. The absence of a declared feature is a defini

When the NodeDeclaredFeatures feature gate is disabled on the control plane but enabled on the Kubelet:

A newer kube-apiserver with the UserNamespacesHostNetworkSupport feature enabled will accept a Pod with hostNetwork: true and hostUsers: false.
An older kubelet will still get the Pod definition from the kube-apiserver. It will attempt to create the Pod. If the container runtime version is too old and doesn’t support this combination, the Pod will be stuck in the ContainerCreating state.
To mitigate scheduling issues in mixed-version clusters, the kubelet will use the UserNamespacesHostNetwork field from CRI-API’s RuntimeFeatures to report node capabilities via Node Declared Features. This allows the scheduler to avoid placing Pods requiring this combination on nodes that do not support it, even in version-skew scenarios.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: UserNamespacesHostNetworkSupport
- Components depending on the feature gate: kube-apiserver
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No. The behavior only changes when a user explicitly sets both hostNetwork: true and hostUsers: false in a Pod spec. The behavior of all existing Pods is unaffected.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. It can be disabled by setting the feature gate to false and restarting the kube-apiserver. This restores the old validation logic. When disabled, Pods that were running in this mode have to be manually purged. Otherwise, they will continue to run in that mode (hostNetwork: true, hostUsers: false) even though it’s technically disabled.

What happens if we reenable the feature if it was previously rolled back?

The kube-apiserver will once again begin to accept the combination of hostNetwork: true and hostUsers: false. This is a stateless change, and reenabling is safe.

Are there any tests for feature enablement/disablement?

During the alpha stage, unit tests for enabling and disabling the toggle functionality will be added to the validation code. Manual testing will also be conducted during the beta stage, and the testing process will be documented here.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The Version Skew Strategy section covers this point.

What specific metrics should inform a rollback?

If a pod is stuck in the ContainerCreating state and returns events similar to the following, it indicates that the container runtime does not yet support this combination, and we should roll back this feature:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "0db019a96c2a28eaacb0d8a795bbbc48c8a3823d9b8e5099948f1d99e826238d": failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This will be validated via manual testing.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No impact to the running workloads

What are other known failure modes?

Detection: Please refer to the content in the “What specific metrics should inform a rollback?” section.
Mitigations: Users should roll back this feature and discontinue using the combination of hostNetwork: true and hostUsers: false.
Diagnostics: If the following log appears in kubelet, it indicates an abnormality with this feature, and users need to roll back. The default log level is sufficient to obtain the following log:

 E1014 20:30:39.550653 2823108 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"data-writer-pod_default(7607f6d7-91e1-4dbd-b957-c0d7b101de2e)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"data-writer-pod_default(7607f6d7-91e1-4dbd-b957-c0d7b101de2e)\\\": rpc error: code = Unknown desc = failed to start sandbox \\\"f47b48e3c415105d25fb316cf224c0e57b146b340c09d6847b2dfcf3b49c923c\\\": failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory\"" pod="default/data-writer-pod" podUID="7607f6d7-91e1-4dbd-b957-c0d7b101de2e"

Testing: Failure mode tests have been run locally. We cannot add this test to the e2e test suite because once container runtime support is introduced, it will exit the failure mode, causing the test to fail.

opt kubectl get pods
NAME READY STATUS RESTARTS AGE
data-writer-pod 0/1 ContainerCreating 0 8m4s
➜ opt kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
8m37s Normal Starting node/127.0.0.1
8m37s Normal RegisteredNode node/127.0.0.1 Node 127.0.0.1 event: Registered Node 127.0.0.1 in Controller
8m7s Normal Scheduled pod/data-writer-pod Successfully assigned default/data-writer-pod to 127.0.0.1
8m7s Warning FailedCreatePodSandBox pod/data-writer-pod Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox "f47b48e3c415105d25fb316cf224c0e57b146b340c09d6847b2dfcf3b49c923c": failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory

What steps should be taken if SLOs are not being met to determine the problem?

N/A

Implementation History

2025-10-03: Initial proposal
2025-12-18: Add implementation content for v1.36

Drawbacks

There are no known drawbacks at this time.

Alternatives

Add this feature to the existing UserNamespacesSupport feature gate:

This was ruled out because the UserNamespacesSupport feature is approaching GA, and its functionality should be stable. Adding a new, externally-dependent, and immature behavior to a nearly-GA feature would introduce unnecessary risk and delays. Keeping the two feature gates separate is cleaner and safer.

Do not implement this feature:

Users can use hostPort as an alternative to hostNetwork, but this may cause some disruption to the existing user environment, as certain privileged containers require direct interaction with the host network stack. Moreover, hostPort requires pre-configured CNI; otherwise, the pod will fail to start. This limitation is precisely why Kubernetes control plane components continue to rely on hostNetwork.

Infrastructure Needed (Optional)

No new infrastructure needed.

Resources: Allow informers for getting a stream of data instead of chunking

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3157: allow informers for getting a stream of data instead of chunking.

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Appendix
- Sources of LIST request
- Steps followed by informers
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The kube-apiserver is vulnerable to memory explosion. The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption. Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an HA mode but also other programs that share the same machine. In this KEP we propose a solution to this issue.

Motivation

Today informers are the primary source of LIST requests. The LIST is used to get a consistent snapshot of data to build up a client-side in-memory cache. The primary issue with LIST requests is unpredictable memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects. See the Appendix section for more details on potential sources of LIST request and their impact on memory. In extreme cases, the server can allocate hundreds of megabytes per request. To better visualize the issue let’s consider the above graph. It shows the memory usage of an API server during a test (see manual test section for more details). We can see that increasing the number of informers drastically increases the memory consumption of the server. Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests. In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client. The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption. Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.

A situation like that is dangerous twofold. First, as we saw it could slow down if not fully stop an API server that has received the requests. Secondly, a sudden and uncontrolled spike in memory consumption will likely put pressure on the node itself. This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet. Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius. Note that in that scenario even clusters in an HA setup are affected.

Worse, in rare cases (see the Appendix section for more) recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to a very expensive storm of LISTs.

Goals

protect kube-apiserver and its node against list-based OOM attacks
considerably reduce (temporary) memory footprint of LISTs, down from O(watchers*page-size*object-size*5) to O(watchers*constant), constant around 2 MB.

Example:

512 watches of 400mb data: 5125002MB*5=2.5TB ↘ 2 GB

racing with Golang GC to free this temporary memory before being OOM’ed.

reduce etcd load by serving from watch cache
get a replacement for paginated lists from watch-cache, which is not feasible without major investment
enforce consistency in the sense of freshness of the returned list
be backward compatible with new client -> old server
fix the long-standing “stale reads from the cache” issue, https://github.com/kubernetes/kubernetes/issues/59848

Non-Goals

get rid of list or list pagination
rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).

Proposal

In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use streaming from the watch-cache instead of paging from etcd. Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see Appendix section for more details on how informers operate today). The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. This would allow us to keep memory allocations constant. The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude) plus a few additional allocations, that will be explained later in this document. The rough idea/plan is as follows:

step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
step 2: upon receiving the request from an informer, compute the RV at which the result should be returned (possibly contacting etcd if consistent read was requested). It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will meet the consistency requirements of the request.
step 2a: wait until watch catches up with the computed RV
step 2a: send all objects currently stored in memory for the given resource type.
step 2c: send a bookmark event to the informer with the given RV.
step 3: listen for further events using the request from step 1.

Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV=“0” watches. The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons. A watch started with RV=“0” may return stale data. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches.

Note 2: informers need consistent lists to avoid time-travel when initializing after restart to avoid time travel in case of switching to another HA instance of kube-apiserver with outdated/lagging watch cache. See the following issue for more details.

Risks and Mitigations

Design Details

Required changes for a WATCH request with the SendInitialEvents=true

The following sequence diagram depicts steps that are needed to complete the proposed feature. A high-level overview of each was provided in a table that follows immediately the diagram. Whereas further down in this section we provided a detailed description of each required step.

Step	Description
1.	The reflector establishes a WATCH request with the watch cache.
2.	If needed, the watch cache contacts etcd for the most up-to-date ResourceVersion.
2a.	The watch cache waits until is observed the requested ResourceVersion.
2b.	The watch cache stream all the contents from its in-memory store.
2c.	After sending all the objects it sends a bookmark event with the given RV to the reflector.
3.	The reflector replaces its internal store with collected items, updates its internal resourceVersion to the one obtained from the bookmark event.
3a.	The reflector uses the WATCH request from step 1 for further progress notifications.

Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) to ensure freshness and setting resourceVersionMatch=NotOlderThan and sendInitialEvents=true. We do that only during the initial ListAndWatch call. Each event (ADD, UPDATE, DELETE) except the BOOKMARK event received from the server is collected. Passing resourceVersion="" tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.

Note: This ensures that returned data is consistent, served from etcd via a quorum read and prevents “going back in time”.

Note 2: Watch cache currently doesn’t have the feature of supporting resourceVersion="" and thus is vulnerable to stale reads, see https://github.com/kubernetes/kubernetes/issues/59848 for more details.

Step 2: Right after receiving a request from the reflector, the cacher gets the current resourceVersion (aka bookmarkAfterResourceVersion) directly from the etcd. It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data. There are ways to do that cheaply, e.g. we could issue a count request against the datastore. Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache. After sending initial data the cacheWatcher starts listening on an input channel for new events, including a bookmark event. At some point, the cacher will receive an event with the resourceVersion equal or greater to the bookmarkAfterResourceVersion. It will be propagated to the cacheWatcher and then back to the reflector as a BOOKMARK event.

Step 2a: Where does the initial data come from?

During construction, the cacher creates the reflector and the watchCache. Since the watchCache implements the Store interface it is used by the reflector to store all data it has received from etcd.

Step 2b: What happens when new events are received while the cacheWatcher is sending initial data?

The cacher maintains a list of all current watchers (cacheWatcher) and a separate goroutine (dispatchEvents) for delivering new events to the watchers. New events are added via the cacheWatcher.nonblockingAdd method that adds an event to the cacheWatcher.input channel. The cacheWatcher.input is a buffered channel and has a different size for different Resources (10 or 1000). Since the cacheWatcher starts processing the cacheWatcher.input channel only after sending all initial events it might block once its buffered channel tips over. In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. All watchers that have failed to deliver the event will be closed.

Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we will delay closing the watch request until all data is sent to the client. We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV, what’s the average buffer size, and a counter for closed watch requests due to an overfull buffer.

For a beta version, we have further options if they turn out to be necessary:

comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on and waiting until the difference between the RVs is < 1000 (the buffer size). We could do that even before sending the initial events. If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV. Assuming all updates would be for the resource the watch request was opened for (which seems unlikely). In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache. Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see https://etcd.io/docs/v3.4/op-guide/performance/ ) Once we are past this step (we know the difference is smaller) and the buffer fills up we:
- case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer. In that case, we will deliver the initial events, any other events we have received which RVs are <= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection. An informer will reconnect with the RV from the bookmark event. Note that any new event received was ignored since the buffer was full.
- case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn’t exist in the buffer. An informer will reconnect arriving at the step that compares the RVs first.
make the buffer dynamic - especially when the difference between RVs is > than 1000
inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before
cap the size (cannot allocate more than X MB of memory) of the buffer
maybe even apply some compression techniques to the buffer (for example by only storing a low-memory shallow reference and take the actual objects for the event from the store)

Note: The RV is effectively a global counter that is incremented every time an object is updated. This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request.

Note 2: Currently, there is a timeout for LIST requests of 60s. That means a slow reflector might fail synchronization as well and would have to re-establish the connection.

Step 2c: How bookmarks are delivered to the cacheWatcher?

First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening. There are two sources of resourceVersions. The first one is regular events that contain RVs besides objects. The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval only to the kube-apiserver. As already mentioned in 2a the watchCache is driven by the reflector. Every event will be eventually propagated from the watchCache to the cacher.processEvent method. For simplicity, we can assume that the processEvent method will simply update the resourceVersion maintained by the cacher.

At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event. As of today, the interval is set to 1 second. The bookmark event contains an empty object and the current resourceVersion. By default, a cacheWatcher expires roughly every 1 minute.

The expiry interval initially will be decreased to 1 second in this feature’s code-path. This helps us deliver a bookmark event that is >= bookmarkAfterResourceVersion much faster. After that, the interval will be put back to the previous value.

Note: Since we get a notification every 5 seconds from etcd and we try to deliver a bookmark every 1 second. It seems the maximum delay time a reflector will have to wait after receiving initial data is 6 seconds (assuming small dataset). It might be unlikely in practice since we might get bookmarkAfterResourceVersion even before handling initial data. Also sending data itself takes some time as well.

Step 3: After receiving a BOOKMARK event the reflector is considered to be synchronized. It replaces its internal store with the collected items (syncWith) and reuses the current connection for getting further events.

API changes

Extend the ListOptions struct with the following field:

type ListOptions struct {
...
// `sendInitialEvents=true` may be set together with `watch=true`.
// In that case, the watch stream will begin with synthetic events to
// produce the current state of objects in the collection. Once all such
// events have been sent, a synthetic "Bookmark" event will be sent.
// The bookmark will report the ResourceVersion (RV) corresponding to the
// set of objects, and be marked with `"k8s.io/initial-events-end": "true"` annotation.
// Afterwards, the watch stream will proceed as usual, sending watch events
// corresponding to changes (subsequent to the RV) to objects watched.
//
// When `sendInitialEvents` option is set, we require `resourceVersionMatch`
// option to also be set. The semantic of the watch request is as following:
// - `resourceVersionMatch` = NotOlderThan
// is interpreted as "data at least as new as the provided `resourceVersion`"
// and the bookmark event is send when the state is synced
// to a `resourceVersion` at least as fresh as the one provided by the ListOptions.
// If `resourceVersion` is unset, this is interpreted as "consistent read" and the
// bookmark event is send when the state is synced at least to the moment
// when request started being processed.
// - `resourceVersionMatch` set to any other value or unset
// Invalid error is returned.
//
// Defaults to true if `resourceVersion=""` or `resourceVersion="0"` (for backward
// compatibility reasons) and to false otherwise.
SendInitialEvents *bool
}

The watch bookmark marking the end of initial events stream will have a dedicated annotation:

"k8s.io/initial-events-end": "true"

(the exact name is subject to change during API review). It will allow clients to precisely figure out when the initial stream of events is finished.

It’s worth noting that explicitly setting SendInitialEvents to false with ResourceVersion=“0” will result in not sending initial events, which makes the option works exactly the same across every potential resource version passed as a parameter.

Important optimisations

Avoid DeepCopying of initial data

The watchCache has an important optimization of wrapping objects into a cachingObject. Given that objects aren’t usually modified (since selfLink has been disabled) and the fact that there might be multiple watchers interested in receiving an event. Wrapping allows us for serializing an object only once. The watchCache maintains two internal data structures. The first one is called the store and is driven by the reflector. It essentially mirrors the content stored in etcd. It is used to serve LIST requests. The second one is called the cache, which represents a sliding window of recent events received from the reflector. It is effectively used to serve WATCH requests from a given RV.

By design cachingObjects are stored only in the cache. As described in Step 2, the cacheWatcher gets initial data from the watchCacher. The latter, in turn, gets data straight from the store. That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.

Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject. Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.
Reduce the number of allocations in the WatchServer

The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients. It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image. Two of which deserve special attention, namely the allocations 1 and 3 because they won’t reuse memory and rely on the GC for cleanup. In other words, the more events we need to send, the more (temporary) memory will be used. In contrast, the other two allocations are already optimizedas they reuse memory instead of creating new buffers for every single event. For better utilization, a similar technique of reusing memory could be used to save precious RAM and scale the system even further.

Manual testing without the changes in place

For the past few years, we have seen many clusters suffering from the issue. Sadly, our only possible recommendation was to ask customers to reduce the cluster in size. Since adding more memory in most of the cases would not fix the issue. Recall from the motivation section that just a few requests can allocate gigabytes of data in a fraction of a second

In order to reproduce the issue, we executed the following manual test, it is the simplest and cheapest way of putting yourself into customers’ shoes: the reproducer creates a namespace with 400 secrets, each containing 1 MB of data. Next, it uses informers to get all secrets in the cluster. The rough estimate is that a single informer will have to bring at least 400MB from the datastore to get all secrets.

The result: 16 informers were able to take down the test cluster.

Results with WATCH-LIST

We have prepared the following PR https://github.com/kubernetes/kubernetes/pull/106477 which is almost identical to the proposed solution. It just differs in a few details. The following image depicts the results we obtained after running the synthetic test described in 4. First of all, it is worth mentioning that the PR was deployed onto the same cluster so that we could ensure an identical setup (CPU, Memory) between the tests. The graph tells us a few things.

Firstly, the proposed solution is at least 100 times better than the current state. Around 12:05 we started the test with 1024 informers, all eventually synced without any errors. Moreover during that time the server was stable and responsive. That particular test ended around 12:30. That means it needed ~25 minutes to bring ~400 GB of data across the network! Impressive achievement.

Secondly, it tells us that memory allocation is not proportional yet to the number of informers! Given the size of individual objects of 1MB and the actual number of informers, we should allocate roughly around 2GB of RAM. We managed to get and analyze the memory profile that showed a few additional allocations inside the watch server. At this point, it is worth mentioning that the results were achieved with only the first optimization applied. We expect the system will scale even better with the second optimization as it will put significantly less pressure on the GC.

Required changes for a WATCH request with the RV set to the last observed value (RV > 0)

In that case, no additional changes are required. We stick to existing semantics. That is we start a watch at an exact resource version. The watch events are for all changes after the provided resource version. This is safe because the client is assumed to already have the initial state at the starting resource version since the client provided the resource version.

Provide a fix for the long-standing issue https://github.com/kubernetes/kubernetes/issues/59848

The issue is still open mainly because informers default to resourceVersion=“0” for their initial LIST requests. This is problematic because the initial LIST requests served from the watch cache might return data that are arbitrarily delayed. This in turn could make clients connected to that server read old data and undo recent work that has been done.

To make consistent reads from cache for LIST requests and thus prevent “going back in time” we propose to use the same technique for ensuring the cache is not stale as described in the previous section.

In that case we are going to change informers to pass “resourceVersion=“0” and resourceVersionMatch=MostRecent” for their initial LIST requests. Then on the server side we:

get the current revision from etcd.
use the existing waitUntilFreshAndBlock function to wait for the watch to catch up to the revision requested in the previous step.
reject the request if waitUntilFreshAndBlock times out, thus forcing informers to retry.
otherwise, construct the final list and send back to a client.

Replacing standard List request with WatchList mechanism for client-go’s List method.

Replacing the underlying implementation of the List method for client-go based clients (like typed or dynamic client) with the WatchList mechanism requires ensuring that the data returned by both the standard List request and the new WatchList mechanism remains identical. The challenge is that WatchList no longer retrieves the entire list from the server at once but only receives individual items, which forces us to “manually” reconstruct the list object on the client side.

To correctly construct the list object on the client side, we need ListKind information. However, simply reconstructing the list object based on these data is not enough. In the case of a standard List request, the server’s response (a versioned list) is processed through a chain of decoders, which can potentially modify the resulting list object. A good example is the WithoutVersionDecoder, which removes the GVK information from the list object. Thus the “manually” constructed list object may not be consistent with the transformations applied by the decoders, leading to differences.

To ensure full compatibility, the server must provide a versioned empty list in the format requested by the client (e.g., protobuf representation). We don’t know how the client’s decoder behaves for different encodings, i.e., whether the decoder actually supports the encoding we intend to use for reconstruction. Therefore, to ensure maximal compatibility, we will ensure that the encoding used for the reconstruction of the list matches the format that the client originally requested. This guarantees that the returned list object can be correctly decoded by the client, preserving the actual encoding format as intended.

The proposed solution is to add a new annotation (k8s.io/initial-events-list-blueprint) to the object returned in the bookmark event (The bookmark event is sent when the state is synced and marks the end of WatchList stream). This annotation will store an empty, versioned list encoded as a Base64 string. This annotation will be added to the same object/place the k8s.io/initial-events-end annotation is added.

When the client receives such a bookmark, it will base64 decode the empty list and pass it to the decoder chain. Only after a successful response from the decoders the list will be populated with data received from subsequent watch events and returned.

For example:

GET /api/v1/namespaces/test/pods?watch=1&sendInitialEvents=true&allowWatchBookmarks=true&resourceVersion=&resourceVersionMatch=NotOlderThan
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json
{
"type": "ADDED",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "8467", "name": "foo"}, ...}
}
{
"type": "ADDED",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "5726", "name": "bar"}, ...}
}
{
"type":"BOOKMARK",
"object":{"kind":"Pod","apiVersion":"v1","metadata":{"resourceVersion":"13519","annotations":{"k8s.io/initial-events-end":"true","k8s.io/initial-events-embedded-list":"eyJraW5kIjoiUG9kTGlzdCIsImFwaVZlcnNpb24iOiJ2MSIsIm1ldGFkYXRhIjp7fSwiaXRlbXMiOm51bGx9Cg=="}} ...}
}
...
<followed by regular watch stream starting>

Alternatives

We could modify the type of the object passed in the last bookmark event to include the list. This approach would require changes to the reflector, as it would need to recognize the new object type in the bookmark event. However, this could potentially break other clients that are not expecting a different object in the bookmark event.

Another option would be to issue an empty list request to the API server to receive a list response from the client. This approach would involve modifying client-go and implementing some form of caching mechanism, possibly with invalidation policies. Non-client-go clients that want to use this new feature would need to rebuild this mechanism as well.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/apiserver/pkg/storage/cacher: 02/02/2023 - 74,7%
k8s.io/client-go/tools/cache/reflector: 02/02/2023 - 88,6%

Integration tests

For alpha, tests asserting fallback mechanism for reflector will be added.

e2e tests

For alpha, tests exercising this feature will be added.

Graduation Criteria

Alpha

The Feature is implemented behind WatchList feature flag
Initial e2e tests completed and enabled
Scalability/Performance tests confirm gains of this feature
Add support for watchlist to APF

Beta

Metrics are added to the kube-apiserver (see the monitoring-requirements section for more details)
Implement SendInitialEvents for watch requests in the etcd storage implementation
The feature is enabled for kube-apiserver and kube-controller-manager.
The generic feature gate mechanism is implemented in client-go. It will be used to enable a new functionality for reflectors/informers.
Implement a consistency check detector that will compare data received through a new watchlist request with data obtained through a standard list request. The detector will be added to the reflector and activated when an environment variable is set. The environment variable will be set for all jobs run in the Kube CI.
Update the client-go generated List function to watchList data when the feature gate has been enabled and the ListOptions are satisfied. This change must be applied to the typed, dynamic and metadata clients.
Implement a mechanism for automatically detecting etcd configuration Whether it is safe to use the RequestWatchProgress API call or if the experimental-watch-progress-notify-interval flag has been set. Knowing etcd configuration will be used to automatically disable the streaming feature.
Use WatchProgressRequester to request progress notifications directly from etcd. This mechanism was developed in Consistent Reads from Cache KEP and will reduce the overall latency for watchlist requests.
The watchlist call, which serves as a drop-in replacement for list calls in client libraries, must properly set the kind and apiVersion fields. These fields are important for the correct decoding of the objects. See also: https://github.com/kubernetes/kubernetes/pull/126191

Beta2

The feature is enabled for kubelet.
Extend the existing performance tests with a case that adds a large number of small objects. The current perf test adds a small number of large objects. The new variant will help catch potential regressions such as https://github.com/kubernetes/kubernetes/issues/129467

Beta3

With new concerns brought in 1.33 release timeline, we revised the approach for the feature. The discussion happened in this document and resulted in the following update for the criteria:

Revert the client-go changes that use watchList to implement List. This inclues removing the API annotation that made this possible, because it doesn’t serve another purpose.
Ensure we don’t break the “latestRV” usecase for StorageVersionMigrator reusing the informer cache from the kube-controller-manager
Ensure that the feature is usable by external projects by validating it works with controller-runtime out-of-the-box via simple enablement
Add support for AcceptContentType header with the value application/json;as=Table and application/json;as=PartialObjectMetadata
Switch the storage/cacher to use streaming directly from etcd (This will also allow us to remove the reflector.UseWatchList field).
Enable the feature by-default for kube-controller-manager.

Beta4

Disable watchlist support in the fake client so that informers do not use watchlists. This ensures that unit tests relying on non-standard fake client behavior will continue to work.
Currently, the ListWatcher used by the WatchCache does not pass the RV from the reflectors. As a result the consistency detector used by the reflectors fails for the WatchCache. This issue needs to be resolved.
Enable the WatchListClient feature gate by default. The FG is defined in client-go. Enabling it will turn on the feature for all clients.

Beta5

Enable gzip compression for the WatchList request. Gated behind a new server-side feature gate WatchListCompression (Beta, enabled by default). Regular watch requests are not affected.

Backward compatibility

No client-side changes are required. Go’s http.Transport automatically sends Accept-Encoding: gzip when DisableCompression is false (the default in rest.Config), the same mechanism already used for LIST compression. Older clients will transparently receive compressed WatchList responses. The pull-kubernetes-gce-master-scale-performance-5000 test with the POC validates this: all in-cluster clients (kubelet, kube-controller-manager, scheduler) used the standard client-go transport with no modifications.

Scale test results (5000 nodes, ~165K pods)

We ran the pull-kubernetes-gce-master-scale-performance-5000 test with WatchList compression enabled and compared against a baseline ci-kubernetes-e2e-gce-scale-performance-5000 run without it. The WatchList P99 latency for pods dropped from 60s (capped at the histogram bucket ceiling, actual latency was much worse) to ~30s:

	Baseline	With compression
Count	486	547
P50	1,950ms	969ms
P90	15,208ms	17,150ms
P99	60,000ms (capped)	29,673ms

For more details see https://github.com/kubernetes/kubernetes/issues/138670 .

GA

No user issues reported

Post-GA

Make list calls expensive in APF. Once all supported releases have the streaming list enabled by default (client-go, control plane components) and the feature itself is locked to its default value, we can increase the cost of regular list requests in APF. This ensures that the fallback mechanism, which switches back to the standard list when streaming has issues, will not be affected.

Upgrade / Downgrade Strategy

Version Skew Strategy

Our immediate idea to ensure backward compatibility between new clients and old servers would be to return a 401 response in old Kubernetes releases (via backports). This approach however would limit the maximum skew version mismatch to just a few previous releases, and would also force customers to update to latest minor versions.

Therefore we propose to make use of the already existing “resourceVersionMatch” LIST option. WATCH requests with that option set will be immediately rejected with a 403 (Forbidden) response by previous servers. In that case, new clients will fall back to the previous mode (ListAndWatch). New servers will allow for WATCH requests to have “resourceVersionMatch=MostRecent” set.

Existing clients will be forward and backward compatible and won’t require any changes since the server will preserve the old behavior (ListAndWatch).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: WatchList
- Components depending on the feature gate:
  - kube-apiserver
- Feature gate name: WatchListClient
  - Components depending on the feature gate:
  - kube-controller-manager via client-go library
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No. Because users must enable the feature on the client side (client-go).

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling WatchList FeatureGate for kube-apiserver. In this case kube-apiserver will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.

Yes, by disabling WatchListClient FeatureGate for kube-controller-manager. In this case informers will follow standard LIST/WATCH semantics.

Note that for safety reasons, reflectors/informers will always fallback to a regular LIST operation regardless of the error that occurred.

What happens if we reenable the feature if it was previously rolled back?

The expected behavior of the feature will be restored.

Are there any tests for feature enablement/disablement?

Yes. There is an integration test that verifies the fallback mechanism of the reflector when interacting with servers that has the WatchList feature enabled/disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Feature does not have a direct impact on rollout/rollback.

However, faulty behavior of a feature can result in incorrect functioning of components that rely on that feature. For the Beta version, we plan to enable it exclusively for kube-controller-manager. The main issues can arise during the initial informer synchronization, which may result in controller failures.

Furthermore, if data consistency issues arise, such as missing data, the controllers simply do not consider the missing data.

What specific metrics should inform a rollback?

apiserver_terminated_watchers_total - a large number of terminated watchers might indicate synchronization issues. For example, we have some client-side error where we’re not getting data from the server. Or we have a server-side error, and the buffer is getting cluttered.

apiserver_request_duration_second_bucket - in general, a large number of “short” watch requests can indicate synchronization issues.

apiserver_watch_list_duration_seconds - the absence of this metric may indicate that the client did not receive a special bookmark. The issue here could be that the server never sent it due to an error or didn’t even receive it from the database.

apiserver_watch_list_duration_seconds - long synchronization times may indicate that the server is lagging behind etcd. Forr example, not receiving progress notifications from the database frequently.

apiserver_watch_cache_lag - tells how far behind the server is compared to the database. Significant discrepancies affect the times for full data synchronization.

A good metric can also be the number of kube-controller-manager restarts. Which may indicate issues with informers synchronization.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Upgrade->downgrade->upgrade testing was done manually using the following steps:

Build and run Kubernetes from the master branch using Kind.

kind build node-image --arch "arm64"
kind create cluster --image kindest/node:latest
kubectl get no
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 26s v1.29.0-alpha.1.47+f8571dabf79717

Check if the kube-apiserver(aka kas) has recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1

Disable the WatchList feature gate for the kas by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml

and pass - --feature-gates=WatchList=false to the kas container.

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Check if kube-controler-manger(aka kcm) is running.

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (44s ago) 3m28s

Check if informers used by the kcm fell back to standard LIST/WATCH semantics.

kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"
W1002 09:11:40.656641 1 reflector.go:340] The watch-list feature is not supported by the server, falling back to the previous LIST/WATCH semantics
…

Disable the WatchListClient feature gate for the kcm by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml

and pass - --feature-gates=WatchListClient=false to the kcm container.

Check if kcm is running.

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 12s

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Check if there are no traces of informers for kcm falling back to standard LIST/WATCH semantics.

kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"

Enable the WatchList feature gate for the kas by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml

and remove - --feature-gates=WatchList=false from the kas container.

Check if kcm is running.

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (22s ago) 86s

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Enable the WatchListClient feature gate for the kcm by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml

and remove - --feature-gates=WatchListClient=false for the cm container.

Check if kcm is running.

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 13s

Check if the kas has recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

If apiserver_watch_list_duration_seconds metric has some data then this feature is in use.

How can someone using this feature know that it is working for their instance?

Assuming that historical data is available then comparing the number of LIST and WATCH requests to the server will tell whether the feature was enabled. When this feature is enabled, the number of LIST requests will be smaller. The difference primarily arises from switching informers to a new mode of operation.

Checking whether WatchListClient FeatureGate has been set for the given component.

Knowing the username for a component, the audit logs could be examined to see whether sendInitialEvents=true in the requestURI has been set for that user.

Scanning the component’s logs for the phrase Reflector WatchList. For requests lasting more than 10 seconds, traces will be reported.

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

None have been defined yet.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)
- Metric name: apiserver_watch_list_duration_seconds (histogram, measures latency of watch-list requests)
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No. On the contrary. The number of requests originating from informers will be reduced by half from 2 (LIST/WATCH) to just 1 (WATCH)

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

On the contrary. It will decrease the memory usage of kube-apiservers needed to handle “list” requests.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

On the contrary. It will decrease the memory usage required for master nodes.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

When the kube-apiserver is unavailable then this feature will also be unavailable.

When etcd is unavailable, requests attempting to retrieve the most recent state of the cluster will fail.

What are other known failure modes?

kube-controller-manager is unable to start.
- Detection: How can it be detected via metrics? Examine the prometheus up time series or examine the pod status or the number of restarts.
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads? Disable the feature. Pass WatchListClient=false to feature-gates command line flag.
- Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? N/A
- Testing: Are there any tests for failure mode? If not, describe why. Yes, if kube-controller-manager is unable to start then a lot of existing e2e tests will fail.

What steps should be taken if SLOs are not being met to determine the problem?

None SLOs have been defined for this feature yet.

Implementation History

The KEP was proposed on 2022-01-14

Drawbacks

N/A

Alternatives

We could tune the cost function used by the priority and fairness feature. There are at least a few issues with this approach. The first is that we would have to come up with a cost estimation function that can approximate the temporary memory consumption. This might be challenging since we don’t know the true cost of the entire list upfront as object sizes can vastly differ throughout the keyspace (imagine some namespaces with giant secrets, some with small secrets). The second issue, assuming we could estimate it, would mean that we would have to throttle the server to handle just a few requests at a given time as the estimate would likely be uniform over resource type or other coarse dimensions
We could attempt to define a function that would prevent the server from allocating more memory than a given threshold. A function like that would require measuring memory usage in real-time. Things we evaluated:
- runtime.ReadMemStats gives us accurate measurement but at the same time is very expensive. It requires STW (stop-the-world) which is equivalent to stopping all running goroutines. Running with 100ms frequency would block the runtime 10 times per second.
- reading from proc would probably increase the CPU usage (polling) and would add some delay (propagation time from the kernel about current memory usage). Since the spike might be very sudden (milliseconds) it doesn’t seem to be a viable option.
- there seems to be no other API provided by golang runtime that would allow for gathering memory stats in real-time other than runtime.ReadMemStats
- using cgroup notification API is efficient (epoll) and near real-time but it seems to be limited in functionality. We could be notified about crossing previously defined memory thresholds but we would still need to calculate available(free) memory on a node.
We could allow for paginated LIST requests to be served directly from the watch cache. This approach has a few advantages. Primarily it doesn’t require changing informers, no version skew issues. At the same time, it also presents a few challenges. The most concerning is that it would actually not solve the issue. It seems it would also lead to (temporary) memory consumption because we would need to allocate space for the entire response (LIST), keep it in memory until the whole response has been sent to the client (which can be up to 60s) and this could be O({2,3}*the-size-of-the-page).

Appendix

Sources of LIST request

A LIST request can be satisfied from two places, largely depending on used query parameters:

by default directly from etcd. In such cases, the memory demand might be extensive, exceeding the full response size from the data store many times.
from the watch cache if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion=“0”). This is actually how most client-go-based controllers actually prime their caches due to performance reasons. The memory usage will be much lower than in the first case. However, it is not perfect as we still need space to store serialized objects and to hold the full response until is sent.

Steps followed by informers

The following steps depict a flow of how client-go-based informers work today.

on startup: informers issue a LIST RV=“0” request with pagination, which due to performance reasons translates to a full (pagination is ignored) LIST from the watch cache.
repeated until ResourceExpired 410: establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
on resumption: establish a new LIST request to the watch cache with RV=“last-known-from-step2” (step1) and then another WATCH request.
after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2

In rare cases, an informer might connect to an API server whose watch cache hasn’t been fully synchronized (after kube-apiserver restart). In that case its flow will be slightly different.

on startup: informers issue a LIST RV=“0” request with pagination, which effectively equals a paginated LIST RV="", i.e. it gets a consistent snapshot of data directly from etcd (quorum read) in chunks (pagination).
repeated until ResourceExpired 410: they establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
on resumption: establish a paginated LIST RV=“last-known-from-step2” request (step1) and then another WATCH request.
after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2

Infrastructure Needed (Optional)

N/A

Resources: Allow Replacement of Pods in a Job when fully terminating

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3939: Allow replacement of Pods in a Job when fully terminated

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, Jobs start replacement Pods as soon as previously created Pods are terminating (have a deletionTimestamp) or fail (phase=Failed). Terminating pods are currently counted as failed in the Job status. However, terminating pods are actually in a transitory state where they are neither active nor really fully terminated.
This KEP proposes a new field for the Job API that allows for users to specify if they want replacement Pods as soon as the previous Pods are terminating (existing behavior) or only once the existing pods are fully terminated (new behavior).

Motivation

Existing Issues:

Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.

Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.

On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn’t match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.

This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.

Goals

Job controller should allow for flexibility in waiting for pods to be fully terminated before creating replacement Pods
Job controller will have a new status field where we include the number of terminating pods.

Non-Goals

Other workload APIs are not included in this proposal.

Proposal

The Job controller gets a list of active pods. Active pods are pods that don’t have a terminal phase (Succeeded or Failed) and are not terminating (have a deletionTimestamp) In this KEP, we will consider terminating pods to be separate from active and failed.
As an opt-in behavior, the job controller can use the active and terminating pods to determine whether replacement Pods are needed.

We propose two new API fields:

A field in Spec that allows for opt-in behavior of whether to wait for terminating pods to finish before creating replacement pods.
A new field in Status for tracking the number of terminating pods.

User Stories (Optional)

Story 1

As a machine learning user, ML frameworks allow scheduling of multiple pods.
The Job controller does not typically wait for terminating pods to be marked as failed.
Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated.

This case was added due to a bug discovered with running IndexedJobs with Tensorflow.
See Jobs create replacement Pods as soon as a Pod is marked for deletion for more details.

Story 2

As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify.
Terminating pods do not relinquish resources so scarce compute resource are still scheduled to those pods. Replacement pods do not produce unnecessary scale ups.

Story 3

As a Job-level quota controller, I want to track the number of terminating pods, in addition to the active pods.

See Kueue: Account for terminating pods when doing preemption for an example of this.

Notes/Constraints/Caveats (Optional)

The default job controller behavior

Based on the proposed API below, the behavior of the job controller prior to this KEP is equivalent to podReplacementPolicy: TerminatingOrFailed.

This behavior has the following semantic problems:

A terminating Pod might gracefully terminate as Succeeded, but it counts towards .status.failed as soon as it’s terminating and it’s not reclassified upon termination.
When using podFailurePolicy, the controller might create a replacement Pod before being able to evaluate the terminal state of the Pod. The replacement Pod might be terminated due to the policy.

In a Job v2 API, we should consider having the default behavior equivalent to podReplacementPolicy: Failed, given the above problems. We could even consider removing the proposed field podReplacementPolicy.

But for backwards compatibility, in v1, we have to introduce a change of behavior as opt-in.

When Pods enter a terminating state

Pods can be marked for termination by several controllers, which we typically refer to as disruptions, such as: kubelet eviction, scheduler preemption, API eviction, etc.

The job controller itself can delete running Pods, in the following scenarios:

A job is over the activeDeadlineSeconds.
When the number of Pod failures reaches the backoffLimit.
With PodFailurePolicy active and FailJob is set as the action.

In all these situations, the Pod initially gets a deletionTimestamp and we interpret the pod as “terminating”. Once the pod terminates, it gets a terminal phase (Succeeded or Failed).

Exponential Backoff for Pod Failures

The job controller implements backoff delays to prevent fast recreation of continuously failing Pods.

This behavior is internal (not configurable through the API) and it’s orthogonal to this KEP. The behavior will be preserved as follows:

When podReplacementPolicy: TerminatingOrFailed, the backoff period counts from the time the Pod is terminating or Failed.
When podReplacementPolicy: Failed, the backoff period counts from the time the Pod is Failed.

Risks and Mitigations

Pods are not guaranteed to transition to a terminal phase

One area of contention is how this KEP will work with 3329-retriable-and-non-retriable-failures .

In 3329, there was a decision to make kubelet transition pods to failed before deleting them. This is feature toggled guarded by PodDisruptionCondition, which in addition to setting the phase to Failed, it adds a DisruptionTarget condition. This means that when this feature is turned on, the job controller is able to count pods as failed only when they are fully terminated, as it is guaranteed that all pods will reach a terminal state (Failed or Succeeded). Note that a terminating pod is not considered active either. If PodDisruptionCondition is turned off, then the job controller considers the pod as failed as soon as it is terminating (has a deletion timestamp), because there is no guarantee that the pod will transition to phase=Failed.

Another issue is described here . If PodDisruptionConditions is disabled, a pod bound to a no-longer-existing node may be stuck in the Running phase. As a consequence, it will never be replaced, so the whole job will be stuck from making progress. When PodDisruptionConditions is enabled, the PodGC transitions the Pod to phase Failed in this scenario.

Due to the above issues, we propose the following mitigation:

If PodDisruptionConditions OR JobPodReplacementPolicy are enabled, set phase=Failed in kubelet and podGC before deleting a Pod.
If JobPodReplacmentPolicy is enabled, but PodDisruptionConditions is disabled, the kubelet and podGC only set the phase, but do not add a DisruptionTarget condition.

Design Details

Job API Definition

At the JobSpec level, we are adding a new enum field:

// This field controls when we recreate pods
// Default will be TerminatingOrFailed ie recreate pods when they are failed
// +enum 
type PodReplacementPolicy string
const (
 // TerminatingOrFailed is a policy that creates replacement pods when they are
 // marked as terminating (have a deletion timestamp) or reach the terminal
 // phase `Failed`.
 // Terminating pods count towards `.status.failed`, even if they later reach
 // the terminal phase `Succeeded`.
 TerminatingOrFailed PodReplacementPolicy = "TerminatingOrFailed"
 // Failed is a policy that creates replacement Pods only when the previously
 // created Pods reach the terminal phase `Failed`.
 Failed PodReplacementPolicy = "Failed"
)

type JobSpec struct{
 ...
 // podReplacementPolicy specifies when to create replacement Pods. Possible values are:
 // - TerminatingOrFailed means to create a replacement Pod when the previously
 // created Pod is terminating or failed.
 // - Failed means to wait until a previously created Pod is fully terminated
 // before creating a replacement Pod.
 //
 // When using podFailurePolicy, the default value is Failed and this is the
 // only allowed policy.
 // When not using podFailurePolicy, the default value is TerminatingOrFailed.
 // +optional
 PodReplacementPolicy *PodReplacementPolicy
}

In order to offer visibility of the number of terminating pods, we include a new field in the JobStatus.

type JobStatus struct {
 ...
 // Number of terminating pods
 // +optional
 terminating *int32
}

Defaulting and validation

Defaulting of podReplacementPolicy will depend on whether podFailurePolicy is in use:

when podFailurePolicy is in use, the default value is Failed.
when podFailurePolicy is not in use, the default value is TerminatingOrFailed.

When podFailurePolicy is in use, the only allowed value for podFailurePolicy is Failed.

Tracking the terminating pods

In order to allow the quota management for Job-level controllers story 3 we introduced the .status.terminating field which tracks the number of terminating pods. However, in the initial Beta implementation the field stops tracking the number of terminating pods as soon as the Job is marked as Failed with the Failed condition (see (issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775]). The remaining pods may be occupying resources for an arbitrary amount of time.

In 1.31 we are going to fix this issue by delaying the addition of the Failed or Complete conditions until all pods are fully terminated. To indicate that a Job is doomed to fail or succeed, as soon as possible, we extend the scope of pre-existing conditions: FailureTarget, and SuccessCriteriaMet, respectively, See more details in Job API managed-by mechanism .

Implementation

As part of this KEP, we need to track pods that are terminating (deletionTimestamp != nil and phase is Pending or Running).

The following algorithm could be used:

Count the number of pods that are active and not terminating.
Count the number of terminating pods.
In manageJob we will count expected pods as:

when podReplacementPolicy: Failed then expectedPods = active + terminating.
when podReplacementPolicy: TerminatingOrFailed then expectedPods = active.

Use the expected number of pods to decide whether to recreate.

In Indexed completion mode, the tracking of pods is per index.

The controller updates the field Status.terminating with the number of terminating pods. For backwards compatibility, when podReplacementPolicy: TerminatingOrFailed, the number of failed pods includes the terminating pods.

The controller updates the terminating field in the same API call where it updates other counters, so it should not require any extra API calls.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

controller_utils: April 3rd 2023 - 56.6
- Adding tests to help determine if pods are terminating.
job: April 3rd 2023 - 90.4 a. Verify that terminating pods are in fact counted in the status. b. Recreate pods only once pod is fully terminated (ie Failed) c. Verify existing behavior with TerminatingOrFailed d. If feature is off verify existing behavior e. Count terminating pods even if terminating Pod considered failed when JobPodReplacementPolicy is disabled f. Count terminating pods even if terminating Pod not considered failed when JobPodReplacementPolicy is enabled
gc_controller.go: April 3rd 2023 - 82.4 a. Set PodPhase to failed when JobPodReplacementPolicy true but PodDisruptionConditions is false

The following scenarios related to tracking the terminating pods are covered:

Failed or Complete conditions are not added while there are still terminating pods
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
SuccessCriteriaMet is added when the completions are satisfied

Integration tests

We will add the following integration test for the Job controller:

Case with JobPodReplacementPolicy on and podReplacementPolicy: Failed

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is tracked
Verify that pod creation only occurs once pod is fully terminated.

Case with JobPodReplacementPolicy on and podReplacementPolicy: TerminatingOrFailed

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is tracked
Verify that pod creation only occurs once deletion happens.

Case With JobPodReplacementPolicy off

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is not tracked
Verify that pod creation only occurs once deletion happens.

Case for disable and reenable JobPodReplacementPolicy

Create Job with podReplacementPolicy: Failed
Job starts pods that takes a while to terminate
Restart controller and disable JobPodReplacementPolicy
Delete some pods
Verify that terminating pods count as failed and pods are recreated.
Restart controller and reenable JobPodReplacementPolicy
Terminate pods with phase Succeeded.
Verify that pods still count as failed.
Delete remaining Pods.
Verify that terminating is tracked.
Verify that pod creation only occurs once pod is fully terminated.
Verify that pod creation only occurs once deletion happens.

To cover cases with PodDisruptionCondition we really only need to worry about tracking terminating fields. Tests will verify counting of terminating fields regardless of PodDisruptionCondition being on or off.

The following scenarios related to tracking the terminating pods are covered:

Failed or Complete conditions are not added while there are still terminating pods
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
SuccessCriteriaMet is added when the completions are satisfied

The integration tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/integration/job/job_test.go . Most relevant test is TestJobPodReplacementPolicy.

e2e tests

Generally the only tests that are useful for this feature are when PodReplacementPolicy: Failed.
Test should to create a Job which can catch a SIGTERM signal and allow for graceful termination, so when we delete the test
we can first assert that pods aren’t created while the Pod is terminating and finally when it terminates that a new Pod is created.

We can use the default busybox image which is generally used in e2e tests and override the command field with something like:

_term(){
 sleep 5
 exit 143
}
trap _term SIGTERM
while true; do
 sleep 1
done

An e2e test can verify that deletion will not trigger a new pod creation until the exiting pod is fully deleted.

If podReplacementPolicy: TerminatingOrFailed is specified we would test that pod creation happens closely after deletion.

The e2e tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/e2e/apps/job.go .

Test grid:

gce

Kubernetes e2e suite.[It] [sig-apps] Job should recreate pods only after they have failed if pod replacement policy is set to Failed

Graduation Criteria

Alpha

Job controller can consider terminating pods as active
Job controller counts terminating pods in JobStatus.
Unit Tests
Integration tests

Beta

Address reviews and bug reports from Alpha users
E2e tests are in Testgrid and linked in KEP
The feature flag enabled by default
job_pods_creation_total metric is added.

GA

Address reviews and bug reports from Beta users
Allow Job API clients tracking the number of the terminating pods until all the resources are released (see tracking the terminating pods ). Also, provide links for the relevant integration tests in the KEP.
Lock the JobPodReplacementPolicy feature-gate to true
Restore the .status.terminating assertion for JobSuccessPolicy Conformance Tests in the following:

Deprecation

Remove JobPodReplacementPolicy feature-gate in GA+3.

Upgrade / Downgrade Strategy

Upgrade

Set JobPodReplacementPolicy to true in apiserver and controller manager.

There are no other components required.

Jobs that want to replace pods once they are fully terminal can use PodReplacementPolicy: Failed.

If a Job is not using PodFailurePolicy, one can change PodReplacementPolicy to terminatingOrFailed. This will revert Jobs to existing behavior with the feature off.

If one is using PodFailurePolicy, one will not be able to set the value to terminatingOrFailed as Failed is the only allowable solution. In this case, the recommendation would be to disable the PodFailurePolicy feature also.

Downgrade

Set JobPodReplacementPolicy to false in apiserver and controller manager.

With downgrading, you will no longer see any side-effects of PodReplacementPolicy.

Version Skew Strategy

This feature is limited to control plane.

Note that, kube-apiserver can be in the N+1 skew version relative to the kube-controller-manager (see here ). In that case, the Job controller operates on the version of the Job object that already supports the new Job API.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: JobPodReplacementPolicy
- Components depending on the feature gate:
  - kube-apiserver (for field control)
  - kube-controller-manager (for main functionality)
  - kubelet (for supporting functionality: transition to phase=Failed)

Does enabling the feature change any default behavior?

Yes,

a. Count the number of terminating pods and populate in JobStatus b. Set phase=Failed in kubelet and pod-GC before deleting a Pod object (behavior also present when related PodDisruptionConditions is enabled) c. As part of closely related KEP-3329, we will default podReplacementPolicy to Failed if podFailurePolicy is set which, as described above, will change the way of handling terminating pods.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

When the feature is disabled:

the apiserver:
- Discards the value of podReplacementPolicy for new objects.
- Preserves the value of podRepacementPolicy for existing objects.
the job controller:
- processes the Job as podReplacementPolicy: TerminatingOrFailed (the existing behavior)
- stops tracking terminating pods, sets the value of .status.terminating to nil in the next Job sync.

What happens if we reenable the feature if it was previously rolled back?

The job controller will respect the value of podReplacementPolicy for new events (new Pods becoming terminating or failed).

If podReplacementPolicy: Failed and there are currently terminating Pod(s) that were already considered Failed before reenabling the feature, they won’t be re-evaluated.

Are there any tests for feature enablement/disablement?

No, but we will add unit and integration tests for feature enablement and disablement.

An integration test verifies disable and reenable. See integration tests for details.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout or rollback will not fail as rolling out this feature entails turning on JobPodReplacementPolicy. Failure rates of the Jobs will not increase or decrease on this feature. Pods will be marked as failed later (as we wait for the pods to be fully terminal)

This feature is opt-in for functional changes. We track terminating pods for observability reasons but we only use this data in the case of Failed.

If a user has set PodReplacementPolicy: Failed or has PodFailurePolicy set, then rollbacking this feature would mean that terminating Pods will be recreated once they are deleted.

If a user rollouts this feature with PodFailurePolicy or PodReplacementPolicy set to Failed, then pods will only recreate once they are fully terminal.
This will not impact failure counts as in both cases, they will get marked as failed eventually.

If a user rollouts this feature without PodFailurePolicy or PodReplacementPolicy set, then there will be no impact to existing workloads.

What specific metrics should inform a rollback?

job_syncs_total, exposed by kube-controller-manager
- If the number of syncs increases it could mean that we have an increased number of failures.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

In beta, we are working on adding an integration test for these cases.

In terms of a manual test for upgrade and rollback, we can use 1.28.

The Upgrade->downgrade->upgrade testing was done manually using the alpha version in 1.28 with the following steps:

Start the cluster with the JobPodReplacementPolicy enabled:

Create a KIND cluster with 1.28 and use the config below to turn this feature on.

using config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
 "JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker

Then, create the job using .spec.podReplacementPolicy=Failed:

kubectl create -f job.yaml

using job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
 name: job-prp
spec:
 completions: 1
 parallelism: 1
 backoffLimit: 2
 podReplacementPolicy: Failed
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: sleep
 image: gcr.io/k8s-staging-perf-tests/sleep
 args: ["-termination-grace-period", "1m", "60s"]

Await for the pods to be running and delete a pod:

kubectl delete pods -l job-name=job-prp

With feature on and PodReplacementPolicy set to Failed, the replacement pod will be recreated once the pod was fully terminated. While the pod is terminating you can also see the status report a terminating pod.

kubectl get jobs -ljob-name=job-prp -oyaml

status:
 terminating: 1

Simulate downgrade by creating a new Kind cluster with the feature turned off.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
 "JobPodReplacementPolicy": false
nodes:
- role: control-plane
- role: worker

Then, deleting the pods of the job.

kubectl delete pods -l job-name=job-prp

There should also be no terminating pod status and a pod will be created before the other pod terminates. If you use the above case, you should see a terminating pod and a new pod created.

Simulate upgrade by creating a new Kind cluster with the feature turned on.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
 "JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker

Deleting the pod will create a replacement pod once the pod is fully terminated. The status field will also state that the pod is terminating.

This demonstrates that the feature is working again for the job.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

During pod terminations, an operator can see that the terminating field is being set.

We will use a new metric:

job_pods_creation_total (new) the reason label will mention what triggers creation (new, recreate_terminating_or_failed, recreate_failed))
and the status label will mention the status of the pod creation (succeeded, failed).
This can be used to get the number of pods that are being recreated due to recreateTerminated. Otherwise, we would expect to see new or recreateTerminatingOrFailed as the normal values.

How can someone using this feature know that it is working for their instance?

If a user terminates pods that are controlled by a job, then we should wait until the existing pods are terminated before starting new ones.

When feature is turned on, we will also include a terminating field in the Job Status if there are any terminating pods.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

We did not propose any SLO/SLI for this feature.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - job_syncs_total (existing): can be used to see how much the feature enablement causes the number of syncs to increase.
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

In beta, we will add a new metric job_pods_creation_total.

Dependencies

In Risks and Mitigations we discuss the interaction with 3329-retriable-and-non-retriable-failures .
We will have to guard against cases if PodFailurePolicy is off while this feature is on.
PodFailurePolicy is in stable and is locked to true by default but we should guard against cases where PodDisruptionCondition is turned off.

Does this feature depend on any specific services running in the cluster?

Scalability

Generally, enabling this will slow down pod creation if pods take a long time to terminate. We would wait to create new pods until the existing ones are terminated.

Will enabling / using this feature result in any new API calls?

In the job controller, we only update the Job.Status if any field in the Job.Status changes. With this feature on, we will track terminating pods in this status. It could be possible to see an increase in updating the status field of Jobs if a lot of the pods are being terminated. However, if pods are being terminated, we would also expect other fields to be getting updated also (active, failed, etc) so there should not be a large increase of API calls for patching.

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

For Job API, we are adding an enum field named PodReplacementPolicy which takes either a TerminatingOrFailed or Failed

API type(s): enum
Estimated increase in size: 8B

We are also added a status field for tracking terminating pods.

API type(s): int32
Estimated increase in size: 4B

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No, SLI/SLO do not include time taking to create new pods if existing ones are terminated.
There is an existing one on pod creation but this will not impact that.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

N/A

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

N/A

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior of the Job controller.

What are other known failure modes?

There are no other failure modes.

What steps should be taken if SLOs are not being met to determine the problem?

If one wants to keep the feature on and they could suspend the jobs that are using this feature. Setting Suspend:True in your JobSpec will halt the execution of all jobs.

Implementation History

2023-04-03: Created KEP
2023-05-19: KEP Merged.
2023-07-16: Alpha PRs merged.
2023-09-29: KEP marked for beta promotion.
2023-10-24: Merged bugfix Fix tracking of terminating Pods when nothing else changes
2023-10-24: Merged adding a metric required for beta promotion feat: add job_pods_creation_total metric
2023-10-27: Merged Switch feature flag to beta for pod replacement policy and add e2e test #121491
2024-06-11: [v1.31] Merged Count terminating pods when deleting active pods for failed jobs #125175
2024-07-12: [v1.31] Merged Delay setting terminal Job conditions until all pods are terminal #125510

This feature was promoted to beta in v1.29, but important updates were implemented in v1.31. For additional info, check the PRs linked above with the tag [v1.31].

Drawbacks

Enabling this feature may have rollouts become slower.

Alternatives

We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.

Infrastructure Needed (Optional)

Resources: Allow special characters environment variable

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4369: Allow special characters in environment variables

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories (Optional)
  - Story 1
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Allows all printable ASCII characters except “=” to be set as environment variables, the range of printable ASCII characters is 32-126.

Motivation

Kubernetes should not restrict which environment variable names can be used, because it has no way of knowing what the application may need, and people can’t always choose their own variable names, which may limit the adoption of Kubernetes.

Goals

Allows users to set all ASCII characters with serial numbers in the range of 32-126 except “=” as environment variables.

Non-Goals

Proposal

Implements relaxed validation at the top level validation method when validating API create requests, all ASCII characters in the range 32-126 except “=” can be verified.
Allow users to set Configmap keys and secret keys outside the C_IDENTIFIER scope as environment variables using EnvFrom
Document rules for setting environment variables.

User Stories (Optional)

Story 1

I am a .NET Core development engineer, .Net Core applications are using “:” when working with application settings loaded from appsettings.json file. When running .net core app in containers typically overwrite this settings by specifying environmental variable. such as: "Logging": { "IncludeScopes": false, "LogLevel": { "Default": "Warning" } }
override like this -e Logging:LogLevel:Default=Debug

Risks and Mitigations

Relaxed validation can break upgrade and rollback scenarios, but our use of feature gate to control whether it’s enabled or not will make it a manageable risk, with the user having the autonomy to choose whether or not to enable it.

Design Details

A feature gate name RelaxedEnvironmentVariableValidation controlling the loosening of the envvar name validation, initially in alpha state and defaulting to false
Two sets of validation logic for envvar names:
- Strict validation
  - Strict validation follows the current design, which only allows envvar names passed the regular expression [-._a-zA-Z][-._a-zA-Z0-9]*.
- Relaxed validation
  - Relaxed verification allows all ASCII characters in the range 32-126 as envvar name, and its regular expression is ^[ -<>-~]+$, matches a string containing ASCII characters from space to < and from > to ~, ignore =, and has a length of at least 1.
Everywhere we validate envvar names in API objects, plumbing a parameter whether we want the strict or relaxed validation
- At the top level validation method when validating API create requests, use the strict validation if the feature gate is off
- At the top level validation method when validating API update requests, use the strict validation if the feature gate is off and the old object passes strict envvar name validation

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Currently coverages:

pkg/apis/core/validation/validation_test.go: 2023-12-21 - 83.9%
pkg/kubelet/kubelet_pods_test.go: 2023-12-21 - 67.2%
staging/src/k8s.io/apimachinery/pkg/util/validation/validation_test.go: 2023-12-21 - 94.8%

These tests will be added:

New tests will be added to ensure environment variable fields can be correctly validated pkg/apis/core/validation/validation_test.go
Add a new test that sets special character environment variables for pods in a given namespace pkg/kubelet/kubelet_pods_test.go
A new test will be added to ensure that the environment variable name field is valid staging/src/k8s.io/apimachinery/pkg/util/validation/validation_test.go

Integration tests

e2e tests

Add a test to test/e2e/common/node/configmap.go to test that the special characters in configmap are consumed by the environment variable.
Add a test to test/e2e/common/node/secret.go to test that the special characters in secret are consumed by the environment variable.
Add a test to test/e2e/common/node/expansion to test environment variable can contain special characters.

We have also added presubmit and periodic test jobs in CI for these e2e tests. Job names:

pull-kubernetes-e2e-relaxed-environment-variable-validation
ci-kubernetes-e2e-relaxed-environment-variable-validation

Graduation Criteria

Alpha

Created the feature gate and implement the feature, disabled by default.
Add unit and e2e tests for the feature.

Beta

Solicit feedback from the Alpha.
Ensure tests are stable and passing.

GA

Ensure that the time range from Alpha to GA version can cover the version skew of all components.
Add troubleshooting details on how to deal with incompatible kubelet/CRI implementations based on issues found in beta releases.

Upgrade / Downgrade Strategy

Upgrade

Environment variables previously set by the user will not change. To use this enhancement, users need to enable the feature gate

Downgrade

After downgrade, environment variables containing special characters will continue to work as expected, but any writes to resources to add or change environment variables must set the environment variable names to only use normal characters.

Version Skew Strategy

kube-apiserver will need to enable feature gates to use this feature.

If kube-apiserver is not enabled feature gate will use strict validation.

If the feature gate is disabled and the existing object passes strict validation, strict validation on update will be used.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: RelaxedEnvironmentVariableValidation
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If disable the feature gate, already running workloads will not be affected in any way, but cannot create workloads that use special characters as environment variables.

What happens if we reenable the feature if it was previously rolled back?

The feature should continue to work just fine.

Are there any tests for feature enablement/disablement?

Yes.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

When the feature gate is disabled, workloads that are already running will not be affected. However, if user update the workloads, they may fail to recreate pods or ReplicaSets due to failing the Apiserver’s validation logic, which could cause the workloads to fail.

What specific metrics should inform a rollback?

N/A

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Yes, operators can use the Kubenetes API to achieve this. They need to get all pods in the cluster and check if any pod has set a field other than [-._a-zA-Z][-._a-zA-Z0-9]* as an environment variable name. For example, we can find the namespaces and names of pods using this feature and their environment variable names using the following command:

kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.containers[].env[]?.name | test("^[a-zA-Z_][a-zA-Z0-9_]*$") | not) | [.metadata.namespace, .metadata.name, .spec.containers[].env[]?.name] | @tsv'

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

According to the test results in https://github.com/HirazawaUi/verfiy-container-env , the container runtime is very lenient with using special characters as environment variables, and almost no failures will occur.

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Dependencies

N/A

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

- 2023-12-21: Initial draft KEP

- 2024-02-06: KEP promoted to implementable.

- 2024-08-26: Promote to beta

- 2024-08-27: Fixed some errors in the beta phase

- 2025-06-03: Promote to GA

Drawbacks

If the envvar name character set is extended, all the things currently consuming and using envvar names from the API will have an impact and may break or be unsafe.

For example:

If a third party uses an envvar name as a filename and assumes that it is currently safe, then if it contains characters that cannot be used as a filename (like :) or characters that break the assumptions of a flat directory structure (like /), then unexpected results will occur.

Alternatives

do nothing (leave it as-is)
relax the rule, but with a long beta period where the existing rule remains the default. Ensure that the beta period doesn’t end until ValidatingAdmissionPolicy is GA and has been for 2 minor releases. Clearly document how to use a ValidatingAdmissionPolicy to get behavior equivalent to the legacy checking, and signpost people to these docs when graduating the looser validation to be the Kubernetes default.
define a label or annotation for each namespace that controls how Pod environment variables are validated in that namespace
[more complex!] add an API kind to specify the validation rules for Pods

Create a new API kind, eg PodValidationRule. It’s namespaced. Within the .spec of each object, define:
- a Pod selector
- an optional CEL validation rule for environment variable keys
- an optional CEL validation rule for environment variable values
If any of the selected validation rules don’t pass for a Pod, reject it at admission time. Set up a defaulting mechanism to Also, define how Pod templates interact with this new API (eg: you get a Warning: when you create a Deployment where the PodTemplate inside the Deployment wouldn’t pass validation)

Infrastructure Needed (Optional)

Resources: Allow zero value for Sleep Action of PreStop Hook

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4818: Allow zero value for Sleep Action of PreStop Hook

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The sleep action for the PreStop container lifecycle hook was introduced in KEP 3960. It however doesn’t accept zero as a valid value for the sleep duration seconds. This KEP aims to add support for setting a value of zero with the sleep action of the PreStop hook.

Motivation

Currently, trying to create a container with a PreStop lifecycle hook with sleep of 0 seconds will throw a validation error like so:

Invalid value: 0: must be greater than 0 and less than terminationGracePeriodSeconds (30)

The Sleep action is implemented with the time package from Go’s standard library. The time.After() which is used to implement the sleep permits a zero sleep duration. A negative or a zero sleep duration will cause the function to return immediately and function like a no-op.

The implementation in KEP 3960 supports only non-zero values for the sleep duration. It is semantically correct to support a zero value for this field since time.After() also supports zero and negative durations. Negative values as well as zero have the same effect with time.After(), they both return immediately. We don’t need to support negative values since they have the same effect as setting the duration to zero.

A potential use case for this behaviour is when you need a PreStop hook to be defined for the validation of your resource, but don’t really need to sleep as part of the PreStop hook. An example of this is described by a user here in the parent KEP. They add a PreStop sleep hook in via an admission webhoook by default if the PreStop is hook is not specified by the user. In order to opt-out from this, a no-op PreStop hook with a duration of zero seconds can be used.

Goals

Update the validation for the Sleep action to allow zero as a valid sleep duration.
Allow users to set a zero value for the sleep action in PreStop hooks to do a no-op.

Non-Goals

This KEP does not support adding negative values for the sleep duration.
This KEP does not aim to provide a way to pause or delay pod termination indefinitely.

Proposal

Introduce a PodLifecycleSleepActionAllowZero feature gate which is disabled by default. When the feature gate is enabled, the validateSleepAction method would allow values greater than or equal to zero as a valid sleep duration.

Since this update to the validation allows previously invalid values, care must be taken to support cluster downgrades safely. To accomplish this, the validation will distinguish between new resources and updates to existing resources:

When the feature gate is disabled:
- (a) New resources will no longer allow setting zero as the sleep duration second for the PreStop hook. (no change to current validation)
- (b) Existing resources cannot be updated to have a sleep duration of zero seconds
- (c) Existing resources with a PreStop sleep duration set to zero will continue to run and use a sleep duration of zero seconds. These can be updated and the zero sleep duration would continue to work.
When the feature gate is enabled:
- (c) New resources allow zero as a valid sleep duration.
- (d) Updates to existing resources will allow zero as a valid sleep duration.

The proposed change adds another layer to the validateSleepAction function to allow zero as a valid sleep duration setting like shown:

-func validateSleepAction(sleep *core.SleepAction, gracePeriod *int64, fldPath *field.Path) field.ErrorList {
+func validateSleepAction(sleep *core.SleepAction, gracePeriod *int64, fldPath *field.Path, opts PodValidationOptions) field.ErrorList {
 allErrors := field.ErrorList{}
 // We allow gracePeriod to be nil here because the pod in which this SleepAction
 // is defined might have an invalid grace period defined, and we don't want to
 // flag another error here when the real problem will already be flagged.
- if gracePeriod != nil && sleep.Seconds <= 0 || sleep.Seconds > *gracePeriod {
- invalidStr := fmt.Sprintf("must be greater than 0 and less than terminationGracePeriodSeconds (%d)", *gracePeriod)
- allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
+ if opts.AllowPodLifecycleSleepActionZeroValue {
+ if gracePeriod != nil && sleep.Seconds < 0 || sleep.Seconds > *gracePeriod {
+ invalidStr := fmt.Sprintf("must be non-negative and less than terminationGracePeriodSeconds (%d)", *gracePeriod)
+ allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
+ }
+ } else {
+ if gracePeriod != nil && sleep.Seconds <= 0 || sleep.Seconds > *gracePeriod {
+ invalidStr := fmt.Sprintf("must be greater than 0 and less than terminationGracePeriodSeconds (%d). Please enable PodLifecycleSleepActionAllowZero feature gate if you need a sleep of zero duration.", *gracePeriod)
+ allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
+ }
 }
 return allErrors
}

Currently, the kubelet accepts 0 as a valid duration. There is no validation done at the kubelet level. All the validation for the duration itself is done at the kube-apiserver. The runSleepHandler in the kubelet uses the time.After() function from the time package, which supports a 0 duration input. time.After also accepts negative values which are also returned immediately similar to zero. We don’t support negative values however.

See the entire code changes in the WIP PR: https://github.com/kubernetes/kubernetes/pull/127094

User Stories (Optional)

Story 1

As a Kubernetes user, I want to to be able to have a PreStop hook defined in my spec without needing to sleep during the execution of the PreStop hook. This no-op behaviour can be used for validation purposes with admission webhooks (Reference ).

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The change is opt-in, since it requires configuring a PreStop hook with sleep action of 0 second duration. So there is no risk beyond the upgrade/downgrade risks which are addressed in the Proposal section.

Design Details

Refer to the Proposal section.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Alpha:

Test that the runSleepHandler function returns immediately when given a duration of zero.
Test that the validation succeeds when given a zero duration with the feature gate enabled.
Test that the validation fails when given a zero duration with the feature gate disabled.
Test that the validation returns the appropriate error messages when given an invalid duration value (e.g., a negative value) with the feature gate disabled and enabled.
Unit tests for testing the disabling of the feature gate after it was enabled and the feature was used.
Unit tests for pod with zero grace period duration and zero sleep duration with zero value enabled.
Unit test for pod with nil grace period with zero value disabled
Unit test for pod with nil grace period with zero value enabled

Current coverages:

k8s.io/kubernetes/pkg/apis/core/validation : 2024-09-20 - 84.3
k8s.io/kubernetes/pkg/kubelet/lifecycle/handlers : 2024-09-20 - 86.4

Integration tests

N/A

e2e tests

Basic functionality

Create a simple pod with a container that runs a long-running process.
Add a preStop hook to the container configuration, using the new sleepAction with a sleep duration of 0.
Delete the pod and observe the time it takes for the container to terminate.
Verify that the container terminates immediately without sleeping.

Additional e2e tests for beta:

Test that pods with sleep value of 0 in PreStop hook can be created
Test that pods with sleep value of 0 in PostStart hook can be created
Test that pods with sleep value of 0 in PreStop hook can be updated
Test that pods with sleep value of 0 in PostStart hook can be updated

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial unit/e2e tests completed and enabled

Beta

Gather feedback from developers and surveys
Additional e2e tests are completed
No trouble reports from alpha release

GA

No trouble reports with the beta release, plus some anecdotal evidence of it being used successfully.

Upgrade / Downgrade Strategy

Upgrade

The previous PreStop Sleep Action behavior will not be broken. Users can continue to use their hooks as it is. To use this enhancement, users need to enable the feature gate, and set the sleep duration as zero in their prestop hook’s sleep action.

Downgrade

If the kube-apiserver is downgraded to a version where the feature gate is not supported (<v1.32), no new resources can be created with a PreStop sleep duration of zero seconds. Existing resources created with a sleep duration of zero will continue to function.

If the feature gate is turned off after being enabled, no new resources can be created with PreStop sleep duration of zero seconds. Existing resources will continue to run and use a sleep duration of zero seconds. These resources can be updated and the zero sleep duration would continue to work.

Version Skew Strategy

Only the kube-apiserver will need to enable the feature gate for the full featureset to be present. This is because the implementation is already handled in the parent KEP #3960 . The change introduced in this KEP is only to how the validation is done. If the feature gate is disabled, the feature will not be available. The feature gate does not apply to the kubelet logic since the time.After function used by the original KEP already supports zero as a valid duration.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: PodLifecycleSleepActionAllowZero
- Components depending on the feature gate: kube-apiserver
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver with the feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the sleep field.

What happens if we reenable the feature if it was previously rolled back?

New pods with sleep action in prestop sleep duration of zero seconds can be created.

Are there any tests for feature enablement/disablement?

For the parent KEP, unit tests for the switch of the feature gate were added in pkg/registry/core/pod/strategy_test. We can add similar tests for the new feature gate as well.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The change is opt-in, it doesn’t impact already running workloads.

What specific metrics should inform a rollback?

I believe we don’t need a metric here since the parent KEP already has a metric to inform rollbacks. This KEP only updates the validation to allow zero value.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This is an opt-in feature, and it does not change any default behavior. I manually tested enabling and disabling this feature by changing the kube-api-server config and restarting them in a kind cluster. The details of the expected behavior are described in the Proposal and Upgrade/Downgrade sections.

The manual test steps are as following:

Create a local 1.32 k8s cluster with kind, and create a test-pod in that cluster.
Enable PodLifecycleSleepActionAllowZero feature in the kube-apiserver and restart it.
Add a prestop hook with sleep action with duration of zero seconds to the test-pod and delete it. Observe the time cost.
Create another pod with sleep action duration of zero seconds.
Disable PodLifecycleSleepActionAllowZero feature in the kube-api-server and restart it.
Delete the pod created in step 4, and observe the time cost.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Inspect the preStop hook configuration and also the feature gates

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Check the logs of the container during termination, check the termination duration.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details: Check the logs of the container during termination, check the termination duration.

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A. This is a change to validation within the API server.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Disable PodLifecycleSleepActionAllowZero feature gate, and restart the kube-apiserver.

Implementation History

2024-09-16: Alpha KEP PR opened for v1.32
2024-10-03: Summary, Motivation and Proposal sections merged
2024-09-03: Alpha code implementation PR opened
2024-11-01: Alpha code PR merged
2024-12-11: Kubernetes v1.32 release with PodLifecycleSleepActionAllowZero in alpha stage
2025-02-06: KEP updated targeting to beta in v1.33
2025-06-11: KEP updated targeting to stable in v1.34
2025-07-20: Code implementation for GA graduation merged into k/k
2025-10-20: k/enhancements PR opened updating KEP status as implemented

Drawbacks

N/A

Alternatives

Another way to run zero duration sleep in a container is to use the exec command in preStop hook like so ["/bin/sh","-c","sleep 0"]. This requires a sleep binary in the image. Since the sleep action already exists as a PreStop hook, it is easier to allow a duration of zero seconds for the sleep action.

Infrastructure Needed (Optional)

N/A

Resources: Allows setting arbitrary FQDN as the pod's hostname

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4762: Allows setting arbitrary FQDN as the pod’s hostname

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This proposal allows users to set arbitrary Fully Qualified Domain Name (FQDN) as the hostname of a pod, introduces a new field hostnameOverride for the podSpec, which, if set, once the API is GA will always be respected by the Kubelet (otherwise it will fall back to legacy behavior), and no longer cares about the hostname as well as the subdomain values.

Motivation

This feature will allow some traditional applications to join kubernetes in a more friendly way. Some older services may use hostname to determine permissions or service operations. When migrating services to k8s, the migration path will become confusing due to the hostname restrictions of the pod itself, because when we try to add a Fully Qualified Domain Name (FQDN) hostname to the pod, it will inevitably always carry the cluster-suffix, which will never be possible for services that expect to use DNS to match the hostname.

Goals

Allow users to set any arbitrary FQDN as pod hostname.
Write the FQDN set by the user to /etc/hosts in the pod.

Non-Goals

Add DNS records for the FQDN set by the user.

Proposal

We add a new field called hostnameOverride to podSpec, of type string. When the value of the hostnameOverride field is not an empty string, it always overrides the values of the setHostnameAsFQDN, subdomain, and hostname fields in podSpec to become the hostname of the pod, and only allow the value of setHostnameAsFQDN to be nil.

User Stories (Optional)

Story 1

As a Kubernetes administrator, I want the Kerberos replication daemon (kpropd) to accurately handle hostname resolution for authentication.

In a Kubernetes environment, kpropd on the receiving end uses the hostname to determine the appropriate service credential for authentication purposes (e.g., foo-0.default.pod.cluster-local). However, on the sending side, kpropd uses the hostname it is connecting to (e.g., kdc1.example.com) to generate the cryptographic secret for secure communication. These hostnames must match to ensure that the cryptographic process can generate consistent data on both ends. Any discrepancy between these hostnames can result in authentication failure due to mismatched cryptographic data.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The Linux kernel limits the hostname field to 64 bytes (see sethostname(2) ). If a hostname reaches this 64 byte kernel hostname limit, Kubernetes will fail to create the Pod Sandbox, causing the Pod to remain indefinitely in the ContainerCreating state.

To mitigate this issue, we will implement a validation during resource creation to check whether the value of hostnameOverride exceeds 64 bytes. Creation requests exceeding this limit will be denied.

After enabling this feature, if users utilize it to create a group of Pods via Deployment or StatefulSet, multiple Pods with identical names may concentrate on a single node. This could lead to unintended consequences, though we haven’t identified specific potential issues at this time.

Design Details

We are introducing a new feature gate called HostnameOverride. When this feature gate is enabled, users can add the hostnameOverride field in the podSpec.

The hostnameOverride field has a length limitation of 64 characters and must adhere to the DNS subdomain names standard defined in RFC 1123 .

Additionally, in the generatePodSandboxConfig method of kubelet, the pod’s hostname will always be overridden with the value of hostnameOverride, and it will be written in the pod’s /etc/hosts.

For Windows containers, we only set the container’s hostname and do not create an /etc/hosts file for it (as we have previously made it clear that we do not create an /etc/hosts file for Windows containers).

If both setHostnameAsFQDN and hostnameOverride fields are set, or if both hostNetwork and hostnameOverride fields are set, we will reject the creation of the resource and return an error indicating that these fields are mutually exclusive with the hostnameOverride field.

Based on the above design, after the KEP is implemented, we can achieve the following results.

#	`.hostname`	`.subdomain`	`.setHostnameAsFQDN`	`.hostnameOverride`	`.hostNetwork`	`$(hostname)`	`$(hostname -f)`	DNS (assuming service exists)
0						`<pod-name>`	`<pod-name>`
1	`aa`					`aa`	`aa`
2		`bb`				`<pod-name>`	`<pod-name>.bb.<ns>.svc.<zone>`	`<pod-name>.bb.<ns>.svc.<zone>`
3	`aa`	`bb`				`aa`	`aa.bb.<ns>.svc.<zone>`	`aa.bb.<ns>.svc.<zone>`
4			true			`<pod-name>`	`<pod-name>`
5	`aa`		true			`aa`	`aa`
6		`bb`	true			`<pod-name>.bb.<ns>.svc.<zone>`	`<pod-name>.bb.<ns>.svc.<zone>`	`<pod-name>.bb.<ns>.svc.<zone>`
7	`aa`	`bb`	true			`aa.bb.<ns>.svc.<zone>`	`aa.bb.<ns>.svc.<zone>`	`aa.bb.<ns>.svc.<zone>`
8				`xx.yy.zz`		`xx.yy.zz`	`xx.yy.zz`
9	`aa`			`xx.yy.zz`		`xx.yy.zz`	`xx.yy.zz`
10		`bb`		`xx.yy.zz`		`xx.yy.zz`	`xx.yy.zz`	`<pod-name>.bb.<ns>.svc.<zone>`
11	`aa`	`bb`		`xx.yy.zz`		`xx.yy.zz`	`xx.yy.zz`	`aa.bb.<ns>.svc.<zone>`
12			true	`xx.yy.zz`		INVALID	INVALID	INVALID
13	`aa`		true	`xx.yy.zz`		INVALID	INVALID	INVALID
14		`bb`	true	`xx.yy.zz`		INVALID	INVALID	INVALID
15	`aa`	`bb`	true	`xx.yy.zz`		INVALID	INVALID	INVALID
16					true	`<same-as-node>`	`<same-as-node>`
17	`aa`				true	`<same-as-node>`	`<same-as-node>`
18		`bb`			true	`<same-as-node>`	`<same-as-node>`	`<pod-name>.bb.<ns>.svc.<zone>`
19	`aa`	`bb`			true	`<same-as-node>`	`<same-as-node>`	`aa.bb.<ns>.svc.<zone>`
20			true		true	`<same-as-node>`	`<same-as-node>`
21	`aa`		true		true	`<same-as-node>`	`<same-as-node>`
22		`bb`	true		true	`<same-as-node>`	`<same-as-node>`	`<pod-name>.bb.<ns>.svc.<zone>`
23	`aa`	`bb`	true		true	`<same-as-node>`	`<same-as-node>`	`aa.bb.<ns>.svc.<zone>`
24				`xx.yy.zz`	true	INVALID	INVALID	INVALID
25	`aa`			`xx.yy.zz`	true	INVALID	INVALID	INVALID
26		`bb`		`xx.yy.zz`	true	INVALID	INVALID	INVALID
27	`aa`	`bb`		`xx.yy.zz`	true	INVALID	INVALID	INVALID
28			true	`xx.yy.zz`	true	INVALID	INVALID	INVALID
29	`aa`		true	`xx.yy.zz`	true	INVALID	INVALID	INVALID
30		`bb`	true	`xx.yy.zz`	true	INVALID	INVALID	INVALID
31	`aa`	`bb`	true	`xx.yy.zz`	true	INVALID	INVALID	INVALID

As shown in the table, setting hostnameOverride will only change the hostname inside the pod and will not modify the DNS records in Kubernetes.

Test Plan

Prerequisite testing updates

Unit tests

Add kubelet unit tests to verify that container hostnames are correctly generated: k8s.io/kubernetes/pkg/kubelet/kuberuntime: 2025-06-06 - 69.0%
Add API validation unit tests to ensure all field combinations yield correct results: k8s.io/kubernetes/pkg/apis/core/validation : 2025-06-06 - 84.7%

Integration tests

e2e tests

Add a conformance test to test/e2e that verifies our implementation conforms to the expectation defined in the table within the #Design Details section.

Graduation Criteria

Alpha

Use the HostnameOverride feature gate to implement this feature.
Initial e2e tests completed and enabled.
- The link to the added e2e test: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/pod_hostnameoverride.go
Add documentation for feature gates.
Add a detailed table to the docs illustrating the mappings between pod hostnames and DNS records under different configurations.

Beta

Make feature gate to be enabled by default.
Update the feature gate documentation.

GA

No issues reported during two releases.

Upgrade / Downgrade Strategy

API server should be upgraded before Kubelets. Kubelets should be downgraded before the API server.

Version Skew Strategy

The core implementation resides in kubelet.

Older kubelet versions will ignore the pod’s hostnameOverride field: • Newly created Pods will retain previous behavior

Older apiserver versions will similarly ignore the hostnameOverride field: • The apiserver doesn’t populate the hostnameOverride value, so newer kubelet versions will maintain legacy behavior

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: HostnameOverride
- Components depending on the feature gate: kubelet, kube-apiserver
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Using the feature gate is the only way to enable/disable this feature.

What happens if we reenable the feature if it was previously rolled back?

There will be no impact on running Pods in the cluster. This change solely affects newly created Pods. Once enabled, you can set pod hostnames by configuring the podSpec.hostnameOverride field.

Are there any tests for feature enablement/disablement?

We have added unit tests for enabling and disabling the feature gate in: pkg/kubelet/kubelet_pods_test.go#TestGeneratePodHostNameAndDomain

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

No known failure modes.

What specific metrics should inform a rollback?

The kubelet_started_pods_total metrics helps determine whether enabling/disabling this feature causes abnormal pod restarts in the cluster.

kubelet_started_pods_errors_total metrics tracks if feature toggling results in pod startup failures.

kubelet_restarted_pods_total metrics monitors whether enabling/disabling triggers restarts of Static Pods.

run_podsandbox_errors_total metric helps detect if enabling the feature gate and using this functionality would cause sandbox container creation failures.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes. The upgrade, downgrade, and upgrade path was manually tested with a local cluster by restarting the kube-apiserver and kubelet with the HostnameOverride feature gate enabled, then disabled, then enabled again. The test verified that:

A Pod created while HostnameOverride=true uses spec.hostnameOverride as its runtime hostname.
The existing Pod keeps running and keeps its hostname after the feature gate is disabled.
A new Pod created while HostnameOverride=false ignores spec.hostnameOverride and uses the default Pod hostname.
The Pod created while the feature gate was disabled keeps running and keeps the default hostname after the feature gate is re-enabled.

The script used for the manual local-up test was:

#!/usr/bin/env bash
set -euo pipefail

export GOPATH="${GOPATH:-/root/go}"
KUBE_ROOT="$GOPATH/src/k8s.io/kubernetes"
KUBECTL="$KUBE_ROOT/_output/bin/kubectl"
POD_NAME="test-pod"
OVERRIDE_HOSTNAME="test-hostname"
DEFAULT_HOSTNAME="$POD_NAME"
FEATURE_GATES_ON="HostnameOverride=true"
FEATURE_GATES_OFF="HostnameOverride=false"

cd "$KUBE_ROOT"
export PATH="$PATH:$GOPATH/bin:$KUBE_ROOT/third_party/etcd"
export KUBECONFIG="/var/run/kubernetes/admin.kubeconfig"

kill_local_up_components() {
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-apiserver" >/dev/null 2>&1 || true
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-controller-manager" >/dev/null 2>&1 || true
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-scheduler" >/dev/null 2>&1 || true
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kubelet" >/dev/null 2>&1 || true
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-proxy" >/dev/null 2>&1 || true
 pkill -9 -f "etcd --advertise-client-urls http://127.0.0.1:2379" >/dev/null 2>&1 || true
 pkill -9 -f "bash hack/local-up-cluster.sh" >/dev/null 2>&1 || true
 sleep 2
}

component_pid() {
 pgrep -f "^$1 " | head -n 1
}

read_cmdline() {
 local pid="$1"
 local -n out="$2"
 out=()
 while IFS= read -r -d '' arg; do
 out+=("$arg")
 done <"/proc/$pid/cmdline"
}

set_feature_gates() {
 local -n cmd="$1"
 local feature_gates="$2"
 for i in "${!cmd[@]}"; do
 if [[ "${cmd[$i]}" == --feature-gates=* ]]; then
 cmd[$i]="--feature-gates=$feature_gates"
 return
 fi
 if [[ "${cmd[$i]}" == "--feature-gates" ]]; then
 cmd[$((i + 1))]="$feature_gates"
 return
 fi
 done
 cmd+=("--feature-gates=$feature_gates")
}

start_cluster() {
 local log_file="$KUBE_ROOT/_output/hostname-override-local-up.log"
 kill_local_up_components
 rm -f "$log_file"
 FEATURE_GATES="$FEATURE_GATES_ON" LOG_LEVEL=3 hack/local-up-cluster.sh >"$log_file" 2>&1 &

 until grep -q "Local Kubernetes cluster is running" "$log_file"; do
 sleep 2
 done
}

restart_apiserver_and_kubelet() {
 local feature_gates="$1"
 local apiserver_pid kubelet_pid
 local apiserver_cmd kubelet_cmd

 apiserver_pid="$(component_pid "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-apiserver")"
 kubelet_pid="$(component_pid "$KUBE_ROOT/_output/local/bin/linux/arm64/kubelet")"
 read_cmdline "$apiserver_pid" apiserver_cmd
 read_cmdline "$kubelet_pid" kubelet_cmd
 set_feature_gates apiserver_cmd "$feature_gates"
 set_feature_gates kubelet_cmd "$feature_gates"

 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kubelet" >/dev/null 2>&1 || true
 pkill -9 -f "$KUBE_ROOT/_output/local/bin/linux/arm64/kube-apiserver" >/dev/null 2>&1 || true
 sleep 2

 "${apiserver_cmd[@]}" >"/tmp/kube-apiserver-hostname-override.log" 2>&1 &
 disown "$!"
 until "$KUBECTL" version >/dev/null 2>&1; do
 sleep 1
 done

 "${kubelet_cmd[@]}" >"/tmp/kubelet-hostname-override.log" 2>&1 &
 disown "$!"
 "$KUBECTL" wait --for=condition=Ready node/127.0.0.1 --timeout=180s >/dev/null
}

apply_pod() {
 cat <<EOF | "$KUBECTL" apply -f - >/dev/null
apiVersion: v1
kind: Pod
metadata:
 name: $POD_NAME
spec:
 hostnameOverride: $OVERRIDE_HOSTNAME
 containers:
 - name: writer-container
 image: busybox
 command: ["/bin/sh", "-c", "sleep 3600"]
EOF
}

wait_for_pod() {
 "$KUBECTL" wait --for=condition=Ready "pod/$POD_NAME" --timeout=180s >/dev/null
}

expect_hostname() {
 local message="$1"
 local expected="$2"
 local actual=""
 for _ in $(seq 1 60); do
 actual="$("$KUBECTL" exec "$POD_NAME" -- hostname 2>/dev/null || true)"
 if [[ -n "$actual" ]]; then
 break
 fi
 sleep 2
 done
 if [[ "$actual" != "$expected" ]]; then
 echo "FAIL: $message: expected hostname $expected, got $actual"
 exit 1
 fi
 echo "PASS: $message: hostname=$actual"
}

print_version() {
 echo "Cluster version:"
 "$KUBECTL" version
}

start_cluster
restart_apiserver_and_kubelet "$FEATURE_GATES_ON"
print_version
"$KUBECTL" delete "pod/$POD_NAME" --ignore-not-found --wait=true >/dev/null
apply_pod
wait_for_pod
expect_hostname "HostnameOverride=true overrides pod hostname" "$OVERRIDE_HOSTNAME"

restart_apiserver_and_kubelet "$FEATURE_GATES_OFF"
wait_for_pod
expect_hostname "existing pod keeps running after HostnameOverride=false" "$OVERRIDE_HOSTNAME"
"$KUBECTL" delete "pod/$POD_NAME" --wait=true >/dev/null
apply_pod
wait_for_pod
expect_hostname "new pod does not use hostnameOverride when HostnameOverride=false" "$DEFAULT_HOSTNAME"

restart_apiserver_and_kubelet "$FEATURE_GATES_ON"
wait_for_pod
expect_hostname "pod created while HostnameOverride=false keeps running after HostnameOverride=true" "$DEFAULT_HOSTNAME"

The test result was:

Cluster version:
Client Version: v1.36.0-1349+643e407efef84a
Kustomize Version: v5.8.1
Server Version: v1.36.0-1349+643e407efef84a
PASS: HostnameOverride=true overrides pod hostname: hostname=test-hostname
PASS: existing pod keeps running after HostnameOverride=false: hostname=test-hostname
PASS: new pod does not use hostnameOverride when HostnameOverride=false: hostname=test-pod
PASS: pod created while HostnameOverride=false keeps running after HostnameOverride=true: hostname=test-pod

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Users can check which workloads are utilizing this feature with the following command:

kubectl get pods -A -o json | jq -r '.items[] | select(.spec.hostnameOverride != null) | "\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)"'

How can someone using this feature know that it is working for their instance?

Users can use the following command to identify which workloads are using this feature and verify whether it is functioning as expected.

kubectl get pods -A -o json | jq -r '.items[] | select(.spec.hostnameOverride != null) | "\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)"' | while IFS=' ' read -r ns pod ho; do actual=$(kubectl exec -n "$ns" "$pod" -- hostname 2>/dev/null); [ "$actual" = "$ho" ] && echo "$ns $pod $actual $ho"; done

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

If the kubelet_started_pods_errors_total metric in a cluster remains consistently at 0, then after introducing this feature, the value of kubelet_started_pods_errors_total should similarly remain at 0.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: run_podsandbox_errors_total, kubelet_started_pods_total, kubelet_started_pods_errors_total, kubelet_restarted_pods_total
- [Optional] Aggregation method: A sharp increase in these metric values would indicate abnormal pod restarts or creation errors in the cluster caused by toggling the feature gate.
- Components exposing the metric: Kubelet
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Implementing this feature requires adding a new field to the Pod object, which will increase its size. However, we’ll limit the new field’s length to 64 bytes.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No impact to the running workloads

What are other known failure modes?

No known failure modes.

What steps should be taken if SLOs are not being met to determine the problem?

If the SLO is not being met, operators should:

Check whether the affected Pods use spec.hostnameOverride:

kubectl get pods -A -o json | jq -r '.items[] | select(.spec.hostnameOverride != null) | "\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)"'

Confirm the HostnameOverride feature gate state on the kube-apiserver and kubelet. The field is accepted and used only when the feature gate is enabled in the relevant components.
Inspect kubelet metrics for the affected nodes, especially kubelet_started_pods_errors_total, run_podsandbox_errors_total, kubelet_started_pods_total, and kubelet_restarted_pods_total, and compare them with the same metrics before the feature gate was enabled or before Pods using spec.hostnameOverride were created.
Inspect affected Pod status, events, and kubelet logs to determine whether the failures are during Pod admission, sandbox creation, or container start:
```
kubectl describe pod -n <namespace> <pod>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod>
```
Verify the runtime hostname for Pods that are Ready but suspected to be misconfigured:
```
kubectl exec -n <namespace> <pod> -- hostname
```
If failures correlate with enabling this feature or with Pods using spec.hostnameOverride, roll back by disabling the HostnameOverride feature gate. Existing Pods are not expected to be disrupted; newly created Pods will stop using spec.hostnameOverride while the feature gate is disabled.

Implementation History

2024-07-18: Initial draft KEP
2025-08-13: Align KEPs with implemented PRs and documentation.
2025-10-10: Promote to beta stage
2026-05-23: Promote to stable (GA) stage

Drawbacks

This is not a standard Kubernetes use case; it is undoubtedly in conflict with the current pod’s potential DNS records, and using it will bring more confusion to users. Moreover, we are not sure how much it can help traditional services that can benefit from being migrated to Kubernetes.

Alternatives

Configure hostnameOverride via kube-apiserver:
- If the hostnameOverride field is set, Kubelet will always respect this field (otherwise it will revert to the old behavior). In the default or REST logic, we can see if hostnameOverride is not set, then we check the hostname, setHostnameAsFQDN, and the cluster-suffix, and write the result into hostnameOverride. If the user sets it themselves, we will retain it and treat it as an override, this can ultimately simplify Kubelet as it can remove legacy behavior, but it means teaching the kube-apiserver about the cluster-suffix, however, it is challenging to find an existing or grace way to pass the kube-apiserver’s configuration options in the REST or default logic.
Migrate Legacy Projects:
- Repair the traditional projects that cannot be migrated to Kubernetes, or find alternatives.
Relax hostname Validation:
- Do not add new fields, relax the validation of the hostname field in podSpec to allow it to accept strings in FQDN format, and when the hostname is set to FQDN, we will unconditionally ignore the subdomain and setHostnameAsFQDN fields, or to keep the current hostname and be able to override or omit the default.svc.cluster.local part. However, doing so will cause us to lose the DNS resolution records for the pod.
Custom setHostnameAsFQDN:
- Do not add new fields, allowing the value of setHostnameAsFQDN to be set to Custom, the pod’s hostname can still meet our expectations. However, since setHostnameAsFQDN is currently a boolean type, modifying it would be disruptive to the existing API.
Init Container Hostname
- We can start an init container with privileged mode and run the command hostname mypod.fqdn.com within the init container to set the Pod’s hostname to mypod.fqdn.com. This can achieve the same goal.

Infrastructure Needed (Optional)

Resources: Anago to Krel Migration

Mon, 01 Jan 0001 00:00:00 +0000

Anago to Krel Migration

Objectives
Milestones
- First Milestone: Complete the Migration Effort
  - Open Issues
  - Acceptance Criteria
- Second Milestone: Introduce krel stage/release
  - Open Issues
Risks
Quality/Test Plan

Moving away from running bash in production in k/release

Objectives

This roadmap defines a strategy for achieving two primary goals: migrating exchangeable bits of bash code within anago to krel and creating a Golang native replacement for anago.

Milestones

Complete the code migration
Have a minimum working krel stage
Have a minimum working krel release
Remove/swap out Anago in a simple way, after completing the preceding steps

The scope and implementation details of Milestones 2-4 will become clearer as work on Milestone 1 proceeds.

Creating new features for krel is out of scope.

First Milestone: Complete the Migration Effort

Anago is still the main bash script running in GCB, which right now calls out to krel if necessary. Many parts of the bash-based source code in k/release have already been transferred to krel (golang), whereas we directly remove the bash-based parts from the repository after each refactoring iteration.

This milestone focuses on reducing technical debt in k/release by migrating the remaining bash code into refactored golang-based implementations. This effort will lead to higher quality and provide a stable foundation for future feature developments. By “stable,” we mean that making changes will not break the entire system.

This migration will not interrupt our ability to cut releases.

Open Issues

The list of currently outlined issues, with assignees (release managers) where established:

Add krel anago subcommand to retrieve the build candidate (TBD)

https://github.com/kubernetes/release/issues/1536
Introduce krel anago subcommand to update GitHub release

https://github.com/kubernetes/release/issues/1534 (@xmudrii)
Finish-up krel push

https://github.com/kubernetes/release/issues/1459 (@saschagrunert)
Introduce krel subcommand for pushing git objects

https://github.com/kubernetes/release/issues/1446 (TBD)

All four issues can be worked on in parallel. This is not a comprehensive list: There are still parts in Anago that can be ported from bash and that are not part of any issue yet.

Acceptance Criteria

All issues currently open will be resolved (#1534 , #1536 , #1446 , #1459 )
New code is unit-tested and code-reviewed (logical paths, not line coverage)
Direct use of the new Golang source code in production

Second Milestone: Introduce krel stage/release

In parallel to the ongoing migration (first milestone) we will introduce new krel stage and krel release subcommands. The plan is to re-evaluate the current functionality within anago and build a declarative approach of cutting releases. We can re-use the already migrated parts as well as using the existing logic in anago as guidance for the necessary feature set of krel stage/release.

Open Issues

The list of currently outlined issues, with assignees (release managers) where established:

Evaluate possible krel stage/release subcommands

https://github.com/kubernetes/release/issues/1551

Risks

The highest risk during the migration is that we end-up in a state where we break the current functionality. This would mean that we cannot build releases any more. Immediate fixing and incremental testing between the releases should minimize this risk.

Quality/Test Plan

Merge changes to the main branch from user fork/branch as per normal community PR process. Feature branches will not be used.

Anago-replacement features must be behind a feature gate, initially ensuring they are only run in ‘mock’ mode.

Merged features can be tested in production at any time so long as they are only triggered from a mock stage or mock release or mock notify.

Non-mock testing will occur only during a release cycle’s alpha period. This gives initial test ability for non-mock paths in Sep/Oct 2020 and again in Jan/Feb 2021. Beyond Feb 2021, we will need to re-evaluate testing based on future circumstances.

During alpha periods we can A/B test, eg: build alpha.1 with Anago and immediately after build alpha.2 with krel. Compare the results.

Resources: API gzip compression support

Mon, 01 Jan 0001 00:00:00 +0000

Graduate API gzip compression to GA

Summary
Motivation
- Goals
- Non-Goals
Proposal
Graduation Criteria
Implementation History

Summary

Kubernetes sometimes returns extremely large responses to clients outside of its local network, resulting in long delays for components that integrate with the cluster in the list/watch controller pattern. Kubernetes should properly support transparent gzip response encoding, while ensuring that the performance of the cluster does not regress for small requests.

Motivation

In large Kubernetes clusters the size of protobuf or JSON responses may exceed hundreds of megabytes, and clients that are not on fast local networks or colocated with the master may experience bandwidth and/or latency issues attempting to synchronize their state with the server (in the case of custom controllers). Many HTTP servers and clients support transparent compression by use of the Accept-Encoding header, and support for gzip can reduce total bandwidth requirements for integrating with Kubernetes clusters for JSON by up to 10x and for protobuf up to 8x.

Goals

Allow standard HTTP transparent Accept-Encoding: gzip behavior to work for large Kubernetes API requests, without impacting existing Go language clients (which are already sending that header) or causing a performance regression on the Kubernetes apiservers due to the additional CPU necessary to compress small requests.

Non-Goals

Support other compression formats like Snappy due to limited client support
Compress non-API responses
Compress watch responses

Proposal

1.16

Update the existing incomplete alpha API compression to:
- Only occur on API requests
- Only occur on very large responses (>128KB)
Promote to beta and enable by default since this is a standard feature of HTTP servers
- Test at large scale to mitigate risk of regression, tune as necessary

1.17

Promote to GA

Implementation Details

Kubernetes has had an alpha implementation of transparent gzip encoding since 1.7. However, this implementation was never graduated because it caused client misbehavior and the issues were not resolved.

After reviewing the code, the problems in the prior implementation were that it attempted to globally provide transparent compression as an HTTP middleware component at a much higher level than was necessary. The bugs that prevented enablement involved double compression of nested responses and failures to correctly handle flushing of lower level primitives. We do not need to GZIP compress all HTTP endpoints served by the Kubernetes API server (such as watch requests, exec requests, OpenAPI endpoints which provide their own compression). Our implementation may satisfy its goals of reducing latency for large requests if we narrowly scope compression to only those endpoints that need compression.

A further complexity is that the standard Go client library (which Kubernetes has leveraged since 1.0) always requests compression. Performance testing showed that enabling compression for all suitable API responses (objects returned via GET, LIST, UPDATE, PATCH) caused a significant performance regression in both CPU usage (2x) and tail latency (2-5x) on the Kubernetes apiservers. This is due to the additional CPU costs for performing compression, which impacts tail latency of small requests due to increased apiserver load. Since forcing all clients in the ecosystem to disable transparent compression by default is impractical and cannot be done in a gradual manner, we need to apply a more suitable heuristic than “did the client request transparent compression”. According to the HTTP spec, a server may ignore an Accept-Encoding header for any reason, which means we decide when we want to compress, not just whether we compress.

The preferred approach is to only compress responses returned by the API server when encoding objects that are large enough for compression to benefit the client but not unduly burden the server. In general, the target of this optimization is extremely large LIST responses which are usually multiple megabytes in size. These requests are infrequent (<1% of all reads) and when network bandwidth is lower than typical datacenter speeds (1 GBps) the benefit in reduced latency for clients outweighs the slightly higher CPU cost for compression.

We experimentally determined a size cut-off for compression that caused no regression on the Kubernetes density and load tests in either 99th percentile latency or kube-apiserver CPU usage of 128KB, which is roughly the size of 50 average pods (2.2kb from a large Kubernetes cluster with a diverse workload). This implementation applies this specific heuristic to the place in the Kubernetes code path where we encode the body of a response from a single input []byte buffer due to how Kubernetes encodes and manages responses, which removes the side-effects and unanticipated complexity in the prior implementation.

Given that this is standard HTTP server behavior and can easily be tested with unit, integration, and our complete end-to-end test suite (due to all of our clients already requesting gzip compression), there is minimal risk in rolling this out directly to GA. We suggest preserving the feature gate so that an operator can disable this behavior if they experience a regression in highly-tuned large-scale deployments.

Risks and Mitigations

The primary risk is that an operator running Kubernetes very close to the latency and tolerance limits on a very large and overloaded Kubernetes apiserver who runs an unusually high percentage of large LIST queries on high bandwidth networks would experience higher CPU use that causes them to hit a CPU limit. In practice, the cost of gzip proportional to the memory and CPU costs of Go memory allocation on very large serialization and deserialization grows sublinear, so we judge this unlikely. However, to give administrators an opportunity to react, we would preserve the feature gate and allow it to be disabled until 1.17.

Some clients may be requesting gzip and not be correctly handling gzipped responses. An effort should be made to educate client authors that this change is coming, but in general we do not consider incorrect client implementations to block implementation of standard HTTP features. The easy mitigation for many clients is to disable sending Accept-Encoding (Go is unusual in providing automatic transparent compression in the client ecosystem - many client libraries still require opt-in behavior).

Graduation Criteria

Transparent compression must be implemented in the more focused fashion described in this KEP. The scalability sig must sign off that the chosen limit (128KB) does not cause a regression in 5000 node clusters, which may cause us to revise the limit up.

Implementation History

1.7 Kubernetes added alpha implementation behind disabled flag
Updated proposal with more scoped implementation for Beta in 1.16 that addresses prior issues

Resources: API Server Network Proxy

Mon, 01 Jan 0001 00:00:00 +0000

Summary

We will build an extensible system which controls network traffic from the Kube API Server. We will add a traffic egress or network proxy system. The KAS can be configured to send traffic (or not) to one or more of the proxies. Users can drop in custom proxies if the default behavior is insufficient.

Motivation

Historically, Kubernetes used SSH tunnels , but they only functioned on GCE; they were deprecated in 1.9 and removed in 1.22 .

In retrospect, having an explicit level of indirection that separates user-initiated network traffic from API server-initiated traffic is a useful concept. Cloud providers want to control how API server to pod, node and service network traffic is implemented. Cloud providers may choose to run their API server (control network) and the cluster nodes (cluster network) on isolated networks. The control and cluster networks may have overlapping IP addresses. Therefore they require a non-IP routing proxy layer (SSH tunnel are an example). Adding this layer enables metadata audit logging. It allows validation of outgoing API server connections. Structuring the API server in this way is a forcing function for keeping architectural layering violations out of apiserver. In combination with a firewall, this separation of networks protects against security concerns such as Security Impact of Kubernetes API server external IP address proxying .

Goals

Delete the SSH Tunnel/Node Dialer code from Kube APIServer. Enable admins to mitigate https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ . Allow isolation of the Control network from the Cluster network.

Non-Goals

Build a general purpose Proxy which does everything. (Users should build their own custom proxies with the desired behavior, based on the provided proxy.) Handle requests from the Cluster to the Control Plane. (The proxy can be extended to do this. However that is left to the User if they want that behavior.)

Definitions

Control Plane Network An IP reachable network space which contains the control plane components, such as Kubernetes API Server, Connectivity Proxy and ETCD server.
Node Network An IP reachable network space which contains all the clusters Nodes, for alpha. Worth noting that the Node Network may be completely disjoint from the Control Plane network. It may have overlapping IP addresses to the Control Plane Network or other means of network isolation. Direct IP routability between cluster and control plane networks should not be assumed. Later version may relax the all node requirement to some.
KAS Kubernetes API Server, responsible for serving the Kubernetes API to clients.
KMS Key Management Service, plugins for secrets encryption key management
Egress Selector A component built into the KAS which provides a golang dialer for outgoing connection requests. The dialer provided depends on NetworkContext information.
Konnectivity Server The proxy server which runs in the control plane network. It has a secure channel established to the cluster network. It could work on either a gRPC or HTTP Connect mechanism. If the former it would exposes a gRPC interface to KAS to provide connectivity service. If the latter it would use standard HTTP Connect. Formerly known the the Network Proxy Server.
Konnectivity Agent A proxy agent which runs in the node network for establishing the tunnel. Formerly known as the Network Proxy Agent.
Flat Network A network where traffic can be successfully routed using IP. Implies no overlapping (i.e. shared) IPs on the network.

Proposal

We will run a connectivity server inside the control plane network. It could work on either a HTTP Connect mechanism or gRPC. For the alpha version we will attempt to get this working with HTTP Connect. We will evaluate HTTP Connect for scalability, error handling and traffic types. For scalability we will be looking at the number of required open connections. Increasing usage of webhooks means we need better than 1 request per connection (multiplexing). We also need the tunnel to be tolerant of errors in the requests it is transporting. HTTP-Connect only supports HTTP requests and not things like DNS requests. We assume that for HTTP URL requests, it will be the proxy which does the DNS lookup. However this means that we cannot have the KAS perform a DNS request to then do a follow on request. If no issues are found with HTTP Connect in these areas we will proceed with it. If an issue is found then we will update the KEP and switch the client to the gRPC solution. This should be as simple as switching the connection mode of the client code.

It may be desirable to allow out of band data (metadata) to be transmitted from the KAS to the Proxy Server. We expect to handle metadata in the HTTP Connect case using http ‘X’ headers on the Connect request. This means that the metadata can only be sent when establishing a KAS to Proxy tunnel. For the gRPC case we just update the interface to the KAS. In this case the metadata can be sent even during tunnel usage.

Each connectivity proxy allows secure connections to one or more cluster networks. Any network addressed by a connectivity proxy must be flat. Currently the only mechanism for handling overlapping IP ranges in Kubernetes is the Proxy. Non IP routable traffic, past the proxy, would need to be a non Kubernetes mechanism to route.

Running the connectivity proxy in a separate process has a few advantages.

The connectivity proxy can be extended without recompiling the KAS. Administrators can run their own variants of the connectivity proxy.
Traffic can be audited or forwarded (eg. via a proprietary VPN) using a custom connectivity proxy.
The separation removes control plane <-> cluster connectivity concerns from the KAS.
The code and responsibility separation lowers the complexity of the KAS code base.
The separation reduces the effects of issue such as crashes in the connectivity impacting the KAS. Connectivity issues will not stop the KAS from serving API requests. This is important as serving API requests may be necessary in order to fix the crashes. A problem with a node, set of nodes or load-balancers configuration, may be fixed with API requests.

The diagram shows API Server’s outgoing traffic flow. The user (in blue box), control plane network (in purple cloud) and a cluster network (in green cloud) are represented.

The user (blue) initiates communication to the KAS. The KAS then initiates connections to other components. It could be node/pod/service in cluster networks (red dotted arrow to green cloud), or etcd for storage in the same control plane network (blue arrow) or mutate the request based on an admission web-hook (red dotted arrow to purple cloud). The KAS handles these cases based on NetworkContext based traffic routing. The connectivity proxy should be able to do routing solely based on IP. The proxy should not require the NetworkContext. This means the service CIDR, node CIDR and pod CIDR of each cluster network cannot overlap.

Network Context

The minimal NetworkContext looks like the following struct in golang:

type EgressType int

const (
 // ControlPlane is the EgressType for traffic intended to go to the control plane.
 ControlPlane EgressType = iota
 // Etcd is the EgressType for traffic intended to go to Kubernetes persistence store.
 Etcd
 // Cluster is the EgressType for traffic intended to go to the system being managed by Kubernetes.
 Cluster
)

// NetworkContext is the struct used by Kubernetes API Server to indicate where it intends traffic to be sent.
type NetworkContext struct {
 // EgressSelectionName is the unique name of the
 // EgressSelectorConfiguration which determines
 // the network we route the traffic to.
 EgressSelectionName EgressType
}

EgressSelectionName specifies the network to route traffic to. The KAS starts with a list of registered konnectivity service names. These correspond to networks we route traffic to. So the KAS knows where to proxy the traffic to, otherwise it returns an “Unknown network” error.

The KAS starts with a proxy configuration like the below example. The example specifies 4 networks. “direct” specifies the KAS talking directly on the local network (no proxy). “controlplane” specifies the KAS talks to a proxy listening at 1.2.3.4:5678. “cluster” specifies the KAS talks to a proxy listening at 1.2.3.5:5679. While these are represented as resources they are not intended to be loaded dynamically. The names are not case sensitive. The KAS loads this resource list as a configuration at start time.

apiVersion: apiserver.k8s.io/v1alpha1
kind: EgressSelectorConfiguration
egressSelections:
- name: direct
 connection:
 type: direct
- name: controlplane
 connection:
 type: grpc
 url: grpc://1.2.3.4:5678
 caBundle: file1.pem
 clientKeyFile: proxy-client1.key
 clientCertFile: proxy-client1.crt
- name: cluster
 connection:
 type: grpc
 url: grpc://1.2.3.5:5679
 caBundle: file2.pem
 clientKeyFile: proxy-client2.key
 clientCertFile: proxy-client2.crt

NetworkContext could be extended to contain more contextual information. This would allow smarter routing based on the k8s object KAS is processing or which user/tenant tries to initiate the request, etc.

Proxy gRPC definition

In order to serve a proxy request, one gRPC bidirectional stream on proxy server is created to serve it. It’s a 1:1 mapping from TCP connection to a gRPC stream, so the state of TCP connection is exactly the same as the gRPC stream state.

syntax = "proto3";
service ProxyService {
// Proxy a TCP connection to a remote address defined by ConnectParam.
// The ConnectParam is defined in metadata under key "x-kube-net-proxy".
// metadata["x-kube-net-proxy"] = base64.Encode(proto.Marshal(connectOptions))
rpc Proxy(stream Payload) returns (stream Payload) {}
}
// ConnectOptions defines the remote TCP endpoint to connect
message ConnectOptions {
string remote_addr = 1; // remote address to connect to. e.g. 8.8.8.8:53
}
// Payload defines a TCP payload.
message Payload {
bytes data = 1;
}

Konnectivity Server

The Konnectivity Server (connectivity proxy(s)) can run in the same container as the KAS. It should run on the same machine and must run in the same flat network as the KAS. It listens on a port for gRPC connections from the KAS. This port would be for forwarding traffic to the appropriate cluster. It should have an administrative port speaking https. The administrative port serves metrics and (optional) debug/pprof handlers. It should have a health check port, serving liveness and readiness probes. The liveness probe prevents a partially broken cluster where the KAS cannot connect to the cluster. The readiness probe indicates that at least one Konnectivity Agent is connected.

Direct Connection

This connection type uses the default dialer. This allows use of the connectivity service without the connectivity proxy. This is a quick way to run the system in a “legacy” or fallback mode. Simple clusters (not needing network segregation) run this way to avoid the overhead (in latency or configuration) of the connectivity proxy.

Kubernetes API Server Outbound Requests

The majority of the KAS communication originates from incoming requests. Here we cover the outgoing requests. This is our understanding of those requests and some details as to how they fit in this model. For the alpha release we support ‘controlplane’, ’etcd’ and ‘cluster’ connectivity service names.

ETCD It is possible to make etcd talk via the proxy. The etcd client takes a transport. (https://github.com/etcd-io/etcd/blob/main/client/internal/v2/client.go#L101 ) We will add configuration as to which proxy an etcd client should use. (https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/config.go ) This will add an extra process hop to our main scaling axis. We will scale test the impact and publish the results. As a precaution we will add an extra network configuration ’etcd’ separate from ‘controlplane’. Etcd requests can be configured separately from the rest of ‘controlplane’.
Pods/Exec, Pods/Proxy, Pods/Portforward, Pods/Attach, Pods/Log Pod requests (and pod sub-resource requests) are meant for the cluster and will be routed based on the ‘cluster’ NetworkContext.
Nodes/Proxy Node requests (and node sub-resource requests) are meant for the cluster and will be routed based on the ‘cluster’ NetworkContext.
Services/Proxy Service requests (and service sub-resource requests) are meant for the cluster and will be routed based on the ‘cluster’ NetworkContext.
Admission Webhooks Admission webhooks can be destined for a service or a URL. If destined for a service then the service rules apply (send to ‘cluster’). If destined for a URL then we will use the ‘controlplane’ NetworkContext.
Aggregated API Server (and OpenAPI requests for aggregated resources) Aggregated API Servers can be destined for a service. If destined for a service then the service rules apply.
Authentication, Authorization and Audit Webhooks These Webhooks use a kube config file to determine destination. Given that we use a ‘controlplane’ NetworkContext.

Note: KMS is also an egress endpoint but will not be covered as egress since it only supports a Dialer using unix domain sockets (UDS). This is used for communicating between processes running on the same host. In the future, we may consider adding egressSelector support if KMS accepts other protocols.

Testing the Solution

We will test using a network namespace to partition the KAS from the test nodes. It is then impossible to connect directly from the KAS to the test nodes. This ensures that the proxy must be used for logs, exec, port forward, aggregation and webhooks. We run with this configuration and a direct configuration for these specific features. This ensures that the solution works and will continue to work.

Security

One motivation for network proxy is providing a mechanism to secure https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ . This, in conjunction with a firewall or other network isolation, fixes the security concern.

Implementation Details/Notes/Constraints

You may want to check the original design doc for alternatives and futures considered. https://goo.gl/qiARUK . Please make sure you are a member of kubernetes-dev@googlegroups.com to view the doc. It is also worth looking at https://github.com/kubernetes-sigs/apiserver-network-proxy as it contains the reference implementation of the API Server Network Proxy.

User Stories

Combined Control Plane and Node Network

Customers can run a cluster which combines the control plane and cluster networks. They configure all their connectivity configuration to direct. This bypasses the proxy and optimizes the performance. For a customer with no security concerns with combined network, this is a fairly simple straight forward configuration.

Control Plane and Untrusted Node Network

A customer may want to isolate their control plane from their cluster network. This may be a simple separation of concerns or due to something like running untrusted workloads on the cluster network. Placing a firewall between the control plane and cluster networks accomplishes this. A few ports for the KAS public port and Proxy public port are opened between these networks. Separation of concerns minimizes the accidental interactions between the control plane and cluster networks. It minimizes bandwidth consumption on the cluster network negatively impacting the control plane. The combination of firewall and proxy minimizes the interaction between the networks to a set which can be more easily reasoned about, checked and monitored.

Control Plane and Node Networks which are not IP Routable

If control plane and cluster network CIDRs are not controlled by the same entity, then they can end up having conflicting IP CIDRs. Traffic cannot be routed between them based strictly on IP address. The connection proxy solves this issue. It also solves connectivity using a VPN tunnel. The proxy offloads the work off sending traffic to the cluster network from the KAS. The proxy gives us extensibility.

Better Monitoring

Instrumenting the network proxy requests with out of band data (Eg. requester identity/tradition context) enables a Proxy to provide increased monitoring of Control Plane originated requests.

Design Details

Risks and Mitigations

The primary risk of this solution would seem to be some portion of the proxy or agent failing. For existing clusters which do not depend on SSH Tunnels or any of the new functionality, the mitigation would be to set all networks to direct. This should bypass the proxy and allow the system to work as it does today. For anyone using SSH Tunnels we are planning to support both for several releases.

Test Plan

The primary test plan is to set up a network namespace with a firewall dividing the control plane and cluster networks. Then running the existing tests for logs, proxy and portforward to ensure the routing works correctly. It should work with the correct configuration and fail correctly with a direct configuration. Normal tests would be run with the direct configuration to ensure the mitigation is working correctly.

Please adhere to the Kubernetes testing guidelines when drafting this test plan.

Graduation Criteria

Alpha:

Feature is turned off in the KAS by default. Enabled by adding ConnectivityServiceConfiguration.
Kubernetes will not ship with a network proxy. The feature will work with the sample network proxy in https://github.com/kubernetes-sigs/apiserver-network-proxy
Demonstrate that the API Server Network Proxy eliminates the need for the SSH Tunnels.

Beta:

All Kube API Server egress points have been implemented to use the EgressSelector.
Have official releases of the Konnectivity Server and Agent reference implementations.
Have at least one OSS kube-up implementation where the feature can be turned on and demonstrated.
Have run a basic load test with egresses enabled through the Konnectivity Server to demonstrate that concurrent requests work with Admission Webhooks.
Tests for EgressSelector.
e2e test with a functioning cluster with the EgressSelector conifgured to use a KonnectivityService.
Add metrics and trace around the Egress Lookup/Dial code. Make sure we know how many egresses of each type are returned. Make sure we know how long we are spending dialing out.
Ensure we have metrics on each existing egress use case.

Implementation History

Feature went Alpha in 1.16 with limited functionality. It will cover the log sub resource and communication to the etcd server.
Feature went Beta in 1.18.

Alternatives [optional]

Leave SSH Tunnels (deprecated) in the KAS. Prevents us from making the KAS cloud provider agnostic. Blocks out of tree effort.
Build equivalent functionality into the KAS. Is not extensible. Essentially has the same issues as SSH Tunnels.
Use a socks5 proxy. No standard mTLS mechanism for securing traffic. Does not actually act as a standard. More complicated implementation.

Infrastructure Needed [optional]

Any one wishing to use this feature will need to create network proxy images/pods on the control plane and set up the EgressSelectorConfiguration. The network proxy provided is meant as a reference implementation. Users as expected to extend it for their needs.

Resources: APIServer Tracing

Mon, 01 Jan 0001 00:00:00 +0000

KEP-647: APIServer Tracing

Release Signoff Checklist
Summary
Motivation
Proposal
- User Stories
  - Steady-State trace collection
  - On-Demand trace collection
- Risks and Mitigations
Design Details
Graduation requirements
- Upgrade / Downgrade Strategy
- Version Skew Strategy
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives considered
- Introducing a new EgressSelector type
- Other OpenTelemetry Exporters

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This Kubernetes Enhancement Proposal (KEP) proposes enhancing the API Server to allow tracing requests. For this, it proposes using OpenTelemetry libraries, and exports in the OpenTelemetry format.

Motivation

Along with metrics and logs, traces are a useful form of telemetry to aid with debugging incoming requests. The API Server currently uses a poor-man’s form of tracing (see github.com/kubernetes/utils/trace ), but we can make use of distributed tracing to improve the ease of use and enable easier analysis of trace data. Trace data is structured, providing the detail necessary to debug requests, and context propagation allows plugins, such as admission webhooks, to add to API Server requests.

Definitions

Span: The smallest unit of a trace. It has a start and end time, and is attached to a single trace. Trace: A collection of Spans which represents a single process. Trace Context: A reference to a Trace that is designed to be propagated across component boundaries. Sometimes referred to as the “Span Context”. It is can be thought of as a pointer to a parent span that child spans can be attached to.

Goals

The API Server generates and exports spans for incoming and outgoing requests.
The API Server propagates context from incoming requests to outgoing requests.

Non-Goals

Tracing in kubernetes controllers
Replace existing logging, metrics, or the events API
Trace operations from all Kubernetes resource types in a generic manner (i.e. without manual instrumentation)
Change metrics or logging (e.g. to support trace-metric correlation)
Access control to tracing backends
Add tracing to components outside kubernetes (e.g. etcd client library).

Proposal

User Stories

Since this feature is for diagnosing problems with the Kube-API Server, it is targeted at Cluster Operators and Cloud Vendors that manage kubernetes control-planes.

For the following use-cases, I can deploy an OpenTelemetry collector as a sidecar to the API Server. I can use the API Server’s --opentelemetry-config-file flag with the default URL to make the API Server send its spans to the sidecar collector. Alternatively, I can point the API Server at an OpenTelemetry collector listening on a different port or URL if I need to.

Steady-State trace collection

As a cluster operator or cloud provider, I would like to collect traces for API requests to the API Server to help debug a variety of control-plane problems. I can set the SamplingRatePerMillion in the configuration file to a non-zero number to have spans collected for a small fraction of requests. Depending on the symptoms I need to debug, I can search span metadata to find a trace which displays the symptoms I am looking to debug. Even for issues which occur non-deterministically, a low sampling rate is generally still enough to surface a representative trace over time.

On-Demand trace collection

As a cluster operator or cloud provider, I would like to collect a trace for a specific request to the API Server. This will often happen when debugging a live problem. In such cases, I don’t want to change the SamplingRatePerMillion to collecting a high percentage of requests, which would be expensive and collect many things I don’t care about. I also don’t want to restart the API Server, which may fix the problem I am trying to debug. Instead, I can make sure the incoming request to the API Server is sampled. The tooling to do this easily doesn’t exist today, but could be added in the future.

For example, to trace a request to list nodes, with traceid=4bf92f3577b34da6a3ce929d0e0e4737, no parent span, and sampled=true:

kubectl proxy --port=8080 &
curl http://localhost:8080/api/v1/nodes -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4737-0000000000000000-01"

Risks and Mitigations

The primary risk associated with distributed tracing is DDOS. A user that can send a large number of sampled requests can cause the server to generate a large number of spans. This is mitigated by allowing respecting the incoming trace context for privileged (system:master and system:monitoring groups) users and by configuring the to SamplingRatePerMillion to a low value.

There is also a risk of memory usage incurred by storing spans prior to export. This is mitigated by limiting the number of spans that can be queued for export, and dropping spans if necessary to stay under that limit.

Design Details

Tracing API Requests

We will wrap the API Server’s http server and http clients with otelhttp to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below WithRequestInfo in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.

Note that some clients of the API Server, such as webhooks, may make reentrant calls to the API Server. To gain the full benefit of tracing, such clients should propagate context with requests back to the API Server. One way to do this is to use the wrap the webhook’s http server using otelhttp, and use the request’s context when making requests to the API Server.

Webhook Example

Wrapping the http server, which ensures context is propagated from http headers to the requests context:

mux := http.NewServeMux()
handler := otelhttp.NewHandler(mux, "HandleAdmissionRequest")

Use the context from the request in reentrant requests:

ctx := req.Context()
client.CoreV1().Pods("").List(ctx, metav1.ListOptions{})

Note: Even though the admission controller uses the otelhttp handler wrapper, that does not mean it will emit spans. OpenTelemetry has a concept of an SDK, which manages the exporting of telemetry. If no SDK is registered, the NoOp SDK is used, which only propagates context, and does not export spans. In the webhook case in which no SDK is registered, the reentrant API call would appear to be a direct child of the original API call. If the webhook registers an SDK and exports spans, there would be an additional span from the webhook between the original and reentrant API Server call.

Note: OpenTelemetry has a concept of “Baggage” , which is akin to annotations for propagated context. If there is any additional metadata we would like to attach, and propagate along with a request, we can do that using Baggage.

Exporting Spans

This KEP proposes the use of the OpenTelemetry tracing framework to create and export spans to configured backends.

The API Server will use the OpenTelemetry exporter format , and the OTlp exporter which can export traces. This format is easy to use with the OpenTelemetry Collector , which allows importing and configuring exporters for trace storage backends to be done out-of-tree in addition to other useful features.

Running the OpenTelemetry Collector

The OpenTelemetry Collector can be run as a sidecar, a daemonset, a deployment , or a combination in which the daemonset buffers telemetry and forwards to the deployment for aggregation (e.g. tail-base sampling) and routing to a telemetry backend. To support these various setups, the API Server should be able to send traffic either to a local (on the control plane network) collector, or to a cluster service (on the cluster network).

APIServer Configuration and EgressSelectors

The API Server controls where traffic is sent using an EgressSelector , and has separate controls for ControlPlane, Cluster, and Etcd traffic. As described above, we would like to support sending telemetry to a url using the ControlPlane egress. To accomplish this, we will introduce a flag, --opentelemetry-config-file, that will point to the file that defines the opentelemetry exporter configuration. That file will have the following format:

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// TracingConfiguration provides versioned configuration for tracing clients.
type TracingConfiguration struct {
 metav1.TypeMeta `json:",inline"`

 // +optional
 // URL of the collector that's running on the control-plane node.
 // the APIServer uses the egressType ControlPlane when sending data to the collector.
 // Defaults to localhost:4317
 URL *string `json:"url,omitempty" protobuf:"bytes,1,opt,name=url"`

 // +optional
 // SamplingRatePerMillion is the number of samples to collect per million spans.
 // Defaults to 0.
 SamplingRatePerMillion *int32 `json:"samplingRatePerMillion,omitempty" protobuf:"varint,2,opt,name=samplingRatePerMillion"`
}

If --opentelemetry-config-file is not specified, the API Server will not send any spans, even if incoming requests ask for sampling.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

We will test tracing added by this feature with an integration test. The integration test will verify that spans exported by the apiserver match what is expected from the request.

Prerequisite testing updates

None.

Unit tests

staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go: 10/10/2021 42.6%
staging/src/k8s.io/component-base/tracing/api/v1/config_test.go: 10/10/2021 59.0%

Integration tests

test/integration/apiserver/tracing/tracing_test.go
- TestAPIServerTracingWithKMSv2: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracingWithKMSv2
- TestAPIServerTracingWithEgressSelector: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracingWithEgressSelector
- TestAPIServerTracing: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracing

e2e tests

Not Required.

Graduation requirements

Alpha

Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver
Integration testing of tracing

Beta

Tracing 100% of requests does not break scalability tests (this does not necessarily mean trace backends can handle all the data).
- Verified in a manual run: https://github.com/kubernetes/kubernetes/pull/113695#issuecomment-1307665358 . This is not part of periodic tests, although it may be useful for debugging with a low sampling rate in the future.
OpenTelemetry reaches GA
Publish examples of how to use the OT Collector with kubernetes
Allow time for feedback
Revisit the format used to export spans.
Parity with the old text-based Traces

Publish guidelines for kubernetes components on when and how to add tracing to a component.
Graduate the TracingConfiguration component config to v1.
Define and document stability guarantees for trace instrumentation.
Add support for On-Demand trace collection as described above.

Upgrade / Downgrade Strategy

This feature is upgraded or downgraded with the API Server. It is not otherwise impacted.

Version Skew Strategy

This feature is not impacted by version skew. API Servers of different versions can each prodce traces to provide observability signals independently.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: APIServerTracing
- Components depending on the feature gate: kube-apiserver
Other
- Describe the mechanism: Use specify a file using the --opentelemetry-config-file API Server flag.
- Will enabling / disabling the feature require downtime of the control plane? Yes, it will require restarting the API Server.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.

Does enabling the feature change any default behavior?

No. The feature is disabled unlesss both the feature gate and --opentelemetry-config-file flag are set. When the feature is enabled, it doesn’t change behavior from the users’ perspective; it only adds tracing telemetry based on API Server requests.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

What happens if we reenable the feature if it was previously rolled back?

It will start sending traces again. This will happen regardless of whether it was disabled by removing the --opentelemetry-config-file flag, or by disabling via feature gate.

Are there any tests for feature enablement/disablement?

Unit tests exist which enable the feature gate.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

How can a rollout fail? Can it impact already running workloads?

Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?

If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.

What specific metrics should inform a rollback?

API Server SLOs are the signals that should guide a rollback. In particular, the apiserver_request_duration_seconds and apiserver_request_slo_duration_seconds metrics would surface issues resulting in slower API Server responses.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

How can an operator determine if the feature is in use by workloads?

This is an operator-facing feature. Look for traces to see if tracing is enabled.

How can someone using this feature know that it is working for their instance?

Look for spans. If you see them, then it is working.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes, those are being added in OpenTelemetry, and we will use them once they are present: https://github.com/open-telemetry/opentelemetry-go/issues/2547

Dependencies

This section must be completed when targeting beta graduation to a release.

Does this feature depend on any specific services running in the cluster?

The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.

[OpenTelemetry Collector (optional)]
- Usage description: Deploy the collector as a sidecar container to the API Server, and route traces to your backend of choice.
  - Impact of its outage on the feature: Spans will continue to be collected by the kube-apiserver, but may be lost before they reach the trace backend.
  - Impact of its degraded performance or high-error rates on the feature: Spans may be lost before they reach the trace backend.

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

Will enabling / using this feature result in any new API calls?

This will not add any additional API calls.

Will enabling / using this feature result in introducing new API types?

This will introduce an API type for the configuration. This is only for loading configuration, users cannot create these objects.

Will enabling / using this feature result in any new calls to the cloud provider?

Not directly. Cloud providers could choose to send traces to their managed trace backends, but this requires them to set up a telemetry pipeline as described above.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs ?

It will increase API Server request latency by a negligible amount (<1 microsecond) for encoding and decoding the trace contex from headers, and recording spans in memory. Exporting spans is not in the critical path.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. Collecting and exporter spans does not use additional node resources even when it is failing to connect to the backend.

Troubleshooting

This section must be completed when targeting beta graduation to a release.

How does this feature react if the API server and/or etcd is unavailable?

This feature does not have a dependency on the API Server or etcd (it is built into the API Server).

What are other known failure modes?

[Trace endpoint misconfigured, or unavailable]
- Detection: No traces processed by trace ingestion pipeline
- Mitigations: None
- Diagnostics: API Server logs containing: “traces exporter is disconnected from the server”
- Testing: The feature will simply not work if misconfigured. It doesn’t seem worth verifying.

What steps should be taken if SLOs are not being met to determine the problem?

This feature will likely be useful for determining why scalability SLOs are not being met, as tracing can provide detailed latency information as described above. If tracing is suspected as the reason for SLOs not meeting SLOs, it can be disabled without impacting other functionality by not setting the --opentelemetry-config-file flag.

Implementation History

Mutating admission webhook which injects trace context
Instrumentation of Kubernetes components
Instrumentation of Kubernetes components for 1/24/2019 community demo
KEP merged as provisional on 1/8/2020, including controller tracing
KEP scoped down to only API Server traces on 5/1/2020
Updated PRR section 2/8/2021

Drawbacks

Depending on the chosen sampling rate, tracing can increase CPU and memory usage by a small amount, and can also add a negligible amount of latency to API Server requests, when enabled.

Alternatives considered

Introducing a new EgressSelector type

Instead of a configuration file to choose between a url on the ControlPlane network, or a service on the Cluster network, we considered introducing a new OpenTelemetry egress type, which could be configured separately. However, we aren’t actually introducing a new destination for traffic, so it is more conventional to make use of existing egress types. We will also likely want to add additional configuration for the OpenTelemetry client in the future.

Other OpenTelemetry Exporters

This KEP suggests that we utilize the OpenTelemetry exporter format in all components. Alternative options include:

Add configuration for many exporters in-tree by vendoring multiple “supported” exporters. These exporters are the only compatible backends for tracing in kubernetes. a. This places the kubernetes community in the position of curating supported tracing backends
Support both a curated set of in-tree exporters, and the collector exporter

Resources: Apply

Mon, 01 Jan 0001 00:00:00 +0000

Apply

Summary
Motivation
- Goals
- Non-Goals
Proposal
- Implementation Details/Notes/Constraints [optional]
Production Readiness Review Questionnaire
Graduation Criteria
- Upgrade / Downgrade Strategy
Implementation History
Drawbacks
Alternatives

Summary

kubectl apply is a core part of the Kubernetes config workflow, but it is buggy and hard to fix. This functionality will be regularized and moved to the control plane.

Motivation

Example problems today:

User does POST, then changes something and applies: surprise!
User does an apply, then kubectl edit, then applies again: surprise!
User does GET, edits locally, then apply: surprise!
User tweaks some annotations, then applies: surprise!
Alice applies something, then Bob applies something: surprise!

Why can’t a smaller change fix the problems? Why hasn’t it already been fixed?

Too many components need to change to deliver a fix
Organic evolution and lack of systematic approach
- It is hard to make fixes that cohere instead of interfere without a clear model of the feature
Lack of API support meant client-side implementation
- The client sends a PATCH to the server, which necessitated strategic merge patch–as no patch format conveniently captures the data type that is actually needed.
- Tactical errors: SMP was not easy to version, fixing anything required client and server changes and a 2 release deprecation period.
The implications of our schema were not understood, leading to bugs.
- e.g., non-positional lists, sets, undiscriminated unions, implicit context
- Complex and confusing defaulting behavior (e.g., Always pull policy from :latest)
- Non-declarative-friendly API behavior (e.g., selector updates)

Goals

“Apply” is intended to allow users and systems to cooperatively determine the desired state of an object. The resulting system should:

Be robust to changes made by other users, systems, defaulters (including mutating admission control webhooks), and object schema evolution.
Be agnostic about prior steps in a CI/CD system (and not require such a system).
Have low cognitive burden:
- For integrators: a single API concept supports all object types; integrators have to learn one thing total, not one thing per operation per api object. Client side logic should be kept to a minimum; CURL should be sufficient to use the apply feature.
- For users: looking at a config change, it should be intuitive what the system will do. The “magic” is easy to understand and invoke.
- Error messages should–to the extent possible–tell users why they had a conflict, not just what the conflict was.
- Error messages should be delivered at the earliest possible point of intervention.

Goal: The control plane delivers a comprehensive solution.

Goal: Apply can be called by non-go languages and non-kubectl clients. (e.g., via CURL.)

Non-Goals

Multi-object apply will not be changed: it remains client side for now
Some sources of user confusion will not be addressed:
- Changing the name field makes a new object rather than renaming an existing object
- Changing fields that can’t really be changed (e.g., Service type).

Proposal

(Please note that when this KEP was started, the KEP process was much less well defined and we have been treating this as a requirements / mission statement document; KEPs have evolved into more than that.)

A brief list of the changes:

Apply will be moved to the control plane.
- The original design is in a google doc; joining the kubernetes-dev or kubernetes-announce list will grant permission to see it. Since then, the implementation has changed so this may be useful for historical understanding. The test cases and examples there are still valid.
- Additionally, readable in the same way, is the original design for structured diff and merge ; we found in practice a better mechanism for our needs (tracking field managers) but the formalization of our schema from that document is still correct.
Apply is invoked by sending a certain Content-Type with the verb PATCH.
Instead of using a kubectl.kubernetes.io/last-applied-configuration annotation, the control plane will track a “manager” for every field.
Apply is for users and/or ci/cd systems. We modify the POST, PUT (and non-apply PATCH) verbs so that when controllers or other systems make changes to an object, they are made “managers” of the fields they change.
The things our “Go IDL” describes are formalized: structured merge and diff
Existing Go IDL files will be fixed (e.g., by fixing the directives )
Dry-run will be implemented on control plane verbs (POST, PUT, PATCH).
- Admission webhooks will have their API appended accordingly.
An upgrade path will be implemented so that version skew between kubectl and the control plane will not have disastrous results.

The linked documents should be read for a more complete picture.

Implementation Details/Notes/Constraints [optional]

(TODO: update this section with current design)

API Topology

Server-side apply has to understand the topology of the objects in order to make valid merging decisions. In order to reach that goal, some new Go markers, as well as OpenAPI extensions have been created:

Lists

Lists can behave in mostly 3 different ways depending on what their actual semantic is. New annotations allow API authors to define this behavior.

Atomic lists: The list is owned by only one person and can only be entirely replaced. This is the default for lists. It is defined either in Go IDL by pefixing the list with // +listType=atomic, or in the OpenAPI with "x-kubenetes-list-type": "atomic".
Sets: the list is a set (it has to be of a scalar type). Items in the list must appear at most once. Individual actors of the API can own individual items. It is defined either in Go IDL by pefixing the list with // +listType=set, or in the OpenAPI with "x-kubenetes-list-type": "set".
Associative lists: Kubernetes has a pattern of using lists as dictionary, with “name” being a very common key. People can now reproduce this pattern by using // +listType=map, or in the OpenAPI with "x-kubernetes-list-type": "map" along with "x-kubernetes-list-map-keys": ["name"], or // +listMapKey=name. Items of an associative lists are owned by the person who applied the item to the list.

For compatibility with the existing markers, the patchStrategy and patchMergeKey markers are automatically used and converted to the corresponding listType and listMapKey if missing.

Maps and structs

Maps and structures can behave in two ways:

Each item in the map or field in the structure are independent from each other. They can be changed by different actors. This is the default behavior, but can be explicitly specified with // +mapType=granular or // +structType=granular respectively. They map to the same openapi extension: "x-kubernetes-map-type": "granular".
All the fields or item of the map are treated as one unit, we say the map/struct is atomic. That can be specified with // +mapType=atomic or // +structType=atomic respectively. They map to the same openapi extension: "x-kubernetes-map-type": "atomic".

Kubectl

Server-side Apply

Since server-side apply is currently in the Alpha phase, it is not enabled by default on kubectl. To use server-side apply on servers with the feature, run the command kubectl apply --experimental-server-side ....

If the feature is not available or enabled on the server, the command will fail rather than fall-back on client-side apply due to significant semantical differences.

As the feature graduates to the Beta phase, the flag will be renamed to --server-side.

The long-term plan for this feature is to be the default apply on all Kubernetes clusters. The semantical differences between server-side apply and client-side apply will make a smooth roll-out difficult, so the best way to achieve this has not been decided yet.

Status Wiping

Current Behavior

Right before being persisted to etcd, resources in the apiserver undergo a preparation mechanism that is custom for every resource kind. It takes care of things like incrementing object generation and status wiping. This happens through PrepareForUpdate and PrepareForCreate .

The problem status wiping at this level creates is, that when a user applies a field that gets wiped later on, it gets owned by said user. The apply mechanism (FieldManager) can not know which fields get wiped for which resource and therefor can not ignore those.

Additionally ignoring status as a whole is not enough, as it should be possible to own status (and other fields) in some occasions. More conversation on this can be found in the GitHub issue where the problem got reported.

Proposed Change

Add an interface that resource strategies can implement, to provide field sets affected by status wiping.

# staging/src/k8s.io/apiserver/pkg/registry/rest/rest.go
// ResetFieldsProvider is an optional interface that a strategy can implement
// to expose a set of fields that get reset before persisting the object.
type ResetFieldsProvider interface {
 // ResetFieldsFor returns a set of fields for the provided version that get reset before persisting the object.
 // If no fieldset is defined for a version, nil is returned.
 ResetFieldsFor(version string) *fieldpath.Set
}

Additionally, this interface is implemented by registry.Store which forwards it to the corresponding strategy (if applicable). If registry.Store can not provide a field set, it returns nil.

An example implementation for the interface inside the pod strategy could be:

# pkg/registry/core/pod/strategy.go
// ResetFieldsFor returns a set of fields for the provided version that get reset before persisting the object.
// If no fieldset is defined for a version, nil is returned.
func (podStrategy) ResetFieldsFor(version string) *fieldpath.Set {
 set, ok := resetFieldsByVersion[version]
 if !ok {
 return nil
 }
 return set
}

var resetFieldsByVersion = map[string]*fieldpath.Set{
 "v1": fieldpath.NewSet(
 fieldpath.MakePathOrDie("status"),
 ),
}

When creating the handlers in installer.go the current rest.Storage is checked to implement the ResetFieldsProvider interface and the result is passed to the FieldManager.

# staging/src/k8s.io/apiserver/pkg/endpoints/installer.go
var resetFields *fieldpath.Set
if resetFieldsProvider, isResetFieldsProvider := storage.(rest.ResetFieldsProvider); isResetFieldsProvider {
 resetFields = resetFieldsProvider.ResetFieldsFor(a.group.GroupVersion.Version)
}

When provided with a field set, the FieldManager strips all resetFields from incoming update and apply requests. This causes the user/manager to not own those fields.

...
if f.resetFields != nil {
 patchObjTyped = patchObjTyped.Remove(f.resetFields)
}
...

Alternatives

We looked at a way to get the fields affected by status wiping without defining them separately. Mainly by pulling the reset logic from the strategies PrepareForCreate and PrepareForUpdate methods into a new method ResetFields implementing an ObjectResetter interface.

This approach did not work as expected, because the strategy works on internal types while the FieldManager handles external api types. The conversion between the two and creating the diff was complex and would have caused a notable amount of allocations.

Implementation History

12/2019 #86083 implementing a poc for the described approach

API Audit

The ManagedFields fields of an object in the API audit log may not be very useful. We want to provide a mechanism, so the cluster operator can opt in so that the managed fields can be omitted from the audit log.

We propose the following changes to the audit.k8s.io/Policy API that provides the cluster operator with a more granular way to control the omission of managed fields in audit log:

type Policy struct {
 // +optional
 OmitManagedFields bool `json:"omitManagedFields,omitempty"`
}

type PolicyRule struct {
 // +optional
 OmitManagedFields *bool `json:"omitManagedFields,omitempty"`
}

The above API changes will be introduced in v1, v1beta1 and v1alpha1 of audit.k8s.io

A new field OmitManagedFields is added to both Policy and PolicyRule making the following possible:

Policy.OmitManagedFields sets the default policy for omitting managed fields globally.
- the default value is false, managed fields are not omitted, this retains the current behavior.
- a value of true will omit managed fields from being written to the API audit log unless PolicyRule overrides.
PolicyRule:OmitManagedFields can be used to override the global default for a particular set of request(s), it has three possible values:
- nil (default value): the cluster operator did not specify any value, the global default specified in Policy.OmitManagedFields is in effect.
- true: the cluster operator opted in to omit managed fields for a given set of request(s), and it overrides the global default.
- false: the cluster operator opted in to not omit managed fields for a given set of request(s), and it overrides the global default.

This ensures the following:

with an existing Policy object, the new version of the apiserver will maintain current behavior which is to include managed fields in audit log
the cluster operator must opt in to enable omission of managed fields

Let’s look at a few examples:

 # omit managed fields for all request and all response bodies
 apiVersion: audit.k8s.io/v1
 kind: Policy
 omitManagedFields: true
 rules:
 - level: RequestResponse

 # omit managed fields for all request and all response bodies
 # except for Pod for which we want to include managed fields in audit log
 apiVersion: audit.k8s.io/v1
 kind: Policy
 omitManagedFields: true
 rules:
 - level: RequestResponse
 omitManagedFields: false
 resources: ["pods"]

 - level: RequestResponse

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in kep.yaml)
  - Feature gate name: ServerSideApply
  - Components depending on the feature gate: kube-apiserver
Does enabling the feature change any default behavior?

While this changes how objects are modified and then stored in the database, all the changes should be strictly backward compatible, and shouldn’t break existing automation or users. The increase in size can possibly have adverse, surprising consequences including increased memory usage for controllers, increased bandwidth usage when fetching objects, bigger objects when displaying for users (kubectl get -o yaml). We’re trying to mitigate all of these with the addition of a new header.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Also set disable-supported to true or false in kep.yaml. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?).

Yes. The consequence is that managed fields will be reset for server-side applied objects (requiring a read/write cycle on the impacted resources).
What happens if we reenable the feature if it was previously rolled back?

The feature will be restored. Server-side applied objects will have lost their “set” which may cause some surprising behavior (fields might not be removed as expected).
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified.

Tests are in place for upgrading from client side to server side apply and vice versa.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout? There is no specific way that the rollout can fail. The rollout can’t impact existing workload.
What specific metrics should inform a rollback?

The feature shouldn’t affect any existing behavior. A surprisingly high number of modification rejections could be a sign that something is not working properly.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Because the feature doesn’t affect existing behavior, rollback and upgrades haven’t be specifically tested. The feature is being used by the cluster role aggregator though. Upgrading/downgrading/upgrading, which could result in the managedFields being removed, wouldn’t cause any problems since the Rules field filled by the controller is atomic, and thus doesn’t depend on the current state of the managedFields.

The new managedFields field is cleared when it is incorrect. That protects us from having invalid data inserted by a potential bad upgrade.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No No.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

How can an operator determine if the feature is in use by workloads? Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose.

Any existing metric split by request verb will record the APPLY verb if the feature is in use.

Additionally, the OpenAPI spec exposes the available media-type for each individual endpoint. The presence of the apply type for the PATCH verb of a endpoints indicates whether the feature is enabled for that specific resource, e.g.

...
"patch": {
 "consumes": [
 "application/json-patch+json",
 "application/merge-patch+json",
 "application/strategic-merge-patch+json",
 "application/apply-patch+yaml"
 ],
 ...
}
...

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

There is no specific metric attached to server side apply. All PATCH requests that utilize SSA will use the verb APPLY when logging metrics. API Server metrics that are split by verb automatically include this. They include apiserver_request_total, apiserver_longrunning_gauge, apiserver_response_sizes, apiserver_request_terminations_total, apiserver_selfrequest_total
- Components exposing the metric: kube-apiserver
Apply requests (PATCH with application/apply-patch+yaml mime type) have the same level of SLIs as other types of requests.
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? n/a Apply requests (PATCH with application/apply-patch+yaml mime type) have the same level of SLOs as other types of requests.
Are there any missing metrics that would be useful to have to improve observability of this feature? n/a

Dependencies

Does this feature depend on any specific services running in the cluster? No

Scalability

Will enabling / using this feature result in any new API calls? No
Will enabling / using this feature result in introducing new API types? Describe them, providing: No
Will enabling / using this feature result in any new calls to the cloud provider? No
Will enabling / using this feature result in increasing size or count of the existing API objects? Objects applied using server side apply will have their managed fields metadata populated. managedFields metadata fields can represent up to 60% of the total size of an object, increasing the size of objects.
Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]? No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? Since objects are larger with the new managedFields, caches as well as network bandwidth requirement will increase.

Troubleshooting

This section must be completed when targeting beta graduation to a release.

How does this feature react if the API server and/or etcd is unavailable?

The feature is part of of the API server and will not function without it
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
  - Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node? Apply requests (PATCH with application/apply-patch+yaml mime type) have the same level of SLIs as other types of requests.
  - Mitigations: What can be done to stop the bleeding, especially for already running user workloads? This shouldn’t affect running workloads, and this feature shouldn’t alter the behavior of previously existing mechanisms like PATCH and PUT.
  - Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? The feature uses very little logging, and errors should be returned directly to the user. Not required until feature graduated to beta.
  - Testing: Are there any tests for failure mode? Failure modes are tested exhaustively both as unit-tests and as integration tests.
What steps should be taken if SLOs are not being met to determine the problem? n/a

Risks and Mitigations

We used a feature branch to ensure that no partial state of this feature would be in master. We developed the new “business logic” in a separate repo for velocity and reusability.

Testing Plan

The specific logic of apply will be tested by extensive unit tests in the structured merge and diff repo. The integration between that repo and kubernetes/kubernetes will mainly be tested by integration tests in test/integration/apiserver/apply and test/cmd , as well as unit tests where applicable. The feature will also be enabled in the alpha-features e2e test suite , which runs every hour and everytime someone types /test pull-kubernetes-e2e-gce-alpha-features on a PR. This will ensure that the cluster can still start up and the other endpoints will function normally when the feature is enabled.

Unit Tests in structured merge and diff repo for:

Merge typed objects of the same type with a schema. link
Merge deduced typed objects without a schema (for CRDs). link
Convert a typed value to a field set. link
Diff two typed values. link
Validate a typed value against it’s schema. link
Get correct conflicts when applying. link
Apply works for deduced typed objects. link
Apply works for leaf fields with scalar values. link
Apply works for items in associative lists of scalars. link
Apply works for items in associative lists with keys. link
Apply works for nested schemas, including recursive schemas. link
Apply works for multiple appliers. link
Apply works when the object conversion changes value of map keys. link
Apply works when unknown/obsolete versions are present in managedFields (for when APIs are deprecated). link

Unit Tests for:

Apply strips certain fields (like name and namespace) from managers. link
ManagedFields API can be round tripped through the structured-merge-diff format. link
Manager identifiers passed to structured-merge-diff are encoded as json. link
Managers will be sorted by operation, then timestamp, then manager name. link
Conflicts will be returned as readable status errors. link
Fields API can be round tripped through the structured-merge-diff format. link
Fields API conversion to and from the structured-merge-diff format catches errors. link
Path elements can be round tripped through the structured-merge-diff format. link
Path element conversion will ignore unknown qualifiers. link
Path element conversion will fail if a known qualifier’s value is invalid. link
Can convert both built-in objects and CRDs to structured-merge-diff typed objects. link
Can convert structured-merge-diff typed objects between API versions. link

Integration tests for:

Creating an object with apply works with default and custom storage implementations. link
Create is blocked on apply if uid is provided. link
Apply has conflicts when changing fields set by Update, and is able to force. link
There are no changes to the managedFields API. link
ManagedFields has no entries for managers who manage no fields. link
Apply works with custom resources. link
Run kubectl apply tests with server-side flag enabled. link

E2E and Conformance tests will be added for GA.

Graduation Criteria

An alpha version of this is targeted for 1.14.

This can be promoted to beta when it is a drop-in replacement for the existing kubectl apply, and has no regressions (which aren’t bug fixes). This KEP will be updated when we know the concrete things changing for beta.

A GA version of this is targeted for 1.22.

E2E tests are created and graduate to conformance
Apply for client-go’s typed client is implemented and at least one kube-controller-manager uses that client
Outstanding bugs around status wiping and scale subresource are fixed

Upgrade / Downgrade Strategy

Upgrade from kubectl Client-Side to Server-Side Apply

With client-side kubectl apply, the annotation kubectl.kubernetes.io/last-applied-configuration tracks ownership for a single shared field manager. With server-side kubectl apply --server-side, the .metadata.managedFields field tracks ownership for multiple field managers.

Users who wish to start using server-side apply for objects managed with client-side apply would encounter a field manager conflict: the field set that the user now wants to manage with server-side apply will be owned by the client-side apply field manager.

If we don’t specifically handle this case, then users would need to force conflicts with kubectl apply --server-side --force-conflicts. This extra step is not desirable for users who wish to onboard to server-side apply.

However we know that users’ intent is to take ownership of client-side apply fields when upgrading, which we can do for them while avoiding the conflict.

Avoiding Conflicts from Client-Side Apply to Server-Side Apply

We’ll use the kubectl user-agent and the client-side apply last-applied-configuration annotation to identify when to do the upgrade.

When server-side apply is run with kubectl apply --server-side on an object with a last-applied-configuration annotation for client-side apply, then the annotation will be upgraded to the managed fields server-side apply notation.

To upgrade the last-applied-configuration annotation, the following procedure will be used.

Identify if the server-side apply is from the kubectl user-agent
Identify if the server-side apply would result in a conflict
Create a fieldset from the last-applied-configuration annotation.
Remove all fields from the last-applied-configuration annotation that are added, missing, or different than the corresponding field of the live object. Because the fields have changed, client-side apply does not own them.
Compare the “last-applied” fieldset to the conflict fieldset. Take the difference as the new conflict fieldset. If the conflict fieldset is empty, then the conflicts are allowed and we force the server-side apply. If the conflict fieldset is not empty, then return the conflict fieldset.

Downgrade from kubectl Server-Side to Client-Side Apply

Client-side kubectl apply users can incrementally upgrade to a version of kubectl that can send a server-side apply

We can sync the intent between server-side and client-side apply by keeping the last-applied-configuration annotation up-to-date with the .managedFields field.

Client-side apply will continue to work.

Downgrade the API Server

When downgrading the API server with server-side apply disabled, then .metadata.managedFields field will be cleared since the API server doesn’t know about this field. A server-side apply will fail with a content-type unknown error.

A client-side apply would succeed because the last-applied-configuration annotation is preserved and up-to-date as described in the downgrade above.

Implementation History

Early 2018: @lavalamp begins thinking about apply and writing design docs
2018Q3: Design shift from merge + diff to tracking field managers.
2019Q1: Alpha.
2019Q3: Beta.

(For more details, one can view the apply-wg recordings, or join the mailing list and view the meeting notes. TODO: links)

Drawbacks

Why should this KEP not be implemented: many bugs in kubectl apply will go away. Users might be depending on the bugs.

Alternatives

It’s our belief that all routes to fixing the user pain involve centralizing this functionality in the control plane.

Resources: Appropriate use of node-role labels

Mon, 01 Jan 0001 00:00:00 +0000

Appropriate use of node-role labels

Release Signoff Checklist
Summary
Motivation
- Goals
Proposal
- Use of node-role.kubernetes.io/* labels
- Current users of node-role.kubernetes.io/* within the project that must change
Design Details
Production Readiness Review Questionnaire
Implementation History
Future work
Reference

Release Signoff Checklist

ACTION REQUIRED: In order to merge code into a release, there must be an issue in kubernetes/enhancements referencing this KEP and targeting a release milestone before Enhancement Freeze of the targeted release.

These checklist items must be updated for the enhancement to be released.

kubernetes/enhancements issue in release milestone, which links to KEP: https://github.com/kubernetes/enhancements/issues/1143
KEP approvers have set the KEP status to implementable
Design details are appropriately documented
Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
Graduation criteria is in place
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Note: Any PRs to move a KEP to implementable or significant changes once it is marked implementable should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).

Note: This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.

Summary

Clarify that the node-role.kubernetes.io/* label is for use only by users and external projects and may not be used to vary Kubernetes behavior. Define migration process for all internal consumers of these labels.

Motivation

The node-role.kubernetes.io/master and the broader node-role.kubernetes.io namespace for labels were introduced to provide a simple organizational and grouping convention for cluster users. The labels were reserved solely for organizing nodes via a convention that tools could recognize to display information to end users, and for use by opinionated external tooling that wished to simplify topology concepts. Use of the label by components within the Kubernetes project (those projects subject to API review) was restricted. Specifically, no project could mandate the use of those labels in a conformant distribution, since we anticipated that many deployments of Kubernetes would have more nuanced control-plane topologies than simply “a control plane node”.

Over time, several changes to Kubernetes core and related projects were introduced that depended on the node-role.kubernetes.io/master label to vary their behavior in contravention to the guidance the label was approved under. This was unintentional and due to unclear reviewer guidelines that have since been more strictly enforced. Likewise, the complexity of Kubernetes deployments has increased and the simplistic mapping of control plane concepts to a node has proven to limit the ability of conformant Kubernetes distributions to self-host, as anticipated. The lack of clarity in how to use node-role and the disjoint mechanisms within the code has been a point of confusion for contributors that we wish to remove.

Finally, we wish to clarify that external components may use node-role tolerations and labels as they wish as long as they are cognizant that not all conformant distributions will expose or allow those tolerations or labels to be set.

Goals

This KEP:

Clarifies that the use of the node-role.kubernetes/* label namespace is reserved solely for end-user and external Kubernetes consumers, and:
- Must not be used to vary behavior within Kubernetes projects that are subject to API review (kubernetes/kubernetes and all components that expose APIs under the *.k8s.io namespace)
- Must not be required to be present for a cluster to be conformant
Describes the locations within Kubernetes that must be changed to use an alternative mechanism for behavior
- Suggests approaches for each location to migrate
Describes the timeframe and migration process for Kubernetes distributions and deployments to update labels

Proposal

Use of `node-role.kubernetes.io/*` labels

Kubernetes components MUST NOT set or alter behavior on any label within the node-role.kubernetes.io/* namespace.
Kubernetes components (such as kubectl) MAY simplify the display of node-role.kubernetes.io/* labels to convey the node roles of a node
Kubernetes examples and documentation MUST NOT leverage the node-role labels for node placement
External users, administrators, conformant Kubernetes distributions, and extensions MAY use node-role.kubernetes.io/* without reservation
- Extensions are recommended not to vary behavior based on node-role, but MAY do so as they wish
First party components like kubeadm MAY use node-roles to simplify their own deployment mechanisms.
Conformance tests MUST NOT depend on the node-role labels in any fashion
Ecosystem controllers that desire to be placed on the masters MAY tolerate the node-role master taint or set nodeSelector to the master nodes in order to be placed, but SHOULD recognize that some deployment models will not have these node-roles, or may prohibit deployments that attempt to schedule to masters as unprivileged users. In general we recommend limiting this sort of placement rule to examples, docs, or simple deployment configurations rather than embedding the logic in code.

Current users of `node-role.kubernetes.io/*` within the project that must change

The following components vary behavior based on the presence of the node-role labels:

Service load-balancer

The service load balancer implementation previously implemented a heuristic where node-role.kubernetes.io/master is used to exclude masters from the candidate nodes for a service. This is an implementation detail of the cluster and is not allowed. Since there is value in excluding nodes from service load balancer candidacy in some deployments, an alpha feature gated label alpha.service-controller.kubernetes.io/exclude-balancer was added in Kubernetes 1.9.

This label should be moved to beta in Kube 1.19 at its final name node.kubernetes.io/exclude-from-external-load-balancers, its feature gate ServiceNodeExclusion should default on in 1.19, the gate ServiceNodeExclusion should be declared GA in 1.21, and the gate will be removed in 1.22. The old alpha label should be honored in 1.21 and removed in 1.22.

Starting in 1.16 the legacy code block should be gated on LegacyNodeRoleBehavior=true

Node controller excludes master nodes from consideration for eviction

The k8s.io/kubernetes/pkg/util/system/IsMasterNode(nodeName) function is used by the NodeLifecycleController to exclude nodes with a node name that ends in master or starts with master- when considering whether to mark nodes as disrupted. A recent PR attempted to change this to use node-roles and was blocked. Instead, the controller should be updated to use a label node.kubernetes.io/exclude-disruption to decide whether to exclude nodes from being considered for disruption handling.

Kubernetes e2e tests

The e2e tests use a number of heuristics including the IsMasterNode(nodeName) function and the node-roles labels to select nodes. In order for conformant Kubernetes clusters to run the tests, the e2e suite must change to use individual user-provided label selectors to identify nodes to test, nodes that have special rules for testing unusual cases, and for other selection behaviors. The label selectors may be defaulted by the test code to their current values, as long as a conformant cluster operator can execute the e2e suite against an arbitrary cluster.

The IsMasterNode() method will be moved to be test specific, identified as deprecated, and will be removed as soon as possible.

QUESTION: Is a single label selector sufficient to identify nodes to test?

Preventing accidental reintroduction

In order to prevent reviewers from accidentally allowing code changes that leverage this functionality, we should clarify the Godoc of the constant to limit their use. A lint process could be run as part of verify that requires approval of a small list to modify exclusions (currently only cmd/kubeadm will be allowed to use that constaint, with all test function being abstracted). The review doc should call out that labels must be scoped to a particular feature enablement vs being broad.

Some components like the external cloud provider controllers (considered to fall within these rules due to implementing k8s.io APIs) may be vulnerable to accidental assumptions about topology - code review and e2e tests are our primary mechanism to prevent regression.

Design Details

Migrating existing deployments

The proposed fixes will all require deployment-level changes. That must be staged across several releases, and it should be possible for deployers to move early and “fix” the issues that may be caused by their topology.

Therefore, for each change we recommend the following process to adopt the new labels in successive releases:

Release 1 (1.16):
- Introduce a feature gate for disabling node-role being honored. The gate defaults to on. LegacyNodeRoleBehavior=true
- Define the new node label with an associated feature gate for each feature area. The gate defaults to off. ServiceNodeExclusion=false and NodeDisruptionExclusion=false
- Behavior for each functional area is defined as (LegacyNodeRoleBehavior == on && node_has_role) || (FeatureGate == on && node_has_label)
- No new components may leverage node-roles within Kubernetes projects.
- Early adopters may label their nodes to opt in to the features, even in the absence of the gate.
Release 2 (1.17):
- The legacy alpha label alpha.service-controller.kubernetes.io/exclude-balancer is marked as deprecated
- Deprecation of node role behavior in tree is announced for 1.21, with a detailed plan for cluster administrators and deployers
- Gates are officially alpha
Release 3 (1.19):
- The old label alpha.service-controller.kubernetes.io/exclude-balancer is removed
- For both labels, usage is reviewed and as appropriate the label is declared beta/GA and the feature gate is set on
- All Kubernetes deployments should be updated to add node labels as appropriate: kubectl label nodes -l node-role.kubernetes.io/master LABEL_A=VALUE_A
- Documentation will be provided on making the transition
- Deployments may set LegacyNodeRoleBehavior=false after they have set the appropriate labels.
Release 4 (1.21):
- Default the legacy gate LegacyNodeRoleBehavior to off. Admins whose deployments still use the old labels may set LegacyNodeRoleBehavior=true during 1.19 to get the legacy behavior.
- Deployments should stop setting LegacyNodeRoleBehavior=false if they opted out early.
Release 5 (1.22):
- The LegacyNodeRoleBehavior gate and all feature-level gates are removed, components that attempt to set these gates will fail to start.
- Code that references node-roles within Kubernetes will be removed.

In Release 5 (which could be as early as 1.21) this KEP will be considered complete.

Instructions for deployers

The current behavior of the node-role.kubernetes.io/master label on nodes preventing them from being part of service load balancers or from being disrupted when NotReady is deprecated and will be fully removed in Kubernetes 1.20. Administrators and Kubernetes deployers should follow these steps.

If you are using the alpha.service-controller.kubernetes.io/exclude-balancer label in your deployments to exclude specific nodes from your deployment, the label has been replaced in 1.17 with node.kubernetes.io/exclude-from-external-load-balancers. All administrators should run the following command before upgrading to Kubernetes 1.18 and set the feature gate ServiceNodeExclusion=true:

kubectl label nodes --selector=alpha.service-controller.kubernetes.io/exclude-balancer \
node.kubernetes.io/exclude-balancer=true

Cluster deployers that rely on the existing behavior where master nodes are not part of the service load balancer and master workloads will not be evicted if the master is NotReady for longer than the grace period should run the following command after upgrading to Kubernetes 1.18:

kubectl label nodes --selector=node-role.kubernetes.io/master \
node.kubernetes.io/exclude-from-external-load-balancers=true \
node.kubernetes.io/exclude-disruption=true

After setting these labels in 1.18, administrators will need to take no further action.

Cluster deployers that wish to manage this migration during the 1.17 to 1.18 upgrade should label nodes and set feature gates before upgrading to 1.18. If LegacyNodeRoleBehavior=false is set, it must be removed prior to the 1.21 to 1.22 upgrade.

Test Plan

Unit tests to verify selection using feature gates

Graduation Criteria

New labels and feature flags become beta after one release, GA and defaulted on after two, and are removed after two releases after they are defaulted on (so 4 releases from when this is first implemented).
Documentation for migrating to the new labels is available in 1.18.

Upgrade / Downgrade Strategy

As described in the migration process, deployers and administrators have 2 releases to migrate their clusters.

Version Skew Strategy

Controllers are updated after the control plane, so consumers must update the labels on their nodes before they update controller processes in 1.21.

Production Readiness Review Questionnaire

Feature enablement and rollback

How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in kep.yaml)
  - Feature gate name: LegacyNodeRoleBehavior, ServiceNodeExclusion
  - Components depending on the feature gate: kube-apiserver, kube-controller-manager, cloud controller managers
- Other
  - Describe the mechanism:
  - Will enabling / disabling the feature require downtime of the control plane?
  - Will enabling / disabling the feature require downtime or reprovisioning of a node?
Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?

Yes

What happens if we reenable the feature if it was previously rolled back?

The old behavior is present.

Are there any tests for feature enablement/disablement?

Yes

Rollout, Upgrade and Rollback Planning

Covered in migration strategy.

Monitoring requirements

How can an operator determine if the feature is in use by workloads?

Not applicable to workloads

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Not applicable

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

Not applicable

Are there any missing metrics that would be useful to have to improve observability if this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Not applicable

What are other known failure modes?

Not applicable

What steps should be taken if SLOs are not being met to determine the problem?

Not applicable

Implementation History

2019-07-16: Created
2020-04-15: Labels promoted to beta in 1.19 in https://github.com/kubernetes/kubernetes/pull/90126
2020-06-01: Updated for 1.19 with details of production readiness
2021-01-06: GA in 1.21 and marked to be removed in 1.22

Future work

This proposal touches on the important topic of scheduling policy - the ability of clusters to restrict where arbitrary workloads may run - by noting that some conformant clusters may reject attempts to schedule onto masters. This is out of scope of this KEP except to indicate that node-role use by ecosystem components may conflict with future enhancements in this area.

Reference

Resources: Artifact Distribution Policy

Mon, 01 Jan 0001 00:00:00 +0000

KEP 3000: Image Promotion and Distribution Policy

Summary
Why a new domain?
How can we help?
Proposal
What exactly are you doing?
Alternatives / Background
- How much is this going to save us?

Summary

For a few years now, we have been using k8s.gcr.io in all our repositories as default repository for downloading images from.

The cost of distributing Kubernetes comes at great cost nearing $150kUSD/month (mostly egress) in donations.

Additionally some of our community members are unable to access the official release container images due to country level firewalls that do not them connect to Google services.

Ideally we can dramatically reduce cost and allow everyone in the world to download the container images released by our community.

We are now used to using the image promoter process to promote images to the official kubernetes container registry using the infrastructure (GCR staging repos etc) provided by sig-k8s-infra

Why a new domain?

So far we (all kubernetes project) are using GCP as our default infrastructure provider for all things like GCS, GCR, GKE based prow clusters etc. Google has graciously sponsored a lot of our infrastructure costs as well. However for about a year or so we are finding that our costs are sky-rocketing because the community usage of this infrastructure has been from other cloud providers like AWS, Azure etc. So in conjunction with CNCF staff we are trying to put together a plan to host copies of images and binaries nearer to where they are used rather than incur cross-cloud costs.

One part of this plan is to setup a redirecting web service, that can identify where the traffic is coming from and redirect to the nearest image layer/repository. This is why we are setting up a new service using what we call an oci-proxy for everyone to use. This redirector will identify traffic coming from, for example, a certain AWS region, then will setup a HTTP redirect to a source in that AWS region. If we get traffic from GKE/GCP or we don’t know where the traffic is coming from, it will still redirect to the current infrastructure (k8s.gcr.io).

How can we help?

When Kubernetes master opens up for v1.25 development, we need to update all default urls in our code and test harness to the new registry url. As a team sig-k8s-infra is signing up to ensure that this oci-proxy based registry.k8s.io will be as robust and available as the current setup. As a backup, we will continue to run the current k8s.gcr.io as well. So do not worry about that going away. Turning on traffic to the new url will help us monitor and fix things if/when they break and we will be able to tune traffic and lower our costs of operation.

Goals

A policy and procedure for use by SIG Release to promote container images to multiple registries and mirrors.

A solution to allow redirection to appropriate mirrors to lower cost and allow access from any cloud or country globally.

Non-Goals

Anything related to creation of artifacts, bom, staging buckets.

What is not in scope

Currently we focus on AWS only. We are getting a lot of help from AWS in terms of technical details as well as targeted infrastructure costs for standing up and running this infrastructure

What are good goals to shoot for

In terms of cost reduction, monitor GCP infrastructure and get to the point where we fully avoid serving large binary image layers from GCR/GCS
We can add other AWS regions and clouds as needed in well known documented way
Seamless transition for the community from the old k8s.gcr.io to registry.k8s.io with same rock solid stability as we now have with k8s.gcr.io

Proposal

There are two intertwined concepts that are part of this proposal.

First, the policy and procedures to promote/upload our container images to multiple providers. Our existing processes upload only to GCS buckets. Ideally we extend the existing software/promotion process to push directly to multiple providers. Alternatively we use a second process to synchronize container images from our existing production buckets to similar constructs at other providers.

Additionally we require a registry and artifact url-redirection solution to the local cloud provider or country.

What exactly are you doing?

We are setting up an AWS account with an IAM role and s3 buckets in AWS regions where we see a large percentage of source image pull traffic
We will iterate on a sandbox url (registry-sandbox.k8s.io) for our experiments and ONLY promote things to (registry.k8s.io) when we have complete confidence
both registry and registry-sandbox are serving traffic using oci-proxy on google cloud run
oci-proxy will be updated to identify incoming traffic from AWS regions based on IP ranges so we can route traffic to s3 buckets in that region. If a specific AWS region do not currently host s3 buckets, we will redirect to the nearest region which does have s3 buckets (tradeoff between storage and network costs)
We will bulk sync existing image layers to these s3 layers as a starting point (from GCS/GCR)
We will update image-promoter to push to these s3 buckets as well in addition to the current setup
We will set up monitoring/reporting to check on new costs we incur on the AWS infrastructure and update what we do in GCP infrastructure as well to include the new components
We will have a plan in place on how we could add additional AWS regions in the future
We will have CI jobs that will run against registry-sandbox.k8s.io as well to monitor stability before we promote code to registry
We will automate the deployment/monitoring and testing of code landing in the oci-proxy repository

registry.k8s.io request handling

Requests to registry.k8s.io follows the following flow:

If it’s a request for /: redirect to our wiki page about the project
If it’s not a request for / and does not start with /v2/: 404 error
For registry API requests, all of which start with /v2/:

If it’s not a blob request: redirect to Upstream Registry
If it’s not a known AWS IP: redirect to Upstream Registry
If it’s a known AWS IP AND HEAD request for the layer succeeds in S3: redirect to S3
If it’s a known AWS IP AND HEAD fails: redirect to Upstream Registry

Currently the Upstream Registry is https://k8s.gcr.io .

Notes/Constraints/Caveats

The primary purpose of the KEP is getting consensus on the agreed policy and procedure to unblock our community and move forward together.

There has been a lot of activity around the technology and tooling for both goals, but we need shared agreement on policy and procedure first.

Risks and Mitigations

This is the primary pipeline for delivering Kubernetes worldwide. Ensuring the appropriate SLAs and support as well as artifact integrity is crucial.

Alternatives / Background

Original KEP
- https://github.com/kubernetes/enhancements/tree/master/keps/sig-release/1734-k8s-image-promoter
Oras
- https://github.com/oras-project/oras
KubeCon Talk
- https://www.youtube.com/watch?v=F2IFjz7sr9Q
Apache has a widespread mirror network
- @dims has experience here
- http://ws.apache.org/mirrors.cgi
- https://infra.apache.org/mirrors.html
Umbrella issue: k8s.gcr.io => registry.k8s.io solution k/k8s.io#1834
ii/registry.k8s.io Implementation proposals
ii.nz/blog :: Building a data pipeline for displaying Kubernetes public artifact traffic

How much is this going to save us?

Cost of K8s Artifact hosting - Data Studio Graphs

Analysis has been done on usage patterns related to providers. AWS participated in this process and have a keen interest to help drive down cost by providing artifacts directly to their clients consuming resources from the public registry.

Resources: Artifact Generation

Mon, 01 Jan 0001 00:00:00 +0000

Resources: Asynchronous API calls during scheduling

Mon, 01 Jan 0001 00:00:00 +0000

KEP-5229: Asynchronous API calls during scheduling

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes making all API calls during scheduling asynchronous, by introducing a new kube-scheduler-wide way of handling such calls.

Motivation

Scheduling performance is crucial. One of the bottlenecks is the API calls done during the scheduling cycle. The binding cycle is already asynchronous, but it would still be beneficial to re-evaluate whether the current model of busy-waiting goroutines is good long-term.

Making one universal approach for handling API calls in the kube-scheduler could allow these calls to be consistent and better control the number of dispatched goroutines. Already asynchronous calls could also be migrated to this approach.

Goals

P0: Make the scheduling cycle free of blocking API calls, i.e., make all API calls asynchronous.
P0: Make the solution extendable for custom/future use cases.
P1: Skip some types of updates if they soon become irrelevant by consecutive updates.

Non-Goals

Prioritize high-importance updates (like binding) over low-importance ones if updates to the kube-apiserver get throttled.
Change how the already asynchronous procedures, such as the binding cycle or asynchronous preemption goroutines, actually work. They should remain asynchronous and continue to wait for the API calls to finish before proceeding. Any further refinements to these stages could be added in future revisions of this KEP or in the separate ones.

Proposal

There are a few ways to make API calls asynchronous. They are introduced below to facilitate discussion and identify the most suitable solution.

These questions have to be answered:

How to handle Pod rescheduling while waiting for the API call to complete
What component should handle the API calls

Also, races (collisions) between multiple API calls for a single object should be mitigated by the design.

Note that this KEP focuses on making individual API calls asynchronous. Some procedures, such as the binding cycle or asynchronous preemption, will still be separate goroutines with the ability to wait for the (async) API calls to finish. This way, dependencies between calls that rely on each other won’t need to be implemented.

API calls categorization

Before selecting the best approach, the kube-scheduler’s API calls have to be analyzed against the goals. The following operations involve API calls during the main scheduling cycle and have to be made asynchronous (1st goal):

Updating a Pod status in handleSchedulingFailure when a Pod is unschedulable.
[Feature proposal: #130668 ] Updating the status of a Pod that is rejected by the PreEnqueue plugins in the scheduling queue.

These API calls are already asynchronous in their own ways:

[Feature proposal: KEP-5278 ] Set nominatedNodeName in delayed binding scenarios.
Preemption - ClearNominatedNodeName and Pod eviction (made asynchronous by KEP-4832 ).
Pod binding - is in the asynchronous binding phase.

All three of the above API calls could be migrated to the new mechanism.

In-tree plugins’ operations that involve non-Pod API calls during scheduling and could be made asynchronous (but don’t have to be supported from the very beginning):

Volume binding - is in the PreBind phase, hence asynchronous.
DRA ResourceClaim deallocating in PostFilter.
DRA removing ReservedFor in Unreserve.
DRA ResourceClaims binding - is in the PreBind phase, hence asynchronous.
[Feature proposal: KEP-5004 ] Extended resource feature will add ResourceClaim creation API call to the PreBind phase.
Other potential DRA features.

API calls relevance order in which they could cancel less relevant calls for the same Pod (3rd goal):

Pod deletion caused by preemption (4) should cancel all Pod-based API calls for such a Pod.
Pod binding (5) should cancel Pod status update API calls (1 - 3), because they are no longer relevant.
Updating Pod status (1, 2) and setting nominatedNodeName (3) should cancel previous such updates. Both are calls to the status subresource of a Pod, so they should overwrite (merge) the previous calls properly when the newest status is stored in-memory.
API calls for non-Pod resources (6 - 11) should be further analyzed as they are not likely to consider the Pod-based API calls, hence implementing those shouldn’t block making (1 - 2) calls asynchronous.

There is no need to send two API calls for one Pod, because more relevant calls should override less relevant ones, and status updates can be combined into one call. There is no scenario in which two API calls, but for different Pods, or even any two API calls that do not involve the same object, should be canceled or merged, so the relevance order between them should not be analyzed.

In terms of API call priority, the order might be different (non-goal, but considered):

Pod binding (5) should have the highest priority as this is the main purpose of the kube-scheduler.
Pod deletion caused by preemption (4) should also be important to free up space for high-priority Pods.
Updating Pod status (1, 2) could be less important and called if there is space for it. It’s worth considering if setting nominatedNodeName (3) should have the same priority or higher, because the higher delay might affect other components like Cluster Autoscaler or Karpenter.
API calls for non-Pod resources (6 - 11) could be analyzed case by case, but are likely equally important to (5) or (4).

1: How to handle Pod rescheduling while waiting for the API call to complete

There are multiple possible ways to handle such API calls, especially for Pod status updates. Other (potential) use cases should also be considered when choosing the solution. Three ways were analyzed, but the non-blocking approach, presented below, was selected.

Use advanced queue and don’t block the Pod from being scheduled in the meantime

This approach allows the Pod to enter the scheduling queue and be scheduled again even before the status update API call completes, without blocking it. This requires implementing advanced logic for queueing API calls in the kube-scheduler and migrating all Pod-based API calls done during scheduling to this method, including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls and cluster status updates as well as parallelize them properly, e.g., don’t parallelize two updates of the same Pod. This requires making the API calls queued in a separate component or sending API calls through a kube-scheduler’s cache , presented below, to be implemented.

All Pod-based scenarios (1 - 5) could and should be implemented when choosing this approach. Still, a single error reporting path for Pod condition updates could be considered but wouldn’t be required.

Pros:

Allows the Pod to be scheduled again even before the API call completes, which could reduce end-to-end Pod startup latency.
Simplifies introducing new API calls to the kube-scheduler assuming the collision handling logic is implemented correctly.

Cons:

Requires implementing complex, advanced queueing logic.
Necessitates migrating all Pod-based API calls to this method, but introduces unification, which could be desirable.
Implementing collision resolution (e.g., for same-Pod updates) is complex, but could allow optimizing the number of API calls overall.

2: What component should handle the API calls

Another thing worth considering is how to indeed make the API calls asynchronous and which component should be responsible for this. Two alternatives were considered. Ultimately, both contributed to the design of the final architecture, which consists of both queueing and caching approaches.

2.1: Make the API calls queued in a separate component

To make asynchronous dispatching more advanced, a queueing in a separate component approach could be explored. A new component might understand what the API calls are intended to do and eventually delay, skip, or merge them, e.g., don’t set nominatedNodeName when Pod binding is enqueued. Initially, it could be a framework, which might be extended in the future, e.g., by introducing the possibility of setting delays.

If two update API calls for the same Pod are enqueued the merging mechanism should be introduced to handle such case. See API calls categorization for more details.

Pros:

Allows for advanced goroutine dispatching logic.
Can potentially delay, skip, or merge API calls based on type (e.g., skip nominatedNodeName if binding is pending).
All collisions could be resolved at the new component level, not relying on higher-level mechanisms.
Allows supporting all scenarios without additional structures.
Provides a framework that can be extended in the future.

Cons:

Requires complex logic to handle potential conflicts between different update types for the same Pod.
Needs a clear strategy for how to update the in-memory Pod object during scheduling.
Requires extra steps to cache the updated objects.

2.2: Send API calls through a kube-scheduler’s cache

A second approach could be to have a consistent Pod state in the kube-scheduler itself first and then change it through the API. This means that all API calls would have to go through the kube-scheduler’s cache, change the Pod there, and after that, execute. However, Pod updates might come from outside the kube-scheduler, e.g., a user changes the spec or another component changes the status. This extended cache would have to merge the internal state of the Pod with the external state, including the Pod update made by the kube-scheduler that will come as an event as well. Now, the Pod object stored in the cache is based only on events that come to the kube-scheduler.

Another thing to think of is that the cache stores only the bound Pods. The rest of the Pods are stored in the scheduling queue, so once again, API calls might need to go through the scheduling queue itself.

The cache proposal would still need to reuse some ideas of the first approach to achieve merging or skipping API calls.

Pros:

Aims for a consistent internal state of the Pod within the kube-scheduler before calling the API, possibly simplifying conflict resolution.
Allows for advanced goroutine dispatching logic.
All collisions could be resolved at the cache, not relying on higher-level mechanisms.
Can potentially delay, skip, or merge API calls based on type (e.g., skip nominatedNodeName if binding is pending), but merging would be possible if it stores additional data (what fields should be updated, etc.).

Cons:

Requires the cache to handle and merge updates coming from both the kube-scheduler’s internal actions and external API events.
The cache currently only stores bound Pods, requiring integration with the scheduling queue for pending Pods.
Complex logic is needed to handle external updates arriving while an internal update is pending or in progress.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Asynchronous API call failure

When an asynchronous API call fails, the caller should be able to handle this. This can be done by using an OnFailure channel that passes the error, allowing callers to react accordingly. For current API calls, this will be enough - updating Pod status failures are already unhandled (only logged), so this KEP won’t make that situation worse. However, more graceful handling, such as retries, could be added in the future.

It could be riskier when previous calls were skipped or overwritten and a subsequent call fails. This results in losing previous decisions (outside the kube-scheduler) as well as the last change not being applied externally. This should still be handled correctly, as a failed binding will result in applying a failed Pod status anyway, and a Pod with binding canceled because of deletion (preemption) could still be retried. Nevertheless, this risk should be taken into consideration when extending feature usage in the future and should be properly documented in the code.

Another aspect is caching: applying a change to a cache should be reversible. This could be done by storing two versions of an object (like in AssumeCache) and restoring the older version in case of a failure. However, for basic usage, this won’t be required for Pod-based API calls.

Object updated by an external component causing a race with the scheduler

If a single field can be updated by both the scheduler and another component, making the update API call asynchronous might extend the race window. One such case is the NominatedNodeName use case, extended by KEP-5278 .

However, in this KEP, we assume that the default kube-scheduler should have precedence when applying updates to objects (Pods), and any custom logic could be implemented by changing the default if needed.

API calls added at a higher rate than execution rate leading to memory explosion

As pending API calls will be stored in the scheduler, slower processing of these calls, while maintaining a high frequency of additions, might result in significant memory usage. This can already occur, for example, when many Pods are waiting to be bound simultaneously. However, if it turns out to be a real problem, a timeout could be added to the API call that will limit the time the call might spend in the queue, discarding it afterward.

Pod is retried based on an old object

Since a Pod won’t be blocked from retrying scheduling when an status update API call for that Pod is being executed, it might enter the next scheduling cycle before the call completes. However, the PodScheduled condition is not used during scheduling, and NominatedNodeName is reflected in the nominator, so having an outdated Pod object won’t cause any harm. Still, any future use cases might introduce issues here, so caching the updates could be considered to fully mitigate this risk.

Out-of-tree plugins start using asynchronous API calls framework

The framework should be designed to handle such custom use cases, but it should be explicitly documented what capabilities are allowed (and supported) for out-of-tree plugins. For example, adding a new Pod-based API call might require changes in the original implementations. Not all use cases might be covered by the first release of this feature, but eventually, they should be fully supported and documented accordingly.

Design Details

This section describes the most important design details. Three proposals based on the above ideas that combine queueing, caching, and a separate component for managing API calls were considered. Ultimately, proposal C was selected, and the details of proposals A and B can be found in the alternative design proposals section at the end of the KEP. Specifically, see proposal A for the proposed APIQueue structure.

Proposal C: Create a separate component managing API calls, but treat the cache as a middleware

This proposal combines the strengths of proposals A and B by making a cache a middleware between scheduling/binding cycles, plugins, and event handlers. This way, we could achieve the cache advantages of proposal B, while also allowing multiple caches to coexist. Direct API queue operations would still be possible (e.g., for some out-of-tree plugins that don’t need to cache any object).

The APIQueue design from proposal A could be largely reused in this approach. If an object needs to be modified, it would first go through the cache, then be added to the API queue, and, based on the result, properly stored in the cache. This decoupled approach would allow adding a StatusUpdateCall through the scheduler’s cache, but for example, a ResourceClaimUpdate could go through the DRA manager, simplifying the adaptation of this KEP.

This proposal could be implemented as a second step extension of proposal A.

Summary of API call management

Below is a summary of the steps in API call management that would be introduced by the proposals above.

Enqueueing a new API call

Having a separate component (APIQueue in proposal A and partially C) would make the API calls explicit to the caller by directly calling Add() on the APIQueue. This means it will be visible from the scheduler or plugins that an API call will be sent, and various options could be easily passed.

Using a cache (proposal B and C), the API call will be hidden and executed implicitly when needed, based on the cache’s internal logic. It’s questionable how to pass some options to the API call, e.g., an OnFinish channel or additional metadata. Error handling might also be less verbose for the caller.

Updating a cache with API call details would be similar across all proposals. Given the details, it would be possible to know precisely which fields will be updated by the API call. Some Update() method could then apply these changes to an object, and the result could be stored in the cache. If any future update appears, it will be routed similarly.

In all proposals, if there isn’t any API call already enqueued for a given object, its UID will be added to the queue that will later be consumed by the API calls runner. In other scenarios, more advanced logic will be required. See the section below for more details.

Enqueueing another API call for the same object

Another API call for the same object could be enqueued, while the previous one is still waiting to be executed. Based on API calls categorization, some updates might need to be merged. This logic has to be implemented and could be achieved similarly for all three proposals. In general, given the API calls categorization, the calls could be simply merged by overwriting the details with the new ones, if applicable. For StatusUpdateCall, merging will check if the NominatedNodeName or Pod condition changed and then overwrite these fields accordingly.

Skipping or overwriting less or more important API calls could be done by configuring an importance value for each CallType and then making a decision based on comparison while adding a new API call. Not all API calls would need to implement their merging strategy. Merging should also allow deciding if the API call should be removed from the queue when the update reverts a previous one that wasn’t executed yet.

In proposal A and C, the merging strategy (Merge() method in APICall) would implement this merging logic. In proposal B, some other configurable method would need to be designed to implement this.

Merging, overwriting, or skipping a call could get more complicated if the previous API call is already in flight. See the enqueueing an API call while a previous one is in-flight section for more details. In proposal B, setting the merging strategy might be more complicated and could require providing custom logic through some interfaces.

Receiving object update through event handlers

An object might get updated or deleted externally in the meantime, while some API call is enqueued for the same object. One such scenario might be setting NominatedNodeName by an external component (see KEP-5278 ). For Pod status updates themselves, making an update based on the old object wouldn’t cause trouble, because of the strategic merge patch used – it will just overwrite the Pod condition or NominatedNodeName if needed. It is assumed that the scheduler should overwrite all such updates according to the actual needs, and if it’s not expected, custom logic could always be added using an APICall interface.

However, to support other potential use cases and have the newest object possible in the cache (proposals B and C, and optionally A), merging the object received by event handlers with API call details should also be added. It would work similarly to updating a cache in the section above.

It also should be defined how to handle such external updates if the API call is completed and the scheduler is waiting for the update to come in event handlers. The ResourceVersion of the object could be used to distinguish it, i.e., apply the API call details as long as the ResourceVersion of the received object is older than the version returned by the update API call.

Executing the API call

In all three proposals, executing the API call could be done by having a goroutine (API calls runner) that will check if there is any goroutine available in the pool (could be a configurable number) and it will try to fetch the first resource ID from a queue. Then, in the new goroutine, the API call for this resource will be executed, and after it completes, it will be freed for the next call.

Enqueueing an API call while a previous one is in-flight

One other possible scenario occurs when an API call is executing (is in-flight) and a new API call for the same object is added. If both calls have the same type, standard merging logic could be applied. This involves adjusting the new API call with the in-flight call’s details to reflect the changes that are already in-flight and avoid repeating them in the next call (note that the in-flight call should still be stored in the map with details, but not in the queue with resource IDs).

Waiting for the API call to finish

In some use cases, the caller would like to wait for the asynchronous API call to finish. This could be achieved by passing an OnFinish channel along with the call that will receive the API call result (nil or error). This way, already asynchronous calls like binding can be easily migrated to the new mechanism just by blocking on the call completion, as binding is already asynchronous. This channel could be easily used with proposal A, but proposals B and C would require passing it through cache methods, which could be less readable.

Retrying API calls

As API calls are getting overwritten or skipped, failure of one call might end up in losing multiple operations. That’s why, for retryable errors, it should be possible to re-enqueue the API call and try it again soon Such logic could be explored, but having an OnFinish channel and handling errors by the caller should be enough for the actual use cases.

For example, if a binding API call fails, the binding cycle procedure for that Pod will be notified via the OnFinish channel. It will then invoke a failure handler that re-adds the Pod to the scheduling queue to retry.

If no procedure tracks the OnFinish handler of a call (e.g., for a status update), the error will be unhandled (only logged). This aligns with the current implementation, and status updates aren’t critical enough to implement more advanced retry logic. Update conflicts also won’t be an issue for status updates, as a strategic merge patch is used, and the update will overwrite a condition and NominatedNodeName if a conflict occurs.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/scheduler: 2025-06-09 - 69.6%
pkg/scheduler/backend/cache: 2025-06-09 - 85.7%

Integration tests

k8s.io/kubernetes/test/integration/schedule
- Modify and add test cases covering the feature (with feature flag enabled and disabled), including handling unschedulable pods, preemption and binding.
scheduler_perf
- Add test cases measuring performance of scenarios that use asynchronous API calls (with feature flag enabled and disabled).
- Performance improvement should be visible for Unschedulable test case.

e2e tests

The feature is scoped within the kube-scheduler internally, so there is no interaction between other components. The whole feature should be already covered by integration tests.

Graduation Criteria

Alpha

N/A

Beta

Implement a feature behind a feature flag and enable it by default.
Migrate all Pod-based API calls done during scheduling and binding to the asynchronous version.
Implement all tests from Test Plan .

GA

Gather feedback from users and fix reported bugs.

Upgrade / Downgrade Strategy

Upgrade

During the beta period, the feature gate SchedulerAsyncAPICalls is enabled by default, so users don’t need to opt in. This is a purely in-memory feature for the kube-scheduler, so no special actions are required outside the scheduler.

Downgrade

Users need to disable the feature gate.

Version Skew Strategy

This is a purely in-memory feature for the kube-scheduler, and hence there is no version skew strategy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: SchedulerAsyncAPICalls
- Components depending on the feature gate: kube-scheduler

Does enabling the feature change any default behavior?

Pod scheduling might be retried even if the API call hasn’t yet been executed. For instance, a Pod might be retried before its PodScheduled condition is set to false (indicating it’s unschedulable). Consequently, external components that would rely on a strict ordering of applying a condition -> retrying a Pod might be less informed.

Moreover, some API calls might be canceled. In such cases, if the Pod is bound shortly after, the PodScheduled condition might not be set to false at all, as the binding takes precedence.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. The feature can be disabled in Beta version by restarting the kube-scheduler with the feature-gate off.

What happens if we reenable the feature if it was previously rolled back?

The kube-scheduler again starts to run API calls asynchronously.

Are there any tests for feature enablement/disablement?

Given it’s a purely in-memory feature and enablement/disablement requires restarting the component (to change the value of the feature flag), having feature tests is enough.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The partial failure in the rollout isn’t there because the kube-scheduler is the only component to roll out this feature. But, if upgrading the kube-scheduler itself fails somehow, new Pods won’t be scheduled anymore, while Pods, which are already scheduled, won’t be affected in any case.

What specific metrics should inform a rollback?

pending_async_api_calls metric is large or growing abnormally
async_api_call_execution_total value with result indicating error is large even if there are no issues with kube-apiserver
async_api_call_duration_seconds visibly increased even if there are no issues with kube-apiserver
event_handling_duration_seconds visibly increased
scheduling_attempt_duration_seconds visibly increased

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No. This feature is an in-memory feature of the scheduler and thus calculations start from the beginning every time the scheduler is restarted. So, just upgrading it and upgrade->downgrade->upgrade are both the same.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check async_api_call_execution_total, async_api_call_duration_seconds and pending_async_api_calls metrics, and if their values are changing with each processed pod.

How can someone using this feature know that it is working for their instance?

N/A

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

In the default scheduler, we should see the throughput around 100-150 pods/s (ref ), and this feature shouldn’t bring any regression there.

Based on that schedule_attempts_total shouldn’t grow less than 100 per second and scheduling_algorithm_duration_seconds in average shouldn’t be higher than 10 ms, if there is a sufficient number of pending pods in the cluster.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - async_api_call_execution_total
  - async_api_call_duration_seconds
  - pending_async_api_calls
  - scheduling_attempt_duration_seconds
  - event_handling_duration_seconds
  - pod_scheduling_sli_duration_seconds
- Components exposing the metric: kube-scheduler

Are there any missing metrics that would be useful to have to improve observability of this feature?

async_api_call_execution_total with call_type and result labels to indicate how many async API calls with specific call_type completed with that result.
async_api_call_duration_seconds with call_type and result labels to indicate how long it took for async API calls with specific call_type to complete with that result.
pending_async_api_calls with call_type label to indicate how many async API calls are enqueued for specific call_type.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Not visibly end-to-end - binding API call could be slightly delayed by routing through the API queue.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Memory usage within the kube-scheduler might increase due to the queue storing pending API calls. Memory increase is expected to be linear with the number of pending API calls.

The number of goroutines will also increase to dispatch API calls, which could affect the CPU usage of the kube-scheduler.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If API server is unavailable, the API calls will result in a failure. Scheduler already handle such cases (retry scheduling in most of them) and this feature should not make a change here. See retrying API calls section for more details.

What are other known failure modes?

Unknown

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

8th Apr 2025: The initial KEP is submitted.

Drawbacks

Alternatives

There were other alternatives considered in two topics:

Where and how to handle API calls during queueing and scheduling.
How to make the API calls asynchronous.

1.1: Handle API calls in the scheduling queue

One possible approach is to send the API calls through a scheduling queue. This allows delaying putting the pod into unschedulablePods after updating the pod. This prevents race conditions from parallel updates of a single pod because, during the API call, the pod is in-flight and thus not eligible for rescheduling.

A new method could be added to the PriorityQueue, which will take the function to be called asynchronously. It should also make sure the pod is stored in inFlightPods to register the cluster events that will happen during the asynchronous part. Calling AddUnschedulableIfNotPresent at the end ensures there won’t be any race with the asynchronous pod update. Because the pod would need to be in inFlightPods during the API call, the size of inFlightEvents might increase, but as long as the API call executes quickly, there won’t be a significant memory pressure.

Example solution could look like:

// Author: @sanposhiho
func (p *PriorityQueue) AddUnschedulableAsync(pInfo *framework.QueuedPodInfo, fn func() error) {
 // Make sure the Pod is in inFlightPods before starting the goroutine

 go func() { // Or another way of dispatching
 // Run fn first 
 if err := fn(); err != nil { ... }

 // Push the pod back to the unschedQ after completing fn().
 p.AddUnschedulableIfNotPresent(...)
 }()
}

This way, we could cover pod status updates during the failure handler (1) and pod status updates for PreEnqueue plugins (2). Asynchronous preemption (4) could be migrated to this approach by adding a possibility to return a function from PostFilter plugins in PostFilterResult and calling this function probably in the failure handler together with the status update.

However, this method cannot be used for setting the nominatedNodeName scenario (3) because this operation occurs in the successful scheduling as well. Therefore, additional effort would have to be made to specifically ensure that the nominatedNodeName doesn’t collide with a potential status update. Probably, before this status update in the failure handler, the code should try to cancel the set nominatedNodeName API call or wait until it finishes. After that, it should proceed with setting the unschedulable status via the API. The binding call might similarly need to wait.

Another aspect to consider is how to dispatch the goroutines, as discussed in how to make the API calls asynchronous section.

Pros:

Allows delaying putting unschedulable pods back to the queue until the API update completes.
Prevents race conditions for parallel updates of a single pod by delaying the AddUnschedulableIfNotPresent call.
Can easily cover status updates for both scheduling failures and PreEnqueue failures.
Asynchronous preemption could be migrated to this approach, increasing consistency.

Cons:

Handling of failures might not be consistent, requiring AddUnschedulableAsync to be called in two places.
Delaying the AddUnschedulableAsync call increases pod queuing latency because the initial backoff timestamp is set there.
Cannot be used for the nominatedNodeName scenario, requiring additional effort and separate handling.
Might visibly increase the size of inFlightEvents if API calls are slow or if there are many calls.

1.2: Handle API calls in the handleSchedulingFailure

Another approach could be to make all unschedulable status update API calls within handleSchedulingFailure. This would make this handler the only error reporting path. Synchronous API calls within this handler could be made asynchronous, but additional effort would be needed to prevent race conditions. This could be achieved by blocking the retries of the pod using PreEnqueue (similar to asynchronous preemption) or by implementing advanced queueing logic.

This way, again, we could cover pod status updates during the failure handler (1), but pod status updates for PreEnqueue plugins (2) will require more refactoring by either:

Running a simplified scheduling cycle for pods that were rejected by the PreEnqueue to update the pod condition. This might negatively impact scheduling performance because a portion of the scheduling cycles will be spent for pods that are ultimately unschedulable Moreover, PreEnqueue plugins might also need to be called within this simplified scheduling cycle, or alternatively, PreFilter plugins could implement the necessary PreEnqueue logic, duplicating it.
Calling handleSchedulingFailure directly from the scheduling queue when a pod is rejected by the PreEnqueue. This might be feasible, although it would create a circular dependency between the scheduling queue and the handler; however, it wouldn’t have the same performance implications as the solution above.

Asynchronous preemption could also be migrated to this approach by exposing a function, provided that the blocking behavior in PreEnqueue is consistent with the actual preemption blocking mechanism.

Again, for setting the nominatedNodeName scenario (3), this method cannot be used because this operation occurs in the successful scheduling as well. Therefore, additional effort would have to be made to specifically ensure that the nominatedNodeName doesn’t collide with a potential status update.

Pros:

Makes the failure handler the single path of reporting unschedulable status errors.
Asynchronous preemption could potentially be migrated to this approach, increasing consistency.
Pod would be immediately put into the scheduling queue, starting the backoff timer right away.

Cons:

Requires additional effort to prevent race conditions for updates.
Handling PreEnqueue rejections requires significant refactoring (implementing a simplified scheduling cycle or direct handleSchedulingFailure` call).
- Simplified scheduling cycle for PreEnqueue rejections could impact performance and duplicate PreEnqueue logic.
- Direct handleSchedulingFailure call would introduce circular dependency.
Cannot be used for the nominatedNodeName scenario, requiring additional effort and separate handling.

2.1: Just dispatch goroutines

With appropriate handling of races during updates, we could just dispatch goroutines with API calls. A potential drawback is that we won’t limit the number of these goroutines and won’t be able to, e.g., delay the calls. Limiting goroutines could still be easily achieved by having some group with a limited number of goroutines and a simple queue that will store pending calls. Some delay might potentially appear due to side effects, especially when there will be problems with the kube-apiserver, so some higher-level mechanism such as (1.1) or (1.2) would need to prevent pod update races.

Pros:

Simple to implement if the appropriate race handling is chosen.
Can easily be extended with a simple queue and worker pool to limit number of goroutines.

Cons:

Does not inherently support delaying calls.
Higher-level mechanisms (like 1.1 or 1.2) would be needed to prevent pod update races.
nominatedNodeName scenario support would require more effort in (1.1) or (1.2).
Prevents from further optimizations, e.g. can’t merge two API calls.

Alternative design proposals

Three design proposals were considered, but the proposal C was selected to be implemented. Below, another two proposals are presented for comparison.

Proposal A: Create a separate component managing API calls

Implementing an API queue could be made by adding a new component to the scheduler that will have to understand the API calls’ details as well as be (potentially) able to modify the cache (see dotted lines in the diagram). This approach would provide an extensible interface and understand the precedence of API calls. Having a new component on its own would cause the cache to be less informed, i.e., not updated with API calls’ details, providing the scheduler with outdated data. It could be prevented by making an API queue a middleware between the event handler and a cache (dotted lines). This won’t have to be fully implemented in the first place (only support a subset of use cases), but will allow handling multiple cached storages that are currently in the scheduler, i.e., scheduler cache, nominator, DRA manager (claimTracker), and volume binding AssumeCache.

The interface for the new component could look like the following:

type APICallType string

const (
 StatusUpdateCall APICallType = "status_update"
 BindingCall APICallType = "binding"
 PreemptionCall APICallType = "preemption"
 // PVCBinding etc.
)

// APICall describes the API call to be made and store all required data to make the call,
// e.g. fields that should be updated or object to be added/removed.
type APICall interface {
 // CallType returns an API call type. This should be unique across all APICall implementations that could be in the queue at one moment.
 CallType() APICallType
 // UID returns UID of an object that this call is related to
 UID() types.UID
 // Execute makes the actual API call
 Execute(client clientset.Interface) error
 // Merge merges two API calls with the same APICallType into one
 Merge(oldObj APICall) (bool, error)

 // Not required from the very beginning:

 // Update updates the obj using APICall details and returns the new version
 Update(obj any) (any, error)
}

type QueuedAPICall struct {
 APICall
 // OnFinish is a channel where the API call result is sent.
 // It allows to synchronize on the call completeness, e.g., in binding
 // and handle its result well.
 OnFinish chan<- error
}

type APIQueue struct {
 ...
}

func (aq *APIQueue) Add(apiCall QueuedAPICall) error {
 // If API call for specific UID is already enqueued,
 // check the callType and skip, replace or merge the call depending on precedence.
 ...
}

func (aq *APIQueue) Update(obj any) (any, error) {
 // Update the object using API call details if any is enqueued for its UID.
 ...
}

func (aq *APIQueue) Run() {
 // Dispatch limited number of goroutines if queue is non empty.
 ...
}

APIQueue would provide an Add() method would would be used to enqueue an API call that has to be executed. APICall would provide all required methods to handle it, especially Execute() for running, Merge() for merging it with the same call type (e.g. StatusUpdateCall) that is already enqueued. There should be only one APICall implementation with the same CallType at any given moment (prevented by APIQueue), but extending this behavior could be considered in the future. Supporting a cache would need adding Update() method that would take the object and update it with API call details (e.g., set NominatedNodeName in a Pod that will be soon updated by the call). This updated object could be then stored in the cache, and having the call details would allow to know what fields would need to be changed if any future update occurs before the API call is executed.

Proposal B: Make a scheduler’s cache managing API calls

This approach differs from the previous one. Instead of creating a separate component, this would reuse the scheduler’s cache to handle API calls. Its advantage would be keeping a consistent state of the updated object in the scheduler and invisibly dispatching API calls if needed. The largest caveat could be refactoring the scheduler’s cache if non-Pod API call would have to be supported - the cache is currently split into multiple, more specialized caches, i.e., scheduler cache, nominator, DRA manager (claimTracker), and volume binding AssumeCache. This means that the scheduler’s cache might need to be extended by these use cases or be able to support those custom storage options using some interfaces. Having a cache would still require storing additional metadata (details), similar to proposal A, required to make the API calls and to be able to handle incoming updates from the event handler properly (store information about what the API call will change and be able to apply them on an updated object).

It would also require adding specialized methods to the cache to consume details needed to merge the calls and objects properly; for instance, the default UpdatePod method might not be useful, because it would be too generic for our use cases. Supporting out-of-tree plugins might also be harder, as it would require making the cache extensible to store some custom objects and somehow add new methods.

Infrastructure Needed (Optional)

Resources: Asynchronous Preemption

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4832: Asynchronous Preemption

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
  - When kube-apiserver is unstable
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Introduce a new extension point

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes decoupling the API calls for the preemption from the scheduling cycle, to enhance the scheduling throughput of the scheduling failure scenarios.

Motivation

The scheduler is basically only one in a cluster, and hence scheduling throughput is the crucial metric for the scheduler.

The scheduler schedules Pods one by one within the scheduling cycle, and we basically try to reduce the API calls as much as possible to enhance the scheduling cycle throughput.

The binding cycle is the example for this motivation;

The scheduling cycle decides where Pod should go to,
At the end of the scheduling cycle, the scheduler reserves the Node within the scheduler’s cache so that next scheduling cycle will take the current pod into consideration.
The scheduling cycle ends and the binding cycle starts; the binding cycle is run asynchronously, and the scheduler starts the next scheduling cycle.

This flow allows us to decouple the API call to assign Pod to the Node from the scheduling cycle so that the API call doesn’t block the scheduling throughput.

But, we have the similar problem with the preemption; the preemption is run at PostFilter extension point which is the part of the scheduling cycle. The preemption has to make some API calls to update Pods’ condition and delete Pods after all, which could block the scheduling throughput.

scheduler-perf actually shows currently the preemption scenario takes too long time, compared to others.

Goals

Improve scheduling throughput when pods require issuing preemptions by making API calls asynchronous

Non-Goals

Making the same enhancement for DRA is not a goal of this KEP because it’s an under-construction feature yet.
- If DRA maintainers want, technically they can along with this KEP. But, at least in this KEP, we don’t discuss how.

Proposal

The preemption plugin makes API calls for the preemption asynchronously after PostFilter extension point so that the scheduler can continue to other Pods’ scheduling while making API calls for preemption. After the preemption goroutine is done, the scheduling for the Pod that triggered the preemption will be retried.

Risks and Mitigations

When kube-apiserver is unstable

When kube-apiserver is unstable and API calls at the preemption goroutine fails frequently, the scheduler could make a non-optimal scheduling decision because the scheduler nominates pods at PostFilter though, those Pods won’t be scheduled on nodes because the preemption API calls fail.

Let’s say many mid-priority Pods are making the preemption API calls. With the scheduler after this proposal, during the preemption goroutine for them are runnning, the scheduler assumes they’ll be scheduled at the Nodes eventually that the preemptions are targeting via .Status.NominatedNodeName. So, other mid-priority or lower priority Pods’ scheduling take those preemptor Pods into consideration, which is correct if the preemption goroutine finishes successful actually, while which results in non-best scheduling results otherwise. (Higher priority Pods won’t be affected; Pods can take place of reserved for lower priority Pods via .Status.NominatedNodeName)

But, in the first place though, when kube-apiserver is unstable, the scheduler doesn’t behave well because it works with a lot of communication with kube-apiserver. Even if the scheduler makes the best scheduling result, the binding API might fail after all.

So, we don’t have to pay a special attention to this issue.

Design Details

To achieve an asynchronous preemption, we will change the preemption plugin’s implementation like the following:

The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node. (We’ll use AddNominatedPod API exposed from the scheduling framework to plugins.)
The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).
The preemption plugin blocks the Pod while the preemption routine is in-progress, using PreEnqueue extension point, so that the target Pod won’t be retried during this time.

Then, afterwards the preemption goroutine makes actual API calls to delete victime Pods and set Pod.Status.NominatedNodeName. If the preemption goroutine fails at some point, it reverts the nomination via AddNominatedPod with clearNominatedNode .

If the preemption goroutine is complete, the preemption plugin ungates the Pod; the Pod is queued back to the queue with the Pod/delete event, and (hopefully) scheduled on the nominated node in the next scheduling cycle.

Consideration to race condition

Thanks to the nomination at PostFilter, this new asynchronous preemption shouldn’t make any race condition between several scheduling cycles.

Here, I’ll discuss what happens in which scenario, and make sure there’s no worry.

Let’s say pod1 is during the preemption process (node1) at the preemption goroutine, the next scheduling cycle is scheduling pod2.

The pod2’s scheduling is successful (pod2 is equal or lower priority than pod1)

As I described above, pod1’s PostFilter nominates pod1 for node1.

At the scheduling cycle, the scheduler takes such nominated pods that are equal or higher priority than pod1 into consideration; meaning, pod2 won’t rob pod1 of the place on node1.

The pod2’s scheduling is successful (pod2 is higher priority than pod1)

Even though pod1 is nominated for the node, the scheduler allows pod2 to take node1, where the pod1’s preemption made the space.

Then, when pod1 comes back to the scheduling cycle, it may not be able to land on node1 because pod2 is scheduled there now. It happens with both the current and this KEP’s scheduler, so no issue here.

The pod2’s scheduling is failed and starts the preemption (pod2 is equal or lower priority than pod1)

The preemption also takes nominated Pods into consideration when calculating the preemption target.

Therefore, if, coincidently, two preemptions for pod1 and pod2 select the same Node after all, then the preemption for pod2 should decide to make the space for pod1 and pod2.

So, we don’t have to worry about two preemption targeting the same Node make any issue.

The pod2’s scheduling is failed and starts the preemption (pod2 is higher priority than pod1)

The pod2’s preemption ignores pod1’s nomination for node1.

If, coincidently, two preemptions for pod1 and pod2 select the same Node after all, then the preemption for pod2 may just select the same preemption targets as pod1, and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.

But, this isn’t an issue because the final result is completely the same as the current scheduler; with the current scheduler, pod1 preempts some Pods on node1, then pod2’s scheduling starts, pod2 takes node1, and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go: 2024-09-07 - 85.4
/pkg/scheduler/framework/preemption/preemption.go: 2024-09-07 - 27.2

Because the coverage for preemption.go is pretty low, we have to improve the testing there before the change for this KEP.

Integration tests

We have to add integration tests to make sure the asynchronous preemption is performed appropriately, especially in the scenarios listed in Consideration to race condition .

e2e tests

We’ll add test cases that multiple pods are trigger preemption.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
All tests mentioned in Test Plan are implemented.

Beta

Gather feedback from users and fix reported bugs.
Change the feature flag to be enabled by default.

GA

Gather feedback from users and fix reported bugs.

Upgrade / Downgrade Strategy

Upgrade

During the alpha period, users have to enable the feature gate SchedulerAsyncPreemption to opt in this feature. This is purely in-memory feature for kube-scheduler, so no other special actions are required outside the scheduler.

Downgrade

Users need to disable the feature gate.

Version Skew Strategy

This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: SchedulerAsyncPreemption
- Components depending on the feature gate: kube-scheduler
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No. The feature is a performance optimization that affects every Pod that needs preemption, but there are no functional changes: the result of the preemption is the same. But, like mentioned in When kube-apiserver is unstable , scheduling results could be different.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. The feature can be disabled in Alpha and Beta versions by restarting kube-scheduler with the feature-gate off.

What happens if we reenable the feature if it was previously rolled back?

The scheduler again starts to run PostFilter asynchronously.

Are there any tests for feature enablement/disablement?

Given it’s purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag), having feature tests is enough.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The partly failure in the rollout isn’t there because the scheduler is only the component to rollout this feature. But, if upgrading the scheduler itself fails somehow, new Pods won’t be scheduled anymore. If there’s a bug in the preemption because of this enhancement, and also downgrading the scheduler fails somehow, running Pods could be affected, for example, by being deleted by mistake (depending on bugs).

What specific metrics should inform a rollback?

Maybe something goes wrong with the preemption if goroutines_duration_seconds{operation=preemption} takes too long time. Also, if preemption_attempts_total increases too much, then that might also imply some bugs around the preemption.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No. This feature is an in-memory feature of the scheduler, and just upgrading it and upgrade->downgrade->upgrade are both the same.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This feature is used during all Pods’ preemption if the feature gate is enabled. You can see if the scheduler triggers any preemptions via preemption_attempts_total metric.

You can find Pods that have triggered the preemption by referring to .Status.NominatedNodeName, and Pods that have been preempted by referring to their condition with type: DisruptionTarget and reason: PreemptionByScheduler.

How can someone using this feature know that it is working for their instance?

API .status
- Other field: If .Status.NominatedNodeName of Pods is non-empty, they have experienced the preemption running asynchronously.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The failure rate of the preemption goroutine (goroutines_execution_total{result=error, operation=preemption}/goroutines_execution_total{operation=preemption}) should be < 0.01.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: goroutines_execution_total{result=error, operation=preemption}
- Components exposing the metric: kube-scheduler

Are there any missing metrics that would be useful to have to improve observability of this feature?

goroutines_duration_seconds (w/ label: operation): to observe how long each preemption goroutine takes to complete.
goroutines_execution_total (w/ labels: operation, result): to observe how many preemption goroutines have failed.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No. Just move the existing API calls from PostFilter into goroutines.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

The scheduler starts to run more goroutines in the preemption plugin, so maybe the CPU usage go up.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

In such cases, API calls for the preemption fails in the preemption goroutines. But, the scheduler cannot perform not only the preemption, but anything essentially because it cannot get objects, bind Pods to Nodes, etc.

What are other known failure modes?

Nothing.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Sep 07, 2024: The initial KEP is submitted.
Nob 08, 2024: The implementation PR is merged.
Feb 03, 2025: The PR to promote it to beta is submitted.

Drawbacks

Alternatives

Introduce a new extension point

To make this kind of scenario easier to implement for other plugins, we can implement a new extension point AsyncPostFilter. We calculate the preemption target and nominate the Pod for the Node at PostFilter, and then AsyncPostFilter starts asynchronously, in which the preemption plugin makes API calls for the preemption.

The Pod won’t be queued back to the queue until AsyncPostFilter is done.

We don’t go with this idea because we can implement the async preemption without introducing a new extension point. Adding a new extension point unnecessarily may result in the regret in the future, and also we can implement it if it’s really necessary.

Resources: Authorize with Selectors

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4601: Authorize with Selectors

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The authorization attributes will be extended to include field selectors and label selectors from List, Watch, and DeleteCollection. This will allow authorizers to use these selectors when making an authorization decision.

Motivation

Security for per-node workloads could be improved by exposing field and label selectors to authorizers. Adding them as authorization attributes allows the development of new kinds of authorizers that leverage this information to provide security. In particular, it enables out-of-tree authorizers to experiment with ways to express restrictions based on field and label selectors.

Goals

Add field and label selectors to authorization attributes for List, Watch, and DeleteCollection verbs.
Add field and label selectors to webhook authorization types.
Add field and label selectors to SelfSubjectAccessReview (SSAR), SubjectAccessReview (SAR), and LocalSubjectAccessReview.
Update node authorizer to restrict on nodeName field selector.
Add field and label selectors to CEL authorizer implementation.

Non-Goals

Create a generic in-tree authorizer that manages field or label selectors.
Expand the audit surface area, since requestURI is already included
Expand the admission surface area (admission.Attributes, AdmissionReview, available to admission) since admission verbs don’t support field/label selectors

Proposal

List, Watch, and DeleteCollection requests directly have field and label selector options. A single-item List or Watch request is still a list as normal (including selectors), but also includes a name.

Authorization Attributes changes

The authorization attributes have easy access to the query parameter field and label selectors. To avoid confusion, field and label selectors will not be included in authorization attributes for kube-apiserver requests with verbs where the field selector has no semantic meaning. In practice this means that (for now), only List, Watch, and DeleteCollection have field and label selectors.

SubjectAccessReviews submitted to the kube-apiserver with verbs that do not honor the selectors will NOT modify the field and label selector attributes. The client is trusted to be sending only combinations that will be honored.

Any authorizer that gets an error from GetFieldSelector or GetLabelSelector may attempt to authorize without field or label selectors since that will authorize using a wider permission (field and label selectors can only reduce access).

type Attributes interface {
 // GetFieldSelector is lazy, thread-safe, and stores the parsed result and error.
 // It can return an error if the field selector cannot be parsed.
 // Remember that field selector formats vary based on the version of the API being used!
 GetFieldSelector() (fields.Requirements, error)

 // GetLabelSelector is lazy, thread-safe, and stores the parsed result and error.
 // It can return an error if the field selector cannot be parsed.
 GetLabelSelector() (labels.Requirements, error)

Webhook authors: remember that the list of verbs accepting field and label selectors may change over time. If the kube-apiserver sends the FieldSelector or LabelSelector to a webhook, the kube-apiserver intends to honor the selector attributes.

Future-proofing your authorization webhook for future verbs

As of 1.31, the only verbs with field and label selectors are List, Watch, and DeleteCollection. In the future, the kube-apiserver may add field and label selectors to Get, Create, Update, Patch, and Delete.

For Get, this means the field and label selector of the retrieved object must match.
For Create, this means that the resource after all mutation is complete (finalObject) must match the field and label selector.
For Update/Patch, this means that the finalNewObject and oldObject must match the field and label selector.
For Delete, this means that the oldObject must match the field and label selector.
For subresources, if the storage layer cannot verify the parent object matches the selector (both old and new), the request must be rejected.

We do not allow field and label selectors for Get, because if a client is specifying a selector, they can add a .metadata.name field selector and use a List to get equivalent functionality.

SubjectAccessReview Changes

SubjectAccessReview is used for two purposes:

Authorization webhook calls from the kube-apiserver to a webhook. This usage likely benefits from a serialization with []Requirement.
Authorization checks from a client (often a server process using in-cluster authorization like kube-rbac-proxy) This usage likely benefits from a serialization that matches the query parameter.

Their needs are best met with two different serialization (see user stories)


type SubjectAccessReviewSpec struct {
 ResourceAttributes *ResourceAttributes
}

type ResourceAttributes struct {
 FieldSelector *FieldSelectorAttributes

 LabelSelector *LabelSelectorAttributes
}

// FieldSelectorAttributes indicates a field limited access.
// For webhooks:
// The kube-apiserver will never send a request with rawSelector set, but we cannot control what other clients directly send.
// * If rawSelector is empty and requirements are empty, the request is not limited.
// * If rawSelector is present and requirements are empty, the request is not limited.
// * If rawSelector is empty and requirements are present, the requirements should be honored
// * If rawSelector is present and requirements are present, the request is invalid.
// Webhook authors are encouraged to
// * ensure rawSelector and requirements are not both set
// * consider the requirements field if set
// * not try to parse or consider the rawSelector field if set.
// This is to avoid another CVE-2022-2880 (i.e. getting different systems to agree on how exactly to parse
// a query is not something we want), see https://www.oxeye.io/resources/golang-parameter-smuggling-attack for more details.
// For the kube-apiserver:
// * If rawSelector is empty and requirements are empty, the request is not limited.
// * If rawSelector is present and requirements are empty, the rawSelector will be parsed and limited if the parsing succeeds.
// * If rawSelector is empty and requirements are present, the requirements should be honored
// * If rawSelector is present and requirements are present, the request is invalid.
type FieldSelectorAttributes struct {
 // rawSelector is the serialization of a field selector that would be included in a query parameter.
 // Webhook implementations are encouraged to ignore rawSelector.
 // The kube-apiserver's SubjectAccessReview will parse the rawSelector. 
 RawSelector string

 // requirements is the parsed interpretation of a field selector.
 // All requirements must be met for a resource instance to match the selector.
 // Webhook implementations should handle requirements, but how to handle them is up to the webhook.
 // Since requirements can only limit the request, it is safe to authorize as unlimited request if the requirements
 // are not understood.
 Requirements []FieldSelectorRequirement
}

// LabelSelectorAttributes indicates a label limited access.
// For webhooks:
// The kube-apiserver will never send a request with rawSelector set, but we cannot control what other clients directly send.
// * If rawSelector is empty and requirements are empty, the request is not limited.
// * If rawSelector is present and requirements are empty, the request is not limited.
// * If rawSelector is empty and requirements are present, the requirements should be honored
// * If rawSelector is present and requirements are present, the request is invalid.
// Webhook authors are encouraged to
// * ensure rawSelector and requirements are not both set
// * consider the requirements field if set
// * not try to parse or consider the rawSelector field if set.
// This is to avoid another CVE-2022-2880 (i.e. getting different systems to agree on how exactly to parse
// a query is not something we want), see https://www.oxeye.io/resources/golang-parameter-smuggling-attack for more details.
// For the kube-apiserver:
// * If rawSelector is empty and requirements are empty, the request is not limited.
// * If rawSelector is present and requirements are empty, the rawSelector will be parsed and limited if the parsing succeeds.
// * If rawSelector is empty and requirements are present, the requirements should be honored
// * If rawSelector is present and requirements are present, the request is invalid.
type LabelSelectorAttributes struct {
 // rawSelector is the serialization of a field selector that would be included in a query parameter.
 // Webhook implementations are encouraged to ignore rawSelector.
 // The kube-apiserver's SubjectAccessReview will parse the rawSelector. 
 RawSelector string

 // requirements is the parsed interpretation of a label selector.
 // All requirements must be met for a resource instance to match the selector.
 // Webhook implementations should handle requirements, but how to handle them is up to the webhook.
 // Since requirements can only limit the request, it is safe to authorize as unlimited request if the requirements
 // are not understood.
 Requirements []metav1.LabelSelectorRequirement
}

type FieldSelectorRequirement struct {
 // key is the field selector key that the requirement applies to.
 Key string `json:"key" protobuf:"bytes,1,opt,name=key"`
 // operator represents a key's relationship to a set of values.
 // Valid operators are In, NotIn, Exists, DoesNotExist
 // The list of operators may grow in the future.
 // Webhook authors are encouraged to ignore unrecognized operators and assume they don't limit the request.
 // The semantics of "all requirements are AND'd will not change, so other requirements can continue to be enforced.
 Operator LabelSelectorOperator `json:"operator" protobuf:"bytes,2,opt,name=operator,casttype=LabelSelectorOperator"`
 // values is an array of string values. If the operator is In or NotIn,
 // the values array must be non-empty. If the operator is Exists or DoesNotExist,
 // the values array must be empty.
 // +optional
 // +listType=atomic
 Values []string `json:"values,omitempty" protobuf:"bytes,3,rep,name=values"`
}

Importantly, if old webhook authorizers do not honor these new fields, they will assume the broadest possible access and fail closed. If old in-cluster authorization does not include field and label selectors, the kube-apiserver will assume the broadest possible access and fail closed.

Node Authorizer Changes

The node authorizer will be modified to only authorize node clients to list and watch pods with fieldSelectors containing spec.nodeName=$nodeName. The node authorizer will be modified to authorize pod get requests based on the graph.

CEL Authorizer Changes

While admission isn’t supported on List, Watch, or DeleteCollection, it is reasonable to expect that secondary authorization checks may desire to use those verbs and leverage the field and label selector capabilities. To support this we will two congruent options similar to

 "fieldSelector": {
 cel.MemberOverload("resourcecheck_fieldselector", []*cel.Type{ResourceCheckType, cel.StringType}, ResourceCheckType,
 cel.BinaryBinding(resourceCheckName))},
 }

This will allow usage like authorizer.group('').resource('pods').fieldSelector('spec.nodeName=foo').check('list').allowed(). The parsing will happen during the call to allowed where we track errors and have means of handling them already. Field and label selectors that fail to parse will be ignored. No checking of valid verb,selector pairs is made.

User Stories (Optional)

As a SAR client, I want to check a request with a field or label selector

This type of usage probably finds the stringified serialization format used in the query parameters the most convenient format to build their request with. Providing the query parameter serialization format avoids the need for a client to grow a decently complex lexer/parser.

As an authorization webhook author, I want to easily consume the field and label selectors

This type of usage probably finds a serialized []Requirement to be the most convenient way to consume the field and label selector. Providing the parsed value avoids the need for every consumer to grow a decently complex lexer/parser.

Notes/Constraints/Caveats (Optional)

Remember to update these places in existing code:

authorization webhook matchConditions, which evaluates the v1 SubjectAccessReview that would be sent to the webhook: ref .
v1 / v1beta1 SAR translation function ref
v1 SubjectAccessReview construction function ref
cache size decision ref

Risks and Mitigations

client provides field or label selector to kube-apiserver that does not parse

The kube-apiserver may still authorize the request without considering the selectors (system:masters for instance). It will be up to the REST handler to accept or reject requests for bad selectors. This approach also allows an aggregated API server to have extended field and label selector syntax, though we strongly discourage doing so. The kube-apiserver will attempt to authorize without the selector information.

If the client is authorized without the selector, then Allow since they have broader permission.
If the client is not authorized without the selector then either NoOpinion or Fail depending on intent.

client provides field or label selector to kube-apiserver with improper verb

Consider a client that sends an Update request with a field selector on it. The metav1.UpdateOption doesn’t allow this, but imagine devious-user with an alternative library. The ResolveRequestInfo method will not add field and label selectors to the requestInfo, so they will not appear in the authorization.Attributes, so the spurious selectors are not passed to the authorizer. This keeps authorization behavior exactly as it was previously.

SubjectAccessReviews are not modified prior to calling the kube-apiserver authorizer. This allows skew in support between the kube-apiserver and other apiservers.

client provides SAR where field rawSelector does not match field requirements.

The request is rejected. Only one of rawSelector and requirements can be specified.

Design Details

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/registry/authorization/subjectaccessreview: 61.9% of statements
k8s.io/kubernetes/pkg/registry/authorization/util: 82.6% of statements
k8s.io/kubernetes/plugin/pkg/auth/authorizer/node: 77.0% of statements
k8s.io/kubernetes/pkg/apis/admissionregistration/validation: 87.6% of statements
k8s.io/kubernetes/pkg/apis/authorization/validation: 97.0% of statements
k8s.io/apiserver/pkg/admission/plugin/cel: 83.6% of statements
k8s.io/apiserver/pkg/authorization/cel: 53.9% of statements
k8s.io/apiserver/pkg/endpoints/filters: 77.2% of statements
k8s.io/apiserver/pkg/endpoints/request: 65.4% of statements
k8s.io/apiserver/plugin/pkg/authorizer/webhook: 86.6% of statements

Unit tests exercise node authorization, CEL compilation for authorization webhook and admission matchConditions, and CEL compilation for authorizer use with and without the feature enabled:

https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/plugin/pkg/auth/authorizer/node/node_authorizer_test.go#L75-L81

https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/authorization/cel/compile_test.go#L34

https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook_v1_test.go#L806

https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/admission/plugin/cel/filter_test.go#L503-L620

Integration tests

test/integration/apiserver/cel/authorizerselector/... - triage history
- Fully exercise the new CEL authorizer functions with the feature enabled and disabled
test/integration/auth TestMultiWebhookAuthzConfig - triage history
positive and negative match tests for a webhook matchCondition using selector matching, on actual API requests using selectors and on SubjectAccessReview requests

Test history

e2e tests

This feature is fully tested with unit and integration tests

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Unit tests demonstrating wiring and fallback
Integration test demonstrating field selector wiring
- must include fallback on parsing error as well

Beta

Determine if additional tests are necessary
Ensure reliability of existing tests

GA

All bugs resolved and no new bugs requiring code change since the previous shipped release

Upgrade / Downgrade Strategy

On upgrade to a version that enables the feature, no configuration changes are required to maintain previous behavior of CEL expressions and authorization webhooks. All existing CEL expressions and authorization webhook responses behave identically.

On upgrade to a version that enables the feature, to make use of the new feature:

authorization webhooks can inspect incoming SubjectAccessReview requests for field and label selector information
authorization webhook configuration files can include matchConditions that inspect field and label selector information
admission webhook API matchConditions can use authorizer fieldSelector / labelSelector functions
SubjectAccessReview API requests can specify fieldSelector / labelSelector fields

On downgrade to a version that does not enable the feature by default, or if the feature is disabled:

field and label selector information will no longer be sent to authorization webhooks
authorization webhook configuration files can no longer include matchConditions that inspect field and label selector information
admission webhook API matchConditions use authorizer fieldSelector / labelSelector functions will not error, but will no-op
SubjectAccessReview API requests that specify fieldSelector / labelSelector fields will drop those fields

Version Skew Strategy

New kube-apiserver, old webhook authorizer

The new kube-apiserver will include the field and label selectors, but the old webhook authorizer will ignore them. The old authorizer will assume the broadest possible action and authorize accordingly. Because the old authorizer will only allow the action if the user has permission to act on th entire collection, this fails safely. There may be more rejections than expected, but this behavior matches previous behavior.

Old kube-apiserver, new in-cluster authorizer (or any SAR client)

The new client will include the field and label selectors, but the kube-apiserver will ignore them. The kube-apiserver will assume the broadest possible action and authorize accordingly. Because the kube-apiserver will only allow the action if the user has permission to act on th entire collection, this fails safely. There may be more rejections than expected, but this behavior matches previous behavior.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: AuthorizeWithSelectors
- Components depending on the feature gate:
  - kube-apiserver
- Feature gate name: AuthorizeNodeWithSelectors
- Components depending on the feature gate:
  - kube-apiserver

Does enabling the feature change any default behavior?

Yes. The kube-apiserver will send field and label selector information to authorization webhooks. The node authorizer will start preventing kubelets from listing pods that are not on their node.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Set the FeatureGate to false and restart the kube-apiserver. The kube-apiserver will stop sending field and label selector information to authorization webhooks. Persisted CEL expressions using fieldSelector and labelSelector authorization functions will still function.

What happens if we reenable the feature if it was previously rolled back?

The kube-apiserver will send field and label selector information to authorization webhooks.

Are there any tests for feature enablement/disablement?

Yes. Integration tests exercise behavior of CEL expressions with the feature enabled and disabled.

https://github.com/kubernetes/kubernetes/tree/0b1d123fd040359da11dc772947a7908ee907910/test/integration/apiserver/cel/authorizerselector

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Non-kubelet clients using kubelet credentials to make API requests could be forbidden if they are listing/watching pods without filtering to pods scheduled to the node, or if they are listing/watching nodes other than their own node.

What specific metrics should inform a rollback?

Use of kubelet credentials to make API requests the kubelet is not authorized to make is unexpected, but could be detected in the authorization_attempts_total{result=denied} metric increasing and audit events showing requests from a user in the system:nodes group with an authorization.k8s.io/decision=forbid audit annotation.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Handling of persisted CEL expressions using selector features was tested with the feature disabled, and with a compatibility version of 1.30, to ensure that a previous version API server would not have to handle CEL expressions it did not understand.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

None

How can an operator determine if the feature is in use by workloads?

Workloads do not use this feature directly.

Audit events of SubjectAccessReview API requests would show if selector information was being provided.

Authorization webhooks would be able to observe selector information provided in requests.

How can someone using this feature know that it is working for their instance?

Most of the uses are internal to cluster administrators:

authorization webhooks configured with matchConditions using fieldSelector/labelSelector pass validation and only route requests passing those conditions to the webhook (apiserver_authorization_match_condition_exclusions_total metric will increment if match conditions skip)
authorization webhooks can inspect the SubjectAccessReview requests sent to them to observe selector information
admission webhooks and validating admission policies can use fieldSelector and labelSelector authorizer methods and pass API validation.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Use of this feature should not change existing API SLOs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Use of this feature should not change existing API SLIs.

Are there any missing metrics that would be useful to have to improve observability of this feature?

There are already metrics for the layers this feature is adding to:

authorization latency
authorization success
webhook authorizer match condition latency
webhook authorizer match condition success
webhook admission match condition latency
webhook admission match condition success
validating admission policy match condition latency
validating admission policy match condition success

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Existing API fields containing CEL expressions support additional CEL functions.

SubjectAccessReview types (which are not persisted) add new fields for fieldSelector and labelSelector data.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Enabling the feature adds negligible size to authorization webhook payloads.

Using the authorization selector functions in CEL expressions in authorization webhook matchConditions, admission webhook matchConditions, and validating admission policies can take additional time, though this is no different from increasing the complexity or number of CEL expressions generally. CEL expressions that can be set via REST APIs are subject to cost estimation to limit the complexity and size of the input data used for selectors.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, this feature does not touch nodes.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

This feature is fully contained within the API server.

What are other known failure modes?

Non-kubelet clients using kubelet credentials are forbidden
- Detection: logs of non-kubelet client, authorization_attempts_total{result=denied} metric increasing, audit events showing requests from a user in the system:nodes group with an authorization.k8s.io/decision=forbid audit annotation
- Mitigations:
  - change the non-kubelet client to use its own credential (preferred)
  - adjust the non-kubelet client to use field selectors on pods and nodes
  - temporarily disable the AuthorizeNodeWithSelectors feature gate in kube-apiserver
- Diagnostics: the node authorizer logs the following messages at verbosity level 2 when a client attempts to use kubelet credentials to read nodes or pods without using the expected field selector:
  - node '...' cannot read all nodes, only its own Node object
  - node '...' cannot read '...', only its own Node object
  - can only list/watch pods with spec.nodeName field selector
- Testing: There are tests ensuring the node authorizer forbids these overly broad read requests. Use of kubelet credentials by non-kubelet clients to make API requests the kubelet is not authorized to make is unexpected and unwanted.

What steps should be taken if SLOs are not being met to determine the problem?

Determine if webhook latency or matchCondition latency of matchConditions using these selector functions is the primary contributor, and if that change correlates with enablement of this feature. Test if eliminating use of the CEL selector functions in the offending CEL expression resolves the issue.

Implementation History

v1.31: Alpha release
v1.32: Beta release
v1.34: Stable release

Drawbacks

None considered

Alternatives

None considered

Infrastructure Needed (Optional)

None

Resources: Auto delete PVCs created by StatefulSet

Mon, 01 Jan 0001 00:00:00 +0000

KEP-1847: Auto delete PVCs created by StatefulSet

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The proposal is to add a feature to autodelete the PVCs created by StatefulSet.

Motivation

Currently, the PVCs created automatically by the StatefulSet are not deleted when the StatefulSet is deleted. As can be seen by the discussion in the issue 55045 there are several use cases where the PVCs which are automatically created are deleted as well. In many StatefulSet use cases, PVCs have a different lifecycle than the pods of the StatefulSet, and should not be deleted at the same time. Because of this, PVC deletion will be opt-in for users.

Goals

Provide a feature to auto delete the PVCs created by StatefulSet when the volumes are no longer in use to ease management of StatefulSets that don’t live indefinitely. As application state should survive over StatefulSet maintenance, the feature ensures that the pod restarts due to non scale down events such as rolling update or node drain do not delete the PVC.

Non-Goals

This proposal does not plan to address how the underlying PVs are treated on PVC deletion. That functionality will continue to be governed by the reclaim policy of the storage class.

Proposal

Background

The garbagecollector controller is responsible for ensuring that when a StatefulSet is deleted, the corresponding pods spawned from the StatefulSet are deleted as well. The garbagecollector uses an OwnerReference added to the Pod by the StatefulSet controller to delete the Pod. This proposal leverages a similar mechanism to automatically delete the PVCs created by the controller from the StatefulSet’s VolumeClaimTemplate.

Changes required

The following changes are required:

Add persistentVolumeClaimRetentionPolicy to the StatefulSet spec with the following fields.
- whenDeleted - specifies if the VolumeClaimTemplate PVCs are deleted when their StatefulSet is deleted.
- whenScaled - specifies if VolumeClaimTemplate PVCs are deleted when their corresponding pod is deleted on a StatefulSet scale-down, that is, when the number of pods in a StatefulSet is reduced via the Replicas field.
These fields may be set to the following values.
- Retain - the default policy, which is also used when no policy is specified. This specifies the existing behavior: when a StatefulSet is deleted or scaled down, no action is taken with respect to the PVCs created by the StatefulSet.
- Delete - specifies that the appropriate PVCs as described above will be deleted in the corresponding scenario, either on StatefulSet deletion or scale-down.
Add patch to the statefulset controller rbac cluster role for persistentvolumeclaims.

User Stories

Story 0

The user is happy with legacy behavior of a stateful set. They leave all fields of PersistentVolumeClaimRetentionPolicy to Retain. Nothing traditional StatefulSet behavior changes neither on set deletion nor on scale-down.

Story 1

The user is running a StatefulSet as part of an application with a finite lifetime. During the application’s existence the StatefulSet maintains per-pod state, even across scale-up and scale-down. In order to maximize performance, volumes are retained during scale-down so that scale-up can leverage the existing volumes. When the application is finished, the volumes created by the StatefulSet are no longer needed and can be automatically reclaimed.

The user would set persistentVolumeClaimRetentionPolicy.whenDeleted to Delete, which would ensure that the PVCs created automatically during the StatefulSet activation is deleted once the StatefulSet is deleted.

Story 2

The user is cost conscious, and can sustain slower scale-up speeds even after a scale-down, because scaling events are rare, and volume data can be reconstructed, albeit slowly, during a scale up. However, it is necessary to bring down the StatefulSet temporarily by deleting it, and then bring it back up by reusing the volumes. This is accomplished by setting persistentVolumeClaimRetentionPolicy.whenScaled to Delete, and leaving persistentVolumeClaimRetentionPolicy.whenDeleted at Retain.

Story 3

User is very cost conscious, and can sustain slower scale-up speeds even after a scale-down. The user does not want to pay for volumes that are not in use in any circumstance, and so wants them to be reclaimed as soon as possible. On scale-up a new volume will be provisioned and the new pod will have to re-intitialize. However, for short-lived interruptions when a pod is killed & recreated, like a rolling update or node disruptions, the data on volumes is persisted. This is a key property that ephemeral storage, like emptyDir, cannot provide.

User would set the persistentVolumeClaimRetentionPolicy.whenScaled as well as persistentVolumeClaimRetentionPolicy.whenDeleted to Delete, ensuring PVCs are deleted when corresponding Pods are deleted. New Pods created during scale-up followed by a scale-down will wait for freshly created PVCs. PVCs are deleted as well when the set is deleted, reclaiming volumes as quickly as possible and minimizing expense.

Notes/Constraints/Caveats (optional)

This feature applies to PVCs which are defined by the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature. These PVCs are identified by the static naming scheme used by StatefulSets. Auto-provisioned and pre-provisioned PVCs will be treated identically, so that if a user pre-provisions a PVC matching those of a VolumeClaimTemplate it will be deleted according to the deletion policy.

Risks and Mitigations

Currently the PVCs created by StatefulSet are not deleted automatically. Using whenScaled or whenDeleted set to Delete would delete the PVCs automatically. Since this involves persistent data being deleted, users should take appropriate care using this feature. Having the Retain behavior as default will ensure that the PVCs remain intact by default and only a conscious choice made by user will involve any persistent data being deleted.

This proposed API causes the PVCs associated with the StatefulSet to have behavior close to, but not the same as, ephemeral volumes, such as emptyDir or generic ephemeral volumes. This may cause user confusion. PVCs under this policy will more durable than ephemeral volumes would be, as they are only deleted on scale-down or StatefulSet deletion, and not on other pod deletion and recreation events eviction or the death of their node.

User documentation will emphasize the race conditions associated with changing policy or rolling back the feature concurrently with StatefulSet deletion or scale-down. See below in Design Detils for more information.

Design Details

Objects Associated with the StatefulSet

When a StatefulSet spec has a VolumeClaimTemplate, PVCs are dynamically created using a static naming scheme, and each Pod is created with a claim to the corresponding PVC. These are the precise PVCs meant when referring to the volume or PVC for Pod below, and these are the only PVCs modified with an ownerRef. Other PVCs referenced by the StatefulSet Pod template are not affected by this behavior.

OwnerReferences are used to manage PVC deletion. All such references used for this feature will set the controller field to the StatefulSet or Pod as appropriate. This will be used to distinguish references added by the controller from, for example, user-created owner references. When ownerRefs is removed, it is understood that only those ownerRefs whose controller field matches the StatefulSet or Pod in question are affected.

The controller flag will be set for these references. If there is already a different (non-StatefulSet) controller set for a PVC, an ownerRef will not be added. This will mean that the autodelete functionality will not be operative. An event will be created to reflect this.

To summarize,

If the StatefulSet is a controller owner,

the PVC lifecycle will be full managed by the StatefulSet controller
old owner references will be updated with controller=false to controller=true (see Upgrade / Downgrade Strategy, below).
remove itself as the owner and controller when the retain policy is specified in the StatefulSet.

If someone else is the controller,

the PVC lifecycle will not be touched by the StatefulSet controller. The PVC will stay when the delete policy is specified in the StatefulSet.
old StatefulSet owner references will be removed.

Volume delete policy for the StatefulSet created PVCs

A new field named PersistentVolumeClaimRetentionPolicy of the type StatefulSetPersistentVolumeClaimRetentionPolicy will be added to the StatefulSet. This will represent the user indication for which circumstances the associated PVCs can be automatically deleted or not, as described above. The default policy would be to retain PVCs in all cases.

The PersistentVolumeClaimRetentionPolicy object will be mutable. The deletion mechanism will be based on reconciliation, so as long as the field is changed far from StatefulSet deletion or scale-down, the policy will work as expected. Mutability does introduce race conditions if it is changed while a StatefulSet is being deleted or scaled down and may result in PVCs not being deleted as expected when the policy is being changed from Retain, and PVCs being deleted unexpectedly when the policy is being changed to Retain. PVCs will be reconciled before a scale-down or deletion to reduce this race as much as possible, although it will still occur. The former case can be mitigated by manually deleting PVCs. The latter case will result in lost data, but only in PVCs that were originally declared to have been deleted. Life does not always have an undo button.

`whenScaled` policy of `Delete`.

If persistentVolumeClaimRetentionPolicy.whenScaled is set to Delete, the Pod will be set as the owner of the PVCs created from the VolumeClaimTemplates just before the scale-down is performed by the StatefulSet controller. When a Pod is deleted, the PVC owned by the Pod is also deleted.

The current StatefulSet controller implementation ensures that the manually deleted pods are restored before the scale-down logic is run. This combined with the fact that the owner references are set only before the scale-down will ensure that manual deletions do not automatically delete the PVCs in question.

During scale-up, if a PVC has an OwnerRef that does not match the Pod, it indicates that the PVC was referred to by the deleted Pod and is in the process of getting deleted. The controller will skip the reconcile loop until PVC deletion finishes, avoiding a race condition.

`whenDeleted` policy of `Delete`.

When persistentVolumeClaimRetentionPolicy.whenDeleted is set to Delete, when a VolumeClaimTemplate PVC is created, an owner reference in PVC will be added to point to the StatefulSet. When a scale-up or scale-down occurs, the PVC is unchanged. PVCs previously in use before scale-down will be used again when the scale-up occurs.

In the existing StatefulSet reconcile loop, the associated VolumeClaimTemplate PVCs will be checked to see if the ownerRef is correct according to the persistentVolumeClaimRetentionPolicy and updated accordingly. This includes PVCs that have been manually provisioned. It will be most consistent and easy to reason about if all VolumeClaimTemplate PVCs are treated uniformly rather than trying to guess at their provenance.

When the StatefulSet is deleted, these PVCs will also be deleted, but only after the Pod gets deleted. Since the Pod StatefulSet ownership has blockOwnerDeletion set to true, pods will get deleted before the StatefulSet is deleted. The blockOwnerDeletion for PVCs will be set to false which ensures that PVC deletion happens only after the StatefulSet is deleted. This is necessary because of PVC protection which does not allow PVC deletion until all pods referencing it are deleted.

The deletion policies may be combined in order to get the delete behavior both on set deletion as well as scale-down.

Non-Cascading Deletion

When StatefulSet is deleted without cascading, eg kubectl delete --cascade=false, then existing behavior is retained and no PVC will be deleted. Only the StatefulSet resource will be affected.

Mutating `PersistentVolumeClaimRetentionPolicy`

Recall that as defined above, the PVCs associated with a StatefulSet are found by the StatefulSet volumeClaimTemplate static naming scheme. The Pods associated with the StatefulSet can be found by their controllerRef.

From a deletion policy to Retain

When mutating any delete policy to retain, the PVC ownerRefs to the StatefulSet are removed. If a scale-down is in progress, each remaining PVC ownerRef to its pod is removed, by matching the index of the PVC to the Pod index.

From Retain to a deletion policy

When mutating from the Retain policy to a deletion policy, the StatefulSet PVCs are updated with an ownerRef to the StatefulSet. If a scale-down is in process, remaining PVCs are given an ownerRef to their Pod (by index, as above).

Cluster role change for statefulset controller

In order to update the PVC ownerReference, the buildControllerRoles will be updated with patch on PVC resource.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Unit tests

From https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&include-filter-by-regex=statefulset

k8s.io/kubernetes/pkg/controller/statefulset: 2024-10-07: 86.5%
k8s.io/kubernetes/pkg/registry/apps/statefulset: 2022-10-07: 62.7%
k8s.io/kubernetes/pkg/registry/apps/statefulset/storage: 2022-10-07: 64%

Integration tests

test/integration/statefulset: 2024-10-07: No failures

Added TestAutodeleteOwnerRefs to k8s.io/kubernetes/test/integration/statefulset.

E2E tests

[gci-gce-statefulset](https://testgrid.k8s.io/google-gce#gci-gce-statefulset): 2024-10-07: 0 Failures
triage : 2024-10-09:
- Flakey failures in ci-kubernetes-kind-e2e-parallel, ci-kubernetes-kind-e2e-parallel-1-30 and ci-kubernetes-kind-ipv6-e2e-parallel-1-31 also had failures in many other tests, so appears to be general infrastructure flake.
- [sig-apps] StatefulSet Non-retain StatefulSetPersistentVolumeClaimPolicy should delete PVCs after adopting pod (WhenScaled) seems to have real flakes which will be investigated.

Added Feature:StatefulSetAutoDeletePVC tests to k8s.io/kubernetes/test/e2e/apps/.

Upgrade/downgrade & feature enabled/disable tests

The following scenarios were manuall tested.

1. Create statefulset in previous version and upgrade to the version
supporting this feature. The PVCs should remain intact.
2. Downgrade to earlier version and check the PVCs with Retain
remain intact and the others with set policies before upgrade
gets deleted based on if the references were already set.

Since rancher.io/local-path now provides a default storage class, StatefulSets can be tested with kind with the following procedure.

Create a kind cluster with the following config.yaml.

apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
featureGates:
StatefulSetAutoDeletePVC: false
nodes:
- role: control-plane
image: kindest/node:v1.31.0

This is done with kind create cluster --config config.yaml

The configuration adds the feature gate to all control plane services. In a kind cluster, these are stored in the /etc/kubernetes/manifests directory of the kind docker container serving as the control plane node. The manifests are reconciled to the control plane, so the cluster can be upgraded or downgraded from the StatefulSet retention policy feature with bash script like the following.
```
for c in kube-apiserver kube-controller-manager kube-scheduler; do
docker exec kind-control-plane \
sed -i -r "s|(StatefulSetAutoDeletePVC)=false|\1=true|" \
/etc/kubernetes/manifests/$c.yaml
echo $c updated
done
```
To downgrade, swap false for true in the above. Note that the kind control plane will be unreachable for a minute or so while the reconciliation occurs.

For the upgrade scenario, a StatefulSet was created in a cluster with the feature gate disabled. The feature gate was enabled, the StatefulSet was scaled down or deleted, and it was confirmed that no PVCs were deleted.

In the downgrade scenario, four StatefulSets were created with all possibilities of WhenScaled and WhenDeleted policies. After downgraded, it was confirmed that (1) no PVCs are deleted when the StatefulSet is scaled down, and (2) PVCs are deleted when the WhenDeleted policy is Delete, and the StatefulSet is deleted.

Graduation Criteria

Alpha release

(Done) Complete adding the items in the ‘Changes required’ section.
(Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.

Beta release

(Done) Enable feature gate for e2e pipelines

GA release

(Done) Validate with customer workloads. There has been no customer feedback aside from some unrelated issues on GKE which showed that customers were using delete strategies, and analysis of owner references on PVCs that motivated #122400 .

Upgrade / Downgrade Strategy

This features adds a new field to the StatefulSet. The default value for the new field maintains the existing behavior of StatefulSets.

On a downgrade, the PersistentVolumeClaimRetentionPolicy field will be hidden on any StatefulSets. The behavior in this case will be identical to mutating the policy field to Retain, as described above, including the edge cases introduced if this is done during a scale-down or StatefulSet deletion.

The initial beta version did not set the controller flag in the owner reference. This was fixed in later versions, so that the controller flag is set. The behavior is then as follows.

If the StatefulSet or one of its Pods is a controller, owner references are updated as specified by the retention policy.
If the StatefulSet or one of its Pods is an owner but not a controller,
- if there is no other controller, the StatefulSet or Pod owners will be set as controller (ie, they are assumed to be from the initial beta version and updated).
- The controller will be updated as specified by the retention policy.
If there is another resource that is a controller,
- any (non-controler) StatefulSet or Pod owner reference is removed.
- The retention policy is ignored.

Version Skew Strategy

There are only apiserver and kube-controller-manager changes involved. Node components are not involved so there is no version skew between nodes and the control plane. Since the api changes are backwards compatible, as long as the apiserver version which originally added the new StatefulSet fields is rolled out before the kube-controller-manager, behavior will be correct. Since the alpha API has been out since 1.23 and there have been no incompatible changes to the API, the order of any modern apiserver & kube-controller-manager rollout should not matter anyway.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: StatefulSetAutoDeletePVC
- Components depending on the feature gate
  - kube-controller-manager, which orchestrates the volume deletion.
  - kube-apiserver, to manage the new policy field in the StatefulSet resource (eg dropDisabledFields).

Does enabling the feature change any default behavior?

No. What happens during StatefulSet deletion differs from current behavior only when the user explicitly specifies the PersistentVolumeClaimDeletePolicy. Hence no change in any user visible behavior change by default.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Disabling the feature gate will cause the new field to be ignored. If the feature gate is re-enabled, the new behavior will start working.

When the PersistentVolumeClaimRetentionPolicy has WhenDeleted set to Delete, then VolumeClaimTemplate PVCs ownerRefs must be removed.

There are new corner cases here. For example, if a StatefulSet deletion is in process when the feature is disabled or enabled, the appropriate ownerRefs will not have been added and PVCs may not be deleted. The exact behavior will be discovered during feature testing. In any case the mitigation will be to manually delete any PVCs.

What happens if we reenable the feature if it was previously rolled back?

In the simple case of reenabling the feature without concurrent StatefulSet deletion or scale-down, nothing needs to be done when the deletion policy has whenScaled set to Delete. When the policy has whenDeleted set to Delete, the VolumeClaimTemplate PVC ownerRefs must be set to the StatefulSet.

As above, if there is a concurrent scale-down or StatefulSet deletion, more care needs to be taken. This will be detailed further during feature testing.

Are there any tests for feature enablement/disablement?

Feature enablement and disablement tests will be added, including for StatefulSet behavior during transitions in conjunction with scale-down or deletion.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

If there is a control plane update which disables the feature while a stateful set is in the process of being deleted or scaled down, it is undefined which PVCs will be deleted. Before the update, PVCs will be marked for deletion; until the updated controller has a chance to reconcile some PVCs may be garbage collected before the controller has a chance to remove any owner references. We do not think this is a true failure, as it should be clear to an operator that there is an essential race condition when a cluster update happens during a stateful set scale down or delete.

What specific metrics should inform a rollback?

The operator can monitor kube_persistent_volume_* metrics from kube-state-metrics to watch for large numbers of undeleted PersistentVolumes. If consistent behavior is required, the operator can wait for those metrics to stablize.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes. The race condition wasn’t exposed, but we confirmed the PVCs were updated correctly.

Is the rollout accompanied by any deprecations and/or removals of features, APIs,

fields of API types, flags, etc.? No

Monitoring Requirements

Metrics are provided by kube-state-metrics unless otherwise noted.

How can an operator determine if the feature is in use by workloads?

kube_statefulset_persistent_volume_claim_retention_policy will have nonzero counts for the delete policy fields.

What are the SLIs (Service Level Indicators) an operator can use to determine

the health of the service?

Metric name: kube_statefulset_status_replicas_current should be near kube_statefulset_stats_replicas_ready.
- [Optional] Aggregation method: gauge
- Components exposing the metric: kube-state-metrics

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

kube_statefulset_stats_replicas_ready / kube_statefulset_stats_replicas_current should be near 1.0, although as unhealthy replicas are often an application error rather than a problem with the stateful set controller, this will need to be tuned by an operator on a per-cluster basis.

Are there any missing metrics that would be useful to have to improve observability

of this feature?

kube-state-metrics have filled a gap in the traditional lack of metrics from core Kubernetes controllers.

Dependencies

Does this feature depend on any specific services running in the cluster?

No, outside of depending on the scheduler, the garbage collector and volume management (provisioning, attaching, etc) as does almost anything in Kubernetes. This feature does not add any new dependencies that did not already exist with the stateful set controller.

Scalability

Will enabling / using this feature result in any new API calls?

Yes and no. This feature will result in additional resource deletion calls, which will scale like the number of pods in the stateful set (ie, one PVC per pod and possibly one PV per PVC depending on the reclaim policy). There will not be additional watches, because the existing pod watches will be used. There will be additional patches to set PVC ownerRefs, scaling like the number of pods in the StatefulSet.

However, anyone who uses this feature would have made those resource deletions anyway: those PVs cost money. Aside from the additional patches for onwerRefs, there shouldn’t be much overall increase beyond the second-order effect of this feature allowing more automation.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

PVC deletion may cause PV deletion, depending on reclaim policy, which will result in cloud provider calls through the volume API. However, as noted above, these calls would have been happening anyway, manually.

Will enabling / using this feature result in increasing size or count of the existing API objects?

PVC, new ownerRef; ~64 bytes
StatefulSet, new field; ~8 bytes (holds string enumeration either “Delete” or “Retain”

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. (There are currently no StatefulSet SLOs?)

Note that scale-up may be slower when volumes were deleted by scale-down. This is by design of the feature.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

PVC deletion will be paused. If the control plane went unavailable in the middle of a stateful set being deleted or scaled down, there may be deleted Pods whose PVCs have not yet been deleted. Deletion will continue normally after the control plane returns.

What are other known failure modes?

PVCs from a stateful set not being deleted as expected.
- Detection: This can be deteted by higher than expected counts of kube_persistentvolumeclaim_status_phase{phase=Bound}, lower than expected counts of kube_persistentvolume_status_phase{phase=Released}, and by an operator listing and examining PVCs.
- Mitigations: We expect this to happen only if there are other, operator-installed, controllers that are also managing owner refs on PVCs. Any such PVCs can be deleted manually. The conflicting controllers will have to be manually discovered.
- Diagnostics: Logs from kube-controller-manager and stateful set controller.
- Testing: Tests are in place for confirming owner refs are added by the StatefulSet controller, but Kubernetes does not test against external custom controller.

What steps should be taken if SLOs are not being met to determine the problem?

Stateful set SLOs are new with this feature and are in process of being evaluated. If they are not being met, the kube-controller-manager (where the stateful set controller lives) should be examined and/or restarted.

Implementation History

1.21, KEP created.
1.23, alpha implementation.
1.27, graduation to beta.
1.31, fix controller references.
1.32, graduation to GA.

Drawbacks

The StatefulSet field update is required.

Alternatives

Users can delete the PVC manually. The friction associated with that is the motivation of the KEP.

Resources: Auto-refreshing official CVE feed

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3203: Auto-Refreshing Official CVE Feed

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
- User Stories (Optional)
  - Story 1
  - Story 2
  - Story 3
- Story 4
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Storage of CVE feed blob
  - 1. Only use Google Cloud Bucket
  - 2. Only use Git Repository
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release .

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- ( R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently it is not possible to filter for issues or PRs that are related to CVEs announced by kubernetes. This KEP addresses this concern by labeling these issues or PRs with the new label official-cve-feed using the automation. The in-scope issues are the closed issues for which there is a CVE ID and is officially announced as a Kubernetes CVE by SRC in the past.

Motivation

With the growing number of eyes on Kubernetes, the number of CVEs related to Kubernetes have increased. Although most CVEs that directly, indirectly, or transitively impact Kubernetes are regularly fixed, there is no single place for the end users of Kubernetes to programmatically subscribe or pull the data of fixed CVEs. Current options are either broken or incomplete .

An auto-refreshing CVE feed will allow end users to programmatically fetch the list of CVEs and allow them to get the latest information from Kubernetes community.

Goals

Create a periodically auto-refreshing, machine-readable list of official Kubernetes CVEs

Non-Goals

Triage and vulnerability disclosure: This will continue to be done by SRC
Listing CVEs that are identified in build time dependencies and container images. Only official CVEs announced by the Kubernetes SRC will be published in the feed.
Integration with CVEProject may happen at a future stage but currently is not planned or scoped.

User Stories (Optional)

Story 1

As a K8s end user, I want a list of CVEs with relevant information that I can fetch programmatically, so I can track when new CVEs are announced.

Story 2

As a K8s End User, I want to use my browser to get a list of fixed CVEs, from the official K8s website so that I can trust it as an authoritative source of data through implicit trust offered by website certificate and domain name.

Story 3

As a K8s maintainer, I want to create a process that auto-updates CVE feed, when SRC announces new CVEs such that I do not have to do extra work to maintain this feed manually

Story 4

As a K8s platform provider, I want to automatically know if my Kubernetes clusters are vulnerable to any of the CVEs SRC have announced. I want to have a programmatically available API to parse this kind of data so I can easily provide it to users of my platform.

Proposal

Pre-requisites

Add official-cve-label https://github.com/kubernetes/test-infra/pull/23428
Search and Identify closed issues that have a CVE ID e.g. CVE-1001-12345 in the issue description or summary (This search filter is giving the most accurate data so far)
Label those issues with official-cve-feed using https://docs.github.com/en/rest/reference/issues REST API
Add official-cve-feed label as part of SRC playbook: https://github.com/kubernetes/committee-security-response/pull/133

Overview

Generate a JSON blob using the results from the filtered label on k/k repo.
Create a Prow job to periodically generate this JSON blob.
Push this JSON blob when needed (e.g. when a new CVE is announced) to GCS ( Google Cloud Bucket)
Using Hugo and other tooling (such as Netlify), publish the list from this JSON blob on official k8s website during k/website build
Generate an RSS feed (atom format) with hugo templates using the generated JSON blob

Risks and Mitigations

JSON blob construction will fail

If the generation of the JSON blob listing known CVEs were to fail, downstream jobs also fail. If blob construction fails, the failure will alert the owners of this feature and we will take action as needed. If the failure can not be fixed in a reasonable amount of time, the CVE feed will be stale until it is fixed. In case of an urgent need from the community to update the vulnerabilities feed, JSON blob will be manually updated via gsutil command.

Misuse of Auto-Refresh feature

Without proper filtering and control over who can label GitHub issues, the list of CVEs can become a list with poor signal-to-noise ratio making the list unusable.

For this purpose, the filtering is applied such that only issues that are marked as closed will be part of the list. Also, additionally, the official-cve-feed label is a restricted label that can only be applied by SRC and SIG Security Tooling Leads.

Large JSON blob could lead to slower read/write and resource consumption

Blobs will only be rewritten, if the generated blob is different from existing blob. As hash file would be created and stored alongside generated blob. This hash file will be check everytime before push to the hash of the generated file. If the hash file matches writing to the bucket will be skipped, if hash file is different writing to bucket, will be triggered.

Design Details

The steps to implement this design will involve a prow job that:

Queries Github API for fixed official CVEs
Generates a JSON blob based on the query results
Writes the JSON blob to gcs-bucket if it is different than existing blob
Triggers the k/website build using netlify build-hook . Secret token to trigger build is added as External Secret. See example for snyk-token
k/website build pulls the JSON blob from gcs bucket during website rebuild, pulling it from gcs-bucket into something like https://kubernetes.io/security/official-cve-feed.json
k/website renders the JSON blob as an HTML table for viewing the list of fixed CVEs from a browser at this location: https://kubernetes.io/docs/reference/issues-security/official-cve-feed and linked from this page: https://kubernetes.io/docs/reference/issues-security/security/

Notes:

A GCS bucket needs to be created. Example PR for this looks like this
Additional custom fields need to be added to make JSON feed compliant with https://validator.jsonfeed.org/

Test Plan

This is a process KEP implemented using periodic prow job. This KEP is not implemented for any functional use cases of kubernetes. So no e2e/unit/integration tests are applicable and going forward test plan will mostly include the scenarios around monitoring of the prow job for any failures as and when needed.

Graduation Criteria

Alpha

Feature implemented with working JSON feed and tabular list
Initial e2e testing completed and alerting setup for detecting failures

Beta

Gather feedback from developers and end users
Make JSON feed compliant with jsonfeed spec
Add RSS feed for the CVE list
Add fields that signal freshness of the data

Upgrade / Downgrade Strategy

Not applicable

Version Skew Strategy

Not applicable

Production Readiness Review Questionnaire

Not applicable as per this comment

Implementation History

Drawbacks

Alternatives

Storage of CVE feed blob

There are two options to store the CVE feed JSON blob:

1. Only use Google Cloud Bucket

A new Google cloud bucket can be created where the CVE feed is written using gsutil tool and read via curl call.

Advantages:
- Transparent updates to JSON blob where the prow job run will be identical everytime.
- Access control for writing to bucket is least privilege i.e. managed via a service account and Google group membership
Disadvantages:
- CVE list has an unofficial looking URL which would be hard for an end user to decipher for its authenticity and provenance.

2. Only use Git Repository

Store it as a version controlled artifact in one of the kubernetes GitHub Org repositories.

Advantages:
- When a Git Repository especially k/website hosts the JSON blob the domain name in the URL would be k8s.io/static/security/official-cve-feed.json which is much more recognizable, intuitive in terms of trust, TLS enabled and unlikely to be spoofed. But there are several disadvantages:
Disadvantages:
- This might get delayed by PR review and approval process. However, this can be prevented through use of skip-review label.
- A fork would need to be maintained under a Github Robot for k/website or k/sig-security which will add overhead for GitHub Admins who manage the robot accounts and its forks.

In summary, because both the approaches have pros and cons, the finalized approach combined the good parts from both the alternatives by storing the blob in Google Cloud bucket but rendering it via kubernetes/website GitHub Repository.

Infrastructure Needed (Optional)

A GCS bucket to store JSON blob is needed, with its corresponding service account.

Resources: Azure Availability Zones

Mon, 01 Jan 0001 00:00:00 +0000

Resources: Backoff Limits Per Index For Indexed Jobs

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3850: Backoff Limits Per Index For Indexed Jobs

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP extends the Job API to support indexed jobs where the backoff limit is per index, and the Job can continue execution despite some of its indexes failing.

Motivation

Currently, the indexes of an indexed job share a single backoff limit. When the job reaches this shared backoff limit, the job controller marks the entire job as failed, and the resources are cleaned up, including indexes that have yet to run to completion.

As a result, the current implementation does not cover the situation where the workload is truly embarrassingly parallel and each index is independent of other indexes.

For instance, if indexed jobs were used as the basis for a suite of long-running integration tests, then each test run would only be able to find a single test failure.

Other popular batch services like AWS Batch use a separate backoff limit for each index, showing that this is a common use case that should be supported by Kubernetes.

Goals

allow to count failures towards the backoffLimit independently for all indexes,
allow to continue Job execution despite some of its indexes failing,
allow to fail an index (stop recreating pods for the index) using pod failure policy.

Non-Goals

allow to control the number of retries per index when pod’s restartPolicy=OnFailure (see Support backoffLimitPerIndex when restartPolicy=OnFailure ).

Proposal

We propose a new policy for running Indexed Jobs in which the backoff limit controls the number of retries per index. When the new policy is used all indexes execute until their success or failure. We also propose a new API field to control the number of failed indexes.

Additionally, we propose a new action in PodFailurePolicy , called FailIndex, to short-circuit failing of the index before the backoff limit per index is reached.

User Stories (Optional)

Story 1

As a CI/CD platform administrator, I want to use Indexed Jobs to run suites of integration tests, one suite per index. A failure of one suite should not interrupt running of other suites. Additionally, I would like to be able to control the maximal number of retries per index.

The following Job configuration could satisfy my use case:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimitPerIndex: 1
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: job-container
 image: job-image
 command: ["./tests-runner"]

In this case, we run 10 indexes representing the test suites. We allow for one failure per index.

Story 2

As a CI/CD platform administrator from the Story 1 I want to be able to control the failures with the pod failure policy. In particular, I want to be able to use pod failure policy to avoid restarts of some indexes, based on exit codes.

The following Job configuration could satisfy my use case:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimitPerIndex: 1
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: job-container
 image: job-image
 command: ["./tests-runner"]
 podFailurePolicy:
 rules:
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [42]

Story 3

As a CI/CD platform administrator from the Story 1 I want to be able to fail the entire Job if the number of failed indexes exceeds 50%. I want to do this in order to cut down costs of running the tests in case of compilation issues that would result in all tests failing.

The following Job configuration could satisfy my use case:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimitPerIndex: 1
 maxFailedIndexes: 5
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: job-container
 image: job-image
 command: ["./tests-runner"]

Notes/Constraints/Caveats (Optional)

Performance benchmark

We assess the performance of the Beta implementation in comparison to the index jobs with regular backoffLimit using the two integration tests (BenchmarkLargeIndexedJob and BenchmarkLargeFailureHandling) in the PR #121393 .

In the BenchmarkLargeIndexedJob test, the measured part creates N pods and marks them as Succeeded, awaiting for the Job status to be updated accordingly. This is a sanity test for the backoffLimitPerIndex, to demonstrate that the new branches of code don’t have significant performance impact.

Here are the results (lines re-ordered from smallest to the largest N):

go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeIndexedJob" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10-48 1 3034342185 ns/op 14391160 B/op 164352 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=100-48 1 3050613253 ns/op 111100464 B/op 1324757 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=1000-48 1 19382609963 ns/op 1133953568 B/op 13079710 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10_000-48 1 222696805443 ns/op 11610639800 B/op 131946944 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10-48 1 3025650312 ns/op 14757368 B/op 166282 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=100-48 1 3045479158 ns/op 114324072 B/op 1345524 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=1000-48 1 19384632203 ns/op 1161105080 B/op 13216319 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10_000-48 1 223635439324 ns/op 11911685592 B/op 133325939 allocs/op

In the BenchmarkLargeFailureHandling test, the measured part of the test marks N running pods as Failed and awaits for the job status to be updated accordingly. In order to make the test comparable for regular indexed jobs and with backoffLimitPerIndex we set the max backoff delay due to pod failures as 10ms. Here are the results (lines re-ordered from smallest to the largest N):

go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeFailureHandling" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10-48 1 2021272442 ns/op 13813736 B/op 165760 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=100-48 1 3036166978 ns/op 109866704 B/op 1310651 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=1000-48 1 21049273834 ns/op 1074301144 B/op 12832549 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10_000-48 1 202327947010 ns/op 10926201704 B/op 131423197 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10-48 1 3016501067 ns/op 14676224 B/op 175301 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=100-48 1 3038839798 ns/op 112090728 B/op 1323948 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=1000-48 1 21057643253 ns/op 1096364096 B/op 13008669 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10_000-48 1 202373728278 ns/op 11185209520 B/op 132578325 allocs/op

The above results show that the jobs using .spec.backoffLimitPerIndex are be slower for about 1% compared to regular indexed jobs. In practice the difference is expected to be covered by the exponential backoff delay due to pod failures.

Risks and Mitigations

The Job object too big

With the new field .status.failedIndexes the Job object can be significantly larger as every failed index is recorded in the field.

Note that, the similar risk is also present for Indexed Jobs, regarding the already existing .status.completedIndexes field (see Indexed Jobs can break with high number of parallelism or completions ).

In order to mitigate this risk we first constrain the .spec.maxFailedIndexes to 10^5, which is the same limit as for .spec.parallelism currently.

Second, we validate if the fields are inside of the scalability limits:

.spec.completions<=10^5, .spec.parallelism<=10^5, spec.maxFailedIndexes<=10^5
spec.completions unlimited (<= max int32 ~2*10^9), .spec.parallelism<=10^4, spec.maxFailedIndexes<=10^4

In (1.), in the worst case scenario, every index is either present in completedIndexes or failedIndexes, but not in both. Thus the total sum of both fields is limited by (5+1)*10^5=0.572Mi, where:

5 is the maximal number of digits in the indexes,
1 is for separation character,
10^5 is the total number of listed indexes.

In (2.) the worst case scenario for the completedIndexes field is when every third index is not in the field, because it corresponds to either a failed or a hanging indexes, so it is a “gap”. Then, between every gap we have two indexes listed. Thus, the size of the completedIndexes field is limited by: (10+1)*2*(10^4+10^4)=0.42Mi, where:

10 is the maximal number of digits in the indexes
1 is for the separation character
2*(10^4+10^4) is the number of indexes explicitly listed in the field - two indexes per gap.

The size of the failedIndexes field is limited by: (10+1)*10^4=0.105Mi, where:

10 is the maximal number of digits in the indexes,
1 is for the separation character
10^4 is the maximal number of indexes present in the field.

Thus, the size of both fields is capped at 0.572Mi for the limits in (1.) and 0.525Mi for the limits in (2.).

For comparison, before the introduction of .status.failedIndexes, the max size of the .status.completedIndexes was limited by (5+1)*10^5*2/3=0.382Mi in the (1.) case, and (10+1)*2*10^4=0.21Mi in the (2.) case. This means an increase of 0.19Mi.

The values of the limits are aligned with the values for the soft limits proposed as a fix for the for regular indexed jobs (see here ). However, in case when backoffLimitPerIndex is used we propose these limits to be hard.

We believe that the scalability limits should be enough for most of Job use-cases. For workloads requiring larger jobs users should be able to create multiple Jobs, orchestrated by the JobSet .

Exponential backoff delay issue

Currently, a pod is recreated by the Job controller with exponential backoff delay (10s, 20s, 40s …), counted from the last failure time.

One complication is that the last failure time for failed pods may increase with time, as it fallbacks to now in some cases (see in code ). Thus, there is a risk that due to the presence of pods hitting the fallback the last failure time is continuously bumped, thus shifting the time to recreate the pod.

This risk is present both when computing the exponential backoff delay globally (as for regular indexed Jobs), or per-index as proposed in in this KEP (see Exponential backoff delay per index ).

In order to mitigate this risk currently the time of last failure is recorded in-memory (globally for all pods within a Job). And a new failed pod may bump it only until it is added to the uncountedTerminatedPods structure.

However, tracking the last failure time per index might be costly for memory consumption (see Exponential backoff delay with in-memory tracking ).

Thus, in order to mitigate this risk we propose to compute the finish time for a pod as the first available value of the following (avoiding the ever-increasing fallback to now):

max finishAt of all containers, if specified for all containers
LastTransitionTime for the Ready=False condition
deletionTimestamp - deletionGracePeriodSeconds if deletionTimestamp is set

Here (3.) is used to mark the moment of deletion which is used to approximate the current behavior. (2.) is used when Kubelet loses track of one of its containers, the Ready=False condition is set by Kubelet when transitioning a pod to Failed phase: https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/kubelet/status/status_manager.go#L1060-L1068 . When none of the above conditions is satisfied to compute the finish time we fallback to the pod’s creation time.

This fix can be considered a preparatory PR before the KEP, as to some extent is solves the preexisting issue.

Too fast Job status updates

In this KEP the Job controller needs to keep updating the new status field .status.failedIndexes to reflect the current status of the Job. This can raise concerns of overwhelming the API server with status updates.

First, observe that the new field does not entail additional Job status updates. When a pod terminates (either failure or success), it triggers Job status update to increment the status.failed or .status.succeeded counter fields. These updates are also used to update the pre-existing status.completedIndexes field, and the new status.failedIndexes field.

Second, in order to mitigate this risk there is already a mechanism present in the Job controller, to bulk Job status updates per Job.

The way the mechanism works is that Job controller maintains a queue of syncJob invocations per job (see in code ). New items are added to the queue with a delay (1s for pod events, such as: delete, add, update). The delay allows for deduplication of the sync per Job.

One place to queue a new item in the queue, specific to this KEP, is when the exponential backoff delay hasn’t elapsed for any index (allowing pod recreation), then we requeue the next Job status update. The delay is computed as minimum of all delays computed for all indexes requiring pod recreation, but not less that 1s.

Design Details

We introduce a new Job API field, called .spec.backoffLimitPerIndex. When set it limits the number of retries, counted independently for all indexes.

Additionally, we propose the .spec.maxFailedIndexes to control the maximal number of failed indexes. Once the number is exceeded the entire Job is marked Failed and its execution is terminated.

We also propose to extend the PodFailurePolicy with a new action, called FailIndex to allow an index to fail fast before reaching the backoff limit per index.

Job API


// PodFailurePolicyAction specifies how a Pod failure is handled.
// +enum
type PodFailurePolicyAction string

const (
 // This is an action which might be taken on a pod failure - mark the
 // Job's index as failed to avoid restarts within this index. This action
 // can only be used when backoffLimitPerIndex is set.
 PodFailurePolicyActionFailIndex PodFailurePolicyAction = "FailIndex"
 ...
)
...

// JobSpec describes how the job execution will look like.
type JobSpec struct {
 ...
 // Specifies the limit for the number of retries within an
 // index before marking this index as failed. When enabled the number of
 // failures per index is kept in the pod's
 // batch.kubernetes.io/job-index-failure-count annotation. It can only
 // be set when Job's completionMode=Indexed, and the Pod's restart
 // policy is Never. The field is immutable.
 // +optional
 BackoffLimitPerIndex *int32

 // Specifies the maximal number of failed indexes before marking the Job as
 // failed, when backoffLimitPerIndex is set. Once the number of failed
 // indexes exceeds this number the entire Job is marked as Failed and its
 // execution is terminated. When left as null the job continues execution of
 // all of its indexes and is marked with the `Complete` Job condition.
 // It can only be specified when backoffLimitPerIndex is set.
 // It can be null or up to completions. It is required and must be
 // less than or equal to 10^4 when is completions greater than 10^5.
 // +optional
 MaxFailedIndexes *int32
 ...
}

type JobStatus struct {
 ...

 // FailedIndexes holds the failed indexes when backoffLimitPerIndex is set.
 // The indexes are represented in the text format analogous as for the
 // `completedIndexes` field, ie. they are kept as decimal integers
 // separated by commas. The numbers are listed in increasing order. Three or
 // more consecutive numbers are compressed and represented by the first and
 // last element of the series, separated by a hyphen.
 // For example, if the failed indexes are 1, 3, 4, 5 and 7, they are
 // represented as "1,3-5,7".
 // +optional
 FailedIndexes *string
}

Note that, the PodFailurePolicyAction type is already defined in master with three possible enum values: Ignore, FailJob and Count (see here ).

We allow to specify custom .spec.backoffLimit and .spec.backoffLimitPerIndex. This allows for a controlled downgrade. Also, when .spec.backoffLimitPerIndex is specified, then we default .spec.backoffLimit to max int32 value. This way we ensure old clients of the API wouldn’t break when reading or trying to modify the .spec.backoffLimit that has nil value.

Tracking the number of failures per index

In order to determine if the backoff limit per index is exceeded we keep track of the number of failures per index. For this purpose we use the Pod annotation, batch.kubernetes.io/job-index-failure-count, which holds the value of the number of pod failures for a given index. It is set to 0 for the first pod created for a given index.

When Job controller sees a failed pod corresponding to a given index, and the value of the annotation batch.kubernetes.io/job-index-failure-count is greater or equal to the configured backoff limit per index then the index is marked as failed and added to .status.failedIndexes.

When Job controller creates replacement pods for failed pods for a given index it checks if the index isn’t finished yet (it is not in .status.failedIndexes nor .status.completedIndexes). Then, if x is the highest batch.kubernetes.io/job-index-failure-count for the index, the newly created pod will have the annotation set to x+1. An exception is when the newly failed pod matches the Ignore action in pod failure policy. In this case the replacement pod does not increment the value in the annotation.

In order to keep track of the number of failures per index, the Job controller removes finalizers of a failed pod for a given index, only once the replacement pod (with incremented value of batch.kubernetes.io/job-index-failure-count) is created, or the index is marked as failed in .status.failedIndexes. This means that these are the main steps when handling a failed pod to prepare it for deletion:

Pod is recognized as failed
pod UID is recorded in Job status (.status.uncountedTerminatedPods)
the replacement Pod is created
Pod’s finalizer is removed

Here, the new feature adds a dependency between steps (3.) and (4.) as previously these steps could be performed in any order. Note that, typically when a pod is deleted or fails the replacement pod is created with a backoff delay, starting from 10s. This means, that after the proposed change the pod finalizer removal will be paused for at least 10s, until the backoff elapses and the replacement pod is created. While this may result in pods hanging around before garbage collection, it does not affect directly the rate of pod recreation.

Note that, the first step (1.) will also be impacted by KEP-3939: Consider Terminating pods as active pods in Jobs.

Failed indexes format

The format of the .status.failedIndexes field is analogous to the one used for successful indexes represented by the completedIndexes field ), which is a text format grouping consecutive integers into ranges. In a special case, when the indexes are non-consecutive they are represented by comma-separated numbers. In the worst-case scenario this is a string of comma-separated even values. In order to constrain the size of the field we cap the number of completions (see The Job object too big for more details).

Job completion

When backoff limit per index is used, then we execute indexes until all of them are completed (either failed or succeeded), or the number of failed indexes exceeds the specified .spec.maxFailedIndexes.

Then, the Job is marked as completed (the Complete Job condition type) when all indexes are succeeded. The Job is marked as failed (the Failed Job condition) when at least one index is failed. The Failed condition is added once all indexes completed their execution (either failed or succeeded), or when the number of failed indexes exceeds the specified .spec.maxFailedIndexes.

FailIndex action

In order to allow early termination of indexes with the FailIndex action we add the corresponding index to the set of failed indexes represented by .status.failedIndexes. This action can only be used if backoff limit per index is used.

Exponential backoff delay per index

First, we solve the issue of increasing failure time for deleted pods when the finalizer removal is delayed, by modifying the definition of the pod finish time, to avoid fallback to now (see also Exponential backoff delay issue ).

Second, we compute the backoff delay within each index independently. The number of consecutive failures per-index can be derived from the batch.kubernetes.io/job-index-failure-count annotation of the last failed pod, plus one. This is because any successful pod marks the index as successful and stops retries. Note that, using the annotation value means that failed pods matching the Ignore rule are skipped in the calculation, but this behavior is consistent with handling ignored pod failures for regular backoff limit.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Unit tests will be added along with any new code introduced. In particular, the following scenarios will be covered with unit tests:

handling or ignoring of .spec.backoffLimitPerIndex by the Job controller when the feature gate is enabled or disabled, respectively,
handling of ignoring of the pod failure policy rule with FailIndex action
the JobBackoffLimitPerIndex feature gate is enabled or disabled, respectively,
validation of a job configuration with respect to .spec.backoffLimitPerIndex by kube-apiserver (including limits for .spec.maxFailedIndexes, .spec.parallelism and .spec.completions), when the feature gate is enabled or disabled,
marking of the Job as Complete only once all indexes are completed,
termination of Job execution and marking it as failed when .spec.maxFailedIndexes is exceeded.
calculation of the exponential backoff delay per index when backoffLimitPerIndex is used.
a fuzzer roundtrip test for API when backoffLimit is set to max int32.

The core packages (with their unit test coverage) which are going to be modified during the implementation:

k8s.io/kubernetes/pkg/controller/job: 27 Apr 2023 - 90.4%
k8s.io/kubernetes/pkg/apis/batch/validation: 27 Apr 2023 - 98.5%

Integration tests

The following scenarios will be covered with integration tests:

enabling, disabling and re-enabling of the JobBackoffLimitPerIndex feature gate (code )
handling of the .spec.backoffLimitPerIndex when the FailIndex action is used (code ),
handling of the .spec.backoffLimitPerIndex when .spec.maxFailedIndexes isn’t set (code ),
handling of the .spec.backoffLimitPerIndex when .spec.maxFailedIndexes is set (code ),
handling of the .spec.backoffLimit when .spec.backoffLimitPerIndex is set (code ),
handling of the exponential backoff delay per index when .spec.backoffLimitPerIndex is set (code ).

The [k8s-triage] page for the BackoffLimitPerIndex integration tests .

More integration tests might be added to ensure good code coverage based on the actual implementation.

e2e tests

The following scenario is covered with e2e tests for Beta:

sig-apps#gce :
- Job should execute all indexes despite some failing when using backoffLimitPerIndex (code )
- Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes (code )
- Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy (code )

The [k8s-triage] page for the BackoffLimitPerIndex e2e tests .

Graduation Criteria

Alpha

the feature implemented behind the JobBackoffLimitPerIndex feature flag
change the logic of computing the exponential backoff delay (see here )
user-facing documentation, including the warning for setting completions > 10^5
The JobBackoffLimitPerIndex feature flag disabled by default
Tests: unit and integration

Beta

Address reviews and bug reports from Alpha users
Implement the job_finished_indexes_total metric
E2e tests are in Testgrid and linked in KEP
Move the new reason declarations from Job controller to the API package
Evaluate performance of Job controller for jobs using backoff limit per index with benchmarks at the integration or e2e level (discussion pointers from Alpha review: thread1 and thread2 )
The feature flag enabled by default

GA

Address reviews and bug reports from Beta users
Write a blog post about the feature
Revisit extending the hands-on guide for Pod failure policy to use FailIndex
Graduate e2e tests as conformance tests
Lock the JobBackoffLimitPerIndex feature gate

Upgrade / Downgrade Strategy

Upgrade

An upgrade to a version which supports this feature should not require any additional configuration changes. In order to use this feature after an upgrade users will need to configure their Jobs by specifying .spec.backoffLimitPerIndex. There is no difference in behavior of Jobs if .spec.backoffLimitPerIndex is not set.

Downgrade

A downgrade to a version which does not support this feature should not require any additional configuration changes. Jobs which specified .spec.backoffLimitPerIndex (to make use of this feature) will be handled in a default way, ie. using the .spec.backoffLimit. However, since the .spec.backoffLimit defaults to max int32 value (see here ) is might require a manual setting of the .spec.backoffLimit to ensure failed pods are not retried indefinitely.

Version Skew Strategy

This feature is limited to control plane.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: JobBackoffLimitPerIndex
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Using the feature gate is the recommended way. When the feature is disabled the Job controller manager handles pod failures in the default way, even if .spec.backoffLimitPerIndex is set.

What happens if we reenable the feature if it was previously rolled back?

The Job controller starts to handle pod failures according to the specified .spec.backoffLimitPerIndex or .spec.maxFailedIndexes fields.

Are there any tests for feature enablement/disablement?

Yes, there is an integration test which tests the following path: enablement -> disablement -> re-enablement.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This change does not impact how the rollout or rollback fail.

The change is opt-in, thus a rollout doesn’t impact already running pods.

The rollback might affect how pod failures are handled, since they will be counted only against .spec.backoffLimit, which is defaulted to max int32 value, when using .spec.backoffLimitPerIndex (see here ). Thus, similarly as in case of a downgrade (see here ) it might be required to manually set spec.backoffLimit to ensure failed pods are not retried indefinitely.

What specific metrics should inform a rollback?

A substantial increase in the job_sync_duration_seconds.

Also, a substantial increase in the total number of pods, as it may take additional time to get the finalizers removed.

Additionally, a substantial increase in the difference of terminated_pods_tracking_finalizer_total for the add and delete labels may indicate that it takes too long to delete the finalizers.

The feature is opt-in so in case of issues it is enough not to use the backoffLimitPerIndex API field.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The Upgrade->downgrade->upgrade testing was done manually using the alpha version in 1.28 with the following steps:

Start the cluster with the JobBackoffLimitPerIndex enabled:

kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml

using config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
 "JobBackoffLimitPerIndex": true
nodes:
- role: control-plane
- role: worker

Then, create the job using .spec.backoffLimitPerIndex=1:

kubectl create -f job.yaml

using job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
 name: job-longrun
spec:
 parallelism: 3
 completions: 3
 completionMode: Indexed
 backoffLimitPerIndex: 1
 template:
 spec:
 restartPolicy: Never
 containers:
 - name: sleep
 image: busybox:1.36.1
 command: ["sleep"]
 args: ["1800"] # 30min
 imagePullPolicy: IfNotPresent

Await for the pods to be running and delete 0-indexed pod:

kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=0 --grace-period=1

Await for the replacement pod to be created and repeat the deletion.

Check job status and confirm .status.failedIndexes="0"

kubectl get jobs -ljob-name=job-longrun -oyaml

Also, notice that .status.active=2, because the pod for a failed index is not re-created.

Simulate downgrade by disabling the feature for api server and control-plane.

Then, verify that 3 pods are running again, and the .status.failedIndexes is gone by:

kubectl get jobs -ljob-name=job-longrun -oyaml

this will produce output similar to:

 ...
 status:
 active: 3
 failed: 2
 ready: 2

Simulate upgrade by re-enabling the feature for api server and control-plane.

Then, delete 1-indexed pod:

kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=1 --grace-period=1

Await for the replacement pod to be created and repeat the deletion. Check job status and confirm .status.failedIndexes="1"

kubectl get jobs -ljob-name=job-longrun -oyaml

Also, notice that .status.active=2, because the pod for a failed index is not re-created.

This demonstrates that the feature is working again for the job.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By the presence of the .spec.backoffLimitPerIndex field in the jobs.

For Beta we are also considering to introduce job_finished_indexes_total metric (see also here ).

How can someone using this feature know that it is working for their instance?

Job API .status
- field: failedIndexes will not be empty as indexes fail
Pod API
- annotation: batch.kubernetes.io/job-index-failure-count is present for pods created by Jobs with this feature enabled

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature does not propose SLOs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - job_sync_duration_seconds (existing): can be used to see how much the feature enablement increases the time spent in the sync job
  - job_finished_indexes_total (new): can be used to determine if the indexes are marked failed,
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

For Beta we will introduce a new metric job_finished_indexes_total with labels status=(failed|succeeded), and backoffLimit=(perIndex|global). It will count the number of failed and succeeded indexes across jobs using backoffLimitPerIndex, or regular Indexed Jobs (using only .spec.backoffLimit). It might be useful to determine the global ratio of failed vs. succeeded indexes when backoffLimitPerIndex is used.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, but only when the .spec.backoffLimitPerIndex field is set.

API type(s): Job
Estimated increase in size:
- New .status.failedIndexes field in Status and .status.completedIndexes pre-existing field are impacted. When the scalability limits are respected, then the maximal increase of the total size of both fields can be estimated as 190Ki (see The Job object too big for more details),
- New .spec.backoffLimitPerIndex field of *int32 is 12 bytes.
API type(s): Pod
Estimated increase in size: the new annotation batch.kubernetes.io/job-index-failure-count to keep the current number of retries per index. Is around 50 bytes.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

We don’t expect this increase to be captured by existing SLO/SLIs .

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

The added dependency of removing finalizers only after pod recreation Tracking the number of failures per index may keep pods around longer (around 10s which is the backoff for pod recreation) before actual deletion (requested or by PodGC).

This can increase the RAM consumption, but only for a short period of time. Also, it is only affecting the failing pods.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. This feature does not introduce any resource exhaustive operations.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior of the Job controller.

What are other known failure modes?

None.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2023-01-23: Initial version of the KEP PR Backoff Limit Per Job #3774
2023-04-26: The KEP PR Backoff limit per Job Index #3967 takes over from #3774
2023-05-08: The KEP PR ready for review
2023-06-07: The KEP PR merged
2023-07-13: The implementation PR Support BackoffLimitPerIndex in Jobs #118009 under review
2023-07-18: Merge the API PR Extend the Job API for BackoffLimitPerIndex
2023-07-18: Merge the Job Controller PR Support BackoffLimitPerIndex in Jobs
2023-08-04: Merge user-facing docs PR Docs update for Job’s backoff limit per index (alpha in 1.28)
2023-08-06: Merge KEP update reflecting decisions during the implementation phase Update for KEP3850 “Backoff Limit Per Index”
2023-10-02: Update KEP-3850 “Backoff Limit Per Index” for Beta
2023-10-20: Introduce the job_finished_indexes_total metric
2023-10-23: Graduate BackoffLimitPerIndex to Beta
2023-10-24: Indicate Job Backoff Limit Per Index reason consts are beta
2023-10-25: Backoff limit per index e2e test
2023-11-02: Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP
2023-11-02: Benchmark job with backoff limit per index
2023-11-02: Update KEP3850 “BackoffLimitPerIndex for Indexed Jobs”
2025-02-07: KEP3850: graduate Backoff Limit Per Index for Job to stable
2025-02-25: Add Job e2e for tracking failure count per index
2025-03-01: Graduate Backoff Limit Per Index as stable

Drawbacks

Alternatives

backoffLimitPerIndex inside new runPolicy

We could nest the new fields (maxFailedIndexes and backoffLimitPerIndex) inside another field. Proposed alternative names for the field:

runPolicy
completionPolicy
failurePolicy

For example:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimit: 4
 runPolicy:
 backoffLimitPerIndex: true
 maxFailedIndexes: 1
 ...

The option (3.) suggests that the fields are about declaring the Job as failed. However, the backoffLimitPerIndex field not only allows to count failures towards the backoff limit per index, but also allows all indexes to execute despite failures, thus more generic names, like (1.) and (2.) are preferred.

Also the options (1.) and (2.) may be reused in the context of success policy which is subject of Job success/completion policy . It might be beneficial for the API to consider the conditions for the Job success or failure under the same field.

Reasons for deferring / rejecting

It is not clear what is the best name going forward. Also, it seems that the backoffLimitPerIndex should be next to backoffLimit. It was discussed and the consensus is that “top-level” is fine (see here ).

Mark Job Complete if some indexes failed

The alternative to the proposed Job completion strategy.

Allow execution of all indexes, up to .spec.maxFailedIndexes of failed indexes. Then, mark the Job Complete even if some indexes failed. The Job is marked Failed only if the number of failed indexes exceeds the specified .spec.maxFailedIndexes limit, in that case, the reason field could be FailedIndexes, and the message field would list the failed indexes up to a couple of them.

Reasons for deferring / rejecting

This approach is less intuitive to the end-users of the API, compared to the proposal. In particular, in some cases it would require custom logic in the user’s controller to determine if the Job is failed.

Support backoffLimitPerIndex when restartPolicy=OnFailure

We’ve considered supporting the backoffLimitPerIndex when pod’s restartPolicy=OnFailure.

Reasons for deferring / rejecting

When restartPolicy=OnFailure it is Kubelet’s responsibility to restart the pod. On the other hand if the maximal number of restarts would be enforced by the Job controller, then race conditions are possible. For example, in-between the checks by the Job controller, Kubelet execute more restarts than the specified .spec.backoffLimit. The problematic counting of failures in the restartPolicy=OnFailure has been ticketed When restartPolicy=OnFailure the calculation for number of retries is not accurate .

We believe that this feature can be supported well by using the pod-level API, started in this KEP: Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure .

Once the pod-level API is done, it could be considered to support .spec.backoffLimitPerIndex whenrestartPolicy=OnFailure in pod’s spec. In this case we could set the pod-level maxRestartTimes field based on the Job-level .spec.backoffLimit, leaving the responsibility of enforcing the limit to the Kubelet.

We will re-assess the decision of the Pod-level API graduates to GA in the KEP: Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure . For example, when maxRestartTimes is specified for restartPolicy=OnFailure, then we could support maxFailedIndexes which would allow to control the number of failed indexes (that exceeded the maxRestartTimes and are marked failed).

Mutually exclusive backoffLimit and backoffLimitPerIndex

We’ve also considered to make the backoffLimit and backoffLimitPerIndex fields mutually exclusive.

Reasons for deferring / rejecting

There is no way to control downgrade, as the value of backoffLimit would always default to 6. Also, old API clients may error trying to read or modify Job objects with backoffLimit=nil.

Use bool field

We’ve considered to use a bool backoffLimitPerIndex field. Here is an example:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimit: 1
 backoffLimitPerIndex: true
 ...

Reasons for deferring / rejecting

It does not allow to specify both .spec.backoffLimit and .spec.backoffLimitPerIndex in the same config. While setting both fields can be confusing in regular use it can be helpful to support the use case of controlled downgrade.

Use enum field

We’ve considered to use an enum backoffLimitTarget: Job|Index field (another name for this concept could be backoffLimitGranularity), to specify that the failures should be tracked per-index. Here, the default would be Job. Here is an example:

apiVersion: v1
kind: Job
spec:
 parallelism: 10
 completions: 10
 completionMode: Indexed
 backoffLimit: 1
 backoffLimitTarget: Index
 ...

Reasons for deferring / rejecting

No other targets, than Job and Index, will be added in a foreseeable future. Thus, it seems like an unnecessary complication. The dedicated name backoffLimitPerIndex seems to also better reflect the user’s intention.

Similarly as in the bool case field Use bool field it does not allow to set both .spec.backoffLimit and .spec.backoffLimitPerIndex to control the downgrade.

Global exponential backoff delay

We could also consider leaving the exponential backoff delay as global and be enabled by a dedicated API field in the future KEP, say backoffDelayPerIndex.

Reasons for deferring / rejecting

The idea of using backoffLimitPerIndex is to make the indexes independent. Thus, failures or successes in one index should not influence backoff delays for another index. We are leaving the decision to the community feeback and discussions though.

Exponential backoff delay with in-memory tracking

Instead of modifying the definition of pod’s finish time (see Exponential backoff delay issue ) we could keep track of the “failure time” for failed pods in-memory.

Reasons for deferring / rejecting

As the number of failed indexes is capped at 10^5 keeping track of failure times for all pods will be at least 8B per failed pod, which is around 1Mi per Job in the worst-case scenario. This is a non-negligible memory increase.

The extra tracking information is not needed counting pods as terminated is done in KEP-3939: Consider terminating pods in job controller . In this case we can assume that the failure time of each pod does not change after its phase is terminal.

Alternative ways to support high number of completions

In the current proposal the high number of completions (like 10^6) is supported by specifying the .spec.maxFailedIndexes field. This way the size of the failedIndexes field is controlled.

See below for alternative approaches proposed.

Keep failedIndexes field as a bitmap

In order to squeeze more failed indexes we could use bitmap.

Reasons for deferring / rejecting

it is not human readable which might be useful for manual inspection
it is harder to parse by user-provided controllers
it introduces another format to keeping the succeeded indexes in .status.completedIndexes

Keep the list of failed indexes in a dedicated API object

The idea is to keep the heavy fields outside of the Job API object itself. It could be a new API object, for example JobFailedIndexes.

Reasons for deferring / rejecting

This approach significantly increases the complexity of the Job controller that needs to register and manage another API object. This may also have performance impact as the Job controller needs to query the object. Finally, it is also a complication to the end users who want to fetch the list of failed indexes.

Implicit limit on the number of failed indexes

An alternative is to have an implicit limit on the number of failed indexes, for example, by controlling the size of the .status.failedIndexes field down to 300KB. This can allow to run a job with completions at the level of 10^6, without explicit limit for maximal number of failed indexes.

Reasons for deferring / rejecting

It may behave unpredictably, impacting the user experience. For example, when a user sets maxFailedIndexes as 10^6 the Job may complete if the indexes and consecutive, but the Job may also fail if the size of the object exceeds the limits due to non-consecutive indexes failing.

Skip uncountedTerminatedPods when backoffLimitPerIndex is used

It’s been proposed (see link ) that when backoffLimitPerIndex is used, then we could skip the interim step of recording terminated pods in .status.uncountedTerminatedPods.

Reasons for deferring / rejecting

First, if we stop using .status.uncountedTerminatedPods it means that .status.failed can no longer track the number of failed pods. Thus, it would require a change of semantic to denote just the number of failed indexes. This has downsides:

two different semantics of the field, depending on the used feature
lost information about some failed pods within an index (some users may care to investigate succeeded indexes with at least one failed pod)

Second, it would only optimize the unhappy path, where there are failures. Also, the saving is only 1 request per 500 failed pods, which does not seem essential.

Infrastructure Needed (Optional)

Resources: Beta APIs Are Off by Default

Mon, 01 Jan 0001 00:00:00 +0000

KEP-3136: Beta APIs Are Off by Default

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

From the Kubernetes release where this change is introduced, and onwards, beta APIs will not be enabled in clusters by default. Existing beta APIs and new versions of existing beta APIs, will continue to be enabled by default: if v1beta.some.group is currently enabled by default and we create v1beta2.some.group, v1beta2.some.group will still be enabled by default.

Motivation

Beta APIs are not considered stable and reliance upon APIs in this state leads to exposure to bugs, guaranteed migration pain for users when the APIs move to stable, and the risk that dependencies will grow around unfinished APIs. Enabling beta APIs by default, exacerbates these problems by making them on in nearly every cluster. We observed these problems as we removed long-standing beta APIs and the PRR survey tells us that over 90% of cluster-admins leave production clusters with these APIs enabled. Unsuitability for production use is documented at https://kubernetes.io/docs/reference/using-api/#api-versioning (“The software is not recommended for production uses”), but defaulting on means they are present in nearly every production cluster. By disabling beta APIs by default, a cluster-admin can opt-in for specific APIs without having every incomplete API present in the cluster. This is now practical to do since conformance no longer relies on non-stable APIs.

Goals

Disable new beta APIs by default.
Continue enabling existing beta APIs and new version of existing beta APIs by default: if v1beta.some.group is currently enabled by default and we create v1beta2.some.group, v1beta2.some.group will still be enabled by default.
Allow enabling specific resources in beta. Enable coolnewjobtype.v1beta1.batch.k8s.io without enabling other-neat-job.v1beta1.batch.k8s.io

Non-Goals

Change feature gate defaults. Feature gates control new features (not just new APIs) and they are on by default for beta features. This KEP is not changing the lifecycle flow for feature gates. It is currently alpha=off-by-default, beta=on-by-default, stable=locked-to-on.

Proposal

New beta APIs will be placed into the DisableVersions stanza instead of the EnableVersions stanza (see DefaultAPIResourceConfigSource ). The --runtime-config flag will be extended to allow group/version/resource=true, to enable specific resources. To enable a beta API, a cluster-admin will have to add the appropriate --runtime-config flags.

User Stories (Optional)

Story 1

As a cluster-admin I want to enable the coolnewjobtype.v1beta1.batch.k8s.io API in my cluster.

To do this I call kube-apiserver --runtime-config=batch.k8s.io/v1beta1/coolnewjobtype.

Story 2

As a cluster-admin I want to enable all beta APIs as in past releases.

To do this I call kube-apiserver --runtime-config=api/beta=true. This already exists and will continue to function.

Notes/Constraints/Caveats (Optional)

Installers, utilities, controllers, etc that need to know if a certain beta API is present can continue to use the existing discovery mechanisms (example: kubectl’s api-resources sub command or the /api/apps/v1 REST API).

Risks and Mitigations

Adoption of beta features will slow. Given how kubernetes is now treated, this is a good thing, not a bad thing. Those users that want to move quickly and get new features can do so by enabling all beta feature or just enabling those that are important for their workload. The PRR survey shows that over 30% of cluster-admins have enabled alpha features on at least some production clusters, so cluster-admins are willing and able to enable features that are not on by default when they are desired.

If two or more APIs are tightly coupled together, it will now be possible to enable them independently. This can lead to unanticipated failure modes, but should only impact beta APIs with beta dependencies. While this is a risk, it is not very common and components should fail safe as a general principle.

If beta APIs are off by default, it’s possible that fewer clients will use them and provide feedback. This is a risk, but early adopters are able to enable these features and have a history of enabling alpha features. When moving from beta to GA, it will be important for sigs to explicitly seek feedback. We will address this by extending the PRR questionnaire to include a GA-targeted question to validate that the feature was reasonably validated in production use-cases.

If beta APIs are off by default, it is possible that sigs don’t treat taking an API as an indication of a “mostly-baked” API. If this happens, then more transformation may be required. Keeping our beta API rules consistent and continuing to enforce easy to use APIs seems to be the best option.

Design Details

Test Plan

Integration tests will be written to ensure that no new beta APIs are enabled in the kube-apiserver by default. Unit tests will be written to ensure that the new flag functionality works as expected.

Graduation Criteria

This KEP is a policy KEP, not a feature KEP. It will start as GA.

GA

Integration and unit tests from above.
updating the enablement docs for beta
- https://kubernetes.io/docs/reference/using-api/#api-versioning
- https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#using-a-feature Even though that is talking about feature gates, it is likely worth calling out there that new beta REST APIs are no longer enabled by default)
email to dev@kubernetes.io to explain the new policy
blog post explaining change in time for 1.24 release
CI configuration updated to have a testing mode that enables beta APIs, likely set using kube-apiserver --runtime-config=api/beta=true
extend the PRR questionnaire to include a GA-targeted question to validate that the feature was reasonably validated in production use-cases.

Upgrade / Downgrade Strategy

The additional command line flag format for --runtime-config will not be recognized on older levels of kubernetes. This means that when downgrading, cluster-admins will have to adjust their CLI arguments if they opted into a new beta API. This is congruent to flag handling for new features today. Because this only impacts new beta APIs, there is no behavior change for existing APIs on upgrade.

Version Skew Strategy

Because this only impacts new beta APIs, there is no novel skew risk.

Production Readiness Review Questionnaire

Not applicable because this is a policy KEP.

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)

Resources: Beta Feature Gate Promotion Requirements

Mon, 01 Jan 0001 00:00:00 +0000

KEP-5241: Beta Feature Gate Promotion Requirements

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
  - What if I need to add capability to my feature?
  - Who will make sure that new KEPs follow the promotion rules?
- Graduation Criteria
Drawbacks
- This may slow the rate that new features are promoted.
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Features gates must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to being enabled by default. The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.

Motivation

Features gates that are enabled by default are enabled in every production Kubernetes cluster in the world. We must avoid making every production cluster into unstable or incomplete feature testing clusters. Even feature gates that make flags accessible, but require a secondary configuration to use must be stable, because it is unrealistic to expect everyone to understand the graduation stages of various flags for each release: the only stages that really matter are “takes enabling an explicit alpha feature gate” and “my production cluster accepts this as valid by default”.

Goals

Features gates must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to being enabled by default.
The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.

Non-Goals

Changing beta APIs off by default rules.
Change the imperfect mechanisms we have for API evolution.

Proposal

Kubernetes feature gates have three levels: GA (locked on), GA (disable-able), Beta, and Alpha.

GA (locked-on) means that a feature gate is unconditionally enabled in all production kubernetes clusters and that feature cannot be disabled.
GA (disable-able) is only for features gates that include a new API serialization that cannot be enabled by default until the API reaches stable. This means that the first time the API is enabled in production, the feature will be GA, but also can be disabled. This is a less common state and does not apply to most features.
Beta means that a feature gate is usually enabled in all production Kubernetes clusters by default and that feature can be disabled. Exceptions exist for entirely new APIs and some node features, but this broadly the case.
Alpha means that a feature gate is disabled in all production Kubernetes clusters by default and can be optionally enabled by setting a --feature-gate command line argument.

Making the jump to GA (cannot be disabled), without actual field experience is irresponsible. The first time we take a feature gate enabled by default in production Kubernetes clusters, we must have a way to disable the feature in case of unexpected stability, performance, or security issues.

Enabling incomplete features in production Kubernetes clusters by default is irresponsible. Features that are known to be incomplete naturally bring with them additional stability, performance, and security issues. Once a feature has been enabled in a production Kubernetes cluster by default, adding to it carries greater risk to upgrading clusters and the ecosystem. The feature can easily have become relied upon by workloads and other platform extensions. If an accident happens in adding those capabilities with stability, performance, and security the cost to disable those features in a cluster becomes significantly greater and breaks existing clusters, workloads and use-cases. This posture makes upgrades higher risk than necessary.

To balance these concerns, we are changing how we evaluate Beta and GA stability criteria. The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”. Promotion from Beta to GA must have no significant change for the release. This means that Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to beta.

Phasing in larger features over time can be done by bringing separate feature gates through alpha, beta, and GA. Each feature gate needs to meet the beta and GA criteria for completeness, functional, security, monitoring, and testing. After meeting the criteria for enabled by default, and at the SIG’s discretion, the new feature gate could be set to enabled by default in the release it is introduced. Importantly, the features need to behave in a way that allows old and new clients to interoperate and new additions to larger features able to be independently disablable with their own path for GA.

Risks and Mitigations

What if I need to add capability to my feature?

To handle this situation, we described above how to add second feature gate for the new behavior. This provides a mechanism for adding needed capability, but ensures that cluster-admins never end up stuck after upgrade because they rely on v1.Y-1 behavior that new capability in v1.Y broke under the same feature gate.

Who will make sure that new KEPs follow the promotion rules?

We’ll adjust the KEP template to indicate the allowed criteria, so authors should notice. SIG approvers should enforce those standards. PRR approvers can be a final backstop.

Graduation Criteria

This document is our new position once merged until it is superceded by another position statement.

Drawbacks

This may slow the rate that new features are promoted.

For this to be true, that would mean that we previously enabled feature gates in production that were knowingly incomplete for functional, security, monitoring, testing, or known bugs. We hope this was not the common case, but if it was the common enough to have an impact, we’re pleased that the result is preventing incomplete feature gates from being enabled in production clusters.

Alternatives

None proposed so far.

Resources: Block ExternalIPs via Admission Control

Mon, 01 Jan 0001 00:00:00 +0000

KEP-2200: Deny use of ExternalIPs via admission control

Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories (Optional)
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Summary

This proposal is in response to CVE-2020-8554: “Man in the middle using LoadBalancer or ExternalIPs”.

Fundamentally the Service.spec.externalIPs[] feature is bad. It predates Service.spec.type=LoadBalancer and, now that we have that, has very few use-cases. In short an unprivileged user can hijack an IP address via a Service spec. In contrast, type=LoadBalancer uses Service status, which most normal users should not be allowed to write.

This KEP proposes to block the use of ExternalIPs via a built-in admission controller. The justification for this, as opposed to a webhook, is that 99% of users will never use this feature, and making them ALL run a webhook seems terrible.

Motivation

https://github.com/kubernetes/kubernetes/issues/97110

Goals

Make it possible to disable an insecure feature for the vast majority of users very quickly.

Non-Goals

Make this the default (breaking change)
Make the feature safe to use.

Proposal

This KEP proposes to add a built-in admission controller “DenyServiceExternalIPs”, which rejects any CREATE or UPDATE operation which adds a new value to Service.spec.externalIPs. Existing values will be tolerated and may be removed.

The number of rejected operations will be exposed by the standard admission metrics (apiserver_admission_controller_admission_duration_seconds_bucket{name="DenyServiceExternalIPs",rejected="true", ...}).

User Stories (Optional)

Alice the admin does not want her users using this insecure feature. She enabled this admission controller and knows no user can use it. She can then audit existing users and make them stop.

Risks and Mitigations

Some installations may want to use this feature in a more controlled way. They can use a custom webhook admission controller or a policy controller to enforce their own rules.

This is a precedent we should not set lightly. In this case the VAST majority of users do not need this feature and this proposal is very surgical in nature. As far as we know, there are few other unprivileged fields with this much power anywhere in our API, and most of those already have some form of controls on them.

Design Details

One simple admission controller should be enough to disable this misfeature. Unfortunately it can not be on by default (that would be breaking).

This means that platform-providers may need to expose an option to control this. While we generally try to avoid mixing knobs that cluster-users would set with knobs that cluster-providers own, it seems reasonable to close this as soon as possible and consider better answers when we have more cases to generalize from. See “Alternatives” below for more.

See “Proposal” above.

Test Plan

Unit tests to ensure CREATE and UPDATE operations are rejected when adding new externalIPs.
Unit tests to ensure UPDATE operations allow existing externalIPs.

Graduation Criteria

This feature will debut as “GA”, bypassing alpha and beta. It’s already opt-in and very small scope.

Upgrade / Downgrade Strategy

Cluster upgrades/downgrades should not be an issue.

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
- Other flag
  - Flag name: –enable-admission-plugins (existing)
Does enabling the feature change any default behavior? Yes. The externalIPs field will not be allowed to mutate, except to remove existing values.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes.
What happens if we reenable the feature if it was previously rolled back? No problem.
Are there any tests for feature enablement/disablement? Unit tests should suffice.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads? It could start disallowing all Service operations, if the controller was buggy.
What specific metrics should inform a rollback? apiserver_admission_controller_admission_duration_seconds_bucket{name="DenyServiceExternalIPs",rejected="true", ...}
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Manual testing:
- Create a service “extip” with 2 externalIPs values
- Upgrade to new apiserver and enable new admission controller
- Try to create a new service using externalIPs -> fail
- Try to change the “extip” service in an unrelated way -> OK
- Try to change the value of one externalIPs value in extip -> fail
- Try to remove the [0] value of externalIPs -> OK
- Try to add the removed value back -> fail
- Remove the last externalIPs value -> OK
- Try to add the removed value back -> fail
- Revert to “standard” apiserver
- Try to add the removed value back -> OK
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads? There are two possible facets of this: 1) Is the admission control enabled? and 2) Are any users using externalIPs?

To point 1, admins can look at their admission control config (–enable-admission-plugins) and look for DenyServiceExternalIPs in that list.

To point 2, admins can look at all services in the cluster for use of the externalIPs field. Via kubectl:
```
kubectl get svc --all-namespaces -o go-template='
{{- range .items -}}
{{if .spec.externalIPs -}}
{{.metadata.namespace}}/{{.metadata.name}}: {{.spec.externalIPs}}{{"\n"}}
{{- end}}
{{- end -}}
'
```
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? N/A
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? N/A
Are there any missing metrics that would be useful to have to improve observability of this feature? This proposes to use the existing apiserver_admission_controller_admission_duration_seconds_bucket{name="DenyServiceExternalIPs", ...} metrics.

Dependencies

Does this feature depend on any specific services running in the cluster? No.

Scalability

Will enabling / using this feature result in any new API calls? No.
Will enabling / using this feature result in introducing new API types? No.
Will enabling / using this feature result in any new calls to the cloud provider? No.
Will enabling / using this feature result in increasing size or count of the existing API objects? No.
Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]? No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable? It is part of apiserver REST path.
What are other known failure modes? None.
What steps should be taken if SLOs are not being met to determine the problem? N/A

Implementation History

2020-12-07: First draft
2021-01-04: Edits to PRR section.
2021-01-15: Edits from feedback.

Drawbacks

It is a slippery-slope to other ad hoc policies. Counter: this is very surgical and overwhelmingly not a useful feature.

Users who REALLY need this feature can enable it and apply whatever bespoke admission policies they need (or not).

Alternatives

Force users to use policy controllers as webhooks. Forever.
Make a breaking API change and disable or rip-out the feature.
Add a new flag telling validation logic to dissallow this field.
Make a more complex API to define which namespaces can use this feature and/or which IPs they can use.
Make a new API that allows cluster-users to enable this sort of field-block without changing admission-control flags on apiserver.

Resources: bound service account token improvements

Mon, 01 Jan 0001 00:00:00 +0000

KEP-4193: bound service account token improvements

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Token projection and alternative audiences on JWTs issued by the apiserver enable an external entity to validate the identity and certain properties (e.g. associated ServiceAccount or Pod) of the caller.

When attempting to verify a token associated with a Pod, it is not possible to verify that the Pod is associated with a specific Node without geting the relevant Pod object (embedded as a private claim in the JWT) and cross-referencing the named spec.nodeName.

To allow for a robust chain of identity verification from the requester all the way through to the projected token, it would be beneficial if the Node object reference associated with the requesting Pod were embedded into the signed JWT.

This is especially useful in cases where the external software wants to avoid replay attacks with projected service account tokens. The external software can cross-reference the identity of the caller to that service Node reference embedded in the JWT, which allows this verification to be rooted upon the same root of trust that the kubelet/requesting entity uses.

By embedding the identity of the Node the Pod is running on, we can cross-reference this information with an identity passed along to the external service, thus removing the ability for a malicious actor to ‘replay’ a projected token from another Node.

This will be implemented as an additional node entry in the private claims embedded into each JWT returned by the TokenRequest API, in a similar manner to how the ServiceAccount, Pod or Secret is referenced.

Additionally, to provide a robust means of tracking token usage within the audit log we can embed a unique identifier for each token which is can then also be recorded in future audit entries made by this token.

As we are adding support for node metadata associated with Pods, we will also add the ability to bind a token/JWT to a Node object directly, similar to how a token can be bound to a Pod or Secret resource today.

Motivation

Goals

Embedding information about the Node that a pod is running on into signed JWTs.
Make it easier to track the actions a single token has taken, and cross-reference that back to the origin of the token (via audit log inspection).
Provide a means of checking whether a Pod’s token is associated with the same Node as it was associated with when the initial TokenRequest was made (via an extra field that can be observed from the TokenReview API).

Non-Goals

Embedding requester information. This is discussed further in the alternatives considered section, and a future KEP may revisit this.
Embedding information beyond the immutable Node name and UID into the token. We aim to mimic what is done with the ref fields for secret, pod and serviceaccount (not introduce any additional properties).
Changing default behaviour of the SA authenticator to enforce the referenced Node object still exists.

Proposal

Embedding Pod’s bound Node information in tokens

The kube-apiserver will be extended to automatically embed the name and uid of the Node a Pod is associated with (via spec.nodeName) in generated tokens when a TokenRequest create call is serviced.

As the ‘pod’ is already available in this area of code, which contains the nodeName, we will just need to plumb through a Getter for Node objects into the TokenRequest storage layer so the node’s UID can be fetched, similar to what is done for pod & secret objects.

Allowing ServiceAccount tokens to be bound to a Node object

Similar to how a token can be bound to a Pod or Secret object, we will also extend the TokenRequest API to allow binding directly to Node objects (without needing to bind to a Pod as well).

This allows users to obtain a token that is tied specifically to the Node objects lifecycle, i.e. when the Node object is deleted, the token will be invalidated.

Extending TokenReview to verify tokens bound to Node objects

The SA authenticator will be extended to check whether a token that is bound to a Node object is still valid, by first checking whether the Node object with the name given in the JWT still exists, and if it does, validating whether the UID of that Node is equal to the UID embedded in the token.

Tokens bound to Pod objects will continue to only validate the referenced pod. This avoids changing the previous behaviour for validation of tokens issued for pods. Deletion of a node triggers deletion of the pods associated with that node after a period of time , which ultimately invalidates those tokens.

Tokens that are directly bound to Node objects will always validate the name and UID, as binding tokens to Node objects is a new option and therefore enforcing this validation check from day 1 is non-breaking.

Including a UUID (JTI ) on each issued JWT

When a TokenRequest is being issued/fulfilled, we will modify the issuing code to also generate and embed a UUID which can be later used to trace the requests that a specific issued token has made to the apiserver via the audit log.

This will require changing the JWT issuing code to actually generate this UUID, as well as extending the code around the audit log to have it record this information into audit entries when a token is issued (via the authentication.k8s.io/issued-credential-id audit annotation).

As this UUID will be embedded as part of a user’s ExtraInfo, it’ll automatically be persisted into audit events for all requests made using a token that embeds a credential identifier (as authentication.k8s.io/credential-id).

User Stories (Optional)

Story 1

Alice hosts a service that verifies host identity using an out-of-band mechanism and also submits a bound token that contains a node assertion.

The node assertion can be checked to ensure the host identity matches the node assertion of the token.

Story 2

Bob is an administrator of a cluster and has noticed some strange request patterns from an unknown service account.

Bob would like to understand who initially issued/authorised this token to be issued. To do so, Bob looks up the JTI of the token making the suspicious requests by looking inside the audit log entries at user’s ExtraInfo for these suspect requests.

This JTI is then used for a further audit log lookup - namely, looking for the TokenRequest create call which contains the audit annotation with key authentication.kubernetes.io/issued-credential-id and the value set to that of the suspect token.

This allows Bob to determine precisely who made the original request for this token, and (depending on the ‘chain’ above this token), allows Bob to recursively perform this lookup to find all involved parties that led to this token being issued.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Adding additional cross-referencing validation checks into the TokenReview API may break some user workflows that involve deleting Node objects and restarting kubelet’s to allow them to be recreated. As a result, the TokenReview API will NOT be modified to permit tightening this validation behaviour. Instead, the existing protections & mechanisms for invalidating a Node<>Pod binding (i.e. auto-deletion after a fixed time period after the Node object is deleted).

Design Details

The pkg/serviceaccount/claims.go file’s Claims function will be modified to accept a core.Node. This will be made available in the call-site for this function (pkg/registry/core/serviceaccount/storage/token.go) by passing through a Getter for Node objects, similar to how secret objects are fetched.

The associated Validator used to validate and parse service account tokens will also be extended to extract this new information from tokens if it is available.

In pkg/registry/core/serviceaccount/storage/token.go, the Create function will also be extended to add an audit annotation including the generated service account token’s JTI, to make it possible to map a future request which used this token back to the initial point at which the token was generated (i.e. to allow deeper inspection of who the requester is).

In the file staging/src/k8s.io/apiserver/pkg/authentication/serviceaccount/util.go, the ServiceAccountInfo.UserInfo method will be modified to also return this information in the returned user.Info struct.

These proposed changes can also be reviewed in the draft pull request .

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/registry/core/serviceaccount/storage:

Coverage before (release-1.28): k8s.io/kubernetes/pkg/registry/core/serviceaccount/storage 8.354s coverage: 10.7% of statements
Coverage after: k8s.io/kubernetes/pkg/registry/core/serviceaccount/storage 8.394s coverage: 8.7% of statements
Test ensuring audit annotations are added to audit events for the serviceaccounts/<name>/token subresource.
Tests verifying it’s possible to bind a token to a Node object.
Tests ensuring tokens bound to pod objects also embed associated node metadata.
NOTE: the majority of this file is untested with unit tests (instead, using integration tests). #121515 .

staging/src/k8s.io/apiserver/pkg/authentication/serviceaccount:

Coverage before (release-1.28): k8s.io/apiserver/pkg/authentication/serviceaccount 0.567s coverage: 60.8% of statements
Coverage after: k8s.io/apiserver/pkg/authentication/serviceaccount 0.569s coverage: 70.1% of statements
Test ensuring that service account info (JTI, node name and UID) is correctly extracted from a presented JWT.
Tests to ensure the information is NOT extracted when the feature gate is disabled.

pkg/serviceaccount:

Coverage before (release-1.28): k8s.io/kubernetes/pkg/serviceaccount 0.755s coverage: 72.4% of statements
Coverage after: k8s.io/kubernetes/pkg/serviceaccount 0.786s coverage: 72.7% of statements
Extending tests to ensure Node info is embedded into extended claims (name and uid)
Tests to ensure ID/JTI field is always set to a random UUID.
Tests to ensure the info embedded on a JWT is extracted from the token and into the ServiceAccountInfo when a token is validated.
Tests to ensure the information is NOT embedded or extracted when the feature gate is disabled.

staging/src/k8s.io/kubectl/pkg/cmd/create:

Coverage before (release-1.28): k8s.io/kubectl/pkg/cmd/create 0.995s coverage: 55.1% of statements
Coverage after: k8s.io/kubectl/pkg/cmd/create 0.949s coverage: 55.2% of statements
Add tests ensuring it’s possible to request a token that is bound to a Node object (gated by environment variable during alpha)

Integration tests

Test that calls the TokenRequest API to obtain a token that is bound to a Pod. It should assert that the token embeds a reference to the Pod object, as well as to the Node object that the Pod is assigned to.
Test that calls the TokenRequest API to obtain a token that is bound to a Node. It should assert that the token embeds a reference to the Node object.
Test that calls the TokenReview API with a token that is bound to a Node object that no longer exists. It should assert that the token does not validate once the Node has been deleted.

k8s.io/test/integration/sig-auth/svcacct_test.go

e2e tests

Extend existing TokenRequest e2e tests to check for embedded scheduled node name & UID + generated JTI is present.

Graduation Criteria

Alpha

JTI feature implemented behind a feature flag ServiceAccountTokenJTI.
Embedding Pod’s assigned Node name/uid feature implemented behind a feature flag ServiceAccountTokenPodNodeInfo.
Support verifying JWTs bound to Node objects with feature flag ServiceAccountTokenNodeBindingValidation.
Allowing tokens bound to Node objects to be issued with feature flag flag ServiceAccountTokenNodeBinding.
Initial e2e tests completed and enabled

Beta

Decide what the default of the new flag should be
- Decision: this flag was not added during alpha, and MAY be added post-beta, but will definitely default to off.
- This does not need to block promotion of ServiceAccountTokenPodNodeInfo feature as a result.
Decide if using an audit annotation is the correct approach
- Decision: audit annotation is the correct approach as this is only for serviceaccounts/<name>/token requests, not all
- Renaming audit annotation to authentication.kubernetes.io/issued-credential-id to disambiguate from authentication.kubernetes.io/credential-id in user’s ExtraInfo
Docs around the SA JWT schema (this does not exist today)

GA

Allowing time for feedback and any other user-experience reports.
Conformance tests
Consolidate the existing service account docs to be more coherent and avoid duplication, especially in regards to consuming service account tokens outside of Kubernetes:

Upgrade / Downgrade Strategy

Version Skew Strategy

Embedding a Pod’s assigned Node name into a JWT does not require any coordination between clients and the apiserver, as no components require this information to be embedded. This is purely additive, and the only rollback concerns would be around third party software that consumes this information. This software should always verify whether a node claim is embedded into tokens if they require using it, and provide a fall-back behaviour (i.e. a GET to the apiserver to fetch the Pod & Node object) if they need to maintain compatibility with older apiservers.

Binding a token to a Node introduces a new validation mechanism, and therefore we must allow one release cycle after introducing the ability to validate tokens, before we can begin permitting issuance of these tokens. This is a critical step from a security standpoint, as otherwise an administrator could:

upgrade their apiserver/control plane.
a user could request a token bound to a Node, expecting it to be invalidated when the Node is deleted.
rollback the apiserver to an older version.
the Node object is deleted.
the token issued in (2) would now continue to be accepted/validated, despite the Node object no longer existing.

By graduating validation a release earlier than issuance, we can ensure any tokens that are bound to a Node object will be correctly validated even after a rollback.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

ServiceAccountTokenJTI feature flag will toggle including JTI information in tokens, as well as recording JTIs in the audit log / the SA user info.
ServiceAccountTokenPodNodeInfo feature flag will toggle including node info associated with pods in tokens.
ServiceAccountTokenNodeBindingValidation feature flag will toggle the apiserver validating Node claims in node bound service account tokens.
ServiceAccountTokenNodeBinding feature flag will toggle allowing service account tokens to be bound to Node objects.

The ServiceAccountTokenNodeBindingValidation feature will graduate to beta in version v1.30, a release earlier than ServiceAccountTokenNodeBinding to ensure a safe rollback from version v1.31 to v1.30 (more info below in rollback considerations section).

The ServiceAccountTokenNodeBinding feature gate must only be enabled once the ServiceAccountTokenNodeBindingValidation feature has been enabled. Disabling the ServiceAccountTokenNodeBindingValidation feature whilst keeping ServiceAccountTokenNodeBinding would allow tokens that are expected to be bound to the lifetime of a particular Node to validate even if that Node no longer exists. The rollout & rollback section below goes into further detail.

All other feature flags can be disabled without any unexpected adverse affects or coordination required.

How can this feature be enabled / disabled in a live cluster?

Feature gate
- Feature gate name: ServiceAccountTokenJTI
- Components depending on the feature gate: kube-apiserver
Feature gate
- Feature gate name: ServiceAccountTokenPodNodeInfo
- Components depending on the feature gate: kube-apiserver
Feature gate
- Feature gate name: ServiceAccountTokenNodeBinding
- Components depending on the feature gate: kube-apiserver
Feature gate
- Feature gate name: ServiceAccountTokenNodeBindingValidation
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

Enabling the ServiceAccountTokenPodNodeInfo and/or ServiceAccountTokenJTI feature gate will cause additional information to be stored/persisted into service account JWTs, as well as new audit annotations being recorded in the audit log. This is all purely additive, so no changes to existing features, schemas or fields are expected.

Enabling the ServiceAccountTokenNodeBinding will permit binding tokens to Node objects, which is a change in behaviour (albeit not to an existing feature, so is not problematic).

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Future tokens will then not embed this information. Any existing issued tokens will still have this information embedded, however.

If these fields are deemed to be problematic for other systems interpreting these tokens, users will need to re-issue these tokens before presenting them elsewhere.

Once the feature(s) have graduated to GA, it will not be possible to disable this behaviour.

What happens if we reenable the feature if it was previously rolled back?

Future tokens will once again include this information/no adverse effects.

Are there any tests for feature enablement/disablement?

Yes (as noted above in the test plan)

Rollout, Upgrade and Rollback Planning

Rolling this out will be done by enabling the feature flag on all control plane hosts.

The ServiceAccountTokenNodeBindingValidation feature gate should be enabled and complete rollout before the ServiceAccountTokenNodeBinding gate is enabled, so all active servers will correctly validate tokens issued by any server.

The ServiceAccountTokenNodeBindingValidation will be defaulted to on one release before ServiceAccountTokenNodeBinding to account for this. Concretely, ServiceAccountTokenNodeBindingValidation will be enabled by default in v1.30 and ServiceAccountTokenNodeBinding will be enabled by default in v1.31.

This should not have any issues/affect during upgrades. Rollback is done by removing/disabling the feature gate(s).

How can a rollout or rollback fail? Can it impact already running workloads?

During a rollback, there is a concern that tokens that were issued prior to the rollback that are bound directly to a Node object (i.e. not bound to a Pod that also embeds node info, which is informational) could be accepted by an older apiserver even if the bound Node object no longer exists (as it would not know to verify the new node claim).

To help avoid this, the feature will be graduated in two phases:

First, graduating the acceptance/validation of explicitly node-scoped tokens in one release
Secondly, graduating the issuance of explicitly Node bound tokens

This allows for a safe rollback in which the same security expectations are enforced once a token has been issued.

If a user explicitly disables ServiceAccountTokenNodeBindingValidation but keeps ServiceAccountTokenNodeBinding enabled, the node claims in the issued tokens will not be properly validated. This configuration will be explicitly denied by the kube-apiserver and will cause it to exit on startup.

What specific metrics should inform a rollback?

authentication_attempts
authorization_attempts_total
serviceaccount_valid_tokens_total

New metrics that can be used to identify if the feature is in use:

serviceaccount_authentication_pod_node_ref_verified_total
serviceaccount_authentication_bound_object_verified_total{bound_object_kind="Node"}
serviceaccount_bound_tokens_issued_pod_with_node_tokens_total
serviceaccount_bound_tokens_issued_total{bound_object_kind="Node"}
serviceaccount_bound_tokens_issued_with_identifier_total

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

For ServiceAccountTokenJTI feature (alpha v1.29, beta v1.30, GA v1.32):

Without the feature gate enabled, issued service account tokens will not have their jti field set to a random UUID, and the audit log will not persist the issued credential identifier when issuing a token.

With the feature gate enabled, issued service accounts will set the jti field to a random UUID. Additionally, the audit event recorded when issuing a new token will have a new annotation added (authentication.k8s.io/issued-credential-id). As a service account’s JTI field is used to infer the credential identifier, which forms part of a users ExtraInfo, audit events generated using this newly issued token will also include this JTI (persisted as authentication.k8s.io/credential-id).

If the feature is disabled and a token is presented that includes a credential identifier, it will still be persisted into the audit log as part of the UserInfo in the audit event.

As none of these fields are actually used for validating/verifying a token is valid, enabling & disabling the feature does not cause any adverse side effects.

For ServiceAccountTokenNodeBinding (alpha v1.29, beta v1.31, GA v1.33) and ServiceAccountTokenNodeBindingValidation (alpha v1.29, beta v1.30, GA v1.32) feature:

Without the feature gate enabled, service account tokens that have been bound to Node objects will not have their node reference claims validated (to ensure the referenced node exists).

With the feature gate enabled, if a token has a node claim contained within it, it’ll be validated to ensure the corresponding Node object actually exists.

Disabling this feature will therefore relax the security posture of the cluster in an unexpected way, as tokens that may have been previously invalid (because their corresponding Node does not exist) may become valid again.

Node bound tokens may only be issued if the ServiceAccountTokenNodeBinding feature is enabled, and it is not possible to enable ServiceAccountTokenNodeBinding without ServiceAccountTokenNodeBindingValidation being enabled too.

This is further mitigated by graduating the ServiceAccountTokenNodeBindingValidation feature one release earlier than ServiceAccountTokenNodeBinding.

Tokens that are bound to objects other than Nodes are unaffected.

For ServiceAccountTokenPodNodeInfo feature (alpha v1.29, beta v1.30, GA v1.32):

Without the feature gate enabled, tokens that are bound to Pod objects will not include information about the Node that the pod is scheduled/assigned to.

With the feature enabled, newly minted tokens that are bound to Pod objects will include metadata about the Node, namely the Node’s name and UID.

These fields are not validated and therefore disabling the feature after enabling it will not cause any adverse side-effects.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

New metrics:

serviceaccount_authentication_pod_node_ref_verified_total - new metric that is incremeneted when a token bound to a Pod has its Node reference verified
serviceaccount_authentication_bound_object_verified_total{bound_object_kind="Node"} - new metric that is incremeneted when a token bound to a Node has its reference verified
serviceaccount_bound_tokens_issued_pod_with_node_tokens_total - new metric that is incremented when a node ref is embedded into a bound Pod token (aka implicitly added)
serviceaccount_bound_tokens_issued_total{bound_object_kind="Node"} - new metric that is incremented whenever a bound token is issued that references a Node (explicitly added)
serviceaccount_bound_tokens_issued_with_identifier_total - new metric that is incremented whenever a token that contains an identifier/JTI is issued

How can an operator determine if the feature is in use by workloads?

The metrics detailed above provide a clear signal as to whether these features are being used.

How can someone using this feature know that it is working for their instance?

For the node info part, using the TokenRequest API and inspecting the contents of the issued JWTs for a token bound to a Pod. For JTIs, using the TokenRequest API and then inspecting the contents of the issued JWT for any ServiceAccount token.

For the validation/verification, the user can use the SelfSubjectAccessReview API to check whether the token is still valid. To do so, they’d need to obtain a token that is bound to a Pod, delete the corresponding Node object that the Pod is scheduled on, and observe that the token is no longer valid via the SelfSubjectAccessReview API.

A similar process could be used for tokens bound to Node objects directly.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

None

Does this feature depend on any specific services running in the cluster?

Scalability

N/A

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Additional audit log annotation keys, as well as extending the JWT claims we embed into service account tokens.

The maximum size of a UUID is 36 bytes. The maximum size of a Node object’s name is 253 bytes. The maximum size of a Node object’s UID is 36 bytes.

This additional data will be recorded into issued JWTs as well as audit log events.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Fractionally increase the time spent issuing service account JWTs (UUID generation mainly). This is expected to be negligible.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Not applicable. This change is solely within the apiserver, and does not touch etcd.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

After observing an issue (e.g. uptick in denied authentication requests or a significant shift in any metrics added for this KEP), kube-apiserver logs from the authenticator may be used to debug.

Additionally, manually attempting to exercise the affected codepaths would surface information that’d aid debugging. For example, attempting to issue a node bound token, or attempting to authenticate to the apiserver using a node bound token.

Implementation History

KEP marked implementable and merged for the v1.29 release
KEP implemented in an alpha state for v1.29
Renamed audit annotation used for the serviceaccounts/<name>/token endpoint to be clearer: https://github.com/kubernetes/kubernetes/pull/123098
Added restrictions to disallow enabling ServiceAccountTokenNodeBinding without ServiceAccountTokenNodeBindingValidation: https://github.com/kubernetes/kubernetes/pull/123135
ServiceAccountTokenJTI, ServiceAccountTokenNodeBindingValidation and ServiceAccountTokenPodNodeInfo promoted to beta for v1.30 release
Promoted ServiceAccountTokenNodeBinding promoted to beta for v1.31 release
Promoted ServiceAccountTokenJTI, ServiceAccountTokenPodNodeInfo, ServiceAccountTokenNodeBindingValidation to stable for v1.32 release
Promoted ServiceAccountTokenNodeBinding to stable for v1.33 release

Drawbacks

Alternatives

Infrastructure Needed (Optional)

N/A

Resources: Bound Service Account Tokens

Mon, 01 Jan 0001 00:00:00 +0000

Bound Service Account Tokens

Summary
Background
Motivation
Design Details
Production Readiness Review Questionnaire

Summary

This KEP describes an API that would allow workloads running on Kubernetes to request JSON Web Tokens that are audience, time and eventually key bound. In addition, this KEP introduces a new mechanism of distribution with support for bound service account tokens and explores how to migrate from the existing mechanism backwards compatibly.

Background

Kubernetes already provisions JWTs to workloads. This functionality is on by default and thus widely deployed. The current workload JWT system has serious issues:

Security: JWTs are not audience bound. Any recipient of a JWT can masquerade as the presenter to anyone else.
Security: The current model of storing the service account token in a Secret and delivering it to nodes results in a broad attack surface for the Kubernetes control plane when powerful components are run - giving a service account a permission means that any component that can see that service account’s secrets is at least as powerful as the component.
Security: JWTs are not time bound. A JWT compromised via 1 or 2, is valid for as long as the service account exists. This may be mitigated with service account signing key rotation but is not supported by client-go and not automated by the control plane and thus is not widely deployed.
Scalability: JWTs require a Kubernetes secret per service account.

Motivation

We would like to introduce a new mechanism for provisioning Kubernetes service account tokens that is compatible with our current security and scalability requirements.

Design Details

TokenRequest

Infrastructure to support on demand token requests will be implemented in the core apiserver. Once this API exists, a client of the apiserver will request an attenuated token for its own use. The API will enforce required attenuations, e.g. audience and time binding.

Token Attenuations

Audience binding

Tokens issued from this API will be audience bound. Audience of requested tokens will be bound by the aud claim. The aud claim is an array of strings (usually URLs) that correspond to the intended audience of the token. A recipient of a token is responsible for verifying that it identifies as one of the values in the audience claim, and should otherwise reject the token. The TokenReview API will support this validation.

Time Binding

Tokens issued from this API will be time bound. Time validity of these tokens will be claimed in the following fields:

exp: expiration time
nbf: not before
iat: issued at

A recipient of a token should verify that the token is valid at the time that the token is presented, and should otherwise reject the token. The TokenReview API will support this validation.

Cluster administrators will be able to configure the maximum validity duration for expiring tokens. During the migration off of the old service account tokens, clients of this API may request tokens that are valid for many years. These tokens will be drop in replacements for the current service account tokens.

Object Binding

Tokens issued from this API may be bound to a Kubernetes object in the same namespace as the service account. The name, group, version, kind and uid of the object will be embedded as claims in the issued token. A token bound to an object will only be valid for as long as that object exists.

Only a subset of object kinds will support object binding. Initially the only kinds that will be supported are:

v1/Pod
v1/Secret

The TokenRequest API will validate this binding.

API Changes

Add `tokenrequests.authentication.k8s.io`

We will add an imperative API (a la TokenReview) to the authentication.k8s.io API group:

type TokenRequest struct {
 Spec TokenRequestSpec
 Status TokenRequestStatus
}

type TokenRequestSpec struct {
 // Audiences are the intendend audiences of the token. A token issued
 // for multiple audiences may be used to authenticate against any of
 // the audiences listed. This implies a high degree of trust between
 // the target audiences.
 Audiences []string

 // ValidityDuration is the requested duration of validity of the request. The
 // token issuer may return a token with a different validity duration so a
 // client needs to check the 'expiration' field in a response.
 ValidityDuration metav1.Duration

 // BoundObjectRef is a reference to an object that the token will be bound to.
 // The token will only be valid for as long as the bound object exists.
 BoundObjectRef *BoundObjectReference
}

type BoundObjectReference struct {
 // Kind of the referent. Valid kinds are 'Pod' and 'Secret'.
 Kind string
 // API version of the referent.
 APIVersion string

 // Name of the referent.
 Name string
 // UID of the referent.
 UID types.UID
}

type TokenRequestStatus struct {
 // Token is the token data
 Token string

 // Expiration is the time of expiration of the returned token. Empty means the
 // token does not expire.
 Expiration metav1.Time
}

This API will be exposed as a subresource under a serviceaccount object. A requestor for a token for a specific service account will POST a TokenRequest to the /token subresource of that serviceaccount object.

Modify `tokenreviews.authentication.k8s.io`

The TokenReview API will be extended to support passing an additional audience field which the service account authenticator will validate.

type TokenReviewSpec struct {
 // Token is the opaque bearer token.
 Token string
 // Audiences is the identifier that the client identifies as.
 Audiences []string
}

Example Flow

> POST /apis/v1/namespaces/default/serviceaccounts/default/token
> {
> "kind": "TokenRequest",
> "apiVersion": "authentication.k8s.io/v1",
> "spec": {
> "audience": [
> "https://kubernetes.default.svc"
> ],
> "validityDuration": "99999h",
> "boundObjectRef": {
> "kind": "Pod",
> "apiVersion": "v1",
> "name": "pod-foo-346acf"
> }
> }
> }
{
"kind": "TokenRequest",
"apiVersion": "authentication.k8s.io/v1",
"spec": {
"audience": [
"https://kubernetes.default.svc"
],
"validityDuration": "99999h",
"boundObjectRef": {
"kind": "Pod",
"apiVersion": "v1",
"name": "pod-foo-346acf"
}
},
"status": {
"token":
"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJz[payload omitted].EkN-[signature omitted]",
"expiration": "Jan 24 16:36:00 PST 3018"
}
}

The token payload will be:

{
"iss": "https://example.com/some/path",
"sub": "system:serviceaccount:default:default,
"aud": [
"https://kubernetes.default.svc"
],
"exp": 24412841114,
"iat": 1516841043,
"nbf": 1516841043,
"kubernetes.io": {
"serviceAccountUID": "c0c98eab-0168-11e8-92e5-42010af00002",
"boundObjectRef": {
"kind": "Pod",
"apiVersion": "v1",
"uid": "a4bb8aa4-0168-11e8-92e5-42010af00002",
"name": "pod-foo-346acf"
}
}
}

Service Account Authenticator Modification

The service account token authenticator will be extended to support validation of time and audience binding claims.

ACLs for TokenRequest

The NodeAuthorizer will allow the kubelet to use its credentials to request a service account token on behalf of pods running on that node. The NodeRestriction admission controller will require that these tokens are pod bound.

TokenRequestProjection

A ServiceAccountToken volume projection that maintains a service account token requested by the node from the TokenRequest API.

API Change

A new volume projection will be implemented with an API that closely matches the TokenRequest API.

type ProjectedVolumeSource struct {
 Sources []VolumeProjection
 DefaultMode *int32
}

type VolumeProjection struct {
 Secret *SecretProjection
 DownwardAPI *DownwardAPIProjection
 ConfigMap *ConfigMapProjection
 ServiceAccountToken *ServiceAccountTokenProjection
}

// ServiceAccountTokenProjection represents a projected service account token
// volume. This projection can be used to insert a service account token into
// the pods runtime filesystem for use against APIs (Kubernetes API Server or
// otherwise).
type ServiceAccountTokenProjection struct {
 // Audience is the intended audience of the token. A recipient of a token
 // must identify itself with an identifier specified in the audience of the
 // token, and otherwise should reject the token. The audience defaults to the
 // identifier of the apiserver.
 Audience string
 // ExpirationSeconds is the requested duration of validity of the service
 // account token. As the token approaches expiration, the kubelet volume
 // plugin will proactively rotate the service account token. The kubelet will
 // start trying to rotate the token if the token is older than 80 percent of
 // its time to live or if the token is older than 24 hours.Defaults to 1 hour
 // and must be at least 10 minutes.
 ExpirationSeconds int64
 // Path is the relative path of the file to project the token into.
 Path string
}

A volume plugin implemented in the kubelet will project a service account token sourced from the TokenRequest API into volumes created from ProjectedVolumeSources. As the token approaches expiration, the kubelet volume plugin will proactively rotate the service account token. The kubelet will start trying to rotate the token if the token is older than 80 percent of its time to live or if the token is older than 24 hours.

To replace the current service account token secrets, we also need to inject the clusters CA certificate bundle. We will deploy it as a configmap per-namespace and reference it using a ConfigMapProjection.

A projected volume source that is equivalent to the current service account secret:

- name: kube-api-access-xxxxx
 projected:
 defaultMode: 420 # 0644
 sources:
 - serviceAccountToken:
 expirationSeconds: 3600
 path: token
 - configMap:
 items:
 - key: ca.crt
 path: ca.crt
 name: kube-root-ca.crt
 - downwardAPI:
 items:
 - fieldRef:
 apiVersion: v1
 fieldPath: metadata.namespace
 path: namespace

File Permission

The secret projections are currently written with world readable (0644, effectively 444) file permissions. Given that file permissions are one of the oldest and most hardened isolation mechanisms on unix, this is not ideal. We would like to opportunistically restrict permissions for projected service account tokens as long we can show that they won’t break users if we are to migrate away from secrets to distribute service account credentials.

Proposed Heuristics

Case 1: The pod has an fsGroup set. We can set the file permission on the token file to 0600 and let the fsGroup mechanism work as designed. It will set the permissions to 0640, chown the token file to the fsGroup and start the containers with a supplemental group that grants them access to the token file. This works today.
Case 2: The pod’s containers declare the same runAsUser for all containers (ephemeral containers are excluded) in the pod. We chown the token file to the pod’s runAsUser to grant the containers access to the token. All containers must have UID either specified in container security context or inherited from pod security context. Preferred UIDs in container images are ignored.
Fallback: We set the file permissions to world readable (0644) to match the behavior of secrets.

This gives users that run as non-root greater isolation between users without breaking existing applications. We also may consider adding more cases in the future as long as we can ensure that they won’t break users.

Alternatives Considered

We can create a volume for each UserID and set the owner to be that UserID with mode 0400. If user doesn’t specify runAsUser, fetching UserID in image requires a re-design of kubelet regarding volume mounts and image pulling. This has significant implementation complexity because:
- We would have to reorder container creation to introspect images (that might declare USER or GROUP directives) to pass this information to the projected volume mounter.
- Further, images are mutable so these directives may change over the lifetime of the pod.
- Volumes are shared between all pods that mount them today. Mapping a single logical volume in a pod spec to distinct mount points is likely a significant architectural change.
We pick a random group and set fsGroup on all pods in the service account admission controller. It’s unclear how we would do this without conflicting with usage of groups and potentially compromising security.
We set token files to be world readable always. Problems with this are discussed above.

ServiceAccount Admission Controller Migration

Prerequisites

Before migration to a version with BoundServiceAccountVolume=true, cluster operators should make sure:

Set feature gate TokenRequest=true. (default to true since 1.12)
- This feature requires the following flags to the API server:
  - --service-account-issuer
  - --service-account-signing-key-file
  - --service-account-key-file
  - --api-audiences (default to --service-account-issuer)
Set feature gate TokenRequestProjection=true. (default to true since 1.12)
Update all workloads to newer version of officially supported Kubernetes client libraries to reload token:
- Go: >= v0.15.7
- Python: >= v12.0.0
- Java: >= v9.0.0
- Javascript: >= v0.10.3
- Ruby: master branch
- Haskell: v0.3.0.0
For community-maintained client libraries, feel free to contribute to them if the reloading logic is missing.

Note: If having trouble in finding places using in-cluster config completely, cluster operators can specify flag --service-account-extend-token-expiration=true to kube apiserver to allow tokens have longer expiration temporarily during the migration. Any usage of legacy token will be recorded in both metrics and audit logs. After fixing all the potentially broken workloads, turn off the flag so that the original expiration settings are honored. Note the --service-account-extend-token-expiration mitigation defaults to true, and that cluster administrators can set it to --service-account-extend-token-expiration=false to turn off the mitigation if desired.
- Metrics: serviceaccount_stale_tokens_total
- Audit: looking for authentication.k8s.io/stale-token annotation
See next section for the details of how to discover the workloads that will suffer from expired tokens.

If anything goes wrong, please file a bug and CC @kubernetes/sig-auth-bugs. More contact information here .

Safe Rollout of Time-bound Token

Legacy service account tokens distributed via secrets are not time-bound. Many client libraries have come to depend on this behavior. After time-bound service account token being used, if in-cluster clients do not periodically reload token from projected volume, requests would be rejected once the initial token got expired.

In order to allow guadual adoption of time-bound token, we would:

Pick a constant period D between one and two hours. The value of D would be static across Kubernetes deployments, while avoiding collision with common duration.
Modify service account admission control to inject token valid for D when the BoundServiceAccountTokenVolume feature is enabled.
Modify kube apiserver TokenRequest API. When it receives TokenRequest with requested valid period D, extend the token lifetime to one year. At the same time, save the original requested D to kubernetes.io/warnafter field in minted token.
In the TokenRequest status, tell clients that the token would be valid only for D, encouraging clients to reload token as if the token was valid for D.

This modification could be optionally enabled by providing a command line flag to kube apiserver.

These extended tokens would not expire and continue to be accepted within one year. At the same time, the authentication side could monitor whether clients are properly reloading tokens by:

Compare the kubernetes.io/warnafter field with current time. If current time is after kubernetes.io/warnafter field, it implies calling client is not reloading token regularly.
Expose metrics to monitor number of legacy and stale token used.
Add annotation to audit events for legacy and stale tokens including necessary information to locate problematic client.

Test Plan

TokenRequest/TokenRequestProjection

Unit tests
E2E tests
- Projected jwt tokens are correctly mounted. (conformance test)
- The owner and mode of projected tokens are correctly set
- In-cluster clients work with Token rotation

RootCAConfigMap

Unit tests
E2E tests
- Every namespace has configmap kube-root-ca.crt

BoundServiceAccountTokenVolume

Unit tests
An upgrade test

Create pod A with feature disabled where pod A is working and a secret volume is mounted
Enable feature where pod A continue working
Create pod B and it is working and projected volumes are mounted

Graduation Criteria

TokenRequest/TokenRequestProjection

Alpha	Beta	GA
1.10	1.12	1.20

Beta->GA

RootCAConfigMap

Alpha	Beta	GA
1.13	1.20	1.21

Beta->GA

In use by multiple distributions
Approved by PRR and scalability
Any known bugs fixed

BoundServiceAccountTokenVolume

Alpha	Beta	GA
1.13	1.21	1.22

Alpha->Beta

Any known bugs fixed
- PodSecurityPolicies that allow secrets but not projected volumes will prevent the use of token volumes.
  - Fixed in https://github.com/kubernetes/kubernetes/pull/92006
- In-cluster clients that don’t reload service account tokens will start failing an hour after deployment.
- Mitigation added in https://github.com/kubernetes/kubernetes/issues/68164
- Pods running as non root may not access the service account token.
  - Fixed in https://github.com/kubernetes/kubernetes/pull/89193
- Dynamic clientbuilder does not invalidate token.
  - Fixed in https://github.com/kubernetes/kubernetes/pull/99324

Tests passing
- Upgrade test sig-auth-serviceaccount-admission-controller-migration
TokenRequest/TokenRequestProjection GA
RootCAConfigMap GA

Beta -> GA Graduation

Allow kube-apiserver to recognize multiple issuers to enable non disruptive issuer change.
- Fixed in https://github.com/kubernetes/kubernetes/pull/101155
New ServiceAccount admission controller work as intended in Beta for >= 1 minor release without significant issues.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
- Feature gate name: BoundServiceAccountTokenVolume
- Components depending on the feature gate: kube-apiserver and kube-controller-manager
- Will enabling / disabling the feature require downtime of the control plane? yes, need to restart kube-apiserver and kube-controller-manager.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? no.
Does enabling the feature change any default behavior? yes, pods' service account tokens will expire after 1 year by default and are not stored as Secrets any more.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? yes.
What happens if we reenable the feature if it was previously rolled back? the same as the first enablement.
Are there any tests for feature enablement/disablement?
- unit test: plugin/pkg/admission/serviceaccount/admission_test.go
- upgrade test: test/e2e/upgrades/serviceaccount_admission_controller_migration.go

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?
1. creation of CA configmap can fail due to permission / quota / admission errors.
2. newly issued tokens could fail to be recognized by skewed API servers not configured with the bound token signing key/issuer.
What specific metrics should inform a rollback?
1. creation of CA configmap,
  - root_ca_cert_publisher_rate_limiter_use
2. authentication errors in (n-1) API servers,
  - authentication_attempts
  - authentication_duration_seconds

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? for upgrade, we have set up e2e test running here: https://testgrid.k8s.io/sig-auth-gce#upgrade-tests&width=5

for downgrade, we have manually tested where a workload continues to authenticate successfully.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? no

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check TokenRequest in use:
- serviceaccount_valid_tokens_total: cumulative valid projected service account tokens used
- serviceaccount_stale_tokens_total: cumulative stale projected service account tokens used
- apiserver_request_total: with labels group="",version="v1",resource="serviceaccounts",subresource="token"
- apiserver_request_duration_seconds: with labels group="",version="v1",resource="serviceaccounts",subresource="token"
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
  - Metric name: apiserver_request_total
  - Aggregation method: group="",version=“v1”,resource=“serviceaccounts”,subresource=“token”
  - Components exposing the metric: kube-apiserver
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
- per-day percentage of API calls finishing with 5XX errors <= 1%
Are there any missing metrics that would be useful to have to improve observability of this feature?
- add granularity to storage_operation_duration_seconds to distinguish projected volumes: configmap, secret, token,..etc… or add new metrics so that we can know the usage of projected tokens.

Dependencies

Does this feature depend on any specific services running in the cluster? There are no new components required, but specific versions of kubelet and kube-controller-manager are required

TokenRequest depends on kubelets >= 1.12

BoundServiceAccountTokenVolume depends on kubelets >= 1.12 with TokenRequest enabled (default since 1.12) and kube-controller-manager >= 1.12 with RootCAConfigMap feature enabled (default since 1.20)

Scalability

Will enabling / using this feature result in any new API calls?
- API call type: TokenRequest
- estimated throughput: 1/pod every ~48 minutes.
- originating component: kubelet
- components listing and/or watching resources they didn’t before: N/A.
- API calls that may be triggered by changes of some Kubernetes resources: N/A.
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.): 1 call per pod every ~48 minutes.
Will enabling / using this feature result in introducing new API types? no.
Will enabling / using this feature result in any new calls to the cloud provider? no.
Will enabling / using this feature result in increasing size or count of the existing API objects? controller creates one additional configmap per namespace.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs ? no.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? it adds a token minting operation in the API server every ~48 minutes for every pod.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
- TokenRequest API is unavailable
- configmap containing API server CA bundle cannot be created or fetched

What are other known failure modes?
- failure to issue token via token subresource
  - Detection: check apiserver_request_total with labels group="",version="v1",resource="serviceaccounts",subresource="token"
  - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in the kube-apiserver and recreate pods.
  - Diagnostics: “failed to generate token” in kube-apiserver log.
  - Testing: e2e test
- failure to create root CA config map
  - Detection: check root_ca_cert_publisher_sync_total from kube-controller-manager. (available in 1.21+)
  - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in the kube-apiserver and recreate pods.
  - Diagnostics: “syncing [namespace]/[configmap name] failed” in kube-controller-manager log.
  - Testing: e2e test
- kubelet fails to renew token
  - Detection: check apiserver_request_total with labels group="",version="v1",resource="serviceaccounts",subresource="token" to see if failed in requesting a new token; check kubelet log.
  - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in the kube-apiserver and recreate pods.
  - Diagnostics: “token [namespace]/[token name] expired and refresh failed” in kubelet log.
  - Testing: e2e test
- workload fails to refresh token from disk
  - Detection: serviceaccount_stale_tokens_total emitted by kube-apiserver
  - Mitigations: update client library to newer version.
  - Diagnostics: look for authentication.k8s.io/stale-token in audit log if --service-account-extend-token-expiration=true, or check authentication error in kube-apiserver log.
  - Testing: covered in all client libraries’ unittests.
What steps should be taken if SLOs are not being met to determine the problem? Check kube-apiserver, kube-controller-managera and kubelet logs.

Resources: Bounding Self-Labeling Kubelets

Mon, 01 Jan 0001 00:00:00 +0000

Bounding Self-Labeling Kubelets

Motivation
- Capturing Dedicated Workloads
Proposal
Implementation Timeline
Alternatives Considered

Motivation

Today the node client has total authority over its own Node labels. This ability is incredibly useful for the node auto-registration flow. The kubelet reports a set of well-known labels, as well as additional labels specified on the command line with --node-labels.

While this distributed method of registration is convenient and expedient, it has two problems that a centralized approach would not have. Minorly, it makes management difficult. Instead of configuring labels in a centralized place, we must configure N kubelet command lines. More significantly, the approach greatly compromises security. Below are two straightforward escalations on an initially compromised node that exhibit the attack vector.

Capturing Dedicated Workloads

Suppose company foo needs to run an application that deals with PII on dedicated nodes to comply with government regulation. A common mechanism for implementing dedicated nodes in Kubernetes today is to set a label or taint (e.g. foo/dedicated=customer-info-app) on the node and to select these dedicated nodes in the workload controller running customer-info-app.

Since the nodes self reports labels upon registration, an intruder can easily register a compromised node with label foo/dedicated=customer-info-app. The scheduler will then bind customer-info-app to the compromised node potentially giving the intruder easy access to the PII.

This attack also extends to secrets. Suppose company foo runs their outward facing nginx on dedicated nodes to reduce exposure to the company’s publicly trusted server certificates. They use the secret mechanism to distribute the serving certificate key. An intruder captures the dedicated nginx workload in the same way and can now use the node certificate to read the company’s serving certificate key.

Proposal

Modify the NodeRestriction admission plugin to prevent Kubelets from self-setting labels within the k8s.io and kubernetes.io namespaces except for these specifically allowed labels/prefixes:

kubernetes.io/hostname
kubernetes.io/instance-type
kubernetes.io/os
kubernetes.io/arch
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
failure-domain.kubernetes.io/zone
failure-domain.kubernetes.io/region
[*.]kubelet.kubernetes.io/*
[*.]node.kubernetes.io/*

Reserve and document the node-restriction.kubernetes.io/* label prefix for cluster administrators that want to label their Node objects centrally for isolation purposes.

The node-restriction.kubernetes.io/* label prefix is reserved for cluster administrators to isolate nodes. These labels cannot be self-set by kubelets when the NodeRestriction admission plugin is enabled.

This accomplishes the following goals:

continues allowing people to use arbitrary labels under their own namespaces any way they wish
supports legacy labels kubelets are already adding
provides a place under the kubernetes.io label namespace for node isolation labeling
provide a place under the kubernetes.io label namespace for kubelets to self-label with kubelet and node-specific labels

Implementation Timeline

v1.13:

Kubelet deprecates setting kubernetes.io or k8s.io labels via --node-labels, other than the specifically allowed labels/prefixes described above, and warns when invoked with kubernetes.io or k8s.io labels outside that set.
NodeRestriction admission prevents kubelets from adding/removing/modifying [*.]node-restriction.kubernetes.io/* labels on Node create and update
NodeRestriction admission prevents kubelets from adding/removing/modifying kubernetes.io or k8s.io labels other than the specifically allowed labels/prefixes described above on Node update only

v1.14:

Begin migration/removal of in-tree --node-labels use outside of the allowed set by addons:
- beta.kubernetes.io/fluentd-ds-ready
  - addon: remove from the nodeSelector
  - kube-up: remove from the default --node-labels flag
- beta.kubernetes.io/metadata-proxy-ready
  - addon: announce the nodeSelector will switch to cloud.google.com/metadata-proxy-ready in 1.15
  - kube-up: add cloud.google.com/metadata-proxy-ready=true along with the existing label to --node-labels
  - kube-up: add cloud.google.com/metadata-proxy-ready=true to existing nodes with the beta.kubernetes.io/metadata-proxy-ready=true label
- beta.kubernetes.io/kube-proxy-ds-ready
  - addon: announce the nodeSelector will switch to node.kubernetes.io/kube-proxy-ds-ready in 1.15
  - kube-up: add node.kubernetes.io/kube-proxy-ds-ready=true along with the existing label to --node-labels
  - kube-up: add node.kubernetes.io/kube-proxy-ds-ready=true to existing nodes with the beta.kubernetes.io/kube-proxy-ds-ready=true label
- beta.kubernetes.io/masq-agent-ds-ready
  - addon: announce the nodeSelector will switch to node.kubernetes.io/masq-agent-ds-ready in 1.16
  - kube-up: add node.kubernetes.io/masq-agent-ds-ready=true to existing nodes with the beta.kubernetes.io/masq-agent-ds-ready=true label

v1.16:

Complete migration/removal of in-tree --node-labels use outside of the allowed set by addons:
- beta.kubernetes.io/metadata-proxy-ready
  - addon: change the nodeSelector to cloud.google.com/metadata-proxy-ready
  - kube-up: stop setting beta.kubernetes.io/metadata-proxy-ready
- beta.kubernetes.io/kube-proxy-ds-ready
  - addon: change the nodeSelector to node.kubernetes.io/kube-proxy-ds-ready
  - kube-up: stop setting beta.kubernetes.io/kube-proxy-ds-ready
- beta.kubernetes.io/masq-agent-ds-ready
  - addon: change the nodeSelector to node.kubernetes.io/masq-agent-ds-ready
Kubelet removes the ability to set kubernetes.io or k8s.io labels via --node-labels other than the specifically allowed labels/prefixes described above (deprecation period of 6 months for CLI elements of admin-facing components is complete)

v1.19:

NodeRestriction admission prevents kubelets from adding/removing/modifying kubernetes.io or k8s.io labels other than the specifically allowed labels/prefixes described above on Node update and create (oldest supported kubelet running against a v1.19 apiserver is v1.17)

Alternatives Considered

File or flag-based configuration of the apiserver to allow specifying allowed labels

A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently
File-based config isn’t easily inspectable to be able to verify enforced labels
File-based config isn’t easily kept in sync in HA apiserver setups

API-based configuration of the apiserver to allow specifying allowed labels

A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently
An API object that controls the allowed labels is a potential escalation path for a compromised node

Allow kubelets to add any labels they wish, and add NoSchedule taints if disallowed labels are added

To be robust, this approach would also likely involve a controller to automatically inspect labels and remove the NoSchedule taint. This seemed overly complex. Additionally, it was difficult to come up with a tainting scheme that preserved information about which labels were the cause.

Forbid all labels regardless of namespace except for a specifically allowed set

This was much more disruptive to existing usage of --node-labels.
This was much more difficult to integrate with other systems allowing arbitrary topology labels like CSI.
This placed restrictions on how labels outside the kubernetes.io and k8s.io label namespaces could be used, which didn’t seem proper.

Resources: Breaking apart the Kubernetes test tarball

Mon, 01 Jan 0001 00:00:00 +0000

Breaking apart the kubernetes test tarball

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
References
Implementation History

Release Signoff Checklist

k/enhancements issue in release milestone and linked to KEP
KEP approvers have set the KEP status to implementable
Design details are appropriately documented
Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
Graduation criteria is in place
“Implementation History” section is up-to-date for milestone
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Kubernetes release artifacts include a “mondo” test tarball which includes both “portable” test sources (such as shell scripts and image manifests) as well as platform-specific test binaries for all supported client, node, and server platforms.

This KEP proposes replacing the monolith test tarball with platform-specific tarballs, matching the existing pattern used for the client, node, and server tarballs.

Motivation

As the number of supported client, server, and node platforms has increased, the size of the test tarball has increased as well. As of Kubernetes v1.13.2, the official kubernetes-test.tar.gz is approximately 1.2GB; previous releases ranged from 1.3 - 1.5GB. Several years ago, there were complaints that the “full” kubernetes.tar.gz release tarball was too big at 1.4G. Many of the motivations for breaking up that tarball echo into this proposal.

The Bazel effort is another driving motivation. It’s possible to build all release artifacts solely using Bazel, and there is progress being made on supporting cross-compilation of binary artifacts, but combining multiple target platforms in one Bazel call is currently not well-supported. Separating this tarball would make it easier to ensure that we can use Bazel t produce identical artifacts as the non-Bazel build.

Goals

The Kubernetes test artifacts are broken apart such that users only need to download binaries for the platforms they’re using.
Be largely invisible to most developers: everything should just keep working.

Non-Goals

Changing the underlying build system. Both the make/shell-based build system and the Bazel-based build system will be affected, but users can continue to use their existing workflow.
Removing cruft from the test tarballs. Likely, there are binaries and portable sources no longer being used anywhere, but we won’t prune them with this effort.
Changing what is released independent of the test tarball; i.e. whether we should make test binaries able to be downloaded directly from GCS.

Proposal

Instead of building and distributing a single kubernetes-test.tar.gz with all portable sources and compiled binaries for all platforms, produce several platform-specific tarballs, one for each platform defined in KUBE_TEST_PLATFORMS :

kubernetes-test-linux-amd64.tar.gz
kubernetes-test-linux-arm.tar.gz
kubernetes-test-linux-arm64.tar.gz
kubernetes-test-linux-s390x.tar.gz
kubernetes-test-linux-ppc64le.tar.gz
kubernetes-test-darwin-amd64.tar.gz
kubernetes-test-windows-amd64.tar.gz

Internal structure of the test tarball

At present, the Kubernetes test tarball has several components, all rooted under a kubernetes/ top-level directory.

Binary artifacts

The test binary artifacts are currently organized into directories divided by platform:

platforms/
- darwin/, linux/, windows/
  - amd64/, arm/, arm64/, ppc64le/, s390x/

For comparison, the existing platform-specific tarballs (such as kubernetes-client-linux-amd64.tar.gz) place all binaries under a constant path with no platform information: kubernetes/client/bin/kubectl.

Scripts (such as cluster/get-kube-binaries.sh) extract these tarballs back into platform-specific directories to support downloading multiple platforms into a single workspace.

The test tarball should follow the lead of the other platform-specific tarballs and place the binaries under test/bin. We can then reuse the existing functionality already implemented for the other tarballs.

Portable sources

Portable sources are basically copied directly from the source tree:

test/e2e/testing-manifests/
test/images/
test/kubemark/
hack/ (partially )

We have two options for these:

Continue to distribute as a separate tarball, either kubernetes-test.tar.gz, or possibly something like kubernetes-test-portable.tar.gz.

This makes the distinction very clear vs. the binary artifacts
There’s already some precedent, such as the kubernetes-manifest.tar.gz tarball
It slightly complicates downloading of test dependencies

Duplicate these sources into each binary-specific tarball.

Simplifies test dependency distribution - may only need to download one tarball if client and server are same platform
Portable sources are small (as of v1.13.2, approximately 2.7MB uncompressed or about 186KB compressed) so duplication isn’t a huge worry
Complicates extraction of tarballs with existing scripts, since they assume everything is platform-specific

We propose the first option as slightly preferable given the tradeoffs. Since we intend to continue distributing the mondo test tarball over a deprecation period, we’ll use the name kubernetes-test-portable.tar.gz for the portable sources.

Updating dependencies on `kubernetes-test.tar.gz`

Currently the CI workflows and kubetest use the cluster/get-kube.sh and cluster/get-kube-binaries.sh scripts to download all artifacts, and conveniently get-kube-binaries.sh is versioned with the release artifacts in kubernetes.tar.gz, so simply making get-kube-binaries.sh aware of the new tarballs should be sufficient for most CI and developer needs.

Because the test tarball includes binaries used both on the host running tests (such as the e2e binary), as well as binaries which may run the nodes (e2e.node), we would need to make sure to download binary test artifacts targeting the host platform, node platform, and possibly server platform.

A quick search reveals a few other uses of kubernetes-test.tar.gz, mostly in cluster/. We can update these to use the platform-specific tarballs, possibly with a fallback to the mondo-tarball if worried about versioning.

Dependencies outside the Kubernetes organization

Searching GitHub for references to kubernetes-test.tar.gz largely returns forks of the main kubernetes repository (including some very old forks, identifiable by the script e2e-from-release.sh). Since these forks are not likely to depend on upstream release artifacts, we can ignore them.

The Samsung SDS CNCT kraken-lib repository has a reference to kubernetes-test.tar.gz in its conformance test script , but this repo is also marked deprecated and read-only, and there have been no changes since July 2018.

In vmware/simple-k8s-test-env, the sk8.sh file uses kubernetes-test.tar.gz , and this repo seems actively maintained, so we should make sure this continues working.

The reference to kubernetes-test.tar.gz in knative/test-infra is hilarious.

There may be other uses that are not easily identifiable, so we will follow a deprecation process of the mondo test tarball as described in the next section.

Risks and Mitigations

It’s hard to tell who uses these test tarballs outside the core project or without tools like kubetest. We’ll need to broadcast this change widely so that any downstream users are aware of the incompatible changes.

As this is an inherently breaking change, we must decide when to cause the break. Assuming this effort is targeted for the 1.14 release:

We can continue to produce a mondo-tarball for 3 releases, along with new split tarballs; i.e., both 1.14 and 1.15 would contain both split and mondo test tarballs, while 1.16 would only use a split test tarball. This way one could continue to use the mondo tarball through the 1.15 release cycle, and then switch to using split test tarballs for 1.16, as all supported releases would then be producing split test tarballs.
We can make a clean break for 1.14, not producing any mondo test tarballs. Downstream users would need to account for the break immediately, and would also need to special-case for older releases that still use the mondo test tarball.
A somewhat hybrid approach, mixing 1 and 2 backwards in time: a. Produce both mondo test tarballs and split test tarballs on master for a few weeks. b. Backport split tarballs to older releases still supported (1.11 through 1.13), but continue to produce mondo test tarballs. We would never remove the mondo test tarballs from these releases, instead continuing to produce both. c. Update all test infrastructure to use split test tarballs d. Remove the mondo test tarball from 1.14 before the first beta release.

Given the Kubernetes deprecation policy , we should go for option 1 and continue to distribute mondo and split test tarballs for the 1.14 release, and possibly for several releases thereafter. (It’s not entirely clear exactly which deprecation policy applies to this change, however.)

We’ll mark the mondo test tarball as deprecated in the 1.14 release, both through announcements in the release notes, as well as a DEPRECATION notice in the mondo test tarball.

Test Plan

We’ll start by building both the mondo test tarball and split test tarballs in CI, followed by updating test infrastructure to use the new split tarballs. We will monitor TestGrid jobs to ensure that nothing is noticeably broken by the change, and our primary sources will be those on the sig-release-master-blocking , sig-release-master-informing , and sig-release-master-upgrade dashboards.

We’ll also reach out to community members testing on non-amd64 architectures, since they’re most likely to be impacted by this change.

We’ll work with any downstream consumers we can find to update them to use the split tarballs ahead of the 1.14.0 release, but will continue to support the mondo test tarball through at least 1.14’s complete lifecycle.

Graduation Criteria

To consider this effort complete, we should no longer be distributing a mondo-tarball of test artifacts, and all TestGrid dashboards should show a similar level of greenness.

While ideally we’d make a clean break, removing the mondo-tarball at the same time as we create the platform-specific test tarballs, to ensure a smoother rollout we will distribute both the split and mondo test tarballs for a while, and this effort will not be deemed complete until the mondo test tarball is gone.

References

Similar discussion and work on the other release tarballs:

Implementation History

2019-01-18: proposal on Slack and creation of the KEP
2019-01-28: KEP announced on sig-testing and sig-release mailing lists
2019-01-29: discussion at sig-testing weekly meeting
2019-02-14: implementation https://github.com/kubernetes/kubernetes/pull/74065 created, deprecation notice included in mondo test tarball
2019-02-22: implementation https://github.com/kubernetes/kubernetes/pull/74065 merged
2019-09-24: Stop building kubernetes-test.tar.gz: https://github.com/kubernetes/kubernetes/pull/83093
2021-08-16: Retroactive stable declaration

Kubernetes Contributors – Kubernetes Enhancement Proposals (KEPs)

Resources: Add an nftables-based kube-proxy backend

KEP-3866: Add an nftables-based kube-proxy backend

Summary

Motivation

The iptables kernel subsystem has unfixable performance problems

Upstream development has moved on from iptables to nftables

The ipvs mode of kube-proxy will not save us

The nf_tables mode of /sbin/iptables will not save us

The iptables mode of kube-proxy has grown crufty

We will hopefully be able to trade 2 supported backends for 1

Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts

Goals

Non-Goals

Proposal

Notes/Constraints/Caveats

Risks and Mitigations

Functionality

Compatibility

Security

Design Details

High level design

Low level design

Tables

Communicating with the kernel nftables subsystem

Notes on the sample rules in this KEP

Versioning and compatibility

NAT rules

General Service dispatch

Masquerading

Session affinity

Filter rules

Dropping or rejecting packets for services with no endpoints

Dropping traffic rejected by LoadBalancerSourceRanges

Forcing traffic on HealthCheckNodePorts to be accepted

Future improvements

Changes from the iptables kube-proxy backend

Localhost NodePorts

NodePort Addresses

Behavior of service IPs

Defining an API for integration with admin/debug/third-party rules

Rule monitoring

Switching between kube-proxy modes

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Scalability & Performance tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

The `ipvs` mode of kube-proxy will not save us

The `nf_tables` mode of `/sbin/iptables` will not save us

The `iptables` mode of kube-proxy has grown crufty

Dropping traffic rejected by `LoadBalancerSourceRanges`

Forcing traffic on `HealthCheckNodePort`s to be accepted

Continue to improve the `iptables` mode

Fix up the `ipvs` mode