<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kubernetes Contributors – Kubernetes Enhancement Proposals (KEPs)</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/</link><description>Recent content in Kubernetes Enhancement Proposals (KEPs) on Kubernetes Contributors</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/index.xml" rel="self" type="application/rss+xml"/><item><title>Resources: Add an nftables-based kube-proxy backend</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3866/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3866/</guid><description>
&lt;h1 id="kep-3866-add-an-nftables-based-kube-proxy-backend">KEP-3866: Add an nftables-based kube-proxy backend&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#the-iptables-kernel-subsystem-has-unfixable-performance-problems"
>The iptables kernel subsystem has unfixable performance problems&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upstream-development-has-moved-on-from-iptables-to-nftables"
>Upstream development has moved on from iptables to nftables&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-ipvs-mode-of-kube-proxy-will-not-save-us"
>The &lt;code>ipvs&lt;/code> mode of kube-proxy will not save us&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-nf_tables-mode-of-sbiniptables-will-not-save-us"
>The &lt;code>nf_tables&lt;/code> mode of &lt;code>/sbin/iptables&lt;/code> will not save us&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-iptables-mode-of-kube-proxy-has-grown-crufty"
>The &lt;code>iptables&lt;/code> mode of kube-proxy has grown crufty&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#we-will-hopefully-be-able-to-trade-2-supported-backends-for-1"
>We will hopefully be able to trade 2 supported backends for 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#writing-a-new-kube-proxy-mode-will-help-to-focus-our-cleanuprefactoring-efforts"
>Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#functionality"
>Functionality&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#compatibility"
>Compatibility&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#security"
>Security&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#high-level-design"
>High level design&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#low-level-design"
>Low level design&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#tables"
>Tables&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#communicating-with-the-kernel-nftables-subsystem"
>Communicating with the kernel nftables subsystem&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#notes-on-the-sample-rules-in-this-kep"
>Notes on the sample rules in this KEP&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#versioning-and-compatibility"
>Versioning and compatibility&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#nat-rules"
>NAT rules&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#general-service-dispatch"
>General Service dispatch&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#masquerading"
>Masquerading&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#session-affinity"
>Session affinity&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#filter-rules"
>Filter rules&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#dropping-or-rejecting-packets-for-services-with-no-endpoints"
>Dropping or rejecting packets for services with no endpoints&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dropping-traffic-rejected-by-loadbalancersourceranges"
>Dropping traffic rejected by &lt;code>LoadBalancerSourceRanges&lt;/code>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#forcing-traffic-on-healthchecknodeports-to-be-accepted"
>Forcing traffic on &lt;code>HealthCheckNodePort&lt;/code>s to be accepted&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#future-improvements"
>Future improvements&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#changes-from-the-iptables-kube-proxy-backend"
>Changes from the iptables kube-proxy backend&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#localhost-nodeports"
>Localhost NodePorts&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#nodeport-addresses"
>NodePort Addresses&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#behavior-of-service-ips"
>Behavior of service IPs&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#defining-an-api-for-integration-with-admindebugthird-party-rules"
>Defining an API for integration with admin/debug/third-party rules&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rule-monitoring"
>Rule monitoring&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#switching-between-kube-proxy-modes"
>Switching between kube-proxy modes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability--performance-tests"
>Scalability &amp;amp; Performance tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#continue-to-improve-the-iptables-mode"
>Continue to improve the &lt;code>iptables&lt;/code> mode&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#fix-up-the-ipvs-mode"
>Fix up the &lt;code>ipvs&lt;/code> mode&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#use-an-existing-nftables-based-kube-proxy-implementation"
>Use an existing nftables-based kube-proxy implementation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#create-an-ebpf-based-proxy-implementation"
>Create an eBPF-based proxy implementation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The default kube-proxy implementation on Linux is currently based on
iptables. IPTables was the preferred packet filtering and processing
system in the Linux kernel for many years (starting with the 2.4
kernel in 2001). However, problems with iptables led to the
development of a successor, nftables, first made available in the 3.13
kernel in 2014, and growing increasingly featureful and usable as a
replacement for iptables since then. Development on iptables has
mostly stopped, with new features and performance improvements
primarily going into nftables instead.&lt;/p>
&lt;p>This KEP proposes the creation of a new official/supported nftables
backend for kube-proxy. While it is hoped that this backend will
eventually replace both the &lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code> backends and become
the default kube-proxy mode on Linux, that replacement/deprecation
would be handled in a separate future KEP.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>There are currently two officially supported kube-proxy backends for
Linux: &lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code>. (The original &lt;code>userspace&lt;/code> backend was
deprecated several releases ago and removed from the tree in 1.26.)&lt;/p>
&lt;p>The &lt;code>iptables&lt;/code> mode of kube-proxy is currently the default, and it is
generally considered &amp;ldquo;good enough&amp;rdquo; for most use cases. Nonetheless,
there are good arguments for replacing it with a new &lt;code>nftables&lt;/code> mode.&lt;/p>
&lt;h3 id="the-iptables-kernel-subsystem-has-unfixable-performance-problems">The iptables kernel subsystem has unfixable performance problems&lt;/h3>
&lt;p>Although much work has been done to improve the performance of the
kube-proxy &lt;code>iptables&lt;/code> backend, there are fundamental
performance-related problems with the implementation of iptables in
the kernel, both on the &amp;ldquo;control plane&amp;rdquo; side and on the &amp;ldquo;data plane&amp;rdquo;
side:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The control plane is problematic because the iptables API does not
support making incremental changes to the ruleset. If you want to
add a single iptables rule, the iptables binary must acquire a lock,
download the entire ruleset from the kernel, find the appropriate
place in the ruleset to add the new rule, add it, re-upload the
entire ruleset to the kernel, and release the lock. This becomes
slower and slower as the ruleset increases in size (ie, as the
number of Kubernetes Services grows). If you want to replace a large
number of rules (as kube-proxy does frequently), then simply the
time that it takes &lt;code>/sbin/iptables-restore&lt;/code> to parse all of the
rules becomes substantial.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The data plane is problematic because (for the most part), the
number of iptables rules used to implement a set of Kubernetes
Services is directly proportional to the number of Services. And
every packet going through the system then needs to pass through
all of these rules, slowing down the traffic.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>IPTables is the bottleneck in kube-proxy performance, and it always
will be until we stop using it.&lt;/p>
&lt;h3 id="upstream-development-has-moved-on-from-iptables-to-nftables">Upstream development has moved on from iptables to nftables&lt;/h3>
&lt;p>In large part due to its unfixable problems, development on iptables
in the kernel has slowed down and mostly stopped. New features are not
being added to iptables, because nftables is supposed to do everything
iptables does, but better.&lt;/p>
&lt;p>Although there is no plan to remove iptables from the upstream kernel,
that does not guarantee that iptables will remain supported by
&lt;em>distributions&lt;/em> forever. In particular, Red Hat has declared that
&lt;a href="https://access.redhat.com/solutions/6739041"
target="_blank" rel="noopener">iptables is deprecated in RHEL 9&lt;/a>
and is likely to be removed
entirely in RHEL 10, a few years from now. Other distributions have
made smaller steps in the same direction; for instance, &lt;a href="https://salsa.debian.org/pkg-netfilter-team/pkg-iptables/-/commit/c59797aab9"
target="_blank" rel="noopener">Debian
removed &lt;code>iptables&lt;/code> from the set of &amp;ldquo;required&amp;rdquo; packages&lt;/a>
in Debian 11
(Bullseye).&lt;/p>
&lt;p>The RHEL deprecation in particular impacts Kubernetes in two ways:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Many Kubernetes users run RHEL or one of its downstreams, so in a
few years when RHEL 10 is released, they will be unable to use
kube-proxy in &lt;code>iptables&lt;/code> mode (or, for that matter, in &lt;code>ipvs&lt;/code> or
&lt;code>userspace&lt;/code> mode, since those modes also make heavy use of the
iptables API).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Several upstream iptables bugs and performance problems that
affect Kubernetes have been fixed by Red Hat developers over the
past several years. With Red Hat no longer making any effort to
maintain iptables, it is less likely that upstream iptables bugs
that affect Kubernetes in the future would be fixed promptly, if
at all.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="the-ipvs-mode-of-kube-proxy-will-not-save-us">The &lt;code>ipvs&lt;/code> mode of kube-proxy will not save us&lt;/h3>
&lt;p>Because of the problems with iptables, some developers added an &lt;code>ipvs&lt;/code>
mode to kube-proxy in 2017. It was generally hoped that this could
eventually solve all of the problems with the &lt;code>iptables&lt;/code> mode and
become its replacement, but this never really happened. It&amp;rsquo;s not
entirely clear why&amp;hellip; &lt;a href="https://github.com/kubernetes/kubeadm/issues/817"
target="_blank" rel="noopener">kubeadm #817&lt;/a>
, &amp;ldquo;Track when we can enable the
ipvs mode for the kube-proxy by default&amp;rdquo; is perhaps a good snapshot of
the initial excitement followed by growing disillusionment with the
&lt;code>ipvs&lt;/code> mode:&lt;/p>
&lt;ul>
&lt;li>&amp;ldquo;a few issues &amp;hellip; re: the version of iptables/ipset shipped in the
kube-proxy container image&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;clearly not ready for defaulting&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;complications &amp;hellip; with IPVS kernel modules missing or disabled on
user nodes&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;we are still lacking tests&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;still does not completely align with what [we] support in
iptables mode&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;iptables works and people are familiar with it&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;&lt;a href="https://en.wikipedia.org/wiki/The_Fox_and_the_Grapes"
target="_blank" rel="noopener">not sure that it was ever intended for IPVS to be the default&lt;/a>
&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;p>Additionally, the kernel IPVS APIs alone do not provide enough
functionality to fully implement Kubernetes services, and so the
&lt;code>ipvs&lt;/code> backend also makes heavy use of the iptables API. Thus, if we
are worried about iptables deprecation, then in order to switch to
using &lt;code>ipvs&lt;/code> as the default mode, we would have to port the iptables
parts of it to use nftables anyway. But at that point, there would be
little excuse for using IPVS for the core load-balancing part,
particularly given that IPVS, like iptables, is no longer an
actively-developed technology.&lt;/p>
&lt;h3 id="the-nf_tables-mode-of-sbiniptables-will-not-save-us">The &lt;code>nf_tables&lt;/code> mode of &lt;code>/sbin/iptables&lt;/code> will not save us&lt;/h3>
&lt;p>In 2018, with the 1.8.0 release of the iptables client binaries, a new
mode was added to the binaries, to allow them to use the nftables API
in the kernel rather than the legacy iptables API, while still
preserving the &amp;ldquo;API&amp;rdquo; of the original iptables binaries. As of 2022,
most Linux distributions now use this mode, so the legacy iptables
kernel API is mostly dead.&lt;/p>
&lt;p>However, this new mode does not add any new &lt;em>syntax&lt;/em>, and so it is not
possible to use any of the new nftables features (like maps) that are
not present in iptables.&lt;/p>
&lt;p>Furthermore, the compatibility constraints imposed by the user-facing
API of the iptables binaries themselves prevent them from being able
to take advantage of many of the performance improvements associated
with nftables.&lt;/p>
&lt;p>(Additionally, the RHEL deprecation of iptables includes
&lt;code>iptables-nft&lt;/code> as well.)&lt;/p>
&lt;h3 id="the-iptables-mode-of-kube-proxy-has-grown-crufty">The &lt;code>iptables&lt;/code> mode of kube-proxy has grown crufty&lt;/h3>
&lt;p>Because &lt;code>iptables&lt;/code> is the default kube-proxy mode, it is subject to
strong backward-compatibility constraints which mean that certain
&amp;ldquo;features&amp;rdquo; that are now considered to be bad ideas cannot be removed
because they might break some existing users. A few examples:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>It allows NodePort services to be accessed on &lt;code>localhost&lt;/code>, which
requires it to set a sysctl to a value that may introduce security
holes on the system. More generally, it defaults to having
NodePort services be accessible on &lt;em>all&lt;/em> node IPs, when most users
would probably prefer them to be more restricted.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>It implements the &lt;code>LoadBalancerSourceRanges&lt;/code> feature for traffic
addressed directly to LoadBalancer IPs, but not for traffic
redirected to a NodePort by an external LoadBalancer.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Some new functionality only works correctly if the administrator
passes certain command-line options to kube-proxy (eg,
&lt;code>--cluster-cidr&lt;/code>), but we cannot make those options be mandatory,
since that would break old clusters that aren&amp;rsquo;t passing them.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>A new kube-proxy mode, which existing users would have to explicitly opt
into, could revisit these and other decisions. (Though if we expect it
to eventually become the default, then we might decide to avoid such
changes anyway.)&lt;/p>
&lt;h3 id="we-will-hopefully-be-able-to-trade-2-supported-backends-for-1">We will hopefully be able to trade 2 supported backends for 1&lt;/h3>
&lt;p>Right now SIG Network is supporting both the &lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code>
backends of kube-proxy, and does not feel like it can ditch &lt;code>ipvs&lt;/code>
because of perceived performance issues with &lt;code>iptables&lt;/code>. If we create a new
backend which is as functional and non-buggy as &lt;code>iptables&lt;/code> but as
performant as &lt;code>ipvs&lt;/code>, then we could (eventually) deprecate both of the
existing backends and only have one Linux backend to support in the future.&lt;/p>
&lt;h3 id="writing-a-new-kube-proxy-mode-will-help-to-focus-our-cleanuprefactoring-efforts">Writing a new kube-proxy mode will help to focus our cleanup/refactoring efforts&lt;/h3>
&lt;p>There is a desire to provide a &amp;ldquo;kube-proxy library&amp;rdquo; that third parties
could use as a base for external service proxy implementations
(&lt;a href="https://github.com/kubernetes/enhancements/issues/3786"
target="_blank" rel="noopener">KEP-3786&lt;/a>
). The existing &amp;ldquo;core kube-proxy&amp;rdquo; code, while functional,
is not very well designed and is not something we would want to
support other people using in its current form.&lt;/p>
&lt;p>Writing a new proxy backend will force us to look over all of this
shared code again, and perhaps give us new ideas on how it can be
cleaned up, rationalized, and optimized.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Design and implement an &lt;code>nftables&lt;/code> mode for kube-proxy.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Consider various fixes to legacy &lt;code>iptables&lt;/code> mode behavior.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Do not enable the &lt;code>route_localnet&lt;/code> sysctl.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Add a more restrictive startup mode to kube-proxy, which
will error out if the configuration is invalid (e.g.,
&amp;ldquo;&lt;code>--detect-local-mode ClusterCIDR&lt;/code>&amp;rdquo; without specifying
&amp;ldquo;&lt;code>--cluster-cidr&lt;/code>&amp;rdquo;) or incomplete (e.g.,
partially-dual-stack but not fully-dual-stack).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>(Possibly other changes discussed in this KEP.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Ensure that any such changes are clearly documented for
users.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>To the extent possible, provide metrics to allow &lt;code>iptables&lt;/code>
users to easily determine if they are using features that
would behave differently in &lt;code>nftables&lt;/code> mode.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Document specific details of the nftables implementation that we
want to consider as &amp;ldquo;API&amp;rdquo;. In particular, document the
high-level behavior that authors of network plugins can rely
on. We may also document ways that third parties or
administrators can integrate with kube-proxy&amp;rsquo;s rules at a lower
level.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Allowing switching from the &lt;code>iptables&lt;/code> (or &lt;code>ipvs&lt;/code>) mode to
&lt;code>nftables&lt;/code>, or vice versa, without needing to manually clean up
rules in between.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Document the minimum kernel/distro requirements for the new backend.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Document incompatible changes between &lt;code>iptables&lt;/code> mode and &lt;code>nftables&lt;/code>
mode (e.g. localhost NodePorts, firewall handling, etc).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Do performance testing comparing the &lt;code>iptables&lt;/code>,
&lt;code>ipvs&lt;/code>, and &lt;code>nftables&lt;/code> backends in small, medium, and large
clusters, comparing both the &amp;ldquo;control plane&amp;rdquo; aspects (time/CPU usage
spent reprogramming rules) and &amp;ldquo;data plane&amp;rdquo; aspects (latency and
throughput of packets to service IPs).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Help with the clean-up and refactoring of the kube-proxy &amp;ldquo;library&amp;rdquo;
code.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Although this KEP does not include anything post-GA (e.g., making
&lt;code>nftables&lt;/code> the default backend, or changing the status of the
&lt;code>iptables&lt;/code> and/or &lt;code>ipvs&lt;/code> backends), we should have at least the
start of a plan for the future by the time this KEP goes GA, to
ensure that we don&amp;rsquo;t just end up permanently maintaining 3 backends
instead of 2.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Falling into the same traps as the &lt;code>ipvs&lt;/code> backend, to the extent
that we can identify what those traps were.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Removing the iptables &lt;code>KUBE-IPTABLES-HINT&lt;/code> chain from kubelet; that
chain exists for the benefit of any component on the node that wants
to use iptables, and so should continue to exist even if no part of
the kubernetes core uses iptables itself. (And there is no need to
add anything similar for nftables, since there are no bits of host
filesystem configuration related to nftables that containerized
nftables users need to worry about.)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>And some Non-Goals relative to earlier discussions in this KEP:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Changing the session affinity behavior; the &lt;code>nftables&lt;/code> backend will
implement the same behavior as &lt;code>iptables&lt;/code> does (which is different
from &lt;code>ipvs&lt;/code> and some third-party proxy implementations). If we
decide to revisit session affinity in the future, it will be easy to
add or change the &lt;code>nftables&lt;/code> backend&amp;rsquo;s behavior, because it is
implemented &amp;ldquo;manually&amp;rdquo;.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Implementing &lt;code>LoadBalancerSourceRanges&lt;/code> filtering for &lt;code>NodePort&lt;/code> (or
&lt;code>ExternalIPs&lt;/code>) traffic. The kube-proxy implementation of that
feature mostly only exists for the pod-to-load balancer short
circuit case anyway. Users who want more consistent filtering
behavior can use the Gateway API.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Support for running multiple instances of the proxy (along with the
&lt;code>service.kubernetes.io/service-proxy-name&lt;/code> label). There is now a
proof-of-concept of this idea (&lt;a href="https://github.com/kubernetes/kubernetes/pull/122814"
target="_blank" rel="noopener">kubernetes #122814&lt;/a>
), so we know
that the design supports it and it could be implemented in the
future.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Explicit support for &amp;ldquo;debug&amp;rdquo;/&amp;ldquo;admin-override&amp;rdquo; rules. The nftables
backend will retain the iptables backend&amp;rsquo;s behavior of &amp;ldquo;you can
change our rules but your changes will eventually get overwritten&amp;rdquo;.
We may still some day add support for explicit overrides, as discussed
&lt;a href="#defining-an-api-for-integration-with-admindebugthird-party-rules"
>below&lt;/a>
,
but this will not be part of the initial release.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="notesconstraintscaveats">Notes/Constraints/Caveats&lt;/h3>
&lt;p>At least three nftables-based kube-proxy implementations already
exist, but none of them seems suitable either to adopt directly or to
use as a starting point:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://github.com/zevenet/kube-nftlb"
target="_blank" rel="noopener">kube-nftlb&lt;/a>
: This is built on top of a separate nftables-based load
balancer project called &lt;a href="https://github.com/zevenet/nftlb"
target="_blank" rel="noopener">nftlb&lt;/a>
, which means that rather than
translating Kubernetes Services directly into nftables rules, it
translates them into nftlb load balancer objects, which then get
translated into nftables rules. Besides making the code more
confusing for users who aren&amp;rsquo;t already familiar with nftlb, this
also means that in many cases, new Service features would need to
have features added to the nftlb core first before kube-nftld could
consume them. (Also, it has not been updated since November 2020.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/sbezverk/nfproxy"
target="_blank" rel="noopener">nfproxy&lt;/a>
: Its README notes that &amp;ldquo;nfproxy is not a 1:1 copy of
kube-proxy (iptables) in terms of features. nfproxy is not going to
cover all corner cases and special features addressed by
kube-proxy&amp;rdquo;. (Also, it has not been updated since January 2021.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft"
target="_blank" rel="noopener">kpng&amp;rsquo;s nft backend&lt;/a>
: This was written as a proof of concept and is
mostly a straightforward translation of the iptables rules to
nftables, and doesn&amp;rsquo;t make good use of nftables features that would
let it reduce the total number of rules. It also makes heavy use of
kpng&amp;rsquo;s APIs, like &amp;ldquo;DiffStore&amp;rdquo;, which there is not consensus about
adopting upstream.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="functionality">Functionality&lt;/h4>
&lt;p>The primary risk of the proposal is feature or stability regressions,
which will be addressed by testing, and by a slow, optional, rollout
of the new proxy mode.&lt;/p>
&lt;p>The most important mitigation for this risk is ensuring that rollback
from &lt;code>nftables&lt;/code> mode back to &lt;code>iptables&lt;/code>/&lt;code>ipvs&lt;/code> mode works reliably.&lt;/p>
&lt;h4 id="compatibility">Compatibility&lt;/h4>
&lt;p>Many Kubernetes networking implementations use kube-proxy as their
service proxy implementation. Given that few low-level details of
kube-proxy&amp;rsquo;s behavior are explicitly specified, using it as part of a
larger networking implementation (and in particular, writing a
NetworkPolicy implementation that interoperates with it correctly)
necessarily requires making assumptions about (currently-)undocumented
aspects of its behavior (such as exactly when and how packets get
rewritten).&lt;/p>
&lt;p>While the &lt;code>nftables&lt;/code> mode is likely to look very similar to the
&lt;code>iptables&lt;/code> mode from the outside, some CNI plugins, NetworkPolicy
implementations, etc, may need updates in order to work with it. (This
may further limit the amount of testing the new mode can get during
the Alpha phase, if it is not yet compatible with popular network
plugins at that point.) There is not much we can do here, other than
avoiding &lt;em>gratuitous&lt;/em> behavioral differences.&lt;/p>
&lt;h4 id="security">Security&lt;/h4>
&lt;p>The &lt;code>nftables&lt;/code> mode should not pose any new security issues relative
to the &lt;code>iptables&lt;/code> mode.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="high-level-design">High level design&lt;/h3>
&lt;p>At a high level, the new mode should have the same architecture as the
existing modes; it will use the service/endpoint-tracking code in
&lt;code>k8s.io/kubernetes/pkg/proxy&lt;/code> to watch for changes, and update rules
in the kernel accordingly.&lt;/p>
&lt;h3 id="low-level-design">Low level design&lt;/h3>
&lt;p>Some details will be figured out as we implement it. We may start with
an implementation that is architecturally closer to the &lt;code>iptables&lt;/code>
mode, and then rewrite it to take advantage of additional nftables
features over time.&lt;/p>
&lt;h4 id="tables">Tables&lt;/h4>
&lt;p>Unlike iptables, nftables does not have any reserved/default tables or
chains (eg, &lt;code>nat&lt;/code>, &lt;code>PREROUTING&lt;/code>). Instead, each nftables user is
expected to create and work with its own table(s), and to ignore the
tables created by other components (for example, when firewalld is
running in nftables mode, restarting it only flushes the rules in the
&lt;code>firewalld&lt;/code> table, unlike when it is running in iptables mode, where
restarting it causes it to flush &lt;em>all&lt;/em> rules).&lt;/p>
&lt;p>Within each table, &amp;ldquo;base chains&amp;rdquo; can be connected to &amp;ldquo;hooks&amp;rdquo; that give
them behavior similar to the built-in iptables chains. (For example, a
chain with the properties &lt;code>type nat&lt;/code> and &lt;code>hook prerouting&lt;/code> would work
like the &lt;code>PREROUTING&lt;/code> chain in the iptables &lt;code>nat&lt;/code> table.) The
&amp;ldquo;priority&amp;rdquo; of a base chain controls when it runs relative to other
chains connected to the same hook in the same or other tables.&lt;/p>
&lt;p>An nftables table can only contain rules for a single &amp;ldquo;family&amp;rdquo; (&lt;code>ip&lt;/code>
(v4), &lt;code>ip6&lt;/code>, &lt;code>inet&lt;/code> (both IPv4 and IPv6), &lt;code>arp&lt;/code>, &lt;code>bridge&lt;/code>, or
&lt;code>netdev&lt;/code>). We will create a single &lt;code>kube-proxy&lt;/code> table in the &lt;code>ip&lt;/code>
family, and another in the &lt;code>ip6&lt;/code> family. All of our chains, sets,
maps, etc, will go into those tables.&lt;/p>
&lt;p>(In theory, instead of creating one table each in the &lt;code>ip&lt;/code> and &lt;code>ip6&lt;/code>
families, we could create a single table in the &lt;code>inet&lt;/code> family and put
both IPv4 and IPv6 chains/rules there. However, this wouldn&amp;rsquo;t really
result in much simplification, because we would still need separate
sets/maps to match IPv4 addresses and IPv6 addresses. (There is no
data type that can store/match either an IPv4 address or an IPv6
address.) Furthermore, because of how Kubernetes Services evolved in
parallel with the existing kube-proxy implementation, we have ended up
with a dual-stack Service semantics that is most easily implemented by
handling IPv4 and IPv6 completely separately anyway.)&lt;/p>
&lt;h4 id="communicating-with-the-kernel-nftables-subsystem">Communicating with the kernel nftables subsystem&lt;/h4>
&lt;p>We will use the &lt;code>nft&lt;/code> command-line tool to read and write rules, much
like how we use command-line tools in the &lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code>
backends.&lt;/p>
&lt;p>However, the &lt;code>nft&lt;/code> tool is mostly just a thin wrapper around
&lt;code>libnftables&lt;/code>, so any golang API that wraps the &lt;code>nft&lt;/code> command-line
could easily be rewritten to use &lt;code>libnftables&lt;/code> directly (via a cgo
wrapper) in the future if that seemed like a better idea. (In theory
we could also use netlink directly, without needing cgo or external
libraries, but this would probably be a bad idea; &lt;code>libnftables&lt;/code>
implements quite a bit of functionality on top of the raw netlink
API.)&lt;/p>
&lt;p>The nftables command-line tool allows either a single command per
invocation (as with &lt;code>/sbin/iptables&lt;/code>):&lt;/p>
&lt;pre tabindex="0">&lt;code>$ nft add table ip kube-proxy &amp;#39;{ comment &amp;#34;Kubernetes service proxying rules&amp;#34;; }&amp;#39;
$ nft add chain ip kube-proxy services
$ nft add rule ip kube-proxy services ip daddr . ip protocol . th dport vmap @service_ips
&lt;/code>&lt;/pre>&lt;p>or multiple commands to be executed in a single atomic transaction (as
with &lt;code>/sbin/iptables-restore&lt;/code>, but more flexible):&lt;/p>
&lt;pre tabindex="0">&lt;code>$ nft -f - &amp;lt;&amp;lt;EOF
add table ip kube-proxy { comment &amp;#34;Kubernetes service proxying rules&amp;#34;; }
add chain ip kube-proxy services
add rule ip kube-proxy services ip daddr . ip protocol . th dport vmap @service_ips
EOF
&lt;/code>&lt;/pre>&lt;p>The syntax for the two modes is the same, other than the need to
escape shell meta characters in the former case.&lt;/p>
&lt;p>When reading data from the kernel (&lt;code>nft list ...&lt;/code>), &lt;code>nft&lt;/code> outputs the
data in a nested &amp;ldquo;object&amp;rdquo; form:&lt;/p>
&lt;pre tabindex="0">&lt;code>$ nft list table ip kube-proxy
table ip kube-proxy {
comment &amp;#34;Kubernetes service proxying rules&amp;#34;;
chain services {
ip daddr . ip protocol . th dport vmap @service_ips
}
}
&lt;/code>&lt;/pre>&lt;p>(It is possible to pass data to &lt;code>nft -f&lt;/code> in this form as well, but
this wouldn&amp;rsquo;t be useful for us, since we would have to pass the entire
contents of &lt;code>table ip kube-proxy&lt;/code> rather than just adding, removing,
and updating the particular rules, sets, etc, that we wanted to
change.)&lt;/p>
&lt;p>&lt;code>nft&lt;/code> also has a JSON API, which would theoretically be a better
option for programmatic use than the &amp;ldquo;plain text&amp;rdquo; API. Unfortunately,
the representation of rules in this mode is vastly different from the
representation of rules in &amp;ldquo;plain text&amp;rdquo; mode:&lt;/p>
&lt;pre tabindex="0">&lt;code>$ nft --json list table ip kube-proxy | jq .
...
{
&amp;#34;rule&amp;#34;: {
&amp;#34;family&amp;#34;: &amp;#34;ip&amp;#34;,
&amp;#34;table&amp;#34;: &amp;#34;kube-proxy&amp;#34;,
&amp;#34;chain&amp;#34;: &amp;#34;services&amp;#34;,
&amp;#34;handle&amp;#34;: 19,
&amp;#34;expr&amp;#34;: [
{
&amp;#34;vmap&amp;#34;: {
&amp;#34;key&amp;#34;: {
&amp;#34;concat&amp;#34;: [
{
&amp;#34;payload&amp;#34;: {
&amp;#34;protocol&amp;#34;: &amp;#34;ip&amp;#34;,
&amp;#34;field&amp;#34;: &amp;#34;daddr&amp;#34;
}
},
{
&amp;#34;payload&amp;#34;: {
&amp;#34;protocol&amp;#34;: &amp;#34;ip&amp;#34;,
&amp;#34;field&amp;#34;: &amp;#34;protocol&amp;#34;
}
},
{
&amp;#34;payload&amp;#34;: {
&amp;#34;protocol&amp;#34;: &amp;#34;th&amp;#34;,
&amp;#34;field&amp;#34;: &amp;#34;dport&amp;#34;
}
}
]
},
&amp;#34;data&amp;#34;: &amp;#34;@service_ips&amp;#34;
}
}
]
}
...
&lt;/code>&lt;/pre>&lt;p>While it&amp;rsquo;s clear how this &lt;em>particular&lt;/em> rule would be converted back
and forth between the two forms, there is no way to be able to map
&lt;em>all&lt;/em> rules back and forth without having separate code for every rule
type. Furthermore, the JSON syntax of individual rules is poorly
documented, and essentially all examples on the web (including the
nftables wiki, random blog posts, etc) use the non-JSON syntax. So if
we used the JSON syntax in kube-proxy, it would make the code harder
to understand and to maintain.&lt;/p>
&lt;p>As a result, the plan is that for our internal nftables API:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>When passing data &lt;em>to&lt;/em> &lt;code>nft&lt;/code>, we will use the &amp;ldquo;plain text&amp;rdquo; API. In
particular, this means that all &lt;code>add rule ...&lt;/code> commands will use the
well-documented plain text rule form.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When reading data back &lt;em>from&lt;/em> &lt;code>nft&lt;/code>, we will use the JSON API, to
ensure that the results are unambiguously parseable (rather than
having to make assumptions about the exact whitespace, punctuation,
etc, that &lt;code>nft&lt;/code> will output in particular cases in the plain text
mode).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>This means that our internal nftables API would not be able to support
reading back rules in a &amp;ldquo;legible&amp;rdquo; form. However, this is not expected
to be a problem, given that our internal iptables API
(&lt;code>pkg/util/iptables&lt;/code>) also does not explicitly support this, and it&amp;rsquo;s
not a problem for the iptables backend.&lt;/p>
&lt;h4 id="notes-on-the-sample-rules-in-this-kep">Notes on the sample rules in this KEP&lt;/h4>
&lt;p>The examples below all show data in the plain text &amp;ldquo;object&amp;rdquo; form, but
this is just for reader convenience, and does not correspond to either
the form we would be writing the data in (the multi-command
transaction form) or the form we would be reading it back in (JSON).
(Likewise, note that the &lt;code>#&lt;/code>-prefixed comments would be ignored by
&lt;code>nft&lt;/code> and are only there for the benefit of the KEP reader, whereas
the &lt;code>comment &amp;quot;...&amp;quot;&lt;/code> comments are actual object metadata that would be
stored in nftables, as with iptables &lt;code>--comment &amp;quot;...&amp;quot;&lt;/code>. Every table,
chain, set, map, rule, and set/map element can have its own comment,
so there is a lot of opportunity for us to make the ruleset
self-documenting, if we want to.)&lt;/p>
&lt;p>The examples below are also all IPv4-specific, for simplicity. When
actually writing out rules for nft, we will need to switch between,
e.g., &amp;ldquo;&lt;code>ip daddr&lt;/code>&amp;rdquo; and &amp;ldquo;&lt;code>ip6 daddr&lt;/code>&amp;rdquo; appropriately, to match an IPv4
or IPv6 destination address. This will actually be fairly simple
because the &lt;code>nft&lt;/code> command lets you create &amp;ldquo;variables&amp;rdquo; (really
constants) and substitute their values into the rules. Thus, we can
just always have the rule-generating code write &amp;ldquo;&lt;code>$IP daddr&lt;/code>&amp;rdquo;, and
then pass either &amp;ldquo;&lt;code>-D IP=ip&lt;/code>&amp;rdquo; or &amp;ldquo;&lt;code>-D IP=ip6&lt;/code>&amp;rdquo; to &lt;code>nft&lt;/code> to fix it up.)&lt;/p>
&lt;p>The per-service/per-endpoint chain names below use hashed strings to
shorten the names, as in the &lt;code>iptables&lt;/code> backend (e.g.,
&amp;ldquo;&lt;code>svc_4SW47YFZTEDKD3PK&lt;/code>&amp;rdquo;, where that hash was copied out of the
existing &lt;code>iptables&lt;/code> unit tests and happens to represent
&amp;ldquo;&lt;code>ns4/svc4:p80tcp&lt;/code>&amp;rdquo;). However, it turns out that nftables chain names
can be much longer than iptables chain names (256 characters rather
than 30), so we ought to be able to create more recognizable chain
names in the &lt;code>nftables&lt;/code> backend.&lt;/p>
&lt;p>The multi-word names in the examples are also inconsistent about the
use of underscores vs hyphens; underscores are standard in most
nftables documentation, but hyphens are more
&lt;code>iptables&lt;/code>-kube-proxy-like. We should eventually settle on one or the
other.&lt;/p>
&lt;p>(Also, most of the examples below have not actually been tested and
may have syntax errors. Caveat lector.)&lt;/p>
&lt;h4 id="versioning-and-compatibility">Versioning and compatibility&lt;/h4>
&lt;p>Since nftables is subject to much more development than iptables has
been recently, we will need to pay more attention to kernel and tool
versions.&lt;/p>
&lt;p>The &lt;code>nft&lt;/code> command has a &lt;code>--check&lt;/code> option which can be used to check if
a command could be run successfully; it parses the input, and then
(assuming success), uploads the data to the kernel and asks the kernel
to check it (but not actually act on it) as well. Thus, with a few
&lt;code>nft --check&lt;/code> runs at startup we should be able to confirm what
features are known to both the tooling and the kernel.&lt;/p>
&lt;p>It is not yet clear what the minimum kernel or &lt;code>nft&lt;/code> command-line
versions needed by the &lt;code>nftables&lt;/code> backend will be. The newest feature
used in the examples below was added in Linux 5.6, released in March
2020 (though they could be rewritten to not need that feature).&lt;/p>
&lt;p>It is possible some users will not be able to upgrade from the
&lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code> backends to &lt;code>nftables&lt;/code>. (Certainly the
&lt;code>nftables&lt;/code> backend will not support RHEL 7, which some people are
still using Kubernetes with.)&lt;/p>
&lt;h4 id="nat-rules">NAT rules&lt;/h4>
&lt;h5 id="general-service-dispatch">General Service dispatch&lt;/h5>
&lt;p>For ClusterIP and external IP services, we will use an nftables
&amp;ldquo;verdict map&amp;rdquo; to store the logic about where to dispatch traffic,
based on destination IP, protocol, and port. We will then need only a
single actual rule to apply the verdict map to all inbound traffic.
(Or it may end up making more sense to have separate verdict maps for
ClusterIP, ExternalIP, and LoadBalancer IP?) Either way, service
dispatch will be roughly &lt;strong>O(1)&lt;/strong> rather than &lt;strong>O(n)&lt;/strong> as in the
&lt;code>iptables&lt;/code> backend.&lt;/p>
&lt;p>Likewise, for NodePort traffic, we will use a verdict map matching
only on destination protocol / port, with the rules set up to only
check the &lt;code>nodeports&lt;/code> map for packets addressed to a local IP.&lt;/p>
&lt;pre tabindex="0">&lt;code>map service_ips {
comment &amp;#34;ClusterIP, ExternalIP and LoadBalancer IP traffic&amp;#34;;
# The &amp;#34;type&amp;#34; clause defines the map&amp;#39;s datatype; the key type is to
# the left of the &amp;#34;:&amp;#34; and the value type to the right. The map key
# in this case is a concatenation (&amp;#34;.&amp;#34;) of three values; an IPv4
# address, a protocol (tcp/udp/sctp), and a port (aka
# &amp;#34;inet_service&amp;#34;). The map value is a &amp;#34;verdict&amp;#34;, which is one of a
# limited set of nftables actions. In this case, the verdicts are
# all &amp;#34;goto&amp;#34; statements.
type ipv4_addr . inet_proto . inet_service : verdict;
elements {
172.30.0.44 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
192.168.99.33 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK,
...
}
}
map service_nodeports {
comment &amp;#34;NodePort traffic&amp;#34;;
type inet_proto . inet_service : verdict;
elements {
tcp . 3001 : goto svc_4SW47YFZTEDKD3PK,
...
}
}
chain prerouting {
jump services
jump nodeports
}
chain services {
# Construct a key from the destination address, protocol, and port,
# then look that key up in the `service_ips` vmap and take the
# associated action if it is found.
ip daddr . ip protocol . th dport vmap @service_ips
}
chain nodeports
# Return if the destination IP is non-local, or if it&amp;#39;s localhost.
fib daddr type != local return
ip daddr == 127.0.0.1 return
# If --nodeport-addresses was in use then the above would instead be
# something like:
# ip daddr != { 192.168.1.5, 192.168.3.10 } return
# dispatch on the service_nodeports vmap
ip protocol . th dport vmap @service_nodeports
}
# Example per-service chain
chain svc_4SW47YFZTEDKD3PK {
# Send to random endpoint chain using an inline vmap
numgen random mod 2 vmap {
0 : goto sep_UKSFD7AGPMPPLUHC,
1 : goto sep_C6EBXVWJJZMIWKLZ
}
}
# Example per-endpoint chain
chain sep_UKSFD7AGPMPPLUHC {
# masquerade hairpin traffic
ip saddr 10.180.0.4 jump mark_for_masquerade
# send to selected endpoint
dnat to 10.180.0.4:8000
}
&lt;/code>&lt;/pre>&lt;h5 id="masquerading">Masquerading&lt;/h5>
&lt;p>The example rules above include&lt;/p>
&lt;pre tabindex="0">&lt;code> ip saddr 10.180.0.4 jump mark_for_masquerade
&lt;/code>&lt;/pre>&lt;p>to masquerade hairpin traffic, as in the &lt;code>iptables&lt;/code> proxier. This
assumes the existence of a &lt;code>mark_for_masquerade&lt;/code> chain, not shown.&lt;/p>
&lt;p>nftables has the same constraints on DNAT and masquerading as iptables
does; you can only DNAT from the &amp;ldquo;prerouting&amp;rdquo; stage and you can only
masquerade from the &amp;ldquo;postrouting&amp;rdquo; stage. Thus, as with &lt;code>iptables&lt;/code>, the
&lt;code>nftables&lt;/code> proxy will have to handle DNAT and masquerading at separate
times. One possibility would be to simply copy the existing logic from
the &lt;code>iptables&lt;/code> proxy, using the packet mark to communicate from the
prerouting chains to the postrouting ones.&lt;/p>
&lt;p>However, it should be possible to do this in nftables without using
the mark or any other externally-visible state; we can just create an
nftables &lt;code>set&lt;/code>, and use that to communicate information between the
chains. Something like:&lt;/p>
&lt;pre tabindex="0">&lt;code># Set of 5-tuples of connections that need masquerading
set need_masquerade {
type ipv4_addr . inet_service . ipv4_addr . inet_service . inet_proto;
flags timeout ; timeout 5s ;
}
chain mark_for_masquerade {
update @need_masquerade { ip saddr . th sport . ip daddr . th dport . ip protocol }
}
chain postrouting_do_masquerade {
# We use &amp;#34;ct original ip daddr&amp;#34; and &amp;#34;ct original proto-dst&amp;#34; here
# since the packet may have been DNATted by this point.
ip saddr . th sport . ct original ip daddr . ct original proto-dst . ip protocol @need_masquerade masquerade
}
&lt;/code>&lt;/pre>&lt;p>This is not yet tested, but some kernel nftables developers have
confirmed that it ought to work. We should test to make sure that
having a potentially-high-churn &lt;code>need_masquerade&lt;/code> set will not be a
performance problem.&lt;/p>
&lt;h5 id="session-affinity">Session affinity&lt;/h5>
&lt;p>Session affinity can be done in roughly the same way as in the
&lt;code>iptables&lt;/code> proxy, just using the more general nftables &amp;ldquo;set&amp;rdquo; framework
rather than the affinity-specific version of sets provided by the
iptables &lt;code>recent&lt;/code> module. In fact, since nftables allows arbitrary set
keys, we can optimize relative to &lt;code>iptables&lt;/code>, and only have a single
affinity set per service, rather than one per endpoint. (And we also
have the flexibility to change the affinity key in the future if we
want to, eg to key on source IP+port rather than just source IP.)&lt;/p>
&lt;pre tabindex="0">&lt;code>set affinity_4SW47YFZTEDKD3PK {
# Source IP . Destination IP . Destination Port
type ipv4_addr . ipv4_addr . inet_service;
flags timeout; timeout 3h;
}
chain svc_4SW47YFZTEDKD3PK {
# Check for existing session affinity against each endpoint
ip saddr . 10.180.0.4 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_UKSFD7AGPMPPLUHC
ip saddr . 10.180.0.5 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_C6EBXVWJJZMIWKLZ
# Send to random endpoint chain
numgen random mod 2 vmap {
0 : goto sep_UKSFD7AGPMPPLUHC,
1 : goto sep_C6EBXVWJJZMIWKLZ
}
}
chain sep_UKSFD7AGPMPPLUHC {
# Mark the source as having affinity for this endpoint
update @affinity_4SW47YFZTEDKD3PK { ip saddr . 10.180.0.4 . 80 }
ip saddr 10.180.0.4 jump mark_for_masquerade
dnat to 10.180.0.4:8000
}
# likewise for other endpoint(s)...
&lt;/code>&lt;/pre>&lt;h4 id="filter-rules">Filter rules&lt;/h4>
&lt;p>The &lt;code>iptables&lt;/code> mode uses the &lt;code>filter&lt;/code> table for three kinds of rules:&lt;/p>
&lt;h5 id="dropping-or-rejecting-packets-for-services-with-no-endpoints">Dropping or rejecting packets for services with no endpoints&lt;/h5>
&lt;p>As with service dispatch, this is easily handled with a verdict map:&lt;/p>
&lt;pre tabindex="0">&lt;code>map no_endpoint_services {
type ipv4_addr . inet_proto . inet_service : verdict
elements = {
192.168.99.22 . tcp . 80 : drop,
172.30.0.46 . tcp . 80 : goto reject_chain,
1.2.3.4 . tcp . 80 : drop
}
}
chain filter {
...
ip daddr . ip protocol . th dport vmap @no_endpoint_services
...
}
# helper chain needed because &amp;#34;reject&amp;#34; is not a &amp;#34;verdict&amp;#34; and so can&amp;#39;t
# be used directly in a verdict map
chain reject_chain {
reject
}
&lt;/code>&lt;/pre>&lt;h5 id="dropping-traffic-rejected-by-loadbalancersourceranges">Dropping traffic rejected by &lt;code>LoadBalancerSourceRanges&lt;/code>&lt;/h5>
&lt;p>The implementation of LoadBalancer source ranges will be similar to
the ipset-based implementation in the &lt;code>ipvs&lt;/code> kube proxy: we use one
set to recognize &amp;ldquo;traffic that is subject to source ranges&amp;rdquo;, and then
another to recognize &amp;ldquo;traffic that is &lt;em>accepted&lt;/em> by its service&amp;rsquo;s
source ranges&amp;rdquo;. Traffic which matches the first set but not the second
gets dropped:&lt;/p>
&lt;pre tabindex="0">&lt;code>set firewall {
comment &amp;#34;destinations that are subject to LoadBalancerSourceRanges&amp;#34;;
type ipv4_addr . inet_proto . inet_service
}
set firewall_allow {
comment &amp;#34;destination+sources that are allowed by LoadBalancerSourceRanges&amp;#34;;
type ipv4_addr . inet_proto . inet_service . ipv4_addr
}
chain filter {
...
ip daddr . ip protocol . th dport @firewall jump firewall_check
...
}
chain firewall_check {
ip daddr . ip protocol . th dport . ip saddr @firewall_allow return
drop
}
&lt;/code>&lt;/pre>&lt;p>Where, eg, adding a Service with LoadBalancer IP &lt;code>10.1.2.3&lt;/code>, port
&lt;code>80&lt;/code>, and source ranges &lt;code>[&amp;quot;192.168.0.3/32&amp;quot;, &amp;quot;192.168.1.0/24&amp;quot;]&lt;/code> would
result in:&lt;/p>
&lt;pre tabindex="0">&lt;code>add element ip kube-proxy firewall { 10.1.2.3 . tcp . 80 }
add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.0.3/32 }
add element ip kube-proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.1.0/24 }
&lt;/code>&lt;/pre>&lt;h5 id="forcing-traffic-on-healthchecknodeports-to-be-accepted">Forcing traffic on &lt;code>HealthCheckNodePort&lt;/code>s to be accepted&lt;/h5>
&lt;p>The &lt;code>iptables&lt;/code> mode adds rules to ensure that traffic to NodePort
services&amp;rsquo; health check ports is allowed through the firewall. eg:&lt;/p>
&lt;pre tabindex="0">&lt;code>-A KUBE-NODEPORTS -m comment --comment &amp;#34;ns2/svc2:p80 health check node port&amp;#34; -m tcp -p tcp --dport 30000 -j ACCEPT
&lt;/code>&lt;/pre>&lt;p>(There are also rules to accept any traffic that has already been
tagged by conntrack.)&lt;/p>
&lt;p>This cannot be done reliably in nftables; the semantics of &lt;code>accept&lt;/code>
(or &lt;code>-j ACCEPT&lt;/code> in iptables) is to end processing &lt;em>of the current
table&lt;/em>. In iptables, this effectively guarantees that the packet is
accepted (since &lt;code>-j ACCEPT&lt;/code> is mostly only used in the &lt;code>filter&lt;/code>
table), but in nftables, it is still possible that someone would later
call &lt;code>drop&lt;/code> on the packet from another table, causing it to be
dropped. There is no way to reliably &amp;ldquo;sneak behind the firewall&amp;rsquo;s
back&amp;rdquo; like you can in iptables; if an nftables-based firewall is
dropping kube-proxy&amp;rsquo;s packets, then you need to actually configure
&lt;em>that firewall&lt;/em> to accept them instead.&lt;/p>
&lt;p>However, this firewall-bypassing behavior is somewhat legacy anyway;
the &lt;code>iptables&lt;/code> proxy is able to bypass a &lt;em>local&lt;/em> firewall, but has no
ability to bypass a firewall implemented at the cloud network layer,
which is perhaps a more common configuration these days anyway.
Administrators using non-local firewalls are already required to
configure those firewalls correctly to allow Kubernetes traffic
through, and it is reasonable for us to just extend that requirement
to administrators using local firewalls as well.&lt;/p>
&lt;p>Thus, the &lt;code>nftables&lt;/code> backend will not attempt to replicate these
&lt;code>iptables&lt;/code>-backend rules.&lt;/p>
&lt;h4 id="future-improvements">Future improvements&lt;/h4>
&lt;p>Further improvements are likely possible.&lt;/p>
&lt;p>For example, it would be nice to not need a separate &amp;ldquo;hairpin&amp;rdquo; check for
every endpoint. There is no way to ask directly &amp;ldquo;does this packet have
the same source and destination IP?&amp;rdquo;, but the proof-of-concept &lt;a href="https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft"
target="_blank" rel="noopener">kpng
nftables backend&lt;/a>
does this instead:&lt;/p>
&lt;pre tabindex="0">&lt;code>set hairpin {
type ipv4_addr . ipv4_addr;
elements {
10.180.0.4 . 10.180.0.4,
10.180.0.5 . 10.180.0.5,
...
}
}
chain ... {
...
ip saddr . ip daddr @hairpin jump mark_for_masquerade
}
&lt;/code>&lt;/pre>&lt;p>More efficiently, if nftables eventually got the ability to call eBPF
programs as part of rule processing (like iptables&amp;rsquo;s &lt;code>-m ebpf&lt;/code>) then
we could write a trivial eBPF program to check &amp;ldquo;source IP equals
destination IP&amp;rdquo; and then call that rather than needing the giant set
of redundant IPs.&lt;/p>
&lt;p>If we do this, then we don&amp;rsquo;t need the per-endpoint hairpin check
rules. If we could also get rid of the per-endpoint affinity-updating
rules, then we could get rid of the per-endpoint chains entirely,
since &lt;code>dnat to ...&lt;/code> is an allowed vmap verdict:&lt;/p>
&lt;pre tabindex="0">&lt;code>chain svc_4SW47YFZTEDKD3PK {
# FIXME handle affinity somehow
# Send to random endpoint
random mod 2 vmap {
0 : dnat to 10.180.0.4:8000
1 : dnat to 10.180.0.5:8000
}
}
&lt;/code>&lt;/pre>&lt;p>With the current set of nftables functionality, it does not seem
possible to do this (in the case where affinity is in use), but future
features may make it possible.&lt;/p>
&lt;p>It is not yet clear what the tradeoffs of such rewrites are, either in
terms of runtime performance, or of admin/developer-comprehensibility
of the ruleset.&lt;/p>
&lt;h3 id="changes-from-the-iptables-kube-proxy-backend">Changes from the iptables kube-proxy backend&lt;/h3>
&lt;p>Switching to a new backend which people will have to opt into gives us
the chance to break backward-compatibility in various places where we
don&amp;rsquo;t like the current iptables kube-proxy behavior.&lt;/p>
&lt;p>However, if we intend to eventually make the &lt;code>nftables&lt;/code> mode the
default, then differences from &lt;code>iptables&lt;/code> mode will be more of a
problem, so we should limit these changes to cases where the benefit
outweighs the cost.&lt;/p>
&lt;h4 id="localhost-nodeports">Localhost NodePorts&lt;/h4>
&lt;p>Kube-proxy in &lt;code>iptables&lt;/code> mode supports NodePorts on &lt;code>127.0.0.1&lt;/code> (for
IPv4 services) by default. (Kube-proxy in &lt;code>ipvs&lt;/code> mode does not support
this, and neither mode supports localhost NodePorts for IPv6 services,
although &lt;code>userspace&lt;/code> mode did, in single-stack IPv6 clusters.)&lt;/p>
&lt;p>Localhost NodePort traffic does not work cleanly with a DNAT-based
approach to NodePorts, because moving a localhost packet to network
interface other than &lt;code>lo&lt;/code> causes the kernel to consider it &amp;ldquo;martian&amp;rdquo;
and refuse to route it. There are various ways around this problem:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>The &lt;code>userspace&lt;/code> approach: Proxy packets in userspace rather than
redirecting them with DNAT. (The &lt;code>userspace&lt;/code> proxy did this for
all IPs; the fact that localhost NodePorts worked with the
&lt;code>userspace&lt;/code> proxy was a coincidence, not an explicitly-intended
feature).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The &lt;code>iptables&lt;/code> approach: Enable the &lt;code>route_localnet&lt;/code> sysctl,
which tells the kernel to never consider IPv4 loopback addresses
to be &amp;ldquo;martian&amp;rdquo;, so that DNAT works. This only works for IPv4;
there is no corresponding sysctl for IPv6. Unfortunately, enabling
this sysctl opens security holes (&lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2020-8558"
target="_blank" rel="noopener">CVE-2020-8558&lt;/a>
), which
kube-proxy then needs to try to close, which it does by creating
iptables rules to block all the packets that &lt;code>route_localnet&lt;/code>
would have blocked &lt;em>except&lt;/em> for the ones we want (which assumes
that the administrator &lt;a href="https://github.com/kubernetes/kubernetes/pull/91666#issuecomment-640733664"
target="_blank" rel="noopener">didn&amp;rsquo;t also change certain other sysctls&lt;/a>
that might have been safe to change had we not set
&lt;code>route_localnet&lt;/code>, and which according to some reports &lt;a href="https://github.com/kubernetes/kubernetes/pull/91666#issuecomment-763549921"
target="_blank" rel="noopener">may block
legitimate traffic&lt;/a>
in some configurations).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The Cilium approach: Intercept the connect(2) call with eBPF and
rewrite the destination IP there, so that the network stack never
actually sees a packet with destination &lt;code>127.0.0.1&lt;/code> / &lt;code>::1&lt;/code>. (As
in the &lt;code>userspace&lt;/code> kube-proxy case, this is not a special-case
for localhost, it&amp;rsquo;s just how Cilium does service proxying.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If you control the client, you can explicitly bind the socket to
&lt;code>127.0.0.1&lt;/code> / &lt;code>::1&lt;/code> before connecting. (I&amp;rsquo;m not sure why this
works since the packet still eventually gets routed off &lt;code>lo&lt;/code>.) It
doesn&amp;rsquo;t seem to be possible to &amp;ldquo;spoof&amp;rdquo; this after the socket is
created, though as with the previous case, you could do this by
intercepting syscalls with eBPF.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>In discussions about this feature, only one real use case has been
presented: it allows you to run a docker registry in a pod and then
have nodes use a NodePort service via &lt;code>127.0.0.1&lt;/code> to access that
registry. Docker treats &lt;code>127.0.0.1&lt;/code> as an &amp;ldquo;insecure registry&amp;rdquo; by
default (though containerd and cri-o do not) and so does not require
TLS authentication in this case; using any other IP would require
setting up TLS certificates, making the deployment more complicated.
(In other words, this is basically an intentional exploitation of the
security hole that CVE-2020-8558 warns about: enabling
&lt;code>route_localnet&lt;/code> may allow someone to access a service that doesn&amp;rsquo;t
require authentication because it assumed it was only accessible to
localhost.)&lt;/p>
&lt;p>In all other cases, it is generally possible (though not always
convenient) to just rewrite things to use the node IP rather than
localhost (or to use a ClusterIP rather than a NodePort). Indeed,
since localhost NodePorts do not work with &lt;code>ipvs&lt;/code> mode or with IPv6,
many places that used to use NodePorts on &lt;code>127.0.0.1&lt;/code> have already
been rewritten to not do so (eg &lt;a href="https://github.com/contiv/vpp/pull/1434"
target="_blank" rel="noopener">contiv/vpp#1434&lt;/a>
).&lt;/p>
&lt;p>So:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>There is no way to make IPv6 localhost NodePorts work with a
NAT-based solution.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The way to make IPv4 localhost NodePorts work with NAT introduces
a security hole, and we don&amp;rsquo;t necessarily have a fully-generic way
to mitigate it.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The only commonly-argued-for use case for the feature involves
deploying a service in a configuration which its own documentation
describes as insecure and &amp;ldquo;only appropriate for testing&amp;rdquo;.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The use case in question works by default against cri-dockerd
but not against containerd or cri-o with their default
configurations.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>cri-dockerd, containerd, and cri-o all allow additional
&amp;ldquo;insecure registry&amp;rdquo; IPs/CIDRs to be configured, so an
administrator could configure them to allow non-TLS image
pulling against a ClusterIP.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Given this, I think we should not try to support localhost NodePorts
in the &lt;code>nftables&lt;/code> backend.&lt;/p>
&lt;h4 id="nodeport-addresses">NodePort Addresses&lt;/h4>
&lt;p>In addition to the localhost issue, iptables kube-proxy defaults to
accepting NodePort connections on all local IPs, which has effects
varying from intended-but-unexpected (&amp;ldquo;why can people connect to
NodePort services from the management network?&amp;rdquo;) to clearly-just-wrong
(&amp;ldquo;why can people connect to NodePort services on LoadBalancer IPs?&amp;rdquo;)&lt;/p>
&lt;p>The nftables proxy should default to only opening NodePorts on a
single interface, probably the interface with the default route by
default. (Ideally, you really want it to accept NodePorts on the
interface that holds the route to the cloud load balancers, but we
don&amp;rsquo;t necessarily know what that is ahead of time.) Admins can use
&lt;code>--nodeport-addresses&lt;/code> to override this.&lt;/p>
&lt;h4 id="behavior-of-service-ips">Behavior of service IPs&lt;/h4>
&lt;p>Traffic to invalid ports on active cluster IPs will be rejected by the
nftables proxy. If the &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1880-multiple-service-cidrs"
target="_blank" rel="noopener">MultiServiceCIDRAllocator&lt;/a>
feature gate is
enabled, it will additionally drop traffic to unassigned cluster IPs.&lt;/p>
&lt;h4 id="defining-an-api-for-integration-with-admindebugthird-party-rules">Defining an API for integration with admin/debug/third-party rules&lt;/h4>
&lt;p>Administrators sometimes want to add rules to log or drop certain
packets. Kube-proxy makes this difficult because it is constantly
rewriting its rules, making it likely that admin-added rules will be
deleted shortly after being added.&lt;/p>
&lt;p>Likewise, external components (eg, NetworkPolicy implementations) may
want to write rules that integrate with kube-proxy&amp;rsquo;s rules in
well-defined ways.&lt;/p>
&lt;p>The existing kube-proxy modes do not provide any explicit &amp;ldquo;API&amp;rdquo; for
integrating with them, although certain implementation details of the
&lt;code>iptables&lt;/code> backend in particular (e.g. the fact that service IPs in
packets are rewritten to endpoint IPs during iptables&amp;rsquo;s &lt;code>PREROUTING&lt;/code>
phase, and that masquerading will not happen before &lt;code>POSTROUTING&lt;/code>) are
effectively API, in that we know that changing them would result in
significant ecosystem breakage.&lt;/p>
&lt;p>We should provide a stronger definition of these larger-scale &amp;ldquo;black
box&amp;rdquo; guarantees in the &lt;code>nftables&lt;/code> backend. NFTables makes this easier
than iptables in some ways, because each application is expected to
create their own table, and not interfere with anyone else&amp;rsquo;s tables.
If we document the &lt;code>priority&lt;/code> values we use to connect to each
nftables hook, then admins and third party developers should be able
to reliably process packets before or after kube-proxy, without
needing to modify kube-proxy&amp;rsquo;s chains/rules. (As of 1.33, this is now
documented.)&lt;/p>
&lt;p>In cases where administrators want to insert rules into the middle of
particular service or endpoint chains, we would have the same problem
that the &lt;code>iptables&lt;/code> backend has, which is that it would be difficult
for us to avoid accidentally overwriting them when we update rules.
Additionally, we want to preserve our ability to redesign the rules
later to take better advantage of nftables features, which would be
impossible to do if we were officially allowing users to modify the
existing rules.&lt;/p>
&lt;p>One possibility would be to add &amp;ldquo;admin override&amp;rdquo; vmaps that are
normally empty but which admins could add &lt;code>jump&lt;/code>/&lt;code>goto&lt;/code> rules to for
specific services to augment/bypass the normal service processing. It
probably makes sense to leave these out initially and see if people
actually do need them, or if creating rules in another table is
sufficient.&lt;/p>
&lt;h4 id="rule-monitoring">Rule monitoring&lt;/h4>
&lt;p>Given the constraints of the iptables API, it would be extremely
inefficient to do &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md"
target="_blank" rel="noopener">a controller loop in the &amp;ldquo;standard&amp;rdquo; style&lt;/a>
:&lt;/p>
&lt;pre tabindex="0">&lt;code>for {
desired := getDesiredState()
current := getCurrentState()
makeChanges(desired, current)
}
&lt;/code>&lt;/pre>&lt;p>(In particular, the combination of &amp;ldquo;&lt;code>getCurrentState&lt;/code>&amp;rdquo; and
&amp;ldquo;&lt;code>makeChanges&lt;/code>&amp;rdquo; is slower than just skipping the &amp;ldquo;&lt;code>getCurrentState&lt;/code>&amp;rdquo;
and rewriting everything from scratch every time.)&lt;/p>
&lt;p>In the past, the &lt;code>iptables&lt;/code> backend &lt;em>did&lt;/em> rewrite everything from
scratch every time:&lt;/p>
&lt;pre tabindex="0">&lt;code>for {
desired := getDesiredState()
makeChanges(desired, nil)
}
&lt;/code>&lt;/pre>&lt;p>but &lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3453-minimize-iptables-restore/README.md"
target="_blank" rel="noopener">KEP-3453&lt;/a>
&amp;ldquo;Minimizing iptables-restore input size&amp;rdquo; changed this,
to improve performance:&lt;/p>
&lt;pre tabindex="0">&lt;code>for {
desired := getDesiredState()
predicted := getPredictedState()
if err := makeChanges(desired, predicted); err != nil {
makeChanges(desired, nil)
}
}
&lt;/code>&lt;/pre>&lt;p>That is, it makes incremental updates under the assumption that the
current state is correct, but if an update fails (e.g. because it
assumes the existence of a chain that didn&amp;rsquo;t exist), kube-proxy falls
back to doing a full rewrite. (It also eventually falls back to a full
update after enough time passes.)&lt;/p>
&lt;p>Proxies based on iptables have also historically had the problem that
system processes (particularly firewall implementations) would
sometimes flush all iptables rules and restart with a clean state,
thus completely breaking kube-proxy. The initial solution for this
problem was to just recreate all iptables rules every 30 seconds even
if no services/endpoints had changed. Later this was changed to create
a single &amp;ldquo;canary&amp;rdquo; chain, and check every 30 seconds that the canary
had not been deleted, and only recreate everything from scratch if the
canary disappears.&lt;/p>
&lt;p>NFTables provides a way to monitor for changes without doing polling;
you can keep a netlink socket open to the kernel (or a pipe open to an
&lt;code>nft monitor&lt;/code> process) and receive notifications when particular kinds
of nftables objects are created or destroyed.&lt;/p>
&lt;p>However, the &amp;ldquo;everyone uses their own table&amp;rdquo; design of nftables means
that this should not be necessary. IPTables-based firewall
implementations flush all iptables rules because everyone&amp;rsquo;s iptables
rules are all mixed together and it&amp;rsquo;s hard to do otherwise. But in
nftables, a firewall ought to only flush &lt;em>its own&lt;/em> table when
restarting, and leave everyone else&amp;rsquo;s tables untouched. In particular,
firewalld works this way when using nftables. We will need to see what
other firewall implementations do.&lt;/p>
&lt;h3 id="switching-between-kube-proxy-modes">Switching between kube-proxy modes&lt;/h3>
&lt;p>In the past, kube-proxy attempted to allow users to switch between the
&lt;code>userspace&lt;/code> and &lt;code>iptables&lt;/code> modes (and later the &lt;code>ipvs&lt;/code> mode) by just
restarting kube-proxy with the new arguments. Each mode would attempt
to clean up the iptables rules used by the other modes on startup.&lt;/p>
&lt;p>Unfortunately, this didn&amp;rsquo;t work well because the three modes all used
some of the same iptables chains, so, e.g., when kube-proxy started up
in &lt;code>iptables&lt;/code> mode, it would try to delete the &lt;code>userspace&lt;/code> rules, but
this would end up deleting rules that had been created by &lt;code>iptables&lt;/code>
mode too, which mean that any time you restarted kube-proxy, it would
immediately delete some of its rules and be in a broken state until it
managed to re-sync from the apiserver. So this code was removed with
&lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2448-Remove-kube-proxy-automatic-clean-up-logic"
target="_blank" rel="noopener">KEP-2448&lt;/a>
.&lt;/p>
&lt;p>However, the same problem would not apply when switching between an
iptables-based mode and an nftables-based mode; it should be safe to
delete all &lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code> rules when starting kube-proxy in
&lt;code>nftables&lt;/code> mode, and to delete all &lt;code>nftables&lt;/code> rules when starting
kube-proxy in &lt;code>iptables&lt;/code> or &lt;code>ipvs&lt;/code> mode. This will make it easier for
users to switch between modes.&lt;/p>
&lt;p>Since rollback from &lt;code>nftables&lt;/code> mode is most important when the
&lt;code>nftables&lt;/code> mode is not actually working correctly, we should do our
best to make sure that the cleanup code that runs when rolling back to
&lt;code>iptables&lt;/code>/&lt;code>ipvs&lt;/code> mode is likely to work correctly even if the rest of
the &lt;code>nftables&lt;/code> code is broken. To that end, we can have it simply run
&lt;code>nft&lt;/code> directly, bypassing the abstractions used by the rest of the
code. Since our rules will be isolated to our own tables, all we need
to do to clean up all of our rules is:&lt;/p>
&lt;pre tabindex="0">&lt;code>nft delete table ip kube-proxy
nft delete table ip6 kube-proxy
&lt;/code>&lt;/pre>&lt;p>In fact, this is simple enough that we could document it explicitly as
something administrators could do if they run into problems while
rolling back.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;p>None. (We had considered refactoring the &lt;code>iptables&lt;/code> unit tests to make
it possible to share the same tests between the two backends, but we
ended up just copying them instead.)&lt;/p>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>We will add unit tests for the &lt;code>nftables&lt;/code> mode that are equivalent to
the ones for the &lt;code>iptables&lt;/code> mode. In particular, we will port over the
tests that feed Services and EndpointSlices into the proxy engine,
dump the generated ruleset, and then mock running packets through the
ruleset to determine how they would behave.&lt;/p>
&lt;p>Since virtually all of the new code will be in a new directory, there
should not be any large changes either way to the test coverage
percentages in any existing directories.&lt;/p>
&lt;p>As of 2023-09-22, &lt;code>pkg/proxy/iptables&lt;/code> has 70.6% code coverage in its
unit tests. For Alpha, we will have comparable coverage for
&lt;code>nftables&lt;/code>. However, since the &lt;code>nftables&lt;/code> implementation is new, and
more likely to have bugs than the older, widely-used &lt;code>iptables&lt;/code>
implementation, we will also add additional unit tests before Beta.&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/proxy/nftables&lt;/code>: &lt;code>2024-05-24&lt;/code> - &lt;code>74.7%&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>for comparison:&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/proxy/iptables&lt;/code>: &lt;code>2024-05-24&lt;/code> - &lt;code>68.4%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/proxy/ipvs&lt;/code>: &lt;code>2024-05-24&lt;/code> - &lt;code>60.9%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>Kube-proxy does not have integration tests.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>Most of the e2e testing of kube-proxy is backend-agnostic. Initially,
we will need a separate e2e job to test the nftables mode (like we do
with ipvs). Eventually, if nftables becomes the default, then this
would be flipped around to having a legacy &amp;ldquo;iptables&amp;rdquo; job.&lt;/p>
&lt;p>The test &amp;ldquo;&lt;code>[It should recreate its iptables rules if they are deleted]&lt;/code>&amp;rdquo; tests (a) that kubelet recreates &lt;code>KUBE-IPTABLES-HINT&lt;/code> if it
is deleted, and (b) that deleting all &lt;code>KUBE-*&lt;/code> iptables rules does not
cause services to be broken forever. The latter part is obviously a
no-op under &lt;code>nftables&lt;/code> kube-proxy, but we can run it anyway. (We are
currently assuming that we will not need an nftables version of this
test, since the problem of one component deleting another component&amp;rsquo;s
rules should not exist with nftables.)&lt;/p>
&lt;p>(Though not directly related to kube-proxy, there are also other e2e
tests that use iptables which should eventually be ported to nftables;
notably, the ones using &lt;a href="https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.2/test/e2e/framework/network/utils.go#L1078"
target="_blank" rel="noopener">&lt;code>TestUnderTemporaryNetworkFailure&lt;/code>&lt;/a>
.)&lt;/p>
&lt;p>For the most part, we should not need to add any nftables-specific e2e
tests; the &lt;code>nftables&lt;/code> backend&amp;rsquo;s job is just to implement the Service
proxy API to the same specifications as the other backends do, so the
existing e2e tests already cover everything relevant. The only
exception to this is in cases where we change default behavior from
the &lt;code>iptables&lt;/code> backend, in which case we may need new tests for the
different behavior.&lt;/p>
&lt;p>We will eventually need e2e tests for switching between &lt;code>iptables&lt;/code> and
&lt;code>nftables&lt;/code> mode in an existing cluster.&lt;/p>
&lt;h4 id="scalability--performance-tests">Scalability &amp;amp; Performance tests&lt;/h4>
&lt;p>We have an &lt;a href="https://testgrid.k8s.io/sig-scalability-experiments#nftables-100"
target="_blank" rel="noopener">nftables scalability job&lt;/a>
. Initial performance is fine; we
have not done a lot of further testing/improvement yet.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>&lt;code>kube-proxy --proxy-mode nftables&lt;/code> available behind a feature gate&lt;/p>
&lt;/li>
&lt;li>
&lt;p>nftables mode has unit test parity with iptables&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An nftables-mode e2e job exists, and passes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Documentation describes any changes in behavior between the
&lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code> modes and the &lt;code>nftables&lt;/code> mode.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Documentation explains how to manually clean up nftables rules in
case things go very wrong.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>At least two releases since Alpha.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The nftables mode has seen at least a bit of real-world usage. (Yes;
we&amp;rsquo;ve gotten bug reports and PRs from users experimenting with it.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No major outstanding bugs.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>nftables mode better unit test coverage than iptables mode
(currently) has. (It is possible that we will end up adding
equivalent unit tests to the iptables backend in the process.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A &amp;ldquo;kube-proxy mode-switching&amp;rdquo; e2e job exists, to confirm that you
can redeploy kube-proxy in a different mode in an existing cluster.
Rollback is confirmed to be reliable.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An nftables e2e periodic perf/scale job exists, and shows
performance as good as iptables and ipvs.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Documentation describes any changes in behavior between the
&lt;code>iptables&lt;/code> and &lt;code>ipvs&lt;/code> modes and the &lt;code>nftables&lt;/code> mode. Any warnings
that we have decide to add for &lt;code>iptables&lt;/code> users using functionality
that behaves differently in &lt;code>nftables&lt;/code> have been added.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>At least two releases since Beta.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The nftables mode has seen non-trivial real-world usage.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The nftables mode has no bugs / regressions that would make us
hesitate to recommend it.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We have at least the start of a plan for the next steps (changing
the default mode, deprecating the old backends, etc).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>No UNRESOLVED sections in the KEP. (In particular, we have figured
out what sort of &amp;ldquo;API&amp;rdquo; we will offer for integrating third-party
nftables rules.)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>The new mode should not introduce any upgrade/downgrade problems,
excepting that you can&amp;rsquo;t downgrade or feature-disable a cluster using
the new kube-proxy mode without switching it back to &lt;code>iptables&lt;/code> or
&lt;code>ipvs&lt;/code> first. (The older kube-proxy would refuse to start if given
&lt;code>--proxy-mode nftables&lt;/code>, and wouldn&amp;rsquo;t know how to clean up stale
nftables service rules if any were present.)&lt;/p>
&lt;p>When rolling out or rolling back the feature, it should be safe to
enable the feature gate and change the configuration at the same time,
since nothing cares about the feature gate except for kube-proxy
itself. Likewise, it is expected to be safe to roll out the feature in
a live cluster, even though this will result in different proxy modes
running on different nodes, because Kubernetes service proxying is
defined in such a way that no node needs to be aware of the
implementation details of the service proxy implementation on any
other node.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>The feature is isolated to kube-proxy and does not introduce any API
changes, so the versions of other components do not matter.&lt;/p>
&lt;p>Kube-proxy has no problems skewing with different versions of itself
across different nodes, because Kubernetes service proxying is defined
in such a way that no node needs to be aware of the implementation
details of the service proxy implementation on any other node.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;p>The administrator must enable the feature gate to make the feature
available, and then must run kube-proxy with the
&lt;code>--proxy-mode=nftables&lt;/code> flag.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: NFTablesProxyMode&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-proxy&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:
&lt;ul>
&lt;li>kube-proxy must be restarted with the new &lt;code>--proxy-mode&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?
&lt;ul>
&lt;li>No&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume &lt;code>Dynamic Kubelet Config&lt;/code> feature is enabled).
&lt;ul>
&lt;li>No&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Enabling the feature gate does not change any behavior; it just makes
the &lt;code>--proxy-mode=nftables&lt;/code> option available.&lt;/p>
&lt;p>Switching from &lt;code>--proxy-mode=iptables&lt;/code> or &lt;code>--proxy-mode=ipvs&lt;/code> to
&lt;code>--proxy-mode=nftables&lt;/code> will likely change some behavior, depending
on what we decide to do about certain un-loved kube-proxy features
like localhost nodeports. Whatever differences in behavior exist will
be explained clearly by the documentation; this is no different from
users switching from &lt;code>iptables&lt;/code> to &lt;code>ipvs&lt;/code>, which initially did not
have feature parity with &lt;code>iptables&lt;/code>.&lt;/p>
&lt;p>(Assuming we eventually make &lt;code>nftables&lt;/code> the default, then differences
in behavior from &lt;code>iptables&lt;/code> will be more important, but making it the
default is not part of &lt;em>this&lt;/em> KEP.)&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes, though it is necessary to clean up the nftables rules that were
created, or they will continue to intercept service traffic. In any
normal case, this should happen automatically when restarting
kube-proxy in &lt;code>iptables&lt;/code> or &lt;code>ipvs&lt;/code> mode, however, that assumes the
user is rolling back to a version of kube-proxy that has at least the
Alpha nftables code (1.29+). If the user wants to roll back the
cluster to a version of Kubernetes that doesn&amp;rsquo;t have the nftables
kube-proxy code (i.e., rolling back from Alpha to Pre-Alpha), or if
they are rolling back to an external service proxy implementation
(e.g., kpng), then they would need to make sure that the nftables
rules got cleaned up &lt;em>before&lt;/em> they rolled back, or else clean them up
manually. (We document how to do this.)&lt;/p>
&lt;p>(By the time we are considering making the &lt;code>nftables&lt;/code> backend the
default in the future, the feature will have existed and been GA for
several releases, so at that point, rollback (to another version of
kube-proxy) would always be to a version that still supports
&lt;code>nftables&lt;/code> and can properly clean up from it.)&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>It should just work.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>The actual feature gate enablement/disablement itself is not
interesting, since it only controls whether &lt;code>--proxy-mode nftables&lt;/code>
can be selected.&lt;/p>
&lt;p>We will need an e2e test of switching a node from &lt;code>iptables&lt;/code> (or
&lt;code>ipvs&lt;/code>) mode to &lt;code>nftables&lt;/code>, and vice versa. The Graduation Criteria
currently list this e2e test as being a criterion for Beta, not Alpha,
since we don&amp;rsquo;t really expect people to be switching their existing
clusters over to an Alpha version of kube-proxy anyway.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>Simply enabling the feature (or upgrading to the release where it is
Beta) has no effect. Admins must explicitly choose to switch to the
new backend.&lt;/p>
&lt;p>Switching to the new backend can&amp;rsquo;t really &amp;ldquo;fail&amp;rdquo;, other than in the
case of bugs, which could have results ranging from &amp;ldquo;almost
unnoticeable&amp;rdquo; to &amp;ldquo;utterly catastrophic&amp;rdquo;. Such a failure would almost
certainly impact already running workloads. However, each node must be
switched over to the new backend independently, so any especially bad
failure would likely be noticed after switching over the first node,
and could be rolled back at that point.&lt;/p>
&lt;p>Rollback should not be able to fail unless there are bugs in the
nftables cleanup code, which is very very simple.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>If &lt;code>sync_proxy_rules_nftables_sync_failures_total&lt;/code> is growing, that
indicates that &lt;em>something&lt;/em> is going wrong, and kube-proxy logs may
provide more information.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Tested by hand:&lt;/p>
&lt;ol>
&lt;li>Start kube-proxy in &lt;code>iptables&lt;/code> mode.&lt;/li>
&lt;li>Confirm (via &lt;code>iptables-save&lt;/code>) that iptables rules exist for Services.&lt;/li>
&lt;li>Kill kube-proxy.&lt;/li>
&lt;li>Start kube-proxy in &lt;code>nftables&lt;/code> mode.&lt;/li>
&lt;li>Confirm (via &lt;code>iptables-save&lt;/code>) that iptables rules for Services no
longer exist. (There will still be a handful of iptables chains
left over, but nothing that actually affects the behavior of
services.)&lt;/li>
&lt;li>Confirm (via &lt;code>nft list ruleset&lt;/code>) that nftables rules for Services
exist.&lt;/li>
&lt;li>Kill kube-proxy.&lt;/li>
&lt;li>Start kube-proxy in &lt;code>iptables&lt;/code> mode again.&lt;/li>
&lt;li>Confirm (via &lt;code>iptables-save&lt;/code>) that iptables rules exist for Services.&lt;/li>
&lt;li>Confirm (via &lt;code>nft list ruleset&lt;/code>) that the &lt;code>kube-proxy&lt;/code> table (or
tables, if dual-stack) has been deleted.&lt;/li>
&lt;/ol>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>The new backend is not 100% compatible with the &lt;code>iptables&lt;/code> backend.
This will be documented, and there are new metrics in the &lt;code>iptables&lt;/code>
backend that can help users figure out if they are depending on
features that aren&amp;rsquo;t implemented or that work differently in
&lt;code>nftables&lt;/code>.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>The operator is the one who would enable the feature, and they would
know it is in use by looking at the kube-proxy configuration.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: If Services still work then the feature is working&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>For Beta, the goal is for the &lt;a href="https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md"
target="_blank" rel="noopener">network programming latency&lt;/a>
to
be equivalent to the &lt;em>old&lt;/em>, pre-&lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3453-minimize-iptables-restore/README.md"
target="_blank" rel="noopener">KEP-3453&lt;/a>
iptables performance
(because the current code is not yet heavily optimized).&lt;/p>
&lt;p>For GA, the goal was for it to be at least as good as the current
iptables performance.&lt;/p>
&lt;p>In fact, we never got entirely clear measurements of this, because the
iptables-based 1000 node perf/scale test still uses &lt;code>minSyncPeriod: 10s&lt;/code>, while the nftables-based one does not. However, the nftables
performance is quite satisfactory (and the fact that it is able to
have satisfactory performance without using &lt;code>minSyncPeriod&lt;/code> is also a
major win).&lt;/p>
&lt;p>Meanwhile, nftables data plane performance is &lt;em>substantially&lt;/em> better
than iptables:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/3866-nftables-proxy/iptables-vs-nftables.svg" alt="iptables-vs-nftables kube-proxy data plane performance">&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>&lt;code>sync_proxy_rules_nftables_sync_failures_total&lt;/code> indicates the number
of failed syncs; if this number is growing, it indicates the backend
is failing in some way.&lt;/p>
&lt;p>The various generic kube-proxy metrics like
&lt;code>network_programming_duration_seconds&lt;/code> and
&lt;code>sync_proxy_rules_duration_seconds&lt;/code> also exist, and can be used to
check that changes are being processed promptly, and that individual
syncs are taking a reasonable amount of time, respectively.&lt;/p>
&lt;p>It&amp;rsquo;s not clear yet what sort of nftables-specific metrics will be
interesting. For example, in the &lt;code>iptables&lt;/code> backend we have
&lt;code>sync_proxy_rules_iptables_total&lt;/code>, which tells you the total number of
iptables rules kube-proxy has programmed. But the equivalent metric in
the &lt;code>nftables&lt;/code> backend is not going to be as interesting, because many
of the things that are done with rules in the &lt;code>iptables&lt;/code> backend will
be done with maps and sets in the &lt;code>nftables&lt;/code> backend. Likewise, just
tallying &amp;ldquo;total number of rules and set/map elements&amp;rdquo; is not likely to
be useful, because the entire point of sets and maps is that they have
more-or-less &lt;strong>O(1)&lt;/strong> behavior, so knowing the number of elements is
not going to give you much information about how well the system is
likely to be performing.&lt;/p>
&lt;p>(Update while going to GA: it&amp;rsquo;s still not clear. We have not found
ourselves wanting any additional metrics, nor have we received any
requests for additional metrics.)&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric names:
&lt;ul>
&lt;li>&lt;code>network_programming_duration_seconds&lt;/code> (already exists)&lt;/li>
&lt;li>&lt;code>sync_proxy_rules_last_queued_timestamp_seconds&lt;/code> (already exists)&lt;/li>
&lt;li>&lt;code>sync_proxy_rules_last_timestamp_seconds&lt;/code> (already exists)&lt;/li>
&lt;li>&lt;code>sync_proxy_rules_duration_seconds&lt;/code> (already exists)&lt;/li>
&lt;li>&lt;code>sync_proxy_rules_nftables_sync_failures_total&lt;/code>&lt;/li>
&lt;li>&lt;code>sync_proxy_rules_nftables_cleanup_failures_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Components exposing the metric:
&lt;ul>
&lt;li>kube-proxy&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>We have now added some metrics to the &lt;code>iptables&lt;/code> mode (e.g.,
&lt;code>kubeproxy_iptables_ct_state_invalid_dropped_packets_total&lt;/code>), allowing
users to be aware of whether they are depending on features that work
differently in the &lt;code>nftables&lt;/code> backend, to help users decide whether
they can migrate to &lt;code>nftables&lt;/code>, and whether they need any non-standard
configuration in order to do so.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>It may require a newer kernel than some current users have. It does
not depend on anything else in the cluster.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No. kube-proxy is still using the same
Service/EndpointSlice-monitoring code, it is just doing different
things locally with the results.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>It is not expected to&amp;hellip;&lt;/p>
&lt;p>We do not currently have any apples-to-apples comparisons; the
&lt;code>nftables&lt;/code> perf job uses more CPU than the corresponding &lt;code>iptables&lt;/code>
job, but this is because it doesn&amp;rsquo;t run with &lt;code>minSyncPeriod: 10s&lt;/code> like
the &lt;code>iptables&lt;/code> job does, and so it syncs rule changes more often.
(However, the that it&amp;rsquo;s able to do so without the cluster falling over
is a strong indication that it &lt;em>is&lt;/em> more efficient.)&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>The same way that kube-proxy currently does; updates stop being
processed until the apiserver is available again.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>Initial proposal: 2023-02-01&lt;/li>
&lt;li>Merged: 2023-10-06&lt;/li>
&lt;li>Updates for beta: 2024-05-24&lt;/li>
&lt;li>Updates for GA: 2025-01-15&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>Adding a new officially-supported kube-proxy implementation implies
more work for SIG Network (especially if we are not able to deprecate
either of the existing backends soon).&lt;/p>
&lt;p>Replacing the default kube-proxy implementation will affect many
users.&lt;/p>
&lt;p>However, doing nothing would result in a situation where, eventually,
many users would be unable to use the default proxy implementation.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="continue-to-improve-the-iptables-mode">Continue to improve the &lt;code>iptables&lt;/code> mode&lt;/h3>
&lt;p>We have made many improvements to the &lt;code>iptables&lt;/code> mode, and could make
more. In particular, we could make the &lt;code>iptables&lt;/code> mode use IP sets
like the &lt;code>ipvs&lt;/code> mode does.&lt;/p>
&lt;p>However, even if we could solve literally all of the performance
problems with the &lt;code>iptables&lt;/code> mode, there is still the looming
deprecation issue.&lt;/p>
&lt;p>(See also &amp;ldquo;&lt;a href="#the-iptables-kernel-subsystem-has-unfixable-performance-problems"
>The iptables kernel subsystem has unfixable performance
problems&lt;/a>
&amp;rdquo;.)&lt;/p>
&lt;h3 id="fix-up-the-ipvs-mode">Fix up the &lt;code>ipvs&lt;/code> mode&lt;/h3>
&lt;p>Rather than implementing an entirely new &lt;code>nftables&lt;/code> kube-proxy mode,
we could try to fix up the existing &lt;code>ipvs&lt;/code> mode.&lt;/p>
&lt;p>However, the &lt;code>ipvs&lt;/code> mode makes extensive use of the iptables API in
addition to the IPVS API. So while it solves the performance problems
with the &lt;code>iptables&lt;/code> mode, it does not address the deprecation issue.
So we would at least have to rewrite it to be IPVS+nftables rather
than IPVS+iptables.&lt;/p>
&lt;p>(See also &amp;ldquo;&lt;a href="#the--mode-of-kube-proxy-will-not-save-us"
>The &lt;code>ipvs&lt;/code> mode of kube-proxy will not save
us&lt;/a>
&amp;rdquo;.)&lt;/p>
&lt;h3 id="use-an-existing-nftables-based-kube-proxy-implementation">Use an existing nftables-based kube-proxy implementation&lt;/h3>
&lt;p>Discussed in &lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
.&lt;/p>
&lt;h3 id="create-an-ebpf-based-proxy-implementation">Create an eBPF-based proxy implementation&lt;/h3>
&lt;p>Another possibility would be to try to replace the &lt;code>iptables&lt;/code> and
&lt;code>ipvs&lt;/code> modes with an eBPF-based proxy backend, instead of an an
nftables one. eBPF is very trendy, but it is also notoriously
difficult to work with.&lt;/p>
&lt;p>One problem with this approach is that the APIs to access conntrack
information from eBPF programs only exist in the very newest kernels.
In particular, the API for NATting a connection from eBPF was only
added in the recently-released 6.1 kernel. It will be a long time
before a majority of Kubernetes users have a kernel new enough that we
can depend on that API.&lt;/p>
&lt;p>Thus, an eBPF-based kube-proxy implementation would initially need a
number of workarounds for missing functionality, adding to its
complexity (and potentially forcing architectural choices that would
not otherwise be necessary, to support the workarounds).&lt;/p>
&lt;p>One interesting eBPF-based approach for service proxying is to use
eBPF to intercept the &lt;code>connect()&lt;/code> call in pods, and rewrite the
destination IP before the packets are even sent. In this case, eBPF
conntrack support is not needed (though it would still be needed for
non-local service connections, such as connections via NodePorts). One
nice feature of this approach is that it integrates well with possible
future &amp;ldquo;multi-network Service&amp;rdquo; ideas, in which a pod might connect to
a service IP that resolves to an IP on a secondary network which is
only reachable by certain pods. In the case of a &amp;ldquo;normal&amp;rdquo; service
proxy that does destination IP rewriting in the host network
namespace, this would result in a packet that was undeliverable
(because the host network namespace has no route to the isolated
secondary pod network), but a service proxy that does &lt;code>connect()&lt;/code>-time
rewriting would rewrite the connection before it ever left the pod
network namespace, allowing the connection to proceed.&lt;/p>
&lt;p>The multi-network effort is still in the very early stages, and it is
not clear that it will actually adopt a model of multi-network
Services that works this way. (It is also &lt;em>possible&lt;/em> to make such a
model work with a mostly-host-network-based proxy implementation; it&amp;rsquo;s
just more complicated.)&lt;/p></description></item><item><title>Resources: Add AppArmor Support</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/24/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/24/</guid><description>
&lt;h1 id="add-apparmor-support">Add AppArmor Support&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#background"
>Background&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api"
>API&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#pod-annotations-beta-api"
>Pod Annotations (beta API)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#pod-api"
>Pod API&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#runtimedefault-profile"
>RuntimeDefault Profile&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#localhost-profile"
>Localhost Profile&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#validation"
>Validation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#node-status"
>Node Status&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#failure-and-fallback-strategy"
>Failure and Fallback Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#pod-creation"
>Pod Creation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#pod-security-admission"
>Pod Security Admission&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#pod-update"
>Pod Update&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#podtemplates"
>PodTemplates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#warnings"
>Warnings&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#kubelet-fallback"
>Kubelet fallback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#runtime-profiles"
>Runtime Profiles&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#kubelet-backwards-compatibility"
>Kubelet Backwards compatibility&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#removing-annotation-support"
>Removing annotation support&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature enablement and rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#syncing-fields--annotations-on-workload-resources"
>Syncing fields &amp;amp; annotations on workload resources&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in
&lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and
SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for
publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation e.g., additional design documents, links to
mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This is a proposal to add AppArmor support to the Kubernetes API.&lt;/p>
&lt;p>For GA graduation, this proposal aims to do the &lt;em>bare minimum&lt;/em> to clean up the feature from its beta
release, without blocking future enhancements.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>AppArmor can enable users to run a more secure deployment, and/or provide better auditing and
monitoring of their systems. AppArmor should be supported to provide users an alternative to
SELinux, and provide an interface for users that are already maintaining a set of AppArmor profiles.&lt;/p>
&lt;h3 id="background">Background&lt;/h3>
&lt;p>Kubernetes AppArmor support predates most of our current feature lifecycle practices, including the
KEP process. This KEP is backfilling for current AppArmor support. For the original AppArmor
proposal, see &lt;a href="https://github.com/kubernetes/design-proposals-archive/blob/main/auth/apparmor.md"
target="_blank" rel="noopener">https://github.com/kubernetes/design-proposals-archive/blob/main/auth/apparmor.md&lt;/a>
.&lt;/p>
&lt;p>This KEP is proposing a minimal path to GA, per the
&lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-architecture/1635-prevent-permabeta/README.md"
target="_blank" rel="noopener">no perma-Beta requirement&lt;/a>
.
This feature graduation closely parallels that of &lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-node/135-seccomp/README.md"
target="_blank" rel="noopener">Seccomp&lt;/a>
.
The notable exceptions are that the AppArmor annotations are immutable on pods, which simplifies the
migration. AppArmor is also feature gated, via the &lt;code>AppArmor&lt;/code> gate.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allow running Pods with AppArmor confinement&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>This KEP proposes the absolute minimum to provide generally available AppArmor
confinement for Pods and their containers. Further functional enhancements are out of scope,
including:&lt;/p>
&lt;ul>
&lt;li>Defining any standard &amp;ldquo;Kubernetes branded&amp;rdquo; AppArmor profiles&lt;/li>
&lt;li>Formally specifying the AppArmor profile format in Kubernetes&lt;/li>
&lt;li>Providing mechanisms for defining custom profiles using the Kubernetes API, or for
loading profiles from outside of the node.&lt;/li>
&lt;li>Windows support&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Add a new field to the Pod API that allows defining the AppArmor profile. The new field should be
part of the security context.&lt;/p>
&lt;h3 id="api">API&lt;/h3>
&lt;p>Pods and PodTemplate will include an &lt;code>appArmorProfile&lt;/code> field that you can set either for a Pod&amp;rsquo;s
security context or for an individual container. If AppArmor options are defined at both the pod and
container level, the container-level options override the pod options.&lt;/p>
&lt;h4 id="pod-annotations-beta-api">Pod Annotations (beta API)&lt;/h4>
&lt;p>The beta API was defined through annotations on pods.&lt;/p>
&lt;p>The &lt;code>container.apparmor.security.beta.kubernetes.io/&amp;lt;container_name&amp;gt;&lt;/code> annotation will be used to
configure the AppArmor profile that the container named &lt;code>&amp;lt;container_name&amp;gt;&lt;/code> is run with. The
annotation is immutable on Pods.&lt;/p>
&lt;p>Possible annotation values are:&lt;/p>
&lt;ol>
&lt;li>&lt;code>runtime/default&lt;/code> - This explicitly selects the default profile configured by the container
runtime. Absent this annotation, containerd and CRI-O will run non-privileged containers with
this profile by default on AppArmor-enabled (LSM loaded) hosts.&lt;/li>
&lt;li>&lt;code>unconfined&lt;/code> - Run without any AppArmor profile. This is the default for privileged pods.&lt;/li>
&lt;li>&lt;code>localhost/&amp;lt;profile_name&amp;gt;&lt;/code> - Run the container using the &lt;code>&amp;lt;profile_name&amp;gt;&lt;/code> AppArmor profile. The
profile must be pre-loaded into the kernel (typically via &lt;code>apparmor_parser&lt;/code> utility), otherwise
the container will not be started.&lt;/li>
&lt;/ol>
&lt;h4 id="pod-api">Pod API&lt;/h4>
&lt;p>The Pod AppArmor API is generally immutable, except in &lt;code>PodTemplates&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PodSecurityContext &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The AppArmor options to use by the containers in this pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Note that this field cannot be set when spec.os.name is windows.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppArmorProfile &lt;span style="color:#666">*&lt;/span>AppArmorProfile
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> SecurityContext &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The AppArmor options to use by this container. If AppArmor options are&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// provided at both the pod &amp;amp; container level, the container options&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// override the pod options.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Note that this field cannot be set when spec.os.name is windows.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppArmorProfile &lt;span style="color:#666">*&lt;/span>AppArmorProfile
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AppArmorProfile defines a pod or container&amp;#39;s AppArmor settings.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Only one profile source may be set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +union&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AppArmorProfile &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// type indicates which kind of AppArmor profile will be applied.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Valid options are:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Localhost - a profile pre-loaded on the node.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// RuntimeDefault - the container runtime&amp;#39;s default profile.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Unconfined - no AppArmor enforcement.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +unionDescriminator&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Type AppArmorProfileType
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// LocalhostProfile indicates a loaded profile on the node that should be used.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The profile must be preconfigured on the node to work.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Must match the loaded name of the profile.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Must only be set if type is &amp;#34;Localhost&amp;#34;.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> LocalhostProfile &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AppArmorProfileType &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppArmorProfileTypeUnconfined AppArmorProfileType = &lt;span style="color:#b44">&amp;#34;Unconfined&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppArmorProfileTypeRuntimeDefault AppArmorProfileType = &lt;span style="color:#b44">&amp;#34;RuntimeDefault&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppArmorProfileTypeLocalhost AppArmorProfileType = &lt;span style="color:#b44">&amp;#34;Localhost&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This API makes the options more explicit and leaves room for new profile sources
to be added in the future (e.g. Kubernetes predefined profiles or ConfigMap
profiles) and for future extensions, such as defining the behavior when a
profile cannot be set.&lt;/p>
&lt;h5 id="runtimedefault-profile">RuntimeDefault Profile&lt;/h5>
&lt;p>We propose maintaining the support to a single runtime profile, which will be
defined by using the &lt;code>AppArmorProfileTypeRuntimeDefault&lt;/code>. The reasons being:&lt;/p>
&lt;ul>
&lt;li>No changes to the current behavior. Users are currently not allowed to specify
other runtime profiles. The existing API server rejects runtime profile names
that are different than &lt;code>runtime/default&lt;/code>.&lt;/li>
&lt;li>Most runtimes only support the default profile, although the CRI is flexible
enough to allow the kubelet to send other profile names.&lt;/li>
&lt;li>Multiple runtime profiles has never been requested as a feature.&lt;/li>
&lt;/ul>
&lt;p>If built-in support for multiple runtime profiles is needed in the future, a new
KEP will be created to cover its details.&lt;/p>
&lt;h5 id="localhost-profile">Localhost Profile&lt;/h5>
&lt;p>This KEP proposes LocalhostProfile as the only source of user-defined
profiles. User-defined profiles are essential for users to realize
the full benefits out of AppArmor, allowing them to decrease their attack
surface based on their own workloads.&lt;/p>
&lt;h6 id="updating-localhost-apparmor-profiles">Updating localhost AppArmor profiles&lt;/h6>
&lt;p>AppArmor profiles are applied at container creation time. The underlying
container runtime only references already loaded profiles by its name.
Therefore, updating the profiles content requires a manual reload (typically via
&lt;code>apparmor_parser&lt;/code>).&lt;/p>
&lt;p>Note that changing profiles is not recommended and may cause containers to fail
on next restart, in the case of the new profile being more restrictive, invalid
or the file no longer available on the host.&lt;/p>
&lt;p>Currently, users have no way to tell whether their physical profiles have been
deleted or modified. This KEP proposes no changes to the existing functionality.&lt;/p>
&lt;p>The recommended approach for rolling out changes to AppArmor profiles is to
always create &lt;em>new profiles&lt;/em> instead of updating existing ones. Create and
deploy a new version of the existing Pod Template, changing the profile name to
the newly created profile. Redeploy, once working delete the former Pod
Template. This will avoid disruption on in-flight workloads.&lt;/p>
&lt;p>The current behavior lacks features to facilitate the maintenance of AppArmor
profiles across the cluster. Two examples being: 1) the lack of profile
synchronization across nodes and 2) how difficult it can be to identify that
profiles have been changed on disk/memory, after pods started using it. However,
Kubernetes managed profiles are out of scope for this KEP.
Out of tree enhancements like the
&lt;a href="https://github.com/kubernetes-sigs/security-profiles-operator"
target="_blank" rel="noopener">security-profiles-operator&lt;/a>
can
provide such enhanced functionality on top.&lt;/p>
&lt;h6 id="profiles-managed-by-the-cluster-admins">Profiles managed by the cluster admins&lt;/h6>
&lt;p>The current support relies on profiles being loaded on all cluster nodes
where the pods using them may be scheduled. It is also the cluster admin&amp;rsquo;s
responsibility to ensure the profiles are correctly saved and synchronized
across the all nodes. Existing mechanisms like node &lt;code>labels&lt;/code> and &lt;code>nodeSelectors&lt;/code>
can be used to ensure that pods are scheduled on nodes supporting their desired
profiles.&lt;/p>
&lt;h4 id="validation">Validation&lt;/h4>
&lt;p>The following validations were applied to the AppArmor annotations on pods:&lt;/p>
&lt;ul>
&lt;li>Pod annotations are immutable (cannot be added, modified, or removed on pod update)&lt;/li>
&lt;li>Annotation value must have a &lt;code>localhost/&lt;/code> prefix, or be one of: &lt;code>&amp;quot;&amp;quot;&lt;/code>, &lt;code>runtime/default&lt;/code>, &lt;code>unconfined&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>The annotation validations will be carried over to the field API, and the following additional
validations are proposed:&lt;/p>
&lt;ol>
&lt;li>Fields must match the corresponding annotations when both are present, except for ephemeral containers.&lt;/li>
&lt;li>AppArmor profile must be unset on Windows pods (&lt;code>spec.os.name == &amp;quot;windows&amp;quot;&lt;/code>). Only enforced on fields.&lt;/li>
&lt;li>Localhost profile must not be empty, and must not be padded with whitespace. Only enforced on creation.
This was previously enforced by the &lt;a href="https://github.com/kubernetes/kubernetes/blob/2624e93d55375a9642977d4d5795841ab7463b1d/pkg/security/apparmor/validate.go#L70-L77"
target="_blank" rel="noopener">Kubelet&lt;/a>
.&lt;/li>
&lt;/ol>
&lt;p>&lt;em>Note on localhost profile validation:&lt;/em> AppArmor profile naming is flexible, but both of the leading
CRI implementations (containerd &amp;amp; cri-o) require a profile with a matching name to be loaded. This
prevents the special &lt;code>unconfined&lt;/code> profile, or various wildcard and variable profile names from being
used in practice. This validation is deferred to the runtime, rather than being enforced by the API
for backwards compatibility.&lt;/p>
&lt;h4 id="node-status">Node Status&lt;/h4>
&lt;p>The Kubelet SHOULD NOT append the AppArmor status to the node ready condition message.&lt;/p>
&lt;p>The ready condition is certainly not the right place for this message, but more generally the
kubelet does not broadcast the status of every optional feature. (A beta implementation of
this feature, added before the Kubernetes enhancement process was formalized, did customize
the node ready condition message).&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>When an AppArmor profile is set on a container (or pod), the kubelet will pass the option on to the
container runtime, which is responsible for running the container with the desired profile. Profiles
must be loaded into the kernel before the container is started (profile loading is out of scope for
this KEP). For more details, see &lt;a href="https://kubernetes.io/docs/tutorials/security/apparmor/"
target="_blank" rel="noopener">https://kubernetes.io/docs/tutorials/security/apparmor/&lt;/a>
.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;p>None&lt;/p>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0ff6a00fafee08467d946ab18c7839d9704d27d5/pkg/api/pod/util_test.go#L706"
target="_blank" rel="noopener">&lt;code>TestDropAppArmor&lt;/code>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/validation/validation_test.go"
target="_blank" rel="noopener">Pod validation tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0ff6a00fafee08467d946ab18c7839d9704d27d5/pkg/kubelet/nodestatus/setters_test.go#L1480"
target="_blank" rel="noopener">&lt;code>TestReadyCondition&lt;/code>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/security/apparmor/validate_test.go"
target="_blank" rel="noopener">Host validation tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/pod-security-admission/policy/check_appArmorProfile_test.go"
target="_blank" rel="noopener">Pod Security Admission policy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>New tests will be added covering the annotation/field conflict cases described
under &lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
.&lt;/p>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>Pod Security tests: &lt;a href="https://github.com/kubernetes/kubernetes/blob/1ded677b2a77a764a0a0adfa58180c3705242c49/test/integration/auth/podsecurity_test.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/1ded677b2a77a764a0a0adfa58180c3705242c49/test/integration/auth/podsecurity_test.go&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>[AppArmor node E2E][https://github.com/kubernetes/kubernetes/blob/2f6c4f5eab85d3f15cd80d21f4a0c353a8ceb10b/test/e2e_node/apparmor_test.go]&lt;/p>
&lt;ul>
&lt;li>These tests are guarded by the &lt;code>[Feature:AppArmor]&lt;/code> tag and run as part of the
&lt;a href="https://testgrid.k8s.io/sig-node-containerd#node-e2e-features"
target="_blank" rel="noopener">containerd E2E features&lt;/a>
test suite.&lt;/li>
&lt;/ul>
&lt;p>The E2E tests will be migrated to the field-based API.&lt;/p>
&lt;h3 id="failure-and-fallback-strategy">Failure and Fallback Strategy&lt;/h3>
&lt;p>There are different scenarios in which applying an AppArmor profile may fail,
below are the ones we mapped and their outcome once this KEP is implemented:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Scenario&lt;/th>
&lt;th>API Server Result&lt;/th>
&lt;th>Kubelet Result&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1) Using localhost or explicit &lt;code>runtime/default&lt;/code> profile when container runtime does not support AppArmor.&lt;/td>
&lt;td>Pod created&lt;/td>
&lt;td>The outcome is container runtime dependent. In this scenario containers may 1) fail to start or 2) run normally without having its policies enforced.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2) Using custom or &lt;code>runtime/default&lt;/code> profile that restricts actions a container is trying to make.&lt;/td>
&lt;td>Pod created&lt;/td>
&lt;td>The outcome is workload and AppArmor dependent. In this scenario containers may 1) fail to start, 2) misbehave or 3) log violations.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3) Using a localhost profile that does not exist on the node.&lt;/td>
&lt;td>Pod created&lt;/td>
&lt;td>Container runtime dependent: containers fail to start. Retry respecting RestartPolicy and back-off delay. Error message in event.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4) Using an unsupported runtime profile (i.e. &lt;code>runtime/default-audit&lt;/code>).&lt;/td>
&lt;td>Fails validation: pod &lt;strong>not&lt;/strong> created.&lt;/td>
&lt;td>N/A&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>5) Using localhost or explicit &lt;code>runtime/default&lt;/code> profile when AppArmor is disabled by the host or build&lt;/td>
&lt;td>Pod created.&lt;/td>
&lt;td>Kubelet puts Pod in blocked state.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>6) Using implicit (default) &lt;code>runtime/default&lt;/code> profile when AppArmor is disabled by the host or build.&lt;/td>
&lt;td>Pod created&lt;/td>
&lt;td>Container created without AppArmor enforcement.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>7) Using localhost profile with invalid (empty) name&lt;/td>
&lt;td>Fails validation: pod &lt;strong>not&lt;/strong> created.&lt;/td>
&lt;td>N/A&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Scenario 2 is the expected behavior of using AppArmor and it is included here
for completeness.&lt;/p>
&lt;p>Scenario 7 represents the case of failing the existing validation, which is
defined at &lt;a href="#pod-api"
>Pod API&lt;/a>
.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>All API skew is resolved in the API server.&lt;/p>
&lt;h4 id="pod-creation">Pod Creation&lt;/h4>
&lt;p>If no AppArmor annotations or fields are specified, no action is necessary.&lt;/p>
&lt;p>If the &lt;code>AppArmor&lt;/code> feature is disabled per feature gate, then the annotations and
fields are cleared (&lt;a href="https://github.com/kubernetes/kubernetes/blob/f58f70bd5730658505042cd9baa80f72d3b6e31e/pkg/api/pod/util.go#L526-L532"
target="_blank" rel="noopener">current behavior&lt;/a>
).&lt;/p>
&lt;p>If the pod&amp;rsquo;s OS is &lt;code>windows&lt;/code>, fields are forbidden to be set and annotations
are not copied to the corresponding fields.&lt;/p>
&lt;p>If &lt;em>only&lt;/em> AppArmor fields are specified, add the corresponding annotations. If these
are specified at the Pod level, copy the annotations to each container that does
not have annotations already specified. This ensures that the fields are enforced
even if the node version trails the API version (see &lt;a href="##version-skew-strategy"
>Version Skew Strategy&lt;/a>
).&lt;/p>
&lt;p>If &lt;em>only&lt;/em> AppArmor annotations are specified, copy the values into the
corresponding fields. This ensures that existing applications continue to
enforce AppArmor, and prevents the kubelet from needing to resolve annotations &amp;amp;
fields. If the annotation is empty, then the &lt;code>runtime/default&lt;/code> profile will be
used by the CRI container runtime. If a localhost profile is specified, then
container runtimes will strip the &lt;code>localhost/&lt;/code> prefix, too. This will be covered
by e2e tests during the GA promotion.&lt;/p>
&lt;p>If both AppArmor annotations &lt;em>and&lt;/em> fields are specified, the values MUST match.
This will be enforced in API validation.&lt;/p>
&lt;p>Container-level AppArmor profiles override anything set at the pod-level.&lt;/p>
&lt;h4 id="pod-security-admission">Pod Security Admission&lt;/h4>
&lt;p>The Pod Security admission plugin will be updated to evaluate AppArmorProfile fields in addition to
annotations.&lt;/p>
&lt;p>The &lt;a href="https://kubernetes.io/docs/concepts/security/pod-security-standards/#baseline"
target="_blank" rel="noopener">policy for the &lt;strong>baseline&lt;/strong> Pod security standard&lt;/a>
forbids setting an &lt;code>Unconfined&lt;/code> profile, but allows unset, &lt;code>RuntimeDefault&lt;/code> and &lt;code>Localhost&lt;/code>
profiles. In the case of localhost profiles, this can include OS profiles intended for other system
daemons, so additional profile restrictions are encouraged (e.g. via
&lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/"
target="_blank" rel="noopener">ValidatingAdmissionPolicy&lt;/a>
.&lt;/p>
&lt;h4 id="pod-update">Pod Update&lt;/h4>
&lt;p>The AppArmor fields on a pod are immutable, which also applies to the
&lt;a href="https://github.com/kubernetes/kubernetes/blob/b46612a74224b0871a97dae819f5fb3a1763d0b9/pkg/apis/core/validation/validation.go#L177-L182"
target="_blank" rel="noopener">annotation&lt;/a>
.&lt;/p>
&lt;p>When an &lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-node/277-ephemeral-containers/README.md"
target="_blank" rel="noopener">Ephemeral Container&lt;/a>
is added, it
will follow the same rules for using or overriding the pod&amp;rsquo;s AppArmor profile.
Ephemeral container&amp;rsquo;s will never sync with an AppArmor annotation.&lt;/p>
&lt;h4 id="podtemplates">PodTemplates&lt;/h4>
&lt;p>PodTemplates (and their embeddings within e.g. ReplicaSets, Deployments, StatefulSets, etc.) will be
ignored. The field/annotation resolution will happen on template instantiation.&lt;/p>
&lt;h4 id="warnings">Warnings&lt;/h4>
&lt;p>To raise awareness of workloads using the beta AppArmor annotations that need to be migrated, a
warning will be emitted when only AppArmor annotations are set (no fields) on pod creation, or pod
template (including workload resources with an embedded pod template) create &amp;amp; update.&lt;/p>
&lt;h4 id="kubelet-fallback">Kubelet fallback&lt;/h4>
&lt;p>Since Kubelet versions must not be ahead of API versions, Kubelets can defer annotation/field
resolution to the API server, and only consider the AppArmor fields.&lt;/p>
&lt;p>The exception to this is static pods. In this case, Kubelet will copy annotation values to fields in the
&lt;a href="https://github.com/kubernetes/kubernetes/blob/2363cdcc399cbf428210efb2c51575ddcad2b84a/pkg/kubelet/config/common.go#L57C6-L57C19"
target="_blank" rel="noopener">&lt;code>applyDefaults&lt;/code>&lt;/a>
function. In this case, Kubelet will also log a warning.&lt;/p>
&lt;h4 id="runtime-profiles">Runtime Profiles&lt;/h4>
&lt;p>The API Server will continue to reject annotations with runtime profiles
different than &lt;code>runtime/default&lt;/code>, to maintain the existing behavior.&lt;/p>
&lt;p>Violations would lead to the error message:&lt;/p>
&lt;pre tabindex="0">&lt;code>Invalid value: &amp;#34;runtime/profile-name&amp;#34;: must be a valid AppArmor profile
&lt;/code>&lt;/pre>&lt;h4 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h4>
&lt;p>Nodes do not currently support in-place upgrades, so pods will be recreated on
node upgrade and downgrade. No special handling or consideration is needed to
support this.&lt;/p>
&lt;p>On the API server side, we&amp;rsquo;ve already taken version skew in HA clusters into
account. The same precautions make upgrade &amp;amp; downgrade handling a non-issue.&lt;/p>
&lt;p>Since
&lt;a href="https://kubernetes.io/docs/setup/release/version-skew-policy/"
target="_blank" rel="noopener">we support&lt;/a>
up
to 2 minor releases of version skew between the master and node, annotations
must continue to be supported and backfilled for at least 2 versions passed the
initial implementation. Specifically, fields will no longer be copied to annotations for older kubelet
versions. However, annotations submitted to the API server will continue to be copied to fields at the
kubelet indefinitely, as was done with Seccomp.&lt;/p>
&lt;h5 id="kubelet-backwards-compatibility">Kubelet Backwards compatibility&lt;/h5>
&lt;p>Since we don&amp;rsquo;t support running newer Kubelets than API server, new Kubelets only need to handle
AppArmor fields. All the version skew resolution happens within the API server.&lt;/p>
&lt;h4 id="removing-annotation-support">Removing annotation support&lt;/h4>
&lt;p>&lt;em>(Assuming field support merges in 1.30, otherwise adjust all versions a constant amount)&lt;/em>&lt;/p>
&lt;p>Phase 1 (v1.30): AppArmor field support merged&lt;/p>
&lt;ul>
&lt;li>Sync annotations &amp;amp; fields on Pod create (version skew strategy described above)&lt;/li>
&lt;li>Warn on annotation use, if field isn&amp;rsquo;t set&lt;/li>
&lt;li>Kubelet copies static pod annotations to fields&lt;/li>
&lt;/ul>
&lt;p>Phase 2 (v1.34):&lt;/p>
&lt;ul>
&lt;li>API server stops copying fields to annotations&lt;/li>
&lt;li>Warn on annotation use if there is no corresponding &lt;em>container&lt;/em> field (including on workload resources)&lt;/li>
&lt;li>&lt;strong>Risk:&lt;/strong> policy controllers that don&amp;rsquo;t consider field values&lt;/li>
&lt;/ul>
&lt;p>Phase 3 (v1.36): End state&lt;/p>
&lt;ul>
&lt;li>API server stops copying annotations to fields&lt;/li>
&lt;li>Kubelet stops copying annotations to fields for static pods&lt;/li>
&lt;li>Validation that annotations &amp;amp; fields match persists indefinitely&lt;/li>
&lt;li>&lt;strong>Risk:&lt;/strong> workloads that haven&amp;rsquo;t migrated&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>&lt;strong>General Availability:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Field-based API&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature enablement and rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;p>AppArmor is controlled by the &lt;code>AppArmor&lt;/code> feature gate (already beta by the time this KEP was
formally opened).&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate
&lt;ul>
&lt;li>Feature gate name: &lt;code>AppArmor&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;li>kubelet&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No - AppArmor has been enabled by default since Kubernetes v1.4.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Containers already running with AppArmor enforcement will continue to do so, but on restart
will fallback to the container runtime default. Pods created with AppArmor disabled will have their
fields &amp;amp; annotations stripped.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Newly started or restarted containers in pods that still have the AppArmor field/annotations will
have the specified AppArmor profile applied, rather than the runtime default.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The &lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
section covers this point.
Running workloads should have no impact as the Kubelet will support either the
existing annotations or the new fields introduced by this KEP.&lt;/p>
&lt;p>Disabling the AppArmor feature will cause the container runtimes to apply the runtime default
profile (except for privileged pods). In cases where a user was expecting to apply a custom profile
(or explicitly unconfined), this could break the workload.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>An increase in pod validation errors can indicate issues with the field translation. These would
show up as &lt;code>code=400&lt;/code> (Bad Request) errors in &lt;code>apiserver_request_total&lt;/code>.&lt;/p>
&lt;p>The following errors could indicate problems with how kubelets are interpreting AppArmor profiles.&lt;/p>
&lt;ul>
&lt;li>&lt;code>started_containers_errors_total&lt;/code>&lt;/li>
&lt;li>&lt;code>started_pods_errors_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Automated tests will cover the scenarios with and without the changes proposed
on this KEP. As defined under &lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
,
we are assuming the cluster may have kubelets with older versions (without
this KEP&amp;rsquo; changes), therefore this will be covered as part of the new tests.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>The promotion of AppArmor to GA would deprecate the beta annotations as described in the
&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>The feature is built into the kubelet and api server components. No metric is
planned at this moment. The way to determine usage is by checking whether the
pods/containers have a AppArmorProfile set.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>The AppArmor enforcement status is not directly surfaced by Kubernetes, but is visible through the
linux proc API. For example, you can check what profile a container is running with by execing into it:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>$ kubectl &lt;span style="color:#a2f">exec&lt;/span> -n &lt;span style="color:#b8860b">$NAMESPACE&lt;/span> &lt;span style="color:#b8860b">$POD_NAME&lt;/span> -- cat /proc/1/attr/current
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>k8s-apparmor-example-deny-write &lt;span style="color:#666">(&lt;/span>enforce&lt;span style="color:#666">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Negligible increase in Pod object size, and any objects embedding a PodSpec.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No. AppArmor profiles are managed outside of Kubernetes, and without this feature enabled the
runtime default AppArmor profile is still enforced on non-privileged containers (for AppArmor
enabled hosts).&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No impact to running workloads.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>No impact is being foreseen to running workloads based on the nature of
changes brought by this KEP.&lt;/p>
&lt;p>Although some general errors and failures can be seen on &lt;a href="#failure-and-fallback-strategy"
>Failure and Fallback
Strategy&lt;/a>
.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2016-07-25: &lt;a href="https://github.com/kubernetes/design-proposals-archive/blob/main/auth/apparmor.md"
target="_blank" rel="noopener">AppArmor design proposal&lt;/a>
&lt;/li>
&lt;li>2016-09-26: AppArmor beta release with v1.4&lt;/li>
&lt;li>2020-01-10: Initial (retrospective) KEP&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;ul>
&lt;li>Custom AppArmor profiles are not fully managed by Kubernetes&lt;/li>
&lt;li>AppArmor support adds a dimension to the feature compatibility matrix, as support is not
guaranteed in linux&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="syncing-fields--annotations-on-workload-resources">Syncing fields &amp;amp; annotations on workload resources&lt;/h3>
&lt;p>AppArmor fields &amp;amp; annotations on Pods are immutable, which means that syncing fields &amp;amp; annotations
is a one-time operation. This is not true for workload resources (ReplicaSets, Deployments, etc).&lt;/p>
&lt;p>In order to support syncing fields on workload resources, we need to account for clients that only
pay attention to one of the field/annotation settings. When combined with the validation requirement
that fields &amp;amp; annotations match, getting this right in both the patch &amp;amp; update cases adds
significant complexity.&lt;/p></description></item><item><title>Resources: Add CDI devices to device plugin API</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4009/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4009/</guid><description>
&lt;h1 id="kep-4009-add-cdi-devices-to-device-plugin-api">KEP-4009: Add CDI devices to device plugin API&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alpha-to-beta-graduation"
>Alpha to Beta Graduation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta-to-ga-graduation"
>Beta to G.A Graduation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes extending the Device Plugin API, adding a field to specify
Container Device Interface (CDI) device IDs in the &lt;code>AllocateResponse&lt;/code>. This
supplements the existing fields such as annotations and allows device plugin
implementations to uniquely specify devices using their fully-qualified CDI
devices names.&lt;/p>
&lt;p>The recent addition of CDI device IDs to the CRI structures in &lt;a href="https://github.com/kubernetes/enhancements/pull/3731"
target="_blank" rel="noopener">#3731&lt;/a>
allow these IDs to be forwarded to the CRI runtimes in a secure manner. Although
these changes were motivated by &lt;a href="https://github.com/kubernetes/enhancements/issues/3063"
target="_blank" rel="noopener">KEP-3063&lt;/a>
, adding support for these fields to the
existing device plugin API allows this mechanism to also be used for devices
supported by these plugins.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The Container Device Inteface (CDI) provides a standard mechanism for device
vendors to describe what is required to provide access to a specific resource
such as a GPU. These resources can be uniquely identified using a
fully-qualified CDI device name.&lt;/p>
&lt;p>The changes proposed in &lt;a href="https://github.com/kubernetes/enhancements/pull/3731"
target="_blank" rel="noopener">#3731&lt;/a>
) extend the CRI to provide a well-defined mechanism for forwarding such
requests to CRI runtimes such as Containerd and Cri-o. These have already
been extended to accept CDI device requests, and to use the associated CDI
specifications to ensure that the required
modifications are made to the OCI runtime specification for a container being
launched.&lt;/p>
&lt;p>The addition of an explicit field for specifying CDI device names to the Device
Plugin API allows this CRI field to be used to indicate which devices should be
injected. This removes the need to use workarounds such as container annotations
to pass this information to the runtimes and allows Device Plugin authors to
adopt CDI to inject devices without requiring that users move to a Dynamic
Resource Allocation (DRA) based implementation.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allow Device Plugin authors to forward device requests to CRI runtimes as a CRI field.&lt;/li>
&lt;li>Allow Device Plugin authors to use CDI to define the modifications required for containerised environments.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We propose a mechanism for device plugin authors to specify devices using Container Device Interface (CDI) names. The names of the requested devices are passed down as CRI fields to CRI runtimes which are ultimately responsible for making the requested devices accessible from a container.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>This adds a repeated &lt;code>CDIDevice&lt;/code> field to the exiting &lt;code>ContainerAllocateResponse&lt;/code> returned as part of the
&lt;code>AllocateResponse&lt;/code> in the Device Plugin API. This matches the modifications made to the Dynamic Resource Allocation API in &lt;a href="https://github.com/kubernetes/enhancements/pull/3731"
target="_blank" rel="noopener">#3731&lt;/a>
.&lt;/p>
&lt;p>The values contained in this field are then used to populate the corresponding field in the CRI
which is passed to the container runtimes. In addition, annotations with a &lt;code>cdi.k8s.io&lt;/code> prefix will be
added to the CRI to allow for consumption in container runtimes that do not yet support the
CRI field directly, but do support device requests through annotations.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-protobuf" data-lang="protobuf">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// CDIDevice specifies a CDI device information.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span>&lt;span style="color:#a2f;font-weight:bold">message&lt;/span> &lt;span style="color:#00f">CDIDevice&lt;/span> {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Fully qualified CDI device name
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// for example: vendor.com/gpu=gpudevice1
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// see more details in the CDI specification:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// https://github.com/container-orchestrated-devices/container-device-interface/blob/main/SPEC.md
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> name &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">1&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>&lt;span style="color:#a2f;font-weight:bold">message&lt;/span> &lt;span style="color:#00f">ContainerAllocateResponse&lt;/span> {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// List of environment variable to be set in the container to access one of more devices.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> map&amp;lt;&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>, &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>&amp;gt; envs &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">1&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Mounts for the container.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#a2f;font-weight:bold">repeated&lt;/span> Mount mounts &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">2&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Devices for the container.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#a2f;font-weight:bold">repeated&lt;/span> DeviceSpec devices &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">3&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Container annotations to pass to the container runtime
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> map&amp;lt;&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>, &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>&amp;gt; annotations &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">4&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// CDI devices for the container.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#a2f;font-weight:bold">repeated&lt;/span> CDIDevice cdi_devices &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">5&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>devicemanager&lt;/code>: &lt;code>2023-06-15&lt;/code> - &lt;code>85.1%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>There are currently no integration tests for device plugins.
We do not plan to add any for this feature.&lt;/p>
&lt;p>However, these cases will be added in the existing integration tests:&lt;/p>
&lt;ul>
&lt;li>Feature gate enable/disable tests&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>This test case has been added to the existing &lt;code>e2e_node&lt;/code> tests:&lt;/p>
&lt;ul>
&lt;li>DevicePlugin can make a CDI device accessible in a container&lt;/li>
&lt;/ul>
&lt;p>Links to test grid:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cdi-device-plugins"
target="_blank" rel="noopener">https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cdi-device-plugins&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>Links to k8s-triage for tests:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://storage.googleapis.com/k8s-triage/index.html?job=ci-crio-cdi-device-plugins"
target="_blank" rel="noopener">https://storage.googleapis.com/k8s-triage/index.html?job=ci-crio-cdi-device-plugins&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Add the CDIDevices field to the device plugin API&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Implement the logic to pass the CDIDevices into the CRI&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Add proper &lt;code>e2e_node&lt;/code> tests&lt;/li>
&lt;/ul>
&lt;h4 id="alpha-to-beta-graduation">Alpha to Beta Graduation&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> No major bugs reported in the previous cycle&lt;/li>
&lt;/ul>
&lt;h4 id="beta-to-ga-graduation">Beta to G.A Graduation&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Gather feedback from at least 2 device plugin vendors that CDI support works for them&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>We expect no impact on upgrades.
On downgrades, we expect no impact to Kubernetes and minimal impact to device
plugin developers.&lt;/p>
&lt;p>We are not bumping the device plugin API version, but simply adding a field to
its protobuf. On upgrades this means that older device plugins will simply
continue to work as they always have, since they will need to opt-in to using
this new field.&lt;/p>
&lt;p>For downgrades, if a plugin has not opted to use the new field, there will be
no impact since a downgraded kubelet won&amp;rsquo;t support it anyway. If a device
plugin has opted-in to use the new field, a downgraded kubelet will simply
silently ignore it. This would have no impact to Kubernetes itself, but the
plugin developer would need to be aware of this if they are confused as to why
their new CDI support is suddenly not working anymore.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>The kubelet will always be backwards compatible, so going forward existing
plugins are not expected to break.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate names:
&lt;ul>
&lt;li>&lt;code>DevicePluginCDIDevices&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Components depending on the feature gate: kubelet&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Pass CDI devices to the kubelet over the new field in the device plugin API
&lt;ul>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?
No.&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?
No.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. Device Plugins need to be updated to make use of the new field.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;ul>
&lt;li>Yes, disabling the &lt;code>DevicePluginCDIDevices&lt;/code> feature gate shuts down the feature completely.&lt;/li>
&lt;li>Yes, by not sending CDI devices over the device plugin API (and falling back to the old way of passing device info).&lt;/li>
&lt;/ul>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Nothing bad will happen, new containers will simply be able to be started with
CDI devices again.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>There will be e2e tests demonstrating that CDI devices are attached as expected
when the feature is enabled, and silently ignored if the feature is disabled.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The failure of the kubelet would mean that fields from new device allocations
will not be processed.&lt;/p>
&lt;p>However, CDI device themselves are only interpereted at container start.
Existing containers that were started with support for CDI devices will not be
impacted if the feature gate is enabled or disabled during the lifetime of a
running container. Only new containers will be impacted by the presence or
absence of the feature gate.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>This depends on Device Plugin vendor implementations making use of the required
field and cannot be directly determined.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>End-users are not aware that this feature exists. Device plugin developers can
ensure that this feature is working by passing CDI devices to workloads
requesting them, and ensuring that the workloads come up successfully with
access to the devices they asked for.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;ul>
&lt;li>The container runtime (e.g. containerd, crio-o, etc.) must support CDI.&lt;/li>
&lt;li>A Device Plugin must be implemented to use the field.&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No. The additional field will replace existing usages where used.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>The change to Kubernetes to support this feature is very minimal. The CDI
device list passed from the plugin to the kubelet is opaquely forwarded to the
underlying container runtime without affecting the overall logic of the kubelet
in any significant way. As such, the only known failure scenarios result from
plugins themselves doing something incorrectly (not the kubelet). For example,
sending back a list of CDI devices that are not included in any CDI spec
visible to the underlying container runtime. However, such failure scenarios do
not affect the proper functioning of kubernetes itself, and are therefore out
of scope for this KEP. We recommend you check the device plugin and container
runtime logs instead.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2023-05-15: KEP created&lt;/li>
&lt;li>2023-09-25: KEP updated to mark transition to beta&lt;/li>
&lt;li>2024-01-24: KEP updated to mark transition to stable&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>There is no reason this KEP should not be implemented. CDI is the new standard
for device support in containerized environments, and this enhancement now
makes this possible through a simple addition to the device plugin API.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>None&lt;/p></description></item><item><title>Resources: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4540/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4540/</guid><description>
&lt;h1 id="kep-4540-add-cpumanager-policy-option-to-restrict-reservedsystemcpus-to-system-daemons-and-interrupt-processing">KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#archived-risk-mitigation-option-1"
>Archived Risk Mitigation (Option 1)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#archived-risk-mitigation-option-2"
>Archived Risk Mitigation (Option 2)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Starting with Kubernetes 1.22, a new &lt;code>CPUManager&lt;/code> flag has facilitated the use of &lt;code>CPUManager&lt;/code> Policy options (#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy.
These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster.
The policy options that already exist are &lt;code>full-pcpus-only&lt;/code> (#2625) and &lt;code>distribute-cpus-across-numa&lt;/code> (#2902) and &lt;code>align-by-socket&lt;/code> (#3327) and &lt;code>distribute-cpus-across-cores&lt;/code> (#4176).
With this KEP, a new &lt;code>CPUManager&lt;/code> policy option &lt;code>strict-cpu-reservation&lt;/code> is introduced which ensures that &lt;code>reservedSystemCPUs&lt;/code> are strictly reserved for system daemons or interrupt processing and are not used by burstable and best-effort pods.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The static policy is used to reduce latency or improve performance. If you want to move system daemons or interrupt processing to dedicated cores, the obvious way is use the &lt;code>reservedSystemCPUs&lt;/code> option. But in current implementation this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods (and guaranteed pods with fractional CPU requests).
Admission is only comparing the cpu requests against the allocatable cpus. Since the cpu limit are higher than the request, it allows burstable and best-effort pods to use up the capacity of &lt;code>reservedSystemCPUs&lt;/code> and cause host OS services to starve in real life deployments.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Align scheduler and node view for Node Allocatable (total - reserved).&lt;/li>
&lt;li>Ensure &lt;code>reservedSystemCPUs&lt;/code> is only used by system daemons or interrupt processing not by workloads.&lt;/li>
&lt;li>Ensure no breaking changes for the &lt;code>static&lt;/code> policy of &lt;code>CPUManager&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Change interface between node and scheduler.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We propose to add a new &lt;code>CPUManager&lt;/code> policy option called &lt;code>strict-cpu-reservation&lt;/code> to the &lt;code>static&lt;/code> policy of &lt;code>CPUManager&lt;/code>.
When this policy option is enabled, we remove the reserved cores from the list of all available cores at the stage of calculation DefaultCPUSet. As a result, burstable and best-effort containers are launched with a cpuset in which the reserved cores are excluded.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>To protect latency of workload, systemd daemons including irqbalance daemon are commonly constrained to the reserved CPUs.
Burstable and best-effort pods (and guaranteed pods with fractional CPU requests) running on the reserved CPUs causes CPU throttling for infrastructure services which results in poor system response time which in turn hits back on workload response time.
This issue is particularly bad in all-in-one deployments where workloads are placed on combined master+worker+storage nodes.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>Silently allowing workloads running on the reserved CPUs makes benchmarking infrastructure and workloads both inaccurate.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>In Kubelet, when &lt;code>strict-cpu-reservation&lt;/code> is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet.&lt;/p>
&lt;p>Feature impact can be illustrated as following:&lt;/p>
&lt;p>With the following Kubelet configuration:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>KubeletConfiguration&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kubelet.config.k8s.io/v1beta1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>static&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicyOptions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">strict-cpu-reservation&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;true&amp;#34;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">reservedSystemCPUs&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;0,32,1,33,16,48&amp;#34;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#00f;font-weight:bold">...&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When &lt;code>strict-cpu-reservation&lt;/code> is disabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;0-63&amp;#34;,&amp;#34;checksum&amp;#34;:1058907510}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When &lt;code>strict-cpu-reservation&lt;/code> is enabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;2-15,17-31,34-47,49-63&amp;#34;,&amp;#34;checksum&amp;#34;:4141502832}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The feature is isolated to a specific policy option &lt;code>strict-cpu-reservation&lt;/code> under &lt;code>cpuManagerPolicyOptions&lt;/code>.&lt;/p>
&lt;p>Concern for feature impact on best-effort workloads, the workloads that do not have resource requests, is brought up.&lt;/p>
&lt;p>Kube-scheduler schedules pods on node allocatable (total - reserved). For best-effort pods, kube-scheduler uses default request values when scoring the nodes, see &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/util/pod_resources.go#L32"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/util/pod_resources.go#L32&lt;/a>
and &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/noderesources/resource_allocation.go#L123"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/noderesources/resource_allocation.go#L123&lt;/a>
, but the scheduler does not use the default request values when fitting the nodes i.e. best-effort pods are always admitted.&lt;/p>
&lt;p>The concern is, when the feature graduates to &lt;code>Stable&lt;/code>, it will be enabled by default, best-effort workloads could be starved on the node when the node runs out of CPU cores.&lt;/p>
&lt;p>However, this is exactly the feature intent, best-effort workloads have no KPI requirement, they are meant to consume whatever CPU resources left on the node including starving from time to time. Best-effort workloads are not scheduled to run on the &lt;code>reservedSystemCPUs&lt;/code> so they shall not be run on the &lt;code>reservedSystemCPUs&lt;/code> to destablize the whole node.&lt;/p>
&lt;p>Nevertheless, risk mitigation has been discussed in details (see archived options below) and we agree to start with the following node metrics of cpu pool sizes in Alpha and Beta stages to assess the actual impact in real deployment. The plan is to move the current implementation to Stable stage if no field issue is observed for one year.&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/pull/127506"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/127506&lt;/a>
&lt;/p>
&lt;ul>
&lt;li>&lt;code>cpu_manager_shared_pool_size_millicores&lt;/code>: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve&lt;/li>
&lt;li>&lt;code>cpu_manager_exclusive_cpu_allocation_count&lt;/code>: report exclusively allocated cores, counting full cores (e.g. 16)&lt;/li>
&lt;/ul>
&lt;h4 id="archived-risk-mitigation-option-1">Archived Risk Mitigation (Option 1)&lt;/h4>
&lt;p>This option is to add &lt;code>numMinSharedCPUs&lt;/code> in &lt;code>strict-cpu-reservation&lt;/code> option as the minimum number of CPU cores not available for exclusive allocation and expose it to Kube-scheduler for enforcement.&lt;/p>
&lt;p>In Kubelet, when &lt;code>strict-cpu-reservation&lt;/code> is enabled as a policy option, we remove the reserved cores from the shared pool at the stage of calculation DefaultCPUSet and remove the &lt;code>MinSharedCPUs&lt;/code> from the list of available cores for exclusive allocation.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-node/4540-strict-cpu-reservation/./strict-cpu-allocation.png" alt="MinSharedCPUs">&lt;/p>
&lt;p>When &lt;code>strict-cpu-reservation&lt;/code> is disabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Total CPU cores: 64
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">ReservedSystemCPUs: 6
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">defaultCPUSet = Reserved (6) + 58 (available for exclusive allocation)
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When &lt;code>strict-cpu-reservation&lt;/code> is enabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Total CPU cores: 64
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">ReservedSystemCPUs: 6
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">MinSharedCPUs: 4
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">defaultCPUSet = MinSharedCPUs (4) + 54 (available for exclusive allocation)
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Prototype PR for the option is created:
&lt;a href="https://github.com/kubernetes/kubernetes/pull/123979/commits"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/123979/commits&lt;/a>
&lt;/p>
&lt;p>Add &lt;code>numMinSharedCPUs&lt;/code> as part of &lt;code>strict-cpu-reservation&lt;/code> option in Kubelet configuration:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>KubeletConfiguration&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kubelet.config.k8s.io/v1beta1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">CPUManagerPolicyAlphaOptions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>static&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicyOptions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">strict-cpu-reservation&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">&amp;#34;enable&amp;#34;: &amp;#34;true&amp;#34;, &amp;#34;numMinSharedCPUs&amp;#34;: &lt;/span>&lt;span style="color:#666">4&lt;/span>&lt;span style="color:#bbb"> &lt;/span>}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">reservedSystemCPUs&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;0,32,1,33,16,48&amp;#34;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#00f;font-weight:bold">...&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In Node API, we add &lt;code>exclusive-cpu&lt;/code> in Node Allocatable for Kube-scheduler to consume.&lt;/p>
&lt;pre tabindex="0">&lt;code> &amp;#34;status&amp;#34;: {
&amp;#34;capacity&amp;#34;: {
&amp;#34;cpu&amp;#34;: &amp;#34;64&amp;#34;,
&amp;#34;exclusive-cpu&amp;#34;: &amp;#34;64&amp;#34;,
&amp;#34;ephemeral-storage&amp;#34;: &amp;#34;832821572Ki&amp;#34;,
&amp;#34;hugepages-1Gi&amp;#34;: &amp;#34;0&amp;#34;,
&amp;#34;hugepages-2Mi&amp;#34;: &amp;#34;0&amp;#34;,
&amp;#34;memory&amp;#34;: &amp;#34;196146004Ki&amp;#34;,
&amp;#34;pods&amp;#34;: &amp;#34;110&amp;#34;
},
&amp;#34;allocatable&amp;#34;: {
&amp;#34;cpu&amp;#34;: &amp;#34;58&amp;#34;,
&amp;#34;exclusive-cpu&amp;#34;: &amp;#34;54&amp;#34;,
&amp;#34;ephemeral-storage&amp;#34;: &amp;#34;767528359485&amp;#34;,
&amp;#34;hugepages-1Gi&amp;#34;: &amp;#34;0&amp;#34;,
&amp;#34;hugepages-2Mi&amp;#34;: &amp;#34;0&amp;#34;,
&amp;#34;memory&amp;#34;: &amp;#34;186067796Ki&amp;#34;,
&amp;#34;pods&amp;#34;: &amp;#34;110&amp;#34;
},
...
&lt;/code>&lt;/pre>&lt;p>In kube-scheduler, &lt;code>ExlusiveMilliCPU&lt;/code> is added in scheduler&amp;rsquo;s &lt;code>Resource&lt;/code> structure and &lt;code>NodeResourcesFit&lt;/code> plugin is extended to filter out nodes that can not meet pod&amp;rsquo;s exclusive CPU request.&lt;/p>
&lt;p>A new item &lt;code>ExclusiveMilliCPU&lt;/code> is added in the scheduler &lt;code>Resource&lt;/code> structure:&lt;/p>
&lt;pre tabindex="0">&lt;code>// Resource is a collection of compute resource.
type Resource struct {
MilliCPU int64
ExclusiveMilliCPU int64 // added
Memory int64
EphemeralStorage int64
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
// explicitly as int, to avoid conversions and improve performance.
AllowedPodNumber int
// ScalarResources
ScalarResources map[v1.ResourceName]int64
}
&lt;/code>&lt;/pre>&lt;p>A new node fitting failure &amp;lsquo;Insufficient exclusive cpu&amp;rsquo; is added in the &lt;code>NodeResourcesFit&lt;/code> plugin:&lt;/p>
&lt;pre tabindex="0">&lt;code> if podRequest.MilliCPU &amp;gt; 0 &amp;amp;&amp;amp; podRequest.MilliCPU &amp;gt; (nodeInfo.Allocatable.MilliCPU-nodeInfo.Requested.MilliCPU) {
insufficientResources = append(insufficientResources, InsufficientResource{
ResourceName: v1.ResourceCPU,
Reason: &amp;#34;Insufficient cpu&amp;#34;,
Requested: podRequest.MilliCPU,
Used: nodeInfo.Requested.MilliCPU,
Capacity: nodeInfo.Allocatable.MilliCPU,
})
}
if nodeInfo.Allocatable.ExclusiveMilliCPU &amp;gt; 0 { // added
if podRequest.ExclusiveMilliCPU &amp;gt; 0 &amp;amp;&amp;amp; podRequest.ExclusiveMilliCPU &amp;gt; (nodeInfo.Allocatable.ExclusiveMilliCPU-nodeInfo.Requested.ExclusiveMilliCPU) {
insufficientResources = append(insufficientResources, InsufficientResource{
ResourceName: v1.ResourceExclusiveCPU,
Reason: &amp;#34;Insufficient exclusive cpu&amp;#34;,
Requested: podRequest.ExclusiveMilliCPU,
Used: nodeInfo.Requested.ExclusiveMilliCPU,
Capacity: nodeInfo.Allocatable.ExclusiveMilliCPU,
})
}
}
&lt;/code>&lt;/pre>&lt;h4 id="archived-risk-mitigation-option-2">Archived Risk Mitigation (Option 2)&lt;/h4>
&lt;p>The problem with &lt;code>MinSharedCPUs&lt;/code> is that it creates another complication like memory and hugpages, new resources vs overlapping resources, exclusive-cpus is a subset of cpu.&lt;/p>
&lt;p>Currently the noderesources scheduler plugin does not filter out the best-effort pods in the case there&amp;rsquo;s no available CPU.&lt;/p>
&lt;p>Another option is to force the cpu requests for best effort pods to 1 MilliCPU in kubelet for the purpose of resource availability checks (or, equivalently, check there&amp;rsquo;s at least 1 MilliCPU allocatable). This option is meant to be simpler than option-1, but it can create runaway pods similar to that in &lt;a href="https://github.com/kubernetes/kubernetes/issues/84869"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/84869&lt;/a>
.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/policy_static.go&lt;/code>: &lt;code>03-18-2024&lt;/code> - &lt;code>91.1&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>No new integration tests for kubelet are planned.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>The e2e tests are implemented in &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_test.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_test.go&lt;/a>
, marked with Ginkgo &amp;ldquo;strict-cpu-reservation&amp;rdquo; label.&lt;/p>
&lt;p>Feature functionality tests:&lt;/p>
&lt;ul>
&lt;li>running with strict CPU reservation: should let the container access all the online CPUs without a reserved CPUs set&lt;/li>
&lt;li>running with strict CPU reservation: should let the container access all the online CPUs minus the reserved CPUs set when enabled&lt;/li>
&lt;li>running with strict CPU reservation: should let the container access all the online non-exclusively-allocated CPUs minus the reserved CPUs set when enabled`&lt;/li>
&lt;/ul>
&lt;p>CPU Manager options compatibility tests:&lt;/p>
&lt;ul>
&lt;li>SMT Alignment and strict CPU reservation: should reject workload asking non-SMT-multiple of cpus&lt;/li>
&lt;li>SMT Alignment and strict CPU reservation: should admit workload asking SMT-multiple of cpus&lt;/li>
&lt;li>Strict CPU Reservation and Uncore Cache Alignment: should assign CPUs aligned to uncore caches with prefer-align-cpus-by-uncore-cache and avoid reserved cpus&lt;/li>
&lt;/ul>
&lt;p>Testgrid:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-node-kubelet#kubelet-serial-gce-e2e-cpu-manager"
target="_blank" rel="noopener">kubelet-serial-gce-e2e-cpu-manager&lt;/a>
: Green&lt;/li>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial"
target="_blank" rel="noopener">kubelet-gce-e2e-arm64-ubuntu-serial&lt;/a>
: Green&lt;/li>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-node-containerd#pull-e2e-serial-ec2"
target="_blank" rel="noopener">pull-e2e-serial-ec2&lt;/a>
: Green&lt;/li>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-resource-managers"
target="_blank" rel="noopener">node-kubelet-containerd-resource-managers&lt;/a>
: Green&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Implement the new policy option.&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Ensure proper unit tests are in place.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Gather feedback from consumers of the new policy option.&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Verify no major bugs reported in the previous cycle.&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Ensure proper e2e tests are in place.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Allow time for feedback (two releases).&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Make sure all risks have been addressed.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>The new policy option is opt-in and orthogonal to the existing ones.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>No changes needed.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;p>The &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> needs to be removed when enabling or disabling the feature.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Change the kubelet configuration to set a &lt;code>CPUManager&lt;/code> policy of &lt;code>static&lt;/code> and a &lt;code>CPUManager&lt;/code> policy option of &lt;code>strict-cpu-reservation&lt;/code>
&lt;ul>
&lt;li>Will enabling / disabling the feature require downtime of the control plane? No&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning of a node? No &amp;ndash; removing &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> and restarting kubelet are enough.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Yes. Reserved CPU cores will be strictly used for system daemons and interrupt processing no longer available for workloads.&lt;/p>
&lt;p>The feature is only enabled when all following conditions are met:&lt;/p>
&lt;ol>
&lt;li>The &lt;code>static&lt;/code> &lt;code>CPUManager&lt;/code> policy is selected&lt;/li>
&lt;li>The &lt;code>strict-cpu-reservation&lt;/code> policy option is selected&lt;/li>
&lt;li>The &lt;code>reservedSystemCPUs&lt;/code> is not empty&lt;/li>
&lt;/ol>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes, the feature can be disabled by the following steps:&lt;/p>
&lt;ol>
&lt;li>Remove &lt;code>strict-cpu-reservation&lt;/code> from the list of &lt;code>CPUManager&lt;/code> policy options&lt;/li>
&lt;li>Remove &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> and restart kubelet&lt;/li>
&lt;/ol>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The feature will be enabled regardless it is enabled for the first time or not.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;ul>
&lt;li>A specific e2e test will demonstrate that the default behaviour is preserved when the feature is not used (2 separate tests)&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>If the feature rollout fails, burstable and best-efforts continue to run on the reserved CPU cores.
If the feature rollback fails, burstable and best-efforts continue not to run on the reserved CPU cores.
In either case, existing workload will not be affected.&lt;/p>
&lt;p>When enabling or disabling the feature, make sure &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> is removed before restarting kubelet otherwise kubelet restart could fail.&lt;/p>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;p>Best-effort workloads are starved for prolonged time. This indicates you are lacking hardware to use the feature, or you should review the amount of CPU cores reserved.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>If you have this feature enabled in v1.32 under &lt;code>CPUManagerPolicyAlphaOptions&lt;/code> (default to false) you will continue to have the feature enabled in v1.33 under &lt;code>CPUManagerPolicyBetaOptions&lt;/code> (default to true) automatically i.e. no extra action is needed.
To enable or disable this feature in v1.33, follow the feature activation and de-activation procedures described above.&lt;/p>
&lt;p>Manual upgrade-&amp;gt;downgrade-&amp;gt;upgrade testing from v1.32 to v1.33 is as follows:&lt;/p>
&lt;p>With the following Kubelet configuration and &lt;code>cpu_manager_state&lt;/code> v1.32:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>KubeletConfiguration&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kubelet.config.k8s.io/v1beta1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">CPUManagerPolicyAlphaOptions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>static&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">cpuManagerPolicyOptions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">strict-cpu-reservation&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;true&amp;#34;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">reservedSystemCPUs&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;0,32,1,33,16,48&amp;#34;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#00f;font-weight:bold">...&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;2-15,17-31,34-47,49-63&amp;#34;,&amp;#34;checksum&amp;#34;:4141502832}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The same Kubelet &lt;code>cpu_manager_state&lt;/code> will be seen after upgrading to v1.33:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;2-15,17-31,34-47,49-63&amp;#34;,&amp;#34;checksum&amp;#34;:4141502832}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>You are recommended to remove the &lt;code>CPUManagerPolicyAlphaOptions&lt;/code> feature gate after upgrading to v1.33 for operational integrity, but it is not mandatory.&lt;/p>
&lt;p>If you want to disable the feature in v1.33, you can either disable the &lt;code>CPUManagerPolicyBetaOptions&lt;/code> feature gate, or remove the &lt;code>strict-cpu-reservation&lt;/code> policy option. Remember to remove the &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> file before restarting kubelet.&lt;/p>
&lt;p>The following &lt;code>cpu_manager_state&lt;/code> will be seen after the feature is disabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;0-63&amp;#34;,&amp;#34;checksum&amp;#34;:1058907510}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If you want to enable the feature in v1.33, you need to make sure the &lt;code>CPUManagerPolicyBetaOptions&lt;/code> feature gate is not disabled and add the &lt;code>strict-cpu-reservation&lt;/code> policy option. Remember to remove the &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code> file before restarting kubelet.&lt;/p>
&lt;p>The following &lt;code>cpu_manager_state&lt;/code> will be seen after the feature is enabled:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> cat /var/lib/kubelet/cpu_manager_state
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">{&amp;#34;policyName&amp;#34;:&amp;#34;static&amp;#34;,&amp;#34;defaultCpuSet&amp;#34;:&amp;#34;2-15,17-31,34-47,49-63&amp;#34;,&amp;#34;checksum&amp;#34;:4141502832}
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Inspect the &lt;code>defaultCpuSet&lt;/code> in &lt;code>/var/lib/kubelet/cpu_manager_state&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>When the feature is disabled, the reserved CPU cores are included in the &lt;code>defaultCpuSet&lt;/code>.&lt;/li>
&lt;li>When the feature is enabled, the reserved CPU cores are not included in the &lt;code>defaultCpuSet&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>Inspect the pods&amp;rsquo; status file &amp;ndash; check the reserved cores are not used by them.&lt;/p>
&lt;p>Below is an example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#000080;font-weight:bold">#&lt;/span> kubectl &lt;span style="color:#a2f">exec&lt;/span> cnf1-58446568f4-dr986 -n cnf1-ns -- grep Cpus_allowed /proc/self/status
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Cpus_allowed: fffefffc,fffefffc
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Cpus_allowed_list: 2-15,17-31,34-47,49-63
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>This feature allows users to protect infrastructure services from bursty workloads.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>Monitor the following kubelet counters:&lt;/p>
&lt;ul>
&lt;li>&lt;code>cpu_manager_shared_pool_size_millicores&lt;/code>: report shared pool size, in millicores (e.g. 13500m), expected to be non-zone otherwise best-effort pods will starve&lt;/li>
&lt;li>&lt;code>cpu_manager_exclusive_cpu_allocation_count&lt;/code>: report exclusively allocated cores, counting full cores (e.g. 16)&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>No.&lt;/p>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>Increase kubelet log level and check kubelet log for errors.&lt;/p>
&lt;p>Below is how to check kubelet log when it runs as a systemd service:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">journalctl _SYSTEMD_INVOCATION_ID=`systemctl show -p InvocationID --value kubelet.service`
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>There is no known impact.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>There is no known failure mode.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>You can safely disable the feature.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2024-03-08: Initial KEP created&lt;/li>
&lt;li>2024-10-07: KEP gets LGTM and Approval&lt;/li>
&lt;li>2025-02-03: KEP updated with Beta criteria&lt;/li>
&lt;li>2025-09-30: KEP updated with GA criteria&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2></description></item><item><title>Resources: Add generic control plane staging repository(ies)</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4080/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4080/</guid><description>
&lt;h1 id="kep-4080-add-generic-control-plane-staging-repositoryies">KEP-4080: Add generic control plane staging repository(ies)&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes factoring &lt;em>kube-apiserver&lt;/em> and &lt;em>kube-controller-manager&lt;/em> to build
on one or multiple new staging repositories that consume &lt;code>k/apiserver&lt;/code> but have a
bigger, carefully chosen subset of the functionality of &lt;em>kube-apiserver&lt;/em> and
&lt;em>kube-controller-manager&lt;/em> such that it is reusable.&lt;/p>
&lt;p>The factoring will be progressive: we will start with new repo(s) adding
nothing to &lt;code>k/apiserver&lt;/code>, refactor in-place and then progressively move generic
functionality from &lt;em>kube-apiserver&lt;/em> and &lt;em>kube-controller-manager&lt;/em> to the new
repositories.&lt;/p>
&lt;p>The suggested naming of the new repository(ies) is &lt;code>k/generic-controlplane&lt;/code>
(&lt;code>-apiserver/-controllers&lt;/code>; for simplicity we drop these suffixes in this document
until the names are finalized). Choosing the exact name(s) and split of
packages will be part of the process of implementation.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>A working kube-based control plane is more than just an apiserver component
built on &lt;code>k/apiserver&lt;/code>. It includes standard resources (depending on context
namespaces, CRDs, RBAC, secrets, configmaps), and standard controllers (think
of garbage collection, namespace deletion, etc.). &lt;em>kube-apiserver&lt;/em> today is a
bundle of those resources with container orchestration, &lt;em>kube-controller-manager&lt;/em>
equally for the corresponding controllers.&lt;/p>
&lt;p>Separating the generic parts from container orchestration will allow new
use-cases building upon &lt;code>k/apimachinery&lt;/code> and &lt;code>k/apiserver&lt;/code>, while keeping a
unified codebase and ecosystem, and by improving the factoring of
&lt;em>kube-apiserver&lt;/em> for easier maintenance due to less complexity by clear layering.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>As always: every PR transforms a working system into a working system, and
PRs are of manageable size.&lt;/li>
&lt;li>Improve factoring of &lt;em>kube-apiserver&lt;/em> through layering on-top of
&lt;code>k/generic-controlplane&lt;/code>, reducing complexity through more explicit structure
and reduction of code in &lt;code>k/k&lt;/code>.&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> will provide
&lt;ul>
&lt;li>a &lt;code>sample-generic-controlplane&lt;/code> binary&lt;/li>
&lt;li>a modular, further customizable (in code) library suitable to build a
working kube-based control plane without vendoring k/k.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> will (optionally) include the ability to define
resources by CustomResourceDefinition objects.&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> will be able to (optionally) delegate handling of
some kinds of objects to another server, as directed by APIService objects.&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> will allow customization (in code) of which generic
(native) resources like secrets, configmaps, admission webhooks, RBAC, etc.
are served.&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> will not include the definitions of the resources in
Kubernetes for the management of containerized workloads. For example, the
excluded resources include: nodes, pods, daemonsets, ingresses, services,
persistentvolumes.&lt;/li>
&lt;li>&lt;code>k/generic-controlplane&lt;/code> as a library will be agnostic to being used in separate
binaries or in an all-in-one binary, both in a &lt;em>hyperkube&lt;/em>-like subcommand way,
and in an all-in-one &lt;em>k3s&lt;/em> like way.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>provide and ship a de-facto standard, full-featured generic-controlplane binary:&lt;/p>
&lt;p>i.e. this is clearly a library approach and consumer projects will define a
feature set of a control plane. There is no new deliverable beyond a staging
repository with a library and a sample binary only, with clear limited scope
of demonstrating plumbing.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change anything noticeable to the user for existing binaries.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change compatibility guarantees of (server-side) staging repositories&lt;/p>
&lt;/li>
&lt;li>
&lt;p>create &lt;code>k/kube-apiserver&lt;/code> or anything similar, although this work can lead the
path by defining package structures suitable for &lt;code>k/kube-apiserver&lt;/code>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>The desired outcome is a useful new library, while of course keeping everything
working during iterative development. Success will be measured by community
members saying that the new library is useful to them.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;!--
Detail the things that people will be able to do if this KEP is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.
-->
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>Project &lt;em>kube-hyper-mini&lt;/em> wants to maintain a main program that bundles&lt;/p>
&lt;ol>
&lt;li>the subset of &lt;em>kube-apiserver&lt;/em> that is not concerned with the management of
containerized workloads,&lt;/li>
&lt;li>a single-member etcd cluster, and&lt;/li>
&lt;li>the subset of &lt;em>kube-controller-manager&lt;/em> that is not concerned with the
management of containerized workloads. This makes a convenient platform for
hosting kube-style APIs defined by CRDs and/or resources served by their own
extension apiserver. They use &lt;code>k/generic-controlplane&lt;/code> to get part (1).&lt;/li>
&lt;/ol>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>Project &lt;em>kube-core&lt;/em> recognizes that the three parts of kube-hyper-mini scale out
differently, and wants instead to maintain a main program that is just the
desired subset of &lt;em>kube-apiserver&lt;/em>. Their main program is very little more than
a use of &lt;code>k/generic-controlplane&lt;/code>.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;ul>
&lt;li>This KEP is about code refactoring introducing another layer to the staging
repositories that kube-apiserver and kube-controller-manager are built from.
With every code refactoring there is risk of bugs. The mitigation are small,
easy reviewable “obvious” PRs, iteratively moving from the old to the new
structure.&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/4080-generic-controlplane/./staging-repos.svg" alt="Staging repository dependencies">&lt;/p>
&lt;p>First steps are about splitting existing &lt;em>kube-apiserver&lt;/em> and &lt;em>kube-controller-manager&lt;/em>
packages in-place, aka inside of &lt;code>k/k&lt;/code>. This includes:&lt;/p>
&lt;ul>
&lt;li>&lt;code>cmd/kube-apiserver&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/kubeapiserver&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/controlplane&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>Early sketch of an end-state. By far, most changes towards this goal will be
code moves only:&lt;/p>
&lt;ul>
&lt;li>&lt;code>k/generic-controlplane&lt;/code>
&lt;ul>
&lt;li>&lt;code>pkg/apis&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/apiserver/options&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/apiserver/server&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/apiserver/registry&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/apiserver/admission&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/controllers/garbagecollection&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/controllers/namespacedeletion&lt;/code>&lt;/li>
&lt;li>&lt;code>cmd/sample-generic-controlplane&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Potentially, we will split the apiserver and controller parts into two separate
repositories. This will be decided after the initial in-place steps have been done
and the best structure has become clearer how to host the new packages.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[ ] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>&lt;code>&amp;lt;package&amp;gt;&lt;/code>: &lt;code>&amp;lt;date&amp;gt;&lt;/code> - &lt;code>&amp;lt;test coverage&amp;gt;&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>&lt;test>: &lt;link to test coverage>&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;ul>
&lt;li>&lt;test>: &lt;link to test coverage>&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate] graduations, or as
something else. The KEP should keep this high-level with a focus on what
signals will be looked at to determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
#### Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
#### Beta
- Gather feedback from developers and surveys
- Complete features A, B, C
- Additional tests are in Testgrid and linked in KEP
#### GA
- N examples of real-world usage
- N installs
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- Allowing time for feedback
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;p>There will be no changes to system behavior. The typical alpha/beta/GA stages
and requirements do not apply as this KEP proposed code moves of existing code,
without changing its alpha/beta/GA status.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;!--
Pick one of these and delete the rest.
Documentation is available on [feature gate lifecycle] and expectations, as
well as the [existing list] of feature gates.
[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name:&lt;/li>
&lt;li>Components depending on the feature gate:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism: Does not apply. This is a code move of existing code
without functional changes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;!--
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
Feature gates are typically disabled by setting the flag to `false` and
restarting the component. No other changes should be necessary to disable the
feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
Additionally, for features that are introducing a new API field, unit tests that
are exercising the `switch` of feature gate itself (what happens if I disable a
feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2727/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2727/</guid><description>
&lt;h1 id="kep-2727-add-grpc-probe">KEP-2727: Add GRPC Probe&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alternative-considerations"
>Alternative Considerations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha-1"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta-1"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga-1"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#references"
>References&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Enhancement issue in release milestone, which links to KEP dir in
&lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Test plan is in place, giving consideration to SIG Architecture
and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in
&lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation e.g., additional design documents,
links to mailing list discussions/SIG meetings, relevant PRs/issues,
release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Add gRPC probe to Pod.Spec.Container.{Liveness,Readiness,Startup}Probe.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>gRPC is wide spread RPC framework. Existing solutions to add
probes to gRPC apps like exposing additional http endpoint
for health checks or packing external gRPC client as part of
an image and use exec probes have many limitations and overhead.&lt;/p>
&lt;p>Many load balancers support gRPC natively so adding it to
Kubernetes aligns well with the industry.&lt;/p>
&lt;p>Finally, Kubernetes project actively uses gRPC so adding built-in
support for gRPC endpoints does not introduce any new dependencies
to the project.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Enable gRPC probe natively from Kubelet without requiring users to package a
gRPC healthcheck binary with their container.&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/grpc-ecosystem/grpc-health-probe"
target="_blank" rel="noopener">https://github.com/grpc-ecosystem/grpc-health-probe&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/grpc/grpc/blob/master/doc/health-checking.md"
target="_blank" rel="noopener">https://github.com/grpc/grpc/blob/master/doc/health-checking.md&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Add gRPC support in other areas of K8s (e.g. Services).&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Add the follow configuration to the &lt;code>LivenessProbe&lt;/code>, &lt;code>ReadinessProbe&lt;/code>
and &lt;code>StartupProbe&lt;/code>. Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">readinessProbe&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">grpc&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic">#+&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">port&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">9090&lt;/span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic">#+&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">service&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>my-service &lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic">#+&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">initialDelaySeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">5&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">periodSeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This will result in the use of gRPC (using HTTP/2 over TLS) to use the
standard healthcheck service (&lt;code>Check&lt;/code> method) to determine the health of the
container. Using &lt;code>Watch&lt;/code> method of the healthcheck service is not supported,
but may be considered in future iterations.
As spec&amp;rsquo;d, the &lt;code>kubelet&lt;/code> probe will not allow use of client
certificates nor verify the certificate on the container. We do not
support other protocols for the time being (unencrypted HTTP/2, QUIC).&lt;/p>
&lt;p>The healthcheck request will be identified with the following gRPC
&lt;code>User-Agent&lt;/code> metadata. This user agent will be statically defined (not
configurable by the user):&lt;/p>
&lt;pre tabindex="0">&lt;code>User-Agent: kube-probe/K8S_MAJOR_VER.K8S_MINOR_VER
&lt;/code>&lt;/pre>&lt;p>Example:&lt;/p>
&lt;pre tabindex="0">&lt;code>User-Agent: kube-probe/1.23
&lt;/code>&lt;/pre>&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;ol>
&lt;li>Adds more code to Kubelet and surface area to Pod.Spec. &lt;em>Response&lt;/em>: we
expect that this will be generally useful given broad gRPC adoption in the
industry.&lt;/li>
&lt;/ol>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// core/v1/types.go&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> Handler &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> TCPSocket &lt;span style="color:#666">*&lt;/span>TCPSocketAction &lt;span style="color:#b44">`json...`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// GRPC specifies an action involving a TCP port. //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> GRPC &lt;span style="color:#666">*&lt;/span>GRPCAction &lt;span style="color:#b44">`json...`&lt;/span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> GRPCAction &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> { &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Port number of the gRPC service. Number must be in the range 1 to 65535. //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Port &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;port&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=port&amp;#34;`&lt;/span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Service is the name of the service to place in the gRPC HealthCheckRequest //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The service name can be the empty string (i.e. &amp;#34;&amp;#34;). //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Service &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;service&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=service&amp;#34;`&lt;/span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Host is the host name to connect to, defaults to the Pod&amp;#39;s IP. //+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Host &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json,omitempty&amp;#34;, protobuf:&amp;#34;bytes,3,opt,name=host&amp;#34;`&lt;/span> &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>} &lt;span style="color:#080;font-style:italic">//+&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note that &lt;code>GRPCAction.Port&lt;/code> is an int32, which is inconsistent with
the other existing probe definitions. This is on purpose &amp;ndash; we want to
move users away from using the (portNum, portName) union type.&lt;/p>
&lt;h3 id="alternative-considerations">Alternative Considerations&lt;/h3>
&lt;p>Note that &lt;code>readinessProbe.grpc.service&lt;/code> may be confusing, some
alternatives considered:&lt;/p>
&lt;ul>
&lt;li>&lt;code>serviceName&lt;/code>&lt;/li>
&lt;li>&lt;code>healthCheckServiceName&lt;/code>&lt;/li>
&lt;li>&lt;code>grpcService&lt;/code>&lt;/li>
&lt;li>&lt;code>grpcServiceName&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>There were no feedback on the selected name being confusing in the context of a probe definition.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/probe/grpc&lt;/code>: &lt;code>2023/02/06&lt;/code> - &lt;code>78.1%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>N/A, only unit tests and e2e coverage.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>Tests in &lt;code>test/e2e/common/node/container_probe.go&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>should &lt;em>not&lt;/em> be restarted with a GRPC liveness probe: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=Probing%20container%20should%20%5C*not%5C*%20be%20restarted%20with%20a%20GRPC%20liveness%20probe"
target="_blank" rel="noopener">results&lt;/a>
&lt;/li>
&lt;li>should be restarted with a GRPC liveness probe: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=should%20be%20restarted%20with%20a%20GRPC%20liveness%20probe"
target="_blank" rel="noopener">results&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>TODO: stress test to validate the scale (see GA requirements).&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Implement the feature.&lt;/li>
&lt;li>Add unit and e2e tests for the feature.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Solicit feedback from the Alpha.&lt;/li>
&lt;li>Ensure tests are stable and passing.&lt;/li>
&lt;/ul>
&lt;p>Depending on skew strategy:&lt;/p>
&lt;ul>
&lt;li>kubelet version skew ensures all (kubelet ver, cluster ver) support
the feature.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Address feedback from beta usage&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Validate that API is appropriate for users. There are some potential tunables:
&lt;ul>
&lt;li>&lt;code>User-Agent&lt;/code>&lt;/li>
&lt;li>connect timeout&lt;/li>
&lt;li>protocol (HTTP, QUIC)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Close on any remaining open issues &amp;amp; bugs&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Promote tests to conformance&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Implement a stress test&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Upgrade: N/A&lt;/p>
&lt;p>Downgrade: gRPC probes will not be supported in a downgrade from Alpha.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;ul>
&lt;li>We may not be able to graduate this widely until all kubelet version
skew supports the probe type.&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;p>Feature enablement will be guarded by a feature gate flag.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>GRPCContainerProbe&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kubelet&lt;/code> (probing), API
server (API changes).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. This would require restarting kubelet, and so probes for existing
Pods would no longer run.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>It becomes enabled again after the &lt;code>kubelet&lt;/code> restart.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes, unit tests for the feature when enabled and disabled will be
implemented in both kubelet and api server.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>We passed the version skew problem for the new API. No planning is required.&lt;/p>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>We passed the version skew problem - the API will be available on any supported
version skew. So no issues are expected with rollout and rollback.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>Rollback wouldn&amp;rsquo;t address issues. Pods will need to stop using the new probe
type.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>When gRPC probe is configured, Pod must be scheduled and, the metric
&lt;code>probe_total&lt;/code> can be observed to see the result of probe execution.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>When gRPC probe is configured, Pod must be scheduled and, the metric
&lt;code>probe_total&lt;/code> can be observed to see the result of probe execution.&lt;/p>
&lt;p>Event will be emitted for the failed probe and logs available in &lt;code>kubelet.log&lt;/code>
to troubleshoot the failing probes.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>Probe must succeed whenever service has returned the correct response
in defined timeout, and fail otherwise.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>The metric &lt;code>probe_total&lt;/code> can be used to check for the probe result. Event and
&lt;code>kubelet.log&lt;/code> log entries can be observed to troubleshoot issues.&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>Creation of a probe duration metric is tracked in this issue:
&lt;a href="https://github.com/kubernetes/kubernetes/issues/101035"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/101035&lt;/a>
and out of scope for this
KEP.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Adds &amp;lt; 200 bytes to Pod.Spec, which is consistent with other probe types.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>The overhead of executing probes is consistent with other probe types.&lt;/p>
&lt;p>We expect decrease of disk, RAM, and CPU use for many scenarios where the &lt;a href="https://github.com/grpc-ecosystem/grpc-health-probe"
target="_blank" rel="noopener">https://github.com/grpc-ecosystem/grpc-health-probe&lt;/a>
was used to probe gRPC endpoints.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>Yes, gRPC probes use node resources to establish connection.
This may lead to issue like &lt;a href="https://github.com/kubernetes/kubernetes/issues/89898"
target="_blank" rel="noopener">kubernetes/kubernetes#89898&lt;/a>
.&lt;/p>
&lt;p>The node resources for gRPC probes can be exhausted by a Pod with HostPort
making many connections to different destinations or any other process on a node.
This problem cannot be addressed generically.&lt;/p>
&lt;p>However, the design where node resources are being used for gRPC probes works
for the most setups. The default pods maximum is &lt;code>110&lt;/code>. There are currently
no limits on number of containers. The number of containers is limited by the
amount of resources requested by these containers. With the fix limiting
the &lt;code>TIME_WAIT&lt;/code> for the socket to 1 second,
&lt;a href="https://github.com/kubernetes/kubernetes/issues/89898#issuecomment-1383207322"
target="_blank" rel="noopener">this calculation&lt;/a>
demonstrates it will be hard to reach the limits on sockets.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>Logs and Pod events can be used to troubleshoot probe failures.&lt;/p>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No dependency on etcd availability.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>None&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;ul>
&lt;li>Make sure feature gate is set&lt;/li>
&lt;li>Make sure configuration is correct and gRPC service is reacheable by kubelet.
This may be different when migrating off &lt;a href="https://github.com/grpc-ecosystem/grpc-health-probe"
target="_blank" rel="noopener">https://github.com/grpc-ecosystem/grpc-health-probe&lt;/a>
and is covered in feature documentation.&lt;/li>
&lt;li>&lt;code>kubelet.log&lt;/code> log must be analyzed to understand why there is a mismatch of
service response and status reported by probe.&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>Original PR for k8 Prober: &lt;a href="https://github.com/kubernetes/kubernetes/pull/89832"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/89832&lt;/a>
&lt;/li>
&lt;li>2020-04-04: MR for k8 Prober&lt;/li>
&lt;li>2021-05-12: Cloned to this KEP to move the probe forward.&lt;/li>
&lt;li>2021-05-13: Updates.&lt;/li>
&lt;/ul>
&lt;h3 id="alpha-1">Alpha&lt;/h3>
&lt;p>Alpha feature was implemented in 1.23.&lt;/p>
&lt;h3 id="beta-1">Beta&lt;/h3>
&lt;p>Feature is promoted to beta in 1.24.&lt;/p>
&lt;h3 id="ga-1">GA&lt;/h3>
&lt;p>Feature is promoted to GA in 1.27.&lt;/p>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>See &lt;a href="#motivation"
>Motivation&lt;/a>
on why gRPC was picked as another RPC framework
to support natively.&lt;/p>
&lt;p>Adding gRPC is a small increment to k8s functionality with very little side
effects. But providing a lot of &amp;ldquo;quaity of life improvements&amp;rdquo; to gRPC apps.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>3rd party solutions like &lt;a href="https://github.com/grpc-ecosystem/grpc-health-probe"
target="_blank" rel="noopener">https://github.com/grpc-ecosystem/grpc-health-probe&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h2 id="references">References&lt;/h2>
&lt;ul>
&lt;li>GRPC healthchecking: &lt;a href="https://github.com/grpc/grpc/blob/master/doc/health-checking.md"
target="_blank" rel="noopener">https://github.com/grpc/grpc/blob/master/doc/health-checking.md&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Add Informer Metrics</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4346/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4346/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [x] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [x] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [x] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [x] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [x] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-4346-add-informer-metrics">KEP-4346: Add Informer Metrics&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#informer-metrics"
>Informer metrics&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#reflector-metrics"
>Reflector metrics&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#remove-metrics"
>Remove Metrics&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;!--
This section is incredibly important for producing high-quality, user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins, in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself. KEP editors and SIG Docs
should help to ensure that the tone and content of the `Summary` section is
useful for a wide audience.
A good summary is probably at least a paragraph in length.
Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
&lt;p>Informer is a base component in most K8s controllers, it is important to find a way to check if it is healthy.
This enhancement proposal adds metrics to the client-go informer. It will expose reflector/queue/eventHandler internal metrics to Prometheus. These metrics is useful for developers/reliability engineers, they can monitor informer depend on it.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;!--
This section is for explicitly listing the motivation, goals, and non-goals of
this KEP. Describe why the change is important and the benefits to users. The
motivation section can optionally provide links to [experience reports] to
demonstrate the interest in a KEP within the wider Kubernetes community.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->
&lt;p>A Kubernetes controller will watch objects for the desired state and the actual state, then send instructions to make the actual state be more like the desired state. Most controllers use informer to watch object change, then send work items that require reconcile to the &lt;code>workqueue&lt;/code>.&lt;/p>
&lt;p>Now the workqueue exposes metrics about queueLatency/workDuration, it is useful to find issues in reconcile routine. When a lot of objects need to be reconciled, but there are no new work items sent into &lt;code>workqueue&lt;/code>, the informer most likely blocked. Informer is composed of reflector/queue/eventHandler, to find the root cause, users have to add debug log and change log level.&lt;/p>
&lt;p>Informer should expose reflector/queue/eventHandler metrics, it will be easy to find why this informer is blocked. For example, it will show how long in seconds eventHandler processing an item.&lt;/p>
&lt;p>This change remove reflector metrics before &lt;a href="https://github.com/kubernetes/kubernetes/pull/74636"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/74636&lt;/a>
. It is essential to fix memory leak issue.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;!--
List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
&lt;ul>
&lt;li>Add metrics for informer&lt;/li>
&lt;li>Expose informer reflector/queue/eventHandler metrics&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
&lt;ul>
&lt;li>It does not introduce breaking changes for controllers which use informer.&lt;/li>
&lt;li>It does not modify core Kubernetes components which use informer.&lt;/li>
&lt;li>It does not list all informer metrics, which can add as needed&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. What is the desired outcome and how do we measure success?.
The "Design Details" section below is for the real
nitty-gritty.
-->
&lt;ul>
&lt;li>Introduce the informer metrics struct &lt;code>informerMetrics&lt;/code> contains queue/eventHandler metrics&lt;/li>
&lt;li>Introduce the informer metrics provider interface &lt;code>informerMetricsProvider&lt;/code>, implement in &lt;code>k8s.io/component-base/metrics&lt;/code>&lt;/li>
&lt;li>Revert the deleted &lt;code>reflectorMetrics&lt;/code>&lt;/li>
&lt;li>Add a feature gate &lt;code>InformerMetrics&lt;/code> to enable informer/reflector metrics&lt;/li>
&lt;/ul>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;!--
Detail the things that people will be able to do if this KEP is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.
-->
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>Client-go informer create a RingGrowing &lt;code>pendingNotifications&lt;/code> for every eventHandler. This RingGrowing will grow, but never shrink. An informer has some eventHandlers, it is hard to distinguish which &lt;code>pendingNotifications&lt;/code> linked to a lot of objects. The &lt;code>pendingNotifications&lt;/code> metric will help developers distinguish the slow eventHandler.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>Users want to know how often the reflector performs a &lt;code>LIST&lt;/code>.&lt;/p>
&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>It is hard to known how many item in informer queue/store. Add metrics for queue/store, it will help developers to find the number of pending deltas.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;p>N/A&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.
How will security be reviewed, and by whom?
How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
&lt;p>The informer metrics is disabled by default. When enable informer metrics, the newly added metrics will increase CPU/MEM usage.&lt;/p>
&lt;p>If the metrics result memory leak, users can disable the informer metrics.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
&lt;p>Add a feature gate &lt;code>InformMetrics&lt;/code> in client-go. It is disabled when in the Alpha state.&lt;/p>
&lt;h3 id="informer-metrics">Informer metrics&lt;/h3>
&lt;p>Introduce the informer metrics struct &lt;code>informerMetrics&lt;/code> and &lt;code>eventHandlerMetrics&lt;/code>. It is similar to the existing &lt;code>workqueue&lt;/code> metrics.&lt;/p>
&lt;pre tabindex="0">&lt;code>type informerMetrics struct {
clock clock.Clock
// total number of item in store
numbernOfStoredItem GaugeMetric
// total number of item in queue
numberOfQueuedItem GaugeMetric
// each eventHandler metrics
eventHandlerMetrics map[string]eventHandlerMetrics
}
type eventHandlerMetrics struct {
// number of pending data
numberOfPendingNotifications GaugeMetric
// size of RingGrowring data
sizeOfRingGrowing GaugeMetric
// how long processing an item from informer reflector
prcoessDuration HistogramMetric
}
// MetricsProvider generates various metrics used by the queue.
type MetricsProvider interface {
// the informer name
NewStoredItemMetric(name string) GaugeMetric
NewQueuedItemMetric(name string) GaugeMetric
// the eventHandler name
NewPendingNotificationsMetric(name string) GaugeMetric
NewRingGrowingMetric(name string) GaugeMetric
NewPrcoessDurationMetric(name string) HistogramMetric
}
&lt;/code>&lt;/pre>&lt;p>Add prometheus metrics item in subsystem &lt;code>informer&lt;/code>&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>name&lt;/th>
&lt;th>labels&lt;/th>
&lt;th>description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>store_item_total&lt;/td>
&lt;td>informer name&lt;/td>
&lt;td>Total number of item in store&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>queued_item_total&lt;/td>
&lt;td>informer name&lt;/td>
&lt;td>Total number of item in queue&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>pending_notifications_total&lt;/td>
&lt;td>eventHandler name&lt;/td>
&lt;td>Total number of pending notifications in eventHandler RingGrowing&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>ring_growing_capacity&lt;/td>
&lt;td>eventHandler name&lt;/td>
&lt;td>Capacity of eventHandler RingGrowing&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>event_process_duration&lt;/td>
&lt;td>eventHandler name&lt;/td>
&lt;td>How long in seconds eventHandler processing an item from RingGrowing takes&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="reflector-metrics">Reflector metrics&lt;/h3>
&lt;p>This change &lt;a href="https://github.com/kubernetes/kubernetes/pull/74636"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/74636&lt;/a>
will be reverted.&lt;/p>
&lt;p>Each reflector metrics contains 3 counter, 4 summary and 1 gauge.&lt;/p>
&lt;pre tabindex="0">&lt;code>type reflectorMetrics struct {
numberOfLists CounterMetric
listDuration HistogramMetric
numberOfItemsInList HistogramMetric
numberOfWatches CounterMetric
numberOfShortWatches CounterMetric
watchDuration HistogramMetric
numberOfItemsInWatch HistogramMetric
lastResourceVersion GaugeMetric
}
&lt;/code>&lt;/pre>&lt;p>According to kubernetes/kubernetes#73587, the memory leak is caused by summary. It&amp;rsquo;d be better to use histograms instead. HistogramMetrics are aggregatable and it will reduce memory usage.&lt;/p>
&lt;h3 id="remove-metrics">Remove Metrics&lt;/h3>
&lt;p>When the informers and reflectors stopped, the reference metrics will be removed.&lt;/p>
&lt;p>Kube component-base metrics support to delete metrics by matching labels.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>
&lt;p>&lt;code>&amp;lt;package&amp;gt;&lt;/code>: &lt;code>&amp;lt;date&amp;gt;&lt;/code> - &lt;code>&amp;lt;test coverage&amp;gt;&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Unit tests to ensure that the metrics output meets expectations.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Unit tests to ensure that the metrics deletion is functioning properly.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;p>We will have extensive integration testing of the union code in the
&lt;code>test/integration/metrics&lt;/code> package.&lt;/p>
&lt;ul>
&lt;li>When enabling &lt;code>InformerMetrics&lt;/code> feature gate, ensure the metrics will be exposed. Ensure the metrics subsystem/label/granularity is correct.&lt;/li>
&lt;li>When the informers and reflectors are stopped, ensure the reference metrics will be removed.&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;ul>
&lt;li>&lt;test>: &lt;link to test coverage>&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate] graduations, or as
something else. The KEP should keep this high-level with a focus on what
signals will be looked at to determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
#### Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
#### Beta
- Gather feedback from developers and surveys
- Complete features A, B, C
- Additional tests are in Testgrid and linked in KEP
#### GA
- N examples of real-world usage
- N installs
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- Allowing time for feedback
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature gate flag&lt;/li>
&lt;li>Add related integration and unit tests to ensure functionality and make sure there is no memory leak in
existing behavior&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from developers and surveys&lt;/li>
&lt;li>Work on feedback and add additional tests as needed&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Decision on GA will be made based on beta feedback&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;p>N/A&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and nodes?
- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;p>N/A&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;!--
Pick one of these and delete the rest.
Documentation is available on [feature gate lifecycle] and expectations, as
well as the [existing list] of feature gates.
[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: InformerMetrics&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>components via client-go library&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;!--
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
&lt;p>No. It does not change any default behavior. When this feature is enabled, it will increase memory usage in client-go.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
Feature gates are typically disabled by setting the flag to `false` and
restarting the component. No other changes should be necessary to disable the
feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;p>Yes, by disabling &lt;code>InformerMetrics&lt;/code> FeatureGate for components via client-go library.
In this case informers will not expose metrics anymore.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The expected behavior of the feature will be restored.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
Additionally, for features that are introducing a new API field, unit tests that
are exercising the `switch` of feature gate itself (what happens if I disable a
feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
&lt;p>For now, there is no tests for feature enablement/disablement. The unit / integration tests will be added.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;p>Feature has no impact on rollout/rollback, and no impact on running workloads.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;p>The memory used by this metrics continues to grow, consuming a significant amount&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>Not yet. In the alpha releases, we could test this.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;p>This feature does not deprecate or remove any features/APIs/fields/flags/etc.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Informer / Reflector (e.g., &lt;code>lists_total&lt;/code>, &lt;code>watches_total&lt;/code>) metrics returned by the operator are populated&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:
&lt;ul>
&lt;li>The following metrics are available when &lt;code>InformerMetrics&lt;/code> is enabled:
&lt;ul>
&lt;li>lists_total&lt;/li>
&lt;li>watches_total&lt;/li>
&lt;li>last_resource_version&lt;/li>
&lt;li>etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;p>The feature gate will increase memory usage. The memory usage should not continuously grow.
The informerMetrics / eventHandlerMetrics / reflectorMetrics memory consumption is in a stable state.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: Memory usage&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric: Operating System/golang pprof&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;p>Not at the moment.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;p>Yes. The informer metrics will increase CPU/RAM usage.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;p>Yes. When enable informer metrics, kubelet will only increase CPU/RAM usage.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;ul>
&lt;li>2023-11-29: Initial draft KEP&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;p>N/A&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;p>N/A&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Add job creation timestamp to job annotations</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4026/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4026/</guid><description>
&lt;h1 id="kep-4026-add-job-creation-timestamp-to-job-annotations">KEP-4026: Add job creation timestamp to job annotations&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Currently, there is no supported way to get the original/expected initial scheduled timestamp for the job created from a cronjob. This KEP proposes to set the original scheduled time as an annotation in the job metadata.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Set job scheduled timestamp as an annotation on the job.&lt;/li>
&lt;li>Adding the annotation should not be disruptive to existing workloads.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>At a high level, the proposal is to modify the CronJob controller to set the job scheduled timestamp as a job annotation. The details of this are outlined in the Design Details section below.&lt;/p>
&lt;p>Job scheduled timestamp annotation: &lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code>&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a user, I would like to get the job&amp;rsquo;s scheduled timestamp that this job was expected to be running.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>CronJobs are always working with the assumption that the changes apply only to newly created jobs after the change. Therefore, the change will be to inject the annotation for newly created Jobs from CronJobs for when the feature is on. This will nicely play with downgrade and doesn&amp;rsquo;t introduce unnecessary complexity.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>The CronJob controller will only need a minor update to the &lt;a href="https://github.com/kubernetes/kubernetes/blob/7024beeeeb1f2e4cde93805a137cd7ad92fec466/pkg/controller/cronjob/utils.go#L188"
target="_blank" rel="noopener">getJobFromTemplate2&lt;/a>
function, to add the job scheduled timestamp as the job annotation &lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code>. The scheduled timestamp is represented in &lt;code>RFC3339&lt;/code>.&lt;/p>
&lt;p>For the scheduled timestamp&amp;rsquo;s timezone, the initial thought was to use &lt;code>UTC&lt;/code> as it&amp;rsquo;s used as the primary one for less confusion. However, since the &lt;code>job&lt;/code> object has a &lt;code>spec.timeZone&lt;/code>, it was a better to use the same timezone within the same object. If the job &lt;code>spec.timeZone&lt;/code> is not set or &lt;code>nil&lt;/code>, the annotation will use the &lt;code>UTC&lt;/code> timezone as a default.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/li>
&lt;/ul>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/controller/cronjob&lt;/code>: &lt;code>09/24/2023&lt;/code> - &lt;code>71.2%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>No integration tests are planned for this feature.&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/4aeaf1e99e82da8334c0d6dddd848a194cd44b4f/test/e2e/apps/cronjob.go#L264-L287"
target="_blank" rel="noopener">CronJob should set the cronjob-scheduled-timestamp annotation&lt;/a>
: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=.*CronJob%20should%20set%20the%20cronjob-scheduled-timestamp%20annotation.*"
target="_blank" rel="noopener">test coverage&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>The feature will be released directly in Beta state since there is no benefit in having an alpha release, since we are simply adding a new annotation so there is very little risk.&lt;/p>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind the &lt;code>CronJobsScheduledAnnotation&lt;/code> feature gate.&lt;/li>
&lt;li>Unit and e2e tests passing.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;p>Fix any potentially reported bugs.&lt;/p>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>No changes required to existing cluster to use this feature.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>N/A. This feature doesn&amp;rsquo;t require coordination between control plane components,
the changes to each controller are self-contained.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>CronJobCreationAnnotation&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kube-controller-manager&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism: N/A.&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane? No&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or re-provisioning of a node? No&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>The jobs newly created by cronjob controller will contain a new annotation &lt;code>CronJobsScheduledAnnotation&lt;/code>.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. If the feature gate is disabled, the CronJob controller will not add the
scheduled timestamp as an annotation.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The CronJob controller will begin adding the scheduled timestamp as an annotation to jobs created while the feature is enabled, and existing jobs will be unaffected.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Given the feature results in adding an annotation only to newly created objects, those tests won&amp;rsquo;t really be different from the actual feature tests.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>This change will not impact the rollout or rollback fail. It also will not impact the already running workloads.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>Users can monitor CronJobs metrics &lt;code>job_creation_skew_duration_seconds&lt;/code> and &lt;code>cronjob_controller_rate_limiter_use&lt;/code>, &lt;code>cronjob_job_creation_skew&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>The following manual upgrade-&amp;gt;downgrade-&amp;gt;upgrade scenario was performed:&lt;/p>
&lt;ol>
&lt;li>Create a v1.27 cluster where the feature is not available, yet.&lt;/li>
&lt;li>Create a CronJob and wait for jobs to be created. Verify the newly created job
does NOT have the &lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code> annotation.&lt;/li>
&lt;li>Upgrade cluster to v1.28, where the feature was available as beta, iow.
on by default. Verify the newly created job from a CronJob created in 2nd step
has the &lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code> annotation with
planned time, when a job was to be created.&lt;/li>
&lt;li>Downgrade cluster to v1.27, where the feature was NOT available. Verify the
newly created job from a CronJob created in 2nd step does NOT have the
&lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code> annotation.&lt;/li>
&lt;/ol>
&lt;p>During the tests no problems were identified with cronjobs or jobs.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Randomly checking the CronJobs annotation &lt;code>batch.kubernetes.io/cronjob-scheduled-timestamp&lt;/code> is sufficient. For monitoring purposes, we can rely on pre-existing metrics which monitor both the cronjob queue and the job creation skew, which should provide sufficient signal if the controller is working as expected. For small clusters, checking the annotation will determine the feature is used.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> API .metadata
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:
&lt;ul>
&lt;li>&lt;code>.metadata.annotations['batch.kubernetes.io/cronjob-scheduled-timestamp']&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;ul>
&lt;li>99% percentile over day for Job syncs is &amp;lt;= 15s for a client-side 50 QPS limit.&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: cronjob_job_creation_skew&lt;/li>
&lt;li>Components exposing the metric: kube-controller-manager&lt;/li>
&lt;li>Metric name: job_creation_skew_duration_seconds&lt;/li>
&lt;li>Components exposing the metric: kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Yes, each job created by a cronjob-controller will have an additional annotation containing &lt;code>RFC3339&lt;/code> timestamp, which together with annotation name results in ~70B per job object.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No change comparing to existing failure modes.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>The new annotation shouldn&amp;rsquo;t cause any unforeseen issues with the cronjob controller.
In the event of issues with meeting SLOs, cluster admins are advised to consult
&lt;a href="https://kubernetes.io/docs/tasks/debug/"
target="_blank" rel="noopener">troubleshooting overview document&lt;/a>
.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2023-06-06: KEP published&lt;/li>
&lt;li>2024-09-24: KEP updated for stable promotion&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Add label instead of annotation&lt;/p>
&lt;ul>
&lt;li>Labels are unnecessary as we need to pass data that won&amp;rsquo;t be used with search or satisfy certain conditions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Add a status field&lt;/p>
&lt;ul>
&lt;li>The object already has the &lt;code>CreationTimestamp&lt;/code> field, but it will get overridden with the time the CronJob will start. The point of the new annotation is to pass the original/expected scheduled timestamp information.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>N/A&lt;/p></description></item><item><title>Resources: Add kubelet instance configuration to configure CRI socket for each node</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4656/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4656/</guid><description/></item><item><title>Resources: Add NonPreempting Option For PriorityClasses</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/902/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/902/</guid><description>
&lt;h1 id="add-nonpreempting-option-for-priorityclasses">Add NonPreempting Option For PriorityClasses&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#testing-plan"
>Testing Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha-v115"
>Alpha (v1.15):&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta-v119"
>Beta (v1.19):&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#stable-v124"
>Stable (v1.24):&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature enablement and rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>&lt;a href="https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/"
target="_blank" rel="noopener">PriorityClasses&lt;/a>
are a GA feature as on 1.14,
which impact the scheduling and eviction of pods.
Pods are be scheduled according to descending priority.
If a pod cannot be scheduled due to insufficient resources,
lower-priority pods will be preempted to make room.&lt;/p>
&lt;p>This proposal makes the preempting behavior optional for a PriorityClass,
by adding a new field to PriorityClasses,
which in turn populates PodSpec.
If a pod is waiting to be scheduled,
and it does not have preemption enabled,
it will not trigger preemption of other pods.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Allowing PriorityClasses to be non-preempting is important for running batch workloads.&lt;/p>
&lt;p>Batch workloads typically have a backlog of work,
with unscheduled pods.
Higher-priority workloads can be assigned a higher priority via a PriorityClass,
but this may result in pods with partially-completed work being preempted.
Adding the non-preempting option allows users to prioritize the scheduling queue,
without discarding incomplete work.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Add a boolean flag to PriorityClasses,
to enable or disable preemption for pods of that PriorityClass.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Protecting pods from preemption. PodDisruptionBudget should be used.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Add a Preempting field to both PodSpec and PriorityClass.
This field will default to true,
for backwards compatibility.&lt;/p>
&lt;p>If Preempting is true for a pod,
the scheduler will preempt lower priority pods to schedule this pod,
as is current behavior.&lt;/p>
&lt;p>If Preempting is false,
a pod of that priority will not preempt other pods.&lt;/p>
&lt;p>Setting the Preempting field in PriorityClass provides a straightforward interface,
and allows ResourceQuotas to restrict preemption.&lt;/p>
&lt;p>PriorityClass type example:&lt;/p>
&lt;pre tabindex="0">&lt;code>type PriorityClass struct {
metav1.TypeMeta
metav1.ObjectMeta
Value int32
GlobalDefault bool
Description string
Preempting *bool // New option
}
&lt;/code>&lt;/pre>&lt;p>The Preempting field in PodSpec will be populated during pod admission,
similarly to how the PriorityClass Value is populated.
Storing the Preempting field in the pod spec has several benefits:&lt;/p>
&lt;ul>
&lt;li>The scheduler does not need to be aware of PiorityClasses,
as all relevant information is in the pod.&lt;/li>
&lt;li>Mutating PriorityClass objects does not impact existing pods.&lt;/li>
&lt;li>Kubelets can set Preempting on static pods.&lt;/li>
&lt;/ul>
&lt;p>PodSpec type example:&lt;/p>
&lt;pre tabindex="0">&lt;code>type PodSpec struct {
...
Preempting *bool
...
}
&lt;/code>&lt;/pre>&lt;p>This feature should be gated in alpha, provisionally under the gate &lt;code>NonPreemptingPriority&lt;/code>.&lt;/p>
&lt;p>Documentation must be updated to reflect the new feature,
and changes to PriorityClass/PodSpec fields.&lt;/p>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;p>A user is running batch workloads on a cluster.
The user has a high-priority job,
that they wish to schedule before other workloads in the queue.
As the user does not want to preempt running batch workloads and discard work,
the user creates the new workload with a high-priority,
non-preempting PriorityClass.
The new workload&amp;rsquo;s pods are scheduled ahead of the queue,
without disrupting running workloads.&lt;/p>
&lt;ul>
&lt;li>Users are able to run preempting and non-preempting workloads in a stable manner,
and are not requesting additional changes.&lt;/li>
&lt;li>The feature has been stable and reliable in at least 2 releases.&lt;/li>
&lt;li>Adequate documentation exists for preemption and the optional field.&lt;/li>
&lt;li>Test coverage includes non-preempting use cases.&lt;/li>
&lt;li>Conformance requirements for non-preempting PriorityClasses are agreed upon.&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The new feature may malfuction,
or existing preemption functionality may be impaired.
New tests (covering both nonpreepting workloads and mixed workloads),
and the existing preempting PriorityClass tests should be used to prove stability.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="testing-plan">Testing Plan&lt;/h3>
&lt;p>Add detailed unit and integration tests for nonpreempting workloads.&lt;/p>
&lt;p>Add basic e2e tests, to ensure all components are working together.&lt;/p>
&lt;p>Ensure existing tests (for preempting PriorityClasses) do not break.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha-v115">Alpha (v1.15):&lt;/h4>
&lt;ul>
&lt;li>Support NonPreemptingPriority in PriorityClasses&lt;/li>
&lt;/ul>
&lt;h4 id="beta-v119">Beta (v1.19):&lt;/h4>
&lt;ul>
&lt;li>Add integration test for NonPreemptingPriority.&lt;/li>
&lt;li>Graduate NonPreemptingPriority to Beta.&lt;/li>
&lt;li>Update documents to reflect the changes.&lt;/li>
&lt;/ul>
&lt;h4 id="stable-v124">Stable (v1.24):&lt;/h4>
&lt;ul>
&lt;li>No negative feedback.&lt;/li>
&lt;li>Enhance the message of the existing event for scheduling failed to include details about preemption.&lt;/li>
&lt;li>Graduate NonPreemptingPriority to GA.&lt;/li>
&lt;li>Update documents to reflect the changes.&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature enablement and rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate
&lt;ul>
&lt;li>Feature gate name: NonPreemptingPriority&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;li>kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we rollback
the enablement)?&lt;/strong>
Yes. This feature can be disabled by restarting kube-apiserver and kube-scheduler with feature-gate turned off.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>
If we reenable the feature, the Pod with high priority and NonPreemptionPolicy will be eligible to preempt other pods with low priority when cluster resources are tight.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>
If a rollout fails, kube-scheduler will keep crashing. Running workloads won&amp;rsquo;t be affected by kube-scheduler.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>
Check the following indicators to determine if there are any exceptions:&lt;/p>
&lt;ul>
&lt;li>pod_preemption_victims&lt;/li>
&lt;li>total_preemption_attempts&lt;/li>
&lt;li>scheduling_algorithm_preemption_evaluation_seconds&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/strong>
Manually tested successfully. The test environment version is v1.23. We tested enabling and disabling this
feature. After each change in the feature-gate, 3 separate priorityclasses will be recreated (One
high-priorityclass with preemptionPolicy as Never, other high-priorityclass with preemptionPolicy not be
set, one low-priorityclass with preemptionPolicy not be set). Create multiple pods with the above 3
priorityclasses to verify that the preemption results are as expected.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of features?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>The operator can determine if the workload is using the feature by checking if the priorityclass&amp;rsquo;s preemptionPolicy is set to &amp;ldquo;Never&amp;rdquo;.&lt;/p>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason: There is an event sent by kube-scheduler if the pod preempts other pods. If the feature is working and the pod with the priorityclass&amp;rsquo;preemptionPolicy as Never, there won&amp;rsquo;t be a preemption related event for this pod.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: Check if pods with preemptionPolicy set to Never can preempt other low-priority pods when the cluster resources cannot be met.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: preemption_victims&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric: kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>We currently only have events that describe a pod being preempted by another pod. But we don&amp;rsquo;t
have an event that describes why sometimes the preemption is not successful. We can enhance the
message of the existing event for scheduling failed to include details about preemption. This
will help us to improve observability for this feature and other scenarios.&lt;/p>
&lt;p>In addition to events, we can add metrics about how many pods have stopped preempting other pods because of this no-preemption option. However, since the probability of this metric being used is likely to be small, it was not added.&lt;/p>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to cloud
provider?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count
of the existing API objects?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs][]?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;ol>
&lt;li>Errors for the preempt process are visible in logs.&lt;/li>
&lt;li>check the metrics below to determine if there is an exception&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>pod_preemption_victims&lt;/li>
&lt;li>total_preemption_attempts&lt;/li>
&lt;li>scheduling_algorithm_preemption_evaluation_seconds&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2019-03-17: Initial KEP&lt;/li>
&lt;li>2020-05-19: Graduate the feature to Beta&lt;/li>
&lt;li>2022-01-15: Graduate the feature to GA&lt;/li>
&lt;/ul></description></item><item><title>Resources: Add pod-startup liveness-probe holdoff for slow-starting pods</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/950/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/950/</guid><description>
&lt;h1 id="add-pod-startup-liveness-probe-holdoff-for-slow-starting-pods">Add pod-startup liveness-probe holdoff for slow-starting pods&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#implementation-details"
>Implementation Details&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#why-a-new-probe-instead-of-initializationfailurethreshold"
>Why a new probe instead of initializationFailureThreshold&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#configuration-example"
>Configuration example&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#feature-gate"
>Feature Gate&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#version-116"
>Version 1.16&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-117"
>Version 1.17&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-118"
>Version 1.18&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-119"
>Version 1.19&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-120"
>Version 1.20&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> KEP approvers have set the KEP status to &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Test plan is in place, giving consideration to SIG Architecture and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://github.com/kubernetes/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Slow starting containers are difficult to address with the current status of health probes: they are either killed before being up, or could be left deadlocked during a very long time before being killed.&lt;/p>
&lt;p>This proposal adds a new probe called &lt;code>startupProbe&lt;/code> that holds off all the other probes until the pod has finished its startup. In the case of a slow-starting pod, it could poll on a relatively short period with a high &lt;code>failureThreshold&lt;/code>. Once it is satisfied, the other probes can start.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Slow starting containers here refer to containers that require a significant amount of time (one to several minutes) to start. There can be various reasons for this slow startup:&lt;/p>
&lt;ul>
&lt;li>long data initialization: only the first startup takes a lot of time&lt;/li>
&lt;li>heavy workload: every startups take a lot of time&lt;/li>
&lt;li>underpowered/overloaded node: startup times depend on external factors (however, solving node related issues is not a goal of this proposal)&lt;/li>
&lt;/ul>
&lt;p>The main problem with this kind containers is that they should be given enough time to start before having &lt;code>livenessProbe&lt;/code> fail &lt;code>failureThreshold&lt;/code> times, which triggers a kill by the &lt;code>kubelet&lt;/code> before they have a chance to be up.&lt;/p>
&lt;p>There are various strategies to handle this situation with the current API:&lt;/p>
&lt;ul>
&lt;li>Delay the initial &lt;code>livenessProbe&lt;/code> sufficiently to permit the container to start up (set &lt;code>initialDelaySeconds&lt;/code> greater than &lt;strong>startup time&lt;/strong>). While this ensures no &lt;code>livenessProbe&lt;/code> will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than &lt;code>initialDelaySeconds&lt;/code>. Also, since the &lt;code>livenessProbe&lt;/code> isn&amp;rsquo;t run at all during startup, there is no feedback loop on the actual startup time of the container.&lt;/li>
&lt;li>Increase the allowed number of &lt;code>livenessProbe&lt;/code> failures until &lt;code>kubelet&lt;/code> kills the container (set &lt;code>failureThreshold&lt;/code> so that &lt;code>failureThreshold&lt;/code> times &lt;code>periodSeconds&lt;/code> is greater than &lt;strong>startup time&lt;/strong>). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.&lt;/li>
&lt;/ul>
&lt;p>However, none of these strategies provide an timely answer to slow starting containers stuck in a deadlock, which is the primary reason of setting up a &lt;code>livenessProbe&lt;/code>.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allow slow starting containers to run safely during startup with health probes enabled.&lt;/li>
&lt;li>Improve documentation of the &lt;code>Probe&lt;/code> structure in core types&amp;rsquo; API.&lt;/li>
&lt;li>Improve &lt;code>kubernetes.io/docs&lt;/code> section about Pod lifecycle:
&lt;ul>
&lt;li>Clearly state that PostStart handlers do not delay probe executions.&lt;/li>
&lt;li>Introduce and explain this new probe.&lt;/li>
&lt;li>Document appropriate use cases for this new probe.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>This proposal does not address the issue of pod load affecting startup (or any other probe that may be delayed due to load). It is acting strictly at the pod level, not the node level.&lt;/li>
&lt;li>This proposal will only update the official Kubernetes documentation, excluding &lt;a href="https://blog.openshift.com/kubernetes-pods-life/"
target="_blank" rel="noopener">A Pod&amp;rsquo;s Life&lt;/a>
and other well referenced pages explaining probes.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="implementation-details">Implementation Details&lt;/h3>
&lt;p>The proposed solution is to add a new probe named &lt;code>startupProbe&lt;/code> in the container spec of a pod which will determine whether it has finished starting up.&lt;/p>
&lt;p>It also requires keeping the state of the container (has the &lt;code>startupProbe&lt;/code> ever succeeded?) using a boolean &lt;code>Started&lt;/code> inside the ContainerStatus struct.&lt;/p>
&lt;p>Depending on &lt;code>Started&lt;/code> the probing mechanism in &lt;code>worker.go&lt;/code> might be altered:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Started == true&lt;/code>: the kubelet worker works the same way as today&lt;/li>
&lt;li>&lt;code>Started == false&lt;/code>: the kubelet worker only probes the &lt;code>startupProbe&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>If &lt;code>startupProbe&lt;/code> fails more than &lt;code>failureThreshold&lt;/code> times, the result is the same as today when &lt;code>livenessProbe&lt;/code> fails: the container is killed and might be restarted depending on &lt;code>restartPolicy&lt;/code>.&lt;/p>
&lt;p>If no &lt;code>startupProbe&lt;/code> is defined, &lt;code>Started&lt;/code> is initialized with &lt;code>true&lt;/code>.&lt;/p>
&lt;h3 id="why-a-new-probe-instead-of-initializationfailurethreshold">Why a new probe instead of initializationFailureThreshold&lt;/h3>
&lt;p>While trying to merge PR &lt;a href="https://github.com/kubernetes/enhancements/pull/1014"
target="_blank" rel="noopener">#1014&lt;/a>
in time for code-freeze, @thockin has make the following points which I agree with:&lt;/p>
&lt;blockquote>
&lt;p>I feel pretty strongly that something like a startupProbe would be net simpler to comprehend than a new field on liveness.&lt;/p>
&lt;p>In &lt;a href="https://github.com/kubernetes/kubernetes/issues/27114#issuecomment-437208330"
target="_blank" rel="noopener">issuecomment-437208330&lt;/a>
we looked at a different take on this API - it is more precise in its meaning and rather than add yet another behavior modifier to probe, it can reuse the probe structure directly.&lt;/p>&lt;/blockquote>
&lt;p>Here is the excerpt of &lt;a href="https://github.com/kubernetes/kubernetes/issues/27114#issuecomment-437208330"
target="_blank" rel="noopener">issuecomment-437208330&lt;/a>
talking about the design:&lt;/p>
&lt;blockquote>
&lt;p>An idea that I toyed with but never pursued was a StartupProbe - all the other probes would wait on it at pod startup. It could poll on a relatively short period with a long FailureThreshold. Once it is satisfied, the other probes can start.&lt;/p>&lt;/blockquote>
&lt;p>I also think the third probe gives more flexibility if we find other good reasons to inhibit &lt;code>livenessProbe&lt;/code> or &lt;code>readinessProbe&lt;/code> before something occurs during container startup.&lt;/p>
&lt;h3 id="configuration-example">Configuration example&lt;/h3>
&lt;p>This example shows how startupProbe can be used to emulate the functionality of &lt;code>initializationFailureThreshold&lt;/code> as it was proposed before:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">ports&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>liveness-port&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containerPort&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">8080&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">hostPort&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">8080&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">livenessProbe&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">httpGet&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">path&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>/healthz&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">port&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>liveness-port&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">failureThreshold&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">periodSeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">startupProbe&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">httpGet&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">path&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>/healthz&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">port&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>liveness-port&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">failureThreshold&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">30&lt;/span>&lt;span style="color:#bbb"> &lt;/span>(=initializationFailureThreshold)&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">periodSeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>Unit tests will be implemented with &lt;code>newTestWorker&lt;/code> and will check the following:&lt;/p>
&lt;ul>
&lt;li>proper initialization of &lt;code>Started&lt;/code> to false&lt;/li>
&lt;li>&lt;code>Started&lt;/code> becomes true as soon as &lt;code>startupProbe&lt;/code> succeeds&lt;/li>
&lt;li>&lt;code>livenessProbe&lt;/code> and &lt;code>readinessProbe&lt;/code> are disabled until &lt;code>Started&lt;/code> is true&lt;/li>
&lt;li>&lt;code>startupProbe&lt;/code> is disabled after &lt;code>Started&lt;/code> becomes true&lt;/li>
&lt;li>&lt;code>failureThreshold&lt;/code> exceeded for &lt;code>startupProbe&lt;/code> kills the container&lt;/li>
&lt;/ul>
&lt;p>E2e tests will also cover the main use-case for this probe:&lt;/p>
&lt;ul>
&lt;li>&lt;code>startupProbe&lt;/code> disables &lt;code>livenessProbe&lt;/code> long enough to simulate a slow starting container, using a high &lt;code>failureThreshold&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="feature-gate">Feature Gate&lt;/h3>
&lt;ul>
&lt;li>Expected feature gate key: &lt;code>StartupProbeEnabled&lt;/code>&lt;/li>
&lt;li>Expected default value: &lt;code>false&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;ul>
&lt;li>Alpha: Initial support for &lt;code>startupProbe&lt;/code> added. Disabled by default.&lt;/li>
&lt;li>Beta: &lt;code>startupProbe&lt;/code> enabled with no default configuration.&lt;/li>
&lt;li>Stable: &lt;code>startupProbe&lt;/code> enabled with no default configuration.&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2018-11-27: prototype implemented in PR &lt;a href="https://github.com/kubernetes/kubernetes/pull/71449"
target="_blank" rel="noopener">#71449&lt;/a>
under review&lt;/li>
&lt;li>2019-03-05: present KEP to sig-node&lt;/li>
&lt;li>2019-04-11: open issue in enhancements &lt;a href="https://github.com/kubernetes/enhancements/issues/950"
target="_blank" rel="noopener">#950&lt;/a>
&lt;/li>
&lt;li>2019-05-01: redesign to additional probe after @thockin &lt;a href="https://github.com/kubernetes/kubernetes/issues/27114#issuecomment-437208330"
target="_blank" rel="noopener">proposal&lt;/a>
&lt;/li>
&lt;li>2019-05-02: add test plan&lt;/li>
&lt;/ul>
&lt;h3 id="version-116">Version 1.16&lt;/h3>
&lt;ul>
&lt;li>Implement &lt;code>startupProbe&lt;/code> as Alpha &lt;a href="https://github.com/kubernetes/kubernetes/pull/77807"
target="_blank" rel="noopener">#77807&lt;/a>
&lt;/li>
&lt;li>Cherry pick of #82747 &lt;a href="https://github.com/kubernetes/kubernetes/pull/83607"
target="_blank" rel="noopener">#83607&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="version-117">Version 1.17&lt;/h3>
&lt;ul>
&lt;li>Fix &lt;code>startup_probe_test.go&lt;/code> failing test &lt;a href="https://github.com/kubernetes/kubernetes/issues/82747"
target="_blank" rel="noopener">#82747&lt;/a>
&lt;/li>
&lt;li>Add &lt;code>startupProbe&lt;/code> result handling to kuberuntime &lt;a href="https://github.com/kubernetes/kubernetes/pull/84279"
target="_blank" rel="noopener">#84279&lt;/a>
&lt;/li>
&lt;li>Clarify startupProbe e2e tests &lt;a href="https://github.com/kubernetes/kubernetes/pull/84291"
target="_blank" rel="noopener">#84291&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="version-118">Version 1.18&lt;/h3>
&lt;ul>
&lt;li>Graduate &lt;code>startupProbe&lt;/code> to Beta &lt;a href="https://github.com/kubernetes/kubernetes/pull/83437"
target="_blank" rel="noopener">#83437&lt;/a>
&lt;/li>
&lt;li>Cherry pick of #92196 &lt;a href="https://github.com/kubernetes/kubernetes/pull/92477"
target="_blank" rel="noopener">#92477&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="version-119">Version 1.19&lt;/h3>
&lt;ul>
&lt;li>Pods which have not &amp;ldquo;started&amp;rdquo; can not be &amp;ldquo;ready&amp;rdquo; &lt;a href="https://github.com/kubernetes/kubernetes/pull/92196"
target="_blank" rel="noopener">#92196&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="version-120">Version 1.20&lt;/h3>
&lt;ul>
&lt;li>Graduate &lt;code>startupProbe&lt;/code> to GA &lt;a href="https://github.com/kubernetes/kubernetes/pull/94160"
target="_blank" rel="noopener">#94160&lt;/a>
&lt;/li>
&lt;/ul></description></item><item><title>Resources: Add ProcMount option</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4265/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4265/</guid><description>
&lt;h1 id="kep-4265-add-procmount-option">KEP-4265: add ProcMount option&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>For Linux containers, the Kubelet instructs container runtimes to mask and set as read-only certain paths in &lt;code>/proc&lt;/code>.
This is to prevent data from being exposed into a container that should not be.
However, there are certain use-cases where it is necessary to turn this off.&lt;/p>
&lt;p>This KEP proposes adding a field to the Pod security context to allow bypassing the usual restrictions.&lt;/p>
&lt;p>In 1.12, this was introduced as the ProcMountType feature gate, and has it has languished in alpha ever since. This KEP is
a successor to (and heavily based on) &lt;a href="https://github.com/kubernetes/community/pull/1934"
target="_blank" rel="noopener">https://github.com/kubernetes/community/pull/1934&lt;/a>
, updated for the modern era.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Some end users would like to run unprivileged containers &lt;em>nested inside&lt;/em> a Kubernetes container using user namespaces. The outer container is started by the CRI implementation.
Kubernetes defaults to masking the &lt;code>/proc&lt;/code> mount of a container, setting some paths as read only. To run a nested container within an unprivileged Pod, a user would need a way to
override that default masking behavior.&lt;/p>
&lt;p>Please see the following filed issues for more information:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/opencontainers/runc/issues/1658#issuecomment-373122073"
target="_blank" rel="noopener">opencontainers/runc#1658&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/moby/moby/issues/36597"
target="_blank" rel="noopener">moby/moby#36597&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/moby/moby/pull/36644"
target="_blank" rel="noopener">moby/moby#36644&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allow users to opt out of the CRI masking &lt;code>/proc&lt;/code> for Linux containers.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Add a new &lt;code>string&lt;/code> named &lt;code>procMount&lt;/code> to the &lt;code>securityContext&lt;/code> definition for choosing from a set of proc mount isolation mode options.&lt;/p>
&lt;p>The default for &lt;code>procMount&lt;/code> is &lt;code>Default&lt;/code>, which instructs the container runtime to mask the aforementioned paths.&lt;/p>
&lt;p>This will look like the following in the spec:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ProcMountType &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// DefaultProcMount uses the container runtime default ProcType. Most &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// container runtimes mask certain paths in /proc to avoid accidental security&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// exposure of special devices or information.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> DefaultProcMount ProcMountType = &lt;span style="color:#b44">&amp;#34;Default&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// UnmaskedProcMount bypasses the default masking behavior of the container&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// runtime and ensures the newly created /proc the container stays intact with&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// no modifications. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> UnmaskedProcMount ProcMountType = &lt;span style="color:#b44">&amp;#34;Unmasked&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>procMount &lt;span style="color:#666">*&lt;/span>ProcMountType
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>where nil is default, and is interpreted as &amp;ldquo;Default&amp;rdquo; ProcMountType.&lt;/p>
&lt;p>When the kubelet is presented with a pod that has a ProcMountType as Unmasked, it will edit the default list of
masked paths it passes down to the CRI to be &lt;a href="https://github.com/kubernetes/kubernetes/blob/964529b/pkg/securitycontext/util.go#L216"
target="_blank" rel="noopener">empty&lt;/a>
which it does
with the &lt;a href="https://github.com/kubernetes/kubernetes/blob/964529b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L889-L891"
target="_blank" rel="noopener">CRI request&lt;/a>
.&lt;/p>
&lt;p>This requires changes to the CRI runtime integrations so that kubelet will add the specific &lt;code>unmasked&lt;/code> option.
This was done after alpha:&lt;/p>
&lt;ul>
&lt;li>CRI-O has support in v1.25.0 after &lt;a href="https://github.com/cri-o/cri-o/pull/6025/commits/4102586132214263c5d0ae93ec257432653ab82b"
target="_blank" rel="noopener">https://github.com/cri-o/cri-o/pull/6025/commits/4102586132214263c5d0ae93ec257432653ab82b&lt;/a>
&lt;/li>
&lt;li>containerd has support in 1.6. See &lt;a href="https://github.com/containerd/containerd/pull/5070/commits/07f1df4541d6a81c205d194f4f6ea3a6a95c3e29"
target="_blank" rel="noopener">https://github.com/containerd/containerd/pull/5070/commits/07f1df4541d6a81c205d194f4f6ea3a6a95c3e29&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>The main use case for unmasking paths in &lt;code>/proc&lt;/code> are for a user nesting unprivileged containers within a container. However, having an Unmasked ProcMountType
is a privileged operation, and thus is part of the &lt;a href="https://k8s.io/docs/concepts/security/pod-security-standards/#privileged"
target="_blank" rel="noopener">privileged&lt;/a>
Pod Security Admission (PSA). Since a user must have be in
the privileged policy, they are also trusted to choose the correct user ID and run a workload that won&amp;rsquo;t interfere with the host.&lt;/p>
&lt;p>A container running as root user on the host and an unmasked &lt;code>/proc&lt;/code> could be able to write to the host &lt;code>/proc&lt;/code>, and thus this privileged designation is appropriate.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a cluster admin, I would like a way to nest containers within containers. To do so, kernel the top level containers need an unmasked /proc.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a kubernetes user, I may want to build containers from within a kubernetes container.
See &lt;a href="https://github.com/jessfraz/blog/blob/master/content/post/building-container-images-securely-on-kubernetes.md"
target="_blank" rel="noopener">this article for more information&lt;/a>
.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.
How will security be reviewed, and by whom?
How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
&lt;ul>
&lt;li>A user turning this on without user namespaces enabled
&lt;ul>
&lt;li>Admission should deny a pod that tries to use &lt;code>ProcMountType: Unmasked&lt;/code> with &lt;code>HostUsers: true&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>More trust in user namespacing/the kernel instead of container runtime
&lt;ul>
&lt;li>This is probably the correct direction to head in.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>pkg/securitycontext&lt;/code>: &lt;code>10-05-2023&lt;/code> - &lt;code>70.04&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>N/A (Kubelet barely defines integration tests today, focusing on e2e_node tests instead)&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>
&lt;p>test/e2e_node&lt;/p>
&lt;/li>
&lt;li>
&lt;p>additional tests should be added to e2e_node suite to test the adherence of the ProcMount field&lt;/p>
&lt;ul>
&lt;li>Test default behavior actually masks /proc paths.&lt;/li>
&lt;li>Test Unmasked behavior is not masking /proc paths.&lt;/li>
&lt;li>Test PSA integration (if possible to test in e2e)&lt;/li>
&lt;li>Test that Windows pod cannot be scehduled with the value of ProcMount specifies&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate] graduations, or as
something else. The KEP should keep this high-level with a focus on what
signals will be looked at to determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag&lt;/li>
&lt;li>Add e2e tests for the feature (must be done before beta)
&lt;ul>
&lt;li>Including ones for enabling/disabling the feature&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Explicitly require hostUsers option to be &lt;code>false&lt;/code> if this option is enabled.
&lt;ul>
&lt;li>Otherwise, this option effectively becomes another &amp;ldquo;privileged&amp;rdquo; field&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Allowing time for feedback&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;p>Turn off the feature gate to turn off the feature.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and nodes?
- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;p>The feature gate is only processed by the API server&amp;ndash;Kubelet has no awareness of it. API server will scrub the ProcMount field from the request
if it doesn&amp;rsquo;t support the feature gate. Since all supported Kubelet versions support ProcMountType field, there&amp;rsquo;s no version skew worry.
API server can have the feature gate toggled without worrying about doing the same for Kubelets.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: ProcMountType&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver (kube-apiserver filters &lt;code>procMount&lt;/code> field if it&amp;rsquo;s not enabled).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No, only gives a user access to the Unmasked ProcMountType&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. This can be done by removing the feature gate from all kube-apiservers. To fully roll back, the nodes will need to be drained or rebooted,
as the Kubelet will not remove the &lt;code>procMount&lt;/code> of an already running container.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Nothing special. The pod&amp;rsquo;s &lt;code>procMount&lt;/code> field depends on where in the enablement process the kube-apiserver was when it was created.
The container has to be restarted to be up to date with the kube-apiserver.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes. I have manually tested feature enablement and disablement on kube-apiserver, and verified that pods are not recreated without
a drain. There will be an e2e test to verify this as well.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>It cannot. Either the kube-apiserver has the feature gate on or not. If it has it on, then workloads with the feature enabled will get an Unmasked
ProcMountType if they request it. If it&amp;rsquo;s off, then the kube-apiserver will force it to default, and the container&amp;rsquo;s creation will move forward
without an Unmasked ProcMountType.&lt;/p>
&lt;p>Already running workloads aren&amp;rsquo;t stopped and restarted on a feature revert, so an admin would need to reboot or drain to impact running workloads.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>The behavior of this feature has been consistent for more than 10 minor releases, so these tests are less relevant now.
Put differently: there is no upgrade-&amp;gt;downgrade-&amp;gt;upgrade path between supported versions of kubernetes that support this feature.&lt;/p>
&lt;p>Manual testing has been done between versions that do support it, toggling the feature on and off. In these cases, the feature works as described.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>&lt;code>kubectl get pods --all-namespaces -o jsonpath=&amp;quot;{range .items[*]}{.metadata.name}{' '}{.spec.containers[*].securityContext.procMount}{'\n'}{end}&amp;quot; | grep -i unmasked&lt;/code>
Will print all pods that has an Unmasked ProcMountType, along with the pod name.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: Container created with the Unmasked ProcMountType have paths &lt;a href="https://github.com/kubernetes/kubernetes/blob/964529b/pkg/securitycontext/util.go#L193"
target="_blank" rel="noopener">here&lt;/a>
as writable, not read only.
&lt;ul>
&lt;li>&amp;ldquo;/proc/asound&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/acpi&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/kcore&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/keys&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/latency_stats&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/timer_list&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/timer_stats&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/sched_debug&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/proc/scsi&amp;rdquo;,&lt;/li>
&lt;li>&amp;ldquo;/sys/firmware&amp;rdquo;,&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Another option is to run &lt;code>kubectl exec $podname -- mount | grep /proc&lt;/code>.
&lt;ul>
&lt;li>If there&amp;rsquo;s just one mount, and it looks like &lt;code>proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)&lt;/code> this is an unmasked &lt;code>/proc&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>No noticeable change in pod start times when this feature is enabled.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: &lt;code>kubelet_pod_start_sli_duration_seconds&lt;/code>&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric: kubelet&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>I don&amp;rsquo;t think any would be useful.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>I don&amp;rsquo;t think any would be useful.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;ul>
&lt;li>A CRI implementation that supports this feature
&lt;ul>
&lt;li>All supported versions currently do.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>ProcMountType in the pod spec&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>There is one additional field in the pod API: &lt;code>procMount&lt;/code>. It has an enum value of two values: &lt;code>Default&lt;/code> and &lt;code>Unmasked&lt;/code>.
The Kubelet is also passing the MaskedPaths to the CRI, which involves a single slice of strings.
When the value &lt;code>Default&lt;/code> is chosen, the slice is defined &lt;a href="https://github.com/kubernetes/kubernetes/blob/964529b/pkg/securitycontext/util.go#L193"
target="_blank" rel="noopener">here&lt;/a>
.
If &lt;code>Unmasked&lt;/code>, the slice is empty.
Both of these are size changes on the order of bytes and can be considered negligible.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>Potentially a malicious user given access and running with a root container in the host context could mess with the host processes.
PSA has already been configured to mitigate this by required a user be in a privileged namespace to get access to the field.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No effect&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Malicious user gaining access to the host &lt;code>/proc&lt;/code> with a rootful container&lt;/p>
&lt;ul>
&lt;li>admission should be updated to deny unmasked ProcMountType without user namespaces (hostUsers: true)&lt;/li>
&lt;/ul>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>The field can be unset in a pod spec (or feature gate turned off) to see if SLOs met after the feature is disabled for pods.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;p>2018-05-07: k/community update opened
2018-05-27: k/kubernetes PR merged with support.
2023-10-02: KEP opened and retargeted at Alpha
2024-02-26: &lt;a href="https://github.com/kubernetes/kubernetes/pull/123520"
target="_blank" rel="noopener">Update&lt;/a>
Unmasked ProcMountType to fail validation without a pod level user namespace.
2024-05-31: Added e2e &lt;a href="https://github.com/kubernetes/kubernetes/pull/123303"
target="_blank" rel="noopener">tests&lt;/a>
2024-05-31: KEP updated to Beta
2025-01-31: KEP updated to on by default Beta
2026-01-29: KEP updated to GA&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>&lt;code>--oci-worker-no-process-sandbox&lt;/code> like in &lt;a href="https://github.com/moby/buildkit/blob/v0.12.2/examples/kubernetes/job.rootless.yaml#L31"
target="_blank" rel="noopener">BuildKit&lt;/a>
&lt;ul>
&lt;li>Not broadly supported with other container runtimes/builders.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Update the kernel to allow mounting a new procfs with masks.
&lt;ul>
&lt;li>Proposed, but &lt;a href="https://patchwork.kernel.org/project/linux-fsdevel/patch/20180404115311.725-1-alban@kinvolk.io/"
target="_blank" rel="noopener">denied&lt;/a>
in the kernel&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Adopt a similar approach to LXD where &lt;code>/proc&lt;/code> and &lt;code>/sys&lt;/code> are mounted to different locations within the container, instead of masked.&lt;/li>
&lt;li>Give all pods with &lt;code>hostUsers: false&lt;/code> (pod level user namespace) access to these mounts by default
&lt;ul>
&lt;li>Even though it potentially is safe, it opens an argument that user namespaced pods are less secure than non user namespaced pods. The weakining of these boundries should be opt-in.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Ditch this option
&lt;ul>
&lt;li>Most use cases don&amp;rsquo;t really need this. However, if a pod wants to be able to, for instance, set its own sysctls, it would need this option.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2></description></item><item><title>Resources: Add Recreate Update Strategy to StatefulSet</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3541/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3541/</guid><description>
&lt;h1 id="kep-3541-add-recreate-update-strategy-to-statefulset">KEP-3541: Add Recreate Update Strategy to StatefulSet&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#current-behavior-and-problems"
>Current Behavior and Problems&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#why-existing-solutions-are-insufficient"
>Why Existing Solutions Are Insufficient&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposed-solution-benefits"
>Proposed Solution Benefits&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1-cicd-platform-team"
>Story 1: CI/CD Platform Team&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2-stateless-web-application"
>Story 2: Stateless Web Application&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3-developmentexperiment-environment"
>Story 3: Development/Experiment Environment&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-4-external-data-storage"
>Story 4: External Data Storage&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-5-leaderworkerset-lws-use-case"
>Story 5: LeaderWorkerSet (LWS) Use Case&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risk-unintended-data-loss"
>Risk: Unintended Data Loss&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#detailed-algorithm-specification"
>Detailed Algorithm Specification&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#current-rollingupdate-algorithm"
>Current RollingUpdate Algorithm&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposed-recreate-strategy-algorithm"
>Proposed Recreate Strategy Algorithm&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#api-changes"
>API Changes&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#spec-changes"
>Spec Changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#status-changes"
>Status Changes&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-changes"
>Implementation Changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#comparison-with-existing-solutions"
>Comparison with Existing Solutions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade"
>Upgrade&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade"
>Downgrade&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#downtime-requirement"
>Downtime Requirement&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#limited-rollback-options"
>Limited Rollback Options&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alternative-1-podprogresstimeoutseconds-field-in-rollingupdate-strategy"
>Alternative 1: PodProgressTimeoutSeconds Field in RollingUpdate Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-2-enforcedrollingupdate-strategy"
>Alternative 2: EnforcedRollingUpdate Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-3-now-primary-solution-recreate-strategy"
>Alternative 3: (Now Primary Solution): Recreate Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-4-add-force-flag-to-rollingupdate"
>Alternative 4: Add Force Flag to RollingUpdate&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-5-enhance-parallel-policy"
>Alternative 5: Enhance Parallel Policy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>StatefulSets currently offer two update strategies: &lt;code>OnDelete&lt;/code> (manual) and &lt;code>RollingUpdate&lt;/code> (automatic, default). When using &lt;code>RollingUpdate&lt;/code> with the default &lt;code>podManagementPolicy: OrderedReady&lt;/code>, StatefulSets follow sequential ordering where each individual pod must be Running and Ready before the controller proceeds to update the next pod. Even with the &lt;code>maxUnavailable&lt;/code> option (which allows multiple pods to be updated simultaneously), the controller still requires each pod to reach Ready state before moving forward but stuck pods halt the entire update process. While &lt;code>podManagementPolicy: Parallel&lt;/code> allows pods to be updated simultaneously without waiting for Ready state, stuck pods remain and are not automatically replaced. This design ensures data safety for stateful workloads but creates a critical operational problem.&lt;/p>
&lt;p>When a StatefulSet update results in pods that fail to reach Ready state (due to configuration errors, resource constraints, etc..), the rolling update process becomes permanently stuck. Even after applying a corrected configuration, the controller will not automatically replace the broken pods, requiring manual intervention to delete stuck pods.&lt;/p>
&lt;p>This behavior has generated significant user frustration across multiple GitHub issues (&lt;a href="https://github.com/kubernetes/kubernetes/issues/67250"
target="_blank" rel="noopener">#67250&lt;/a>
, &lt;a href="https://github.com/kubernetes/kubernetes/issues/60164"
target="_blank" rel="noopener">#60164&lt;/a>
, &lt;a href="https://github.com/kubernetes/kubernetes/issues/109597"
target="_blank" rel="noopener">#109597&lt;/a>
) with users reporting:&lt;/p>
&lt;ul>
&lt;li>Broken CI/CD pipelines requiring manual intervention&lt;/li>
&lt;li>Inability to automatically recover from configuration mistakes&lt;/li>
&lt;li>Operational burden in managing stateful applications&lt;/li>
&lt;/ul>
&lt;p>This KEP proposes adding a new &lt;code>Recreate&lt;/code> update strategy to StatefulSets, mirroring the behavior of Deployments&amp;rsquo;
Recreate strategy. This strategy deletes all pods, waits for full termination, then creates new pods according
to &lt;code>podManagementPolicy&lt;/code>. This provides a simple, predictable way to handle stuck pods and enables automated recovery for workloads that can tolerate downtime (CI/CD environments, stateless applications using StatefulSet
for pod identity, applications with external data storage, and use cases like LeaderWorkerSet). The &lt;code>Recreate&lt;/code>
strategy offers a clean parallel with existing Kubernetes patterns, simplifies controller logic, and provides users with explicit control over update behavior.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;h3 id="current-behavior-and-problems">Current Behavior and Problems&lt;/h3>
&lt;p>StatefulSets with &lt;code>RollingUpdate&lt;/code> strategy follow this algorithm:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>if &lt;code>podManagementPolicy: OrderedReady&lt;/code> (default)&lt;/p>
&lt;ol>
&lt;li>Update pods in reverse ordinal order (N-1, N-2, &amp;hellip;, 0)&lt;/li>
&lt;li>For each pod, wait until it becomes Running and Ready before proceeding to the next&lt;/li>
&lt;li>If any pod fails to become Ready, the entire update process halts&lt;/li>
&lt;li>Even when a corrected configuration is applied, stuck pods are never automatically replaced&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>if &lt;code>podManagementPolicy: Parallel&lt;/code>&lt;/p>
&lt;ol>
&lt;li>Update all pods simultaneously (or up to &lt;code>maxUnavailable&lt;/code> at a time if specified)&lt;/li>
&lt;li>Pods are created/deleted without waiting for Ready state&lt;/li>
&lt;li>Stuck pods do not block other pods from being updated&lt;/li>
&lt;li>Even when a corrected configuration is applied, stuck pods are never automatically replaced&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ol>
&lt;p>The current approach was designed for stateful workloads where data persistence is critical, pod identity and storage are tightly coupled, or automatic pod deletion could cause data loss.&lt;/p>
&lt;p>This behavior has significant impact across multiple scenarios:&lt;/p>
&lt;p>&lt;strong>CI/CD Pipeline Failures&lt;/strong>: Teams report broken deployments that require manual intervention, breaking automation:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic"># Example: A typo in image name breaks the entire update&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>apps/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>StatefulSet&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>app&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>myapp:v2.0.0-typo &lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># ImagePullBackOff&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Update gets stuck, requires manual pod deletion&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Operational Overhead&lt;/strong>: Platform teams must build custom controllers or fix it manually to handle stuck updates.&lt;/p>
&lt;h3 id="why-existing-solutions-are-insufficient">Why Existing Solutions Are Insufficient&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes/enhancements/issues/961"
target="_blank" rel="noopener">MaxUnavailable&lt;/a>
doesn&amp;rsquo;t address the core issue.
The &lt;code>maxUnavailable&lt;/code> option in &lt;code>RollingUpdate&lt;/code> strategy allows multiple pods to be updated simultaneously, but its behavior depends on &lt;code>podManagementPolicy&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podManagementPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Parallel&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">updateStrategy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>RollingUpdate&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">rollingUpdate&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">maxUnavailable&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">2&lt;/span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Can update 2 pods at once&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>With &lt;code>podManagementPolicy: Parallel&lt;/code> + &lt;code>maxUnavailable: 2&lt;/code>, multiple pods can be updated simultaneously, but if any pod fails to reach Ready state, it remains stuck and requires manual cleanup. Stuck pods don&amp;rsquo;t block other pods from updating, but they are never automatically replaced (see section 2 below).&lt;/p>
&lt;p>With &lt;code>podManagementPolicy: OrderedReady&lt;/code>, updates happen one pod at a time in reverse ordinal order. If any pod fails to reach Ready state, the entire rolling update process halts completely, even with &lt;code>maxUnavailable&lt;/code> configured. The controller waits indefinitely for stuck pods to become Ready.&lt;/p>
&lt;p>Example Scenario with &lt;code>podManagementPolicy: OrderedReady&lt;/code>:
- StatefulSet with 5 replicas
- Update pod &lt;code>app-4&lt;/code> first
- &lt;code>app-4&lt;/code> gets stuck in &lt;code>ImagePullBackOff&lt;/code>
- Even after fixing the image name, &lt;code>app-4&lt;/code> remains stuck
- Update process cannot proceed to &lt;code>app-3&lt;/code>, &lt;code>app-2&lt;/code>, &lt;code>app-1&lt;/code>, or &lt;code>app-0&lt;/code>
- Manual intervention still required: &lt;code>kubectl delete pod app-4&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Custom Controllers
Some teams have built custom controllers to delete stuck pods, but this:&lt;/p>
&lt;ul>
&lt;li>Duplicates StatefulSet controller logic&lt;/li>
&lt;li>Creates maintenance burden&lt;/li>
&lt;li>May conflict with StatefulSet controller behavior&lt;/li>
&lt;li>Lacks integration with StatefulSet status and events&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="proposed-solution-benefits">Proposed Solution Benefits&lt;/h3>
&lt;p>Adding &lt;code>Recreate&lt;/code> update strategy to StatefulSets addresses these issues by:&lt;/p>
&lt;ol>
&lt;li>Stuck pods are cleared and replaced during updates without manual intervention&lt;/li>
&lt;li>Clean algorithm with no complexity around timeout tracking or transient failure detection&lt;/li>
&lt;li>Consistency with Kubernetes Patterns (Deployment) Recreate strategy.&lt;/li>
&lt;li>Handles All Stuck Scenarios, regardless of whether pods are stuck in ImagePullBackOff, Pending, CrashLoopBackOff, or any other state&lt;/li>
&lt;/ol>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ol>
&lt;li>Add a new &lt;code>Recreate&lt;/code> update strategy type to StatefulSet, providing a third option alongside &lt;code>OnDelete&lt;/code> and &lt;code>RollingUpdate&lt;/code>&lt;/li>
&lt;li>Align StatefulSet update strategies with Deployment patterns for API consistency&lt;/li>
&lt;li>Enable automated recovery from stuck pod states without manual intervention&lt;/li>
&lt;li>Provide a simple, predictable update behavior for workloads that can tolerate downtime&lt;/li>
&lt;li>Support use cases like CI/CD environments, stateless applications, external storage applications, and LeaderWorkerSet patterns&lt;/li>
&lt;li>Add &lt;code>Progressing&lt;/code> state condition to &lt;code>StatefulSet&lt;/code> status for all strategies&lt;/li>
&lt;/ol>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ol>
&lt;li>Change default behavior of StatefulSet updates (opt-in via explicit &lt;code>type: Recreate&lt;/code> configuration)&lt;/li>
&lt;li>Add timeout-based progressive failure detection (use Recreate for simplicity)&lt;/li>
&lt;li>Change Recreate deletion semantics (all pods are always deleted simultaneously but recreate ordering follows &lt;code>podManagementPolicy&lt;/code>)&lt;/li>
&lt;li>Replace Deployment-style revision management (StatefulSets continue to directly manage Pods)&lt;/li>
&lt;/ol>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;h4 id="story-1-cicd-platform-team">Story 1: CI/CD Platform Team&lt;/h4>
&lt;p>&lt;strong>Context&lt;/strong>: A platform team manages hundreds of StatefulSet deployments across development and staging environments. Their CI/CD system requires end-to-end automation, but StatefulSet rolling updates break automation when pods get stuck. The team either has to implement custom &amp;ldquo;garbage collection&amp;rdquo; logic or accept that automated deployments will fail and require manual intervention. Since these are non-production environments, downtime during updates is acceptable.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: With &lt;code>updateStrategy: type: Recreate&lt;/code> configured, when an update with incorrect configuration is applied, all pods are deleted and new pods are created. If they fail, the deployment fails quickly and clearly. When a corrected configuration is applied, the Recreate strategy deletes all broken pods and creates fresh ones, allowing the CI/CD pipeline to complete without manual intervention. The downtime is acceptable in CI/CD environments where fast, automated recovery is more important than uptime.&lt;/p>
&lt;h4 id="story-2-stateless-web-application">Story 2: Stateless Web Application&lt;/h4>
&lt;p>&lt;strong>Context&lt;/strong>: A web application uses StatefulSet for predictable pod naming but doesn&amp;rsquo;t store critical data locally. When resource limit typos cause pods to get stuck in Pending state, the entire update halts even though pod replacement is safe. The application can tolerate brief downtime during updates.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: With &lt;code>updateStrategy: type: Recreate&lt;/code> configured, when an update encounters issues, all pods are deleted and recreated cleanly. This eliminates the need for manual pod deletion since stuck pods are automatically cleared. The brief downtime is acceptable for this stateless application that primarily uses StatefulSet for pod identity rather than stateful semantics.&lt;/p>
&lt;h4 id="story-3-developmentexperiment-environment">Story 3: Development/Experiment Environment&lt;/h4>
&lt;p>&lt;strong>Context&lt;/strong>: Developers using StatefulSet for experiments face constant frustration - every time a rolling update breaks due to configuration errors, they must manually delete stuck pods after applying fixes. This manual intervention disrupts the development workflow. Uptime is not a concern in development environments.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: With &lt;code>updateStrategy: type: Recreate&lt;/code> configured, developers get fast, clean resets - when an update fails, applying a fix automatically deletes all broken pods and creates fresh ones. This enables a smoother development experience without requiring cluster operator intervention or manual pod cleanup. The Recreate strategy&amp;rsquo;s simplicity makes it ideal for rapid iteration in development.&lt;/p>
&lt;h4 id="story-4-external-data-storage">Story 4: External Data Storage&lt;/h4>
&lt;p>&lt;strong>Context&lt;/strong>: A database application stores all persistent data on network-attached storage (not local pod storage). Pod replacement is completely safe since no local data would be lost, but the StatefulSet controller treats it as a traditional stateful workload and requires manual intervention. The application can tolerate brief downtime for clean updates.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: With &lt;code>updateStrategy: type: Recreate&lt;/code> configured, the controller automatically deletes and recreates all pods during updates, which is safe for this architecture since all data persists externally. The Recreate strategy provides clean, predictable updates without concerns about stuck pods, and the brief downtime is acceptable given the data safety guarantees from external storage.&lt;/p>
&lt;h4 id="story-5-leaderworkerset-lws-use-case">Story 5: LeaderWorkerSet (LWS) Use Case&lt;/h4>
&lt;p>&lt;strong>Context&lt;/strong>: Developers use StatefulSet as the high-level controller workload for &lt;a href="https://github.com/kubernetes-sigs/lws"
target="_blank" rel="noopener">LWS&lt;/a>
. However, it behaves more like a Deployment - there&amp;rsquo;s no ordering dependency between different replicas. They only need the ordinal index for pod identification. When a replica fails during updates, the entire StatefulSet update gets stuck, even though there&amp;rsquo;s no actual ordering requirement between replicas. The LeaderWorkerSet pattern can tolerate brief downtime for updates.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: With &lt;code>updateStrategy: type: Recreate&lt;/code> configured, all replicas are cleanly deleted and recreated during updates, eliminating stuck pod scenarios entirely. This aligns perfectly with the deployment-like nature of LWS workloads, providing simple and predictable updates for applications that use StatefulSet primarily for pod identity rather than traditional stateful semantics. The Recreate strategy&amp;rsquo;s &amp;ldquo;all or nothing&amp;rdquo; approach matches the LWS pattern where all workers restart together.&lt;/p>
&lt;h3 id="notesconstraintscaveats">Notes/Constraints/Caveats&lt;/h3>
&lt;ul>
&lt;li>Strategy Type Change Does Not Trigger Rollout: changing only &lt;code>.spec.updateStrategy.type&lt;/code> from &lt;code>RollingUpdate&lt;/code> to &lt;code>Recreate&lt;/code> (or vice versa) does not trigger a new rollout. This is consistent with Deployment behavior. The StatefulSet controller uses the &lt;code>controller-revision-hash&lt;/code> label to identify pod revisions, which is computed from &lt;code>.spec.template&lt;/code> content only.&lt;/li>
&lt;/ul>
&lt;p>The Recreate behavior will only be triggered when users either:&lt;/p>
&lt;ol>
&lt;li>Make a change to &lt;code>.spec.template&lt;/code>&lt;/li>
&lt;li>Force a rollout using &lt;code>kubectl rollout restart&lt;/code>&lt;/li>
&lt;/ol>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="risk-unintended-data-loss">Risk: Unintended Data Loss&lt;/h4>
&lt;p>&lt;strong>Risk Description&lt;/strong>: If &lt;code>Recreate&lt;/code> strategy is used on StatefulSets with local persistent data and PersistentVolumeClaims, the downtime could affect applications expecting sequential updates. However, data on PVCs is preserved since Recreate only deletes pods, not volumes.&lt;/p>
&lt;p>&lt;strong>Mitigation Strategies&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>Documentation: Clear guidance on when to use &lt;code>Recreate&lt;/code> strategy - suitable for workloads that can tolerate downtime&lt;/li>
&lt;li>No Default Change: Opt-in behavior - existing workloads continue using safe &lt;code>RollingUpdate&lt;/code> (current behavior unchanged)&lt;/li>
&lt;li>Explicit Strategy Selection: Users must explicitly set &lt;code>type: Recreate&lt;/code>, preventing accidental usage&lt;/li>
&lt;li>Clear Events: Events emitted during the recreate process to show deletion and recreation phases&lt;/li>
&lt;li>Status Conditions: StatefulSet status clearly reflects the recreate process state&lt;/li>
&lt;li>PVC Preservation: PersistentVolumeClaims are not deleted, so data on volumes persists across recreate operations&lt;/li>
&lt;/ol>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="detailed-algorithm-specification">Detailed Algorithm Specification&lt;/h3>
&lt;h4 id="current-rollingupdate-algorithm">Current RollingUpdate Algorithm&lt;/h4>
&lt;pre tabindex="0">&lt;code>FOR i = replicas-1 To i &amp;gt;= 0 DO i--
If pod[i] needs update Then
wait_for_predecessors_ready(i+1 to replicas-1)
If !pod[i].Running Or !pod[i].Ready Then
return // STUCK - wait for manual intervention
ENDIF
update_pod(i)
wait_until_ready(pod[i])
ENDIF
ENDFOR
&lt;/code>&lt;/pre>&lt;p>The algorithm halts when &lt;code>pod[i]&lt;/code> is not Running or Ready, even if a fix is applied.&lt;/p>
&lt;h4 id="proposed-recreate-strategy-algorithm">Proposed Recreate Strategy Algorithm&lt;/h4>
&lt;pre tabindex="0">&lt;code>// Recreate Strategy Algorithm
// Uses controller-revision-hash label to identify pod revision (same as RollingUpdate)
// updateRevision = hash of current spec.template (computed by controller)
current_phase = determine_phase()
IF current_phase == &amp;#34;NeedsDeletion&amp;#34; THEN
// Phase 1: Delete all pods with old revision
emit_event(&amp;#34;RecreateStarted&amp;#34;, &amp;#34;Deleting all pods for Recreate update&amp;#34;)
set_condition(&amp;#34;Progressing&amp;#34;, status=&amp;#34;True&amp;#34;, reason=&amp;#34;RecreateInProgress&amp;#34;)
// Delete ALL pods owned by this StatefulSet that have old revision
// This handles orphaned pods with ordinals &amp;gt;= replicas
FOR each pod in pods:
IF pod.Labels[&amp;#34;controller-revision-hash&amp;#34;] != updateRevision THEN
IF pod.DeletionTimestamp == nil THEN
delete_pod(pod)
ENDIF
ENDIF
ENDFOR
return // Reconcile again after deletions are issued
ENDIF
IF current_phase == &amp;#34;WaitingTermination&amp;#34; THEN
// Phase 2: Wait for all old-revision pods to be fully removed from etcd
// Controller watches pods and will reconcile when deletions complete
// Note: Only emit event on first entry to this phase (tracked via condition)
return
ENDIF
IF current_phase == &amp;#34;ReadyForCreation&amp;#34; THEN
// Phase 3: Create pods with new revision according to podManagementPolicy
IF podManagementPolicy == OrderedReady THEN
// Create in ascending ordinal order; only create the next ordinal when predecessor is Running and Ready
i = lowest ordinal in [0, replicas-1] such that pod i does not exist
IF i is defined THEN
IF i == 0 OR (pod i-1 exists AND is Running and Ready) THEN
create_pod(i, updateRevision)
ENDIF
ENDIF
ELSE
// Parallel: create all missing pods at once
FOR i = 0 TO replicas-1:
IF pod with ordinal i does not exist THEN
create_pod(i, updateRevision)
ENDIF
ENDFOR
ENDIF
return // Reconcile again to check creation progress
ENDIF
IF current_phase == &amp;#34;Complete&amp;#34; THEN
// All replicas exist with current revision
set_condition(&amp;#34;Progressing&amp;#34;, status=&amp;#34;True&amp;#34;, reason=&amp;#34;RecreateComplete&amp;#34;)
return
ENDIF
// Helper: Determine current phase based on pod states
FUNCTION determine_phase():
pods = get_all_pods_for_statefulset() // All pods owned by this StatefulSet
old_revision_pods_active = 0 // Old revision, not yet deleted
old_revision_pods_terminating = 0 // Old revision, has DeletionTimestamp
new_revision_pods = 0 // Current revision (not terminating)
FOR each pod in pods:
IF pod.Labels[&amp;#34;controller-revision-hash&amp;#34;] != updateRevision THEN
// Pod has old revision
IF pod.DeletionTimestamp == nil THEN
old_revision_pods_active++
ELSE
old_revision_pods_terminating++
ENDIF
ELSE
// Pod has current revision
IF pod.DeletionTimestamp == nil THEN
new_revision_pods++
ENDIF
// Note: new revision pods with DeletionTimestamp are ignored
// (could happen if user manually deleted, will be recreated)
ENDIF
ENDFOR
// Phase 1: Any old-revision pods that haven&amp;#39;t been deleted yet
IF old_revision_pods_active &amp;gt; 0 THEN
return &amp;#34;NeedsDeletion&amp;#34;
ENDIF
// Phase 2: Old pods are terminating, wait for full removal
IF old_revision_pods_terminating &amp;gt; 0 THEN
return &amp;#34;WaitingTermination&amp;#34;
ENDIF
// Phase 3: No old pods remain, but we don&amp;#39;t have enough new pods yet
IF new_revision_pods &amp;lt; replicas THEN
return &amp;#34;ReadyForCreation&amp;#34;
ENDIF
// Phase 4: All replicas exist with current revision
return &amp;#34;Complete&amp;#34;
END FUNCTION
&lt;/code>&lt;/pre>&lt;p>Key Characteristics:&lt;/p>
&lt;ol>
&lt;li>Uses &lt;code>controller-revision-hash&lt;/code> label (same as RollingUpdate) to identify old vs new pods&lt;/li>
&lt;li>All old-revision pods are fully terminated before any new pods are created&lt;/li>
&lt;li>Guarantees old and new pods never run simultaneously&lt;/li>
&lt;li>Deletes all old-revision pods including orphans with ordinals &amp;gt;= replicas&lt;/li>
&lt;li>Since all pods are forcibly deleted, updates cannot become permanently blocked&lt;/li>
&lt;li>Explicit downtime: Users opt-in knowing there will be unavailability between deletion and creation phases&lt;/li>
&lt;li>Safe to retry deletions and creations on controller restart&lt;/li>
&lt;li>Recreation phase respects &lt;code>podManagementPolicy&lt;/code>&lt;/li>
&lt;/ol>
&lt;h3 id="api-changes">API Changes&lt;/h3>
&lt;h4 id="spec-changes">Spec Changes&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// StatefulSetUpdateStrategyType is a string enumeration type that represents the update strategy type for StatefulSets&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> StatefulSetUpdateStrategyType &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// RollingUpdateStatefulSetStrategyType indicates that pods in a StatefulSet will be updated in reverse ordinal order&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RollingUpdateStatefulSetStrategyType StatefulSetUpdateStrategyType = &lt;span style="color:#b44">&amp;#34;RollingUpdate&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// OnDeleteStatefulSetStrategyType indicates that pods in a StatefulSet will only be updated when manually deleted&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> OnDeleteStatefulSetStrategyType StatefulSetUpdateStrategyType = &lt;span style="color:#b44">&amp;#34;OnDelete&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// RecreateStatefulSetStrategyType indicates that all pods will be fully terminated before new ones are created&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RecreateStatefulSetStrategyType StatefulSetUpdateStrategyType = &lt;span style="color:#b44">&amp;#34;Recreate&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Example Usage:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>apps/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>StatefulSet&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>web&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">replicas&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">updateStrategy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Recreate&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>nginx&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>nginx:1.14.2&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Behavior&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>When update is triggered (e.g., template change):
&lt;ol>
&lt;li>All pods (web-0 through web-9) are deleted simultaneously&lt;/li>
&lt;li>Controller waits for all pods to fully terminate&lt;/li>
&lt;li>All new pods (web-0 through web-9) are created according to their &lt;code>.spec.podManagementPolicy&lt;/code>&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>Downtime occurs between deletion and recreation phases&lt;/li>
&lt;li>No stuck pod scenarios - all pods are forcibly deleted&lt;/li>
&lt;/ul>
&lt;h4 id="status-changes">Status Changes&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// StatefulSetConditionType describes the condition types&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> StatefulSetConditionType &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Progress for a StatefulSet is considered when a new pod is created, deleted, or becomes ready.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> StatefulSetProgressing StatefulSetConditionType = &lt;span style="color:#b44">&amp;#34;Progressing&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> StatefulSetAvailable StatefulSetConditionType = &lt;span style="color:#b44">&amp;#34;Available&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="implementation-changes">Implementation Changes&lt;/h3>
&lt;p>The implementation requires changes to the StatefulSet controller in &lt;code>pkg/controller/statefulset/stateful_set_control.go&lt;/code>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Strategy Type Handling:&lt;/p>
&lt;ul>
&lt;li>Add new case for &lt;code>RecreateStatefulSetStrategyType&lt;/code> in update strategy switch statement&lt;/li>
&lt;li>Implement separate update path for Recreate strategy alongside existing RollingUpdate and OnDelete paths&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Recreate Update Logic:&lt;/p>
&lt;ul>
&lt;li>Phase 1 - Deletion: Iterate through all pods and delete them (similar to scale-down operation)&lt;/li>
&lt;li>Phase 2 - Wait for Termination: Check all pods for &lt;code>deletionTimestamp&lt;/code>; reconcile periodically until all pods are fully terminated&lt;/li>
&lt;li>Phase 3 - Recreation: Create all new pods according to &lt;code>spec.podManagementPolicy&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Status Condition Management:&lt;/p>
&lt;ul>
&lt;li>Add &lt;code>Progressing&lt;/code> condition to StatefulSet status&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Validation:&lt;/p>
&lt;ul>
&lt;li>API validation in &lt;code>pkg/apis/apps/validation/validation.go&lt;/code>&lt;/li>
&lt;li>Validate &lt;code>type: Recreate&lt;/code> can be set on StatefulSet&lt;/li>
&lt;li>No additional fields required for Recreate strategy (unlike RollingUpdate which has partition, maxUnavailable)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Respect Ordering Semantics:&lt;/p>
&lt;ul>
&lt;li>Recreate strategy according to &lt;code>podManagementPolicy&lt;/code> settings&lt;/li>
&lt;li>All pods deleted at once and then re-created according to &lt;code>podManagementPolicy&lt;/code> settings&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="comparison-with-existing-solutions">Comparison with Existing Solutions&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Solution&lt;/th>
&lt;th>Sequential Ordering&lt;/th>
&lt;th>Automatic Recovery&lt;/th>
&lt;th>Downtime&lt;/th>
&lt;th>Behavior When Pod Stuck&lt;/th>
&lt;th>Use Case&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>RollingUpdate&lt;/code> (default)&lt;/td>
&lt;td>Yes&lt;/td>
&lt;td>No&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Halts completely, waits forever&lt;/td>
&lt;td>Traditional stateful apps&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>RollingUpdate&lt;/code> + &lt;code>maxUnavailable&lt;/code>&lt;/td>
&lt;td>Yes (batched)&lt;/td>
&lt;td>No&lt;/td>
&lt;td>No&lt;/td>
&lt;td>&lt;strong>Still halts completely&lt;/strong>&lt;/td>
&lt;td>Faster updates, but same stuck problem&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>OnDelete&lt;/code>&lt;/td>
&lt;td>Yes (manual)&lt;/td>
&lt;td>No&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Fully manual control&lt;/td>
&lt;td>Maximum safety/control&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>&lt;code>Recreate&lt;/code> (proposed)&lt;/strong>&lt;/td>
&lt;td>No&lt;/td>
&lt;td>Yes&lt;/td>
&lt;td>Yes&lt;/td>
&lt;td>All pods deleted and recreated&lt;/td>
&lt;td>CI/CD, stateless apps, external storage, LWS&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>pkg/apis/apps/validation/validation.go&lt;/code>: &lt;code>2025-10-13&lt;/code> - &lt;code>92.8%&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/controller/statefulset/stateful_set_control.go&lt;/code>: &lt;code>2025-10-13&lt;/code> - &lt;code>91.5%&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/controller/statefulset/stateful_pod_control.go&lt;/code>: &lt;code>2025-10-13&lt;/code> - &lt;code>89.6%&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/registry/apps/statefulset/strategy.go&lt;/code>: &lt;code>2025-10-13&lt;/code> - &lt;code>83.9%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>We should cover below scenarios:&lt;/p>
&lt;ul>
&lt;li>Without &lt;code>type: Recreate&lt;/code>: Existing StatefulSets with &lt;code>RollingUpdate&lt;/code> and &lt;code>OnDelete&lt;/code> continue to work unchanged (backward compatibility)&lt;/li>
&lt;li>With &lt;code>type: Recreate&lt;/code> configured:
&lt;ul>
&lt;li>All pods are deleted when update is triggered (template spec change)&lt;/li>
&lt;li>Controller waits for all pods to fully terminate (no pods with deletionTimestamp remain)&lt;/li>
&lt;li>All new pods are created after termination complete&lt;/li>
&lt;li>Status condition &lt;code>Progressing=True&lt;/code>&lt;/li>
&lt;li>Status condition &lt;code>Progressing=True&lt;/code> with &lt;code>reason=RecreateComplete&lt;/code> after pods created&lt;/li>
&lt;li>Recreate strategy respects &lt;code>podManagementPolicy&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>PVC preservation: PersistentVolumeClaims are not deleted during Recreate (only pods are deleted)&lt;/li>
&lt;li>Stuck pod handling: Pods stuck in any state are forcibly deleted (ImagePullBackOff, Pending, CrashLoopBackOff, etc.)&lt;/li>
&lt;li>Validation: API validation accepts &lt;code>type: Recreate&lt;/code> on StatefulSet&lt;/li>
&lt;li>For alpha, Add test to verify that we cannot switch strategies from Recreate to RollingUpdate or OnDelete. Later on beta, we will need to add a test to verify that we can switch strategies&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>The following e2e tests will be added to &lt;code>test/e2e/apps/statefulset.go&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>StatefulSet with &lt;code>type: Recreate&lt;/code> successfully deletes and recreates all pods during update&lt;/li>
&lt;li>Recreate works with stuck pods (ImagePullBackOff scenario - pods are deleted and new ones created)&lt;/li>
&lt;li>Recreate waits for full termination before creating new pods (no mixed old/new state)&lt;/li>
&lt;li>Recreate preserves PersistentVolumeClaims (data persists across recreation)&lt;/li>
&lt;li>Recreate respects &lt;code>podManagementPolicy&lt;/code> during recreation&lt;/li>
&lt;li>StatefulSets without &lt;code>type: Recreate&lt;/code> maintain current RollingUpdate/OnDelete behavior (backward compatibility)&lt;/li>
&lt;li>Controller restart during Recreate resumes correctly from last phase&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag.&lt;/li>
&lt;li>Unit and integration tests passed as designed in &lt;a href="#test-plan"
>TestPlan&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Feature is enabled by default&lt;/li>
&lt;li>Address reviews and bug reports from Alpha users&lt;/li>
&lt;li>Users are able to switch strategies from Recreate to RollingUpdate or OnDelete&lt;/li>
&lt;li>e2e tests:
&lt;ul>
&lt;li>Add links to testgrid results&lt;/li>
&lt;li>Verify zero flakes over 2+ weeks&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>No negative feedback from developers.&lt;/li>
&lt;li>Consider conformance test if feature becomes widely adopted and part of core contract&lt;/li>
&lt;li>Ensure existing conformance tests for basic RollingUpdate continue to pass&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;h4 id="upgrade">Upgrade&lt;/h4>
&lt;p>This feature is protected by the feature-gate &lt;code>StatefulSetRecreateStrategy&lt;/code>, which must be enabled on both &lt;code>kube-apiserver&lt;/code> and &lt;code>kube-controller-manager&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Component Dependencies&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>kube-apiserver: Validates and persists the &lt;code>type: Recreate&lt;/code> strategy in the StatefulSet spec&lt;/li>
&lt;li>kube-controller-manager: Implements the Recreate strategy logic (delete all, wait for termination, create all)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Upgrade Sequence&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Enable feature gate on kube-apiserver first&lt;/li>
&lt;li>Enable feature gate on kube-controller-manager&lt;/li>
&lt;li>Create/update StatefulSets with &lt;code>updateStrategy.type: Recreate&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Partial Upgrade Behavior&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>If apiserver has feature enabled but kube-controller-manager does not:&lt;/p>
&lt;ul>
&lt;li>API server accepts &lt;code>type: Recreate&lt;/code> strategy&lt;/li>
&lt;li>Strategy type is persisted in etcd&lt;/li>
&lt;li>Kube-controller-manager ignores Recreate type and falls back to default RollingUpdate behavior&lt;/li>
&lt;li>No errors, but Recreate behavior is not active&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>If apiserver does NOT have feature enabled but kube-controller-manager does:&lt;/p>
&lt;ul>
&lt;li>API server rejects create/update requests that set &lt;code>type: Recreate&lt;/code> with a validation error&lt;/li>
&lt;li>Users cannot create or switch to Recreate until the apiserver has the feature enabled.&lt;/li>
&lt;li>Kube-controller-manager cannot process Recreate in this skew because no &lt;code>StatefulSet&lt;/code> with &lt;code>type: Recreate&lt;/code> can be stored.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Enable the feature gate on &lt;code>kube-apiserver&lt;/code> first, then &lt;code>kube-controller-manager&lt;/code> to ensure smooth transition.&lt;/p>&lt;/blockquote>
&lt;h4 id="downgrade">Downgrade&lt;/h4>
&lt;ul>
&lt;li>The older apiserver does not recognize &lt;code>type: Recreate&lt;/code> and will reject create/update requests that set it.&lt;/li>
&lt;li>StatefulSets that already have &lt;code>type: Recreate&lt;/code> stored in etcd remain stored, but any update that touches the spec may be rejected unless the strategy is changed back to RollingUpdate/OnDelete first&lt;/li>
&lt;li>The controller in the older version ignores Recreate and behaves as RollingUpdate for those existing objects&lt;/li>
&lt;/ul>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This feature has dependencies between control plane components.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>kube-apiserver v1.xx+1 (feature enabled) and kube-controller-manager v1.xx (no feature)&lt;/p>
&lt;ul>
&lt;li>API accepts &lt;code>type: Recreate&lt;/code>, controller ignores it&lt;/li>
&lt;li>StatefulSets fall back to default RollingUpdate behavior&lt;/li>
&lt;li>StatefulSets are functional, just without Recreate strategy feature&lt;/li>
&lt;li>No errors or warnings&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>kube-apiserver v1.xx (no feature) and kube-controller-manager v1.xx+1 (feature enabled)&lt;/p>
&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>API server rejects create/update requests that set &lt;code>type: Recreate&lt;/code> with a validation error&lt;/li>
&lt;li>Users cannot create or update StatefulSets to use Recreate until apiserver is upgraded and the feature is enabled&lt;/li>
&lt;li>Enable the feature on kube-apiserver first, then on kube-controller-manager&lt;/li>
&lt;/ul>
&lt;ol start="3">
&lt;li>
&lt;p>Mixed control plane during rolling upgrade&lt;/p>
&lt;ul>
&lt;li>During control plane upgrade, apiservers and controller-managers may have different versions, and the feature may be enabled or disabled. The behavior depends on the leader&amp;rsquo;s version:
&lt;ul>
&lt;li>If leader has feature enabled: Recreate strategy is processed correctly&lt;/li>
&lt;li>If leader has feature disabled: Recreate strategy is ignored, falls back to RollingUpdate behavior&lt;/li>
&lt;li>Leader may change during upgrade, causing behavior to switch between Recreate and RollingUpdate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: StatefulSetRecreateStrategy&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;li>kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. Enabling the &lt;code>StatefulSetRecreateStrategy&lt;/code> feature gate does not change any default behavior.&lt;/p>
&lt;p>The &lt;code>type: Recreate&lt;/code> strategy is &lt;strong>opt-in&lt;/strong>. When not explicitly set:&lt;/p>
&lt;ul>
&lt;li>StatefulSets behave exactly as they do today (default &lt;code>RollingUpdate&lt;/code> behavior)&lt;/li>
&lt;li>All existing StatefulSet update strategies continue to work unchanged&lt;/li>
&lt;/ul>
&lt;p>The feature only activates when users explicitly configure &lt;code>spec.updateStrategy.type: Recreate&lt;/code> in their StatefulSet spec.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes, the feature can be disabled.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The feature works normally again. StatefulSets with &lt;code>type: Recreate&lt;/code> in their spec will immediately start using Recreate behavior for the next update.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>No, unit and integration tests will be added to cover feature gate enablement/disablement scenarios.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>&lt;strong>Rollout Failures:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>If apiserver and controller-manager have different feature gate states, &lt;code>type: Recreate&lt;/code> may be accepted but ignored (falls back to RollingUpdate)&lt;/li>
&lt;li>API validation accepts &lt;code>type: Recreate&lt;/code> as valid strategy type (no complex validation needed)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Rollback Failures:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>If the strategy type was not changed back, StatefulSets with &lt;code>type: Recreate&lt;/code> will fall back to RollingUpdate behavior and Recreate behavior will be ignored.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Impact on Running Workloads:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>No impact on StatefulSets without &lt;code>type: Recreate&lt;/code>&lt;/li>
&lt;li>StatefulSets with &lt;code>type: Recreate&lt;/code> will experience downtime during updates (i.e. all pods are deleted before new ones are created)&lt;/li>
&lt;/ul>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>statefulset_unavailable_replicas&lt;/code> shows how many Statefulset replicas are unavailable&lt;/li>
&lt;li>&lt;code>workqueue_depth{name=&amp;quot;statefulset&amp;quot;}&lt;/code> shows the current depth of the StatefulSet controller queue&lt;/li>
&lt;li>&lt;code>workqueue_queue_duration_seconds{name=&amp;quot;statefulset&amp;quot;}&lt;/code> shows how long items wait in queue before processing&lt;/li>
&lt;li>&lt;code>workqueue_retries_total{name=&amp;quot;statefulset&amp;quot;}&lt;/code> shows retry counts which may indicate processing failures&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>No, tests will be added to cover upgrade and rollback scenarios.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No. This feature adds a new strategy type &lt;code>Recreate&lt;/code> to &lt;code>spec.updateStrategy.type&lt;/code>. No deprecations of existing fields or APIs nor removals of existing functionality.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;ul>
&lt;li>By querying StatefulSets using kubectl:&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get statefulsets -A -o json | &lt;span style="color:#b62;font-weight:bold">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b62;font-weight:bold">&lt;/span> jq &lt;span style="color:#b44">&amp;#39;.items[] | select(.spec.updateStrategy.type == &amp;#34;Recreate&amp;#34;) |
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> {namespace: .metadata.namespace, name: .metadata.name, strategy: .spec.updateStrategy.type}&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>By checking StatefulSet status conditions:&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get statefulsets -A -o json | &lt;span style="color:#b62;font-weight:bold">\
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b62;font-weight:bold">&lt;/span> jq &lt;span style="color:#b44">&amp;#39;.items[] | select(.status.conditions[]? | select(.type==&amp;#34;Progressing&amp;#34;))&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>[] Events&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name: &lt;code>Progressing&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics (existing metrics &lt;a href="https://github.com/kubernetes/kube-state-metrics/blob/release-2.18/docs/metrics/workload/statefulset-metrics.md?plain=1"
target="_blank" rel="noopener">kube-state-metrics&lt;/a>
)
&lt;ul>
&lt;li>&lt;code>kube_statefulset_replicas&lt;/code>&lt;/li>
&lt;li>&lt;code>kube_statefulset_status_replicas_ready&lt;/code>&lt;/li>
&lt;li>&lt;code>kube_statefulset_status_replicas_current&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;ul>
&lt;li>100% of StatefulSets without &lt;code>type: Recreate&lt;/code> behave identically to pre-feature behavior&lt;/li>
&lt;li>99% of Recreate updates complete within (pod termination time + pod startup time + 30s)&lt;/li>
&lt;li>0% of pods are left in mixed old/new spec states after Recreate update&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics (existing metrics &lt;a href="https://github.com/kubernetes/kube-state-metrics/blob/release-2.18/docs/metrics/workload/statefulset-metrics.md?plain=1"
target="_blank" rel="noopener">kube-state-metrics&lt;/a>
)
&lt;ul>
&lt;li>Metric(s) name:
&lt;ul>
&lt;li>&lt;code>kube_statefulset_status_replicas_available&lt;/code>&lt;/li>
&lt;li>&lt;code>kube_statefulset_status_replicas_ready&lt;/code>&lt;/li>
&lt;li>&lt;code>kube_statefulset_status_replicas_current&lt;/code>&lt;/li>
&lt;li>Components exposing the metric: kube-state-metrics&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Metric name:
&lt;ul>
&lt;li>&lt;code>statefulset_unavailable_replicas&lt;/code>&lt;/li>
&lt;li>Components exposing the metric: kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>These metrics reflect the StatefulSet &lt;code>.status&lt;/code> (availableReplicas, readyReplicas, currentReplicas). They have labels &lt;code>statefulset&lt;/code> and &lt;code>namespace&lt;/code>, so operators can filter by StatefulSet to monitor a specific StatefulSet during Recreate&lt;/li>
&lt;li>During Recreate updates, the values show the transition from all pods deleted (0 available) to all new pods created and ready&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No. The existing StatefulSet metrics provide sufficient observability for the Recreate strategy.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No new types of API calls. If the feature gate is enabled but no StatefulSet uses &lt;code>type: Recreate&lt;/code>, then no additional API calls occur.&lt;/p>
&lt;p>When Recreate strategy is used during an update, the following existing API call types are made:&lt;/p>
&lt;ul>
&lt;li>Pod Deletion (DELETE /api/v1/namespaces/{ns}/pods/{name})&lt;/li>
&lt;li>Pod Creation (POST /api/v1/namespaces/{ns}/pods)&lt;/li>
&lt;li>StatefulSet Status Update (PUT /apis/apps/v1/namespaces/{ns}/statefulsets/{name}/status)&lt;/li>
&lt;li>Event Creation (POST /api/v1/namespaces/{ns}/events)&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No. A new strategy type &lt;code>Recreate&lt;/code> is added to the existing &lt;code>StatefulSetUpdateStrategyType&lt;/code> enum, but no new API types are introduced.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Yes, minor increases in size when &lt;code>type: Recreate&lt;/code> is used.&lt;/p>
&lt;p>Per StatefulSet using Recreate strategy:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Spec&lt;/strong>: ~8 bytes (strategy type enum value: &amp;ldquo;Recreate&amp;rdquo;)&lt;/li>
&lt;li>&lt;strong>Status&lt;/strong>: ~150-200 bytes when Progressing condition is active&lt;/li>
&lt;li>&lt;strong>Total&lt;/strong>: ~160-210 bytes per StatefulSet&lt;/li>
&lt;/ul>
&lt;p>For a cluster with 1000 StatefulSets using Recreate strategy:&lt;/p>
&lt;ul>
&lt;li>Total increase: ~160-210 KB&lt;/li>
&lt;li>Impact: Negligible compared to typical etcd usage (multi-GB scale)&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>API Server Operations:&lt;/p>
&lt;ul>
&lt;li>GET/LIST StatefulSets: No impact (strategy type is standard enum field, standard deserialization)&lt;/li>
&lt;li>CREATE/UPDATE StatefulSets: Minimal impact (~10-20μs for validating strategy type enum).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>StatefulSet Controller Reconciliation:&lt;/p>
&lt;ul>
&lt;li>With feature enabled but strategy not set to Recreate: No additional overhead.&lt;/li>
&lt;li>With Recreate strategy: Same overhead as manual pod deletion + creation operations.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;ul>
&lt;li>Etcd Operations:
&lt;ul>
&lt;li>Minimal increase in object size when using Recreate strategy (~8 bytes for strategy type enum value + ~150-200 bytes for status conditions when active).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Memory/CPU:
&lt;ul>
&lt;li>Memory (per StatefulSet): ~8 bytes for strategy type enum value.&lt;/li>
&lt;li>CPU: Strategy type comparison on each reconciliation: ~1-2μs (simple string comparison).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Network I/O:
&lt;ul>
&lt;li>An additional ~8 bytes per StatefulSet spec when Recreate strategy is set, and ~150-200 bytes per status update when Progressing condition is active.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No, the feature does not introduce new node resource exhaustion risks beyond existing mechanism.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>The feature behaves similar to existing controllers which depend on API server and etcd availability.&lt;/p>
&lt;ul>
&lt;li>API Server Unavailable: StatefulSet controller cannot read/write StatefulSet or Pod objects, so all updates halt.&lt;/li>
&lt;li>etcd Unavailable: Similar to API server unavailability, no state changes can be persisted.&lt;/li>
&lt;/ul>
&lt;p>No special handling is required as this feature only changes the update progression logic, not the fundamental dependency on API server/etcd availability.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;ol>
&lt;li>Examine Metrics (&lt;code>kube_statefulset_status_replicas_available&lt;/code>, &lt;code>kube_statefulset_status_replicas_ready&lt;/code>)
&lt;ul>
&lt;li>If &lt;code>kube_statefulset_status_replicas_available&lt;/code> is stuck at 0 for extended period → pods may be stuck in termination&lt;/li>
&lt;li>If &lt;code>kube_statefulset_status_replicas_current&lt;/code> is increasing but &lt;code>kube_statefulset_status_replicas_ready&lt;/code> is not → pods may be failing to start&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Check if pods are stuck in termination (long grace periods, finalizers blocking deletion)&lt;/li>
&lt;li>Verify pod startup time is reasonable (image pull, initialization containers, readiness probes)&lt;/li>
&lt;/ol>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2022-09-26: Initial KEP Created&lt;/li>
&lt;li>2025-07-29: Updated the KEP after changing the ownership&lt;/li>
&lt;li>2025-10-13: Pivoted KEP from &lt;code>EnforcedRollingUpdate&lt;/code> strategy to &lt;code>podProgressTimeoutSeconds&lt;/code> field based on sig-apps feedback. This approach better handles transient vs permanent failures and aligns with Deployment semantics.&lt;/li>
&lt;li>2025-12-01: Pivoted KEP from &lt;code>podProgressTimeoutSeconds&lt;/code> to &lt;code>Recreate&lt;/code> strategy based on sig-apps meeting (&lt;a href="https://www.youtube.com/watch?v=W7VuKDvAtjg&amp;amp;list=PL69nYSiGNLP2LMq7vznITnpd2Fk1YIZF3"
target="_blank" rel="noopener">meeting recording&lt;/a>
). Key feedback:
&lt;ul>
&lt;li>Progress deadline seconds in Deployments do not terminate pods, but podProgressTimeoutSeconds proposal would terminate pods&lt;/li>
&lt;li>Deleting/terminating pods based on readiness signals is problematic and disruptive&lt;/li>
&lt;li>Group consensus favored Recreate for simplicity and consistency with existing Kubernetes APIs&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>2025-04-06: Updated KEP to change the milestone to 1.37&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;h3 id="downtime-requirement">Downtime Requirement&lt;/h3>
&lt;p>The Recreate strategy causes downtime during updates since all pods are deleted before new ones are created:&lt;/p>
&lt;ul>
&lt;li>Service Interruption: Application is completely unavailable during the deletion/recreation window&lt;/li>
&lt;li>Not Suitable for All Workloads: Traditional stateful applications requiring high availability cannot use this strategy&lt;/li>
&lt;li>User Expectation Management: Users must understand and accept downtime implications&lt;/li>
&lt;/ul>
&lt;p>Mitigation:&lt;/p>
&lt;ul>
&lt;li>Clear documentation emphasizing downtime implications&lt;/li>
&lt;li>Explicit opt-in via &lt;code>type: Recreate&lt;/code> (no accidental usage)&lt;/li>
&lt;li>Recommendation to use for appropriate workloads (CI/CD, stateless apps, development environments)&lt;/li>
&lt;/ul>
&lt;h3 id="limited-rollback-options">Limited Rollback Options&lt;/h3>
&lt;p>During a Recreate update, there&amp;rsquo;s no gradual rollback:&lt;/p>
&lt;ul>
&lt;li>If new version has issues, all pods are affected (no gradual detection)&lt;/li>
&lt;li>Cannot compare old vs new pods side-by-side during rollout&lt;/li>
&lt;li>Must wait for full recreation cycle to attempt fixes&lt;/li>
&lt;/ul>
&lt;p>Mitigation:&lt;/p>
&lt;ul>
&lt;li>Clear events and status conditions during Recreate process&lt;/li>
&lt;li>Users can choose RollingUpdate for gradual rollouts where needed&lt;/li>
&lt;li>Quick feedback loop due to fast recreation (all pods start together)&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="alternative-1-podprogresstimeoutseconds-field-in-rollingupdate-strategy">Alternative 1: PodProgressTimeoutSeconds Field in RollingUpdate Strategy&lt;/h3>
&lt;p>Extend the existing &lt;code>RollingUpdate&lt;/code> strategy with a &lt;code>podProgressTimeoutSeconds&lt;/code> field (similar to Deployment&amp;rsquo;s &lt;code>progressDeadlineSeconds&lt;/code>) that allows timeout-based detection of stuck pods.&lt;/p>
&lt;p>API Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">updateStrategy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>RollingUpdate&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">rollingUpdate&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podProgressTimeoutSeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">600&lt;/span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Wait 10 minutes per pod&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">maxUnavailable&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Algorithm: For each pod in reverse ordinal order, delete and create new pod, wait for Ready state with timeout. If pod doesn&amp;rsquo;t become Ready within &lt;code>podProgressTimeoutSeconds&lt;/code>, delete and recreate it.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Maintains sequential ordering guarantees&lt;/li>
&lt;li>Distinguishes transient failures (slow image pulls) from permanent failures (misconfig)&lt;/li>
&lt;li>Works with existing &lt;code>maxUnavailable&lt;/code> and &lt;code>partition&lt;/code> fields&lt;/li>
&lt;li>Allows fine-grained control over timeout per workload&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Complexity: Requires tracking per-pod creation timestamps and deadline state across reconciliation loops&lt;/li>
&lt;li>Timeout Configuration Burden: Users must choose appropriate timeout values (too short = unnecessary churn, too long = slow recovery)&lt;/li>
&lt;li>Doesn&amp;rsquo;t Solve All Scenarios: Still blocks on transient issues until timeout expires&lt;/li>
&lt;li>Controller Complexity: Adds significant complexity to StatefulSet controller logic&lt;/li>
&lt;/ul>
&lt;p>Why Not Chosen as Primary Solution: Based on sig-apps meeting feedback (&lt;a href="https://www.youtube.com/watch?v=W7VuKDvAtjg&amp;amp;list=PL69nYSiGNLP2LMq7vznITnpd2Fk1YIZF3"
target="_blank" rel="noopener">meeting link&lt;/a>
), the group favored the simpler Recreate strategy approach. Key concerns raised:&lt;/p>
&lt;ul>
&lt;li>Progress deadline in Deployments does &lt;strong>not&lt;/strong> terminate pods when deadline is reached, but this proposal would&lt;/li>
&lt;li>Using readiness signals to terminate pods is problematic and disruptive&lt;/li>
&lt;li>The timeout-based approach adds complexity that may not be necessary for the primary use cases (CI/CD, stateless apps, external storage)&lt;/li>
&lt;li>Recreate strategy is &amp;ldquo;pretty bare&amp;rdquo; and has direct parallel with Deployment patterns, making it easier to implement and understand&lt;/li>
&lt;/ul>
&lt;h3 id="alternative-2-enforcedrollingupdate-strategy">Alternative 2: EnforcedRollingUpdate Strategy&lt;/h3>
&lt;p>Add a new update strategy type &lt;code>EnforcedRollingUpdate&lt;/code> that immediately deletes and replaces stuck pods without timeout during rolling updates.&lt;/p>
&lt;p>API Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">updateStrategy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>EnforcedRollingUpdate&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">enforcedRollingUpdate&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">maxUnavailable&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Algorithm: When &lt;code>pod[i]&lt;/code> needs update, delete it immediately regardless of current state, create new pod, wait for Ready.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Simpler than timeout-based approach (no deadline tracking)&lt;/li>
&lt;li>Maintains some ordering through sequential updates&lt;/li>
&lt;li>Immediate action on stuck pods&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Cannot distinguish transient from permanent failures (network delays, CI/CD pipeline delays, slow image pulls)&lt;/li>
&lt;li>Still maintains sequential ordering, which adds complexity&lt;/li>
&lt;li>Doesn&amp;rsquo;t solve initial deployment failure, only works when spec changes&lt;/li>
&lt;/ul>
&lt;p>Why Not Chosen: Similar concerns as Alternative 1, but Recreate is even simpler by removing ordering requirements entirely.&lt;/p>
&lt;h3 id="alternative-3-now-primary-solution-recreate-strategy">Alternative 3: (Now Primary Solution): Recreate Strategy&lt;/h3>
&lt;p>NOTE: This alternative was chosen as the primary solution for this KEP based on sig-apps meeting feedback.&lt;/p>
&lt;p>Add a &lt;code>Recreate&lt;/code> update strategy (matching Deployment&amp;rsquo;s Recreate strategy) that deletes all pods before creating new ones.&lt;/p>
&lt;p>API Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">updateStrategy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Recreate&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Algorithm: Delete all pods, wait for termination, create all new pods according to &lt;code>spec.podManagementPolicy&lt;/code>.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>No complexity around stuck pods or timeout tracking&lt;/li>
&lt;li>All pods deleted before new ones created, guaranteeing clean state&lt;/li>
&lt;li>Simple, predictable behavior aligned with Deployment patterns&lt;/li>
&lt;li>Can quickly replace all pods regardless of their current state&lt;/li>
&lt;li>No need to configure timeouts or tune parameters&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>No ordering during deletion (all at once). Ordering during creation only when podManagementPolicy&lt;/li>
&lt;li>Not suitable for traditional stateful workloads requiring zero-downtime updates&lt;/li>
&lt;/ul>
&lt;p>Why Chosen as Primary Solution: Based on sig-apps meeting discussion, this approach is:&lt;/p>
&lt;ul>
&lt;li>Simpler to implement and understand (matches existing Deployment Recreate pattern)&lt;/li>
&lt;li>Addresses the primary use cases (CI/CD, stateless apps, external storage, LeaderWorkerSet)&lt;/li>
&lt;li>Avoids concerns about terminating pods based on readiness/timeout signals&lt;/li>
&lt;li>Provides explicit opt-in behavior where users accept downtime for automated recovery&lt;/li>
&lt;/ul>
&lt;h3 id="alternative-4-add-force-flag-to-rollingupdate">Alternative 4: Add Force Flag to RollingUpdate&lt;/h3>
&lt;p>Add a boolean field like &lt;code>spec.updateStrategy.rollingUpdate.forceUpdate: true&lt;/code>.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Minimal API change&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Same issue as Alternative 1; cannot distinguish transient from permanent failures&lt;/li>
&lt;li>Less discoverable than dedicated field&lt;/li>
&lt;li>Boolean flag doesn&amp;rsquo;t allow tuning timeout per workload&lt;/li>
&lt;/ul>
&lt;p>Why Not Chosen: Recreate strategy is clearer about behavior and simpler to implement.&lt;/p>
&lt;h3 id="alternative-5-enhance-parallel-policy">Alternative 5: Enhance Parallel Policy&lt;/h3>
&lt;p>Extend &lt;code>podManagementPolicy: Parallel&lt;/code> to automatically replace stuck pods during updates.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Reuses existing field&lt;/li>
&lt;li>Already has parallel semantics&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Loses sequential ordering guarantees&lt;/li>
&lt;li>Confuses semantics of &lt;code>podManagementPolicy&lt;/code> (affects both scaling and updates) vs &lt;code>updateStrategy&lt;/code> (updates only)&lt;/li>
&lt;li>Less explicit than dedicated strategy type&lt;/li>
&lt;li>Doesn&amp;rsquo;t automatically delete all pods for clean state&lt;/li>
&lt;/ul>
&lt;p>Why Not Chosen: Recreate strategy as a dedicated update strategy type is clearer and more explicit. It also aligns better with Deployment patterns.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>N/A&lt;/p></description></item><item><title>Resources: Add Resource Health Status to the Pod Status for Device Plugin and DRA</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4680/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4680/</guid><description>
&lt;h1 id="kep-4680-add-resource-health-status-to-the-pod-status-for-device-plugin-and-dra">KEP-4680: Add Resource Health Status to the Pod Status for Device Plugin and DRA&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#podstatusallocatedresourcesstatus"
>PodStatus.AllocatedResourcesStatus&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#device-plugin-implementation-details"
>Device Plugin implementation details&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dra-implementation-details"
>DRA implementation details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#high-level-architectural-approach-for-dra-health"
>High-Level Architectural Approach for DRA Health&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#grpc-api-for-dra-device-health"
>gRPC API for DRA Device Health&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alpha2"
>Alpha2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.&lt;/p>
&lt;p>Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.&lt;/p>
&lt;p>Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Expose device health information (served by Device Plugin or DRA) in Pod Status and events.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Expose any other device information beyond the health.&lt;/li>
&lt;li>Expose CPU assignment of the pod by CPU manager or any other resources assignment by other managers.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="podstatusallocatedresourcesstatus">PodStatus.AllocatedResourcesStatus&lt;/h3>
&lt;p>As part of the InPlacePodVerticalScaling KEP, the two new fields were introduced in Pod Status to reflect the currently allocated resources for the Pod:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ContainerStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AllocatedResources represents the compute resources allocated for this container by the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// node. Kubelet sets this value to Container.Resources.Requests upon successful pod admission&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// and after successfully admitting desired pod resize.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +featureGate=InPlacePodVerticalScaling&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AllocatedResources ResourceList &lt;span style="color:#b44">`json:&amp;#34;allocatedResources,omitempty&amp;#34; protobuf:&amp;#34;bytes,10,rep,name=allocatedResources,casttype=ResourceList,castkey=ResourceName&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Resources represents the compute resource requests and limits that have been successfully&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// enacted on the running container after it has been started or has been successfully resized.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +featureGate=InPlacePodVerticalScaling&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Resources &lt;span style="color:#666">*&lt;/span>ResourceRequirements &lt;span style="color:#b44">`json:&amp;#34;resources,omitempty&amp;#34; protobuf:&amp;#34;bytes,11,opt,name=resources&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>One field reflects the resource requests and limits and the other actual allocated resources.&lt;/p>
&lt;p>This structure will contain standard resources as well as extended resources. As noted in the comment: &lt;a href="https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713&lt;/a>
, it is only logical to also include the status of those allocated resources.&lt;/p>
&lt;p>The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the &lt;code>AllocatedResources&lt;/code> match the desired state.&lt;/p>
&lt;p>The proposal is to introduce an additional field:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ContainerStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AllocatedResourcesStatus represents the status of various resources&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// allocated for this Container. In case of DRA, the same resource health&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// can be reported multiple times if it is associated with the multiple containers.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +featureGate=ResourceHealthStatus&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchMergeKey=name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchStrategy=merge&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AllocatedResourcesStatus []ResourceStatus &lt;span style="color:#b44">`json:&amp;#34;allocatedResourcesStatus,omitempty&amp;#34; patchStrategy:&amp;#34;merge&amp;#34; patchMergeKey:&amp;#34;name&amp;#34; protobuf:&amp;#34;bytes,14,rep,name=allocatedResourcesStatus&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The &lt;code>ResourceStatus&lt;/code> is defined as:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name of the resource. Must be unique within the pod and in case of non-DRA resource, match one of the resources from the pod spec.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// For DRA resources, the value must be claim:claim_name/request.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The claim_name must match one of the claims from resourceClaims field in the podSpec.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +required&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name ResourceName &lt;span style="color:#b44">`json:&amp;#34;name&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=name&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// List of unique Resources health. Each element in the list contains an unique resource ID and resource health.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// At a minimum, ResourceID must uniquely identify the Resource&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// allocated to the Pod on the Node for the lifetime of a Pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// See ResourceID type for it&amp;#39;s definition.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=resourceID&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Resources []ResourceHealth &lt;span style="color:#b44">`json:&amp;#34;resources,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,rep,name=resources&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceHealthStatus &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceHealthStatusHealthy ResourceHealthStatus = &lt;span style="color:#b44">&amp;#34;Healthy&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceHealthStatusUnhealthy ResourceHealthStatus = &lt;span style="color:#b44">&amp;#34;Unhealthy&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceHealthStatusUnknown ResourceHealthStatus = &lt;span style="color:#b44">&amp;#34;Unknown&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ResourceID is calculated based on the source of this resource health information.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For DevicePlugin:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// DeviceID, where DeviceID is from the Device structure of DevicePlugin&amp;#39;s ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For DRA:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// &amp;lt;driver name&amp;gt;/&amp;lt;pool name&amp;gt;/&amp;lt;device name&amp;gt;: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceID &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ResourceHealth represents the health of a resource. It has the latest device health information.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// This is a part of KEP https://kep.k8s.io/4680 and historical health changes are planned to be added in future iterations of a KEP.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceHealth &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ResourceID is the unique identifier of the resource. See the ResourceID type for more information.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceID ResourceID &lt;span style="color:#b44">`json:&amp;#34;resourceID&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=resourceID&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Health of the resource.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// can be one of:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// - Healthy: operates as normal&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// - Unhealthy: reported unhealthy. We consider this a temporary health issue&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// since we do not have a mechanism today to distinguish&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// temporary and permanent issues.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// - Unknown: The status cannot be determined.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// For example, Device Plugin got unregistered and hasn&amp;#39;t been re-registered since.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// In future we may want to introduce the PermanentlyUnhealthy Status.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Health ResourceHealthStatus &lt;span style="color:#b44">`json:&amp;#34;health,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,name=health&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Message provides additional human-readable context about the health status.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This can include error details, failure reasons, or other diagnostic information.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field is optional and may be empty for healthy resources.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Message &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;message,omitempty&amp;#34; protobuf:&amp;#34;bytes,3,opt,name=message&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In alpha2 in order to support pod level DRA resources, the following field will be added to the PodStatus:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// PodStatus represents information about the status of a pod. Status may trail the actual&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// state of a system.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PodStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Status of resource claims.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +featureGate=DynamicResourceAllocation&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceClaimStatuses []PodResourceClaimStatus
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AllocatedResourcesStatus represents the status of various resources&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// allocated for this Pod, but not associated with any of containers.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +featureGate=ResourceHealthStatus&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchMergeKey=name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchStrategy=merge&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AllocatedResourcesStatus []ResourceStatus &lt;span style="color:#b44">`json:&amp;#34;allocatedResourcesStatus,omitempty&amp;#34; patchStrategy:&amp;#34;merge&amp;#34; patchMergeKey:&amp;#34;name&amp;#34; protobuf:&amp;#34;bytes,14,rep,name=allocatedResourcesStatus&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;em>&lt;strong>Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?&lt;/strong>&lt;/em>&lt;/p>
&lt;p>No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already (if &lt;code>restartPolicy=Never&lt;/code>, in other cases Pod will enter crash loop backoff).&lt;/p>
&lt;p>Note: Updating Pod Status with device health after the pod has been marked as Failed is &lt;strong>not supported&lt;/strong> due to a
race condition in the Kubelet&amp;rsquo;s DRA manager cleanup. See the Known Limitations section for details.&lt;/p>
&lt;p>&lt;em>&lt;strong>Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?&lt;/strong>&lt;/em>&lt;/p>
&lt;p>We may consider this as a future improvement.&lt;/p>
&lt;p>&lt;em>&lt;strong>Should we introduce a permanent failure status?&lt;/strong>&lt;/em>&lt;/p>
&lt;p>We may consider this as a future improvement.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;ul>
&lt;li>User scheduled a Pod using the GPU device&lt;/li>
&lt;li>When GPU device fails, user sees the Pod is in crash loop backoff&lt;/li>
&lt;li>User checks the Pod Status using &lt;code>kubectl describe pod&lt;/code>&lt;/li>
&lt;li>User sees the pod status indicating that the GPU device is not healthy&lt;/li>
&lt;li>User or some (custom for now) controller deletes the Pod and replicaset reschedules it on another available GPU&lt;/li>
&lt;/ul>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>DRA Device Health Timeout Configuration:&lt;/strong> The timeout for marking a DRA device&amp;rsquo;s health as &amp;ldquo;Unknown&amp;rdquo;
when no updates are received can be configured per device through the &lt;code>health_check_timeout_seconds&lt;/code> field
in the &lt;code>DeviceHealth&lt;/code> message. This allows different hardware types (e.g., GPUs, FPGAs, TPUs, storage devices)
to specify appropriate timeout values based on their health-reporting characteristics. If not specified,
Kubelet will use a default timeout of 30 seconds. This addresses
&lt;a href="https://github.com/kubernetes/kubernetes/issues/133118"
target="_blank" rel="noopener">Issue #133118&lt;/a>
and the discussion in
&lt;a href="https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511"
target="_blank" rel="noopener">PR #130606&lt;/a>
.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Failure Message Field:&lt;/strong> The &lt;code>ResourceHealth&lt;/code> struct includes an optional &lt;code>Message&lt;/code> field that provides
additional human-readable context about device health status. This field enables Device Plugins and DRA drivers
to report detailed error information, failure reasons, and diagnostic information beyond the basic health status.
This enhancement improves troubleshooting capabilities for device-related failures. See
&lt;a href="https://github.com/kubernetes/kubernetes/issues/133202"
target="_blank" rel="noopener">Issue #133202&lt;/a>
and
&lt;a href="https://github.com/kubernetes/kubernetes/pull/134506"
target="_blank" rel="noopener">PR #134506&lt;/a>
for implementation details.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Known Limitation - Device Health for Terminated Pods:&lt;/strong> Device health status is &lt;strong>not&lt;/strong> updated in PodStatus
after a Pod has terminated (e.g., in Failed state). Due to a race condition between pod termination and
health status updates, the Kubelet&amp;rsquo;s DRA manager cleans up the ClaimInfo from its cache before health updates
can be applied. The complexity required to fix this (tombstoning terminated ClaimInfo entries) was deemed
not worth the benefit for this edge case. The core value for long running services (&lt;code>RestartPolicy: Always&lt;/code>)
is unaffected. See &lt;a href="https://github.com/kubernetes/kubernetes/issues/132978"
target="_blank" rel="noopener">Issue #132978&lt;/a>
for details on why
this was closed without implementation.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>There is not many risks of this KEP. The biggest risk is that Device Plugins will not be
able to detect device health reliably and fast enough to assign this status to the
Pods, marked as &lt;code>restartPolicy=Never&lt;/code>. End users will expect this field and the
missing health information will confuse them.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="device-plugin-implementation-details">Device Plugin implementation details&lt;/h3>
&lt;p>Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods.&lt;/p>
&lt;p>One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered.&lt;/p>
&lt;p>NVIDIA device plugin has the checkHealth implementation: &lt;a href="https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39"
target="_blank" rel="noopener">https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39&lt;/a>
that has more information than simple “Unhealthy”.&lt;/p>
&lt;p>We should consider introducing another field to the Status that will be a free form error information as a future improvement.&lt;/p>
&lt;h3 id="dra-implementation-details">DRA implementation details&lt;/h3>
&lt;p>Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
type &lt;code>BasicDevice&lt;/code> (from &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/dynamic-resource-allocation/api/types.go#L58"
target="_blank" rel="noopener">staging/src/k8s.io/dynamic-resource-allocation/api/types.go&lt;/a>
) to include the Health field the same way it is done in the Device Plugin as well as a device ID.&lt;/p>
&lt;p>The following design outlines how Kubelet will obtain health information
from DRA plugins and use it to update the PodStatus. This design focuses on an
optional, proactive health reporting mechanism from DRA plugins.&lt;/p>
&lt;h4 id="high-level-architectural-approach-for-dra-health">High-Level Architectural Approach for DRA Health&lt;/h4>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Optional gRPC Stream:&lt;/strong> A new, optional gRPC service for health monitoring
will be defined. DRA plugins can implement this service to proactively send
health updates for their managed devices to Kubelet. It will expose a
server-streaming RPC that allows the plugin to send a complete list of
device health states whenever a change occurs. If a plugin does not
implement this service, the health of its devices will be reported as &amp;ldquo;Unknown&amp;rdquo;.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Health Information Cache:&lt;/strong> Kubelet&amp;rsquo;s DRA Manager will maintain a
persistent cache of device health information. This cache will store the
latest known health status (e.g., Healthy, Unhealthy, Unknown) and a
timestamp for each device, keyed by driver and device identifiers. The cache
will be responsible for reconciling the state reported by the plugin, handling
timeouts for stale data (marking devices as &amp;ldquo;Unknown&amp;rdquo; if not updated
within a certain period), and persisting this information across Kubelet restarts.&lt;/p>
&lt;p>&lt;strong>Note:&lt;/strong> The timeout for marking a device&amp;rsquo;s health as &amp;ldquo;Unknown&amp;rdquo; can be
configured per device via the &lt;code>health_check_timeout_seconds&lt;/code> field in the
&lt;code>DeviceHealth&lt;/code> message. If not specified, Kubelet will use a default timeout
of 30 seconds. This addresses &lt;a href="https://github.com/kubernetes/kubernetes/issues/133118"
target="_blank" rel="noopener">Issue #133118&lt;/a>
,
allowing different hardware types (e.g., GPUs, FPGAs, TPUs, storage) to specify
appropriate timeout values based on their health-reporting characteristics.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Kubelet Integration:&lt;/strong> The DRA Manager in Kubelet will act as the gRPC client.
Upon plugin registration, it will attempt to initiate the health monitoring
stream. If successful, it will consume the health updates, update its
internal health cache, and identify which Pods are affected by any
reported health changes. For seamless plugin upgrades, where multiple
instances of a plugin might run concurrently, the Kubelet will always
watch the most recently registered plugin for health updates.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>PodStatus Update:&lt;/strong> When health changes for a device are detected, the DRA manager
will trigger an update for the affected Pods. Kubelet&amp;rsquo;s main pod synchronization
logic will then read the current health status for the Pod&amp;rsquo;s allocated DRA devices
from the health cache and populate the &lt;code>AllocatedResourcesStatus&lt;/code> field in the
PodStatus with the correct health information.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;em>Note: Kubelet will only use this health information to update the Pod
Status. The DRA plugin remains responsible for other actions, such as tainting
ResourceSlices to prevent scheduling on unhealthy resources.&lt;/em>&lt;/p>
&lt;h4 id="grpc-api-for-dra-device-health">gRPC API for DRA Device Health&lt;/h4>
&lt;p>A new gRPC service, &lt;code>NodeHealth&lt;/code>, will be introduced in a new API group (e.g., &lt;code>dra-health/v1alpha1&lt;/code>) to keep it separate from the core DRA API and signify its optionality.&lt;/p>
&lt;p>The service will define a &lt;code>WatchResources&lt;/code> RPC:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-proto" data-lang="proto">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">service&lt;/span> NodeHealth {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// WatchResources allows a DRA plugin to stream health updates for its devices to Kubelet.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Kubelet calls this method, and the plugin streams responses.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// This method is optional; if not implemented by a plugin, Kubelet will assume
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// devices managed by that plugin have an &amp;#34;Unknown&amp;#34; health status.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#a2f;font-weight:bold">rpc&lt;/span> WatchResources(WatchResourcesRequest) &lt;span style="color:#a2f;font-weight:bold">returns&lt;/span> (stream WatchResourcesResponse) {}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>&lt;span style="color:#a2f;font-weight:bold">message&lt;/span> &lt;span style="color:#00f">WatchResourcesRequest&lt;/span> {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Reserved for future use, e.g., filtering or options.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>&lt;span style="color:#a2f;font-weight:bold">message&lt;/span> &lt;span style="color:#00f">WatchResourcesResponse&lt;/span> {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// A list of all devices managed by the plugin for which health is being reported.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// This should be a complete list for the driver; Kubelet will reconcile this state.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#a2f;font-weight:bold">repeated&lt;/span> DeviceHealth devices &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">1&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>&lt;span style="color:#a2f;font-weight:bold">message&lt;/span> &lt;span style="color:#00f">DeviceHealth&lt;/span> {&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// The name of the resource pool this device belongs to.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Required.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> pool_name &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">1&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// The unique name of the device within the pool.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Required.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> device_name &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">2&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Health status of the device.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Expected values: &amp;#34;Healthy&amp;#34;, &amp;#34;Unhealthy&amp;#34;, &amp;#34;Unknown&amp;#34;.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Required.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> health_status &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">3&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Required.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">int64&lt;/span> last_updated_timestamp &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">4&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span> &lt;span style="color:#080;font-style:italic">// Health check timeout duration in seconds for this device.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// If not specified or zero, Kubelet will use a default timeout.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#080;font-style:italic">// Optional.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">&lt;/span> &lt;span style="color:#0b0;font-weight:bold">int64&lt;/span> health_check_timeout_seconds &lt;span style="color:#666">=&lt;/span> &lt;span style="color:#666">5&lt;/span>;&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">&lt;/span>}&lt;span style="">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;p>The existing test coverage for Device Manager and DRA will be used as a baseline. New code introduced by this KEP will include thorough unit tests to maintain or improve coverage.&lt;/p>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>Current coverage for the relevant packages (as of June 2025):&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/cm/devicemanager&lt;/code>: &lt;code>84.8%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/cm/dra&lt;/code>: &lt;code>79.8%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin&lt;/code>: &lt;code>84.0%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/cm/dra/state&lt;/code>: &lt;code>46.2%&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>The new DRA health monitoring logic will have thorough unit test coverage, including:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Health Information Cache Logic:&lt;/strong>
&lt;ul>
&lt;li>Cache initialization from scratch and from a checkpoint file.&lt;/li>
&lt;li>State reconciliation of device health based on plugin reports.&lt;/li>
&lt;li>Correct handling of &lt;code>LastUpdated&lt;/code> timestamps.&lt;/li>
&lt;li>Marking devices as &amp;ldquo;Unknown&amp;rdquo; after a timeout period.&lt;/li>
&lt;li>Correctly identifying which devices have changed health status.&lt;/li>
&lt;li>Accurate retrieval of health status for existing, timed-out, and non-existent devices.&lt;/li>
&lt;li>Proper cleanup of a driver&amp;rsquo;s health data upon its deregistration.&lt;/li>
&lt;li>Persistence logic for saving to and loading from the checkpoint file.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Plugin Registration and gRPC Stream Handling:&lt;/strong>
&lt;ul>
&lt;li>Verification of successful health stream startup and background processing.&lt;/li>
&lt;li>Graceful handling of plugins that do not implement the health monitoring service (&lt;code>Unimplemented&lt;/code> error).&lt;/li>
&lt;li>Correct cancellation of the health stream when a plugin is replaced or deregistered.&lt;/li>
&lt;li>Error handling during stream initiation and message reception.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>DRA Manager Logic:&lt;/strong>
&lt;ul>
&lt;li>Correct processing of health update messages from the gRPC stream.&lt;/li>
&lt;li>Accurate identification of Pods affected by a health change.&lt;/li>
&lt;li>Properly sending update notifications for affected Pods.&lt;/li>
&lt;li>Correct population of the &lt;code>AllocatedResourcesStatus&lt;/code> field in the Pod&amp;rsquo;s status object.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>N/A&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>Planned tests will cover the user-visible behavior of the feature:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Basic Health Reporting:&lt;/strong>
&lt;ul>
&lt;li>Verify that when a DRA plugin reports a device as unhealthy, the PodStatus is updated to reflect this.&lt;/li>
&lt;li>Verify that when the device becomes healthy again, the PodStatus is correctly updated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>State Transitions:&lt;/strong>
&lt;ul>
&lt;li>Test rapid health state changes (e.g., unhealthy to healthy and back) to ensure the final PodStatus reflects the latest state.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Failure Scenarios:&lt;/strong>
&lt;ul>
&lt;li>Verify that a Pod in a &lt;code>CrashLoopBackOff&lt;/code> state due to an unhealthy device correctly shows the device&amp;rsquo;s unhealthy status.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Feature Gate Behavior (for Alpha):&lt;/strong>
&lt;ul>
&lt;li>When the feature gate is disabled, verify that the &lt;code>AllocatedResourcesStatus&lt;/code> field is not populated by the DRA manager.&lt;/li>
&lt;li>When the feature gate is disabled on an existing cluster, verify that existing health information is gracefully ignored or removed on the next Pod update.&lt;/li>
&lt;li>When the feature gate is re-enabled, verify that health reporting resumes correctly.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>New field is introduced in Pod Status&lt;/li>
&lt;li>Feature implemented in Device Manager behind a feature flag&lt;/li>
&lt;li>Initial e2e tests completed and enabled&lt;/li>
&lt;/ul>
&lt;h4 id="alpha2">Alpha2&lt;/h4>
&lt;ul>
&lt;li>Feature implemented in DRA behind a feature flag&lt;/li>
&lt;li>e2e tests completed and enabled for DRA&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;p>The following requirements must be met for Beta graduation:&lt;/p>
&lt;ul>
&lt;li>Complete e2e tests coverage&lt;/li>
&lt;li>&lt;strong>Configurable Device Health Check Timeout&lt;/strong> (&lt;a href="https://github.com/kubernetes/kubernetes/issues/133118"
target="_blank" rel="noopener">Issue #133118&lt;/a>
, &lt;a href="https://github.com/kubernetes/kubernetes/pull/133752"
target="_blank" rel="noopener">PR #133752&lt;/a>
):
Verify that the configurable device health check timeout implementation (via &lt;code>health_check_timeout_seconds&lt;/code> field)
works correctly across different plugin vendors and hardware types (e.g., GPUs, FPGAs, TPUs, storage devices).&lt;/li>
&lt;li>&lt;strong>Failure Message Field&lt;/strong> (&lt;a href="https://github.com/kubernetes/kubernetes/issues/133202"
target="_blank" rel="noopener">Issue #133202&lt;/a>
, &lt;a href="https://github.com/kubernetes/kubernetes/pull/134506"
target="_blank" rel="noopener">PR #134506&lt;/a>
):
Support for a message field in device health reporting to provide additional context about health status and failures,
enabling better troubleshooting capabilities.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Feedback is collected on usability of the field&lt;/li>
&lt;li>Example of real-world usage with one of the device plugin. For example, NVIDIA Device Plugin&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>The feature exposes a new field based on information the Device Plugin already exposes. There will be no dependency on upgrade/downgrade, feature will either work or not.&lt;/p>
&lt;p>DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA ugrade/downgrade strategy.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>There is no issue with the version skew. Kubelet that will expose this flag will
always be the same version of behind the API, which introduced this new field.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;p>Simple change of a feature gate will either enable or disable this feature.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>ResourceHealthStatus&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kubelet&lt;/code> and &lt;code>kube-apiserver&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes, with no side effect except of missing the new field in pod status. When the feature is disabled,
the values of the &lt;code>AllocatedResourcesStatus&lt;/code> fields will be dropped when serving the API even if they
are written to storage. This prevents clients from acting on potentially stale data when the feature
is off. Values written while the feature was enabled may be wiped on next update request.
Re-enablement of the feature will not guarantee to keep the values written before the
feature was disabled.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The pod status will be updated again. When the feature is re-enabled, there may be a brief period
where stale values from storage reappear in the API before kubelet and controllers actuate and update
the values with current device health information. This period should be kept as short as possible
through normal kubelet reconciliation. Consistency will not be guaranteed for fields written
before the last enablement.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes, see in e2e tests section.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>API server error rate increase. &lt;code>apiserver_request_total&lt;/code> filtered by &lt;code>code&lt;/code> to be non &lt;code>2xx&lt;/code>.
API validation error is the most likely indication of an error.&lt;/p>
&lt;p>Potential errors on kubelet would likely be exposed as error logs and events on Pods.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Will be tested, but we do not expect any issues.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Check the Pod Status.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> API pod.status&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>There are a few error modes for this feature:&lt;/p>
&lt;ol>
&lt;li>API issues accepting the new field - for example kubelet is writing the field in a format not acceptable by the API server&lt;/li>
&lt;li>kubelet fails while populating this field&lt;/li>
&lt;/ol>
&lt;p>First error mode can be observer with the metric &lt;code>apiserver_request_total&lt;/code> filtered by &lt;code>code&lt;/code> to be non &lt;code>2xx&lt;/code>.&lt;/p>
&lt;p>There is no good metric for the second error mode because it will not be clear what part of processing may fail.
The most likely indication of an error would be the increased number of error events on the Pod.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>DRA implementation.&lt;/p>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>Pod Status size will increase insignificantly.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>New field on Pod Status.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Pod Status size will increase insignificantly.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>Not significantly. We already keep all the collection in memory, just need to connect dots.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Not applicable.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>&lt;code>v1.31&lt;/code>: KEP is in alpha and imlpemented for Device Plugin&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>No post mortem health status for terminated pods:&lt;/strong> For batch jobs using &lt;code>RestartPolicy: Never&lt;/code>,
device health status will not be updated after the pod terminates. This means &amp;ldquo;post mortem&amp;rdquo;
troubleshooting for batch jobs cannot rely on this field. The race condition between pod termination
and health updates would require significant complexity to fix (tombstoning ClaimInfo entries in the
DRA manager), which was deemed not worth the benefit. See &lt;a href="https://github.com/kubernetes/kubernetes/issues/132978"
target="_blank" rel="noopener">Issue #132978&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>There are a few alternatives to this proposal.&lt;/p>
&lt;p>&lt;strong>First&lt;/strong>, an API similar to Pod Resources API can be exposed by kubelet to query via kubectl or directly thru some node exposed port. The problem with this approach is:&lt;/p>
&lt;ul>
&lt;li>it opens up a new API surface&lt;/li>
&lt;li>It will be impossible to get status for Pods that have completed already&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Second&lt;/strong>, exposing the status for DRA via claims - this approach leads to a debate on how to ensure security so kubelet is limited to which statuses it can set. With this approach, there are mechanisms in place to ensure that kubelet updates status for Pods scheduled on that node.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>We may need to update sample device plugin. No special infra is needed as emulating real GPU failures or failures in other devices is not practical.&lt;/p></description></item><item><title>Resources: Add support for a kubelet drop-in configuration directory</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3983/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3983/</guid><description>
&lt;h1 id="kep-3983-add-support-for-a-drop-in-kubelet-configuration-directory">KEP-3983: Add support for a drop-in kubelet configuration directory&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Add support for a drop-in configuration directory for the Kubelet. This directory can be specified via a &lt;code>--config-dir&lt;/code> flag, and configuration files will be processed in alphanumeric order. The flag will be empty by default and if not specified, drop-in support will not be enabled. During the alpha phase, we introduced an environment variable called &lt;code>KUBELET_CONFIG_DROPIN_DIR_ALPHA&lt;/code> to control the drop-in configuration directory for testing purposes. In the beta phase, we plan to leave the &lt;code>--config-dir&lt;/code> flag unset by default, which aligns with the behavior of the &lt;code>--config&lt;/code> flag. Users are encouraged to opt in by specifying their desired configuration directory. Additionally, we will enhance the feature with E2E testing and streamline the configuration process. As part of this optimization, we will remove the &lt;code>KUBELET_CONFIG_DROPIN_DIR_ALPHA&lt;/code> environment variable, simplifying configuration management. The feature will be enabled by default during the beta phase, and we will evaluate its status in future releases.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>A common pattern for software configuration in linux is support for a drop-in configuration directory. The location of this directory is often based on a corresponding configuration file. For instance, &lt;code>/etc/security/limits&lt;/code> can be overridden by files in &lt;code>/etc/security/limits.d&lt;/code>. This pattern is useful for a number of reasons, though a large motivation here is to allow files to be owned by a single owner. If multiple processes are vying for changing the same file, then they could stamp over each other&amp;rsquo;s changes and possibly race against each other, creating TOCTOU problems.&lt;/p>
&lt;p>Components in Kubernetes can similarly be configured by multiple entities and preventing races between them is cumbersome. There has been past work in the Kubelet to have a Dynamic Configuration, but resolving between multiple entities and a last known good state was also complicated. Since the Kubelet is the node agent, and is often distributed as a package on the host operating system along with the container runtime, configuring it similarly to other host processes seems clear. This paves the path for continuing the pattern of drop-in configuration for the Kubelet.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Add support for a &lt;code>--config-dir&lt;/code> flag to the kubelet to allow users to specify a drop-in directory, which will override the configuration for the Kubelet located at &lt;code>/etc/kubernetes/kubelet.conf&lt;/code>&lt;/li>
&lt;li>Extend kubelet configuration parsing code to handle files in the drop-in directory.&lt;/li>
&lt;li>Define Kubernetes best-practices for configuration definitions, similarly to &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md"
target="_blank" rel="noopener">API conventions&lt;/a>
. This is intended for other maintainers who would wish to setup a configuration object that works well with drop-in directories.&lt;/li>
&lt;li>Add ability to easily view the effective configuration that is being used by kubelet.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Add support for drop-in configuration for Kubernetes components other than the Kubelet.&lt;/li>
&lt;li>Dynamically reconfiguring running kubelets if drop-in contents change.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>This proposal aims to add support for a drop-in configuration directory for the kubelet via specifying a &lt;code>--config-dir&lt;/code> flag (for example, &lt;code>/etc/kubernetes/kubelet.conf.d&lt;/code>). Users are able to specify individually configurable kubelet config snippets in files, formatted in the same way as the existing kubelet.conf file. The kubelet will process the configuration provided in the drop-in directory in alphanumeric order:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>If no other configuration for the subfield(s) exist, append to the base configuration&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If the subfield(s) exists in the base configuration at &lt;code>/etc/kubernetes/kubelet.conf&lt;/code> file or another file in the drop-in directory with lesser alphanumeric ordering, overwrite it&lt;/p>
&lt;ul>
&lt;li>If the subfield(s) exist as a list, overwrite instead of attempting to merge. This makes it easier to delete items from lists defined in the base kubelet.conf or other drop-ins without having to modify other files. See example below&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>If there are any issues with the drop-ins (e.g. formatting errors), the error will be reported in the same way as a misconfigured kubelet.conf file. Only files with a &lt;code>.conf&lt;/code> extension will be parsed. All other files found will be skipped and logged.&lt;/p>
&lt;p>This drop-in directory is purely optional and if empty, the base configuration is used and no behavior changes will be introduced. The &lt;code>--config-dir&lt;/code> flag, along with the &lt;code>KUBELET_CONFIG_DROPIN_DIR_ALPHA&lt;/code> environment variable, allows users to specify a drop-in configuration directory for the Kubelet. This directory is empty by default, ensuring that drop-in support is not enabled unless explicitly configured. This aims to align with &lt;code>--config&lt;/code> flag defaults.&lt;/p>
&lt;p>Example:&lt;/p>
&lt;p>Base configuration:&lt;/p>
&lt;pre tabindex="0">&lt;code>authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
clusterDNS:
- 1.2.3.4
- 1.2.3.5
&lt;/code>&lt;/pre>&lt;p>Drop-in 1:&lt;/p>
&lt;pre tabindex="0">&lt;code>authentication:
x509
clientCAFile: /some/new/location
&lt;/code>&lt;/pre>&lt;p>Drop-in 2:&lt;/p>
&lt;pre tabindex="0">&lt;code>clusterDNS:
- 1.2.3.6
&lt;/code>&lt;/pre>&lt;p>Final result:&lt;/p>
&lt;pre tabindex="0">&lt;code>authentication:
anonymous:
enabled: false
webhook:
enabled: true
x509:
clientCAFile: some/new/location
clusterDNS:
- 1.2.3.6
&lt;/code>&lt;/pre>&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a cluster admin, I would like to be able to easily customize the Kubelet configuration for different node types, while still sharing a base configuration. For instance, I would like to have customized system reserved allocations for the control plane and workers.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a Kubernetes distribution author, I would like to enable users to customize fields on the Kubelet while leaving a sensible and secure default in an easy way.&lt;/p>
&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>As a cluster admin, I would like to have cgroup management and log size management in different files, so I can automate per-node management of those configurations performed via different components without cross-interference.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;ul>
&lt;li>Handling of zeroed fields
&lt;ul>
&lt;li>It’s possible the configuration of the Kubelet does not handle not specified fields well. Special testing will need to be done for different types to define and ensure conformance of that behavior.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Handling of lists
&lt;ul>
&lt;li>During the beta phase, we will conduct additional testing to address risks and refine the feature.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>cmd/kubelet/app: 07-17-2023 27.6&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>A &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/kubelet_config_dir_test.go"
target="_blank" rel="noopener">test&lt;/a>
should confirm that the kubelet.conf.d directory is correctly processed, and its contents are accurately reported in the configz endpoint.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;p>Add ability to support drop-in configuration directory.&lt;/p>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;p>Add ability to augment the feature&amp;rsquo;s capabilities with a focus on robustness and testing, which includes:&lt;/p>
&lt;ul>
&lt;li>Ensure the correct kubelet configuration is displayed when queried using the &lt;code>kubectl get --raw &amp;quot;/api/v1/nodes/{node-name}/proxy/configz&amp;quot;&lt;/code> command, particularly verifying the contents of the kubelet.conf.d directory.&lt;/li>
&lt;li>Remove the environment variable &lt;code>KUBELET_CONFIG_DROPIN_DIR_ALPHA&lt;/code>, introduced during the Alpha phase, to streamline the user experience by simplifying configuration management.&lt;/li>
&lt;li>Leave the &lt;code>--config-dir&lt;/code> flag empty by default. Users can configure it by specifying a path, with &lt;code>/etc/kubernetes/kubelet.conf.d&lt;/code> as the recommended directory.&lt;/li>
&lt;li>Add a version compatibility check for drop-in files to ensure alignment with the expected Kubelet configuration API version and catch discrepancies when future versions are introduced.&lt;/li>
&lt;li>Provide official guidance on the Kubernetes website for merging lists and structures in the kubelet configuration file, including documentation for the &lt;code>/configz&lt;/code> endpoint.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;p>Collect user feedback and gather information about real-world use cases for this feature.&lt;/p>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Upgrades and downgrades are safe as far as Kubelet stability is concerned. It’s possible a vendor may ship vital pieces of configuration within a drop-in directory. If the Kubelet downgrades to a version that doesn’t support reading the drop-in directory, the kubelet will not recognize the &amp;ldquo;&amp;ndash;config-dir&amp;rdquo; flag and risk failing. However, assuming that vendor left that the original &lt;code>/etc/kubelet/kubelet.conf&lt;/code> is in a valid state, and the flag isn&amp;rsquo;t specified, there should be no risk to the system. Any configuration that exists in a drop-in dir won&amp;rsquo;t be applied, but that would not affect kubelet stability.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>All behavior change is encapsulated within the Kubelet, so there is no version skew possible within core Kubernetes. It is possible third party tools may attempt to utilize the Kubelet’s drop-in directory before the Kubelet is upgraded to support it, which would cause silent failures. It is the responsibility of these third party tools to ensure the Kubelet is new enough to support this.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>[] Feature gate
N/A&lt;/li>
&lt;/ul>
&lt;p>In addition to configuring the KUBELET_CONFIG_DROPIN_DIR_ALPHA environment variable, administrators must explicitly set the &amp;ndash;config-dir flag in the kubelet&amp;rsquo;s command-line interface (CLI) to enable this feature. Starting from the beta phase, specifying the &amp;ndash;config-dir flag is the only way to enable this feature. The default value for &lt;code>--config-dir&lt;/code> is an empty string, which means the feature is disabled by default.&lt;/p>
&lt;p>The decision to use an environment variable (KUBELET_CONFIG_DROPIN_DIR_ALPHA) over a feature gate was made to avoid potential conflicts in configuration settings. With the current configuration flow, feature gates could lead to unexpected behavior when CLI settings conflict with the kubelet.conf.d directory. The potential issue arises when the CLI initially sets the feature gate to &amp;ldquo;off,&amp;rdquo; but the kubelet configuration specifies it as &amp;ldquo;on.&amp;rdquo; In this scenario, the kubelet would start with the feature gate &amp;ldquo;off,&amp;rdquo; switch it to &amp;ldquo;on&amp;rdquo; during configuration rendering, and then have conflicting settings when reading the kubelet.conf.d directory, leading to unexpected behavior. By using an environment variable during the alpha phase, we provided a simpler and more predictable way to control the drop-in configuration directory for testing. In the beta phase, we are removing this environment variable to streamline configuration management and enhance the user experience.&lt;/p>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No, upgrading to a version of the Kubelet with this feature will not enable the Kubelet to be configured with the drop-in directory if no flag is specified.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. To disable the feature, roll back by removing the &amp;ndash;config-dir flag from the kubelet&amp;rsquo;s CLI.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>This feature will be re-enabled via adding back the &lt;code>--config-dir&lt;/code> flag to the CLI.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>A test will be added to assemble a single, functional kubelet configuration object from various individual drop-in config files.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The feature can cause the Kubelet to fail if the configuration in the drop-in directory is invalid. A rollback could fail if the original configuration also has an invalid configuration. This situation would cause workloads to not appear on that node. Neither of these cases are expected.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>The Kubelet not starting, which will cause nodes to be NotReady.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>The feature does not persist any data and so the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path is not special.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Workloads do not directly consume this feature, it is for cluster admins during kubelet configuration.
To check if the feature is enabled, users can query the merged configuration. One way to do this is by hitting the configz endpoint using kubectl or a standalone kubelet mode.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>In alpha, the user can query their active kubeletconfiguration to see if their drop-ins have taken effect.
In beta and onwards, the user will be able to read this off logs or the API, to be determined as described above.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>The node bootstrap time should be minimal so kubelet doesn&amp;rsquo;t take too long to reconcile the configuration.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>No noticeable increase in the kubelet startup time.&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No, though there may be changes to the Kubelet configuration required.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No, though metadata on the fields may need to be changed.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>It will take slightly longer for the Kubelet to start, but it should not be noticeable unless there are very many (hundreds?) of configurations.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>Likely negligible amounts of CPU.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>Not likely&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>This feature is enabled in Kubelet alone.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Invalid configuration, including issues like incorrect file permissions or misconfigured settings for the drop-in directory and files, falls under known failure modes, same as exists today with &lt;code>/etc/kubernetes/kubelet.conf&lt;/code>&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>Fix the invalid configuration, or remove configurations.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2023-05-04: KEP initialized.&lt;/li>
&lt;li>2023-07-17: Alpha is implemented in 1.28&lt;/li>
&lt;li>2023-09-25: KEP retargeted to Alpha in 1.29&lt;/li>
&lt;li>2024-01-19: Added an &lt;a href="https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd&amp;amp;include-filter-by-regex=KubeletConfigDropInDir"
target="_blank" rel="noopener">e2e&lt;/a>
test and set KEP target to Beta in 1.30&lt;/li>
&lt;li>2024-09-30: Update Beta requirements&lt;/li>
&lt;li>2025-10-02: Update to stable&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>Reinstate the now deprecated Dynamic Kubelet Configuration&lt;/p>
&lt;p>Continue to rely on CLI flags or systemd drop-in files.&lt;/p></description></item><item><title>Resources: Add support for AdminNetworkPolicy resources</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2091/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2091/</guid><description>
&lt;h1 id="kep-2091-add-support-for-adminnetworkpolicy-resources">KEP-2091: Add support for AdminNetworkPolicy resources&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#adminnetworkpolicy-resource"
>AdminNetworkPolicy resource&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#actions"
>Actions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#priority"
>Priority&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rule-names"
>Rule Names&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#baselineadminnetworkpolicy-resource"
>BaselineAdminNetworkPolicy resource&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1-deny-traffic-at-a-cluster-level"
>Story 1: Deny traffic at a cluster level&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2-allow-traffic-at-a-cluster-level"
>Story 2: Allow traffic at a cluster level&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3-explicitly-delegate-traffic-to-existing-k8s-network-policy"
>Story 3: Explicitly Delegate traffic to existing K8s Network Policy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-4-create-and-isolate-multiple-tenants-in-a-cluster"
>Story 4: Create and Isolate multiple tenants in a cluster&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-5-cluster-wide-default-guardrails"
>Story 5: Cluster Wide Default Guardrails&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#rbac"
>RBAC&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#key-differences-between-adminnetworkpolicies-and-networkpolicies"
>Key differences between AdminNetworkPolicies and NetworkPolicies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigation"
>Risks and Mitigation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#future-work"
>Future Work&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#adminnetworkpolicy-api-design"
>AdminNetworkPolicy API Design&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#general-notes-on-the-adminnetworkpolicy-api"
>General Notes on the AdminNetworkPolicy API&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#further-examples-utilizing-the-self-field-for-namespacedpeer-objects"
>Further examples utilizing the self field for &lt;code>NamespacedPeer&lt;/code> objects&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#sample-specs-for-user-stories"
>Sample Specs for User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#sample-spec-for-story-1-deny-traffic-at-a-cluster-level"
>Sample spec for Story 1: Deny traffic at a cluster level&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#sample-spec-for-story-2-allow-traffic-at-a-cluster-level"
>Sample spec for Story 2: Allow traffic at a cluster level&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#sample-spec-for-story-3-explicitly-delegate-traffic-to-existing-k8s-network-policy"
>Sample spec for Story 3: Explicitly Delegate traffic to existing K8s Network Policy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#sample-spec-for-story-4-create-and-isolate-multiple-tenants-in-a-cluster"
>Sample spec for Story 4: Create and Isolate multiple tenants in a cluster&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#sample-spec-for-story-5-cluster-wide-default-guardrails"
>Sample spec for Story 5: Cluster Wide Default Guardrails&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha-to-beta-graduation"
>Alpha to Beta Graduation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta-to-ga-graduation"
>Beta to GA Graduation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade-considerations"
>Upgrade considerations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade-considerations"
>Downgrade considerations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#networkpolicy-v2"
>NetworkPolicy v2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#empower-deny-allow-action-based-crd"
>Empower, Deny, Allow action based CRD&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#clusterdefaultnetworkpolicy-resource"
>ClusterDefaultNetworkPolicy resource&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#single-crd-with-defaultrules-field"
>Single CRD with DefaultRules field&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#single-crd-with-isoverrideable-field"
>Single CRD with IsOverrideable field&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#single-crd-with-baselineallow-as-action"
>Single CRD with BaselineAllow as Action&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Introduce new set of APIs to express an administrator&amp;rsquo;s intent in securing
their K8s cluster. This doc proposes the AdminNetworkPolicy API to complement
the developer focused NetworkPolicy API in Kubernetes.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Kubernetes provides the NetworkPolicy resource to control traffic within a
cluster. NetworkPolicy focuses on expressing a developer&amp;rsquo;s intent to secure
their applications. However, it was not intended to be used for cluster scoped
administrative traffic control, which is reflected by its design:&lt;/p>
&lt;ul>
&lt;li>NetworkPolicy uses a &amp;ldquo;implicit isolation&amp;rdquo; model, which means that once a policy
is applied to certain workloads, they are automatically isolated (in the direction
specified by the policy) and anything allowed needs to be explicitly called out.&lt;/li>
&lt;li>It has no concept of explicit &amp;ldquo;deny&amp;rdquo; rules, because the application deployer can
simply refrain from allowing the things they want to deny.&lt;/li>
&lt;li>The commutative nature of NetworkPolicy can make certain filtering intents difficult
to express.
Thus, in order to satisfy the needs of a cluster admin, we propose to introduce
a new API that captures the administrator&amp;rsquo;s intent.&lt;/li>
&lt;/ul>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>The goals for this KEP are to satisfy the following key user stories:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>As a cluster administrator, I want to enforce irrevocable in-cluster guardrails
that all workloads must adhere to in order to guarantee the safety of my clusters.
In particular I want to enforce certain network level access controls that are
cluster scoped and cannot be overridden or bypassed by namespace scoped
NetworkPolicies.&lt;/p>
&lt;p>Example: I would like to explicitly allow all pods in my cluster to reach
kubeDNS.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>As a cluster administrator, I want to have the option to enforce in-cluster network level
access controls that facilitate network multi-tenancy and strict network level
isolation between multiple teams and tenants sharing a cluster via use of namespaces
or groupings of namespaces per tenant.&lt;/p>
&lt;p>Example: I would like to define two tenants in my cluster, one composed of the pods
in &lt;code>foo-ns-1&lt;/code> and &lt;code>foo-ns-2&lt;/code> and the other with pods in &lt;code>bar-ns-1&lt;/code>, where inter-tenant
traffic is denied.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>As a cluster administrator, I want to optionally also deploy an additional default
set of policies to all in-cluster workloads that may be overridden by the developers
if needed&lt;/p>
&lt;p>Example: I would like to explicitly delegate the restriction of traffic destined
for cluster monitoring pods to the developer, allowing them to setup network policy
to deny or allow the traffic from/to their application.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>There are several unique properties that we need to add in order accomplish the
user stories above.&lt;/p>
&lt;ol>
&lt;li>Deny rules and, therefore, hierarchical enforcement of policy&lt;/li>
&lt;li>Semantics for a cluster-scoped policy object that may include
namespaces/workloads that have not been created yet.&lt;/li>
&lt;li>Interoperability with existing Kubernetes Network Policy API&lt;/li>
&lt;/ol>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>Our mission is to solve the most common use cases that cluster admins have.
That is, we don&amp;rsquo;t want to solve for every possible policy permutation a user
can think of. Instead, we want to design an API that addresses 90-95% use cases
while keeping the mental model easy to understand and use.
The focus of this KEP is on cluster scoped controls for east-west traffic within
a cluster, meaning that an AdminNetworkPolicyPeer is &lt;em>always&lt;/em> defined as a set of
in cluster objects. Cluster scoped controls for north-south traffic may be addressed via
future versions of the api resources introduced in this or other future KEPs.
For the time being, the AdminNetworkPolicy resource introduced by this KEP will
never affect north-south traffic, and thus also don&amp;rsquo;t override or bypass NetworkPolicies
with ipBlock rules that select external traffic.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>In order to achieve the three primary broad use cases for a cluster admin to
secure K8s clusters, we propose to introduce the following resources
under &lt;code>policy.networking.k8s.io&lt;/code> API group:&lt;/p>
&lt;ul>
&lt;li>AdminNetworkPolicy&lt;/li>
&lt;li>BaselineAdminNetworkPolicy&lt;/li>
&lt;/ul>
&lt;p>The AdminNetworkPolicy(ANP) and BaselineAdminNetworkPolicy(BANP) resources will
help the administrators:&lt;/p>
&lt;ol>
&lt;li>Set strict security rules for the cluster, i.e. a developer CANNOT override
these rules by creating NetworkPolicies that applies to the same workloads as the
AdminNetworkPolicy does.&lt;/li>
&lt;li>Set baseline security rules that describes default connectivity for cluster
workloads, which CAN be overridden by developer NetworkPolicies if needed.&lt;/li>
&lt;/ol>
&lt;h3 id="adminnetworkpolicy-resource">AdminNetworkPolicy resource&lt;/h3>
&lt;p>An AdminNetworkPolicy (ANP) resource will help the administrators set strict security
rules for the cluster, i.e. a developer CANNOT override these rules by creating
NetworkPolicies that applies to the same workloads as the AdminNetworkPolicy does.&lt;/p>
&lt;h4 id="actions">Actions&lt;/h4>
&lt;p>Unlike the NetworkPolicy resource in which each rule represents an allowed
traffic, AdminNetworkPolicy will enable administrators to set &lt;code>Pass&lt;/code>,
&lt;code>Deny&lt;/code> or &lt;code>Allow&lt;/code> as the action of each rule. AdminNetworkPolicy rules should
be read as-is, i.e. there will not be any implicit isolation effects for the Pods
selected by the AdminNetworkPolicy, as opposed to what NetworkPolicy rules imply.&lt;/p>
&lt;ul>
&lt;li>Pass: Traffic that matches a &lt;code>Pass&lt;/code> rule will skip all further rules from all
lower precedenced ANPs, and instead be enforced by the K8s NetworkPolicies.
If there is no K8s NetworkPolicy rule match, and no BaselineAdminNetworkPolicy
rule match (more on this in the &lt;a href="#priority"
>priority section&lt;/a>
), traffic will be
governed by the implementation. For most implementations, this means &amp;ldquo;allow&amp;rdquo;,
but there may be implementations which have their own policies outside of the
standard Kubernetes APIs.&lt;/li>
&lt;li>Deny: Traffic that matches a &lt;code>Deny&lt;/code> rule will be dropped.&lt;/li>
&lt;li>Allow: Traffic that matches an &lt;code>Allow&lt;/code> rule will be allowed.&lt;/li>
&lt;/ul>
&lt;p>AdminNetworkPolicy &lt;code>Pass&lt;/code> rules allows an admin to delegate security posture for
certain traffic to the Namespace owners by overriding any lower precedenced Allow
or Deny rules. For example, intra-tenant traffic management can be delegated to
tenant admins explicitly with the use of &lt;code>Pass&lt;/code> rules.&lt;/p>
&lt;p>AdminNetworkPolicy &lt;code>Deny&lt;/code> rules are useful for administrators to explicitly
block traffic with malicious in-cluster clients, or workloads that pose security
risks. Those traffic restrictions can only be lifted once the &lt;code>Deny&lt;/code> rules are
deleted, modified by the admin, or overridden by a higher priority rule.&lt;/p>
&lt;p>On the other hand, the &lt;code>Allow&lt;/code> rules can be used to call out traffic in the cluster
that needs to be allowed for certain components to work as expected (egress to
CoreDNS for example). Those traffic should not be blocked when developers apply
NetworkPolicy to their Namespaces which isolates the workloads.&lt;/p>
&lt;h4 id="priority">Priority&lt;/h4>
&lt;p>The policy instances will be ordered based on the numeric priority assigned to each
ANP. &lt;code>Priority&lt;/code> is a 32 bit integer value, where a smaller number corresponds to
a higher precedence. The lowest numeric priority value is &amp;ldquo;0&amp;rdquo;, which corresponds
to the highest precedence. Larger numbers have lower precedence.
All ANPs will have higher precedence over NetworkPolicies in the cluster. If
traffic matches both an ANP rule and a NetworkPolicy rule, the only case where the
NetworkPolicy rule will be evaluated is when there is a third higher-precedence
ANP &lt;code>Pass&lt;/code> rule that allows it to bypass any lower-precedence ANP rules.&lt;/p>
&lt;p>The relative precedence of the rules within a single ANP object (all of which
share a priority) will be determined by the order in which the rule is written.
Thus, a rule that appears at the top of the ingress/egress rules would take the
highest precedence.&lt;/p>
&lt;p>For alpha, this API defines &amp;ldquo;1000&amp;rdquo; as the maximum numeric value for priority, but
this may be revisited as the proposal advances. For future-safety, clients may assume
that higher values will eventually be allowed, and simply treat it as an int32.
Also for alpha, each ANP is limited to 100 ingress rules and 100 egress rules,
which is subject to change (to a greater number) in the future as well.&lt;/p>
&lt;p>Conflict resolution: Two policies are considered to be conflicting if they are assigned
the same &lt;code>priority&lt;/code> and apply to the same resources or a union of resources. In order
to avoid such conflicts, we propose to include tooling for ANP resources to help alert
the admin to potentially ambiguous ANP priority scenarios, more details in &lt;a href="#risks-and-mitigation"
>risks and mitigation&lt;/a>
.
However, ultimately it will be the job of the network policy implementation to decide
how to handle overlapping priority situations.&lt;/p>
&lt;h4 id="rule-names">Rule Names&lt;/h4>
&lt;p>In order to help future proof the ANP API, a built in mechanism to identify each
allow/deny/pass rule is required. Such a mechanism will help administrators organize
and identify individual rules within an AdminNetworkPolicy resource.
We propose to introduce a new string field, called &lt;code>name&lt;/code>, in each &lt;code>AdminNetworkPolicy&lt;/code>
ingress/egress rule. Currently the &lt;code>name&lt;/code> of a rule is optional and is most useful
if it is unique within an ANP instance. The max length for the rule name
string is restricted to 100 characters, which provides flexibility for long generated
names.&lt;/p>
&lt;h3 id="baselineadminnetworkpolicy-resource">BaselineAdminNetworkPolicy resource&lt;/h3>
&lt;p>An BaselineAdminNetworkPolicy (BANP) resource will help the administrators set
baseline security rules that describes default connectivity for cluster workloads,
which CAN be overridden by developer owned NetworkPolicies if needed.&lt;/p>
&lt;p>The BaselineAdminNetworkPolicy spec will look almost identical to that of an ANP,
except for two important differences:&lt;/p>
&lt;ol>
&lt;li>There is no &lt;code>Priority&lt;/code> field associated with BaselineAdminNetworkPolicy.
Note that in writing a BaselineAdminNetworkPolicy, admins can create different
priorities in rules by placing them before or after one another. However, the
authors of this KEP did not find a valid usecase for creating multiple
BaselineAdminNetworkPolicies in a cluster with distinct policy-level priorities.
BANPs are intended for setting cluster default security postures, and in most
cases the subject of such policy should be the entire cluster.&lt;/li>
&lt;li>There is no &lt;code>Pass&lt;/code> action for BaselineAdminNetworkPolicy rules.&lt;/li>
&lt;/ol>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;p>Note: This KEP will focus on East-West traffic, cluster internal, user stories and
not address North-South traffic, cluster external, use cases, which will be
solved in a follow-up proposal.&lt;/p>
&lt;h4 id="story-1-deny-traffic-at-a-cluster-level">Story 1: Deny traffic at a cluster level&lt;/h4>
&lt;p>As a cluster admin, I want to apply non-overridable deny rules
to certain pod(s) and(or) Namespace(s) that isolate the selected
resources from all other cluster internal traffic.&lt;/p>
&lt;p>For Example: In this diagram there is a AdminNetworkPolicy applied to the
&lt;code>sensitive-ns&lt;/code> denying ingress from all other in-cluster resources for all
ports and protocols.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/2091-admin-network-policy/explicit_deny.png?raw=true" alt="Alt text" title="Explicit Deny">&lt;/p>
&lt;h4 id="story-2-allow-traffic-at-a-cluster-level">Story 2: Allow traffic at a cluster level&lt;/h4>
&lt;p>As a cluster admin, I want to apply non-overridable allow rules to&lt;br>
certain pods(s) and(or) Namespace(s) that enable the selected resources
to communicate with all other cluster internal entities.&lt;/p>
&lt;p>For Example: In this diagram there is a AdminNetworkPolicy applied to every
namespace in the cluster allowing egress traffic to &lt;code>kube-dns&lt;/code> pods, and ingress
traffic from pods in &lt;code>monitoring-ns&lt;/code> for all ports and protocols.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/2091-admin-network-policy/explicit_allow.png?raw=true" alt="Alt text" title="Explicit Allow">&lt;/p>
&lt;h4 id="story-3-explicitly-delegate-traffic-to-existing-k8s-network-policy">Story 3: Explicitly Delegate traffic to existing K8s Network Policy&lt;/h4>
&lt;p>As a cluster admin, I want to explicitly delegate traffic so that it
skips any remaining cluster network policies and is handled by standard
namespace scoped network policies.&lt;/p>
&lt;p>For Example: In the diagram below egress traffic destined for the service svc-pub
in namespace bar-ns-1 on TCP port 8080 is delegated to the k8s network policies
implemented in foo-ns-1 and foo-ns-2. If no k8s network policies touch the
delegated traffic the traffic will be allowed.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/2091-admin-network-policy/delegation.png?raw=true" alt="Alt text" title="Delegate">&lt;/p>
&lt;h4 id="story-4-create-and-isolate-multiple-tenants-in-a-cluster">Story 4: Create and Isolate multiple tenants in a cluster&lt;/h4>
&lt;p>As a cluster admin, I want to build tenants in my cluster that are isolated from
each other by default. Tenancy may be modeled as 1:1, where 1 tenant is mapped
to a single Namespace, or 1:n, where a single tenant may own more than 1 Namespace.&lt;/p>
&lt;p>For Example: In the diagram below two tenants (Foo and Bar) are defined such that
all ingress traffic is denied to either tenant.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/2091-admin-network-policy/tenants.png?raw=true" alt="Alt text" title="Tenants">&lt;/p>
&lt;h4 id="story-5-cluster-wide-default-guardrails">Story 5: Cluster Wide Default Guardrails&lt;/h4>
&lt;p>As a cluster admin I want to change the default security model for my cluster,
so that all intra-cluster traffic (except for certain essential traffic) is
blocked by default. Namespace owners will need to use NetworkPolicies to
explicitly allow known traffic. This follows a whitelist model which is
familiar to many security administrators, and similar
to how &lt;a href="https://kubernetes.io/docs/concepts/services-networking/network-policies/#default-policies"
target="_blank" rel="noopener">kubernetes suggests network policy be used&lt;/a>
.&lt;/p>
&lt;p>For Example: In the following diagram all Ingress traffic to every cluster
resource is denied by a baseline deny rule.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-network/2091-admin-network-policy/baseline.png?raw=true" alt="Alt text" title="Default Rules">&lt;/p>
&lt;h3 id="rbac">RBAC&lt;/h3>
&lt;p>AdminNetworkPolicy resources are meant for cluster administrators.
Thus, access to manage these resources must be granted to subjects which have
the authority to outline the security policies for the cluster. Therefore, by
default, the &lt;code>cluster-admin&lt;/code> ClusterRole will be granted the permissions
to edit the AdminNetworkPolicy resources.&lt;/p>
&lt;h3 id="key-differences-between-adminnetworkpolicies-and-networkpolicies">Key differences between AdminNetworkPolicies and NetworkPolicies&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>AdminNetworkPolicy&lt;/th>
&lt;th>K8s NetworkPolicies&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Target persona&lt;/td>
&lt;td>Cluster administrator or equivalent&lt;/td>
&lt;td>Developers within Namespaces&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Scope&lt;/td>
&lt;td>Cluster&lt;/td>
&lt;td>Namespaced&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Drop traffic&lt;/td>
&lt;td>Supported with a &lt;code>Deny&lt;/code> rule action&lt;/td>
&lt;td>Supported via implicit isolation of target Pods&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Skip enforcement&lt;/td>
&lt;td>Supported with an &lt;code>Pass&lt;/code> rule action&lt;/td>
&lt;td>Not needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Allow traffic&lt;/td>
&lt;td>Supported with an &lt;code>Allow&lt;/code> rule action&lt;/td>
&lt;td>Default action for all rules is to allow&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Implicit isolation&lt;/td>
&lt;td>No implicit isolation&lt;/td>
&lt;td>All rules have an implicit isolation of target Pods&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Rule precedence&lt;/td>
&lt;td>Depends on the order in which they appear within a ANP&lt;/td>
&lt;td>Rules are additive&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Policy precedence&lt;/td>
&lt;td>Depends on &lt;code>priority&lt;/code> field among ANPs. Enforced before K8s NetworkPolicies if positive numeric priority value&lt;/td>
&lt;td>Enforced after numeric-priority ClusterNetworkPolicies, before baseline-priority AdminNetworkPolicy&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Matching pod selection&lt;/td>
&lt;td>Can apply different rules to multiple groups of Pods&lt;/td>
&lt;td>Applies rules to a single group of Pods&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Rule identifiers&lt;/td>
&lt;td>Name per rule in string format. Unique within a ANP&lt;/td>
&lt;td>Not supported&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Cluster external traffic&lt;/td>
&lt;td>Not supported&lt;/td>
&lt;td>Partially supported via IPBlock&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Namespace selectors&lt;/td>
&lt;td>Supports advanced selection of Namespaces with the use of &lt;code>namespaceSet&lt;/code>&lt;/td>
&lt;td>Supports label based Namespace selection with the use of &lt;code>namespaceSelector&lt;/code> field&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Note that AdminNetworkPolicy can also apply to Pods in Namespaces that don&amp;rsquo;t
exist yet, and will automatically apply to a new Namespace as long as the new
Namespace&amp;rsquo;s labels match the AdminNetworkPolicy rule&amp;rsquo;s appliedTo selection
criteria. NetworkPolicies, on the contrary, only apply to Pods in the Namespace
they are created in.&lt;/p>
&lt;h3 id="notesconstraintscaveats">Notes/Constraints/Caveats&lt;/h3>
&lt;p>It is important to note that the controller implementation for cluster-scoped
policy APIs will not be provided as part of this KEP. Such controllers which
realize the intent of these APIs will be provided by individual network policy
providers, as is the case with the NetworkPolicy API.&lt;/p>
&lt;h3 id="risks-and-mitigation">Risks and Mitigation&lt;/h3>
&lt;p>To understand why traffic between a pair of Pods is allowed or denied, a list of
NetworkPolicy resources in both Pods&amp;rsquo; Namespace used to be sufficient (considering
no other CRDs in the cluster tries to alter traffic behavior). With the introduction
of AdminNetworkPolicy this is no longer the case, and users could face difficulty
in determining why NetworkPolicies did not take effect.&lt;/p>
&lt;p>For example, in the case where a positive priority (non-zero) AdminNetworkPolicy rule,
NetworkPolicy rule and &amp;ldquo;0&amp;rdquo; priority AdminNetworkPolicy rule apply to an overlapping
set of Pods, users will need to refer to the priority associated with the
rule to determine which rule would take effect. Figuring out how stacked policies
affect traffic between workloads might not be very straightforward.&lt;/p>
&lt;p>To mitigate this risk and improve usability, some additional in-tree tooling
for both the Admin and Developer will need to be created. For the Admin, it is
safe to assume they will have the correct RBAC roles to list all the NetworkPolicies
and AdminNetworkPolicies in a cluster. Therefore, the Admin oriented tooling should
be able to both alert the Admin to any overriding of NetworkPolicies that may occur if a
new AdminNetworkPolicy is to be created and provide a warning if there is another ANP
with the same priority. For the Developer, who usually can only list the NetworkPolicies
in a given namespace, the tooling should simply alert if a given NetworkPolicy would
be overridden by any of the ANPs in a cluster. The aforementioned tooling will not
be a primary development goal during the alpha version of this API, and will most
likely be completed during the beta development cycle.&lt;/p>
&lt;h3 id="future-work">Future Work&lt;/h3>
&lt;p>Although the scope of the AdminNetworkPolicies is extensive, the above proposal
intends to only solve the documented use cases. However, we would
also like to consider the following set of proposals as future work items:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Audit Logging&lt;/strong>: Very often cluster administrators want to log every connection
that is either denied or allowed by a firewall rule and send the details to
an IDS or any custom tool for further processing of that information.
With the introduction of &lt;code>deny&lt;/code> rules, it may make sense to incorporate the
cluster-scoped policy resources with a new field, say &lt;code>auditPolicy&lt;/code>, to
determine whether a connection matching a particular rule/policy must be
logged or not.&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="adminnetworkpolicy-api-design">AdminNetworkPolicy API Design&lt;/h3>
&lt;p>The following new &lt;code>AdminNetworkPolicy&lt;/code> API will be added to the &lt;code>policy.networking.k8s.io&lt;/code>
API group:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicy describes cluster-level network traffic control rules&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicy &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> metav1.TypeMeta &lt;span style="color:#b44">`json:&amp;#34;,inline&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> metav1.ObjectMeta &lt;span style="color:#b44">`json:&amp;#34;metadata&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Specification of the desired behavior of AdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Spec AdminNetworkPolicySpec &lt;span style="color:#b44">`json:&amp;#34;spec&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Status is the status to be reported by the implementation, this is not&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// standardized in alpha and consumers should report what they see fit in&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// relation to their AdminNetworkPolicy implementation.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Status AdminNetworkPolicyStatus &lt;span style="color:#b44">`json:&amp;#34;status,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Conditions []metav1.Condition
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicySpec defines the desired state of AdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicySpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Priority is a value from 0 to 1000. Rules with lower priority values have&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// higher precedence, and are checked before rules with higher priority values.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// All AdminNetworkPolicy rules have higher precedence than NetworkPolicy or&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicy rules&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The relative precedence of the rules within a single ANP object (all of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// which share the priority) will be determined by the order in which the rule&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// is written. Thus, a rule that appears at the top of the ingress/egress rules&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// would take the highest precedence. If ingress rules are defined before egress&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// rules in the same ANP object then ingress would take precedence and vice versa.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The behavior is undefined if two ANP objects have same priority.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:Minimum=0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:Maximum=1000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Priority &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;priority&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Subject defines the pods to which this AdminNetworkPolicy applies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Subject AdminNetworkPolicySubject &lt;span style="color:#b44">`json:&amp;#34;subject&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ingress is the list of Ingress rules to be applied to the selected pods.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// A total of 100 rules will be allowed in each ANP instance. ANPs with no&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ingress rules do not affect ingress traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ingress []AdminNetworkPolicyIngressRule &lt;span style="color:#b44">`json:&amp;#34;ingress,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Egress is the list of Egress rules to be applied to the selected pods.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// A total of 100 rules will be allowed in each ANP instance. ANPs with no&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// egress rules do not affect egress traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Egress []AdminNetworkPolicyEgressRule &lt;span style="color:#b44">`json:&amp;#34;egress,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyIngressRule describes an action to take on a particular&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// set of traffic destined for pods selected by an AdminNetworkPolicy&amp;#39;s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Subject field.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyIngressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name is an identifier for this rule, that may be no more than 100 characters&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in length. This field should be used by the implementation to help&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// improve observability, readability and error-reporting for any applied&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AdminNetworkPolicies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxLength=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;name,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Action specifies the effect this rule will have on matching traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Currently the following actions are supported:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Allow: allows the selected traffic (even if it would otherwise have been denied by NetworkPolicy)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Deny: denies the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Pass: instructs the selected traffic to skip any remaining ANP rules, and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// then pass execution to any NetworkPolicies that select the pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If the pod is not selected by any NetworkPolicies then execution&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// is passed to any BaselineAdminNetworkPolicies that select the pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action AdminNetworkPolicyRuleAction &lt;span style="color:#b44">`json:&amp;#34;action&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// From is the list of sources whose traffic this rule applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If any AdminNetworkPolicyPeer matches the source of incoming&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// traffic then the specified action is applied.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field must be defined and contain at least one item.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinItems=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> From []AdminNetworkPolicyPeer &lt;span style="color:#b44">`json:&amp;#34;from&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ports allows for matching traffic based on port and protocols.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If Ports is not set then the rule does not filter traffic via port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports &lt;span style="color:#666">*&lt;/span>[]AdminNetworkPolicyPort &lt;span style="color:#b44">`json:&amp;#34;ports,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyEgressRule describes an action to take on a particular&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// set of traffic originating from pods selected by a AdminNetworkPolicy&amp;#39;s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Subject field.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyEgressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name is an identifier for this rule, that may be no more than 100 characters&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in length. This field should be used by the implementation to help&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// improve observability, readability and error-reporting for any applied&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AdminNetworkPolicies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxLength=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;name,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Action specifies the effect this rule will have on matching traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Currently the following actions are supported:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Allow: allows the selected traffic (even if it would otherwise have been denied by NetworkPolicy)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Deny: denies the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Pass: instructs the selected traffic to skip any remaining ANP rules, and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// then pass execution to any NetworkPolicies that select the pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If the pod is not selected by any NetworkPolicies then execution&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// is passed to any BaselineAdminNetworkPolicies that select the pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action AdminNetworkPolicyRuleAction &lt;span style="color:#b44">`json:&amp;#34;action&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// To is the List of destinations whose traffic this rule applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If any AdminNetworkPolicyPeer matches the destination of outgoing&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// traffic then the specified action is applied.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field must be defined and contain at least one item.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinItems=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> To []AdminNetworkPolicyPeer &lt;span style="color:#b44">`json:&amp;#34;to&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ports allows for matching traffic based on port and protocols.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If Ports is not set then the rule does not filter traffic via port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports &lt;span style="color:#666">*&lt;/span>[]AdminNetworkPolicyPort &lt;span style="color:#b44">`json:&amp;#34;ports,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyRuleActionAllow indicates that matching traffic will be&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// allowed regardless of NetworkPolicy and BaselineAdminNetworkPolicy&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// rules. Users cannot block traffic which has been matched by an &amp;#34;Allow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// rule in an AdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AdminNetworkPolicyRuleActionAllow AdminNetworkPolicyRuleAction = &lt;span style="color:#b44">&amp;#34;Allow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyRuleActionDeny indicates that matching traffic will be&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// denied before being checked against NetworkPolicy or&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicy rules. Pods will never receive traffic which&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// has been matched by a &amp;#34;Deny&amp;#34; rule in an AdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AdminNetworkPolicyRuleActionDeny AdminNetworkPolicyRuleAction = &lt;span style="color:#b44">&amp;#34;Deny&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyRuleActionPass indicates that matching traffic will&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// bypass further AdminNetworkPolicy processing (ignoring rules with lower&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// precedence) and be allowed or denied based on NetworkPolicy and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicy rules.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AdminNetworkPolicyRuleActionPass AdminNetworkPolicyRuleAction = &lt;span style="color:#b44">&amp;#34;Pass&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The following new &lt;code>BaslineAdminNetworkPolicy&lt;/code> API will also be added to the &lt;code>policy.networking.k8s.io&lt;/code>
API group:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicy &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> metav1.TypeMeta &lt;span style="color:#b44">`json:&amp;#34;,inline&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> metav1.ObjectMeta &lt;span style="color:#b44">`json:&amp;#34;metadata&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Specification of the desired behavior of BaselineAdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Spec BaselineAdminNetworkPolicySpec &lt;span style="color:#b44">`json:&amp;#34;spec&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Status is the status to be reported by the implementation.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Status BaselineAdminNetworkPolicyStatus &lt;span style="color:#b44">`json:&amp;#34;status,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicyStatus defines the observed state of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicyStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Conditions []metav1.Condition &lt;span style="color:#b44">`json:&amp;#34;conditions&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicySpec defines the desired state of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicySpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Subject defines the pods to which this BaselineAdminNetworkPolicy applies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Subject AdminNetworkPolicySubject &lt;span style="color:#b44">`json:&amp;#34;subject&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ingress is the list of Ingress rules to be applied to the selected pods&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// if they are not matched by any AdminNetworkPolicy or NetworkPolicy rules.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// A total of 100 Ingress rules will be allowed in each BANP instance.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BANPs with no ingress rules do not affect ingress traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ingress []BaselineAdminNetworkPolicyIngressRule &lt;span style="color:#b44">`json:&amp;#34;ingress,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Egress is the list of Egress rules to be applied to the selected pods if&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// they are not matched by any AdminNetworkPolicy or NetworkPolicy rules.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// A total of 100 Egress rules will be allowed in each BANP instance. BANPs&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// with no egress rules do not affect egress traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Egress []BaselineAdminNetworkPolicyEgressRule &lt;span style="color:#b44">`json:&amp;#34;egress,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicyIngressRule describes an action to take on a particular&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// set of traffic destined for pods selected by a BaselineAdminNetworkPolicy&amp;#39;s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Subject field.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicyIngressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name is an identifier for this rule, that may be no more than 100 characters&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in length. This field should be used by the implementation to help&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// improve observability, readability and error-reporting for any applied&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxLength=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;name,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Action specifies the effect this rule will have on matching traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Currently the following actions are supported:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Allow: allows the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Deny: denies the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action BaselineAdminNetworkPolicyRuleAction &lt;span style="color:#b44">`json:&amp;#34;action&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// From is the list of sources whose traffic this rule applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If any AdminNetworkPolicyPeer matches the source of incoming&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// traffic then the specified action is applied.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field must be defined and contain at least one item.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinItems=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> From []AdminNetworkPolicyPeer &lt;span style="color:#b44">`json:&amp;#34;from&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ports allows for matching traffic based on port and protocols.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If Ports is not set then the rule does not filter traffic via port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports &lt;span style="color:#666">*&lt;/span>[]AdminNetworkPolicyPort &lt;span style="color:#b44">`json:&amp;#34;ports,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyEgressRule describes an action to take on a particular&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// set of traffic originating from pods selected by a BaselineAdminNetworkPolicy&amp;#39;s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Subject field.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicyEgressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name is an identifier for this rule, that may be no more than 100 characters&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in length. This field should be used by the implementation to help&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// improve observability, readability and error-reporting for any applied&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicies.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxLength=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;name,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Action specifies the effect this rule will have on matching traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Currently the following actions are supported:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Allow: allows the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Deny: denies the selected traffic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action BaselineAdminNetworkPolicyRuleAction &lt;span style="color:#b44">`json:&amp;#34;action&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// To is the list of destinations whose traffic this rule applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If any AdminNetworkPolicyPeer matches the destination of outgoing&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// traffic then the specified action is applied.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field must be defined and contain at least one item.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinItems=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> To []AdminNetworkPolicyPeer &lt;span style="color:#b44">`json:&amp;#34;to&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Ports allows for matching traffic based on port and protocols.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If Ports is not set then the rule does not filter traffic via port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxItems=100&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports &lt;span style="color:#666">*&lt;/span>[]AdminNetworkPolicyPort &lt;span style="color:#b44">`json:&amp;#34;ports,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicyRuleAction string describes the BaselineAdminNetworkPolicy&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// action type.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +enum&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BaselineAdminNetworkPolicyRuleAction &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicyRuleActionDeny enables admins to deny traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> BaselineAdminNetworkPolicyRuleActionDeny BaselineAdminNetworkPolicyRuleAction = &lt;span style="color:#b44">&amp;#34;Deny&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BaselineAdminNetworkPolicyRuleActionAllow enables admins to allow certain traffic.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> BaselineAdminNetworkPolicyRuleActionAllow BaselineAdminNetworkPolicyRuleAction = &lt;span style="color:#b44">&amp;#34;Allow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The following types are common to the &lt;code>AdminNetworkPolicy&lt;/code> and &lt;code>BaselineAdminNetworkPolicy&lt;/code>
resources:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicySubject defines what resources the policy applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Exactly one field must be set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicySubject &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Namespaces is used to select pods via namespace selectors.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Namespaces &lt;span style="color:#666">*&lt;/span>metav1.LabelSelector &lt;span style="color:#b44">`json:&amp;#34;namespaces,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Pods is used to select pods via namespace AND pod selectors.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Pods &lt;span style="color:#666">*&lt;/span>NamespacedPodSubject &lt;span style="color:#b44">`json:&amp;#34;pods,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// NamespacedPodSubject allows the user to select a given set of pod(s) in&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// selected namespace(s)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NamespacedPodSubject &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field follows standard label selector semantics; if empty,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// it selects all Namespaces. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NamespaceSelector metav1.LabelSelector &lt;span style="color:#b44">`json:&amp;#34;namespaceSelector&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Used to explicitly select pods within a namespace; if empty,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// it selects all Pods. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PodSelector metav1.LabelSelector &lt;span style="color:#b44">`json:&amp;#34;podSelector&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyPort describes how to select network ports on pod(s).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Exactly one field must be set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyPort &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Port selects a port on a pod(s) based on number.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PortNumber &lt;span style="color:#666">*&lt;/span>Port &lt;span style="color:#b44">`json:&amp;#34;portNumber,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// NamedPort selects a port on a pod(s) based on name.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NamedPort &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;namedPort,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// PortRange selects a port range on a pod(s) based on provided start and end&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// values.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PortRange &lt;span style="color:#666">*&lt;/span>PortRange &lt;span style="color:#b44">`json:&amp;#34;portRange,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> Port &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Protocol is the network protocol (TCP, UDP, or SCTP) which traffic must&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// match. If not specified, this field defaults to TCP.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Protocol v1.Protocol &lt;span style="color:#b44">`json:&amp;#34;protocol&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Number defines a network port value.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Port &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;port&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// PortRange defines an inclusive range of ports from the the assigned Start value&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// to End value.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PortRange &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Protocol is the network protocol (TCP, UDP, or SCTP) which traffic must&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// match. If not specified, this field defaults to TCP.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Protocol v1.Protocol &lt;span style="color:#b44">`json:&amp;#34;protocol,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Start defines a network port that is the start of a port range, the Start&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// value must be less than End.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Start &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;start&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// End defines a network port that is the end of a port range, the End value&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// must be greater than Start.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> End &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;end&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// AdminNetworkPolicyPeer defines an in-cluster peer to allow traffic to/from.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Exactly one of the selector pointers should be set for a given peer.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyPeer &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Namespaces defines a way to select a set of Namespaces.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Namespaces &lt;span style="color:#666">*&lt;/span>NamespacedPeer &lt;span style="color:#b44">`json:&amp;#34;namespaces,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Pods defines a way to select a set of pods in&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in a set of namespaces.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Pods &lt;span style="color:#666">*&lt;/span>NamespacedPodPeer &lt;span style="color:#b44">`json:&amp;#34;pods,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NamespaceRelation &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NamespaceSelf NamespaceRelation = &lt;span style="color:#b44">&amp;#34;Self&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NamespaceNotSelf NamespaceRelation = &lt;span style="color:#b44">&amp;#34;NotSelf&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// NamespacedPeer defines a flexible way to select Namespaces in a cluster.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Exactly one of the selectors must be set. If a consumer observes none of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// its fields are set, they must assume an unknown option has been specified&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// and fail closed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MaxProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +kubebuilder:validation:MinProperties=1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NamespacedPeer &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Related provides a mechanism for selecting namespaces relative to the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// subject pod. A value of &amp;#34;Self&amp;#34; matches the subject pod&amp;#39;s namespace,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// while a value of &amp;#34;NotSelf&amp;#34; matches namespaces other than the subject&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// pod&amp;#39;s namespace.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Related &lt;span style="color:#666">*&lt;/span>NamespaceRelation &lt;span style="color:#b44">`json:&amp;#34;related,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// NamespaceSelector is a labelSelector used to select Namespaces, This field&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// follows standard label selector semantics; if present but empty, it selects&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// all Namespaces.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NamespaceSelector &lt;span style="color:#666">*&lt;/span>metav1.LabelSelector &lt;span style="color:#b44">`json:&amp;#34;namespaceSelector,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// SameLabels is used to select a set of Namespaces that share the same values&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// for a set of labels.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// To be selected a Namespace must have all of the labels defined in SameLabels,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// and they must all have the same value as the subject of this policy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If Samelabels is Empty then nothing is selected.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> SameLabels []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;sameLabels,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// NotSameLabels is used to select a set of Namespaces that do not have a set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// of label(s). To be selected a Namespace must have none of the labels defined&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// in NotSameLabels. If NotSameLabels is empty then nothing is selected.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> NotSameLabels []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;notSameLabels,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// NamespacedPodPeer defines a flexible way to select Namespaces and pods in a&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// cluster. The `Namespaces` and `PodSelector` fields are required.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NamespacedPodPeer &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Namespaces is used to select a set of Namespaces.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Namespaces NamespacedPeer &lt;span style="color:#b44">`json:&amp;#34;namespaces&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// PodSelector is a labelSelector used to select Pods, This field is NOT optional,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// follows standard label selector semantics and if present but empty, it selects&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// all Pods.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PodSelector metav1.LabelSelector &lt;span style="color:#b44">`json:&amp;#34;podSelector&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="general-notes-on-the-adminnetworkpolicy-api">General Notes on the AdminNetworkPolicy API&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Much of the proposed behavior is intentionally not aligned with
K8s NetworkPolicy resource, especially in regards to the behavior of empty fields.
Specifically this api is designed to be verbose and explicit. Please pay attention
to the comments above each field for more information.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For the AdminNetworkPolicy ingress/egress rule, the &lt;code>Action&lt;/code> field dictates whether
traffic should be allowed/denied/passed from/to the AdminNetworkPolicyPeer. This will be a required field.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The &lt;code>AdminNetworkPolicySubject&lt;/code> and &lt;code>AdminNetworkPolicyPeer&lt;/code> types are explicitly
designed to allow for future extensibility with a focus on the addition of new types
of selectors. Specifically it will allow for failing closed in the event an implementation
does not implement a defined selector. For example, If a new type (&lt;code>ServiceAccounts&lt;/code>)
was added to the &lt;code>AdminNetworkPolicyPeer&lt;/code> struct, and an implementation had not
yet implemented support for such a selector, an ANP using the new selector would
have no effect since the implementation would simply see an empty &lt;code>AdminNetworkPolicyPeer&lt;/code>
object.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h4 id="further-examples-utilizing-the-self-field-for-namespacedpeer-objects">Further examples utilizing the self field for &lt;code>NamespacedPeer&lt;/code> objects&lt;/h4>
&lt;p>&lt;strong>Self:&lt;/strong>
This is a special strategy to indicate that the rule only applies to the Namespace for
which the ingress/egress rule is currently being evaluated upon. Since the Pods
selected by the AdminNetworkPolicy &lt;code>subject&lt;/code> could be from multiple Namespaces,
the scope of ingress/egress rules whose &lt;code>namespaces.related=self&lt;/code> will be the Pod&amp;rsquo;s
own Namespace for each selected Pod.
Consider the following example:&lt;/p>
&lt;ul>
&lt;li>Pods [a1, b1], with labels &lt;code>app=a&lt;/code> and &lt;code>app=b&lt;/code> respectively, exist in Namespace x.&lt;/li>
&lt;li>Pods [a2, b2], with labels &lt;code>app=a&lt;/code> and &lt;code>app=b&lt;/code> respectively, exist in Namespace y.&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Allow&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">pods&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">related&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>self&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">app&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>b&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above AdminNetworkPolicy should be interpreted as: for each Namespace in
the cluster, all Pods in that Namespace should strictly allow traffic from Pods in
the &lt;em>same Namespace&lt;/em> who has label app=b at all ports. Hence, the policy above allows
x/b1 -&amp;gt; x/a1 and y/b2 -&amp;gt; y/a2, but does not allow y/b2 -&amp;gt; x/a1 and x/b1 -&amp;gt; y/a2.&lt;/p>
&lt;p>&lt;strong>SameLabels:&lt;/strong>
This is a special strategy to indicate that the rule only applies to the Namespaces
which share the same label value. Since the Pods selected by the AdminNetworkPolicy &lt;code>subject&lt;/code>
could be from multiple Namespaces, the scope of ingress/egress rules whose &lt;code>namespaces.samelabels=tenant&lt;/code>
will be all the Pods from the Namespaces who have the same label value for the &amp;ldquo;tenant&amp;rdquo; key.
Consider the following example:&lt;/p>
&lt;ul>
&lt;li>Pods [a1, b1] exist in Namespace t1-ns1, which has label &lt;code>tenant=t1&lt;/code>.&lt;/li>
&lt;li>Pods [a2, b2] exist in Namespace t1-ns2, which has label &lt;code>tenant=t1&lt;/code>.&lt;/li>
&lt;li>Pods [a3, b3] exist in Namespace t2-ns1, which has label &lt;code>tenant=t2&lt;/code>.&lt;/li>
&lt;li>Pods [a4, b4] exist in Namespace t2-ns2, which has label &lt;code>tenant=t2&lt;/code>.&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">20&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchExpressions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{&lt;span style="color:#008000;font-weight:bold">key&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;tenant&amp;#34;&lt;/span>&lt;span style="color:#008000;font-weight:bold">; operator&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Exists}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Pass&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">sameLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- tenant&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above AdminNetworkPolicy should be interpreted as: for each Namespace in
the cluster who has a label key set as &amp;ldquo;tenant&amp;rdquo;, traffic for all Pods in that Namespace
from all Pods in the Namespaces who has the same label value for key &lt;code>tenant&lt;/code> is delegated to the Namespace
admins, i.e such traffic will not be subject to any ANP (&lt;code>priority&lt;/code> &amp;gt; 50) rules and be evaluated by K8s NetworkPolicies.
Hence, the policy above delegates traffic from all Pods in Namespaces labeled &lt;code>tenant=t1&lt;/code> i.e. t1-ns1 and t1-ns2,
to reach each other, to K8s NetworkPolicies, similarly traffic for all Pods in Namespaces labeled &lt;code>tenant=t2&lt;/code>
i.e. t2-ns1 and t2-ns2, to talk to each other is delegated to K8s NetworkPolicies as well, however it does not
delegate traffic from any Pod in t1-ns1 or t1-ns2 to reach Pods in t2-ns1 or t2-ns2, such traffic is still subject
to ANP rules.&lt;/p>
&lt;h3 id="sample-specs-for-user-stories">Sample Specs for User Stories&lt;/h3>
&lt;h4 id="sample-spec-for-story-1-deny-traffic-at-a-cluster-level">Sample spec for Story 1: Deny traffic at a cluster level&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>cluster-wide-deny-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kubernetes.io/metadata.name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>sensitive-ns&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Deny&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="sample-spec-for-story-2-allow-traffic-at-a-cluster-level">Sample spec for Story 2: Allow traffic at a cluster level&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>cluster-wide-allow-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">30&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Allow&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kubernetes.io/metadata.name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>monitoring-ns&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">egress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Allow&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">to&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">pods&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchlabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kubernetes.io/metadata.name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kube-system&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchlabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">app&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kube-dns&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="sample-spec-for-story-3-explicitly-delegate-traffic-to-existing-k8s-network-policy">Sample spec for Story 3: Explicitly Delegate traffic to existing K8s Network Policy&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>pub-svc-delegate-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">20&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">egress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Pass&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">to&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">pods&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kubernetes.io/metadata.name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>bar-ns-1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podSelector&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">app&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>svc-pub&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ports&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">port&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">protocol&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>TCP&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">number&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">8080&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="sample-spec-for-story-4-create-and-isolate-multiple-tenants-in-a-cluster">Sample spec for Story 4: Create and Isolate multiple tenants in a cluster&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>tenant-creation-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">50&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchExpressions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{&lt;span style="color:#008000;font-weight:bold">key&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;tenant&amp;#34;&lt;/span>&lt;span style="color:#008000;font-weight:bold">; operator&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Exists}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Deny&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">notSameLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- tenant&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note: the above AdminNetworkPolicy can also be written in the following fashion:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>AdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>tenant-creation-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">priority&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">50&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">matchExpressions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{&lt;span style="color:#008000;font-weight:bold">key&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;tenant&amp;#34;&lt;/span>&lt;span style="color:#008000;font-weight:bold">; operator&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Exists}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Pass&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">sameLabels&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- tenant&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Deny &lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Deny everything else other than same tenant traffic&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The difference is that in the first case, traffic within tenant Namespaces will fall
through, and be evaluated against lower-priority ClusterNetworkPolicies, and then
NetworkPolicies. In the second case, the matching packet will skip all AdminNetworkPolicy
evaluation (except for AdminNetworkPolicy priority=0), and only match
against NetworkPolicy rules in the cluster. In other words, the second AdminNetworkPolicy
specifies intra-tenant traffic must be delegated to the tenant Namespace owners.&lt;/p>
&lt;h4 id="sample-spec-for-story-5-cluster-wide-default-guardrails">Sample spec for Story 5: Cluster Wide Default Guardrails&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>policy.networking.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>BaselineAdminNetworkPolicy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>baseline-rule-example&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">subject&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ingress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Deny &lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># zero-trust cluster default security posture&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">from&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSelector&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">egress&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Deny&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">to&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">namespaces&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">namespaceSeletor&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>{}&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;ul>
&lt;li>Add e2e tests for AdminNetworkPolicy resource
&lt;ul>
&lt;li>Ensure &lt;code>Pass&lt;/code> rules are delegated and are not subject to ANP rules.&lt;/li>
&lt;li>Ensure &lt;code>Deny&lt;/code> rules drop traffic.&lt;/li>
&lt;li>Ensure &lt;code>Allow&lt;/code> rules allow traffic.&lt;/li>
&lt;li>Ensure that in stacked ClusterNetworkPolicies/K8s NetworkPolicies, precedence is maintained
as per the &lt;code>priority&lt;/code> set in ANP.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>e2e test cases must cover ingress and egress rules.&lt;/li>
&lt;li>e2e test cases must cover port-ranges, named ports, integer ports etc.&lt;/li>
&lt;li>e2e test cases must cover various combinations of &lt;code>namespaceSet*s&lt;/code> in each ingress/egress rule.&lt;/li>
&lt;li>Ensure that namespace matching strategies work as expected.&lt;/li>
&lt;li>Add unit tests to test the validation logic which shall be introduced for cluster-scoped policy resources.
&lt;ul>
&lt;li>Ensure exactly one selector has to be set in an &lt;code>Subject&lt;/code> section.&lt;/li>
&lt;li>Ensure exactly one selector has to be set in an &lt;code>AdminNetworkPolicyPeer&lt;/code> section.&lt;/li>
&lt;li>Test cases for fields which are shared with NetworkPolicy, like &lt;code>endPort&lt;/code> etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Ensure that only administrators or assigned roles can create/update/delete cluster-scoped policy resources.&lt;/li>
&lt;li>Ensure smooth integration with existing Kubernetes NetworkPolicy.
&lt;ul>
&lt;li>Ensure all positive priority (non-zero) ANP rules are evaluated before any NetworkPolicy rules.&lt;/li>
&lt;li>Ensure ANP rules with priority=&amp;ldquo;0&amp;rdquo; are evaluated after any NetworkPolicy rules.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha-to-beta-graduation">Alpha to Beta Graduation&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from developers and surveys&lt;/li>
&lt;li>At least 2 implementors must provide a functional and scalable implementation
for the complete set of alpha features.
&lt;ul>
&lt;li>Specifically, ensure that only selecting E/W cluster traffic is plausible
at scale.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Evaluate the need for multiple &lt;code>Subject&lt;/code>s per ANP.&lt;/li>
&lt;li>Evaluate &amp;ldquo;future work&amp;rdquo; items based on feedback from community and challenges
faced by implementors.&lt;/li>
&lt;li>Ensure extensibility of adding new fields. i.e. adding new fields do not &amp;ldquo;fail-open&amp;rdquo;
traffic for older clients.&lt;/li>
&lt;li>Revisit the topic of whether this API should cover north-south traffic.&lt;/li>
&lt;/ul>
&lt;h4 id="beta-to-ga-graduation">Beta to GA Graduation&lt;/h4>
&lt;ul>
&lt;li>At least 4 implementors providers must provide a scalable implementation for the
complete set of beta features&lt;/li>
&lt;li>More rigorous forms of testing
— e.g., downgrade tests and scalability tests&lt;/li>
&lt;li>Allowing time for feedback&lt;/li>
&lt;li>Completion of all accepted &amp;ldquo;future work&amp;rdquo; items&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h4 id="upgrade-considerations">Upgrade considerations&lt;/h4>
&lt;p>As such, the cluster-scoped policy resources are new and shall not exist prior
to upgrading to a new version. Thus, there is no direct impact on upgrades.&lt;/p>
&lt;h4 id="downgrade-considerations">Downgrade considerations&lt;/h4>
&lt;p>Downgrading to a version which no longer supports cluster-scoped policy APIs
must ensure that appropriate security rules are created to mimick the cluster-scoped
policy rules by other means, such that no unintended traffic is allowed, and all
intended traffic is allowed.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>n/a&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;p>N/A for &lt;code>alpha&lt;/code> release.&lt;/p>
&lt;p>NOTE: for &lt;code>alpha&lt;/code> this resource will be implemented as a CRD following the
precedence set by the gateway API.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;!--
Pick one of these and delete the rest.
-->
&lt;p>N/A for &lt;code>alpha&lt;/code> release.&lt;/p>
&lt;p>NOTE: for &lt;code>alpha&lt;/code> this resource will be implemented as a CRD following the
precedence set by the gateway API.&lt;/p>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;!--
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
&lt;p>Creating a AdminNetworkPolicy does have an effect on the cluster, however they
must be specifically created, which means the administrator is aware of the impact.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;p>For &lt;code>alpha&lt;/code> there will be no feature gate so this is N/A.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>For &lt;code>alpha&lt;/code> there will be no feature gate so this is N/A.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
-->
&lt;p>Not in-tree, generally the implementations should have unit tests covering this
scenario.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;p>N/A for &lt;code>alpha&lt;/code>.&lt;/p>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;p>N/A for &lt;code>alpha&lt;/code>.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;p>The AdminNetworkPolicy API has a &lt;code>Status&lt;/code> field which should be used by the
implementation to report weather or not the rules were correctly programmed.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>This will be tested once implementations of the API have been completed.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;p>Since the controller for this feature will not be implemented in-tree, it will be
the responsibility of the implementations to report metrics as they see fit.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name: The Condition name will not be standardized in &lt;code>alpha&lt;/code> however
implementations are given the &lt;code>status&lt;/code> field to report what they see fit.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;p>Specific SLOs will be determined by the implementations.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: N/A since the indicators will vary based on the implementation.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;p>A metric describing the time it takes for the implementation to program the rules
defined in an AdminNetworkPolicy could be useful. However, some implementations
may struggle to report such a metric.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;ul>
&lt;li>API Type: AdminNetworkPolicy&lt;/li>
&lt;li>Supported number of objects per cluster: The total number of AdminNetworkPolicies
will not be limited. However, it is important to remember that the only users
creating ANPs will be Cluster-Admins, of which there should only be a handful. This
will help limit the total number ANPs being deployed at any given time.&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;p>This depends on the implementation, specifically based on the API used to program
the AdminNetworkPolicy rules into the data-plane.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;p>Not in any in-tree components, resource efficiency will need to be monitored by the
implementation.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A for &lt;code>alpha&lt;/code> release.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;p>N/A for &lt;code>alpha&lt;/code> release.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>N/A for &lt;code>alpha&lt;/code> release.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2021-02-18 - Created initial PR for the KEP&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>Securing traffic for a cluster for administrator&amp;rsquo;s use case can get complex.
This leads to introduction of a more complex set of APIs which could confuse
users.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>Following alternative approaches were considered as this KEP has been iterated upon:&lt;/p>
&lt;h3 id="networkpolicy-v2">NetworkPolicy v2&lt;/h3>
&lt;p>A new version for NetworkPolicy, v2, was evaluated to address features and use cases
documented in this KEP. Since the NetworkPolicy resource already exists, it would be
a low barrier to entry and can be extended to incorporate admin use cases.
However, this idea was rejected because the NetworkPolicy resource was introduced
solely to satisfy a developers intent. Thus, adding new use cases for a cluster admin
would be contradictory. In addition to that, the administrator use cases are mainly
scoped to the cluster as opposed to the NetworkPolicy resource, which is &lt;code>namespaced&lt;/code>.&lt;/p>
&lt;h3 id="empower-deny-allow-action-based-crd">Empower, Deny, Allow action based CRD&lt;/h3>
&lt;p>Alternatively, AdminNetworkPolicy can have &lt;code>Empower&lt;/code> (as opposed to &lt;code>Pass&lt;/code>),
&lt;code>Deny&lt;/code> or &lt;code>Allow&lt;/code> as the action of each rule.&lt;/p>
&lt;p>In terms of precedence, the aggregated &lt;code>Empower&lt;/code> rules (all AdminNetworkPolicy
rules with action &lt;code>Empower&lt;/code> in the cluster combined) should be evaluated before
aggregated AdminNetworkPolicy &lt;code>Deny&lt;/code> rules, followed by aggregated AdminNetworkPolicy
&lt;code>Allow&lt;/code> rules, followed by NetworkPolicy rules in all Namespaces. As such, the
&lt;code>Empower&lt;/code> rules have the highest precedence, which shall only be used to provide
exceptions to deny rules. The &lt;code>Empower&lt;/code> rules do not guarantee that the traffic
will not be dropped: it simply denotes that the packets matching those rules can
bypass the AdminNetworkPolicy &lt;code>Deny&lt;/code> rule evaluation. This idea was outvoted
by the &lt;code>Pass&lt;/code> action during sig-networkpolicy meetings, as most members find the
&lt;code>Empower&lt;/code> keyword confusing, and using an &amp;lsquo;action&amp;rsquo; to provide exceptions to certain
rule feels counter-intuitive.&lt;/p>
&lt;h4 id="clusterdefaultnetworkpolicy-resource">ClusterDefaultNetworkPolicy resource&lt;/h4>
&lt;p>Instead of using the &lt;code>Baseline&lt;/code> action to set cluster default rules, the authors
of this KEP also considered using an entirely separate resource named
ClusterDefaultNetworkPolicy. A ClusterDefaultNetworkPolicy resource will help the
administrators set baseline security rules for the cluster, i.e. a developer CAN
override these rules by creating NetworkPolicies that applies to the same workloads
as the ClusterDefaultNetworkPolicy does.&lt;/p>
&lt;p>ClusterDefaultNetworkPolicy works just like NetworkPolicy except that it is cluster-scoped.
When workloads are selected by a ClusterDefaultNetworkPolicy, they are isolated except
for the ingress/egress rules specified. ClusterDefaultNetworkPolicy rules will not have
actions associated &amp;ndash; each rule will be an &amp;lsquo;allow&amp;rsquo; rule.&lt;/p>
&lt;p>Aggregated NetworkPolicy rules will be evaluated before aggregated ClusterDefaultNetworkPolicy
rules. If a Pod is selected by both, a ClusterDefaultNetworkPolicy and a NetworkPolicy,
then the ClusterDefaultNetworkPolicy&amp;rsquo;s effect on that Pod becomes obsolete.
In this case, the traffic allowed will be solely determined by the NetworkPolicy.&lt;/p>
&lt;p>This idea was eventually abandoned due to several reasons:&lt;/p>
&lt;ol>
&lt;li>Two separate resources make it harder to reason about effect of aggregated rules.&lt;/li>
&lt;li>It is confusing that one cluster level resource has implicit isolation and
the other does not.&lt;/li>
&lt;/ol>
&lt;h3 id="single-crd-with-defaultrules-field">Single CRD with DefaultRules field&lt;/h3>
&lt;p>This alternate proposal was a hybrid approach, where in the AdminNetworkPolicy
resource (introduced in the proposal) would include additional fields called &lt;code>defaultIngress&lt;/code>
and &lt;code>defaultEgress&lt;/code>. These defaultIngress/defaultEgress fields would be similar in structure
to the ingress/egress fields, except that the default rules will not have &lt;code>action&lt;/code> field.
All default rules will be &amp;ldquo;allow&amp;rdquo; rules only, similar to K8s NetworkPolicy. Presence of
at least one &lt;code>defaultIngress&lt;/code> rule will isolate the &lt;code>appliedTo&lt;/code> workloads from accepting
any traffic other than that specified by the policy. Similarly, the presence of at least
one &lt;code>defaultEgress&lt;/code> rule will isolate the &lt;code>appliedTo&lt;/code> workloads from accessing any other
workloads other than those specified by the policy. In addition to that, the rules specified
by &lt;code>defaultIngress&lt;/code> and &lt;code>defaultEgress&lt;/code> fields will be evaluated to be enforced after the
K8s NetworkPolicy rules, thus such default rules can be overridden by a developer written
K8s NetworkPolicy.&lt;/p>
&lt;h3 id="single-crd-with-isoverrideable-field">Single CRD with IsOverrideable field&lt;/h3>
&lt;p>Another alternative for separating non-overridable guardrail rules and overridable
baseline rules is to introduce a &lt;code>IsOverridable&lt;/code> field in ANP ingress/egress rules:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyIngress&lt;span style="color:#666">/&lt;/span>EgressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action RuleAction
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> IsOverridable &lt;span style="color:#0b0;font-weight:bold">bool&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports []networkingv1.NetworkPolicyPort
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> From&lt;span style="color:#666">/&lt;/span>To []networkingv1.AdminNetworkPolicyPeer
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If &lt;code>IsOverridable&lt;/code> is set to false, the rules will take higher precedence than the
Kubernetes Network Policy rules. Otherwise, the rules will take lower precedence.
Note that both overridable and non overridable cluster network policy rules have explicit
allow/ deny rules. The precedence order of the rules is as follows:&lt;/p>
&lt;p>&lt;code>AdminNetworkPolicy&lt;/code> Deny (&lt;code>IsOverridable&lt;/code>=false) &amp;gt; &lt;code>AdminNetworkPolicy&lt;/code> Allow (&lt;code>IsOverridable&lt;/code>=false) &amp;gt; K8s &lt;code>NetworkPolicy&lt;/code> &amp;gt; &lt;code>AdminNetworkPolicy&lt;/code> Allow (&lt;code>IsOverridable&lt;/code>=true) &amp;gt; &lt;code>AdminNetworkPolicy&lt;/code> Deny (&lt;code>IsOverridable&lt;/code>=true)&lt;/p>
&lt;p>As the semantics for overridable Cluster NetworkPolicies are different from
K8s Network Policies, cluster administrators who worked on K8s NetworkPolicies
will have hard time writing similar policies for the cluster. Also, modifying
a single field (&lt;code>IsOverridable&lt;/code>) of a rule will change the priority in a
non-intuitive manner which may cause some confusion. For these reasons, we
decided not go with this proposal.&lt;/p>
&lt;h3 id="single-crd-with-baselineallow-as-action">Single CRD with BaselineAllow as Action&lt;/h3>
&lt;p>We evaluated another single CRD approach with an additional &lt;code>RuleAction&lt;/code> to cover
use-cases of both &lt;code>AdminNetworkPolicy&lt;/code> and &lt;code>ClusterDefaultNetworkPolicy&lt;/code>&lt;/p>
&lt;p>In this approach, we introduce a &lt;code>BaselineRuleAction&lt;/code> rule action.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> AdminNetworkPolicyIngress&lt;span style="color:#666">/&lt;/span>EgressRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Action RuleAction
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Ports []networkingv1.NetworkPolicyPort
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> From&lt;span style="color:#666">/&lt;/span>To []networkingv1.AdminNetworkPolicyPeer
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RuleActionDeny RuleAction = &lt;span style="color:#b44">&amp;#34;Deny&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RuleActionAllow RuleAction = &lt;span style="color:#b44">&amp;#34;Allow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RuleActionBaselineAllow RuleAction = &lt;span style="color:#b44">&amp;#34;BaselineAllow&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>RuleActionDeny and RuleActionAllow are used to specify rules that take higher
precedence than Kubernetes NetworkPolicies whereas RuleActionBaselineAllow is
used to specify the rules that take lower precedence Kubernetes NetworkPolicies.
The RuleActionBaselineAllow rules have same semantics as Kubernetes NetworkPolicy
rules but defined at cluster level.&lt;/p>
&lt;p>One of the reasons we did not go with this approach is the ambiguity of the term
&lt;code>BaselineAllow&lt;/code>. Also, the semantics around &lt;code>RuleActionBaselineAllow&lt;/code> is
slightly different as it involves implicit isolation compared to explicit
Allow/ Deny rules with other &lt;code>RuleActions&lt;/code>.&lt;/p></description></item><item><title>Resources: Add webhook hosting to CCM.</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2699/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2699/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [x] **Pick a hosting SIG.** (SIG Cloud Provider)
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [x] **Create an issue in kubernetes/enhancements** (https://github.com/kubernetes/enhancements/issues/2699)
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [x] **Make a copy of this template directory.** (Done)
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-2699-add-webhook-hosting-capability-to-ccm-framework">KEP-2699: Add webhook hosting capability to CCM framework&lt;/h1>
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1---full-control-separation-of-concerns"
>Story 1 - Full control, separation of concerns&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2---fast-and-simple"
>Story 2 - Fast and simple&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3---immediate-cloud-provider-extraction-effort"
>Story 3 - Immediate Cloud Provider Extraction effort&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#deprecation"
>Deprecation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP will detail enhancing the CCM framework to support cloud provider
specific webhooks. The intent is to make it easy to either generate a binary
or enhance the existing CCM binary to host such webhooks. We also intend to
allow for easily linking in &amp;ldquo;standard&amp;rdquo; webhooks needed by other SIGs which
need to be customized for particular cloud providers.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The Cloud Controller Manager (CCM) is the binary into which the Cloud Provider
places all the controllers needed to make a Kubernetes cluster work correctly
on their Cloud. There are also occasions when it makes sense for a Cloud
Provider to want these customizations to be applied synchronously, during API
server request handling, rather than asynchronously after a change has already
been applied.&lt;/p>
&lt;p>Our initial example of this is from SIG Storage. These would like the
functionality from the PersistentVolumeLabel (PVL) admission controller
(&lt;a href="https://github.com/kubernetes/kubernetes/issues/52617"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/52617)&lt;/a>
. This needs to be
completed for cloud provider extraction to complete. Several Cloud Providers
have indicated that this should be done in-line, especially as the existing
deprecated solution is implemented in an API server admission plugin, which is
in-line in the request path.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Our immediate goal is to allow in tree Cloud Providers to stop using the
existing PVL admission controller and do so using the framework. However we
want to build a framework which wil be usable by similar solutions to problems.
This KEP is about the framework needed to support the PVL webhook and not the
webhook itself. The frameworks default listener for the webhook should use
existing Kubernetes mechanism (secure serving, authz, authn) to secure itself
and validate the client. It should be possible to then easily change that
configuration to any other Kubernetes supported options for a webhook.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>We are not intending to create a &lt;a href="https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html"
target="_blank" rel="noopener">general admission webhook
solution&lt;/a>
.
This is just intended to host Cloud Provider specific webhooks as part of the
Control Plane.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We will start by adding extension hooks which can be registered in the
cmd/cloud-controller-manager/main.go. This would be similar to the mechanism we
already use to register new controllers. The existing sample shows this with a
sample of registering the nodeipamcontroller which is not a normally installed
controller in the cloud controller manager. In a similar way we will have a
sample of integrating a PVL mutating webhook into the sample CCM. We will also
have the system automatically detect if there are both controllers and webhooks
registered in the binary. If both are registered it will automatically add
command line flags allowing webhooks and/or controllers to be disabled. There
will be two separate flags, the controller flag and the webhook flag. The
controller flag will default to the controllers being enabled. The webhook flag
will default to the webhooks being disabled. We would also like to provide a
builder pattern for registering both the controller and webhook extensions.&lt;/p>
&lt;p>Another issue to consider is how the mutating/admission webhook configuration
is written into the cluster. This may be somewhat dependent on if the Cloud
Provider intends to run the webhook server on the Control Plane or on the
Cluster. We would recommend running the webhook server on the Control Plane.
However for some Cloud Providers that can lead to special issues with the
configuration. As such we will provide a flag which enables the service to
automatically register the webhooks as part of startup. However that
functionality can be disabled, allowing the Cloud Provider to do their own
custom registration, as part of cluster setup.&lt;/p>
&lt;p>There are a few parts to this issue. If the webhook server is run on the
control plane it may be possible to do things like assume it will be collocated
with the KAS, hence allowing the webhook server to be reference via localhost.
It also means that the webhook server could be instantiated from a static
manifest. If the webhook server is run in the cluster, then resources such as
the Node, Pod, PVs, and AdmissionController which are needed to start the
webhook server, must all be creatable prior to the webhook server coming up. In
addition the template code for the webhook server, should not be written such
that having all the webhook servers crash will not wedge the cluster from being
able to get the webhook server started again.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;p>The users of this KEP are Cloud Providers and feature developers whose code
impacts Cloud Providers. The intent is to make it easy for them both develop
features and to maintain the CCM controllers and webhooks across multiple
versions. At the same time we are attempting to make it easy for the SIGs to
make controllers or webhooks which can do what they know needs to be done and
integrated into Cloud Provider specific processes. We would like to do that in
a way which makes merging upgrades relatively painless.&lt;/p>
&lt;h4 id="story-1---full-control-separation-of-concerns">Story 1 - Full control, separation of concerns&lt;/h4>
&lt;p>Some Cloud Providers would prefer to keep controllers and webhooks in different
processses. They have concerns about attempting to run batch controllers in the
same process as webhooks which are &amp;ldquo;in-line&amp;rdquo; and time sensitive. For these
users it is easy to either build two different binaries or have the same binary
act as two different binaries based on command line flags.&lt;/p>
&lt;h4 id="story-2---fast-and-simple">Story 2 - Fast and simple&lt;/h4>
&lt;p>For Cloud Providers who would like to keep things simple, it is easy to create
a single process which handles both controllers and webhooks. While this KEP
does not deployment, this is a simple deployment, being fewer processes. It
does not stop the Cloud Provider from converting to Story 1 later. This system
should make our part of this simple. Obviously the Cloud Provider would have to
change their deployment setup.&lt;/p>
&lt;h4 id="story-3---immediate-cloud-provider-extraction-effort">Story 3 - Immediate Cloud Provider Extraction effort&lt;/h4>
&lt;p>PVL use case. Cloud Providers want to allow customers to migrate an existing
workload to Kubernetes. That workload uses an existing persistent volume. To
get that workload migrated the end user needs to be able to link the existing
PV into the cluster. However this requires an association which requires calls
out to the cloud provider for certain kinds of storage. Ideally the lookup and
label of the PV to that pre-existing storage happens in-line when the PV is
written. That ensures the write volume is attached to the Node/Pod when it is
scheduled and there are no race conditions.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>There could be potential problems running webhooks and controller in the same
process. Irrespective of failure mode on the webhook configuration, timeouts
will always cause a webhook call to fail. As such we are making it easy to
turn the CCM into two processses to mitigate this. It will be upto the Cloud
Provider to determine if they want the webhook policy to be FAIL or IGNORE. We
will have the sample set the configuration to IGNORE as its the safe option.
Incorrectly setting FAIL can quickly lead to a non functional cluster. Having a
FAIL policy on Pods for example can prevent the system from allocating the
webhook service, which prevents the webhook from ever passing.&lt;/p>
&lt;p>Webhooks are configured by a runtime resource. As a consequence this
configuration can be modified to deleted at runtime. That means that an admin
on the cluster can disable or alter the functionality. This potentially makes
it harder for a cloud provider to enforce that this logic is being applied. It
also means that there needs to be a deployment mechanism for the webhook. It is
left to the Cloud Provider to determine if the need for an in-line request is
sufficient to override these concerns. The Cloud Provider can alternatively use
a standalone controller which is not in-line or use &lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/"
target="_blank" rel="noopener">an admission
controller&lt;/a>
,
built into the APIServer.&lt;/p>
&lt;p>The change outlined in this KEP affects the framework which generates the CCM and not the CCM
itself. The Cloud Provider may wish to run webhooks separately from the
controllers in the CCM, therefore the framework will support that usecase. In
this mode, the CCM will just have controllers in it. A &amp;ldquo;Cloud Webhook Manager&amp;rdquo;
can be run separately and host the webhooks. That is being left as homework for
the Cloud Provider. However the sample CCM which demonstrates how this will be
done will have both in the same sample to make it easy.&lt;/p>
&lt;p>It is noteworthy that the CCM derives from the KCM. The KCM (and the CCM)
predate efforts like controller runtime. The controller runtime is a good
reference as it is demonstrates that operators and webhooks can be successfully
run inside the same binary. It further demonstrates that this is a pattern
which is understood and followed by a significant portion of the Kubernetes
community. Having said that, we consider it more important to unify the KCM and
CCM code bases, then to build on top controller runtime. We are not saying not
to use anything from controller runtime. We are saying that if need to choose
between unifying the KCM &amp;amp; CCM code and building with controller runtime, we
will choose unifying the KCM &amp;amp; CCM code base.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>A sample of how the Builder pattern might look is:&lt;/p>
&lt;pre tabindex="0">&lt;code>cmOptions, err := options.NewCloudManagerOptions()
if err != nil {
klog.Fatalf(&amp;#34;unable to initialize command options: %v&amp;#34;, err)
}
fss := cliflag.NamedFlagSets{}
cloudManagerBuilder := app.NewCloudManagerBuilder(&amp;#34;name&amp;#34;)
cloudManagerBuilder.setOptions(cmOptions)
cloudManagerBuilder.setFlags(fss)
cloudManagerBuilder.registerWebhook(gvkList, handler)
cloudManagerBuilder.registerWebhook(gvkSecondList, secondHandler)
manager, err := cloudManagerBuilder(wait.NeverStop)
if err != nil {
klog.Fatalf(&amp;#34;unable to construct cloud manager: %v&amp;#34;, err)
}
err := command.start()
&lt;/code>&lt;/pre>&lt;p>This will not alter the existing extension hooks in the controller manager
framework, as they are critical for backward compatibility. The builders are
meant to be an abstraction layer on top to make the extensions easier to use.
So for the existing controller manager code you might see changes like:&lt;/p>
&lt;pre tabindex="0">&lt;code>cloudControllerManagerBuilder.registerController(&amp;#34;nodeipamcontroller&amp;#34;, handler)
cloudControllerManagerBuilder.deregisterController(&amp;#34;servicecontroller&amp;#34;)
&lt;/code>&lt;/pre>&lt;p>The handler in this case is likely to be of type ControllerInitFuncConstructor.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/vendor/k8s.io/cloud-provider/options&lt;/code>: &lt;code>2022-10-12&lt;/code> -
&lt;code>34.2&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/vendor/k8s.io/cloud-provider/config/v1alpha1&lt;/code>:
&lt;code>2022-10-12&lt;/code> - &lt;code>38.5&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/staging/src/k8s.io/cloud-provider/app/&lt;/code>: There is
currently no published coverage on this because its not vendored by
Kubernetes itself and for some reason staging does not seem to be included in
the metrics.&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>Integration test for builder pattern exercising the case of building a CCM with a webhook: &lt;link to test coverage>&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;ul>
&lt;li>None, this feature is consumed by cloud provider repositories for the final
binary so it will not be used in e2e tests in K/K.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Reference implementation of the PVL mutating webhook served from the sample CCM.&lt;/li>
&lt;li>Impementation of the PVL mutating webhook for at least 1 Cloud Provider.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note:&lt;/strong> Generally we also wait at least two releases between beta and
GA/stable, because there&amp;rsquo;s no opportunity for user feedback, or even bug reports,
in back-to-back releases.&lt;/p>
&lt;p>&lt;strong>For non-optional features moving to GA, the graduation criteria must include
&lt;a href="https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">conformance tests&lt;/a>
.&lt;/strong>&lt;/p>
&lt;h4 id="deprecation">Deprecation&lt;/h4>
&lt;ul>
&lt;li>Not deprecated&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;ul>
&lt;li>Upgrade is not believed to be an issue at this point.&lt;/li>
&lt;li>Currently we are leaving upgrade as an issue for the Cloud Provider&lt;/li>
&lt;/ul>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;ul>
&lt;li>We are currently assuming that this will be deployed as part of the control
plane. We assume it will be upgraded with the KAS, KCM and CCM.&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;p>This will be built into the CCM by the Cloud Provider. Code must be written
specifically by the Cloud Provider to enable this feature.&lt;/p>
&lt;p>There will be a feature gate which will be used to track the stage of the feature.
Principally this is to make users aware of the support level of the feature.
It will control if the listener can be started.
Please note however this is a library and we expect users to vendor this into their own code base.
As such we cannot control if they will remove the check rather than setting the flag.&lt;/p>
&lt;ul>
&lt;li>Feature gate name: CloudWebhookServer&lt;/li>
&lt;li>Components depending on the feature gate: cloud-controller-manager&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>This cannot just be &amp;ldquo;enabled&amp;rdquo;.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>If you build using our framework, then you will be able to disable using a command line flag.
It can also be disabled by changing the admission webhook configuration.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>For new update requests it will work. However it will not change any persisted
resources, unless they are rewritten.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>By examining the admission webhook configuration.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;ul>
&lt;li>It requires on mutating/validating admission webhooks.&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>The webhooks have an advantage that they can be more easily scaled than controllers.&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>It requires a new call admission webhook call.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>Depends on the Cloud Providers implementation.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>Yes, in the same way that any additional admission webhook call does.
It is worth noting that the Cloud Provider has the option of instead
using a controller, at least for the PVL case. However that is not
the preferred mechanism. These is an optional extension mechanism.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>This is a admission webhook server. Those already exist and those troubleshooting
mechanism should apply here as well.&lt;/li>
&lt;/ul>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;ul>
&lt;li>This feature does not apply unless the API server is functional.&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Timeouts on webhooks act as failures, so any resource sent to the CCM will fail
if it times out.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>The primary alternative is to use controllers to solve all the problems.
This has an issue for things which need to be done in-line. If it is not
ok for state to be missing from a resource between creation and usage,
the controllers are a problem&lt;/p>
&lt;p>Initializers solve the problem between creation and usage, however this
solution has been deprecated.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;ul>
&lt;li>TBD&lt;/li>
&lt;/ul></description></item><item><title>Resources: Adding AppProtocol to Services and Endpoints</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1507/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1507/</guid><description>
&lt;h1 id="adding-appprotocol-to-services-and-endpoints">Adding AppProtocol to Services and Endpoints&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#services"
>Services:&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#endpoints"
>Endpoints:&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposed-roadmap"
>Proposed Roadmap&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#alpha---beta"
>Alpha -&amp;gt; Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta---ga"
>Beta -&amp;gt; GA&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Kubernetes does not have a standardized way of representing application
protocols. When a protocol is specified, it must be one of TCP, UDP, or SCTP.
With the EndpointSlice beta release in 1.17, a concept of AppProtocol was added
that would allow application protocols to be specified for each port. This KEP
proposes adding support for that same attribute to Services and Endpoints.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The lack of direct support for specifying application protocols for ports has
led to widespread use of annotations, providing a poor user experience and
general frustration (&lt;a href="https://github.com/kubernetes/kubernetes/issues/40244"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/40244)&lt;/a>
.
Unfortunately annotations are cloud specific and simply can&amp;rsquo;t provide the ease
of use of a built in attribute like &lt;code>AppProtocol&lt;/code>. Since application protocols
are specific to each port specified on a Service or Endpoints resource, it makes
sense to have a way to specify it at that level.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Add AppProtocol field to Ports in Services and Endpoints.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>In both Endpoints and Services, a new &lt;code>AppProtocol&lt;/code> field would be added. In
both cases, constraints validation would directly mirror what already exists
with EndpointSlices.&lt;/p>
&lt;h4 id="services">Services:&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ServicePort represents the port on which the service is exposed&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ServicePort &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The application protocol for this port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field follows standard Kubernetes label syntax.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Un-prefixed names are reserved for IANA standard service names (as per&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// RFC-6335 and http://www.iana.org/assignments/service-names).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Non-standard protocols should use prefixed names such as&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// mycompany.com/my-custom-protocol.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppProtocol &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="endpoints">Endpoints:&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// EndpointPort is a tuple that describes a single port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> EndpointPort &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The application protocol for this port.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This field follows standard Kubernetes label syntax.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Un-prefixed names are reserved for IANA standard service names (as per&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// RFC-6335 and http://www.iana.org/assignments/service-names).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Non-standard protocols should use prefixed names such as&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// mycompany.com/my-custom-protocol.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AppProtocol &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>It may take some time for cloud providers and other consumers of these APIs to
support this attribute. To help with this, we will work to communicate this
change well in advance of release so it can be well supported initially.&lt;/p>
&lt;h3 id="proposed-roadmap">Proposed Roadmap&lt;/h3>
&lt;p>&lt;strong>Kubernetes 1.18&lt;/strong>: New field is added but gated behind new alpha
&lt;code>ServiceAppProtocol&lt;/code> feature gate.
&lt;strong>Kubernetes 1.19&lt;/strong>: &lt;code>ServiceAppProtocol&lt;/code> feature gate graduates to beta and is
enabled by default.
&lt;strong>Kubernetes 1.20&lt;/strong>: &lt;code>ServiceAppProtocol&lt;/code> feature gate graduates to GA.
&lt;strong>Kubernetes 1.21&lt;/strong>: &lt;code>ServiceAppProtocol&lt;/code> feature gate is removed.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>This adds a new optional attribute to 2 existing stable APIs. This will follow
the traditional approach for adding new fields initially guarded by a feature
gate.&lt;/p>
&lt;h1 id="alpha---beta">Alpha -&amp;gt; Beta&lt;/h1>
&lt;ul>
&lt;li>&lt;code>ServiceAppProtocol&lt;/code> has been supported for at least 1 minor release.&lt;/li>
&lt;li>&lt;code>ServiceAppProtocol&lt;/code> feature gate is enabled by default.&lt;/li>
&lt;/ul>
&lt;h1 id="beta---ga">Beta -&amp;gt; GA&lt;/h1>
&lt;ul>
&lt;li>&lt;code>ServiceAppProtocol&lt;/code> has been enabled by default for at least 1 minor release.&lt;/li>
&lt;/ul>
&lt;h3 id="test-plan">Test plan&lt;/h3>
&lt;p>This will replicate the existing validation tests for the AppProtocol field that
already exists on EndpointSlice. Additionally, it will add tests that ensure
that both the Endpoints and EndpointSlice controllers appropriately set the
AppProtocol field on Endpoints and EndpointSlices when it is set on the
corresponding Service.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>
This was previously enabled with the &lt;code>ServiceAppProtocol&lt;/code> feature gate. That
will be removed in Kubernetes 1.21.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?&lt;/strong>
Not anymore.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>
If the &lt;code>ServiceAppProtocol&lt;/code> gate is manually enabled on Kubernetes components
it will no longer be recognized in Kubernetes 1.21. Users should stop using
this feature gate.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of features,
APIs, fields of API types, flags, etc.?&lt;/strong>
The v1.21 rollout will include the removal of the &lt;code>ServiceAppProtcol&lt;/code> feature
gate.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>
If this field is set on any Services, it may be used by applications that
consume those Services. No core Kubernetes components consume this field.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/strong>
N/A.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any missing metrics that would be useful to have to improve
observability of this feature?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Does this feature depend on any specific services running in the cluster?&lt;/strong>
No.&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to the cloud
provider?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count of the
existing API objects?&lt;/strong>
Describe them, providing:&lt;/p>
&lt;ul>
&lt;li>API type(s): Service&lt;/li>
&lt;li>Estimated increase in size: 10B&lt;/li>
&lt;li>Estimated amount of new objects: This field could be specified on each port
in each Service in a cluster although that is unlikely.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by existing SLIs/SLOs?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong>
No&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>The Troubleshooting section currently serves the &lt;code>Playbook&lt;/code> role. We may consider
splitting it into a dedicated &lt;code>Playbook&lt;/code> document (potentially with some monitoring
details). For now, we leave it here.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Resources: Admission Webhook Match Conditions</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3716/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3716/</guid><description>
&lt;h1 id="kep-3716-admission-webhook-match-conditions">KEP-3716: Admission Webhook Match Conditions&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#exclude-resources-from-a-wildcard-rule"
>Exclude resources from a wildcard rule&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#exempt-system-users-from-security-policy"
>Exempt system users from security policy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scope-an-nfs-access-management-webhook-to-pods-mounting-nfs-volumes"
>Scope an NFS access management webhook to Pods mounting NFS volumes&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api"
>API&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#security"
>Security&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#debuggability"
>Debuggability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#performance"
>Performance&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#future-work"
>Future Work&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#cross-webhook-match-conditions"
>Cross-webhook match conditions&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#exclusion-expressions"
>Exclusion Expressions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#resource-exclusions"
>Resource Exclusions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#cel-admission-control"
>CEL Admission Control&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests for meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes adding &amp;ldquo;match conditions&amp;rdquo; to admission webhooks, as an extension to the
existing &lt;code>rules&lt;/code> to define the scope of a webhook. A &lt;code>matchCondition&lt;/code> is a
&lt;a href="https://github.com/google/cel-spec"
target="_blank" rel="noopener">CEL&lt;/a>
expression that must evaluate to true for the admission
request to be sent to the webhook. If a &lt;code>matchCondition&lt;/code> evaluates to false, the webhook is skipped for
that request (implicitly allowed).&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>&lt;strong>Reliability:&lt;/strong> Admission webhooks continue to be an operational sore spot for many Kubernetes
users. Webhooks that target cluster critical resources put the admission controller backing the
webhook in the critical path of cluster stability. Even if tools like namespace scoping are used to
avoid circular-dependencies and exclude critical system resources, a webhook outage can still have a
major impact on cluster availability. This proposal aims to mitigate (but not eliminate) these
issues by allowing webhooks to be more narrowly scoped and targeted.&lt;/p>
&lt;p>&lt;strong>Performance:&lt;/strong> Admission webhooks sit in the critical request path for write-requests. Validating
webhooks can be run in parallel, but Mutating webhooks must be run in serial (up to 2 times!). This
makes webhooks extremely latency sensitive, and even a webhook that doesn&amp;rsquo;t do any work still needs
to pay the network round-trip cost.&lt;/p>
&lt;p>&lt;strong>Supportability:&lt;/strong> For hosted or managed Kubernetes distributions, webhooks can be a problem when
they interfere with requests by managed components. The existing criteria for filtering out requests
are insufficient for many use cases, and aren&amp;rsquo;t easily appended with provider rules.&lt;/p>
&lt;p>&lt;em>What about &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/3488-cel-admission-control"
target="_blank" rel="noopener">CEL for Admission Control&lt;/a>
?&lt;/em>
&lt;code>ValidatingAdmissionPolicy&lt;/code> is an exciting new feature which we hope will greatly reduce the need
for admission webhooks, but it is intentionally not attempting to cover every possible use case.
This proposal aims to improve the situation for those webhooks that cannot be migrated.&lt;/p>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;h4 id="exclude-resources-from-a-wildcard-rule">Exclude resources from a wildcard rule&lt;/h4>
&lt;blockquote>
&lt;p>I want to enforce metadata policy through an admission webhook without adding latency &amp;amp; risk to
high QPS system requests.&lt;/p>&lt;/blockquote>
&lt;p>Currently, if a webhook uses wildcard match rules, there is no way to filter out a subset of
resources or requests from matching the wildcard. If the webhook instead enumerates every resource
that should match, it must be kept up-to-date with every CRD that&amp;rsquo;s added.&lt;/p>
&lt;p>With CEL match conditions, the webhook could specify wildcard match rules, and add match conditions
to filter out the desired resources:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">rules&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Match CREATE &amp;amp; UPDATE on all resources:&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">operations&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- CREATE&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- UPDATE&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiGroups&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;*&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;*&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">resources&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;*&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">matchConditions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;exclude-leases&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">expression&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;!(request.resource.group == &amp;#34;coordination.k8s.io&amp;#34; &amp;amp;&amp;amp; resource.resource == &amp;#34;leases&amp;#34;)&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="exempt-system-users-from-security-policy">Exempt system users from security policy&lt;/h4>
&lt;blockquote>
&lt;p>As a managed cluster provider, I want to prevent user webhooks from intercepting critical system
requests.&lt;/p>&lt;/blockquote>
&lt;p>System &lt;em>resources&lt;/em> can currently be exempted through a namespace or label selector, but requests by
system components against non-system resources cannot be. For example, update pod status requests by
Kubelets cannot be excluded from user webhooks intercepting all pod requests.&lt;/p>
&lt;p>With &lt;code>matchConditions&lt;/code>, a managed cluster could append system-exclusion rules to each webhook. For example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">matchConditions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;exclude-kubelet-requests&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">expression&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;!(&amp;#34;system:nodes&amp;#34; in request.userInfo.groups)&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Since the expression will be evaluated using a common Kubernetes CEL library, these expressions
should also get automatic access to the secondary authorization check mechanism described in
&lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3488-cel-admission-control#secondary-authz"
target="_blank" rel="noopener">KEP-3488: CEL for Admission Control&lt;/a>
.
In practice, this means that RBAC bindings can be used to opt-out privileged users from security policy:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">matchConditions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Requests by users without breakglass should be included.&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;breakglass&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">expression&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;authorizer.resource(&amp;#39;&lt;/span>admissionregistration.k8s.io&amp;#39;, &amp;#39;validatingwebhookconfigurations&amp;#39;, &amp;#39;*&amp;#39;).name(&amp;#39;security-policy&amp;#39;).check(&amp;#39;breakglass&amp;#39;).denied()&amp;#39;&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="scope-an-nfs-access-management-webhook-to-pods-mounting-nfs-volumes">Scope an NFS access management webhook to Pods mounting NFS volumes&lt;/h4>
&lt;blockquote>
&lt;p>I want to narrowly scope my webhook to only the relevant requests, in order to reduce load on the
webhook and reduce latency in irrelevant requests.&lt;/p>&lt;/blockquote>
&lt;p>Concrete example:&lt;/p>
&lt;blockquote>
&lt;p>A NFS deployment uses an third-party access management system. I have an admission webhook that
performs an access check for against the external system for pods that mount NFS volumes. Only
pods with NFS volumes need to be checked.&lt;/p>&lt;/blockquote>
&lt;p>Currently, there is no way to achieve this. Many webhook implementations today start by checking
that the request is within scope, and return early if it&amp;rsquo;s not. This adds latency and an additional
failure point to irrelevant requests. This example requires an external integration, and thus is not
a candidate for migration to CEL &lt;code>ValidatingAdmissionPolicy&lt;/code>.&lt;/p>
&lt;p>With match conditions, the expressions can check whether the request object is in-scope for the
webhook:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">rules&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">operations&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#39;CREATE&amp;#39;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiGroups&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;&amp;#39;&lt;/span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># core&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;*&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">resources&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;pods&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">matchConditions&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># Only include pods with an NFS volume.&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;nfs-volume-present&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">expression&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#39;object.spec.volumes.exists(v, v.has(nfs))&amp;#39;&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="goals">Goals&lt;/h3>
&lt;ol>
&lt;li>Provide a filtering mechanism for excluding requests from an admission webhook&lt;/li>
&lt;li>Maintain consistency with &lt;code>ValidatingAdmissionPolicy&lt;/code>&lt;/li>
&lt;/ol>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Provide a mechanism to exclude requests from all webhooks.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="api">API&lt;/h3>
&lt;p>Both &lt;code>ValidatingWebhook&lt;/code> and &lt;code>MutatingWebhook&lt;/code> (in &lt;code>admissionregistration.k8s.io&lt;/code>) will be updated
with a new &lt;code>MatchConditions&lt;/code> field:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ValidatingWebhook &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// MatchConditions is a list of conditions on the AdmissionRequest (&amp;#39;request&amp;#39;) that must be met&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// for a request to be sent to this webhook. All conditions in the list must evaluate to TRUE for&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the request to be matched.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchMergeKey=name&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +patchStrategy=merge&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> MatchConditions []MatchCondition &lt;span style="color:#b44">`json:&amp;#34;matchConditions,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> MutatingWebhook &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> MatchConditions []MatchCondition &lt;span style="color:#b44">`json:&amp;#34;matchConditions,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// MatchCondition represents a condition which must by fulfilled for a request to be sent to a webhook.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> MatchCondition &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name is an identifier for this match condition, used for strategic merging of MatchConditions,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// as well as providing an identifier for logging purposes. A good name should be descriptive of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the associated expression.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name must be a valid RFC 1123 DNS subdomain, and unique in a set of MatchConditions.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Required.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;name&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// NOTE: Placeholder documentation, to be replaced by https://github.com/kubernetes/website/issues/39089.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Expression represents the expression which will be evaluated by CEL.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ref: https://github.com/google/cel-spec&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// CEL expressions have access to the contents of the AdmissionRequest, organized into CEL variables:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// &amp;#39;object&amp;#39; - The object from the incoming request. The value is null for DELETE requests.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// &amp;#39;oldObject&amp;#39; - The existing object. The value is null for CREATE requests.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// &amp;#39;request&amp;#39; - Attributes of the admission request([ref](https://raw.githubusercontent.com/kubernetes/enhancements/master/pkg/apis/admission/types.go#AdmissionRequest)).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Required.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Expression &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;expression&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The match condition expression is evaluated by the same libraries as those used for CEL
ValidatingAdmissionPolicy. The only difference in expressions is the availability of the &lt;code>params&lt;/code>
variable. Expressions requiring access to additional information outside the AdmissionRequest must
be performed in the webhook, and are out of scope for this proposal.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="security">Security&lt;/h4>
&lt;p>&lt;strong>Risk: Attacker adds or changes a match condition to weaken an admission policy.&lt;/strong>&lt;/p>
&lt;p>This does not represent a new threat, as doing so would require update access to the admission
registration object, and with that permission an attacker could already disable the policy through
manipulating match rules, namespace selector, or object selector (or reroute the webhook entirely).&lt;/p>
&lt;p>&lt;strong>Risk: Logic error in match condition expression.&lt;/strong>&lt;/p>
&lt;p>Currently the match conditions must be encoded in the webhook backend itself. Moving the logic into
a CEL expression adds a potential failure point. This can be mitigated by testing, but the CEL
ecosystem currently lacks some of the tools that would make this easier.&lt;/p>
&lt;p>Of particular significance are match conditions tied to non-functional properties of an object, such
as using labels to decide whether to opt an object out of a policy. Without additional admission
controls on who can set those non-functional aspects, exempting the policy based on that could be a
security vulnerability. In contrast, the
&lt;a href="#scope-an-nfs-access-management-webhook-to-pods-mounting-nfs-volumes"
>NFS example usecase&lt;/a>
exempts
the policy on a &lt;em>functional&lt;/em> aspect - whether an NFS volume is mounted, and thus whether the policy
is relevant.&lt;/p>
&lt;p>These risks are inherent to the feature being proposed and cannot be mitigated through technical
means, but should be highlighted in the documentation.&lt;/p>
&lt;h4 id="debuggability">Debuggability&lt;/h4>
&lt;p>We do not normally log, audit, or emit an event when a webhook is out-of-scope for a request, and
the same will &lt;em>mostly&lt;/em> be true for match conditions.&lt;/p>
&lt;p>At &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#what-method-to-use"
target="_blank" rel="noopener">log level V(5)&lt;/a>
,
we will emit a log when a request that would otherwise be in-scope for a webhook is excluded for a
non-matching match condition.&lt;/p>
&lt;p>Short of increasing log verbosity, the recommended debug strategy is to capture or reproduce a
relevant AdmissionRequest (for example, in a non-prod cluster disable all match conditions and log
the requests from a webhook). Then, manually test the match conditions against the request, and
iterate as necessary.&lt;/p>
&lt;h4 id="performance">Performance&lt;/h4>
&lt;p>The CEL expression evaluation will leverage the same &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2876-crd-validation-expression-language#resource-constraints"
target="_blank" rel="noopener">Resource Constraints&lt;/a>
used by CEL CRD Validation &amp;amp; CEL Admission Control. The runtime cost budgets are defined here &lt;a href="https://github.com/kubernetes/kubernetes/blob/445869a59bdbd1c587b72b52c5da94c1d1c316a1/staging/src/k8s.io/apiserver/pkg/apis/cel/config.go#L22"
target="_blank" rel="noopener">CEL Runtime Cost&lt;/a>
.&lt;/p>
&lt;p>The per call limit is shared with Validating Admission Policy CEL expressions and set at roughly 0.1 second for each expression evaluation call. The total budget per object (i.e. per ValidatingWebhook) for CEL match conditions is roughly .25 seconds and 1/4 the budget of Validating Admission Policy limit.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>Test cases to add:&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate enablement / disablement is a no-op when no &lt;code>matchConditions&lt;/code> are set (until graduation to GA as feature gate will go away)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate enablement / disablement works as expected when &lt;code>matchConditions&lt;/code> are set
(until graduation to GA as feature gate will go away)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Single match condition:
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Request out of scope without &lt;code>matchConditions&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Request in scope without &lt;code>matchConditions&lt;/code>, but not matching&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Request in scope without &lt;code>matchConditions&lt;/code>, and also matching&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Multiple match conditions, covering the same cases as the single-condition case&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>We will test the edge cases mostly in integration tests and unit tests.&lt;/p>
&lt;p>Once the feature is default enabled in beta, a single E2E test covering the single-match-condition
cases outlined above will be added.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind &lt;code>AdmissionWebhookMatchConditions&lt;/code> feature flag&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
implemented&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Add E2E test coverage&lt;/li>
&lt;li>Resolve resource constraints validation&lt;/li>
&lt;li>Smart reload/recompile of Webhook Accessors, see &lt;a href="https://github.com/kubernetes/kubernetes/issues/116588"
target="_blank" rel="noopener">issue&lt;/a>
&lt;/li>
&lt;li>ValidatingAdmissionPolicy is promoted to Beta.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>Promote appropriate E2E tests to conformance&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apimachinery/webhook.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apimachinery/webhook.go&lt;/a>
&lt;/li>
&lt;li>&amp;ldquo;should be able to create and update validating webhook configurations with match conditions&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;should be able to create and update mutating webhook configurations with match conditions&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;should reject validating webhook configurations with invalid match conditions&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;should reject mutating webhook configurations with invalid match conditions&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;should mutate everything except &amp;lsquo;skip-me&amp;rsquo; configmaps&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Cover any missing test coverage&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Downgrading in a way that disables match conditions after it is already in use can increase the
scope of requests evaluated by a webhook. See
&lt;a href="#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement"
>Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/a>
for more details&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>The new field is only evaluated by the apiserver, so only HA apiserver version skew is relevant. In
this case, if the feature is enabled in one apiserver and not another, a request could
non-deterministically be sent to a webhook. Enabling match conditions without setting
&lt;code>matchConditions&lt;/code> on an webhooks is a no-op, so the version skew non-determinism is best avoided by
waiting until it has been enabled in all apiservers before starting to use the new field.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>AdmissionWebhookMatchConditions&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kube-apiserver&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. If the feature is enabled, but the &lt;code>matchConditions&lt;/code> field is unset, the default behavior
remains unchanged.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Disabling the feature gate will ignore any &lt;code>matchConditions&lt;/code> set, and return to the default
behavior. Disabling &lt;code>AdmissionWebhookMatchConditions&lt;/code> could increase the traffic to the webhook, and
potentially increase the error rate if the webhook fails to process the additional requests
correctly.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Any &lt;code>matchConditions&lt;/code> that were already stored on existing webhooks will be enforced.&lt;/p>
&lt;p>Note: enabling &lt;code>matchConditions&lt;/code> can only reduce the number of requests being sent to a webhook (or
remain unchanged). Enabling it will never increase the number of requests.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
Additionally, for features that are introducing a new API field, unit tests that
are exercising the `switch` of feature gate itself (what happens if I disable a
feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
&lt;p>We will add tests that verify the functionality is turned off when feature gate is toggled off and turned on when toggled on.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;p>In general, rollout / rollback should not fail since the feature is not enabled by default.
However, there are risks on rollback if webhook preconditions was enabled and then unexpectedly
disabled on rollback.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>webhook_admission_match_condition_evaluation_errors_total&lt;/code> is high&lt;/li>
&lt;li>&lt;code>webhook_admission_match_condition_exclusions_total&lt;/code> is too high or too low&lt;/li>
&lt;li>&lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code> is high&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>Not yet, but manual testing should be completed and documented prior to beta.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>A new per-webhook metric will measure the number of requests excluded by match conditions:&lt;/p>
&lt;p>Metric name: &lt;code>webhook_admission_match_condition_exclusions_total&lt;/code>
Labels:&lt;/p>
&lt;ul>
&lt;li>&lt;code>name&lt;/code>: webhook name&lt;/li>
&lt;li>&lt;code>type&lt;/code>: &lt;code>validate&lt;/code> or &lt;code>admit&lt;/code>&lt;/li>
&lt;li>&lt;code>kind&lt;/code>: match condition on a &lt;code>webhook&lt;/code> or &lt;code>policy&lt;/code>&lt;/li>
&lt;li>&lt;code>operation&lt;/code>: the admission operation&lt;/li>
&lt;/ul>
&lt;p>Metric name: &lt;code>webhook_admission_match_condition_evaluation_errors_total&lt;/code>
Labels:&lt;/p>
&lt;ul>
&lt;li>&lt;code>name&lt;/code>: webhook name&lt;/li>
&lt;li>&lt;code>type&lt;/code>: &lt;code>validate&lt;/code> or &lt;code>admit&lt;/code>&lt;/li>
&lt;li>&lt;code>kind&lt;/code>: match condition on a &lt;code>webhook&lt;/code> or &lt;code>policy&lt;/code>&lt;/li>
&lt;li>&lt;code>operation&lt;/code>: the admission operation&lt;/li>
&lt;/ul>
&lt;p>Metric name: &lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code>
Labels:&lt;/p>
&lt;ul>
&lt;li>&lt;code>name&lt;/code>: webhook name&lt;/li>
&lt;li>&lt;code>type&lt;/code>: &lt;code>validate&lt;/code> or &lt;code>admit&lt;/code>&lt;/li>
&lt;li>&lt;code>kind&lt;/code>: match condition on a &lt;code>webhook&lt;/code> or &lt;code>policy&lt;/code>&lt;/li>
&lt;li>&lt;code>operation&lt;/code>: the admission operation&lt;/li>
&lt;/ul>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;p>The metric &lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code> should indicate if the match conditions
are being used and being evaluated for invoking webhooks.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:
&lt;ul>
&lt;li>Check the preconditions field in the webhook object and check the &lt;code>webhook_admission_match_condition_exclusions_total&lt;/code> metric for exclusions&lt;/li>
&lt;li>Check &lt;code>webhook_admission_match_condition_evaluation_errors_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;ul>
&lt;li>Only negligible impact to admission latency due to evaluation of CEL rules&lt;/li>
&lt;li>CEL evaluation time (&lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code>)&lt;/li>
&lt;li>CEL evaluation errors (&lt;code>webhook_admission_match_condition_evaluation_errors_total&lt;/code>)&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:
&lt;ul>
&lt;li>&lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code>&lt;/li>
&lt;li>&lt;code>webhook_admission_match_condition_evaluation_errors_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;p>Yes, the following metrics will be considered for Beta which will improve observability of this feature:&lt;/p>
&lt;ul>
&lt;li>&lt;code>webhook_admission_match_condition_evaluation_seconds&lt;/code>&lt;/li>
&lt;li>&lt;code>webhook_admission_match_condition_exclusions_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;p>No, this feature only adds new fields to existing webhook APIs&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;p>Yes, it can increase size of webhook configuration objects. There is a limit in place for the number of preconditions
a webhook can have, however, webhook objects can still increase in size significantly if large expressions are used.&lt;/p>
&lt;ul>
&lt;li>API types(s): ValidatingWebhookConfiguration, MutatingWebhookConfiguration
Estimated increase in size: depends on size of CEL expressions, but should be negligible in most cases
Estimated amount of new objects: none&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;p>Yes, it can impact latency SLI/SLO if evaluating CEL expressions add significant latency.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;p>Yes, it has potential to increase CPU usage in kube-apiserver if there is a webhook intercepting many requests with many precondition rules.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>See &lt;a href="#debuggability"
>Debuggability&lt;/a>
.&lt;/p>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A &amp;ndash; since the feature is part of kube-apiserver.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>Feature can be disabled per webhook by removing preconditions or for all webhooks by disabling the feature gate in kube-apiserver.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;h3 id="cross-webhook-match-conditions">Cross-webhook match conditions&lt;/h3>
&lt;p>In the future, we should explore ways to apply common match conditions across multiple webhooks.&lt;/p>
&lt;p>Example use cases:&lt;/p>
&lt;ul>
&lt;li>Apply a &lt;a href="#exempt-system-users-from-security-policy"
>break-glass exemption&lt;/a>
across many (or all) webhooks.&lt;/li>
&lt;li>Managed cluster provider wants to exempt provider-managed resources from user-managed webhooks.&lt;/li>
&lt;/ul>
&lt;p>Considerations:&lt;/p>
&lt;ul>
&lt;li>Access by managed cluster provider vs. cluster admin&lt;/li>
&lt;li>Side effects &amp;amp; mutations&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="exclusion-expressions">Exclusion Expressions&lt;/h3>
&lt;p>The &lt;code>matchCondition&lt;/code> expression could be inverted, so that requests that match are excluded rather than
included. In this case, we would probably also want to change from requiring all expressions to
match, to excluding the request if any match.&lt;/p>
&lt;p>Although this approach would simplify some usecases, such as
&lt;a href="#exclude-resources-from-a-wildcard-rule"
>excluding resources from a wildcard rule&lt;/a>
or
&lt;a href="#exempt-system-users-from-security-policy"
>exempting system users from a security policy&lt;/a>
, it
means other expressions would become double-negatives, which generally goes against API design
best-practices.&lt;/p>
&lt;h3 id="resource-exclusions">Resource Exclusions&lt;/h3>
&lt;p>&lt;a href="https://github.com/kubernetes/enhancements/issues/3693"
target="_blank" rel="noopener">KEP-3693&lt;/a>
Proposes an alternative approach
using a more structured format for expressing resource exclusions. This approach may be more
approachable to users who are not comfortable writing CEL expressions, but it is significantly less
powerful. This would address
&lt;a href="#exclude-resources-from-a-wildcard-rule"
>Exclude resources from a wildcard rule&lt;/a>
,
and could be extended with subject exclusions to
address &lt;a href="#exempt-system-users-from-security-policy"
>Exempt system users from security policy&lt;/a>
, but
would not be sufficient to address
&lt;a href="#scope-an-nfs-access-management-webhook-to-pods-mounting-nfs-volumes"
>Scope an NFS access management webhook to Pods mounting NFS volumes&lt;/a>
.&lt;/p>
&lt;p>These two approaches are not mutually exclusive.&lt;/p>
&lt;h3 id="cel-admission-control">CEL Admission Control&lt;/h3>
&lt;p>&lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3488-cel-admission-control"
target="_blank" rel="noopener">KEP-3488: CEL for Admission Control&lt;/a>
adds the
ability for admission webhooks to be replaced entirely by CEL expressions, but this is not intended
to cover 100% of webhook use cases. For example, the user story described in
&lt;a href="#scope-an-nfs-access-management-webhook-to-pods-mounting-nfs-volumes"
>Scope an NFS access management webhook to Pods mounting NFS volumes&lt;/a>
requires integrating with a third-party system, and is not implementable through a CEL
ValidatingAdmissionPolicy.&lt;/p>
&lt;p>With a mutating CEL admission policy (not yet implemented), a combination of mutating &amp;amp; validating
policies could ensure that objects have a designated scoping label applied, which could be filtered
using the &lt;code>ObjectSelector&lt;/code> on the webhook. However, such an approach adds a lot of overhead and
complexity beyond this proposal.&lt;/p></description></item><item><title>Resources: Aggregated Discovery</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3352/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3352/</guid><description>
&lt;!-- **Note:** When your KEP is complete, all of these comment blocks
should be removed.
To get started with this template:
- [ ] **Pick a hosting SIG.** Make sure that the problem space is
something the SIG is interested in taking up. KEPs should not be
checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements** When filing an
enhancement tracking issue, please make sure to complete all fields
in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to
the enhancement and add the link.
- [ ] **Make a copy of this template directory.** Copy this template
into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number
(with no leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.** At
minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.** At minimum, you should
fill in the "Summary" and "Motivation" sections. These should be
easy if you've preflighted the idea of the KEP with the appropriate
SIG(s).
- [ ] **Create a PR for this KEP.** Assign it to people in the SIG who
are sponsoring this process.
- [ ] **Merge early and iterate.** Avoid getting hung up on specific
details and instead aim to get the goals of the KEP clarified and
merged quickly. The best way to do this is to just start with the
high-level sections and fill out details incrementally in subsequent
PRs.
Just because a KEP is merged does not mean it is complete or approved.
Any KEP marked as `provisional` is a working document and subject to
change. You can denote sections that are under active debate as
follows:
``` &lt;&lt;[UNRESOLVED optional short context or usernames ]>> Stuff that
is being argued. &lt;&lt;[/UNRESOLVED]>> ```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep
discussions focused. If you disagree with what is already in a
document, open a new PR with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole
lifecycle. You do not need a new KEP to move from beta to GA, for
example. If new details emerge that belong in the KEP, edit the KEP.
Once a feature has become "implemented", major changes should get new
KEPs.
The canonical place for the latest set of instructions (and the likely
source of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant
changes once it is marked `implementable`, must be approved by each of
the KEP approvers. If none of those approvers are still appropriate,
then changes to that list should be approved by the remaining
approvers and/or the owning SIG (or SIG Architecture for cross-cutting
KEPs). -->
&lt;h1 id="kep-3352-aggregated-discovery">KEP-3352: Aggregated Discovery&lt;/h1>
&lt;!-- This is the title of your KEP. Keep it short, simple, and
descriptive. A good title can help communicate what the KEP is and
should be considered as part of any review. -->
&lt;!-- A table of contents is helpful for quickly jumping to sections of
a KEP and for highlighting any additional information provided beyond
the standard KEP template.
Ensure the TOC is wrapped with &lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc
--&amp;rt;&lt;/code> tags, and then generate with `hack/update-toc.sh`. -->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api"
>API&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#aggregation"
>Aggregation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#client"
>Client&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#deprecation"
>Deprecation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!-- **ACTION REQUIRED:** In order to merge code into a release, there
must be an issue in [kubernetes/enhancements] referencing this KEP and
targeting a release milestone **before the [Enhancement
Freeze](https://git.k8s.io/sig-release/releases) of the targeted
release**.
For enhancements that make changes to code or processes/procedures in
core Kubernetes—i.e., [kubernetes/kubernetes], we require the
following Release Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track.
These checklist items _must_ be updated for the enhancement to be
released. -->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone
/ release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP
dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as
&lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG
Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests for meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance
Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake
free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA
Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance
Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in
&lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents,
links to mailing list discussions/SIG meetings, relevant
PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!-- **Note:** This checklist is iterative and should be reviewed and
updated every time this enhancement is being considered for a
milestone. -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;!-- This section is incredibly important for producing high-quality,
user-focused documentation such as release notes or a development
roadmap. It should be possible to collect this information before
implementation begins, in order to avoid requiring implementors to
split their attention between writing release notes and implementing
the feature itself. KEP editors and SIG Docs should help to ensure
that the tone and content of the `Summary` section is useful for a
wide audience.
A good summary is probably at least a paragraph in length.
Both in this section and below, follow the guidelines of the
[documentation style guide]. In particular, wrap lines to a reasonable
length, to make it easier for reviewers to cite specific portions, and
to minimize diff churn on updates.
-->
&lt;p>The operations that a Kubernetes API server supports are reported
through a collection of small documents partitioned by group-version.
All clients of Kubernetes APIs must send a request to every
group-version in order to &amp;ldquo;discover&amp;rdquo; the available APIs. This causes a
storm of requests for clusters and is a source of latency and
throttling. When new types are added to the API, types will need to be
fetched again and adds an additional storm of requests. This KEP
proposes centralizing the &amp;ldquo;discovery&amp;rdquo; mechanism into two aggregated
documents so clients do not need to send a storm of requests to the
API server to retrieve all the operations available.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;!-- This section is for explicitly listing the motivation, goals, and
non-goals of this KEP. Describe why the change is important and the
benefits to users. The motivation section can optionally provide links
to [experience reports] to demonstrate the interest in a KEP within
the wider Kubernetes community.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports -->
&lt;p>All clients and users of Kubernetes APIs usually first need to
“discover” what the available APIs are and how they can be used. These
APIs are described through a mechanism called “Discovery” which is
typically queried to then build the requests to correct APIs.
Unfortunately, the “Discovery” API is made of lots of small objects
that need to be queried individually, causing possibly a lot of delay
due to the latency of each individual request (up to 80 requests, with
most objects being less than 1,024 bytes). The more numerous the APIs
provided by the Kubernetes cluster, the more requests need to be
performed.&lt;/p>
&lt;p>The most well known Kubernetes client that uses the discovery
mechanism is &lt;code>kubectl&lt;/code>, and more specifically the
&lt;code>CachedDiscoveryClient&lt;/code> in &lt;code>client-go&lt;/code>. To mitigate some of this
latency, kubectl has implemented a 6 hour timer during which the
discovery API is not refreshed. The drawback of this approach is that
the freshness of the cache is doubtful and the entire discovery API
needs to be refreshed after 6 hours, even if it hasn’t expired. Other
clients such as Openshift UI have slow loading times due to the
browser limit of the amount of parallel requests that can be made.&lt;/p>
&lt;p>This primarily concerns clients that need a discovery cache and need
to frequently poll the apiserver for the latest discovery information.
Clients include kubectl, web interfaces, controllers, etc.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Fix the discovery storm issue that clients face when first loading the discovery document&lt;/li>
&lt;li>On an update to the discovery document, efficiently allow clients to detect new types for appropriate decisions to be made&lt;/li>
&lt;li>Aggregate the discovery documents for all Kubernetes types&lt;/li>
&lt;/ul>
&lt;!-- List the specific goals of the KEP. What is it trying to achieve?
How will we know that this has succeeded? -->
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;!-- What is out of scope for this KEP? Listing non-goals helps to
focus discussion and make progress. -->
&lt;p>Since the current discovery separated by group-version is already GA,
removal of the endpoint will not be attempted. There are still use
cases for publishing the discovery document per group-version and this
KEP will solely focus on introducing the new aggregated endpoint.&lt;/p>
&lt;p>Watchable discovery is also outside the scope of this KEP. Polling
with ETag support is sufficient for most users.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We are proposing augmenting the current discovery endpoints at &lt;code>/api&lt;/code>
and &lt;code>/apis&lt;/code> with an new content negotiation accept type. This endpoint
will serve an aggregated discovery document that contains the
resources for all group versions. ETag support will be provided so
clients who already have the latest version of the aggregated
discovery can avoid redownloading the document.&lt;/p>
&lt;p>We will add a new controller responsible for aggregating the discovery
documents when a resource on the cluster changes. There will be no
conflicts when aggregating since each discovery document is
self-contained.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;!-- What are the caveats to the proposal? What are some important
details that didn't come across above? Go in to as much detail as
necessary here. This might be a good place to talk about core concepts
and how they relate. -->
&lt;p>This is an important design note around selecting the group version for the new discovery types to be &lt;code>apidiscovery/v2beta1&lt;/code>. &lt;a href="https://github.com/kubernetes/kubernetes/pull/111978#discussion_r979015557"
target="_blank" rel="noopener">Link to the full comment&lt;/a>
&lt;/p>
&lt;ol>
&lt;li>Discovery is a non-resource API class&lt;/li>
&lt;li>As a non-resource API class, once the feature gate is
&amp;ldquo;on-by-default&amp;rdquo; the API is required to be stable (only additive
features)&lt;/li>
&lt;li>Non-resource APIs that are &amp;ldquo;off-by-default&amp;rdquo; do not promise
stability&lt;/li>
&lt;li>A non-resource APIs that has to change before promotion to
&amp;ldquo;on-by-default&amp;rdquo; must represent incompatible changes somehow to
clients (if the version is &amp;ldquo;v1&amp;rdquo; and then we find a bug, we would
have to rev to &amp;ldquo;v2&amp;rdquo; before &amp;ldquo;on-by-default&amp;rdquo;, which means &amp;ldquo;v1&amp;rdquo; might
not ever be exposed to end users)&lt;/li>
&lt;li>Unversioned net new endpoints (/healthz) are effectively v1 even if
they are &amp;ldquo;off-by-default&amp;rdquo;&lt;/li>
&lt;li>We don&amp;rsquo;t want to have multiple endpoints for discovery because it&amp;rsquo;s
confusing for users and defeats the purpose of making discovery
more efficient, and we have a way to do that with negotiation&lt;/li>
&lt;li>We think there is value in a new API type (APIGroupDiscovery) which
simplifies client logic, but it comes with a small risk of not
being correct&lt;/li>
&lt;li>We have a good idea of what the API looks like due to a previous
v1, so we are evolving an existing API and are not &amp;ldquo;completely
flying blind&amp;rdquo; (i.e. implying this is really an alpha api)&lt;/li>
&lt;li>While we aren&amp;rsquo;t exactly like an unversioned new endpoint (v1 from
start), we want to deliver the feature (improves clients) without
giving the perception that the API is perfect&lt;/li>
&lt;/ol>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;!-- What are the risks of this proposal, and how do we mitigate?
Think broadly. For example, consider both security and how this will
impact the larger Kubernetes ecosystem.
How will security be reviewed, and by whom?
How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!-- This section should contain enough information that the specifics
of your change are understandable. This may include API specs (though
not always required) or even code snippets. If there's any ambiguity
about HOW your proposal will be implemented, this is the place to
discuss them. -->
&lt;p>The current discovery endpoints &lt;code>/api&lt;/code> and &lt;code>/apis&lt;/code> will accept a new
content negotiation type &lt;code>APIGroupDiscoveryList&lt;/code>, representing an
aggregated discovery document.&lt;/p>
&lt;p>Clients requesting the aggregated document will send a request with
&lt;code>as&lt;/code> (kind), &lt;code>v&lt;/code> (version), and &lt;code>g&lt;/code> (group) set as part of the
&lt;code>Accept&lt;/code> header. For example, a client requesting the &lt;code>v2beta1&lt;/code>
version will send &lt;code>Accept: application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io&lt;/code>.&lt;/p>
&lt;p>Clients should send an accept header with all the acceptable responses
in preferred order. This is to avoid sending additional requests to the same endpoint if the initial preferred version is unavailable. The default accept type will not be changed and
omitting the content negotiation type will default to the unaggregated
&lt;code>APIGroupList&lt;/code> type. Requests should have &lt;code>application/json&lt;/code> or
&lt;code>application/vnd.kubernetes.protobuf&lt;/code> as a fallback option in case the
server does not support the aggregated type (eg: Different version,
feature disabled, etc) For instance, &lt;code>Accept: application/json;as=APIGroupDiscoveryList;v=v1;g=apidiscovery.k8s.io,application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json&lt;/code>
will request for the aggregated discovery v2 type, aggregated
discovery v2beta1 type, and unaggregated v1 type in that order. The
server will return the first option that is supported.&lt;/p>
&lt;p>Refer to the Version Skew Strategy section for more information on how backwards compatibility
is maintained by both the client and server when the types are promoted from v2beta1 to v2.&lt;/p>
&lt;h3 id="api">API&lt;/h3>
&lt;p>The contents of this endpoint will be an &lt;code>APIGroupDiscoveryList&lt;/code>,
containing a list of &lt;code>APIGroupDiscovery&lt;/code>, with each group include a
list of versions (&lt;code>APIVersionDiscovery&lt;/code>). Each &lt;code>APIVersionDiscovery&lt;/code>
will include a list of &lt;code>APIResourcesForDiscovery&lt;/code>. There are a couple
minor changes for the &lt;code>APIResourceForDiscovery&lt;/code> compared to the
current &lt;code>APIResource&lt;/code> object, but all states expressible with the
current API will be representable in the new API.&lt;/p>
&lt;p>The endpoint will also publish an ETag calculated based on a hash of
the data for clients.&lt;/p>
&lt;p>These types will live in the &lt;code>apidiscovery/v2&lt;/code> group version.&lt;/p>
&lt;p>This is what the new API will look like.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APIGroupDiscoveryList is a resource containing a list of APIGroupDiscovery.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// This is what is returned from the /discovery/v1 endpoint and is used to discover&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// the list of API resources (built-ins, Custom Resource Definitions, resources from aggregated servers)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// that a cluster supports.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APIGroupDiscoveryList &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> TypeMeta &lt;span style="color:#b44">`json:&amp;#34;,inline&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ResourceVersion will not be set, because this does not have a replayable ordering among multiple apiservers.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ListMeta &lt;span style="color:#b44">`json:&amp;#34;metadata,omitempty&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=metadata&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// items is the list of groups for discovery.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Items []APIGroupDiscovery &lt;span style="color:#b44">`json:&amp;#34;items&amp;#34; protobuf:&amp;#34;bytes,2,rep,name=items&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APIGroupDiscovery holds information about which resources are being served for all version of the API Group.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// It contains a list of APIVersionDiscovery that holds a list of APIResourceDiscovery types served for a version.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Versions are in descending order of preference, with the first version being the preferred entry.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APIGroupDiscovery &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> TypeMeta &lt;span style="color:#b44">`json:&amp;#34;,inline&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Standard object&amp;#39;s metadata.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The only field completed will be name. For instance, resourceVersion will be empty.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// name is the name of the API group whose discovery information is presented here.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// name is allowed to be &amp;#34;&amp;#34; to represent the legacy, ungroupified resources.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ObjectMeta &lt;span style="color:#b44">`json:&amp;#34;metadata,omitempty&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=metadata&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// versions are the versions supported in this group. They are sorted in descending order of preference,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// with the preferred version being the first entry.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=version&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Versions []APIVersionDiscovery &lt;span style="color:#b44">`json:&amp;#34;versions,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,rep,name=versions&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APIVersionDiscovery holds a list of APIResourceDiscovery types that are served for a particular version within an API Group.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APIVersionDiscovery &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// version is the name of the version within a group version.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Version &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;version&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=version&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// resources is a list of APIResourceDiscovery objects for the corresponding group version.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=resource&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Resources []APIResourceDiscovery &lt;span style="color:#b44">`json:&amp;#34;resources,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,rep,name=resources&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// freshness marks whether a group version&amp;#39;s discovery document is up to date.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// &amp;#34;Current&amp;#34; indicates no problems when fetching the discovery document. &amp;#34;Stale&amp;#34; indicates&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// that there was an error fetching the discovery document, and the current version may not&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// be up to date.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Freshness DiscoveryFreshness &lt;span style="color:#b44">`json:&amp;#34;freshness,omitempty&amp;#34; protobuf:&amp;#34;bytes,3,opt,name=freshness&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APIResourceDiscovery provides information about an API resource for discovery.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APIResourceDiscovery &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// resource is the plural name of the resource. This is used in the URL path and is the unique identifier&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// for this resource across all versions in the API group.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// resources with non-&amp;#34;&amp;#34; groups are located at /apis/&amp;lt;APIGroupDiscovery.objectMeta.name&amp;gt;/&amp;lt;APIVersionDiscovery.version&amp;gt;/&amp;lt;APIResourceDiscovery.Resource&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// resource with &amp;#34;&amp;#34; groups are located at /api/v1/&amp;lt;APIResourceDiscovery.Resource&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Resource &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;resource&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=resource&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// responseKind describes the type of serialization that will typically be returned from this endpoint.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// APIs may return other objects types at their discretion, such as error conditions, requests for alternate representations, or other operation specific behavior.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResponseKind GroupVersionKind &lt;span style="color:#b44">`json:&amp;#34;responseKind&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=responseKind&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// scope indicates the scope of a resource, either Cluster or Namespaced&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Scope ResourceScope &lt;span style="color:#b44">`json:&amp;#34;scope&amp;#34; protobuf:&amp;#34;bytes,3,opt,name=scope&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// singularResource is the singular name of the resource. This allows clients to handle plural and singular opaquely.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// For many clients the singular form of the resource will be more understandable to users reading messages and should be used when integrating the name of the resource into a sentence.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The command line tool kubectl, for example, allows use of the singular resource name in place of plurals.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The singular form of a resource should always be an optional element - when in doubt use the canonical resource name.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> SingularResource &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;singularResource&amp;#34; protobuf:&amp;#34;bytes,4,opt,name=singularResource&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// verbs is a list of supported API operation types (this includes&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// but is not limited to get, list, watch, create update, patch,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// delete, deletecollection, and proxy)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Verbs Verbs &lt;span style="color:#b44">`json:&amp;#34;verbs&amp;#34; protobuf:&amp;#34;bytes,5,opt,name=verbs&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// shortNames is a list of suggested short names of the resource.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ShortNames []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;shortNames,omitempty&amp;#34; protobuf:&amp;#34;bytes,6,rep,name=shortNames&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// categories is a list of the grouped resources this resource belongs to (e.g. &amp;#39;all&amp;#39;).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Clients may use this to simplify acting on multiple resource types at once.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Categories []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;categories,omitempty&amp;#34; protobuf:&amp;#34;bytes,7,rep,name=categories&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// subresources is a list of subresources provided by this resource. Subresources are located at /apis/&amp;lt;APIGroupDiscovery.objectMeta.name&amp;gt;/&amp;lt;APIVersionDiscovery.version&amp;gt;/&amp;lt;APIResourceDiscovery.Resource&amp;gt;/name-of-instance/&amp;lt;APIResourceDiscovery.subresources[i].subresource&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=map&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listMapKey=subresource&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Subresources []APISubresourceDiscovery &lt;span style="color:#b44">`json:&amp;#34;subresources,omitempty&amp;#34; protobuf:&amp;#34;bytes,8,rep,name=subresources&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ResourceScope is an enum defining the different scopes available to a resource.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceScope &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ScopeCluster ResourceScope = &lt;span style="color:#b44">&amp;#34;Cluster&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ScopeNamespace ResourceScope = &lt;span style="color:#b44">&amp;#34;Namespaced&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// DiscoveryFreshness is an enum defining whether the Discovery document published by an apiservice is up to date (fresh).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> DiscoveryFreshness &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> DiscoveryFreshnessCurrent DiscoveryFreshness = &lt;span style="color:#b44">&amp;#34;Current&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> DiscoveryFreshnessStale DiscoveryFreshness = &lt;span style="color:#b44">&amp;#34;Stale&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APISubresourceDiscovery provides information about an API subresource for discovery.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APISubresourceDiscovery &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// subresource is the name of the subresource. This is used in the URL path and is the unique identifier&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// for this resource across all versions.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Subresource &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;subresource&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=subresource&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// responseKind describes the type of serialization that will be returned from this endpoint.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Some subresources do not return normal resources, these will have nil return types.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResponseKind &lt;span style="color:#666">*&lt;/span>GroupVersionKind &lt;span style="color:#b44">`json:&amp;#34;responseKind,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=responseKind&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// acceptedTypes describes the kinds that this endpoint accepts. It is possible for a subresource to accept multiple kinds.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It is also possible for an endpoint to accept no standard types. Those will have a zero length list.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> AcceptedTypes []GroupVersionKind &lt;span style="color:#b44">`json:&amp;#34;acceptedTypes,omitempty&amp;#34; protobuf:&amp;#34;bytes,3,rep,name=acceptedTypes&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// verbs is a list of supported kube verbs: get, list, watch, create,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// update, patch, delete&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Verbs Verbs &lt;span style="color:#b44">`json:&amp;#34;verbs&amp;#34; protobuf:&amp;#34;bytes,4,opt,name=verbs&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="aggregation">Aggregation&lt;/h3>
&lt;p>For the aggregation layer on the server, a new controller will be
created to aggregate discovery for built-in types, apiextensions types
(CRDs), and types from aggregated api servers.&lt;/p>
&lt;p>A post start hook will be added and the kube-apiserver health check
should only pass if the discovery document is ready. Since aggregated
api servers may take longer to respond and we do not want to delay
cluster startup, the health check will only block on the local api
servers (built-ins and CRDs) to have their discovery ready. For api
servers that have not been aggregated, their group-versions will be
published with an empty resource list and a &lt;code>Stale&lt;/code> for
&lt;code>Freshness&lt;/code> to indicate that they have not synced yet.&lt;/p>
&lt;h3 id="client">Client&lt;/h3>
&lt;p>The &lt;code>client-go&lt;/code> interface will be modified to add a new method to
retrieve the aggregated discovery document and &lt;code>kubectl&lt;/code> will be the
initial candidate. As a starting point, &lt;code>kubectl api-resources&lt;/code> should
use the aggregated discovery document instead of sending a storm of
requests.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!-- **Note:** *Not required until targeted at a release.* The goal is
to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing
guidelines][testing-guidelines] when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md -->
&lt;p>[x] I/we understand the owners of the involved components may require
updates to existing tests to make this code solid enough prior to
committing the changes necessary to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!-- Based on reviewers feedback describe what additional tests need
to be added prior implementing this enhancement to ensure the
enhancements have also solid foundations. -->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!-- In principle every added code should have complete unit test
coverage, so providing the exact set of tests will not bring
additional value. However, if complete unit test coverage is not
possible, explain the reason of it together with explanation why this
is acceptable. -->
&lt;!-- Additionally, for Alpha try to enumerate the core package you
will be touching to implement this enhancement and provide the current
unit coverage for those in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage> The data can be easily
read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do
before extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>k8s.io/apiserver/pkg/endpoints/discovery/aggregated: 77.4
&lt;ul>
&lt;li>Note that the &lt;code>fake.go&lt;/code> file has no unit test coverage as it is a utility designed to be used by integration tests. The rest of the files in the package have 90+ coverage.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go: 82.2&lt;/li>
&lt;li>k8s.io/client-go/discovery/aggregated_discovery.go: 96.8&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!-- This question should be filled when targeting a release. For
Alpha, describe what tests will be added to ensure proper quality of
the enhancement.
For Beta and GA, add links to added tests together with links to
k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html -->
&lt;p>Integration tests&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-release-master-blocking#integration-master&amp;amp;width=5&amp;amp;include-filter-by-regex=discovery"
target="_blank" rel="noopener">test/integration/apiserver/discovery/discovery_test.go&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!-- This question should be filled when targeting a release. For
Alpha, describe what tests will be added to ensure proper quality of
the enhancement.
For Beta and GA, add links to added tests together with links to
k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA
graduation criteria. -->
&lt;p>e2e tests&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&amp;amp;width=5&amp;amp;include-filter-by-regex=discovery"
target="_blank" rel="noopener">test/e2e/apimachinery/aggregated_discovery.go&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!-- **Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate]
graduations, or as something else. The KEP should keep this high-level
with a focus on what signals will be looked at to determine
graduation.
Consider the following in developing the graduation criteria for this
enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless
of how the functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned
[maturity levels][maturity-levels]. -->
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag&lt;/li>
&lt;li>Initial e2e tests completed and enabled&lt;/li>
&lt;li>At least one client (kubectl) has an implementation to use the
aggregated discovery feature&lt;/li>
&lt;/ul>
&lt;p>We want all clients to benefit from this feature, but for alpha our
main focus will be on kubectl and golang clients.&lt;/p>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>kubectl uses the aggregated discovery feature by default&lt;/li>
&lt;li>Metrics are added&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Existing bugs are fixed:
&lt;ul>
&lt;li>AggregatedDiscovery controller does not purge old APIServices from cache (&lt;a href="https://github.com/kubernetes/kubernetes/issues/115301"
target="_blank" rel="noopener">Issue&lt;/a>
)&lt;/li>
&lt;li>Aggregated Discovery doesn&amp;rsquo;t show aggregated apiservices as Stale before initial health check (&lt;a href="https://github.com/kubernetes/kubernetes/issues/115303"
target="_blank" rel="noopener">Issue&lt;/a>
)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>New API type &lt;code>apidiscovery.k8s.io/v2&lt;/code> is introduced&lt;/li>
&lt;li>e2e and conformance tests&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note:&lt;/strong> Generally we also wait at least two releases between beta
and GA/stable, because there&amp;rsquo;s no opportunity for user feedback, or
even bug reports, in back-to-back releases.&lt;/p>
&lt;p>&lt;strong>For non-optional features moving to GA, the graduation criteria must
include &lt;a href="https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">conformance tests&lt;/a>
.&lt;/strong>&lt;/p>
&lt;h4 id="deprecation">Deprecation&lt;/h4>
&lt;p>Once Aggregated Discovery v2 types are GA, v2beta1 types will be deprecated and removed after 3 releases.&lt;/p>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Aggregated discovery will be behind a feature gate. It is an in-memory
feature and upgrade/downgrade is not a problem.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>When moving from beta to GA, we will introduce a new API group version &lt;code>apidiscovery.k8s.io/v2&lt;/code>.&lt;/p>
&lt;p>All clients v1.26 to v1.29 will only request for the beta API group version &lt;code>apidiscovery.k8s.io/v2beta1&lt;/code>.&lt;/p>
&lt;p>To accommodate skew between the client and server (older client and newer server), the server will serve both v2 and v2beta1 versions based on the client request headers. The API server will continue to support v2beta1 until its removal in Kubernetes v1.33.&lt;/p>
&lt;p>To accommodate skew between an older server and newer client, starting in v1.30,
client-go will request for both v2 and v2beta1 by sending a list of group versions
requested (in order from v2, v2beta1, unaggregated) and the server will return the
first group version that matches. Concretely, this is done using &lt;code>Accept&lt;/code> headers with a single request.&lt;/p>
&lt;pre tabindex="0">&lt;code>Accept: application/json;as=APIGroupDiscoveryList;v=v2;g=apidiscovery.k8s.io,application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json
&lt;/code>&lt;/pre>&lt;p>In the case of older servers, the server will only
be able to match v2beta1. The client will support both v2 and v2beta1. This allows a
newer client to communicate with an older server that supports only the beta version.
Other clients should follow the same convention to support version skew, though a client
that is only capable of processing v2 is sufficient if it only communicates with v1.30+ servers.
Otherwise, the client will need to be ready to tolerate a 406 Not Acceptable response and handle
the error appropriately.&lt;/p>
&lt;p>If there is no skew and both server and client are v1.30+, clients will still request for v2 and v2beta1, and the server will match the first group version and return v2.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features
merging into Kubernetes are observable, scalable and supportable; can
be safely operated in production environments, and can be disabled or
rolled back in the event they cause increased failures in production.
See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and
approved for the KEP to move to `implementable` status and be included
in the release.
In some cases, the questions below should also have answers in
`kep.yaml`. This is to enable automation to verify the presence of the
review, and to reduce review burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74)
channel if you need any help or guidance. -->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!-- This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: AggregatedDiscovery&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Clients using client-go version 1.26 and up will use the aggregated
discovery endpoint rather than the unaggregated discovery endpoint.
This is handled automatically in client-go and clients should see less
requests to the api server when fetching discovery information. Client
versions older than 1.26 will continue to use the old unaggregated
discovery endpoint without any changes.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;!-- Describe the consequences on existing workloads (e.g., if this is
a runtime feature, can it break the existing applications?).
Feature gates are typically disabled by setting the flag to `false`
and restarting the component. No other changes should be necessary to
disable the feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;p>Yes, the feature may be disabled on the apiserver by reverting the
feature flag. This will disable aggregated discovery for all clients. If there is a golang specific client side bug, the feature may also be
turned off in client-go via the
&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L80"
target="_blank" rel="noopener">WithLegacy()&lt;/a>
toggle and this will require a recompile of the application.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The feature does not depend on state, and can be disabled/enabled at
will.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;!-- The e2e framework does not currently support enabling or
disabling feature gates. However, unit tests in each component dealing
with managing data, created with and without the feature, are
necessary. At the very least, think about conversion tests if API
types are being modified.
Additionally, for features that are introducing a new API field, unit
tests that are exercising the `switch` of feature gate itself (what
happens if I disable a feature gate after having objects written with
the new field) are also critical. You can take a look at one potential
example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
&lt;p>A test will be added to ensure that the RESTMapper functionality works
properly both when the feature is enabled and disabled.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!-- This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!-- Try to be as paranoid as possible - e.g., what if some components
will restart mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others
during the rollout. Similarly, consider large clusters and how
enablement/disablement will rollout across nodes. -->
&lt;p>During a rollout, some apiservers may support aggregated discovery and
some may not. It is recommended that clients request for both the
aggregated discovery document with a fallback to the unaggregated
discovery format. This can be achieved by setting the Accept header to
have a fallback to the default GVK of the &lt;code>/apis&lt;/code> and &lt;code>/api&lt;/code> endpoint.
For example, to request the aggregated discovery type and fallback to
the unaggregated discovery, the following header can be sent: &lt;code>Accept: application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json&lt;/code>&lt;/p>
&lt;p>This kind of fallback is already implemented in client-go and this
note is intended for non-golang clients.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!-- What signals should users be paying attention to when the feature
is young that might indicate a serious problem? -->
&lt;p>High latency or failure of a metric in the newly added discovery
aggregation controller. If the &lt;code>/api&lt;/code> and &lt;code>/apis&lt;/code> endpoint returns an
error or is unreachable with the &lt;code>APIGroupDiscoveryList&lt;/code> accept type,
that could be a sign of rollback.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!-- Describe manual testing that was done and the outcomes. Longer
term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>n/a. The API introduced does not store data and state is recalculated on the upgrade, downgrade, upgrade cycle. No state is preserved between versions.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!-- Even if applying deprecation policies, they may still surprise
some users. -->
&lt;p>By enabling aggregated discovery as the default, the new API is
slightly different from the unaggregated version. The
StorageVersionHash field is removed from resources in the aggregated
discovery API. The storage version migrator will have an additional
flag when initializing the discovery client to continue using the
unaggregated API.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!-- This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm
the previous answers based on experience in the field. -->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!-- Ideally, this should be a metric. Operations against the
Kubernetes API (e.g., checking if there are objects with field X set)
may be a last resort. Avoid logs or events for this purpose. -->
&lt;p>Operators can check whether an aggregated discovery request can be
made by sending a request to &lt;code>apis&lt;/code> with
&lt;code>application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json&lt;/code>
as the Accept header and looking at the the &lt;code>Content-Type&lt;/code> response
header. A Content Type response header of &lt;code>Content-Type: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList&lt;/code>
indicates that aggregated discovery is supported and a &lt;code>Content-Type: application/json&lt;/code> header indicates that aggregated discovery is not
supported. They can also check for the presence of aggregated
discovery related metrics: &lt;code>aggregated_discovery_aggregation_count&lt;/code>&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!-- For instance, if this is a pod-related feature, it should be
possible to determine if the feature is functioning properly for each
individual pod. Pick one more of these and delete the rest. Please
describe all items visible to end users below with sufficient detail
so that they can verify correct enablement and operation of this
feature. Recall that end users cannot usually observe component logs
or access metrics. -->
&lt;p>&lt;code>/api&lt;/code> and &lt;code>/apis&lt;/code> endpoints are populated with discovery information
when the aggregated content negotiation type accept header is passed,
and all expected group-versions are present.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!-- This is your opportunity to define what "normal" quality of
service looks like for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time
minus expected job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in
the next question. -->
&lt;p>Aggregated Discovery falls under a non-streaming read-only API call which is defined under the Kubernetes API call latency
&lt;a href="https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md"
target="_blank" rel="noopener">SLI/SLO&lt;/a>
.
The number in the SLO are a good bound for Aggregated Discovery (p99 &amp;lt; 1s).&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!-- Pick one more of these and delete the rest. -->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>
&lt;p>Metric name: &lt;code>aggregator_discovery_aggregation_duration&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Components exposing the metric: &lt;code>kube-server&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>This is a metric for exposing the time it took to aggregate all the api resources.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Metric name: &lt;code>aggregator_discovery_aggregation_count&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Components exposing the metric: &lt;code>kube-server&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>This is a metric for the number of times that the discovery document has been aggregated.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!-- Describe the metrics themselves and the reasons why they weren't
added (e.g., cost, implementation difficulties, etc.). -->
&lt;p>No.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!-- This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!-- Think about both cluster-level services (e.g. metrics-server) as
well as node-level agents (e.g. specific version of CRI). Focus on
external or optional services that are needed. For example, if this
feature depends on a cloud provider API, or upon an external
software-defined storage or network control plane.
For each of these, fill in the following—thinking about running
existing user workloads and creating new ones, as well as about
cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the
feature: -->
&lt;p>No, but if aggregated apiservers are present, the feature will attempt
to contact and aggregate the data published from the aggregated
apiserver on a set interval. If there is high error rate, stale data
may be returned because the latest data was not able to be fetched
from the aggregated apiserver.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!-- For alpha, this section is encouraged: reviewers should consider
these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these
questions.
For GA, this section is required: approvers should be able to confirm
the previous answers based on experience in the field. -->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!-- Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes
resources (e.g. update of object X triggers new updates of object
Y)
- periodic API calls to reconcile state (e.g. periodic fetching
state, heartbeats, leader election, etc.) -->
&lt;p>No. Enabling this feature should reduce the total number of API calls
for client discovery. Instead of clients sending a discovery request
to all group versions (&lt;code>/apis/&amp;lt;group&amp;gt;/&amp;lt;version&amp;gt;&lt;/code>), they will only need
to send a request to the aggregated endpoint to obtain all resources
that the cluster supports.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!-- Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped
objects) -->
&lt;p>Yes, but these API types are not persisted.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!-- Describe them, providing:
- Which API(s):
- Estimated increase: -->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!-- Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every
existing Pod) -->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!-- Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the
details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos -->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!-- Things to keep in mind include: additional in-memory state,
additional non-trivial computations, excessive access to disks
(including increased log volume), significant amount of data sent
and/or received over network, etc. This through this both in small and
large cases, again with respect to the [supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md -->
&lt;p>No.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!-- This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm
the previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We
may consider splitting it into a dedicated `Playbook` document
(potentially with some monitoring details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>The feature is built into the API server, and will not work if the API server is unavailable.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!-- For each of them, fill in the following information by copying
the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another
way: how can an operator troubleshoot without logging into a
master or worker node?
- Mitigations: What can be done to stop the bleeding, especially
for already running user workloads?
- Diagnostics: What are the useful log messages and their required
logging levels that could help debug the issue? Not required
until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe
why. -->
&lt;ul>
&lt;li>Aggregated API Server is unavailable:&lt;/li>
&lt;li>Detection: An Aggregated API Server that is unavailable will return Stale as the DiscoveryFreshness.
A prolonged period of staleness could indicate that the aggregated apiserver is unavailable.&lt;/li>
&lt;li>Mitigations: If the aggregated apiserver is not reacheable, it will not be part of the resources available.
Restarting the pod or checking for any misconfigurations could be a valid next step.&lt;/li>
&lt;li>Diagnostics: Missing the (v3) log line: &lt;code>DiscoveryManager: successfully downloaded discovery/legacy discovery for &amp;lt;apiservice&amp;gt;&lt;/code>&lt;/li>
&lt;li>Testing: We test for unreacheable aggregated apiservers returning Stale, but an aggregated apiserver could
be unavailable for a wide variety of reasons that would require further diagnosis.&lt;/li>
&lt;/ul>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>The feature can be rolled back by setting the AggregatedDiscoveryEndpoint feature flag to false.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!-- Major milestones in the lifecycle of a KEP should be tracked in
this section. Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG
acceptance
- the `Proposal` section being merged, signaling agreement on a
proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was
available
- the version of Kubernetes where the KEP graduated to general
availability
- when the KEP was retired or superseded -->
&lt;ul>
&lt;li>v1.26: Aggregated Discovery KEP is merged and moves to alpha&lt;/li>
&lt;li>v1.27: Aggregated Discovery moves to beta&lt;/li>
&lt;li>v1.30: Aggregated Discovery moves to stable&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>With aggregation, the size of the aggregated discovery document could
be an issue in the future since clients will need to download the
entire document on any resource update. At the moment, even with 3000
CRDs (already very unlikely), the total size is still smaller than
1MB.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;!-- What other approaches did you consider, and why did you rule them
out? These do not need to be as detailed as the proposal, but should
include enough information to express the idea and why it was not
acceptable. -->
&lt;p>An alternative that was considered is &lt;a href="https://docs.google.com/document/d/1AulBtUYjVcc4s809YSQq4bdRdDO3byY7ew9za4Ortj4"
target="_blank" rel="noopener">Discovery Cache
Busting&lt;/a>
.
Cache-busting allows clients to know if the files need to be
downloaded at all, and in most cases the download can be skipped
entirely. This typically works by including a hash of the resource
content in its name, while marking the objects as never-expiring using
cache control headers. Clients can then recognize if the names have
changed or stayed the same, and re-use files that have kept the same
name without downloading them again.&lt;/p>
&lt;p>Aggregated Discovery was selected because of the amount of requests
that are saved both on startup and on changes involving multiple group
versions. For a full comparison between Discovery Cache Busting and
Aggregated Discovery, please refer to the &lt;a href="https://docs.google.com/document/d/1sdf8nz5iTi86ErQy9OVxvQh_0RWfeU3Vyu0nlA10LNM"
target="_blank" rel="noopener">Google
Doc&lt;/a>
.&lt;/p>
&lt;p>An additional alternative that we considered is watchable discovery.
After diving into the use cases, polling with ETag support is
sufficient for most clients and adding support for watch drastically
changes the scope of this proposal.&lt;/p>
&lt;p>Finally, another alternative that was explored was creating a new URL
endpoint &lt;code>/discovery/&amp;lt;version&amp;gt;&lt;/code>. The additional of a new URL endpoint
per serialization version creates burden for clients as the API
evolves, as they may need to check multiple endpoints to determine the
state of the feature.&lt;/p></description></item><item><title>Resources: Allow a Network Policy to contemplate a set of ports in a single rule</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2079/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2079/</guid><description>
&lt;h1 id="kep-2079-network-policy-to-support-port-ranges">KEP-2079: Network Policy to support Port Ranges&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1---opening-communication-to-nodeports-of-other-cluster"
>Story 1 - Opening communication to NodePorts of other cluster&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2---blocking-the-egress-for-not-allowed-insecure-ports"
>Story 2 - Blocking the egress for not allowed insecure ports&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3---containerized-passive-ftp-server"
>Story 3 - Containerized Passive FTP Server&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#validations"
>Validations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Today the &lt;code>ports&lt;/code> field in ingress and egress network policies is an array
that needs a declaration of each single port to be contemplated. This KEP
proposes to add a new field that allows a declaration of a port range,
simplifying the creation of rules with multiple ports.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>NetworkPolicy object is a complex object, that allows a developer to specify
what&amp;rsquo;s the traffic behavior expected of the application and allow/deny
undesired traffic.&lt;/p>
&lt;p>There are a number of user issues like &lt;a href="https://github.com/kubernetes/kubernetes/issues/67526"
target="_blank" rel="noopener">kubernetes #67526&lt;/a>
and &lt;a href="https://github.com/kubernetes/kubernetes/issues/93111"
target="_blank" rel="noopener">kubernetes #93111&lt;/a>
where users expose the need to create a policy that allow a range of ports but some
specific port, or also cases that a user wants to create a policy that allows
the egress to other cluster to the NodePort range (eg 32000-32768) and in this case,
the rule should be created specifying each port separately, as:&lt;/p>
&lt;pre tabindex="0">&lt;code>spec:
egress:
- ports:
- protocol: TCP
port: 32000
- protocol: TCP
port: 32001
- protocol: TCP
port: 32002
- protocol: TCP
port: 32003
[...]
- protocol: TCP
port: 32768
&lt;/code>&lt;/pre>&lt;p>So for the user:&lt;/p>
&lt;ul>
&lt;li>To allow a range of ports, each of them must be declared as an item from
&lt;code>ports&lt;/code> array&lt;/li>
&lt;li>To make an exception needs a declaration of all ports but the exception&lt;/li>
&lt;/ul>
&lt;p>Adding a new &lt;code>endPort&lt;/code> field inside the &lt;code>ports&lt;/code> will allow a simpler
creation of NetworkPolicy to the user.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Add an endPort field in &lt;code>NetworkPolicyPort&lt;/code>&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Support specific &lt;code>Exception&lt;/code> field.&lt;/li>
&lt;li>Support &lt;code>endPort&lt;/code> when the starting &lt;code>port&lt;/code> is a named port.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>In NetworkPolicy specification, inside &lt;code>NetworkPolicyPort&lt;/code> specify a new
&lt;code>endPort&lt;/code> field composed of a numbered port that defines if this is a range
and when it ends.&lt;/p>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;h4 id="story-1---opening-communication-to-nodeports-of-other-cluster">Story 1 - Opening communication to NodePorts of other cluster&lt;/h4>
&lt;p>I have an application that communicates with NodePorts of a different cluster
and I want to allow the egress of the traffic only the NodePort range
(eg. 30000-32767) as I don&amp;rsquo;t know which port is going to be allocated on the
other side, but don&amp;rsquo;t want to create a rule for each of them.&lt;/p>
&lt;h4 id="story-2---blocking-the-egress-for-not-allowed-insecure-ports">Story 2 - Blocking the egress for not allowed insecure ports&lt;/h4>
&lt;p>As a developer, I need to create an application that scrapes informations from
multiple sources, being those sources databases running in random ports, web
applications and other sources. But the security policy of my company asks me
to block communication with well known ports, like 111 and 445, so I need to create
a network policy that allows me to communicate with any port except those two and so
I can be compliant with the company&amp;rsquo;s policy.&lt;/p>
&lt;h4 id="story-3---containerized-passive-ftp-server">Story 3 - Containerized Passive FTP Server&lt;/h4>
&lt;p>As a Kubernetes User, I&amp;rsquo;ve received a demand from my boss to run our FTP server in an
existing Kubernetes cluster, to support some of my legacy applications.
This FTP Server must be acessible from inside the cluster and outside the cluster,
but I still need to keep the basic security policies from my company, that demands
the existence of a default deny rule for all workloads and allowing only specific ports.&lt;/p>
&lt;p>Because this FTP Server runs in PASV mode, I need to open the Network Policy to ports 21
and also to the range 49152-65535 without allowing any other ports.&lt;/p>
&lt;h3 id="notesconstraintscaveats">Notes/Constraints/Caveats&lt;/h3>
&lt;ul>
&lt;li>The technology used by the CNI provider might not support port range in a
trivial way as described in [#drawbacks]&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>CNIs will need to support the new field in their controllers. For this case
we&amp;rsquo;ll try to make broader communication with the main CNIs so they can be aware
of the new field.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>API changes to NetworkPolicy:&lt;/p>
&lt;ul>
&lt;li>Add a new field called &lt;code>EndPort&lt;/code> inside &lt;code>NetworkPolicyPort&lt;/code> as the following:&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// NetworkPolicyPort describes a port to allow traffic on&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NetworkPolicyPort &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// field defaults to TCP.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Protocol &lt;span style="color:#666">*&lt;/span>v1.Protocol &lt;span style="color:#b44">`json:&amp;#34;protocol,omitempty&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=protocol,casttype=k8s.io/api/core/v1.Protocol&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The port on the given protocol. This can either be a numerical or named &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// port on a pod. If this field is not provided, this matches all port names and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// numbers, whether an endPort is defined or not.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Port &lt;span style="color:#666">*&lt;/span>intstr.IntOrString &lt;span style="color:#b44">`json:&amp;#34;port,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=port&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// EndPort defines the last port included in the port range.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Example:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// endPort: 12345&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> EndPort &lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;port,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=endPort&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="validations">Validations&lt;/h3>
&lt;p>The &lt;code>NetworkPolicyPort&lt;/code> will need to be validated, with the following scenarios:&lt;/p>
&lt;ul>
&lt;li>If an &lt;code>EndPort&lt;/code> is specified a &lt;code>Port&lt;/code> must also be specified&lt;/li>
&lt;li>If &lt;code>Port&lt;/code> is a string (named port) &lt;code>EndPort&lt;/code> cannot be specified&lt;/li>
&lt;li>&lt;code>EndPort&lt;/code> must be equal or bigger than &lt;code>Port&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;p>Unit tests:&lt;/p>
&lt;ul>
&lt;li>test API validation logic&lt;/li>
&lt;li>test API strategy to ensure disabled fields&lt;/li>
&lt;/ul>
&lt;p>E2E tests:&lt;/p>
&lt;ul>
&lt;li>Add e2e tests exercising only the API operations for port ranges. Data-path
validation should be done by CNIs.&lt;/li>
&lt;/ul>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>pkg/apis/networking/validation/validation&lt;/code>: &lt;code>14/Jun/2022&lt;/code> - &lt;code>92.5%&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/registry/networking/networkpolicy/strategy&lt;/code>: &lt;code>14/Jun/2022&lt;/code> - &lt;code>75.9%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>Feature:NetworkPolicyEndPort: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?text=EndPort#eaa4b8cdb7b461dccfa9"
target="_blank" rel="noopener">https://storage.googleapis.com/k8s-triage/index.html?text=EndPort#eaa4b8cdb7b461dccfa9&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>The flakes shown here are not related to this feature, per the tests logs&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Add a feature gated new field to NetworkPolicy&lt;/li>
&lt;li>Communicate CNI providers about the new field&lt;/li>
&lt;li>Add validation tests in API&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>&lt;code>EndPort&lt;/code> has been supported for at least 1 minor release&lt;/li>
&lt;li>Three commonly used NetworkPolicy (or CNI providers) implement the new field,
with generally positive feedback on its usage.&lt;/li>
&lt;li>Feature Gate is enabled by Default.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>At least &lt;strong>four&lt;/strong> NetworkPolicy providers (or CNI providers) support the &lt;code>EndPort&lt;/code> field&lt;/li>
&lt;li>&lt;code>EndPort&lt;/code> has been enabled by default for at least 1 minor release&lt;/li>
&lt;/ul>
&lt;p>The following are the CNIs that implement this feature:&lt;/p>
&lt;ul>
&lt;li>Calico&lt;/li>
&lt;li>Antrea&lt;/li>
&lt;li>Openshift SDN&lt;/li>
&lt;li>Kuberouter&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>If upgraded no impact should happen as this is a new field.&lt;/p>
&lt;p>If downgraded the CNI wont be able to look into the new field, as this does not
exists and network policies using this field will stop working correctly and
start working incorrectly. This is a fail-closed failure, so it is acceptable.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: NetworkPolicyEndPort&lt;/li>
&lt;li>Components depending on the feature gate: Kubernetes API Server&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. One caveat here is that NetworkPolicies created with EndPort field set
when the feature was enabled will continue to have that field set when the
feature is disabled unless user removes it from the object.&lt;/p>
&lt;p>If the value is dropped with the FeatureGate disabled, the field can only
be re-inserted if feature gate is enabled again.&lt;/p>
&lt;p>Rolling back the Kubernetes API Server that does not have this field
will make the field not be returned anymore on GET operations,
so CNIs relying on the new field wont recognize it anymore.&lt;/p>
&lt;p>If this happens, CNIs will recognize the policy as a single port instead of a
port range, which may break users, which is inevitable but satisfies the
fail-closed requirement.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Nothing.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes and they can be found &lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.21/pkg/registry/networking/networkpolicy/strategy_test.go#L284"
target="_blank" rel="noopener">here&lt;/a>
&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>Not probably, but still there&amp;rsquo;s the risk of some bug that fails validation,
or conversion function crashes.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>The increase of 5xx http error count on Network Policies Endpoint&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Yes, with unit tests.
Manual tests were also executed as the following:&lt;/p>
&lt;ul>
&lt;li>Created a KinD cluster in v1.24 and Calico as a CNI&lt;/li>
&lt;li>Created a Network Policy with &lt;code>endPort&lt;/code> field to allow a Pod egress to ports from 70 to 90&lt;/li>
&lt;li>Did a test against a target in port 80 - Worked&lt;/li>
&lt;li>Disabled the Feature Gate&lt;/li>
&lt;li>The Network Policy still worked fine&lt;/li>
&lt;li>Changed the Network Policy so the range is 70 to 79, and the Network Policy was changed fine&lt;/li>
&lt;li>Traffic started to be blocked, but could call port 78 as it is still within range&lt;/li>
&lt;li>Removed &lt;code>endPort&lt;/code> field, and wasn&amp;rsquo;t able to re-add it as Feature gate was disabled&lt;/li>
&lt;li>Re-enabled feature gate&lt;/li>
&lt;li>Re-added &lt;code>endPort&lt;/code> field with value of 90&lt;/li>
&lt;li>Traffic started to flow/be accepted again&lt;/li>
&lt;/ul>
&lt;p>Per the manual tests, all worked as desired.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>None&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Operators can determine if NetworkPolicies are making use of EndPort creating
an object specifying the range and validating if the traffic is allowed within
the specified range.&lt;/p>
&lt;p>Also Network Policy object now supports (as alpha) status/condition fields, so
Network Policy providers can add a feedback to the user whether the policy was processed
correctly or not. Providing this feedback is optional and depends on implementation
by each NPP.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other&lt;/li>
&lt;li>Details:
The API Field must be present when a NetworkPolicy is created with that field.
The feature working correctly depends on the CNI implementation, so the operator can
look into CNI metrics to check if the rules are being applied correctly, like Calico
that provides metrics like &lt;code>felix_iptables_restore_errors&lt;/code> that can be used to
verify if the amount of restoring errors raised after the feature being applied.
For NetworkPolicy Providers that doesn&amp;rsquo;t support this feature, a new status field was added
in Network Policy object allowing the providers to give feedback to users using conditions.
Any NPP that does not support this feature should add a condition on the Network Policy
object.&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>Operators can use metrics provided by the CNI to use as SLI, like
&lt;code>felix_iptables_restore_errors&lt;/code> from Calico to verify if the errors rate
has raised.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;ul>
&lt;li>per-day percentage of API calls finishing with 5XX errors &amp;lt;= 1% is a reasonable SLO&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>&lt;strong>Are there any missing metrics that would be useful to have to improve observability
of this feature?&lt;/strong>
N/A&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>Yes, a CNI supporting the new feature&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;ul>
&lt;li>API type(s): NetworkPolicyPorts&lt;/li>
&lt;li>Estimated increase in size: 2 bytes for each new &lt;code>EndPort&lt;/code> value specified + the field name/number in its serialized format&lt;/li>
&lt;li>Estimated amount of new objects: N/A&lt;/li>
&lt;/ul>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>It might get some increase of resource usage by the CNI while parsing the
new field.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>As this feature is mainly used by CNI providers, the reaction with API server
and/or etcd being unavailable will be the same as before.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>Remove EndPort field and check if the number of errors reduce, although this might
lead to undesired Network Policy, blocking previously working rules.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2022-06-14 Propose GA graduation&lt;/li>
&lt;li>2021-05-11 Propose Beta graduation and add more Performance Review data&lt;/li>
&lt;li>2020-10-08 Initial &lt;a href="https://github.com/kubernetes/enhancements/pull/2079"
target="_blank" rel="noopener">KEP PR&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;ul>
&lt;li>The technology used by the CNI provider might not support port range in a
trivial way. As an example, OpenFlow did not supported to specify port range
for a while as commented in &lt;a href="https://github.com/kubernetes/kubernetes/issues/67526#issuecomment-415170435"
target="_blank" rel="noopener">kubernetes #67526&lt;/a>
.
While this has changed in Open vSwitch v1.6, this still might be a caveat
for other CNIs, like eBPF based CNIs will need to populate their maps in a
different way.&lt;/li>
&lt;/ul>
&lt;p>For this cases, CNIs will have to iteract through the Port Range and
populate their packet filtering tables with each port.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>During the development of this KEP there was an alternative implementation
of the &lt;code>NetworkPolicyPortRange&lt;/code> field inside the &lt;code>NetworkPolicyPort&lt;/code> as the following:&lt;/p>
&lt;pre tabindex="0">&lt;code>// NetworkPolicyPort describes a port or a range of ports to allow traffic on
type NetworkPolicyPort struct {
// The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this
// field defaults to TCP.
// +optional
Protocol *api.Protocol
// The port on the given protocol. This can either be a numerical or named
// port on a pod. If this field is not provided but a Range is
// provided, this field is ignored. Otherwise this matches all port names and
// numbers.
// +optional
Port *intstr.IntOrString
// A range of ports on a given protocol and the exceptions. If this field
// is not provided, this doesn&amp;#39;t matches anything
// +optional
Range *NetworkPolicyPortRange
}
&lt;/code>&lt;/pre>&lt;p>But the main design suggested in this Kep seems more clear, so this alternative
has been discarded.&lt;/p>
&lt;p>Also it has been proposed that the implementation contains an &lt;code>Except&lt;/code> array and a new
struct to be used in Ingress/Egress rules, but because it would bring much more complexity
than desired the proposal has been dropped right now:&lt;/p>
&lt;pre tabindex="0">&lt;code>// NetworkPolicyPortRange describes the range of ports to be used in a
// NetworkPolicyPort struct
type NetworkPolicyPortRange struct {
// From defines the start of the port range
From uint16
// To defines the end of the port range, being the end included within the
// range
To uint16
// Except defines all the exceptions in the port range
+optional
Except []uint16
&lt;/code>&lt;/pre></description></item><item><title>Resources: Allow DaemonSets to surge during update like Deployments</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1591/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1591/</guid><description>
&lt;h1 id="allow-daemonsets-to-surge-during-update-like-deployments">Allow DaemonSets to surge during update like Deployments&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#implementation-detailsnotesconstraints"
>Implementation Details/Notes/Constraints&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#workload-implications"
>Workload Implications&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#implications-to-drain"
>Implications to drain&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alpha---beta"
>Alpha -&amp;gt; Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta---ga"
>Beta -&amp;gt; GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Daemonsets allow two update strategies - OnDelete which only replaces pods when they are deleted and RollingUpdate which supports MinAvailable like Deployments but not Surge. Daemonsets should support Surge in order to minimize DaemonSet downtime on nodes. This will allow daemonset workloads to implement zero-downtime upgrades.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>DaemonSets are a key enabler of Kubernetes system-level integrations like CNI, CSI, or per-node functionality. These integrations may have availability impacts on workloads during daemonset updates for a number of reasons, including image pull time or setup. While increasing availability of these daemonsets often requires development investment to manage the handoff between the old instance and the new instance, without the ability to have two pods on the same node these handoffs are complex to implement and typically require higher level orchestration (such as running two daemonsets and round robining updates, or using the OnDelete strategy and orchestrating pod deletes when nodes will be rebooted).&lt;/p>
&lt;p>It should be possible for a node level integration to offer zero-downtime upgrades via a DaemonSet without resorting to a higher level orchestration.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Add support for Surge to the DaemonSet rolling update strategy&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="implementation-detailsnotesconstraints">Implementation Details/Notes/Constraints&lt;/h3>
&lt;p>The design of Deployment rolling updates introduced the surge concept, and the initial design for DaemonSet updates considered the implications of adding the Surge strategy later (&lt;a href="https://github.com/kubernetes/design-proposals-archive/blob/master/apps/daemonset-update.md#future-plans"
target="_blank" rel="noopener">https://github.com/kubernetes/design-proposals-archive/blob/master/apps/daemonset-update.md#future-plans)&lt;/a>
. &lt;a href="https://github.com/kubernetes/enhancements/pull/1863"
target="_blank" rel="noopener">StatefulSets may also surge in a workload specific fashion&lt;/a>
, so this design should be as consistent as possible with existing concepts but clearly denote where the workload concept differs from other controllers.&lt;/p>
&lt;p>We would add &lt;code>MaxSurge *intstr.IntOrString&lt;/code> to the RollingUpdate daemonset upgrade strategy. It would have a default value of 0, preserving current behavior. We would allow MaxUnavailable to be 0 when MaxSurge is set.&lt;/p>
&lt;pre tabindex="0">&lt;code>// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
// The maximum number of DaemonSet pods that can be unavailable during the
// update. Value can be an absolute number (ex: 5) or a percentage of total
// number of DaemonSet pods at the start of the update (ex: 10%). Absolute
// number is calculated from percentage by rounding up.
// This cannot be 0 if MaxSurge is 0
// Default value is 1.
// Example: when this is set to 30%, at most 30% of the total number of nodes
// that should be running the daemon pod (i.e. status.desiredNumberScheduled)
// can have their pods stopped for an update at any given time. The update
// starts by stopping at most 30% of those DaemonSet pods and then brings
// up new DaemonSet pods in their place. Once the new pods are available,
// it then proceeds onto other DaemonSet pods, thus ensuring that at least
// 70% of original number of DaemonSet pods are available at all times during
// the update.
// +optional
MaxUnavailable *intstr.IntOrString `json:&amp;#34;maxUnavailable,omitempty&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=maxUnavailable&amp;#34;`
// The maximum number of nodes with an existing available DaemonSet pod that
// can have an updated DaemonSet pod during during an update.
// Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
// This can not be 0 if MaxUnavailable is 0.
// Absolute number is calculated from percentage by rounding up to a minimum of 1.
// Default value is 0.
// Example: when this is set to 30%, at most 30% of the total number of nodes
// that should be running the daemon pod (i.e. status.desiredNumberScheduled)
// can have their a new pod created before the old pod is marked as deleted.
// The update starts by launching new pods on 30% of nodes. Once an updated
// pod is available (Ready for at least minReadySeconds) the old DaemonSet pod
// on that node is marked deleted. If the old pod becomes unavailable for any
// reason (Ready transitions to false, is evicted, or is drained) an updated
// pod is immediately created on that node without considering surge limits.
// Allowing surge implies the possibility that the resources consumed by the
// daemonset on any given node can double if the readiness check fails, and
// so resource intensive daemonsets should take into account that they may
// cause evictions during disruption.
// +optional
MaxSurge *intstr.IntOrString `json:&amp;#34;maxSurge,omitempty&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=maxSurge&amp;#34;`
&lt;/code>&lt;/pre>&lt;p>Unlike Deployments, MaxSurge only considers nodes that have an available old pod and will instantly launch updated pods if no old available pod is detected on a node. An available pod is defined the same way as Deployments - the pod is not terminating, pod is Ready, and pod has been Ready for MinReadySeconds.&lt;/p>
&lt;p>In the event a rollout cannot proceed due to hitting the MaxSurge limit (due to any condition, whether scheduling, new pods not going ready) the controller should pause creating new pods until conditions change.&lt;/p>
&lt;p>DaemonSet pods are slightly more constrained than Deployments when it comes to scheduling issues since each pod is tied to a single node, so it is worth describing exactly how surge pods that violate same node constraints would be handled consistent with Deployments. The most common conflict is use of HostPort within the pod spec across two versions, which would prevent the second pod from landing and the rollout from proceeding. An identical failure would occur with a Deployment of scale 4 on a 3 node cluster - the rollout would be prohibited because the fourth pod could not be scheduled, and so should be handled identically by this controller. It is user error to specify impossible scheduling constraints, and the correct way to convey that is via status conditions on the DaemonSet (which is a separate proposal).&lt;/p>
&lt;p>In order to reduce confusion for new users, we will start by rejecting HostPort use in daemonset when MaxSurge is non-zero. A user will not be able to update a daemonset to MaxSurge != 0 if HostPort is set, or update a HostPort if MaxSurge is set, without receiving a validation error. If the MaxSurge feature gate is off, the validation rule is bypassed, and a user who turns off the gate, sets both fields, and then enables the gate will have failing pods but will be able to update their daemonset to either remove surge or remove the host port safely.&lt;/p>
&lt;p>A user who uses HostNetwork but does not declare HostPorts and attempts to use MaxSurge with processes that listen on the host network should see errors from the network stack when their process attempts to bind a port (such as &lt;code>cannot bind to address: port in use&lt;/code>) and the new pod will crash and go into a crashloop. Users should expect to see these failures as they would any other &amp;ldquo;my application does not start on Kubernetes&amp;rdquo; error via pod status, daemonset status conditions, and pod logs.&lt;/p>
&lt;p>Building a daemonset that hands off between two host level processes with any degree of coordination is an advanced topic and is up to the workload author. The simplest daemonsets may use pod network without any host level sharing and will benefit significantly from maxSurge during updates by reducing downtime at the cost of extra resources. As more complex sharing (host network, disk resources, unix domain sockets, configuration) is needed, the author is expected to leverage custom readiness probes, process start conditions, and process coordination mechanisms (like disks, networking, or shared memory) across pods. Debugging those interactions will be in the domain of the workload author.&lt;/p>
&lt;h3 id="workload-implications">Workload Implications&lt;/h3>
&lt;p>There are three main workload types that seek to minimize disruption:&lt;/p>
&lt;ol>
&lt;li>Infrastructure that should be quickly replaced during update (CNI plugins, CSI plugins).&lt;/li>
&lt;li>Infrastructure that wishes to hand off a node resource during an upgrade (socket, namespace, process)&lt;/li>
&lt;li>Infrastructure that must remain 100% available to support workloads (networking components, proxies).&lt;/li>
&lt;/ol>
&lt;p>In general, all of these benefit from minimizing the time between old pod shutting down and new pod starting up. MaxSurge allows components to arbitrarily approach zero disruption by careful tuning of their launch scripts and access to shared resources, such as sockets or shared disk.&lt;/p>
&lt;p>Infrastructure invoked by Kubernetes components (CRI, CNI, CSI) can usually fall within the first category and may require some coordination from the invoking process to minimize downtime. For instance, the Kubelet may retry certain types of CSI errors transparently to mitigate brief disruption to a CSI plugin. Or the container runtime may retry certain CNI errors if the plugin is not available.&lt;/p>
&lt;p>The second category of workload requires some coordination between the old and new container - for instance, reusing a host volume and checking for file locking on shared resources, or using the SO_REUSEPORT option to start listening on an interface and share old and new traffic. In general the workload author is assumed to understand how to minimize disruption and Kubernetes is only giving them an overlapping window of execution before beginning the termination of the old process. The readiness probe should be used by the workload author to manage this transition as in other workload flows.&lt;/p>
&lt;p>The last category is the most difficult to achieve and generally combines categories 1 and 2 along with careful tuning. Networking plugins that provide pod network capability may have one or more daemon processes that are desirable to deliver containerized, but any disruption to those critical pods may impact other workloads. In most cases, the capability to overlap execution provided by the MaxSurge is sufficient to allow those components to adapt to zero-downtime updates.&lt;/p>
&lt;p>In the future, &lt;a href="https://github.com/kubernetes/enhancements/issues/2004"
target="_blank" rel="noopener">service topology&lt;/a>
will have implications for services implemented as daemonsets across all nodes. The update strategy for surge or drain will need to take into account topology, although the full details of that are outside the scope of this design. In general, service owners using daemonset surge will wish to maximize availability and minimize the risk of disruption during update.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The primary risk is a bug in the implementation of the controller that causes excessive pod creations or deletions, as we have experienced during previous enhancements to workload controllers. The best mitigation for that scenario is unit testing to ensure the update strategy is stable and general purpose stress e2e testing of the controller.&lt;/p>
&lt;p>Because we are widening validation for MaxUnavailable, we must ensure that during an upgrade old apiservers can still handle that field. The alpha release of this field would have special logic that, if MaxSurge is set and dropped, a value of MaxUnavailable 0 would be set to 1 (the minimum allowed unavailable). The alpha controller would also special case this check when the gate was off. When a cluster was upgraded to beta with the gate on by default, the old controller and apiservers would treat &lt;code>MaxSurge != 0, MaxUnavailable == 0&lt;/code> as &lt;code>MaxSurge == 0, MaxUnavailable == 1&lt;/code> until they themselves were upgraded.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="implications-to-drain">Implications to drain&lt;/h3>
&lt;p>DaemonSets currently ignore unschedulable, but triggering a drain of a node and choosing to delete daemonsets would ensure that if the old pod can be deleted
the daemonset controller immediately schedules a new pod onto that node when
MaxSurge is in play (because the invariant that there must be at least one
pod). If the old pod delays deletion, then the new pod has a chance to accept handoff from the old pod exactly like a normal rolling surge update.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;ul>
&lt;li>Unit tests covering the daemonset controller behavior in all major edge cases&lt;/li>
&lt;li>E2E test for surge strategy that verifies expected recovery behavior and that the controller settles
&lt;ul>
&lt;li>Testing should set up conflicting rules like HostPort and verify that surge fails and the correct daemonset condition is set and events are generated.
&lt;ul>
&lt;li>A test should cover a pod going unready during rollout and verifying it is immediately replaced.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h4>
&lt;p>[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.&lt;/p>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;pre tabindex="0">&lt;code>`k8s.io/kubernetes/pkg/apis/apps/validation` `06/06/2022`: `90.6% of statements` `The tests added for the current feature in this package touches the daemonSet Spec field. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/apis/apps/validation/validation.go:387`: `06/06/2022`: `100.0% of statements`
`k8s.io/kubernetes/pkg/controller/daemon`: `06/06/2022`: `70.7% of statements` `The tests added for the current feature in this package touches the daemonSet update strategies. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/daemonset`: `06/06/2022`: `31.1% of statements` `The tests added for the current feature in this package makes sure that the kubernetes version upgrades/downgrades won&amp;#39;t have any impact on the new field to the daemonSet api when persisting to etcd. No new tests are needed for promotion to GA`
`k8s.io/kubernetes/pkg/registry/apps/daemonset/strategy.go:129`: `06/06/2022`: `100.0% of statements`
&lt;/code>&lt;/pre>&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>A new integration which exercises maxSurge when &lt;code>RollingUpdate&lt;/code> is used as update strategy will be added to &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/integration/daemonset/daemonset_test.go"
target="_blank" rel="noopener">DS integration test suite&lt;/a>
&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>An e2e test which exercises maxSurge when &lt;code>RollingUpdate&lt;/code> is used as update strategy is added for daemonsets.&lt;/p>
&lt;ul>
&lt;li>should surge pods onto nodes when spec was updated and update strategy is RollingUpdate: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=should%20surge%20pods"
target="_blank" rel="noopener">test grid&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>This will be added as a alpha field enhancement to DaemonSets with a backward compatible default. After sufficient exposure this field would be promoted to beta, and then to GA in successive releases. The feature gate for this field will be &lt;code>DaemonSetUpdateSurge&lt;/code>.&lt;/p>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Complete feature behind a featuregate&lt;/li>
&lt;li>Have proper unit and e2e tests&lt;/li>
&lt;/ul>
&lt;h4 id="alpha---beta">Alpha -&amp;gt; Beta&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from the community&lt;/li>
&lt;/ul>
&lt;h4 id="beta---ga">Beta -&amp;gt; GA&lt;/h4>
&lt;p>Atleast one of example of user benefitting from this feature:&lt;/p>
&lt;ul>
&lt;li>OpenShift has few critical &lt;a href="https://github.com/openshift/cluster-dns-operator/blob/d87dd223e67c476220451d254d878209c50324a7/pkg/operator/controller/controller_dns_node_resolver_daemonset.go#L80"
target="_blank" rel="noopener">DS&lt;/a>
where maxSurge is beneficial&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md.
The production readiness review questionnaire must be completed for features in
v1.19 or later, but is non-blocking at this time. That is, approval is not
required in order to be in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>DaemonSetUpdateSurge&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kube-apiserver&lt;/code>, &lt;code>kube-controller-manager&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong>&lt;/p>
&lt;p>No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?&lt;/strong>&lt;/p>
&lt;p>Yes, when the feature gate is disabled the field is ignored and can be cleared by
an end user. A workload using this alpha feature would no longer be able to surge
and would fall back to the default MaxUnavailable value (which is minimum 1).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>&lt;/p>
&lt;p>The field would become active and whatever new values were present would cause
the surge feature to become active. If the field name were changed old values
would be lost and the controller would default to using maxUnavailable 1.&lt;/p>
&lt;p>To clear the field from etcd, disable the gate and perform a no-op PUT on every
daemonset.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>&lt;/p>
&lt;p>A unit test will verify disablement ignores surge and behaves as MaxUnavailable=1&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>
It shouldn&amp;rsquo;t impact already running workloads. This is an opt-in feature since users need to explicitly set the MaxSurge parameter in the DaemonSetSet spec&amp;rsquo;s RollingUpdate i.e &lt;code>.spec.strategy.rollingUpdate.maxSurge&lt;/code> field.
if the feature is disabled the field is preserved if it was already set in the presisted DaemonSetSet object, otherwise it is silently dropped.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>
MaxSurge in DaemonSet doesn&amp;rsquo;t get respected and additional surge pods won&amp;rsquo;t be
created. We consider the feature to be failing if enabling the featuregate and giving
appropriate value to MaxSurge doesn&amp;rsquo;t cause additional surge pods to be created.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/strong>
Manually tested. No issues were found when we enabled the feature gate -&amp;gt; disabled it -&amp;gt;
re-enabled the feature gate. Upgrade -&amp;gt; downgrade -&amp;gt; upgrade scenario was tested manually.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?&lt;/strong>
None&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>
By checking the DaemonSetSet&amp;rsquo;s &lt;code>.spec.strategy.rollingUpdate.maxSurge&lt;/code> field. The additional workload pods created should be respecting the value specified in the
maxSurge field.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in
the DaemonSetSet definition. This can be used to determine the health of this feature.
The existing metrics like &lt;code>kube_daemonset_status_number_available&lt;/code> and &lt;code>kube_daemonset_status_number_unavailable&lt;/code> can be used to track additional pods created&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/strong>
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload
service is available 99.99% of time during updates.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any missing metrics that would be useful to have to improve observability
of this feature?&lt;/strong>
Describe the metrics themselves and the reasons why they weren&amp;rsquo;t added (e.g., cost,
implementation difficulties, etc.).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Does this feature depend on any specific services running in the cluster?&lt;/strong>
None. It is part of kube-controller-manager.&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;p>&lt;em>For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.&lt;/em>&lt;/p>
&lt;p>&lt;em>For beta, this section is required: reviewers must answer these questions.&lt;/em>&lt;/p>
&lt;p>&lt;em>For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>&lt;/p>
&lt;p>No, the controller will perform roughly the same order of magnitude calls as
for the normal strategy.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>&lt;/p>
&lt;p>No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to the cloud
provider?&lt;/strong>&lt;/p>
&lt;p>No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count of
the existing API objects?&lt;/strong>&lt;/p>
&lt;p>No, except for the explicit user chosen field on the daemonset.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by &lt;a href="https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos"
target="_blank" rel="noopener">existing SLIs/SLOs&lt;/a>
?&lt;/strong>&lt;/p>
&lt;p>No, only broken Daemonsets in surge configurations would fail to roll out.
In both strategies, the readiness check gates the SLO of rollout.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong>&lt;/p>
&lt;p>No, the calculations for this controller change are of the same magnitude as
the existing flow.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>The Troubleshooting section currently serves the &lt;code>Playbook&lt;/code> role. We may consider
splitting it into a dedicated &lt;code>Playbook&lt;/code> document (potentially with some monitoring
details). For now, we leave it here.&lt;/p>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>
This feature will not work if the API server or etcd is unavailable as the controller-manager won&amp;rsquo;t be even able get events or updates for DaemonSetSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn&amp;rsquo;t start since it cannot communicate with the API server&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>MaxSurge not respected and too many pods are created
&lt;ul>
&lt;li>Detection: Looking at &lt;code>kube_daemonset_status_number_available&lt;/code> and &lt;code>kube_daemonset_status_number_unavailable&lt;/code> metrics.&lt;/li>
&lt;li>Mitigations: Disable the &lt;code>DaemonSetUpdateSurge&lt;/code> feature flag&lt;/li>
&lt;li>Diagnostics: Controller-manager when starting at log-level 4 and above&lt;/li>
&lt;li>Testing: Yes, e2e tests are already in place&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>MaxSurge not respected and very few pods are created. This causes the workloads to be not be available at 99.99%
&lt;ul>
&lt;li>Detection: Looking at &lt;code>kube_daemonset_status_number_available&lt;/code> and &lt;code>kube_daemonset_status_number_unavailable&lt;/code> metrics.&lt;/li>
&lt;li>Mitigations: Disable the &lt;code>DaemonSetUpdateSurge&lt;/code> feature flag&lt;/li>
&lt;li>Diagnostics: Controller-manager when starting at log-level 4 and above&lt;/li>
&lt;li>Testing: Yes, e2e tests are already in place&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>maxUnavailable should be set to 0 even when maxSurge is configured
&lt;ul>
&lt;li>Detection: Looking at the &lt;code>.spec.strategy.rollingUpdate.maxSurge&lt;/code> and &lt;code>.spec.strategy.rollingUpdate.maxUnavailable&lt;/code>&lt;/li>
&lt;li>Mitigations: Setting maxUnavailable to appropriate value&lt;/li>
&lt;li>Diagnostics: Controller-manager when starting at log-level 4 and above&lt;/li>
&lt;li>Testing: Yes, e2e tests are already in place&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2021-02-09: Initial KEP merged&lt;/li>
&lt;li>2021-03-05: Initial implementation merged&lt;/li>
&lt;li>2021-04-30: Graduate the feature to Beta proposed&lt;/li>
&lt;li>2022-05-10: Graduate the feature to stable proposed&lt;/li>
&lt;/ul></description></item><item><title>Resources: Allow HostNetwork Pods to Use User Namespaces</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5607/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5607/</guid><description>
&lt;h1 id="kep-5607-allow-hostnetwork-pods-to-use-user-namespaces">KEP-5607: Allow HostNetwork Pods to Use User Namespaces&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
within one minor version of promotion to GA&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes introducing a new feature gate to allow Pods to have both &lt;code>hostNetwork&lt;/code> enabled and user namespaces enabled (by setting &lt;code>hostUsers: false&lt;/code>).&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The primary motivation is to enhance the security of Kubernetes control plane components. Many control plane components, such as the &lt;code>kube-apiserver&lt;/code> and &lt;code>kube-controller-manager&lt;/code> often run as static Pods and are configured with &lt;code>hostNetwork: true&lt;/code> to bind to node ports or interact directly with the host&amp;rsquo;s network stack.&lt;/p>
&lt;p>Currently, a validation rule in the kube-apiserver prevents the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code>. This KEP aims to remove that barrier.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Introduce a new, separate alpha feature gate: &lt;code>UserNamespacesHostNetworkSupport&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When this feature gate is enabled, modify the Pod validation logic to allow Pod specs where &lt;code>spec.hostNetwork&lt;/code> is true and &lt;code>spec.hostUsers&lt;/code> is false.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>Including this functionality as part of the &lt;code>UserNamespacesSupport&lt;/code> feature gate. As &lt;code>UserNamespacesSupport&lt;/code> is nearing GA, it would be unwise to add a new, unstable feature with external dependencies.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We propose the introduction of a new feature gate named &lt;code>UserNamespacesHostNetworkSupport&lt;/code>.&lt;/p>
&lt;p>When this feature gate is disabled (the default state), the kube-apiserver will maintain the current validation behavior, rejecting any Pod spec that includes both &lt;code>spec.hostNetwork: true&lt;/code> and &lt;code>spec.hostUsers: false&lt;/code>.&lt;/p>
&lt;p>When the &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature gate is enabled, we will relax this validation check.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a cluster administrator, I want to enable user namespaces for my control plane static Pods (e.g., kube-apiserver, kube-controller-manager) to follow the principle of least privilege and reduce the attack surface. These Pods need to use hostNetwork to interact correctly with the cluster network. By enabling the new feature gate, I can add a critical layer of security isolation to these vital components without changing their networking model.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>If either the container runtime or the underlying container runtime does not support this feature, the container will fail to be created. To mitigate this issue, we will keep this feature in the alpha stage until mainstream container runtimes (containerd/runc) and mainstream underlying container runtimes (runc/crun) both support it, before promoting it to beta.&lt;/p>
&lt;p>Users might upgrade the container runtime to a newer version on some nodes first, but pods could still be scheduled onto nodes that do not support this feature. In such cases, users can leverage &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5328-node-declared-features"
target="_blank" rel="noopener">Node Declared Features&lt;/a>
to avoid this problem. Specifically, the new &lt;code>UserNamespacesHostNetwork&lt;/code> field in CRI-API&amp;rsquo;s &lt;code>RuntimeFeatures&lt;/code> will allow the kubelet to report whether the node supports this combination, enabling the scheduler to make informed placement decisions.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>The &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature integrates with the NodeDeclaredFeatures framework to ensure that Pods requiring the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code> are only scheduled onto nodes that explicitly declare support for this feature. The feature relies on the &lt;code>UserNamespacesHostNetwork&lt;/code> field in CRI-API&amp;rsquo;s &lt;code>RuntimeFeatures&lt;/code> to determine whether the container runtime supports this combination.&lt;/p>
&lt;p>&lt;strong>Node Feature Declaration:&lt;/strong>&lt;/p>
&lt;p>The kubelet will check the &lt;code>UserNamespacesHostNetwork&lt;/code> field in CRI-API&amp;rsquo;s &lt;code>RuntimeFeatures&lt;/code> field in the CRI-API to determine if the container runtime supports the &lt;code>UserNamespacesHostNetwork&lt;/code> feature.
If supported, the kubelet will declare the &lt;code>UserNamespacesHostNetwork&lt;/code> feature in the &lt;code>node.status.declaredFeatures&lt;/code> field. This ensures that the scheduler and other control plane components are aware of the node&amp;rsquo;s capabilities.&lt;/p>
&lt;p>&lt;strong>Pod Validation:&lt;/strong>&lt;/p>
&lt;p>And add a parameter to &lt;code>PodValidationOptions&lt;/code> so that if the &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature gate is disabled, and the pod has already used the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code>, then we should allow updates the pod.&lt;/p>
&lt;p>&lt;strong>Scheduling:&lt;/strong>&lt;/p>
&lt;p>The &lt;code>NodeDeclaredFeatures&lt;/code> scheduler plugin will ensure that Pods requiring the &lt;code>UserNamespacesHostNetwork&lt;/code> feature are only scheduled onto nodes that declare support for it. This is achieved by matching the Pod&amp;rsquo;s feature requirements against the node&amp;rsquo;s &lt;code>node.status.declaredFeatures&lt;/code>.&lt;/p>
&lt;p>&lt;strong>CRI Implementation&lt;/strong>&lt;/p>
&lt;p>When using &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code> together, container runtime needs to mount &lt;code>/sys&lt;/code> using bind mounts instead of directly mounting sysfs. This is because directly mounting sysfs in this configuration will fail with insufficient permissions (EPERM).&lt;/p>
&lt;p>The following mount options will be used to ensure security and proper functionality:&lt;/p>
&lt;ul>
&lt;li>&lt;code>nosuid&lt;/code>: Prevents privilege escalation through SUID binaries.&lt;/li>
&lt;li>&lt;code>nodev&lt;/code>: Prevents unauthorized access to hardware through device files.&lt;/li>
&lt;li>&lt;code>noexec&lt;/code>: Prevents execution of binary programs from the mounted filesystem.&lt;/li>
&lt;li>&lt;code>rbind&lt;/code>: Ensures that the directory is mounted along with all its sub-mount points.&lt;/li>
&lt;li>&lt;code>rro&lt;/code>: Ensures that the entire directory tree, including sub-mount points, is mounted as read-only.&lt;/li>
&lt;/ul>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[ ] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>pkg/apis/core/validation&lt;/code>: &lt;code>2025-10-03&lt;/code> - &lt;code>85.1%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>Add e2e tests to ensure that pods with the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code> can run properly.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>The &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature gate is implemented and disabled by default.&lt;/li>
&lt;li>Add an implementation that integrates with the NodeDeclaredFeatures feature gate.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Mainstream container runtimes and low-level container runtimes (e.g., containerd/CRI-O, runc/crun) have released generally available versions that support the concurrent use of &lt;code>hostNetwork&lt;/code> and user namespaces.&lt;/li>
&lt;li>Add e2e tests to ensure feature availability.&lt;/li>
&lt;li>Document the limitations of combining user namespaces and &lt;code>hostNetwork&lt;/code> (e.g., CAP_NET_RAW, CAP_NET_ADMIN, CAP_NET_BIND_SERVICE remain restricted).&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>The feature has been stable in Beta for at least 2 Kubernetes releases.&lt;/li>
&lt;li>Multiple major container runtimes support the feature.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Upgrade: After upgrading to a version that supports this KEP, the &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature gate can be enabled at any time.&lt;/p>
&lt;p>Downgrade: If downgraded to a version that does not support this KEP, kube-apiserver will revert to strict validation. Pods that were already running in this configuration will continue to run with this configuration.
If we were supposed to disable the feature, all pods using that configuration should be manually purged.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>&lt;strong>When the NodeDeclaredFeatures feature gate is enabled on the control plane but not on an older Kubelet:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>If the control plane is upgraded to a version that supports the &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature, it will correctly identify older nodes as incompatible. The scheduler will filter these nodes, causing Pods with the feature requirement to remain in the Pending state until compatible nodes are available.&lt;/li>
&lt;li>For API validation, operations will be rejected if the target Pod resides on an older node that lacks the necessary feature.&lt;/li>
&lt;li>This strict filtering is reliable because the &lt;code>NodeDeclaredFeatures&lt;/code> framework is scoped to new features only. This prevents ambiguous situations where a feature might be present on a node but is not being reported because the node is too old. The absence of a declared feature is a defini&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>When the NodeDeclaredFeatures feature gate is disabled on the control plane but enabled on the Kubelet:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>A newer kube-apiserver with the &lt;code>UserNamespacesHostNetworkSupport&lt;/code> feature enabled will accept a Pod with &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code>.&lt;/li>
&lt;li>An older kubelet will still get the Pod definition from the kube-apiserver. It will attempt to create the Pod. If the container runtime version is too old and doesn&amp;rsquo;t support this combination, the Pod will be stuck in the ContainerCreating state.&lt;/li>
&lt;li>To mitigate scheduling issues in mixed-version clusters, the kubelet will use the &lt;code>UserNamespacesHostNetwork&lt;/code> field from CRI-API&amp;rsquo;s &lt;code>RuntimeFeatures&lt;/code> to report node capabilities via Node Declared Features. This allows the scheduler to avoid placing Pods requiring this combination on nodes that do not support it, even in version-skew scenarios.&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>UserNamespacesHostNetworkSupport&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kube-apiserver&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. The behavior only changes when a user explicitly sets both &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code> in a Pod spec.
The behavior of all existing Pods is unaffected.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. It can be disabled by setting the feature gate to false and restarting the kube-apiserver.
This restores the old validation logic.
When disabled, Pods that were running in this mode have to be manually purged. Otherwise, they will continue
to run in that mode (hostNetwork: true, hostUsers: false) even though it&amp;rsquo;s technically disabled.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The kube-apiserver will once again begin to accept the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code>.
This is a stateless change, and reenabling is safe.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>During the alpha stage, unit tests for enabling and disabling the toggle functionality will be added to the validation code. Manual testing will also be conducted during the beta stage, and the testing process will be documented here.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The &lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
section covers this point.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>If a pod is stuck in the ContainerCreating state and returns events similar to the following, it indicates that the container runtime does not yet support this combination, and we should roll back this feature:&lt;/p>
&lt;pre tabindex="0">&lt;code>Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox &amp;#34;0db019a96c2a28eaacb0d8a795bbbc48c8a3823d9b8e5099948f1d99e826238d&amp;#34;: failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory
&lt;/code>&lt;/pre>&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>This will be validated via manual testing.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No impact to the running workloads&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;ul>
&lt;li>Detection: Please refer to the content in the &amp;ldquo;What specific metrics should inform a rollback?&amp;rdquo; section.&lt;/li>
&lt;li>Mitigations: Users should roll back this feature and discontinue using the combination of &lt;code>hostNetwork: true&lt;/code> and &lt;code>hostUsers: false&lt;/code>.&lt;/li>
&lt;li>Diagnostics: If the following log appears in kubelet, it indicates an abnormality with this feature, and users need to roll back. The default log level is sufficient to obtain the following log:&lt;/li>
&lt;/ul>
&lt;pre tabindex="0">&lt;code> E1014 20:30:39.550653 2823108 pod_workers.go:1324] &amp;#34;Error syncing pod, skipping&amp;#34; err=&amp;#34;failed to \&amp;#34;CreatePodSandbox\&amp;#34; for \&amp;#34;data-writer-pod_default(7607f6d7-91e1-4dbd-b957-c0d7b101de2e)\&amp;#34; with CreatePodSandboxError: \&amp;#34;Failed to create sandbox for pod \\\&amp;#34;data-writer-pod_default(7607f6d7-91e1-4dbd-b957-c0d7b101de2e)\\\&amp;#34;: rpc error: code = Unknown desc = failed to start sandbox \\\&amp;#34;f47b48e3c415105d25fb316cf224c0e57b146b340c09d6847b2dfcf3b49c923c\\\&amp;#34;: failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory\&amp;#34;&amp;#34; pod=&amp;#34;default/data-writer-pod&amp;#34; podUID=&amp;#34;7607f6d7-91e1-4dbd-b957-c0d7b101de2e&amp;#34;
&lt;/code>&lt;/pre>&lt;ul>
&lt;li>Testing: Failure mode tests have been run locally. We cannot add this test to the e2e test suite because once container runtime support is introduced, it will exit the failure mode, causing the test to fail.&lt;/li>
&lt;/ul>
&lt;pre tabindex="0">&lt;code>opt kubectl get pods
NAME READY STATUS RESTARTS AGE
data-writer-pod 0/1 ContainerCreating 0 8m4s
➜ opt kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
8m37s Normal Starting node/127.0.0.1
8m37s Normal RegisteredNode node/127.0.0.1 Node 127.0.0.1 event: Registered Node 127.0.0.1 in Controller
8m7s Normal Scheduled pod/data-writer-pod Successfully assigned default/data-writer-pod to 127.0.0.1
8m7s Warning FailedCreatePodSandBox pod/data-writer-pod Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox &amp;#34;f47b48e3c415105d25fb316cf224c0e57b146b340c09d6847b2dfcf3b49c923c&amp;#34;: failed to generate sandbox container spec: failed to pin user namespace: failed to open netns(): open : no such file or directory
&lt;/code>&lt;/pre>&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2025-10-03: Initial proposal&lt;/li>
&lt;li>2025-12-18: Add implementation content for v1.36&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>There are no known drawbacks at this time.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>Add this feature to the existing &lt;code>UserNamespacesSupport&lt;/code> feature gate:&lt;/p>
&lt;ul>
&lt;li>This was ruled out because the &lt;code>UserNamespacesSupport&lt;/code> feature is approaching GA, and its functionality should be stable.
Adding a new, externally-dependent, and immature behavior to a nearly-GA feature would introduce unnecessary risk and delays. Keeping the two feature gates separate is cleaner and safer.&lt;/li>
&lt;/ul>
&lt;p>Do not implement this feature:&lt;/p>
&lt;ul>
&lt;li>Users can use &lt;code>hostPort&lt;/code> as an alternative to &lt;code>hostNetwork&lt;/code>, but this may cause some disruption to the existing user environment, as certain privileged containers require direct interaction with the host network stack. Moreover, &lt;code>hostPort&lt;/code> requires pre-configured CNI; otherwise, the pod will fail to start. This limitation is precisely why Kubernetes control plane components continue to rely on &lt;code>hostNetwork&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>No new infrastructure needed.&lt;/p></description></item><item><title>Resources: Allow informers for getting a stream of data instead of chunking</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3157/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3157/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [X] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-3157-allow-informers-for-getting-a-stream-of-data-instead-of-chunking">KEP-3157: allow informers for getting a stream of data instead of chunking.&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#required-changes-for-a-watch-request-with-the-sendinitialeventstrue"
>Required changes for a WATCH request with the SendInitialEvents=true&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api-changes"
>API changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#important-optimisations"
>Important optimisations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#manual-testing-without-the-changes-in-place"
>Manual testing without the changes in place&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#results-with-watch-list"
>Results with WATCH-LIST&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#required-changes-for-a-watch-request-with-the-rv-set-to-the-last-observed-value-rv--0"
>Required changes for a WATCH request with the RV set to the last observed value (RV &amp;gt; 0)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#provide-a-fix-for-the-long-standing-issue-httpsgithubcomkuberneteskubernetesissues59848"
>Provide a fix for the long-standing issue &lt;a href="https://github.com/kubernetes/kubernetes/issues/59848">https://github.com/kubernetes/kubernetes/issues/59848&lt;/a>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#replacing-standard-list-request-with-watchlist-mechanism-for-client-gos-list-method"
>Replacing standard List request with WatchList mechanism for client-go&amp;rsquo;s List method.&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta2"
>Beta2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta3"
>Beta3&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta4"
>Beta4&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta5"
>Beta5&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#backward-compatibility"
>Backward compatibility&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scale-test-results-5000-nodes-165k-pods"
>Scale test results (5000 nodes, ~165K pods)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#post-ga"
>Post-GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#appendix"
>Appendix&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#sources-of-list-request"
>Sources of LIST request&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#steps-followed-by-informers"
>Steps followed by informers&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests for meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;!--
This section is incredibly important for producing high-quality, user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins, in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself. KEP editors and SIG Docs
should help to ensure that the tone and content of the `Summary` section is
useful for a wide audience.
A good summary is probably at least a paragraph in length.
Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
&lt;p>The kube-apiserver is vulnerable to memory explosion.
The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption.
Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an
HA mode but also other programs that share the same machine.
In this KEP we propose a solution to this issue.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;!--
This section is for explicitly listing the motivation, goals, and non-goals of
this KEP. Describe why the change is important and the benefits to users. The
motivation section can optionally provide links to [experience reports] to
demonstrate the interest in a KEP within the wider Kubernetes community.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->
&lt;p>Today informers are the primary source of LIST requests.
The LIST is used to get a consistent snapshot of data to build up a client-side in-memory cache.
The primary issue with LIST requests is unpredictable memory consumption.
The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.
See the &lt;a href="#appendix"
>Appendix&lt;/a>
section for more details on potential sources of LIST request and their impact on memory.
&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./memory_usage_before.png" alt="memory consumption during the synthetic test" title="memory consumption during the synthetic test">
In extreme cases, the server can allocate hundreds of megabytes per request.
To better visualize the issue let&amp;rsquo;s consider the above graph.
It shows the memory usage of an API server during a test (see &lt;a href="#manual-testing-without-the-changes-in-place"
>manual test&lt;/a>
section for more details).
We can see that increasing the number of informers drastically increases the memory consumption of the server.
Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests.
In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client.
The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption.
Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.&lt;/p>
&lt;p>A situation like that is dangerous twofold.
First, as we saw it could slow down if not fully stop an API server that has received the requests.
Secondly, a sudden and uncontrolled spike in memory consumption will likely put pressure on the node itself.
This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet.
Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius.
Note that in that scenario even clusters in an HA setup are affected.&lt;/p>
&lt;p>Worse, in rare cases (see the &lt;a href="#appendix"
>Appendix&lt;/a>
section for more) recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to a very expensive storm of LISTs.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;!--
List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
&lt;ul>
&lt;li>protect kube-apiserver and its node against list-based OOM attacks&lt;/li>
&lt;li>considerably reduce (temporary) memory footprint of LISTs, down from O(watchers*page-size*object-size*5) to O(watchers*constant), constant around 2 MB.&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>Example:&lt;/p>
&lt;p>512 watches of 400mb data: 512&lt;em>500&lt;/em>2MB*5=2.5TB ↘ 2 GB&lt;/p>
&lt;p>racing with Golang GC to free this temporary memory before being OOM&amp;rsquo;ed.&lt;/p>&lt;/blockquote>
&lt;ul>
&lt;li>reduce etcd load by serving from watch cache&lt;/li>
&lt;li>get a replacement for paginated lists from watch-cache, which is not feasible without major investment&lt;/li>
&lt;li>enforce consistency in the sense of freshness of the returned list&lt;/li>
&lt;li>be backward compatible with new client -&amp;gt; old server&lt;/li>
&lt;li>fix the long-standing &amp;ldquo;stale reads from the cache&amp;rdquo; issue, &lt;a href="https://github.com/kubernetes/kubernetes/issues/59848"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/59848&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
&lt;ul>
&lt;li>get rid of list or list pagination&lt;/li>
&lt;li>rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. What is the desired outcome and how do we measure success?.
The "Design Details" section below is for the real
nitty-gritty.
-->
&lt;p>In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use streaming from the watch-cache instead of paging from etcd.
Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see &lt;a href="#appendix"
>Appendix&lt;/a>
section for more details on how informers operate today).
The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs.
This would allow us to keep memory allocations constant.
The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude)
plus a few additional allocations, that will be explained later in this document.
The rough idea/plan is as follows:&lt;/p>
&lt;ul>
&lt;li>step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.&lt;/li>
&lt;li>step 2: upon receiving the request from an informer, compute the RV at which the result should be returned (possibly contacting etcd if consistent read was requested). It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will meet the consistency requirements of the request.&lt;/li>
&lt;li>step 2a: wait until watch catches up with the computed RV&lt;/li>
&lt;li>step 2a: send all objects currently stored in memory for the given resource type.&lt;/li>
&lt;li>step 2c: send a bookmark event to the informer with the given RV.&lt;/li>
&lt;li>step 3: listen for further events using the request from step 1.&lt;/li>
&lt;/ul>
&lt;p>Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV=&amp;ldquo;0&amp;rdquo; watches.
The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons.
A watch started with RV=&amp;ldquo;0&amp;rdquo; may return stale data. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches.&lt;/p>
&lt;p>Note 2: informers need consistent lists to avoid time-travel when initializing after restart to avoid time travel in case of switching to another HA instance of kube-apiserver with outdated/lagging watch cache.
See the following &lt;a href="https://github.com/kubernetes/kubernetes/issues/59848"
target="_blank" rel="noopener">issue&lt;/a>
for more details.&lt;/p>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.
How will security be reviewed, and by whom?
How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
&lt;h3 id="required-changes-for-a-watch-request-with-the-sendinitialeventstrue">Required changes for a WATCH request with the SendInitialEvents=true&lt;/h3>
&lt;p>The following sequence diagram depicts steps that are needed to complete the proposed feature.
A high-level overview of each was provided in a table that follows immediately the diagram.
Whereas further down in this section we provided a detailed description of each required step.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./sequence_diagram.png" alt="flow between an informer and the watch cache" title="flow between an informer and the watch cache">&lt;/p>
&lt;table>
&lt;tr>
&lt;th>Step&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>1.&lt;/th>
&lt;th>The reflector establishes a WATCH request with the watch cache.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>2.&lt;/th>
&lt;th>If needed, the watch cache contacts etcd for the most up-to-date ResourceVersion.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>2a.&lt;/th>
&lt;th>The watch cache waits until is observed the requested ResourceVersion.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>2b.&lt;/th>
&lt;th>The watch cache stream all the contents from its in-memory store.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>2c.&lt;/th>
&lt;th>After sending all the objects it sends a bookmark event with the given RV to the reflector.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>3.&lt;/th>
&lt;th>The reflector replaces its internal store with collected items, updates its internal resourceVersion to the one obtained from the bookmark event.&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>3a.&lt;/th>
&lt;th>The reflector uses the WATCH request from step 1 for further progress notifications.&lt;/th>
&lt;/tr>
&lt;/table>
&lt;p>Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) to ensure freshness and setting resourceVersionMatch=NotOlderThan and sendInitialEvents=true.
We do that only during the initial ListAndWatch call.
Each event (ADD, UPDATE, DELETE) except the BOOKMARK event received from the server is collected.
Passing resourceVersion=&amp;quot;&amp;quot; tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.&lt;/p>
&lt;p>Note: This ensures that returned data is consistent, served from etcd via a quorum read and prevents &amp;ldquo;going back in time&amp;rdquo;.&lt;/p>
&lt;p>Note 2: Watch cache currently doesn&amp;rsquo;t have the feature of supporting resourceVersion=&amp;quot;&amp;quot; and thus is vulnerable to stale reads, see &lt;a href="https://github.com/kubernetes/kubernetes/issues/59848"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/59848&lt;/a>
for more details.&lt;/p>
&lt;p>Step 2: Right after receiving a request from the reflector, the cacher gets the current resourceVersion (aka bookmarkAfterResourceVersion) directly from the etcd.
It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data.
There are ways to do that cheaply, e.g. we could issue a count request against the datastore.
Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache.
After sending initial data the cacheWatcher starts listening on an input channel for new events, including a bookmark event.
At some point, the cacher will receive an event with the resourceVersion equal or greater to the bookmarkAfterResourceVersion.
It will be propagated to the cacheWatcher and then back to the reflector as a BOOKMARK event.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./cacher.png" alt="watch cache with the proposed changes" title="watch cache with the proposed changes">&lt;/p>
&lt;p>Step 2a: Where does the initial data come from?&lt;/p>
&lt;p>During construction, the cacher creates the reflector and the watchCache.
Since the watchCache implements the Store interface it is used by the reflector to store all data it has received from etcd.&lt;/p>
&lt;p>Step 2b: What happens when new events are received while the cacheWatcher is sending initial data?&lt;/p>
&lt;p>The cacher maintains a list of all current watchers (cacheWatcher) and a separate goroutine (dispatchEvents) for delivering new events to the watchers.
New events are added via the cacheWatcher.nonblockingAdd method that adds an event to the cacheWatcher.input channel.
The cacheWatcher.input is a buffered channel and has a different size for different Resources (10 or 1000).
Since the cacheWatcher starts processing the cacheWatcher.input channel only after sending all initial events it might block once its buffered channel tips over.
In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event.
All watchers that have failed to deliver the event will be closed.&lt;/p>
&lt;p>Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.&lt;/p>
&lt;p>For an alpha version, we will delay closing the watch request until all data is sent to the client.
We expect this to behave well even in heavily loaded clusters.
To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV,
what&amp;rsquo;s the average buffer size, and a counter for closed watch requests due to an overfull buffer.&lt;/p>
&lt;p>For a beta version, we have further options if they turn out to be necessary:&lt;/p>
&lt;ol>
&lt;li>comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on
and waiting until the difference between the RVs is &amp;lt; 1000 (the buffer size). We could do that even before sending the initial events.
If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV.
Assuming all updates would be for the resource the watch request was opened for (which seems unlikely).
In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache.
Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see &lt;a href="https://etcd.io/docs/v3.4/op-guide/performance/"
target="_blank" rel="noopener">https://etcd.io/docs/v3.4/op-guide/performance/&lt;/a>
)
Once we are past this step (we know the difference is smaller) and the buffer fills up we:
&lt;ul>
&lt;li>
&lt;p>case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer.
In that case, we will deliver the initial events, any other events we have received which RVs are &amp;lt;= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection.
An informer will reconnect with the RV from the bookmark event.
Note that any new event received was ignored since the buffer was full.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn&amp;rsquo;t exist in the buffer.
An informer will reconnect arriving at the step that compares the RVs first.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>make the buffer dynamic - especially when the difference between RVs is &amp;gt; than 1000&lt;/li>
&lt;li>inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before&lt;/li>
&lt;li>cap the size (cannot allocate more than X MB of memory) of the buffer&lt;/li>
&lt;li>maybe even apply some compression techniques to the buffer (for example by only storing a low-memory shallow reference and take the actual objects for the event from the store)&lt;/li>
&lt;/ol>
&lt;p>Note: The RV is effectively a global counter that is incremented every time an object is updated.
This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request.&lt;/p>
&lt;p>Note 2: Currently, there is a timeout for LIST requests of 60s.
That means a slow reflector might fail synchronization as well and would have to re-establish the connection.&lt;/p>
&lt;p>Step 2c: How bookmarks are delivered to the cacheWatcher?&lt;/p>
&lt;p>First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening.
There are two sources of resourceVersions.
The first one is regular events that contain RVs besides objects.
The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval only to the kube-apiserver.
As already mentioned in 2a the watchCache is driven by the reflector.
Every event will be eventually propagated from the watchCache to the cacher.processEvent method.
For simplicity, we can assume that the processEvent method will simply update the resourceVersion maintained by the cacher.&lt;/p>
&lt;p>At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event.
As of today, the interval is set to 1 second.
The bookmark event contains an empty object and the current resourceVersion.
By default, a cacheWatcher expires roughly every 1 minute.&lt;/p>
&lt;p>The expiry interval initially will be decreased to 1 second in this feature&amp;rsquo;s code-path.
This helps us deliver a bookmark event that is &amp;gt;= bookmarkAfterResourceVersion much faster.
After that, the interval will be put back to the previous value.&lt;/p>
&lt;p>Note: Since we get a notification every 5 seconds from etcd and we try to deliver a bookmark every 1 second.
It seems the maximum delay time a reflector will have to wait after receiving initial data is 6 seconds (assuming small dataset).
It might be unlikely in practice since we might get bookmarkAfterResourceVersion even before handling initial data.
Also sending data itself takes some time as well.&lt;/p>
&lt;p>Step 3: After receiving a BOOKMARK event the reflector is considered to be synchronized.
It replaces its internal store with the collected items (syncWith) and reuses the current connection for getting further events.&lt;/p>
&lt;h4 id="api-changes">API changes&lt;/h4>
&lt;p>Extend the &lt;code>ListOptions&lt;/code> struct with the following field:&lt;/p>
&lt;pre tabindex="0">&lt;code>type ListOptions struct {
...
// `sendInitialEvents=true` may be set together with `watch=true`.
// In that case, the watch stream will begin with synthetic events to
// produce the current state of objects in the collection. Once all such
// events have been sent, a synthetic &amp;#34;Bookmark&amp;#34; event will be sent.
// The bookmark will report the ResourceVersion (RV) corresponding to the
// set of objects, and be marked with `&amp;#34;k8s.io/initial-events-end&amp;#34;: &amp;#34;true&amp;#34;` annotation.
// Afterwards, the watch stream will proceed as usual, sending watch events
// corresponding to changes (subsequent to the RV) to objects watched.
//
// When `sendInitialEvents` option is set, we require `resourceVersionMatch`
// option to also be set. The semantic of the watch request is as following:
// - `resourceVersionMatch` = NotOlderThan
// is interpreted as &amp;#34;data at least as new as the provided `resourceVersion`&amp;#34;
// and the bookmark event is send when the state is synced
// to a `resourceVersion` at least as fresh as the one provided by the ListOptions.
// If `resourceVersion` is unset, this is interpreted as &amp;#34;consistent read&amp;#34; and the
// bookmark event is send when the state is synced at least to the moment
// when request started being processed.
// - `resourceVersionMatch` set to any other value or unset
// Invalid error is returned.
//
// Defaults to true if `resourceVersion=&amp;#34;&amp;#34;` or `resourceVersion=&amp;#34;0&amp;#34;` (for backward
// compatibility reasons) and to false otherwise.
SendInitialEvents *bool
}
&lt;/code>&lt;/pre>&lt;p>The watch bookmark marking the end of initial events stream will have a dedicated
annotation:&lt;/p>
&lt;pre tabindex="0">&lt;code>&amp;#34;k8s.io/initial-events-end&amp;#34;: &amp;#34;true&amp;#34;
&lt;/code>&lt;/pre>&lt;p>(the exact name is subject to change during API review). It will allow clients to
precisely figure out when the initial stream of events is finished.&lt;/p>
&lt;p>It&amp;rsquo;s worth noting that explicitly setting SendInitialEvents to false with ResourceVersion=&amp;ldquo;0&amp;rdquo;
will result in not sending initial events, which makes the option works exactly the same
across every potential resource version passed as a parameter.&lt;/p>
&lt;h4 id="important-optimisations">Important optimisations&lt;/h4>
&lt;ol>
&lt;li>
&lt;p>Avoid DeepCopying of initial data&lt;br>&lt;br>
The watchCache has an important optimization of wrapping objects into a cachingObject.
Given that objects aren&amp;rsquo;t usually modified (since selfLink has been disabled) and the fact that there might be multiple watchers interested in receiving an event.
Wrapping allows us for serializing an object only once.
&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./cacher_deep_copy.png" alt="removing deep copying from the watch cache" title="removing deep copying of object from the watch cache">
The watchCache maintains two internal data structures.
The first one is called the store and is driven by the reflector.
It essentially mirrors the content stored in etcd.
It is used to serve LIST requests.
The second one is called the cache, which represents a sliding window of recent events received from the reflector.
It is effectively used to serve WATCH requests from a given RV.&lt;br>&lt;br>
By design cachingObjects are stored only in the cache.
As described in Step 2, the cacheWatcher gets initial data from the watchCacher.
The latter, in turn, gets data straight from the store.
That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.&lt;br>&lt;br>
Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject.
Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.&lt;br>&lt;br>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Reduce the number of allocations in the WatchServer&lt;br>&lt;br>
The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients.
It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image.
Two of which deserve special attention, namely the allocations 1 and 3 because they won&amp;rsquo;t reuse memory and rely on the GC for cleanup.
In other words, the more events we need to send, the more (temporary) memory will be used.
In contrast, the other two allocations are already optimizedas they reuse memory instead of creating new buffers for every single event.
For better utilization, a similar technique of reusing memory could be used to save precious RAM and scale the system even further.
&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./watch_server_allocs.png" alt="memory allocations in the watch server per object" title="memory allocations in the watch server per object">&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h4 id="manual-testing-without-the-changes-in-place">Manual testing without the changes in place&lt;/h4>
&lt;p>For the past few years, we have seen many clusters suffering from the issue.
Sadly, our only possible recommendation was to ask customers to reduce the cluster in size.
Since adding more memory in most of the cases would not fix the issue.
Recall from the motivation section that just a few requests can allocate gigabytes of data in a fraction of a second&lt;br>&lt;br>
In order to reproduce the issue, we executed the following manual test, it is the simplest and cheapest way of putting yourself into customers&amp;rsquo; shoes: the reproducer creates a namespace with 400 secrets, each containing 1 MB of data.
Next, it uses informers to get all secrets in the cluster.
The rough estimate is that a single informer will have to bring at least 400MB from the datastore to get all secrets.&lt;br>&lt;br>
&lt;strong>The result&lt;/strong>: 16 informers were able to take down the test cluster.&lt;/p>
&lt;h4 id="results-with-watch-list">Results with WATCH-LIST&lt;/h4>
&lt;p>We have prepared the following PR &lt;a href="https://github.com/kubernetes/kubernetes/pull/106477"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/106477&lt;/a>
which is almost identical to the proposed solution.
It just differs in a few details.
The following image depicts the results we obtained after running the synthetic test described in 4.
&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/3157-watch-list/./memory_usage_after.png" alt="memory consumption during the synthetic test with the proposed changes" title="memory consumption during the synthetic test with the proposed changes">
First of all, it is worth mentioning that the PR was deployed onto the same cluster so that we could ensure an identical setup (CPU, Memory) between the tests.
The graph tells us a few things.&lt;br>&lt;br>
Firstly, the proposed solution is at least &lt;strong>100&lt;/strong> times better than the current state.
Around 12:05 we started the test with 1024 informers, all eventually synced without any errors.
Moreover during that time the server was stable and responsive.
That particular test ended around 12:30. That means it needed ~25 minutes to bring ~400 GB of data across the network!
Impressive achievement.&lt;br>&lt;br>
Secondly, it tells us that memory allocation is not proportional yet to the number of informers! Given the size of individual objects of 1MB and the actual number of informers, we should allocate roughly around 2GB of RAM.
We managed to get and analyze the memory profile that showed a few additional allocations inside the watch server.
At this point, it is worth mentioning that the results were achieved with only the first optimization applied.
We expect the system will scale even better with the second optimization as it will put significantly less pressure on the GC.&lt;/p>
&lt;h3 id="required-changes-for-a-watch-request-with-the-rv-set-to-the-last-observed-value-rv--0">Required changes for a WATCH request with the RV set to the last observed value (RV &amp;gt; 0)&lt;/h3>
&lt;p>In that case, no additional changes are required.
We stick to existing semantics.
That is we start a watch at an exact resource version.
The watch events are for all changes after the provided resource version.
This is safe because the client is assumed to already have the initial state at the starting resource version since the client provided the resource version.&lt;/p>
&lt;h3 id="provide-a-fix-for-the-long-standing-issue">Provide a fix for the long-standing issue &lt;a href="https://github.com/kubernetes/kubernetes/issues/59848"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/59848&lt;/a>
&lt;/h3>
&lt;p>The issue is still open mainly because informers default to resourceVersion=&amp;ldquo;0&amp;rdquo; for their initial LIST requests.
This is problematic because the initial LIST requests served from the watch cache might return data that are arbitrarily delayed.
This in turn could make clients connected to that server read old data and undo recent work that has been done.&lt;/p>
&lt;p>To make consistent reads from cache for LIST requests and thus prevent &amp;ldquo;going back in time&amp;rdquo; we propose to use the same technique for ensuring the cache is not stale as described in the previous section.&lt;/p>
&lt;p>In that case we are going to change informers to pass &amp;ldquo;resourceVersion=&amp;ldquo;0&amp;rdquo; and resourceVersionMatch=MostRecent&amp;rdquo; for their initial LIST requests.
Then on the server side we:&lt;/p>
&lt;ol>
&lt;li>get the current revision from etcd.&lt;/li>
&lt;li>use the existing waitUntilFreshAndBlock function to wait for the watch to catch up to the revision requested in the previous step.&lt;/li>
&lt;li>reject the request if waitUntilFreshAndBlock times out, thus forcing informers to retry.&lt;/li>
&lt;li>otherwise, construct the final list and send back to a client.&lt;/li>
&lt;/ol>
&lt;h3 id="replacing-standard-list-request-with-watchlist-mechanism-for-client-gos-list-method">Replacing standard List request with WatchList mechanism for client-go&amp;rsquo;s List method.&lt;/h3>
&lt;p>Replacing the underlying implementation of the List method for client-go based clients (like typed or dynamic client)
with the WatchList mechanism requires ensuring that the data returned by both the standard List request and
the new WatchList mechanism remains identical. The challenge is that WatchList no longer retrieves the entire
list from the server at once but only receives individual items, which forces us to &amp;ldquo;manually&amp;rdquo; reconstruct
the list object on the client side.&lt;/p>
&lt;p>To correctly construct the list object on the client side, we need ListKind information.
However, simply reconstructing the list object based on these data is not enough.
In the case of a standard List request, the server&amp;rsquo;s response (a versioned list) is processed through a chain of decoders,
which can potentially modify the resulting list object.
A good example is the WithoutVersionDecoder, which removes the GVK information from the list object.
Thus the &amp;ldquo;manually&amp;rdquo; constructed list object may not be consistent
with the transformations applied by the decoders, leading to differences.&lt;/p>
&lt;p>To ensure full compatibility, the server must provide a versioned empty list in the format requested by the client (e.g., protobuf representation).
We don&amp;rsquo;t know how the client&amp;rsquo;s decoder behaves for different encodings, i.e., whether the decoder actually supports
the encoding we intend to use for reconstruction. Therefore, to ensure maximal compatibility, we will ensure that
the encoding used for the reconstruction of the list matches the format that the client originally requested.
This guarantees that the returned list object can be correctly decoded by the client,
preserving the actual encoding format as intended.&lt;/p>
&lt;p>The proposed solution is to add a new annotation (&lt;code>k8s.io/initial-events-list-blueprint&lt;/code>) to the object returned
in the bookmark event (The bookmark event is sent when the state is synced and marks the end of WatchList stream).
This annotation will store an empty, versioned list encoded as a Base64 string.
This annotation will be added to the same object/place the &lt;code>k8s.io/initial-events-end&lt;/code> annotation is added.&lt;/p>
&lt;p>When the client receives such a bookmark, it will base64 decode the empty list and pass it to the decoder chain.
Only after a successful response from the decoders the list will be populated with data received from subsequent
watch events and returned.&lt;/p>
&lt;p>For example:&lt;/p>
&lt;pre tabindex="0">&lt;code>GET /api/v1/namespaces/test/pods?watch=1&amp;amp;sendInitialEvents=true&amp;amp;allowWatchBookmarks=true&amp;amp;resourceVersion=&amp;amp;resourceVersionMatch=NotOlderThan
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json
{
&amp;#34;type&amp;#34;: &amp;#34;ADDED&amp;#34;,
&amp;#34;object&amp;#34;: {&amp;#34;kind&amp;#34;: &amp;#34;Pod&amp;#34;, &amp;#34;apiVersion&amp;#34;: &amp;#34;v1&amp;#34;, &amp;#34;metadata&amp;#34;: {&amp;#34;resourceVersion&amp;#34;: &amp;#34;8467&amp;#34;, &amp;#34;name&amp;#34;: &amp;#34;foo&amp;#34;}, ...}
}
{
&amp;#34;type&amp;#34;: &amp;#34;ADDED&amp;#34;,
&amp;#34;object&amp;#34;: {&amp;#34;kind&amp;#34;: &amp;#34;Pod&amp;#34;, &amp;#34;apiVersion&amp;#34;: &amp;#34;v1&amp;#34;, &amp;#34;metadata&amp;#34;: {&amp;#34;resourceVersion&amp;#34;: &amp;#34;5726&amp;#34;, &amp;#34;name&amp;#34;: &amp;#34;bar&amp;#34;}, ...}
}
{
&amp;#34;type&amp;#34;:&amp;#34;BOOKMARK&amp;#34;,
&amp;#34;object&amp;#34;:{&amp;#34;kind&amp;#34;:&amp;#34;Pod&amp;#34;,&amp;#34;apiVersion&amp;#34;:&amp;#34;v1&amp;#34;,&amp;#34;metadata&amp;#34;:{&amp;#34;resourceVersion&amp;#34;:&amp;#34;13519&amp;#34;,&amp;#34;annotations&amp;#34;:{&amp;#34;k8s.io/initial-events-end&amp;#34;:&amp;#34;true&amp;#34;,&amp;#34;k8s.io/initial-events-embedded-list&amp;#34;:&amp;#34;eyJraW5kIjoiUG9kTGlzdCIsImFwaVZlcnNpb24iOiJ2MSIsIm1ldGFkYXRhIjp7fSwiaXRlbXMiOm51bGx9Cg==&amp;#34;}} ...}
}
...
&amp;lt;followed by regular watch stream starting&amp;gt;
&lt;/code>&lt;/pre>&lt;p>&lt;strong>Alternatives&lt;/strong>&lt;/p>
&lt;p>We could modify the type of the object passed in the last bookmark event to include the list.
This approach would require changes to the reflector, as it would need to recognize the new object type in the bookmark event.
However, this could potentially break other clients that are not expecting a different object in the bookmark event.&lt;/p>
&lt;p>Another option would be to issue an empty list request to the API server to receive a list response from the client.
This approach would involve modifying client-go and implementing some form of caching mechanism,
possibly with invalidation policies.
Non-client-go clients that want to use this new feature would need to rebuild this mechanism as well.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>k8s.io/apiserver/pkg/storage/cacher: 02/02/2023 - 74,7%&lt;/li>
&lt;li>k8s.io/client-go/tools/cache/reflector: 02/02/2023 - 88,6%&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>For alpha, tests asserting fallback mechanism for reflector will be added.&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;ul>
&lt;li>For alpha, tests exercising this feature will be added.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>The Feature is implemented behind &lt;code>WatchList&lt;/code> feature flag&lt;/li>
&lt;li>Initial e2e tests completed and enabled&lt;/li>
&lt;li>Scalability/Performance tests confirm gains of this feature&lt;/li>
&lt;li>Add support for watchlist to APF&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Metrics are added to the kube-apiserver (see the &lt;a href="#monitoring-requirements"
>monitoring-requirements&lt;/a>
section for more details)&lt;/li>
&lt;li>Implement &lt;code>SendInitialEvents&lt;/code> for &lt;code>watch&lt;/code> requests in the etcd storage implementation&lt;/li>
&lt;li>The feature is enabled for kube-apiserver and kube-controller-manager.&lt;/li>
&lt;li>The generic feature gate mechanism is implemented in client-go.
It will be used to enable a new functionality for reflectors/informers.&lt;/li>
&lt;li>Implement a consistency check detector that will compare data received through a new watchlist request
with data obtained through a standard list request. The detector will be added to the reflector
and activated when an environment variable is set. The environment variable will be set for all jobs run in the Kube CI.&lt;/li>
&lt;li>Update the client-go generated List function to watchList data when the feature gate has been enabled and the ListOptions are satisfied.
This change must be applied to the typed, dynamic and metadata clients.&lt;/li>
&lt;li>Implement a mechanism for automatically detecting etcd configuration
Whether it is safe to use the RequestWatchProgress API call
or if the experimental-watch-progress-notify-interval flag has been set.
Knowing etcd configuration will be used to automatically disable the streaming feature.&lt;/li>
&lt;li>Use WatchProgressRequester to request progress notifications directly from etcd.
This mechanism was developed in &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache#use-requestprogress-to-enable-automatic-watch-updates"
target="_blank" rel="noopener">Consistent Reads from Cache KEP&lt;/a>
and will reduce the overall latency for watchlist requests.&lt;/li>
&lt;li>The watchlist call, which serves as a drop-in replacement for list calls in client libraries,
must properly set the kind and apiVersion fields.
These fields are important for the correct decoding of the objects.
See also: &lt;a href="https://github.com/kubernetes/kubernetes/pull/126191"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/126191&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h4 id="beta2">Beta2&lt;/h4>
&lt;ul>
&lt;li>The feature is enabled for kubelet.&lt;/li>
&lt;li>Extend the existing performance tests with a case that adds a large number of small objects.
The current perf test adds a small number of large objects.
The new variant will help catch potential regressions such as &lt;a href="https://github.com/kubernetes/kubernetes/issues/129467"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/129467&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h4 id="beta3">Beta3&lt;/h4>
&lt;p>With new concerns brought in 1.33 release timeline, we revised the approach for the feature.
The discussion happened in &lt;a href="https://docs.google.com/document/d/1x30DjXSSF5krpyoTCwManJg6vphpnGI37xfCRaiHAbs"
target="_blank" rel="noopener">this document&lt;/a>
and resulted in the following update for the criteria:&lt;/p>
&lt;ul>
&lt;li>Revert the client-go changes that use watchList to implement List.
This inclues removing the API annotation that made this possible, because it doesn&amp;rsquo;t
serve another purpose.&lt;/li>
&lt;li>Ensure we don&amp;rsquo;t break the &amp;ldquo;latestRV&amp;rdquo; usecase for StorageVersionMigrator reusing the
informer cache from the kube-controller-manager&lt;/li>
&lt;li>Ensure that the feature is usable by external projects by validating it works with
controller-runtime out-of-the-box via simple enablement&lt;/li>
&lt;li>Add support for AcceptContentType header with the value application/json;as=Table and
application/json;as=PartialObjectMetadata&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/a07b1aaa5b39b351ec8586de800baa5715304a3f/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L416"
target="_blank" rel="noopener">Switch&lt;/a>
the &lt;code>storage/cacher&lt;/code> to use streaming directly from etcd
(This will also allow us to &lt;a href="https://github.com/kubernetes/kubernetes/blob/a07b1aaa5b39b351ec8586de800baa5715304a3f/staging/src/k8s.io/client-go/tools/cache/reflector.go#L110"
target="_blank" rel="noopener">remove&lt;/a>
the &lt;code>reflector.UseWatchList&lt;/code> field).&lt;/li>
&lt;li>Enable the feature by-default for kube-controller-manager.&lt;/li>
&lt;/ul>
&lt;h4 id="beta4">Beta4&lt;/h4>
&lt;ul>
&lt;li>Disable watchlist support in the fake client so that informers do not use watchlists.
This ensures that unit tests relying on non-standard fake client behavior will continue to work.&lt;/li>
&lt;li>Currently, the &lt;code>ListWatcher&lt;/code> used by the &lt;code>WatchCache&lt;/code> does not pass the RV from the reflectors.
As a result the consistency detector used by the reflectors fails for the &lt;code>WatchCache&lt;/code>.
This issue needs to be resolved.&lt;/li>
&lt;li>Enable the &lt;code>WatchListClient&lt;/code> feature gate by default.
The FG is defined in client-go.
Enabling it will turn on the feature for all clients.&lt;/li>
&lt;/ul>
&lt;h4 id="beta5">Beta5&lt;/h4>
&lt;ul>
&lt;li>Enable gzip compression for the WatchList request. Gated behind a new server-side
feature gate &lt;code>WatchListCompression&lt;/code> (Beta, enabled by default).
Regular watch requests are not affected.&lt;/li>
&lt;/ul>
&lt;h5 id="backward-compatibility">Backward compatibility&lt;/h5>
&lt;p>No client-side changes are required. Go&amp;rsquo;s &lt;code>http.Transport&lt;/code> automatically sends
&lt;code>Accept-Encoding: gzip&lt;/code> when &lt;code>DisableCompression&lt;/code> is &lt;code>false&lt;/code> (the default in
&lt;code>rest.Config&lt;/code>), the same mechanism already used for LIST compression. Older
clients will transparently receive compressed WatchList responses.
The &lt;code>pull-kubernetes-gce-master-scale-performance-5000&lt;/code> test with the
&lt;a href="https://github.com/kubernetes/kubernetes/pull/138327"
target="_blank" rel="noopener">POC&lt;/a>
validates this: all
in-cluster clients (kubelet, kube-controller-manager, scheduler) used the standard
client-go transport with no modifications.&lt;/p>
&lt;h5 id="scale-test-results-5000-nodes-165k-pods">Scale test results (5000 nodes, ~165K pods)&lt;/h5>
&lt;p>We ran the &lt;code>pull-kubernetes-gce-master-scale-performance-5000&lt;/code> test with WatchList
compression enabled and compared against a &lt;a href="https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-scale-performance-5000/2056057601027739648/artifacts/APIResponsivenessPrometheus_simple_load_2026-05-17T19:11:35Z.json"
target="_blank" rel="noopener">baseline&lt;/a>
&lt;code>ci-kubernetes-e2e-gce-scale-performance-5000&lt;/code> run without it.
The WatchList P99 latency for pods dropped from 60s (capped at the histogram bucket
ceiling, actual latency was much worse) to ~30s:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>&lt;a href="https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-scale-performance-5000/2056057601027739648/artifacts/APIResponsivenessPrometheus_simple_load_2026-05-17T19:11:35Z.json"
target="_blank" rel="noopener">Baseline&lt;/a>
&lt;/th>
&lt;th>&lt;a href="https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/138327/pull-kubernetes-gce-master-scale-performance-5000/2055240498586587136/artifacts/APIResponsivenessPrometheus_simple_load_2026-05-15T13:31:38Z.json"
target="_blank" rel="noopener">With compression&lt;/a>
&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Count&lt;/td>
&lt;td>486&lt;/td>
&lt;td>547&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>P50&lt;/td>
&lt;td>1,950ms&lt;/td>
&lt;td>969ms&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>P90&lt;/td>
&lt;td>15,208ms&lt;/td>
&lt;td>17,150ms&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>P99&lt;/td>
&lt;td>&lt;strong>60,000ms&lt;/strong> (capped)&lt;/td>
&lt;td>&lt;strong>29,673ms&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>For more details see &lt;a href="https://github.com/kubernetes/kubernetes/issues/138670"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/138670&lt;/a>
.&lt;/p>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>No user issues reported&lt;/li>
&lt;/ul>
&lt;h4 id="post-ga">Post-GA&lt;/h4>
&lt;ul>
&lt;li>Make &lt;strong>list&lt;/strong> calls expensive in APF.
Once all supported releases have the streaming list enabled by default (client-go, control plane components)
and the feature itself is locked to its default value, we can increase the cost of regular list requests in APF.
This ensures that the fallback mechanism, which switches back to the standard list when streaming has issues, will not be affected.&lt;/li>
&lt;/ul>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, or as something else. The KEP
should keep this high-level with a focus on what signals will be looked at to
determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
#### Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
#### Beta
- Gather feedback from developers and surveys
- Complete features A, B, C
- Additional tests are in Testgrid and linked in KEP
#### GA
- N examples of real-world usage
- N installs
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- Allowing time for feedback
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and nodes?
- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;p>Our immediate idea to ensure backward compatibility between new clients and old servers would be to return a 401 response in old Kubernetes releases (via backports).
This approach however would limit the maximum skew version mismatch to just a few previous releases, and would also force customers to update to latest minor versions.&lt;br>&lt;br>
Therefore we propose to make use of the already existing &amp;ldquo;resourceVersionMatch&amp;rdquo; LIST option.
WATCH requests with that option set will be immediately rejected with a 403 (Forbidden) response by previous servers.
In that case, new clients will fall back to the previous mode (ListAndWatch).
New servers will allow for WATCH requests to have &amp;ldquo;resourceVersionMatch=MostRecent&amp;rdquo; set.&lt;br>&lt;br>
Existing clients will be forward and backward compatible and won&amp;rsquo;t require any changes since the server will preserve the old behavior (ListAndWatch).&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;!--
Pick one of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: WatchList&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Feature gate name: WatchListClient
&lt;ul>
&lt;li>Components depending on the feature gate:&lt;/li>
&lt;li>kube-controller-manager via client-go library&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. Because users must enable the feature on the client side (client-go).&lt;/p>
&lt;!--
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes, by disabling &lt;code>WatchList&lt;/code> FeatureGate for &lt;code>kube-apiserver&lt;/code>.
In this case &lt;code>kube-apiserver&lt;/code> will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.&lt;/p>
&lt;p>Yes, by disabling &lt;code>WatchListClient&lt;/code> FeatureGate for &lt;code>kube-controller-manager&lt;/code>.
In this case informers will follow standard LIST/WATCH semantics.&lt;/p>
&lt;p>Note that for safety reasons, reflectors/informers will always fallback to a regular LIST operation regardless of the error that occurred.&lt;/p>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The expected behavior of the feature will be restored.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes. There is &lt;a href="https://github.com/kubernetes/kubernetes/pull/120971"
target="_blank" rel="noopener">an integration test&lt;/a>
that verifies the fallback mechanism
of the reflector when interacting with servers that has the &lt;code>WatchList&lt;/code> feature enabled/disabled.&lt;/p>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
-->
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>Feature does not have a direct impact on rollout/rollback.&lt;/p>
&lt;p>However, faulty behavior of a feature can result in incorrect functioning
of components that rely on that feature. For the Beta version, we plan to enable it exclusively for kube-controller-manager.
The main issues can arise during the initial informer synchronization, which may result in controller failures.&lt;/p>
&lt;p>Furthermore, if data consistency issues arise, such as missing data, the controllers simply do not consider the missing data.&lt;/p>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>&lt;code>apiserver_terminated_watchers_total&lt;/code> - a large number of terminated watchers might indicate synchronization issues.
For example, we have some client-side error where we&amp;rsquo;re not getting data from the server. Or we have a server-side error, and the buffer is getting cluttered.&lt;/p>
&lt;p>&lt;code>apiserver_request_duration_second_bucket&lt;/code> - in general, a large number of &amp;ldquo;short&amp;rdquo; watch requests can indicate synchronization issues.&lt;/p>
&lt;p>&lt;code>apiserver_watch_list_duration_seconds&lt;/code> - the absence of this metric may indicate that the client did not receive a special bookmark.
The issue here could be that the server never sent it due to an error or didn&amp;rsquo;t even receive it from the database.&lt;/p>
&lt;p>&lt;code>apiserver_watch_list_duration_seconds&lt;/code> - long synchronization times may indicate that the server is lagging behind etcd.
Forr example, not receiving progress notifications from the database frequently.&lt;/p>
&lt;p>&lt;code>apiserver_watch_cache_lag&lt;/code> - tells how far behind the server is compared to the database.
Significant discrepancies affect the times for full data synchronization.&lt;/p>
&lt;p>A good metric can also be the number of kube-controller-manager restarts.
Which may indicate issues with informers synchronization.&lt;/p>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Upgrade-&amp;gt;downgrade-&amp;gt;upgrade testing was done manually using the following steps:&lt;/p>
&lt;p>Build and run Kubernetes from the master branch using Kind.&lt;/p>
&lt;pre tabindex="0">&lt;code>kind build node-image --arch &amp;#34;arm64&amp;#34;
kind create cluster --image kindest/node:latest
kubectl get no
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 26s v1.29.0-alpha.1.47+f8571dabf79717
&lt;/code>&lt;/pre>&lt;p>Check if the &lt;code>kube-apiserver&lt;/code>(aka &lt;code>kas&lt;/code>) has recorded the watchlist latency metric.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get --raw &amp;#39;/metrics&amp;#39; | grep &amp;#34;apiserver_watch_list_duration_seconds&amp;#34;
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group=&amp;#34;&amp;#34;,resource=&amp;#34;configmaps&amp;#34;,scope=&amp;#34;cluster&amp;#34;,version=&amp;#34;v1&amp;#34;,le=&amp;#34;6&amp;#34;} 1
&lt;/code>&lt;/pre>&lt;p>Disable the &lt;code>WatchList&lt;/code> feature gate for the &lt;code>kas&lt;/code> by editing the static pod manifest directly.&lt;/p>
&lt;pre tabindex="0">&lt;code>docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml
&lt;/code>&lt;/pre>&lt;p>and pass &lt;code>- --feature-gates=WatchList=false&lt;/code> to the &lt;code>kas&lt;/code> container.&lt;/p>
&lt;p>Check if the &lt;code>kas&lt;/code> has not recorded the watchlist latency metric.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get --raw &amp;#39;/metrics&amp;#39; | grep &amp;#34;apiserver_watch_list_duration_seconds&amp;#34;
&lt;/code>&lt;/pre>&lt;p>Check if &lt;code>kube-controler-manger&lt;/code>(aka &lt;code>kcm&lt;/code>) is running.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (44s ago) 3m28s
&lt;/code>&lt;/pre>&lt;p>Check if informers used by the &lt;code>kcm&lt;/code> fell back to standard LIST/WATCH semantics.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e &amp;#34;watch-list&amp;#34;
W1002 09:11:40.656641 1 reflector.go:340] The watch-list feature is not supported by the server, falling back to the previous LIST/WATCH semantics
…
&lt;/code>&lt;/pre>&lt;p>Disable the &lt;code>WatchListClient&lt;/code> feature gate for the &lt;code>kcm&lt;/code> by editing the static pod manifest directly.&lt;/p>
&lt;pre tabindex="0">&lt;code>docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
&lt;/code>&lt;/pre>&lt;p>and pass &lt;code>- --feature-gates=WatchListClient=false&lt;/code> to the &lt;code>kcm&lt;/code> container.&lt;/p>
&lt;p>Check if &lt;code>kcm&lt;/code> is running.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 12s
&lt;/code>&lt;/pre>&lt;p>Check if the &lt;code>kas&lt;/code> has not recorded the watchlist latency metric.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get --raw &amp;#39;/metrics&amp;#39; | grep &amp;#34;apiserver_watch_list_duration_seconds&amp;#34;
&lt;/code>&lt;/pre>&lt;p>Check if there are no traces of informers for &lt;code>kcm&lt;/code> falling back to standard LIST/WATCH semantics.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e &amp;#34;watch-list&amp;#34;
&lt;/code>&lt;/pre>&lt;p>Enable the &lt;code>WatchList&lt;/code> feature gate for the &lt;code>kas&lt;/code> by editing the static pod manifest directly.&lt;/p>
&lt;pre tabindex="0">&lt;code>docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml
&lt;/code>&lt;/pre>&lt;p>and remove &lt;code>- --feature-gates=WatchList=false&lt;/code> from the &lt;code>kas&lt;/code> container.&lt;/p>
&lt;p>Check if &lt;code>kcm&lt;/code> is running.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (22s ago) 86s
&lt;/code>&lt;/pre>&lt;p>Check if the &lt;code>kas&lt;/code> has not recorded the watchlist latency metric.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get --raw &amp;#39;/metrics&amp;#39; | grep &amp;#34;apiserver_watch_list_duration_seconds&amp;#34;
&lt;/code>&lt;/pre>&lt;p>Enable the &lt;code>WatchListClient&lt;/code> feature gate for the &lt;code>kcm&lt;/code> by editing the static pod manifest directly.&lt;/p>
&lt;pre tabindex="0">&lt;code>docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
&lt;/code>&lt;/pre>&lt;p>and remove &lt;code>- --feature-gates=WatchListClient=false&lt;/code> for the &lt;code>cm&lt;/code> container.&lt;/p>
&lt;p>Check if &lt;code>kcm&lt;/code> is running.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 13s
&lt;/code>&lt;/pre>&lt;p>Check if the &lt;code>kas&lt;/code> has recorded the watchlist latency metric.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get --raw &amp;#39;/metrics&amp;#39; | grep &amp;#34;apiserver_watch_list_duration_seconds&amp;#34;
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group=&amp;#34;&amp;#34;,resource=&amp;#34;configmaps&amp;#34;,scope=&amp;#34;cluster&amp;#34;,version=&amp;#34;v1&amp;#34;,le=&amp;#34;6&amp;#34;} 1
&lt;/code>&lt;/pre>&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>If &lt;code>apiserver_watch_list_duration_seconds&lt;/code> metric has some data then this feature is in use.&lt;/p>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>Assuming that historical data is available then comparing the number of LIST and WATCH requests to the server will tell whether the feature was enabled.
When this feature is enabled, the number of LIST requests will be smaller.
The difference primarily arises from switching informers to a new mode of operation.&lt;/p>
&lt;p>Checking whether &lt;code>WatchListClient&lt;/code> FeatureGate has been set for the given component.&lt;/p>
&lt;p>Knowing the &lt;code>username&lt;/code> for a component, the audit logs could be examined to see whether &lt;code>sendInitialEvents=true&lt;/code> in the &lt;code>requestURI&lt;/code> has been set for that user.&lt;/p>
&lt;p>Scanning the component&amp;rsquo;s logs for the phrase &lt;code>Reflector WatchList&lt;/code>. For requests lasting more than 10 seconds, traces will be reported.&lt;/p>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>None have been defined yet.&lt;/p>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)&lt;/li>
&lt;li>Metric name: apiserver_watch_list_duration_seconds (histogram, measures latency of watch-list requests)&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No. On the contrary. The number of requests originating from informers will be reduced by half from 2 (LIST/WATCH) to just 1 (WATCH)&lt;/p>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>On the contrary. It will decrease the memory usage of kube-apiservers needed to handle &amp;ldquo;list&amp;rdquo; requests.&lt;/p>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;p>On the contrary. It will decrease the memory usage required for master nodes.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>When the kube-apiserver is unavailable then this feature will also be unavailable.&lt;/p>
&lt;p>When etcd is unavailable, requests attempting to retrieve the most recent state of the cluster will fail.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;ul>
&lt;li>kube-controller-manager is unable to start.
&lt;ul>
&lt;li>Detection: How can it be detected via metrics? Examine the prometheus &lt;code>up&lt;/code> time series or examine the pod status or the number of restarts.&lt;/li>
&lt;li>Mitigations: What can be done to stop the bleeding, especially for already
running user workloads? Disable the feature. Pass &lt;code>WatchListClient=false&lt;/code> to &lt;code>feature-gates&lt;/code> command line flag.&lt;/li>
&lt;li>Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue? N/A&lt;/li>
&lt;li>Testing: Are there any tests for failure mode? If not, describe why. Yes, if kube-controller-manager is unable to start then a lot of existing e2e tests will fail.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>None SLOs have been defined for this feature yet.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;p>The KEP was proposed on 2022-01-14&lt;/p>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>N/A&lt;/p>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ol>
&lt;li>We could tune the cost function used by the priority and fairness feature.
There are at least a few issues with this approach.
The first is that we would have to come up with a cost estimation function that can approximate the temporary memory consumption.
This might be challenging since we don&amp;rsquo;t know the true cost of the entire list upfront as object sizes can vastly differ throughout the keyspace (imagine some namespaces with giant secrets, some with small secrets).
The second issue, assuming we could estimate it, would mean that we would have to throttle the server to handle just a few requests at a given time as the estimate would likely be uniform over resource type or other coarse dimensions&lt;/li>
&lt;li>We could attempt to define a function that would prevent the server from allocating more memory than a given threshold.
A function like that would require measuring memory usage in real-time. Things we evaluated:
&lt;ul>
&lt;li>runtime.ReadMemStats gives us accurate measurement but at the same time is very expensive. It requires STW (stop-the-world) which is equivalent to stopping all running goroutines. Running with 100ms frequency would block the runtime 10 times per second.&lt;/li>
&lt;li>reading from proc would probably increase the CPU usage (polling) and would add some delay (propagation time from the kernel about current memory usage). Since the spike might be very sudden (milliseconds) it doesn’t seem to be a viable option.&lt;/li>
&lt;li>there seems to be no other API provided by golang runtime that would allow for gathering memory stats in real-time other than runtime.ReadMemStats&lt;/li>
&lt;li>using cgroup notification API is efficient (epoll) and near real-time but it seems to be limited in functionality. We could be notified about crossing previously defined memory thresholds but we would still need to calculate available(free) memory on a node.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>We could allow for paginated LIST requests to be served directly from the watch cache.
This approach has a few advantages.
Primarily it doesn&amp;rsquo;t require changing informers, no version skew issues.
At the same time, it also presents a few challenges.
The most concerning is that it would actually not solve the issue.
It seems it would also lead to (temporary) memory consumption because we would need to allocate space for the entire response (LIST), keep it in memory until the whole response has been sent to the client (which can be up to 60s) and this could be O({2,3}*the-size-of-the-page).&lt;/li>
&lt;/ol>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;h2 id="appendix">Appendix&lt;/h2>
&lt;h3 id="sources-of-list-request">Sources of LIST request&lt;/h3>
&lt;p>A LIST request can be satisfied from two places, largely depending on used query parameters:&lt;/p>
&lt;ol>
&lt;li>by default directly from etcd. In such cases, the memory demand might be extensive, exceeding the full response size from the data store many times.&lt;/li>
&lt;li>from the watch cache if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion=&amp;ldquo;0&amp;rdquo;).
This is actually how most client-go-based controllers actually prime their caches due to performance reasons.
The memory usage will be much lower than in the first case.
However, it is not perfect as we still need space to store serialized objects and to hold the full response until is sent.&lt;/li>
&lt;/ol>
&lt;h3 id="steps-followed-by-informers">Steps followed by informers&lt;/h3>
&lt;p>The following steps depict a flow of how client-go-based informers work today.&lt;/p>
&lt;ol>
&lt;li>&lt;strong>on startup&lt;/strong>: informers issue a LIST RV=&amp;ldquo;0&amp;rdquo; request with pagination, which due to performance reasons translates to a full (pagination is ignored) LIST from the watch cache.&lt;/li>
&lt;li>&lt;strong>repeated until ResourceExpired 410&lt;/strong>: establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.&lt;/li>
&lt;li>&lt;strong>on resumption&lt;/strong>: establish a new LIST request to the watch cache with RV=&amp;ldquo;last-known-from-step2&amp;rdquo; (step1) and then another WATCH request.&lt;/li>
&lt;li>&lt;strong>after compaction (410)&lt;/strong>: we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2&lt;/li>
&lt;/ol>
&lt;p>In rare cases, an informer might connect to an API server whose watch cache hasn&amp;rsquo;t been fully synchronized (after kube-apiserver restart). In that case its flow will be slightly different.&lt;/p>
&lt;ol>
&lt;li>&lt;strong>on startup&lt;/strong>: informers issue a LIST RV=&amp;ldquo;0&amp;rdquo; request with pagination, which effectively equals a paginated LIST RV=&amp;quot;&amp;quot;, i.e. it gets a consistent snapshot of data directly from etcd (quorum read) in chunks (pagination).&lt;/li>
&lt;li>&lt;strong>repeated until ResourceExpired 410&lt;/strong>: they establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.&lt;/li>
&lt;li>&lt;strong>on resumption&lt;/strong>: establish a paginated LIST RV=&amp;ldquo;last-known-from-step2&amp;rdquo; request (step1) and then another WATCH request.&lt;/li>
&lt;li>&lt;strong>after compaction (410)&lt;/strong>: we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2&lt;/li>
&lt;/ol>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>N/A&lt;/p>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Allow Replacement of Pods in a Job when fully terminating</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3939/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3939/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-3939-allow-replacement-of-pods-in-a-job-when-fully-terminated">KEP-3939: Allow replacement of Pods in a Job when fully terminated&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#the-default-job-controller-behavior"
>The default job controller behavior&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#when-pods-enter-a-terminating-state"
>When Pods enter a terminating state&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#exponential-backoff-for-pod-failures"
>Exponential Backoff for Pod Failures&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#pods-are-not-guaranteed-to-transition-to-a-terminal-phase"
>Pods are not guaranteed to transition to a terminal phase&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#job-api-definition"
>Job API Definition&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#defaulting-and-validation"
>Defaulting and validation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#tracking-the-terminating-pods"
>Tracking the terminating pods&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation"
>Implementation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#deprecation"
>Deprecation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade"
>Upgrade&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade"
>Downgrade&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-this-feature-be-enabled--disabled-in-a-live-cluster"
>How can this feature be enabled / disabled in a live cluster?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads"
>How can a rollout or rollback fail? Can it impact already running workloads?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-specific-metrics-should-inform-a-rollback"
>What specific metrics should inform a rollback?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested"
>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc"
>Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads"
>How can an operator determine if the feature is in use by workloads?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance"
>How can someone using this feature know that it is working for their instance?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement"
>What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service"
>What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature"
>Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#does-this-feature-depend-on-any-specific-services-running-in-the-cluster"
>Does this feature depend on any specific services running in the cluster?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-any-new-api-calls"
>Will enabling / using this feature result in any new API calls?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-introducing-new-api-types"
>Will enabling / using this feature result in introducing new API types?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider"
>Will enabling / using this feature result in any new calls to the cloud provider?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects"
>Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos"
>Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components"
>Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc"
>Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable"
>How does this feature react if the API server and/or etcd is unavailable?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-other-known-failure-modes"
>What are other known failure modes?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem"
>What steps should be taken if SLOs are not being met to determine the problem?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Currently, Jobs start replacement Pods as soon as previously created Pods are terminating (have a &lt;code>deletionTimestamp&lt;/code>) or fail (&lt;code>phase=Failed&lt;/code>).
Terminating pods are currently counted as failed in the Job status.
However, terminating pods are actually in a transitory state where they are neither active nor really fully terminated.&lt;br>
This KEP proposes a new field for the Job API that allows for users to specify if they want replacement Pods as soon as
the previous Pods are terminating (existing behavior) or only once the existing pods are fully terminated (new behavior).&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Existing Issues:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/issues/115844"
target="_blank" rel="noopener">Job Creates Replacement Pods as soon as Pod is marked for deletion&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/kueue/issues/510"
target="_blank" rel="noopener">Kueue: Account for terminating pods when doing preemption&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index.
Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors),
a replacement pod is created and immediately fail to start.&lt;/p>
&lt;p>Having a replacement Pod before the previous one fully terminates can also
cause problems in clusters with scarce resources or with tight budgets.
These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated.
If cluster autoscaler is enabled, the replacement Pods might produce undesired
scale ups.&lt;/p>
&lt;p>On the other hand, if a replacement Pod is not immediately created, the Job
status would show that the number of active pods doesn&amp;rsquo;t match the desired
parallelism. To provide better visibility, the job status can have a new field
to track the number of Pods currently terminating.&lt;/p>
&lt;p>This new field can also be used by queueing controllers, such as Kueue,
to track the number of terminating pods to calculate quotas.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Job controller should allow for flexibility in waiting for pods to be fully terminated before
creating replacement Pods&lt;/li>
&lt;li>Job controller will have a new status field where we include the number of terminating pods.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Other workload APIs are not included in this proposal.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>The Job controller gets a list of active pods. Active pods are pods that don&amp;rsquo;t
have a terminal phase (&lt;code>Succeeded&lt;/code> or &lt;code>Failed&lt;/code>) and are not terminating
(have a &lt;code>deletionTimestamp&lt;/code>)
In this KEP, we will consider terminating pods to be separate from active and failed.&lt;br>
As an opt-in behavior, the job controller can use the active and terminating
pods to determine whether replacement Pods are needed.&lt;/p>
&lt;p>We propose two new API fields:&lt;/p>
&lt;ol>
&lt;li>A field in Spec that allows for opt-in behavior of whether to wait for
terminating pods to finish before creating replacement pods.&lt;/li>
&lt;li>A new field in Status for tracking the number of terminating pods.&lt;/li>
&lt;/ol>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a machine learning user, ML frameworks allow scheduling of multiple pods.&lt;br>
The Job controller does not typically wait for terminating pods to be marked as failed.&lt;br>
Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated.&lt;/p>
&lt;p>This case was added due to a bug discovered with running IndexedJobs with Tensorflow.&lt;br>
See &lt;a href="https://github.com/kubernetes/kubernetes/issues/115844"
target="_blank" rel="noopener">Jobs create replacement Pods as soon as a Pod is marked for deletion&lt;/a>
for more details.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify.&lt;br>
Terminating pods do not relinquish resources so scarce compute resource are still scheduled to those pods.
Replacement pods do not produce unnecessary scale ups.&lt;/p>
&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>As a Job-level quota controller, I want to track the number of terminating pods,
in addition to the active pods.&lt;/p>
&lt;p>See &lt;a href="https://github.com/kubernetes-sigs/kueue/issues/510"
target="_blank" rel="noopener">Kueue: Account for terminating pods when doing preemption&lt;/a>
for an example of this.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h4 id="the-default-job-controller-behavior">The default job controller behavior&lt;/h4>
&lt;p>Based on the &lt;a href="#job-api-definition"
>proposed API&lt;/a>
below, the behavior of the
job controller prior to this KEP is equivalent to
&lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code>.&lt;/p>
&lt;p>This behavior has the following semantic problems:&lt;/p>
&lt;ul>
&lt;li>A terminating Pod might gracefully terminate as Succeeded, but it counts
towards &lt;code>.status.failed&lt;/code> as soon as it&amp;rsquo;s terminating and it&amp;rsquo;s not reclassified
upon termination.&lt;/li>
&lt;li>When using podFailurePolicy, the controller might create a replacement Pod
before being able to evaluate the terminal state of the Pod. The replacement
Pod might be terminated due to the policy.&lt;/li>
&lt;/ul>
&lt;p>In a Job v2 API, we should consider having the default behavior equivalent to
&lt;code>podReplacementPolicy: Failed&lt;/code>, given the above problems.
We could even consider removing the proposed field &lt;code>podReplacementPolicy&lt;/code>.&lt;/p>
&lt;p>But for backwards compatibility, in v1, we have to introduce a change of
behavior as opt-in.&lt;/p>
&lt;h4 id="when-pods-enter-a-terminating-state">When Pods enter a terminating state&lt;/h4>
&lt;p>Pods can be marked for termination by several controllers, which we typically
refer to as disruptions, such as: kubelet eviction, scheduler preemption, API eviction, etc.&lt;/p>
&lt;p>The job controller itself can delete running Pods, in the following scenarios:&lt;/p>
&lt;ol>
&lt;li>A job is over the &lt;code>activeDeadlineSeconds&lt;/code>.&lt;/li>
&lt;li>When the number of Pod failures reaches the &lt;code>backoffLimit&lt;/code>.&lt;/li>
&lt;li>With &lt;code>PodFailurePolicy&lt;/code> active and &lt;code>FailJob&lt;/code> is set as the action.&lt;/li>
&lt;/ol>
&lt;p>In all these situations, the Pod initially gets a &lt;code>deletionTimestamp&lt;/code>
and we interpret the pod as &amp;ldquo;terminating&amp;rdquo;. Once the pod terminates, it gets
a terminal &lt;code>phase&lt;/code> (&lt;code>Succeeded&lt;/code> or &lt;code>Failed&lt;/code>).&lt;/p>
&lt;h3 id="exponential-backoff-for-pod-failures">Exponential Backoff for Pod Failures&lt;/h3>
&lt;p>The job controller implements backoff delays to prevent fast recreation of
continuously failing Pods.&lt;/p>
&lt;p>This behavior is internal (not configurable through the API) and it&amp;rsquo;s orthogonal
to this KEP. The behavior will be preserved as follows:&lt;/p>
&lt;ul>
&lt;li>When &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code>, the backoff period counts from
the time the Pod is terminating or Failed.&lt;/li>
&lt;li>When &lt;code>podReplacementPolicy: Failed&lt;/code>, the backoff period counts from the time the
Pod is Failed.&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="pods-are-not-guaranteed-to-transition-to-a-terminal-phase">Pods are not guaranteed to transition to a terminal phase&lt;/h4>
&lt;p>One area of contention is how this KEP will work with &lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md"
target="_blank" rel="noopener">3329-retriable-and-non-retriable-failures&lt;/a>
.&lt;/p>
&lt;p>In 3329, there was a decision to make kubelet transition pods to failed before deleting them.
This is feature toggled guarded by &lt;code>PodDisruptionCondition&lt;/code>, which in addition to
setting the phase to Failed, it adds a &lt;code>DisruptionTarget&lt;/code> condition.
This means that when this feature is turned on, the job controller is able to count pods as failed only when they are fully terminated, as it is guaranteed that all pods will reach a terminal state (Failed or Succeeded).
Note that a terminating pod is not considered active either.
If &lt;code>PodDisruptionCondition&lt;/code> is turned off, then the job controller considers the pod as failed as soon as it is terminating (has a deletion timestamp), because there is no guarantee that the pod will transition to phase=Failed.&lt;/p>
&lt;p>Another issue is described &lt;a href="https://github.com/kubernetes/enhancements/pull/3940#discussion_r1180777509"
target="_blank" rel="noopener">here&lt;/a>
.
If PodDisruptionConditions is disabled, a pod bound to a no-longer-existing node may be stuck in the Running phase.
As a consequence, it will never be replaced, so the whole job will be stuck from making progress.
When PodDisruptionConditions is enabled, the PodGC transitions the Pod to phase Failed in this scenario.&lt;/p>
&lt;p>Due to the above issues, we propose the following mitigation:&lt;/p>
&lt;ul>
&lt;li>If &lt;code>PodDisruptionConditions&lt;/code> OR &lt;code>JobPodReplacementPolicy&lt;/code> are enabled, set
phase=Failed in kubelet and podGC before deleting a Pod.&lt;/li>
&lt;li>If &lt;code>JobPodReplacmentPolicy&lt;/code> is enabled, but &lt;code>PodDisruptionConditions&lt;/code> is
disabled, the kubelet and podGC only set the phase, but do not add a
&lt;code>DisruptionTarget&lt;/code> condition.&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="job-api-definition">Job API Definition&lt;/h3>
&lt;p>At the JobSpec level, we are adding a new enum field:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// This field controls when we recreate pods&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Default will be TerminatingOrFailed ie recreate pods when they are failed&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +enum &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PodReplacementPolicy &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// TerminatingOrFailed is a policy that creates replacement pods when they are&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// marked as terminating (have a deletion timestamp) or reach the terminal&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// phase `Failed`.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Terminating pods count towards `.status.failed`, even if they later reach&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the terminal phase `Succeeded`.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> TerminatingOrFailed PodReplacementPolicy = &lt;span style="color:#b44">&amp;#34;TerminatingOrFailed&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Failed is a policy that creates replacement Pods only when the previously&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// created Pods reach the terminal phase `Failed`.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Failed PodReplacementPolicy = &lt;span style="color:#b44">&amp;#34;Failed&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> JobSpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// podReplacementPolicy specifies when to create replacement Pods. Possible values are:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// - TerminatingOrFailed means to create a replacement Pod when the previously&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// created Pod is terminating or failed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// - Failed means to wait until a previously created Pod is fully terminated&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// before creating a replacement Pod.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">//&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// When using podFailurePolicy, the default value is Failed and this is the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// only allowed policy.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// When not using podFailurePolicy, the default value is TerminatingOrFailed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PodReplacementPolicy &lt;span style="color:#666">*&lt;/span>PodReplacementPolicy
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In order to offer visibility of the number of terminating pods, we include a new
field in the JobStatus.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> JobStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Number of terminating pods&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> terminating &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">int32&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="defaulting-and-validation">Defaulting and validation&lt;/h4>
&lt;p>Defaulting of &lt;code>podReplacementPolicy&lt;/code> will depend on whether &lt;code>podFailurePolicy&lt;/code>
is in use:&lt;/p>
&lt;ul>
&lt;li>when &lt;code>podFailurePolicy&lt;/code> is in use, the default value is &lt;code>Failed&lt;/code>.&lt;/li>
&lt;li>when &lt;code>podFailurePolicy&lt;/code> is not in use, the default value is &lt;code>TerminatingOrFailed&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>When &lt;code>podFailurePolicy&lt;/code> is in use, the only allowed value for &lt;code>podFailurePolicy&lt;/code>
is &lt;code>Failed&lt;/code>.&lt;/p>
&lt;h4 id="tracking-the-terminating-pods">Tracking the terminating pods&lt;/h4>
&lt;p>In order to allow the quota management for Job-level controllers &lt;a href="#story-3"
>story 3&lt;/a>
we introduced the &lt;code>.status.terminating&lt;/code> field which tracks the number of
terminating pods. However, in the initial Beta implementation the field stops
tracking the number of terminating pods as soon as the Job is marked as Failed
with the &lt;code>Failed&lt;/code> condition (see (issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775]).
The remaining pods may be occupying resources for an arbitrary amount of time.&lt;/p>
&lt;p>In 1.31 we are going to fix this issue by delaying the
addition of the &lt;code>Failed&lt;/code> or &lt;code>Complete&lt;/code> conditions until all pods are fully
terminated. To indicate that a Job is doomed to fail or succeed, as soon as
possible, we extend the scope of pre-existing conditions: &lt;code>FailureTarget&lt;/code>, and
&lt;code>SuccessCriteriaMet&lt;/code>, respectively, See more details in
&lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md"
target="_blank" rel="noopener">Job API managed-by mechanism&lt;/a>
.&lt;/p>
&lt;h3 id="implementation">Implementation&lt;/h3>
&lt;p>As part of this KEP, we need to track pods that are terminating (&lt;code>deletionTimestamp != nil&lt;/code> and &lt;code>phase&lt;/code> is &lt;code>Pending&lt;/code> or &lt;code>Running&lt;/code>).&lt;/p>
&lt;p>The following algorithm could be used:&lt;/p>
&lt;ol>
&lt;li>Count the number of pods that are active and not terminating.&lt;/li>
&lt;li>Count the number of terminating pods.&lt;/li>
&lt;li>In &lt;code>manageJob&lt;/code> we will count expected pods as:&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>when &lt;code>podReplacementPolicy: Failed&lt;/code> then &lt;code>expectedPods = active + terminating&lt;/code>.&lt;/li>
&lt;li>when &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code> then &lt;code>expectedPods = active&lt;/code>.&lt;/li>
&lt;/ul>
&lt;ol start="4">
&lt;li>Use the expected number of pods to decide whether to recreate.&lt;/li>
&lt;/ol>
&lt;p>In Indexed completion mode, the tracking of pods is per index.&lt;/p>
&lt;p>The controller updates the field &lt;code>Status.terminating&lt;/code> with the number of terminating pods.
For backwards compatibility, when &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code>,
the number of failed pods includes the terminating pods.&lt;/p>
&lt;p>The controller updates the terminating field in the same API call where it
updates other counters, so it should not require any extra API calls.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h4 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h4>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>controller_utils&lt;/code>: &lt;code>April 3rd 2023&lt;/code> - &lt;code>56.6&lt;/code>
&lt;ul>
&lt;li>Adding tests to help determine if pods are terminating.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>job&lt;/code>: &lt;code>April 3rd 2023&lt;/code> - &lt;code>90.4&lt;/code>
a. Verify that terminating pods are in fact counted in the status.
b. Recreate pods only once pod is fully terminated (ie &lt;code>Failed&lt;/code>)
c. Verify existing behavior with &lt;code>TerminatingOrFailed&lt;/code>
d. If feature is off verify existing behavior
e. Count terminating pods even if terminating Pod considered failed when &lt;code>JobPodReplacementPolicy&lt;/code> is disabled
f. Count terminating pods even if terminating Pod not considered failed when &lt;code>JobPodReplacementPolicy&lt;/code> is enabled&lt;/li>
&lt;li>&lt;code>gc_controller.go&lt;/code>: &lt;code>April 3rd 2023&lt;/code> - &lt;code>82.4&lt;/code>
a. Set &lt;code>PodPhase&lt;/code> to &lt;code>failed&lt;/code> when &lt;code>JobPodReplacementPolicy&lt;/code> true but &lt;code>PodDisruptionConditions&lt;/code> is false&lt;/li>
&lt;/ul>
&lt;p>The following scenarios related to &lt;a href="#tracking-the-terminating-pods"
>tracking the terminating pods&lt;/a>
are covered:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Failed&lt;/code> or &lt;code>Complete&lt;/code> conditions are not added while there are still terminating pods&lt;/li>
&lt;li>&lt;code>FailureTarget&lt;/code> is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded&lt;/li>
&lt;li>&lt;code>SuccessCriteriaMet&lt;/code> is added when the &lt;code>completions&lt;/code> are satisfied&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
- &lt;test>: &lt;link to test coverage>
-->
&lt;p>We will add the following integration test for the Job controller:&lt;/p>
&lt;p>Case with &lt;code>JobPodReplacementPolicy&lt;/code> on and &lt;code>podReplacementPolicy: Failed&lt;/code>&lt;/p>
&lt;ol>
&lt;li>Job starts pods that takes a while to terminate&lt;/li>
&lt;li>Delete pods&lt;/li>
&lt;li>Verify that &lt;code>terminating&lt;/code> is tracked&lt;/li>
&lt;li>Verify that pod creation only occurs once pod is fully terminated.&lt;/li>
&lt;/ol>
&lt;p>Case with &lt;code>JobPodReplacementPolicy&lt;/code> on and &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code>&lt;/p>
&lt;ol>
&lt;li>Job starts pods that takes a while to terminate&lt;/li>
&lt;li>Delete pods&lt;/li>
&lt;li>Verify that &lt;code>terminating&lt;/code> is tracked&lt;/li>
&lt;li>Verify that pod creation only occurs once deletion happens.&lt;/li>
&lt;/ol>
&lt;p>Case With &lt;code>JobPodReplacementPolicy&lt;/code> off&lt;/p>
&lt;ol>
&lt;li>Job starts pods that takes a while to terminate&lt;/li>
&lt;li>Delete pods&lt;/li>
&lt;li>Verify that &lt;code>terminating&lt;/code> is not tracked&lt;/li>
&lt;li>Verify that pod creation only occurs once deletion happens.&lt;/li>
&lt;/ol>
&lt;p>Case for disable and reenable &lt;code>JobPodReplacementPolicy&lt;/code>&lt;/p>
&lt;ol>
&lt;li>Create Job with &lt;code>podReplacementPolicy: Failed&lt;/code>&lt;/li>
&lt;li>Job starts pods that takes a while to terminate&lt;/li>
&lt;li>Restart controller and disable &lt;code>JobPodReplacementPolicy&lt;/code>&lt;/li>
&lt;li>Delete some pods&lt;/li>
&lt;li>Verify that terminating pods count as failed and pods are recreated.&lt;/li>
&lt;li>Restart controller and reenable &lt;code>JobPodReplacementPolicy&lt;/code>&lt;/li>
&lt;li>Terminate pods with phase Succeeded.&lt;/li>
&lt;li>Verify that pods still count as failed.&lt;/li>
&lt;li>Delete remaining Pods.&lt;/li>
&lt;li>Verify that &lt;code>terminating&lt;/code> is tracked.&lt;/li>
&lt;li>Verify that pod creation only occurs once pod is fully terminated.&lt;/li>
&lt;li>Verify that pod creation only occurs once deletion happens.&lt;/li>
&lt;/ol>
&lt;p>To cover cases with &lt;code>PodDisruptionCondition&lt;/code> we really only need to worry about tracking terminating fields.
Tests will verify counting of terminating fields regardless of &lt;code>PodDisruptionCondition&lt;/code> being on or off.&lt;/p>
&lt;p>The following scenarios related to &lt;a href="#tracking-the-terminating-pods"
>tracking the terminating pods&lt;/a>
are covered:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Failed&lt;/code> or &lt;code>Complete&lt;/code> conditions are not added while there are still terminating pods&lt;/li>
&lt;li>&lt;code>FailureTarget&lt;/code> is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded&lt;/li>
&lt;li>&lt;code>SuccessCriteriaMet&lt;/code> is added when the &lt;code>completions&lt;/code> are satisfied&lt;/li>
&lt;/ul>
&lt;p>The &lt;code>integration&lt;/code> tests are implemented in &lt;a href="https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/integration/job/job_test.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/integration/job/job_test.go&lt;/a>
.
Most relevant test is &lt;code>TestJobPodReplacementPolicy&lt;/code>.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>Generally the only tests that are useful for this feature are when &lt;code>PodReplacementPolicy: Failed&lt;/code>.&lt;br>
Test should to create a Job which can catch a SIGTERM signal and allow for graceful termination, so when we delete the test&lt;br>
we can first assert that pods aren&amp;rsquo;t created while the Pod is terminating and finally when it terminates that a new Pod is created.&lt;/p>
&lt;p>We can use the default &lt;code>busybox&lt;/code> image which is generally used in e2e tests and override the command field with something like:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>_term&lt;span style="color:#666">(){&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">5&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">exit&lt;/span> &lt;span style="color:#666">143&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f">trap&lt;/span> _term SIGTERM
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">while&lt;/span> true; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">done&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>An e2e test can verify that deletion will not trigger a new pod creation until the exiting pod is fully deleted.&lt;/p>
&lt;p>If &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code> is specified we would test that pod creation happens closely after deletion.&lt;/p>
&lt;p>The &lt;code>e2e&lt;/code> tests are implemented in &lt;a href="https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/e2e/apps/job.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/e2e/apps/job.go&lt;/a>
.&lt;/p>
&lt;p>Test grid:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-apps#gce"
target="_blank" rel="noopener">&lt;code>gce&lt;/code>&lt;/a>
&lt;/li>
&lt;/ul>
&lt;pre tabindex="0">&lt;code>Kubernetes e2e suite.[It] [sig-apps] Job should recreate pods only after they have failed if pod replacement policy is set to Failed
&lt;/code>&lt;/pre>&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
- &lt;test>: &lt;link to test coverage>
-->
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Job controller can consider terminating pods as active&lt;/li>
&lt;li>Job controller counts terminating pods in &lt;code>JobStatus&lt;/code>.&lt;/li>
&lt;li>Unit Tests&lt;/li>
&lt;li>Integration tests&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Address reviews and bug reports from Alpha users&lt;/li>
&lt;li>E2e tests are in Testgrid and linked in KEP&lt;/li>
&lt;li>The feature flag enabled by default&lt;/li>
&lt;li>&lt;code>job_pods_creation_total&lt;/code> metric is added.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Address reviews and bug reports from Beta users&lt;/li>
&lt;li>Allow Job API clients tracking the number of the terminating pods until all
the resources are released (see &lt;a href="#tracking-the-terminating-pods"
>tracking the terminating pods&lt;/a>
).
Also, provide links for the relevant integration tests in the KEP.&lt;/li>
&lt;li>Lock the &lt;code>JobPodReplacementPolicy&lt;/code> feature-gate to true&lt;/li>
&lt;li>Restore the &lt;code>.status.terminating&lt;/code> assertion for JobSuccessPolicy Conformance Tests in the following:
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L514-L515"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L514-L515&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L556-L557"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L556-L557&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L597-L598"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L597-L598&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="deprecation">Deprecation&lt;/h4>
&lt;ul>
&lt;li>Remove &lt;code>JobPodReplacementPolicy&lt;/code> feature-gate in GA+3.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h4 id="upgrade">Upgrade&lt;/h4>
&lt;p>Set &lt;code>JobPodReplacementPolicy&lt;/code> to true in apiserver and controller manager.&lt;/p>
&lt;p>There are no other components required.&lt;/p>
&lt;p>Jobs that want to replace pods once they are fully terminal can use &lt;code>PodReplacementPolicy&lt;/code>: &lt;code>Failed&lt;/code>.&lt;/p>
&lt;p>If a Job is not using &lt;code>PodFailurePolicy&lt;/code>, one can change &lt;code>PodReplacementPolicy&lt;/code> to &lt;code>terminatingOrFailed&lt;/code>. This will revert Jobs to existing behavior with the feature off.&lt;/p>
&lt;p>If one is using &lt;code>PodFailurePolicy&lt;/code>, one will not be able to set the value to &lt;code>terminatingOrFailed&lt;/code> as &lt;code>Failed&lt;/code> is the only allowable solution.
In this case, the recommendation would be to disable the &lt;code>PodFailurePolicy&lt;/code> feature also.&lt;/p>
&lt;h4 id="downgrade">Downgrade&lt;/h4>
&lt;p>Set &lt;code>JobPodReplacementPolicy&lt;/code> to false in apiserver and controller manager.&lt;/p>
&lt;p>With downgrading, you will no longer see any side-effects of &lt;code>PodReplacementPolicy&lt;/code>.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This feature is limited to control plane.&lt;/p>
&lt;p>Note that, kube-apiserver can be in the N+1 skew version relative to the
kube-controller-manager (see &lt;a href="https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager"
target="_blank" rel="noopener">here&lt;/a>
).
In that case, the Job controller operates on the version of the Job object that
already supports the new Job API.&lt;/p>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h4 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: JobPodReplacementPolicy&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver (for field control)&lt;/li>
&lt;li>kube-controller-manager (for main functionality)&lt;/li>
&lt;li>kubelet (for supporting functionality: transition to phase=Failed)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Yes,&lt;/p>
&lt;p>a. Count the number of terminating pods and populate in JobStatus
b. Set phase=Failed in kubelet and pod-GC before deleting a Pod object
(behavior also present when related &lt;code>PodDisruptionConditions&lt;/code> is enabled)
c. As part of closely related KEP-3329, we will default &lt;code>podReplacementPolicy&lt;/code>
to Failed if podFailurePolicy is set which, as described above, will change
the way of handling terminating pods.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes.&lt;/p>
&lt;p>When the feature is disabled:&lt;/p>
&lt;ul>
&lt;li>the apiserver:
&lt;ul>
&lt;li>Discards the value of &lt;code>podReplacementPolicy&lt;/code> for new objects.&lt;/li>
&lt;li>Preserves the value of &lt;code>podRepacementPolicy&lt;/code> for existing objects.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>the job controller:
&lt;ul>
&lt;li>processes the Job as &lt;code>podReplacementPolicy: TerminatingOrFailed&lt;/code> (the existing behavior)&lt;/li>
&lt;li>stops tracking terminating pods, sets the value of &lt;code>.status.terminating&lt;/code> to
&lt;code>nil&lt;/code> in the next Job sync.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
Feature gates are typically disabled by setting the flag to `false` and
restarting the component. No other changes should be necessary to disable the
feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The job controller will respect the value of &lt;code>podReplacementPolicy&lt;/code> for new
events (new Pods becoming terminating or failed).&lt;/p>
&lt;p>If &lt;code>podReplacementPolicy: Failed&lt;/code> and there are currently terminating Pod(s) that
were already considered Failed before reenabling the feature, they won&amp;rsquo;t be
re-evaluated.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>No, but we will add unit and integration tests for feature enablement and disablement.&lt;/p>
&lt;p>An integration test verifies disable and reenable.
See &lt;a href="#integration-tests"
>integration tests&lt;/a>
for details.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h4 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h4>
&lt;p>A rollout or rollback will not fail as rolling out this feature entails turning on &lt;code>JobPodReplacementPolicy&lt;/code>.
Failure rates of the Jobs will not increase or decrease on this feature. Pods will be marked as failed later (as we wait for the pods to be fully terminal)&lt;/p>
&lt;p>This feature is opt-in for functional changes. We track terminating pods for observability reasons but we only use this data in the case of &lt;code>Failed&lt;/code>.&lt;/p>
&lt;p>If a user has set &lt;code>PodReplacementPolicy: Failed&lt;/code> or has PodFailurePolicy set, then
rollbacking this feature would mean that terminating Pods will be recreated once they are deleted.&lt;/p>
&lt;p>If a user rollouts this feature with &lt;code>PodFailurePolicy&lt;/code> or &lt;code>PodReplacementPolicy&lt;/code> set to &lt;code>Failed&lt;/code>,
then pods will only recreate once they are fully terminal.&lt;br>
This will not impact failure counts as in both cases, they will get marked as failed eventually.&lt;/p>
&lt;p>If a user rollouts this feature without &lt;code>PodFailurePolicy&lt;/code> or &lt;code>PodReplacementPolicy&lt;/code> set, then there will be no impact to existing workloads.&lt;/p>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;h4 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h4>
&lt;ul>
&lt;li>job_syncs_total, exposed by kube-controller-manager
&lt;ul>
&lt;li>If the number of syncs increases it could mean that we have an increased number of failures.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;h4 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h4>
&lt;p>In beta, we are working on adding an &lt;a href="https://github.com/kubernetes/kubernetes/pull/119912"
target="_blank" rel="noopener">integration test&lt;/a>
for these cases.&lt;/p>
&lt;p>In terms of a manual test for upgrade and rollback, we can use 1.28.&lt;/p>
&lt;p>The Upgrade-&amp;gt;downgrade-&amp;gt;upgrade testing was done manually using the &lt;code>alpha&lt;/code>
version in 1.28 with the following steps:&lt;/p>
&lt;ol>
&lt;li>Start the cluster with the &lt;code>JobPodReplacementPolicy&lt;/code> enabled:&lt;/li>
&lt;/ol>
&lt;p>Create a KIND cluster with 1.28 and use the config below to turn this feature on.&lt;/p>
&lt;p>using &lt;code>config.yaml&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Cluster&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kind.x-k8s.io/v1alpha4&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">&amp;#34;JobPodReplacementPolicy&amp;#34;: &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">nodes&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>control-plane&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>worker&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then, create the job using &lt;code>.spec.podReplacementPolicy=Failed&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl create -f job.yaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>using &lt;code>job.yaml&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>batch/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-prp&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimit&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">2&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podReplacementPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Failed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">restartPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Never&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>sleep&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>gcr.io/k8s-staging-perf-tests/sleep&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">args&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;-termination-grace-period&amp;#34;&lt;/span>,&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;1m&amp;#34;&lt;/span>,&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#b44">&amp;#34;60s&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Await for the pods to be running and delete a pod:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl delete pods -l job-name&lt;span style="color:#666">=&lt;/span>job-prp
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>With feature on and &lt;code>PodReplacementPolicy&lt;/code> set to Failed, the replacement pod will be recreated once the pod was fully terminated.
While the pod is terminating you can also see the status report a terminating pod.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get &lt;span style="color:#a2f">jobs&lt;/span> -ljob-name&lt;span style="color:#666">=&lt;/span>job-prp -oyaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">status&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">terminating&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="2">
&lt;li>Simulate downgrade by creating a new &lt;code>Kind&lt;/code> cluster with the feature turned off.&lt;/li>
&lt;/ol>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Cluster&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kind.x-k8s.io/v1alpha4&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">&amp;#34;JobPodReplacementPolicy&amp;#34;: &lt;/span>&lt;span style="color:#a2f;font-weight:bold">false&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">nodes&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>control-plane&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>worker&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then, deleting the pods of the job.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl delete pods -l job-name&lt;span style="color:#666">=&lt;/span>job-prp
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There should also be no terminating pod status and a pod will be created before the other pod terminates. If you use the above case, you should see a terminating pod and a new pod created.&lt;/p>
&lt;ol start="3">
&lt;li>Simulate upgrade by creating a new &lt;code>Kind&lt;/code> cluster with the feature turned on.&lt;/li>
&lt;/ol>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Cluster&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kind.x-k8s.io/v1alpha4&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">&amp;#34;JobPodReplacementPolicy&amp;#34;: &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">nodes&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>control-plane&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>worker&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Deleting the pod will create a replacement pod once the pod is fully terminated.
The status field will also state that the pod is terminating.&lt;/p>
&lt;p>This demonstrates that the feature is working again for the job.&lt;/p>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;h4 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h4>
&lt;p>No.&lt;/p>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h4 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h4>
&lt;p>During pod terminations, an operator can see that the terminating field is being set.&lt;/p>
&lt;p>We will use a new metric:&lt;/p>
&lt;ul>
&lt;li>&lt;code>job_pods_creation_total&lt;/code> (new) the &lt;code>reason&lt;/code> label will mention what triggers creation (&lt;code>new&lt;/code>, &lt;code>recreate_terminating_or_failed&lt;/code>, &lt;code>recreate_failed&lt;/code>))&lt;br>
and the &lt;code>status&lt;/code> label will mention the status of the pod creation (&lt;code>succeeded&lt;/code>, &lt;code>failed&lt;/code>).&lt;br>
This can be used to get the number of pods that are being recreated due to &lt;code>recreateTerminated&lt;/code>. Otherwise, we would expect to see &lt;code>new&lt;/code> or &lt;code>recreateTerminatingOrFailed&lt;/code> as the normal values.&lt;/li>
&lt;/ul>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h4 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h4>
&lt;p>If a user terminates pods that are controlled by a job, then we should wait
until the existing pods are terminated before starting new ones.&lt;/p>
&lt;p>When feature is turned on, we will also include a &lt;code>terminating&lt;/code> field in the Job Status if there are any terminating pods.&lt;/p>
&lt;h4 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h4>
&lt;p>We did not propose any SLO/SLI for this feature.&lt;/p>
&lt;h4 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h4>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:
&lt;ul>
&lt;li>&lt;code>job_syncs_total&lt;/code> (existing): can be used to see how much the
feature enablement causes the number of syncs to increase.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Components exposing the metric: kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h4>
&lt;p>In beta, we will add a new metric &lt;code>job_pods_creation_total&lt;/code>.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>In &lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
we discuss the interaction with &lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md"
target="_blank" rel="noopener">3329-retriable-and-non-retriable-failures&lt;/a>
.&lt;br>
We will have to guard against cases if &lt;code>PodFailurePolicy&lt;/code> is off while this feature is on.&lt;br>
&lt;code>PodFailurePolicy&lt;/code> is in stable and is locked to &lt;code>true&lt;/code> by default but we should guard against cases where &lt;code>PodDisruptionCondition&lt;/code> is turned off.&lt;/p>
&lt;h4 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h4>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;p>Generally, enabling this will slow down pod creation if pods take a long time to terminate. We would wait
to create new pods until the existing ones are terminated.&lt;/p>
&lt;h4 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h4>
&lt;p>In the job controller, we only update the Job.Status if any field in the &lt;code>Job.Status&lt;/code> changes. With this feature on, we will track &lt;code>terminating&lt;/code> pods in this status.
It could be possible to see an increase in updating the status field of Jobs if a lot of the pods are being terminated.
However, if pods are being terminated, we would also expect other fields to be getting updated also (active, failed, etc) so there should not be a large increase of API calls for patching.&lt;/p>
&lt;h4 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h4>
&lt;p>No&lt;/p>
&lt;h4 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h4>
&lt;p>No&lt;/p>
&lt;h4 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h4>
&lt;p>For Job API, we are adding an enum field named &lt;code>PodReplacementPolicy&lt;/code> which takes
either a &lt;code>TerminatingOrFailed&lt;/code> or &lt;code>Failed&lt;/code>&lt;/p>
&lt;ul>
&lt;li>API type(s): enum&lt;/li>
&lt;li>Estimated increase in size: 8B&lt;/li>
&lt;/ul>
&lt;p>We are also added a status field for tracking terminating pods.&lt;/p>
&lt;ul>
&lt;li>API type(s): int32&lt;/li>
&lt;li>Estimated increase in size: 4B&lt;/li>
&lt;/ul>
&lt;h4 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h4>
&lt;p>No, SLI/SLO do not include time taking to create new pods if existing ones are terminated.&lt;br>
There is an existing one on pod creation but this will not impact that.&lt;/p>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;h4 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h4>
&lt;p>N/A&lt;/p>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;h4 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h4>
&lt;p>N/A&lt;/p>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h4 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h4>
&lt;p>No change from existing behavior of the Job controller.&lt;/p>
&lt;h4 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h4>
&lt;p>There are no other failure modes.&lt;/p>
&lt;h4 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h4>
&lt;p>If one wants to keep the feature on and they could suspend the jobs that are using this feature.
Setting &lt;code>Suspend:True&lt;/code> in your JobSpec will halt the execution of all jobs.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2023-04-03: Created KEP&lt;/li>
&lt;li>2023-05-19: KEP Merged.&lt;/li>
&lt;li>2023-07-16: Alpha PRs merged.&lt;/li>
&lt;li>2023-09-29: KEP marked for beta promotion.&lt;/li>
&lt;li>2023-10-24: Merged bugfix &lt;a href="https://github.com/kubernetes/kubernetes/pull/121342"
target="_blank" rel="noopener">Fix tracking of terminating Pods when nothing else changes&lt;/a>
&lt;/li>
&lt;li>2023-10-24: Merged adding a metric required for beta promotion &lt;a href="https://github.com/kubernetes/kubernetes/pull/121481"
target="_blank" rel="noopener">feat: add job_pods_creation_total metric&lt;/a>
&lt;/li>
&lt;li>2023-10-27: Merged &lt;a href="https://github.com/kubernetes/kubernetes/pull/121491"
target="_blank" rel="noopener">Switch feature flag to beta for pod replacement policy and add e2e test #121491&lt;/a>
&lt;/li>
&lt;li>2024-06-11: [v1.31] Merged &lt;a href="https://github.com/kubernetes/kubernetes/pull/125175"
target="_blank" rel="noopener">Count terminating pods when deleting active pods for failed jobs #125175&lt;/a>
&lt;/li>
&lt;li>2024-07-12: [v1.31] Merged &lt;a href="https://github.com/kubernetes/kubernetes/pull/125510"
target="_blank" rel="noopener">Delay setting terminal Job conditions until all pods are terminal #125510&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>This feature was promoted to beta in v1.29, but important updates were implemented in v1.31.
For additional info, check the PRs linked above with the tag &lt;code>[v1.31]&lt;/code>.&lt;/p>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>Enabling this feature may have rollouts become slower.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>NA&lt;/p></description></item><item><title>Resources: Allow special characters environment variable</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4369/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4369/</guid><description>
&lt;h1 id="kep-4369-allow-special-characters-in-environment-variables">KEP-4369: Allow special characters in environment variables&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade"
>Upgrade&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade"
>Downgrade&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Allows all printable ASCII characters except &amp;ldquo;=&amp;rdquo; to be set as environment variables, the range of printable ASCII characters is 32-126.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Kubernetes should not restrict which environment variable names can be used, because it has no way of knowing what the application may need, and people can&amp;rsquo;t always choose their own variable names, which may limit the adoption of Kubernetes.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allows users to set all ASCII characters with serial numbers in the range of 32-126 except &amp;ldquo;=&amp;rdquo; as environment variables.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;ul>
&lt;li>Implements relaxed validation at the top level validation method when validating API create requests, all ASCII characters in the range 32-126 except &amp;ldquo;=&amp;rdquo; can be verified.&lt;/li>
&lt;li>Allow users to set &lt;code>Configmap&lt;/code> keys and secret keys outside the &lt;code>C_IDENTIFIER&lt;/code> scope as environment variables using EnvFrom&lt;/li>
&lt;li>Document rules for setting environment variables.&lt;/li>
&lt;/ul>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>I am a .NET Core development engineer, .Net Core applications are using &amp;ldquo;:&amp;rdquo; when working with application settings loaded from appsettings.json file. When running .net core app in containers typically overwrite this settings by specifying environmental variable.
such as:
&lt;code>&amp;quot;Logging&amp;quot;: { &amp;quot;IncludeScopes&amp;quot;: false, &amp;quot;LogLevel&amp;quot;: { &amp;quot;Default&amp;quot;: &amp;quot;Warning&amp;quot; } }&lt;/code> &lt;br>
override like this &lt;code>-e Logging:LogLevel:Default=Debug&lt;/code>&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>Relaxed validation can break upgrade and rollback scenarios, but our use of feature gate to control whether it&amp;rsquo;s enabled or not will make it a manageable risk, with the user having the autonomy to choose whether or not to enable it.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>A feature gate name &lt;code>RelaxedEnvironmentVariableValidation&lt;/code> controlling the loosening of the envvar name validation, initially in alpha state and defaulting to false&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Two sets of validation logic for envvar names:&lt;/p>
&lt;ul>
&lt;li>Strict validation
&lt;ul>
&lt;li>Strict validation follows the current design, which only allows envvar names passed the regular expression &lt;code>[-._a-zA-Z][-._a-zA-Z0-9]*&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>Relaxed validation
&lt;ul>
&lt;li>Relaxed verification allows all ASCII characters in the range 32-126 as envvar name, and its regular expression is &lt;code>^[ -&amp;lt;&amp;gt;-~]+$&lt;/code>, matches a string containing ASCII characters from &lt;code>space&lt;/code> to &lt;code>&amp;lt;&lt;/code> and from &lt;code>&amp;gt;&lt;/code> to &lt;code>~&lt;/code>, ignore &lt;code>=&lt;/code>, and has a length of at least 1.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Everywhere we validate envvar names in API objects, plumbing a parameter whether we want the strict or relaxed validation&lt;/p>
&lt;ul>
&lt;li>At the top level validation method when validating API create requests, use the strict validation if the feature gate is off&lt;/li>
&lt;li>At the top level validation method when validating API update requests, use the strict validation if the feature gate is off and the old object passes strict envvar name validation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>Currently coverages:&lt;/p>
&lt;ul>
&lt;li>pkg/apis/core/validation/validation_test.go: &lt;code>2023-12-21&lt;/code> - &lt;code>83.9%&lt;/code>&lt;/li>
&lt;li>pkg/kubelet/kubelet_pods_test.go: &lt;code>2023-12-21&lt;/code> - &lt;code>67.2%&lt;/code>&lt;/li>
&lt;li>staging/src/k8s.io/apimachinery/pkg/util/validation/validation_test.go: &lt;code>2023-12-21&lt;/code> - &lt;code>94.8%&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>These tests will be added:&lt;/p>
&lt;ul>
&lt;li>New tests will be added to ensure environment variable fields can be correctly validated &lt;code>pkg/apis/core/validation/validation_test.go&lt;/code>&lt;/li>
&lt;li>Add a new test that sets special character environment variables for pods in a given namespace &lt;code>pkg/kubelet/kubelet_pods_test.go&lt;/code>&lt;/li>
&lt;li>A new test will be added to ensure that the environment variable name field is valid &lt;code>staging/src/k8s.io/apimachinery/pkg/util/validation/validation_test.go&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>Add a test to &lt;code>test/e2e/common/node/configmap.go&lt;/code> to test that the special characters in configmap are consumed by the environment variable.&lt;/li>
&lt;li>Add a test to &lt;code>test/e2e/common/node/secret.go&lt;/code> to test that the special characters in secret are consumed by the environment variable.&lt;/li>
&lt;li>Add a test to &lt;code>test/e2e/common/node/expansion&lt;/code> to test environment variable can contain special characters.&lt;/li>
&lt;/ul>
&lt;p>We have also added presubmit and periodic test jobs in CI for these e2e tests.
Job names:&lt;/p>
&lt;ul>
&lt;li>&lt;code>pull-kubernetes-e2e-relaxed-environment-variable-validation&lt;/code>&lt;/li>
&lt;li>&lt;code>ci-kubernetes-e2e-relaxed-environment-variable-validation&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Created the feature gate and implement the feature, disabled by default.&lt;/li>
&lt;li>Add unit and e2e tests for the feature.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Solicit feedback from the Alpha.&lt;/li>
&lt;li>Ensure tests are stable and passing.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Ensure that the time range from Alpha to GA version can cover the version skew of all components.&lt;/li>
&lt;li>Add troubleshooting details on how to deal with incompatible kubelet/CRI implementations based on issues found in beta releases.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;h4 id="upgrade">Upgrade&lt;/h4>
&lt;p>Environment variables previously set by the user will not change. To use this enhancement, users need to enable the feature gate&lt;/p>
&lt;h4 id="downgrade">Downgrade&lt;/h4>
&lt;p>After downgrade, environment variables containing special characters will continue to work as expected, but any writes to resources to add or change environment variables must set the environment variable names to only use normal characters.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>kube-apiserver will need to enable feature gates to use this feature.&lt;/p>
&lt;p>If kube-apiserver is not enabled feature gate will use strict validation.&lt;/p>
&lt;p>If the feature gate is disabled and the existing object passes strict validation, strict validation on update will be used.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: RelaxedEnvironmentVariableValidation&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>If disable the feature gate, already running workloads will not be affected in any way,
but cannot create workloads that use special characters as environment variables.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The feature should continue to work just fine.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>When the feature gate is disabled, workloads that are already running will not be affected. However, if user update the workloads, they may fail to recreate pods or ReplicaSets due to failing the Apiserver&amp;rsquo;s validation logic, which could cause the workloads to fail.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Yes, operators can use the Kubenetes API to achieve this. They need to get all pods in the cluster and check if any pod has set a field other than &lt;code>[-._a-zA-Z][-._a-zA-Z0-9]*&lt;/code> as an environment variable name. For example, we can find the namespaces and names of pods using this feature and their environment variable names using the following command:&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get pods --all-namespaces -o json | jq -r &amp;#39;.items[] | select(.spec.containers[].env[]?.name | test(&amp;#34;^[a-zA-Z_][a-zA-Z0-9_]*$&amp;#34;) | not) | [.metadata.namespace, .metadata.name, .spec.containers[].env[]?.name] | @tsv&amp;#39;
&lt;/code>&lt;/pre>&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>According to the test results in &lt;a href="https://github.com/HirazawaUi/verfiy-container-env"
target="_blank" rel="noopener">https://github.com/HirazawaUi/verfiy-container-env&lt;/a>
, the container runtime is very lenient with using special characters as environment variables, and almost no failures will occur.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>N/A&lt;/p>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;p>- 2023-12-21: Initial draft KEP&lt;/p>
&lt;p>- 2024-02-06: KEP promoted to implementable.&lt;/p>
&lt;p>- 2024-08-26: Promote to beta&lt;/p>
&lt;p>- 2024-08-27: Fixed some errors in the beta phase&lt;/p>
&lt;p>- 2025-06-03: Promote to GA&lt;/p>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>If the envvar name character set is extended, all the things currently consuming and using envvar names from the API will have an impact and may break or be unsafe.&lt;/p>
&lt;p>For example:&lt;/p>
&lt;ul>
&lt;li>If a third party uses an envvar name as a filename and assumes that it is currently safe, then if it contains characters that cannot be used as a filename (like &lt;code>:&lt;/code>) or characters that break the assumptions of a flat directory structure (like &lt;code>/&lt;/code>), then unexpected results will occur.&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>do nothing (leave it as-is)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>relax the rule, but with a long beta period where the existing rule remains the default.
Ensure that the beta period doesn&amp;rsquo;t end until ValidatingAdmissionPolicy is GA and has been for 2 minor releases.
&lt;em>Clearly&lt;/em> document how to use a ValidatingAdmissionPolicy to get behavior equivalent to the legacy checking,
and signpost people to these docs when graduating the looser validation to be the Kubernetes default.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>define a label or annotation for each namespace that controls how Pod environment variables are validated in that namespace&lt;/p>
&lt;/li>
&lt;li>
&lt;p>[more complex!]
add an API kind to specify the validation rules for Pods&lt;/p>
&lt;p>Create a new API kind, eg PodValidationRule. It&amp;rsquo;s &lt;strong>namespaced&lt;/strong>. Within the &lt;code>.spec&lt;/code> of each object, define:&lt;/p>
&lt;ul>
&lt;li>a Pod selector&lt;/li>
&lt;li>an optional CEL validation rule for environment variable keys&lt;/li>
&lt;li>an optional CEL validation rule for environment variable values&lt;/li>
&lt;/ul>
&lt;p>If any of the selected validation rules don&amp;rsquo;t pass for a Pod, reject it at admission time. Set up a defaulting
mechanism to
Also, define how Pod templates interact with this new API (eg: you get a &lt;code>Warning:&lt;/code> when you create
a Deployment where the PodTemplate inside the Deployment wouldn&amp;rsquo;t pass validation)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2></description></item><item><title>Resources: Allow zero value for Sleep Action of PreStop Hook</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4818/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4818/</guid><description>
&lt;h1 id="kep-4818-allow-zero-value-for-sleep-action-of-prestop-hook">KEP-4818: Allow zero value for Sleep Action of PreStop Hook&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade"
>Upgrade&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade"
>Downgrade&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The sleep action for the PreStop container lifecycle hook was introduced in KEP 3960. It however doesn’t accept zero as a valid value for the sleep duration seconds. This KEP aims to add support for setting a value of zero with the sleep action of the PreStop hook.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Currently, trying to create a container with a PreStop lifecycle hook with sleep of 0 seconds will throw a validation error like so:&lt;/p>
&lt;pre tabindex="0">&lt;code>Invalid value: 0: must be greater than 0 and less than terminationGracePeriodSeconds (30)
&lt;/code>&lt;/pre>&lt;p>The Sleep action is implemented with the time package from Go’s standard library. The &lt;code>time.After()&lt;/code> which is used to implement the sleep permits a zero sleep duration. A negative or a zero sleep duration will cause the function to return immediately and function like a no-op.&lt;/p>
&lt;p>The implementation in KEP 3960 supports only non-zero values for the sleep duration. It is semantically correct to support a zero value for this field since time.After() also supports zero and negative durations. Negative values as well as zero have the same effect with time.After(), they both return immediately. We don’t need to support negative values since they have the same effect as setting the duration to zero.&lt;/p>
&lt;p>A potential use case for this behaviour is when you need a PreStop hook to be defined for the validation of your resource, but don&amp;rsquo;t really need to sleep as part of the PreStop hook. An example of this is described by a user &lt;a href="https://github.com/kubernetes/enhancements/issues/3960#issuecomment-2208556397"
target="_blank" rel="noopener">here&lt;/a>
in the parent KEP. They add a PreStop sleep hook in via an admission webhoook by default if the PreStop is hook is not specified by the user. In order to opt-out from this, a no-op PreStop hook with a duration of zero seconds can be used.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Update the validation for the Sleep action to allow zero as a valid sleep duration.&lt;/li>
&lt;li>Allow users to set a zero value for the sleep action in PreStop hooks to do a no-op.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>This KEP does not support adding negative values for the sleep duration.&lt;/li>
&lt;li>This KEP does not aim to provide a way to pause or delay pod termination indefinitely.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Introduce a &lt;code>PodLifecycleSleepActionAllowZero&lt;/code> feature gate which is disabled by default. When the feature gate is enabled, the &lt;code>validateSleepAction&lt;/code> method would allow values greater than or equal to zero as a valid sleep duration.&lt;/p>
&lt;p>Since this update to the validation allows previously invalid values, care must be taken to support cluster downgrades safely. To accomplish this, the validation will distinguish between new resources and updates to existing resources:&lt;/p>
&lt;ul>
&lt;li>When the feature gate is disabled:
&lt;ul>
&lt;li>(a) New resources will no longer allow setting zero as the sleep duration second for the PreStop hook. (no change to current validation)&lt;/li>
&lt;li>(b) Existing resources cannot be updated to have a sleep duration of zero seconds&lt;/li>
&lt;li>(c) Existing resources with a PreStop sleep duration set to zero will continue to run and use a sleep duration of zero seconds. These can be updated and the zero sleep duration would continue to work.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>When the feature gate is enabled:
&lt;ul>
&lt;li>(c) New resources allow zero as a valid sleep duration.&lt;/li>
&lt;li>(d) Updates to existing resources will allow zero as a valid sleep duration.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>The proposed change adds another layer to the &lt;code>validateSleepAction&lt;/code> function to allow zero as a valid sleep duration setting like shown:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-diff" data-lang="diff">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">-func validateSleepAction(sleep *core.SleepAction, gracePeriod *int64, fldPath *field.Path) field.ErrorList {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">&lt;/span>&lt;span style="color:#00a000">+func validateSleepAction(sleep *core.SleepAction, gracePeriod *int64, fldPath *field.Path, opts PodValidationOptions) field.ErrorList {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">&lt;/span> allErrors := field.ErrorList{}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> // We allow gracePeriod to be nil here because the pod in which this SleepAction
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> // is defined might have an invalid grace period defined, and we don&amp;#39;t want to
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> // flag another error here when the real problem will already be flagged.
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">- if gracePeriod != nil &amp;amp;&amp;amp; sleep.Seconds &amp;lt;= 0 || sleep.Seconds &amp;gt; *gracePeriod {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">- invalidStr := fmt.Sprintf(&amp;#34;must be greater than 0 and less than terminationGracePeriodSeconds (%d)&amp;#34;, *gracePeriod)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">- allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a00000">&lt;/span>&lt;span style="color:#00a000">+ if opts.AllowPodLifecycleSleepActionZeroValue {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ if gracePeriod != nil &amp;amp;&amp;amp; sleep.Seconds &amp;lt; 0 || sleep.Seconds &amp;gt; *gracePeriod {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ invalidStr := fmt.Sprintf(&amp;#34;must be non-negative and less than terminationGracePeriodSeconds (%d)&amp;#34;, *gracePeriod)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ }
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ } else {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ if gracePeriod != nil &amp;amp;&amp;amp; sleep.Seconds &amp;lt;= 0 || sleep.Seconds &amp;gt; *gracePeriod {
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ invalidStr := fmt.Sprintf(&amp;#34;must be greater than 0 and less than terminationGracePeriodSeconds (%d). Please enable PodLifecycleSleepActionAllowZero feature gate if you need a sleep of zero duration.&amp;#34;, *gracePeriod)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ allErrors = append(allErrors, field.Invalid(fldPath, sleep.Seconds, invalidStr))
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">+ }
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#00a000">&lt;/span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> return allErrors
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Currently, the kubelet accepts &lt;code>0&lt;/code> as a valid duration. There is no validation done at the kubelet level. All the validation for the duration itself is done at the kube-apiserver. The &lt;a href="https://github.com/AxeZhan/kubernetes/blob/3a96afdfefdf329c637623ae31a61d20dbdb0393/pkg/kubelet/lifecycle/handlers.go#L129-L141"
target="_blank" rel="noopener">runSleepHandler&lt;/a>
in the kubelet uses the &lt;code>time.After()&lt;/code> function from the &lt;a href="https://pkg.go.dev/time"
target="_blank" rel="noopener">time&lt;/a>
package, which supports a &lt;code>0&lt;/code> duration input. &lt;code>time.After&lt;/code> also accepts negative values which are also returned immediately similar to zero. We don&amp;rsquo;t support negative values however.&lt;/p>
&lt;p>See the entire code changes in the WIP PR: &lt;a href="https://github.com/kubernetes/kubernetes/pull/127094"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/127094&lt;/a>
&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a Kubernetes user, I want to to be able to have a PreStop hook defined in my spec without needing to sleep during the execution of the PreStop hook. This no-op behaviour can be used for validation purposes with admission webhooks (&lt;a href="https://github.com/kubernetes/enhancements/issues/3960#issuecomment-2208556397"
target="_blank" rel="noopener">Reference&lt;/a>
).&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The change is opt-in, since it requires configuring a PreStop hook with sleep action of 0 second duration. So there is no risk beyond the upgrade/downgrade risks which are addressed in the Proposal section.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>Refer to the Proposal section.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>Alpha:&lt;/p>
&lt;ul>
&lt;li>Test that the runSleepHandler function returns immediately when given a duration of zero.&lt;/li>
&lt;li>Test that the validation succeeds when given a zero duration with the feature gate enabled.&lt;/li>
&lt;li>Test that the validation fails when given a zero duration with the feature gate disabled.&lt;/li>
&lt;li>Test that the validation returns the appropriate error messages when given an invalid duration value (e.g., a negative value) with the feature gate disabled and enabled.&lt;/li>
&lt;li>Unit tests for testing the disabling of the feature gate after it was enabled and the feature was used.&lt;/li>
&lt;li>Unit tests for pod with zero grace period duration and zero sleep duration with zero value enabled.&lt;/li>
&lt;li>Unit test for pod with nil grace period with zero value disabled&lt;/li>
&lt;li>Unit test for pod with nil grace period with zero value enabled&lt;/li>
&lt;/ul>
&lt;p>Current coverages:&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/apis/core/validation&lt;/code> : 2024-09-20 - 84.3&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/kubelet/lifecycle/handlers&lt;/code> : 2024-09-20 - 86.4&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;p>N/A&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;p>Basic functionality&lt;/p>
&lt;ul>
&lt;li>Create a simple pod with a container that runs a long-running process.&lt;/li>
&lt;li>Add a preStop hook to the container configuration, using the new sleepAction with a sleep duration of &lt;code>0&lt;/code>.&lt;/li>
&lt;li>Delete the pod and observe the time it takes for the container to terminate.&lt;/li>
&lt;li>Verify that the container terminates immediately without sleeping.&lt;/li>
&lt;/ul>
&lt;p>Additional e2e tests for beta:&lt;/p>
&lt;ul>
&lt;li>Test that pods with sleep value of 0 in PreStop hook can be created&lt;/li>
&lt;li>Test that pods with sleep value of 0 in PostStart hook can be created&lt;/li>
&lt;li>Test that pods with sleep value of 0 in PreStop hook can be updated&lt;/li>
&lt;li>Test that pods with sleep value of 0 in PostStart hook can be updated&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag&lt;/li>
&lt;li>Initial unit/e2e tests completed and enabled&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from developers and surveys&lt;/li>
&lt;li>Additional e2e tests are completed&lt;/li>
&lt;li>No trouble reports from alpha release&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>No trouble reports with the beta release, plus some anecdotal evidence of it being used successfully.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;h4 id="upgrade">Upgrade&lt;/h4>
&lt;p>The previous PreStop Sleep Action behavior will not be broken. Users can continue to use their hooks as it is. To use this enhancement, users need to enable the feature gate, and set the sleep duration as zero in their prestop hook’s sleep action.&lt;/p>
&lt;h4 id="downgrade">Downgrade&lt;/h4>
&lt;p>If the kube-apiserver is downgraded to a version where the feature gate is not supported (&amp;lt;v1.32), no new resources can be created with a PreStop sleep duration of zero seconds. Existing resources created with a sleep duration of zero will continue to function.&lt;/p>
&lt;p>If the feature gate is turned off after being enabled, no new resources can be created with PreStop sleep duration of zero seconds. Existing resources will continue to run and use a sleep duration of zero seconds. These resources can be updated and the zero sleep duration would continue to work.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>Only the kube-apiserver will need to enable the feature gate for the full featureset to be present. This is because the implementation is already handled in the parent &lt;a href="https://github.com/kubernetes/enhancements/issues/3960"
target="_blank" rel="noopener">KEP #3960&lt;/a>
. The change introduced in this KEP is only to how the validation is done. If the feature gate is disabled, the feature will not be available. The feature gate does not apply to the kubelet logic since the time.After function used by the original KEP already supports zero as a valid duration.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: PodLifecycleSleepActionAllowZero&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver with the feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the sleep field.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>New pods with sleep action in prestop sleep duration of zero seconds can be created.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>For the parent KEP, unit tests for the &lt;code>switch&lt;/code> of the feature gate were added in &lt;code>pkg/registry/core/pod/strategy_test&lt;/code>. We can add similar tests for the new feature gate as well.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The change is opt-in, it doesn&amp;rsquo;t impact already running workloads.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>I believe we don&amp;rsquo;t need a metric here since the &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3960-pod-lifecycle-sleep-action#what-specific-metrics-should-inform-a-rollback"
target="_blank" rel="noopener">parent KEP already has a metric&lt;/a>
to inform rollbacks. This KEP only updates the validation to allow zero value.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>This is an opt-in feature, and it does not change any default behavior. I manually tested enabling and disabling this feature by changing the kube-api-server config and restarting them in a kind cluster. The details of the expected behavior are described in the Proposal and Upgrade/Downgrade sections.&lt;/p>
&lt;p>The manual test steps are as following:&lt;/p>
&lt;ol>
&lt;li>Create a local 1.32 k8s cluster with kind, and create a test-pod in that cluster.&lt;/li>
&lt;li>Enable PodLifecycleSleepActionAllowZero feature in the kube-apiserver and restart it.&lt;/li>
&lt;li>Add a prestop hook with sleep action with duration of zero seconds to the test-pod and delete it. Observe the time cost.&lt;/li>
&lt;li>Create another pod with sleep action duration of zero seconds.&lt;/li>
&lt;li>Disable PodLifecycleSleepActionAllowZero feature in the kube-api-server and restart it.&lt;/li>
&lt;li>Delete the pod created in step 4, and observe the time cost.&lt;/li>
&lt;/ol>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Inspect the preStop hook configuration and also the feature gates&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Events
&lt;ul>
&lt;li>Event Reason:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Condition name:&lt;/li>
&lt;li>Other field:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: Check the logs of the container during termination, check the termination duration.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details: Check the logs of the container during termination, check the termination duration.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>N/A. This is a change to validation within the API server.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>Disable &lt;code>PodLifecycleSleepActionAllowZero&lt;/code> feature gate, and restart the kube-apiserver.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;ul>
&lt;li>2024-09-16: Alpha KEP PR opened for v1.32&lt;/li>
&lt;li>2024-10-03: Summary, Motivation and Proposal sections merged&lt;/li>
&lt;li>2024-09-03: &lt;a href="https://github.com/kubernetes/kubernetes/pull/127094"
target="_blank" rel="noopener">Alpha code implementation PR&lt;/a>
opened&lt;/li>
&lt;li>2024-11-01: Alpha code PR merged&lt;/li>
&lt;li>2024-12-11: Kubernetes v1.32 release with PodLifecycleSleepActionAllowZero in alpha stage&lt;/li>
&lt;li>2025-02-06: KEP updated targeting to beta in v1.33&lt;/li>
&lt;li>2025-06-11: KEP updated targeting to stable in v1.34&lt;/li>
&lt;li>2025-07-20: &lt;a href="https://github.com/kubernetes/kubernetes/pull/132595"
target="_blank" rel="noopener">Code implementation for GA graduation&lt;/a>
merged into k/k&lt;/li>
&lt;li>2025-10-20: k/enhancements PR opened updating KEP status as implemented&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>N/A&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>Another way to run zero duration sleep in a container is to use the exec command in preStop hook like so &lt;code>[&amp;quot;/bin/sh&amp;quot;,&amp;quot;-c&amp;quot;,&amp;quot;sleep 0&amp;quot;]&lt;/code>. This requires a sleep binary in the image. Since the sleep action already exists as a PreStop hook, it is easier to allow a duration of zero seconds for the sleep action.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>N/A&lt;/p></description></item><item><title>Resources: Allows setting arbitrary FQDN as the pod's hostname</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4762/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4762/</guid><description>
&lt;h1 id="kep-4762-allows-setting-arbitrary-fqdn-as-the-pods-hostname">KEP-4762: Allows setting arbitrary FQDN as the pod&amp;rsquo;s hostname&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This proposal allows users to set arbitrary Fully Qualified Domain Name (FQDN) as the hostname of a pod, introduces a new field &lt;code>hostnameOverride&lt;/code> for the podSpec, which, if set, once the API is GA will always be respected by the Kubelet (otherwise it will fall back to legacy behavior), and no longer cares about the &lt;code>hostname&lt;/code> as well as the &lt;code>subdomain&lt;/code> values.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>This feature will allow some traditional applications to join kubernetes in a more friendly way. Some older services may use hostname to determine permissions or service operations. When migrating services to k8s, the migration path will become confusing due to the hostname restrictions of the pod itself, because when we try to add a Fully Qualified Domain Name (FQDN) hostname to the pod, it will inevitably always carry the &lt;code>cluster-suffix&lt;/code>, which will never be possible for services that expect to use DNS to match the hostname.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Allow users to set any arbitrary FQDN as pod hostname.&lt;/li>
&lt;li>Write the FQDN set by the user to &lt;code>/etc/hosts&lt;/code> in the pod.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Add DNS records for the FQDN set by the user.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We add a new field called &lt;code>hostnameOverride&lt;/code> to &lt;code>podSpec&lt;/code>, of type string. When the value of the &lt;code>hostnameOverride&lt;/code> field is not an empty string, it always overrides the values of the &lt;code>setHostnameAsFQDN&lt;/code>, &lt;code>subdomain&lt;/code>, and &lt;code>hostname&lt;/code> fields in &lt;code>podSpec&lt;/code> to become the hostname of the pod, and only allow the value of setHostnameAsFQDN to be nil.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a Kubernetes administrator, I want the Kerberos replication daemon (kpropd) to accurately handle hostname resolution for authentication.&lt;/p>
&lt;p>In a Kubernetes environment, kpropd on the receiving end uses the hostname to determine the appropriate service credential for authentication purposes (e.g., foo-0.default.pod.cluster-local). However, on the sending side, kpropd uses the hostname it is connecting to (e.g., kdc1.example.com) to generate the cryptographic secret for secure communication. These hostnames must match to ensure that the cryptographic process can generate consistent data on both ends. Any discrepancy between these hostnames can result in authentication failure due to mismatched cryptographic data.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The Linux kernel limits the hostname field to 64 bytes (see &lt;a href="http://man7.org/linux/man-pages/man2/sethostname.2.html"
target="_blank" rel="noopener">sethostname(2)&lt;/a>
). If a hostname reaches this 64 byte kernel hostname limit, Kubernetes will fail to create the Pod Sandbox, causing the Pod to remain indefinitely in the &lt;code>ContainerCreating&lt;/code> state.&lt;/p>
&lt;p>To mitigate this issue, we will implement a validation during resource creation to check whether the value of hostnameOverride exceeds 64 bytes. Creation requests exceeding this limit will be denied.&lt;/p>
&lt;p>After enabling this feature, if users utilize it to create a group of Pods via Deployment or StatefulSet, multiple Pods with identical names may concentrate on a single node. This could lead to unintended consequences, though we haven&amp;rsquo;t identified specific potential issues at this time.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>We are introducing a new feature gate called &lt;code>HostnameOverride&lt;/code>. When this feature gate is enabled, users can add the &lt;code>hostnameOverride&lt;/code> field in the podSpec.&lt;/p>
&lt;p>The &lt;code>hostnameOverride&lt;/code> field has a length limitation of 64 characters and must adhere to the DNS subdomain names standard defined in &lt;a href="https://datatracker.ietf.org/doc/html/rfc1123"
target="_blank" rel="noopener">RFC 1123&lt;/a>
.&lt;/p>
&lt;p>Additionally, in the &lt;code>generatePodSandboxConfig&lt;/code> method of kubelet, the pod&amp;rsquo;s hostname will always be overridden with the value of &lt;code>hostnameOverride&lt;/code>, and it will be written in the pod&amp;rsquo;s &lt;code>/etc/hosts&lt;/code>.&lt;/p>
&lt;p>For Windows containers, we only set the container&amp;rsquo;s hostname and do not create an &lt;code>/etc/hosts&lt;/code> file for it (as we have previously made it clear that we do not create an &lt;code>/etc/hosts&lt;/code> file for Windows containers).&lt;/p>
&lt;p>If both &lt;code>setHostnameAsFQDN&lt;/code> and &lt;code>hostnameOverride&lt;/code> fields are set, or if both &lt;code>hostNetwork&lt;/code> and &lt;code>hostnameOverride&lt;/code> fields are set, we will reject the creation of the resource and return an error indicating that these fields are mutually exclusive with the &lt;code>hostnameOverride&lt;/code> field.&lt;/p>
&lt;p>Based on the above design, after the KEP is implemented, we can achieve the following results.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>#&lt;/th>
&lt;th>&lt;code>.hostname&lt;/code>&lt;/th>
&lt;th>&lt;code>.subdomain&lt;/code>&lt;/th>
&lt;th>&lt;code>.setHostnameAsFQDN&lt;/code>&lt;/th>
&lt;th>&lt;code>.hostnameOverride&lt;/code>&lt;/th>
&lt;th>&lt;code>.hostNetwork&lt;/code>&lt;/th>
&lt;th>&lt;code>$(hostname)&lt;/code>&lt;/th>
&lt;th>&lt;code>$(hostname -f)&lt;/code>&lt;/th>
&lt;th>DNS (assuming service exists)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>0&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>5&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>6&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>7&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>8&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>9&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>10&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>11&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>12&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>13&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>14&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>15&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>16&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>17&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>18&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>19&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>20&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>21&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>22&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;pod-name&amp;gt;.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>23&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>&amp;lt;same-as-node&amp;gt;&lt;/code>&lt;/td>
&lt;td>&lt;code>aa.bb.&amp;lt;ns&amp;gt;.svc.&amp;lt;zone&amp;gt;&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>24&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>25&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>26&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>27&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>28&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>29&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>30&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>31&lt;/td>
&lt;td>&lt;code>aa&lt;/code>&lt;/td>
&lt;td>&lt;code>bb&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>&lt;code>xx.yy.zz&lt;/code>&lt;/td>
&lt;td>true&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;td>INVALID&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>As shown in the table, setting &lt;code>hostnameOverride&lt;/code> will only change the hostname inside the pod and will not modify the DNS records in Kubernetes.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>Add kubelet unit tests to verify that container hostnames are correctly generated: &lt;code>k8s.io/kubernetes/pkg/kubelet/kuberuntime&lt;/code>: &lt;code>2025-06-06&lt;/code> - &lt;code>69.0%&lt;/code>&lt;/li>
&lt;li>Add API validation unit tests to ensure all field combinations yield correct results: &lt;code>k8s.io/kubernetes/pkg/apis/core/validation&lt;/code> : &lt;code>2025-06-06&lt;/code> - &lt;code>84.7%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>N/A&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>Add a conformance test to &lt;code>test/e2e&lt;/code> that verifies our implementation conforms to the expectation defined in the table within the #Design Details section.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Use the &lt;code>HostnameOverride&lt;/code> feature gate to implement this feature.&lt;/li>
&lt;li>Initial e2e tests completed and enabled.
&lt;ul>
&lt;li>The link to the added e2e test: &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/pod_hostnameoverride.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/pod_hostnameoverride.go&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Add documentation for feature gates.&lt;/li>
&lt;li>Add a detailed table to the docs illustrating the mappings between pod hostnames and DNS records under different configurations.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Make feature gate to be enabled by default.&lt;/li>
&lt;li>Update the feature gate documentation.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>No issues reported during two releases.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>API server should be upgraded before Kubelets. Kubelets should be downgraded before the API server.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>The core implementation resides in kubelet.&lt;/p>
&lt;p>Older kubelet versions will ignore the pod&amp;rsquo;s hostnameOverride field:
• Newly created Pods will retain previous behavior&lt;/p>
&lt;p>Older apiserver versions will similarly ignore the hostnameOverride field:
• The apiserver doesn&amp;rsquo;t populate the hostnameOverride value, so newer kubelet versions will maintain legacy behavior&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: HostnameOverride&lt;/li>
&lt;li>Components depending on the feature gate: kubelet, kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Using the feature gate is the only way to enable/disable this feature.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>There will be no impact on running Pods in the cluster. This change solely affects newly created Pods. Once enabled, you can set pod hostnames by configuring the &lt;code>podSpec.hostnameOverride&lt;/code> field.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>We have added unit tests for enabling and disabling the feature gate in: &lt;code>pkg/kubelet/kubelet_pods_test.go#TestGeneratePodHostNameAndDomain&lt;/code>&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>No known failure modes.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>The &lt;code>kubelet_started_pods_total&lt;/code> metrics helps determine whether enabling/disabling this feature causes abnormal pod restarts in the cluster.&lt;/p>
&lt;p>&lt;code>kubelet_started_pods_errors_total&lt;/code> metrics tracks if feature toggling results in pod startup failures.&lt;/p>
&lt;p>&lt;code>kubelet_restarted_pods_total&lt;/code> metrics monitors whether enabling/disabling triggers restarts of Static Pods.&lt;/p>
&lt;p>&lt;code>run_podsandbox_errors_total&lt;/code> metric helps detect if enabling the feature gate and using this functionality would cause sandbox container creation failures.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Yes. The upgrade, downgrade, and upgrade path was manually tested with a local
cluster by restarting the kube-apiserver and kubelet with the
&lt;code>HostnameOverride&lt;/code> feature gate enabled, then disabled, then enabled again. The
test verified that:&lt;/p>
&lt;ul>
&lt;li>A Pod created while &lt;code>HostnameOverride=true&lt;/code> uses &lt;code>spec.hostnameOverride&lt;/code> as
its runtime hostname.&lt;/li>
&lt;li>The existing Pod keeps running and keeps its hostname after the feature gate
is disabled.&lt;/li>
&lt;li>A new Pod created while &lt;code>HostnameOverride=false&lt;/code> ignores
&lt;code>spec.hostnameOverride&lt;/code> and uses the default Pod hostname.&lt;/li>
&lt;li>The Pod created while the feature gate was disabled keeps running and keeps
the default hostname after the feature gate is re-enabled.&lt;/li>
&lt;/ul>
&lt;p>The script used for the manual local-up test was:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080">#!/usr/bin/env bash
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080">&lt;/span>&lt;span style="color:#a2f">set&lt;/span> -euo pipefail
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f">export&lt;/span> &lt;span style="color:#b8860b">GOPATH&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>&lt;span style="color:#b8860b">GOPATH&lt;/span>&lt;span style="color:#a2f;font-weight:bold">:-&lt;/span>/root/go&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">KUBE_ROOT&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$GOPATH&lt;/span>&lt;span style="color:#b44">/src/k8s.io/kubernetes&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">KUBECTL&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/bin/kubectl&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">POD_NAME&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;test-pod&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">OVERRIDE_HOSTNAME&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;test-hostname&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">DEFAULT_HOSTNAME&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$POD_NAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">FEATURE_GATES_ON&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;HostnameOverride=true&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b8860b">FEATURE_GATES_OFF&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;HostnameOverride=false&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f">cd&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f">export&lt;/span> &lt;span style="color:#b8860b">PATH&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$PATH&lt;/span>&lt;span style="color:#b44">:&lt;/span>&lt;span style="color:#b8860b">$GOPATH&lt;/span>&lt;span style="color:#b44">/bin:&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/third_party/etcd&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f">export&lt;/span> &lt;span style="color:#b8860b">KUBECONFIG&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;/var/run/kubernetes/admin.kubeconfig&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>kill_local_up_components&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-apiserver&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-controller-manager&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-scheduler&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kubelet&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-proxy&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;etcd --advertise-client-urls http://127.0.0.1:2379&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;bash hack/local-up-cluster.sh&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>component_pid&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pgrep -f &lt;span style="color:#b44">&amp;#34;^&lt;/span>&lt;span style="color:#b8860b">$1&lt;/span>&lt;span style="color:#b44"> &amp;#34;&lt;/span> | head -n &lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>read_cmdline&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">pid&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$1&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> -n &lt;span style="color:#b8860b">out&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$2&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">out&lt;/span>&lt;span style="color:#666">=()&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">while&lt;/span> &lt;span style="color:#b8860b">IFS&lt;/span>&lt;span style="color:#666">=&lt;/span> &lt;span style="color:#a2f">read&lt;/span> -r -d &lt;span style="color:#b44">&amp;#39;&amp;#39;&lt;/span> arg; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">out&lt;/span>&lt;span style="color:#666">+=(&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$arg&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#666">)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">done&lt;/span> &amp;lt;&lt;span style="color:#b44">&amp;#34;/proc/&lt;/span>&lt;span style="color:#b8860b">$pid&lt;/span>&lt;span style="color:#b44">/cmdline&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>set_feature_gates&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> -n &lt;span style="color:#b8860b">cmd&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$1&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">feature_gates&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$2&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">for&lt;/span> i in &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>!cmd[@]&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> &lt;span style="color:#666">[[&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>&lt;span style="color:#b8860b">cmd&lt;/span>[&lt;span style="color:#b8860b">$i&lt;/span>]&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#666">==&lt;/span> --feature-gates&lt;span style="color:#666">=&lt;/span>* &lt;span style="color:#666">]]&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">then&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cmd&lt;span style="color:#666">[&lt;/span>&lt;span style="color:#b8860b">$i&lt;/span>&lt;span style="color:#666">]=&lt;/span>&lt;span style="color:#b44">&amp;#34;--feature-gates=&lt;/span>&lt;span style="color:#b8860b">$feature_gates&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">return&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">fi&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> &lt;span style="color:#666">[[&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>&lt;span style="color:#b8860b">cmd&lt;/span>[&lt;span style="color:#b8860b">$i&lt;/span>]&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#666">==&lt;/span> &lt;span style="color:#b44">&amp;#34;--feature-gates&amp;#34;&lt;/span> &lt;span style="color:#666">]]&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">then&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cmd&lt;span style="color:#666">[&lt;/span>&lt;span style="color:#a2f;font-weight:bold">$((&lt;/span>i &lt;span style="color:#666">+&lt;/span> &lt;span style="color:#666">1&lt;/span>&lt;span style="color:#a2f;font-weight:bold">))&lt;/span>&lt;span style="color:#666">]=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$feature_gates&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">return&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">fi&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">done&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">cmd&lt;/span>&lt;span style="color:#666">+=(&lt;/span>&lt;span style="color:#b44">&amp;#34;--feature-gates=&lt;/span>&lt;span style="color:#b8860b">$feature_gates&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#666">)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>start_cluster&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">log_file&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/hostname-override-local-up.log&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> kill_local_up_components
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> rm -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$log_file&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">FEATURE_GATES&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$FEATURE_GATES_ON&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#b8860b">LOG_LEVEL&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">3&lt;/span> hack/local-up-cluster.sh &amp;gt;&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$log_file&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &amp;amp;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">until&lt;/span> grep -q &lt;span style="color:#b44">&amp;#34;Local Kubernetes cluster is running&amp;#34;&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$log_file&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">done&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>restart_apiserver_and_kubelet&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">feature_gates&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$1&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> apiserver_pid kubelet_pid
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> apiserver_cmd kubelet_cmd
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">apiserver_pid&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#a2f;font-weight:bold">$(&lt;/span>component_pid &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-apiserver&amp;#34;&lt;/span>&lt;span style="color:#a2f;font-weight:bold">)&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">kubelet_pid&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#a2f;font-weight:bold">$(&lt;/span>component_pid &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kubelet&amp;#34;&lt;/span>&lt;span style="color:#a2f;font-weight:bold">)&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> read_cmdline &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$apiserver_pid&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> apiserver_cmd
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> read_cmdline &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$kubelet_pid&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> kubelet_cmd
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> set_feature_gates apiserver_cmd &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$feature_gates&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> set_feature_gates kubelet_cmd &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$feature_gates&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kubelet&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pkill -9 -f &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBE_ROOT&lt;/span>&lt;span style="color:#b44">/_output/local/bin/linux/arm64/kube-apiserver&amp;#34;&lt;/span> &amp;gt;/dev/null 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>&lt;span style="color:#b8860b">apiserver_cmd&lt;/span>[@]&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &amp;gt;&lt;span style="color:#b44">&amp;#34;/tmp/kube-apiserver-hostname-override.log&amp;#34;&lt;/span> 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &amp;amp;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">disown&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$!&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">until&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> version &amp;gt;/dev/null 2&amp;gt;&amp;amp;1; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">done&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b68;font-weight:bold">${&lt;/span>&lt;span style="color:#b8860b">kubelet_cmd&lt;/span>[@]&lt;span style="color:#b68;font-weight:bold">}&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &amp;gt;&lt;span style="color:#b44">&amp;#34;/tmp/kubelet-hostname-override.log&amp;#34;&lt;/span> 2&amp;gt;&amp;amp;&lt;span style="color:#666">1&lt;/span> &amp;amp;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">disown&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$!&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#a2f">wait&lt;/span> --for&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b8860b">condition&lt;/span>&lt;span style="color:#666">=&lt;/span>Ready node/127.0.0.1 --timeout&lt;span style="color:#666">=&lt;/span>180s &amp;gt;/dev/null
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>apply_pod&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cat &lt;span style="color:#b44">&amp;lt;&amp;lt;EOF | &amp;#34;$KUBECTL&amp;#34; apply -f - &amp;gt;/dev/null
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">apiVersion: v1
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">kind: Pod
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">metadata:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> name: $POD_NAME
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">spec:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> hostnameOverride: $OVERRIDE_HOSTNAME
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> containers:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> - name: writer-container
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> image: busybox
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44"> command: [&amp;#34;/bin/sh&amp;#34;, &amp;#34;-c&amp;#34;, &amp;#34;sleep 3600&amp;#34;]
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">EOF&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wait_for_pod&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#a2f">wait&lt;/span> --for&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b8860b">condition&lt;/span>&lt;span style="color:#666">=&lt;/span>Ready &lt;span style="color:#b44">&amp;#34;pod/&lt;/span>&lt;span style="color:#b8860b">$POD_NAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> --timeout&lt;span style="color:#666">=&lt;/span>180s &amp;gt;/dev/null
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>expect_hostname&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">message&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$1&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">expected&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$2&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">local&lt;/span> &lt;span style="color:#b8860b">actual&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">for&lt;/span> _ in &lt;span style="color:#a2f;font-weight:bold">$(&lt;/span>seq &lt;span style="color:#666">1&lt;/span> 60&lt;span style="color:#a2f;font-weight:bold">)&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">do&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b8860b">actual&lt;/span>&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#a2f;font-weight:bold">$(&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#a2f">exec&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$POD_NAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> -- hostname 2&amp;gt;/dev/null &lt;span style="color:#666">||&lt;/span> &lt;span style="color:#a2f">true&lt;/span>&lt;span style="color:#a2f;font-weight:bold">)&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> &lt;span style="color:#666">[[&lt;/span> -n &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$actual&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#666">]]&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">then&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">break&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">fi&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sleep &lt;span style="color:#666">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">done&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> &lt;span style="color:#666">[[&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$actual&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> !&lt;span style="color:#666">=&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$expected&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> &lt;span style="color:#666">]]&lt;/span>; &lt;span style="color:#a2f;font-weight:bold">then&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">echo&lt;/span> &lt;span style="color:#b44">&amp;#34;FAIL: &lt;/span>&lt;span style="color:#b8860b">$message&lt;/span>&lt;span style="color:#b44">: expected hostname &lt;/span>&lt;span style="color:#b8860b">$expected&lt;/span>&lt;span style="color:#b44">, got &lt;/span>&lt;span style="color:#b8860b">$actual&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">exit&lt;/span> &lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">fi&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">echo&lt;/span> &lt;span style="color:#b44">&amp;#34;PASS: &lt;/span>&lt;span style="color:#b8860b">$message&lt;/span>&lt;span style="color:#b44">: hostname=&lt;/span>&lt;span style="color:#b8860b">$actual&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print_version&lt;span style="color:#666">()&lt;/span> &lt;span style="color:#666">{&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f">echo&lt;/span> &lt;span style="color:#b44">&amp;#34;Cluster version:&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> version
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">}&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>start_cluster
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>restart_apiserver_and_kubelet &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$FEATURE_GATES_ON&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print_version
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> delete &lt;span style="color:#b44">&amp;#34;pod/&lt;/span>&lt;span style="color:#b8860b">$POD_NAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> --ignore-not-found --wait&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#a2f">true&lt;/span> &amp;gt;/dev/null
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>apply_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wait_for_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>expect_hostname &lt;span style="color:#b44">&amp;#34;HostnameOverride=true overrides pod hostname&amp;#34;&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$OVERRIDE_HOSTNAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>restart_apiserver_and_kubelet &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$FEATURE_GATES_OFF&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wait_for_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>expect_hostname &lt;span style="color:#b44">&amp;#34;existing pod keeps running after HostnameOverride=false&amp;#34;&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$OVERRIDE_HOSTNAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$KUBECTL&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> delete &lt;span style="color:#b44">&amp;#34;pod/&lt;/span>&lt;span style="color:#b8860b">$POD_NAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span> --wait&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#a2f">true&lt;/span> &amp;gt;/dev/null
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>apply_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wait_for_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>expect_hostname &lt;span style="color:#b44">&amp;#34;new pod does not use hostnameOverride when HostnameOverride=false&amp;#34;&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$DEFAULT_HOSTNAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>restart_apiserver_and_kubelet &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$FEATURE_GATES_ON&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wait_for_pod
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>expect_hostname &lt;span style="color:#b44">&amp;#34;pod created while HostnameOverride=false keeps running after HostnameOverride=true&amp;#34;&lt;/span> &lt;span style="color:#b44">&amp;#34;&lt;/span>&lt;span style="color:#b8860b">$DEFAULT_HOSTNAME&lt;/span>&lt;span style="color:#b44">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The test result was:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-console" data-lang="console">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Cluster version:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Client Version: v1.36.0-1349+643e407efef84a
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Kustomize Version: v5.8.1
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">Server Version: v1.36.0-1349+643e407efef84a
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">PASS: HostnameOverride=true overrides pod hostname: hostname=test-hostname
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">PASS: existing pod keeps running after HostnameOverride=false: hostname=test-hostname
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">PASS: new pod does not use hostnameOverride when HostnameOverride=false: hostname=test-pod
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#888">PASS: pod created while HostnameOverride=false keeps running after HostnameOverride=true: hostname=test-pod
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Users can check which workloads are utilizing this feature with the following command:&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get pods -A -o json | jq -r &amp;#39;.items[] | select(.spec.hostnameOverride != null) | &amp;#34;\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)&amp;#34;&amp;#39;
&lt;/code>&lt;/pre>&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>Users can use the following command to identify which workloads are using this feature and verify whether it is functioning as expected.&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get pods -A -o json | jq -r &amp;#39;.items[] | select(.spec.hostnameOverride != null) | &amp;#34;\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)&amp;#34;&amp;#39; | while IFS=&amp;#39; &amp;#39; read -r ns pod ho; do actual=$(kubectl exec -n &amp;#34;$ns&amp;#34; &amp;#34;$pod&amp;#34; -- hostname 2&amp;gt;/dev/null); [ &amp;#34;$actual&amp;#34; = &amp;#34;$ho&amp;#34; ] &amp;amp;&amp;amp; echo &amp;#34;$ns $pod $actual $ho&amp;#34;; done
&lt;/code>&lt;/pre>&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>If the &lt;code>kubelet_started_pods_errors_total&lt;/code> metric in a cluster remains consistently at 0, then after introducing this feature, the value of &lt;code>kubelet_started_pods_errors_total&lt;/code> should similarly remain at 0.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: &lt;code>run_podsandbox_errors_total&lt;/code>, &lt;code>kubelet_started_pods_total&lt;/code>, &lt;code>kubelet_started_pods_errors_total&lt;/code>, &lt;code>kubelet_restarted_pods_total&lt;/code>&lt;/li>
&lt;li>[Optional] Aggregation method: A sharp increase in these metric values would indicate abnormal pod restarts or creation errors in the cluster caused by toggling the feature gate.&lt;/li>
&lt;li>Components exposing the metric: Kubelet&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Implementing this feature requires adding a new field to the Pod object, which will increase its size. However, we&amp;rsquo;ll limit the new field&amp;rsquo;s length to 64 bytes.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No impact to the running workloads&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>No known failure modes.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>If the SLO is not being met, operators should:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Check whether the affected Pods use &lt;code>spec.hostnameOverride&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get pods -A -o json | jq -r &lt;span style="color:#b44">&amp;#39;.items[] | select(.spec.hostnameOverride != null) | &amp;#34;\(.metadata.namespace) \(.metadata.name) \(.spec.hostnameOverride)&amp;#34;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Confirm the &lt;code>HostnameOverride&lt;/code> feature gate state on the kube-apiserver and
kubelet. The field is accepted and used only when the feature gate is
enabled in the relevant components.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Inspect kubelet metrics for the affected nodes, especially
&lt;code>kubelet_started_pods_errors_total&lt;/code>, &lt;code>run_podsandbox_errors_total&lt;/code>,
&lt;code>kubelet_started_pods_total&lt;/code>, and &lt;code>kubelet_restarted_pods_total&lt;/code>, and compare
them with the same metrics before the feature gate was enabled or before
Pods using &lt;code>spec.hostnameOverride&lt;/code> were created.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Inspect affected Pod status, events, and kubelet logs to determine whether
the failures are during Pod admission, sandbox creation, or container start:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl describe pod -n &amp;lt;namespace&amp;gt; &amp;lt;pod&amp;gt;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>kubectl get events -n &amp;lt;namespace&amp;gt; --field-selector involvedObject.name&lt;span style="color:#666">=&lt;/span>&amp;lt;pod&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Verify the runtime hostname for Pods that are Ready but suspected to be
misconfigured:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl &lt;span style="color:#a2f">exec&lt;/span> -n &amp;lt;namespace&amp;gt; &amp;lt;pod&amp;gt; -- hostname
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>If failures correlate with enabling this feature or with Pods using
&lt;code>spec.hostnameOverride&lt;/code>, roll back by disabling the &lt;code>HostnameOverride&lt;/code>
feature gate. Existing Pods are not expected to be disrupted; newly created
Pods will stop using &lt;code>spec.hostnameOverride&lt;/code> while the feature gate is
disabled.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2024-07-18: Initial draft KEP&lt;/li>
&lt;li>2025-08-13: Align KEPs with implemented PRs and documentation.&lt;/li>
&lt;li>2025-10-10: Promote to beta stage&lt;/li>
&lt;li>2026-05-23: Promote to stable (GA) stage&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>This is not a standard Kubernetes use case; it is undoubtedly in conflict with the current pod&amp;rsquo;s potential DNS records, and using it will bring more confusion to users. Moreover, we are not sure how much it can help traditional services that can benefit from being migrated to Kubernetes.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>Configure hostnameOverride via kube-apiserver:
&lt;ul>
&lt;li>If the &lt;code>hostnameOverride&lt;/code> field is set, Kubelet will always respect this field (otherwise it will revert to the old behavior). In the default or REST logic, we can see if &lt;code>hostnameOverride&lt;/code> is not set, then we check the &lt;code>hostname&lt;/code>, &lt;code>setHostnameAsFQDN&lt;/code>, and the &lt;code>cluster-suffix&lt;/code>, and write the result into &lt;code>hostnameOverride&lt;/code>. If the user sets it themselves, we will retain it and treat it as an override, this can ultimately simplify &lt;code>Kubelet&lt;/code> as it can remove legacy behavior, but it means teaching the &lt;code>kube-apiserver&lt;/code> about the &lt;code>cluster-suffix&lt;/code>, however, it is challenging to find an existing or grace way to pass the &lt;code>kube-apiserver&lt;/code>’s configuration options in the REST or default logic.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Migrate Legacy Projects:
&lt;ul>
&lt;li>Repair the traditional projects that cannot be migrated to Kubernetes, or find alternatives.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Relax hostname Validation:
&lt;ul>
&lt;li>Do not add new fields, relax the validation of the &lt;code>hostname&lt;/code> field in &lt;code>podSpec&lt;/code> to allow it to accept strings in FQDN format, and when the &lt;code>hostname&lt;/code> is set to FQDN, we will unconditionally ignore the &lt;code>subdomain&lt;/code> and &lt;code>setHostnameAsFQDN&lt;/code> fields, or to keep the current &lt;code>hostname&lt;/code> and be able to override or omit the &lt;code>default.svc.cluster.local&lt;/code> part. However, doing so will cause us to lose the DNS resolution records for the pod.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Custom setHostnameAsFQDN:
&lt;ul>
&lt;li>Do not add new fields, allowing the value of &lt;code>setHostnameAsFQDN&lt;/code> to be set to &lt;code>Custom&lt;/code>, the pod&amp;rsquo;s hostname can still meet our expectations. However, since &lt;code>setHostnameAsFQDN&lt;/code> is currently a boolean type, modifying it would be disruptive to the existing API.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Init Container Hostname
&lt;ul>
&lt;li>We can start an init container with privileged mode and run the command hostname mypod.fqdn.com within the init container to set the Pod&amp;rsquo;s hostname to mypod.fqdn.com. This can achieve the same goal.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Anago to Krel Migration</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/0000/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/0000/</guid><description>
&lt;h1 id="anago-to-krel-migration">Anago to Krel Migration&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#objectives"
>Objectives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#milestones"
>Milestones&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#first-milestone-complete-the-migration-effort"
>First Milestone: Complete the Migration Effort&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#open-issues"
>Open Issues&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#acceptance-criteria"
>Acceptance Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#second-milestone-introduce-krel-stagerelease"
>Second Milestone: Introduce krel stage/release&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#open-issues-1"
>Open Issues&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks"
>Risks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#qualitytest-plan"
>Quality/Test Plan&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;p>&lt;em>Moving away from running bash in production in k/release&lt;/em>&lt;/p>
&lt;h2 id="objectives">Objectives&lt;/h2>
&lt;p>This roadmap defines a strategy for achieving two primary goals: migrating
exchangeable bits of bash code within anago to krel and creating a Golang native
replacement for anago.&lt;/p>
&lt;h2 id="milestones">Milestones&lt;/h2>
&lt;ol>
&lt;li>Complete the code migration&lt;/li>
&lt;li>Have a minimum working krel stage&lt;/li>
&lt;li>Have a minimum working krel release&lt;/li>
&lt;li>Remove/swap out Anago in a simple way, after completing the preceding steps&lt;/li>
&lt;/ol>
&lt;p>The scope and implementation details of Milestones 2-4 will become clearer as
work on Milestone 1 proceeds.&lt;/p>
&lt;p>Creating new features for krel is out of scope.&lt;/p>
&lt;h3 id="first-milestone-complete-the-migration-effort">First Milestone: Complete the Migration Effort&lt;/h3>
&lt;p>Anago is still the main bash script running in GCB, which right now calls out to
krel if necessary. Many parts of the bash-based source code in k/release have
already been transferred to krel (golang), whereas we directly remove the
bash-based parts from the repository after each refactoring iteration.&lt;/p>
&lt;p>This milestone focuses on reducing technical debt in k/release by migrating the
remaining bash code into refactored golang-based implementations. This effort
will lead to higher quality and provide a stable foundation for future feature
developments. By “stable,” we mean that making changes will not break the entire
system.&lt;/p>
&lt;p>This migration will not interrupt our ability to cut releases.&lt;/p>
&lt;h4 id="open-issues">Open Issues&lt;/h4>
&lt;p>The list of currently outlined issues, with assignees (release managers) where
established:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Add krel anago subcommand to retrieve the build candidate (TBD)&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/release/issues/1536"
target="_blank" rel="noopener">https://github.com/kubernetes/release/issues/1536&lt;/a>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Introduce krel anago subcommand to update GitHub release&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/release/issues/1534"
target="_blank" rel="noopener">https://github.com/kubernetes/release/issues/1534&lt;/a>
(@xmudrii)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Finish-up krel push&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/release/issues/1459"
target="_blank" rel="noopener">https://github.com/kubernetes/release/issues/1459&lt;/a>
(@saschagrunert)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Introduce krel subcommand for pushing git objects&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/release/issues/1446"
target="_blank" rel="noopener">https://github.com/kubernetes/release/issues/1446&lt;/a>
(TBD)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>All four issues can be worked on in parallel. This is not a comprehensive list:
There are still parts in Anago that can be ported from bash and that are not
part of any issue yet.&lt;/p>
&lt;h4 id="acceptance-criteria">Acceptance Criteria&lt;/h4>
&lt;ul>
&lt;li>All issues currently open will be resolved
(&lt;a href="https://github.com/kubernetes/release/issues/1534"
target="_blank" rel="noopener">#1534&lt;/a>
,
&lt;a href="https://github.com/kubernetes/release/issues/1536"
target="_blank" rel="noopener">#1536&lt;/a>
,
&lt;a href="https://github.com/kubernetes/release/issues/1446"
target="_blank" rel="noopener">#1446&lt;/a>
,
&lt;a href="https://github.com/kubernetes/release/issues/1459"
target="_blank" rel="noopener">#1459&lt;/a>
)&lt;/li>
&lt;li>New code is unit-tested and code-reviewed (logical paths, not line coverage)&lt;/li>
&lt;li>Direct use of the new Golang source code in production&lt;/li>
&lt;/ul>
&lt;h3 id="second-milestone-introduce-krel-stagerelease">Second Milestone: Introduce krel stage/release&lt;/h3>
&lt;p>In parallel to the ongoing migration (first milestone) we will introduce new
krel stage and krel release subcommands. The plan is to re-evaluate the current
functionality within anago and build a declarative approach of cutting releases.
We can re-use the already migrated parts as well as using the existing logic in
anago as guidance for the necessary feature set of krel stage/release.&lt;/p>
&lt;h4 id="open-issues-1">Open Issues&lt;/h4>
&lt;p>The list of currently outlined issues, with assignees (release managers) where
established:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Evaluate possible krel stage/release subcommands&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/release/issues/1551"
target="_blank" rel="noopener">https://github.com/kubernetes/release/issues/1551&lt;/a>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="risks">Risks&lt;/h2>
&lt;p>The highest risk during the migration is that we end-up in a state where we
break the current functionality. This would mean that we cannot build releases
any more. Immediate fixing and incremental testing between the releases should
minimize this risk.&lt;/p>
&lt;h2 id="qualitytest-plan">Quality/Test Plan&lt;/h2>
&lt;p>Merge changes to the main branch from user fork/branch as per normal community
PR process. Feature branches will not be used.&lt;/p>
&lt;p>Anago-replacement features must be behind a feature gate, initially ensuring
they are only run in ‘mock’ mode.&lt;/p>
&lt;p>Merged features can be tested in production at any time so long as they are only
triggered from a mock stage or mock release or mock notify.&lt;/p>
&lt;p>Non-mock testing will occur only during a release cycle’s alpha period. This
gives initial test ability for non-mock paths in Sep/Oct 2020 and again in
Jan/Feb 2021. Beyond Feb 2021, we will need to re-evaluate testing based on
future circumstances.&lt;/p>
&lt;p>During alpha periods we can A/B test, eg: build alpha.1 with Anago and
immediately after build alpha.2 with krel. Compare the results.&lt;/p></description></item><item><title>Resources: API gzip compression support</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2338/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2338/</guid><description>
&lt;h1 id="graduate-api-gzip-compression-to-ga">Graduate API gzip compression to GA&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#116"
>1.16&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#117"
>1.17&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-details"
>Implementation Details&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Kubernetes sometimes returns extremely large responses to clients outside of its local network, resulting in long delays for components that integrate with the cluster in the list/watch controller pattern. Kubernetes should properly support transparent gzip response encoding, while ensuring that the performance of the cluster does not regress for small requests.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>In large Kubernetes clusters the size of protobuf or JSON responses may exceed hundreds of megabytes, and clients that are not on fast local networks or colocated with the master may experience bandwidth and/or latency issues attempting to synchronize their state with the server (in the case of custom controllers). Many HTTP servers and clients support transparent compression by use of the &lt;code>Accept-Encoding&lt;/code> header, and support for gzip can reduce total bandwidth requirements for integrating with Kubernetes clusters for JSON by up to 10x and for protobuf up to 8x.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Allow standard HTTP transparent &lt;code>Accept-Encoding: gzip&lt;/code> behavior to work for large Kubernetes API requests, without impacting existing Go language clients (which are already sending that header) or causing a performance regression on the Kubernetes apiservers due to the additional CPU necessary to compress small requests.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Support other compression formats like Snappy due to limited client support&lt;/li>
&lt;li>Compress non-API responses&lt;/li>
&lt;li>Compress watch responses&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="116">1.16&lt;/h3>
&lt;ul>
&lt;li>Update the existing incomplete alpha API compression to:
&lt;ul>
&lt;li>Only occur on API requests&lt;/li>
&lt;li>Only occur on very large responses (&amp;gt;128KB)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Promote to beta and enable by default since this is a standard feature of HTTP servers
&lt;ul>
&lt;li>Test at large scale to mitigate risk of regression, tune as necessary&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="117">1.17&lt;/h3>
&lt;ul>
&lt;li>Promote to GA&lt;/li>
&lt;/ul>
&lt;h3 id="implementation-details">Implementation Details&lt;/h3>
&lt;p>Kubernetes has had an alpha implementation of transparent gzip encoding since 1.7. However, this
implementation was never graduated because it caused client misbehavior and the issues were not resolved.&lt;/p>
&lt;p>After reviewing the code, the problems in the prior implementation were that it attempted to globally
provide transparent compression as an HTTP middleware component at a much higher level than was necessary.
The bugs that prevented enablement involved double compression of nested responses and failures to
correctly handle flushing of lower level primitives. We do not need to GZIP compress all HTTP endpoints
served by the Kubernetes API server (such as watch requests, exec requests, OpenAPI endpoints which provide
their own compression). Our implementation may satisfy its goals of reducing latency for large requests if
we narrowly scope compression to only those endpoints that need compression.&lt;/p>
&lt;p>A further complexity is that the standard Go client library (which Kubernetes has leveraged since 1.0)
always requests compression. Performance testing showed that enabling compression for all suitable
API responses (objects returned via GET, LIST, UPDATE, PATCH) caused a significant performance regression
in both CPU usage (2x) and tail latency (2-5x) on the Kubernetes apiservers. This is due to the additional
CPU costs for performing compression, which impacts tail latency of small requests due to increased
apiserver load. Since forcing all clients in the ecosystem to disable transparent compression by default
is impractical and cannot be done in a gradual manner, we need to apply a more suitable heuristic than
&amp;ldquo;did the client request transparent compression&amp;rdquo;. According to the HTTP spec, a server may ignore an
&lt;code>Accept-Encoding&lt;/code> header for any reason, which means we decide &lt;em>when&lt;/em> we want to compress, not just
whether we compress.&lt;/p>
&lt;p>The preferred approach is to only compress responses returned by the API server when encoding objects
that are large enough for compression to benefit the client but not unduly burden the server. In general,
the target of this optimization is extremely large LIST responses which are usually multiple megabytes
in size. These requests are infrequent (&amp;lt;1% of all reads) and when network bandwidth is lower than typical
datacenter speeds (1 GBps) the benefit in reduced latency for clients outweighs the slightly higher CPU
cost for compression.&lt;/p>
&lt;p>We experimentally determined a size cut-off for compression that caused no regression on the Kubernetes
density and load tests in either 99th percentile latency or kube-apiserver CPU usage of 128KB, which is
roughly the size of 50 average pods (2.2kb from a large Kubernetes cluster with a diverse workload). This
implementation applies this specific heuristic to the place in the Kubernetes code path where we encode
the body of a response from a single input &lt;code>[]byte&lt;/code> buffer due to how Kubernetes encodes and manages
responses, which removes the side-effects and unanticipated complexity in the prior implementation.&lt;/p>
&lt;p>Given that this is standard HTTP server behavior and can easily be tested with unit, integration, and
our complete end-to-end test suite (due to all of our clients already requesting gzip compression),
there is minimal risk in rolling this out directly to GA. We suggest preserving the feature gate so that
an operator can disable this behavior if they experience a regression in highly-tuned large-scale deployments.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The primary risk is that an operator running Kubernetes very close to the latency and tolerance limits
on a very large and overloaded Kubernetes apiserver who runs an unusually high percentage of large
LIST queries on high bandwidth networks would experience higher CPU use that causes them to hit a CPU
limit. In practice, the cost of gzip proportional to the memory and CPU costs of Go memory allocation
on very large serialization and deserialization grows sublinear, so we judge this unlikely. However,
to give administrators an opportunity to react, we would preserve the feature gate and allow it to be
disabled until 1.17.&lt;/p>
&lt;p>Some clients may be requesting gzip and not be correctly handling gzipped responses. An effort should
be made to educate client authors that this change is coming, but in general we do not consider
incorrect client implementations to block implementation of standard HTTP features. The easy mitigation
for many clients is to disable sending &lt;code>Accept-Encoding&lt;/code> (Go is unusual in providing automatic
transparent compression in the client ecosystem - many client libraries still require opt-in behavior).&lt;/p>
&lt;h2 id="graduation-criteria">Graduation Criteria&lt;/h2>
&lt;p>Transparent compression must be implemented in the more focused fashion described in this KEP. The
scalability sig must sign off that the chosen limit (128KB) does not cause a regression in 5000 node
clusters, which may cause us to revise the limit up.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>1.7 Kubernetes added alpha implementation behind disabled flag&lt;/li>
&lt;li>Updated proposal with more scoped implementation for Beta in 1.16 that addresses prior issues&lt;/li>
&lt;/ul></description></item><item><title>Resources: API Server Network Proxy</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1281/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1281/</guid><description>
&lt;h1 id="api-server-network-proxy">API Server Network Proxy&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#definitions"
>Definitions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#network-context"
>Network Context&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proxy-grpc-definition"
>Proxy gRPC definition&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#konnectivity-server"
>Konnectivity Server&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#direct-connection"
>Direct Connection&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#kubernetes-api-server-outbound-requests"
>Kubernetes API Server Outbound Requests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#testing-the-solution"
>Testing the Solution&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#security"
>Security&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-detailsnotesconstraints"
>Implementation Details/Notes/Constraints&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#combined-control-plane-and-node-network"
>Combined Control Plane and Node Network&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#control-plane-and-untrusted-node-network"
>Control Plane and Untrusted Node Network&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#control-plane-and-node-networks-which-are-not-ip-routable"
>Control Plane and Node Networks which are not IP Routable&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#better-monitoring"
>Better Monitoring&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives-optional"
>Alternatives [optional]&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed [optional]&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>We will build an extensible system which controls network traffic from the Kube API Server.
We will add a traffic egress or network proxy system. The KAS can be configured to send traffic
(or not) to one or more of the proxies. Users can drop in custom proxies if the
default behavior is insufficient.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Historically, Kubernetes used &lt;a href="https://github.com/kubernetes/kubernetes/issues/54076"
target="_blank" rel="noopener">SSH tunnels&lt;/a>
, but they only
functioned on GCE; they were deprecated in 1.9 and &lt;a href="https://github.com/kubernetes/kubernetes/pull/102297"
target="_blank" rel="noopener">removed in
1.22&lt;/a>
.&lt;/p>
&lt;p>In retrospect, having an explicit level of indirection that separates user-initiated network traffic from API
server-initiated traffic is a useful concept.
Cloud providers want to control how API server to pod, node and service network traffic is implemented.
Cloud providers may choose to run their API server (control network) and the cluster nodes (cluster network)
on isolated networks. The control and cluster networks may have overlapping IP addresses.
Therefore they require a non-IP routing proxy layer (SSH tunnel are an example).
Adding this layer enables metadata audit logging. It allows validation of outgoing API server connections.
Structuring the API server in this way is a forcing function for keeping architectural layering violations out of apiserver.
In combination with a firewall, this separation of networks protects against security concerns such as
&lt;a href="https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ"
target="_blank" rel="noopener">Security Impact of Kubernetes API server external IP address proxying&lt;/a>
.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Delete the SSH Tunnel/Node Dialer code from Kube APIServer.
Enable admins to mitigate &lt;a href="https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ"
target="_blank" rel="noopener">https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ&lt;/a>
.
Allow isolation of the Control network from the Cluster network.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>Build a general purpose Proxy which does everything. (Users should build their own
custom proxies with the desired behavior, based on the provided proxy.)
Handle requests from the Cluster to the Control Plane. (The proxy can be extended to
do this. However that is left to the User if they want that behavior.)&lt;/p>
&lt;h2 id="definitions">Definitions&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Control Plane Network&lt;/strong> An IP reachable network space which contains the control plane components, such as Kubernetes API Server,
Connectivity Proxy and ETCD server.&lt;/li>
&lt;li>&lt;strong>Node Network&lt;/strong> An IP reachable network space which contains all the clusters Nodes, for alpha.
Worth noting that the Node Network may be completely disjoint from the Control Plane network.
It may have overlapping IP addresses to the Control Plane Network or other means of network isolation.
Direct IP routability between cluster and control plane networks should not be assumed.
Later version may relax the all node requirement to some.&lt;/li>
&lt;li>&lt;strong>KAS&lt;/strong> Kubernetes API Server, responsible for serving the Kubernetes API to clients.&lt;/li>
&lt;li>&lt;strong>KMS&lt;/strong> Key Management Service, plugins for secrets encryption key management&lt;/li>
&lt;li>&lt;strong>Egress Selector&lt;/strong> A component built into the KAS which provides a golang dialer for outgoing connection requests.
The dialer provided depends on NetworkContext information.&lt;/li>
&lt;li>&lt;strong>Konnectivity Server&lt;/strong> The proxy server which runs in the control plane network.
It has a secure channel established to the cluster network.
It could work on either a gRPC or HTTP Connect mechanism.
If the former it would exposes a gRPC interface to KAS to provide connectivity service.
If the latter it would use standard HTTP Connect.
Formerly known the the Network Proxy Server.&lt;/li>
&lt;li>&lt;strong>Konnectivity Agent&lt;/strong> A proxy agent which runs in the node network for
establishing the tunnel.
Formerly known as the Network Proxy Agent.&lt;/li>
&lt;li>&lt;strong>Flat Network&lt;/strong> A network where traffic can be successfully routed using IP.
Implies no overlapping (i.e. shared) IPs on the network.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We will run a connectivity server inside the control plane network.
It could work on either a HTTP Connect mechanism or gRPC.
For the alpha version we will attempt to get this working with HTTP Connect.
We will evaluate HTTP Connect for scalability, error handling and traffic types.
For scalability we will be looking at the number of required open connections.
Increasing usage of webhooks means we need better than 1 request per connection (multiplexing).
We also need the tunnel to be tolerant of errors in the requests it is transporting.
HTTP-Connect only supports HTTP requests and not things like DNS requests.
We assume that for HTTP URL requests, it will be the proxy which does the DNS lookup.
However this means that we cannot have the KAS perform a DNS request to then do a follow on request.
If no issues are found with HTTP Connect in these areas we will proceed with it.
If an issue is found then we will update the KEP and switch the client to the gRPC solution.
This should be as simple as switching the connection mode of the client code.&lt;/p>
&lt;p>It may be desirable to allow out of band data (metadata) to be transmitted from the KAS to the Proxy Server.
We expect to handle metadata in the HTTP Connect case using http &amp;lsquo;X&amp;rsquo; headers on the Connect request.
This means that the metadata can only be sent when establishing a KAS to Proxy tunnel.
For the gRPC case we just update the interface to the KAS.
In this case the metadata can be sent even during tunnel usage.&lt;/p>
&lt;p>Each connectivity proxy allows secure connections to one or more cluster networks.
Any network addressed by a connectivity proxy must be flat.
Currently the only mechanism for handling overlapping IP ranges in Kubernetes is the Proxy.
Non IP routable traffic, past the proxy, would need to be a non Kubernetes mechanism to route.&lt;/p>
&lt;p>Running the connectivity proxy in a separate process has a few advantages.&lt;/p>
&lt;ul>
&lt;li>The connectivity proxy can be extended without recompiling the KAS.
Administrators can run their own variants of the connectivity proxy.&lt;/li>
&lt;li>Traffic can be audited or forwarded (eg. via a proprietary VPN) using a custom connectivity proxy.&lt;/li>
&lt;li>The separation removes control plane &amp;lt;-&amp;gt; cluster connectivity concerns from the KAS.&lt;/li>
&lt;li>The code and responsibility separation lowers the complexity of the KAS code base.&lt;/li>
&lt;li>The separation reduces the effects of issue such as crashes in the connectivity impacting the KAS.
Connectivity issues will not stop the KAS from serving API requests.
This is important as serving API requests may be necessary in order to fix the crashes.
A problem with a node, set of nodes or load-balancers configuration, may be fixed with API requests.&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-api-machinery/1281-network-proxy/NetworkProxySimpleCluster.png" alt="API Server Network Proxy Simple Cluster">
The diagram shows API Server’s outgoing traffic flow.
The user (in blue box), control plane network (in purple cloud) and
a cluster network (in green cloud) are represented.&lt;/p>
&lt;p>The user (blue) initiates communication to the KAS.
The KAS then initiates connections to other components.
It could be node/pod/service in cluster networks (red dotted arrow to green cloud),
or etcd for storage in the same control plane network (blue arrow) or mutate the request
based on an admission web-hook (red dotted arrow to purple cloud).
The KAS handles these cases based on NetworkContext based traffic routing.
The connectivity proxy should be able to do routing solely based on IP.
The proxy should not require the NetworkContext. This means the service CIDR,
node CIDR and pod CIDR of each cluster network cannot overlap.&lt;/p>
&lt;h3 id="network-context">Network Context&lt;/h3>
&lt;p>The minimal NetworkContext looks like the following struct in golang:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> EgressType &lt;span style="color:#0b0;font-weight:bold">int&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ControlPlane is the EgressType for traffic intended to go to the control plane.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ControlPlane EgressType = &lt;span style="color:#a2f;font-weight:bold">iota&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Etcd is the EgressType for traffic intended to go to Kubernetes persistence store.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Etcd
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Cluster is the EgressType for traffic intended to go to the system being managed by Kubernetes.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Cluster
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// NetworkContext is the struct used by Kubernetes API Server to indicate where it intends traffic to be sent.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> NetworkContext &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// EgressSelectionName is the unique name of the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// EgressSelectorConfiguration which determines&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the network we route the traffic to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> EgressSelectionName EgressType
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>EgressSelectionName specifies the network to route traffic to.
The KAS starts with a list of registered konnectivity service names. These
correspond to networks we route traffic to. So the KAS knows where to
proxy the traffic to, otherwise it returns an “Unknown network” error.&lt;/p>
&lt;p>The KAS starts with a proxy configuration like the below example.
The example specifies 4 networks. &amp;ldquo;direct&amp;rdquo; specifies the KAS talking directly
on the local network (no proxy). &amp;ldquo;controlplane&amp;rdquo; specifies the KAS talks to a proxy
listening at 1.2.3.4:5678. &amp;ldquo;cluster&amp;rdquo; specifies the KAS talks to a proxy
listening at 1.2.3.5:5679. While these are represented as resources
they are not intended to be loaded dynamically. The names are not case
sensitive. The KAS loads this resource list as a configuration at start time.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>apiserver.k8s.io/v1alpha1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>EgressSelectorConfiguration&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">egressSelections&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>direct&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">connection&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>direct&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>controlplane&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">connection&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>grpc&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">url&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>grpc://1.2.3.4:5678&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">caBundle&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>file1.pem&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">clientKeyFile&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>proxy-client1.key&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">clientCertFile&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>proxy-client1.crt&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>cluster&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">connection&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">type&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>grpc&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">url&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>grpc://1.2.3.5:5679&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">caBundle&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>file2.pem&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">clientKeyFile&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>proxy-client2.key&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">clientCertFile&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>proxy-client2.crt&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>NetworkContext could be extended to contain more contextual information.
This would allow smarter routing based on the k8s object KAS is processing
or which user/tenant tries to initiate the request, etc.&lt;/p>
&lt;h3 id="proxy-grpc-definition">Proxy gRPC definition&lt;/h3>
&lt;p>In order to serve a proxy request, one gRPC bidirectional stream on proxy
server is created to serve it. It&amp;rsquo;s a 1:1 mapping from TCP connection to a
gRPC stream, so the state of TCP connection is exactly the same as the gRPC
stream state.&lt;/p>
&lt;pre tabindex="0">&lt;code class="language-grpc" data-lang="grpc">syntax = &amp;#34;proto3&amp;#34;;
service ProxyService {
// Proxy a TCP connection to a remote address defined by ConnectParam.
// The ConnectParam is defined in metadata under key &amp;#34;x-kube-net-proxy&amp;#34;.
// metadata[&amp;#34;x-kube-net-proxy&amp;#34;] = base64.Encode(proto.Marshal(connectOptions))
rpc Proxy(stream Payload) returns (stream Payload) {}
}
// ConnectOptions defines the remote TCP endpoint to connect
message ConnectOptions {
string remote_addr = 1; // remote address to connect to. e.g. 8.8.8.8:53
}
// Payload defines a TCP payload.
message Payload {
bytes data = 1;
}
&lt;/code>&lt;/pre>&lt;h3 id="konnectivity-server">Konnectivity Server&lt;/h3>
&lt;p>The Konnectivity Server (connectivity proxy(s)) can run in the same container as the KAS.
It should run on the same machine and must run in the same flat network as the KAS.
It listens on a port for gRPC connections from the KAS.
This port would be for forwarding traffic to the appropriate cluster.
It should have an administrative port speaking https.
The administrative port serves metrics and (optional) debug/pprof handlers.
It should have a health check port, serving liveness and readiness probes.
The liveness probe prevents a partially broken cluster
where the KAS cannot connect to the cluster.
The readiness probe indicates that at least one Konnectivity Agent is connected.&lt;/p>
&lt;h3 id="direct-connection">Direct Connection&lt;/h3>
&lt;p>This connection type uses the default dialer.
This allows use of the connectivity service without the connectivity proxy.
This is a quick way to run the system in a “legacy” or fallback mode.
Simple clusters (not needing network segregation) run this way to avoid the overhead
(in latency or configuration) of the connectivity proxy.&lt;/p>
&lt;h3 id="kubernetes-api-server-outbound-requests">Kubernetes API Server Outbound Requests&lt;/h3>
&lt;p>The majority of the KAS communication originates from incoming requests.
Here we cover the outgoing requests. This is our understanding of those requests
and some details as to how they fit in this model. For the alpha release we
support &amp;lsquo;controlplane&amp;rsquo;, &amp;rsquo;etcd&amp;rsquo; and &amp;lsquo;cluster&amp;rsquo; connectivity service names.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>ETCD&lt;/strong> It is possible to make etcd talk via the proxy.
The etcd client takes a transport.
(&lt;a href="https://github.com/etcd-io/etcd/blob/main/client/internal/v2/client.go#L101"
target="_blank" rel="noopener">https://github.com/etcd-io/etcd/blob/main/client/internal/v2/client.go#L101&lt;/a>
)
We will add configuration as to which proxy an etcd client should use.
(&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/config.go"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/storagebackend/config.go&lt;/a>
)
This will add an extra process hop to our main scaling axis.
We will scale test the impact and publish the results. As a precaution
we will add an extra network configuration &amp;rsquo;etcd&amp;rsquo; separate from ‘controlplane’.
Etcd requests can be configured separately from the rest of &amp;lsquo;controlplane&amp;rsquo;.&lt;/li>
&lt;li>&lt;strong>Pods/Exec&lt;/strong>, &lt;strong>Pods/Proxy&lt;/strong>, &lt;strong>Pods/Portforward&lt;/strong>, &lt;strong>Pods/Attach&lt;/strong>, &lt;strong>Pods/Log&lt;/strong>
Pod requests (and pod sub-resource requests) are meant for the cluster
and will be routed based on the ‘cluster’ NetworkContext.&lt;/li>
&lt;li>&lt;strong>Nodes/Proxy&lt;/strong>
Node requests (and node sub-resource requests) are meant for the cluster
and will be routed based on the ‘cluster’ NetworkContext.&lt;/li>
&lt;li>&lt;strong>Services/Proxy&lt;/strong>
Service requests (and service sub-resource requests) are meant for the cluster
and will be routed based on the ‘cluster’ NetworkContext.&lt;/li>
&lt;li>&lt;strong>Admission Webhooks&lt;/strong>
Admission webhooks can be destined for a service or a URL.
If destined for a service then the service rules apply (send to &amp;lsquo;cluster&amp;rsquo;).
If destined for a URL then we will use the ‘controlplane’ NetworkContext.&lt;/li>
&lt;li>&lt;strong>Aggregated API Server (and OpenAPI requests for aggregated resources)&lt;/strong>
Aggregated API Servers can be destined for a service.
If destined for a service then the service rules apply.&lt;/li>
&lt;li>&lt;strong>Authentication, Authorization and Audit Webhooks&lt;/strong>
These Webhooks use a kube config file to determine destination.
Given that we use a ‘controlplane’ NetworkContext.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note&lt;/strong>: KMS is also an egress endpoint but will not be covered as egress since it only supports a &lt;a href="https://github.com/kubernetes/kubernetes/blob/e8bc121341807f9e33a076f6725b1b1a18d75ba0/staging/src/k8s.io/apiserver/pkg/storage/value/encrypt/envelope/grpc_service.go#L74"
target="_blank" rel="noopener">Dialer&lt;/a>
using unix domain sockets (UDS). This is used for communicating between processes running on the same host. In the future, we may consider adding egressSelector support if KMS accepts other protocols.&lt;/p>
&lt;h3 id="testing-the-solution">Testing the Solution&lt;/h3>
&lt;p>We will test using a network namespace to partition the KAS from the test nodes.
It is then impossible to connect directly from the KAS to the test nodes.
This ensures that the proxy must be used for logs, exec, port forward, aggregation and webhooks.
We run with this configuration and a direct configuration for these specific features.
This ensures that the solution works and will continue to work.&lt;/p>
&lt;h3 id="security">Security&lt;/h3>
&lt;p>One motivation for network proxy is providing a mechanism to secure
&lt;a href="https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ"
target="_blank" rel="noopener">https://groups.google.com/d/msg/kubernetes-security-announce/tyd-MVR-tY4/tyREP9-qAwAJ&lt;/a>
.
This, in conjunction with a firewall or other network isolation, fixes the security concern.&lt;/p>
&lt;h3 id="implementation-detailsnotesconstraints">Implementation Details/Notes/Constraints&lt;/h3>
&lt;p>You may want to check the original design doc for alternatives and futures considered. &lt;a href="https://goo.gl/qiARUK"
target="_blank" rel="noopener">https://goo.gl/qiARUK&lt;/a>
.
Please make sure you are a member of &lt;a href="mailto:kubernetes-dev@googlegroups.com"
>kubernetes-dev@googlegroups.com&lt;/a>
to view the doc.
It is also worth looking at &lt;a href="https://github.com/kubernetes-sigs/apiserver-network-proxy"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/apiserver-network-proxy&lt;/a>
as it contains the reference
implementation of the API Server Network Proxy.&lt;/p>
&lt;h2 id="user-stories">User Stories&lt;/h2>
&lt;h4 id="combined-control-plane-and-node-network">Combined Control Plane and Node Network&lt;/h4>
&lt;p>Customers can run a cluster which combines the control plane and cluster networks.
They configure all their connectivity configuration to direct.
This bypasses the proxy and optimizes the performance. For a customer with no
security concerns with combined network, this is a fairly simple straight forward configuration.&lt;/p>
&lt;h4 id="control-plane-and-untrusted-node-network">Control Plane and Untrusted Node Network&lt;/h4>
&lt;p>A customer may want to isolate their control plane from their cluster network. This may be a
simple separation of concerns or due to something like running untrusted workloads on
the cluster network. Placing a firewall between the control plane and
cluster networks accomplishes this. A few ports for the KAS public port and Proxy public port
are opened between these networks. Separation of concerns minimizes the
accidental interactions between the control plane and cluster networks. It minimizes bandwidth
consumption on the cluster network negatively impacting the control plane. The
combination of firewall and proxy minimizes the interaction between the networks to
a set which can be more easily reasoned about, checked and monitored.&lt;/p>
&lt;h4 id="control-plane-and-node-networks-which-are-not-ip-routable">Control Plane and Node Networks which are not IP Routable&lt;/h4>
&lt;p>If control plane and cluster network CIDRs are not controlled by the same entity, then they
can end up having conflicting IP CIDRs. Traffic cannot be routed between
them based strictly on IP address. The connection proxy solves this issue.
It also solves connectivity using a VPN tunnel. The proxy offloads the work off sending traffic
to the cluster network from the KAS. The proxy gives us extensibility.&lt;/p>
&lt;h4 id="better-monitoring">Better Monitoring&lt;/h4>
&lt;p>Instrumenting the network proxy requests with out of band data
(Eg. requester identity/tradition context) enables a Proxy to
provide increased monitoring of Control Plane originated requests.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The primary risk of this solution would seem to be some portion of the proxy or agent failing.
For existing clusters which do not depend on SSH Tunnels or any of the new functionality, the
mitigation would be to set all networks to direct. This should bypass the proxy and allow
the system to work as it does today. For anyone using SSH Tunnels we are planning to support
both for several releases.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>The primary test plan is to set up a network namespace with a firewall dividing the control plane and cluster
networks. Then running the existing tests for logs, proxy and portforward to ensure the
routing works correctly. It should work with the correct configuration and fail correctly
with a direct configuration. Normal tests would be run with the direct
configuration to ensure the mitigation is working correctly.&lt;/p>
&lt;p>Please adhere to the &lt;a href="https://git.k8s.io/community/contributors/devel/sig-testing/testing.md"
target="_blank" rel="noopener">Kubernetes testing guidelines&lt;/a>
when drafting this test plan.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>Alpha:&lt;/p>
&lt;ul>
&lt;li>Feature is turned off in the KAS by default. Enabled by adding ConnectivityServiceConfiguration.&lt;/li>
&lt;li>Kubernetes will not ship with a network proxy. The feature will work with the sample network proxy in &lt;a href="https://github.com/kubernetes-sigs/apiserver-network-proxy"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/apiserver-network-proxy&lt;/a>
&lt;/li>
&lt;li>Demonstrate that the API Server Network Proxy eliminates the need for the SSH Tunnels.&lt;/li>
&lt;/ul>
&lt;p>Beta:&lt;/p>
&lt;ul>
&lt;li>All &lt;a href="#kubernetes-api-server-outbound-requests"
>Kube API Server egress points&lt;/a>
have been implemented to use the
EgressSelector.&lt;/li>
&lt;li>Have official releases of the &lt;a href="https://github.com/kubernetes-sigs/apiserver-network-proxy"
target="_blank" rel="noopener">Konnectivity Server and Agent&lt;/a>
reference implementations.&lt;/li>
&lt;li>Have at least one OSS kube-up implementation where the feature can be turned on and
demonstrated.&lt;/li>
&lt;li>Have run a basic load test with egresses enabled through the Konnectivity
Server to demonstrate that concurrent requests work with Admission Webhooks.&lt;/li>
&lt;li>Tests for EgressSelector.&lt;/li>
&lt;li>e2e test with a functioning cluster with the EgressSelector conifgured to use
a KonnectivityService.&lt;/li>
&lt;li>Add metrics and trace around the Egress Lookup/Dial code. Make sure we know
how many egresses of each type are returned. Make sure we know how long we
are spending dialing out.&lt;/li>
&lt;li>Ensure we have metrics on each existing egress use case.&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Feature went Alpha in 1.16 with limited functionality. It will cover the log
sub resource and communication to the etcd server.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Feature went Beta in 1.18.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives-optional">Alternatives [optional]&lt;/h2>
&lt;ul>
&lt;li>Leave SSH Tunnels (deprecated) in the KAS. Prevents us from making the KAS cloud provider agnostic. Blocks out of tree effort.&lt;/li>
&lt;li>Build equivalent functionality into the KAS. Is not extensible. Essentially has the same issues as SSH Tunnels.&lt;/li>
&lt;li>Use a socks5 proxy. No standard mTLS mechanism for securing traffic. Does not actually act as a standard. More complicated implementation.&lt;/li>
&lt;/ul>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed [optional]&lt;/h2>
&lt;p>Any one wishing to use this feature will need to create network proxy images/pods on the control plane and set up the EgressSelectorConfiguration.
The network proxy provided is meant as a reference implementation. Users as expected to extend it for their needs.&lt;/p></description></item><item><title>Resources: APIServer Tracing</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/647/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/647/</guid><description>
&lt;h1 id="kep-647-apiserver-tracing">KEP-647: APIServer Tracing&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#definitions"
>Definitions&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#steady-state-trace-collection"
>Steady-State trace collection&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#on-demand-trace-collection"
>On-Demand trace collection&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#tracing-api-requests"
>Tracing API Requests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#exporting-spans"
>Exporting Spans&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#running-the-opentelemetry-collector"
>Running the OpenTelemetry Collector&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#apiserver-configuration-and-egressselectors"
>APIServer Configuration and EgressSelectors&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-requirements"
>Graduation requirements&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives-considered"
>Alternatives considered&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#introducing-a-new-egressselector-type"
>Introducing a new EgressSelector type&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#other-opentelemetry-exporters"
>Other OpenTelemetry Exporters&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This Kubernetes Enhancement Proposal (KEP) proposes enhancing the API Server to allow tracing requests. For this, it proposes using OpenTelemetry libraries, and exports in the OpenTelemetry format.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Along with metrics and logs, traces are a useful form of telemetry to aid with debugging incoming requests. The API Server currently uses a poor-man&amp;rsquo;s form of tracing (see &lt;a href="https://github.com/kubernetes/utils/tree/master/trace"
target="_blank" rel="noopener">github.com/kubernetes/utils/trace&lt;/a>
), but we can make use of distributed tracing to improve the ease of use and enable easier analysis of trace data. Trace data is structured, providing the detail necessary to debug requests, and context propagation allows plugins, such as admission webhooks, to add to API Server requests.&lt;/p>
&lt;h3 id="definitions">Definitions&lt;/h3>
&lt;p>&lt;strong>Span&lt;/strong>: The smallest unit of a trace. It has a start and end time, and is attached to a single trace.
&lt;strong>Trace&lt;/strong>: A collection of Spans which represents a single process.
&lt;strong>Trace Context&lt;/strong>: A reference to a Trace that is designed to be propagated across component boundaries. Sometimes referred to as the &amp;ldquo;Span Context&amp;rdquo;. It is can be thought of as a pointer to a parent span that child spans can be attached to.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>The API Server generates and exports spans for incoming and outgoing requests.&lt;/li>
&lt;li>The API Server propagates context from incoming requests to outgoing requests.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Tracing in kubernetes controllers&lt;/li>
&lt;li>Replace existing logging, metrics, or the events API&lt;/li>
&lt;li>Trace operations from all Kubernetes resource types in a generic manner (i.e. without manual instrumentation)&lt;/li>
&lt;li>Change metrics or logging (e.g. to support trace-metric correlation)&lt;/li>
&lt;li>Access control to tracing backends&lt;/li>
&lt;li>Add tracing to components outside kubernetes (e.g. etcd client library).&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;p>Since this feature is for diagnosing problems with the Kube-API Server, it is targeted at Cluster Operators and Cloud Vendors that manage kubernetes control-planes.&lt;/p>
&lt;p>For the following use-cases, I can deploy an OpenTelemetry collector as a sidecar to the API Server. I can use the API Server&amp;rsquo;s &lt;code>--opentelemetry-config-file&lt;/code> flag with the default URL to make the API Server send its spans to the sidecar collector. Alternatively, I can point the API Server at an OpenTelemetry collector listening on a different port or URL if I need to.&lt;/p>
&lt;h4 id="steady-state-trace-collection">Steady-State trace collection&lt;/h4>
&lt;p>As a cluster operator or cloud provider, I would like to collect traces for API requests to the API Server to help debug a variety of control-plane problems. I can set the &lt;code>SamplingRatePerMillion&lt;/code> in the configuration file to a non-zero number to have spans collected for a small fraction of requests. Depending on the symptoms I need to debug, I can search span metadata to find a trace which displays the symptoms I am looking to debug. Even for issues which occur non-deterministically, a low sampling rate is generally still enough to surface a representative trace over time.&lt;/p>
&lt;h4 id="on-demand-trace-collection">On-Demand trace collection&lt;/h4>
&lt;p>As a cluster operator or cloud provider, I would like to collect a trace for a specific request to the API Server. This will often happen when debugging a live problem. In such cases, I don&amp;rsquo;t want to change the &lt;code>SamplingRatePerMillion&lt;/code> to collecting a high percentage of requests, which would be expensive and collect many things I don&amp;rsquo;t care about. I also don&amp;rsquo;t want to restart the API Server, which may fix the problem I am trying to debug. Instead, I can make sure the incoming request to the API Server is sampled. The tooling to do this easily doesn&amp;rsquo;t exist today, but could be added in the future.&lt;/p>
&lt;p>For example, to trace a request to list nodes, with traceid=4bf92f3577b34da6a3ce929d0e0e4737, no parent span, and sampled=true:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>kubectl proxy --port&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">8080&lt;/span> &amp;amp;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>curl http://localhost:8080/api/v1/nodes -H &lt;span style="color:#b44">&amp;#34;traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4737-0000000000000000-01&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>The primary risk associated with distributed tracing is DDOS. A user that can send a large number of sampled requests can cause the server to generate a large number of spans. This is mitigated by allowing respecting the incoming trace context for privileged (&lt;code>system:master&lt;/code> and &lt;code>system:monitoring&lt;/code> groups) users and by configuring the to &lt;code>SamplingRatePerMillion&lt;/code> to a low value.&lt;/p>
&lt;p>There is also a risk of memory usage incurred by storing spans prior to export. This is mitigated by limiting the number of spans that can be queued for export, and dropping spans if necessary to stay under that limit.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="tracing-api-requests">Tracing API Requests&lt;/h3>
&lt;p>We will wrap the API Server&amp;rsquo;s http server and http clients with &lt;a href="https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/net/http/otelhttp"
target="_blank" rel="noopener">otelhttp&lt;/a>
to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below &lt;a href="https://github.com/kubernetes/kubernetes/blob/9eb097c4b07ea59c674a69e19c1519f0d10f2fa8/staging/src/k8s.io/apiserver/pkg/server/config.go#L676"
target="_blank" rel="noopener">WithRequestInfo&lt;/a>
in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.&lt;/p>
&lt;p>Note that some clients of the API Server, such as webhooks, may make reentrant calls to the API Server. To gain the full benefit of tracing, such clients should propagate context with requests back to the API Server. One way to do this is to use the wrap the webhook&amp;rsquo;s http server using otelhttp, and use the request&amp;rsquo;s context when making requests to the API Server.&lt;/p>
&lt;p>&lt;strong>Webhook Example&lt;/strong>&lt;/p>
&lt;p>Wrapping the http server, which ensures context is propagated from http headers to the requests context:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>mux &lt;span style="color:#666">:=&lt;/span> http.&lt;span style="color:#00a000">NewServeMux&lt;/span>()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>handler &lt;span style="color:#666">:=&lt;/span> otelhttp.&lt;span style="color:#00a000">NewHandler&lt;/span>(mux, &lt;span style="color:#b44">&amp;#34;HandleAdmissionRequest&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Use the context from the request in reentrant requests:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>ctx &lt;span style="color:#666">:=&lt;/span> req.&lt;span style="color:#00a000">Context&lt;/span>()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>client.&lt;span style="color:#00a000">CoreV1&lt;/span>().&lt;span style="color:#00a000">Pods&lt;/span>(&lt;span style="color:#b44">&amp;#34;&amp;#34;&lt;/span>).&lt;span style="color:#00a000">List&lt;/span>(ctx, metav1.ListOptions{})
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note: Even though the admission controller uses the otelhttp handler wrapper, that does &lt;em>not&lt;/em> mean it will emit spans. OpenTelemetry has a concept of an SDK, which manages the exporting of telemetry. If no SDK is registered, the NoOp SDK is used, which only propagates context, and does not export spans. In the webhook case in which no SDK is registered, the reentrant API call would appear to be a direct child of the original API call. If the webhook registers an SDK and exports spans, there would be an additional span from the webhook between the original and reentrant API Server call.&lt;/p>
&lt;p>Note: OpenTelemetry has a concept of &lt;a href="https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md#baggage-api"
target="_blank" rel="noopener">&amp;ldquo;Baggage&amp;rdquo;&lt;/a>
, which is akin to annotations for propagated context. If there is any additional metadata we would like to attach, and propagate along with a request, we can do that using Baggage.&lt;/p>
&lt;h3 id="exporting-spans">Exporting Spans&lt;/h3>
&lt;p>This KEP proposes the use of the &lt;a href="https://opentelemetry.io/"
target="_blank" rel="noopener">OpenTelemetry tracing framework&lt;/a>
to create and export spans to configured backends.&lt;/p>
&lt;p>The API Server will use the &lt;a href="https://github.com/open-telemetry/opentelemetry-proto"
target="_blank" rel="noopener">OpenTelemetry exporter format&lt;/a>
, and the &lt;a href="https://github.com/open-telemetry/opentelemetry-go/tree/master/exporters/otlp#opentelemetry-collector-go-exporter"
target="_blank" rel="noopener">OTlp exporter&lt;/a>
which can export traces. This format is easy to use with the &lt;a href="https://github.com/open-telemetry/opentelemetry-collector"
target="_blank" rel="noopener">OpenTelemetry Collector&lt;/a>
, which allows importing and configuring exporters for trace storage backends to be done out-of-tree in addition to other useful features.&lt;/p>
&lt;h3 id="running-the-opentelemetry-collector">Running the OpenTelemetry Collector&lt;/h3>
&lt;p>The &lt;a href="https://github.com/open-telemetry/opentelemetry-collector"
target="_blank" rel="noopener">OpenTelemetry Collector&lt;/a>
can be run as a sidecar, a daemonset, a deployment , or a combination in which the daemonset buffers telemetry and forwards to the deployment for aggregation (e.g. tail-base sampling) and routing to a telemetry backend. To support these various setups, the API Server should be able to send traffic either to a local (on the control plane network) collector, or to a cluster service (on the cluster network).&lt;/p>
&lt;h3 id="apiserver-configuration-and-egressselectors">APIServer Configuration and EgressSelectors&lt;/h3>
&lt;p>The API Server controls where traffic is sent using an &lt;a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190226-network-proxy.md"
target="_blank" rel="noopener">EgressSelector&lt;/a>
, and has separate controls for &lt;code>ControlPlane&lt;/code>, &lt;code>Cluster&lt;/code>, and &lt;code>Etcd&lt;/code> traffic. As described above, we would like to support sending telemetry to a url using the &lt;code>ControlPlane&lt;/code> egress. To accomplish this, we will introduce a flag, &lt;code>--opentelemetry-config-file&lt;/code>, that will point to the file that defines the opentelemetry exporter configuration. That file will have the following format:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// TracingConfiguration provides versioned configuration for tracing clients.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> TracingConfiguration &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> metav1.TypeMeta &lt;span style="color:#b44">`json:&amp;#34;,inline&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// URL of the collector that&amp;#39;s running on the control-plane node.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the APIServer uses the egressType ControlPlane when sending data to the collector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Defaults to localhost:4317&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> URL &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;url,omitempty&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=url&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// SamplingRatePerMillion is the number of samples to collect per million spans.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Defaults to 0.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> SamplingRatePerMillion &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">int32&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;samplingRatePerMillion,omitempty&amp;#34; protobuf:&amp;#34;varint,2,opt,name=samplingRatePerMillion&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If &lt;code>--opentelemetry-config-file&lt;/code> is not specified, the API Server will not send any spans, even if incoming requests ask for sampling.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;p>We will test tracing added by this feature with an integration test. The
integration test will verify that spans exported by the apiserver match what is
expected from the request.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;p>None.&lt;/p>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go&lt;/code>: &lt;code>10/10/2021&lt;/code> 42.6%&lt;/li>
&lt;li>&lt;code>staging/src/k8s.io/component-base/tracing/api/v1/config_test.go&lt;/code>: &lt;code>10/10/2021&lt;/code> 59.0%&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>test/integration/apiserver/tracing/tracing_test.go&lt;/code>
&lt;ul>
&lt;li>TestAPIServerTracingWithKMSv2: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;amp;job=integration&amp;amp;test=TestAPIServerTracingWithKMSv2"
target="_blank" rel="noopener">https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;job=integration&amp;test=TestAPIServerTracingWithKMSv2&lt;/a>
&lt;/li>
&lt;li>TestAPIServerTracingWithEgressSelector: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;amp;job=integration&amp;amp;test=TestAPIServerTracingWithEgressSelector"
target="_blank" rel="noopener">https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;job=integration&amp;test=TestAPIServerTracingWithEgressSelector&lt;/a>
&lt;/li>
&lt;li>TestAPIServerTracing: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;amp;job=integration&amp;amp;test=TestAPIServerTracing"
target="_blank" rel="noopener">https://storage.googleapis.com/k8s-triage/index.html?pr=1&amp;job=integration&amp;test=TestAPIServerTracing&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>Not Required.&lt;/p>
&lt;h2 id="graduation-requirements">Graduation requirements&lt;/h2>
&lt;p>Alpha&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Integration testing of tracing&lt;/li>
&lt;/ul>
&lt;p>Beta&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Tracing 100% of requests does not break scalability tests (this does not necessarily mean trace backends can handle all the data).
&lt;ul>
&lt;li>Verified in a manual run: &lt;a href="https://github.com/kubernetes/kubernetes/pull/113695#issuecomment-1307665358"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/113695#issuecomment-1307665358&lt;/a>
. This is not part of periodic tests, although it may be useful for debugging with a low sampling rate in the future.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> OpenTelemetry reaches GA&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Publish examples of how to use the OT Collector with kubernetes&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Allow time for feedback&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Revisit the format used to export spans.&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Parity with the old text-based Traces&lt;/li>
&lt;/ul>
&lt;p>GA&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Publish guidelines for kubernetes components on when and how to add tracing to a component.&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Graduate the TracingConfiguration component config to v1.&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Define and document stability guarantees for trace instrumentation.&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Add support for On-Demand trace collection as described above.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>This feature is upgraded or downgraded with the API Server. It is not otherwise impacted.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This feature is not impacted by version skew. API Servers of different versions can each prodce traces to provide observability signals independently.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: APIServerTracing&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism: Use specify a file using the &lt;code>--opentelemetry-config-file&lt;/code> API Server flag.&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane? Yes, it will require restarting the API Server.&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node? No.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. The feature is disabled unlesss both the feature gate and &lt;code>--opentelemetry-config-file&lt;/code> flag are set. When the feature is enabled, it doesn&amp;rsquo;t change behavior from the users&amp;rsquo; perspective; it only adds tracing telemetry based on API Server requests.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>It will start sending traces again. This will happen regardless of whether it was disabled by removing the &lt;code>--opentelemetry-config-file&lt;/code> flag, or by disabling via feature gate.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/5426da8f69c1d5fa99814526c1878aeb99b2456e/test/integration/apiserver/tracing/tracing_test.go"
target="_blank" rel="noopener">Unit tests&lt;/a>
exist which enable the feature gate.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;h6 id="how-can-a-rollout-fail-can-it-impact-already-running-workloads">How can a rollout fail? Can it impact already running workloads?&lt;/h6>
&lt;p>Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?&lt;/p>
&lt;ul>
&lt;li>If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.&lt;/li>
&lt;/ul>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>API Server &lt;a href="https://github.com/kubernetes/community/tree/master/sig-scalability/slos"
target="_blank" rel="noopener">SLOs&lt;/a>
are the signals that should guide a rollback. In particular, the &lt;a href="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-instrumentation/647-apiserver-tracing/apiserver_request_slo_duration_seconds"
target="_blank" rel="noopener">&lt;code>apiserver_request_duration_seconds&lt;/code> and &lt;code>apiserver_request_slo_duration_seconds&lt;/code>&lt;/a>
metrics would surface issues resulting in slower API Server responses.&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>This is an operator-facing feature. Look for traces to see if tracing is enabled.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>Look for spans. If you see them, then it is working.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>Yes, those are being added in OpenTelemetry, and we will use them once they are present: &lt;a href="https://github.com/open-telemetry/opentelemetry-go/issues/2547"
target="_blank" rel="noopener">https://github.com/open-telemetry/opentelemetry-go/issues/2547&lt;/a>
&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.&lt;/p>
&lt;ul>
&lt;li>[OpenTelemetry Collector (optional)]
&lt;ul>
&lt;li>Usage description: Deploy the collector as a sidecar container to the API Server, and route traces to your backend of choice.
&lt;ul>
&lt;li>Impact of its outage on the feature: Spans will continue to be collected by the kube-apiserver, but may be lost before they reach the trace backend.&lt;/li>
&lt;li>Impact of its degraded performance or high-error rates on the feature: Spans may be lost before they reach the trace backend.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;p>&lt;em>For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.&lt;/em>&lt;/p>
&lt;p>&lt;em>For beta, this section is required: reviewers must answer these questions.&lt;/em>&lt;/p>
&lt;p>&lt;em>For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.&lt;/em>&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>This will not add any additional API calls.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>This will introduce an API type for the configuration. This is only for
loading configuration, users cannot create these objects.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>Not directly. Cloud providers could choose to send traces to their managed
trace backends, but this requires them to set up a telemetry pipeline as
described above.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by &lt;a href="https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos"
target="_blank" rel="noopener">existing SLIs/SLOs&lt;/a>
?&lt;/h6>
&lt;p>It will increase API Server request latency by a negligible amount (&amp;lt;1 microsecond)
for encoding and decoding the trace contex from headers, and recording spans
in memory. Exporting spans is not in the critical path.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No. Collecting and exporter spans does not use additional node resources even when it is failing to connect to the backend.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>The Troubleshooting section currently serves the &lt;code>Playbook&lt;/code> role. We may consider
splitting it into a dedicated &lt;code>Playbook&lt;/code> document (potentially with some monitoring
details). For now, we leave it here.&lt;/p>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>This feature does not have a dependency on the API Server or etcd (it is built into the API Server).&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;ul>
&lt;li>[Trace endpoint misconfigured, or unavailable]
&lt;ul>
&lt;li>Detection: No traces processed by trace ingestion pipeline&lt;/li>
&lt;li>Mitigations: None&lt;/li>
&lt;li>Diagnostics: API Server logs containing: &amp;ldquo;traces exporter is disconnected from the server&amp;rdquo;&lt;/li>
&lt;li>Testing: The feature will simply not work if misconfigured. It doesn&amp;rsquo;t seem worth verifying.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>This feature will likely be useful for determining why scalability SLOs are not being met, as tracing can
provide detailed latency information as described above. If tracing is suspected as the reason for SLOs not
meeting SLOs, it can be disabled without impacting other functionality by not setting the
&lt;code>--opentelemetry-config-file&lt;/code> flag.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://github.com/Monkeyanator/mutating-trace-admission-controller"
target="_blank" rel="noopener">Mutating admission webhook which injects trace context&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/Monkeyanator/kubernetes/pull/15"
target="_blank" rel="noopener">Instrumentation of Kubernetes components&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/compare/master...dashpole:tracing"
target="_blank" rel="noopener">Instrumentation of Kubernetes components for 1/24/2019 community demo&lt;/a>
&lt;/li>
&lt;li>KEP merged as provisional on 1/8/2020, including controller tracing&lt;/li>
&lt;li>KEP scoped down to only API Server traces on 5/1/2020&lt;/li>
&lt;li>Updated PRR section 2/8/2021&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>Depending on the chosen sampling rate, tracing can increase CPU and memory usage by a small amount, and can also add a negligible amount of latency to API Server requests, when enabled.&lt;/p>
&lt;h2 id="alternatives-considered">Alternatives considered&lt;/h2>
&lt;h3 id="introducing-a-new-egressselector-type">Introducing a new EgressSelector type&lt;/h3>
&lt;p>Instead of a configuration file to choose between a url on the &lt;code>ControlPlane&lt;/code> network, or a service on the &lt;code>Cluster&lt;/code> network, we considered introducing a new &lt;code>OpenTelemetry&lt;/code> egress type, which could be configured separately. However, we aren&amp;rsquo;t actually introducing a new destination for traffic, so it is more conventional to make use of existing egress types. We will also likely want to add additional configuration for the OpenTelemetry client in the future.&lt;/p>
&lt;h3 id="other-opentelemetry-exporters">Other OpenTelemetry Exporters&lt;/h3>
&lt;p>This KEP suggests that we utilize the OpenTelemetry exporter format in all components. Alternative options include:&lt;/p>
&lt;ol>
&lt;li>Add configuration for many exporters in-tree by vendoring multiple &amp;ldquo;supported&amp;rdquo; exporters. These exporters are the only compatible backends for tracing in kubernetes.
a. This places the kubernetes community in the position of curating supported tracing backends&lt;/li>
&lt;li>Support &lt;em>both&lt;/em> a curated set of in-tree exporters, and the collector exporter&lt;/li>
&lt;/ol></description></item><item><title>Resources: Apply</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/555/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/555/</guid><description>
&lt;h1 id="apply">Apply&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#implementation-detailsnotesconstraints-optional"
>Implementation Details/Notes/Constraints [optional]&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api-topology"
>API Topology&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#lists"
>Lists&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#maps-and-structs"
>Maps and structs&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#kubectl"
>Kubectl&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#server-side-apply"
>Server-side Apply&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#status-wiping"
>Status Wiping&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#current-behavior"
>Current Behavior&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposed-change"
>Proposed Change&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#api-audit"
>API Audit&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#testing-plan"
>Testing Plan&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade-from-kubectl-client-side-to-server-side-apply"
>Upgrade from kubectl Client-Side to Server-Side Apply&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#avoiding-conflicts-from-client-side-apply-to-server-side-apply"
>Avoiding Conflicts from Client-Side Apply to Server-Side Apply&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#downgrade-from-kubectl-server-side-to-client-side-apply"
>Downgrade from kubectl Server-Side to Client-Side Apply&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade-the-api-server"
>Downgrade the API Server&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history-1"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives-1"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>&lt;code>kubectl apply&lt;/code> is a core part of the Kubernetes config workflow, but it is
buggy and hard to fix. This functionality will be regularized and moved to the
control plane.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Example problems today:&lt;/p>
&lt;ul>
&lt;li>User does POST, then changes something and applies: surprise!&lt;/li>
&lt;li>User does an apply, then &lt;code>kubectl edit&lt;/code>, then applies again: surprise!&lt;/li>
&lt;li>User does GET, edits locally, then apply: surprise!&lt;/li>
&lt;li>User tweaks some annotations, then applies: surprise!&lt;/li>
&lt;li>Alice applies something, then Bob applies something: surprise!&lt;/li>
&lt;/ul>
&lt;p>Why can&amp;rsquo;t a smaller change fix the problems? Why hasn&amp;rsquo;t it already been fixed?&lt;/p>
&lt;ul>
&lt;li>Too many components need to change to deliver a fix&lt;/li>
&lt;li>Organic evolution and lack of systematic approach
&lt;ul>
&lt;li>It is hard to make fixes that cohere instead of interfere without a clear model of the feature&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Lack of API support meant client-side implementation
&lt;ul>
&lt;li>The client sends a PATCH to the server, which necessitated strategic merge patch&amp;ndash;as no patch format conveniently captures the data type that is actually needed.&lt;/li>
&lt;li>Tactical errors: SMP was not easy to version, fixing anything required client and server changes and a 2 release deprecation period.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The implications of our schema were not understood, leading to bugs.
&lt;ul>
&lt;li>e.g., non-positional lists, sets, undiscriminated unions, implicit context&lt;/li>
&lt;li>Complex and confusing defaulting behavior (e.g., Always pull policy from :latest)&lt;/li>
&lt;li>Non-declarative-friendly API behavior (e.g., selector updates)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>&amp;ldquo;Apply&amp;rdquo; is intended to allow users and systems to cooperatively determine the
desired state of an object. The resulting system should:&lt;/p>
&lt;ul>
&lt;li>Be robust to changes made by other users, systems, defaulters (including mutating admission control webhooks), and object schema evolution.&lt;/li>
&lt;li>Be agnostic about prior steps in a CI/CD system (and not require such a system).&lt;/li>
&lt;li>Have low cognitive burden:
&lt;ul>
&lt;li>For integrators: a single API concept supports all object types; integrators
have to learn one thing total, not one thing per operation per api object.
Client side logic should be kept to a minimum; CURL should be sufficient to
use the apply feature.&lt;/li>
&lt;li>For users: looking at a config change, it should be intuitive what the
system will do. The “magic” is easy to understand and invoke.&lt;/li>
&lt;li>Error messages should&amp;ndash;to the extent possible&amp;ndash;tell users why they had a
conflict, not just what the conflict was.&lt;/li>
&lt;li>Error messages should be delivered at the earliest possible point of
intervention.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Goal: The control plane delivers a comprehensive solution.&lt;/p>
&lt;p>Goal: Apply can be called by non-go languages and non-kubectl clients. (e.g.,
via CURL.)&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Multi-object apply will not be changed: it remains client side for now&lt;/li>
&lt;li>Some sources of user confusion will not be addressed:
&lt;ul>
&lt;li>Changing the name field makes a new object rather than renaming an existing object&lt;/li>
&lt;li>Changing fields that can’t really be changed (e.g., Service type).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>(Please note that when this KEP was started, the KEP process was much less well
defined and we have been treating this as a requirements / mission statement
document; KEPs have evolved into more than that.)&lt;/p>
&lt;p>A brief list of the changes:&lt;/p>
&lt;ul>
&lt;li>Apply will be moved to the control plane.
&lt;ul>
&lt;li>The &lt;a href="https://goo.gl/UbCRuf"
target="_blank" rel="noopener">original design&lt;/a>
is in a google doc; joining the
kubernetes-dev or kubernetes-announce list will grant permission to see it.
Since then, the implementation has changed so this may be useful for
historical understanding. The test cases and examples there are still valid.&lt;/li>
&lt;li>Additionally, readable in the same way, is the &lt;a href="https://goo.gl/nRZVWL"
target="_blank" rel="noopener">original design for structured diff and merge&lt;/a>
;
we found in practice a better mechanism for our needs (tracking field
managers) but the formalization of our schema from that document is still
correct.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Apply is invoked by sending a certain Content-Type with the verb PATCH.&lt;/li>
&lt;li>Instead of using a &lt;code>kubectl.kubernetes.io/last-applied-configuration&lt;/code> annotation,
the control plane will track a &amp;ldquo;manager&amp;rdquo; for every field.&lt;/li>
&lt;li>Apply is for users and/or ci/cd systems. We modify the POST, PUT (and
non-apply PATCH) verbs so that when controllers or other systems make changes
to an object, they are made &amp;ldquo;managers&amp;rdquo; of the fields they change.&lt;/li>
&lt;li>The things our &amp;ldquo;Go IDL&amp;rdquo; describes are formalized: &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff"
target="_blank" rel="noopener">structured merge and diff&lt;/a>
&lt;/li>
&lt;li>Existing Go IDL files will be fixed (e.g., by &lt;a href="https://github.com/kubernetes/kubernetes/pull/70100/files"
target="_blank" rel="noopener">fixing the directives&lt;/a>
)&lt;/li>
&lt;li>Dry-run will be implemented on control plane verbs (POST, PUT, PATCH).
&lt;ul>
&lt;li>Admission webhooks will have their API appended accordingly.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>An upgrade path will be implemented so that version skew between kubectl and
the control plane will not have disastrous results.&lt;/li>
&lt;/ul>
&lt;p>The linked documents should be read for a more complete picture.&lt;/p>
&lt;h3 id="implementation-detailsnotesconstraints-optional">Implementation Details/Notes/Constraints [optional]&lt;/h3>
&lt;p>(TODO: update this section with current design)&lt;/p>
&lt;h4 id="api-topology">API Topology&lt;/h4>
&lt;p>Server-side apply has to understand the topology of the objects in order to make
valid merging decisions. In order to reach that goal, some new Go markers, as
well as OpenAPI extensions have been created:&lt;/p>
&lt;h5 id="lists">Lists&lt;/h5>
&lt;p>Lists can behave in mostly 3 different ways depending on what their actual semantic
is. New annotations allow API authors to define this behavior.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Atomic lists: The list is owned by only one person and can only be entirely
replaced. This is the default for lists. It is defined either in Go IDL by
pefixing the list with &lt;code>// +listType=atomic&lt;/code>, or in the OpenAPI
with &lt;code>&amp;quot;x-kubenetes-list-type&amp;quot;: &amp;quot;atomic&amp;quot;&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Sets: the list is a set (it has to be of a scalar type). Items in the list
must appear at most once. Individual actors of the API can own individual items.
It is defined either in Go IDL by pefixing the list with &lt;code>// +listType=set&lt;/code>, or in the OpenAPI with
&lt;code>&amp;quot;x-kubenetes-list-type&amp;quot;: &amp;quot;set&amp;quot;&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Associative lists: Kubernetes has a pattern of using lists as dictionary, with
&amp;ldquo;name&amp;rdquo; being a very common key. People can now reproduce this pattern by using
&lt;code>// +listType=map&lt;/code>, or in the OpenAPI with &lt;code>&amp;quot;x-kubernetes-list-type&amp;quot;: &amp;quot;map&amp;quot;&lt;/code>
along with &lt;code>&amp;quot;x-kubernetes-list-map-keys&amp;quot;: [&amp;quot;name&amp;quot;]&lt;/code>, or &lt;code>// +listMapKey=name&lt;/code>.
Items of an associative lists are owned by the person who applied the item to
the list.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>For compatibility with the existing markers, the &lt;code>patchStrategy&lt;/code> and
&lt;code>patchMergeKey&lt;/code> markers are automatically used and converted to the corresponding &lt;code>listType&lt;/code>
and &lt;code>listMapKey&lt;/code> if missing.&lt;/p>
&lt;h5 id="maps-and-structs">Maps and structs&lt;/h5>
&lt;p>Maps and structures can behave in two ways:&lt;/p>
&lt;ul>
&lt;li>Each item in the map or field in the structure are independent from each
other. They can be changed by different actors. This is the default behavior,
but can be explicitly specified with &lt;code>// +mapType=granular&lt;/code> or &lt;code>// +structType=granular&lt;/code> respectively. They map to the same openapi extension:
&lt;code>&amp;quot;x-kubernetes-map-type&amp;quot;: &amp;quot;granular&amp;quot;&lt;/code>.&lt;/li>
&lt;li>All the fields or item of the map are treated as one unit, we say the map/struct is
atomic. That can be specified with &lt;code>// +mapType=atomic&lt;/code> or &lt;code>// +structType=atomic&lt;/code> respectively. They map to the same openapi extension:
&lt;code>&amp;quot;x-kubernetes-map-type&amp;quot;: &amp;quot;atomic&amp;quot;&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="kubectl">Kubectl&lt;/h4>
&lt;h5 id="server-side-apply">Server-side Apply&lt;/h5>
&lt;p>Since server-side apply is currently in the Alpha phase, it is not
enabled by default on kubectl. To use server-side apply on servers
with the feature, run the command
&lt;code>kubectl apply --experimental-server-side ...&lt;/code>.&lt;/p>
&lt;p>If the feature is not available or enabled on the server, the command
will fail rather than fall-back on client-side apply due to significant
semantical differences.&lt;/p>
&lt;p>As the feature graduates to the Beta phase, the flag will be renamed to &lt;code>--server-side&lt;/code>.&lt;/p>
&lt;p>The long-term plan for this feature is to be the default apply on all
Kubernetes clusters. The semantical differences between server-side
apply and client-side apply will make a smooth roll-out difficult, so
the best way to achieve this has not been decided yet.&lt;/p>
&lt;h4 id="status-wiping">Status Wiping&lt;/h4>
&lt;h5 id="current-behavior">Current Behavior&lt;/h5>
&lt;p>Right before being persisted to etcd, resources in the apiserver undergo a preparation mechanism that is custom for every resource kind.
It takes care of things like incrementing object generation and status wiping.
This happens through &lt;a href="https://github.com/kubernetes/kubernetes/blob/bc1360ab158d524c5a7132c8dd9dc7f7e8889af1/staging/src/k8s.io/apiserver/pkg/registry/rest/update.go#L49"
target="_blank" rel="noopener">PrepareForUpdate&lt;/a>
and &lt;a href="https://github.com/kubernetes/kubernetes/blob/bc1360ab158d524c5a7132c8dd9dc7f7e8889af1/staging/src/k8s.io/apiserver/pkg/registry/rest/create_update.go#L37"
target="_blank" rel="noopener">PrepareForCreate&lt;/a>
.&lt;/p>
&lt;p>The problem status wiping at this level creates is, that when a user applies a field that gets wiped later on, it gets owned by said user.
The apply mechanism (FieldManager) can not know which fields get wiped for which resource and therefor can not ignore those.&lt;/p>
&lt;p>Additionally ignoring status as a whole is not enough, as it should be possible to own status (and other fields) in some occasions. More conversation on this can be found in the &lt;a href="https://github.com/kubernetes/kubernetes/issues/75564"
target="_blank" rel="noopener">GitHub issue&lt;/a>
where the problem got reported.&lt;/p>
&lt;h5 id="proposed-change">Proposed Change&lt;/h5>
&lt;p>Add an interface that resource strategies can implement, to provide field sets affected by status wiping.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="">#&lt;/span> staging&lt;span style="color:#666">/&lt;/span>src&lt;span style="color:#666">/&lt;/span>k8s.io&lt;span style="color:#666">/&lt;/span>apiserver&lt;span style="color:#666">/&lt;/span>pkg&lt;span style="color:#666">/&lt;/span>registry&lt;span style="color:#666">/&lt;/span>rest&lt;span style="color:#666">/&lt;/span>rest.&lt;span style="color:#a2f;font-weight:bold">go&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ResetFieldsProvider is an optional interface that a strategy can implement&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// to expose a set of fields that get reset before persisting the object.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResetFieldsProvider &lt;span style="color:#a2f;font-weight:bold">interface&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ResetFieldsFor returns a set of fields for the provided version that get reset before persisting the object.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If no fieldset is defined for a version, nil is returned.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">ResetFieldsFor&lt;/span>(version &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>) &lt;span style="color:#666">*&lt;/span>fieldpath.Set
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Additionally, this interface is implemented by &lt;code>registry.Store&lt;/code> which forwards it to the corresponding strategy (if applicable).
If &lt;code>registry.Store&lt;/code> can not provide a field set, it returns nil.&lt;/p>
&lt;p>An example implementation for the interface inside the pod strategy could be:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="">#&lt;/span> pkg&lt;span style="color:#666">/&lt;/span>registry&lt;span style="color:#666">/&lt;/span>core&lt;span style="color:#666">/&lt;/span>pod&lt;span style="color:#666">/&lt;/span>strategy.&lt;span style="color:#a2f;font-weight:bold">go&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ResetFieldsFor returns a set of fields for the provided version that get reset before persisting the object.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// If no fieldset is defined for a version, nil is returned.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">func&lt;/span> (podStrategy) &lt;span style="color:#00a000">ResetFieldsFor&lt;/span>(version &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>) &lt;span style="color:#666">*&lt;/span>fieldpath.Set {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> set, ok &lt;span style="color:#666">:=&lt;/span> resetFieldsByVersion[version]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> !ok {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">return&lt;/span> &lt;span style="color:#a2f;font-weight:bold">nil&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">return&lt;/span> set
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">var&lt;/span> resetFieldsByVersion = &lt;span style="color:#a2f;font-weight:bold">map&lt;/span>[&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>]&lt;span style="color:#666">*&lt;/span>fieldpath.Set{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;v1&amp;#34;&lt;/span>: fieldpath.&lt;span style="color:#00a000">NewSet&lt;/span>(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> fieldpath.&lt;span style="color:#00a000">MakePathOrDie&lt;/span>(&lt;span style="color:#b44">&amp;#34;status&amp;#34;&lt;/span>),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When creating the handlers in &lt;a href="https://github.com/kubernetes/kubernetes/blob/3ff0ed46791a821cb7053c1e25192e1ecd67a6f0/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go"
target="_blank" rel="noopener">installer.go&lt;/a>
the current &lt;code>rest.Storage&lt;/code> is checked to implement the &lt;code>ResetFieldsProvider&lt;/code> interface and the result is passed to the FieldManager.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="">#&lt;/span> staging&lt;span style="color:#666">/&lt;/span>src&lt;span style="color:#666">/&lt;/span>k8s.io&lt;span style="color:#666">/&lt;/span>apiserver&lt;span style="color:#666">/&lt;/span>pkg&lt;span style="color:#666">/&lt;/span>endpoints&lt;span style="color:#666">/&lt;/span>installer.&lt;span style="color:#a2f;font-weight:bold">go&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">var&lt;/span> resetFields &lt;span style="color:#666">*&lt;/span>fieldpath.Set
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">if&lt;/span> resetFieldsProvider, isResetFieldsProvider &lt;span style="color:#666">:=&lt;/span> storage.(rest.ResetFieldsProvider); isResetFieldsProvider {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> resetFields = resetFieldsProvider.&lt;span style="color:#00a000">ResetFieldsFor&lt;/span>(a.group.GroupVersion.Version)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When provided with a field set, the FieldManager strips all &lt;code>resetFields&lt;/code> from incoming update and apply requests.
This causes the user/manager to not own those fields.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">if&lt;/span> f.resetFields &lt;span style="color:#666">!=&lt;/span> &lt;span style="color:#a2f;font-weight:bold">nil&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> patchObjTyped = patchObjTyped.&lt;span style="color:#00a000">Remove&lt;/span>(f.resetFields)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h5 id="alternatives">Alternatives&lt;/h5>
&lt;p>We looked at a way to get the fields affected by status wiping without defining them separately.
Mainly by pulling the reset logic from the strategies &lt;code>PrepareForCreate&lt;/code> and &lt;code>PrepareForUpdate&lt;/code> methods into a new method &lt;code>ResetFields&lt;/code> implementing an &lt;code>ObjectResetter&lt;/code> interface.&lt;/p>
&lt;p>This approach did not work as expected, because the strategy works on internal types while the FieldManager handles external api types.
The conversion between the two and creating the diff was complex and would have caused a notable amount of allocations.&lt;/p>
&lt;h5 id="implementation-history">Implementation History&lt;/h5>
&lt;ul>
&lt;li>12/2019 &lt;a href="https://github.com/kubernetes/kubernetes/pull/86083"
target="_blank" rel="noopener">#86083&lt;/a>
implementing a poc for the described approach&lt;/li>
&lt;/ul>
&lt;h4 id="api-audit">API Audit&lt;/h4>
&lt;p>The &lt;code>ManagedFields&lt;/code> fields of an object in the API audit log may not be very useful. We want to provide a mechanism,
so the cluster operator can opt in so that the managed fields can be omitted from the audit log.&lt;/p>
&lt;p>We propose the following changes to the &lt;code>audit.k8s.io/Policy&lt;/code> API that provides the cluster operator with a more
granular way to control the omission of managed fields in audit log:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> Policy &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> OmitManagedFields &lt;span style="color:#0b0;font-weight:bold">bool&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;omitManagedFields,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PolicyRule &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> OmitManagedFields &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">bool&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;omitManagedFields,omitempty&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above API changes will be introduced in &lt;code>v1&lt;/code>, &lt;code>v1beta1&lt;/code> and &lt;code>v1alpha1&lt;/code> of &lt;code>audit.k8s.io&lt;/code>&lt;/p>
&lt;p>A new field &lt;code>OmitManagedFields&lt;/code> is added to both &lt;code>Policy&lt;/code> and &lt;code>PolicyRule&lt;/code> making the following possible:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Policy.OmitManagedFields&lt;/code> sets the default policy for omitting managed fields globally.
&lt;ul>
&lt;li>the default value is &lt;code>false&lt;/code>, managed fields are not omitted, this retains the current behavior.&lt;/li>
&lt;li>a value of &lt;code>true&lt;/code> will omit managed fields from being written to the API audit log unless &lt;code>PolicyRule&lt;/code> overrides.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>PolicyRule:OmitManagedFields&lt;/code> can be used to override the global default for a particular set of request(s),
it has three possible values:
&lt;ul>
&lt;li>&lt;code>nil&lt;/code> (default value): the cluster operator did not specify any value,
the global default specified in &lt;code>Policy.OmitManagedFields&lt;/code> is in effect.&lt;/li>
&lt;li>&lt;code>true&lt;/code>: the cluster operator opted in to omit managed fields for a given set of request(s), and it overrides the global default.&lt;/li>
&lt;li>&lt;code>false&lt;/code>: the cluster operator opted in to not omit managed fields for a given set of request(s), and it overrides the global default.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>This ensures the following:&lt;/p>
&lt;ul>
&lt;li>with an existing &lt;code>Policy&lt;/code> object, the new version of the apiserver will maintain current behavior which
is to include managed fields in audit log&lt;/li>
&lt;li>the cluster operator must opt in to enable omission of managed fields&lt;/li>
&lt;/ul>
&lt;p>Let&amp;rsquo;s look at a few examples:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># omit managed fields for all request and all response bodies&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>audit.k8s.io/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Policy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">omitManagedFields&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">rules&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">level&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>RequestResponse &lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># omit managed fields for all request and all response bodies&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># except for Pod for which we want to include managed fields in audit log&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>audit.k8s.io/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Policy&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">omitManagedFields&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">rules&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">level&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>RequestResponse&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">omitManagedFields&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">false&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">resources&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;pods&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">level&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>RequestResponse&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting alpha to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/features/kube_features.go#L100"
target="_blank" rel="noopener">ServerSideApply&lt;/a>
&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong>&lt;/p>
&lt;p>While this changes how objects are modified and then stored in the database, all the changes should be strictly backward compatible, and shouldn’t break existing automation or users. The increase in size can possibly have adverse, surprising consequences including increased memory usage for controllers, increased bandwidth usage when fetching objects, bigger objects when displaying for users (kubectl get -o yaml). We’re trying to mitigate all of these with the addition of a new header.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?&lt;/strong>
Also set &lt;code>disable-supported&lt;/code> to &lt;code>true&lt;/code> or &lt;code>false&lt;/code> in &lt;code>kep.yaml&lt;/code>.
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).&lt;/p>
&lt;p>Yes. The consequence is that managed fields will be reset for server-side applied objects (requiring a read/write cycle on the impacted resources).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>&lt;/p>
&lt;p>The feature will be restored. Server-side applied objects will have lost their “set” which may cause some surprising behavior (fields might not be removed as expected).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.&lt;/p>
&lt;p>Tests are in place for upgrading from client side to server side apply and vice versa.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
There is no specific way that the rollout can fail. The rollout can&amp;rsquo;t impact existing workload.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>&lt;/p>
&lt;p>The feature shouldn&amp;rsquo;t affect any existing behavior. A surprisingly high number of modification rejections could be a sign that something is not working properly.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/strong>&lt;/p>
&lt;p>Because the feature doesn&amp;rsquo;t affect existing behavior, rollback and upgrades haven&amp;rsquo;t be specifically tested.
The feature is being used by the cluster role aggregator though. Upgrading/downgrading/upgrading, which
could result in the managedFields being removed, wouldn&amp;rsquo;t cause any problems since the &lt;code>Rules&lt;/code> field
filled by the controller is &lt;code>atomic&lt;/code>, and thus doesn&amp;rsquo;t depend on the current state of the managedFields.&lt;/p>
&lt;p>The new &lt;code>managedFields&lt;/code> field is cleared when it is incorrect. That protects us from having invalid data inserted by a potential bad upgrade.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?&lt;/strong> No
No.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.&lt;/p>
&lt;p>Any existing metric split by request verb will record the &lt;a href="https://github.com/kubernetes/kubernetes/blob/8f6ffb24df989608b87451f89b8ac9fc338ed71c/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L507-L509"
target="_blank" rel="noopener">APPLY&lt;/a>
verb if the feature is in use.&lt;/p>
&lt;p>Additionally, the OpenAPI spec exposes the available media-type for each individual endpoint. The presence of the &lt;code>apply&lt;/code> type for the PATCH verb of a endpoints indicates whether the feature is enabled for that specific resource, e.g.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-json" data-lang="json">&lt;span style="display:flex;">&lt;span>&lt;span style="">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b44">&amp;#34;patch&amp;#34;&lt;/span>&lt;span style="">:&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#008000;font-weight:bold">&amp;#34;consumes&amp;#34;&lt;/span>: [
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;application/json-patch+json&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;application/merge-patch+json&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;application/strategic-merge-patch+json&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;application/apply-patch+yaml&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="">...&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>
&lt;p>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?&lt;/strong>&lt;/p>
&lt;p>There is no specific metric attached to server side apply. All PATCH requests that utilize SSA will use the verb APPLY when logging metrics. API Server metrics that are split by verb automatically include this. They include &lt;code>apiserver_request_total&lt;/code>, &lt;code>apiserver_longrunning_gauge&lt;/code>, &lt;code>apiserver_response_sizes&lt;/code>, &lt;code>apiserver_request_terminations_total&lt;/code>, &lt;code>apiserver_selfrequest_total&lt;/code>&lt;/p>
&lt;ul>
&lt;li>Components exposing the metric: kube-apiserver&lt;/li>
&lt;/ul>
&lt;p>Apply requests (&lt;code>PATCH&lt;/code> with &lt;code>application/apply-patch+yaml&lt;/code> mime type) have the same level of SLIs as other types of requests.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/strong> n/a
Apply requests (&lt;code>PATCH&lt;/code> with &lt;code>application/apply-patch+yaml&lt;/code> mime type) have the same level of SLOs as other types of requests.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any missing metrics that would be useful to have to improve observability
of this feature?&lt;/strong> n/a&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Does this feature depend on any specific services running in the cluster?&lt;/strong> No&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong> No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>
Describe them, providing: No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to the cloud
provider?&lt;/strong> No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count of
the existing API objects?&lt;/strong> Objects applied using server side apply will have their managed fields metadata populated. &lt;code>managedFields&lt;/code> metadata fields can represent up to 60% of the total size of an object, increasing the size of objects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?&lt;/strong> No&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong> Since objects are larger with the new &lt;code>managedFields&lt;/code>, caches as well as network bandwidth requirement will increase.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>The Troubleshooting section currently serves the &lt;code>Playbook&lt;/code> role. We may consider
splitting it into a dedicated &lt;code>Playbook&lt;/code> document (potentially with some monitoring
details). For now, we leave it here.&lt;/p>
&lt;p>&lt;em>This section must be completed when targeting beta graduation to a release.&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>&lt;/p>
&lt;p>The feature is part of of the API server and will not function without it&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>
For each of them, fill in the following information by copying the below template:&lt;/p>
&lt;ul>
&lt;li>[Failure mode brief description]
&lt;ul>
&lt;li>Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node? Apply requests (&lt;code>PATCH&lt;/code> with &lt;code>application/apply-patch+yaml&lt;/code> mime type) have the same level of SLIs as other types of requests.&lt;/li>
&lt;li>Mitigations: What can be done to stop the bleeding, especially for already
running user workloads? This shouldn&amp;rsquo;t affect running workloads, and this feature shouldn&amp;rsquo;t alter the behavior of previously existing mechanisms like PATCH and PUT.&lt;/li>
&lt;li>Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue? The feature uses very little logging, and errors should be returned directly to the user.
Not required until feature graduated to beta.&lt;/li>
&lt;li>Testing: Are there any tests for failure mode? Failure modes are tested exhaustively both as unit-tests and as integration tests.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong> n/a&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>We used a feature branch to ensure that no partial state of this feature would
be in master. We developed the new &amp;ldquo;business logic&amp;rdquo; in a
&lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff"
target="_blank" rel="noopener">separate repo&lt;/a>
for
velocity and reusability.&lt;/p>
&lt;h3 id="testing-plan">Testing Plan&lt;/h3>
&lt;p>The specific logic of apply will be tested by extensive unit tests in the
&lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff"
target="_blank" rel="noopener">structured merge and diff&lt;/a>
repo. The integration between that repo and kubernetes/kubernetes will mainly
be tested by integration tests in &lt;a href="https://github.com/kubernetes/kubernetes/tree/master/test/integration/apiserver/apply"
target="_blank" rel="noopener">test/integration/apiserver/apply&lt;/a>
and &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/cmd/apply.sh"
target="_blank" rel="noopener">test/cmd&lt;/a>
,
as well as unit tests where applicable. The feature will also be enabled in the
&lt;a href="https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features"
target="_blank" rel="noopener">alpha-features e2e test suite&lt;/a>
,
which runs every hour and everytime someone types &lt;code>/test pull-kubernetes-e2e-gce-alpha-features&lt;/code>
on a PR. This will ensure that the cluster can still start up and the other
endpoints will function normally when the feature is enabled.&lt;/p>
&lt;p>Unit Tests in &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff"
target="_blank" rel="noopener">structured merge and diff&lt;/a>
repo for:&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Merge typed objects of the same type with a schema. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/typed/merge_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Merge deduced typed objects without a schema (for CRDs). &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/typed/deduced_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Convert a typed value to a field set. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/typed/toset_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Diff two typed values. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/typed/symdiff_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Validate a typed value against it&amp;rsquo;s schema. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/typed/validate_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Get correct conflicts when applying. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/conflict_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for deduced typed objects. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/deduced_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for leaf fields with scalar values. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/leaf_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for items in associative lists of scalars. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/set_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for items in associative lists with keys. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/key_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for nested schemas, including recursive schemas. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/nested_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works for multiple appliers. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/9f6585cadf64c6b61b5a75bde69ba07d5d34dc3f/merge/multiple_appliers_test.go#L31-L685"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works when the object conversion changes value of map keys. &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/9f6585cadf64c6b61b5a75bde69ba07d5d34dc3f/merge/multiple_appliers_test.go#L687-L886"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works when unknown/obsolete versions are present in managedFields (for when APIs are deprecated). &lt;a href="https://github.com/kubernetes-sigs/structured-merge-diff/blob/master/merge/obsolete_versions_test.go"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>Unit Tests for:&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply strips certain fields (like name and namespace) from managers. &lt;a href="https://github.com/kubernetes/kubernetes/blob/8a6a2883f9a38e09ae941b62c14f4e68037b2d21/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/fieldmanager_test.go#L69-L139"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> ManagedFields API can be round tripped through the structured-merge-diff format. &lt;a href="https://github.com/kubernetes/kubernetes/blob/4394bf779800710e67beae9bddde4bb5425ce039/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/managedfields_test.go#L30-L156"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Manager identifiers passed to structured-merge-diff are encoded as json. &lt;a href="https://github.com/kubernetes/kubernetes/blob/4394bf779800710e67beae9bddde4bb5425ce039/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/managedfields_test.go#L158-L202"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Managers will be sorted by operation, then timestamp, then manager name. &lt;a href="https://github.com/kubernetes/kubernetes/blob/4394bf779800710e67beae9bddde4bb5425ce039/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/managedfields_test.go#L204-L304"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Conflicts will be returned as readable status errors. &lt;a href="https://github.com/kubernetes/kubernetes/blob/69b9167dcbc8eea2ca5653fa42584539920a1fd4/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/conflict_test.go#L31-L106"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Fields API can be round tripped through the structured-merge-diff format. &lt;a href="https://github.com/kubernetes/kubernetes/blob/0e1d50e70fdc9ed838d75a7a1abbe5fa607d22a1/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/fields_test.go#L29-L57"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Fields API conversion to and from the structured-merge-diff format catches errors. &lt;a href="https://github.com/kubernetes/kubernetes/blob/0e1d50e70fdc9ed838d75a7a1abbe5fa607d22a1/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/fields_test.go#L59-L109"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Path elements can be round tripped through the structured-merge-diff format. &lt;a href="https://github.com/kubernetes/kubernetes/blob/6b2e4682fe883eebcaf1c1e43cf2957dde441174/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/pathelement_test.go#L21-L54"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Path element conversion will ignore unknown qualifiers. &lt;a href="https://github.com/kubernetes/kubernetes/blob/6b2e4682fe883eebcaf1c1e43cf2957dde441174/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/pathelement_test.go#L56-L61"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Path element conversion will fail if a known qualifier&amp;rsquo;s value is invalid. &lt;a href="https://github.com/kubernetes/kubernetes/blob/6b2e4682fe883eebcaf1c1e43cf2957dde441174/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/pathelement_test.go#L63-L84"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Can convert both built-in objects and CRDs to structured-merge-diff typed objects. &lt;a href="https://github.com/kubernetes/kubernetes/blob/42aba643290c19a63168513bd758822e8014a0fd/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/typeconverter_test.go#L40-L135"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Can convert structured-merge-diff typed objects between API versions. &lt;a href="https://github.com/kubernetes/kubernetes/blob/0e1d50e70fdc9ed838d75a7a1abbe5fa607d22a1/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/fieldmanager/internal/versionconverter_test.go#L32-L69"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>Integration tests for:&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Creating an object with apply works with default and custom storage implementations. &lt;a href="https://github.com/kubernetes/kubernetes/blob/1b8c8f1daf4b1ed6d17ee1d2f40d62c8ecec0e15/test/integration/apiserver/apply/apply_test.go#L55-L121"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Create is blocked on apply if uid is provided. &lt;a href="https://github.com/kubernetes/kubernetes/blob/1b8c8f1daf4b1ed6d17ee1d2f40d62c8ecec0e15/test/integration/apiserver/apply/apply_test.go#L123-L154"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply has conflicts when changing fields set by Update, and is able to force. &lt;a href="https://github.com/kubernetes/kubernetes/blob/1b8c8f1daf4b1ed6d17ee1d2f40d62c8ecec0e15/test/integration/apiserver/apply/apply_test.go#L156-L239"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> There are no changes to the managedFields API. &lt;a href="https://github.com/kubernetes/kubernetes/blob/1b8c8f1daf4b1ed6d17ee1d2f40d62c8ecec0e15/test/integration/apiserver/apply/apply_test.go#L241-L341"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> ManagedFields has no entries for managers who manage no fields. &lt;a href="https://github.com/kubernetes/kubernetes/blob/1b8c8f1daf4b1ed6d17ee1d2f40d62c8ecec0e15/test/integration/apiserver/apply/apply_test.go#L343-L392"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Apply works with custom resources. &lt;a href="https://github.com/kubernetes/kubernetes/blob/b55417f429353e1109df8b3bfa2afc8dbd9f240b/staging/src/k8s.io/apiextensions-apiserver/test/integration/apply_test.go#L34-L117"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Run kubectl apply tests with server-side flag enabled. &lt;a href="https://github.com/kubernetes/kubernetes/blob/81e6407393aa46f2695e71a015f93819f1df424c/test/cmd/apply.sh#L246-L314"
target="_blank" rel="noopener">link&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>E2E and Conformance tests will be added for GA.&lt;/p>
&lt;h2 id="graduation-criteria">Graduation Criteria&lt;/h2>
&lt;p>An alpha version of this is targeted for 1.14.&lt;/p>
&lt;p>This can be promoted to beta when it is a drop-in replacement for the existing
kubectl apply, and has no regressions (which aren&amp;rsquo;t bug fixes). This KEP will be
updated when we know the concrete things changing for beta.&lt;/p>
&lt;p>A GA version of this is targeted for 1.22.&lt;/p>
&lt;ul>
&lt;li>E2E tests are created and graduate to conformance&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2144-clientgo-apply"
target="_blank" rel="noopener">Apply for client-go&amp;rsquo;s typed client&lt;/a>
is implemented and at least one kube-controller-manager uses that client&lt;/li>
&lt;li>Outstanding bugs around status wiping and scale subresource are fixed&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;h4 id="upgrade-from-kubectl-client-side-to-server-side-apply">Upgrade from kubectl Client-Side to Server-Side Apply&lt;/h4>
&lt;p>With client-side &lt;code>kubectl apply&lt;/code>, the annotation
&lt;code>kubectl.kubernetes.io/last-applied-configuration&lt;/code> tracks ownership for a single
shared field manager. With server-side &lt;code>kubectl apply --server-side&lt;/code>, the
&lt;code>.metadata.managedFields&lt;/code> field tracks ownership for multiple field managers.&lt;/p>
&lt;p>Users who wish to start using server-side apply for objects managed with
client-side apply would encounter a field manager conflict: the field set that
the user now wants to manage with server-side apply will be owned by the
client-side apply field manager.&lt;/p>
&lt;p>If we don&amp;rsquo;t specifically handle this case, then users would need to force
conflicts with &lt;code>kubectl apply --server-side --force-conflicts&lt;/code>. This extra step
is not desirable for users who wish to onboard to server-side apply.&lt;/p>
&lt;p>However we know that users&amp;rsquo; intent is to take ownership of client-side apply
fields when upgrading, which we can do for them while avoiding the conflict.&lt;/p>
&lt;h5 id="avoiding-conflicts-from-client-side-apply-to-server-side-apply">Avoiding Conflicts from Client-Side Apply to Server-Side Apply&lt;/h5>
&lt;p>We&amp;rsquo;ll use the &lt;code>kubectl&lt;/code> user-agent and the client-side apply
&lt;code>last-applied-configuration&lt;/code> annotation to identify when to do the upgrade.&lt;/p>
&lt;p>When server-side apply is run with &lt;code>kubectl apply --server-side&lt;/code> on an object
with a &lt;code>last-applied-configuration&lt;/code> annotation for client-side apply, then the
annotation will be upgraded to the managed fields server-side apply notation.&lt;/p>
&lt;p>To upgrade the &lt;code>last-applied-configuration&lt;/code> annotation, the following procedure
will be used.&lt;/p>
&lt;ol>
&lt;li>Identify if the server-side apply is from the &lt;code>kubectl&lt;/code> user-agent&lt;/li>
&lt;li>Identify if the server-side apply would result in a conflict&lt;/li>
&lt;li>Create a fieldset from the &lt;code>last-applied-configuration&lt;/code> annotation.&lt;/li>
&lt;li>Remove all fields from the &lt;code>last-applied-configuration&lt;/code> annotation that are
added, missing, or different than the corresponding field of the live
object. Because the fields have changed, client-side apply does not own
them.&lt;/li>
&lt;li>Compare the &amp;ldquo;last-applied&amp;rdquo; fieldset to the conflict fieldset. Take the
difference as the new conflict fieldset. If the conflict fieldset is empty,
then the conflicts are allowed and we force the server-side apply. If the
conflict fieldset is not empty, then return the conflict fieldset.&lt;/li>
&lt;/ol>
&lt;h4 id="downgrade-from-kubectl-server-side-to-client-side-apply">Downgrade from kubectl Server-Side to Client-Side Apply&lt;/h4>
&lt;p>Client-side &lt;code>kubectl apply&lt;/code> users can incrementally upgrade to a version of
&lt;code>kubectl&lt;/code> that can send a server-side apply&lt;/p>
&lt;p>We can sync the intent between server-side and client-side apply by keeping the
&lt;code>last-applied-configuration&lt;/code> annotation up-to-date with the &lt;code>.managedFields&lt;/code>
field.&lt;/p>
&lt;p>Client-side apply will continue to work.&lt;/p>
&lt;h4 id="downgrade-the-api-server">Downgrade the API Server&lt;/h4>
&lt;p>When downgrading the API server with server-side apply disabled, then
&lt;code>.metadata.managedFields&lt;/code> field will be cleared since the API server doesn&amp;rsquo;t
know about this field. A server-side apply will fail with a content-type unknown
error.&lt;/p>
&lt;p>A client-side apply would succeed because the &lt;code>last-applied-configuration&lt;/code>
annotation is preserved and up-to-date as described in the downgrade above.&lt;/p>
&lt;h2 id="implementation-history-1">Implementation History&lt;/h2>
&lt;ul>
&lt;li>Early 2018: @lavalamp begins thinking about apply and writing design docs&lt;/li>
&lt;li>2018Q3: Design shift from merge + diff to tracking field managers.&lt;/li>
&lt;li>2019Q1: Alpha.&lt;/li>
&lt;li>2019Q3: Beta.&lt;/li>
&lt;/ul>
&lt;p>(For more details, one can view the apply-wg recordings, or join the mailing list
and view the meeting notes. TODO: links)&lt;/p>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>Why should this KEP &lt;em>not&lt;/em> be implemented: many bugs in kubectl apply will go
away. Users might be depending on the bugs.&lt;/p>
&lt;h2 id="alternatives-1">Alternatives&lt;/h2>
&lt;p>It&amp;rsquo;s our belief that all routes to fixing the user pain involve
centralizing this functionality in the control plane.&lt;/p></description></item><item><title>Resources: Appropriate use of node-role labels</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1143/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1143/</guid><description>
&lt;h1 id="appropriate-use-of-node-role-labels">Appropriate use of node-role labels&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#use-of-node-rolekubernetesio-labels"
>Use of &lt;code>node-role.kubernetes.io/*&lt;/code> labels&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#current-users-of-node-rolekubernetesio-within-the-project-that-must-change"
>Current users of &lt;code>node-role.kubernetes.io/*&lt;/code> within the project that must change&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#service-load-balancer"
>Service load-balancer&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#node-controller-excludes-master-nodes-from-consideration-for-eviction"
>Node controller excludes master nodes from consideration for eviction&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#kubernetes-e2e-tests"
>Kubernetes e2e tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#preventing-accidental-reintroduction"
>Preventing accidental reintroduction&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#migrating-existing-deployments"
>Migrating existing deployments&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#instructions-for-deployers"
>Instructions for deployers&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature enablement and rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#future-work"
>Future work&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#reference"
>Reference&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>&lt;strong>ACTION REQUIRED:&lt;/strong> In order to merge code into a release, there must be an issue in &lt;a href="https://github.com/kubernetes/enhancements/issues"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
referencing this KEP and targeting a release milestone &lt;strong>before &lt;a href="https://github.com/kubernetes/sig-release/tree/master/releases"
target="_blank" rel="noopener">Enhancement Freeze&lt;/a>
of the targeted release&lt;/strong>.&lt;/p>
&lt;p>These checklist items &lt;em>must&lt;/em> be updated for the enhancement to be released.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> kubernetes/enhancements issue in release milestone, which links to KEP: &lt;a href="https://github.com/kubernetes/enhancements/issues/1143"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/1143&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> KEP approvers have set the KEP status to &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Test plan is in place, giving consideration to SIG Architecture and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://github.com/kubernetes/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note:&lt;/strong> Any PRs to move a KEP to &lt;code>implementable&lt;/code> or significant changes once it is marked &lt;code>implementable&lt;/code> should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).&lt;/p>
&lt;p>&lt;strong>Note:&lt;/strong> This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Clarify that the &lt;code>node-role.kubernetes.io/*&lt;/code> label is for use only by users and external projects and may not be used to vary Kubernetes behavior. Define migration process for all internal consumers of these labels.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The &lt;code>node-role.kubernetes.io/master&lt;/code> and the broader &lt;code>node-role.kubernetes.io&lt;/code> namespace for labels were introduced to provide a simple organizational and grouping convention for cluster users. The labels were reserved solely for organizing nodes via a convention that tools could recognize to display information to end users, and for use by opinionated external tooling that wished to simplify topology concepts. Use of the label by components within the Kubernetes project (those projects subject to API review) was restricted. Specifically, no project could mandate the use of those labels in a conformant distribution, since we anticipated that many deployments of Kubernetes would have more nuanced control-plane topologies than simply &amp;ldquo;a control plane node&amp;rdquo;.&lt;/p>
&lt;p>Over time, several changes to Kubernetes core and related projects were introduced that depended on the &lt;code>node-role.kubernetes.io/master&lt;/code> label to vary their behavior in contravention to the guidance the label was approved under. This was unintentional and due to unclear reviewer guidelines that have since been more strictly enforced. Likewise, the complexity of Kubernetes deployments has increased and the simplistic mapping of control plane concepts to a node has proven to limit the ability of conformant Kubernetes distributions to self-host, as anticipated. The lack of clarity in how to use node-role and the disjoint mechanisms within the code has been a point of confusion for contributors that we wish to remove.&lt;/p>
&lt;p>Finally, we wish to clarify that external components may use node-role tolerations and labels as they wish as long as they are cognizant that not all conformant distributions will expose or allow those tolerations or labels to be set.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>This KEP:&lt;/p>
&lt;ul>
&lt;li>Clarifies that the use of the &lt;code>node-role.kubernetes/*&lt;/code> label namespace is reserved solely for end-user and external Kubernetes consumers, and:
&lt;ul>
&lt;li>Must not be used to vary behavior within Kubernetes projects that are subject to API review (kubernetes/kubernetes and all components that expose APIs under the &lt;code>*.k8s.io&lt;/code> namespace)&lt;/li>
&lt;li>Must not be required to be present for a cluster to be conformant&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Describes the locations within Kubernetes that must be changed to use an alternative mechanism for behavior
&lt;ul>
&lt;li>Suggests approaches for each location to migrate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Describes the timeframe and migration process for Kubernetes distributions and deployments to update labels&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="use-of-node-rolekubernetesio-labels">Use of &lt;code>node-role.kubernetes.io/*&lt;/code> labels&lt;/h3>
&lt;ul>
&lt;li>Kubernetes components MUST NOT set or alter behavior on any label within the &lt;code>node-role.kubernetes.io/*&lt;/code> namespace.&lt;/li>
&lt;li>Kubernetes components (such as &lt;code>kubectl&lt;/code>) MAY simplify the display of &lt;code>node-role.kubernetes.io/*&lt;/code> labels to convey the node roles of a node&lt;/li>
&lt;li>Kubernetes examples and documentation MUST NOT leverage the node-role labels for node placement&lt;/li>
&lt;li>External users, administrators, conformant Kubernetes distributions, and extensions MAY use &lt;code>node-role.kubernetes.io/*&lt;/code> without reservation
&lt;ul>
&lt;li>Extensions are recommended not to vary behavior based on node-role, but MAY do so as they wish&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>First party components like &lt;code>kubeadm&lt;/code> MAY use node-roles to simplify their own deployment mechanisms.&lt;/li>
&lt;li>Conformance tests MUST NOT depend on the node-role labels in any fashion&lt;/li>
&lt;li>Ecosystem controllers that desire to be placed on the masters MAY tolerate the node-role master taint or set nodeSelector to the master nodes in order to be placed, but SHOULD recognize that some deployment models will not have these node-roles, or may prohibit deployments that attempt to schedule to masters as unprivileged users. In general we recommend limiting this sort of placement rule to examples, docs, or simple deployment configurations rather than embedding the logic in code.&lt;/li>
&lt;/ul>
&lt;h3 id="current-users-of-node-rolekubernetesio-within-the-project-that-must-change">Current users of &lt;code>node-role.kubernetes.io/*&lt;/code> within the project that must change&lt;/h3>
&lt;p>The following components vary behavior based on the presence of the node-role labels:&lt;/p>
&lt;h4 id="service-load-balancer">Service load-balancer&lt;/h4>
&lt;p>The service load balancer implementation previously implemented a heuristic where &lt;code>node-role.kubernetes.io/master&lt;/code> is used to exclude masters from the candidate nodes for a service. This is an implementation detail of the cluster and is not allowed. Since there is value in excluding nodes from service load balancer candidacy in some deployments, an alpha feature gated label &lt;code>alpha.service-controller.kubernetes.io/exclude-balancer&lt;/code> was added in Kubernetes 1.9.&lt;/p>
&lt;p>This label should be moved to beta in Kube 1.19 at its final name &lt;code>node.kubernetes.io/exclude-from-external-load-balancers&lt;/code>, its feature gate &lt;code>ServiceNodeExclusion&lt;/code> should default on in 1.19, the gate &lt;code>ServiceNodeExclusion&lt;/code> should be declared GA in 1.21, and the gate will be removed in 1.22. The old alpha label should be honored in 1.21 and removed in 1.22.&lt;/p>
&lt;p>Starting in 1.16 the legacy code block should be gated on &lt;code>LegacyNodeRoleBehavior=true&lt;/code>&lt;/p>
&lt;h4 id="node-controller-excludes-master-nodes-from-consideration-for-eviction">Node controller excludes master nodes from consideration for eviction&lt;/h4>
&lt;p>The &lt;code>k8s.io/kubernetes/pkg/util/system/IsMasterNode(nodeName)&lt;/code> function is used by the NodeLifecycleController to exclude nodes with a node name that ends in &lt;code>master&lt;/code> or starts with &lt;code>master-&lt;/code> when considering whether to mark nodes as disrupted. A recent PR attempted to change this to use node-roles and was blocked. Instead, the controller should be updated to use a label &lt;code>node.kubernetes.io/exclude-disruption&lt;/code> to decide whether to exclude nodes from being considered for disruption handling.&lt;/p>
&lt;h4 id="kubernetes-e2e-tests">Kubernetes e2e tests&lt;/h4>
&lt;p>The e2e tests use a number of heuristics including the &lt;code>IsMasterNode(nodeName)&lt;/code> function and the node-roles labels to select nodes. In order for conformant Kubernetes clusters to run the tests, the e2e suite must change to use individual user-provided label selectors to identify nodes to test, nodes that have special rules for testing unusual cases, and for other selection behaviors. The label selectors may be defaulted by the test code to their current values, as long as a conformant cluster operator can execute the e2e suite against an arbitrary cluster.&lt;/p>
&lt;p>The &lt;code>IsMasterNode()&lt;/code> method will be moved to be test specific, identified as deprecated, and will be removed as soon as possible.&lt;/p>
&lt;p>QUESTION: Is a single label selector sufficient to identify nodes to test?&lt;/p>
&lt;h4 id="preventing-accidental-reintroduction">Preventing accidental reintroduction&lt;/h4>
&lt;p>In order to prevent reviewers from accidentally allowing code changes that leverage this functionality, we should clarify the Godoc of the constant to limit their use. A lint process could be run as part of verify that requires approval of a small list to modify exclusions (currently only cmd/kubeadm will be allowed to use that constaint, with all test function being abstracted). The review doc should call out that labels must be scoped to a particular feature enablement vs being broad.&lt;/p>
&lt;p>Some components like the external cloud provider controllers (considered to fall within these rules due to implementing k8s.io APIs) may be vulnerable to accidental assumptions about topology - code review and e2e tests are our primary mechanism to prevent regression.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="migrating-existing-deployments">Migrating existing deployments&lt;/h3>
&lt;p>The proposed fixes will all require deployment-level changes. That must be staged across several releases, and it should be possible for deployers to move early and &amp;ldquo;fix&amp;rdquo; the issues that may be caused by their topology.&lt;/p>
&lt;p>Therefore, for each change we recommend the following process to adopt the new labels in successive releases:&lt;/p>
&lt;ul>
&lt;li>Release 1 (1.16):
&lt;ul>
&lt;li>Introduce a feature gate for disabling node-role being honored. The gate defaults to on. &lt;code>LegacyNodeRoleBehavior=true&lt;/code>&lt;/li>
&lt;li>Define the new node label with an associated feature gate for each feature area. The gate defaults to off. &lt;code>ServiceNodeExclusion=false&lt;/code> and &lt;code>NodeDisruptionExclusion=false&lt;/code>&lt;/li>
&lt;li>Behavior for each functional area is defined as &lt;code>(LegacyNodeRoleBehavior == on &amp;amp;&amp;amp; node_has_role) || (FeatureGate == on &amp;amp;&amp;amp; node_has_label)&lt;/code>&lt;/li>
&lt;li>No new components may leverage node-roles within Kubernetes projects.&lt;/li>
&lt;li>Early adopters may label their nodes to opt in to the features, even in the absence of the gate.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Release 2 (1.17):
&lt;ul>
&lt;li>The legacy alpha label &lt;code>alpha.service-controller.kubernetes.io/exclude-balancer&lt;/code> is marked as deprecated&lt;/li>
&lt;li>Deprecation of node role behavior in tree is announced for 1.21, with a detailed plan for cluster administrators and deployers&lt;/li>
&lt;li>Gates are officially alpha&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Release 3 (1.19):
&lt;ul>
&lt;li>The old label &lt;code>alpha.service-controller.kubernetes.io/exclude-balancer&lt;/code> is removed&lt;/li>
&lt;li>For both labels, usage is reviewed and as appropriate the label is declared beta/GA and the feature gate is set on&lt;/li>
&lt;li>All Kubernetes deployments should be updated to add node labels as appropriate: &lt;code>kubectl label nodes -l node-role.kubernetes.io/master LABEL_A=VALUE_A&lt;/code>&lt;/li>
&lt;li>Documentation will be provided on making the transition&lt;/li>
&lt;li>Deployments may set &lt;code>LegacyNodeRoleBehavior=false&lt;/code> after they have set the appropriate labels.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Release 4 (1.21):
&lt;ul>
&lt;li>Default the legacy gate &lt;code>LegacyNodeRoleBehavior&lt;/code> to off. Admins whose deployments still use the old labels may set &lt;code>LegacyNodeRoleBehavior=true&lt;/code> during 1.19 to get the legacy behavior.&lt;/li>
&lt;li>Deployments should stop setting &lt;code>LegacyNodeRoleBehavior=false&lt;/code> if they opted out early.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Release 5 (1.22):
&lt;ul>
&lt;li>The &lt;code>LegacyNodeRoleBehavior&lt;/code> gate and all feature-level gates are removed, components that attempt to set these gates will fail to start.&lt;/li>
&lt;li>Code that references node-roles within Kubernetes will be removed.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>In Release 5 (which could be as early as 1.21) this KEP will be considered complete.&lt;/p>
&lt;h4 id="instructions-for-deployers">Instructions for deployers&lt;/h4>
&lt;p>The current behavior of the &lt;code>node-role.kubernetes.io/master&lt;/code> label on nodes preventing them from being part of service load balancers or from being disrupted when NotReady is deprecated and will be fully removed in Kubernetes 1.20. Administrators and Kubernetes deployers should follow these steps.&lt;/p>
&lt;p>If you are using the &lt;code>alpha.service-controller.kubernetes.io/exclude-balancer&lt;/code> label in your deployments to exclude specific nodes from your deployment, the label has been replaced in 1.17 with &lt;code>node.kubernetes.io/exclude-from-external-load-balancers&lt;/code>. All administrators should run the following command before upgrading to Kubernetes 1.18 and set the feature gate &lt;code>ServiceNodeExclusion=true&lt;/code>:&lt;/p>
&lt;pre>&lt;code>kubectl label nodes --selector=alpha.service-controller.kubernetes.io/exclude-balancer \
node.kubernetes.io/exclude-balancer=true
&lt;/code>&lt;/pre>
&lt;p>Cluster deployers that rely on the existing behavior where master nodes are not part of the service load balancer and master workloads will not be evicted if the master is NotReady for longer than the grace period should run the following command after upgrading to Kubernetes 1.18:&lt;/p>
&lt;pre>&lt;code>kubectl label nodes --selector=node-role.kubernetes.io/master \
node.kubernetes.io/exclude-from-external-load-balancers=true \
node.kubernetes.io/exclude-disruption=true
&lt;/code>&lt;/pre>
&lt;p>After setting these labels in 1.18, administrators will need to take no further action.&lt;/p>
&lt;p>Cluster deployers that wish to manage this migration during the 1.17 to 1.18 upgrade should label nodes and set feature gates before upgrading to 1.18. If &lt;code>LegacyNodeRoleBehavior=false&lt;/code> is set, it must be removed prior to the 1.21 to 1.22 upgrade.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;ul>
&lt;li>Unit tests to verify selection using feature gates&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;ul>
&lt;li>New labels and feature flags become beta after one release, GA and defaulted on after two, and are removed after two releases after they are defaulted on (so 4 releases from when this is first implemented).&lt;/li>
&lt;li>Documentation for migrating to the new labels is available in 1.18.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>As described in the migration process, deployers and administrators have 2 releases to migrate their clusters.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>Controllers are updated after the control plane, so consumers must update the labels on their nodes before they update controller processes in 1.21.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature enablement and rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>LegacyNodeRoleBehavior&lt;/code>, &lt;code>ServiceNodeExclusion&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: &lt;code>kube-apiserver&lt;/code>, &lt;code>kube-controller-manager&lt;/code>, cloud controller managers&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we rollback
the enablement)?&lt;/strong>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Yes&lt;/p>
&lt;ul>
&lt;li>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>The old behavior is present.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Yes&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>Covered in migration strategy.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring requirements&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable to workloads&lt;/p>
&lt;ul>
&lt;li>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable&lt;/p>
&lt;ul>
&lt;li>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Are there any missing metrics that would be useful to have to improve
observability if this feature?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Does this feature depend on any specific services running in the cluster?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in any new calls to cloud
provider?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in increasing size or count
of the existing API objects?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs][]?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable&lt;/p>
&lt;ul>
&lt;li>&lt;strong>What are other known failure modes?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable&lt;/p>
&lt;ul>
&lt;li>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Not applicable&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2019-07-16: Created&lt;/li>
&lt;li>2020-04-15: Labels promoted to beta in 1.19 in &lt;a href="https://github.com/kubernetes/kubernetes/pull/90126"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/90126&lt;/a>
&lt;/li>
&lt;li>2020-06-01: Updated for 1.19 with details of production readiness&lt;/li>
&lt;li>2021-01-06: GA in 1.21 and marked to be removed in 1.22&lt;/li>
&lt;/ul>
&lt;h2 id="future-work">Future work&lt;/h2>
&lt;p>This proposal touches on the important topic of scheduling policy - the ability of clusters to restrict where arbitrary workloads may run - by noting that some conformant clusters may reject attempts to schedule onto masters. This is out of scope of this KEP except to indicate that node-role use by ecosystem components may conflict with future enhancements in this area.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://groups.google.com/d/msg/kubernetes-sig-architecture/ZKUOPy2PNJ4/lDh4hs4HBQAJ"
target="_blank" rel="noopener">https://groups.google.com/d/msg/kubernetes-sig-architecture/ZKUOPy2PNJ4/lDh4hs4HBQAJ&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/35975"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/35975&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/39112"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/39112&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/76654"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/76654&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/80021"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/80021&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/78500"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/78500&lt;/a>
- Work to remove master role label from e2e&lt;/li>
&lt;/ul></description></item><item><title>Resources: Artifact Distribution Policy</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3000/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3000/</guid><description>
&lt;h1 id="kep-3000-image-promotion-and-distribution-policy">KEP 3000: Image Promotion and Distribution Policy&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#why-a-new-domain"
>Why a new domain?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#how-can-we-help"
>How can we help?&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-is-not-in-scope"
>What is not in scope&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-good-goals-to-shoot-for"
>What are good goals to shoot for&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-exactly-are-you-doing"
>What exactly are you doing?&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#registryk8sio-request-handling"
>registry.k8s.io request handling&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats"
>Notes/Constraints/Caveats&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#alternatives--background"
>Alternatives / Background&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-much-is-this-going-to-save-us"
>How much is this going to save us?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>For a few years now, we have been using k8s.gcr.io in all our repositories as default repository for downloading images from.&lt;/p>
&lt;p>The cost of distributing Kubernetes comes at great cost nearing $150kUSD/month (mostly egress) in donations.&lt;/p>
&lt;p>Additionally some of our community members are unable to access the official release container images due to country level firewalls that do not them connect to Google services.&lt;/p>
&lt;p>Ideally we can dramatically reduce cost and allow everyone in the world to download the container images released by our community.&lt;/p>
&lt;p>We are now used to using the &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-release/1734-k8s-image-promoter"
target="_blank" rel="noopener">image promoter process&lt;/a>
to promote images to the official kubernetes container registry using the infrastructure (GCR staging repos etc) provided by &lt;a href="https://github.com/kubernetes/k8s.io/tree/main/k8s.gcr.io"
target="_blank" rel="noopener">sig-k8s-infra&lt;/a>
&lt;/p>
&lt;h2 id="why-a-new-domain">Why a new domain?&lt;/h2>
&lt;p>So far we (all kubernetes project) are using GCP as our default infrastructure provider for all things like GCS, GCR, GKE based prow clusters etc. Google has graciously sponsored a lot of our infrastructure costs as well. However for about a year or so we are finding that our costs are sky-rocketing because the community usage of this infrastructure has been from other cloud providers like AWS, Azure etc. So in conjunction with CNCF staff we are trying to put together a plan to host copies of images and binaries nearer to where they are used rather than incur cross-cloud costs.&lt;/p>
&lt;p>One part of this plan is to setup a redirecting web service, that can identify where the traffic is coming from and redirect to the nearest image layer/repository. This is why we are setting up a new service using what we call an &lt;a href="https://github.com/kubernetes-sigs/oci-proxy"
target="_blank" rel="noopener">oci-proxy&lt;/a>
for everyone to use. This redirector will identify traffic coming from, for example, a certain AWS region, then will setup a HTTP redirect to a source in that AWS region. If we get traffic from GKE/GCP or we don&amp;rsquo;t know where the traffic is coming from, it will still redirect to the current infrastructure (k8s.gcr.io).&lt;/p>
&lt;h2 id="how-can-we-help">How can we help?&lt;/h2>
&lt;p>When Kubernetes master opens up for v1.25 development, we need to update all default urls in our code and test harness to the new registry url. As a team sig-k8s-infra is signing up to ensure that this oci-proxy based registry.k8s.io will be as robust and available as the current setup. As a backup, we will continue to run the current k8s.gcr.io as well. So do not worry about that going away. Turning on traffic to the new url will help us monitor and fix things if/when they break and we will be able to tune traffic and lower our costs of operation.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>A policy and procedure for use by SIG Release to promote container images to multiple registries and mirrors.&lt;/p>
&lt;p>A solution to allow redirection to appropriate mirrors to lower cost and allow access from any cloud or country globally.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>Anything related to creation of artifacts, bom, staging buckets.&lt;/p>
&lt;h3 id="what-is-not-in-scope">What is not in scope&lt;/h3>
&lt;ul>
&lt;li>Currently we focus on AWS only. We are getting a lot of help from AWS in terms of technical details as well as targeted infrastructure costs for standing up and running this infrastructure&lt;/li>
&lt;/ul>
&lt;h3 id="what-are-good-goals-to-shoot-for">What are good goals to shoot for&lt;/h3>
&lt;ul>
&lt;li>In terms of cost reduction, monitor GCP infrastructure and get to the point where we fully avoid serving large binary image layers from GCR/GCS&lt;/li>
&lt;li>We can add other AWS regions and clouds as needed in well known documented way&lt;/li>
&lt;li>Seamless transition for the community from the old k8s.gcr.io to registry.k8s.io with same rock solid stability as we now have with k8s.gcr.io&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>There are two intertwined concepts that are part of this proposal.&lt;/p>
&lt;p>First, the policy and procedures to promote/upload our container images to multiple providers. Our existing processes upload only to GCS buckets. Ideally we extend the existing software/promotion process to push directly to multiple providers. Alternatively we use a second process to synchronize container images from our existing production buckets to similar constructs at other providers.&lt;/p>
&lt;p>Additionally we require a registry and artifact url-redirection solution to the local cloud provider or country.&lt;/p>
&lt;h2 id="what-exactly-are-you-doing">What exactly are you doing?&lt;/h2>
&lt;ul>
&lt;li>We are setting up an AWS account with an IAM role and s3 buckets in AWS regions where we see a large percentage of source image pull traffic&lt;/li>
&lt;li>We will iterate on a sandbox url (registry-sandbox.k8s.io) for our experiments and ONLY promote things to (registry.k8s.io) when we have complete confidence&lt;/li>
&lt;li>both registry and registry-sandbox are serving traffic using oci-proxy on google cloud run&lt;/li>
&lt;li>oci-proxy will be updated to identify incoming traffic from AWS regions based on IP ranges so we can route traffic to s3 buckets in that region. If a specific AWS region do not currently host s3 buckets, we will redirect to the nearest region which does have s3 buckets (tradeoff between storage and network costs)&lt;/li>
&lt;li>We will bulk sync existing image layers to these s3 layers as a starting point (from GCS/GCR)&lt;/li>
&lt;li>We will update image-promoter to push to these s3 buckets as well in addition to the current setup&lt;/li>
&lt;li>We will set up monitoring/reporting to check on new costs we incur on the AWS infrastructure and update what we do in GCP infrastructure as well to include the new components&lt;/li>
&lt;li>We will have a plan in place on how we could add additional AWS regions in the future&lt;/li>
&lt;li>We will have CI jobs that will run against registry-sandbox.k8s.io as well to monitor stability before we promote code to registry&lt;/li>
&lt;li>We will automate the deployment/monitoring and testing of code landing in the oci-proxy repository&lt;/li>
&lt;/ul>
&lt;h3 id="registryk8sio-request-handling">registry.k8s.io request handling&lt;/h3>
&lt;p>Requests to &lt;a href="https://registry.k8s.io"
target="_blank" rel="noopener">registry.k8s.io&lt;/a>
follows the following flow:&lt;/p>
&lt;ol>
&lt;li>If it&amp;rsquo;s a request for &lt;code>/&lt;/code>: redirect to our wiki page about the project&lt;/li>
&lt;li>If it&amp;rsquo;s not a request for &lt;code>/&lt;/code> and does not start with &lt;code>/v2/&lt;/code>: 404 error&lt;/li>
&lt;li>For registry API requests, all of which start with &lt;code>/v2/&lt;/code>:&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>If it&amp;rsquo;s not a blob request: redirect to &lt;em>Upstream Registry&lt;/em>&lt;/li>
&lt;li>If it&amp;rsquo;s not a known AWS IP: redirect to &lt;em>Upstream Registry&lt;/em>&lt;/li>
&lt;li>If it&amp;rsquo;s a known AWS IP AND HEAD request for the layer succeeds in S3: redirect to S3&lt;/li>
&lt;li>If it&amp;rsquo;s a known AWS IP AND HEAD fails: redirect to &lt;em>Upstream Registry&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>Currently the &lt;em>Upstream Registry&lt;/em> is &lt;a href="https://k8s.gcr.io"
target="_blank" rel="noopener">https://k8s.gcr.io&lt;/a>
.&lt;/p>
&lt;h3 id="notesconstraintscaveats">Notes/Constraints/Caveats&lt;/h3>
&lt;p>The primary purpose of the KEP is getting consensus on the agreed policy and procedure to unblock our community and move forward together.&lt;/p>
&lt;p>There has been a lot of activity around the technology and tooling for both goals, but we need shared agreement on policy and procedure first.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>This is the primary pipeline for delivering Kubernetes worldwide. Ensuring the appropriate SLAs and support as well as artifact integrity is crucial.&lt;/p>
&lt;h2 id="alternatives--background">Alternatives / Background&lt;/h2>
&lt;ul>
&lt;li>Original KEP
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-release/1734-k8s-image-promoter"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/tree/master/keps/sig-release/1734-k8s-image-promoter&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Oras
&lt;ul>
&lt;li>&lt;a href="https://github.com/oras-project/oras"
target="_blank" rel="noopener">https://github.com/oras-project/oras&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>KubeCon Talk
&lt;ul>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=F2IFjz7sr9Q"
target="_blank" rel="noopener">https://www.youtube.com/watch?v=F2IFjz7sr9Q&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Apache has a widespread mirror network
&lt;ul>
&lt;li>@dims has experience here&lt;/li>
&lt;li>&lt;a href="http://ws.apache.org/mirrors.cgi"
target="_blank" rel="noopener">http://ws.apache.org/mirrors.cgi&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://infra.apache.org/mirrors.html"
target="_blank" rel="noopener">https://infra.apache.org/mirrors.html&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/k8s.io/issues/1834"
target="_blank" rel="noopener">Umbrella issue: k8s.gcr.io =&amp;gt; registry.k8s.io solution k/k8s.io#1834
&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/ii/registry.k8s.io#registryk8sio"
target="_blank" rel="noopener">ii/registry.k8s.io Implementation proposals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://ii.nz/post/building-a-data-pipline-for-displaying-kubernetes-public-artifact-traffic/"
target="_blank" rel="noopener">ii.nz/blog :: Building a data pipeline for displaying Kubernetes public artifact traffic
&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="how-much-is-this-going-to-save-us">How much is this going to save us?&lt;/h3>
&lt;p>Cost of K8s Artifact hosting - Data Studio Graphs&lt;/p>
&lt;p>&lt;img src="https://i.imgur.com/LAn4UIE.png" alt="">&lt;/p>
&lt;p>Analysis has been done on usage patterns related to providers. AWS participated in this process and have a keen interest to help drive down cost by providing artifacts directly to their clients consuming resources from the public registry.&lt;/p></description></item><item><title>Resources: Artifact Generation</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2503/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2503/</guid><description/></item><item><title>Resources: Asynchronous API calls during scheduling</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5229/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5229/</guid><description>
&lt;h1 id="kep-5229-asynchronous-api-calls-during-scheduling">KEP-5229: Asynchronous API calls during scheduling&lt;/h1>
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api-calls-categorization"
>API calls categorization&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#1-how-to-handle-pod-rescheduling-while-waiting-for-the-api-call-to-complete"
>1: How to handle Pod rescheduling while waiting for the API call to complete&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#use-advanced-queue-and-dont-block-the-pod-from-being-scheduled-in-the-meantime"
>Use advanced queue and don&amp;rsquo;t block the Pod from being scheduled in the meantime&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#2-what-component-should-handle-the-api-calls"
>2: What component should handle the API calls&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#21-make-the-api-calls-queued-in-a-separate-component"
>2.1: Make the API calls queued in a separate component&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#22-send-api-calls-through-a-kube-schedulers-cache"
>2.2: Send API calls through a kube-scheduler&amp;rsquo;s cache&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#asynchronous-api-call-failure"
>Asynchronous API call failure&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#object-updated-by-an-external-component-causing-a-race-with-the-scheduler"
>Object updated by an external component causing a race with the scheduler&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#api-calls-added-at-a-higher-rate-than-execution-rate-leading-to-memory-explosion"
>API calls added at a higher rate than execution rate leading to memory explosion&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#pod-is-retried-based-on-an-old-object"
>Pod is retried based on an old object&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#out-of-tree-plugins-start-using-asynchronous-api-calls-framework"
>Out-of-tree plugins start using asynchronous API calls framework&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#proposal-c-create-a-separate-component-managing-api-calls-but-treat-the-cache-as-a-middleware"
>Proposal C: Create a separate component managing API calls, but treat the cache as a middleware&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary-of-api-call-management"
>Summary of API call management&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#enqueueing-a-new-api-call"
>Enqueueing a new API call&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#enqueueing-another-api-call-for-the-same-object"
>Enqueueing another API call for the same object&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#receiving-object-update-through-event-handlers"
>Receiving object update through event handlers&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#executing-the-api-call"
>Executing the API call&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#enqueueing-an-api-call-while-a-previous-one-is-in-flight"
>Enqueueing an API call while a previous one is in-flight&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#waiting-for-the-api-call-to-finish"
>Waiting for the API call to finish&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#retrying-api-calls"
>Retrying API calls&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#11-handle-api-calls-in-the-scheduling-queue"
>1.1: Handle API calls in the scheduling queue&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#12-handle-api-calls-in-the-handleschedulingfailure"
>1.2: Handle API calls in the handleSchedulingFailure&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#21-just-dispatch-goroutines"
>2.1: Just dispatch goroutines&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-design-proposals"
>Alternative design proposals&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#proposal-a-create-a-separate-component-managing-api-calls"
>Proposal A: Create a separate component managing API calls&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#proposal-b-make-a-schedulers-cache-managing-api-calls"
>Proposal B: Make a scheduler&amp;rsquo;s cache managing API calls&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes making all API calls during scheduling asynchronous, by introducing a new kube-scheduler-wide way of handling such calls.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Scheduling performance is crucial. One of the bottlenecks is the API calls done during the scheduling cycle.
The binding cycle is already asynchronous, but it would still be beneficial to re-evaluate whether the current model of busy-waiting goroutines is good long-term.&lt;/p>
&lt;p>Making one universal approach for handling API calls in the kube-scheduler could allow these calls to be consistent and better control the number of dispatched goroutines.
Already asynchronous calls could also be migrated to this approach.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>P0: Make the scheduling cycle free of blocking API calls, i.e., make all API calls asynchronous.&lt;/li>
&lt;li>P0: Make the solution extendable for custom/future use cases.&lt;/li>
&lt;li>P1: Skip some types of updates if they soon become irrelevant by consecutive updates.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Prioritize high-importance updates (like binding) over low-importance ones if updates to the kube-apiserver get throttled.&lt;/li>
&lt;li>Change how the already asynchronous procedures, such as the binding cycle or asynchronous preemption goroutines, actually work.
They should remain asynchronous and continue to wait for the API calls to finish before proceeding.
Any further refinements to these stages could be added in future revisions of this KEP or in the separate ones.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>There are a few ways to make API calls asynchronous.
They are introduced below to facilitate discussion and identify the most suitable solution.&lt;/p>
&lt;p>These questions have to be answered:&lt;/p>
&lt;ol>
&lt;li>How to handle Pod rescheduling while waiting for the API call to complete&lt;/li>
&lt;li>What component should handle the API calls&lt;/li>
&lt;/ol>
&lt;p>Also, races (collisions) between multiple API calls for a single object should be mitigated by the design.&lt;/p>
&lt;p>Note that this KEP focuses on making individual API calls asynchronous. Some procedures, such as the binding cycle or asynchronous preemption,
will still be separate goroutines with the ability to wait for the (async) API calls to finish.
This way, dependencies between calls that rely on each other won&amp;rsquo;t need to be implemented.&lt;/p>
&lt;h3 id="api-calls-categorization">API calls categorization&lt;/h3>
&lt;p>Before selecting the best approach, the kube-scheduler&amp;rsquo;s API calls have to be analyzed against the goals.
The following operations involve API calls during the main scheduling cycle and have to be made asynchronous (1st goal):&lt;/p>
&lt;ol>
&lt;li>Updating a Pod status in &lt;code>handleSchedulingFailure&lt;/code> when a Pod is unschedulable.&lt;/li>
&lt;li>[Feature proposal: &lt;a href="https://github.com/kubernetes/kubernetes/issues/130668"
target="_blank" rel="noopener">#130668&lt;/a>
] Updating the status of a Pod that is rejected by the &lt;code>PreEnqueue&lt;/code> plugins in the scheduling queue.&lt;/li>
&lt;/ol>
&lt;p>These API calls are already asynchronous in their own ways:&lt;/p>
&lt;ol start="3">
&lt;li>[Feature proposal: &lt;a href="https://github.com/kubernetes/enhancements/issues/5278"
target="_blank" rel="noopener">KEP-5278&lt;/a>
] Set &lt;code>nominatedNodeName&lt;/code> in delayed binding scenarios.&lt;/li>
&lt;li>Preemption - &lt;code>ClearNominatedNodeName&lt;/code> and Pod eviction (made asynchronous by &lt;a href="https://github.com/kubernetes/enhancements/issues/4832"
target="_blank" rel="noopener">KEP-4832&lt;/a>
).&lt;/li>
&lt;li>Pod binding - is in the asynchronous binding phase.&lt;/li>
&lt;/ol>
&lt;p>All three of the above API calls could be migrated to the new mechanism.&lt;/p>
&lt;p>In-tree plugins&amp;rsquo; operations that involve non-Pod API calls during scheduling and could be made asynchronous
(but don&amp;rsquo;t have to be supported from the very beginning):&lt;/p>
&lt;ol start="6">
&lt;li>Volume binding - is in the &lt;code>PreBind&lt;/code> phase, hence asynchronous.&lt;/li>
&lt;li>DRA ResourceClaim deallocating in &lt;code>PostFilter&lt;/code>.&lt;/li>
&lt;li>DRA removing &lt;code>ReservedFor&lt;/code> in &lt;code>Unreserve&lt;/code>.&lt;/li>
&lt;li>DRA ResourceClaims binding - is in the &lt;code>PreBind&lt;/code> phase, hence asynchronous.&lt;/li>
&lt;li>[Feature proposal: &lt;a href="https://github.com/kubernetes/enhancements/issues/5278"
target="_blank" rel="noopener">KEP-5004&lt;/a>
] Extended resource feature will add &lt;code>ResourceClaim&lt;/code> creation API call to the &lt;code>PreBind&lt;/code> phase.&lt;/li>
&lt;li>Other potential DRA features.&lt;/li>
&lt;/ol>
&lt;p>API calls relevance order in which they could cancel less relevant calls for the same Pod (3rd goal):&lt;/p>
&lt;ul>
&lt;li>Pod deletion caused by preemption (4) should cancel all Pod-based API calls for such a Pod.&lt;/li>
&lt;li>Pod binding (5) should cancel Pod status update API calls (1 - 3), because they are no longer relevant.&lt;/li>
&lt;li>Updating Pod status (1, 2) and setting &lt;code>nominatedNodeName&lt;/code> (3) should cancel previous such updates.
Both are calls to the &lt;code>status&lt;/code> subresource of a Pod, so they should overwrite (merge) the previous calls properly
when the newest status is stored in-memory.&lt;/li>
&lt;li>API calls for non-Pod resources (6 - 11) should be further analyzed as they are not likely to consider the Pod-based API calls,
hence implementing those shouldn&amp;rsquo;t block making (1 - 2) calls asynchronous.&lt;/li>
&lt;/ul>
&lt;p>There is no need to send two API calls for one Pod, because more relevant calls should override less relevant ones,
and status updates can be combined into one call.
There is no scenario in which two API calls, but for different Pods, or even &lt;strong>any&lt;/strong> two API calls that do not involve the same object,
should be canceled or merged, so the relevance order between them should not be analyzed.&lt;/p>
&lt;p>In terms of API call priority, the order might be different (non-goal, but considered):&lt;/p>
&lt;ul>
&lt;li>Pod binding (5) should have the highest priority as this is the main purpose of the kube-scheduler.&lt;/li>
&lt;li>Pod deletion caused by preemption (4) should also be important to free up space for high-priority Pods.&lt;/li>
&lt;li>Updating Pod status (1, 2) could be less important and called if there is space for it.
It&amp;rsquo;s worth considering if setting &lt;code>nominatedNodeName&lt;/code> (3) should have the same priority or higher,
because the higher delay might affect other components like Cluster Autoscaler or Karpenter.&lt;/li>
&lt;li>API calls for non-Pod resources (6 - 11) could be analyzed case by case, but are likely equally important to (5) or (4).&lt;/li>
&lt;/ul>
&lt;h3 id="1-how-to-handle-pod-rescheduling-while-waiting-for-the-api-call-to-complete">1: How to handle Pod rescheduling while waiting for the API call to complete&lt;/h3>
&lt;p>There are multiple possible ways to handle such API calls, especially for Pod status updates.
Other (potential) use cases should also be considered when choosing the solution.
Three ways were analyzed, but the non-blocking approach, presented below, was selected.&lt;/p>
&lt;h4 id="use-advanced-queue-and-dont-block-the-pod-from-being-scheduled-in-the-meantime">Use advanced queue and don&amp;rsquo;t block the Pod from being scheduled in the meantime&lt;/h4>
&lt;p>This approach allows the Pod to enter the scheduling queue and be scheduled again even before the status update API call completes, without blocking it.
This requires implementing advanced logic for queueing API calls in the kube-scheduler and migrating &lt;strong>all&lt;/strong> Pod-based API calls done during scheduling to this method,
including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls and cluster status updates as well as parallelize them properly,
e.g., don&amp;rsquo;t parallelize two updates of the same Pod. This requires &lt;a href="#21-make-the-api-calls-queued-in-a-separate-component"
>making the API calls queued in a separate component&lt;/a>
or
&lt;a href="#22-send-api-calls-through-a-kube-schedulers-cache"
>sending API calls through a kube-scheduler&amp;rsquo;s cache&lt;/a>
, presented below, to be implemented.&lt;/p>
&lt;p>All Pod-based scenarios (1 - 5) could and should be implemented when choosing this approach.
Still, a single error reporting path for Pod condition updates could be considered but wouldn&amp;rsquo;t be required.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Allows the Pod to be scheduled again even before the API call completes, which could reduce end-to-end Pod startup latency.&lt;/li>
&lt;li>Simplifies introducing new API calls to the kube-scheduler assuming the collision handling logic is implemented correctly.&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Requires implementing complex, advanced queueing logic.&lt;/li>
&lt;li>Necessitates migrating &lt;strong>all&lt;/strong> Pod-based API calls to this method, but introduces unification, which could be desirable.&lt;/li>
&lt;li>Implementing collision resolution (e.g., for same-Pod updates) is complex, but could allow optimizing the number of API calls overall.&lt;/li>
&lt;/ul>
&lt;h3 id="2-what-component-should-handle-the-api-calls">2: What component should handle the API calls&lt;/h3>
&lt;p>Another thing worth considering is how to indeed make the API calls asynchronous and which component should be responsible for this.
Two alternatives were considered. Ultimately, both contributed to the design of the final architecture,
which consists of both queueing and caching approaches.&lt;/p>
&lt;h4 id="21-make-the-api-calls-queued-in-a-separate-component">2.1: Make the API calls queued in a separate component&lt;/h4>
&lt;p>To make asynchronous dispatching more advanced, a queueing in a separate component approach could be explored.
A new component might understand what the API calls are intended to do and eventually delay, skip, or merge them,
e.g., don&amp;rsquo;t set &lt;code>nominatedNodeName&lt;/code> when Pod binding is enqueued.
Initially, it could be a framework, which might be extended in the future, e.g., by introducing the possibility of setting delays.&lt;/p>
&lt;p>If two update API calls for the same Pod are enqueued the merging mechanism should be introduced to handle such case.
See &lt;a href="#api-calls-categorization"
>API calls categorization&lt;/a>
for more details.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Allows for advanced goroutine dispatching logic.&lt;/li>
&lt;li>Can potentially delay, skip, or merge API calls based on type (e.g., skip &lt;code>nominatedNodeName&lt;/code> if binding is pending).&lt;/li>
&lt;li>All collisions could be resolved at the new component level, not relying on higher-level mechanisms.&lt;/li>
&lt;li>Allows supporting all scenarios without additional structures.&lt;/li>
&lt;li>Provides a framework that can be extended in the future.&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Requires complex logic to handle potential conflicts between different update types for the same Pod.&lt;/li>
&lt;li>Needs a clear strategy for how to update the in-memory Pod object during scheduling.&lt;/li>
&lt;li>Requires extra steps to cache the updated objects.&lt;/li>
&lt;/ul>
&lt;h4 id="22-send-api-calls-through-a-kube-schedulers-cache">2.2: Send API calls through a kube-scheduler&amp;rsquo;s cache&lt;/h4>
&lt;p>A second approach could be to have a consistent Pod state in the kube-scheduler itself first and then change it through the API.
This means that all API calls would have to go through the kube-scheduler&amp;rsquo;s cache, change the Pod there, and after that, execute.
However, Pod updates might come from outside the kube-scheduler, e.g., a user changes the spec or another component changes the status.
This extended cache would have to merge the internal state of the Pod with the external state,
including the Pod update made by the kube-scheduler that will come as an event as well.
Now, the Pod object stored in the cache is based only on events that come to the kube-scheduler.&lt;/p>
&lt;p>Another thing to think of is that the cache stores only the bound Pods. The rest of the Pods are stored in the scheduling queue,
so once again, API calls might need to go through the scheduling queue itself.&lt;/p>
&lt;p>The cache proposal would still need to reuse some ideas of the first approach to achieve merging or skipping API calls.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Aims for a consistent internal state of the Pod within the kube-scheduler before calling the API, possibly simplifying conflict resolution.&lt;/li>
&lt;li>Allows for advanced goroutine dispatching logic.&lt;/li>
&lt;li>All collisions could be resolved at the cache, not relying on higher-level mechanisms.&lt;/li>
&lt;li>Can potentially delay, skip, or merge API calls based on type (e.g., skip &lt;code>nominatedNodeName&lt;/code> if binding is pending),
but merging would be possible if it stores additional data (what fields should be updated, etc.).&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Requires the cache to handle and merge updates coming from both the kube-scheduler&amp;rsquo;s internal actions and external API events.&lt;/li>
&lt;li>The cache currently only stores bound Pods, requiring integration with the scheduling queue for pending Pods.&lt;/li>
&lt;li>Complex logic is needed to handle external updates arriving while an internal update is pending or in progress.&lt;/li>
&lt;/ul>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="asynchronous-api-call-failure">Asynchronous API call failure&lt;/h4>
&lt;p>When an asynchronous API call fails, the caller should be able to handle this.
This can be done by using an &lt;code>OnFailure&lt;/code> channel that passes the error, allowing callers to react accordingly.
For current API calls, this will be enough - updating Pod status failures are already unhandled (only logged), so this KEP won&amp;rsquo;t make that situation worse.
However, more graceful handling, such as retries, could be added in the future.&lt;/p>
&lt;p>It could be riskier when previous calls were skipped or overwritten and a subsequent call fails.
This results in losing previous decisions (outside the kube-scheduler) as well as the last change not being applied externally.
This should still be handled correctly, as a failed binding will result in applying a failed Pod status anyway,
and a Pod with binding canceled because of deletion (preemption) could still be retried.
Nevertheless, this risk should be taken into consideration when extending feature usage in the future and should be properly documented in the code.&lt;/p>
&lt;p>Another aspect is caching: applying a change to a cache should be reversible.
This could be done by storing two versions of an object (like in AssumeCache) and restoring the older version in case of a failure.
However, for basic usage, this won&amp;rsquo;t be required for Pod-based API calls.&lt;/p>
&lt;h4 id="object-updated-by-an-external-component-causing-a-race-with-the-scheduler">Object updated by an external component causing a race with the scheduler&lt;/h4>
&lt;p>If a single field can be updated by both the scheduler and another component, making the update API call asynchronous might extend the race window.
One such case is the &lt;code>NominatedNodeName&lt;/code> use case, extended by &lt;a href="https://github.com/kubernetes/enhancements/issues/5278"
target="_blank" rel="noopener">KEP-5278&lt;/a>
.&lt;/p>
&lt;p>However, in this KEP, we assume that the default kube-scheduler should have precedence when applying updates to objects (Pods),
and any custom logic could be implemented by changing the default if needed.&lt;/p>
&lt;h4 id="api-calls-added-at-a-higher-rate-than-execution-rate-leading-to-memory-explosion">API calls added at a higher rate than execution rate leading to memory explosion&lt;/h4>
&lt;p>As pending API calls will be stored in the scheduler, slower processing of these calls, while maintaining a high frequency of additions, might result in significant memory usage.
This can already occur, for example, when many Pods are waiting to be bound simultaneously.
However, if it turns out to be a real problem, a timeout could be added to the API call that will limit the time the call might spend in the queue, discarding it afterward.&lt;/p>
&lt;h4 id="pod-is-retried-based-on-an-old-object">Pod is retried based on an old object&lt;/h4>
&lt;p>Since a Pod won&amp;rsquo;t be blocked from retrying scheduling when an status update API call for that Pod is being executed, it might enter the next scheduling cycle before the call completes.
However, the &lt;code>PodScheduled&lt;/code> condition is not used during scheduling, and &lt;code>NominatedNodeName&lt;/code> is reflected in the &lt;code>nominator&lt;/code>, so having an outdated Pod object won&amp;rsquo;t cause any harm.
Still, any future use cases might introduce issues here, so caching the updates could be considered to fully mitigate this risk.&lt;/p>
&lt;h4 id="out-of-tree-plugins-start-using-asynchronous-api-calls-framework">Out-of-tree plugins start using asynchronous API calls framework&lt;/h4>
&lt;p>The framework should be designed to handle such custom use cases, but it should be explicitly documented what capabilities are allowed (and supported) for out-of-tree plugins.
For example, adding a new Pod-based API call might require changes in the original implementations.
Not all use cases might be covered by the first release of this feature, but eventually, they should be fully supported and documented accordingly.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>This section describes the most important design details. Three proposals based on the above ideas that combine queueing, caching,
and a separate component for managing API calls were considered. Ultimately, proposal C was selected,
and the details of proposals A and B can be found in the &lt;a href="#alternative-design-proposals"
>alternative design proposals&lt;/a>
section at the end of the KEP.
Specifically, see &lt;a href="#proposal-a-create-a-separate-component-managing-api-calls"
>proposal A&lt;/a>
for the proposed &lt;code>APIQueue&lt;/code> structure.&lt;/p>
&lt;h3 id="proposal-c-create-a-separate-component-managing-api-calls-but-treat-the-cache-as-a-middleware">Proposal C: Create a separate component managing API calls, but treat the cache as a middleware&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/proposal-C-cache-and-separate-component.png" alt="proposal C">&lt;/p>
&lt;p>This proposal combines the strengths of proposals A and B by making a cache a middleware between scheduling/binding cycles, plugins, and event handlers.
This way, we could achieve the cache advantages of proposal B, while also allowing multiple caches to coexist.
Direct API queue operations would still be possible (e.g., for some out-of-tree plugins that don&amp;rsquo;t need to cache any object).&lt;/p>
&lt;p>The &lt;code>APIQueue&lt;/code> design from proposal A could be largely reused in this approach. If an object needs to be modified, it would first go through the cache,
then be added to the API queue, and, based on the result, properly stored in the cache. This decoupled approach would allow adding a &lt;code>StatusUpdateCall&lt;/code> through the scheduler&amp;rsquo;s cache,
but for example, a &lt;code>ResourceClaimUpdate&lt;/code> could go through the DRA manager, simplifying the adaptation of this KEP.&lt;/p>
&lt;p>This proposal could be implemented as a second step extension of proposal A.&lt;/p>
&lt;h3 id="summary-of-api-call-management">Summary of API call management&lt;/h3>
&lt;p>Below is a summary of the steps in API call management that would be introduced by the proposals above.&lt;/p>
&lt;h4 id="enqueueing-a-new-api-call">Enqueueing a new API call&lt;/h4>
&lt;p>Having a separate component (&lt;code>APIQueue&lt;/code> in proposal A and partially C) would make the API calls explicit to the caller by directly calling &lt;code>Add()&lt;/code> on the &lt;code>APIQueue&lt;/code>.
This means it will be visible from the scheduler or plugins that an API call will be sent, and various options could be easily passed.&lt;/p>
&lt;p>Using a cache (proposal B and C), the API call will be hidden and executed implicitly when needed, based on the cache&amp;rsquo;s internal logic.
It&amp;rsquo;s questionable how to pass some options to the API call, e.g., an &lt;code>OnFinish&lt;/code> channel or additional metadata. Error handling might also be less verbose for the caller.&lt;/p>
&lt;p>Updating a cache with API call details would be similar across all proposals. Given the details, it would be possible to know precisely which fields will be updated by the API call.
Some &lt;code>Update()&lt;/code> method could then apply these changes to an object, and the result could be stored in the cache. If any future update appears, it will be routed similarly.&lt;/p>
&lt;p>In all proposals, if there isn&amp;rsquo;t any API call already enqueued for a given object, its UID will be added to the queue that will later be consumed by the API calls runner.
In other scenarios, more advanced logic will be required. See the section below for more details.&lt;/p>
&lt;h4 id="enqueueing-another-api-call-for-the-same-object">Enqueueing another API call for the same object&lt;/h4>
&lt;p>Another API call for the same object could be enqueued, while the previous one is still waiting to be executed.
Based on API calls categorization, some updates might need to be merged. This logic has to be implemented and could be achieved similarly for all three proposals.
In general, given the API calls categorization, the calls could be simply merged by overwriting the details with the new ones, if applicable.
For &lt;code>StatusUpdateCall&lt;/code>, merging will check if the &lt;code>NominatedNodeName&lt;/code> or Pod condition changed and then overwrite these fields accordingly.&lt;/p>
&lt;p>Skipping or overwriting less or more important API calls could be done by configuring an importance value for each &lt;code>CallType&lt;/code>
and then making a decision based on comparison while adding a new API call. Not all API calls would need to implement their merging strategy.
Merging should also allow deciding if the API call should be removed from the queue when the update reverts a previous one that wasn&amp;rsquo;t executed yet.&lt;/p>
&lt;p>In proposal A and C, the merging strategy (&lt;code>Merge()&lt;/code> method in &lt;code>APICall&lt;/code>) would implement this merging logic.
In proposal B, some other configurable method would need to be designed to implement this.&lt;/p>
&lt;p>Merging, overwriting, or skipping a call could get more complicated if the previous API call is already in flight.
See the &lt;a href="#enqueueing-an-api-call-while-a-previous-one-is-in-flight"
>enqueueing an API call while a previous one is in-flight&lt;/a>
section for more details.
In proposal B, setting the merging strategy might be more complicated and could require providing custom logic through some interfaces.&lt;/p>
&lt;h4 id="receiving-object-update-through-event-handlers">Receiving object update through event handlers&lt;/h4>
&lt;p>An object might get updated or deleted externally in the meantime, while some API call is enqueued for the same object.
One such scenario might be setting &lt;code>NominatedNodeName&lt;/code> by an external component (see &lt;a href="https://github.com/kubernetes/enhancements/issues/5278"
target="_blank" rel="noopener">KEP-5278&lt;/a>
).
For Pod status updates themselves, making an update based on the old object wouldn&amp;rsquo;t cause trouble,
because of the strategic merge patch used – it will just overwrite the Pod condition or &lt;code>NominatedNodeName&lt;/code> if needed.
It is assumed that the scheduler should overwrite all such updates according to the actual needs,
and if it&amp;rsquo;s not expected, custom logic could always be added using an &lt;code>APICall&lt;/code> interface.&lt;/p>
&lt;p>However, to support other potential use cases and have the newest object possible in the cache (proposals B and C, and optionally A),
merging the object received by event handlers with API call details should also be added.
It would work similarly to updating a cache in the section above.&lt;/p>
&lt;p>It also should be defined how to handle such external updates if the API call is completed and the scheduler is waiting for the update to come in event handlers.
The &lt;code>ResourceVersion&lt;/code> of the object could be used to distinguish it, i.e., apply the API call details
as long as the &lt;code>ResourceVersion&lt;/code> of the received object is older than the version returned by the update API call.&lt;/p>
&lt;h4 id="executing-the-api-call">Executing the API call&lt;/h4>
&lt;p>In all three proposals, executing the API call could be done by having a goroutine (API calls runner) that will check if there is any goroutine available in the pool
(could be a configurable number) and it will try to fetch the first resource ID from a queue. Then, in the new goroutine, the API call for this resource will be executed, and after it completes,
it will be freed for the next call.&lt;/p>
&lt;h4 id="enqueueing-an-api-call-while-a-previous-one-is-in-flight">Enqueueing an API call while a previous one is in-flight&lt;/h4>
&lt;p>One other possible scenario occurs when an API call is executing (is in-flight) and a new API call for the same object is added.
If both calls have the same type, standard merging logic could be applied. This involves adjusting the new API call with the in-flight call&amp;rsquo;s details
to reflect the changes that are already in-flight and avoid repeating them in the next call
(note that the in-flight call should still be stored in the map with details, but not in the queue with resource IDs).&lt;/p>
&lt;h4 id="waiting-for-the-api-call-to-finish">Waiting for the API call to finish&lt;/h4>
&lt;p>In some use cases, the caller would like to wait for the asynchronous API call to finish.
This could be achieved by passing an &lt;code>OnFinish&lt;/code> channel along with the call that will receive the API call result (nil or error).
This way, already asynchronous calls like binding can be easily migrated to the new mechanism just by blocking on the call completion,
as binding is already asynchronous. This channel could be easily used with proposal A,
but proposals B and C would require passing it through cache methods, which could be less readable.&lt;/p>
&lt;h4 id="retrying-api-calls">Retrying API calls&lt;/h4>
&lt;p>As API calls are getting overwritten or skipped, failure of one call might end up in losing multiple operations.
That&amp;rsquo;s why, for retryable errors, it should be possible to re-enqueue the API call and try it again soon
Such logic could be explored, but having an &lt;code>OnFinish&lt;/code> channel and handling errors by the caller should be enough for the actual use cases.&lt;/p>
&lt;p>For example, if a binding API call fails, the binding cycle procedure for that Pod will be notified via the &lt;code>OnFinish&lt;/code> channel.
It will then invoke a failure handler that re-adds the Pod to the scheduling queue to retry.&lt;/p>
&lt;p>If no procedure tracks the &lt;code>OnFinish&lt;/code> handler of a call (e.g., for a status update), the error will be unhandled (only logged).
This aligns with the current implementation, and status updates aren&amp;rsquo;t critical enough to implement more advanced retry logic.
Update conflicts also won&amp;rsquo;t be an issue for status updates, as a strategic merge patch is used,
and the update will overwrite a condition and &lt;code>NominatedNodeName&lt;/code> if a conflict occurs.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;ul>
&lt;li>&lt;code>pkg/scheduler&lt;/code>: &lt;code>2025-06-09&lt;/code> - &lt;code>69.6%&lt;/code>&lt;/li>
&lt;li>&lt;code>pkg/scheduler/backend/cache&lt;/code>: &lt;code>2025-06-09&lt;/code> - &lt;code>85.7%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler"
target="_blank" rel="noopener">&lt;code>k8s.io/kubernetes/test/integration/schedule&lt;/code>&lt;/a>
&lt;ul>
&lt;li>Modify and add test cases covering the feature (with feature flag enabled and disabled), including handling unschedulable pods, preemption and binding.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf"
target="_blank" rel="noopener">scheduler_perf&lt;/a>
&lt;ul>
&lt;li>Add test cases measuring performance of scenarios that use asynchronous API calls (with feature flag enabled and disabled).&lt;/li>
&lt;li>Performance improvement should be visible for &lt;code>Unschedulable&lt;/code> test case.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>The feature is scoped within the kube-scheduler internally, so there is no interaction between other components.
The whole feature should be already covered by integration tests.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;p>N/A&lt;/p>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Implement a feature behind a feature flag and enable it by default.&lt;/li>
&lt;li>Migrate all Pod-based API calls done during scheduling and binding to the asynchronous version.&lt;/li>
&lt;li>Implement all tests from &lt;a href="#test-plan"
>Test Plan&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from users and fix reported bugs.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>&lt;strong>Upgrade&lt;/strong>&lt;/p>
&lt;p>During the beta period, the feature gate &lt;code>SchedulerAsyncAPICalls&lt;/code> is enabled by default, so users don&amp;rsquo;t need to opt in.
This is a purely in-memory feature for the kube-scheduler, so no special actions are required outside the scheduler.&lt;/p>
&lt;p>&lt;strong>Downgrade&lt;/strong>&lt;/p>
&lt;p>Users need to disable the feature gate.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This is a purely in-memory feature for the kube-scheduler, and hence there is no version skew strategy.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>SchedulerAsyncAPICalls&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Pod scheduling might be retried even if the API call hasn&amp;rsquo;t yet been executed.
For instance, a Pod might be retried before its &lt;code>PodScheduled&lt;/code> condition is set to &lt;code>false&lt;/code> (indicating it&amp;rsquo;s unschedulable).
Consequently, external components that would rely on a strict ordering of &lt;code>applying a condition -&amp;gt; retrying a Pod&lt;/code> might be less informed.&lt;/p>
&lt;p>Moreover, some API calls might be canceled. In such cases, if the Pod is bound shortly after,
the &lt;code>PodScheduled&lt;/code> condition might not be set to &lt;code>false&lt;/code> at all, as the binding takes precedence.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes.
The feature can be disabled in Beta version by restarting the kube-scheduler with the feature-gate off.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The kube-scheduler again starts to run API calls asynchronously.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Given it&amp;rsquo;s a purely in-memory feature and enablement/disablement requires restarting the component
(to change the value of the feature flag), having feature tests is enough.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The partial failure in the rollout isn&amp;rsquo;t there because the kube-scheduler is the only component to roll out this feature.
But, if upgrading the kube-scheduler itself fails somehow, new Pods won&amp;rsquo;t be scheduled anymore,
while Pods, which are already scheduled, won&amp;rsquo;t be affected in any case.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>pending_async_api_calls&lt;/code> metric is large or growing abnormally&lt;/li>
&lt;li>&lt;code>async_api_call_execution_total&lt;/code> value with &lt;code>result&lt;/code> indicating error is large even if there are no issues with kube-apiserver&lt;/li>
&lt;li>&lt;code>async_api_call_duration_seconds&lt;/code> visibly increased even if there are no issues with kube-apiserver&lt;/li>
&lt;li>&lt;code>event_handling_duration_seconds&lt;/code> visibly increased&lt;/li>
&lt;li>&lt;code>scheduling_attempt_duration_seconds&lt;/code> visibly increased&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>No. This feature is an in-memory feature of the scheduler
and thus calculations start from the beginning every time the scheduler is restarted.
So, just upgrading it and upgrade-&amp;gt;downgrade-&amp;gt;upgrade are both the same.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>Check &lt;code>async_api_call_execution_total&lt;/code>, &lt;code>async_api_call_duration_seconds&lt;/code> and &lt;code>pending_async_api_calls&lt;/code> metrics, and if their values are changing with each processed pod.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>In the default scheduler, we should see the throughput around 100-150 pods/s (&lt;a href="https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&amp;amp;metriccategoryname=Scheduler&amp;amp;metricname=LoadSchedulingThroughput&amp;amp;TestName=load"
target="_blank" rel="noopener">ref&lt;/a>
),
and this feature shouldn&amp;rsquo;t bring any regression there.&lt;/p>
&lt;p>Based on that &lt;code>schedule_attempts_total&lt;/code> shouldn&amp;rsquo;t grow less than 100 per second
and &lt;code>scheduling_algorithm_duration_seconds&lt;/code> in average shouldn&amp;rsquo;t be higher than 10 ms,
if there is a sufficient number of pending pods in the cluster.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:
&lt;ul>
&lt;li>&lt;code>async_api_call_execution_total&lt;/code>&lt;/li>
&lt;li>&lt;code>async_api_call_duration_seconds&lt;/code>&lt;/li>
&lt;li>&lt;code>pending_async_api_calls&lt;/code>&lt;/li>
&lt;li>&lt;code>scheduling_attempt_duration_seconds&lt;/code>&lt;/li>
&lt;li>&lt;code>event_handling_duration_seconds&lt;/code>&lt;/li>
&lt;li>&lt;code>pod_scheduling_sli_duration_seconds&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Components exposing the metric: kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>async_api_call_execution_total&lt;/code> with &lt;code>call_type&lt;/code> and &lt;code>result&lt;/code> labels to indicate how many async API calls with specific &lt;code>call_type&lt;/code> completed with that &lt;code>result&lt;/code>.&lt;/li>
&lt;li>&lt;code>async_api_call_duration_seconds&lt;/code> with &lt;code>call_type&lt;/code> and &lt;code>result&lt;/code> labels to indicate how long it took for async API calls with specific &lt;code>call_type&lt;/code> to complete with that &lt;code>result&lt;/code>.&lt;/li>
&lt;li>&lt;code>pending_async_api_calls&lt;/code> with &lt;code>call_type&lt;/code> label to indicate how many async API calls are enqueued for specific &lt;code>call_type&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>Not visibly end-to-end - binding API call could be slightly delayed by routing through the API queue.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>Memory usage within the kube-scheduler might increase due to the queue storing pending API calls.
Memory increase is expected to be linear with the number of pending API calls.&lt;/p>
&lt;p>The number of goroutines will also increase to dispatch API calls, which could affect the CPU usage of the kube-scheduler.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>If API server is unavailable, the API calls will result in a failure.
Scheduler already handle such cases (retry scheduling in most of them) and this feature should not make a change here.
See &lt;a href="#retrying-api-calls"
>retrying API calls&lt;/a>
section for more details.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Unknown&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;ul>
&lt;li>8th Apr 2025: The initial KEP is submitted.&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>There were other alternatives considered in two topics:&lt;/p>
&lt;ol>
&lt;li>Where and how to handle API calls during queueing and scheduling.&lt;/li>
&lt;li>How to make the API calls asynchronous.&lt;/li>
&lt;/ol>
&lt;h4 id="11-handle-api-calls-in-the-scheduling-queue">1.1: Handle API calls in the scheduling queue&lt;/h4>
&lt;p>One possible approach is to send the API calls through a scheduling queue.
This allows delaying putting the pod into &lt;code>unschedulablePods&lt;/code> after updating the pod.
This prevents race conditions from parallel updates of a single pod because, during the API call,
the pod is in-flight and thus not eligible for rescheduling.&lt;/p>
&lt;p>A new method could be added to the &lt;code>PriorityQueue&lt;/code>, which will take the function to be called asynchronously.
It should also make sure the pod is stored in &lt;code>inFlightPods&lt;/code> to register the cluster events that will happen during the asynchronous part.
Calling &lt;code>AddUnschedulableIfNotPresent&lt;/code> at the end ensures there won&amp;rsquo;t be any race with the asynchronous pod update.
Because the pod would need to be in &lt;code>inFlightPods&lt;/code> during the API call, the size of &lt;code>inFlightEvents&lt;/code> might increase,
but as long as the API call executes quickly, there won&amp;rsquo;t be a significant memory pressure.&lt;/p>
&lt;p>Example solution could look like:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Author: @sanposhiho&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">func&lt;/span> (p &lt;span style="color:#666">*&lt;/span>PriorityQueue) &lt;span style="color:#00a000">AddUnschedulableAsync&lt;/span>(pInfo &lt;span style="color:#666">*&lt;/span>framework.QueuedPodInfo, fn &lt;span style="color:#a2f;font-weight:bold">func&lt;/span>() &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>) {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Make sure the Pod is in inFlightPods before starting the goroutine&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">go&lt;/span> &lt;span style="color:#a2f;font-weight:bold">func&lt;/span>() { &lt;span style="color:#080;font-style:italic">// Or another way of dispatching&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Run fn first &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a2f;font-weight:bold">if&lt;/span> err &lt;span style="color:#666">:=&lt;/span> &lt;span style="color:#00a000">fn&lt;/span>(); err &lt;span style="color:#666">!=&lt;/span> &lt;span style="color:#a2f;font-weight:bold">nil&lt;/span> { &lt;span style="color:#666">...&lt;/span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Push the pod back to the unschedQ after completing fn().&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> p.&lt;span style="color:#00a000">AddUnschedulableIfNotPresent&lt;/span>(&lt;span style="color:#666">...&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This way, we could cover pod status updates during the failure handler (1) and pod status updates for &lt;code>PreEnqueue&lt;/code> plugins (2).
Asynchronous preemption (4) could be migrated to this approach by adding a possibility to return a function from &lt;code>PostFilter&lt;/code> plugins in &lt;code>PostFilterResult&lt;/code>
and calling this function probably in the failure handler together with the status update.&lt;/p>
&lt;p>However, this method cannot be used for setting the &lt;code>nominatedNodeName&lt;/code> scenario (3) because this operation occurs in the successful scheduling as well.
Therefore, additional effort would have to be made to specifically ensure that the &lt;code>nominatedNodeName&lt;/code> doesn&amp;rsquo;t collide with a potential status update.
Probably, before this status update in the failure handler, the code should try to cancel the set &lt;code>nominatedNodeName&lt;/code> API call or wait until it finishes.
After that, it should proceed with setting the unschedulable status via the API. The binding call might similarly need to wait.&lt;/p>
&lt;p>Another aspect to consider is how to dispatch the goroutines, as discussed in &lt;a href="#2-how-to-make-the-api-calls-asynchronous"
>how to make the API calls asynchronous&lt;/a>
section.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Allows delaying putting unschedulable pods back to the queue until the API update completes.&lt;/li>
&lt;li>Prevents race conditions for parallel updates of a single pod by delaying the &lt;code>AddUnschedulableIfNotPresent&lt;/code> call.&lt;/li>
&lt;li>Can easily cover status updates for both scheduling failures and &lt;code>PreEnqueue&lt;/code> failures.&lt;/li>
&lt;li>Asynchronous preemption could be migrated to this approach, increasing consistency.&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Handling of failures might not be consistent, requiring &lt;code>AddUnschedulableAsync&lt;/code> to be called in two places.&lt;/li>
&lt;li>Delaying the &lt;code>AddUnschedulableAsync&lt;/code> call increases pod queuing latency because the initial backoff timestamp is set there.&lt;/li>
&lt;li>Cannot be used for the &lt;code>nominatedNodeName&lt;/code> scenario, requiring additional effort and separate handling.&lt;/li>
&lt;li>Might visibly increase the size of &lt;code>inFlightEvents&lt;/code> if API calls are slow or if there are many calls.&lt;/li>
&lt;/ul>
&lt;h4 id="12-handle-api-calls-in-the-handleschedulingfailure">1.2: Handle API calls in the handleSchedulingFailure&lt;/h4>
&lt;p>Another approach could be to make all unschedulable status update API calls within &lt;code>handleSchedulingFailure&lt;/code>.
This would make this handler the only error reporting path. Synchronous API calls within this handler could be made asynchronous,
but additional effort would be needed to prevent race conditions. This could be achieved by blocking the retries of the pod using &lt;code>PreEnqueue&lt;/code>
(similar to asynchronous preemption) or by implementing advanced queueing logic.&lt;/p>
&lt;p>This way, again, we could cover pod status updates during the failure handler (1),
but pod status updates for &lt;code>PreEnqueue&lt;/code> plugins (2) will require more refactoring by either:&lt;/p>
&lt;ul>
&lt;li>Running a simplified scheduling cycle for pods that were rejected by the &lt;code>PreEnqueue&lt;/code> to update the pod condition.
This might negatively impact scheduling performance because a portion of the scheduling cycles will be spent for pods that are ultimately unschedulable
Moreover, &lt;code>PreEnqueue&lt;/code> plugins might also need to be called within this simplified scheduling cycle,
or alternatively, &lt;code>PreFilter&lt;/code> plugins could implement the necessary PreEnqueue logic, duplicating it.&lt;/li>
&lt;li>Calling &lt;code>handleSchedulingFailure&lt;/code> directly from the scheduling queue when a pod is rejected by the &lt;code>PreEnqueue&lt;/code>.
This might be feasible, although it would create a circular dependency between the scheduling queue and the handler;
however, it wouldn&amp;rsquo;t have the same performance implications as the solution above.&lt;/li>
&lt;/ul>
&lt;p>Asynchronous preemption could also be migrated to this approach by exposing a function,
provided that the blocking behavior in &lt;code>PreEnqueue&lt;/code> is consistent with the actual preemption blocking mechanism.&lt;/p>
&lt;p>Again, for setting the &lt;code>nominatedNodeName&lt;/code> scenario (3), this method cannot be used because this operation occurs in the successful scheduling as well.
Therefore, additional effort would have to be made to specifically ensure that the &lt;code>nominatedNodeName&lt;/code> doesn&amp;rsquo;t collide with a potential status update.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Makes the failure handler the single path of reporting unschedulable status errors.&lt;/li>
&lt;li>Asynchronous preemption could potentially be migrated to this approach, increasing consistency.&lt;/li>
&lt;li>Pod would be immediately put into the scheduling queue, starting the backoff timer right away.&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Requires additional effort to prevent race conditions for updates.&lt;/li>
&lt;li>Handling PreEnqueue rejections requires significant refactoring (implementing a &lt;code>simplified scheduling cycle or direct &lt;/code>handleSchedulingFailure` call).
&lt;ul>
&lt;li>Simplified scheduling cycle for &lt;code>PreEnqueue&lt;/code> rejections could impact performance and duplicate &lt;code>PreEnqueue&lt;/code> logic.&lt;/li>
&lt;li>Direct &lt;code>handleSchedulingFailure&lt;/code> call would introduce circular dependency.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Cannot be used for the &lt;code>nominatedNodeName&lt;/code> scenario, requiring additional effort and separate handling.&lt;/li>
&lt;/ul>
&lt;h4 id="21-just-dispatch-goroutines">2.1: Just dispatch goroutines&lt;/h4>
&lt;p>With appropriate handling of races during updates, we could just dispatch goroutines with API calls.
A potential drawback is that we won&amp;rsquo;t limit the number of these goroutines and won&amp;rsquo;t be able to, e.g., delay the calls.
Limiting goroutines could still be easily achieved by having some group with a limited number of goroutines and a simple queue that will store pending calls.
Some delay might potentially appear due to side effects, especially when there will be problems with the kube-apiserver,
so some higher-level mechanism such as (1.1) or (1.2) would need to prevent pod update races.&lt;/p>
&lt;p>Pros:&lt;/p>
&lt;ul>
&lt;li>Simple to implement if the appropriate race handling is chosen.&lt;/li>
&lt;li>Can easily be extended with a simple queue and worker pool to limit number of goroutines.&lt;/li>
&lt;/ul>
&lt;p>Cons:&lt;/p>
&lt;ul>
&lt;li>Does not inherently support delaying calls.&lt;/li>
&lt;li>Higher-level mechanisms (like 1.1 or 1.2) would be needed to prevent pod update races.&lt;/li>
&lt;li>&lt;code>nominatedNodeName&lt;/code> scenario support would require more effort in (1.1) or (1.2).&lt;/li>
&lt;li>Prevents from further optimizations, e.g. can&amp;rsquo;t merge two API calls.&lt;/li>
&lt;/ul>
&lt;h3 id="alternative-design-proposals">Alternative design proposals&lt;/h3>
&lt;p>Three design proposals were considered, but the &lt;a href="#proposal-c-create-a-separate-component-managing-api-calls-but-treat-the-cache-as-a-middleware"
>proposal C&lt;/a>
was selected to be implemented.
Below, another two proposals are presented for comparison.&lt;/p>
&lt;h4 id="proposal-a-create-a-separate-component-managing-api-calls">Proposal A: Create a separate component managing API calls&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/proposal-A-separate-component.png" alt="proposal A">&lt;/p>
&lt;p>Implementing an API queue could be made by adding a new component to the scheduler that will have to understand the API calls&amp;rsquo; details
as well as be (potentially) able to modify the cache (see dotted lines in the diagram). This approach would provide an extensible interface and understand the precedence of API calls.
Having a new component on its own would cause the cache to be less informed, i.e., not updated with API calls&amp;rsquo; details, providing the scheduler with outdated data.
It could be prevented by making an API queue a middleware between the event handler and a cache (dotted lines). This won&amp;rsquo;t have to be fully implemented in the first place (only support a subset of use cases),
but will allow handling multiple cached storages that are currently in the scheduler, i.e., scheduler cache, nominator, DRA manager (&lt;code>claimTracker&lt;/code>), and volume binding &lt;code>AssumeCache&lt;/code>.&lt;/p>
&lt;p>The interface for the new component could look like the following:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APICallType &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> StatusUpdateCall APICallType = &lt;span style="color:#b44">&amp;#34;status_update&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> BindingCall APICallType = &lt;span style="color:#b44">&amp;#34;binding&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PreemptionCall APICallType = &lt;span style="color:#b44">&amp;#34;preemption&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// PVCBinding etc.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// APICall describes the API call to be made and store all required data to make the call,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// e.g. fields that should be updated or object to be added/removed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APICall &lt;span style="color:#a2f;font-weight:bold">interface&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// CallType returns an API call type. This should be unique across all APICall implementations that could be in the queue at one moment.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">CallType&lt;/span>() APICallType
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// UID returns UID of an object that this call is related to&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">UID&lt;/span>() types.UID
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Execute makes the actual API call&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">Execute&lt;/span>(client clientset.Interface) &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Merge merges two API calls with the same APICallType into one&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">Merge&lt;/span>(oldObj APICall) (&lt;span style="color:#0b0;font-weight:bold">bool&lt;/span>, &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Not required from the very beginning:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Update updates the obj using APICall details and returns the new version&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">Update&lt;/span>(obj &lt;span style="color:#0b0;font-weight:bold">any&lt;/span>) (&lt;span style="color:#0b0;font-weight:bold">any&lt;/span>, &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> QueuedAPICall &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> APICall
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// OnFinish is a channel where the API call result is sent.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It allows to synchronize on the call completeness, e.g., in binding&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// and handle its result well.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> OnFinish &lt;span style="color:#a2f;font-weight:bold">chan&lt;/span>&lt;span style="color:#666">&amp;lt;-&lt;/span> &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> APIQueue &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">func&lt;/span> (aq &lt;span style="color:#666">*&lt;/span>APIQueue) &lt;span style="color:#00a000">Add&lt;/span>(apiCall QueuedAPICall) &lt;span style="color:#0b0;font-weight:bold">error&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// If API call for specific UID is already enqueued,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// check the callType and skip, replace or merge the call depending on precedence.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">func&lt;/span> (aq &lt;span style="color:#666">*&lt;/span>APIQueue) &lt;span style="color:#00a000">Update&lt;/span>(obj &lt;span style="color:#0b0;font-weight:bold">any&lt;/span>) (&lt;span style="color:#0b0;font-weight:bold">any&lt;/span>, &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>) {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Update the object using API call details if any is enqueued for its UID.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">func&lt;/span> (aq &lt;span style="color:#666">*&lt;/span>APIQueue) &lt;span style="color:#00a000">Run&lt;/span>() {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Dispatch limited number of goroutines if queue is non empty.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>APIQueue would provide an &lt;code>Add()&lt;/code> method would would be used to enqueue an API call that has to be executed.
&lt;code>APICall&lt;/code> would provide all required methods to handle it, especially &lt;code>Execute()&lt;/code> for running, &lt;code>Merge()&lt;/code> for merging it with the same call type (e.g. &lt;code>StatusUpdateCall&lt;/code>) that is already enqueued.
There should be only one &lt;code>APICall&lt;/code> implementation with the same &lt;code>CallType&lt;/code> at any given moment (prevented by &lt;code>APIQueue&lt;/code>), but extending this behavior could be considered in the future.
Supporting a cache would need adding &lt;code>Update()&lt;/code> method that would take the object and update it with API call details (e.g., set NominatedNodeName in a Pod that will be soon updated by the call).
This updated object could be then stored in the cache, and having the call details would allow to know what fields would need to be changed if any future update occurs before the API call is executed.&lt;/p>
&lt;h4 id="proposal-b-make-a-schedulers-cache-managing-api-calls">Proposal B: Make a scheduler&amp;rsquo;s cache managing API calls&lt;/h4>
&lt;p>&lt;img src="https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/proposal-B-cache.png" alt="proposal B">&lt;/p>
&lt;p>This approach differs from the previous one. Instead of creating a separate component, this would reuse the scheduler&amp;rsquo;s cache to handle API calls.
Its advantage would be keeping a consistent state of the updated object in the scheduler and invisibly dispatching API calls if needed.
The largest caveat could be refactoring the scheduler&amp;rsquo;s cache if non-Pod API call would have to be supported - the cache is currently split into multiple, more specialized caches,
i.e., scheduler cache, nominator, DRA manager (&lt;code>claimTracker&lt;/code>), and volume binding &lt;code>AssumeCache&lt;/code>. This means that the scheduler&amp;rsquo;s cache might need to be extended by these use cases or
be able to support those custom storage options using some interfaces. Having a cache would still require storing additional metadata (details), similar to proposal A,
required to make the API calls and to be able to handle incoming updates from the event handler properly (store information about what the API call will change and be able to apply them on an updated object).&lt;/p>
&lt;p>It would also require adding specialized methods to the cache to consume details needed to merge the calls and objects properly; for instance, the default &lt;code>UpdatePod&lt;/code> method might not be useful,
because it would be too generic for our use cases. Supporting out-of-tree plugins might also be harder, as it would require making the cache extensible to store some custom objects
and somehow add new methods.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Asynchronous Preemption</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4832/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4832/</guid><description>
&lt;h1 id="kep-4832-asynchronous-preemption">KEP-4832: Asynchronous Preemption&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#when-kube-apiserver-is-unstable"
>When kube-apiserver is unstable&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#consideration-to-race-condition"
>Consideration to race condition&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#the-pod2s-scheduling-is-successful-pod2-is-equal-or-lower-priority-than-pod1"
>The pod2&amp;rsquo;s scheduling is successful (pod2 is equal or lower priority than pod1)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-pod2s-scheduling-is-successful-pod2-is-higher-priority-than-pod1"
>The pod2&amp;rsquo;s scheduling is successful (pod2 is higher priority than pod1)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-pod2s-scheduling-is-failed-and-starts-the-preemption-pod2-is-equal-or-lower-priority-than-pod1"
>The pod2&amp;rsquo;s scheduling is failed and starts the preemption (pod2 is equal or lower priority than pod1)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#the-pod2s-scheduling-is-failed-and-starts-the-preemption-pod2-is-higher-priority-than-pod1"
>The pod2&amp;rsquo;s scheduling is failed and starts the preemption (pod2 is higher priority than pod1)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#introduce-a-new-extension-point"
>Introduce a new extension point&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP proposes decoupling the API calls for the preemption from the scheduling cycle, to enhance the scheduling throughput of the scheduling failure scenarios.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>The scheduler is basically only one in a cluster,
and hence scheduling throughput is the crucial metric for the scheduler.&lt;/p>
&lt;p>The scheduler schedules Pods one by one within the scheduling cycle,
and we basically try to reduce the API calls as much as possible to enhance the scheduling cycle throughput.&lt;/p>
&lt;p>The binding cycle is the example for this motivation;&lt;/p>
&lt;ol>
&lt;li>The scheduling cycle decides where Pod should go to,&lt;/li>
&lt;li>At the end of the scheduling cycle, the scheduler reserves the Node within the scheduler&amp;rsquo;s cache so that next scheduling cycle will take the current pod into consideration.&lt;/li>
&lt;li>The scheduling cycle ends and the binding cycle starts; the binding cycle is run asynchronously, and the scheduler starts the next scheduling cycle.&lt;/li>
&lt;/ol>
&lt;p>This flow allows us to decouple the API call to assign Pod to the Node from the scheduling cycle so that the API call doesn&amp;rsquo;t block the scheduling throughput.&lt;/p>
&lt;p>But, we have the similar problem with the preemption; the preemption is run at PostFilter extension point which is the part of the scheduling cycle.
The preemption has to make some API calls to update Pods&amp;rsquo; condition and delete Pods after all, which could block the scheduling throughput.&lt;/p>
&lt;p>scheduler-perf &lt;a href="https://github.com/kubernetes/kubernetes/blob/342da505bdefbd849b808cca3cb76c24a993025f/test/integration/scheduler_perf/config/performance-config.yaml#L641"
target="_blank" rel="noopener">actually shows&lt;/a>
currently the preemption scenario takes too long time, compared to others.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Improve scheduling throughput when pods require issuing preemptions by making API calls asynchronous&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Making the same enhancement for DRA is not a goal of this KEP because it&amp;rsquo;s an under-construction feature yet.
&lt;ul>
&lt;li>If DRA maintainers want, technically they can along with this KEP. But, at least in this KEP, we don&amp;rsquo;t discuss how.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>The preemption plugin makes API calls for the preemption asynchronously after &lt;code>PostFilter&lt;/code> extension point
so that the scheduler can continue to other Pods&amp;rsquo; scheduling while making API calls for preemption.
After the preemption goroutine is done, the scheduling for the Pod that triggered the preemption will be retried.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="when-kube-apiserver-is-unstable">When kube-apiserver is unstable&lt;/h4>
&lt;p>When kube-apiserver is unstable and API calls at the preemption goroutine fails frequently,
the scheduler could make a non-optimal scheduling decision
because the scheduler nominates pods at &lt;code>PostFilter&lt;/code> though, those Pods won&amp;rsquo;t be scheduled on nodes because the preemption API calls fail.&lt;/p>
&lt;p>Let&amp;rsquo;s say many mid-priority Pods are making the preemption API calls.
With the scheduler after this proposal, during the preemption goroutine for them are runnning,
the scheduler assumes they&amp;rsquo;ll be scheduled at the Nodes eventually
that the preemptions are targeting via &lt;code>.Status.NominatedNodeName&lt;/code>.
So, other mid-priority or lower priority Pods&amp;rsquo; scheduling take those preemptor Pods into consideration,
which is correct if the preemption goroutine finishes successful actually, while which results in non-best scheduling results otherwise.
(Higher priority Pods won&amp;rsquo;t be affected; Pods can take place of reserved for lower priority Pods via &lt;code>.Status.NominatedNodeName&lt;/code>)&lt;/p>
&lt;p>But, in the first place though, when kube-apiserver is unstable, the scheduler doesn&amp;rsquo;t behave well
because it works with a lot of communication with kube-apiserver.
Even if the scheduler makes the best scheduling result, the binding API might fail after all.&lt;/p>
&lt;p>So, we don&amp;rsquo;t have to pay a special attention to this issue.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>To achieve an asynchronous preemption, we will change the preemption plugin&amp;rsquo;s implementation like the following:&lt;/p>
&lt;ol>
&lt;li>The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node. (We&amp;rsquo;ll use &lt;code>AddNominatedPod&lt;/code> API exposed from the scheduling framework to plugins.)&lt;/li>
&lt;li>The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).&lt;/li>
&lt;li>The preemption plugin blocks the Pod while the preemption routine is in-progress, using PreEnqueue extension point, so that the target Pod won&amp;rsquo;t be retried during this time.&lt;/li>
&lt;/ol>
&lt;p>Then, afterwards the preemption goroutine makes actual API calls to delete victime Pods and set &lt;code>Pod.Status.NominatedNodeName&lt;/code>.
If the preemption goroutine fails at some point, it reverts the nomination via &lt;code>AddNominatedPod&lt;/code> with &lt;a href="https://github.com/kubernetes/kubernetes/blob/f5c538418189e119a8dbb60e2a2b22394548e326/pkg/scheduler/schedule_one.go#L135"
target="_blank" rel="noopener">&lt;code>clearNominatedNode&lt;/code>&lt;/a>
.&lt;/p>
&lt;p>If the preemption goroutine is complete, the preemption plugin ungates the Pod;
the Pod is queued back to the queue with the Pod/delete event, and (hopefully) scheduled on the nominated node in the next scheduling cycle.&lt;/p>
&lt;h3 id="consideration-to-race-condition">Consideration to race condition&lt;/h3>
&lt;p>Thanks to the nomination at &lt;code>PostFilter&lt;/code>, this new asynchronous preemption shouldn&amp;rsquo;t make any race condition between several scheduling cycles.&lt;/p>
&lt;p>Here, I&amp;rsquo;ll discuss what happens in which scenario, and make sure there&amp;rsquo;s no worry.&lt;/p>
&lt;p>Let&amp;rsquo;s say pod1 is during the preemption process (node1) at the preemption goroutine, the next scheduling cycle is scheduling pod2.&lt;/p>
&lt;h4 id="the-pod2s-scheduling-is-successful-pod2-is-equal-or-lower-priority-than-pod1">The pod2&amp;rsquo;s scheduling is successful (pod2 is equal or lower priority than pod1)&lt;/h4>
&lt;p>As I described above, pod1&amp;rsquo;s &lt;code>PostFilter&lt;/code> nominates pod1 for node1.&lt;/p>
&lt;p>At the scheduling cycle, the scheduler takes such nominated pods that are equal or higher priority than pod1 into consideration;
meaning, pod2 won&amp;rsquo;t rob pod1 of the place on node1.&lt;/p>
&lt;h4 id="the-pod2s-scheduling-is-successful-pod2-is-higher-priority-than-pod1">The pod2&amp;rsquo;s scheduling is successful (pod2 is higher priority than pod1)&lt;/h4>
&lt;p>Even though pod1 is nominated for the node, the scheduler allows pod2 to take node1, where the pod1&amp;rsquo;s preemption made the space.&lt;/p>
&lt;p>Then, when pod1 comes back to the scheduling cycle, it may not be able to land on node1 because pod2 is scheduled there now.
It happens with both the current and this KEP&amp;rsquo;s scheduler, so no issue here.&lt;/p>
&lt;h4 id="the-pod2s-scheduling-is-failed-and-starts-the-preemption-pod2-is-equal-or-lower-priority-than-pod1">The pod2&amp;rsquo;s scheduling is failed and starts the preemption (pod2 is equal or lower priority than pod1)&lt;/h4>
&lt;p>The preemption also takes nominated Pods into consideration when calculating the preemption target.&lt;/p>
&lt;p>Therefore, if, coincidently, two preemptions for pod1 and pod2 select the same Node after all,
then the preemption for pod2 should decide to make the space for pod1 and pod2.&lt;/p>
&lt;p>So, we don&amp;rsquo;t have to worry about two preemption targeting the same Node make any issue.&lt;/p>
&lt;h4 id="the-pod2s-scheduling-is-failed-and-starts-the-preemption-pod2-is-higher-priority-than-pod1">The pod2&amp;rsquo;s scheduling is failed and starts the preemption (pod2 is higher priority than pod1)&lt;/h4>
&lt;p>The pod2&amp;rsquo;s preemption ignores pod1&amp;rsquo;s nomination for node1.&lt;/p>
&lt;p>If, coincidently, two preemptions for pod1 and pod2 select the same Node after all,
then the preemption for pod2 may just select the same preemption targets as pod1,
and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.&lt;/p>
&lt;p>But, this isn&amp;rsquo;t an issue because the final result is completely the same as the current scheduler;
with the current scheduler, pod1 preempts some Pods on node1, then pod2&amp;rsquo;s scheduling starts, pod2 takes node1,
and when pod1 comes back to the scheduling cycle, it (probably) cannot be scheduled on node1 because of pod2.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go&lt;/code>: &lt;code>2024-09-07&lt;/code> - &lt;code>85.4&lt;/code>&lt;/li>
&lt;li>&lt;code>/pkg/scheduler/framework/preemption/preemption.go&lt;/code>: &lt;code>2024-09-07&lt;/code> - &lt;code>27.2&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>Because the coverage for preemption.go is pretty low, we have to improve the testing there before the change for this KEP.&lt;/p>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;p>We have to add integration tests to make sure the asynchronous preemption is performed appropriately,
especially in the scenarios listed in &lt;a href="#consideration-to-race-condition"
>Consideration to race condition&lt;/a>
.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>We&amp;rsquo;ll add test cases that multiple pods are trigger preemption.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag&lt;/li>
&lt;li>All tests mentioned in &lt;a href="#test-plan"
>Test Plan&lt;/a>
are implemented.&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from users and fix reported bugs.&lt;/li>
&lt;li>Change the feature flag to be enabled by default.&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from users and fix reported bugs.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>&lt;strong>Upgrade&lt;/strong>&lt;/p>
&lt;p>During the alpha period, users have to enable the feature gate &lt;code>SchedulerAsyncPreemption&lt;/code> to opt in this feature.
This is purely in-memory feature for kube-scheduler, so no other special actions are required outside the scheduler.&lt;/p>
&lt;p>&lt;strong>Downgrade&lt;/strong>&lt;/p>
&lt;p>Users need to disable the feature gate.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: &lt;code>SchedulerAsyncPreemption&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No. The feature is a performance optimization that affects every Pod that needs preemption, but there are no functional changes: the result of the preemption is the same.
But, like mentioned in &lt;a href="#when-kube-apiserver-is-unstable"
>When kube-apiserver is unstable&lt;/a>
, scheduling results could be different.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes.
The feature can be disabled in Alpha and Beta versions
by restarting kube-scheduler with the feature-gate off.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The scheduler again starts to run PostFilter asynchronously.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Given it&amp;rsquo;s purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag),
having feature tests is enough.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>The partly failure in the rollout isn&amp;rsquo;t there because the scheduler is only the component to rollout this feature.
But, if upgrading the scheduler itself fails somehow, new Pods won&amp;rsquo;t be scheduled anymore.
If there&amp;rsquo;s a bug in the preemption because of this enhancement, and also downgrading the scheduler fails somehow,
running Pods could be affected, for example, by being deleted by mistake (depending on bugs).&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>Maybe something goes wrong with the preemption if &lt;code>goroutines_duration_seconds{operation=preemption}&lt;/code> takes too long time.
Also, if &lt;code>preemption_attempts_total&lt;/code> increases too much, then that might also imply some bugs around the preemption.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>No. This feature is an in-memory feature of the scheduler,
and just upgrading it and upgrade-&amp;gt;downgrade-&amp;gt;upgrade are both the same.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>This feature is used during all Pods&amp;rsquo; preemption if the feature gate is enabled.
You can see if the scheduler triggers any preemptions via &lt;code>preemption_attempts_total&lt;/code> metric.&lt;/p>
&lt;p>You can find Pods that have triggered the preemption by referring to &lt;code>.Status.NominatedNodeName&lt;/code>,
and Pods that have been preempted by referring to their condition with &lt;code>type: DisruptionTarget&lt;/code> and &lt;code>reason: PreemptionByScheduler&lt;/code>.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> API .status
&lt;ul>
&lt;li>Other field: If &lt;code>.Status.NominatedNodeName&lt;/code> of Pods is non-empty, they have experienced the preemption running asynchronously.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;ul>
&lt;li>The failure rate of the preemption goroutine (&lt;code>goroutines_execution_total{result=error, operation=preemption}&lt;/code>/&lt;code>goroutines_execution_total{operation=preemption}&lt;/code>) should be &amp;lt; 0.01.&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: &lt;code>goroutines_execution_total{result=error, operation=preemption}&lt;/code>&lt;/li>
&lt;li>Components exposing the metric: kube-scheduler&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>goroutines_duration_seconds&lt;/code> (w/ label: &lt;code>operation&lt;/code>): to observe how long each preemption goroutine takes to complete.&lt;/li>
&lt;li>&lt;code>goroutines_execution_total&lt;/code> (w/ labels: &lt;code>operation&lt;/code>, &lt;code>result&lt;/code>): to observe how many preemption goroutines have failed.&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No. Just move the existing API calls from &lt;code>PostFilter&lt;/code> into goroutines.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>The scheduler starts to run more goroutines in the preemption plugin, so maybe the CPU usage go up.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>In such cases, API calls for the preemption fails in the preemption goroutines.
But, the scheduler cannot perform not only the preemption, but anything essentially because it cannot get objects, bind Pods to Nodes, etc.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>Nothing.&lt;/p>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>Sep 07, 2024: The initial KEP is submitted.&lt;/li>
&lt;li>Nob 08, 2024: The implementation PR is merged.&lt;/li>
&lt;li>Feb 03, 2025: The PR to promote it to beta is submitted.&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="introduce-a-new-extension-point">Introduce a new extension point&lt;/h3>
&lt;p>To make this kind of scenario easier to implement for other plugins, we can implement a new extension point &lt;code>AsyncPostFilter&lt;/code>.
We calculate the preemption target and nominate the Pod for the Node at &lt;code>PostFilter&lt;/code>, and then &lt;code>AsyncPostFilter&lt;/code> starts asynchronously, in which the preemption plugin makes API calls for the preemption.&lt;/p>
&lt;p>The Pod won&amp;rsquo;t be queued back to the queue until &lt;code>AsyncPostFilter&lt;/code> is done.&lt;/p>
&lt;p>We don&amp;rsquo;t go with this idea because we can implement the async preemption without introducing a new extension point.
Adding a new extension point unnecessarily may result in the regret in the future, and also we can implement it if it&amp;rsquo;s really necessary.&lt;/p></description></item><item><title>Resources: Authorize with Selectors</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4601/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4601/</guid><description>
&lt;h1 id="kep-4601-authorize-with-selectors">KEP-4601: Authorize with Selectors&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#authorization-attributes-changes"
>Authorization Attributes changes&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#future-proofing-your-authorization-webhook-for-future-verbs"
>Future-proofing your authorization webhook for future verbs&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#subjectaccessreview-changes"
>SubjectAccessReview Changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#node-authorizer-changes"
>Node Authorizer Changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#cel-authorizer-changes"
>CEL Authorizer Changes&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#as-a-sar-client-i-want-to-check-a-request-with-a-field-or-label-selector"
>As a SAR client, I want to check a request with a field or label selector&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#as-an-authorization-webhook-author-i-want-to-easily-consume-the-field-and-label-selectors"
>As an authorization webhook author, I want to easily consume the field and label selectors&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#client-provides-field-or-label-selector-to-kube-apiserver-that-does-not-parse"
>client provides field or label selector to kube-apiserver that does not parse&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#client-provides-field-or-label-selector-to-kube-apiserver-with-improper-verb"
>client provides field or label selector to kube-apiserver with improper verb&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#client-provides-sar-where-field-rawselector-does-not-match-field-requirements"
>client provides SAR where field rawSelector does not match field requirements.&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#new-kube-apiserver-old-webhook-authorizer"
>New kube-apiserver, old webhook authorizer&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#old-kube-apiserver-new-in-cluster-authorizer-or-any-sar-client"
>Old kube-apiserver, new in-cluster authorizer (or any SAR client)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The authorization attributes will be extended to include field selectors and label selectors from
List, Watch, and DeleteCollection.
This will allow authorizers to use these selectors when making an authorization decision.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Security for per-node workloads could be improved by exposing field and label selectors to authorizers.
Adding them as authorization attributes allows the development of new kinds of authorizers that
leverage this information to provide security.
In particular, it enables out-of-tree authorizers to experiment with ways to express restrictions based on field and label selectors.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Add field and label selectors to authorization attributes for List, Watch, and DeleteCollection verbs.&lt;/li>
&lt;li>Add field and label selectors to webhook authorization types.&lt;/li>
&lt;li>Add field and label selectors to SelfSubjectAccessReview (SSAR), SubjectAccessReview (SAR), and LocalSubjectAccessReview.&lt;/li>
&lt;li>Update node authorizer to restrict on nodeName field selector.&lt;/li>
&lt;li>Add field and label selectors to CEL authorizer implementation.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Create a generic in-tree authorizer that manages field or label selectors.&lt;/li>
&lt;li>Expand the audit surface area, since requestURI is already included&lt;/li>
&lt;li>Expand the admission surface area (admission.Attributes, AdmissionReview, available to admission)
since admission verbs don&amp;rsquo;t support field/label selectors&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>List, Watch, and DeleteCollection requests directly have field and label selector options.
A single-item List or Watch request is still a list as normal (including selectors), but also includes a name.&lt;/p>
&lt;h3 id="authorization-attributes-changes">Authorization Attributes changes&lt;/h3>
&lt;p>The authorization attributes have easy access to the query parameter field and label selectors.
To avoid confusion, field and label selectors will not be included in authorization attributes for kube-apiserver requests
with verbs where the field selector has no semantic meaning.
In practice this means that (for now), only List, Watch, and DeleteCollection have field and label selectors.&lt;/p>
&lt;p>SubjectAccessReviews submitted to the kube-apiserver with verbs that do not honor the selectors will NOT modify the field and label selector attributes.
The client is trusted to be sending only combinations that will be honored.&lt;/p>
&lt;p>Any authorizer that gets an error from &lt;code>GetFieldSelector&lt;/code> or &lt;code>GetLabelSelector&lt;/code> may attempt to authorize without
field or label selectors since that will authorize using a wider permission (field and label selectors can only reduce access).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> Attributes &lt;span style="color:#a2f;font-weight:bold">interface&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// GetFieldSelector is lazy, thread-safe, and stores the parsed result and error.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It can return an error if the field selector cannot be parsed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Remember that field selector formats vary based on the version of the API being used!&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">GetFieldSelector&lt;/span>() (fields.Requirements, &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// GetLabelSelector is lazy, thread-safe, and stores the parsed result and error.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It can return an error if the field selector cannot be parsed.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#00a000">GetLabelSelector&lt;/span>() (labels.Requirements, &lt;span style="color:#0b0;font-weight:bold">error&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Webhook authors: remember that the list of verbs accepting field and label selectors may change over time.
If the kube-apiserver sends the FieldSelector or LabelSelector to a webhook, the kube-apiserver intends to honor the selector attributes.&lt;/p>
&lt;h4 id="future-proofing-your-authorization-webhook-for-future-verbs">Future-proofing your authorization webhook for future verbs&lt;/h4>
&lt;p>As of 1.31, the only verbs with field and label selectors are List, Watch, and DeleteCollection.
In the future, the kube-apiserver may add field and label selectors to Get, Create, Update, Patch, and Delete.&lt;/p>
&lt;ul>
&lt;li>For Get, this means the field and label selector of the retrieved object must match.&lt;/li>
&lt;li>For Create, this means that the resource after all mutation is complete (finalObject) must match the field and label selector.&lt;/li>
&lt;li>For Update/Patch, this means that the finalNewObject and oldObject must match the field and label selector.&lt;/li>
&lt;li>For Delete, this means that the oldObject must match the field and label selector.&lt;/li>
&lt;li>For subresources, if the storage layer cannot verify the parent object matches the selector (both old and new), the request must be rejected.&lt;/li>
&lt;/ul>
&lt;p>We do not allow field and label selectors for Get, because if a client is specifying a selector, they can add a &lt;code>.metadata.name&lt;/code>
field selector and use a List to get equivalent functionality.&lt;/p>
&lt;h3 id="subjectaccessreview-changes">SubjectAccessReview Changes&lt;/h3>
&lt;p>SubjectAccessReview is used for two purposes:&lt;/p>
&lt;ol>
&lt;li>Authorization webhook calls from the kube-apiserver to a webhook.
This usage likely benefits from a serialization with &lt;code>[]Requirement&lt;/code>.&lt;/li>
&lt;li>Authorization checks from a client (often a server process using in-cluster authorization like kube-rbac-proxy)
This usage likely benefits from a serialization that matches the query parameter.&lt;/li>
&lt;/ol>
&lt;p>Their needs are best met with two different serialization (see user stories)&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> SubjectAccessReviewSpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ResourceAttributes &lt;span style="color:#666">*&lt;/span>ResourceAttributes
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ResourceAttributes &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> FieldSelector &lt;span style="color:#666">*&lt;/span>FieldSelectorAttributes
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> LabelSelector &lt;span style="color:#666">*&lt;/span>LabelSelectorAttributes
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// FieldSelectorAttributes indicates a field limited access.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For webhooks:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// The kube-apiserver will never send a request with rawSelector set, but we cannot control what other clients directly send.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are present, the requirements should be honored&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are present, the request is invalid.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Webhook authors are encouraged to&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * ensure rawSelector and requirements are not both set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * consider the requirements field if set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * not try to parse or consider the rawSelector field if set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// This is to avoid another CVE-2022-2880 (i.e. getting different systems to agree on how exactly to parse&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// a query is not something we want), see https://www.oxeye.io/resources/golang-parameter-smuggling-attack for more details.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For the kube-apiserver:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are empty, the rawSelector will be parsed and limited if the parsing succeeds.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are present, the requirements should be honored&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are present, the request is invalid.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> FieldSelectorAttributes &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// rawSelector is the serialization of a field selector that would be included in a query parameter.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Webhook implementations are encouraged to ignore rawSelector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The kube-apiserver&amp;#39;s SubjectAccessReview will parse the rawSelector. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RawSelector &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// requirements is the parsed interpretation of a field selector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// All requirements must be met for a resource instance to match the selector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Webhook implementations should handle requirements, but how to handle them is up to the webhook.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Since requirements can only limit the request, it is safe to authorize as unlimited request if the requirements&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// are not understood.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Requirements []FieldSelectorRequirement
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// LabelSelectorAttributes indicates a label limited access.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For webhooks:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// The kube-apiserver will never send a request with rawSelector set, but we cannot control what other clients directly send.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are present, the requirements should be honored&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are present, the request is invalid.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// Webhook authors are encouraged to&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * ensure rawSelector and requirements are not both set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * consider the requirements field if set&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * not try to parse or consider the rawSelector field if set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// This is to avoid another CVE-2022-2880 (i.e. getting different systems to agree on how exactly to parse&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// a query is not something we want), see https://www.oxeye.io/resources/golang-parameter-smuggling-attack for more details.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// For the kube-apiserver:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are empty, the request is not limited.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are empty, the rawSelector will be parsed and limited if the parsing succeeds.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is empty and requirements are present, the requirements should be honored&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// * If rawSelector is present and requirements are present, the request is invalid.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> LabelSelectorAttributes &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// rawSelector is the serialization of a field selector that would be included in a query parameter.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Webhook implementations are encouraged to ignore rawSelector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The kube-apiserver&amp;#39;s SubjectAccessReview will parse the rawSelector. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> RawSelector &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// requirements is the parsed interpretation of a label selector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// All requirements must be met for a resource instance to match the selector.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Webhook implementations should handle requirements, but how to handle them is up to the webhook.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Since requirements can only limit the request, it is safe to authorize as unlimited request if the requirements&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// are not understood.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Requirements []metav1.LabelSelectorRequirement
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> FieldSelectorRequirement &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// key is the field selector key that the requirement applies to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Key &lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;key&amp;#34; protobuf:&amp;#34;bytes,1,opt,name=key&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// operator represents a key&amp;#39;s relationship to a set of values.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Valid operators are In, NotIn, Exists, DoesNotExist&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The list of operators may grow in the future.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Webhook authors are encouraged to ignore unrecognized operators and assume they don&amp;#39;t limit the request.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The semantics of &amp;#34;all requirements are AND&amp;#39;d will not change, so other requirements can continue to be enforced.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Operator LabelSelectorOperator &lt;span style="color:#b44">`json:&amp;#34;operator&amp;#34; protobuf:&amp;#34;bytes,2,opt,name=operator,casttype=LabelSelectorOperator&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// values is an array of string values. If the operator is In or NotIn,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the values array must be non-empty. If the operator is Exists or DoesNotExist,&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the values array must be empty.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +listType=atomic&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Values []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span> &lt;span style="color:#b44">`json:&amp;#34;values,omitempty&amp;#34; protobuf:&amp;#34;bytes,3,rep,name=values&amp;#34;`&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Importantly, if old webhook authorizers do not honor these new fields, they will assume the broadest possible access and fail closed.
If old in-cluster authorization does not include field and label selectors, the kube-apiserver will assume the broadest possible access and fail closed.&lt;/p>
&lt;h3 id="node-authorizer-changes">Node Authorizer Changes&lt;/h3>
&lt;p>The node authorizer will be modified to only authorize node clients to &lt;code>list&lt;/code> and &lt;code>watch&lt;/code> pods with fieldSelectors
containing &lt;code>spec.nodeName=$nodeName&lt;/code>.
The node authorizer will be modified to authorize pod &lt;code>get&lt;/code> requests based on the graph.&lt;/p>
&lt;h3 id="cel-authorizer-changes">CEL Authorizer Changes&lt;/h3>
&lt;p>While admission isn&amp;rsquo;t supported on List, Watch, or DeleteCollection, it is reasonable to expect that secondary authorization
checks may desire to use those verbs and leverage the field and label selector capabilities.
To support this we will two congruent options similar to&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b44">&amp;#34;fieldSelector&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cel.&lt;span style="color:#00a000">MemberOverload&lt;/span>(&lt;span style="color:#b44">&amp;#34;resourcecheck_fieldselector&amp;#34;&lt;/span>, []&lt;span style="color:#666">*&lt;/span>cel.Type{ResourceCheckType, cel.StringType}, ResourceCheckType,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cel.&lt;span style="color:#00a000">BinaryBinding&lt;/span>(resourceCheckName))},
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This will allow usage like &lt;code>authorizer.group('').resource('pods').fieldSelector('spec.nodeName=foo').check('list').allowed()&lt;/code>.
The parsing will happen during the call to &lt;code>allowed&lt;/code> where we track errors and have means of handling them already.
Field and label selectors that fail to parse will be ignored.
No checking of valid verb,selector pairs is made.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="as-a-sar-client-i-want-to-check-a-request-with-a-field-or-label-selector">As a SAR client, I want to check a request with a field or label selector&lt;/h4>
&lt;p>This type of usage probably finds the stringified serialization format used in the query parameters the
most convenient format to build their request with.
Providing the query parameter serialization format avoids the need for a client to grow a decently complex lexer/parser.&lt;/p>
&lt;h4 id="as-an-authorization-webhook-author-i-want-to-easily-consume-the-field-and-label-selectors">As an authorization webhook author, I want to easily consume the field and label selectors&lt;/h4>
&lt;p>This type of usage probably finds a serialized &lt;code>[]Requirement&lt;/code> to be the most convenient way to consume the field and label selector.
Providing the parsed value avoids the need for every consumer to grow a decently complex lexer/parser.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;p>Remember to update these places in existing code:&lt;/p>
&lt;ol>
&lt;li>authorization webhook matchConditions, which evaluates the v1 SubjectAccessReview that would be sent to the webhook: &lt;a href="https://github.com/kubernetes/kubernetes/blob/bb838fde5bb9df4becb9fd267c84759be9f5400f/staging/src/k8s.io/apiserver/pkg/authorization/cel/compile.go#L197-L205"
target="_blank" rel="noopener">ref&lt;/a>
.&lt;/li>
&lt;li>v1 / v1beta1 SAR translation function &lt;a href="https://github.com/kubernetes/kubernetes/blob/bb838fde5bb9df4becb9fd267c84759be9f5400f/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L472-L485"
target="_blank" rel="noopener">ref&lt;/a>
&lt;/li>
&lt;li>v1 SubjectAccessReview construction function &lt;a href="https://github.com/kubernetes/kubernetes/blob/bb838fde5bb9df4becb9fd267c84759be9f5400f/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L198"
target="_blank" rel="noopener">ref&lt;/a>
&lt;/li>
&lt;li>cache size decision &lt;a href="https://github.com/kubernetes/kubernetes/blob/bb838fde5bb9df4becb9fd267c84759be9f5400f/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go#L440"
target="_blank" rel="noopener">ref&lt;/a>
&lt;/li>
&lt;/ol>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="client-provides-field-or-label-selector-to-kube-apiserver-that-does-not-parse">client provides field or label selector to kube-apiserver that does not parse&lt;/h4>
&lt;p>The kube-apiserver may still authorize the request without considering the selectors (system:masters for instance).
It will be up to the REST handler to accept or reject requests for bad selectors.
This approach also allows an aggregated API server to have extended field and label selector syntax, though we strongly discourage doing so.
The kube-apiserver will attempt to authorize without the selector information.&lt;/p>
&lt;ul>
&lt;li>If the client is authorized without the selector, then Allow since they have broader permission.&lt;/li>
&lt;li>If the client is not authorized without the selector then either NoOpinion or Fail depending on intent.&lt;/li>
&lt;/ul>
&lt;h4 id="client-provides-field-or-label-selector-to-kube-apiserver-with-improper-verb">client provides field or label selector to kube-apiserver with improper verb&lt;/h4>
&lt;p>Consider a client that sends an Update request with a field selector on it.
The metav1.UpdateOption doesn&amp;rsquo;t allow this, but imagine devious-user with an alternative library.
The &lt;code>ResolveRequestInfo&lt;/code> method will not add field and label selectors to the &lt;code>requestInfo&lt;/code>, so they will not appear
in the &lt;code>authorization.Attributes&lt;/code>, so the spurious selectors are not passed to the authorizer.
This keeps authorization behavior exactly as it was previously.&lt;/p>
&lt;p>SubjectAccessReviews are not modified prior to calling the kube-apiserver authorizer.
This allows skew in support between the kube-apiserver and other apiservers.&lt;/p>
&lt;h4 id="client-provides-sar-where-field-rawselector-does-not-match-field-requirements">client provides SAR where field rawSelector does not match field requirements.&lt;/h4>
&lt;p>The request is rejected.
Only one of &lt;code>rawSelector&lt;/code> and &lt;code>requirements&lt;/code> can be specified.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;pre tabindex="0">&lt;code>k8s.io/kubernetes/pkg/registry/authorization/subjectaccessreview: 61.9% of statements
k8s.io/kubernetes/pkg/registry/authorization/util: 82.6% of statements
k8s.io/kubernetes/plugin/pkg/auth/authorizer/node: 77.0% of statements
k8s.io/kubernetes/pkg/apis/admissionregistration/validation: 87.6% of statements
k8s.io/kubernetes/pkg/apis/authorization/validation: 97.0% of statements
k8s.io/apiserver/pkg/admission/plugin/cel: 83.6% of statements
k8s.io/apiserver/pkg/authorization/cel: 53.9% of statements
k8s.io/apiserver/pkg/endpoints/filters: 77.2% of statements
k8s.io/apiserver/pkg/endpoints/request: 65.4% of statements
k8s.io/apiserver/plugin/pkg/authorizer/webhook: 86.6% of statements
&lt;/code>&lt;/pre>&lt;p>Unit tests exercise node authorization, CEL compilation for authorization webhook and admission &lt;code>matchConditions&lt;/code>,
and CEL compilation for authorizer use with and without the feature enabled:&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/plugin/pkg/auth/authorizer/node/node_authorizer_test.go#L75-L81"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/plugin/pkg/auth/authorizer/node/node_authorizer_test.go#L75-L81&lt;/a>
&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/authorization/cel/compile_test.go#L34"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/authorization/cel/compile_test.go#L34&lt;/a>
&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook_v1_test.go#L806"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook_v1_test.go#L806&lt;/a>
&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/admission/plugin/cel/filter_test.go#L503-L620"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/0b1d123fd040359da11dc772947a7908ee907910/staging/src/k8s.io/apiserver/pkg/admission/plugin/cel/filter_test.go#L503-L620&lt;/a>
&lt;/p>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/tree/c5f2fc05ad5ef3d68f35263f9f965101b371b8cc/test/integration/apiserver/cel/authorizerselector"
target="_blank" rel="noopener">&lt;code>test/integration/apiserver/cel/authorizerselector/...&lt;/code>&lt;/a>
- &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=test%2Fintegration%2Fapiserver%2Fcel%2Fauthorizerselector"
target="_blank" rel="noopener">triage history&lt;/a>
&lt;/p>
&lt;ul>
&lt;li>Fully exercise the new CEL authorizer functions with the feature enabled and disabled&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/blob/c5f2fc05ad5ef3d68f35263f9f965101b371b8cc/test/integration/auth/authz_config_test.go#L472-L485"
target="_blank" rel="noopener">&lt;code>test/integration/auth TestMultiWebhookAuthzConfig&lt;/code>&lt;/a>
- &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?text=TestMultiWebhookAuthzConfig&amp;amp;test=test%2Fintegration%2Fauth"
target="_blank" rel="noopener">triage history&lt;/a>
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>positive and negative match tests for a webhook matchCondition using selector matching, on actual API requests using selectors and on SubjectAccessReview requests&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;a href="https://testgrid.k8s.io/sig-release-master-blocking#integration-master&amp;amp;include-filter-by-regex=test/integration/apiserver/cel/authorizerselector%7ctest/integration/auth&amp;amp;width=5"
target="_blank" rel="noopener">Test history&lt;/a>
&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;p>This feature is fully tested with unit and integration tests&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate] graduations, or as
something else. The KEP should keep this high-level with a focus on what
signals will be looked at to determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
-->
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented behind a feature flag&lt;/li>
&lt;li>Unit tests demonstrating wiring and fallback&lt;/li>
&lt;li>Integration test demonstrating field selector wiring
&lt;ul>
&lt;li>must include fallback on parsing error as well&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Determine if additional tests are necessary&lt;/li>
&lt;li>Ensure reliability of existing tests&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>All bugs resolved and no new bugs requiring code change since the previous shipped release&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;p>On upgrade to a version that enables the feature, no configuration changes are required
to maintain previous behavior of CEL expressions and authorization webhooks.
All existing CEL expressions and authorization webhook responses behave identically.&lt;/p>
&lt;p>On upgrade to a version that enables the feature, to make use of the new feature:&lt;/p>
&lt;ul>
&lt;li>authorization webhooks can inspect incoming SubjectAccessReview requests for field and label selector information&lt;/li>
&lt;li>authorization webhook configuration files can include &lt;code>matchConditions&lt;/code> that inspect field and label selector information&lt;/li>
&lt;li>admission webhook API &lt;code>matchConditions&lt;/code> can use authorizer fieldSelector / labelSelector functions&lt;/li>
&lt;li>SubjectAccessReview API requests can specify fieldSelector / labelSelector fields&lt;/li>
&lt;/ul>
&lt;p>On downgrade to a version that does not enable the feature by default, or if the feature is disabled:&lt;/p>
&lt;ul>
&lt;li>field and label selector information will no longer be sent to authorization webhooks&lt;/li>
&lt;li>authorization webhook configuration files can no longer include &lt;code>matchConditions&lt;/code> that inspect field and label selector information&lt;/li>
&lt;li>admission webhook API &lt;code>matchConditions&lt;/code> use authorizer fieldSelector / labelSelector functions will not error, but will no-op&lt;/li>
&lt;li>SubjectAccessReview API requests that specify fieldSelector / labelSelector fields will drop those fields&lt;/li>
&lt;/ul>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;h4 id="new-kube-apiserver-old-webhook-authorizer">New kube-apiserver, old webhook authorizer&lt;/h4>
&lt;p>The new kube-apiserver will include the field and label selectors, but the old webhook authorizer will ignore them.
The old authorizer will assume the broadest possible action and authorize accordingly.
Because the old authorizer will only allow the action if the user has permission to act on th entire collection, this fails safely.
There may be more rejections than expected, but this behavior matches previous behavior.&lt;/p>
&lt;h4 id="old-kube-apiserver-new-in-cluster-authorizer-or-any-sar-client">Old kube-apiserver, new in-cluster authorizer (or any SAR client)&lt;/h4>
&lt;p>The new client will include the field and label selectors, but the kube-apiserver will ignore them.
The kube-apiserver will assume the broadest possible action and authorize accordingly.
Because the kube-apiserver will only allow the action if the user has permission to act on th entire collection, this fails safely.
There may be more rejections than expected, but this behavior matches previous behavior.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: AuthorizeWithSelectors&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Feature gate name: AuthorizeNodeWithSelectors&lt;/li>
&lt;li>Components depending on the feature gate:
&lt;ul>
&lt;li>kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Yes. The kube-apiserver will send field and label selector information to authorization webhooks.
The node authorizer will start preventing kubelets from listing pods that are not on their node.&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Set the FeatureGate to false and restart the kube-apiserver.
The kube-apiserver will stop sending field and label selector information to authorization webhooks.
Persisted CEL expressions using &lt;code>fieldSelector&lt;/code> and &lt;code>labelSelector&lt;/code> authorization functions will still function.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The kube-apiserver will send field and label selector information to authorization webhooks.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes. Integration tests exercise behavior of CEL expressions with the feature enabled and disabled.&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/tree/0b1d123fd040359da11dc772947a7908ee907910/test/integration/apiserver/cel/authorizerselector"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/tree/0b1d123fd040359da11dc772947a7908ee907910/test/integration/apiserver/cel/authorizerselector&lt;/a>
&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;p>Non-kubelet clients using kubelet credentials to make API requests could be forbidden
if they are listing/watching pods without filtering to pods scheduled to the node,
or if they are listing/watching nodes other than their own node.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;p>Use of kubelet credentials to make API requests the kubelet is not authorized to make
is unexpected, but could be detected in the &lt;code>authorization_attempts_total{result=denied}&lt;/code>
metric increasing and audit events showing requests from a user in the &lt;code>system:nodes&lt;/code> group
with an &lt;code>authorization.k8s.io/decision=forbid&lt;/code> audit annotation.&lt;/p>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;p>Handling of persisted CEL expressions using selector features was tested
with the feature disabled, and with a compatibility version of 1.30,
to ensure that a previous version API server would not have to handle
CEL expressions it did not understand.&lt;/p>
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;p>No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;p>None&lt;/p>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;p>Workloads do not use this feature directly.&lt;/p>
&lt;p>Audit events of SubjectAccessReview API requests would show if
selector information was being provided.&lt;/p>
&lt;p>Authorization webhooks would be able to observe selector information
provided in requests.&lt;/p>
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;p>Most of the uses are internal to cluster administrators:&lt;/p>
&lt;ul>
&lt;li>authorization webhooks configured with matchConditions using fieldSelector/labelSelector
pass validation and only route requests passing those conditions to the webhook
(&lt;code>apiserver_authorization_match_condition_exclusions_total&lt;/code> metric will increment if match conditions skip)&lt;/li>
&lt;li>authorization webhooks can inspect the SubjectAccessReview requests sent to them to observe selector information&lt;/li>
&lt;li>admission webhooks and validating admission policies can use &lt;code>fieldSelector&lt;/code> and &lt;code>labelSelector&lt;/code> authorizer methods
and pass API validation.&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;p>Use of this feature should not change existing API SLOs.&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;p>Use of this feature should not change existing API SLIs.&lt;/p>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;p>There are already metrics for the layers this feature is adding to:&lt;/p>
&lt;ul>
&lt;li>authorization latency&lt;/li>
&lt;li>authorization success&lt;/li>
&lt;li>webhook authorizer match condition latency&lt;/li>
&lt;li>webhook authorizer match condition success&lt;/li>
&lt;li>webhook admission match condition latency&lt;/li>
&lt;li>webhook admission match condition success&lt;/li>
&lt;li>validating admission policy match condition latency&lt;/li>
&lt;li>validating admission policy match condition success&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;p>No.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;p>Existing API fields containing CEL expressions support additional CEL functions.&lt;/p>
&lt;p>SubjectAccessReview types (which are not persisted) add new fields for fieldSelector and labelSelector data.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;p>Enabling the feature adds negligible size to authorization webhook payloads.&lt;/p>
&lt;p>Using the authorization selector functions in CEL expressions in authorization webhook matchConditions,
admission webhook matchConditions, and validating admission policies can take additional time,
though this is no different from increasing the complexity or number of CEL expressions generally.
CEL expressions that can be set via REST APIs are subject to cost estimation to limit the complexity
and size of the input data used for selectors.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;p>No, this feature does not touch nodes.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>This feature is fully contained within the API server.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;ul>
&lt;li>Non-kubelet clients using kubelet credentials are forbidden
&lt;ul>
&lt;li>Detection: logs of non-kubelet client, &lt;code>authorization_attempts_total{result=denied}&lt;/code>
metric increasing, audit events showing requests from a user in the &lt;code>system:nodes&lt;/code> group
with an &lt;code>authorization.k8s.io/decision=forbid&lt;/code> audit annotation&lt;/li>
&lt;li>Mitigations:
&lt;ul>
&lt;li>change the non-kubelet client to use its own credential (preferred)&lt;/li>
&lt;li>adjust the non-kubelet client to use field selectors on pods and nodes&lt;/li>
&lt;li>temporarily disable the &lt;code>AuthorizeNodeWithSelectors&lt;/code> feature gate in kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Diagnostics: the node authorizer logs the following messages at verbosity level 2
when a client attempts to use kubelet credentials to read nodes or pods without
using the expected field selector:
&lt;ul>
&lt;li>&lt;code>node '...' cannot read all nodes, only its own Node object&lt;/code>&lt;/li>
&lt;li>&lt;code>node '...' cannot read '...', only its own Node object&lt;/code>&lt;/li>
&lt;li>&lt;code>can only list/watch pods with spec.nodeName field selector&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Testing: There are tests ensuring the node authorizer forbids these overly broad
read requests. Use of kubelet credentials by non-kubelet clients to make API
requests the kubelet is not authorized to make is unexpected and unwanted.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>Determine if webhook latency or matchCondition latency of matchConditions using these selector
functions is the primary contributor, and if that change correlates with enablement of this feature.
Test if eliminating use of the CEL selector functions in the offending CEL expression resolves the issue.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>v1.31: Alpha release&lt;/li>
&lt;li>v1.32: Beta release&lt;/li>
&lt;li>v1.34: Stable release&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>None considered&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>None considered&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>None&lt;/p></description></item><item><title>Resources: Auto delete PVCs created by StatefulSet</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1847/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1847/</guid><description>
&lt;h1 id="kep-1847-auto-delete-pvcs-created-by-statefulset">KEP-1847: Auto delete PVCs created by StatefulSet&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#background"
>Background&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#changes-required"
>Changes required&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories"
>User Stories&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-0"
>Story 0&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#objects-associated-with-the-statefulset"
>Objects Associated with the StatefulSet&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#volume-delete-policy-for-the-statefulset-created-pvcs"
>Volume delete policy for the StatefulSet created PVCs&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#whenscaled-policy-of-delete"
>&lt;code>whenScaled&lt;/code> policy of &lt;code>Delete&lt;/code>.&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#whendeleted-policy-of-delete"
>&lt;code>whenDeleted&lt;/code> policy of &lt;code>Delete&lt;/code>.&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-cascading-deletion"
>Non-Cascading Deletion&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#mutating-persistentvolumeclaimretentionpolicy"
>Mutating &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code>&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#cluster-role-change-for-statefulset-controller"
>Cluster role change for statefulset controller&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>E2E tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upgradedowngrade--feature-enableddisable-tests"
>Upgrade/downgrade &amp;amp; feature enabled/disable tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha-release"
>Alpha release&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta-release"
>Beta release&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga-release"
>GA release&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-this-feature-be-enabled--disabled-in-a-live-cluster"
>How can this feature be enabled / disabled in a live cluster?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#does-enabling-the-feature-change-any-default-behavior"
>Does enabling the feature change any default behavior?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement"
>Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back"
>What happens if we reenable the feature if it was previously rolled back?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#are-there-any-tests-for-feature-enablementdisablement"
>Are there any tests for feature enablement/disablement?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-a-rollout-fail-can-it-impact-already-running-workloads"
>How can a rollout fail? Can it impact already running workloads?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-specific-metrics-should-inform-a-rollback"
>What specific metrics should inform a rollback?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested"
>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis"
>Is the rollout accompanied by any deprecations and/or removals of features, APIs,&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads"
>How can an operator determine if the feature is in use by workloads?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine"
>What are the SLIs (Service Level Indicators) an operator can use to determine&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-the-reasonable-slos-service-level-objectives-for-the-above-slis"
>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability"
>Are there any missing metrics that would be useful to have to improve observability&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#does-this-feature-depend-on-any-specific-services-running-in-the-cluster"
>Does this feature depend on any specific services running in the cluster?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-any-new-api-calls"
>Will enabling / using this feature result in any new API calls?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-introducing-new-api-types"
>Will enabling / using this feature result in introducing new API types?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider"
>Will enabling / using this feature result in any new calls to the cloud provider?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects"
>Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos"
>Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components"
>Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable"
>How does this feature react if the API server and/or etcd is unavailable?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-are-other-known-failure-modes"
>What are other known failure modes?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem"
>What steps should be taken if SLOs are not being met to determine the problem?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The proposal is to add a feature to autodelete the PVCs created by StatefulSet.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Currently, the PVCs created automatically by the StatefulSet are not deleted when
the StatefulSet is deleted. As can be seen by the discussion in the issue
&lt;a href="https://github.com/kubernetes/kubernetes/issues/55045"
target="_blank" rel="noopener">55045&lt;/a>
there are several use
cases where the PVCs which are automatically created are deleted as well. In many
StatefulSet use cases, PVCs have a different lifecycle than the pods of the
StatefulSet, and should not be deleted at the same time. Because of this, PVC
deletion will be opt-in for users.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Provide a feature to auto delete the PVCs created by StatefulSet when the volumes are no
longer in use to ease management of StatefulSets that don&amp;rsquo;t live indefinitely. As
application state should survive over StatefulSet maintenance, the feature ensures that
the pod restarts due to non scale down events such as rolling update or node drain do not
delete the PVC.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;p>This proposal does not plan to address how the underlying PVs are treated on PVC deletion.
That functionality will continue to be governed by the reclaim policy of the storage class.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="background">Background&lt;/h3>
&lt;p>The &lt;code>garbagecollector&lt;/code> controller is responsible for ensuring that when a StatefulSet
is deleted, the corresponding pods spawned from the StatefulSet are deleted as well. The
&lt;code>garbagecollector&lt;/code> uses an &lt;code>OwnerReference&lt;/code> added to the &lt;code>Pod&lt;/code> by the StatefulSet
controller to delete the Pod. This proposal leverages a similar mechanism to automatically
delete the PVCs created by the controller from the StatefulSet&amp;rsquo;s VolumeClaimTemplate.&lt;/p>
&lt;h3 id="changes-required">Changes required&lt;/h3>
&lt;p>The following changes are required:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Add &lt;code>persistentVolumeClaimRetentionPolicy&lt;/code> to the StatefulSet spec with the following fields.&lt;/p>
&lt;ul>
&lt;li>&lt;code>whenDeleted&lt;/code> - specifies if the VolumeClaimTemplate PVCs are deleted when
their StatefulSet is deleted.&lt;/li>
&lt;li>&lt;code>whenScaled&lt;/code> - specifies if VolumeClaimTemplate PVCs are deleted when
their corresponding pod is deleted on a StatefulSet scale-down, that is,
when the number of pods in a StatefulSet is reduced via the Replicas field.&lt;/li>
&lt;/ul>
&lt;p>These fields may be set to the following values.&lt;/p>
&lt;ul>
&lt;li>&lt;code>Retain&lt;/code> - the default policy, which is also used when no policy is
specified. This specifies the existing behavior: when a StatefulSet is
deleted or scaled down, no action is taken with respect to the PVCs
created by the StatefulSet.&lt;/li>
&lt;li>&lt;code>Delete&lt;/code> - specifies that the appropriate PVCs as described above will be
deleted in the corresponding scenario, either on StatefulSet deletion or scale-down.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Add &lt;code>patch&lt;/code> to the statefulset controller rbac cluster role for &lt;code>persistentvolumeclaims&lt;/code>.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="user-stories">User Stories&lt;/h3>
&lt;h4 id="story-0">Story 0&lt;/h4>
&lt;p>The user is happy with legacy behavior of a stateful set. They leave all fields
of &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code> to &lt;code>Retain&lt;/code>. Nothing traditional
StatefulSet behavior changes neither on set deletion nor on scale-down.&lt;/p>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>The user is running a StatefulSet as part of an application with a finite lifetime. During
the application&amp;rsquo;s existence the StatefulSet maintains per-pod state, even across scale-up
and scale-down. In order to maximize performance, volumes are retained during scale-down
so that scale-up can leverage the existing volumes. When the application is finished, the
volumes created by the StatefulSet are no longer needed and can be automatically
reclaimed.&lt;/p>
&lt;p>The user would set &lt;code>persistentVolumeClaimRetentionPolicy.whenDeleted&lt;/code> to &lt;code>Delete&lt;/code>, which
would ensure that the PVCs created automatically during the StatefulSet
activation is deleted once the StatefulSet is deleted.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>The user is cost conscious, and can sustain slower scale-up speeds even after a
scale-down, because scaling events are rare, and volume data can be
reconstructed, albeit slowly, during a scale up. However, it is necessary to
bring down the StatefulSet temporarily by deleting it, and then bring it back up
by reusing the volumes. This is accomplished by setting
&lt;code>persistentVolumeClaimRetentionPolicy.whenScaled&lt;/code> to &lt;code>Delete&lt;/code>, and leaving
&lt;code>persistentVolumeClaimRetentionPolicy.whenDeleted&lt;/code> at &lt;code>Retain&lt;/code>.&lt;/p>
&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>User is very cost conscious, and can sustain slower scale-up speeds even after a
scale-down. The user does not want to pay for volumes that are not in use in any
circumstance, and so wants them to be reclaimed as soon as possible. On scale-up
a new volume will be provisioned and the new pod will have to
re-intitialize. However, for short-lived interruptions when a pod is killed &amp;amp;
recreated, like a rolling update or node disruptions, the data on volumes is
persisted. This is a key property that ephemeral storage, like emptyDir, cannot
provide.&lt;/p>
&lt;p>User would set the &lt;code>persistentVolumeClaimRetentionPolicy.whenScaled&lt;/code> as well as
&lt;code>persistentVolumeClaimRetentionPolicy.whenDeleted&lt;/code> to &lt;code>Delete&lt;/code>, ensuring PVCs are
deleted when corresponding Pods are deleted. New Pods created during scale-up
followed by a scale-down will wait for freshly created PVCs. PVCs are deleted as
well when the set is deleted, reclaiming volumes as quickly as possible and
minimizing expense.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (optional)&lt;/h3>
&lt;p>This feature applies to PVCs which are defined by the volumeClaimTemplate of a
StatefulSet. Any PVC and PV provisioned from this mechanism will function with
this feature. These PVCs are identified by the static naming scheme used by
StatefulSets. Auto-provisioned and pre-provisioned PVCs will be treated
identically, so that if a user pre-provisions a PVC matching those of a
VolumeClaimTemplate it will be deleted according to the deletion policy.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>Currently the PVCs created by StatefulSet are not deleted automatically. Using
&lt;code>whenScaled&lt;/code> or &lt;code>whenDeleted&lt;/code> set to &lt;code>Delete&lt;/code> would delete the PVCs
automatically. Since this involves persistent data being deleted, users should
take appropriate care using this feature. Having the &lt;code>Retain&lt;/code> behavior as
default will ensure that the PVCs remain intact by default and only a conscious
choice made by user will involve any persistent data being deleted.&lt;/p>
&lt;p>This proposed API causes the PVCs associated with the StatefulSet to have
behavior close to, but not the same as, ephemeral volumes, such as emptyDir or
generic ephemeral volumes. This may cause user confusion. PVCs under this policy
will more durable than ephemeral volumes would be, as they are only deleted on
scale-down or StatefulSet deletion, and not on other pod deletion and recreation
events eviction or the death of their node.&lt;/p>
&lt;p>User documentation will emphasize the race conditions associated with changing
policy or rolling back the feature concurrently with StatefulSet deletion or
scale-down. See below in &lt;a href="#design-details"
>Design Detils&lt;/a>
for more information.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="objects-associated-with-the-statefulset">Objects Associated with the StatefulSet&lt;/h3>
&lt;p>When a StatefulSet spec has a &lt;code>VolumeClaimTemplate&lt;/code>, PVCs are dynamically created using a
static naming scheme, and each Pod is created with a claim to the corresponding PVC. These
are the precise PVCs meant when referring to the volume or PVC for Pod below, and these
are the only PVCs modified with an ownerRef. Other PVCs referenced by the StatefulSet Pod
template are not affected by this behavior.&lt;/p>
&lt;p>OwnerReferences are used to manage PVC deletion. All such references used for
this feature will set the controller field to the StatefulSet or Pod as
appropriate. This will be used to distinguish references added by the controller
from, for example, user-created owner references. When ownerRefs is removed, it
is understood that only those ownerRefs whose controller field matches the
StatefulSet or Pod in question are affected.&lt;/p>
&lt;p>The controller flag will be set for these references. If there is already a
different (non-StatefulSet) controller set for a PVC, an ownerRef will not be
added. This will mean that the autodelete functionality will not be operative. An
event will be created to reflect this.&lt;/p>
&lt;p>To summarize,&lt;/p>
&lt;p>&lt;strong>If the StatefulSet is a controller owner&lt;/strong>,&lt;/p>
&lt;ul>
&lt;li>the PVC lifecycle will be full managed by the StatefulSet controller&lt;/li>
&lt;li>old owner references will be updated with &lt;code>controller=false&lt;/code> to &lt;code>controller=true&lt;/code>
(see Upgrade / Downgrade Strategy, below).&lt;/li>
&lt;li>remove itself as the owner and controller when the retain policy is specified in the StatefulSet.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>If someone else is the controller&lt;/strong>,&lt;/p>
&lt;ul>
&lt;li>the PVC lifecycle will not be touched by the StatefulSet controller. The PVC
will stay when the delete policy is specified in the StatefulSet.&lt;/li>
&lt;li>old StatefulSet owner references will be removed.&lt;/li>
&lt;/ul>
&lt;h3 id="volume-delete-policy-for-the-statefulset-created-pvcs">Volume delete policy for the StatefulSet created PVCs&lt;/h3>
&lt;p>A new field named &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code> of the type
&lt;code>StatefulSetPersistentVolumeClaimRetentionPolicy&lt;/code> will be added to the StatefulSet. This
will represent the user indication for which circumstances the associated PVCs
can be automatically deleted or not, as described above. The default policy
would be to retain PVCs in all cases.&lt;/p>
&lt;p>The &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code> object will be mutable. The deletion
mechanism will be based on reconciliation, so as long as the field is changed
far from StatefulSet deletion or scale-down, the policy will work as
expected. Mutability does introduce race conditions if it is changed while a
StatefulSet is being deleted or scaled down and may result in PVCs not being
deleted as expected when the policy is being changed from &lt;code>Retain&lt;/code>, and PVCs
being deleted unexpectedly when the policy is being changed to &lt;code>Retain&lt;/code>. PVCs
will be reconciled before a scale-down or deletion to reduce this race as much
as possible, although it will still occur. The former case can be mitigated by
manually deleting PVCs. The latter case will result in lost data, but only in
PVCs that were originally declared to have been deleted. Life does not always
have an undo button.&lt;/p>
&lt;h4 id="whenscaled-policy-of-delete">&lt;code>whenScaled&lt;/code> policy of &lt;code>Delete&lt;/code>.&lt;/h4>
&lt;p>If &lt;code>persistentVolumeClaimRetentionPolicy.whenScaled&lt;/code> is set to &lt;code>Delete&lt;/code>, the Pod will be
set as the owner of the PVCs created from the &lt;code>VolumeClaimTemplates&lt;/code> just before
the scale-down is performed by the StatefulSet controller. When a Pod is
deleted, the PVC owned by the Pod is also deleted.&lt;/p>
&lt;p>The current StatefulSet controller implementation ensures that the manually deleted pods
are restored before the scale-down logic is run. This combined with the fact that the
owner references are set only before the scale-down will ensure that manual deletions do
not automatically delete the PVCs in question.&lt;/p>
&lt;p>During scale-up, if a PVC has an OwnerRef that does not match the Pod, it indicates that
the PVC was referred to by the deleted Pod and is in the process of getting
deleted. The controller will skip the reconcile loop until PVC deletion finishes, avoiding
a race condition.&lt;/p>
&lt;h4 id="whendeleted-policy-of-delete">&lt;code>whenDeleted&lt;/code> policy of &lt;code>Delete&lt;/code>.&lt;/h4>
&lt;p>When &lt;code>persistentVolumeClaimRetentionPolicy.whenDeleted&lt;/code> is set to &lt;code>Delete&lt;/code>, when a
VolumeClaimTemplate PVC is created, an owner reference in PVC will be added to
point to the StatefulSet. When a scale-up or scale-down occurs, the PVC is
unchanged. PVCs previously in use before scale-down will be used again when the
scale-up occurs.&lt;/p>
&lt;p>In the existing StatefulSet reconcile loop, the associated VolumeClaimTemplate
PVCs will be checked to see if the ownerRef is correct according to the
&lt;code>persistentVolumeClaimRetentionPolicy&lt;/code> and updated accordingly. This includes PVCs
that have been manually provisioned. It will be most consistent and easy
to reason about if all VolumeClaimTemplate PVCs are treated uniformly rather
than trying to guess at their provenance.&lt;/p>
&lt;p>When the StatefulSet is deleted, these PVCs will also be deleted, but only after
the Pod gets deleted. Since the Pod StatefulSet ownership has
&lt;code>blockOwnerDeletion&lt;/code> set to &lt;code>true&lt;/code>, pods will get deleted before the StatefulSet
is deleted. The &lt;code>blockOwnerDeletion&lt;/code> for PVCs will be set to &lt;code>false&lt;/code> which
ensures that PVC deletion happens only after the StatefulSet is deleted. This is
necessary because of PVC protection which does not allow PVC deletion until all
pods referencing it are deleted.&lt;/p>
&lt;p>The deletion policies may be combined in order to get the delete behavior both
on set deletion as well as scale-down.&lt;/p>
&lt;h4 id="non-cascading-deletion">Non-Cascading Deletion&lt;/h4>
&lt;p>When StatefulSet is deleted without cascading, eg &lt;code>kubectl delete --cascade=false&lt;/code>, then
existing behavior is retained and no PVC will be deleted. Only the StatefulSet resource
will be affected.&lt;/p>
&lt;h4 id="mutating-persistentvolumeclaimretentionpolicy">Mutating &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code>&lt;/h4>
&lt;p>Recall that as defined above, the PVCs associated with a StatefulSet are found
by the StatefulSet volumeClaimTemplate static naming scheme. The Pods associated
with the StatefulSet can be found by their controllerRef.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>From a deletion policy to &lt;code>Retain&lt;/code>&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>When mutating any delete policy to retain, the PVC ownerRefs to the
StatefulSet are removed. If a scale-down is in progress, each remaining PVC
ownerRef to its pod is removed, by matching the index of the PVC to the Pod
index.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>From &lt;code>Retain&lt;/code> to a deletion policy&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>When mutating from the &lt;code>Retain&lt;/code> policy to a deletion policy, the StatefulSet
PVCs are updated with an ownerRef to the StatefulSet. If a scale-down is in
process, remaining PVCs are given an ownerRef to their Pod (by index, as above).&lt;/p>
&lt;h3 id="cluster-role-change-for-statefulset-controller">Cluster role change for statefulset controller&lt;/h3>
&lt;p>In order to update the PVC ownerReference, the &lt;code>buildControllerRoles&lt;/code> will be updated with
&lt;code>patch&lt;/code> on PVC resource.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h4 id="unit-tests">Unit tests&lt;/h4>
&lt;p>From &lt;a href="https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&amp;amp;include-filter-by-regex=statefulset"
target="_blank" rel="noopener">https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&amp;include-filter-by-regex=statefulset&lt;/a>
&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/controller/statefulset&lt;/code>: &lt;code>2024-10-07&lt;/code>: &lt;code>86.5%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/registry/apps/statefulset&lt;/code>: &lt;code>2022-10-07&lt;/code>: &lt;code>62.7%&lt;/code>&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/registry/apps/statefulset/storage&lt;/code>: &lt;code>2022-10-07&lt;/code>: &lt;code>64%&lt;/code>&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>test/integration/statefulset&lt;/code>: &lt;code>2024-10-07&lt;/code>: &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?job=ci-kubernetes-integration&amp;amp;test=TestAutodeleteOwnerRefs"
target="_blank" rel="noopener">No failures&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>Added &lt;code>TestAutodeleteOwnerRefs&lt;/code> to &lt;code>k8s.io/kubernetes/test/integration/statefulset&lt;/code>.&lt;/p>
&lt;h5 id="e2e-tests">E2E tests&lt;/h5>
&lt;ul>
&lt;li>&lt;code>[gci-gce-statefulset](https://testgrid.k8s.io/google-gce#gci-gce-statefulset)&lt;/code>: &lt;code>2024-10-07&lt;/code>: &lt;code>0 Failures&lt;/code>&lt;/li>
&lt;li>&lt;a href="https://storage.googleapis.com/k8s-triage/index.html?test=.*StatefulSetPersistentVolumeClaimPolicy.*"
target="_blank" rel="noopener">triage&lt;/a>
: &lt;code>2024-10-09&lt;/code>:
&lt;ul>
&lt;li>Flakey failures in &lt;code>ci-kubernetes-kind-e2e-parallel&lt;/code>, &lt;code>ci-kubernetes-kind-e2e-parallel-1-30&lt;/code> and
&lt;code>ci-kubernetes-kind-ipv6-e2e-parallel-1-31&lt;/code> also had failures in many other tests, so appears to
be general infrastructure flake.&lt;/li>
&lt;li>&lt;code> [sig-apps] StatefulSet Non-retain StatefulSetPersistentVolumeClaimPolicy should delete PVCs after adopting pod (WhenScaled)&lt;/code> seems to have real flakes which will be investigated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Added &lt;code>Feature:StatefulSetAutoDeletePVC&lt;/code> tests to &lt;code>k8s.io/kubernetes/test/e2e/apps/&lt;/code>.&lt;/p>
&lt;h5 id="upgradedowngrade--feature-enableddisable-tests">Upgrade/downgrade &amp;amp; feature enabled/disable tests&lt;/h5>
&lt;p>The following scenarios were manuall tested.&lt;/p>
&lt;pre>&lt;code>1. Create statefulset in previous version and upgrade to the version
supporting this feature. The PVCs should remain intact.
2. Downgrade to earlier version and check the PVCs with Retain
remain intact and the others with set policies before upgrade
gets deleted based on if the references were already set.
&lt;/code>&lt;/pre>
&lt;p>Since &lt;code>rancher.io/local-path&lt;/code> now provides a default storage class, StatefulSets
can be tested with kind with the following procedure.&lt;/p>
&lt;ul>
&lt;li>Create a &lt;a href="https://kind.sigs.k8s.io/"
target="_blank" rel="noopener">kind&lt;/a>
cluster with the following &lt;code>config.yaml&lt;/code>.
&lt;pre tabindex="0">&lt;code>apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
featureGates:
StatefulSetAutoDeletePVC: false
nodes:
- role: control-plane
image: kindest/node:v1.31.0
&lt;/code>&lt;/pre>This is done with &lt;code>kind create cluster --config config.yaml&lt;/code>&lt;/li>
&lt;li>The configuration adds the feature gate to all control plane services. In a
kind cluster, these are stored in the &lt;code>/etc/kubernetes/manifests&lt;/code> directory of
the kind docker container serving as the control plane node. The manifests are
reconciled to the control plane, so the cluster can be upgraded or downgraded
from the StatefulSet retention policy feature with bash script like the
following.
&lt;pre tabindex="0">&lt;code>for c in kube-apiserver kube-controller-manager kube-scheduler; do
docker exec kind-control-plane \
sed -i -r &amp;#34;s|(StatefulSetAutoDeletePVC)=false|\1=true|&amp;#34; \
/etc/kubernetes/manifests/$c.yaml
echo $c updated
done
&lt;/code>&lt;/pre>To downgrade, swap false for true in the above. Note that the kind control
plane will be unreachable for a minute or so while the reconciliation occurs.&lt;/li>
&lt;/ul>
&lt;p>For the upgrade scenario, a StatefulSet was created in a cluster with the
feature gate disabled. The feature gate was enabled, the StatefulSet was scaled
down or deleted, and it was confirmed that no PVCs were deleted.&lt;/p>
&lt;p>In the downgrade scenario, four StatefulSets were created with all possibilities
of WhenScaled and WhenDeleted policies. After downgraded, it was confirmed that
(1) no PVCs are deleted when the StatefulSet is scaled down, and (2) PVCs are
deleted when the WhenDeleted policy is Delete, and the StatefulSet is deleted.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha-release">Alpha release&lt;/h4>
&lt;ul>
&lt;li>(Done) Complete adding the items in the &amp;lsquo;Changes required&amp;rsquo; section.&lt;/li>
&lt;li>(Done) Add unit, functional, upgrade and downgrade tests to automated k8s test.&lt;/li>
&lt;/ul>
&lt;h4 id="beta-release">Beta release&lt;/h4>
&lt;ul>
&lt;li>(Done) Enable feature gate for e2e pipelines&lt;/li>
&lt;/ul>
&lt;h4 id="ga-release">GA release&lt;/h4>
&lt;ul>
&lt;li>(Done) Validate with customer workloads. There has been no customer feedback
aside from some unrelated issues on GKE which showed that customers were using
delete strategies, and analysis of owner references on PVCs that motivated
&lt;a href="https://github.com/kubernetes/kubernetes/issues/122400"
target="_blank" rel="noopener">#122400&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>This features adds a new field to the StatefulSet. The default value for the new field
maintains the existing behavior of StatefulSets.&lt;/p>
&lt;p>On a downgrade, the &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code> field will be hidden on
any StatefulSets. The behavior in this case will be identical to mutating the
policy field to &lt;code>Retain&lt;/code>, as described above, including the edge cases
introduced if this is done during a scale-down or StatefulSet deletion.&lt;/p>
&lt;p>The initial beta version did not set the controller flag in the owner
reference. This was fixed in later versions, so that the controller flag is
set. The behavior is then as follows.&lt;/p>
&lt;ul>
&lt;li>If the StatefulSet or one of its Pods is a controller, owner references are
updated as specified by the retention policy.&lt;/li>
&lt;li>If the StatefulSet or one of its Pods is an owner but not a controller,
&lt;ul>
&lt;li>if there is no other controller, the StatefulSet or Pod owners will be set
as controller (ie, they are assumed to be from the initial beta version and
updated).&lt;/li>
&lt;li>The controller will be updated as specified by the retention policy.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>If there is another resource that is a controller,
&lt;ul>
&lt;li>any (non-controler) StatefulSet or Pod owner reference is removed.&lt;/li>
&lt;li>The retention policy is ignored.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>There are only apiserver and kube-controller-manager changes involved. Node
components are not involved so there is no version skew between nodes and the
control plane. Since the api changes are backwards compatible, as long as the
apiserver version which originally added the new StatefulSet fields is rolled
out before the kube-controller-manager, behavior will be correct. Since the alpha
API has been out since 1.23 and there have been no incompatible changes to the
API, the order of any modern apiserver &amp;amp; kube-controller-manager rollout should
not matter anyway.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;h5 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h5>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: StatefulSetAutoDeletePVC&lt;/li>
&lt;li>Components depending on the feature gate
&lt;ul>
&lt;li>kube-controller-manager, which orchestrates the volume deletion.&lt;/li>
&lt;li>kube-apiserver, to manage the new policy field in the StatefulSet
resource (eg dropDisabledFields).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h5>
&lt;p>No. What happens during StatefulSet deletion differs from current behavior
only when the user explicitly specifies the
&lt;code>PersistentVolumeClaimDeletePolicy&lt;/code>. Hence no change in any user visible
behavior change by default.&lt;/p>
&lt;h5 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h5>
&lt;p>Yes. Disabling the feature gate will cause the new field to be ignored. If the feature
gate is re-enabled, the new behavior will start working.&lt;/p>
&lt;p>When the &lt;code>PersistentVolumeClaimRetentionPolicy&lt;/code> has &lt;code>WhenDeleted&lt;/code> set to
&lt;code>Delete&lt;/code>, then VolumeClaimTemplate PVCs ownerRefs must be removed.&lt;/p>
&lt;p>There are new corner cases here. For example, if a StatefulSet deletion is in
process when the feature is disabled or enabled, the appropriate ownerRefs
will not have been added and PVCs may not be deleted. The exact behavior will
be discovered during feature testing. In any case the mitigation will be to
manually delete any PVCs.&lt;/p>
&lt;h5 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h5>
&lt;p>In the simple case of reenabling the feature without concurrent StatefulSet
deletion or scale-down, nothing needs to be done when the deletion policy has
&lt;code>whenScaled&lt;/code> set to &lt;code>Delete&lt;/code>. When the policy has &lt;code>whenDeleted&lt;/code> set to &lt;code>Delete&lt;/code>, the
VolumeClaimTemplate PVC ownerRefs must be set to the StatefulSet.&lt;/p>
&lt;p>As above, if there is a concurrent scale-down or StatefulSet deletion, more
care needs to be taken. This will be detailed further during feature testing.&lt;/p>
&lt;h5 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h5>
&lt;p>Feature enablement and disablement tests will be added, including for
StatefulSet behavior during transitions in conjunction with scale-down or
deletion.&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;h5 id="how-can-a-rollout-fail-can-it-impact-already-running-workloads">How can a rollout fail? Can it impact already running workloads?&lt;/h5>
&lt;p>If there is a control plane update which disables the feature while a stateful
set is in the process of being deleted or scaled down, it is undefined which
PVCs will be deleted. Before the update, PVCs will be marked for deletion;
until the updated controller has a chance to reconcile some PVCs may be
garbage collected before the controller has a chance to remove any owner
references. We do not think this is a true failure, as it should be clear to
an operator that there is an essential race condition when a cluster update
happens during a stateful set scale down or delete.&lt;/p>
&lt;h5 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h5>
&lt;p>The operator can monitor &lt;code>kube_persistent_volume_*&lt;/code> metrics from
kube-state-metrics to watch for large numbers of undeleted
PersistentVolumes. If consistent behavior is required, the operator can wait
for those metrics to stablize.&lt;/p>
&lt;h5 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h5>
&lt;p>Yes. The race condition wasn&amp;rsquo;t exposed, but we confirmed the PVCs were updated correctly.&lt;/p>
&lt;h5 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis">Is the rollout accompanied by any deprecations and/or removals of features, APIs,&lt;/h5>
&lt;p>fields of API types, flags, etc.?
No&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>Metrics are provided by &lt;code>kube-state-metrics&lt;/code> unless otherwise noted.&lt;/p>
&lt;h5 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h5>
&lt;p>&lt;code>kube_statefulset_persistent_volume_claim_retention_policy&lt;/code> will have nonzero
counts for the &lt;code>delete&lt;/code> policy fields.&lt;/p>
&lt;h5 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine">What are the SLIs (Service Level Indicators) an operator can use to determine&lt;/h5>
&lt;p>the health of the service?&lt;/p>
&lt;ul>
&lt;li>Metric name: &lt;code>kube_statefulset_status_replicas_current&lt;/code> should be near
&lt;code>kube_statefulset_stats_replicas_ready&lt;/code>.
&lt;ul>
&lt;li>[Optional] Aggregation method: &lt;code>gauge&lt;/code>&lt;/li>
&lt;li>Components exposing the metric: &lt;code>kube-state-metrics&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="what-are-the-reasonable-slos-service-level-objectives-for-the-above-slis">What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/h5>
&lt;p>&lt;code>kube_statefulset_stats_replicas_ready / kube_statefulset_stats_replicas_current&lt;/code> should be near 1.0, although as
unhealthy replicas are often an application error rather than a problem with
the stateful set controller, this will need to be tuned by an operator on a
per-cluster basis.&lt;/p>
&lt;h5 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability">Are there any missing metrics that would be useful to have to improve observability&lt;/h5>
&lt;p>of this feature?&lt;/p>
&lt;p>kube-state-metrics have filled a gap in the traditional lack of metrics from
core Kubernetes controllers.&lt;/p>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;h5 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h5>
&lt;p>No, outside of depending on the scheduler, the garbage collector and volume
management (provisioning, attaching, etc) as does almost anything in
Kubernetes. This feature does not add any new dependencies that did not
already exist with the stateful set controller.&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;h5 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h5>
&lt;p>Yes and no. This feature will result in additional resource deletion calls, which will
scale like the number of pods in the stateful set (ie, one PVC per pod and possibly one
PV per PVC depending on the reclaim policy). There will not be additional watches,
because the existing pod watches will be used. There will be additional
patches to set PVC ownerRefs, scaling like the number of pods in the StatefulSet.&lt;/p>
&lt;p>However, anyone who uses this feature would have made those resource deletions
anyway: those PVs cost money. Aside from the additional patches for onwerRefs,
there shouldn&amp;rsquo;t be much overall increase beyond the second-order effect of
this feature allowing more automation.&lt;/p>
&lt;h5 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h5>
&lt;p>No.&lt;/p>
&lt;h5 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h5>
&lt;p>PVC deletion may cause PV deletion, depending on reclaim policy, which will result in
cloud provider calls through the volume API. However, as noted above, these calls would
have been happening anyway, manually.&lt;/p>
&lt;h5 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h5>
&lt;ul>
&lt;li>PVC, new ownerRef; ~64 bytes&lt;/li>
&lt;li>StatefulSet, new field; ~8 bytes (holds string enumeration either &amp;ldquo;Delete&amp;rdquo; or &amp;ldquo;Retain&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;h5 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h5>
&lt;p>No. (There are currently no StatefulSet SLOs?)&lt;/p>
&lt;p>Note that scale-up may be slower when volumes were deleted by scale-down. This
is by design of the feature.&lt;/p>
&lt;h5 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h5>
&lt;p>No.&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;h5 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h5>
&lt;p>PVC deletion will be paused. If the control plane went unavailable in the middle
of a stateful set being deleted or scaled down, there may be deleted Pods whose
PVCs have not yet been deleted. Deletion will continue normally after the
control plane returns.&lt;/p>
&lt;h5 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h5>
&lt;ul>
&lt;li>PVCs from a stateful set not being deleted as expected.
&lt;ul>
&lt;li>Detection: This can be deteted by higher than expected counts of
&lt;code>kube_persistentvolumeclaim_status_phase{phase=Bound}&lt;/code>, lower than
expected counts of &lt;code>kube_persistentvolume_status_phase{phase=Released}&lt;/code>,
and by an operator listing and examining PVCs.&lt;/li>
&lt;li>Mitigations: We expect this to happen only if there are other,
operator-installed, controllers that are also managing owner refs on
PVCs. Any such PVCs can be deleted manually. The conflicting controllers
will have to be manually discovered.&lt;/li>
&lt;li>Diagnostics: Logs from kube-controller-manager and stateful set controller.&lt;/li>
&lt;li>Testing: Tests are in place for confirming owner refs are added by the
&lt;code>StatefulSet&lt;/code> controller, but Kubernetes does not test against external
custom controller.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h5 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h5>
&lt;p>Stateful set SLOs are new with this feature and are in process of being
evaluated. If they are not being met, the kube-controller-manager (where the
stateful set controller lives) should be examined and/or restarted.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>1.21, KEP created.&lt;/li>
&lt;li>1.23, alpha implementation.&lt;/li>
&lt;li>1.27, graduation to beta.&lt;/li>
&lt;li>1.31, fix controller references.&lt;/li>
&lt;li>1.32, graduation to GA.&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>The StatefulSet field update is required.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>Users can delete the PVC manually. The friction associated with that is the motivation of
the KEP.&lt;/p></description></item><item><title>Resources: Auto-refreshing official CVE feed</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3203/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3203/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-3203-auto-refreshing-official-cve-feed">KEP-3203: Auto-Refreshing Official CVE Feed&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#story-4"
>Story 4&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#pre-requisites"
>Pre-requisites&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#overview"
>Overview&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#json-blob-construction-will-fail"
>JSON blob construction will fail&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#misuse-of-auto-refresh-feature"
>Misuse of Auto-Refresh feature&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#large-json-blob-could-lead-to-slower-readwrite-and-resource-consumption"
>Large JSON blob could lead to slower read/write and resource consumption&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#storage-of-cve-feed-blob"
>Storage of CVE feed blob&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#1-only-use-google-cloud-bucket"
>1. &lt;strong>Only use Google Cloud Bucket&lt;/strong>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#2-only-use-git-repository"
>2. &lt;strong>Only use Git Repository&lt;/strong>&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>
.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir
in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and
SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests for meet requirements
for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (
R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit
by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for
publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to
mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Currently it is not possible to filter for issues or PRs that are related to
CVEs announced by kubernetes. This KEP addresses this concern by labeling these
issues or PRs with the new label &lt;strong>official-cve-feed&lt;/strong> using the automation. The
in-scope issues are the closed issues for which there is a CVE ID and is
officially announced as a Kubernetes CVE by SRC in the past.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>With the growing number of eyes on Kubernetes, the number of CVEs related to
Kubernetes have increased. Although most CVEs that directly, indirectly, or
transitively impact Kubernetes are regularly fixed, there is no single place
for the end users of Kubernetes to programmatically subscribe or pull the data
of fixed CVEs. Current options are either
&lt;a href="https://github.com/kubernetes/sig-security/issues/1"
target="_blank" rel="noopener">broken or incomplete&lt;/a>
.&lt;/p>
&lt;p>An auto-refreshing CVE feed will allow end users to programmatically fetch the
list of CVEs and allow them to get the latest information from Kubernetes
community.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Create a periodically auto-refreshing, machine-readable list of official
Kubernetes CVEs&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Triage and vulnerability disclosure: This will continue to be done by SRC&lt;/li>
&lt;li>Listing CVEs that are identified in build time dependencies and container
images. Only official CVEs announced by the Kubernetes SRC will be published
in the feed.&lt;/li>
&lt;li>Integration with &lt;a href="https://github.com/CVEProject/cvelist"
target="_blank" rel="noopener">CVEProject&lt;/a>
may
happen at a future stage but currently is not planned or scoped.&lt;/li>
&lt;/ul>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a K8s end user, I want a list of CVEs with relevant information that I can
fetch programmatically, so I can track when new CVEs are announced.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a K8s End User, I want to use my browser to get a list of fixed CVEs, from
the official K8s website so that I can trust it as an authoritative source of
data through implicit trust offered by website certificate and domain name.&lt;/p>
&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>As a K8s maintainer, I want to create a process that auto-updates CVE feed, when
SRC announces new CVEs such that I do not have to do extra work to maintain this
feed manually&lt;/p>
&lt;h3 id="story-4">Story 4&lt;/h3>
&lt;p>As a K8s platform provider, I want to automatically know if my Kubernetes
clusters are vulnerable to any of the CVEs SRC have announced. I want to have a
programmatically available API to parse this kind of data so I can easily
provide it to users of my platform.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="pre-requisites">Pre-requisites&lt;/h3>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Add official-cve-label &lt;a href="https://github.com/kubernetes/test-infra/pull/23428"
target="_blank" rel="noopener">https://github.com/kubernetes/test-infra/pull/23428&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Search and Identify closed issues that have a CVE ID e.g. CVE-1001-12345
in the issue description or summary (This
search &lt;a href="https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue&amp;#43;in%3Abody&amp;#43;%22CVSS%3A3.%22&amp;#43;label%3Acommittee%2Fsecurity-response&amp;#43;is%3Aclosed&amp;#43;"
target="_blank" rel="noopener">filter&lt;/a>
is giving the most accurate data so far)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Label those issues with &lt;code>official-cve-feed&lt;/code>
using &lt;a href="https://docs.github.com/en/rest/reference/issues"
target="_blank" rel="noopener">https://docs.github.com/en/rest/reference/issues&lt;/a>
REST API&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Add &lt;code>official-cve-feed&lt;/code> label as part of SRC
playbook: &lt;a href="https://github.com/kubernetes/committee-security-response/pull/133"
target="_blank" rel="noopener">https://github.com/kubernetes/committee-security-response/pull/133&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="overview">Overview&lt;/h3>
&lt;ul>
&lt;li>Generate a JSON blob using the results from the filtered label on &lt;code>k/k&lt;/code>
repo.&lt;/li>
&lt;li>Create a Prow job to periodically generate this JSON blob.&lt;/li>
&lt;li>Push this JSON blob when needed (e.g. when a new CVE is announced) to GCS (
Google Cloud Bucket)&lt;/li>
&lt;li>Using Hugo and other tooling (such as Netlify), publish the list from this
JSON blob on official k8s website during &lt;code>k/website&lt;/code> build&lt;/li>
&lt;li>Generate an RSS feed (atom format) with hugo templates using the generated
JSON blob&lt;/li>
&lt;/ul>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="json-blob-construction-will-fail">JSON blob construction will fail&lt;/h4>
&lt;p>If the generation of the JSON blob listing known CVEs were to fail, downstream
jobs also fail. If blob construction fails, the failure will alert the owners of
this feature and we will take action as needed. If the failure can not be fixed
in a reasonable amount of time, the CVE feed will be stale until it is fixed. In
case of an urgent need from the community to update the vulnerabilities feed,
JSON blob will be manually updated via &lt;code>gsutil&lt;/code> command.&lt;/p>
&lt;h4 id="misuse-of-auto-refresh-feature">Misuse of Auto-Refresh feature&lt;/h4>
&lt;p>Without proper filtering and control over who can label GitHub issues, the list
of CVEs can become a list with poor signal-to-noise ratio making the list
unusable.&lt;/p>
&lt;p>For this purpose, the filtering is applied such that only issues that are marked
as &lt;code>closed&lt;/code>
will be part of the list. Also, additionally, the &lt;code>official-cve-feed&lt;/code> label is a
&lt;a href="https://github.com/kubernetes/test-infra/blob/master/config/prow/plugins.yaml#L140-L150"
target="_blank" rel="noopener">restricted&lt;/a>
label that can only be applied by SRC and SIG Security Tooling Leads.&lt;/p>
&lt;h4 id="large-json-blob-could-lead-to-slower-readwrite-and-resource-consumption">Large JSON blob could lead to slower read/write and resource consumption&lt;/h4>
&lt;p>Blobs will only be rewritten, if the generated blob is different from existing
blob. As hash file would be created and stored alongside generated blob. This
hash file will be check everytime before push to the hash of the generated file.
If the hash file matches writing to the bucket will be skipped, if hash file is
different writing to bucket, will be triggered.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>The steps to implement this design will involve a prow job that:&lt;/p>
&lt;ol>
&lt;li>Queries Github API for fixed official CVEs&lt;/li>
&lt;li>Generates a JSON blob based on the query results&lt;/li>
&lt;li>Writes the JSON blob to gcs-bucket if it is different than existing blob&lt;/li>
&lt;li>Triggers the &lt;code>k/website&lt;/code> build using netlify
&lt;a href="https://docs.netlify.com/configure-builds/build-hooks/"
target="_blank" rel="noopener">build-hook&lt;/a>
. Secret
token to trigger build is added as External Secret. See example for
&lt;a href="https://github.com/kubernetes/k8s.io/pull/2222/files"
target="_blank" rel="noopener">snyk-token&lt;/a>
&lt;/li>
&lt;li>&lt;code>k/website&lt;/code> build pulls the JSON blob from gcs bucket during website rebuild,
pulling it from gcs-bucket into something like
&lt;code>https://kubernetes.io/security/official-cve-feed.json&lt;/code>&lt;/li>
&lt;li>&lt;code>k/website&lt;/code> renders the JSON blob as an HTML table for viewing the list of
fixed CVEs from a browser at this location:
&lt;code>https://kubernetes.io/docs/reference/issues-security/official-cve-feed&lt;/code>
and linked from this
page: &lt;code>https://kubernetes.io/docs/reference/issues-security/security/&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>&lt;em>Notes&lt;/em>:&lt;/p>
&lt;ul>
&lt;li>A GCS bucket needs to be created. Example PR for this looks
like &lt;a href="https://github.com/kubernetes/k8s.io/pull/2570/files"
target="_blank" rel="noopener">this&lt;/a>
&lt;/li>
&lt;li>Additional custom fields need to be added to make JSON feed compliant with
&lt;a href="https://validator.jsonfeed.org/"
target="_blank" rel="noopener">https://validator.jsonfeed.org/&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>This is a process KEP implemented using periodic prow job. This KEP is not implemented for any functional use cases of kubernetes. So no e2e/unit/integration tests are applicable and going forward test plan will mostly include the scenarios around monitoring of the prow job for any failures as and when needed.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>Feature implemented with working JSON feed and tabular list&lt;/li>
&lt;li>Initial e2e testing completed and alerting setup for detecting failures&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Gather feedback from developers and end users&lt;/li>
&lt;li>Make JSON feed compliant with &lt;code>jsonfeed&lt;/code> spec&lt;/li>
&lt;li>Add &lt;code>RSS&lt;/code> feed for the CVE list&lt;/li>
&lt;li>Add fields that signal freshness of the data&lt;/li>
&lt;/ul>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, or as something else. The KEP
should keep this high-level with a focus on what signals will be looked at to
determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
#### Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
#### Beta
- Gather feedback from developers and surveys
- Complete features A, B, C
- Additional tests are in Testgrid and linked in KEP
#### GA
- N examples of real-world usage
- N installs
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- Allowing time for feedback
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Not applicable&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>Not applicable&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;p>Not applicable as per this
&lt;a href="https://github.com/kubernetes/enhancements/pull/3204#issuecomment-1042367862"
target="_blank" rel="noopener">comment&lt;/a>
&lt;/p>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="storage-of-cve-feed-blob">Storage of CVE feed blob&lt;/h3>
&lt;p>There are two options to store the CVE feed JSON blob:&lt;/p>
&lt;h4 id="1-only-use-google-cloud-bucket">1. &lt;strong>Only use Google Cloud Bucket&lt;/strong>&lt;/h4>
&lt;p>A new Google cloud bucket can be created where the CVE feed is written using
&lt;code>gsutil&lt;/code> tool and read via &lt;code>curl&lt;/code> call.&lt;/p>
&lt;ul>
&lt;li>Advantages:
&lt;ul>
&lt;li>Transparent updates to JSON blob where the prow job run will be identical everytime.&lt;/li>
&lt;li>Access control for writing to bucket is least privilege i.e. managed
via a service account and Google group membership&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Disadvantages:
&lt;ul>
&lt;li>CVE list has an unofficial looking URL which would be hard for an
end user to decipher for its authenticity and provenance.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="2-only-use-git-repository">2. &lt;strong>Only use Git Repository&lt;/strong>&lt;/h4>
&lt;p>Store it as a version controlled artifact in one of the &lt;code>kubernetes&lt;/code>
GitHub Org repositories.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Advantages:&lt;/p>
&lt;ul>
&lt;li>When a Git Repository especially &lt;code>k/website&lt;/code> hosts the JSON blob the domain
name in the URL would be &lt;code>k8s.io/static/security/official-cve-feed.json&lt;/code>
which is much more recognizable, intuitive in terms of trust, TLS enabled
and unlikely to be spoofed. But there are several disadvantages:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Disadvantages:&lt;/p>
&lt;ul>
&lt;li>This might get delayed by PR review and approval process. However, this
can be prevented through use of &lt;code>skip-review&lt;/code> label.&lt;/li>
&lt;li>A fork would need to be maintained under a Github Robot for &lt;code>k/website&lt;/code> or
&lt;code>k/sig-security&lt;/code> which will add overhead for GitHub Admins who manage the
robot accounts and its forks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>In summary, because both the approaches have pros and cons, the finalized approach
combined the good parts from both the alternatives by storing the blob in Google
Cloud bucket but rendering it via &lt;code>kubernetes/website&lt;/code> GitHub Repository.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>A GCS bucket to store JSON blob is needed, with its corresponding service account.&lt;/p></description></item><item><title>Resources: Azure Availability Zones</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/586/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/586/</guid><description/></item><item><title>Resources: Backoff Limits Per Index For Indexed Jobs</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3850/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3850/</guid><description>
&lt;h1 id="kep-3850-backoff-limits-per-index-for-indexed-jobs">KEP-3850: Backoff Limits Per Index For Indexed Jobs&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-3"
>Story 3&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#performance-benchmark"
>Performance benchmark&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#the-job-object-too-big"
>The Job object too big&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#exponential-backoff-delay-issue"
>Exponential backoff delay issue&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#too-fast-job-status-updates"
>Too fast Job status updates&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#job-api"
>Job API&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#tracking-the-number-of-failures-per-index"
>Tracking the number of failures per index&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#failed-indexes-format"
>Failed indexes format&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#job-completion"
>Job completion&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#failindex-action"
>FailIndex action&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#exponential-backoff-delay-per-index"
>Exponential backoff delay per index&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#upgrade"
>Upgrade&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#downgrade"
>Downgrade&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#backofflimitperindex-inside-new-runpolicy"
>backoffLimitPerIndex inside new runPolicy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#mark-job-complete-if-some-indexes-failed"
>Mark Job Complete if some indexes failed&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#support-backofflimitperindex-when-restartpolicyonfailure"
>Support backoffLimitPerIndex when restartPolicy=OnFailure&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#mutually-exclusive-backofflimit-and-backofflimitperindex"
>Mutually exclusive backoffLimit and backoffLimitPerIndex&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#use-bool-field"
>Use bool field&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#use-enum-field"
>Use enum field&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#global-exponential-backoff-delay"
>Global exponential backoff delay&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#exponential-backoff-delay-with-in-memory-tracking"
>Exponential backoff delay with in-memory tracking&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternative-ways-to-support-high-number-of-completions"
>Alternative ways to support high number of completions&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#keep-failedindexes-field-as-a-bitmap"
>Keep failedIndexes field as a bitmap&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#keep-the-list-of-failed-indexes-in-a-dedicated-api-object"
>Keep the list of failed indexes in a dedicated API object&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implicit-limit-on-the-number-of-failed-indexes"
>Implicit limit on the number of failed indexes&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#skip-uncountedterminatedpods-when-backofflimitperindex-is-used"
>Skip uncountedTerminatedPods when backoffLimitPerIndex is used&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP extends the Job API to support indexed jobs where the backoff limit is
per index, and the Job can continue execution despite some of its indexes failing.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Currently, the indexes of an indexed job share a single backoff limit.
When the job reaches this shared backoff limit, the job controller marks the entire
job as failed, and the resources are cleaned up, including indexes that have yet
to run to completion.&lt;/p>
&lt;p>As a result, the current implementation does not cover the situation where the workload
is truly embarrassingly parallel and each index is independent of other indexes.&lt;/p>
&lt;p>For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
then each test run would only be able to find a single test failure.&lt;/p>
&lt;p>Other popular batch services like AWS Batch use a separate backoff limit for each index,
showing that this is a common use case that should be supported by Kubernetes.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>allow to count failures towards the backoffLimit independently for all indexes,&lt;/li>
&lt;li>allow to continue Job execution despite some of its indexes failing,&lt;/li>
&lt;li>allow to fail an index (stop recreating pods for the index) using pod failure policy.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>allow to control the number of retries per index when pod&amp;rsquo;s &lt;code>restartPolicy=OnFailure&lt;/code>
(see &lt;a href="#support-backofflimitperindex-when-restartpolicyonfailure"
>Support backoffLimitPerIndex when restartPolicy=OnFailure&lt;/a>
).&lt;/li>
&lt;/ul>
&lt;!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>We propose a new policy for running Indexed Jobs in which the backoff limit
controls the number of retries per index. When the new policy is used all
indexes execute until their success or failure. We also propose a new API field
to control the number of failed indexes.&lt;/p>
&lt;p>Additionally, we propose a new action in &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures"
target="_blank" rel="noopener">PodFailurePolicy&lt;/a>
, called FailIndex,
to short-circuit failing of the index before the backoff limit per index is
reached.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a CI/CD platform administrator, I want to use Indexed Jobs to run
suites of integration tests, one suite per index. A failure of one suite should
not interrupt running of other suites. Additionally, I would like to be able
to control the maximal number of retries per index.&lt;/p>
&lt;p>The following Job configuration could satisfy my use case:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">restartPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Never&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-container&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-image&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">command&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;./tests-runner&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In this case, we run 10 indexes representing the test suites. We allow for one
failure per index.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a CI/CD platform administrator from the &lt;a href="#story-1"
>Story 1&lt;/a>
I want to be able
to control the failures with the pod failure policy. In particular, I want
to be able to use pod failure policy to avoid restarts of some indexes, based
on exit codes.&lt;/p>
&lt;p>The following Job configuration could satisfy my use case:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">restartPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Never&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-container&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-image&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">command&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;./tests-runner&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">podFailurePolicy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">rules&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">action&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>FailIndex&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">onExitCodes&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">operator&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>In&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">values&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#666">42&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="story-3">Story 3&lt;/h4>
&lt;p>As a CI/CD platform administrator from the &lt;a href="#story-1"
>Story 1&lt;/a>
I want to be able
to fail the entire Job if the number of failed indexes exceeds 50%. I want to
do this in order to cut down costs of running the tests in case of compilation
issues that would result in all tests failing.&lt;/p>
&lt;p>The following Job configuration could satisfy my use case:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">maxFailedIndexes&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">5&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">restartPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Never&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-container&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-image&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">command&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;./tests-runner&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h4 id="performance-benchmark">Performance benchmark&lt;/h4>
&lt;p>We assess the performance of the Beta implementation in comparison to
the index jobs with regular &lt;code>backoffLimit&lt;/code> using the two integration tests
(&lt;code>BenchmarkLargeIndexedJob&lt;/code> and &lt;code>BenchmarkLargeFailureHandling&lt;/code>)
in the &lt;a href="https://github.com/kubernetes/kubernetes/pull/121393"
target="_blank" rel="noopener">PR #121393&lt;/a>
.&lt;/p>
&lt;p>In the &lt;code>BenchmarkLargeIndexedJob&lt;/code> test, the measured part creates N pods
and marks them as &lt;code>Succeeded&lt;/code>, awaiting for the Job status to be updated accordingly.
This is a sanity test for the &lt;code>backoffLimitPerIndex&lt;/code>, to demonstrate that the
new branches of code don&amp;rsquo;t have significant performance impact.&lt;/p>
&lt;p>Here are the results (lines re-ordered from smallest to the largest N):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>go &lt;span style="color:#a2f">test&lt;/span> -benchmem -run&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;^&lt;/span>$&lt;span style="color:#b44">&amp;#34;&lt;/span> -timeout&lt;span style="color:#666">=&lt;/span>80m -bench &lt;span style="color:#b44">&amp;#34;^BenchmarkLargeIndexedJob&amp;#34;&lt;/span> k8s.io/kubernetes/test/integration/job | grep &lt;span style="color:#b44">&amp;#34;^Benchmark&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10-48 1 &lt;span style="color:#666">3034342185&lt;/span> ns/op &lt;span style="color:#666">14391160&lt;/span> B/op &lt;span style="color:#666">164352&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>100-48 1 &lt;span style="color:#666">3050613253&lt;/span> ns/op &lt;span style="color:#666">111100464&lt;/span> B/op &lt;span style="color:#666">1324757&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>1000-48 1 &lt;span style="color:#666">19382609963&lt;/span> ns/op &lt;span style="color:#666">1133953568&lt;/span> B/op &lt;span style="color:#666">13079710&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10_000-48 1 &lt;span style="color:#666">222696805443&lt;/span> ns/op &lt;span style="color:#666">11610639800&lt;/span> B/op &lt;span style="color:#666">131946944&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10-48 1 &lt;span style="color:#666">3025650312&lt;/span> ns/op &lt;span style="color:#666">14757368&lt;/span> B/op &lt;span style="color:#666">166282&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>100-48 1 &lt;span style="color:#666">3045479158&lt;/span> ns/op &lt;span style="color:#666">114324072&lt;/span> B/op &lt;span style="color:#666">1345524&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>1000-48 1 &lt;span style="color:#666">19384632203&lt;/span> ns/op &lt;span style="color:#666">1161105080&lt;/span> B/op &lt;span style="color:#666">13216319&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10_000-48 1 &lt;span style="color:#666">223635439324&lt;/span> ns/op &lt;span style="color:#666">11911685592&lt;/span> B/op &lt;span style="color:#666">133325939&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In the &lt;code>BenchmarkLargeFailureHandling&lt;/code> test, the measured part of the test marks
N running pods as &lt;code>Failed&lt;/code> and awaits for the job status to be updated accordingly.
In order to make the test comparable for regular indexed jobs and with
backoffLimitPerIndex we set the max backoff delay due to pod failures as 10ms.
Here are the results (lines re-ordered from smallest to the largest N):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>go &lt;span style="color:#a2f">test&lt;/span> -benchmem -run&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#b44">&amp;#34;^&lt;/span>$&lt;span style="color:#b44">&amp;#34;&lt;/span> -timeout&lt;span style="color:#666">=&lt;/span>80m -bench &lt;span style="color:#b44">&amp;#34;^BenchmarkLargeFailureHandling&amp;#34;&lt;/span> k8s.io/kubernetes/test/integration/job | grep &lt;span style="color:#b44">&amp;#34;^Benchmark&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10-48 1 &lt;span style="color:#666">2021272442&lt;/span> ns/op &lt;span style="color:#666">13813736&lt;/span> B/op &lt;span style="color:#666">165760&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>100-48 1 &lt;span style="color:#666">3036166978&lt;/span> ns/op &lt;span style="color:#666">109866704&lt;/span> B/op &lt;span style="color:#666">1310651&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>1000-48 1 &lt;span style="color:#666">21049273834&lt;/span> ns/op &lt;span style="color:#666">1074301144&lt;/span> B/op &lt;span style="color:#666">12832549&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10_000-48 1 &lt;span style="color:#666">202327947010&lt;/span> ns/op &lt;span style="color:#666">10926201704&lt;/span> B/op &lt;span style="color:#666">131423197&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10-48 1 &lt;span style="color:#666">3016501067&lt;/span> ns/op &lt;span style="color:#666">14676224&lt;/span> B/op &lt;span style="color:#666">175301&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>100-48 1 &lt;span style="color:#666">3038839798&lt;/span> ns/op &lt;span style="color:#666">112090728&lt;/span> B/op &lt;span style="color:#666">1323948&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>1000-48 1 &lt;span style="color:#666">21057643253&lt;/span> ns/op &lt;span style="color:#666">1096364096&lt;/span> B/op &lt;span style="color:#666">13008669&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;&lt;span style="color:#b8860b">_size&lt;/span>&lt;span style="color:#666">=&lt;/span>10_000-48 1 &lt;span style="color:#666">202373728278&lt;/span> ns/op &lt;span style="color:#666">11185209520&lt;/span> B/op &lt;span style="color:#666">132578325&lt;/span> allocs/op
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The above results show that the jobs using &lt;code>.spec.backoffLimitPerIndex&lt;/code> are be
slower for about 1% compared to regular indexed jobs. In practice the difference
is expected to be covered by the exponential backoff delay due to pod failures.&lt;/p>
&lt;!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="the-job-object-too-big">The Job object too big&lt;/h4>
&lt;p>With the new field &lt;code>.status.failedIndexes&lt;/code> the Job object can be significantly
larger as every failed index is recorded in the field.&lt;/p>
&lt;p>Note that, the similar risk is also present for Indexed Jobs, regarding the
already existing &lt;code>.status.completedIndexes&lt;/code> field (see
&lt;a href="https://github.com/kubernetes/kubernetes/issues/118085"
target="_blank" rel="noopener">Indexed Jobs can break with high number of parallelism or completions&lt;/a>
).&lt;/p>
&lt;p>In order to mitigate this risk we first constrain the &lt;code>.spec.maxFailedIndexes&lt;/code>
to &lt;code>10^5&lt;/code>, which is the same limit as for &lt;code>.spec.parallelism&lt;/code> currently.&lt;/p>
&lt;p>Second, we validate if the fields are inside of the scalability limits:&lt;/p>
&lt;ol>
&lt;li>&lt;code>.spec.completions&amp;lt;=10^5&lt;/code>, &lt;code>.spec.parallelism&amp;lt;=10^5&lt;/code>, &lt;code>spec.maxFailedIndexes&amp;lt;=10^5&lt;/code>&lt;/li>
&lt;li>&lt;code>spec.completions&lt;/code> unlimited (&amp;lt;= max int32 ~2*10^9), &lt;code>.spec.parallelism&amp;lt;=10^4&lt;/code>, &lt;code>spec.maxFailedIndexes&amp;lt;=10^4&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>In (1.), in the worst case scenario, every index is either present
in &lt;code>completedIndexes&lt;/code> or &lt;code>failedIndexes&lt;/code>, but not in both. Thus the total
sum of both fields is limited by &lt;code>(5+1)*10^5=0.572Mi&lt;/code>, where:&lt;/p>
&lt;ul>
&lt;li>5 is the maximal number of digits in the indexes,&lt;/li>
&lt;li>1 is for separation character,&lt;/li>
&lt;li>10^5 is the total number of listed indexes.&lt;/li>
&lt;/ul>
&lt;p>In (2.) the worst case scenario for the &lt;code>completedIndexes&lt;/code> field is when every
third index is not in the field, because it corresponds to either a failed or
a hanging indexes, so it is a &amp;ldquo;gap&amp;rdquo;. Then, between every gap we have two indexes
listed. Thus, the size of the &lt;code>completedIndexes&lt;/code> field is limited
by: &lt;code>(10+1)*2*(10^4+10^4)=0.42Mi&lt;/code>, where:&lt;/p>
&lt;ul>
&lt;li>10 is the maximal number of digits in the indexes&lt;/li>
&lt;li>1 is for the separation character&lt;/li>
&lt;li>2*(10^4+10^4) is the number of indexes explicitly listed in the field - two indexes per gap.&lt;/li>
&lt;/ul>
&lt;p>The size of the &lt;code>failedIndexes&lt;/code> field is limited by: &lt;code>(10+1)*10^4=0.105Mi&lt;/code>, where:&lt;/p>
&lt;ul>
&lt;li>10 is the maximal number of digits in the indexes,&lt;/li>
&lt;li>1 is for the separation character&lt;/li>
&lt;li>10^4 is the maximal number of indexes present in the field.&lt;/li>
&lt;/ul>
&lt;p>Thus, the size of both fields is capped at &lt;code>0.572Mi&lt;/code> for the limits in (1.) and
&lt;code>0.525Mi&lt;/code> for the limits in (2.).&lt;/p>
&lt;p>For comparison, before the introduction of &lt;code>.status.failedIndexes&lt;/code>, the max
size of the &lt;code>.status.completedIndexes&lt;/code> was limited by &lt;code>(5+1)*10^5*2/3=0.382Mi&lt;/code> in
the (1.) case, and &lt;code>(10+1)*2*10^4=0.21Mi&lt;/code> in the (2.) case. This means an increase
of &lt;code>0.19Mi&lt;/code>.&lt;/p>
&lt;p>The values of the limits are aligned with the values for the soft limits proposed
as a fix for the for regular indexed jobs
(see &lt;a href="https://github.com/kubernetes/kubernetes/issues/118085#issuecomment-1564520559"
target="_blank" rel="noopener">here&lt;/a>
).
However, in case when &lt;code>backoffLimitPerIndex&lt;/code> is used we propose these limits
to be hard.&lt;/p>
&lt;p>We believe that the scalability limits should be enough for most of Job use-cases.
For workloads requiring larger jobs users should be able to create multiple Jobs,
orchestrated by the &lt;a href="https://github.com/kubernetes-sigs/jobset"
target="_blank" rel="noopener">JobSet&lt;/a>
.&lt;/p>
&lt;h3 id="exponential-backoff-delay-issue">Exponential backoff delay issue&lt;/h3>
&lt;p>Currently, a pod is recreated by the Job controller with exponential backoff
delay (10s, 20s, 40s &amp;hellip;), counted from the last failure time.&lt;/p>
&lt;p>One complication is that the last failure time for failed pods may increase with
time, as it fallbacks to &lt;code>now&lt;/code> in some cases
(see in &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/backoff_utils.go#L160-L182"
target="_blank" rel="noopener">code&lt;/a>
).
Thus, there is a risk that due to the presence
of pods hitting the fallback the last failure time is continuously bumped,
thus shifting the time to recreate the pod.&lt;/p>
&lt;p>This risk is present both when computing the exponential backoff delay globally
(as for regular indexed Jobs), or per-index as proposed in in this KEP
(see &lt;a href="#exponential-backoff-delay-per-index"
>Exponential backoff delay per index&lt;/a>
).&lt;/p>
&lt;p>In order to mitigate this risk currently the time of last failure is recorded
in-memory (globally for all pods within a Job). And a new failed pod may bump
it only until it is added to the &lt;code>uncountedTerminatedPods&lt;/code> structure.&lt;/p>
&lt;p>However, tracking the last failure time per index might be costly for memory
consumption (see &lt;a href="#exponential-backoff-delay-with-in-memory-tracking"
>Exponential backoff delay with in-memory tracking&lt;/a>
).&lt;/p>
&lt;p>Thus, in order to mitigate this risk we propose to compute the finish time for
a pod as the first available value of the following (avoiding the ever-increasing
fallback to &lt;code>now&lt;/code>):&lt;/p>
&lt;ol>
&lt;li>max &lt;code>finishAt&lt;/code> of all containers, if specified for all containers&lt;/li>
&lt;li>&lt;code>LastTransitionTime&lt;/code> for the &lt;code>Ready=False&lt;/code> condition&lt;/li>
&lt;li>&lt;code>deletionTimestamp&lt;/code> - &lt;code>deletionGracePeriodSeconds&lt;/code> if &lt;code>deletionTimestamp&lt;/code> is set&lt;/li>
&lt;/ol>
&lt;p>Here (3.) is used to mark the moment of deletion which is used to approximate
the current behavior. (2.) is used when Kubelet loses track of one of its containers,
the &lt;code>Ready=False&lt;/code> condition is set by Kubelet when transitioning a pod to &lt;code>Failed&lt;/code>
phase: &lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/kubelet/status/status_manager.go#L1060-L1068"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/kubelet/status/status_manager.go#L1060-L1068&lt;/a>
.
When none of the above conditions is satisfied to compute the finish time we
fallback to the pod&amp;rsquo;s creation time.&lt;/p>
&lt;p>This fix can be considered a preparatory PR before the KEP, as to some extent
is solves the preexisting issue.&lt;/p>
&lt;h3 id="too-fast-job-status-updates">Too fast Job status updates&lt;/h3>
&lt;p>In this KEP the Job controller needs to keep updating the new status field
&lt;code>.status.failedIndexes&lt;/code> to reflect the current status of the Job. This can raise
concerns of overwhelming the API server with status updates.&lt;/p>
&lt;p>First, observe that the new field does not entail additional Job status updates.
When a pod terminates (either failure or success), it triggers Job status update
to increment the &lt;code>status.failed&lt;/code> or &lt;code>.status.succeeded&lt;/code> counter fields. These
updates are also used to update the pre-existing &lt;code>status.completedIndexes&lt;/code>
field, and the new &lt;code>status.failedIndexes&lt;/code> field.&lt;/p>
&lt;p>Second, in order to mitigate this risk there is already a mechanism present in
the Job controller, to bulk Job status updates per Job.&lt;/p>
&lt;p>The way the mechanism works is that Job controller maintains a queue of &lt;code>syncJob&lt;/code>
invocations per job
(see &lt;a href="https://github.com/kubernetes/kubernetes/blob/72a3990728b2a8979effb37b9800beb3117349f6/pkg/controller/job/job_controller.go#L118"
target="_blank" rel="noopener">in code&lt;/a>
).
New items are added to the queue with a delay (1s for pod events, such as:
delete, add, update). The delay allows for deduplication of the sync per Job.&lt;/p>
&lt;p>One place to queue a new item in the queue, specific to this KEP, is when
the exponential backoff delay hasn&amp;rsquo;t elapsed for any index (allowing pod
recreation), then we requeue the next Job status update. The delay is computed
as minimum of all delays computed for all indexes requiring pod recreation,
but not less that 1s.&lt;/p>
&lt;!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.
How will security be reviewed, and by whom?
How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>We introduce a new Job API field, called &lt;code>.spec.backoffLimitPerIndex&lt;/code>.
When set it limits the number of retries, counted independently for all indexes.&lt;/p>
&lt;p>Additionally, we propose the &lt;code>.spec.maxFailedIndexes&lt;/code> to control
the maximal number of failed indexes. Once the number is exceeded the entire
Job is marked Failed and its execution is terminated.&lt;/p>
&lt;p>We also propose to extend the PodFailurePolicy with a new action, called
&lt;code>FailIndex&lt;/code> to allow an index to fail fast before reaching the backoff limit
per index.&lt;/p>
&lt;h3 id="job-api">Job API&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// PodFailurePolicyAction specifies how a Pod failure is handled.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// +enum&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> PodFailurePolicyAction &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">const&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// This is an action which might be taken on a pod failure - mark the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Job&amp;#39;s index as failed to avoid restarts within this index. This action&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// can only be used when backoffLimitPerIndex is set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> PodFailurePolicyActionFailIndex PodFailurePolicyAction = &lt;span style="color:#b44">&amp;#34;FailIndex&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// JobSpec describes how the job execution will look like.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> JobSpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Specifies the limit for the number of retries within an&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// index before marking this index as failed. When enabled the number of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// failures per index is kept in the pod&amp;#39;s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// batch.kubernetes.io/job-index-failure-count annotation. It can only&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// be set when Job&amp;#39;s completionMode=Indexed, and the Pod&amp;#39;s restart&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// policy is Never. The field is immutable.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> BackoffLimitPerIndex &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">int32&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Specifies the maximal number of failed indexes before marking the Job as&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// failed, when backoffLimitPerIndex is set. Once the number of failed&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// indexes exceeds this number the entire Job is marked as Failed and its&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// execution is terminated. When left as null the job continues execution of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// all of its indexes and is marked with the `Complete` Job condition.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It can only be specified when backoffLimitPerIndex is set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// It can be null or up to completions. It is required and must be&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// less than or equal to 10^4 when is completions greater than 10^5.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> MaxFailedIndexes &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">int32&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> JobStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#666">...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// FailedIndexes holds the failed indexes when backoffLimitPerIndex is set.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The indexes are represented in the text format analogous as for the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// `completedIndexes` field, ie. they are kept as decimal integers&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// separated by commas. The numbers are listed in increasing order. Three or&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// more consecutive numbers are compressed and represented by the first and&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// last element of the series, separated by a hyphen.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// For example, if the failed indexes are 1, 3, 4, 5 and 7, they are&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// represented as &amp;#34;1,3-5,7&amp;#34;.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// +optional&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> FailedIndexes &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note that, the &lt;code>PodFailurePolicyAction&lt;/code> type is already defined in master with
three possible enum values: &lt;code>Ignore&lt;/code>, &lt;code>FailJob&lt;/code> and &lt;code>Count&lt;/code> (see &lt;a href="https://github.com/kubernetes/kubernetes/blob/72a3990728b2a8979effb37b9800beb3117349f6/pkg/apis/batch/types.go#L113-L131"
target="_blank" rel="noopener">here&lt;/a>
).&lt;/p>
&lt;p>We allow to specify custom &lt;code>.spec.backoffLimit&lt;/code> and &lt;code>.spec.backoffLimitPerIndex&lt;/code>.
This allows for a controlled downgrade. Also, when &lt;code>.spec.backoffLimitPerIndex&lt;/code>
is specified, then we default &lt;code>.spec.backoffLimit&lt;/code> to max int32 value. This way
we ensure old clients of the API wouldn&amp;rsquo;t break when reading or trying to modify
the &lt;code>.spec.backoffLimit&lt;/code> that has nil value.&lt;/p>
&lt;h3 id="tracking-the-number-of-failures-per-index">Tracking the number of failures per index&lt;/h3>
&lt;p>In order to determine if the backoff limit per index is exceeded we keep
track of the number of failures per index. For this purpose we use the Pod
annotation, &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code>, which holds the value
of the number of pod failures for a given index. It is set to &lt;code>0&lt;/code> for the first
pod created for a given index.&lt;/p>
&lt;p>When Job controller sees a failed pod corresponding to a given index, and the
value of the annotation &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code> is greater
or equal to the configured backoff limit per index then the index is marked
as failed and added to &lt;code>.status.failedIndexes&lt;/code>.&lt;/p>
&lt;p>When Job controller creates replacement pods for failed pods for a given
index it checks if the index isn&amp;rsquo;t finished yet (it is not in
&lt;code>.status.failedIndexes&lt;/code> nor &lt;code>.status.completedIndexes&lt;/code>).
Then, if &lt;code>x&lt;/code> is the highest &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code>
for the index, the newly created pod will have the annotation set to &lt;code>x+1&lt;/code>.
An exception is when the newly failed pod matches the &lt;code>Ignore&lt;/code> action in pod
failure policy. In this case the replacement pod does not increment the
value in the annotation.&lt;/p>
&lt;p>In order to keep track of the number of failures per index, the Job controller
removes finalizers of a failed pod for a given index, only once the replacement
pod (with incremented value of &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code>) is
created, or the index is marked as failed in &lt;code>.status.failedIndexes&lt;/code>. This means
that these are the main steps when handling a failed pod to prepare it for
deletion:&lt;/p>
&lt;ol>
&lt;li>Pod is recognized as failed&lt;/li>
&lt;li>pod UID is recorded in Job status (&lt;code>.status.uncountedTerminatedPods&lt;/code>)&lt;/li>
&lt;li>the replacement Pod is created&lt;/li>
&lt;li>Pod&amp;rsquo;s finalizer is removed&lt;/li>
&lt;/ol>
&lt;p>Here, the new feature adds a dependency between steps (3.) and (4.) as previously
these steps could be performed in any order. Note that, typically when a pod is
deleted or fails the replacement pod is created with a backoff delay, starting
from 10s. This means, that after the proposed change the pod finalizer removal
will be paused for at least 10s, until the backoff elapses and the replacement
pod is created. While this may result in pods hanging around before garbage
collection, it does not affect directly the rate of pod recreation.&lt;/p>
&lt;p>Note that, the first step (1.) will also be impacted by
&lt;a href="https://github.com/kubernetes/enhancements/issues/3939"
target="_blank" rel="noopener">KEP-3939: Consider Terminating pods as active pods in Jobs.&lt;/a>
&lt;/p>
&lt;h3 id="failed-indexes-format">Failed indexes format&lt;/h3>
&lt;p>The format of the &lt;code>.status.failedIndexes&lt;/code> field is analogous to the one used for
successful indexes represented by the &lt;a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job#track-completed-indexes-in-job-status"
target="_blank" rel="noopener">&lt;code>completedIndexes&lt;/code> field&lt;/a>
), which is a
text format grouping consecutive integers into ranges. In a special case, when
the indexes are non-consecutive they are represented by comma-separated numbers.
In the worst-case scenario this is a string of comma-separated even values. In
order to constrain the size of the field we cap the number of completions
(see &lt;a href="#the-job-object-too-big"
>The Job object too big&lt;/a>
for more details).&lt;/p>
&lt;h3 id="job-completion">Job completion&lt;/h3>
&lt;p>When backoff limit per index is used, then we execute indexes until all of them
are completed (either failed or succeeded), or the number of failed indexes
exceeds the specified &lt;code>.spec.maxFailedIndexes&lt;/code>.&lt;/p>
&lt;p>Then, the Job is marked as completed (the &lt;code>Complete&lt;/code> Job condition type) when
all indexes are succeeded. The Job is marked as failed (the &lt;code>Failed&lt;/code> Job condition)
when at least one index is failed. The &lt;code>Failed&lt;/code> condition is added once
all indexes completed their execution (either failed or succeeded), or when
the number of failed indexes exceeds the specified &lt;code>.spec.maxFailedIndexes&lt;/code>.&lt;/p>
&lt;h3 id="failindex-action">FailIndex action&lt;/h3>
&lt;p>In order to allow early termination of indexes with the &lt;code>FailIndex&lt;/code> action
we add the corresponding index to the set of failed indexes represented by
&lt;code>.status.failedIndexes&lt;/code>. This action can only be used if backoff limit per index
is used.&lt;/p>
&lt;h3 id="exponential-backoff-delay-per-index">Exponential backoff delay per index&lt;/h3>
&lt;p>First, we solve the issue of increasing failure time for deleted pods when the
finalizer removal is delayed, by modifying the definition of the pod finish time,
to avoid fallback to &lt;code>now&lt;/code>
(see also &lt;a href="#exponential-backoff-delay-issue"
>Exponential backoff delay issue&lt;/a>
).&lt;/p>
&lt;p>Second, we compute the backoff delay within each index independently. The number
of consecutive failures per-index can be derived from the
&lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code> annotation of the last failed pod,
plus one. This is because any successful pod marks the index as successful and
stops retries. Note that, using the annotation value means that failed pods
matching the Ignore rule are skipped in the calculation, but this behavior is
consistent with handling ignored pod failures for regular backoff limit.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>Unit tests will be added along with any new code introduced. In particular,
the following scenarios will be covered with unit tests:&lt;/p>
&lt;ul>
&lt;li>handling or ignoring of &lt;code>.spec.backoffLimitPerIndex&lt;/code> by the Job
controller when the feature gate is enabled or disabled, respectively,&lt;/li>
&lt;li>handling of ignoring of the pod failure policy rule with &lt;code>FailIndex&lt;/code> action&lt;/li>
&lt;li>the &lt;code>JobBackoffLimitPerIndex&lt;/code> feature gate is enabled or disabled, respectively,&lt;/li>
&lt;li>validation of a job configuration with respect to &lt;code>.spec.backoffLimitPerIndex&lt;/code> by
kube-apiserver (including limits for &lt;code>.spec.maxFailedIndexes&lt;/code>,
&lt;code>.spec.parallelism&lt;/code> and &lt;code>.spec.completions&lt;/code>), when the feature gate is enabled
or disabled,&lt;/li>
&lt;li>marking of the Job as &lt;code>Complete&lt;/code> only once all indexes are completed,&lt;/li>
&lt;li>termination of Job execution and marking it as failed when
&lt;code>.spec.maxFailedIndexes&lt;/code> is exceeded.&lt;/li>
&lt;li>calculation of the exponential backoff delay per index when &lt;code>backoffLimitPerIndex&lt;/code>
is used.&lt;/li>
&lt;li>a fuzzer roundtrip test for API when &lt;code>backoffLimit&lt;/code> is set to max int32.&lt;/li>
&lt;/ul>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;p>The core packages (with their unit test coverage) which are going to be modified during the implementation:&lt;/p>
&lt;ul>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/controller/job&lt;/code>: &lt;code>27 Apr 2023&lt;/code> - &lt;code>90.4%&lt;/code> &lt;!--(main logic to handle backoffLimitPerIndex, FailIndex and maxFailedIndexes)-->&lt;/li>
&lt;li>&lt;code>k8s.io/kubernetes/pkg/apis/batch/validation&lt;/code>: &lt;code>27 Apr 2023&lt;/code> - &lt;code>98.5%&lt;/code> &lt;!--(validation of the job configuration for backoffLimitPerIndex and FailIndex)-->&lt;/li>
&lt;/ul>
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;p>The following scenarios will be covered with integration tests:&lt;/p>
&lt;ul>
&lt;li>enabling, disabling and re-enabling of the &lt;code>JobBackoffLimitPerIndex&lt;/code> feature gate (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1030"
target="_blank" rel="noopener">code&lt;/a>
)&lt;/li>
&lt;li>handling of the &lt;code>.spec.backoffLimitPerIndex&lt;/code> when the &lt;code>FailIndex&lt;/code> action is used (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1888"
target="_blank" rel="noopener">code&lt;/a>
),&lt;/li>
&lt;li>handling of the &lt;code>.spec.backoffLimitPerIndex&lt;/code> when &lt;code>.spec.maxFailedIndexes&lt;/code> isn&amp;rsquo;t set (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1688"
target="_blank" rel="noopener">code&lt;/a>
),&lt;/li>
&lt;li>handling of the &lt;code>.spec.backoffLimitPerIndex&lt;/code> when &lt;code>.spec.maxFailedIndexes&lt;/code> is set (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1846"
target="_blank" rel="noopener">code&lt;/a>
),&lt;/li>
&lt;li>handling of the &lt;code>.spec.backoffLimit&lt;/code> when &lt;code>.spec.backoffLimitPerIndex&lt;/code> is set (&lt;a href="https://github.com/kubernetes/kubernetes/blob/master/test/integration/job/job_test.go#L1744"
target="_blank" rel="noopener">code&lt;/a>
),&lt;/li>
&lt;li>handling of the exponential backoff delay per index when &lt;code>.spec.backoffLimitPerIndex&lt;/code> is set (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/integration/job/job_test.go#L1120"
target="_blank" rel="noopener">code&lt;/a>
).&lt;/li>
&lt;/ul>
&lt;p>The [k8s-triage] page for the &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?job=integration&amp;amp;test=BackoffLimitPerIndex"
target="_blank" rel="noopener">BackoffLimitPerIndex integration tests&lt;/a>
.&lt;/p>
&lt;p>More integration tests might be added to ensure good code coverage based on the
actual implementation.&lt;/p>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->
&lt;p>The following scenario is covered with e2e tests for Beta:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://testgrid.k8s.io/sig-apps#gce"
target="_blank" rel="noopener">sig-apps#gce&lt;/a>
:
&lt;ul>
&lt;li>Job should execute all indexes despite some failing when using backoffLimitPerIndex (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L602"
target="_blank" rel="noopener">code&lt;/a>
)&lt;/li>
&lt;li>Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L635"
target="_blank" rel="noopener">code&lt;/a>
)&lt;/li>
&lt;li>Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy (&lt;a href="https://github.com/kubernetes/kubernetes/blob/20b12ad5c389ff74792988bf1e0c10fe2820d9a1/test/e2e/apps/job.go#L670"
target="_blank" rel="noopener">code&lt;/a>
)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>The [k8s-triage] page for the &lt;a href="https://storage.googleapis.com/k8s-triage/index.html?job=e2e&amp;amp;test=should%20mark%20indexes%20as%20failed%20when%20the%20FailIndex%20action%20is%20matched%20in%20podFailurePolicy%7Cshould%20terminate%20job%20execution%20when%20the%20number%20of%20failed%20indexes%20exceeds%20maxFailedIndexes%7Cshould%20execute%20all%20indexes%20despite%20some%20failing%20when%20using%20backoffLimitPerIndex"
target="_blank" rel="noopener">BackoffLimitPerIndex e2e tests&lt;/a>
.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
Define graduation milestones.
These may be defined in terms of API maturity, [feature gate] graduations, or as
something else. The KEP should keep this high-level with a focus on what
signals will be looked at to determine graduation.
Consider the following in developing the graduation criteria for this enhancement:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]
Clearly define what graduation means by either linking to the [API doc
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
or by redefining what graduation means.
In general we try to use the same stages (alpha, beta, GA), regardless of how the
functionality is accessed.
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
#### Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
#### Beta
- Gather feedback from developers and surveys
- Complete features A, B, C
- Additional tests are in Testgrid and linked in KEP
#### GA
- N examples of real-world usage
- N installs
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- Allowing time for feedback
**Note:** Generally we also wait at least two releases between beta and
GA/stable, because there's no opportunity for user feedback, or even bug reports,
in back-to-back releases.
**For non-optional features moving to GA, the graduation criteria must include
[conformance tests].**
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
#### Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
-->
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>the feature implemented behind the &lt;code>JobBackoffLimitPerIndex&lt;/code> feature flag&lt;/li>
&lt;li>change the logic of computing the exponential backoff delay (see &lt;a href="#exponential-backoff-delay-issue"
>here&lt;/a>
)&lt;/li>
&lt;li>user-facing documentation, including the warning for setting completions &amp;gt; 10^5&lt;/li>
&lt;li>The &lt;code>JobBackoffLimitPerIndex&lt;/code> feature flag disabled by default&lt;/li>
&lt;li>Tests: unit and integration&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Address reviews and bug reports from Alpha users&lt;/li>
&lt;li>Implement the &lt;code>job_finished_indexes_total&lt;/code> metric&lt;/li>
&lt;li>E2e tests are in Testgrid and linked in KEP&lt;/li>
&lt;li>Move the &lt;a href="https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89"
target="_blank" rel="noopener">new reason declarations&lt;/a>
from Job controller to the API package&lt;/li>
&lt;li>Evaluate performance of Job controller for jobs using backoff limit per index
with benchmarks at the integration or e2e level (discussion pointers from Alpha
review: &lt;a href="https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406"
target="_blank" rel="noopener">thread1&lt;/a>
and &lt;a href="https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076"
target="_blank" rel="noopener">thread2&lt;/a>
)&lt;/li>
&lt;li>The feature flag enabled by default&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Address reviews and bug reports from Beta users&lt;/li>
&lt;li>Write a blog post about the feature&lt;/li>
&lt;li>Revisit extending the &lt;a href="https://kubernetes.io/docs/tasks/job/pod-failure-policy/"
target="_blank" rel="noopener">hands-on guide for Pod failure policy&lt;/a>
to use &lt;code>FailIndex&lt;/code>&lt;/li>
&lt;li>Graduate e2e tests as conformance tests&lt;/li>
&lt;li>Lock the &lt;code>JobBackoffLimitPerIndex&lt;/code> feature gate&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;h4 id="upgrade">Upgrade&lt;/h4>
&lt;p>An upgrade to a version which supports this feature should not require any
additional configuration changes. In order to use this feature after an upgrade
users will need to configure their Jobs by specifying
&lt;code>.spec.backoffLimitPerIndex&lt;/code>.
There is no difference in behavior of Jobs if &lt;code>.spec.backoffLimitPerIndex&lt;/code> is
not set.&lt;/p>
&lt;h4 id="downgrade">Downgrade&lt;/h4>
&lt;p>A downgrade to a version which does not support this feature should not require
any additional configuration changes. Jobs which specified
&lt;code>.spec.backoffLimitPerIndex&lt;/code> (to make use of this feature) will be
handled in a default way, ie. using the &lt;code>.spec.backoffLimit&lt;/code>.
However, since the &lt;code>.spec.backoffLimit&lt;/code> defaults to max int32 value
(see &lt;a href="#job-api"
>here&lt;/a>
) is might require a manual setting of the &lt;code>.spec.backoffLimit&lt;/code>
to ensure failed pods are not retried indefinitely.&lt;/p>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>This feature is limited to control plane.&lt;/p>
&lt;p>Note that, kube-apiserver can be in the N+1 skew version relative to the
kube-controller-manager (see &lt;a href="https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager"
target="_blank" rel="noopener">here&lt;/a>
).
In that case, the Job controller operates on the version of the Job object that
already supports the new Job API.&lt;/p>
&lt;!--
If applicable, how will the component handle version skew with other
components? What are the guarantees? Make sure this is in the test plan.
Consider the following in developing a version skew strategy for this
enhancement:
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI,
CRI or CNI may require updating that component before the kubelet.
-->
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;!--
This section must be completed when targeting alpha to a release.
-->
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;!--
Pick one of these and delete the rest.
Documentation is available on [feature gate lifecycle] and expectations, as
well as the [existing list] of feature gates.
[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Feature gate (also fill in values in &lt;code>kep.yaml&lt;/code>)
&lt;ul>
&lt;li>Feature gate name: JobBackoffLimitPerIndex&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver, kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other
&lt;ul>
&lt;li>Describe the mechanism:&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane?&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume &lt;code>Dynamic Kubelet Config&lt;/code> feature is enabled).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Using the feature gate is the recommended way. When the feature is disabled
the Job controller manager handles pod failures in the default way, even if
&lt;code>.spec.backoffLimitPerIndex&lt;/code> is set.&lt;/p>
&lt;!--
Describe the consequences on existing workloads (e.g., if this is a runtime
feature, can it break the existing applications?).
Feature gates are typically disabled by setting the flag to `false` and
restarting the component. No other changes should be necessary to disable the
feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>The Job controller starts to handle pod failures according to the specified
&lt;code>.spec.backoffLimitPerIndex&lt;/code> or &lt;code>.spec.maxFailedIndexes&lt;/code> fields.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes, there is an &lt;a href="https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763"
target="_blank" rel="noopener">integration test&lt;/a>
which tests the following path: enablement -&amp;gt; disablement -&amp;gt; re-enablement.&lt;/p>
&lt;!--
The e2e framework does not currently support enabling or disabling feature
gates. However, unit tests in each component dealing with managing data, created
with and without the feature, are necessary. At the very least, think about
conversion tests if API types are being modified.
Additionally, for features that are introducing a new API field, unit tests that
are exercising the `switch` of feature gate itself (what happens if I disable a
feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>This change does not impact how the rollout or rollback fail.&lt;/p>
&lt;p>The change is opt-in, thus a rollout doesn&amp;rsquo;t impact already running pods.&lt;/p>
&lt;p>The rollback might affect how pod failures are handled, since they will
be counted only against &lt;code>.spec.backoffLimit&lt;/code>, which is defaulted to max int32
value, when using &lt;code>.spec.backoffLimitPerIndex&lt;/code> (see &lt;a href="#job-api"
>here&lt;/a>
).
Thus, similarly as in case of a downgrade (see &lt;a href="#downgrade"
>here&lt;/a>
)
it might be required to manually set &lt;code>spec.backoffLimit&lt;/code> to ensure failed pods
are not retried indefinitely.&lt;/p>
&lt;!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;p>A substantial increase in the &lt;code>job_sync_duration_seconds&lt;/code>.&lt;/p>
&lt;p>Also, a substantial increase in the total number of pods, as it may take
additional time to get the finalizers removed.&lt;/p>
&lt;p>Additionally, a substantial increase in the difference of
&lt;code>terminated_pods_tracking_finalizer_total&lt;/code> for the &lt;code>add&lt;/code> and &lt;code>delete&lt;/code> labels may
indicate that it takes too long to delete the finalizers.&lt;/p>
&lt;p>The feature is opt-in so in case of issues it is enough not to use the
backoffLimitPerIndex API field.&lt;/p>
&lt;!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>The Upgrade-&amp;gt;downgrade-&amp;gt;upgrade testing was done manually using the &lt;code>alpha&lt;/code>
version in 1.28 with the following steps:&lt;/p>
&lt;ol>
&lt;li>Start the cluster with the &lt;code>JobBackoffLimitPerIndex&lt;/code> enabled:&lt;/li>
&lt;/ol>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>using &lt;code>config.yaml&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Cluster&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kind.x-k8s.io/v1alpha4&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">featureGates&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">&amp;#34;JobBackoffLimitPerIndex&amp;#34;: &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">nodes&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>control-plane&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>- &lt;span style="color:#008000;font-weight:bold">role&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>worker&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then, create the job using &lt;code>.spec.backoffLimitPerIndex=1&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl create -f job.yaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>using &lt;code>job.yaml&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>batch/v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">metadata&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>job-longrun&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">3&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">3&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">template&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">restartPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Never&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">containers&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>sleep&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">image&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>busybox:1.36.1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">command&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;sleep&amp;#34;&lt;/span>]&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">args&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>[&lt;span style="color:#b44">&amp;#34;1800&amp;#34;&lt;/span>]&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># 30min&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">imagePullPolicy&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>IfNotPresent&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Await for the pods to be running and delete 0-indexed pod:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl delete pods -l job-name&lt;span style="color:#666">=&lt;/span>job-longrun -l batch.kubernetes.io/job-completion-index&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">0&lt;/span> --grace-period&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Await for the replacement pod to be created and repeat the deletion.&lt;/p>
&lt;p>Check job status and confirm &lt;code>.status.failedIndexes=&amp;quot;0&amp;quot;&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get &lt;span style="color:#a2f">jobs&lt;/span> -ljob-name&lt;span style="color:#666">=&lt;/span>job-longrun -oyaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Also, notice that &lt;code>.status.active=2&lt;/code>, because the pod for a failed index is not
re-created.&lt;/p>
&lt;ol start="2">
&lt;li>Simulate downgrade by disabling the feature for api server and control-plane.&lt;/li>
&lt;/ol>
&lt;p>Then, verify that 3 pods are running again, and the &lt;code>.status.failedIndexes&lt;/code> is
gone by:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get &lt;span style="color:#a2f">jobs&lt;/span> -ljob-name&lt;span style="color:#666">=&lt;/span>job-longrun -oyaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>this will produce output similar to:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">status&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">active&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">3&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">failed&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">2&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">ready&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">2&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="3">
&lt;li>Simulate upgrade by re-enabling the feature for api server and control-plane.&lt;/li>
&lt;/ol>
&lt;p>Then, delete 1-indexed pod:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl delete pods -l job-name&lt;span style="color:#666">=&lt;/span>job-longrun -l batch.kubernetes.io/job-completion-index&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">1&lt;/span> --grace-period&lt;span style="color:#666">=&lt;/span>&lt;span style="color:#666">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Await for the replacement pod to be created and repeat the deletion.
Check job status and confirm &lt;code>.status.failedIndexes=&amp;quot;1&amp;quot;&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>kubectl get &lt;span style="color:#a2f">jobs&lt;/span> -ljob-name&lt;span style="color:#666">=&lt;/span>job-longrun -oyaml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Also, notice that &lt;code>.status.active=2&lt;/code>, because the pod for a failed index is not
re-created.&lt;/p>
&lt;p>This demonstrates that the feature is working again for the job.&lt;/p>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Even if applying deprecation policies, they may still surprise some users.
-->
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>By the presence of the &lt;code>.spec.backoffLimitPerIndex&lt;/code> field in the jobs.&lt;/p>
&lt;p>For Beta we are also considering to introduce &lt;code>job_finished_indexes_total&lt;/code>
metric
(see also &lt;a href="#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature"
>here&lt;/a>
).&lt;/p>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;!--
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
for each individual pod.
Pick one more of these and delete the rest.
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Job API .status
&lt;ul>
&lt;li>field: &lt;code>failedIndexes&lt;/code> will not be empty as indexes fail&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Pod API
&lt;ul>
&lt;li>annotation: &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code> is present for
pods created by Jobs with this feature enabled&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>This feature does not propose SLOs.&lt;/p>
&lt;!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors &lt;= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job &lt;= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;!--
Pick one more of these and delete the rest.
-->
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:
&lt;ul>
&lt;li>&lt;code>job_sync_duration_seconds&lt;/code> (existing): can be used to see how much the
feature enablement increases the time spent in the sync job&lt;/li>
&lt;li>&lt;code>job_finished_indexes_total&lt;/code> (new): can be used to determine if the indexes
are marked failed,&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Components exposing the metric: kube-controller-manager&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>For Beta we will introduce a new metric &lt;code>job_finished_indexes_total&lt;/code>
with labels &lt;code>status=(failed|succeeded)&lt;/code>, and &lt;code>backoffLimit=(perIndex|global)&lt;/code>.
It will count the number of failed and succeeded indexes across jobs using
&lt;code>backoffLimitPerIndex&lt;/code>, or regular Indexed Jobs (using only &lt;code>.spec.backoffLimit&lt;/code>).
It might be useful to determine the global ratio of failed vs. succeeded indexes
when &lt;code>backoffLimitPerIndex&lt;/code> is used.&lt;/p>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
-->
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Yes, but only when the &lt;code>.spec.backoffLimitPerIndex&lt;/code> field is set.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>API type(s): Job&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimated increase in size:&lt;/p>
&lt;ul>
&lt;li>New &lt;code>.status.failedIndexes&lt;/code> field in Status and &lt;code>.status.completedIndexes&lt;/code>
pre-existing field are impacted. When the scalability limits are respected,
then the maximal increase of the total size of both fields can be estimated
as &lt;code>190Ki&lt;/code> (see &lt;a href="#the-job-object-too-big"
>The Job object too big&lt;/a>
for more details),&lt;/li>
&lt;li>New &lt;code>.spec.backoffLimitPerIndex&lt;/code> field of &lt;code>*int32&lt;/code> is 12 bytes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>API type(s): Pod&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimated increase in size:
the new annotation &lt;code>batch.kubernetes.io/job-index-failure-count&lt;/code> to keep the
current number of retries per index. Is around 50 bytes.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>We don&amp;rsquo;t expect this increase to be captured by existing
&lt;a href="https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md"
target="_blank" rel="noopener">SLO/SLIs&lt;/a>
.&lt;/p>
&lt;!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>The added dependency of removing finalizers only after pod
recreation &lt;a href="#tracking-the-number-of-failures-per-index"
>Tracking the number of failures per index&lt;/a>
may keep pods around longer (around 10s which is the backoff for pod recreation)
before actual deletion (requested or by PodGC).&lt;/p>
&lt;p>This can increase the RAM consumption, but only for a short period of time. Also,
it is only affecting the failing pods.&lt;/p>
&lt;!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No. This feature does not introduce any resource exhaustive operations.&lt;/p>
&lt;!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>No change from existing behavior of the Job controller.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;p>None.&lt;/p>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>N/A.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;ul>
&lt;li>2023-01-23: Initial version of the KEP PR &lt;a href="https://github.com/kubernetes/enhancements/pull/3774"
target="_blank" rel="noopener">Backoff Limit Per Job #3774&lt;/a>
&lt;/li>
&lt;li>2023-04-26: The KEP PR &lt;a href="https://github.com/kubernetes/enhancements/pull/3967"
target="_blank" rel="noopener">Backoff limit per Job Index #3967&lt;/a>
takes over from &lt;a href="https://github.com/kubernetes/enhancements/pull/3774"
target="_blank" rel="noopener">#3774&lt;/a>
&lt;/li>
&lt;li>2023-05-08: The KEP PR ready for review&lt;/li>
&lt;li>2023-06-07: The KEP PR merged&lt;/li>
&lt;li>2023-07-13: The implementation PR &lt;a href="https://github.com/kubernetes/kubernetes/pull/118009"
target="_blank" rel="noopener">Support BackoffLimitPerIndex in Jobs #118009&lt;/a>
under review&lt;/li>
&lt;li>2023-07-18: Merge the API PR &lt;a href="https://github.com/kubernetes/kubernetes/pull/119294"
target="_blank" rel="noopener">Extend the Job API for BackoffLimitPerIndex&lt;/a>
&lt;/li>
&lt;li>2023-07-18: Merge the Job Controller PR &lt;a href="https://github.com/kubernetes/kubernetes/pull/118009"
target="_blank" rel="noopener">Support BackoffLimitPerIndex in Jobs&lt;/a>
&lt;/li>
&lt;li>2023-08-04: Merge user-facing docs PR &lt;a href="https://github.com/kubernetes/website/pull/41921"
target="_blank" rel="noopener">Docs update for Job&amp;rsquo;s backoff limit per index (alpha in 1.28)&lt;/a>
&lt;/li>
&lt;li>2023-08-06: Merge KEP update reflecting decisions during the implementation phase &lt;a href="https://github.com/kubernetes/enhancements/pull/4123"
target="_blank" rel="noopener">Update for KEP3850 &amp;ldquo;Backoff Limit Per Index&amp;rdquo;&lt;/a>
&lt;/li>
&lt;li>2023-10-02: &lt;a href="https://github.com/kubernetes/enhancements/pull/4228"
target="_blank" rel="noopener">Update KEP-3850 &amp;ldquo;Backoff Limit Per Index&amp;rdquo; for Beta&lt;/a>
&lt;/li>
&lt;li>2023-10-20: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121292"
target="_blank" rel="noopener">Introduce the job_finished_indexes_total metric&lt;/a>
&lt;/li>
&lt;li>2023-10-23: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121356"
target="_blank" rel="noopener">Graduate BackoffLimitPerIndex to Beta&lt;/a>
&lt;/li>
&lt;li>2023-10-24: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121471"
target="_blank" rel="noopener">Indicate Job Backoff Limit Per Index reason consts are beta&lt;/a>
&lt;/li>
&lt;li>2023-10-25: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121368"
target="_blank" rel="noopener">Backoff limit per index e2e test&lt;/a>
&lt;/li>
&lt;li>2023-11-02: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121633"
target="_blank" rel="noopener">Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP&lt;/a>
&lt;/li>
&lt;li>2023-11-02: &lt;a href="https://github.com/kubernetes/kubernetes/pull/121393"
target="_blank" rel="noopener">Benchmark job with backoff limit per index&lt;/a>
&lt;/li>
&lt;li>2023-11-02: &lt;a href="https://github.com/kubernetes/enhancements/pull/4321"
target="_blank" rel="noopener">Update KEP3850 &amp;ldquo;BackoffLimitPerIndex for Indexed Jobs&amp;rdquo;&lt;/a>
&lt;/li>
&lt;li>2025-02-07: &lt;a href="https://github.com/kubernetes/enhancements/pull/5154"
target="_blank" rel="noopener">KEP3850: graduate Backoff Limit Per Index for Job to stable&lt;/a>
&lt;/li>
&lt;li>2025-02-25: &lt;a href="https://github.com/kubernetes/kubernetes/pull/130390"
target="_blank" rel="noopener">Add Job e2e for tracking failure count per index&lt;/a>
&lt;/li>
&lt;li>2025-03-01: &lt;a href="https://github.com/kubernetes/kubernetes/pull/130061"
target="_blank" rel="noopener">Graduate Backoff Limit Per Index as stable&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;h3 id="backofflimitperindex-inside-new-runpolicy">backoffLimitPerIndex inside new runPolicy&lt;/h3>
&lt;p>We could nest the new fields (&lt;code>maxFailedIndexes&lt;/code> and &lt;code>backoffLimitPerIndex&lt;/code>) inside
another field. Proposed alternative names for the field:&lt;/p>
&lt;ol>
&lt;li>&lt;code>runPolicy&lt;/code>&lt;/li>
&lt;li>&lt;code>completionPolicy&lt;/code>&lt;/li>
&lt;li>&lt;code>failurePolicy&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>For example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimit&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">4&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">runPolicy&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">maxFailedIndexes&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The option (3.) suggests that the fields are about declaring the Job as failed.
However, the &lt;code>backoffLimitPerIndex&lt;/code> field not only allows to count failures
towards the backoff limit per index, but also allows all indexes to execute
despite failures, thus more generic names, like (1.) and (2.) are preferred.&lt;/p>
&lt;p>Also the options (1.) and (2.) may be reused in the context of success policy
which is subject of
&lt;a href="https://github.com/kubernetes/kubernetes/issues/117600"
target="_blank" rel="noopener">Job success/completion policy&lt;/a>
.
It might be beneficial for the API to consider the conditions for the Job
success or failure under the same field.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>It is not clear what is the best name going forward. Also, it seems that the
&lt;code>backoffLimitPerIndex&lt;/code> should be next to &lt;code>backoffLimit&lt;/code>. It was discussed
and the consensus is that &amp;ldquo;top-level&amp;rdquo; is fine
(see &lt;a href="https://github.com/kubernetes/enhancements/pull/3967#discussion_r1196170192"
target="_blank" rel="noopener">here&lt;/a>
).&lt;/p>
&lt;h3 id="mark-job-complete-if-some-indexes-failed">Mark Job Complete if some indexes failed&lt;/h3>
&lt;p>The alternative to the proposed &lt;a href="#job-completion"
>Job completion&lt;/a>
strategy.&lt;/p>
&lt;p>Allow execution of all indexes, up to &lt;code>.spec.maxFailedIndexes&lt;/code> of
failed indexes. Then, mark the Job &lt;code>Complete&lt;/code> even if some indexes failed.
The Job is marked &lt;code>Failed&lt;/code> only if the number of failed indexes exceeds the
specified &lt;code>.spec.maxFailedIndexes&lt;/code> limit, in that case, the &lt;code>reason&lt;/code>
field could be &lt;code>FailedIndexes&lt;/code>, and the &lt;code>message&lt;/code> field would list the failed
indexes up to a couple of them.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>This approach is less intuitive to the end-users of the API, compared
to the proposal. In particular, in some cases it would require custom logic in
the user&amp;rsquo;s controller to determine if the Job is failed.&lt;/p>
&lt;h3 id="support-backofflimitperindex-when-restartpolicyonfailure">Support backoffLimitPerIndex when restartPolicy=OnFailure&lt;/h3>
&lt;p>We&amp;rsquo;ve considered supporting the backoffLimitPerIndex when pod&amp;rsquo;s &lt;code>restartPolicy=OnFailure&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>When restartPolicy=OnFailure it is Kubelet&amp;rsquo;s responsibility to restart the pod.
On the other hand if the maximal number of restarts would be enforced by the
Job controller, then race conditions are possible. For example, in-between the
checks by the Job controller, Kubelet execute more restarts than the specified
&lt;code>.spec.backoffLimit&lt;/code>. The problematic counting of failures in the
restartPolicy=OnFailure has been ticketed
&lt;a href="https://github.com/kubernetes/kubernetes/issues/109870"
target="_blank" rel="noopener">When restartPolicy=OnFailure the calculation for number of retries is not accurate&lt;/a>
.&lt;/p>
&lt;p>We believe that this feature can be supported well by using the pod-level API,
started in this KEP:
&lt;a href="https://github.com/kubernetes/enhancements/issues/3322"
target="_blank" rel="noopener">Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure&lt;/a>
.&lt;/p>
&lt;p>Once the pod-level API is done, it could be considered to support &lt;code>.spec.backoffLimitPerIndex&lt;/code>
when&lt;code>restartPolicy=OnFailure&lt;/code> in pod&amp;rsquo;s spec. In this case we could set the pod-level
&lt;code>maxRestartTimes&lt;/code> field based on the Job-level &lt;code>.spec.backoffLimit&lt;/code>, leaving the
responsibility of enforcing the limit to the Kubelet.&lt;/p>
&lt;p>We will re-assess the decision of the Pod-level API graduates to GA in the
KEP: &lt;a href="https://github.com/kubernetes/enhancements/issues/3322"
target="_blank" rel="noopener">Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure&lt;/a>
.
For example, when maxRestartTimes is specified for &lt;code>restartPolicy=OnFailure&lt;/code>, then
we could support &lt;code>maxFailedIndexes&lt;/code> which would allow to control the number of
failed indexes (that exceeded the &lt;code>maxRestartTimes&lt;/code> and are marked failed).&lt;/p>
&lt;h3 id="mutually-exclusive-backofflimit-and-backofflimitperindex">Mutually exclusive backoffLimit and backoffLimitPerIndex&lt;/h3>
&lt;p>We&amp;rsquo;ve also considered to make the &lt;code>backoffLimit&lt;/code> and &lt;code>backoffLimitPerIndex&lt;/code>
fields mutually exclusive.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>There is no way to control downgrade, as the value of &lt;code>backoffLimit&lt;/code> would
always default to 6. Also, old API clients may error trying to read or modify
Job objects with backoffLimit=nil.&lt;/p>
&lt;h3 id="use-bool-field">Use bool field&lt;/h3>
&lt;p>We&amp;rsquo;ve considered to use a bool &lt;code>backoffLimitPerIndex&lt;/code> field. Here is an example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimit&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitPerIndex&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#a2f;font-weight:bold">true&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>It does not allow to specify both &lt;code>.spec.backoffLimit&lt;/code> and &lt;code>.spec.backoffLimitPerIndex&lt;/code>
in the same config. While setting both fields can be confusing in regular use
it can be helpful to support the use case of controlled downgrade.&lt;/p>
&lt;h3 id="use-enum-field">Use enum field&lt;/h3>
&lt;p>We&amp;rsquo;ve considered to use an enum &lt;code>backoffLimitTarget: Job|Index&lt;/code> field (another
name for this concept could be &lt;code>backoffLimitGranularity&lt;/code>), to specify that the
failures should be tracked per-index. Here, the default would be &lt;code>Job&lt;/code>. Here is
an example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">kind&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Job&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb">&lt;/span>&lt;span style="color:#008000;font-weight:bold">spec&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">parallelism&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completions&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">10&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">completionMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Indexed&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimit&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">1&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">backoffLimitTarget&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>Index&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>...&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>No other targets, than &lt;code>Job&lt;/code> and &lt;code>Index&lt;/code>, will be added in a foreseeable
future. Thus, it seems like an unnecessary complication. The dedicated name
&lt;code>backoffLimitPerIndex&lt;/code> seems to also better reflect the user&amp;rsquo;s intention.&lt;/p>
&lt;p>Similarly as in the bool case field &lt;a href="#use-bool-field"
>Use bool field&lt;/a>
it does
not allow to set both &lt;code>.spec.backoffLimit&lt;/code> and &lt;code>.spec.backoffLimitPerIndex&lt;/code>
to control the downgrade.&lt;/p>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;h3 id="global-exponential-backoff-delay">Global exponential backoff delay&lt;/h3>
&lt;p>We could also consider leaving the exponential backoff delay as global and
be enabled by a dedicated API field in the future KEP, say &lt;code>backoffDelayPerIndex&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>The idea of using &lt;code>backoffLimitPerIndex&lt;/code> is to make the indexes independent.
Thus, failures or successes in one index should not influence backoff delays
for another index. We are leaving the decision to the community feeback and
discussions though.&lt;/p>
&lt;h3 id="exponential-backoff-delay-with-in-memory-tracking">Exponential backoff delay with in-memory tracking&lt;/h3>
&lt;p>Instead of modifying the definition of pod&amp;rsquo;s finish time (see &lt;a href="#exponential-backoff-delay-issue"
>Exponential backoff delay issue&lt;/a>
)
we could keep track of the &amp;ldquo;failure time&amp;rdquo; for failed pods in-memory.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>As the number of failed indexes is capped at 10^5 keeping track of failure
times for all pods will be at least 8B per failed pod, which is around 1Mi per
Job in the worst-case scenario. This is a non-negligible memory increase.&lt;/p>
&lt;p>The extra tracking information is not needed counting pods as terminated is done
in &lt;a href="https://github.com/kubernetes/enhancements/pull/3940"
target="_blank" rel="noopener">KEP-3939: Consider terminating pods in job controller&lt;/a>
.
In this case we can assume that the failure time of each pod does not change
after its phase is terminal.&lt;/p>
&lt;h3 id="alternative-ways-to-support-high-number-of-completions">Alternative ways to support high number of completions&lt;/h3>
&lt;p>In the current proposal the high number of completions (like 10^6) is supported
by specifying the &lt;code>.spec.maxFailedIndexes&lt;/code> field. This way the size
of the &lt;code>failedIndexes&lt;/code> field is controlled.&lt;/p>
&lt;p>See below for alternative approaches proposed.&lt;/p>
&lt;h4 id="keep-failedindexes-field-as-a-bitmap">Keep failedIndexes field as a bitmap&lt;/h4>
&lt;p>In order to squeeze more failed indexes we could use bitmap.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>it is not human readable which might be useful for manual inspection&lt;/li>
&lt;li>it is harder to parse by user-provided controllers&lt;/li>
&lt;li>it introduces another format to keeping the succeeded indexes in &lt;code>.status.completedIndexes&lt;/code>&lt;/li>
&lt;/ul>
&lt;h4 id="keep-the-list-of-failed-indexes-in-a-dedicated-api-object">Keep the list of failed indexes in a dedicated API object&lt;/h4>
&lt;p>The idea is to keep the heavy fields outside of the Job API object itself.
It could be a new API object, for example JobFailedIndexes.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>This approach significantly increases the complexity of the Job controller that
needs to register and manage another API object. This may also have performance
impact as the Job controller needs to query the object. Finally, it is also
a complication to the end users who want to fetch the list of failed indexes.&lt;/p>
&lt;h4 id="implicit-limit-on-the-number-of-failed-indexes">Implicit limit on the number of failed indexes&lt;/h4>
&lt;p>An alternative is to have an implicit limit on the number of failed indexes, for
example, by controlling the size of the &lt;code>.status.failedIndexes&lt;/code> field down to
300KB. This can allow to run a job with completions at the level of 10^6, without
explicit limit for maximal number of failed indexes.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>It may behave unpredictably, impacting the user experience. For example,
when a user sets &lt;code>maxFailedIndexes&lt;/code> as 10^6 the Job may complete if the indexes
and consecutive, but the Job may also fail if the size of the object exceeds the
limits due to non-consecutive indexes failing.&lt;/p>
&lt;h3 id="skip-uncountedterminatedpods-when-backofflimitperindex-is-used">Skip uncountedTerminatedPods when backoffLimitPerIndex is used&lt;/h3>
&lt;p>It&amp;rsquo;s been proposed (see &lt;a href="https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848"
target="_blank" rel="noopener">link&lt;/a>
)
that when backoffLimitPerIndex is used, then we could skip the interim step of
recording terminated pods in &lt;code>.status.uncountedTerminatedPods&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Reasons for deferring / rejecting&lt;/strong>&lt;/p>
&lt;p>First, if we stop using &lt;code>.status.uncountedTerminatedPods&lt;/code> it means that
&lt;code>.status.failed&lt;/code> can no longer track the number of failed pods. Thus, it would
require a change of semantic to denote just the number of failed indexes. This
has downsides:&lt;/p>
&lt;ul>
&lt;li>two different semantics of the field, depending on the used feature&lt;/li>
&lt;li>lost information about some failed pods within an index (some users may care
to investigate succeeded indexes with at least one failed pod)&lt;/li>
&lt;/ul>
&lt;p>Second, it would only optimize the unhappy path, where there are failures. Also,
the saving is only 1 request per 500 failed pods, which does not seem essential.&lt;/p>
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Beta APIs Are Off by Default</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3136/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/3136/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-3136-beta-apis-are-off-by-default">KEP-3136: Beta APIs Are Off by Default&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests for meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>From the Kubernetes release where this change is introduced, and onwards, beta APIs will not be enabled in clusters by default.
Existing beta APIs and new versions of existing beta APIs, will continue to be enabled by default:
if v1beta.some.group is currently enabled by default and we create v1beta2.some.group, v1beta2.some.group will still be enabled by default.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Beta APIs are not considered stable and reliance upon APIs in this state leads to exposure to bugs,
guaranteed migration pain for users when the APIs move to stable, and the risk that dependencies will
grow around unfinished APIs.
Enabling beta APIs by default, exacerbates these problems by making them on in nearly every cluster.
We observed these problems as we removed long-standing beta APIs and the PRR survey tells us that over
90% of cluster-admins leave production clusters with these APIs enabled.
Unsuitability for production use is documented at &lt;a href="https://kubernetes.io/docs/reference/using-api/#api-versioning"
target="_blank" rel="noopener">https://kubernetes.io/docs/reference/using-api/#api-versioning&lt;/a>
(&amp;ldquo;The software is not recommended for production uses&amp;rdquo;), but defaulting on means they are present in nearly every
production cluster.
By disabling beta APIs by default, a cluster-admin can opt-in for specific APIs without having every
incomplete API present in the cluster.
This is now practical to do since conformance no longer relies on non-stable APIs.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ol>
&lt;li>Disable new beta APIs by default.&lt;/li>
&lt;li>Continue enabling existing beta APIs and new version of existing beta APIs by default:
if v1beta.some.group is currently enabled by default and we create v1beta2.some.group, v1beta2.some.group will still be enabled by default.&lt;/li>
&lt;li>Allow enabling specific resources in beta. Enable coolnewjobtype.v1beta1.batch.k8s.io without enabling other-neat-job.v1beta1.batch.k8s.io&lt;/li>
&lt;/ol>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ol>
&lt;li>Change feature gate defaults.
Feature gates control new features (not just new APIs) and they are on by default for beta features.
This KEP is not changing the lifecycle flow for feature gates.
It is currently alpha=off-by-default, beta=on-by-default, stable=locked-to-on.&lt;/li>
&lt;/ol>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>New beta APIs will be placed into the &lt;code>DisableVersions&lt;/code> stanza instead of the &lt;code>EnableVersions&lt;/code> stanza (see &lt;a href="https://github.com/kubernetes/kubernetes/blob/0669da445fa8c1ae07c15c0827f0e83da11cbe58/pkg/controlplane/instance.go#L643"
target="_blank" rel="noopener">DefaultAPIResourceConfigSource&lt;/a>
).
The &lt;code>--runtime-config&lt;/code> flag will be extended to allow &lt;code>group/version/resource=true&lt;/code>, to enable specific resources.
To enable a beta API, a cluster-admin will have to add the appropriate &lt;code>--runtime-config&lt;/code> flags.&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>As a cluster-admin I want to enable the coolnewjobtype.v1beta1.batch.k8s.io API in my cluster.&lt;/p>
&lt;p>To do this I call &lt;code>kube-apiserver --runtime-config=batch.k8s.io/v1beta1/coolnewjobtype&lt;/code>.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>As a cluster-admin I want to enable all beta APIs as in past releases.&lt;/p>
&lt;p>To do this I call &lt;code>kube-apiserver --runtime-config=api/beta=true&lt;/code>.
This already exists and will continue to function.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;p>Installers, utilities, controllers, etc that need to know if a certain beta API is present can continue to use the
existing discovery mechanisms (example: kubectl&amp;rsquo;s api-resources sub command or the &lt;code>/api/apps/v1&lt;/code> REST API).&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>Adoption of beta features will slow.
Given how kubernetes is now treated, this is a good thing, not a bad thing.
Those users that want to move quickly and get new features can do so by enabling all beta feature
or just enabling those that are important for their workload.
The &lt;a href="https://datastudio.google.com/reporting/2e9c7439-202b-48a9-8c57-4459e0d69c8d/page/Cv5HB"
target="_blank" rel="noopener">PRR survey&lt;/a>
shows that
over 30% of cluster-admins have enabled alpha features on at least some production clusters, so cluster-admins are willing and able to enable features
that are not on by default when they are desired.&lt;/p>
&lt;p>If two or more APIs are tightly coupled together, it will now be possible to enable them independently.
This can lead to unanticipated failure modes, but should only impact beta APIs with beta dependencies.
While this is a risk, it is not very common and components should fail safe as a general principle.&lt;/p>
&lt;p>If beta APIs are off by default, it&amp;rsquo;s possible that fewer clients will use them and provide feedback.
This is a risk, but early adopters are able to enable these features and have a history of enabling alpha features.
When moving from beta to GA, it will be important for sigs to explicitly seek feedback.
We will address this by extending the PRR questionnaire to include a GA-targeted question to validate that the feature
was reasonably validated in production use-cases.&lt;/p>
&lt;p>If beta APIs are off by default, it is possible that sigs don&amp;rsquo;t treat taking an API as an indication of a &amp;ldquo;mostly-baked&amp;rdquo; API.
If this happens, then more transformation may be required.
Keeping our beta API rules consistent and continuing to enforce easy to use APIs seems to be the best option.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>Integration tests will be written to ensure that no new beta APIs are enabled in the kube-apiserver by default.
Unit tests will be written to ensure that the new flag functionality works as expected.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>This KEP is a policy KEP, not a feature KEP. It will start as GA.&lt;/p>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Integration and unit tests from above.&lt;/li>
&lt;li>updating the enablement docs for beta
&lt;ul>
&lt;li>&lt;a href="https://kubernetes.io/docs/reference/using-api/#api-versioning"
target="_blank" rel="noopener">https://kubernetes.io/docs/reference/using-api/#api-versioning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#using-a-feature"
target="_blank" rel="noopener">https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#using-a-feature&lt;/a>
Even though that is talking about feature gates, it is likely worth calling out there that new beta REST APIs are no
longer enabled by default)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>email to &lt;a href="mailto:dev@kubernetes.io"
>dev@kubernetes.io&lt;/a>
to explain the new policy&lt;/li>
&lt;li>blog post explaining change in time for 1.24 release&lt;/li>
&lt;li>CI configuration updated to have a testing mode that enables beta APIs, likely set using &lt;code>kube-apiserver --runtime-config=api/beta=true&lt;/code>&lt;/li>
&lt;li>extend the PRR questionnaire to include a GA-targeted question to validate that the feature was reasonably validated in production use-cases.&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>The additional command line flag format for &lt;code>--runtime-config&lt;/code> will not be recognized on older levels of kubernetes.
This means that when downgrading, cluster-admins will have to adjust their CLI arguments if they opted into a new beta API.
This is congruent to flag handling for new features today.
Because this only impacts new beta APIs, there is no behavior change for existing APIs on upgrade.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>Because this only impacts new beta APIs, there is no novel skew risk.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;p>Not applicable because this is a policy KEP.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;!--
Why should this KEP _not_ be implemented?
-->
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
--></description></item><item><title>Resources: Beta Feature Gate Promotion Requirements</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5241/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/5241/</guid><description>
&lt;!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
appropriate SIG(s).
- [ ] **Create a PR for this KEP.**
Assign it to people in the SIG who are sponsoring this process.
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the KEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
Just because a KEP is merged does not mean it is complete or approved. Any KEP
marked as `provisional` is a working document and subject to change. You can
denote sections that are under active debate as follows:
```
&lt;&lt;[UNRESOLVED optional short context or usernames ]>>
Stuff that is being argued.
&lt;&lt;[/UNRESOLVED]>>
```
When editing KEPS, aim for tightly-scoped, single-topic PRs to keep discussions
focused. If you disagree with what is already in a document, open a new PR
with suggested changes.
One KEP corresponds to one "feature" or "enhancement" for its whole lifecycle.
You do not need a new KEP to move from beta to GA, for example. If
new details emerge that belong in the KEP, edit the KEP. Once a feature has become
"implemented", major changes should get new KEPs.
The canonical place for the latest set of instructions (and the likely source
of this file) is [here](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/NNNN-kep-template/README.md).
**Note:** Any PRs to move a KEP to `implementable`, or significant changes once
it is marked `implementable`, must be approved by each of the KEP approvers.
If none of those approvers are still appropriate, then changes to that list
should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
&lt;h1 id="kep-5241-beta-feature-gate-promotion-requirements">KEP-5241: Beta Feature Gate Promotion Requirements&lt;/h1>
&lt;!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->
&lt;!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.
Ensure the TOC is wrapped with
&lt;code>&amp;lt;!-- toc --&amp;rt;&amp;lt;!-- /toc --&amp;rt;&lt;/code>
tags, and then generate with `hack/update-toc.sh`.
-->
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#what-if-i-need-to-add-capability-to-my-feature"
>What if I need to add capability to my feature?&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#who-will-make-sure-that-new-keps-follow-the-promotion-rules"
>Who will make sure that new KEPs follow the promotion rules?&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#this-may-slow-the-rate-that-new-features-are-promoted"
>This may slow the rate that new features are promoted.&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
of the targeted release**.
For enhancements that make changes to code or processes/procedures in core
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Features gates must include all functional, security, monitoring, and testing requirements along with
resolving all issues and gaps identified prior to being enabled by default.
The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Features gates that are enabled by default are enabled in every production Kubernetes cluster in the world.
We must avoid making every production cluster into unstable or incomplete feature testing clusters.
Even feature gates that make flags accessible, but require a secondary configuration to use must be
stable, because it is unrealistic to expect everyone to understand the graduation stages of various flags
for each release: the only stages that really matter are &amp;ldquo;takes enabling an explicit alpha feature gate&amp;rdquo;
and &amp;ldquo;my production cluster accepts this as valid by default&amp;rdquo;.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Features gates must include all functional, security, monitoring, and testing requirements along with
resolving all issues and gaps identified prior to being enabled by default.&lt;/li>
&lt;li>The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Changing beta APIs off by default rules.&lt;/li>
&lt;li>Change the imperfect mechanisms we have for API evolution.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Kubernetes feature gates have three levels: GA (locked on), GA (disable-able), Beta, and Alpha.&lt;/p>
&lt;ol>
&lt;li>GA (locked-on) means that a feature gate is unconditionally enabled in all production kubernetes clusters and
that feature cannot be disabled.&lt;/li>
&lt;li>GA (disable-able) is only for features gates that include a new API serialization that cannot be enabled by default
until the API reaches stable. This means that the first time the API is enabled in production, the feature will
be GA, but also can be disabled. This is a less common state and does not apply to most features.&lt;/li>
&lt;li>Beta means that a feature gate is usually enabled in all production Kubernetes clusters by default
and that feature can be disabled.
Exceptions exist for entirely new APIs and some node features, but this broadly the case.&lt;/li>
&lt;li>Alpha means that a feature gate is disabled in all production Kubernetes clusters by default and
can be optionally enabled by setting a &lt;code>--feature-gate&lt;/code> command line argument.&lt;/li>
&lt;/ol>
&lt;p>Making the jump to GA (cannot be disabled), without actual field experience is irresponsible.
The first time we take a feature gate enabled by default in production Kubernetes clusters, we must
have a way to disable the feature in case of unexpected stability, performance, or security issues.&lt;/p>
&lt;p>Enabling incomplete features in production Kubernetes clusters by default is irresponsible.
Features that are known to be incomplete naturally bring with them additional stability, performance, and security issues.
Once a feature has been enabled in a production Kubernetes cluster by default, adding to it carries
greater risk to upgrading clusters and the ecosystem.
The feature can easily have become relied upon by workloads and other platform extensions.
If an accident happens in adding those capabilities with stability, performance, and security the
cost to disable those features in a cluster becomes significantly greater and breaks existing
clusters, workloads and use-cases.
This posture makes upgrades higher risk than necessary.&lt;/p>
&lt;p>To balance these concerns, we are changing how we evaluate Beta and GA stability criteria.
The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.
Promotion from Beta to GA must have no significant change for the release.
This means that Beta criteria must include all functional, security, monitoring, and testing requirements along
with resolving all issues and gaps identified prior to beta.&lt;/p>
&lt;p>Phasing in larger features over time can be done by bringing separate feature gates through alpha, beta, and GA.
Each feature gate needs to meet the beta and GA criteria for completeness, functional, security, monitoring, and testing.
After meeting the criteria for enabled by default, and at the SIG&amp;rsquo;s discretion, the new feature gate could be
set to enabled by default in the release it is introduced.
Importantly, the features need to behave in a way that allows old and new clients to interoperate and new additions
to larger features able to be independently disablable with their own path for GA.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;h4 id="what-if-i-need-to-add-capability-to-my-feature">What if I need to add capability to my feature?&lt;/h4>
&lt;p>To handle this situation, we described above how to add second feature gate for the new behavior.
This provides a mechanism for adding needed capability, but ensures that
cluster-admins never end up stuck after upgrade because they rely on v1.Y-1 behavior that new capability
in v1.Y broke under the same feature gate.&lt;/p>
&lt;h4 id="who-will-make-sure-that-new-keps-follow-the-promotion-rules">Who will make sure that new KEPs follow the promotion rules?&lt;/h4>
&lt;p>We&amp;rsquo;ll adjust the KEP template to indicate the allowed criteria, so authors should notice.
SIG approvers should enforce those standards.
PRR approvers can be a final backstop.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>This document is our new position once merged until it is superceded by another position statement.&lt;/p>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;h3 id="this-may-slow-the-rate-that-new-features-are-promoted">This may slow the rate that new features are promoted.&lt;/h3>
&lt;p>For this to be true, that would mean that we previously enabled feature gates in production that were knowingly
incomplete for functional, security, monitoring, testing, or known bugs.
We hope this was not the common case, but if it was the common enough to have an impact, we&amp;rsquo;re pleased that
the result is preventing incomplete feature gates from being enabled in production clusters.&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;p>None proposed so far.&lt;/p></description></item><item><title>Resources: Block ExternalIPs via Admission Control</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2200/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/2200/</guid><description>
&lt;h1 id="kep-2200-deny-use-of-externalips-via-admission-control">KEP-2200: Deny use of ExternalIPs via admission control&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This proposal is in response to CVE-2020-8554: &amp;ldquo;Man in the middle using
LoadBalancer or ExternalIPs&amp;rdquo;.&lt;/p>
&lt;p>Fundamentally the &lt;code>Service.spec.externalIPs[]&lt;/code> feature is bad. It predates
&lt;code>Service.spec.type=LoadBalancer&lt;/code> and, now that we have that, has very few
use-cases. In short an unprivileged user can hijack an IP address via a
Service spec. In contrast, &lt;code>type=LoadBalancer&lt;/code> uses Service status, which most
normal users should not be allowed to write.&lt;/p>
&lt;p>This KEP proposes to block the use of ExternalIPs via a built-in admission
controller. The justification for this, as opposed to a webhook, is that 99%
of users will never use this feature, and making them ALL run a webhook seems
terrible.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>&lt;a href="https://github.com/kubernetes/kubernetes/issues/97110"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/97110&lt;/a>
&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;p>Make it possible to disable an insecure feature for the vast majority of users
very quickly.&lt;/p>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Make this the default (breaking change)&lt;/li>
&lt;li>Make the feature safe to use.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>This KEP proposes to add a built-in admission controller
&amp;ldquo;DenyServiceExternalIPs&amp;rdquo;, which rejects any CREATE or UPDATE operation which
adds a new value to &lt;code>Service.spec.externalIPs&lt;/code>. Existing values will be
tolerated and may be removed.&lt;/p>
&lt;p>The number of rejected operations will be exposed by the standard admission
metrics (&lt;code>apiserver_admission_controller_admission_duration_seconds_bucket{name=&amp;quot;DenyServiceExternalIPs&amp;quot;,rejected=&amp;quot;true&amp;quot;, ...}&lt;/code>).&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;p>Alice the admin does not want her users using this insecure feature. She
enabled this admission controller and knows no user can use it. She can then
audit existing users and make them stop.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>Some installations may want to use this feature in a more controlled way. They
can use a custom webhook admission controller or a policy controller to enforce
their own rules.&lt;/p>
&lt;p>This is a precedent we should not set lightly. In this case the VAST majority
of users do not need this feature and this proposal is very surgical in nature.
As far as we know, there are few other unprivileged fields with this much
power anywhere in our API, and most of those already have some form of controls
on them.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>One simple admission controller should be enough to disable this misfeature.
Unfortunately it can not be on by default (that would be breaking).&lt;/p>
&lt;p>This means that platform-providers may need to expose an option to control
this. While we generally try to avoid mixing knobs that cluster-users would
set with knobs that cluster-providers own, it seems reasonable to close this as
soon as possible and consider better answers when we have more cases to
generalize from. See &amp;ldquo;Alternatives&amp;rdquo; below for more.&lt;/p>
&lt;p>See &amp;ldquo;Proposal&amp;rdquo; above.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;ul>
&lt;li>Unit tests to ensure CREATE and UPDATE operations are rejected when adding
new &lt;code>externalIPs&lt;/code>.&lt;/li>
&lt;li>Unit tests to ensure UPDATE operations allow existing &lt;code>externalIPs&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>This feature will debut as &amp;ldquo;GA&amp;rdquo;, bypassing alpha and beta. It&amp;rsquo;s already opt-in
and very small scope.&lt;/p>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;p>Cluster upgrades/downgrades should not be an issue.&lt;/p>
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>N/A&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Other flag
&lt;ul>
&lt;li>Flag name: &amp;ndash;enable-admission-plugins (existing)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong>
Yes. The &lt;code>externalIPs&lt;/code> field will not be allowed to mutate, except to remove
existing values.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/strong>
Yes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled back?&lt;/strong>
No problem.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>
Unit tests should suffice.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>
It could start disallowing all Service operations, if the controller was
buggy.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>
&lt;code>apiserver_admission_controller_admission_duration_seconds_bucket{name=&amp;quot;DenyServiceExternalIPs&amp;quot;,rejected=&amp;quot;true&amp;quot;, ...}&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/strong>
Manual testing:&lt;/p>
&lt;ul>
&lt;li>Create a service &amp;ldquo;extip&amp;rdquo; with 2 &lt;code>externalIPs&lt;/code> values&lt;/li>
&lt;li>Upgrade to new apiserver and enable new admission controller&lt;/li>
&lt;li>Try to create a new service using &lt;code>externalIPs&lt;/code> -&amp;gt; fail&lt;/li>
&lt;li>Try to change the &amp;ldquo;extip&amp;rdquo; service in an unrelated way -&amp;gt; OK&lt;/li>
&lt;li>Try to change the value of one &lt;code>externalIPs&lt;/code> value in extip -&amp;gt; fail&lt;/li>
&lt;li>Try to remove the [0] value of &lt;code>externalIPs&lt;/code> -&amp;gt; OK&lt;/li>
&lt;li>Try to add the removed value back -&amp;gt; fail&lt;/li>
&lt;li>Remove the last &lt;code>externalIPs&lt;/code> value -&amp;gt; OK&lt;/li>
&lt;li>Try to add the removed value back -&amp;gt; fail&lt;/li>
&lt;li>Revert to &amp;ldquo;standard&amp;rdquo; apiserver&lt;/li>
&lt;li>Try to add the removed value back -&amp;gt; OK&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>
There are two possible facets of this: 1) Is the admission control enabled?
and 2) Are any users using externalIPs?&lt;/p>
&lt;p>To point 1, admins can look at their admission control config
(&amp;ndash;enable-admission-plugins) and look for &lt;code>DenyServiceExternalIPs&lt;/code> in that
list.&lt;/p>
&lt;p>To point 2, admins can look at all services in the cluster for use of
the &lt;code>externalIPs&lt;/code> field. Via kubectl:&lt;/p>
&lt;pre tabindex="0">&lt;code>kubectl get svc --all-namespaces -o go-template=&amp;#39;
{{- range .items -}}
{{if .spec.externalIPs -}}
{{.metadata.namespace}}/{{.metadata.name}}: {{.spec.externalIPs}}{{&amp;#34;\n&amp;#34;}}
{{- end}}
{{- end -}}
&amp;#39;
&lt;/code>&lt;/pre>&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above SLIs?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/strong>
This proposes to use the existing
&lt;code>apiserver_admission_controller_admission_duration_seconds_bucket{name=&amp;quot;DenyServiceExternalIPs&amp;quot;, ...}&lt;/code> metrics.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Does this feature depend on any specific services running in the cluster?&lt;/strong>
No.&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to the cloud provider?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong>
No.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is unavailable?&lt;/strong>
It is part of apiserver REST path.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>
None.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the problem?&lt;/strong>
N/A&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2020-12-07: First draft&lt;/li>
&lt;li>2021-01-04: Edits to PRR section.&lt;/li>
&lt;li>2021-01-15: Edits from feedback.&lt;/li>
&lt;/ul>
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;p>It is a slippery-slope to other ad hoc policies. Counter: this is very
surgical and overwhelmingly not a useful feature.&lt;/p>
&lt;p>Users who REALLY need this feature can enable it and apply whatever bespoke
admission policies they need (or not).&lt;/p>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;ul>
&lt;li>Force users to use policy controllers as webhooks. Forever.&lt;/li>
&lt;li>Make a breaking API change and disable or rip-out the feature.&lt;/li>
&lt;li>Add a new flag telling validation logic to dissallow this field.&lt;/li>
&lt;li>Make a more complex API to define which namespaces can use this feature
and/or which IPs they can use.&lt;/li>
&lt;li>Make a new API that allows cluster-users to enable this sort of field-block
without changing admission-control flags on apiserver.&lt;/li>
&lt;/ul></description></item><item><title>Resources: bound service account token improvements</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4193/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/4193/</guid><description>
&lt;h1 id="kep-4193-bound-service-account-token-improvements">KEP-4193: bound service account token improvements&lt;/h1>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#embedding-pods-bound-node-information-in-tokens"
>Embedding Pod&amp;rsquo;s bound Node information in tokens&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#allowing-serviceaccount-tokens-to-be-bound-to-a-node-object"
>Allowing ServiceAccount tokens to be bound to a Node object&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#extending-tokenreview-to-verify-tokens-bound-to-node-objects"
>Extending TokenReview to verify tokens bound to Node objects&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#including-a-uuid-jti-on-each-issued-jwt"
>Including a UUID (&lt;a href="https://datatracker.ietf.org/doc/html/rfc7519#section-4.1.7">JTI&lt;/a>) on each issued JWT&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#user-stories-optional"
>User Stories (Optional)&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#story-1"
>Story 1&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#story-2"
>Story 2&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#notesconstraintscaveats-optional"
>Notes/Constraints/Caveats (Optional)&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisite-testing-updates"
>Prerequisite testing updates&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#unit-tests"
>Unit tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#integration-tests"
>Integration tests&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#e2e-tests"
>e2e tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha"
>Alpha&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta"
>Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#ga"
>GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#upgrade--downgrade-strategy"
>Upgrade / Downgrade Strategy&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#version-skew-strategy"
>Version Skew Strategy&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#drawbacks"
>Drawbacks&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives"
>Alternatives&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#infrastructure-needed-optional"
>Infrastructure Needed (Optional)&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;p>Items marked with (R) are required &lt;em>prior to targeting to a milestone / release&lt;/em>.&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Enhancement issue in release milestone, which links to KEP dir in &lt;a href="https://git.k8s.io/enhancements"
target="_blank" rel="noopener">kubernetes/enhancements&lt;/a>
(not the initial KEP PR)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) KEP approvers have approved the KEP status as &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> e2e Tests for all Beta API Operations (endpoints)&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Ensure GA e2e tests meet requirements for &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> (R) Minimum Two Week Window for GA e2e tests to prove flake free&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Graduation criteria is in place
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> (R) &lt;a href="https://github.com/kubernetes/community/pull/1806"
target="_blank" rel="noopener">all GA Endpoints&lt;/a>
must be hit by &lt;a href="https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md"
target="_blank" rel="noopener">Conformance Tests&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review completed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> (R) Production readiness review approved&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> User-facing documentation has been created in &lt;a href="https://git.k8s.io/website"
target="_blank" rel="noopener">kubernetes/website&lt;/a>
, for publication to &lt;a href="https://kubernetes.io/"
target="_blank" rel="noopener">kubernetes.io&lt;/a>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Token projection and alternative audiences on JWTs issued by the apiserver enable an external entity to validate the
identity and certain properties (e.g. associated ServiceAccount or Pod) of the caller.&lt;/p>
&lt;p>When attempting to verify a token associated with a Pod, it is not possible to verify that the Pod is associated with
a specific Node without &lt;code>get&lt;/code>ing the relevant Pod object (embedded as a private claim in the JWT) and cross-referencing
the named &lt;code>spec.nodeName&lt;/code>.&lt;/p>
&lt;p>To allow for a robust chain of identity verification from the requester all the way through to the projected token, it
would be beneficial if the Node object reference associated with the requesting Pod were embedded into the signed JWT.&lt;/p>
&lt;p>This is especially useful in cases where the external software wants to avoid replay attacks with projected service account
tokens. The external software can cross-reference the identity of the caller to that service Node reference embedded in
the JWT, which allows this verification to be rooted upon the same root of trust that the kubelet/requesting entity uses.&lt;/p>
&lt;p>By embedding the identity of the Node the Pod is running on, we can cross-reference this information with an identity
passed along to the external service, thus removing the ability for a malicious actor to &amp;lsquo;replay&amp;rsquo; a projected token
from another Node.&lt;/p>
&lt;p>This will be implemented as an additional &lt;code>node&lt;/code> entry in the private claims embedded into each JWT returned by the
TokenRequest API, in a similar manner to how the ServiceAccount, Pod or Secret is referenced.&lt;/p>
&lt;p>Additionally, to provide a robust means of tracking token usage within the audit log we can embed a unique identifier for
each token which is can then also be recorded in future audit entries made by this token.&lt;/p>
&lt;p>As we are adding support for &lt;code>node&lt;/code> metadata associated with Pods, we will also add the ability to bind a token/JWT
to a Node object directly, similar to how a token can be bound to a Pod or Secret resource today.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>Embedding information about the Node that a pod is running on into signed JWTs.&lt;/li>
&lt;li>Make it easier to track the actions a single token has taken, and cross-reference that back to the origin of the token
(via audit log inspection).&lt;/li>
&lt;li>Provide a means of checking whether a Pod&amp;rsquo;s token is associated with the same Node as it was associated with when the
initial TokenRequest was made (via an extra field that can be observed from the TokenReview API).&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Embedding requester information. This is discussed further in the alternatives
considered section, and a future KEP may revisit this.&lt;/li>
&lt;li>Embedding information beyond the immutable Node name and UID into the token. We aim to mimic what is done with the ref fields
for secret, pod and serviceaccount (not introduce any additional properties).&lt;/li>
&lt;li>Changing default behaviour of the SA authenticator to enforce the referenced Node object still exists.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;h3 id="embedding-pods-bound-node-information-in-tokens">Embedding Pod&amp;rsquo;s bound Node information in tokens&lt;/h3>
&lt;p>The kube-apiserver will be extended to automatically embed the &lt;code>name&lt;/code> and &lt;code>uid&lt;/code> of the &lt;em>Node&lt;/em> a Pod is associated
with (via &lt;code>spec.nodeName&lt;/code>) in generated tokens when a TokenRequest &lt;code>create&lt;/code> call is serviced.&lt;/p>
&lt;p>As the &amp;lsquo;pod&amp;rsquo; is already available in this area of code, which contains the &lt;code>nodeName&lt;/code>, we will just need to plumb
through a Getter for Node objects into the TokenRequest storage layer so the node&amp;rsquo;s UID can be fetched, similar to
what is done for pod &amp;amp; secret objects.&lt;/p>
&lt;h3 id="allowing-serviceaccount-tokens-to-be-bound-to-a-node-object">Allowing ServiceAccount tokens to be bound to a Node object&lt;/h3>
&lt;p>Similar to how a token can be bound to a Pod or Secret object, we will also extend the TokenRequest API to allow
binding directly to Node objects (without needing to bind to a Pod as well).&lt;/p>
&lt;p>This allows users to obtain a token that is tied specifically to the &lt;em>Node&lt;/em> objects lifecycle, i.e. when the Node
object is deleted, the token will be invalidated.&lt;/p>
&lt;h3 id="extending-tokenreview-to-verify-tokens-bound-to-node-objects">Extending TokenReview to verify tokens bound to Node objects&lt;/h3>
&lt;p>The SA authenticator will be extended to check whether a token that is bound to a Node object is still valid, by
first checking whether the Node object with the name given in the JWT still exists, and if it does, validating whether
the UID of that Node is equal to the UID embedded in the token.&lt;/p>
&lt;p>Tokens bound to Pod objects will continue to only validate the referenced pod.
This avoids changing the previous behaviour for validation of tokens issued for pods.
Deletion of a node triggers deletion of the pods associated with that node after a &lt;a href="https://github.com/kubernetes/kubernetes/blob/fc786dcd1d2efcc241e0e2392086934f2806555d/pkg/controller/podgc/gc_controller.go#L50-L52"
target="_blank" rel="noopener">period of time&lt;/a>
,
which ultimately invalidates those tokens.&lt;/p>
&lt;p>Tokens that are directly bound to Node objects will always validate the name and UID, as binding tokens to Node objects
is a new option and therefore enforcing this validation check from day 1 is non-breaking.&lt;/p>
&lt;h3 id="including-a-uuid-jti-on-each-issued-jwt">Including a UUID (&lt;a href="https://datatracker.ietf.org/doc/html/rfc7519#section-4.1.7"
target="_blank" rel="noopener">JTI&lt;/a>
) on each issued JWT&lt;/h3>
&lt;p>When a TokenRequest is being issued/fulfilled, we will modify the issuing code to also generate and embed a UUID which
can be later used to trace the requests that a specific issued token has made to the apiserver via the audit log.&lt;/p>
&lt;p>This will require changing the JWT issuing code to actually generate this UUID, as well as extending the code around the
audit log to have it record this information into audit entries when a token is issued (via the &lt;code>authentication.k8s.io/issued-credential-id&lt;/code> audit annotation).&lt;/p>
&lt;p>As this UUID will be embedded as part of a user&amp;rsquo;s ExtraInfo, it&amp;rsquo;ll automatically be persisted into audit events for all
requests made using a token that embeds a credential identifier (as &lt;code>authentication.k8s.io/credential-id&lt;/code>).&lt;/p>
&lt;h3 id="user-stories-optional">User Stories (Optional)&lt;/h3>
&lt;h4 id="story-1">Story 1&lt;/h4>
&lt;p>Alice hosts a service that verifies host identity using an out-of-band mechanism and also submits a bound token that
contains a node assertion.&lt;/p>
&lt;p>The node assertion can be checked to ensure the host identity matches the node assertion of the token.&lt;/p>
&lt;h4 id="story-2">Story 2&lt;/h4>
&lt;p>Bob is an administrator of a cluster and has noticed some strange request patterns from an unknown service account.&lt;/p>
&lt;p>Bob would like to understand who initially issued/authorised this token to be issued. To do so, Bob looks up the JTI
of the token making the suspicious requests by looking inside the audit log entries at user&amp;rsquo;s ExtraInfo for these suspect requests.&lt;/p>
&lt;p>This JTI is then used for a further audit log lookup - namely, looking for the TokenRequest &lt;code>create&lt;/code> call which contains
the audit annotation with key &lt;code>authentication.kubernetes.io/issued-credential-id&lt;/code> and the value set to that of the suspect token.&lt;/p>
&lt;p>This allows Bob to determine precisely who made the original request for this token, and (depending on the &amp;lsquo;chain&amp;rsquo;
above this token), allows Bob to recursively perform this lookup to find all involved parties that led to this token
being issued.&lt;/p>
&lt;h3 id="notesconstraintscaveats-optional">Notes/Constraints/Caveats (Optional)&lt;/h3>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;ul>
&lt;li>Adding additional cross-referencing validation checks into the TokenReview API may break some user workflows that
involve deleting Node objects and restarting kubelet&amp;rsquo;s to allow them to be recreated. As a result, the TokenReview
API will &lt;strong>NOT&lt;/strong> be modified to permit tightening this validation behaviour. Instead, the existing protections &amp;amp;
mechanisms for invalidating a Node&amp;lt;&amp;gt;Pod binding (i.e. auto-deletion after a fixed time period after the Node object
is deleted).&lt;/li>
&lt;/ul>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;p>The &lt;code>pkg/serviceaccount/claims.go&lt;/code> file&amp;rsquo;s &lt;code>Claims&lt;/code> &lt;a href="https://github.com/kubernetes/kubernetes/blob/99190634ab252604a4496882912ac328542d649d/pkg/serviceaccount/claims.go#L61-L97"
target="_blank" rel="noopener">function&lt;/a>
will be modified to accept a &lt;code>core.Node&lt;/code>. This will be made available in the call-site for this function
(&lt;code>pkg/registry/core/serviceaccount/storage/token.go&lt;/code>) by passing through a Getter for Node objects, similar to how
secret objects are fetched.&lt;/p>
&lt;p>The associated &lt;code>Validator&lt;/code> used to validate and parse service account tokens will also be extended to extract this
new information from tokens if it is available.&lt;/p>
&lt;p>In &lt;code>pkg/registry/core/serviceaccount/storage/token.go&lt;/code>, the &lt;code>Create&lt;/code> function will also be extended to add an audit
annotation including the generated service account token&amp;rsquo;s JTI, to make it possible to map a future request which
used this token back to the initial point at which the token was generated (i.e. to allow deeper inspection of who
the requester is).&lt;/p>
&lt;p>In the file &lt;code>staging/src/k8s.io/apiserver/pkg/authentication/serviceaccount/util.go&lt;/code>, the &lt;code>ServiceAccountInfo.UserInfo&lt;/code>
method will be modified to also return this information in the returned &lt;code>user.Info&lt;/code> struct.&lt;/p>
&lt;p>These proposed changes can also be reviewed in &lt;a href="https://github.com/kubernetes/kubernetes/pull/119739"
target="_blank" rel="noopener">the draft pull request&lt;/a>
.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.
All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->
&lt;p>[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.&lt;/p>
&lt;h5 id="prerequisite-testing-updates">Prerequisite testing updates&lt;/h5>
&lt;!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->
&lt;h5 id="unit-tests">Unit tests&lt;/h5>
&lt;p>&lt;code>pkg/registry/core/serviceaccount/storage&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>Coverage before (&lt;code>release-1.28&lt;/code>): &lt;code>k8s.io/kubernetes/pkg/registry/core/serviceaccount/storage 8.354s coverage: 10.7% of statements&lt;/code>&lt;/li>
&lt;li>Coverage after: &lt;code>k8s.io/kubernetes/pkg/registry/core/serviceaccount/storage 8.394s coverage: 8.7% of statements&lt;/code>&lt;/li>
&lt;li>Test ensuring audit annotations are added to audit events for the &lt;code>serviceaccounts/&amp;lt;name&amp;gt;/token&lt;/code> subresource.&lt;/li>
&lt;li>Tests verifying it&amp;rsquo;s possible to bind a token to a Node object.&lt;/li>
&lt;li>Tests ensuring tokens bound to pod objects also embed associated node metadata.&lt;/li>
&lt;li>NOTE: the majority of this file is untested with &lt;em>unit tests&lt;/em> (instead, using integration tests). &lt;a href="https://github.com/kubernetes/kubernetes/issues/121515"
target="_blank" rel="noopener">#121515&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>staging/src/k8s.io/apiserver/pkg/authentication/serviceaccount&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>Coverage before (&lt;code>release-1.28&lt;/code>): &lt;code>k8s.io/apiserver/pkg/authentication/serviceaccount 0.567s coverage: 60.8% of statements&lt;/code>&lt;/li>
&lt;li>Coverage after: &lt;code>k8s.io/apiserver/pkg/authentication/serviceaccount 0.569s coverage: 70.1% of statements&lt;/code>&lt;/li>
&lt;li>Test ensuring that service account info (JTI, node name and UID) is correctly extracted from a presented JWT.&lt;/li>
&lt;li>Tests to ensure the information is NOT extracted when the feature gate is disabled.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>pkg/serviceaccount&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>Coverage before (&lt;code>release-1.28&lt;/code>): &lt;code>k8s.io/kubernetes/pkg/serviceaccount 0.755s coverage: 72.4% of statements&lt;/code>&lt;/li>
&lt;li>Coverage after: &lt;code>k8s.io/kubernetes/pkg/serviceaccount 0.786s coverage: 72.7% of statements&lt;/code>&lt;/li>
&lt;li>Extending tests to ensure Node info is embedded into extended claims (name and uid)&lt;/li>
&lt;li>Tests to ensure &lt;code>ID&lt;/code>/&lt;code>JTI&lt;/code> field is always set to a random UUID.&lt;/li>
&lt;li>Tests to ensure the info embedded on a JWT is extracted from the token and into the ServiceAccountInfo when
a token is validated.&lt;/li>
&lt;li>Tests to ensure the information is NOT embedded or extracted when the feature gate is disabled.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>staging/src/k8s.io/kubectl/pkg/cmd/create&lt;/code>:&lt;/p>
&lt;ul>
&lt;li>Coverage before (&lt;code>release-1.28&lt;/code>): &lt;code>k8s.io/kubectl/pkg/cmd/create 0.995s coverage: 55.1% of statements&lt;/code>&lt;/li>
&lt;li>Coverage after: &lt;code>k8s.io/kubectl/pkg/cmd/create 0.949s coverage: 55.2% of statements&lt;/code>&lt;/li>
&lt;li>Add tests ensuring it&amp;rsquo;s possible to request a token that is bound to a Node object (gated by environment variable during alpha)&lt;/li>
&lt;/ul>
&lt;!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->
&lt;!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- &lt;package>: &lt;date> - &lt;current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->
&lt;h5 id="integration-tests">Integration tests&lt;/h5>
&lt;ul>
&lt;li>Test that calls the TokenRequest API to obtain a token that is bound to a Pod. It should assert that the token embeds
a reference to the Pod object, as well as to the Node object that the Pod is assigned to.&lt;/li>
&lt;li>Test that calls the TokenRequest API to obtain a token that is bound to a Node. It should assert that the token embeds
a reference to the Node object.&lt;/li>
&lt;li>Test that calls the TokenReview API with a token that is bound to a Node object that no longer exists. It should
assert that the token does not validate once the Node has been deleted.&lt;/li>
&lt;/ul>
&lt;!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->
&lt;!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->
&lt;p>&lt;code>k8s.io/test/integration/sig-auth/svcacct_test.go&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.29/test/integration/auth/svcaccttoken_test.go#L247"
target="_blank" rel="noopener">TestServiceAccountTokenCreate_bound to a service account and pod&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.29/test/integration/auth/svcaccttoken_test.go#L415"
target="_blank" rel="noopener">TestServiceAccountTokenCreate_bound to service account and a pod with an assigned nodeName that does not exist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.29/test/integration/auth/svcaccttoken_test.go#L416"
target="_blank" rel="noopener">TestServiceAccountTokenCreate_bound to service account and a pod with an assigned nodeName&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.29/test/integration/auth/svcaccttoken_test.go#L418"
target="_blank" rel="noopener">TestServiceAccountTokenCreate_fails to bind to a Node if the feature gate is disabled&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/blob/release-1.29/test/integration/auth/svcaccttoken_test.go#L448"
target="_blank" rel="noopener">TestServiceAccountTokenCreate_bound to service account and node&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h5 id="e2e-tests">e2e tests&lt;/h5>
&lt;ul>
&lt;li>Extend existing TokenRequest e2e tests to check for embedded scheduled node name &amp;amp; UID + generated JTI is present.&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>&lt;test>: &lt;link to test coverage>&lt;/li>
&lt;/ul>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="alpha">Alpha&lt;/h4>
&lt;ul>
&lt;li>JTI feature implemented behind a feature flag &lt;code>ServiceAccountTokenJTI&lt;/code>.&lt;/li>
&lt;li>Embedding Pod&amp;rsquo;s assigned Node name/uid feature implemented behind a feature flag &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code>.&lt;/li>
&lt;li>Support verifying JWTs bound to Node objects with feature flag &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code>.&lt;/li>
&lt;li>Allowing tokens bound to Node objects to be issued with feature flag flag &lt;code>ServiceAccountTokenNodeBinding&lt;/code>.&lt;/li>
&lt;li>Initial e2e tests completed and enabled&lt;/li>
&lt;/ul>
&lt;h4 id="beta">Beta&lt;/h4>
&lt;ul>
&lt;li>Decide what the default of the new flag should be
&lt;ul>
&lt;li>Decision: this flag was not added during alpha, and MAY be added post-beta, but will definitely default to &lt;strong>off&lt;/strong>.&lt;/li>
&lt;li>This does not need to block promotion of ServiceAccountTokenPodNodeInfo feature as a result.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Decide if using an audit annotation is the correct approach
&lt;ul>
&lt;li>Decision: audit annotation is the correct approach as this is only for &lt;code>serviceaccounts/&amp;lt;name&amp;gt;/token&lt;/code> requests, not all&lt;/li>
&lt;li>Renaming audit annotation to &lt;code>authentication.kubernetes.io/issued-credential-id&lt;/code> to disambiguate from &lt;code>authentication.kubernetes.io/credential-id&lt;/code> in user&amp;rsquo;s ExtraInfo&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Docs around the SA JWT schema (this does not exist today)&lt;/li>
&lt;/ul>
&lt;h4 id="ga">GA&lt;/h4>
&lt;ul>
&lt;li>Allowing time for feedback and any other user-experience reports.&lt;/li>
&lt;li>Conformance tests&lt;/li>
&lt;li>Consolidate the existing service account docs to be more coherent and avoid duplication,
especially in regards to consuming service account tokens outside of Kubernetes:
&lt;ul>
&lt;li>&lt;a href="https://kubernetes.io/docs/concepts/security/service-accounts"
target="_blank" rel="noopener">https://kubernetes.io/docs/concepts/security/service-accounts&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin"
target="_blank" rel="noopener">https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account"
target="_blank" rel="noopener">https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade--downgrade-strategy">Upgrade / Downgrade Strategy&lt;/h3>
&lt;!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to maintain previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade, in order to make use of the enhancement?
-->
&lt;h3 id="version-skew-strategy">Version Skew Strategy&lt;/h3>
&lt;p>Embedding a Pod&amp;rsquo;s assigned Node name into a JWT does not require any coordination between clients and the apiserver,
as no components require this information to be embedded. This is purely additive, and the only rollback concerns
would be around third party software that consumes this information. This software should always verify whether a
&lt;code>node&lt;/code> claim is embedded into tokens if they require using it, and provide a fall-back behaviour (i.e. a GET to the
apiserver to fetch the Pod &amp;amp; Node object) if they need to maintain compatibility with older apiservers.&lt;/p>
&lt;p>Binding a token to a Node introduces a new validation mechanism, and therefore we must allow one release cycle after
introducing the ability to &lt;strong>validate&lt;/strong> tokens, before we can begin permitting &lt;strong>issuance&lt;/strong> of these tokens.
This is a critical step from a security standpoint, as otherwise an administrator could:&lt;/p>
&lt;ol>
&lt;li>upgrade their apiserver/control plane.&lt;/li>
&lt;li>a user could request a token bound to a Node, expecting it to be invalidated when the Node is deleted.&lt;/li>
&lt;li>rollback the apiserver to an older version.&lt;/li>
&lt;li>the Node object is deleted.&lt;/li>
&lt;li>the token issued in (2) would now continue to be accepted/validated, despite the Node object no longer existing.&lt;/li>
&lt;/ol>
&lt;p>By graduating validation a release &lt;strong>earlier&lt;/strong> than issuance, we can ensure any tokens that are bound to a Node
object will be correctly validated even after a rollback.&lt;/p>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>&lt;code>ServiceAccountTokenJTI&lt;/code> feature flag will toggle including JTI information in tokens, as well as recording JTIs in the audit log / the SA user info.&lt;/li>
&lt;li>&lt;code>ServiceAccountTokenPodNodeInfo&lt;/code> feature flag will toggle including node info associated with pods in tokens.&lt;/li>
&lt;li>&lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature flag will toggle the apiserver validating Node claims in node bound service account tokens.&lt;/li>
&lt;li>&lt;code>ServiceAccountTokenNodeBinding&lt;/code> feature flag will toggle allowing service account tokens to be bound to Node objects.&lt;/li>
&lt;/ul>
&lt;p>The &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature will graduate to beta in version v1.30, a release earlier than &lt;code>ServiceAccountTokenNodeBinding&lt;/code>
to ensure a safe rollback from version v1.31 to v1.30 (more info below in rollback considerations section).&lt;/p>
&lt;p>The &lt;code>ServiceAccountTokenNodeBinding&lt;/code> feature gate must only be enabled once the &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature has been enabled.
Disabling the &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature whilst keeping &lt;code>ServiceAccountTokenNodeBinding&lt;/code> would allow tokens that are expected to
be bound to the lifetime of a particular Node to validate even if that Node no longer exists.
The &lt;a href="#rollout-upgrade-and-rollback-planning"
>rollout &amp;amp; rollback section&lt;/a>
below goes into further detail.&lt;/p>
&lt;p>All other feature flags can be disabled without any unexpected adverse affects or coordination required.&lt;/p>
&lt;h6 id="how-can-this-feature-be-enabled--disabled-in-a-live-cluster">How can this feature be enabled / disabled in a live cluster?&lt;/h6>
&lt;ul>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Feature gate&lt;/p>
&lt;ul>
&lt;li>Feature gate name: &lt;code>ServiceAccountTokenJTI&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Feature gate&lt;/p>
&lt;ul>
&lt;li>Feature gate name: &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Feature gate&lt;/p>
&lt;ul>
&lt;li>Feature gate name: &lt;code>ServiceAccountTokenNodeBinding&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Feature gate&lt;/p>
&lt;ul>
&lt;li>Feature gate name: &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="does-enabling-the-feature-change-any-default-behavior">Does enabling the feature change any default behavior?&lt;/h6>
&lt;p>Enabling the &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code> and/or &lt;code>ServiceAccountTokenJTI&lt;/code> feature gate will cause additional information
to be stored/persisted into service account JWTs, as well as new audit annotations being recorded in the audit log.
This is all purely additive, so no changes to existing features, schemas or fields are expected.&lt;/p>
&lt;p>Enabling the &lt;code>ServiceAccountTokenNodeBinding&lt;/code> will permit binding tokens to Node objects, which is a change in
behaviour (albeit not to an existing feature, so is not problematic).&lt;/p>
&lt;h6 id="can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement">Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?&lt;/h6>
&lt;p>Yes. Future tokens will then not embed this information. Any existing issued tokens &lt;strong>will&lt;/strong> still have this
information embedded, however.&lt;/p>
&lt;p>If these fields are deemed to be problematic for other systems interpreting these tokens, users will need to re-issue
these tokens before presenting them elsewhere.&lt;/p>
&lt;p>Once the feature(s) have graduated to GA, it will not be possible to disable this behaviour.&lt;/p>
&lt;h6 id="what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back">What happens if we reenable the feature if it was previously rolled back?&lt;/h6>
&lt;p>Future tokens will once again include this information/no adverse effects.&lt;/p>
&lt;h6 id="are-there-any-tests-for-feature-enablementdisablement">Are there any tests for feature enablement/disablement?&lt;/h6>
&lt;p>Yes (as noted above in the test plan)&lt;/p>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;p>Rolling this out will be done by enabling the feature flag on all control plane hosts.&lt;/p>
&lt;p>The &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature gate should be enabled and complete rollout before the
&lt;code>ServiceAccountTokenNodeBinding&lt;/code> gate is enabled, so all active servers will correctly validate tokens issued by
any server.&lt;/p>
&lt;p>The &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> will be defaulted to on one release &lt;strong>before&lt;/strong> &lt;code>ServiceAccountTokenNodeBinding&lt;/code>
to account for this. Concretely, &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> will be enabled by default in v1.30 and
&lt;code>ServiceAccountTokenNodeBinding&lt;/code> will be enabled by default in v1.31.&lt;/p>
&lt;p>This should not have any issues/affect during upgrades.
Rollback is done by removing/disabling the feature gate(s).&lt;/p>
&lt;h6 id="how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads">How can a rollout or rollback fail? Can it impact already running workloads?&lt;/h6>
&lt;p>During a rollback, there is a concern that tokens that were issued prior to the rollback that are bound directly to a
Node object (i.e. not bound to a Pod that also embeds node info, which is informational) could be accepted by an older
apiserver even if the bound Node object no longer exists (as it would not know to verify the new &lt;code>node&lt;/code> claim).&lt;/p>
&lt;p>To help avoid this, the feature will be graduated in two phases:&lt;/p>
&lt;ul>
&lt;li>First, graduating the acceptance/validation of explicitly node-scoped tokens in one release&lt;/li>
&lt;li>Secondly, graduating the issuance of explicitly Node bound tokens&lt;/li>
&lt;/ul>
&lt;p>This allows for a safe rollback in which the same security expectations are enforced once a token has been issued.&lt;/p>
&lt;p>If a user explicitly &lt;em>disables&lt;/em> &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> but keeps &lt;code>ServiceAccountTokenNodeBinding&lt;/code> enabled,
the node claims in the issued tokens will not be properly validated. This configuration will be explicitly denied by the
kube-apiserver and will cause it to exit on startup.&lt;/p>
&lt;h6 id="what-specific-metrics-should-inform-a-rollback">What specific metrics should inform a rollback?&lt;/h6>
&lt;ul>
&lt;li>&lt;code>authentication_attempts&lt;/code>&lt;/li>
&lt;li>&lt;code>authorization_attempts_total&lt;/code>&lt;/li>
&lt;li>&lt;code>serviceaccount_valid_tokens_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>New metrics that can be used to identify if the feature is in use:&lt;/p>
&lt;ul>
&lt;li>&lt;code>serviceaccount_authentication_pod_node_ref_verified_total&lt;/code>&lt;/li>
&lt;li>&lt;code>serviceaccount_authentication_bound_object_verified_total{bound_object_kind=&amp;quot;Node&amp;quot;}&lt;/code>&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_pod_with_node_tokens_total&lt;/code>&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_total{bound_object_kind=&amp;quot;Node&amp;quot;}&lt;/code>&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_with_identifier_total&lt;/code>&lt;/li>
&lt;/ul>
&lt;h6 id="were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested">Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path tested?&lt;/h6>
&lt;p>&lt;strong>For &lt;code>ServiceAccountTokenJTI&lt;/code> feature (alpha v1.29, beta v1.30, GA v1.32):&lt;/strong>&lt;/p>
&lt;p>&lt;em>Without&lt;/em> the feature gate enabled, issued service account tokens &lt;em>will not&lt;/em> have their &lt;code>jti&lt;/code> field set to a random UUID,
and the audit log will not persist the issued credential identifier when issuing a token.&lt;/p>
&lt;p>&lt;em>With&lt;/em> the feature gate enabled, issued service accounts will set the &lt;code>jti&lt;/code> field to a random UUID.
Additionally, the audit event recorded when issuing a new token will have a new annotation added (&lt;code>authentication.k8s.io/issued-credential-id&lt;/code>).
As a service account&amp;rsquo;s JTI field is used to infer the credential identifier, which forms part of a users &lt;code>ExtraInfo&lt;/code>,
audit events generated using this newly issued token will also include this JTI (persisted as &lt;code>authentication.k8s.io/credential-id&lt;/code>).&lt;/p>
&lt;p>If the feature is &lt;em>disabled&lt;/em> and a token is presented that includes a credential identifier, &lt;strong>it will still be persisted into the audit log&lt;/strong>
as part of the UserInfo in the audit event.&lt;/p>
&lt;p>As none of these fields are actually used for validating/verifying a token is valid, enabling &amp;amp; disabling the feature
does not cause any adverse side effects.&lt;/p>
&lt;p>&lt;strong>For &lt;code>ServiceAccountTokenNodeBinding&lt;/code> (alpha v1.29, beta v1.31, GA v1.33) and &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> (alpha v1.29, beta v1.30, GA v1.32) feature:&lt;/strong>&lt;/p>
&lt;p>&lt;em>Without&lt;/em> the feature gate enabled, service account tokens that have been bound to Node objects will not have their
node reference claims validated (to ensure the referenced node exists).&lt;/p>
&lt;p>&lt;em>With&lt;/em> the feature gate enabled, if a token has a &lt;code>node&lt;/code> claim contained within it, it&amp;rsquo;ll be validated to ensure the
corresponding Node object actually exists.&lt;/p>
&lt;p>Disabling this feature will therefore &lt;em>relax&lt;/em> the security posture of the cluster in an unexpected way, as tokens that
may have been previously invalid (because their corresponding Node does not exist) may become valid again.&lt;/p>
&lt;p>Node bound tokens may only be issued if the &lt;code>ServiceAccountTokenNodeBinding&lt;/code> feature is enabled, and it is not possible
to enable &lt;code>ServiceAccountTokenNodeBinding&lt;/code> without &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> being enabled too.&lt;/p>
&lt;p>This is further mitigated by graduating the &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> feature one release &lt;strong>earlier&lt;/strong>
than &lt;code>ServiceAccountTokenNodeBinding&lt;/code>.&lt;/p>
&lt;p>Tokens that are bound to objects other than Nodes are unaffected.&lt;/p>
&lt;p>&lt;strong>For &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code> feature (alpha v1.29, beta v1.30, GA v1.32):&lt;/strong>&lt;/p>
&lt;p>&lt;em>Without&lt;/em> the feature gate enabled, tokens that are bound to Pod objects will not include information about the Node
that the pod is scheduled/assigned to.&lt;/p>
&lt;p>&lt;em>With&lt;/em> the feature enabled, newly minted tokens that are bound to Pod objects will include metadata about the Node, namely
the Node&amp;rsquo;s name and UID.&lt;/p>
&lt;p>These fields are &lt;strong>not validated&lt;/strong> and therefore disabling the feature after enabling it will not cause any adverse side-effects.&lt;/p>
&lt;p>``&lt;/p>
&lt;!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
&lt;h6 id="is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc">Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?&lt;/h6>
&lt;p>No.&lt;/p>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;p>New metrics:&lt;/p>
&lt;ul>
&lt;li>&lt;code>serviceaccount_authentication_pod_node_ref_verified_total&lt;/code> - new metric that is incremeneted when a token bound to a Pod has its Node reference verified&lt;/li>
&lt;li>&lt;code>serviceaccount_authentication_bound_object_verified_total{bound_object_kind=&amp;quot;Node&amp;quot;}&lt;/code> - new metric that is incremeneted when a token bound to a Node has its reference verified&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_pod_with_node_tokens_total&lt;/code> - new metric that is incremented when a node ref is embedded into a bound Pod token (aka implicitly added)&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_total{bound_object_kind=&amp;quot;Node&amp;quot;}&lt;/code> - new metric that is incremented whenever a bound token is issued that references a Node (explicitly added)&lt;/li>
&lt;li>&lt;code>serviceaccount_bound_tokens_issued_with_identifier_total&lt;/code> - new metric that is incremented whenever a token that contains an identifier/JTI is issued&lt;/li>
&lt;/ul>
&lt;h6 id="how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads">How can an operator determine if the feature is in use by workloads?&lt;/h6>
&lt;p>The metrics detailed above provide a clear signal as to whether these features are being used.&lt;/p>
&lt;!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
&lt;h6 id="how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance">How can someone using this feature know that it is working for their instance?&lt;/h6>
&lt;p>For the node info part, using the TokenRequest API and inspecting the contents of the issued JWTs for a token bound to a Pod.
For JTIs, using the TokenRequest API and then inspecting the contents of the issued JWT for any ServiceAccount token.&lt;/p>
&lt;p>For the validation/verification, the user can use the SelfSubjectAccessReview API to check whether the token is still valid.
To do so, they&amp;rsquo;d need to obtain a token that is bound to a Pod, delete the corresponding Node object that the Pod is scheduled
on, and observe that the token is no longer valid via the SelfSubjectAccessReview API.&lt;/p>
&lt;p>A similar process could be used for tokens bound to Node objects directly.&lt;/p>
&lt;h6 id="what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement">What are the reasonable SLOs (Service Level Objectives) for the enhancement?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;h6 id="what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service">What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?&lt;/h6>
&lt;p>N/A&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name:&lt;/li>
&lt;li>[Optional] Aggregation method:&lt;/li>
&lt;li>Components exposing the metric:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox"> Other (treat as last resort)
&lt;ul>
&lt;li>Details:&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h6 id="are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature">Are there any missing metrics that would be useful to have to improve observability of this feature?&lt;/h6>
&lt;p>No&lt;/p>
&lt;!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;p>None&lt;/p>
&lt;h6 id="does-this-feature-depend-on-any-specific-services-running-in-the-cluster">Does this feature depend on any specific services running in the cluster?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;p>N/A&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-api-calls">Will enabling / using this feature result in any new API calls?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-introducing-new-api-types">Will enabling / using this feature result in introducing new API types?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider">Will enabling / using this feature result in any new calls to the cloud provider?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects">Will enabling / using this feature result in increasing size or count of the existing API objects?&lt;/h6>
&lt;p>Additional audit log annotation keys, as well as extending the JWT claims we embed into service account tokens.&lt;/p>
&lt;p>The maximum size of a UUID is 36 bytes.
The maximum size of a Node object&amp;rsquo;s name is 253 bytes.
The maximum size of a Node object&amp;rsquo;s UID is 36 bytes.&lt;/p>
&lt;p>This additional data will be recorded into issued JWTs as well as audit log events.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos">Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?&lt;/h6>
&lt;p>Fractionally increase the time spent issuing service account JWTs (UUID generation mainly). This is expected to be negligible.&lt;/p>
&lt;h6 id="will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components">Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h6 id="can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc">Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?&lt;/h6>
&lt;p>No&lt;/p>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->
&lt;h6 id="how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable">How does this feature react if the API server and/or etcd is unavailable?&lt;/h6>
&lt;p>Not applicable. This change is solely within the apiserver, and does not touch etcd.&lt;/p>
&lt;h6 id="what-are-other-known-failure-modes">What are other known failure modes?&lt;/h6>
&lt;!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
&lt;h6 id="what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem">What steps should be taken if SLOs are not being met to determine the problem?&lt;/h6>
&lt;p>After observing an issue (e.g. uptick in denied authentication requests or a significant shift in any metrics added for this KEP),
kube-apiserver logs from the authenticator may be used to debug.&lt;/p>
&lt;p>Additionally, manually attempting to exercise the affected codepaths would surface information that&amp;rsquo;d aid debugging.
For example, attempting to issue a node bound token, or attempting to authenticate to the apiserver using a node bound token.&lt;/p>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>KEP marked implementable and merged for the v1.29 release&lt;/li>
&lt;li>KEP implemented in an alpha state for v1.29&lt;/li>
&lt;li>Renamed audit annotation used for the &lt;code>serviceaccounts/&amp;lt;name&amp;gt;/token&lt;/code> endpoint to be clearer: &lt;a href="https://github.com/kubernetes/kubernetes/pull/123098"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/123098&lt;/a>
&lt;/li>
&lt;li>Added restrictions to disallow enabling &lt;code>ServiceAccountTokenNodeBinding&lt;/code> without &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code>: &lt;a href="https://github.com/kubernetes/kubernetes/pull/123135"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/123135&lt;/a>
&lt;/li>
&lt;li>&lt;code>ServiceAccountTokenJTI&lt;/code>, &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> and &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code> promoted to beta for v1.30 release&lt;/li>
&lt;li>Promoted &lt;code>ServiceAccountTokenNodeBinding&lt;/code> promoted to beta for v1.31 release&lt;/li>
&lt;li>Promoted &lt;code>ServiceAccountTokenJTI&lt;/code>, &lt;code>ServiceAccountTokenPodNodeInfo&lt;/code>, &lt;code>ServiceAccountTokenNodeBindingValidation&lt;/code> to stable for v1.32 release&lt;/li>
&lt;li>Promoted &lt;code>ServiceAccountTokenNodeBinding&lt;/code> to stable for v1.33 release&lt;/li>
&lt;/ul>
&lt;!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
&lt;h2 id="drawbacks">Drawbacks&lt;/h2>
&lt;ul>
&lt;li>TBC&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives">Alternatives&lt;/h2>
&lt;!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
&lt;h2 id="infrastructure-needed-optional">Infrastructure Needed (Optional)&lt;/h2>
&lt;p>N/A&lt;/p></description></item><item><title>Resources: Bound Service Account Tokens</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1205/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/1205/</guid><description>
&lt;h1 id="bound-service-account-tokens">Bound Service Account Tokens&lt;/h1>
&lt;h2 id="table-of-contents">Table Of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#background"
>Background&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#design-details"
>Design Details&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#tokenrequest"
>TokenRequest&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#token-attenuations"
>Token Attenuations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#audience-binding"
>Audience binding&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#time-binding"
>Time Binding&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#object-binding"
>Object Binding&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#api-changes"
>API Changes&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#add-tokenrequestsauthenticationk8sio"
>Add &lt;code>tokenrequests.authentication.k8s.io&lt;/code>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#modify-tokenreviewsauthenticationk8sio"
>Modify &lt;code>tokenreviews.authentication.k8s.io&lt;/code>&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#example-flow"
>Example Flow&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#service-account-authenticator-modification"
>Service Account Authenticator Modification&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#acls-for-tokenrequest"
>ACLs for TokenRequest&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#tokenrequestprojection"
>TokenRequestProjection&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#api-change"
>API Change&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#file-permission"
>File Permission&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#proposed-heuristics"
>Proposed Heuristics&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives-considered"
>Alternatives Considered&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#serviceaccount-admission-controller-migration"
>ServiceAccount Admission Controller Migration&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#prerequisites"
>Prerequisites&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#safe-rollout-of-time-bound-token"
>Safe Rollout of Time-bound Token&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#tokenrequesttokenrequestprojection"
>TokenRequest/TokenRequestProjection&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rootcaconfigmap"
>RootCAConfigMap&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#boundserviceaccounttokenvolume"
>BoundServiceAccountTokenVolume&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#tokenrequesttokenrequestprojection-1"
>TokenRequest/TokenRequestProjection&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#beta-ga"
>Beta-&amp;gt;GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#rootcaconfigmap-1"
>RootCAConfigMap&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#beta-ga-1"
>Beta-&amp;gt;GA&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#boundserviceaccounttokenvolume-1"
>BoundServiceAccountTokenVolume&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#alpha-beta"
>Alpha-&amp;gt;Beta&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#beta---ga-graduation"
>Beta -&amp;gt; GA Graduation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#production-readiness-review-questionnaire"
>Production Readiness Review Questionnaire&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#feature-enablement-and-rollback"
>Feature Enablement and Rollback&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#rollout-upgrade-and-rollback-planning"
>Rollout, Upgrade and Rollback Planning&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#monitoring-requirements"
>Monitoring Requirements&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#dependencies"
>Dependencies&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#scalability"
>Scalability&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#troubleshooting"
>Troubleshooting&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>This KEP describes an API that would allow workloads running on Kubernetes to
request JSON Web Tokens that are audience, time and eventually key bound. In
addition, this KEP introduces a new mechanism of distribution with support for
bound service account tokens and explores how to migrate from the existing
mechanism backwards compatibly.&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>Kubernetes already provisions JWTs to workloads. This functionality is on by
default and thus widely deployed. The current workload JWT system has serious
issues:&lt;/p>
&lt;ol>
&lt;li>Security: JWTs are not audience bound. Any recipient of a JWT can masquerade
as the presenter to anyone else.&lt;/li>
&lt;li>Security: The current model of storing the service account token in a Secret
and delivering it to nodes results in a broad attack surface for the
Kubernetes control plane when powerful components are run - giving a service
account a permission means that any component that can see that service
account&amp;rsquo;s secrets is at least as powerful as the component.&lt;/li>
&lt;li>Security: JWTs are not time bound. A JWT compromised via 1 or 2, is valid
for as long as the service account exists. This may be mitigated with
service account signing key rotation but is not supported by client-go and
not automated by the control plane and thus is not widely deployed.&lt;/li>
&lt;li>Scalability: JWTs require a Kubernetes secret per service account.&lt;/li>
&lt;/ol>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>We would like to introduce a new mechanism for provisioning Kubernetes service
account tokens that is compatible with our current security and scalability
requirements.&lt;/p>
&lt;h2 id="design-details">Design Details&lt;/h2>
&lt;h3 id="tokenrequest">TokenRequest&lt;/h3>
&lt;p>Infrastructure to support on demand token requests will be implemented in the
core apiserver. Once this API exists, a client of the apiserver will request an
attenuated token for its own use. The API will enforce required attenuations,
e.g. audience and time binding.&lt;/p>
&lt;h4 id="token-attenuations">Token Attenuations&lt;/h4>
&lt;h5 id="audience-binding">Audience binding&lt;/h5>
&lt;p>Tokens issued from this API will be audience bound. Audience of requested tokens
will be bound by the &lt;code>aud&lt;/code> claim. The &lt;code>aud&lt;/code> claim is an array of strings
(usually URLs) that correspond to the intended audience of the token. A
recipient of a token is responsible for verifying that it identifies as one of
the values in the audience claim, and should otherwise reject the token. The
TokenReview API will support this validation.&lt;/p>
&lt;h5 id="time-binding">Time Binding&lt;/h5>
&lt;p>Tokens issued from this API will be time bound. Time validity of these tokens
will be claimed in the following fields:&lt;/p>
&lt;ul>
&lt;li>&lt;code>exp&lt;/code>: expiration time&lt;/li>
&lt;li>&lt;code>nbf&lt;/code>: not before&lt;/li>
&lt;li>&lt;code>iat&lt;/code>: issued at&lt;/li>
&lt;/ul>
&lt;p>A recipient of a token should verify that the token is valid at the time that
the token is presented, and should otherwise reject the token. The TokenReview
API will support this validation.&lt;/p>
&lt;p>Cluster administrators will be able to configure the maximum validity duration
for expiring tokens. During the migration off of the old service account tokens,
clients of this API may request tokens that are valid for many years. These
tokens will be drop in replacements for the current service account tokens.&lt;/p>
&lt;h5 id="object-binding">Object Binding&lt;/h5>
&lt;p>Tokens issued from this API may be bound to a Kubernetes object in the same
namespace as the service account. The name, group, version, kind and uid of the
object will be embedded as claims in the issued token. A token bound to an
object will only be valid for as long as that object exists.&lt;/p>
&lt;p>Only a subset of object kinds will support object binding. Initially the only
kinds that will be supported are:&lt;/p>
&lt;ul>
&lt;li>v1/Pod&lt;/li>
&lt;li>v1/Secret&lt;/li>
&lt;/ul>
&lt;p>The TokenRequest API will validate this binding.&lt;/p>
&lt;h4 id="api-changes">API Changes&lt;/h4>
&lt;h5 id="add-tokenrequestsauthenticationk8sio">Add &lt;code>tokenrequests.authentication.k8s.io&lt;/code>&lt;/h5>
&lt;p>We will add an imperative API (a la TokenReview) to the &lt;code>authentication.k8s.io&lt;/code>
API group:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> TokenRequest &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Spec TokenRequestSpec
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Status TokenRequestStatus
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> TokenRequestSpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Audiences are the intendend audiences of the token. A token issued&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// for multiple audiences may be used to authenticate against any of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the audiences listed. This implies a high degree of trust between&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// the target audiences.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Audiences []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ValidityDuration is the requested duration of validity of the request. The&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// token issuer may return a token with a different validity duration so a&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// client needs to check the &amp;#39;expiration&amp;#39; field in a response.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ValidityDuration metav1.Duration
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// BoundObjectRef is a reference to an object that the token will be bound to.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// The token will only be valid for as long as the bound object exists.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> BoundObjectRef &lt;span style="color:#666">*&lt;/span>BoundObjectReference
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> BoundObjectReference &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Kind of the referent. Valid kinds are &amp;#39;Pod&amp;#39; and &amp;#39;Secret&amp;#39;.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Kind &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// API version of the referent.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> APIVersion &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Name of the referent.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Name &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// UID of the referent.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> UID types.UID
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> TokenRequestStatus &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Token is the token data&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Token &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Expiration is the time of expiration of the returned token. Empty means the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// token does not expire.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Expiration metav1.Time
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This API will be exposed as a subresource under a serviceaccount object. A
requestor for a token for a specific service account will &lt;code>POST&lt;/code> a
&lt;code>TokenRequest&lt;/code> to the &lt;code>/token&lt;/code> subresource of that serviceaccount object.&lt;/p>
&lt;h5 id="modify-tokenreviewsauthenticationk8sio">Modify &lt;code>tokenreviews.authentication.k8s.io&lt;/code>&lt;/h5>
&lt;p>The TokenReview API will be extended to support passing an additional audience
field which the service account authenticator will validate.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-golang" data-lang="golang">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> TokenReviewSpec &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Token is the opaque bearer token.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Token &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Audiences is the identifier that the client identifies as.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Audiences []&lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h5 id="example-flow">Example Flow&lt;/h5>
&lt;pre tabindex="0">&lt;code>&amp;gt; POST /apis/v1/namespaces/default/serviceaccounts/default/token
&amp;gt; {
&amp;gt; &amp;#34;kind&amp;#34;: &amp;#34;TokenRequest&amp;#34;,
&amp;gt; &amp;#34;apiVersion&amp;#34;: &amp;#34;authentication.k8s.io/v1&amp;#34;,
&amp;gt; &amp;#34;spec&amp;#34;: {
&amp;gt; &amp;#34;audience&amp;#34;: [
&amp;gt; &amp;#34;https://kubernetes.default.svc&amp;#34;
&amp;gt; ],
&amp;gt; &amp;#34;validityDuration&amp;#34;: &amp;#34;99999h&amp;#34;,
&amp;gt; &amp;#34;boundObjectRef&amp;#34;: {
&amp;gt; &amp;#34;kind&amp;#34;: &amp;#34;Pod&amp;#34;,
&amp;gt; &amp;#34;apiVersion&amp;#34;: &amp;#34;v1&amp;#34;,
&amp;gt; &amp;#34;name&amp;#34;: &amp;#34;pod-foo-346acf&amp;#34;
&amp;gt; }
&amp;gt; }
&amp;gt; }
{
&amp;#34;kind&amp;#34;: &amp;#34;TokenRequest&amp;#34;,
&amp;#34;apiVersion&amp;#34;: &amp;#34;authentication.k8s.io/v1&amp;#34;,
&amp;#34;spec&amp;#34;: {
&amp;#34;audience&amp;#34;: [
&amp;#34;https://kubernetes.default.svc&amp;#34;
],
&amp;#34;validityDuration&amp;#34;: &amp;#34;99999h&amp;#34;,
&amp;#34;boundObjectRef&amp;#34;: {
&amp;#34;kind&amp;#34;: &amp;#34;Pod&amp;#34;,
&amp;#34;apiVersion&amp;#34;: &amp;#34;v1&amp;#34;,
&amp;#34;name&amp;#34;: &amp;#34;pod-foo-346acf&amp;#34;
}
},
&amp;#34;status&amp;#34;: {
&amp;#34;token&amp;#34;:
&amp;#34;eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJz[payload omitted].EkN-[signature omitted]&amp;#34;,
&amp;#34;expiration&amp;#34;: &amp;#34;Jan 24 16:36:00 PST 3018&amp;#34;
}
}
&lt;/code>&lt;/pre>&lt;p>The token payload will be:&lt;/p>
&lt;pre tabindex="0">&lt;code>{
&amp;#34;iss&amp;#34;: &amp;#34;https://example.com/some/path&amp;#34;,
&amp;#34;sub&amp;#34;: &amp;#34;system:serviceaccount:default:default,
&amp;#34;aud&amp;#34;: [
&amp;#34;https://kubernetes.default.svc&amp;#34;
],
&amp;#34;exp&amp;#34;: 24412841114,
&amp;#34;iat&amp;#34;: 1516841043,
&amp;#34;nbf&amp;#34;: 1516841043,
&amp;#34;kubernetes.io&amp;#34;: {
&amp;#34;serviceAccountUID&amp;#34;: &amp;#34;c0c98eab-0168-11e8-92e5-42010af00002&amp;#34;,
&amp;#34;boundObjectRef&amp;#34;: {
&amp;#34;kind&amp;#34;: &amp;#34;Pod&amp;#34;,
&amp;#34;apiVersion&amp;#34;: &amp;#34;v1&amp;#34;,
&amp;#34;uid&amp;#34;: &amp;#34;a4bb8aa4-0168-11e8-92e5-42010af00002&amp;#34;,
&amp;#34;name&amp;#34;: &amp;#34;pod-foo-346acf&amp;#34;
}
}
}
&lt;/code>&lt;/pre>&lt;h4 id="service-account-authenticator-modification">Service Account Authenticator Modification&lt;/h4>
&lt;p>The service account token authenticator will be extended to support validation
of time and audience binding claims.&lt;/p>
&lt;h4 id="acls-for-tokenrequest">ACLs for TokenRequest&lt;/h4>
&lt;p>The NodeAuthorizer will allow the kubelet to use its credentials to request a
service account token on behalf of pods running on that node. The
NodeRestriction admission controller will require that these tokens are pod
bound.&lt;/p>
&lt;h3 id="tokenrequestprojection">TokenRequestProjection&lt;/h3>
&lt;p>A ServiceAccountToken volume projection that maintains a service account token
requested by the node from the TokenRequest API.&lt;/p>
&lt;h4 id="api-change">API Change&lt;/h4>
&lt;p>A new volume projection will be implemented with an API that closely matches the
TokenRequest API.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ProjectedVolumeSource &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Sources []VolumeProjection
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> DefaultMode &lt;span style="color:#666">*&lt;/span>&lt;span style="color:#0b0;font-weight:bold">int32&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> VolumeProjection &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Secret &lt;span style="color:#666">*&lt;/span>SecretProjection
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> DownwardAPI &lt;span style="color:#666">*&lt;/span>DownwardAPIProjection
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ConfigMap &lt;span style="color:#666">*&lt;/span>ConfigMapProjection
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ServiceAccountToken &lt;span style="color:#666">*&lt;/span>ServiceAccountTokenProjection
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// ServiceAccountTokenProjection represents a projected service account token&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// volume. This projection can be used to insert a service account token into&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// the pods runtime filesystem for use against APIs (Kubernetes API Server or&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#080;font-style:italic">// otherwise).&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a2f;font-weight:bold">type&lt;/span> ServiceAccountTokenProjection &lt;span style="color:#a2f;font-weight:bold">struct&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Audience is the intended audience of the token. A recipient of a token&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// must identify itself with an identifier specified in the audience of the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// token, and otherwise should reject the token. The audience defaults to the&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// identifier of the apiserver.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Audience &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// ExpirationSeconds is the requested duration of validity of the service&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// account token. As the token approaches expiration, the kubelet volume&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// plugin will proactively rotate the service account token. The kubelet will&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// start trying to rotate the token if the token is older than 80 percent of&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// its time to live or if the token is older than 24 hours.Defaults to 1 hour&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// and must be at least 10 minutes.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ExpirationSeconds &lt;span style="color:#0b0;font-weight:bold">int64&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#080;font-style:italic">// Path is the relative path of the file to project the token into.&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Path &lt;span style="color:#0b0;font-weight:bold">string&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A volume plugin implemented in the kubelet will project a service account token
sourced from the TokenRequest API into volumes created from
ProjectedVolumeSources. As the token approaches expiration, the kubelet volume
plugin will proactively rotate the service account token. The kubelet will start
trying to rotate the token if the token is older than 80 percent of its time to
live or if the token is older than 24 hours.&lt;/p>
&lt;p>To replace the current service account token secrets, we also need to inject the
clusters CA certificate bundle. We will deploy it as a configmap per-namespace
and reference it using a ConfigMapProjection.&lt;/p>
&lt;p>A projected volume source that is equivalent to the current service account
secret:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="display:flex;">&lt;span>- &lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kube-api-access-xxxxx&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">projected&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">defaultMode&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">420&lt;/span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#080;font-style:italic"># 0644&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">sources&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">serviceAccountToken&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">expirationSeconds&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#666">3600&lt;/span>&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">path&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>token&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">configMap&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">items&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">key&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>ca.crt&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">path&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>ca.crt&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">name&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>kube-root-ca.crt&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">downwardAPI&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">items&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>- &lt;span style="color:#008000;font-weight:bold">fieldRef&lt;/span>:&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">apiVersion&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>v1&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">fieldPath&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>metadata.namespace&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#bbb"> &lt;/span>&lt;span style="color:#008000;font-weight:bold">path&lt;/span>:&lt;span style="color:#bbb"> &lt;/span>namespace&lt;span style="color:#bbb">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="file-permission">File Permission&lt;/h4>
&lt;p>The secret projections are currently written with world readable (0644,
effectively 444) file permissions. Given that file permissions are one of the
oldest and most hardened isolation mechanisms on unix, this is not ideal.
We would like to opportunistically restrict permissions for projected service
account tokens as long we can show that they won’t break users if we are to
migrate away from secrets to distribute service account credentials.&lt;/p>
&lt;h5 id="proposed-heuristics">Proposed Heuristics&lt;/h5>
&lt;ul>
&lt;li>&lt;em>Case 1&lt;/em>: The pod has an fsGroup set. We can set the file permission on the
token file to 0600 and let the fsGroup mechanism work as designed. It will
set the permissions to 0640, chown the token file to the fsGroup and start
the containers with a supplemental group that grants them access to the
token file. This works today.&lt;/li>
&lt;li>&lt;em>Case 2&lt;/em>: The pod’s containers declare the same runAsUser for all containers
(ephemeral containers are excluded) in the pod. We chown the token file to
the pod’s runAsUser to grant the containers access to the token. All
containers must have UID either specified in container security context or
inherited from pod security context. Preferred UIDs in container images are
ignored.&lt;/li>
&lt;li>&lt;em>Fallback&lt;/em>: We set the file permissions to world readable (0644) to match
the behavior of secrets.&lt;/li>
&lt;/ul>
&lt;p>This gives users that run as non-root greater isolation between users without
breaking existing applications. We also may consider adding more cases in the
future as long as we can ensure that they won’t break users.&lt;/p>
&lt;h5 id="alternatives-considered">Alternatives Considered&lt;/h5>
&lt;ul>
&lt;li>We can create a volume for each UserID and set the owner to be that UserID
with mode 0400. If user doesn&amp;rsquo;t specify runAsUser, fetching UserID in image
requires a re-design of kubelet regarding volume mounts and image pulling.
This has significant implementation complexity because:
&lt;ul>
&lt;li>We would have to reorder container creation to introspect images (that
might declare USER or GROUP directives) to pass this information to the
projected volume mounter.&lt;/li>
&lt;li>Further, images are mutable so these directives may change over the
lifetime of the pod.&lt;/li>
&lt;li>Volumes are shared between all pods that mount them today. Mapping a
single logical volume in a pod spec to distinct mount points is likely a
significant architectural change.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>We pick a random group and set fsGroup on all pods in the service account
admission controller. It’s unclear how we would do this without conflicting
with usage of groups and potentially compromising security.&lt;/li>
&lt;li>We set token files to be world readable always. Problems with this are
discussed above.&lt;/li>
&lt;/ul>
&lt;h3 id="serviceaccount-admission-controller-migration">ServiceAccount Admission Controller Migration&lt;/h3>
&lt;h4 id="prerequisites">Prerequisites&lt;/h4>
&lt;p>Before migration to a version with &lt;code>BoundServiceAccountVolume=true&lt;/code>, cluster
operators should make sure:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Set feature gate &lt;code>TokenRequest=true&lt;/code>. (default to &lt;code>true&lt;/code> since 1.12)&lt;/p>
&lt;ul>
&lt;li>This feature requires the following flags to the API server:
&lt;ul>
&lt;li>&lt;code>--service-account-issuer&lt;/code>&lt;/li>
&lt;li>&lt;code>--service-account-signing-key-file&lt;/code>&lt;/li>
&lt;li>&lt;code>--service-account-key-file&lt;/code>&lt;/li>
&lt;li>&lt;code>--api-audiences&lt;/code> (default to &lt;code>--service-account-issuer&lt;/code>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Set feature gate &lt;code>TokenRequestProjection=true&lt;/code>. (default to &lt;code>true&lt;/code> since
1.12)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Update all workloads to newer version of officially supported Kubernetes
client libraries to reload token:&lt;/p>
&lt;ul>
&lt;li>Go: &amp;gt;= v0.15.7&lt;/li>
&lt;li>Python: &amp;gt;= v12.0.0&lt;/li>
&lt;li>Java: &amp;gt;= v9.0.0&lt;/li>
&lt;li>Javascript: &amp;gt;= v0.10.3&lt;/li>
&lt;li>Ruby: master branch&lt;/li>
&lt;li>Haskell: v0.3.0.0&lt;/li>
&lt;/ul>
&lt;p>For community-maintained client libraries, feel free to contribute to them
if the reloading logic is missing.&lt;/p>
&lt;p>&lt;strong>Note&lt;/strong>: If having trouble in finding places using in-cluster config
completely, cluster operators can specify flag
&lt;code>--service-account-extend-token-expiration=true&lt;/code> to kube apiserver to allow
tokens have longer expiration temporarily during the migration. Any usage of
legacy token will be recorded in both metrics and audit logs. After fixing
all the potentially broken workloads, turn off the flag so that the original
expiration settings are honored. Note the
&lt;code>--service-account-extend-token-expiration&lt;/code> mitigation defaults to true, and
that cluster administrators can set it to
&lt;code>--service-account-extend-token-expiration=false&lt;/code> to turn off the mitigation
if desired.&lt;/p>
&lt;ul>
&lt;li>Metrics: &lt;code>serviceaccount_stale_tokens_total&lt;/code>&lt;/li>
&lt;li>Audit: looking for &lt;code>authentication.k8s.io/stale-token&lt;/code> annotation&lt;/li>
&lt;/ul>
&lt;p>See next section for the details of how to discover the workloads that will
suffer from expired tokens.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>If anything goes wrong, please file a bug and CC @kubernetes/sig-auth-bugs. More
contact information
&lt;a href="https://github.com/kubernetes/community/tree/master/sig-auth#contact"
target="_blank" rel="noopener">here&lt;/a>
.&lt;/p>
&lt;h4 id="safe-rollout-of-time-bound-token">Safe Rollout of Time-bound Token&lt;/h4>
&lt;p>Legacy service account tokens distributed via secrets are not time-bound. Many
client libraries have come to depend on this behavior. After time-bound service
account token being used, if in-cluster clients do not periodically reload token
from projected volume, requests would be rejected once the initial token got
expired.&lt;/p>
&lt;p>In order to allow guadual adoption of time-bound token, we would:&lt;/p>
&lt;ol>
&lt;li>Pick a constant period D between one and two hours. The value of D would be
static across Kubernetes deployments, while avoiding collision with common
duration.&lt;/li>
&lt;li>Modify service account admission control to inject token valid for D when
the BoundServiceAccountTokenVolume feature is enabled.&lt;/li>
&lt;li>Modify kube apiserver TokenRequest API. When it receives TokenRequest with
requested valid period D, extend the token lifetime to one year. At the same
time, save the original requested D to &lt;code>kubernetes.io/warnafter&lt;/code> field in
minted token.&lt;/li>
&lt;li>In the TokenRequest status, tell clients that the token would be valid only
for D, encouraging clients to reload token as if the token was valid for D.&lt;/li>
&lt;/ol>
&lt;p>This modification could be optionally enabled by providing a command line flag
to kube apiserver.&lt;/p>
&lt;p>These extended tokens would not expire and continue to be accepted within one
year. At the same time, the authentication side could monitor whether clients
are properly reloading tokens by:&lt;/p>
&lt;ol>
&lt;li>Compare the &lt;code>kubernetes.io/warnafter&lt;/code> field with current time. If current
time is after &lt;code>kubernetes.io/warnafter&lt;/code> field, it implies calling client is
not reloading token regularly.&lt;/li>
&lt;li>Expose metrics to monitor number of legacy and stale token used.&lt;/li>
&lt;li>Add annotation to audit events for legacy and stale tokens including
necessary information to locate problematic client.&lt;/li>
&lt;/ol>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;h4 id="tokenrequesttokenrequestprojection">TokenRequest/TokenRequestProjection&lt;/h4>
&lt;ul>
&lt;li>Unit tests&lt;/li>
&lt;li>E2E tests
&lt;ul>
&lt;li>Projected jwt tokens are correctly mounted. (conformance test)&lt;/li>
&lt;li>The owner and mode of projected tokens are correctly set&lt;/li>
&lt;li>In-cluster clients work with Token rotation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="rootcaconfigmap">RootCAConfigMap&lt;/h4>
&lt;ul>
&lt;li>Unit tests&lt;/li>
&lt;li>E2E tests
&lt;ul>
&lt;li>Every namespace has configmap &lt;code>kube-root-ca.crt&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="boundserviceaccounttokenvolume">BoundServiceAccountTokenVolume&lt;/h4>
&lt;ul>
&lt;li>Unit tests&lt;/li>
&lt;li>An upgrade test&lt;/li>
&lt;/ul>
&lt;ol>
&lt;li>Create pod A with feature disabled where pod A is working and a secret
volume is mounted&lt;/li>
&lt;li>Enable feature where pod A continue working&lt;/li>
&lt;li>Create pod B and it is working and projected volumes are mounted&lt;/li>
&lt;/ol>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;h4 id="tokenrequesttokenrequestprojection-1">TokenRequest/TokenRequestProjection&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Alpha&lt;/th>
&lt;th>Beta&lt;/th>
&lt;th>GA&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1.10&lt;/td>
&lt;td>1.12&lt;/td>
&lt;td>1.20&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h5 id="beta-ga">Beta-&amp;gt;GA&lt;/h5>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> In use by multiple distributions&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Approved by PRR and scalability&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Any known bugs fixed&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Tests passing
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> E2E test &lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce"
target="_blank" rel="noopener">ServiceAccounts should mount projected service account
token when requested&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> E2E test &lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce"
target="_blank" rel="noopener">ServiceAccounts should set ownership and permission when
RunAsUser or FsGroup is
present&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> E2E test
&lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce-slow"
target="_blank" rel="noopener">ServiceAccounts should support InClusterConfig with token rotation&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="rootcaconfigmap-1">RootCAConfigMap&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Alpha&lt;/th>
&lt;th>Beta&lt;/th>
&lt;th>GA&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1.13&lt;/td>
&lt;td>1.20&lt;/td>
&lt;td>1.21&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h5 id="beta-ga-1">Beta-&amp;gt;GA&lt;/h5>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> In use by multiple distributions&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Approved by PRR and scalability&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Any known bugs fixed&lt;/li>
&lt;/ul>
&lt;h4 id="boundserviceaccounttokenvolume-1">BoundServiceAccountTokenVolume&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Alpha&lt;/th>
&lt;th>Beta&lt;/th>
&lt;th>GA&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1.13&lt;/td>
&lt;td>1.21&lt;/td>
&lt;td>1.22&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h5 id="alpha-beta">Alpha-&amp;gt;Beta&lt;/h5>
&lt;ul>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Any known bugs fixed&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> PodSecurityPolicies that allow secrets but not projected volumes
will prevent the use of token volumes.
&lt;ul>
&lt;li>Fixed in &lt;a href="https://github.com/kubernetes/kubernetes/pull/92006"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/92006&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> In-cluster clients that don’t reload service account tokens will
start failing an hour after deployment.&lt;/li>
&lt;li>Mitigation added in
&lt;a href="https://github.com/kubernetes/kubernetes/issues/68164"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/issues/68164&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Pods running as non root may not access the service account token.
&lt;ul>
&lt;li>Fixed in &lt;a href="https://github.com/kubernetes/kubernetes/pull/89193"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/89193&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Dynamic clientbuilder does not invalidate token.
&lt;ul>
&lt;li>Fixed in &lt;a href="https://github.com/kubernetes/kubernetes/pull/99324"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/99324&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> Tests passing&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Upgrade test
&lt;a href="https://testgrid.k8s.io/sig-auth-gce#upgrade-tests"
target="_blank" rel="noopener">sig-auth-serviceaccount-admission-controller-migration&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> TokenRequest/TokenRequestProjection GA&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;input checked="" disabled="" type="checkbox"> RootCAConfigMap GA&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h5 id="beta---ga-graduation">Beta -&amp;gt; GA Graduation&lt;/h5>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Allow kube-apiserver to recognize multiple issuers to enable non
disruptive issuer change.
&lt;ul>
&lt;li>Fixed in &lt;a href="https://github.com/kubernetes/kubernetes/pull/101155"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/101155&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> New &lt;code>ServiceAccount&lt;/code> admission controller work as intended in Beta
for &amp;gt;= 1 minor release without significant issues.&lt;/li>
&lt;/ul>
&lt;h2 id="production-readiness-review-questionnaire">Production Readiness Review Questionnaire&lt;/h2>
&lt;h3 id="feature-enablement-and-rollback">Feature Enablement and Rollback&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can this feature be enabled / disabled in a live cluster?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Feature gate name: &lt;code>BoundServiceAccountTokenVolume&lt;/code>&lt;/li>
&lt;li>Components depending on the feature gate: kube-apiserver and
kube-controller-manager&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime of the control
plane? yes, need to restart kube-apiserver and kube-controller-manager.&lt;/li>
&lt;li>Will enabling / disabling the feature require downtime or reprovisioning
of a node? no.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Does enabling the feature change any default behavior?&lt;/strong> yes, pods'
service account tokens will expire after 1 year by default and are not
stored as Secrets any more.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Can the feature be disabled once it has been enabled (i.e. can we roll
back the enablement)?&lt;/strong> yes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What happens if we reenable the feature if it was previously rolled
back?&lt;/strong> the same as the first enablement.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any tests for feature enablement/disablement?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>unit test: plugin/pkg/admission/serviceaccount/admission_test.go&lt;/li>
&lt;li>upgrade test:
test/e2e/upgrades/serviceaccount_admission_controller_migration.go&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="rollout-upgrade-and-rollback-planning">Rollout, Upgrade and Rollback Planning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can a rollout fail? Can it impact already running workloads?&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>creation of CA configmap can fail due to permission / quota / admission
errors.&lt;/li>
&lt;li>newly issued tokens could fail to be recognized by skewed API servers
not configured with the bound token signing key/issuer.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What specific metrics should inform a rollback?&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>creation of CA configmap,
&lt;ul>
&lt;li>&lt;code>root_ca_cert_publisher_rate_limiter_use&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>authentication errors in (n-1) API servers,
&lt;ul>
&lt;li>&lt;code>authentication_attempts&lt;/code>&lt;/li>
&lt;li>&lt;code>authentication_duration_seconds&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Were upgrade and rollback tested? Was the upgrade-&amp;gt;downgrade-&amp;gt;upgrade path
tested?&lt;/strong>
for upgrade, we have set up e2e test running here:
&lt;a href="https://testgrid.k8s.io/sig-auth-gce#upgrade-tests&amp;amp;width=5"
target="_blank" rel="noopener">https://testgrid.k8s.io/sig-auth-gce#upgrade-tests&amp;width=5&lt;/a>
&lt;/p>
&lt;p>for downgrade, we have manually tested where a workload continues to
authenticate successfully.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>&lt;strong>Is the rollout accompanied by any deprecations and/or removals of
features, APIs, fields of API types, flags, etc.?&lt;/strong> no&lt;/li>
&lt;/ul>
&lt;h3 id="monitoring-requirements">Monitoring Requirements&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How can an operator determine if the feature is in use by workloads?&lt;/strong>&lt;/p>
&lt;p>Check TokenRequest in use:&lt;/p>
&lt;ul>
&lt;li>&lt;code>serviceaccount_valid_tokens_total&lt;/code>: cumulative valid projected service
account tokens used&lt;/li>
&lt;li>&lt;code>serviceaccount_stale_tokens_total&lt;/code>: cumulative stale projected service
account tokens used&lt;/li>
&lt;li>&lt;code>apiserver_request_total&lt;/code>: with labels &lt;code>group=&amp;quot;&amp;quot;,version=&amp;quot;v1&amp;quot;,resource=&amp;quot;serviceaccounts&amp;quot;,subresource=&amp;quot;token&amp;quot;&lt;/code>&lt;/li>
&lt;li>&lt;code>apiserver_request_duration_seconds&lt;/code>: with labels &lt;code>group=&amp;quot;&amp;quot;,version=&amp;quot;v1&amp;quot;,resource=&amp;quot;serviceaccounts&amp;quot;,subresource=&amp;quot;token&amp;quot;&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Metrics
&lt;ul>
&lt;li>Metric name: apiserver_request_total&lt;/li>
&lt;li>Aggregation method: group=&amp;quot;&amp;quot;,version=&amp;ldquo;v1&amp;rdquo;,resource=&amp;ldquo;serviceaccounts&amp;rdquo;,subresource=&amp;ldquo;token&amp;rdquo;&lt;/li>
&lt;li>Components exposing the metric: kube-apiserver&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What are the reasonable SLOs (Service Level Objectives) for the above
SLIs?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>per-day percentage of API calls finishing with 5XX errors &amp;lt;= 1%&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Are there any missing metrics that would be useful to have to improve
observability of this feature?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>add granularity to &lt;code>storage_operation_duration_seconds&lt;/code> to distinguish
projected volumes: configmap, secret, token,..etc&amp;hellip; or add new metrics
so that we can know the usage of projected tokens.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="dependencies">Dependencies&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Does this feature depend on any specific services running in the
cluster?&lt;/strong> There are no new components required, but specific versions of
kubelet and kube-controller-manager are required&lt;/p>
&lt;p>TokenRequest depends on kubelets &amp;gt;= 1.12&lt;/p>
&lt;p>BoundServiceAccountTokenVolume depends on kubelets &amp;gt;= 1.12 with TokenRequest
enabled (default since 1.12) and kube-controller-manager &amp;gt;= 1.12 with
RootCAConfigMap feature enabled (default since 1.20)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="scalability">Scalability&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new API calls?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>API call type: &lt;code>TokenRequest&lt;/code>&lt;/li>
&lt;li>estimated throughput: 1/pod every ~48 minutes.&lt;/li>
&lt;li>originating component: kubelet&lt;/li>
&lt;li>components listing and/or watching resources they didn&amp;rsquo;t before: N/A.&lt;/li>
&lt;li>API calls that may be triggered by changes of some Kubernetes resources:
N/A.&lt;/li>
&lt;li>periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.): 1 call per pod every ~48 minutes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in introducing new API types?&lt;/strong>
no.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in any new calls to the cloud
provider?&lt;/strong> no.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing size or count of
the existing API objects?&lt;/strong> controller creates one additional configmap per
namespace.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in increasing time taken by any
operations covered by &lt;a href="https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos"
target="_blank" rel="noopener">existing SLIs/SLOs&lt;/a>
?&lt;/strong> no.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, &amp;hellip;) in any components?&lt;/strong> it adds a
token minting operation in the API server every ~48 minutes for every pod.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="troubleshooting">Troubleshooting&lt;/h3>
&lt;p>The Troubleshooting section currently serves the &lt;code>Playbook&lt;/code> role. We may
consider splitting it into a dedicated &lt;code>Playbook&lt;/code> document (potentially with
some monitoring details). For now, we leave it here.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>How does this feature react if the API server and/or etcd is
unavailable?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>TokenRequest API is unavailable&lt;/li>
&lt;li>configmap containing API server CA bundle cannot be created or fetched&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>What are other known failure modes?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>failure to issue token via token subresource&lt;/p>
&lt;ul>
&lt;li>Detection: check &lt;code>apiserver_request_total&lt;/code> with labels
&lt;code>group=&amp;quot;&amp;quot;,version=&amp;quot;v1&amp;quot;,resource=&amp;quot;serviceaccounts&amp;quot;,subresource=&amp;quot;token&amp;quot;&lt;/code>&lt;/li>
&lt;li>Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
the kube-apiserver and recreate pods.&lt;/li>
&lt;li>Diagnostics: &amp;ldquo;failed to generate token&amp;rdquo; in kube-apiserver log.&lt;/li>
&lt;li>Testing: &lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce&amp;amp;width=5&amp;amp;include-filter-by-regex=ServiceAccounts%20should%20mount%20projected%20service%20account%20token"
target="_blank" rel="noopener">e2e test&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>failure to create root CA config map&lt;/p>
&lt;ul>
&lt;li>Detection: check &lt;code>root_ca_cert_publisher_sync_total&lt;/code> from
kube-controller-manager. (available in 1.21+)&lt;/li>
&lt;li>Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
the kube-apiserver and recreate pods.&lt;/li>
&lt;li>Diagnostics: &amp;ldquo;syncing [namespace]/[configmap name] failed&amp;rdquo; in
kube-controller-manager log.&lt;/li>
&lt;li>Testing: &lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce&amp;amp;width=5&amp;amp;include-filter-by-regex=ServiceAccounts%20should%20guarantee%20kube-root-ca.crt%20exist%20in%20any%20namespace"
target="_blank" rel="noopener">e2e test&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>kubelet fails to renew token&lt;/p>
&lt;ul>
&lt;li>Detection: check &lt;code>apiserver_request_total&lt;/code> with labels
&lt;code>group=&amp;quot;&amp;quot;,version=&amp;quot;v1&amp;quot;,resource=&amp;quot;serviceaccounts&amp;quot;,subresource=&amp;quot;token&amp;quot;&lt;/code> to
see if failed in requesting a new token; check kubelet log.&lt;/li>
&lt;li>Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
the kube-apiserver and recreate pods.&lt;/li>
&lt;li>Diagnostics: &amp;ldquo;token [namespace]/[token name] expired and refresh failed&amp;rdquo;
in kubelet log.&lt;/li>
&lt;li>Testing: &lt;a href="https://testgrid.k8s.io/sig-auth-gce#gce-slow&amp;amp;width=5"
target="_blank" rel="noopener">e2e test&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>workload fails to refresh token from disk&lt;/p>
&lt;ul>
&lt;li>Detection: &lt;code>serviceaccount_stale_tokens_total&lt;/code> emitted by kube-apiserver&lt;/li>
&lt;li>Mitigations: update client library to newer version.&lt;/li>
&lt;li>Diagnostics: look for &lt;code>authentication.k8s.io/stale-token&lt;/code> in audit log if
&lt;code>--service-account-extend-token-expiration=true&lt;/code>, or check authentication
error in kube-apiserver log.&lt;/li>
&lt;li>Testing: covered in all client libraries&amp;rsquo; unittests.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>What steps should be taken if SLOs are not being met to determine the
problem?&lt;/strong> Check kube-apiserver, kube-controller-managera and kubelet logs.&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Resources: Bounding Self-Labeling Kubelets</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/279/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/279/</guid><description>
&lt;h1 id="bounding-self-labeling-kubelets">Bounding Self-Labeling Kubelets&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#capturing-dedicated-workloads"
>Capturing Dedicated Workloads&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-timeline"
>Implementation Timeline&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#alternatives-considered"
>Alternatives Considered&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#file-or-flag-based-configuration-of-the-apiserver-to-allow-specifying-allowed-labels"
>File or flag-based configuration of the apiserver to allow specifying allowed labels&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#api-based-configuration-of-the-apiserver-to-allow-specifying-allowed-labels"
>API-based configuration of the apiserver to allow specifying allowed labels&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#allow-kubelets-to-add-any-labels-they-wish-and-add-noschedule-taints-if-disallowed-labels-are-added"
>Allow kubelets to add any labels they wish, and add NoSchedule taints if disallowed labels are added&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#forbid-all-labels-regardless-of-namespace-except-for-a-specifically-allowed-set"
>Forbid all labels regardless of namespace except for a specifically allowed set&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Today the node client has total authority over its own Node labels.
This ability is incredibly useful for the node auto-registration flow.
The kubelet reports a set of well-known labels, as well as additional
labels specified on the command line with &lt;code>--node-labels&lt;/code>.&lt;/p>
&lt;p>While this distributed method of registration is convenient and expedient, it
has two problems that a centralized approach would not have. Minorly, it makes
management difficult. Instead of configuring labels in a centralized
place, we must configure &lt;code>N&lt;/code> kubelet command lines. More significantly, the
approach greatly compromises security. Below are two straightforward escalations
on an initially compromised node that exhibit the attack vector.&lt;/p>
&lt;h3 id="capturing-dedicated-workloads">Capturing Dedicated Workloads&lt;/h3>
&lt;p>Suppose company &lt;code>foo&lt;/code> needs to run an application that deals with PII on
dedicated nodes to comply with government regulation. A common mechanism for
implementing dedicated nodes in Kubernetes today is to set a label or taint
(e.g. &lt;code>foo/dedicated=customer-info-app&lt;/code>) on the node and to select these
dedicated nodes in the workload controller running &lt;code>customer-info-app&lt;/code>.&lt;/p>
&lt;p>Since the nodes self reports labels upon registration, an intruder can easily
register a compromised node with label &lt;code>foo/dedicated=customer-info-app&lt;/code>. The
scheduler will then bind &lt;code>customer-info-app&lt;/code> to the compromised node potentially
giving the intruder easy access to the PII.&lt;/p>
&lt;p>This attack also extends to secrets. Suppose company &lt;code>foo&lt;/code> runs their outward
facing nginx on dedicated nodes to reduce exposure to the company&amp;rsquo;s publicly
trusted server certificates. They use the secret mechanism to distribute the
serving certificate key. An intruder captures the dedicated nginx workload in
the same way and can now use the node certificate to read the company&amp;rsquo;s serving
certificate key.&lt;/p>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>Modify the &lt;code>NodeRestriction&lt;/code> admission plugin to prevent Kubelets from self-setting labels
within the &lt;code>k8s.io&lt;/code> and &lt;code>kubernetes.io&lt;/code> namespaces &lt;em>except for these specifically allowed labels/prefixes&lt;/em>:&lt;/p>
&lt;pre tabindex="0">&lt;code>kubernetes.io/hostname
kubernetes.io/instance-type
kubernetes.io/os
kubernetes.io/arch
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
failure-domain.kubernetes.io/zone
failure-domain.kubernetes.io/region
[*.]kubelet.kubernetes.io/*
[*.]node.kubernetes.io/*
&lt;/code>&lt;/pre>&lt;/li>
&lt;li>
&lt;p>Reserve and document the &lt;code>node-restriction.kubernetes.io/*&lt;/code> label prefix for cluster administrators
that want to label their &lt;code>Node&lt;/code> objects centrally for isolation purposes.&lt;/p>
&lt;blockquote>
&lt;p>The &lt;code>node-restriction.kubernetes.io/*&lt;/code> label prefix is reserved for cluster administrators
to isolate nodes. These labels cannot be self-set by kubelets when the &lt;code>NodeRestriction&lt;/code>
admission plugin is enabled.&lt;/p>&lt;/blockquote>
&lt;/li>
&lt;/ol>
&lt;p>This accomplishes the following goals:&lt;/p>
&lt;ul>
&lt;li>continues allowing people to use arbitrary labels under their own namespaces any way they wish&lt;/li>
&lt;li>supports legacy labels kubelets are already adding&lt;/li>
&lt;li>provides a place under the &lt;code>kubernetes.io&lt;/code> label namespace for node isolation labeling&lt;/li>
&lt;li>provide a place under the &lt;code>kubernetes.io&lt;/code> label namespace for kubelets to self-label with kubelet and node-specific labels&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-timeline">Implementation Timeline&lt;/h2>
&lt;p>v1.13:&lt;/p>
&lt;ul>
&lt;li>Kubelet deprecates setting &lt;code>kubernetes.io&lt;/code> or &lt;code>k8s.io&lt;/code> labels via &lt;code>--node-labels&lt;/code>,
other than the specifically allowed labels/prefixes described above,
and warns when invoked with &lt;code>kubernetes.io&lt;/code> or &lt;code>k8s.io&lt;/code> labels outside that set.&lt;/li>
&lt;li>NodeRestriction admission prevents kubelets from adding/removing/modifying &lt;code>[*.]node-restriction.kubernetes.io/*&lt;/code> labels on Node &lt;em>create&lt;/em> and &lt;em>update&lt;/em>&lt;/li>
&lt;li>NodeRestriction admission prevents kubelets from adding/removing/modifying &lt;code>kubernetes.io&lt;/code> or &lt;code>k8s.io&lt;/code>
labels other than the specifically allowed labels/prefixes described above on Node &lt;em>update&lt;/em> only&lt;/li>
&lt;/ul>
&lt;p>v1.14:&lt;/p>
&lt;ul>
&lt;li>Begin migration/removal of in-tree &lt;code>--node-labels&lt;/code> use outside of the allowed set by addons:
&lt;ul>
&lt;li>&lt;code>beta.kubernetes.io/fluentd-ds-ready&lt;/code>
&lt;ul>
&lt;li>addon: remove from the nodeSelector&lt;/li>
&lt;li>kube-up: remove from the default &lt;code>--node-labels&lt;/code> flag&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>beta.kubernetes.io/metadata-proxy-ready&lt;/code>
&lt;ul>
&lt;li>addon: announce the nodeSelector will switch to &lt;code>cloud.google.com/metadata-proxy-ready&lt;/code> in 1.15&lt;/li>
&lt;li>kube-up: add &lt;code>cloud.google.com/metadata-proxy-ready=true&lt;/code> along with the existing label to &lt;code>--node-labels&lt;/code>&lt;/li>
&lt;li>kube-up: add &lt;code>cloud.google.com/metadata-proxy-ready=true&lt;/code> to existing nodes with the &lt;code>beta.kubernetes.io/metadata-proxy-ready=true&lt;/code> label&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>beta.kubernetes.io/kube-proxy-ds-ready&lt;/code>
&lt;ul>
&lt;li>addon: announce the nodeSelector will switch to &lt;code>node.kubernetes.io/kube-proxy-ds-ready&lt;/code> in 1.15&lt;/li>
&lt;li>kube-up: add &lt;code>node.kubernetes.io/kube-proxy-ds-ready=true&lt;/code> along with the existing label to &lt;code>--node-labels&lt;/code>&lt;/li>
&lt;li>kube-up: add &lt;code>node.kubernetes.io/kube-proxy-ds-ready=true&lt;/code> to existing nodes with the &lt;code>beta.kubernetes.io/kube-proxy-ds-ready=true&lt;/code> label&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>beta.kubernetes.io/masq-agent-ds-ready&lt;/code>
&lt;ul>
&lt;li>addon: announce the nodeSelector will switch to &lt;code>node.kubernetes.io/masq-agent-ds-ready&lt;/code> in 1.16&lt;/li>
&lt;li>kube-up: add &lt;code>node.kubernetes.io/masq-agent-ds-ready=true&lt;/code> to existing nodes with the &lt;code>beta.kubernetes.io/masq-agent-ds-ready=true&lt;/code> label&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>v1.16:&lt;/p>
&lt;ul>
&lt;li>Complete migration/removal of in-tree &lt;code>--node-labels&lt;/code> use outside of the allowed set by addons:
&lt;ul>
&lt;li>&lt;code>beta.kubernetes.io/metadata-proxy-ready&lt;/code>
&lt;ul>
&lt;li>addon: change the nodeSelector to &lt;code>cloud.google.com/metadata-proxy-ready&lt;/code>&lt;/li>
&lt;li>kube-up: stop setting &lt;code>beta.kubernetes.io/metadata-proxy-ready&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>beta.kubernetes.io/kube-proxy-ds-ready&lt;/code>
&lt;ul>
&lt;li>addon: change the nodeSelector to &lt;code>node.kubernetes.io/kube-proxy-ds-ready&lt;/code>&lt;/li>
&lt;li>kube-up: stop setting &lt;code>beta.kubernetes.io/kube-proxy-ds-ready&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>beta.kubernetes.io/masq-agent-ds-ready&lt;/code>
&lt;ul>
&lt;li>addon: change the nodeSelector to &lt;code>node.kubernetes.io/masq-agent-ds-ready&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Kubelet removes the ability to set &lt;code>kubernetes.io&lt;/code> or &lt;code>k8s.io&lt;/code> labels via &lt;code>--node-labels&lt;/code>
other than the specifically allowed labels/prefixes described above (deprecation period
of 6 months for CLI elements of admin-facing components is complete)&lt;/li>
&lt;/ul>
&lt;p>v1.19:&lt;/p>
&lt;ul>
&lt;li>NodeRestriction admission prevents kubelets from adding/removing/modifying &lt;code>kubernetes.io&lt;/code> or &lt;code>k8s.io&lt;/code>
labels other than the specifically allowed labels/prefixes described above on Node &lt;em>update&lt;/em> and &lt;em>create&lt;/em>
(oldest supported kubelet running against a v1.19 apiserver is v1.17)&lt;/li>
&lt;/ul>
&lt;h2 id="alternatives-considered">Alternatives Considered&lt;/h2>
&lt;h3 id="file-or-flag-based-configuration-of-the-apiserver-to-allow-specifying-allowed-labels">File or flag-based configuration of the apiserver to allow specifying allowed labels&lt;/h3>
&lt;ul>
&lt;li>A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently&lt;/li>
&lt;li>File-based config isn&amp;rsquo;t easily inspectable to be able to verify enforced labels&lt;/li>
&lt;li>File-based config isn&amp;rsquo;t easily kept in sync in HA apiserver setups&lt;/li>
&lt;/ul>
&lt;h3 id="api-based-configuration-of-the-apiserver-to-allow-specifying-allowed-labels">API-based configuration of the apiserver to allow specifying allowed labels&lt;/h3>
&lt;ul>
&lt;li>A fixed set of labels and label prefixes is simpler to reason about, and makes every cluster behave consistently&lt;/li>
&lt;li>An API object that controls the allowed labels is a potential escalation path for a compromised node&lt;/li>
&lt;/ul>
&lt;h3 id="allow-kubelets-to-add-any-labels-they-wish-and-add-noschedule-taints-if-disallowed-labels-are-added">Allow kubelets to add any labels they wish, and add NoSchedule taints if disallowed labels are added&lt;/h3>
&lt;ul>
&lt;li>To be robust, this approach would also likely involve a controller to automatically inspect labels and remove the NoSchedule taint. This seemed overly complex. Additionally, it was difficult to come up with a tainting scheme that preserved information about which labels were the cause.&lt;/li>
&lt;/ul>
&lt;h3 id="forbid-all-labels-regardless-of-namespace-except-for-a-specifically-allowed-set">Forbid all labels regardless of namespace except for a specifically allowed set&lt;/h3>
&lt;ul>
&lt;li>This was much more disruptive to existing usage of &lt;code>--node-labels&lt;/code>.&lt;/li>
&lt;li>This was much more difficult to integrate with other systems allowing arbitrary topology labels like CSI.&lt;/li>
&lt;li>This placed restrictions on how labels outside the &lt;code>kubernetes.io&lt;/code> and &lt;code>k8s.io&lt;/code> label namespaces could be used, which didn&amp;rsquo;t seem proper.&lt;/li>
&lt;/ul></description></item><item><title>Resources: Breaking apart the Kubernetes test tarball</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/714/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/resources/keps/714/</guid><description>
&lt;h1 id="breaking-apart-the-kubernetes-test-tarball">Breaking apart the kubernetes test tarball&lt;/h1>
&lt;h2 id="table-of-contents">Table of Contents&lt;/h2>
&lt;!-- toc -->
&lt;ul>
&lt;li>&lt;a href="#release-signoff-checklist"
>Release Signoff Checklist&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#summary"
>Summary&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#motivation"
>Motivation&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#goals"
>Goals&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#non-goals"
>Non-Goals&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#proposal"
>Proposal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#internal-structure-of-the-test-tarball"
>Internal structure of the test tarball&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#binary-artifacts"
>Binary artifacts&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#portable-sources"
>Portable sources&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#updating-dependencies-on-kubernetes-testtargz"
>Updating dependencies on &lt;code>kubernetes-test.tar.gz&lt;/code>&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#dependencies-outside-the-kubernetes-organization"
>Dependencies outside the Kubernetes organization&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#risks-and-mitigations"
>Risks and Mitigations&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#test-plan"
>Test Plan&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#graduation-criteria"
>Graduation Criteria&lt;/a>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#references"
>References&lt;/a>
&lt;/li>
&lt;li>&lt;a href="#implementation-history"
>Implementation History&lt;/a>
&lt;/li>
&lt;/ul>
&lt;!-- /toc -->
&lt;h2 id="release-signoff-checklist">Release Signoff Checklist&lt;/h2>
&lt;ul>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &lt;a href="https://github.com/kubernetes/enhancements/issues/714"
target="_blank" rel="noopener">k/enhancements issue in release milestone and linked to KEP&lt;/a>
&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> KEP approvers have set the KEP status to &lt;code>implementable&lt;/code>&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Design details are appropriately documented&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Test plan is in place, giving consideration to SIG Architecture and SIG Testing input&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Graduation criteria is in place&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> &amp;ldquo;Implementation History&amp;rdquo; section is up-to-date for milestone&lt;/li>
&lt;li>&lt;input checked="" disabled="" type="checkbox"> Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes&lt;/li>
&lt;/ul>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The Kubernetes release artifacts include a &amp;ldquo;mondo&amp;rdquo; test tarball which includes both
&amp;ldquo;portable&amp;rdquo; test sources (such as shell scripts and image manifests) as well as
platform-specific test binaries for all supported client, node, and
server platforms.&lt;/p>
&lt;p>This KEP proposes replacing the monolith test tarball with platform-specific
tarballs, matching the existing pattern used for the client, node, and server
tarballs.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>As the number of supported client, server, and node platforms has increased, the
size of the test tarball has increased as well. As of Kubernetes v1.13.2, the
official &lt;code>kubernetes-test.tar.gz&lt;/code> is approximately 1.2GB; previous releases
ranged from 1.3 - 1.5GB. Several years ago,
&lt;a href="https://github.com/kubernetes/kubernetes/issues/28435"
target="_blank" rel="noopener">there were complaints&lt;/a>
that the &amp;ldquo;full&amp;rdquo; &lt;code>kubernetes.tar.gz&lt;/code> release tarball was too big at 1.4G.
Many of the motivations for breaking up that tarball echo into this proposal.&lt;/p>
&lt;p>The Bazel effort is another driving motivation. It&amp;rsquo;s possible to build all
release artifacts solely using Bazel, and there is progress being made on
supporting cross-compilation of binary artifacts, but combining multiple
target platforms in one Bazel call is currently not well-supported.
Separating this tarball would make it easier to ensure that we can use Bazel t
produce identical artifacts as the non-Bazel build.&lt;/p>
&lt;h3 id="goals">Goals&lt;/h3>
&lt;ul>
&lt;li>The Kubernetes test artifacts are broken apart such that users only need to
download binaries for the platforms they&amp;rsquo;re using.&lt;/li>
&lt;li>Be largely invisible to most developers: everything should just keep working.&lt;/li>
&lt;/ul>
&lt;h3 id="non-goals">Non-Goals&lt;/h3>
&lt;ul>
&lt;li>Changing the underlying build system. Both the make/shell-based build
system and the Bazel-based build system will be affected, but users can
continue to use their existing workflow.&lt;/li>
&lt;li>Removing cruft from the test tarballs. Likely, there are binaries and portable
sources no longer being used anywhere, but we won&amp;rsquo;t prune them with this
effort.&lt;/li>
&lt;li>Changing what is released independent of the test tarball; i.e. whether we
should make test binaries able to be downloaded directly from GCS.&lt;/li>
&lt;/ul>
&lt;h2 id="proposal">Proposal&lt;/h2>
&lt;p>Instead of building and distributing a single &lt;code>kubernetes-test.tar.gz&lt;/code> with all
portable sources and compiled binaries for all platforms, produce several
platform-specific tarballs, one for each platform defined in
&lt;a href="https://github.com/kubernetes/kubernetes/blob/193f659a1cd454b93cbe1e7b1f13b77c21783461/hack/lib/golang.sh#L150-L160"
target="_blank" rel="noopener">&lt;code>KUBE_TEST_PLATFORMS&lt;/code>&lt;/a>
:&lt;/p>
&lt;ul>
&lt;li>&lt;code>kubernetes-test-linux-amd64.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-linux-arm.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-linux-arm64.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-linux-s390x.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-linux-ppc64le.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-darwin-amd64.tar.gz&lt;/code>&lt;/li>
&lt;li>&lt;code>kubernetes-test-windows-amd64.tar.gz&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="internal-structure-of-the-test-tarball">Internal structure of the test tarball&lt;/h3>
&lt;p>At present, the Kubernetes test tarball has several components, all
rooted under a &lt;code>kubernetes/&lt;/code> top-level directory.&lt;/p>
&lt;h4 id="binary-artifacts">Binary artifacts&lt;/h4>
&lt;p>The test binary artifacts are currently organized into directories divided by platform:&lt;/p>
&lt;ul>
&lt;li>&lt;code>platforms/&lt;/code>
&lt;ul>
&lt;li>&lt;code>darwin/&lt;/code>, &lt;code>linux/&lt;/code>, &lt;code>windows/&lt;/code>
&lt;ul>
&lt;li>&lt;code>amd64/&lt;/code>, &lt;code>arm/&lt;/code>, &lt;code>arm64/&lt;/code>, &lt;code>ppc64le/&lt;/code>, &lt;code>s390x/&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For comparison, the existing platform-specific tarballs
(such as &lt;code>kubernetes-client-linux-amd64.tar.gz&lt;/code>) place all binaries under
a constant path with no platform information: &lt;code>kubernetes/client/bin/kubectl&lt;/code>.&lt;/p>
&lt;p>Scripts (such as &lt;code>cluster/get-kube-binaries.sh&lt;/code>) &lt;a href="https://github.com/kubernetes/kubernetes/blob/193f659a1cd454b93cbe1e7b1f13b77c21783461/cluster/get-kube-binaries.sh#L143-L156"
target="_blank" rel="noopener">extract these tarballs
back into platform-specific directories&lt;/a>
to support downloading multiple platforms into a single workspace.&lt;/p>
&lt;p>The test tarball should follow the lead of the other platform-specific tarballs
and place the binaries under &lt;code>test/bin&lt;/code>. We can then reuse the existing
functionality already implemented for the other tarballs.&lt;/p>
&lt;h4 id="portable-sources">Portable sources&lt;/h4>
&lt;p>Portable sources are basically copied directly from the source tree:&lt;/p>
&lt;ul>
&lt;li>&lt;code>test/e2e/testing-manifests/&lt;/code>&lt;/li>
&lt;li>&lt;code>test/images/&lt;/code>&lt;/li>
&lt;li>&lt;code>test/kubemark/&lt;/code>&lt;/li>
&lt;li>&lt;code>hack/&lt;/code> (&lt;a href="https://github.com/kubernetes/kubernetes/blob/193f659a1cd454b93cbe1e7b1f13b77c21783461/hack/lib/golang.sh#L193-L197"
target="_blank" rel="noopener">partially&lt;/a>
)&lt;/li>
&lt;/ul>
&lt;p>We have two options for these:&lt;/p>
&lt;ol>
&lt;li>Continue to distribute as a separate tarball, either &lt;code>kubernetes-test.tar.gz&lt;/code>,
or possibly something like &lt;code>kubernetes-test-portable.tar.gz&lt;/code>.&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>This makes the distinction very clear vs. the binary artifacts&lt;/li>
&lt;li>There&amp;rsquo;s already some precedent, such as the &lt;code>kubernetes-manifest.tar.gz&lt;/code> tarball&lt;/li>
&lt;li>It slightly complicates downloading of test dependencies&lt;/li>
&lt;/ul>
&lt;ol start="2">
&lt;li>Duplicate these sources into each binary-specific tarball.&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>Simplifies test dependency distribution - may only need to download one
tarball if client and server are same platform&lt;/li>
&lt;li>Portable sources are small (as of v1.13.2, approximately 2.7MB uncompressed
or about 186KB compressed) so duplication isn&amp;rsquo;t a huge worry&lt;/li>
&lt;li>Complicates extraction of tarballs with existing scripts, since they assume
everything is platform-specific&lt;/li>
&lt;/ul>
&lt;p>We propose the first option as slightly preferable given the tradeoffs.
Since we intend to continue distributing the mondo test tarball over a
deprecation period, we&amp;rsquo;ll use the name &lt;code>kubernetes-test-portable.tar.gz&lt;/code> for the
portable sources.&lt;/p>
&lt;h3 id="updating-dependencies-on-kubernetes-testtargz">Updating dependencies on &lt;code>kubernetes-test.tar.gz&lt;/code>&lt;/h3>
&lt;p>Currently the CI workflows and &lt;code>kubetest&lt;/code> use the &lt;code>cluster/get-kube.sh&lt;/code> and
&lt;code>cluster/get-kube-binaries.sh&lt;/code> scripts to download all artifacts, and
conveniently &lt;code>get-kube-binaries.sh&lt;/code> is versioned with the release artifacts in
&lt;code>kubernetes.tar.gz&lt;/code>, so simply making &lt;code>get-kube-binaries.sh&lt;/code> aware of the new
tarballs should be sufficient for most CI and developer needs.&lt;/p>
&lt;p>Because the test tarball includes binaries used both on the host running tests
(such as the &lt;code>e2e&lt;/code> binary), as well as binaries which may run the nodes
(&lt;code>e2e.node&lt;/code>), we would need to make sure to download binary test artifacts
targeting the host platform, node platform, and possibly server platform.&lt;/p>
&lt;p>A quick search reveals a few other uses of &lt;code>kubernetes-test.tar.gz&lt;/code>, mostly in
&lt;code>cluster/&lt;/code>. We can update these to use the platform-specific tarballs, possibly
with a fallback to the mondo-tarball if worried about versioning.&lt;/p>
&lt;h4 id="dependencies-outside-the-kubernetes-organization">Dependencies outside the Kubernetes organization&lt;/h4>
&lt;p>Searching GitHub for references to &lt;code>kubernetes-test.tar.gz&lt;/code> largely returns
forks of the main kubernetes repository (including some very old forks,
identifiable by the script &lt;code>e2e-from-release.sh&lt;/code>). Since these forks are not
likely to depend on upstream release artifacts, we can ignore them.&lt;/p>
&lt;p>The Samsung SDS CNCT kraken-lib repository has a reference to &lt;code>kubernetes-test.tar.gz&lt;/code>
in its &lt;a href="https://github.com/samsung-cnct/kraken-lib/blob/aceab16c316bafcdb1f542dc67876dd2e5279f6b/build-scripts/conformance-tests#L16"
target="_blank" rel="noopener">conformance test script&lt;/a>
,
but this repo is also marked deprecated and read-only, and there have been no
changes since July 2018.&lt;/p>
&lt;p>In vmware/simple-k8s-test-env, the &lt;code>sk8.sh&lt;/code> file uses
&lt;a href="https://github.com/vmware/simple-k8s-test-env/blob/master/sk8.sh#L4481"
target="_blank" rel="noopener">&lt;code>kubernetes-test.tar.gz&lt;/code>&lt;/a>
,
and this repo seems actively maintained, so we should make sure this continues
working.&lt;/p>
&lt;p>The &lt;a href="https://github.com/knative/test-infra/blob/8ef3dc1c2ed07e64024bc68c9dbd1a2e10e9e975/scripts/e2e-tests.sh#L118-L120"
target="_blank" rel="noopener">reference&lt;/a>
to &lt;code>kubernetes-test.tar.gz&lt;/code> in knative/test-infra is hilarious.&lt;/p>
&lt;p>There may be other uses that are not easily identifiable, so we will follow a
deprecation process of the mondo test tarball as described in the next section.&lt;/p>
&lt;h3 id="risks-and-mitigations">Risks and Mitigations&lt;/h3>
&lt;p>It&amp;rsquo;s hard to tell who uses these test tarballs outside the core project or
without tools like &lt;code>kubetest&lt;/code>. We&amp;rsquo;ll need to broadcast this change widely
so that any downstream users are aware of the incompatible changes.&lt;/p>
&lt;p>As this is an inherently breaking change, we must decide when to
cause the break. Assuming this effort is targeted for the 1.14 release:&lt;/p>
&lt;ol>
&lt;li>We can continue to produce a mondo-tarball for 3 releases, along with new
split tarballs; i.e., both 1.14 and 1.15 would contain both split and mondo
test tarballs, while 1.16 would only use a split test tarball. This way one
could continue to use the mondo tarball through the 1.15 release cycle,
and then switch to using split test tarballs for 1.16, as all supported
releases would then be producing split test tarballs.&lt;/li>
&lt;li>We can make a clean break for 1.14, not producing any mondo test tarballs.
Downstream users would need to account for the break immediately,
and would also need to special-case for older releases that still use the
mondo test tarball.&lt;/li>
&lt;li>A somewhat hybrid approach, mixing 1 and 2 backwards in time:
a. Produce both mondo test tarballs and split test tarballs on master for
a few weeks.
b. Backport split tarballs to older releases still supported (1.11 through
1.13), but continue to produce mondo test tarballs. We would never remove
the mondo test tarballs from these releases, instead continuing to
produce both.
c. Update all test infrastructure to use split test tarballs
d. Remove the mondo test tarball from 1.14 before the first beta release.&lt;/li>
&lt;/ol>
&lt;p>Given the Kubernetes &lt;a href="https://kubernetes.io/docs/reference/using-api/deprecation-policy/"
target="_blank" rel="noopener">deprecation policy&lt;/a>
,
we should go for option 1 and continue to distribute mondo and split
test tarballs for the 1.14 release, and possibly for several releases
thereafter. (It&amp;rsquo;s not entirely clear exactly which deprecation policy applies
to this change, however.)&lt;/p>
&lt;p>We&amp;rsquo;ll mark the mondo test tarball as deprecated in the 1.14 release, both
through announcements in the release notes, as well as a &lt;code>DEPRECATION&lt;/code> notice
in the mondo test tarball.&lt;/p>
&lt;h3 id="test-plan">Test Plan&lt;/h3>
&lt;p>We&amp;rsquo;ll start by building both the mondo test tarball and split test tarballs in
CI, followed by updating test infrastructure to use the new split tarballs.
We will monitor TestGrid jobs to ensure that nothing is noticeably broken by
the change, and our primary sources will be those on the
&lt;a href="https://testgrid.k8s.io/sig-release-master-blocking"
target="_blank" rel="noopener">sig-release-master-blocking&lt;/a>
,
&lt;a href="https://testgrid.k8s.io/sig-release-master-informing"
target="_blank" rel="noopener">sig-release-master-informing&lt;/a>
,
and &lt;a href="https://testgrid.k8s.io/sig-release-master-upgrade"
target="_blank" rel="noopener">sig-release-master-upgrade&lt;/a>
dashboards.&lt;/p>
&lt;p>We&amp;rsquo;ll also reach out to community members testing on non-amd64 architectures,
since they&amp;rsquo;re most likely to be impacted by this change.&lt;/p>
&lt;p>We&amp;rsquo;ll work with any downstream consumers we can find to update them to use the
split tarballs ahead of the 1.14.0 release, but will continue to support
the mondo test tarball through at least 1.14&amp;rsquo;s complete lifecycle.&lt;/p>
&lt;h3 id="graduation-criteria">Graduation Criteria&lt;/h3>
&lt;p>To consider this effort complete, we should no longer be distributing a
mondo-tarball of test artifacts, and all TestGrid dashboards should show a
similar level of greenness.&lt;/p>
&lt;p>While ideally we&amp;rsquo;d make a clean break, removing the mondo-tarball at the same
time as we create the platform-specific test tarballs, to ensure a smoother
rollout we will distribute both the split and mondo test tarballs for a while,
and this effort will not be deemed complete until the mondo test tarball is
gone.&lt;/p>
&lt;h2 id="references">References&lt;/h2>
&lt;p>Similar discussion and work on the other release tarballs:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/issues/28435"
target="_blank" rel="noopener">Release tarballs are too big&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/issues/28629"
target="_blank" rel="noopener">Build release tars per-architecture&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/pull/35737"
target="_blank" rel="noopener">Stop including arch-specific binaries in kubernetes.tar.gz&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/kubernetes/issues/38725"
target="_blank" rel="noopener">Implicitly call cluster/get-kube-binaries.sh&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://groups.google.com/d/msg/kubernetes-dev/n9H9I8TrOT4/1cyV5r9fAAAJ"
target="_blank" rel="noopener">kubernetes-dev announcement about removing arch-specific binaries from kubernetes.tar.gz &amp;ldquo;full&amp;rdquo; tarball&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://www.youtube.com/watch?v=WbqRursx39k&amp;amp;t=13m28s"
target="_blank" rel="noopener">Recording&lt;/a>
and
&lt;a href="https://docs.google.com/document/d/1z8MQpr_jTwhmjLMUaqQyBk1EYG_Y_3D4y4YdMJ7V1Kk/edit#heading=h.1fpwoneimh52"
target="_blank" rel="noopener">notes&lt;/a>
from sig-testing weekly meeting&lt;/li>
&lt;/ul>
&lt;h2 id="implementation-history">Implementation History&lt;/h2>
&lt;ul>
&lt;li>2019-01-18: proposal on Slack and creation of the KEP&lt;/li>
&lt;li>2019-01-28: KEP announced on sig-testing and sig-release mailing lists&lt;/li>
&lt;li>2019-01-29: discussion at sig-testing weekly meeting&lt;/li>
&lt;li>2019-02-14: implementation &lt;a href="https://github.com/kubernetes/kubernetes/pull/74065"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/74065&lt;/a>
created, deprecation notice included in mondo test tarball&lt;/li>
&lt;li>2019-02-22: implementation &lt;a href="https://github.com/kubernetes/kubernetes/pull/74065"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/74065&lt;/a>
merged&lt;/li>
&lt;li>2019-09-24: Stop building kubernetes-test.tar.gz: &lt;a href="https://github.com/kubernetes/kubernetes/pull/83093"
target="_blank" rel="noopener">https://github.com/kubernetes/kubernetes/pull/83093&lt;/a>
&lt;/li>
&lt;li>2021-08-16: Retroactive stable declaration&lt;/li>
&lt;/ul></description></item></channel></rss>