<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kubernetes Contributors – WG Node Lifecycle</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/node-lifecycle/</link><description>Recent content in WG Node Lifecycle on Kubernetes Contributors</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/node-lifecycle/index.xml" rel="self" type="application/rss+xml"/><item><title>Community: WG Node Lifecycle Charter</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/node-lifecycle/charter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/node-lifecycle/charter/</guid><description>
&lt;h1 id="wg-node-lifecycle-charter">WG Node Lifecycle Charter&lt;/h1>
&lt;p>This charter adheres to the conventions described in the &lt;a href="https://deploy-preview-776--kubernetes-contributor.netlify.app/committee-steering/governance/README.md"
>Kubernetes Charter README&lt;/a>
and uses
the Roles and Organization Management outlined in &lt;a href="https://deploy-preview-776--kubernetes-contributor.netlify.app/committee-steering/governance/wg-governance.md"
>wg-governance&lt;/a>
.&lt;/p>
&lt;h2 id="scope">Scope&lt;/h2>
&lt;p>The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple
projects independently addressing similar issues. The goal of this working group is to develop
unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across
projects and addressing scenarios that impede node drain or cause improper pod termination. Our
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to
support advanced use cases across the ecosystem.&lt;/p>
&lt;p>To properly solve the node drain, we must first understand the node lifecycle. This includes
provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node
shutdown. This then impacts both the node and pod autoscaling, de/scheduling, load balancing, and
the applications running in the cluster. All of these areas have issues and would benefit from a
unified approach.&lt;/p>
&lt;h3 id="in-scope">In scope&lt;/h3>
&lt;ul>
&lt;li>Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs
and extending the current ones. This includes exploring extension to or interactions with the Node
object.&lt;/li>
&lt;li>Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting
the Node API to expose additional state or status in order to coalesce other core Kubernetes and
community APIs around node lifecycle management.&lt;/li>
&lt;li>Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs.
Improve the descheduling, availability and migration capabilities of today&amp;rsquo;s application
workloads. Also explore the interactions with other eviction mechanisms.&lt;/li>
&lt;li>Coordinate pod termination and issues around, de/scheduling, preemption, eviction and readiness
probes.&lt;/li>
&lt;li>Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.&lt;/li>
&lt;li>Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and
the new disruption model/evictions. This includes balancing of the pods according to scheduling
constraints.&lt;/li>
&lt;li>Consider improving the pod lifecycle of DaemonSets and static pods during a node maintenance.&lt;/li>
&lt;li>Explore the cloud provider use cases and how they can hook into the node lifecycle. So that the
users can use the same APIs or configurations across the board.&lt;/li>
&lt;li>Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter,
&amp;hellip;) and other scenarios to use the new unified node draining approach.&lt;/li>
&lt;li>Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
track and react to each of them. Consider past discussions/historical perspective
(e.g. &amp;ldquo;tombstones&amp;rdquo;).&lt;/li>
&lt;li>Explore feedback mechanism for ensuring schedulability (e.g. readiness) and capabilities of the
node. These can apply during the provisioning of the node, but also during the rest of the
node lifecycle.&lt;/li>
&lt;/ul>
&lt;h3 id="out-of-scope">Out of scope&lt;/h3>
&lt;ul>
&lt;li>Implementing cloud provider specific logic, the goal is to have high-level API that the providers
can use, hook into, or extend.&lt;/li>
&lt;li>Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle
management solution.&lt;/li>
&lt;/ul>
&lt;h2 id="stakeholders">Stakeholders&lt;/h2>
&lt;ul>
&lt;li>SIG Apps&lt;/li>
&lt;li>SIG Autoscaling&lt;/li>
&lt;li>SIG CLI&lt;/li>
&lt;li>SIG Cloud Provider&lt;/li>
&lt;li>SIG Cluster Lifecycle&lt;/li>
&lt;li>SIG Network&lt;/li>
&lt;li>SIG Node&lt;/li>
&lt;li>SIG Scheduling&lt;/li>
&lt;li>SIG Storage&lt;/li>
&lt;/ul>
&lt;p>Stakeholders span from multiple SIGs to a broad set of end users,
public and private cloud providers, Kubernetes distribution providers,
and cloud provider end-users. Here are some user stories:&lt;/p>
&lt;ul>
&lt;li>As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
any required manual interventions.&lt;/li>
&lt;li>As a cluster admin, I want to be able to observe the node drain via the API and check on its
progress. I also want to be able to discover workloads that are blocking the node
drain.&lt;/li>
&lt;li>As a cluster admin, I want to be able to perform arbitrary actions after the node drain is
complete, such as resetting GPU drivers, resetting NICs, performing software updates or shutting
down the machine.&lt;/li>
&lt;li>As a cluster admin, I want to reduce the cost of doing maintenance on my hardware accelerators by
using control-plane APIs to help coordinate maintenance and drain a Node.&lt;/li>
&lt;li>To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet,
and other actors should use a new eviction API to gracefully remove pods. This would enable new
migration strategies that prefer to surge (upscale) pods first rather than downscale them. It
would also allow other users/components to monitor pods that are gracefully removed/terminated
and provide better behaviour in terms of de/scheduling, scaling and availability.&lt;/li>
&lt;li>As an end user, I would like more alternatives to blue-green upgrades, especially with special
hardware accelerators. I would like to choose a strategy on how to coordinate the node drain and
the upgrade to achieve better cost-effectiveness.&lt;/li>
&lt;li>As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing
Kubernetes to help cloud service providers safely remove hardware will reduce operational costs.&lt;/li>
&lt;li>As an end user or admin, I would like to use a mixture of on-demand and spot instances in my
clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for
nodes will improve cluster stability in scenarios where instances may be terminated by the cloud
provider due to cost-related thresholds.&lt;/li>
&lt;li>As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with
accelerators) and either prevent termination for a long period of time or have a reliable
migration path. Features like &lt;code>terminationGracePeriodSeconds&lt;/code> are not sufficient as the
termination/migration can take hours if not days.&lt;/li>
&lt;li>As a user, I want my application to finish all network and storage operations before terminating a
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
to the underlying storage and completing storage cleanup routines.&lt;/li>
&lt;li>As a cluster admin, I want a node to be declared as fully drained after all volumes are unmounted
from it.&lt;/li>
&lt;li>As an application developer, signal provided by readiness probes is insufficient in some scenarios.
For example, there might be no change in readiness during a node shutdown, even though the
application should be removed from endpoints/load balancer. I want to stop incoming traffic to my
application in such scenarios.&lt;/li>
&lt;/ul>
&lt;h2 id="deliverables">Deliverables&lt;/h2>
&lt;p>The WG will coordinate requirement gathering, design, implementation, and progressing through
graduation stages.&lt;/p>
&lt;p>The group will help coordinate existing Kubernetes Enhancement Proposals (KEPs) graduation as well
as exploring new APIs and scenarios.&lt;/p>
&lt;p>Area we expect to explore:&lt;/p>
&lt;ul>
&lt;li>An API to express node drain/maintenance.&lt;/li>
&lt;li>An API to solve the problems with the API-initiated Eviction API and PDBs.&lt;/li>
&lt;li>An API/mechanism to gracefully terminate pods during a node shutdown.&lt;/li>
&lt;li>An API to deschedule pods that use DRA devices.&lt;/li>
&lt;li>An API to remove pods from endpoints before they terminate.&lt;/li>
&lt;li>An API to track the schedulability (e.g. readiness) and capabilities of the node.&lt;/li>
&lt;li>Introduce enhancements across multiple Kubernetes SIGs to add support and integration for the new
APIs to solve wide range of issues.&lt;/li>
&lt;/ul>
&lt;p>We expect to provide reference implementations of the new APIs including but not limited to
controllers (kube-controller-manager), API validation, integration with existing core components and
extension points for the ecosystem. This should be accompanied by E2E / Conformance tests.&lt;/p>
&lt;h2 id="relevant-features-keps-and-documents">Relevant Features, KEPs and Documents&lt;/h2>
&lt;ul>
&lt;li>Declarative Node Maintenance: &lt;a href="https://github.com/kubernetes/enhancements/issues/4212"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/4212&lt;/a>
&lt;/li>
&lt;li>EvictionRequest API: &lt;a href="https://github.com/kubernetes/enhancements/issues/4563"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/4563&lt;/a>
&lt;/li>
&lt;li>Graceful Node Shutdown: &lt;a href="https://github.com/kubernetes/enhancements/issues/2000"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/2000&lt;/a>
&lt;/li>
&lt;li>DRA: device taints and tolerations: &lt;a href="https://github.com/kubernetes/enhancements/issues/5055"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/5055&lt;/a>
&lt;/li>
&lt;li>Disrupted Pods should be removed from endpoints: &lt;a href="https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8"
target="_blank" rel="noopener">https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8&lt;/a>
&lt;/li>
&lt;li>Node Readiness Gates: &lt;a href="https://github.com/kubernetes/enhancements/issues/5233"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/5233&lt;/a>
&lt;/li>
&lt;li>Allow the kubelet to trigger rescheduling of pods: &lt;a href="https://docs.google.com/document/d/1-wJhiNy84w7tzFdo9HqwTu5DrVSuXFLGTUv8FBiRAAc"
target="_blank" rel="noopener">https://docs.google.com/document/d/1-wJhiNy84w7tzFdo9HqwTu5DrVSuXFLGTUv8FBiRAAc&lt;/a>
&lt;/li>
&lt;li>Implicit tolerations &lt;a href="https://github.com/kubernetes/enhancements/issues/5282;"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/5282;&lt;/a>
add a Filter plugin to ensure that non-GPU pods are not scheduled on GPU nodes: &lt;a href="https://github.com/kubernetes-sigs/scheduler-plugins/pull/812"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/scheduler-plugins/pull/812&lt;/a>
&lt;/li>
&lt;li>Node Resource Hot Plug: &lt;a href="https://github.com/kubernetes/enhancements/issues/3953"
target="_blank" rel="noopener">https://github.com/kubernetes/enhancements/issues/3953&lt;/a>
&lt;/li>
&lt;/ul>
&lt;h2 id="relevant-projects">Relevant Projects&lt;/h2>
&lt;p>This is a list of known projects that solve similar problems in the ecosystem or would benefit from
the efforts of this WG:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/aws/aws-node-termination-handler"
target="_blank" rel="noopener">https://github.com/aws/aws-node-termination-handler&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/foriequal0/pod-graceful-drain"
target="_blank" rel="noopener">https://github.com/foriequal0/pod-graceful-drain&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/jukie/karpenter-deprovision-controller"
target="_blank" rel="noopener">https://github.com/jukie/karpenter-deprovision-controller&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubereboot/kured"
target="_blank" rel="noopener">https://github.com/kubereboot/kured&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler"
target="_blank" rel="noopener">https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/cluster-api/"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/cluster-api/&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/karpenter"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/karpenter&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/kubespray"
target="_blank" rel="noopener">https://github.com/kubernetes-sigs/kubespray&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/kubevirt/kubevirt"
target="_blank" rel="noopener">https://github.com/kubevirt/kubevirt&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/medik8s/node-maintenance-operator"
target="_blank" rel="noopener">https://github.com/medik8s/node-maintenance-operator&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/Mellanox/maintenance-operator"
target="_blank" rel="noopener">https://github.com/Mellanox/maintenance-operator&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/NVIDIA/pika"
target="_blank" rel="noopener">https://github.com/NVIDIA/pika&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/openshift/machine-config-operator"
target="_blank" rel="noopener">https://github.com/openshift/machine-config-operator&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/planetlabs/draino"
target="_blank" rel="noopener">https://github.com/planetlabs/draino&lt;/a>
&lt;/li>
&lt;li>&lt;a href="https://github.com/strimzi/drain-cleaner"
target="_blank" rel="noopener">https://github.com/strimzi/drain-cleaner&lt;/a>
&lt;/li>
&lt;/ul>
&lt;p>There are also internal, custom solutions that companies use.&lt;/p>
&lt;h2 id="prioritization">Prioritization&lt;/h2>
&lt;p>The group activity will focus on bringing the following features to a stable state (GA):&lt;/p>
&lt;ul>
&lt;li>Declarative Node Maintenance&lt;/li>
&lt;li>EvictionRequest API&lt;/li>
&lt;li>Graceful Node Shutdown&lt;/li>
&lt;/ul>
&lt;p>And have the following priorities which mostly apply to WG calls, but can also apply to the general
WG review/work/guidance capacity:&lt;/p>
&lt;ol>
&lt;li>Urgent topics concerning the WG focus features, especially during the KEP and code freeze periods.&lt;/li>
&lt;li>Discussing issues within the scope of the WG.&lt;/li>
&lt;li>Presentations within the scope of the WG.&lt;/li>
&lt;li>Other WG features or features within the scope of the WG. If a topic is suggested multiple times,
try to prevent starvation.&lt;/li>
&lt;/ol>
&lt;h2 id="roles-and-organization-management">Roles and Organization Management&lt;/h2>
&lt;p>This WG adheres to the Roles and Organization Management outlined in &lt;a href="https://deploy-preview-776--kubernetes-contributor.netlify.app/committee-steering/governance/wg-governance.md"
>wg-governance&lt;/a>
and opts-in to updates and modifications to &lt;a href="https://deploy-preview-776--kubernetes-contributor.netlify.app/committee-steering/governance/wg-governance.md"
>wg-governance&lt;/a>
.&lt;/p>
&lt;h2 id="timelines-and-disbanding">Timelines and Disbanding&lt;/h2>
&lt;p>The working group will disband once the features (KEPs) and core APIs mentioned
&lt;a href="#prioritization"
>above&lt;/a>
have reached a stable state (GA), and ongoing maintenance ownership is
established within the relevant SIGs. We will review whether the working group should disband if
appropriate SIG ownership can&amp;rsquo;t be reached or no additional coordination is needed.&lt;/p></description></item></channel></rss>