<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kubernetes Contributors – WG AI Gateway</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/ai-gateway/</link><description>Recent content in WG AI Gateway on Kubernetes Contributors</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/ai-gateway/index.xml" rel="self" type="application/rss+xml"/><item><title>Community: WG AI Gateway Charter</title><link>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/ai-gateway/charter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-776--kubernetes-contributor.netlify.app/community/community-groups/wg/ai-gateway/charter/</guid><description>
&lt;h1 id="wg-ai-gateway-charter">WG AI Gateway Charter&lt;/h1>
&lt;p>This charter adheres to the conventions described in the &lt;a href="https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md"
target="_blank" rel="noopener">Kubernetes Charter
README&lt;/a>
and uses the Roles and Organization Management outlined in
&lt;a href="https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md"
target="_blank" rel="noopener">wg-governance&lt;/a>
.&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>We’ve seen large growth in the number of “AI Gateways” that have been launched
in the last couple of years which deploy and operate on Kubernetes, often
utilizing Gateway API. This WG aims to determine if the relevant features have
staying power and will be commonly useful to users for years to come, and if we
should expand the Kubernetes standards around this.&lt;/p>
&lt;p>In SIG Network we have the Gateway API Inference Extension (GIE) project. The
GIE currently is paired with a Gateway and “schedules” routes according to
capabilities and metrics advertised by model serving platforms. For the purposes
of this document we’ll call this the “model serving use case”, as this currently
mainly covers the use case where models are being hosted on Kubernetes. There
are deployment situations where users won’t host models but still use a Gateway
to control access to 3rd party services (e.g. Gemini, OpenAI, Mistral, Claude,
etc), we’ll call this the “egress use case”. We find that in both the model
serving and egress use cases users want to be able to add more advanced filters,
policies and other plugins that control or modify inference requests.&lt;/p>
&lt;p>However, there are many features we haven’t fully explored yet that seem to be
cleanly addable at the HTTPRoute level via filters or policies. Perhaps some
would even be applicable at the Gateway level. For example, it is conceivable
you might add a “semantic routing” at the HTTPRoute level as a filter to
determine which model to route to before the “routing/scheduling” layer. Or
perhaps you need a policy to rate-limit token usage for requests (maybe this
could even apply at the Gateway level). For the purposes of this charter,
we’ll refer to features at this level as “AI Gateway” features.&lt;/p>
&lt;h2 id="scope">Scope&lt;/h2>
&lt;p>The scope of this WG is to define terms like &amp;ldquo;AI Gateway&amp;rdquo; in the context of
Kubernetes and propose deliverables that need to be adopted in order to &lt;strong>manage
AI traffic&lt;/strong> on Kubernetes, such as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Prompt Guards&lt;/strong> - Define and enforce content safety rules for inference
content to detect and block sensitive or malicious prompts.&lt;/li>
&lt;li>&lt;strong>Token Rate Limiting&lt;/strong> - enforce rate limiting rules based on token usage to
control usage and cost.&lt;/li>
&lt;li>&lt;strong>Semantic Routing&lt;/strong> - making a routing decision for an inference request
based on semantic similarity of the request body.&lt;/li>
&lt;li>&lt;strong>Semantic Caching&lt;/strong> - Provide caching for inference response based on the
semantic similarity of prompts.&lt;/li>
&lt;li>&lt;strong>Response Risk&lt;/strong> - Define and enforce content safety rules with inference
response content to detect and block sensitive responses from generative AI
models.&lt;/li>
&lt;li>&lt;strong>Failure Modes&lt;/strong> - How inference routing failures should be handled, what
failure modes we think are important to cover. For instance this may
encapsulate fallback and retry policies.&lt;/li>
&lt;li>&lt;strong>Observability&lt;/strong> - Evaluate mechanisms for observability for “AI Gateways”
and if there are AI Gateway specific features needed, make suggestions
according to existing tools.&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: The above list of features should be considered an example, and
non-exhaustive. We may not act on all of these, but the purpose is more to
illustrate the kind of features we will be exploring.&lt;/p>&lt;/blockquote>
&lt;p>Across features that are explored by this WG, we will also explore the
application of these features to multi-cluster use cases and provide support
for multi-cluster deployment scenarios.&lt;/p>
&lt;h3 id="in-scope">In Scope&lt;/h3>
&lt;p>Overall guidance for the WG is to control scope as much as is feasible. The WG
will support model serving via AI networking and traffic management features
(but not working on model serving itself, unless in conjunction with WG
Serving). In particular, the following is in scope:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Providing definitions for networking related AI terms in a Kubernetes
context, such as &amp;ldquo;AI Gateway&amp;rdquo;.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Defining important use-cases for Kubernetes users, including both single and
multi-cluster use cases.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Determining which common features and capabilities in the &amp;ldquo;AI Gateway&amp;rdquo; space
need to be covered by Kubernetes standards and APIs according to user and
implementation needs.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Creating proposals for &amp;ldquo;AI Gateway&amp;rdquo; features and capabilities to the
appropriate sub-projects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Propose new sub-projects if existing sub-projects are not sufficient.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="out-of-scope">Out of Scope&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Developing whole &amp;ldquo;AI Gateway&amp;rdquo; solutions. This group will focus on enabling
existing and new solutions to be more easily deployed and managed on
Kubernetes, not creating any new Gateways.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Any specific kind of hardware support is generally out of scope.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>This group will not cover the entire spectrum of networking for AI. For
instance: RDMA networks are generally out of scope.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>While we serve the &amp;ldquo;model serving use case&amp;rdquo;, and important distinction is
that working directly on Model serving, and AI workloads themselves are
not in scope unless done in collaboration with WG Serving (see below for a
more complete explanation about this nuance).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>While we may look into ways in which observability may be tailored
specifically for AI Gateways, We may make suggestions to groups like
OpenTelemetry, but we are not going to develop new standards for this as part
of this working group.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="additional-scope-distinctions">Additional Scope Distinctions&lt;/h3>
&lt;p>There is a subtle distinction to be made when it comes to the scope of this WG
for load-balancing and routing inference, particular when dealing with inference
&lt;em>workloads&lt;/em>: When the use case includes local model serving on the cluster, and
routing and load-balancing features &lt;em>rely on information from the inference
workloads&lt;/em>, this kind of routing falls under the scope of WG Serving.&lt;/p>
&lt;p>A good example of this is the &lt;a href="https://github.com/kubernetes-sigs/gateway-api-inference-extension"
target="_blank" rel="noopener">Gateway API Inference Extension (GIE)&lt;/a>
.
This project came from WG Serving and specifically handles advanced routing and
load-balancing for inference which is informed by metrics and capabilities being
advertised by the model serving platform (e.g. VLLM). In this vein, the GIE is
effectively an alternative to the Kubernetes &lt;code>Service&lt;/code> API, whereas this WG
means to operate more at the &lt;code>Gateway&lt;/code> and &lt;code>HTTPRoute&lt;/code> level.&lt;/p>
&lt;p>Use cases which have to interact with the model serving layer for networking
(as described above) are generally out of scope for this WG. If some feature
the WG is working on absolutely must cross this line, the effort MUST be brought
to WG Serving and worked on as a joint effort with them.&lt;/p>
&lt;h2 id="deliverables">Deliverables&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>A compendium of AI related networking definitions (e.g. &amp;ldquo;AI Gateway&amp;rdquo;) and
key use-cases for Kubernetes users.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Provide a space for collaboration and experimentation to determine the most
viable features and capabilities that Kubernetes should support. If there is
strong consensus on any particular ideas, the WG will facilitate and
coordinate the delivery of proposals in the appropriate areas.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="stakeholders">Stakeholders&lt;/h2>
&lt;ul>
&lt;li>SIG Network&lt;/li>
&lt;li>SIG MultiCluster&lt;/li>
&lt;/ul>
&lt;h3 id="related-wgs">Related WGs&lt;/h3>
&lt;ul>
&lt;li>WG Serving - The domain of WG Serving is AI Workloads, which can be served by
some of the networking support we want to add. When we have proposals that
are strongly relevant to serving, we will loop them in so they can provide
feedback.&lt;/li>
&lt;/ul>
&lt;h2 id="roles-and-organization-management">Roles and Organization Management&lt;/h2>
&lt;p>This working group adheres to the Roles and Organization Management outlined in
&lt;a href="https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md"
target="_blank" rel="noopener">wg-governance&lt;/a>
and opts-in to updates and modifications to &lt;a href="https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md"
target="_blank" rel="noopener">wg-governance&lt;/a>
.&lt;/p>
&lt;h2 id="exit-criteria">Exit Criteria&lt;/h2>
&lt;p>The WG is done when its deliverables are complete, according to the defined
scope and a list of key use cases and features agreed upon by the group.&lt;/p>
&lt;p>Ideally we want the lifecycle of the WG to go something like this:&lt;/p>
&lt;ol>
&lt;li>Determine definitions and key use cases for Kubernetes users and
implementations, and document those.&lt;/li>
&lt;li>Determine a list of key features that Kubernetes needs to best support the
defined use cases.&lt;/li>
&lt;li>For each feature in that list, make proposals which support them to the
appropriate sub-projects OR propose new sub-projects if deemed necessary.&lt;/li>
&lt;li>Once the feature list is complete, leave behind some guidance and best
practices for future implementations and then exit.&lt;/li>
&lt;/ol></description></item></channel></rss>