Skip to content

Introduction

Gateway API Inference Extension is an official Kubernetes project focused on extending Gateway API with inference specific routing extensions.

The overall resource model focuses on 2 new inference-focused personas and corresponding resources that they are expected to manage:

Gateway API Inference Extension Resource Model

API Resources

InferencePool

InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but will expand to have some inference-specific capabilities. When combined with InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our InferencePool documentation.

InferenceModel

An InferenceModel represents a model or adapter, and its associated configuration. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our InferenceModel documentation.

Composable Layers

This project aims to define specifications to enable a compatible ecosystem for extending the Gateway API with custom endpoint selection algorithms. This project defines a set of patterns across three distinct layers of components that are relevant to this project:

Gateway API Implementations

Gateway API has more than 25 implementations. As this pattern stabilizes, we expect a wide set of these implementations to support this project.

Endpoint Selection Extension

As part of this project, we're building an initial reference extension. Over time, we hope to see a wide variety of extensions emerge that follow this pattern and provide a wide range of choices.

Model Server Frameworks

This project will work closely with model server frameworks to establish a shared standard for interacting with these extensions, particularly focused on metrics and observability so extensions will be able to make informed routing decisions. The project is currently focused on integrations with vLLM and Triton, and will be open to other integrations as they are requested.

Request Flow

To illustrate how this all comes together, it may be helpful to walk through a sample request.

  1. The first step involves the Gateway selecting the correct InferencePool (set of endpoints running a model server framework) or Service to route to. This logic is based on the existing Gateway and HTTPRoute APIs, and will be familiar to any Gateway API users or implementers.

  2. If the request should be routed to an InferencePool, the Gateway will forward the request information to the endpoint selection extension for that pool.

  3. The extension will fetch metrics from whichever portion of the InferencePool endpoints can best achieve the configured objectives. Note that this kind of metrics probing may happen asynchronously, depending on the extension.

  4. The extension will instruct the Gateway which endpoint the request should be routed to.

  5. The Gateway will route the request to the desired endpoint.

Gateway API Inference Extension Request Flow

Who is working on Gateway API Inference Extension?

This project is being driven by WG-Serving SIG-Network to improve and standardize routing to inference workloads in Kubernetes. Check out the implementations reference to see the latest projects & products that support this project. If you are interested in contributing to or building an implementation using Gateway API then don’t hesitate to get involved!