Serving Multiple Inference Pools¶

This guide assumes you completed the Getting Started guide before running the current guide.

Unreleased/main branch

This guide tracks main and is intended for users who want the very latest features and fixes and are comfortable with potential breakage.

A company may need to deploy multiple large language models (LLMs) in a cluster to support different workloads. For example, a Qwen model could power a chatbot interface, while a DeepSeek model might serve a recommendation application. Additionally, each base model may have multiple Low-Rank Adaptations (LoRAs). LoRAs associated with the same base model are served by the same backend inference server that hosts the base model. A LoRA name is also provided as the model name in the request body.

For serving multiple inference pools, the system needs to extract information such as the model name from the request body. This pattern of serving multiple models behind a single endpoint is common among providers and is generally expected by clients.

For such model-aware routing, use the Body-Based Routing (BBR) component as described in this guide.

How¶

The BBR extracts the model name from the request body, does a lookup of the base model in a configmap and adds this information in the X-Gateway-Base-Model-Name header. This header is then used for matching and routing the request to the appropriate InferencePool and its associated Endpoint Picker Extension (EPP) instances.

⚠️ Note: All model names, including base and LoRA names must be unique in order to be able to understand what is the correct InferencePool that should receive the request.

Deploy Body-Based Routing Extension¶

export CHART_VERSION=v0

GKEIstioKgatewayOther

export GATEWAY_PROVIDER=gke
helm install body-based-router \
--set provider.name=$GATEWAY_PROVIDER \
--version $CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing

export GATEWAY_PROVIDER=istio
helm install body-based-router \
--set provider.name=$GATEWAY_PROVIDER \
--version $CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing

Kgateway does not require the Body-Based Routing Extension, and instead natively implements Body-Based Routing. To use Body Based Routing, apply an AgentgatewayPolicy:

apiVersion: gateway.kgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: bbr
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  traffic:
    phase: PreRouting
    transformation:
      request:
        set:
        - name: X-Gateway-Base-Model-Name
          value: |
            {
              "Qwen/Qwen3-32B": "Qwen/Qwen3-32B",
              "food-review-1": "Qwen/Qwen3-32B",
              "deepseek-ai/DeepSeek-V3.2": "deepseek-ai/DeepSeek-V3.2",
              "ski-resorts": "deepseek-ai/DeepSeek-V3.2",
              "movie-critique": "deepseek-ai/DeepSeek-V3.2",
            }[json(request.body).model]

helm install body-based-router \
--version $CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing

Serving a Second Model Server¶

The example uses a vLLM simulator since this is the least common denominator configuration that can be run in every environment. The manifest uses deepseek-ai/DeepSeek-V3.2 as base model with two LoRA adapters ski-resorts and movie-critique.

Deploy the second model server along with a mapping from LoRA adapters to the base model:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/bbr/sim-deployment.yaml

Deploy the Second InferencePool and Endpoint Picker Extension¶

In order to create the HttpRoute mapping via the helm chart, one should use the experimental flag (experimentalHttpRoute) to specify the base model that should be associated with the InferencePool.

Set the Helm chart version (unless already set).

export IGW_CHART_VERSION=v0

GKEIstioKgatewayOther

export GATEWAY_PROVIDER=gke
helm install vllm-deepseek-v3.2 \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-v3.2 \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=deepseek-ai/DeepSeek-V3.2 \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool

export GATEWAY_PROVIDER=istio
helm install vllm-deepseek-v3.2 \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-v3.2 \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=deepseek-ai/DeepSeek-V3.2 \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool

export GATEWAY_PROVIDER=none
helm install vllm-deepseek-r1 \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=deepseek/vllm-deepseek-r1 \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool

helm install vllm-deepseek-r1 \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=deepseek/vllm-deepseek-r1 \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool

After the installation, verify that you have two InferencePools and two EPP pods, one per base model type, running without errors

kubectl get inferencepools

kubectl get pods

Upgrade the First InferencePool and Endpoint Picker Extension¶

Update the first model server mapping of the LoRA adapters to the base model:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/bbr/configmap.yaml

Run helm upgrade in order to update in place the HttpRoute mapping of the first InferencePool:

GKEIstioKgatewayOther

export GATEWAY_PROVIDER=gke
helm upgrade vllm-qwen3-32b oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=Qwen/Qwen3-32B \
--version $IGW_CHART_VERSION

export GATEWAY_PROVIDER=istio
helm upgrade vllm-qwen3-32b oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=Qwen/Qwen3-32B \
--version $IGW_CHART_VERSION

export GATEWAY_PROVIDER=none
helm upgrade vllm-qwen3-32b oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
--set provider.name=$GATEWAY_PROVIDER \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=Qwen/Qwen3-32B \
--version $IGW_CHART_VERSION

helm upgrade vllm-qwen3-32b oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool \
--dependency-update \
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
--set experimentalHttpRoute.enabled=true \
--set experimentalHttpRoute.baseModel=Qwen/Qwen3-32B \
--version $IGW_CHART_VERSION

Try the setup¶

First, make sure that the setup works as before by sending a request to the LoRA of the first model set up in the Getting started (Latest/Main) guide.

Try it out¶

Wait until the gateway is ready.

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "Qwen/Qwen3-32B",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'

Chat Completions APICompletions API

Send a few requests to the Qwen model directly:

curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "Qwen/Qwen3-32B",
            "max_tokens": 100,
            "temperature": 0,
            "messages": [
                {
                   "role": "developer",
                   "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": "Linux is said to be an open source kernel because "
                }
            ]
         }'

Send a few requests to Deepseek model to test that it works, as follows:

curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "deepseek-ai/DeepSeek-V3.2",
            "max_tokens": 100,
            "temperature": 0,
            "messages": [
                {
                   "role": "developer",
                   "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": "Linux is said to be an open source kernel because "
                }
            ]
         }'

Send a few requests to the LoRA of the Qwen model as follows:

curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "food-review-1",
            "max_tokens": 100,
            "temperature": 0,
            "messages": [
                {
                   "role": "reviewer",
                   "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": "Write a review of the best restaurans in San-Francisco"
                }
            ]
      }'

Send a few requests to one LoRA of the Deepseek model as follows:

curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "movie-critique",
            "max_tokens": 100,
            "temperature": 0,
            "messages": [
                {
                   "role": "reviewer",
                   "content": "You are a helpful assistant."
                },
                {
                   "role": "user",
                   "content": "What are the best movies of 2025?"
                }
            ]
      }'

Send a few requests to another LoRA of the Deepseek model as follows:

curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "ski-resorts",
            "max_tokens": 100,
            "temperature": 0,
            "messages": [
                {
                   "role": "reviewer",
                   "content": "You are a helpful assistant."
                 },
                 {
                   "role": "user",
                   "content": "Tell mne about ski deals"
                  }
             ]
      }'

Send a few requests to the Deepseek model:

curl -X POST -i ${IP}:${PORT}/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "deepseek-ai/DeepSeek-V3.2",
            "prompt": "What is the best ski resort in Austria?",
            "max_tokens": 20,
            "temperature": 0
     }'

Send a few requests to the first Deepseek LoRA as follows:

curl -X POST -i ${IP}:${PORT}/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "ski-resorts",
            "prompt": "What is the best ski resort in Austria?",
            "max_tokens": 20,
            "temperature": 0
     }'

Send a few requests to the second Deepseek LoRA as follows:

curl -X POST -i ${IP}:${PORT}/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
            "model": "movie-critique",
            "prompt": "Tell me about movies",
            "max_tokens": 20,
            "temperature": 0
     }'