InferencePool Rollout¶
The goal of this guide is to show you how to perform incremental roll out operations, which gradually deploy new versions of your inference infrastructure. You can update Inference Pool with minimal service disruption. This page also provides guidance on traffic splitting and rollbacks to help ensure reliable deployments for InferencePool rollout.
InferencePool rollout is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary.
Use Cases¶
Use Cases for InferencePool Rollout:
- Node(compute, accelerator) update roll out
- Base model roll out
- Model server framework rollout
Node(compute, accelerator) update roll out¶
Node update roll outs safely migrate inference workloads to new node hardware or accelerator configurations. This process happens in a controlled manner without interrupting model service. Use node update roll outs to minimize service disruption during hardware upgrades, driver updates, or security issue resolution.
Base model roll out¶
Base model updates roll out in phases to a new base LLM, retaining compatibility with existing LoRA adapters. You can use base model update roll outs to upgrade to improved model architectures or to address model-specific issues.
Model server framework rollout¶
Model server framework rollouts enable the seamless deployment of new versions or entirely different serving frameworks, like updating from an older vLLM version to a newer one, or even migrating from a custom serving solution to a managed one. This type of rollout is critical for introducing performance enhancements, new features, or security patches within the serving layer itself, without requiring changes to the underlying base models or application logic. By incrementally rolling out framework updates, teams can ensure stability and performance, quickly identifying and reverting any regressions before they impact the entire inference workload.
How to do InferencePool rollout¶
- Deploy new infrastructure: Create a new InferencePool configured with the new node(compute/accelerator) / model server / base model that you chose.
- Configure traffic splitting: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The
backendRefs.weight
field controls the traffic percentage allocated to each pool. - Maintain InferenceModel integrity: Retain the existing InferenceModel configuration to ensure uniform model behavior across both node configurations or base model versions or model server versions.
- Preserve rollback capability: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary.
Example¶
This is an example of InferencePool rollout with node(compute, accelerator) update roll out
Prerequisites¶
Follow the steps in the main guide
Deploy new infrastructure¶
You start with an existing InferencePool named vllm-llama3-8b-instruct.
To replace the original InferencePool, you create a new InferencePool named vllm-llama3-8b-instruct-new along with
InferenceModels and Endpoint Picker Extension configured with the updated node specifications of nvidia-h100-80gb
accelerator type,
kubectl apply -f - <<EOF
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review-new
spec:
modelName: food-review
criticality: Standard
poolRef:
name: vllm-llama3-8b-instruct-new
targetModels:
- name: food-review-1
weight: 100
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct-new
spec:
replicas: 3
selector:
matchLabels:
app: vllm-llama3-8b-instruct-new
template:
metadata:
labels:
app: vllm-llama3-8b-instruct-new
spec:
containers:
- name: vllm
image: "vllm/vllm-openai:latest"
imagePullPolicy: Always
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--port"
- "8000"
- "--max-num-seq"
- "1024"
- "--compilation-config"
- "3"
- "--enable-lora"
- "--max-loras"
- "2"
- "--max-lora-rank"
- "8"
- "--max-cpu-loras"
- "12"
env:
- name: VLLM_USE_V1
value: "1"
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "true"
ports:
- containerPort: 8000
name: http
protocol: TCP
lifecycle:
preStop:
sleep:
seconds: 30
livenessProbe:
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 1
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 1
successThreshold: 1
failureThreshold: 1
timeoutSeconds: 1
startupProbe:
failureThreshold: 600
initialDelaySeconds: 2
periodSeconds: 1
httpGet:
path: /health
port: http
scheme: HTTP
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
- name: adapters
mountPath: "/adapters"
initContainers:
- name: lora-adapter-syncer
tty: true
stdin: true
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
restartPolicy: Always
imagePullPolicy: Always
env:
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
- name: config-volume
mountPath: /config
restartPolicy: Always
enableServiceLinks: false
terminationGracePeriodSeconds: 130
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-h100-80gb"
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: adapters
emptyDir: {}
- name: config-volume
configMap:
name: vllm-llama3-8b-instruct-adapters-new
---
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama3-8b-instruct-adapters-new
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama3-8b-instruct-adapters-new
port: 8000
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
ensureExist:
models:
- id: food-review-1
source: Kawon/llama3.1-food-finetune_v14_r8
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-llama3-8b-instruct-new
spec:
targetPortNumber: 8000
selector:
app: vllm-llama3-8b-instruct-new
extensionRef:
name: vllm-llama3-8b-instruct-epp-new
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b-instruct-epp-new
namespace: default
spec:
selector:
app: vllm-llama3-8b-instruct-epp-new
ports:
- protocol: TCP
port: 9002
targetPort: 9002
appProtocol: http2
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct-epp-new
namespace: default
labels:
app: vllm-llama3-8b-instruct-epp-new
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b-instruct-epp-new
template:
metadata:
labels:
app: vllm-llama3-8b-instruct-epp-new
spec:
terminationGracePeriodSeconds: 130
containers:
- name: epp
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
imagePullPolicy: Always
args:
- -poolName
- "vllm-llama3-8b-instruct-new"
- "-poolNamespace"
- "default"
- -v
- "4"
- --zap-encoder
- "json"
- -grpcPort
- "9002"
- -grpcHealthPort
- "9003"
ports:
- containerPort: 9002
- containerPort: 9003
- name: metrics
containerPort: 9090
livenessProbe:
grpc:
port: 9003
service: inference-extension
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
grpc:
port: 9003
service: inference-extension
initialDelaySeconds: 5
periodSeconds: 10
EOF
Direct traffic to the new inference pool¶
By configuring an HTTPRoute, as shown below, you can incrementally split traffic between the original vllm-llama3-8b-instruct
and new vllm-llama3-8b-instruct-new
.
kubectl edit httproute llm-route
Change the backendRefs list in HTTPRoute to match the following:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
weight: 90
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct-new
weight: 10
matches:
- path:
type: PathPrefix
value: /
The above configuration means one in every ten requests should be sent to the new version. Try it out:
-
Get the gateway IP:
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
-
Send a few requests as follows:
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ "model": "food-review", "prompt": "Write as if you were a critic: San Francisco", "max_tokens": 100, "temperature": 0 }'
Finish the rollout¶
Modify the HTTPRoute to direct 100% of the traffic to the latest version of the InferencePool.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct-new
weight: 100
matches:
- path:
type: PathPrefix
value: /
Delete old version of InferencePool, InferenceModel and Endpoint Picker Extension¶
kubectl delete InferenceModel food-review --ignore-not-found
kubectl delete Deployment vllm-llama3-8b-instruct --ignore-not-found
kubectl delete ConfigMap vllm-llama3-8b-instruct-adapters --ignore-not-found
kubectl delete InferencePool vllm-llama3-8b-instruct --ignore-not-found
kubectl delete Deployment vllm-llama3-8b-instruct-epp --ignore-not-found
kubectl delete Service vllm-llama3-8b-instruct-epp --ignore-not-found
With this, all requests should be served by the new Inference Pool.