Prefix Cache Aware Benchmarking¶

This guide shows how to deploy a prefix-cache-aware benchmarking config using inference-perf.

Prerequisites¶

Before you begin, ensure you have the following:

Helm 3+: Installation Guide
Kubernetes Cluster: Access to a Kubernetes cluster
Hugging Face Token Secret: A Hugging Face token to pull models. . Prefix-Cache Model Server Deployed*: Ensure a deployment with prefix-caching enabled. (ex. gpu-prefix-cache-deployment.yaml)
Gateway Deployed: Your inference server/gateway must be deployed and accessible within the cluster.

Follow benchmarking guide for more information on how to set up gateway and how to validate benchmark results.

Shared Prefix Dataset Configuration¶

The chart uses the shared_prefix dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:

num_groups: The number of shared prefix groups.
num_prompts_per_group: The number of prompts within each shared prefix group.
system_prompt_len: The length of the system prompt.
question_len: The length of the question part of the prompt.
output_len: The desired length of the model's output.
enable_multi_turn_chat: Creates a user session to keep the conversation where the chat context will be appended for the each request.

The default values for the dataset are defined in the chart, but you can override them using --set config.data.shared_prefix.<parameter> flags.

Example:

helm install my-release ../inference-perf -f long-prefix__many-templates-values.yaml --set config.data.shared_prefix.num_groups=512

Deployment¶

This chart supports four configurations located under gateway-api-inference-extension/benchmarking/prefix-cache-aware:

short-questions__many-system-prompts-values.yaml: High System Prompt Overlap, High System Cache Pressure
long-questions__many-system-prompts-values.yaml: Low System Prompt Overlap, Low System Cache Pressure
short-questions__few-system-prompts-values__multi-chat-values.yaml: High System Prompt Overlap, No System Cache Pressure
long-questions__few-system-prompts-values__multi-chat-values.yaml: Low System Prompt Overlap, No System Cache Pressure

1. Check out the repo.¶

git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
cd gateway-api-inference-extension/benchmarking/prefix-cache-aware

2. Get the target IP.¶

The examples below shows how to get the IP of a gateway or a k8s service.

# Get gateway IP
GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
# Get LoadBalancer k8s service IP
SVC_IP=$(kubectl get service/vllm-qwen3-32b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

echo $GW_IP
echo $SVC_IP

3. Deploying the Configuration¶

This example uses the short-questions__many-system-prompts-values.yaml file which is optimized for scenarios where a high cache hit rate is expected. It

export IP='<YOUR_IP>'
export PORT='<YOUR_PORT>'

# HUGGINGFACE PARAMETERS
# Option A: Pass Token Directly
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
# Option B: Use Existing Kubernetes Secret
export HF_SECRET_NAME='<YOUR_SECRET_NAME>'
export HF_SECRET_KEY='<YOUR_SECRET_KEY>'

helm install prefix-cache-benchmark ../inference-perf -f long-prefix__many-templates-values.yaml \
  --set "config.server.base_url=http://${IP}:${PORT}"
  # ------------------------------------------------
  # HUGGINGFACE OPTION A
  --set token.hfToken=${HF_TOKEN} \ 
  # ------------------------------------------------
  # HUGGINGFACE OPTION B
  # --set token.hfSecret.name=${HF_SECRET_NAME} \
  # --set token.hfSecret.key=${HF_SECRET_KEY} \
  # ------------------------------------------------

Parameters to customize:

prefix-cache-benchmark: A unique name for this deployment. Note, you can deploy multiple charts as long as they have different names. *. long-prefix__many-templates-values.yaml: The appropriate benchmark config you want to test.
token.hfToken: Your hugging face token. Inference Perf chart will create a new kubernetes secret containing this token.
hfSecret.name: The name of your Kubernetes Secret containing the Hugging Face token (default: hf-token).
hfSecret.key: The key in your Kubernetes Secret pointing to the Hugging Face token (default: token).
config.server.base_url: The base URL (IP and port) of your inference server for the high-cache scenario.

4. Clean Up¶

To uninstall the deployed charts:

helm uninstall prefix-cache-benchmark

Post Benchmark Analysis¶

Follow the benchmarking guide instructions to compare benchmark results.

Running E2E Tests¶

The following E2E test runs on GitHub using GitHub Actions.

If you have MAINTAINER access or above, you can trigger the workflow run from the GitHub Actions page, or by leaving a comment on the PR.

Please make sure there is no other GKE tests of the same type running at the same time, as they can interfere with each other.

Test name	Link	PR comment trigger
GKE Prefix Cache Aware Test	https://github.com/gateway-api-inference-extension/.github/workflows/e2e-prefill-heavy-gke.yaml	/run-gke-prefix-cache
GKE Prefill Heavy Test with Standalone EPP	https://github.com/gateway-api-inference-extension/.github/workflows/e2e-prefill-heavy-gke-standlone-epp.yaml	/run-gke-prefix-cache-standalone-epp