AI Agent Deployment on Kubernetes: Scaling Patterns for Production
A practical guide to deploying and scaling AI agents on Kubernetes — from GPU scheduling and model serving to autoscaling strategies and cost-effective resource management.
Why Kubernetes for AI Agents
Kubernetes has become the default platform for deploying AI agents in production. Its container orchestration, auto-scaling, service discovery, and declarative configuration model align well with the requirements of multi-agent systems. But deploying AI workloads on Kubernetes requires patterns that differ from traditional web application deployments.
AI agents have unique resource requirements: GPU access for local model inference, high memory for context windows, variable latency requirements, and bursty compute patterns. This guide covers the patterns that work.
Deployment Architecture
Separating Agent Logic from Model Serving
The most maintainable architecture separates agent orchestration logic from model inference:
# Agent deployment - CPU-only, handles orchestration
apiVersion: apps/v1
kind: Deployment
metadata:
name: customer-support-agent
spec:
replicas: 3
template:
spec:
containers:
- name: agent
image: myregistry/support-agent:v2.1
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: LLM_ENDPOINT
value: "http://model-server:8000/v1"
- name: REDIS_URL
value: "redis://agent-cache:6379"
# Model server deployment - GPU-enabled
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "mistralai/Mistral-7B-Instruct-v0.3"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
nodeSelector:
gpu-type: a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
This separation lets you scale agent logic independently from model inference, upgrade models without redeploying agents, and share model servers across multiple agent types.
GPU Scheduling Strategies
GPU resources are expensive. Maximize utilization with these approaches:
Time-sharing with MPS (Multi-Process Service): Run multiple inference workloads on the same GPU. Works well when individual requests do not saturate GPU compute.
Fractional GPUs: Use tools like nvidia-device-plugin with time-slicing or MIG (Multi-Instance GPU) on A100s to partition a single GPU into multiple smaller allocations.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Spot/Preemptible nodes: Run non-latency-critical workloads (batch processing, evaluation, fine-tuning) on spot instances for 60-70% cost savings.
Auto-Scaling Patterns
Horizontal Pod Autoscaler (HPA)
Standard CPU/memory-based HPA does not work well for AI workloads because inference is GPU-bound, not CPU-bound. Use custom metrics instead:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "5" # Scale up when queue > 5 per pod
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80" # Scale up when GPU > 80% utilized
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA is particularly useful for event-driven agent architectures. Scale agent pods based on message queue depth:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-scaler
spec:
scaleTargetRef:
name: customer-support-agent
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: redis-streams
metadata:
address: agent-cache:6379
stream: agent-tasks
consumerGroup: support-agents
lagCount: "10" # Scale when 10+ messages pending
Networking and Service Mesh
gRPC for Model Serving
Use gRPC instead of REST for internal model serving. gRPC's binary protocol, HTTP/2 multiplexing, and streaming support reduce latency by 30-40% compared to REST for inference workloads.
Health Checks
AI model servers need custom health checks that go beyond TCP port checks:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Models take time to load
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready # Model loaded and warm
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
Cost Optimization
- Right-size GPU instances: Profile your model's actual VRAM and compute requirements. Many teams over-provision by 50% or more
- Use node pools: Separate GPU and CPU node pools to avoid paying GPU prices for CPU-only workloads
- Implement scale-to-zero: For low-traffic agent types, use KEDA to scale to zero pods when idle
- Cache aggressively: Redis or Memcached for embedding caches, prompt caches, and response caches
Observability Stack
Deploy alongside your agents:
- Prometheus + Grafana: GPU utilization, inference latency, queue depth, token throughput
- OpenTelemetry Collector: Distributed tracing across multi-agent pipelines
- Loki or Elasticsearch: Structured logging for conversation debugging
The key to successful Kubernetes deployment of AI agents is treating model serving as infrastructure (stable, shared, GPU-optimized) and agent logic as application code (frequently deployed, independently scaled, CPU-based).
Sources:
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.