LLM Caching Strategies for Cost Optimization: Prompt, Semantic, and KV Caching
Practical techniques to reduce LLM inference costs by 40-80 percent through prompt caching, semantic caching, and KV cache optimization in production systems.
LLM Inference Costs Add Up Fast
At $3-15 per million input tokens for frontier models, LLM costs become significant at scale. A customer support agent handling 10,000 conversations per day with 2,000 tokens per conversation costs $60-300 daily on input tokens alone. Caching strategies can reduce these costs by 40-80 percent while simultaneously improving latency.
Three caching approaches address different patterns: exact prompt caching, semantic caching, and KV cache optimization.
Exact Prompt Caching
The simplest approach: hash the full prompt and cache the response. If the same prompt appears again, return the cached response without calling the LLM.
import hashlib
import redis
import json
cache = redis.Redis(host="localhost", port=6379, db=0)
async def cached_llm_call(messages: list, model: str, ttl: int = 3600):
cache_key = hashlib.sha256(
json.dumps({"messages": messages, "model": model}).encode()
).hexdigest()
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
response = await openai_client.chat.completions.create(
model=model, messages=messages
)
cache.setex(cache_key, ttl, json.dumps(response.to_dict()))
return response
When Exact Caching Works
- Repeated system prompts: Many requests share identical system prompts
- Structured queries: Classification tasks with a fixed set of inputs
- Batch processing: Re-running analysis on unchanged data
When It Fails
Exact caching has a low hit rate for conversational applications where each message includes unique user input. Even one character difference produces a different hash.
Semantic Caching
Semantic caching matches queries by meaning rather than exact text. "What's the weather in NYC?" and "How's the weather in New York City?" should return the same cached response.
Implementation uses embedding models and vector similarity:
from openai import OpenAI
async def semantic_cache_lookup(query: str, threshold: float = 0.95):
query_embedding = embed(query)
# Search vector store for similar previous queries
results = vector_store.search(
vector=query_embedding,
limit=1,
filter={"created_at": {"$gt": ttl_cutoff}}
)
if results and results[0].score > threshold:
return results[0].metadata["response"]
# Cache miss: call LLM and store
response = await llm_call(query)
vector_store.upsert({
"vector": query_embedding,
"metadata": {"query": query, "response": response}
})
return response
Tuning the Similarity Threshold
- 0.98+: Nearly identical queries only. Low hit rate, very safe.
- 0.95-0.98: Paraphrases and minor variations. Good balance.
- 0.90-0.95: Loosely similar queries. Higher hit rate but risk of returning irrelevant cached responses.
Test with your actual query distribution to find the right threshold.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Provider-Level Prompt Caching
Anthropic and OpenAI now offer server-side prompt caching that reduces costs for repeated prompt prefixes.
Anthropic Prompt Caching
Anthropic caches prompt prefixes marked with a cache_control parameter. Subsequent requests with the same prefix hit the cache, reducing input token costs by 90 percent for the cached portion. The cache has a 5-minute TTL that resets on each hit.
This is particularly effective for:
- Long system prompts (1,000+ tokens)
- RAG contexts where the retrieved documents are appended to a fixed instruction prefix
- Multi-turn conversations where the history grows but the system prompt remains constant
OpenAI Cached Tokens
OpenAI automatically caches prompt prefixes longer than 1,024 tokens and charges 50 percent less for cached tokens. Unlike Anthropic's approach, caching is automatic — no API changes required.
KV Cache Optimization
For self-hosted models, the key-value cache stored during autoregressive generation is a major memory and compute bottleneck.
Techniques
- PagedAttention (vLLM): Manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling higher batch sizes
- Prefix caching: Shares KV cache entries across requests with identical prompt prefixes, avoiding redundant computation
- Quantized KV cache: Storing cached keys and values in FP8 or INT8 precision reduces memory by 50 percent with minimal quality impact
Cost Savings Calculator
For a system processing 100,000 LLM calls per day:
| Strategy | Typical Hit Rate | Cost Reduction |
|---|---|---|
| Exact prompt cache | 5-15% | 5-15% |
| Semantic cache | 15-40% | 15-40% |
| Provider prompt caching | 60-90% of tokens | 30-50% |
| Combined approach | — | 50-80% |
The strategies are complementary. A production system should layer exact caching (cheapest to implement), semantic caching (catches paraphrases), and provider-level caching (reduces per-token cost for cache misses).
Sources: Anthropic Prompt Caching Documentation | vLLM PagedAttention Paper | GPTCache GitHub
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.