Skip to content
Large Language Models5 min read0 views

LLM Caching Strategies for Cost Optimization: Prompt, Semantic, and KV Caching

Practical techniques to reduce LLM inference costs by 40-80 percent through prompt caching, semantic caching, and KV cache optimization in production systems.

LLM Inference Costs Add Up Fast

At $3-15 per million input tokens for frontier models, LLM costs become significant at scale. A customer support agent handling 10,000 conversations per day with 2,000 tokens per conversation costs $60-300 daily on input tokens alone. Caching strategies can reduce these costs by 40-80 percent while simultaneously improving latency.

Three caching approaches address different patterns: exact prompt caching, semantic caching, and KV cache optimization.

Exact Prompt Caching

The simplest approach: hash the full prompt and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

import hashlib
import redis
import json

cache = redis.Redis(host="localhost", port=6379, db=0)

async def cached_llm_call(messages: list, model: str, ttl: int = 3600):
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}).encode()
    ).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    response = await openai_client.chat.completions.create(
        model=model, messages=messages
    )
    cache.setex(cache_key, ttl, json.dumps(response.to_dict()))
    return response

When Exact Caching Works

  • Repeated system prompts: Many requests share identical system prompts
  • Structured queries: Classification tasks with a fixed set of inputs
  • Batch processing: Re-running analysis on unchanged data

When It Fails

Exact caching has a low hit rate for conversational applications where each message includes unique user input. Even one character difference produces a different hash.

Semantic Caching

Semantic caching matches queries by meaning rather than exact text. "What's the weather in NYC?" and "How's the weather in New York City?" should return the same cached response.

Implementation uses embedding models and vector similarity:

from openai import OpenAI

async def semantic_cache_lookup(query: str, threshold: float = 0.95):
    query_embedding = embed(query)

    # Search vector store for similar previous queries
    results = vector_store.search(
        vector=query_embedding,
        limit=1,
        filter={"created_at": {"$gt": ttl_cutoff}}
    )

    if results and results[0].score > threshold:
        return results[0].metadata["response"]

    # Cache miss: call LLM and store
    response = await llm_call(query)
    vector_store.upsert({
        "vector": query_embedding,
        "metadata": {"query": query, "response": response}
    })
    return response

Tuning the Similarity Threshold

  • 0.98+: Nearly identical queries only. Low hit rate, very safe.
  • 0.95-0.98: Paraphrases and minor variations. Good balance.
  • 0.90-0.95: Loosely similar queries. Higher hit rate but risk of returning irrelevant cached responses.

Test with your actual query distribution to find the right threshold.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Provider-Level Prompt Caching

Anthropic and OpenAI now offer server-side prompt caching that reduces costs for repeated prompt prefixes.

Anthropic Prompt Caching

Anthropic caches prompt prefixes marked with a cache_control parameter. Subsequent requests with the same prefix hit the cache, reducing input token costs by 90 percent for the cached portion. The cache has a 5-minute TTL that resets on each hit.

This is particularly effective for:

  • Long system prompts (1,000+ tokens)
  • RAG contexts where the retrieved documents are appended to a fixed instruction prefix
  • Multi-turn conversations where the history grows but the system prompt remains constant

OpenAI Cached Tokens

OpenAI automatically caches prompt prefixes longer than 1,024 tokens and charges 50 percent less for cached tokens. Unlike Anthropic's approach, caching is automatic — no API changes required.

KV Cache Optimization

For self-hosted models, the key-value cache stored during autoregressive generation is a major memory and compute bottleneck.

Techniques

  • PagedAttention (vLLM): Manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling higher batch sizes
  • Prefix caching: Shares KV cache entries across requests with identical prompt prefixes, avoiding redundant computation
  • Quantized KV cache: Storing cached keys and values in FP8 or INT8 precision reduces memory by 50 percent with minimal quality impact

Cost Savings Calculator

For a system processing 100,000 LLM calls per day:

Strategy Typical Hit Rate Cost Reduction
Exact prompt cache 5-15% 5-15%
Semantic cache 15-40% 15-40%
Provider prompt caching 60-90% of tokens 30-50%
Combined approach 50-80%

The strategies are complementary. A production system should layer exact caching (cheapest to implement), semantic caching (catches paraphrases), and provider-level caching (reduces per-token cost for cache misses).

Sources: Anthropic Prompt Caching Documentation | vLLM PagedAttention Paper | GPTCache GitHub

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.