What prompt caching teaches us about LLM system design

Prompt caching is usually framed as a performance optimization: add cache_control breakpoints, get faster and cheaper requests. But read the rules carefully and a deeper story emerges. The constraints are not random — they encode assumptions about what good LLM system architecture looks like. Understanding those assumptions teaches you more than the caching feature itself.

The constraints as a design language

Here is what prompt caching actually requires, per the Anthropic API course:

Prompt caching constraints and their architectural implications

Constraint	What it literally means	What it implies about your architecture
Manual breakpoints only	You must explicitly mark `cache_control: {"type": "ephemeral"}` on text blocks	Caching is an intentional design decision, not an automatic optimization
Maximum 4 breakpoints	You can partition context into at most 4 cacheable segments	Your context has a natural granularity limit — too many segments and you’re doing something wrong
Exact-match requirement	A single character change invalidates the entire cached block	Stable and variable content must live in separate text blocks at the API boundary
1,024-token minimum	Total cached content must exceed this threshold	Tiny caches aren’t worth the architectural overhead — caching is for substantial, reused context
One-hour TTL	Cache expires after 60 minutes of inactivity	Cache design targets session-level reuse, not persistent storage
Longhand form required	The shorthand `"content": "string"` form does not support caching	Caching demands explicit structure — you cannot accidentally cache or accidentally miss

Each constraint is a lesson. Together they form a coherent position on how LLM applications should be built.

Lesson 1: Separate stable from variable content at the API boundary

The exact-match requirement is the most consequential constraint. Change “please” to “kindly” in a cached block and the entire cache invalidates. This is not a bug — it’s forcing you to keep stable and variable content in separate text blocks.

The architecture this produces:

structured-request.py python

# Wrong: stable and variable content mixed in one block
# Adding the user's name invalidates the entire cache
messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": f"System instructions... (5000 tokens)\n\nUser {name} asks: {query}",
        "cache_control": {"type": "ephemeral"}
    }]
}]

# Right: stable content cached, variable content uncached
messages = [{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "System instructions... (5000 tokens)",
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": f"User {name} asks: {query}"
        }
    ]
}]

This looks like a minor API detail. It’s actually an architectural principle: the division between stable and variable content should be visible in your request construction code. If you cannot point to the line where stable context ends and variable context begins, your caching strategy is accidental.

The stable-to-variable boundary also maps cleanly onto the system prompt vs. messages distinction covered in the Prompt Engineering Fundamentals. System prompts and tool definitions are stable (cache them). Per-request user data and conversation turns are variable (don’t). The API’s structure and the caching rules converge on the same design.

Lesson 2: The four-breakpoint limit defines a natural granularity

Four breakpoints means at most four cacheable segments. Why four? Because the processing order — tools first, then system prompt, then messages — suggests a natural partitioning:

Tool definitions — Stable across conversations, rarely changing
System prompt — The role, rules, and format specifications
Long reference documents — Uploaded PDFs, knowledge base articles, code files
Conversation prefix — The first N turns of a long conversation, up to the point where new messages append

Natural breakpoint placement following processing order

Breakpoint	What goes before it	Why cache it
1	Tool definitions	Processes once, reused across all conversations
2	System prompt + rules	Same for every request in a deployment
3	Reference documents	Large token count, reused within a session
4	Conversation history prefix	Avoids reprocessing earlier turns as conversation grows

If you need more than four breakpoints, you are likely trying to cache things that should be dynamically assembled or retrieved — a sign that your architecture is treating cacheable context as more granular than it actually is.

Lesson 3: The one-hour TTL targets session economics, not persistence

Caching survives for one hour. Not a day, not a week. This tells you what the feature optimizes for: a user session with multiple back-and-forth turns, or a batch of related requests within a single workflow.

This has implications for what you should cache and what you should retrieve:

Caching and retrieval solve different problems at different timescales. Caching handles session-level reuse with zero additional infrastructure. Retrieval handles cross-session reuse at the cost of a vector database and embedding pipeline. Confusing the two — trying to use caching as a retrieval replacement or retrieval as a cache — produces systems that are either too expensive or too stale.

Lesson 4: Manual breakpoints mean caching is architecture, not optimization

Automatic caching would be easier. The fact that Anthropic chose manual breakpoints is telling. Caching changes system behavior — cached content is identical across requests, which means any per-request customization must live outside the cached blocks. That decision has architectural consequences.

If caching were automatic, you’d discover at runtime whether your architecture happens to be cache-friendly. With manual breakpoints, you decide at design time. The distinction matters:

cache-aware-design.py python

def build_request(system_prompt: str, user_query: str, session_history: list):
    """A cache-aware request builder forces design decisions."""
    content = []

    # Decision 1: Is the system prompt stable enough to cache?
    content.append({
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"}
    })

    # Decision 2: How much conversation history should be cached?
    if len(session_history) > 10:
        prefix = format_messages(session_history[:-2])
        content.append({
            "type": "text",
            "text": prefix,
            "cache_control": {"type": "ephemeral"}
        })
        # Recent turns stay uncached so they can vary
        content.append({
            "type": "text",
            "text": format_messages(session_history[-2:])
        })
    else:
        content.append({
            "type": "text",
            "text": format_messages(session_history)
        })

    # Decision 3: The user's current query is always variable
    content.append({
        "type": "text",
        "text": user_query
    })

    return content

Each cache_control placement is a decision about what is identical across requests. Making those decisions explicit — rather than letting an optimizer discover them — is the point.

Lesson 5: Minimum thresholds create a floor for meaningful structure

The 1,024-token minimum (summed across cached blocks) means caching is not for small optimizations. You don’t cache a single example or a one-line instruction. Caching rewards substantial, reused context.

This creates a natural threshold: if your stable context is under ~1,000 tokens, the architectural overhead of managing cached blocks may not be worth the savings. The feature self-selects for systems with meaningful amounts of shared context — systems with comprehensive system prompts, large tool catalogs, or long reference documents.

Speculative cache warming: hiding the cache creation cost

Cache creation still has a cost — the first request that populates a cache block pays for the full processing of those tokens. In a 150K-token context, that means a 20-second wait before the first token appears. Speculative cache warming hides that cost by populating the cache while the user is still typing.

The pattern is simple: when the user focuses an input field (or a conversation starts), send a 1-token request against the stable context with cache_control enabled. The API processes all the cached blocks and stores them. By the time the user submits their real question, the cache is warm:

speculative-cache-warming.py python

async def warm_cache(client, stable_context):
    """Send a 1-token request to populate the cache in the background."""
    await client.messages.create(
        model=MODEL,
        max_tokens=1,
        messages=[stable_context],  # Same cache_control blocks as real requests
        system=SYSTEM_PROMPT,
    )

# Start warming as soon as the user focuses the input field
cache_task = asyncio.create_task(warm_cache(client, stable_context))

# User types for ~3 seconds — cache warming completes in the background
user_question = await get_user_input()

await cache_task  # Ensure warming is done
# Real request now hits a warm cache — TTFT drops from ~21s to ~2s
response = await client.messages.create(
    model=MODEL,
    max_tokens=4096,
    messages=[stable_context, {"role": "user", "content": user_question}],
    system=SYSTEM_PROMPT,
)

The numbers from a real workload (150K tokens of SQLite source code): standard caching produced a 20.9-second time-to-first-token. With speculative warming, TTFT dropped to 1.9 seconds — a 90.7% improvement. The total response time fell from 28.3s to 8.4s.

Speculative warming doesn’t change the caching architecture — it exploits it. The same stable/variable separation, the same manual breakpoint placement, the same session-level TTL. It just moves the cache population step earlier in time, hiding the cost behind user activity. This is the kind of optimization that becomes possible once you treat caching as infrastructure rather than as an API flag.

The architecture these constraints converge on

Read collectively, the constraints describe a specific system architecture:

The architecture caching constrains you toward

Layer	Content	Cache strategy	Change frequency
Static	Tool definitions, Skill metadata, MCP registrations	Cache at breakpoint 1	Days to weeks
Stable	System prompt, role definition, rules, format specs	Cache at breakpoint 2	Hours to days
Session	Reference documents, uploaded files, memory context	Cache at breakpoint 3	Within a session
Conversation	Prior turns in the current exchange	Cache at breakpoint 4 up to recent turns	Within a conversation
Variable	Current user message, last 1-2 turns	Never cached	Every request

Each layer has a different change frequency, and the cache boundary sits at the transition between layers. If your system does not naturally separate into these layers, caching will be awkward — and that awkwardness is itself a signal that your architecture may need refactoring.

When the constraints fight back

Not every system fits this model. The constraints become painful in specific situations:

Highly personalized per-request context. If every request includes user-specific data woven through the instructions, there is nothing stable to cache. The system is inherently variable. Caching won’t help, and forcing it will produce fragile cache keys that invalidate constantly.

Rapidly iterating prompts during development. The exact-match requirement means every prompt tweak invalidates the cache. During active prompt engineering, caching is counterproductive. It belongs in production, not in the eval loop.

Context that changes more often than hourly. If your tool definitions or reference documents update more frequently than the TTL, cached versions will serve stale data. The cache doesn’t know content changed — it only knows the cached bytes match.

The broader lesson: platform constraints are design recommendations

Prompt caching is one feature in one API. But the pattern generalizes. Extended thinking’s budget tokens shape how you allocate reasoning space. Citations’ structured offsets shape how you design document processing pipelines. Agent memory’s hierarchical paths shape how you organize persistent state.

Platform constraints are not arbitrary. They encode the platform’s theory of how applications should be built. Reading constraints as design recommendations, rather than as limitations to work around, teaches you the system’s intended architecture faster than any tutorial.

Takeaways

Separate stable from variable at the API boundary

The exact-match requirement forces you to keep cached and uncached content in distinct text blocks — making the stable-to-variable division visible in your code.

Four breakpoints define natural granularity

The limit maps onto a four-layer model: tools, system prompt, reference documents, and conversation history — each with different change frequencies.

Manual breakpoints make caching an architectural decision

Anthropic chose explicit over automatic caching because cache placement is a design decision, not an optimization detail.

One-hour TTL targets session reuse, not persistence

Caching handles within-session repetition. Cross-session reuse over days or weeks belongs in a retrieval system, not a cache.

Platform constraints are design recommendations

Reading API limits as architectural guidance rather than arbitrary restrictions reveals the intended system design faster than documentation.