What prompt caching teaches us about LLM system design
Prompt caching's constraints — one-hour TTL, four breakpoints, exact-match requirements, 1,024-token minimum — are not arbitrary limits. They are a design language that pushes toward specific architectural patterns.
Prompt caching is usually framed as a performance optimization: add cache_control breakpoints, get faster and cheaper requests. But read the rules carefully and a deeper story emerges. The constraints are not random — they encode assumptions about what good LLM system architecture looks like. Understanding those assumptions teaches you more than the caching feature itself.
The constraints as a design language
Here is what prompt caching actually requires, per the Anthropic API course:
| Constraint | What it literally means | What it implies about your architecture |
|---|---|---|
| Manual breakpoints only | You must explicitly mark cache_control: {"type": "ephemeral"} on text blocks | Caching is an intentional design decision, not an automatic optimization |
| Maximum 4 breakpoints | You can partition context into at most 4 cacheable segments | Your context has a natural granularity limit — too many segments and you’re doing something wrong |
| Exact-match requirement | A single character change invalidates the entire cached block | Stable and variable content must live in separate text blocks at the API boundary |
| 1,024-token minimum | Total cached content must exceed this threshold | Tiny caches aren’t worth the architectural overhead — caching is for substantial, reused context |
| One-hour TTL | Cache expires after 60 minutes of inactivity | Cache design targets session-level reuse, not persistent storage |
| Longhand form required | The shorthand "content": "string" form does not support caching | Caching demands explicit structure — you cannot accidentally cache or accidentally miss |
Each constraint is a lesson. Together they form a coherent position on how LLM applications should be built.
Lesson 1: Separate stable from variable content at the API boundary
The exact-match requirement is the most consequential constraint. Change “please” to “kindly” in a cached block and the entire cache invalidates. This is not a bug — it’s forcing you to keep stable and variable content in separate text blocks.
The architecture this produces:
# Wrong: stable and variable content mixed in one block
# Adding the user's name invalidates the entire cache
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": f"System instructions... (5000 tokens)\n\nUser {name} asks: {query}",
"cache_control": {"type": "ephemeral"}
}]
}]
# Right: stable content cached, variable content uncached
messages = [{
"role": "user",
"content": [
{
"type": "text",
"text": "System instructions... (5000 tokens)",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"User {name} asks: {query}"
}
]
}] This looks like a minor API detail. It’s actually an architectural principle: the division between stable and variable content should be visible in your request construction code. If you cannot point to the line where stable context ends and variable context begins, your caching strategy is accidental.
The stable-to-variable boundary also maps cleanly onto the system prompt vs. messages distinction covered in the Prompt Engineering Fundamentals. System prompts and tool definitions are stable (cache them). Per-request user data and conversation turns are variable (don’t). The API’s structure and the caching rules converge on the same design.
Lesson 2: The four-breakpoint limit defines a natural granularity
Four breakpoints means at most four cacheable segments. Why four? Because the processing order — tools first, then system prompt, then messages — suggests a natural partitioning:
- Tool definitions — Stable across conversations, rarely changing
- System prompt — The role, rules, and format specifications
- Long reference documents — Uploaded PDFs, knowledge base articles, code files
- Conversation prefix — The first N turns of a long conversation, up to the point where new messages append
| Breakpoint | What goes before it | Why cache it |
|---|---|---|
| 1 | Tool definitions | Processes once, reused across all conversations |
| 2 | System prompt + rules | Same for every request in a deployment |
| 3 | Reference documents | Large token count, reused within a session |
| 4 | Conversation history prefix | Avoids reprocessing earlier turns as conversation grows |
If you need more than four breakpoints, you are likely trying to cache things that should be dynamically assembled or retrieved — a sign that your architecture is treating cacheable context as more granular than it actually is.
Lesson 3: The one-hour TTL targets session economics, not persistence
Caching survives for one hour. Not a day, not a week. This tells you what the feature optimizes for: a user session with multiple back-and-forth turns, or a batch of related requests within a single workflow.
This has implications for what you should cache and what you should retrieve:
Caching and retrieval solve different problems at different timescales. Caching handles session-level reuse with zero additional infrastructure. Retrieval handles cross-session reuse at the cost of a vector database and embedding pipeline. Confusing the two — trying to use caching as a retrieval replacement or retrieval as a cache — produces systems that are either too expensive or too stale.
Lesson 4: Manual breakpoints mean caching is architecture, not optimization
Automatic caching would be easier. The fact that Anthropic chose manual breakpoints is telling. Caching changes system behavior — cached content is identical across requests, which means any per-request customization must live outside the cached blocks. That decision has architectural consequences.
If caching were automatic, you’d discover at runtime whether your architecture happens to be cache-friendly. With manual breakpoints, you decide at design time. The distinction matters:
def build_request(system_prompt: str, user_query: str, session_history: list):
"""A cache-aware request builder forces design decisions."""
content = []
# Decision 1: Is the system prompt stable enough to cache?
content.append({
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
})
# Decision 2: How much conversation history should be cached?
if len(session_history) > 10:
prefix = format_messages(session_history[:-2])
content.append({
"type": "text",
"text": prefix,
"cache_control": {"type": "ephemeral"}
})
# Recent turns stay uncached so they can vary
content.append({
"type": "text",
"text": format_messages(session_history[-2:])
})
else:
content.append({
"type": "text",
"text": format_messages(session_history)
})
# Decision 3: The user's current query is always variable
content.append({
"type": "text",
"text": user_query
})
return content Each cache_control placement is a decision about what is identical across requests. Making those decisions explicit — rather than letting an optimizer discover them — is the point.
Lesson 5: Minimum thresholds create a floor for meaningful structure
The 1,024-token minimum (summed across cached blocks) means caching is not for small optimizations. You don’t cache a single example or a one-line instruction. Caching rewards substantial, reused context.
This creates a natural threshold: if your stable context is under ~1,000 tokens, the architectural overhead of managing cached blocks may not be worth the savings. The feature self-selects for systems with meaningful amounts of shared context — systems with comprehensive system prompts, large tool catalogs, or long reference documents.
Speculative cache warming: hiding the cache creation cost
Cache creation still has a cost — the first request that populates a cache block pays for the full processing of those tokens. In a 150K-token context, that means a 20-second wait before the first token appears. Speculative cache warming hides that cost by populating the cache while the user is still typing.
The pattern is simple: when the user focuses an input field (or a conversation starts), send a 1-token request against the stable context with cache_control enabled. The API processes all the cached blocks and stores them. By the time the user submits their real question, the cache is warm:
async def warm_cache(client, stable_context):
"""Send a 1-token request to populate the cache in the background."""
await client.messages.create(
model=MODEL,
max_tokens=1,
messages=[stable_context], # Same cache_control blocks as real requests
system=SYSTEM_PROMPT,
)
# Start warming as soon as the user focuses the input field
cache_task = asyncio.create_task(warm_cache(client, stable_context))
# User types for ~3 seconds — cache warming completes in the background
user_question = await get_user_input()
await cache_task # Ensure warming is done
# Real request now hits a warm cache — TTFT drops from ~21s to ~2s
response = await client.messages.create(
model=MODEL,
max_tokens=4096,
messages=[stable_context, {"role": "user", "content": user_question}],
system=SYSTEM_PROMPT,
) The numbers from a real workload (150K tokens of SQLite source code): standard caching produced a 20.9-second time-to-first-token. With speculative warming, TTFT dropped to 1.9 seconds — a 90.7% improvement. The total response time fell from 28.3s to 8.4s.
Speculative warming doesn’t change the caching architecture — it exploits it. The same stable/variable separation, the same manual breakpoint placement, the same session-level TTL. It just moves the cache population step earlier in time, hiding the cost behind user activity. This is the kind of optimization that becomes possible once you treat caching as infrastructure rather than as an API flag.
The architecture these constraints converge on
Read collectively, the constraints describe a specific system architecture:
| Layer | Content | Cache strategy | Change frequency |
|---|---|---|---|
| Static | Tool definitions, Skill metadata, MCP registrations | Cache at breakpoint 1 | Days to weeks |
| Stable | System prompt, role definition, rules, format specs | Cache at breakpoint 2 | Hours to days |
| Session | Reference documents, uploaded files, memory context | Cache at breakpoint 3 | Within a session |
| Conversation | Prior turns in the current exchange | Cache at breakpoint 4 up to recent turns | Within a conversation |
| Variable | Current user message, last 1-2 turns | Never cached | Every request |
Each layer has a different change frequency, and the cache boundary sits at the transition between layers. If your system does not naturally separate into these layers, caching will be awkward — and that awkwardness is itself a signal that your architecture may need refactoring.
When the constraints fight back
Not every system fits this model. The constraints become painful in specific situations:
Highly personalized per-request context. If every request includes user-specific data woven through the instructions, there is nothing stable to cache. The system is inherently variable. Caching won’t help, and forcing it will produce fragile cache keys that invalidate constantly.
Rapidly iterating prompts during development. The exact-match requirement means every prompt tweak invalidates the cache. During active prompt engineering, caching is counterproductive. It belongs in production, not in the eval loop.
Context that changes more often than hourly. If your tool definitions or reference documents update more frequently than the TTL, cached versions will serve stale data. The cache doesn’t know content changed — it only knows the cached bytes match.
The broader lesson: platform constraints are design recommendations
Prompt caching is one feature in one API. But the pattern generalizes. Extended thinking’s budget tokens shape how you allocate reasoning space. Citations’ structured offsets shape how you design document processing pipelines. Agent memory’s hierarchical paths shape how you organize persistent state.
Platform constraints are not arbitrary. They encode the platform’s theory of how applications should be built. Reading constraints as design recommendations, rather than as limitations to work around, teaches you the system’s intended architecture faster than any tutorial.
Takeaways
Separate stable from variable at the API boundary
The exact-match requirement forces you to keep cached and uncached content in distinct text blocks — making the stable-to-variable division visible in your code.
Four breakpoints define natural granularity
The limit maps onto a four-layer model: tools, system prompt, reference documents, and conversation history — each with different change frequencies.
Manual breakpoints make caching an architectural decision
Anthropic chose explicit over automatic caching because cache placement is a design decision, not an optimization detail.
One-hour TTL targets session reuse, not persistence
Caching handles within-session repetition. Cross-session reuse over days or weeks belongs in a retrieval system, not a cache.
Platform constraints are design recommendations
Reading API limits as architectural guidance rather than arbitrary restrictions reveals the intended system design faster than documentation.