← cc-safe-setup hub FREE PREVIEW — INCIDENT 1 OF 10

The 1h to 5m Cache TTL Silent Downgrade

Free preview chapter from Claude Code Incident Postmortems — a forthcoming Gumroad release (target date 2026-05-05; current price on the product page).

What this book is. Ten production-level Claude Code failures, reverse-engineered from public GitHub Issues, Anthropic staff replies, official documentation, and 800 hours of verified operator logs. Each incident follows a five-part structure: Signal, Root cause, Evidence, Fix/Workaround, Prevention (with a detection hook). This page is Incident 1 in full. The other nine and four appendices are in the book.

Not affiliated with Anthropic PBC. Every claim is traceable to a cited public source.

Period: March 6, 2026 to April 12, 2026 (closed as not planned)
Severity: High for Max subscribers, neutral-to-positive for API users
Status: Closed without fix; environment variables promised, not yet shipped
Primary sources: #46829 · docs.anthropic.com/prompt-caching · theregister.com 2026-04-13

What you will learn from this postmortem:

Why a "cost-neutral on average" change can still drain a specific subscriber's weekly quota in hours.
How to tell from /usage output whether your sessions are running on 5-minute or 1-hour cache TTL.
Why Anthropic's official "not planned" response is technically correct and still leaves Max users worse off.
A concrete detection hook you can install today that logs cache tier per request without waiting for a future environment variable.

Signal

In the first week of March 2026, Max $200/month subscribers who had been using Claude Code for six-plus months without ever hitting the 5-hour quota cap suddenly began to hit it. Some hit both the 5-hour and the weekly limit in the same day. Reddit threads from this window include "20x max usage gone in 19 minutes" (r/ClaudeAI, 330+ comments) and "Claude Code Limits Were Silently Reduced" (r/ClaudeCode, 360+ comments).

There had been no announcement. The CLI version had not changed in a way that explained the difference. Same prompts, same codebase, same workflow — different burn rate.

The /usage endpoint (in versions that shipped it) showed the usual breakdown by cache tier, but without a reference point for "what should this look like," most users could not tell that anything was off until they hit the cap.

This is the pattern you should learn to recognize from this incident: nothing visibly breaks, a single hidden dial moves, and the cost of your existing workflow changes by double digits.

Root cause

Every request Claude Code sends to the Anthropic API can tag its prompt cache with a TTL — the time the cache content will stay valid for cheap reads. Two tiers exist:

ephemeral_5m — cache expires in 5 minutes. Write cost ≈ 1.25× base input. Read cost ≈ 0.1× base input.
ephemeral_1h — cache expires in 1 hour. Write cost ≈ 2× base input. Read cost ≈ 0.1× base input.

The client chooses which tier to tag per request. From roughly February 7 to March 5, 2026, Claude Code was tagging nearly all Max-subscription main-conversation turns as ephemeral_1h. Subagent turns were often tagged ephemeral_5m even in that window — a detail that matters later.

On March 6, 2026, an experiment gate on the client flipped. The logic became per-request selection rather than "1h by default." In practice this meant a large fraction of Max main-conversation turns began writing to ephemeral_5m.

For a workflow where the same main session is used intermittently over a 20-minute coding session — five minutes of typing, a three-minute tool call, a twelve-minute look at the output — each gap over five minutes expires the cache. The next turn must re-upload the same context as a fresh cache write. Every re-upload is billed as cache_creation_input_tokens, which Max subscription quota accounting weights heavily.

The change did not break anything. It simply moved the efficient-reuse assumption from "1 hour" to "5 minutes." For a subscription user whose work pattern straddles the 5-minute boundary all day, this translates directly into faster quota exhaustion.

Evidence

Issue #46829 ("Cache TTL silently regressed from 1h to 5m around early March 2026") was opened by seanGSISG with a 119,866-call analysis covering January 11 through April 11, 2026, across two machines on Max subscription and Vertex billing. The data showed three distinct phases:

Period	5m-tier share of `cache_creation` tokens (Max main turns)
January 29 to February 6	~58% (1h rollout in progress)
February 7 to March 5	0% (full 1h)
March 6 to March 31	~20 to 45% (per-request mix)
April 1 onward (post v2.1.90 fix)	0 to 6% (bug fix restored most main turns to 1h)

A second independent dataset from @spm1001 (approximately 407,000 turns on Max + Vertex, UK and DE) confirmed the timeline, matched the March 6 breakpoint, and added a detail that became important for reproducing the incident: Vertex-billed turns are always 5-minute TTL. Across 35,000 Vertex-billed turns, every single cache creation was tagged ephemeral_5m, with no exceptions. If you are running Claude Code through Google Cloud Vertex AI billing, you have never received the 1-hour cache optimization regardless of the client gate state.

A note on the financial figures in the OP. The original post calculated the regression as a $949 over-cost on Sonnet and $1,581 over-cost on Opus, roughly 17.1 percent. After Anthropic staff replied, the OP corrected his own math in issuecomment-4233120483. The corrected figures separate main-session turns (21.8 : 1 read-to-write ratio, 35.3 million 5-minute tokens, $121 upper-bound waste) from subagent turns (9.1 : 1, 239.8 million 5-minute tokens). Subagents have a 1.4-second median inter-turn gap, so 5-minute TTL almost never expires on them, and the cheaper 5-minute write tier is in fact the right choice. Combined, the corrected math showed API users saved approximately $418 on average rather than losing $949.

This retraction is one of the most instructive moments in the entire thread, and it matters for your operational thinking. Averaging across all request types hid the subscriber-specific harm. The "net positive on API cost" story is true. It is also true that Max subscribers ran out of quota faster because main-conversation turns — the ones that sit across human-scale gaps — are exactly the ones that suffer from short TTL. Two different user populations, one change, two different outcomes. If you spot an incident postmortem that reports only an "on average" figure, apply the same split.

Official response

Boris Cherny (bcherny, Anthropic) replied on the thread with a clarification of the per-request heuristic and a commitment: "We will soon be changing the client-side default to 1h for a few queries ... We will also give you env vars to force 1h and 5m." In the same comment he added that Claude reads the default value (5 minutes) when telemetry is disabled, because the experiment gate is delivered through the telemetry channel. Users who opted out of telemetry for privacy reasons were effectively opted out of the cache optimization as well.

Jarred Sumner (Jarred-Sumner, Anthropic) addressed a related client-side bug in which sessions that started already in subscription overage would stay pinned to 5-minute TTL until the session exited. That bug was reported fixed in v2.1.90. The CHANGELOG for v2.1.90 does not mention the fix; it is recorded only in the Issue thread.

The Register summarized Anthropic's public position on April 13, 2026 as "Claude quota drain not caused by cache tweaks." That framing is technically defensible for API customers and technically hollow for Max customers who cannot buy quota back at token price.

Issue #46829 was closed with status "not planned" on April 12, 2026. The subscription-side quota impact was referenced to Issue #45756 ("Pro Max 5x Quota Exhausted in 1.5 Hours Despite Moderate Usage") for separate handling.

Diagnosis — how to tell if you were affected

If you ran Claude Code on Max or Pro between March 6 and April 1 and your 5-hour or weekly quota started hitting for the first time, the answer is likely yes.

For ongoing diagnosis on current versions, use the /usage command (available in v2.1.118+):

/usage

The output shows cumulative cache_creation_input_tokens and cache_read_input_tokens. A healthy coding workflow on 1-hour TTL typically shows a read-to-write ratio of 10 : 1 or better on the main session. If you see a ratio below 5 : 1 and your work pattern includes gaps of 5 to 60 minutes between turns, you are paying the 5-minute-TTL tax.

A second diagnostic is telemetry state. Run:

grep -i telemetry ~/.claude/settings.json

If telemetry is disabled, your client is reading the hard-coded default cache TTL (5 minutes for most queries) and you are not receiving the gate updates that restore 1-hour behavior for main-conversation turns.

A third diagnostic, specific to Vertex customers: you cannot fix this by client upgrade, and you should plan your budget against the 5-minute baseline permanently until Anthropic ships automatic Vertex support (no published date as of this writing).

Fix and workaround

There is no official fix. CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 is discussed in Incident 7 of this book and partially compensates by removing a background cache chain that inflates cost independently of TTL. That is not the same as fixing the TTL itself.

Promised but not yet shipped as of the closing of Issue #46829:

Environment variable to force 1-hour TTL on main turns.
Environment variable to force 5-minute TTL for cost-sensitive subagent-heavy workflows.

Until those ship, the available levers for Max subscribers are operational rather than configurational:

Keep main-session turns under 5 minutes apart when the context is expensive. If you are working on a 200K-token CLAUDE.md and a large file tree, stay in the session. Breaking for a 10-minute code review means the next turn is a 200K-token cache write.
Use subagents for short, one-shot tasks. They already benefit from the 5-minute tier and their cost structure is actually improved by the March 6 change.
Avoid keepalive ping scripts on subscription. Third-party tools like yujiachen-y/claude-code-cache-keepalive send periodic dummy requests to keep the cache warm. As kaiomp pointed out on the thread, every ping is a full cache-read on the prefix — cheap in API dollars, but it counts against your 5-hour and weekly quotas. Keepalive works for API customers paying per token; it actively harms Max subscribers by spending coding capacity on cache maintenance. Measured benefit does not exceed measured quota cost.
Accept the peak-hour penalty. As opriz noted, cache misses do not save compute — they force a full prefill. In the 5 to 11 AM Pacific window (business hours for most North American users), capacity is tight, TTFT (time to first token) rises, and throttling complaints spike. If you can schedule heavy work outside that window, do.

Prevention — a detection hook you can install today

The cc-safe-setup repository includes a PostToolUse hook that logs cache tier per request. Copy it into ~/.claude/hooks/cache-tier-logger.sh:

#!/usr/bin/env bash
# cache-tier-logger.sh — logs TTL tier per API call for after-the-fact diagnosis.
# PostToolUse event, triggered on any model call.

set -eu

# Read tool-call JSON from stdin.
payload="$(cat)"

# Extract fields; skip non-API events silently.
tier="$(printf '%s' "$payload" | jq -r '.tool_input.cache_control.ttl // empty' 2>/dev/null)"
[ -z "$tier" ] && exit 0

timestamp="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
creation="$(printf '%s' "$payload" | jq -r '.tool_output.usage.cache_creation_input_tokens // 0')"
read="$(printf '%s' "$payload" | jq -r '.tool_output.usage.cache_read_input_tokens // 0')"

log_dir="${HOME}/.claude/logs"
mkdir -p "$log_dir"
printf '%s\t%s\t%s\t%s\n' "$timestamp" "$tier" "$creation" "$read" >> "$log_dir/cache-tier.log"

Wire it up in ~/.claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      { "matcher": "*", "hooks": [{ "type": "command", "command": "~/.claude/hooks/cache-tier-logger.sh" }] }
    ]
  }
}

After a day of normal use, inspect the log:

awk '{c[$2]++; w[$2]+=$3; r[$2]+=$4} END { for (t in c) printf "%s\tcalls=%d\twrite=%d\tread=%d\tratio=%.1f\n", t, c[t], w[t], r[t], (w[t]>0 ? r[t]/w[t] : 0) }' ~/.claude/logs/cache-tier.log

If you see ephemeral_5m dominating on main-session turns with a read-to-write ratio under 5 : 1, and you know your workflow includes multi-minute gaps, you are absorbing the 5-minute-TTL penalty. That is signal worth a change of workflow (see "Fix and workaround" above).

The hook is MIT-licensed and pinned in the cc-safe-setup repository so the implementation stays current as new cache metadata fields are added.

Lessons

A "cost-neutral on average" change is a meaningless statement to any individual user. Averages hide distribution. The user population that absorbs the regression is rarely the population Anthropic's internal dashboards center.
When an OP retracts their own numbers, the retraction itself is evidence. The seanGSISG retraction shows that naive token-waste math assumes the wrong counterfactual. If your own postmortem of a cost incident uses "tokens × wrong-tier-price" as the waste number, you are making the same mistake.
Telemetry is not a neutral setting. Disabling telemetry has a second-order effect on cache optimization delivery. If your organization disables telemetry for privacy reasons, document that you are also accepting slower cache tuning uptake and plan budget accordingly.
Anthropic support channels do not provide incident reimbursement on Max. No quota refund was offered on this incident. If your organization depends on predictable quota, build in a 30 percent buffer for silent client-side regressions of this kind.

The other 9 incidents in the book

Resume Attachment Relocation Bug (v2.1.69 to v2.1.89, "fixed" with residual drift)
Native Sentinel Replacement cch=00000 (v2.1.36+, still unfixed — 10-20x cost multiplier on standalone Bun binary)
Opus 4.7 silent mid-session downgrade to 1M context (#49541)
Opus 4.7 tokenizer 1.35 to 1.46x token inflation (launch 2026-04-16)
Opus 4.7 long-context retrieval regression (MRCR -46pt, 256k-8-needle -32.7pt)
extractMemories Double Cache Chain (default ON, 2x API cost baseline)
v2.1.105 MCP stdio non-JSON regression
Weekly quota reset time bug (April 2026 cluster, independent 4-reporter window)
/doctor dismissed plus settings.json corruption (#52648 pattern)

Plus four appendices: a one-page postmortem template for your own incidents, ten detection hooks packaged with copy-paste install, how to read GitHub Issues productively, and further reading (Mariański's binary disassembly series, Simon Willison's token counting work, The Register coverage, cc-safe-setup).

Claude Code Incident Postmortems

Target date 2026-05-05 · See product page for current price · Buyers receive revised PDF free if Anthropic ships fixes that materially change any incident

Gumroad store → cc-safe-setup (detection hooks, MIT) Token Book (everyday cost reduction)

Want next month's incidents, not last quarter's?

Postmortems is a one-time book of 10 fully-investigated incidents. CC Safety Lab is the monthly companion that tracks 4-8 new incidents, ships fix-it hooks, and updates the safety checklist every month. ¥500/month, Founder pricing locked.

See what's in May 2026 issue →

If this preview was useful, star cc-safe-setup — the detection hook in this chapter ships from there, and the book's source archive is pinned to a repository release.