
If an AI agent only answers one or two rounds of questions, token cost is usually easy to understand: how much the user sends in, how much the model returns, and the bill roughly follows. Research agents are different. In the commercial research-agent scenario at Atypica.AI, a task continuously accumulates user requirements, research plans, tool search results, interview notes, sub-agent conclusions, and interim reports. The further the task progresses, the more each model call needs to carry the long context already built up earlier. Before optimization, one of our reports usually consumed around 600,000 tokens.



The real source of runaway cost is not "how many tokens were generated this time," but rather:
The same stable long prefix is repeatedly sent back into the model.
Claude Prompt Cache is designed to solve exactly this problem: cache the stable, sufficiently long prompt prefix that will appear repeatedly later. When subsequent requests hit the same prefix, they no longer need to be billed as full input again.
An easily overlooked point is that Anthropic/Claude Prompt Cache does not "automatically cache conversation history." Engineering-wise, cache control must be manually added to request messages. Place it too early, and the cached prefix is too short to be valuable. Place it too often, and cache writes themselves increase cost.
For Claude requests, code needs to write provider options onto messages, such as anthropic.cacheControl or Bedrock's cachePoint. Without this step, there is no controllable checkpoint. See https://platform.claude.com/docs/en/build-with-claude/prompt-caching

Checkpoints should follow token length, not message count. One tool output can be longer than a dozen chat messages. Message count is not a reliable proxy. We recommend placing checkpoints only on stable message boundaries. Online requests cannot be split at arbitrary token positions. Cache options can only be attached to reproducible message boundaries such as system, user, and assistant messages.
Our conclusion is: do not place checkpoints by "message number." Place them by cumulative input token length. The strategy should first be simulated offline using historical sessions, then verified online with cache metadata. Use real ChatMessage and ChatStatistics records from the DB for offline simulation, then use provider metadata for online verification. Without real cache read/write statistics, cache optimization easily remains stuck at "it should save money in theory." Offline verification also lets us estimate token consumption without repeatedly running the dynamic workflow online, making it much easier to experiment and reach a practical conclusion.
A caching strategy should not go online just because it "looks reasonable." We needed to answer a more specific question:
If this checkpoint strategy is applied to real historical Study tasks, can it actually read long prefixes that are reused later?
To answer this, we built an offline simulator. This process mattered more than the final thresholds themselves, because it turned caching strategy from "experience-based judgment" into "a reproducible cost experiment."

The simulator's key inputs were not summaries returned by an API, but two kinds of real records from the database:
ChatMessage: reconstructs which historical messages the agent could see before each model call;ChatStatistics: provides real token-statistics events and the times when those events occurred.
The overall process was:UserChat.kind = 'study' tasks from production data;ChatStatistics(tokens) row as one real token event;ChatMessage based on the event timestamp;| Strategy | Example | Purpose |
|---|---|---|
| Current index strategy | current-message-index | Baseline for the first implementation |
| Fixed token thresholds | 1K / 4K / 16K / 64K, 4K / 12K / 24K / 48K, etc. | Test whether cumulative-length checkpoints save more |
| Long-prefix quantiles | 25% / 50% / 75% / 90% | See whether "relative position" beats fixed thresholds |
| Recent assistant boundaries | last-4-assistant-boundaries | Test whether boundaries close to the current call are easier to hit |
| Tool phase boundaries | planning, interview, discussion, sub-agent, report | Test whether semantic phases can serve as cache boundaries |
Prompt Cache's benefit comes from reads, but its cost comes from writes. If a checkpoint is written and almost no later requests read it, that write is extra cost. So we compared strategies with a model approximating Bedrock Claude's cache billing shape:

This model intentionally does two things at the same time:
After multiple rounds of experiments, the final production implementation uses:
const thresholds = [1024, 4096, 16384, 65536];
That is 1K / 4K / 16K / 64K. This was not chosen by intuition. It converged through one full round of research:
ChatMessage records;ChatStatistics token events as time anchors;4K / 12K / 24K / 48K scored highest in the base run.
But conservative simulation and production explainability favored 1K / 4K / 16K / 64K as the default strategy. This distinction matters: we were not chasing the single highest score in one offline table, but choosing a more robust default across savings, stability, and engineering observability.
At first, we used the easiest strategy to think of: a message-index heuristic.
The problem with this strategy is that it assumes "message count" roughly correlates with "context length." In research agents, this assumption often fails:
Do not care which message number we are on. Care only whether cumulative input tokens have crossed key thresholds.
The strategy itself is not complicated.

The current production implementation uses 1K / 4K / 16K / 64K:
const thresholds = [1024, 4096, 16384, 65536];
The meaning of these thresholds is straightforward:
The first base run used 20 high-token Study sessions:
4K / 12K / 24K / 48K was the strongest static candidate:| Strategy | Read rate | Est. savings | Cache read | Cache write | Est. billed input |
|---|---|---|---|---|---|
token-thresholds-4k-12k-24k-48k | 98.6% | 86.9% | 52.6M | 747.0K | 7.0M |
long-prefix-quartiles | 98.6% | 86.8% | 52.6M | 851.0K | 7.1M |
last-4-assistant-boundaries | 98.7% | 86.6% | 52.7M | 974.6K | 7.2M |
tool-phase-boundaries | 83.4% | 73.6% | 44.5M | 610.0K | 14.1M |
current-message-index | 3.2% | 2.6% | 1.7M | 121.0K | 52.0M |
This result shows that the first index-based strategy was not merely slightly behind. It basically failed to capture the truly long prefixes that Study actually reuses. But the base run was still optimistic, because when some token stats were recorded, the assistant output may already have been written into the database. So we ran the conservative run: excluding non-chat-history reuse events and dropping the current assistant output.
| Strategy | Read rate | Est. savings | Cache read | Cache write | Est. billed input |
|---|---|---|---|---|---|
token-thresholds-1k-4k-16k-64k | 87.1% | 72.9% | 4.8M | 240.4K | 1.5M |
token-thresholds-1k-2k-4k-8k | 87.1% | 72.9% | 4.8M | 240.4K | 1.5M |
long-prefix-quartiles | 86.9% | 72.8% | 4.8M | 240.4K | 1.5M |
token-thresholds-4k-12k-24k-48k | 58.8% | 48.5% | 3.2M | 218.6K | 2.9M |
current-message-index | 0.4% | 0.0% | 21.7K | 21.7K | 5.5M |
This is why we ultimately chose 1K / 4K / 16K / 64K: 4K / 12K / 24K / 48K was highest in the base run, but 1K / 4K / 16K / 64K was more robust in the conservative run, and it is easier to explain and monitor. A production default should prioritize robustness and observability, not only the highest score in one offline sample.
The most important point is not to treat 72.9% or 86.9% as a promise of online savings, but to understand the strategy difference:
Message-index checkpoints often fail to cache the prefixes that are truly long and truly reused; token-threshold checkpoints are more likely to hit the cost concentration zones of research agents.
Offline simulation can only show that "this strategy is more reasonable on historical context." It cannot prove that real requests will hit cache. Before launch, two additional things need to be verified:

We ultimately need to see these fields in the statistics pipeline:
cacheReadInputTokens, cacheWriteInputTokens
This step is critical. In engineering work, three states often look similar but are actually completely different:
ChatStatistics.extra.cache can we say that cache savings have entered the observable ledger.
In a smoke test through the PPIO/Anthropic path, when the same long prefix was requested consecutively, the first request returned:The second request returned:
cacheReadInputTokens and cacheWriteInputTokens land in the statistics table, proving that this pipeline not only works at the request layer but also enters Atypica.AI's own token ledger.First, caching strategy should be designed around prefix reuse. For long-context agents, the main cost is often repeated input of stable prefixes, not one-off output.
Second, Anthropic caching requires manual checkpoints. Do not understand Prompt Cache as provider-side automatic optimization. The system must explicitly choose boundaries and write cacheControl / cachePoint / cache_control into requests.
Third, cache writes have a cost. Writing more does not necessarily save more. The write premium can only be amortized when later requests truly read a sufficiently long stable prefix.
Fourth, message count is not a reliable metric. Tool calls, report drafts, and sub-agent results can all make the context suddenly much longer. The message number does not tell you whether the prefix is worth caching.
Fifth, fixed token thresholds are a strong default. They are simple enough, easy to explain, easy to monitor, and do not depend on complex semantic judgments about workflow stages.
Sixth, offline simulation must use real messages and real token events. Looking only at API summaries or current frontend statistics can easily understate or overstate historical calls.
Seventh, online verification must land in the cache read/write ledger. Without provider metadata, it is hard for a cache strategy to move from "theoretical optimization" to "sustainable optimization."
For research agents, Prompt Cache is not a simple switch. It is a cost strategy designed around how long context grows. The core lesson from this Atypica.AI experiment can be summarized in one sentence:
Manually place checkpoints on stable message boundaries after cumulative tokens cross key thresholds, then verify the strategy with real historical data and online cache metadata.
This method does not try to find a permanently optimal set of thresholds in one shot. Its more important value is establishing an iterative loop: choose a strategy from historical tasks, verify savings with real provider metadata, then continue calibrating thresholds with new online data. In the end, cost optimization for long-context agents is not about one magic parameter. It is about connecting context structure, provider billing models, and internal statistics ledgers into one reliable chain.
estimated billed input = uncached input tokens + floor(cache read tokens / 10) + floor(cache write tokens * 1.25)Place a checkpoint after message 4Place a checkpoint after message 8Place a checkpoint after message 16cache_creation_input_tokens=10401cache_read_input_tokens=0cache_creation_input_tokens=0cache_read_input_tokens=10401