Data/Cached Workload Pricing

Cached Workload Pricing

(Pricing data as of December 2025)

Caching discounts matter most when input dominates the bill.

This page shows total cost per 1M output tokens across Retrieval (8:1), Balanced (1:1), and Reasoning (1:8), under 0%, 25%, and 50% input caching. Caching applies to input tokens only, output tokens are never cached.

Frontier Models (Frontier High)

				Retrieval (8:1)			Balanced (1:1)			Reasoning (1:8)
Model	StandardInput($/1M)	StandardOutput($/1M)	CachedInput($/1M)	0% (8:1)	25% (8:1)	50% (8:1)	0% (1:1)	25% (1:1)	50% (1:1)	0% (1:8)	25% (1:8)	50% (1:8)
GPT-5.2	$1.75	$14.00	$0.175	$28.00	$24.85	$21.70	$15.75	$15.36	$14.96	$14.22	$14.17	$14.12
Gemini 3 Pro	$2.00	$12.00	$0.200	$28.00	$24.40	$20.80	$14.00	$13.55	$13.10	$12.25	$12.19	$12.14
Claude Sonnet 4.5	$3.00	$15.00	$0.300	$39.00	$33.60	$28.20	$18.00	$17.32	$16.65	$15.38	$15.29	$15.21
DeepSeek-R1	$3.00	$7.00	$0.300	$31.00	$25.60	$20.20	$10.00	$9.32	$8.65	$7.38	$7.29	$7.21
Llama 3.1 405B	$3.50	$3.50	$0.350	$31.50	$25.20	$18.90	$7.00	$6.21	$5.42	$3.94	$3.84	$3.74

8:1 = 8M input + 1M output

1:1 = 1M input + 1M output

1:8 = 125K input + 1M output

Takeaway 1: Cached Input Impact Is Driven by Workload Shape

Cached input has the largest impact in input-heavy workloads and the smallest impact in output-heavy workloads.

What To Notice:

The largest cost reductions appear in Retrieval (8:1) workloads, where input volume dominates total spend.
The smallest cost changes appear in Reasoning (1:8) workloads, where output tokens account for nearly all cost.

Retrieval (8:1), percent of baseline cost (relative to 0% cached input).

Model	0% (8:1)	25% (8:1)	50% (8:1)
GPT-5.2	100%	89%	78%
Gemini 3 Pro	100%	87%	74%
Claude Sonnet 4.5	100%	86%	72%
DeepSeek-R1	100%	83%	65%
Llama 3.1 405B	100%	80%	60%

Takeaway 2: Cached Input Sensitivity Is Driven by Token Pricing Structure

Under the same workload shape, models exhibit different sensitivity to cached input based on how total cost is split between input and output tokens.

What To Notice:

Models with a larger share of total cost coming from input tokens see larger percentage reductions as cached input increases.
Models where output pricing dominates total cost remain relatively insensitive to cached input, even under identical workload mixes.
This explains why cost reductions vary meaningfully across models within the same workload regime.

Model input, cached input, and output token pricing.

Model	StandardInput($/1M)	StandardOutput($/1M)	CachedInput($/1M)
GPT-5.2	$1.75	$14.00	$0.175
Gemini 3 Pro	$2.00	$12.00	$0.200
Claude Sonnet 4.5	$3.00	$15.00	$0.300
DeepSeek-R1	$3.00	$7.00	$0.300
Llama 3.1 405B	$3.50	$3.50	$0.350

Methodology and Assumptions

1) Pricing data: Input and output token prices are taken from publicly available provider pricing pages as of December 2025. All values reflect list pricing only, enterprise and volume discounts are not included.

2) Derived costs: Derived costs show the total dollar cost to produce 1 million output tokens under fixed input-to-output ratios (Retrieval 8:1, Balanced 1:1, Reasoning 1:8), with varying input cache shares (0%, 25%, 50%). Costs are calculated using: InputCost = ((1−c)*I*Pin + c*I*Pcache) / 1,000,000 and OutputCost = (O*Pout) / 1,000,000, where c is the cache share, I is input tokens, O is output tokens (1M), Pin is input price, Pout is output price, and Pcache is cached input price.

3) Input-to-output ratios: The ratios shown (8:1, 1:1, 1:8) are used to hold workload shape constant. They are analytical lenses, not statements about how real workloads behave in practice.

4) Normalization: All costs are normalized to 1 million output tokens so pricing behavior can be compared consistently across models with different input and output price structures.

5) Caching: Caching applies only to input tokens. The cache share (c) represents the fraction of input tokens that are cached and priced at Pcache instead of Pin. Output tokens are never cached.

6) Scope: This analysis is limited to pricing mechanics. It does not account for model quality, reasoning depth, latency, throughput, safety, or system-level costs.

7) Data sourcing: Some prices (including DeepSeek-R1 and Llama 3.1 405B Instruct Turbo) are sourced from Together.ai.