If you’re running RL or agentic post-training on a long-context model and your provider bills you per token with no cross-turn prefix cache, the prefill cost grows quadratically with the number of turns — every turn re-prefills the full conversation history. On Fireworks, session-affinity routing keeps an episode pinned to one replica so the KV cache is reused across turns, and the cached portion is billed at a steep discount (or, on dedicated, contributes zero extra compute). The calculator below makes that difference concrete. Set your episode shape (turns, context growth, generation length) and compare three Fireworks paths against Tinker’s published rates:Documentation Index
Fetch the complete documentation index at: https://fireworks.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
- Tinker — flat per-token billing, no cross-turn cache (re-prefill every turn)
- Fireworks Serverless — Standard and Priority tiers; cached prompt tokens billed at a separate (much lower) rate
- Fireworks Dedicated — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate
How the numbers come together
Tinker (the cost customers describe)
Each turn re-prefills the full accumulated context: …where is the initial prompt (system + tools + task), is the context added per turn (model response + tool result), and is the turn count. This is quadratic in .Fireworks Serverless — prefix cache included
Across one episode, each token is prefilled at most once. The remainder of the prompt is served from the prefix cache at the cached input rate: The cached/uncached split is linear in instead of quadratic, and the cached fraction is billed roughly 5–6× less than the uncached input rate. The two effects compound — at 8 turns the cache hit is ~77%, at 20+ turns it exceeds 95%.Fireworks Dedicated — GPU-hour billing
Dedicated deployments are billed per GPU-second, so the prefix cache shows up as higher effective throughput rather than a discount on per-token rates. On a saturated cluster: Cached tokens contribute essentially nothing to wall-clock work — the cluster’s effective $/M token rate falls as utilization rises. For continuous RL training, where rollouts run at sustained pace, dedicated is typically the cheapest path at scale.The calculator’s dedicated path uses saturated throughput estimates as
defaults. A small, lightly-loaded test deployment will look more expensive
per token than these numbers because the cluster is paid for whether it’s
busy or idle. Tune the throughput inputs in the Advanced panel to match
your actual rollout pace.
What’s covered
The calculator currently includes the four models for which Tinker publishes per-token rates:| Model | Tinker prefill / sample | Fireworks Serverless |
|---|---|---|
| Kimi K2.6 (128K) | 12.81 per 1M | Standard + Priority |
| Kimi K2.5 (128K) | 12.81 | Standard only |
| Qwen3.5-397B-A17B (256K) | 10.00 | — (dedicated only) |
| GPT-OSS-120B (128K) | 1.54 | — (dedicated only) |
snippets/multi-turn-cost-calculator.jsx — update there if
either side’s pricing changes.
Sources
- Tinker pricing: thinkingmachines.ai/tinker
- Fireworks serverless pricing: docs.fireworks.ai/serverless/pricing
- Fireworks GPU-hour pricing: fireworks.ai/pricing
- Related: RFT Cost Estimator — same idea, but for the training-side bill (Fireworks GPU-hour, no comparison column).