Skip to main content

Documentation Index

Fetch the complete documentation index at: https://fireworks.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

If you’re running RL or agentic post-training on a long-context model and your provider bills you per token with no cross-turn prefix cache, the prefill cost grows quadratically with the number of turns — every turn re-prefills the full conversation history. On Fireworks, session-affinity routing keeps an episode pinned to one replica so the KV cache is reused across turns, and the cached portion is billed at a steep discount (or, on dedicated, contributes zero extra compute). The calculator below makes that difference concrete. Set your episode shape (turns, context growth, generation length) and compare three Fireworks paths against Tinker’s published rates:
  • Tinker — flat per-token billing, no cross-turn cache (re-prefill every turn)
  • Fireworks Serverless — Standard and Priority tiers; cached prompt tokens billed at a separate (much lower) rate
  • Fireworks Dedicated — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate

How the numbers come together

Tinker (the cost customers describe)

Each turn re-prefills the full accumulated context: Prefill tokens (Tinker)=t=1TPt=TP1+ΔT(T1)2\text{Prefill tokens (Tinker)} = \sum_{t=1}^{T} P_t = T \cdot P_1 + \Delta \cdot \frac{T(T-1)}{2} …where P1P_1 is the initial prompt (system + tools + task), Δ\Delta is the context added per turn (model response + tool result), and TT is the turn count. This is quadratic in TT. Cost (Tinker)=Prefill tokens106rprefill+Decode tokens106rsample\text{Cost (Tinker)} = \frac{\text{Prefill tokens}}{10^6} \cdot r_{\text{prefill}} + \frac{\text{Decode tokens}}{10^6} \cdot r_{\text{sample}}

Fireworks Serverless — prefix cache included

Across one episode, each token is prefilled at most once. The remainder of the prompt is served from the prefix cache at the cached input rate: Uncached prompt=PT=P1+(T1)Δ\text{Uncached prompt} = P_T = P_1 + (T - 1) \Delta Cached prompt=t=1TPtPT\text{Cached prompt} = \sum_{t=1}^{T} P_t - P_T Cost=PT106rin+Cached106rcached+Decode106rout\text{Cost} = \frac{P_T}{10^6} \cdot r_{\text{in}} + \frac{\text{Cached}}{10^6} \cdot r_{\text{cached}} + \frac{\text{Decode}}{10^6} \cdot r_{\text{out}} The cached/uncached split is linear in TT instead of quadratic, and the cached fraction is billed roughly 5–6× less than the uncached input rate. The two effects compound — at 8 turns the cache hit is ~77%, at 20+ turns it exceeds 95%.

Fireworks Dedicated — GPU-hour billing

Dedicated deployments are billed per GPU-second, so the prefix cache shows up as higher effective throughput rather than a discount on per-token rates. On a saturated cluster: Cluster-hours=Uncached prompt/prefill TPS3600\text{Cluster-hours} = \frac{\text{Uncached prompt} / \text{prefill TPS}}{3600} Cost=Cluster-hoursNGPUrGPU/hr\text{Cost} = \text{Cluster-hours} \cdot N_{\text{GPU}} \cdot r_{\text{GPU/hr}} Cached tokens contribute essentially nothing to wall-clock work — the cluster’s effective $/M token rate falls as utilization rises. For continuous RL training, where rollouts run at sustained pace, dedicated is typically the cheapest path at scale.
The calculator’s dedicated path uses saturated throughput estimates as defaults. A small, lightly-loaded test deployment will look more expensive per token than these numbers because the cluster is paid for whether it’s busy or idle. Tune the throughput inputs in the Advanced panel to match your actual rollout pace.

What’s covered

The calculator currently includes the four models for which Tinker publishes per-token rates:
ModelTinker prefill / sampleFireworks Serverless
Kimi K2.6 (128K)5.15/5.15 / 12.81 per 1MStandard + Priority
Kimi K2.5 (128K)5.15/5.15 / 12.81Standard only
Qwen3.5-397B-A17B (256K)4.00/4.00 / 10.00— (dedicated only)
GPT-OSS-120B (128K)0.63/0.63 / 1.54— (dedicated only)
All Fireworks-side rates are taken from the public pages linked below and the constants live in snippets/multi-turn-cost-calculator.jsx — update there if either side’s pricing changes.

Sources

This is an estimator, not a quote (updated). Real costs depend on your exact workload, cache hit rate, hardware utilization, and rate-card terms at run time.