What does the Inference Cost Calculator tool output?

$ / 1M tokens; GPU hours / day; Recommended cluster shape

What inputs does Inference Cost Calculator need?

Model size + variant; Throughput target (QPS); Hardware mix

Chips & Compute layer

Inference Cost Calculator

Per-million-tokens cost for self-hosted inference across H100 / H200 / B200 / MI300.

The engineer question
What does it cost to self-host a 70B model at 100k QPS?

Result

$ / 1M output tokensdecode only: $0.43
Tokens / sec / GPU (est.)scaled from 1,800 at 70B: 1,800 tok/s
Aggregate token rate100.0k req/s × 500 tok: 50.00M tok/s
GPU-hours / day: 666.7k
GPUs needed (steady state)no redundancy / headroom: 27,778
Cost / day (compute only): $1.87M
Recommended cluster shape27,778 GPUs total: 3,473 × 8-GPU nodes

Recommendation

~27,778 GPUs (3,473 nodes) is a serious dedicated cluster. Owned hardware or a multi-year reserved commit will beat on-demand $/hr by roughly 2–4×, so treat the $/1M-token figure as an on-demand upper bound. Add ~20% GPU headroom for traffic spikes and node failures.

Assumptions

· FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
· Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
· $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
· NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
· Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
· Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
· EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
· MoE models: enter active params per token, not total params, or throughput will be badly underestimated.

Worked example (default inputs)

Result

$ / 1M output tokensdecode only: $0.43
Tokens / sec / GPU (est.)scaled from 1,800 at 70B: 1,800 tok/s
Aggregate token rate100.0k req/s × 500 tok: 50.00M tok/s
GPU-hours / day: 666.7k
GPUs needed (steady state)no redundancy / headroom: 27,778
Cost / day (compute only): $1.87M
Recommended cluster shape27,778 GPUs total: 3,473 × 8-GPU nodes

Recommendation

Assumptions

· FIRST-ORDER ESTIMATE — not a benchmark. Real throughput depends on serving stack (vLLM/TRT-LLM/SGLang), quantization, sequence length, KV-cache pressure and batch composition. Treat every number as ±2× typical.
· Throughput model: tok/s/GPU = ref_tok_s × (70B ÷ model_B). Linear inverse scaling with active params is a rough memory-bandwidth heuristic; very small (<7B) and very large (>200B, multi-GPU) models deviate.
· $/1M tokens = $/hr ÷ (tok/s/GPU × 3600) × 1e6. GPU-hours/day = ceil(aggregate_tok_s ÷ tok/s/GPU) × 24.
· NVIDIA H100 80GB (SXM): ~$2.80/GPU-hr on-demand cloud-equivalent (mid-2026 list pricing, approximate); decode anchor ~1,800 tok/s for a 70B dense model under continuous batching (~60% utilization folded in).
· Reserved / committed-use / owned-hardware TCO is typically 2–4× cheaper per GPU-hr than the on-demand rates used here — this calculator returns an on-demand upper bound.
· Source basis: vendor spec sheets (NVIDIA H100/H200/B200, AMD MI300X) for memory bandwidth, plus trade-press and public cloud GPU price surveys for $/hr. Numbers are typical, not vendor-audited, and drift quarter to quarter.
· EXCLUDED: prefill / input-token cost (only output/decode tokens are priced), networking (InfiniBand/RoCE), CPU head nodes, storage, load-balancer/router overhead, redundancy & autoscaling headroom, power & cooling, software licensing, and engineering time.
· MoE models: enter active params per token, not total params, or throughput will be badly underestimated.

Related tools in the Chips & Compute layer

→ Get this data as JSON

Inputs

Result

Result

Related tools in the Chips & Compute layer