How fast is the DGX Spark, really? Prefill vs. decode, and the 273 GB/s wall
Short answer: the DGX Spark reads your prompt fast and writes its reply slow. Processing the prompt (prefill) is one of the quickest things it does; generating the answer token by token for a single user (decode) is one of the slowest — about 3 tokens/sec on a dense 70B model. Nothing is broken. Decode is capped by memory bandwidth (273 GB/s) — that's physics, not a config bug. Whether the Spark is the right box for you depends on the shape of your workload, not on any single tokens/sec figure.
Two phases, opposite bottlenecks
Every LLM request runs in two phases, and each is limited by something different:
- Prefill — reading your prompt. Every token in the prompt is processed at once, so it's compute-bound, and compute is the Spark's strength: its FP4 tensor cores do roughly 1 PFLOP. The Spark prefills about 4× faster than a Mac Studio M3 Ultra.
- Decode — writing the answer, one token at a time. Each token reads the model's active weights from memory once, so it's memory-bandwidth-bound — and 273 GB/s is the Spark's weak spot.
That makes the single-stream decode ceiling simple arithmetic:
max tokens/sec ≈ memory bandwidth ÷ bytes read per token
A 70B model in FP8 is about 70 GB of weights, so 273 ÷ 70 ≈ 3.9 tok/s — a hard ceiling, ~2.7–3 in practice. That's the number frustrated owners keep posting. It isn't a misconfiguration; it's the bandwidth.
| Model (single user, batch 1) |
Bytes read / token |
Decode ceiling |
Notes |
| Llama 3.3 70B, FP8 (dense) |
~70 GB |
~3.9 tok/s |
the screenshot everyone shares |
| Llama 70B, Q4 (dense) |
~40 GB |
~6.8 tok/s |
harder quant, fewer bytes |
| Qwen 32B, FP8 (dense) |
~32 GB |
~8.5 tok/s |
smaller dense model |
| gpt-oss-120b, MXFP4 (MoE, ~5B active) |
~3–4 GB |
tens of tok/s |
only the active experts are read |
The MoE row is the insight worth keeping: what decode reads each token is the active parameters, not the total. A 120B mixture-of-experts model that activates ~5B parameters per token decodes far faster than a dense 70B, because it touches a fraction of the weights at each step. "Bigger" can be faster.
Which workloads suit the Spark?
It comes down to two things: your input-to-output ratio, and how many requests run at once.
| Workload shape |
Example workloads |
Bottleneck |
Spark verdict |
| Long input → short output |
RAG, summarization, extraction, long-context Q&A, code review over big files |
Compute (prefill) |
Good — bandwidth barely matters |
| Short input → long output, single user |
Interactive chat, long-form writing, single-stream agents |
Memory bandwidth (decode) |
Worst case — the 3 tok/s scenario |
| High concurrency / batched |
Serving several users, offline batch jobs |
Amortizes toward compute |
Good — throughput climbs with batch size |
The Spark is a far stronger multi-stream server than its single-user number suggests: one weight read serves the whole batch, so aggregate throughput scales even though each individual stream stays slow.
One caveat people skip: prefill is fast next to a Mac, but it isn't free. Feed a higher-activation model a 100k-token prompt and prefill alone can take minutes. Fast prefill is a relative advantage, not an absolute one.
Three ways to speed up decode
If single-stream decode is too slow, here are the levers, strongest first:
- Switch to an MoE model. Fewer active parameters per token means fewer bytes read, which means faster decode. The single biggest lever.
- Quantize harder. Q4 reads about 40% of what FP8 reads; INT4/AWQ less still. You trade quality for bytes.
- Add speculative decoding. A small draft model proposes several tokens; the big model checks them in one read instead of several, so accepted tokens come almost free. It helps more on a bandwidth-starved box like the Spark than on a bandwidth-rich GPU. NVIDIA publishes a Spark playbook for it.
"Training, not inference" is the wrong axis
You'll hear that the Spark is "for training, not inference." Wrong frame. Training is compute-heavy and batched, so bandwidth matters less — but its real limits on the Spark are memory capacity and the missing NVLink for multi-GPU scaling, and few people report doing serious fine-tuning on one.
The axis that actually predicts speed is compute-bound and batched (fast) vs bandwidth-bound and single-stream (slow). Training, prefill, and batched serving sit on the fast side; single-user interactive decode sits on the slow side. Plenty of inference — RAG, batch, long-context — runs great. Plenty doesn't.
So, good or bad?
Neither. The Spark is workload-shaped:
- Good at: prefill-heavy work (RAG, long-context), MoE models, batched and multi-user serving, fitting 70B–120B models that won't load on a 32 GB card, and serving as a CUDA dev box for code you'll deploy elsewhere.
- Bad at: fast single-user, long-output chat on a dense 70B+ model — exactly what most buyers picture when they order one.
How to find out for your workload
Here's the catch: a spec sheet won't tell you which side of that line you're on. It depends on your prompts, your model, your concurrency. The only reliable way to know is to run your real workload on a real GB10 and watch the tokens/sec — before you spend $4,699.
That's what hourly rental is for. Spin up a Spark, run your own model and prompts for an afternoon, and let the numbers make the decision. Enverge rents GB10 Sparks by the hour for exactly this: test-drive the box on your workload first, then buy with your eyes open.
FAQ
How many tokens per second does a DGX Spark do?
On a dense 70B (FP8) at batch 1, about 3 tok/s — capped by memory bandwidth. MoE models and batched workloads are much faster, and prefill is excellent.
Why is my DGX Spark so slow?
If you're running a dense 70B+ model for single-user chat, ~3 tok/s is the expected bandwidth ceiling, not a bug. Switch to an MoE model, quantize harder, or add speculative decoding.
Is the DGX Spark good for inference?
For prefill-heavy (RAG, long-context), MoE, and batched or multi-user inference — yes. For fast single-user, long-output chat on a dense model — no.
DGX Spark vs Mac Studio for LLMs?
A Mac Studio (≈800 GB/s) decodes about 3× faster on large models; the Spark prefills about 4× faster and brings the CUDA ecosystem. Mac wins single-user generation; Spark wins long-context, agentic, and CUDA workflows.
Can I try a DGX Spark before buying one?
Yes — you can rent a GB10 by the hour and run your own workload to measure real performance before committing to the ~$4,699 purchase.
Sources