What Fits in 128GB? A Practical Model Size Guide for DGX Spark
One of the biggest reasons engineers look at the NVIDIA DGX Spark is simple: 128GB of unified memory. That number changes what kinds of models you can load, fine-tune, and experiment with on a single machine.
But specs alone don't answer the real question:
What actually fits in 128GB?
This guide gives you a practical framework for thinking about model size on DGX Spark — not just theoretical limits, but what fits comfortably, what fits tightly, and what leaves room for real work.
Why 128GB Matters
Most AI engineers are used to thinking in terms of 24GB, 48GB, or 80GB VRAM. That's the mental model created by consumer RTX GPUs, workstation cards, and datacenter GPUs like the H100.
The DGX Spark is different.
With 128GB of unified memory, you can:
- run larger models on a single box
- avoid aggressive quantization in some workflows
- keep more room for KV cache, larger context windows, and bigger batches
- run multiple models at the same time
- fine-tune models that are awkward or impossible on an 80GB GPU
That doesn't mean everything magically fits. The usable ceiling still depends on:
- model precision
- context length
- KV cache size
- inference vs fine-tuning
- how much headroom you want for actual throughput
So the right question isn't just “does it fit?”
It's:
“Does it fit well enough to be useful?”
The Rule of Thumb
A quick shortcut for estimating model memory:
Inference memory estimate
Approximate model weight size:
- FP16 / BF16 → ~2 bytes per parameter
- FP8 → ~1 byte per parameter
- INT4 / 4-bit → ~0.5 bytes per parameter
So:
- 8B model in FP16 ≈ 16GB
- 70B model in FP16 ≈ 140GB
- 70B model in FP8 ≈ 70GB
- 120B model in 4-bit ≈ 60GB
- 200B model in 4-bit ≈ 100GB
That gets you close — but it's only the weights.
You also need memory for:
- runtime overhead
- KV cache
- tokenizer/runtime buffers
- activations (especially for training)
- adapters / LoRA weights
- multiple concurrent requests
That means the raw weight size is only the starting point.
What Fits Comfortably on DGX Spark
These are workloads that fit with enough headroom left to actually work productively.
7B to 14B models in FP16
Examples:
- Llama 3.1 8B
- Mistral 7B
- DeepSeek Coder 6.7B
- Qwen 14B-class models
These fit easily.
Typical weight footprint:
- 7B FP16 → ~14GB
- 8B FP16 → ~16GB
- 14B FP16 → ~28GB
That leaves plenty of room for:
- large context windows
- bigger batch sizes
- multiple services or sidecars
- local eval tooling
- fine-tuning workflows
If your workload is centered on 7B–14B models, DGX Spark gives you a very comfortable margin.
30B-class models in FP16 or BF16
Examples:
- 30B open-weight reasoning or coding models
- larger domain-specific fine-tuning targets
Approximate weight footprint:
That is still very workable in 128GB.
You now have enough room left for:
- practical inference
- some tuning workflows
- larger contexts
- moderate KV cache
This is one of the major breaks from smaller GPUs. On 24GB or 48GB hardware, 30B starts getting annoying fast. On DGX Spark, it's realistic.
What Fits Well but Needs Thought
These are the workloads where DGX Spark becomes really interesting.
70B models in FP8 or quantized formats
Examples:
- Llama 3.1 70B
- Qwen 72B-class models
Approximate weight footprint:
- 70B FP16 → ~140GB → too large for comfortable single-box use
- 70B FP8 → ~70GB
- 70B 4-bit → ~35GB
This is where DGX Spark has real leverage over 80GB GPUs.
A 70B model in FP8 on DGX Spark can fit with meaningful room left over for:
- inference runtime overhead
- KV cache
- agent-style workflows
- adapters / LoRA weights
A 70B model in 4-bit fits very comfortably.
For many teams, this is the sweet spot:
big enough to be serious, small enough to run on one DGX Spark without pain.
Multiple models at once
This is one of the most overlooked use cases.
For example, a multi-agent or compound stack might look like:
- reasoning model → 30GB
- coding model → 20GB
- embedding model → 8GB
- reranker / support model → 4GB
- runtime overhead → 10–15GB
Total: still comfortably within DGX Spark range.
That means you can build:
- agent systems
- retrieval-augmented pipelines
- multi-model orchestration
- local benchmark harnesses
without constant model swapping.
What Fits Tightly
120B models with quantization
Approximate weight footprint:
- 120B FP16 → ~240GB → no
- 120B FP8 → ~120GB → technically near the ceiling, but very tight
- 120B 4-bit → ~60GB → very plausible
A 120B model in 4-bit is realistic on DGX Spark.
A 120B model in FP8 is theoretically near the edge, but in practice you likely won't want to run it that way unless the rest of the stack is extremely lean.
This is where the difference between “loads once” and “works well” matters.
If your goal is useful inference with room for context, concurrency, and stability, 120B generally wants quantization.
200B-class models in 4-bit
Approximate weight footprint:
This is in the “possible, but tight” zone.
A 200B-class model can fit in 128GB when aggressively quantized, but you should think of this as:
- a technical edge case
- a benchmarking / experimentation scenario
- not the most comfortable everyday setup
You may have less room than you'd like for:
- long context windows
- heavy KV cache
- concurrent usage
- extra tooling
So yes, 200B-class models can be part of the DGX Spark story — but more as a demonstration of ceiling than the default production pattern.
What Does Not Fit Comfortably
70B in FP16
At ~140GB just for weights, this is beyond the practical sweet spot for a 128GB machine.
120B in FP16
At ~240GB, this is clearly out.
Large training jobs without memory-efficient methods
Even if a model can be loaded for inference, training is more demanding because you also need memory for:
- optimizer state
- gradients
- activations
That means full fine-tuning quickly becomes more expensive than inference.
So when thinking about fine-tuning, the practical model size limit is lower than the pure inference limit.
Inference vs Fine-Tuning: Very Different Limits
This is the most important distinction people miss.
For inference
You only need:
- model weights
- KV cache
- runtime overhead
For fine-tuning
You also need:
- gradients
- optimizer states
- activations
- adapter memory (for LoRA/QLoRA)
That means:
- 8B and 13B full fine-tuning → comfortable
- 30B full fine-tuning → possible with careful setup
- 70B full fine-tuning → generally not the default path
- 70B LoRA / QLoRA → much more realistic
So if you're evaluating DGX Spark for training, the answer is not just “what fits?” but “what fits with a viable training recipe?”
A Practical Decision Table
Here’s a more useful summary than raw specs.
| Workload |
DGX Spark verdict |
| 7B–14B FP16 inference |
Easy |
| 30B FP16 inference |
Comfortable |
| 70B FP8 inference |
Good fit |
| 70B 4-bit inference |
Easy |
| 120B 4-bit inference |
Good fit |
| 200B 4-bit inference |
Possible, tight |
| 8B full fine-tuning |
Easy |
| 13B full fine-tuning |
Good fit |
| 30B full fine-tuning |
Possible with care |
| 70B LoRA / QLoRA |
Good fit |
| 70B full fine-tuning |
Generally not ideal |
DGX Spark vs H100 80GB: Why This Matters
This is where DGX Spark becomes especially interesting.
An H100 has much higher raw throughput, but only 80GB of memory.
That means there are real workflows where DGX Spark is more convenient:
- 70B FP8 experiments
- multi-model agent stacks
- 120B-class quantized inference
- memory-heavy prototyping
- context-heavy workflows where the extra memory matters more than raw speed
If your workload is mostly throughput-bound, H100 still wins.
If your workload is memory-bound, DGX Spark can be the more practical machine.
What 128GB Means for Real Teams
For a small team or solo AI engineer, 128GB on DGX Spark unlocks three big things:
1. Fewer compromises
You don't have to immediately choose between:
- tiny models
- aggressive quantization
- awkward offloading
- multi-GPU complexity
2. Better workflow continuity
A model that works on DGX Spark is easier to test, benchmark, and iterate on without moving to a larger cluster too early.
3. More serious local experimentation
You can run serious open-weight models on a single machine and still have room for the rest of the stack.
That changes the development loop.
The Bottom Line
If your question is:
“Can DGX Spark run serious modern open-weight models?”
The answer is yes.
If your question is:
“What fits comfortably in 128GB?”
A good practical answer is:
- 7B–14B → effortless
- 30B → comfortable
- 70B → very workable with FP8 or quantization
- 120B → realistic with quantization
- 200B → possible, but tight and more experimental
That is exactly why DGX Spark is interesting: it sits in the gap between consumer hardware and datacenter-scale infrastructure.
It gives you enough memory to do real AI work without forcing you straight into H100/H200 economics.
Try DGX Spark Yourself
Want to test your own model sizes on real hardware?
Visit spark.enverge.ai to request access.
You can validate your actual workload directly instead of guessing from spec sheets.
Enverge provides cloud access to NVIDIA DGX Spark hardware for AI researchers, engineers, and startups. Use DGX Spark Cloud for large-model inference, fine-tuning, benchmarking, and Blackwell-native experimentation without buying the hardware upfront.