What Fits in 128GB? A Practical Model Size Guide for DGX Spark

Q: Can a 70B model run in FP16 on 128GB?

No, not comfortably. 70B FP16 weighs ~140GB — beyond the practical single-box limit. Plan for FP8 (~70GB) or 4-bit (~35GB) instead.

TL;DR: On DGX Spark's 128 GB, 7B–14B models run effortlessly, 30B is comfortable, 70B works well at FP8 or quantization, and 120B is realistic quantized — 200B is possible but tight. The right question isn't "does it fit?" but "does it fit with enough headroom for context, KV cache, and real throughput?" Don't assume FP16 for 70B+; plan for FP8 or 4-bit.

One of the biggest reasons engineers look at the NVIDIA DGX Spark is simple: 128GB of unified memory. That number changes what kinds of models you can load, fine-tune, and experiment with on a single machine.

But specs alone don't answer the real question:

What actually fits in 128GB?

This guide gives you a practical framework for thinking about model size on DGX Spark — not just theoretical limits, but what fits comfortably, what fits tightly, and what leaves room for real work.

Why 128GB Matters

Most AI engineers are used to thinking in terms of 24GB, 48GB, or 80GB VRAM. That's the mental model created by consumer RTX GPUs, workstation cards, and datacenter GPUs like the H100.

The DGX Spark is different.

With 128GB of unified memory, you can:

run larger models on a single box
avoid aggressive quantization in some workflows
keep more room for KV cache, larger context windows, and bigger batches
run multiple models at the same time
fine-tune models that are awkward or impossible on an 80GB GPU

That doesn't mean everything magically fits. The usable ceiling still depends on:

model precision
context length
KV cache size
inference vs fine-tuning
how much headroom you want for actual throughput

So the right question isn't just “does it fit?”

It's: “Does it fit well enough to be useful?”

Just because a model fits doesn't mean it runs well — latency, context headroom, and agent workflows matter as much as parameter count. And when it does need to fit, the practical next step is often a quantization guide rather than a bigger box.

The Rule of Thumb

A quick shortcut for estimating model memory:

Inference memory estimate

Approximate model weight size:

FP16 / BF16 → ~2 bytes per parameter
FP8 → ~1 byte per parameter
INT4 / 4-bit → ~0.5 bytes per parameter

So:

8B model in FP16 ≈ 16GB
70B model in FP16 ≈ 140GB
70B model in FP8 ≈ 70GB
120B model in 4-bit ≈ 60GB
200B model in 4-bit ≈ 100GB

That gets you close — but it's only the weights.

You also need memory for:

runtime overhead
KV cache
tokenizer/runtime buffers
activations (especially for training)
adapters / LoRA weights
multiple concurrent requests

That means the raw weight size is only the starting point.

What Fits Comfortably on DGX Spark

These are workloads that fit with enough headroom left to actually work productively.

7B to 14B models in FP16

Examples:

Llama 3.1 8B
Mistral 7B
DeepSeek Coder 6.7B
Qwen 14B-class models

These fit easily.

Typical weight footprint:

7B FP16 → ~14GB
8B FP16 → ~16GB
14B FP16 → ~28GB

That leaves plenty of room for:

large context windows
bigger batch sizes
multiple services or sidecars
local eval tooling
fine-tuning workflows

If your workload is centered on 7B–14B models, DGX Spark gives you a very comfortable margin.

30B-class models in FP16 or BF16

Examples:

30B open-weight reasoning or coding models
larger domain-specific fine-tuning targets

Approximate weight footprint:

30B FP16 → ~60GB

That is still very workable in 128GB.

You now have enough room left for:

practical inference
some tuning workflows
larger contexts
moderate KV cache

This is one of the major breaks from smaller GPUs. On 24GB or 48GB hardware, 30B starts getting annoying fast. On DGX Spark, it's realistic.

What Fits Well but Needs Thought

These are the workloads where DGX Spark becomes really interesting.

70B models in FP8 or quantized formats

Examples:

Llama 3.1 70B
Qwen 72B-class models

Approximate weight footprint:

70B FP16 → ~140GB → too large for comfortable single-box use
70B FP8 → ~70GB
70B 4-bit → ~35GB

This is where DGX Spark has real leverage over 80GB GPUs.

A 70B model in FP8 on DGX Spark can fit with meaningful room left over for:

inference runtime overhead
KV cache
agent-style workflows
adapters / LoRA weights

A 70B model in 4-bit fits very comfortably.

For many teams, this is the sweet spot: big enough to be serious, small enough to run on one DGX Spark without pain.

Multiple models at once

This is one of the most overlooked use cases.

For example, a multi-agent or compound stack might look like:

reasoning model → 30GB
coding model → 20GB
embedding model → 8GB
reranker / support model → 4GB
runtime overhead → 10–15GB

Total: still comfortably within DGX Spark range.

That means you can build:

agent systems
retrieval-augmented pipelines
multi-model orchestration
local benchmark harnesses

without constant model swapping.

What Fits Tightly

120B models with quantization

Approximate weight footprint:

120B FP16 → ~240GB → no
120B FP8 → ~120GB → technically near the ceiling, but very tight
120B 4-bit → ~60GB → very plausible

A 120B model in 4-bit is realistic on DGX Spark.

A 120B model in FP8 is theoretically near the edge, but in practice you likely won't want to run it that way unless the rest of the stack is extremely lean.

This is where the difference between “loads once” and “works well” matters.

If your goal is useful inference with room for context, concurrency, and stability, 120B generally wants quantization.

200B-class models in 4-bit

Approximate weight footprint:

200B 4-bit → ~100GB

This is in the “possible, but tight” zone.

A 200B-class model can fit in 128GB when aggressively quantized — NVIDIA positions the Spark for models up to 200 billion parameters — but you should think of this as:

a technical edge case
a benchmarking / experimentation scenario
not the most comfortable everyday setup

You may have less room than you'd like for:

long context windows
heavy KV cache
concurrent usage
extra tooling

So yes, 200B-class models can be part of the DGX Spark story — but more as a demonstration of ceiling than the default production pattern.

What Does Not Fit Comfortably

70B in FP16

At ~140GB just for weights, this is beyond the practical sweet spot for a 128GB machine.

120B in FP16

At ~240GB, this is clearly out.

Large training jobs without memory-efficient methods

Even if a model can be loaded for inference, training is more demanding because you also need memory for:

optimizer state
gradients
activations

That means full fine-tuning quickly becomes more expensive than inference.

So when thinking about fine-tuning, the practical model size limit is lower than the pure inference limit.

Inference vs Fine-Tuning: Very Different Limits

This is the most important distinction people miss.

For inference

You only need:

model weights
KV cache
runtime overhead

For fine-tuning

You also need:

gradients
optimizer states
activations
adapter memory (for LoRA/QLoRA)

That means:

8B and 13B full fine-tuning → comfortable
30B full fine-tuning → possible with careful setup
70B full fine-tuning → generally not the default path
70B LoRA / QLoRA → much more realistic

So if you're evaluating DGX Spark for training, the answer is not just “what fits?” but “what fits with a viable training recipe?” For the practice of fine-tuning on DGX Spark (full, LoRA, QLoRA), see the companion guide.

A Practical Decision Table

Here’s a more useful summary than raw specs.

Workload	DGX Spark verdict
7B–14B FP16 inference	Easy
30B FP16 inference	Comfortable
70B FP8 inference	Good fit
70B 4-bit inference	Easy
120B 4-bit inference	Good fit
200B 4-bit inference	Possible, tight
8B full fine-tuning	Easy
13B full fine-tuning	Good fit
30B full fine-tuning	Possible with care
70B LoRA / QLoRA	Good fit
70B full fine-tuning	Generally not ideal

DGX Spark vs H100 80GB: Why This Matters

This is where DGX Spark becomes especially interesting.

An H100 has much higher raw throughput, but only 80GB of memory.

That means there are real workflows where DGX Spark is more convenient:

70B FP8 experiments
multi-model agent stacks
120B-class quantized inference
memory-heavy prototyping
context-heavy workflows where the extra memory matters more than raw speed

If your workload is mostly throughput-bound, H100 still wins — but check prefill vs decode performance before you assume “throughput” means what you think it does on a given request shape. If your workload is memory-bound, DGX Spark can be the more practical machine.

What 128GB Means for Real Teams

For a small team or solo AI engineer, 128GB on DGX Spark unlocks three big things:

1. Fewer compromises

You don't have to immediately choose between:

tiny models
aggressive quantization
awkward offloading
multi-GPU complexity

2. Better workflow continuity

A model that works on DGX Spark is easier to test, benchmark, and iterate on without moving to a larger cluster too early.

3. More serious local experimentation

You can run serious open-weight models on a single machine and still have room for the rest of the stack.

That changes the development loop.

The Bottom Line

If your question is:

“Can DGX Spark run serious modern open-weight models?”

The answer is yes.

If your question is:

“What fits comfortably in 128GB?”

A good practical answer is:

7B–14B → effortless
30B → comfortable
70B → very workable with FP8 or quantization
120B → realistic with quantization
200B → possible, but tight and more experimental

That is exactly why DGX Spark is interesting: it sits in the gap between consumer hardware and datacenter-scale infrastructure.

It gives you enough memory to do real AI work without forcing you straight into H100/H200 economics.

Try DGX Spark Yourself

Want to test your own model sizes on real hardware?

Visit spark.enverge.ai to request access.

You can validate your actual workload directly instead of guessing from spec sheets.

FAQ

What model sizes fit comfortably in 128GB on DGX Spark?

7B–14B runs effortlessly in FP16, 30B is comfortable, 70B works well at FP8 or quantization, and 120B is realistic quantized. 200B is possible but tight.

Why are fine-tuning limits different from inference limits?

Fine-tuning also needs optimizer state, gradients, and activations — so full fine-tuning on 70B is generally unrealistic, while 70B LoRA/QLoRA fits well. Inference only needs weights, KV cache, and runtime overhead.

Can a 70B model run in FP16 on 128GB?

No, not comfortably. 70B FP16 weighs ~~140GB — beyond the practical single-box limit. Plan for FP8 (~~70GB) or 4-bit (~35GB) instead.

Enverge provides cloud access to NVIDIA DGX Spark hardware for AI researchers, engineers, and startups. Use DGX Spark Cloud for large-model inference, fine-tuning, benchmarking, and Blackwell-native experimentation without buying the hardware upfront.