How to Fine-Tune an LLM on NVIDIA DGX Spark
The NVIDIA DGX Spark's 128GB of unified memory changes the fine-tuning game. Models that require multi-GPU setups or aggressive quantization on other hardware? They fit entirely in memory on a single DGX Spark — no sharding, no compromises.
This guide walks you through fine-tuning LLMs on DGX Spark step by step, from choosing your method to running your first training job.
Why Fine-Tune on DGX Spark?
Fine-tuning adapts a pre-trained model to your specific use case — your data, your domain, your tone. The DGX Spark is uniquely suited for this because:
- 128GB unified memory — load 70B models in full precision without quantization
- Blackwell architecture (SM 10.0) — native NVFP4 support and 5th-gen Tensor Cores
- Desktop form factor — no cloud egress fees, your data stays local
- $0.65/hour cloud access — ~4.5x cheaper per hour than renting an H100
For context: fine-tuning Llama 3.1 8B with a 16K context window requires ~45GB of memory. On an H100 (80GB), that's tight. On DGX Spark (128GB), you have 83GB to spare for larger batches, longer contexts, or bigger models.
Choosing Your Fine-Tuning Method
There are three main approaches, each with different memory and quality trade-offs:
Full Fine-Tuning
Updates every parameter in the model. Produces the highest quality results but requires the most memory.
| Model Size |
Memory Required (FP16) |
Fits on H100? |
Fits on DGX Spark? |
| 8B |
~45GB |
✅ (tight) |
✅ |
| 13B |
~65GB |
⚠️ (marginal) |
✅ |
| 30B |
~120GB |
❌ |
✅ |
| 70B |
~280GB |
❌ |
⚠️ (with FP8) |
Best for: Models up to 30B where you have enough training data (10K+ samples) and want maximum quality.
LoRA (Low-Rank Adaptation)
Freezes the base model and trains small adapter matrices. Uses dramatically less memory while achieving 90-95% of full fine-tuning quality.
| Model Size |
Memory Required (LoRA) |
Fits on H100? |
Fits on DGX Spark? |
| 8B |
~20GB |
✅ |
✅ |
| 70B |
~50GB |
✅ (tight) |
✅ |
| 120B |
~80GB |
❌ |
✅ |
| 200B |
~128GB |
❌ |
✅ (tight) |
Best for: Large models (70B+), limited training data, or when you want to experiment quickly.
QLoRA (Quantized LoRA)
Combines 4-bit quantization with LoRA adapters. The most memory-efficient method — lets you fine-tune models that wouldn't fit even with LoRA alone.
| Model Size |
Memory Required (QLoRA) |
Fits on H100? |
Fits on DGX Spark? |
| 70B |
~38GB |
✅ |
✅ |
| 120B |
~65GB |
✅ (tight) |
✅ |
| 200B |
~105GB |
❌ |
✅ |
Best for: 200B+ models, memory-constrained experiments, quick prototyping.
Step-by-Step: Fine-Tuning Llama 3.1 8B on DGX Spark
Let's walk through a complete example — full fine-tuning of Llama 3.1 8B using PyTorch and Hugging Face Transformers.
Prerequisites
SSH into your DGX Spark instance and set up the environment:
# Create a virtual environment
python3 -m venv ~/finetune-env
source ~/finetune-env/bin/activate
# Install dependencies
pip install torch transformers datasets accelerate peft trl
pip install bitsandbytes # for quantization methods
Prepare Your Dataset
Fine-tuning quality depends on your data. Here's how to structure it for instruction tuning:
from datasets import Dataset
# Your training data — instruction/response pairs
data = [
{
"instruction": "Summarize the key benefits of edge AI deployment.",
"response": "Edge AI reduces latency by processing data locally, lowers bandwidth costs by minimizing cloud transfers, improves privacy by keeping sensitive data on-device, and enables real-time decision-making in environments with limited connectivity."
},
{
"instruction": "What are the main challenges of fine-tuning large language models?",
"response": "The primary challenges include high memory requirements, risk of catastrophic forgetting, the need for high-quality domain-specific data, hyperparameter sensitivity, and the computational cost of full-parameter training on models above 30B parameters."
},
# Add your training examples here (aim for 1K-50K samples)
]
dataset = Dataset.from_list(data)
For real projects, load your data from JSON, CSV, or Hugging Face Hub:
from datasets import load_dataset
# From a JSON file
dataset = load_dataset("json", data_files="my_training_data.jsonl")
# From Hugging Face Hub
dataset = load_dataset("your-org/your-dataset", split="train")
Option A: Full Fine-Tuning
This loads the full model in FP16 and trains every parameter:
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
)
from trl import SFTTrainer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
# Load model — fits comfortably in 128GB
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Format data for instruction tuning
def format_prompt(example):
return {
"text": f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{example['response']}<|eot_id|>"
}
formatted = dataset.map(format_prompt)
# Training configuration
training_args = TrainingArguments(
output_dir="./llama-8b-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4, # DGX Spark can handle batch=4 easily
gradient_accumulation_steps=4, # Effective batch size: 16
learning_rate=2e-5,
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True, # Blackwell supports BF16 natively
optim="adamw_torch",
max_grad_norm=1.0,
)
# Train
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=formatted,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=4096,
)
trainer.train()
trainer.save_model("./llama-8b-finetuned/final")
Memory usage: ~45GB — leaving 83GB free on DGX Spark.
Training time: ~2 hours for 10K samples, 3 epochs.
Option B: LoRA Fine-Tuning (for 70B+ Models)
For larger models, LoRA lets you train efficiently by only updating small adapter layers:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
model_name = "meta-llama/Llama-3.1-70B-Instruct"
# Load model — 70B in FP16 is ~140GB, use FP8 to fit
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
load_in_8bit=True, # ~70GB in FP8 — fits on DGX Spark
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: ~42M / 70B total (0.06%)
training_args = TrainingArguments(
output_dir="./llama-70b-lora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1e-4, # LoRA typically uses higher LR
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="adamw_torch",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=formatted,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
model.save_pretrained("./llama-70b-lora/final")
Memory usage: ~80GB — only possible on DGX Spark (128GB) or H200 (141GB). Does not fit on H100 (80GB).
Option C: Using Unsloth (2x Faster)
Unsloth is optimized specifically for NVIDIA hardware and DGX Spark. It delivers up to 2x faster training with 60% less memory:
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Unsloth handles model loading with built-in optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.1-8B-Instruct",
max_seq_length=4096,
load_in_4bit=False, # Full precision on DGX Spark
)
# Apply LoRA with Unsloth's optimized kernels
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
training_args = TrainingArguments(
output_dir="./llama-8b-unsloth",
num_train_epochs=3,
per_device_train_batch_size=8, # Unsloth's efficiency allows larger batches
gradient_accumulation_steps=2,
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="adamw_8bit",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=formatted,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=4096,
)
trainer.train()
model.save_pretrained("./llama-8b-unsloth/final")
Memory usage: ~18GB with LoRA, ~30GB for full fine-tuning.
Training time: ~1 hour for 10K samples — roughly 2x faster than vanilla PyTorch.
Tips for Better Results
1. Data Quality > Data Quantity
A clean dataset of 5,000 high-quality examples will outperform a noisy dataset of 100,000. Focus on:
- Consistency — same format for every example
- Relevance — only include examples from your target domain
- Diversity — cover the range of inputs your model will see in production
- Accuracy — every response should be one you'd be happy to ship
2. Start Small, Scale Up
Don't jump to fine-tuning a 70B model on day one:
- Start with 8B — fast iteration, quick experiments
- Validate your data pipeline — make sure formatting is correct
- Tune hyperparameters — learning rate, epochs, batch size
- Scale to 70B — once you've confirmed your approach works
3. Use the Right Learning Rate
| Method |
Recommended LR |
| Full fine-tuning |
1e-5 to 5e-5 |
| LoRA |
1e-4 to 3e-4 |
| QLoRA |
1e-4 to 3e-4 |
Too high → model forgets its pre-training (catastrophic forgetting).
Too low → model barely learns your data.
4. Monitor for Overfitting
With small datasets (< 5K samples), overfitting is the main risk. Watch for:
- Training loss keeps dropping but eval loss starts rising
- Model memorizes training examples instead of generalizing
Mitigations: use a validation split, early stopping, and lower the number of epochs.
5. Leverage DGX Spark's Memory Advantage
Because you have 128GB to work with, you can afford to:
- Use larger batch sizes — improves training stability
- Keep context windows long — 4K-16K tokens instead of truncating to 512
- Skip quantization — train in FP16/BF16 for better gradient quality
- Load evaluation models alongside — run benchmarks without unloading
What to Fine-Tune For
Here are the most common fine-tuning use cases that benefit from DGX Spark:
| Use Case |
Recommended Model |
Method |
Memory |
| Domain-specific chatbot |
Llama 3.1 8B |
Full |
~45GB |
| Code assistant |
DeepSeek Coder 33B |
LoRA |
~40GB |
| Medical/Legal expert |
Llama 3.1 70B |
LoRA |
~80GB |
| Multilingual assistant |
Qwen 2.5 72B |
LoRA |
~85GB |
| Reasoning model |
Llama 3.1 8B |
Full (SFT+RLHF) |
~60GB |
| Summarization |
Mistral 7B |
Full |
~35GB |
From Fine-Tuned Model to Production
Once your model is trained, you'll want to serve it efficiently:
# Export to GGUF for llama.cpp / Ollama
python3 -m llama_cpp.convert --outtype f16 ./llama-8b-finetuned/final
# Or serve directly with vLLM
python3 -m vllm.entrypoints.openai.api_server \
--model ./llama-8b-finetuned/final \
--host 0.0.0.0 --port 8000
# Or push to Hugging Face Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="./llama-8b-finetuned/final",
repo_id="your-org/your-model",
repo_type="model",
)
Getting Started
Fine-tuning on DGX Spark takes minutes to set up:
- Get access at spark.enverge.ai
- SSH in and install your training stack
- Prepare your data — quality matters more than quantity
- Pick your method — full fine-tuning for 8-30B, LoRA for 70B+
- Train — most jobs finish in 1-4 hours
- Deploy — export and serve your custom model
The 128GB of unified memory means you can fine-tune models that simply don't fit on other single-GPU hardware. No sharding, no multi-node complexity, no compromises.
Request access to DGX Spark Cloud →
Enverge provides cloud access to NVIDIA DGX Spark hardware for AI researchers, engineers, and teams. Plans start at $0.65/hour with SSH, Docker, and the full NVIDIA AI stack pre-installed.