How to Fine-Tune an LLM on NVIDIA DGX Spark

Q: Which fine-tuning method should I use on DGX Spark?

Use full fine-tuning for 8B–30B when you want maximum quality and have enough data. Use LoRA for 70B+ or fast iteration; use QLoRA when memory is tight or you're prototyping on 200B-class models.

Q: Why fine-tune on DGX Spark instead of an H100?

128 GB unified memory loads large models without multi-GPU sharding — a 70B LoRA run needs ~80 GB, which doesn't fit on an 80 GB H100. Cloud access at $0.65/hr is also ~4.5× cheaper per hour.

Q: What learning rate should I use for LoRA vs full fine-tuning?

Full fine-tuning: 1e-5 to 5e-5. LoRA and QLoRA: 1e-4 to 3e-4. Too high causes catastrophic forgetting; too low barely adapts to your data.

TL;DR: DGX Spark's 128 GB lets you fine-tune models up to 30B in full precision or 70B+ with LoRA/QLoRA on a single GPU — no multi-GPU sharding. Use full fine-tuning for max quality with enough data; LoRA for large models and fast iteration; QLoRA when memory is tight. Cloud access at $0.65/hr is ~4.5× cheaper per hour than an H100 for the same job.

The NVIDIA DGX Spark's 128GB of unified memory changes the fine-tuning game. Models that require multi-GPU setups or aggressive quantization on other hardware? They fit entirely in memory on a single DGX Spark — no sharding, no compromises. For a size-by-size view of what fits in 128GB, see that guide.

This guide walks you through fine-tuning LLMs on DGX Spark step by step, from choosing your method to running your first training job.

Why Fine-Tune on DGX Spark?

Fine-tuning adapts a pre-trained model to your specific use case — your data, your domain, your tone. The DGX Spark is uniquely suited for this because:

128GB unified memory — load 70B models in full precision without quantization
Blackwell architecture (SM 10.0) — native NVFP4 support and 5th-gen Tensor Cores
Desktop form factor — no cloud egress fees, your data stays local
$0.65/hour cloud access — ~4.5x cheaper per hour than renting an H100

For context: fine-tuning Llama 3.1 8B with a 16K context window requires ~45GB of memory. On an H100 (80GB), that's tight. On DGX Spark (128GB), you have 83GB to spare for larger batches, longer contexts, or bigger models.

Choosing Your Fine-Tuning Method

There are three main approaches, each with different memory and quality trade-offs:

Full Fine-Tuning

Updates every parameter in the model. Produces the highest quality results but requires the most memory.

Model Size	Memory Required (FP16)	Fits on H100?	Fits on DGX Spark?
8B	~45GB	✅ (tight)	✅
13B	~65GB	⚠️ (marginal)	✅
30B	~120GB	❌	✅
70B	~280GB	❌	⚠️ (with FP8)

Best for: Models up to 30B where you have enough training data (10K+ samples) and want maximum quality.

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small adapter matrices. Uses dramatically less memory while achieving 90-95% of full fine-tuning quality.

Model Size	Memory Required (LoRA)	Fits on H100?	Fits on DGX Spark?
8B	~20GB	✅	✅
70B	~50GB	✅ (tight)	✅
120B	~80GB	❌	✅
200B	~128GB	❌	✅ (tight)

Best for: Large models (70B+), limited training data, or when you want to experiment quickly.

QLoRA (Quantized LoRA)

Combines 4-bit quantization with LoRA adapters. The most memory-efficient method — lets you fine-tune models that wouldn't fit even with LoRA alone. (For inference-side quantization techniques on the same hardware, see the companion post.)

Model Size	Memory Required (QLoRA)	Fits on H100?	Fits on DGX Spark?
70B	~38GB	✅	✅
120B	~65GB	✅ (tight)	✅
200B	~105GB	❌	✅

Best for: 200B+ models, memory-constrained experiments, quick prototyping.

Step-by-Step: Fine-Tuning Llama 3.1 8B on DGX Spark

Let's walk through a complete example — full fine-tuning of Llama 3.1 8B using PyTorch and Hugging Face Transformers.

Prerequisites

SSH into your DGX Spark instance and set up the environment:

# Create a virtual environment
python3 -m venv ~/finetune-env
source ~/finetune-env/bin/activate

# Install dependencies
pip install torch transformers datasets accelerate peft trl
pip install bitsandbytes  # for quantization methods

Prepare Your Dataset

Fine-tuning quality depends on your data. Here's how to structure it for instruction tuning:

from datasets import Dataset

# Your training data — instruction/response pairs
data = [
    {
        "instruction": "Summarize the key benefits of edge AI deployment.",
        "response": "Edge AI reduces latency by processing data locally, lowers bandwidth costs by minimizing cloud transfers, improves privacy by keeping sensitive data on-device, and enables real-time decision-making in environments with limited connectivity."
    },
    {
        "instruction": "What are the main challenges of fine-tuning large language models?",
        "response": "The primary challenges include high memory requirements, risk of catastrophic forgetting, the need for high-quality domain-specific data, hyperparameter sensitivity, and the computational cost of full-parameter training on models above 30B parameters."
    },
    # Add your training examples here (aim for 1K-50K samples)
]

dataset = Dataset.from_list(data)

For real projects, load your data from JSON, CSV, or Hugging Face Hub:

from datasets import load_dataset

# From a JSON file
dataset = load_dataset("json", data_files="my_training_data.jsonl")

# From Hugging Face Hub
dataset = load_dataset("your-org/your-dataset", split="train")

Option A: Full Fine-Tuning

This loads the full model in FP16 and trains every parameter:

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load model — fits comfortably in 128GB
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Format data for instruction tuning
def format_prompt(example):
    return {
        "text": f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{example['response']}<|eot_id|>"
    }

formatted = dataset.map(format_prompt)

# Training configuration
training_args = TrainingArguments(
    output_dir="./llama-8b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,      # DGX Spark can handle batch=4 easily
    gradient_accumulation_steps=4,       # Effective batch size: 16
    learning_rate=2e-5,
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,                           # Blackwell supports BF16 natively
    optim="adamw_torch",
    max_grad_norm=1.0,
)

# Train
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=4096,
)

trainer.train()
trainer.save_model("./llama-8b-finetuned/final")

Memory usage: ~45GB — leaving 83GB free on DGX Spark. Training time: ~2 hours for 10K samples, 3 epochs.

Option B: LoRA Fine-Tuning (for 70B+ Models)

For larger models, LoRA lets you train efficiently by only updating small adapter layers:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_name = "meta-llama/Llama-3.1-70B-Instruct"

# Load model — 70B in FP16 is ~140GB, use FP8 to fit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    load_in_8bit=True,          # ~70GB in FP8 — fits on DGX Spark
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    r=16,                        # Rank — higher = more capacity, more memory
    lora_alpha=32,               # Scaling factor
    target_modules=[             # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: ~42M / 70B total (0.06%)

training_args = TrainingArguments(
    output_dir="./llama-70b-lora",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,              # LoRA typically uses higher LR
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    optim="adamw_torch",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()
model.save_pretrained("./llama-70b-lora/final")

Memory usage: ~80GB — only possible on DGX Spark (128GB) or H200 (141GB). Does not fit on H100 (80GB).

Option C: Using Unsloth (2x Faster)

Unsloth is optimized specifically for NVIDIA hardware and DGX Spark. It delivers up to 2x faster training with 60% less memory:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Unsloth handles model loading with built-in optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    max_seq_length=4096,
    load_in_4bit=False,          # Full precision on DGX Spark
)

# Apply LoRA with Unsloth's optimized kernels
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

training_args = TrainingArguments(
    output_dir="./llama-8b-unsloth",
    num_train_epochs=3,
    per_device_train_batch_size=8,    # Unsloth's efficiency allows larger batches
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    warmup_steps=10,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    optim="adamw_8bit",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=4096,
)

trainer.train()
model.save_pretrained("./llama-8b-unsloth/final")

Memory usage: ~18GB with LoRA, ~30GB for full fine-tuning. Training time: ~1 hour for 10K samples — roughly 2x faster than vanilla PyTorch.

Tips for Better Results

1. Data Quality > Data Quantity

A clean dataset of 5,000 high-quality examples will outperform a noisy dataset of 100,000. Focus on:

Consistency — same format for every example
Relevance — only include examples from your target domain
Diversity — cover the range of inputs your model will see in production
Accuracy — every response should be one you'd be happy to ship

2. Start Small, Scale Up

Don't jump to fine-tuning a 70B model on day one:

Start with 8B — fast iteration, quick experiments
Validate your data pipeline — make sure formatting is correct
Tune hyperparameters — learning rate, epochs, batch size
Scale to 70B — once you've confirmed your approach works

3. Use the Right Learning Rate

Method	Recommended LR
Full fine-tuning	1e-5 to 5e-5
LoRA	1e-4 to 3e-4
QLoRA	1e-4 to 3e-4

Too high → model forgets its pre-training (catastrophic forgetting). Too low → model barely learns your data.

4. Monitor for Overfitting

With small datasets (< 5K samples), overfitting is the main risk. Watch for:

Training loss keeps dropping but eval loss starts rising
Model memorizes training examples instead of generalizing

Mitigations: use a validation split, early stopping, and lower the number of epochs.

5. Leverage DGX Spark's Memory Advantage

Because you have 128GB to work with, you can afford to:

Use larger batch sizes — improves training stability
Keep context windows long — 4K-16K tokens instead of truncating to 512
Skip quantization — train in FP16/BF16 for better gradient quality
Load evaluation models alongside — run benchmarks without unloading

What to Fine-Tune For

Here are the most common fine-tuning use cases that benefit from DGX Spark:

Use Case	Recommended Model	Method	Memory
Domain-specific chatbot	Llama 3.1 8B	Full	~45GB
Code assistant	DeepSeek Coder 33B	LoRA	~40GB
Medical/Legal expert	Llama 3.1 70B	LoRA	~80GB
Multilingual assistant	Qwen 2.5 72B	LoRA	~85GB
Reasoning model	Llama 3.1 8B	Full (SFT+RLHF)	~60GB
Summarization	Mistral 7B	Full	~35GB

From Fine-Tuned Model to Production

Once your model is trained, you'll want to serve it efficiently:

# Export to GGUF for llama.cpp / Ollama
python3 -m llama_cpp.convert --outtype f16 ./llama-8b-finetuned/final

# Or serve directly with vLLM
python3 -m vllm.entrypoints.openai.api_server \
    --model ./llama-8b-finetuned/final \
    --host 0.0.0.0 --port 8000

# Or push to Hugging Face Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="./llama-8b-finetuned/final",
    repo_id="your-org/your-model",
    repo_type="model",
)

Getting Started

Fine-tuning on DGX Spark takes minutes to set up — start with how to rent DGX Spark if you don't have a box yet:

Get access at spark.enverge.ai
SSH in and install your training stack
Prepare your data — quality matters more than quantity
Pick your method — full fine-tuning for 8-30B, LoRA for 70B+
Train — most jobs finish in 1-4 hours
Deploy — export and serve your custom model

The 128GB of unified memory means you can fine-tune models that simply don't fit on other single-GPU hardware. No sharding, no multi-node complexity, no compromises. The same economics that make Spark strong for training also favor running research experiments where iteration count beats peak FLOPS.

Request access to DGX Spark Cloud →

FAQ

Which fine-tuning method should I use on DGX Spark?

Use full fine-tuning for 8B–30B when you want maximum quality and have enough data. Use LoRA for 70B+ or fast iteration; use QLoRA when memory is tight or you're prototyping on 200B-class models.

Why fine-tune on DGX Spark instead of an H100?

128 GB unified memory loads large models without multi-GPU sharding — a 70B LoRA run needs ~80 GB, which doesn't fit on an 80 GB H100. Cloud access at $0.65/hr is also ~4.5× cheaper per hour.

What learning rate should I use for LoRA vs full fine-tuning?

Full fine-tuning: 1e-5 to 5e-5. LoRA and QLoRA: 1e-4 to 3e-4. Too high causes catastrophic forgetting; too low barely adapts to your data.

Enverge provides cloud access to NVIDIA DGX Spark hardware for AI researchers, engineers, and teams. Plans start at $0.65/hour with SSH, Docker, and the full NVIDIA AI stack pre-installed.