Learning Fine-Tuning on Nvidia DGX Spark - Part 1

If you are working with NVIDIA DGX Spark (or your own rig) and trying to do anything beyond toy experimentation, you will very quickly run into challenges.
This article documents the steps, decisions, and some of the failures I hit while getting NVIDIA AI Workbench running cleanly on DGX Spark with a Blackwell-capable PyTorch base image.
By the end of this article, you will have:
AI Workbench running on DGX Spark (This is pretty straightforward)
A CUDA-enabled PyTorch environment that supports sm_121 (This is where I had some issues)
A clean, reproducible base image
A foundation suitable for fine-tuning modern models
What I Was Trying to Accomplish
The requirements were non-negotiable:
Use NVIDIA AI Workbench as the development control plane
Run on DGX Spark (GB10, Blackwell,
sm_121)Fine-tune modern models (starting with Phi-4)
Use CUDA-enabled PyTorch, not CPU-only fallbacks
Avoid silent GPU architecture incompatibilities
Failure #1: Generic Python Base Images
Starting from a standard Python base image fails exactly how you would expect:
PyTorch installs as
+cputorch.cuda.is_available()returnsFalse
AI Workbench does not magically make PyTorch CUDA-aware. The base image determines everything.
Failure #2: Built-in Workbench PyTorch Images
The next logical step was to use NVIDIA-provided PyTorch base images directly from AI Workbench.
Symptoms:
CUDA appears available
GPU is visible
Warnings about unsupported architecture:
sm_121
Root cause:
These images were compiled for
sm_80,sm_86, andsm_90Blackwell (
sm_120/sm_121) support was not present
At this point the issue was not Docker or Workbench; it was the PyTorch build target.
Choosing a PyTorch Base That Supports Blackwell
Once the constraint was clear, the solution was straightforward.
NVIDIA NGC PyTorch Container
The working base image:
Why this image:
Blackwell-capable (
sm_120, compatible withsm_121)CUDA 13.x user-space
NVIDIA-validated for DGX-class systems
From a GPU and framework standpoint, this is the correct foundation.
Unfortunately, AI Workbench adds another constraint, this led to some trial and error.
Why NGC Images Fail in AI Workbench by Default
NGC images are valid Docker images. They are not valid AI Workbench base environments.
AI Workbench validates image metadata before pulling layers. If required labels are missing, the image is rejected immediately with errors such as:
invalid base environment (invalid OS)no OSDistro setno OSDistroRelease set
You must explicitly provide the metadata Workbench expects.
The Fix: A Minimal Wrapper Image
This is the cleanest solution and the one that scales. No recompiles. No rebuilding PyTorch. Just metadata.
Inspect the Base OS
Inside the NGC container:
OS:
linuxDistro:
ubuntuRelease:
24.04
Verified via /etc/os-release.
Wrapper Dockerfile with Workbench Metadata
The wrapper image does one thing only; it adds the labels required by AI Workbench.
FROM nvcr.io/nvidia/pytorch:25.12-py3
LABEL com.nvidia.workbench.schema-version="v2" \
com.nvidia.workbench.name="NGC PyTorch 25.12 (Workbench)" \
com.nvidia.workbench.description="Wrapper for nvcr.io/nvidia/pytorch:25.12-py3 with Workbench metadata" \
com.nvidia.workbench.image-version="25.12.1" \
com.nvidia.workbench.cuda-version="13.0" \
com.nvidia.workbench.os="linux" \
com.nvidia.workbench.os-distro="ubuntu" \
com.nvidia.workbench.os-distro-release="24.04" \
com.nvidia.workbench.programming-languages="python3"
No software changes are introduced.
Building and Verifying the Wrapper Image
docker build --no-cache -t nvwb-pytorch-25.12:latest .
Verify that the labels are present:
docker inspect nvwb-pytorch-25.12:latest \
--format '{{range $k,$v := .Config.Labels}}{{println $k "=" $v}}{{end}}' \
| grep 'com.nvidia.workbench'
If these labels are missing or incorrect, AI Workbench will reject the image.
Publishing the Image to GitHub Container Registry (GHCR)
Why GHCR
AI Workbench needs a container URL that can be pulled, and I chose GHCR for this purpose. You can choose any option you prefer, as long as the image can be pulled from a valid URL.
Local images are ignored
GHCR works reliably for personal and public images
Authentication Requirements
I used a classic GitHub Personal Access Token. It might be better to use a fine-grained token, but since this is a local development setup, I didn't think it was necessary to determine the exact permissions needed and kept it basic.
read:packageswrite:packages
echo "$GH_TOKEN" | docker login ghcr.io -u <username> --password-stdin
Docker Credential Helper Pitfall
My Docker config initially contained:
"credHelpers": {
"ghcr.io": "workbench"
}
This silently overrides standard Docker authentication and causes:
Failed pushes
wb-svcerrorsAI Workbench unable to read image metadata
Fix:
Remove the
credHelpersentry forghcr.ioAllow Docker to use
authsnormally
This is subtle and easy to miss.
Tagging and Pushing (Pinned Tags Only)
Avoid latest.
docker tag nvwb-pytorch-25.12:latest ghcr.io/brianbaldock/nvwb-pytorch-25.12:25.12.1
docker push ghcr.io/brianbaldock/nvwb-pytorch-25.12:25.12.1
Why latest Causes Workbench Validation Failures
AI Workbench validates metadata before pulling layers.
With latest:
Multiple digests
Registry-side caching
Inconsistent metadata resolution
Pinned tags remove ambiguity and resolve validation failures immediately.
Creating the AI Workbench Project
In AI Workbench:
New Project → Custom Container
Container URL:
ghcr.io/<username>/nvwb-pytorch-25.12:25.12.1
Result:
Base environment validated
Project created successfully
GPU visible inside the container
Final Verification
Inside the container:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
Expected output:
CUDA-enabled PyTorch
NVIDIA GB10No architecture warnings
Key Takeaways
AI Workbench is metadata-driven, not just Docker-driven
NGC containers require wrapper labels
Blackwell requires the correct PyTorch build target
GHCR authentication is not Git authentication
Docker
credHelperscan silently override authlatestis unsafe for Workbench base imagesWrapper images are the cleanest long-term approach
What’s Next
This article establishes a clean, reproducible base.
Next articles in this series will build on it:
Fine-tuning Phi-4 on DGX Spark
LoRA vs QLoRA on Blackwell
Serving models with vLLM
Multi-container AI Workbench workflows





