Skip to main content

Command Palette

Search for a command to run...

Learning Fine-Tuning on Nvidia DGX Spark - Part 1

Updated
5 min read
Learning Fine-Tuning on Nvidia DGX Spark - Part 1

If you are working with NVIDIA DGX Spark (or your own rig) and trying to do anything beyond toy experimentation, you will very quickly run into challenges.

👉
Setting up your own rig? See my previous article to get started Deploying Local AI Inference with vLLM and ChatUI in Docker

This article documents the steps, decisions, and some of the failures I hit while getting NVIDIA AI Workbench running cleanly on DGX Spark with a Blackwell-capable PyTorch base image.

By the end of this article, you will have:

  • AI Workbench running on DGX Spark (This is pretty straightforward)

  • A CUDA-enabled PyTorch environment that supports sm_121 (This is where I had some issues)

  • A clean, reproducible base image

  • A foundation suitable for fine-tuning modern models

What I Was Trying to Accomplish

The requirements were non-negotiable:

  • Use NVIDIA AI Workbench as the development control plane

  • Run on DGX Spark (GB10, Blackwell, sm_121)

  • Fine-tune modern models (starting with Phi-4)

  • Use CUDA-enabled PyTorch, not CPU-only fallbacks

  • Avoid silent GPU architecture incompatibilities

Failure #1: Generic Python Base Images

Starting from a standard Python base image fails exactly how you would expect:

  • PyTorch installs as +cpu

  • torch.cuda.is_available() returns False

AI Workbench does not magically make PyTorch CUDA-aware. The base image determines everything.

Failure #2: Built-in Workbench PyTorch Images

The next logical step was to use NVIDIA-provided PyTorch base images directly from AI Workbench.

Symptoms:

  • CUDA appears available

  • GPU is visible

  • Warnings about unsupported architecture: sm_121

Root cause:

  • These images were compiled for sm_80, sm_86, and sm_90

  • Blackwell (sm_120 / sm_121) support was not present

At this point the issue was not Docker or Workbench; it was the PyTorch build target.

Choosing a PyTorch Base That Supports Blackwell

Once the constraint was clear, the solution was straightforward.

NVIDIA NGC PyTorch Container

The working base image:

Why this image:

  • Blackwell-capable (sm_120, compatible with sm_121)

  • CUDA 13.x user-space

  • NVIDIA-validated for DGX-class systems

From a GPU and framework standpoint, this is the correct foundation.

Unfortunately, AI Workbench adds another constraint, this led to some trial and error.

Why NGC Images Fail in AI Workbench by Default

NGC images are valid Docker images. They are not valid AI Workbench base environments.

AI Workbench validates image metadata before pulling layers. If required labels are missing, the image is rejected immediately with errors such as:

  • invalid base environment (invalid OS)

  • no OSDistro set

  • no OSDistroRelease set

🔑
Key point: NGC containers are not automatically Workbench-compatible. Lesson learned.

You must explicitly provide the metadata Workbench expects.

The Fix: A Minimal Wrapper Image

This is the cleanest solution and the one that scales. No recompiles. No rebuilding PyTorch. Just metadata.

Inspect the Base OS

Inside the NGC container:

  • OS: linux

  • Distro: ubuntu

  • Release: 24.04

Verified via /etc/os-release.

Wrapper Dockerfile with Workbench Metadata

The wrapper image does one thing only; it adds the labels required by AI Workbench.

FROM nvcr.io/nvidia/pytorch:25.12-py3

LABEL com.nvidia.workbench.schema-version="v2" \
      com.nvidia.workbench.name="NGC PyTorch 25.12 (Workbench)" \
      com.nvidia.workbench.description="Wrapper for nvcr.io/nvidia/pytorch:25.12-py3 with Workbench metadata" \
      com.nvidia.workbench.image-version="25.12.1" \
      com.nvidia.workbench.cuda-version="13.0" \
      com.nvidia.workbench.os="linux" \
      com.nvidia.workbench.os-distro="ubuntu" \
      com.nvidia.workbench.os-distro-release="24.04" \
      com.nvidia.workbench.programming-languages="python3"

No software changes are introduced.

Building and Verifying the Wrapper Image

docker build --no-cache -t nvwb-pytorch-25.12:latest .

Verify that the labels are present:

docker inspect nvwb-pytorch-25.12:latest \
  --format '{{range $k,$v := .Config.Labels}}{{println $k "=" $v}}{{end}}' \
  | grep 'com.nvidia.workbench'

If these labels are missing or incorrect, AI Workbench will reject the image.

Publishing the Image to GitHub Container Registry (GHCR)

Why GHCR

  • AI Workbench needs a container URL that can be pulled, and I chose GHCR for this purpose. You can choose any option you prefer, as long as the image can be pulled from a valid URL.

  • Local images are ignored

  • GHCR works reliably for personal and public images

Authentication Requirements

I used a classic GitHub Personal Access Token. It might be better to use a fine-grained token, but since this is a local development setup, I didn't think it was necessary to determine the exact permissions needed and kept it basic.

  • read:packages

  • write:packages

echo "$GH_TOKEN" | docker login ghcr.io -u <username> --password-stdin

Docker Credential Helper Pitfall

My Docker config initially contained:

"credHelpers": {
  "ghcr.io": "workbench"
}

This silently overrides standard Docker authentication and causes:

  • Failed pushes

  • wb-svc errors

  • AI Workbench unable to read image metadata

Fix:

  • Remove the credHelpers entry for ghcr.io

  • Allow Docker to use auths normally

This is subtle and easy to miss.

Tagging and Pushing (Pinned Tags Only)

Avoid latest.

docker tag nvwb-pytorch-25.12:latest ghcr.io/brianbaldock/nvwb-pytorch-25.12:25.12.1
docker push ghcr.io/brianbaldock/nvwb-pytorch-25.12:25.12.1

Why latest Causes Workbench Validation Failures

AI Workbench validates metadata before pulling layers.

With latest:

  • Multiple digests

  • Registry-side caching

  • Inconsistent metadata resolution

Pinned tags remove ambiguity and resolve validation failures immediately.

Creating the AI Workbench Project

In AI Workbench:

  • New Project → Custom Container

  • Container URL:

ghcr.io/<username>/nvwb-pytorch-25.12:25.12.1

Result:

  • Base environment validated

  • Project created successfully

  • GPU visible inside the container


Final Verification

Inside the container:

import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

Expected output:

  • CUDA-enabled PyTorch

  • NVIDIA GB10

  • No architecture warnings

Key Takeaways

  • AI Workbench is metadata-driven, not just Docker-driven

  • NGC containers require wrapper labels

  • Blackwell requires the correct PyTorch build target

  • GHCR authentication is not Git authentication

  • Docker credHelpers can silently override auth

  • latest is unsafe for Workbench base images

  • Wrapper images are the cleanest long-term approach


What’s Next

This article establishes a clean, reproducible base.

Next articles in this series will build on it:

  • Fine-tuning Phi-4 on DGX Spark

  • LoRA vs QLoRA on Blackwell

  • Serving models with vLLM

  • Multi-container AI Workbench workflows

Learning Fine-Tuning on Nvidia DGX Spark

Part 2 of 2

Learning Fine-Tuning on DGX Spark captures my hands-on journey getting a usable fine-tuning environment running on NVIDIA DGX Spark, focusing on real setup decisions, failures, and fixes with AI Workbench, PyTorch, CUDA, and LoRA/QLoRA.

Start from the beginning

Learning Fine-Tuning on Nvidia DGX Spark - Part 2

In Part 1 of this series, I got NVIDIA AI Workbench to accept a custom PyTorch base image on DGX Spark. Got the container validated, the GPU was visible, and CUDA worked. 🔗 Click here to read part 1 of this series By most checklists, the environm...