ML Environment Setup with PyTorch

VerifiedCaution

Guides creation of isolated ML environments with PyTorch, automatically detecting hardware (NVIDIA, AMD, CPU) to install the correct build. Useful when starting new ML projects or troubleshooting GPU/environment compatibility.

Sby Skills Guide Bot
DevelopmentIntermediate
1106/2/2026
Claude Code
#ml-environment#pytorch#gpu-setup#hardware-detection#python-setup

Recommended for

Our review

This skill sets up an isolated ML environment with PyTorch, auto-detecting hardware (NVIDIA/AMD GPU or CPU) and installing appropriate builds.

Strengths

  • Automatic detection and configuration of GPU or CPU hardware
  • Support for NVIDIA (including Blackwell), AMD RDNA, and Strix Halo GPUs
  • Reproducible installation via bash scripts and pinned PyTorch version

Limitations

  • Does not handle manual GTT memory configuration for Strix Halo
  • Pinned PyTorch version (2.10.0) may require updates for newer releases
  • Depends on correctly installed GPU drivers
When to use it

Use this skill to quickly bootstrap a new ML project or troubleshoot PyTorch environment issues related to GPU detection.

When not to use it

Avoid this skill if you need advanced custom configuration (other frameworks, specific dependencies) or if you work in an already configured environment with package managers like Anaconda.

Security analysis

Caution
Quality score85/100

The skill uses Bash to execute setup scripts that require elevated privileges (sudo usermod) and download/install packages, which could modify the system. No destructive or exfiltration commands are present, but the scripts' sources aren't transparent to the auditor, making the risk moderate.

Findings
  • Running bash scripts from user's skills directory without user review, which could perform arbitrary system modifications (installing packages, modifying user groups).

Examples

Create new ML project
Set up a new ML project called 'my-ml-project' with PyTorch in ~/projects. Auto-detect my hardware and install the right version of PyTorch.
Fix GPU detection
I'm trying to use PyTorch but it doesn't see my GPU. Can you detect my hardware and install the correct PyTorch build for my NVIDIA RTX 4090?
Initialize Strix Halo environment
Set up a PyTorch environment for my AMD Strix Halo (gfx1151) system. Use ROCm 7 and remember to add me to render/video groups.

name: ml-env description: Set up ML environments with PyTorch and auto-detect hardware. Use this when creating new ML projects, setting up PyTorch, or troubleshooting GPU/environment issues. Guides you through creating isolated project environments with hardware-specific PyTorch builds (NVIDIA/AMD/CPU). allowed-tools: Read, Bash, WebFetch activation-precedence: high

ML Environment Setup & Troubleshooting

This skill helps you create and manage isolated ML environments with PyTorch. It auto-detects your hardware (NVIDIA GPU, AMD GPU, or CPU) and installs the appropriate PyTorch build.

Creating a New ML Project

I can help you set up a complete ML project with PyTorch in seconds. Here's what I'll do:

  1. Create your project directory
  2. Create a .gitignore for ML files
  3. Run hardware detection
  4. Install PyTorch with the right backend
  5. Install ML libraries
  6. Validate everything works

To get started, tell me:

  • Project name/path where you want it created
  • I'll handle the rest!

Interactive Setup Process

When you ask me to set up a new ML project, I will:

# 1. Create the project directory
mkdir -p ~/projects/my-ml-project

# 2. Create .gitignore
# (ignores ml-env/, data/, models/, logs/, etc.)

# 3. Run the setup script
bash ~/.claude/skills/ml-env/scripts/setup-universal.sh

# 4. Show you the results

The setup script will:

  • Detect your GPU (NVIDIA with nvidia-smi, AMD with rocminfo, or fallback to CPU)
  • Ask questions for special hardware (Blackwell GPUs, Strix Halo, etc.)
  • Install Python 3.13 virtual environment with uv
  • Install PyTorch 2.10.0 with correct backend
  • Install ML libraries: numpy, pandas, scikit-learn, jupyter, accelerate, etc.
  • Create ml-env/ directory in your project
  • Optionally initialize git

After Setup: Using Your Environment

Once created, activating is simple:

cd ~/projects/my-ml-project
source ml-env/bin/activate  # Regular environments
# OR
source ml-env/activate-safe.sh  # If you use conda (ignores conda settings)

Check it works:

python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

Hardware-Specific Guidance

NVIDIA GPUs

  • Supported: RTX 3090, 4090, 5090, and most Ampere/Ada/Blackwell
  • Installation: CUDA 12.8 (stable) or CUDA 13.0 (for RTX 5090)
  • Driver requirement: 520+ for CUDA 12.8, 550+ for CUDA 13.0
  • WSL2 users: Use Windows NVIDIA driver only (do NOT install Linux driver)

AMD RDNA (RX 6000/7000 series)

  • Installation: ROCm 6.2
  • Requirements: User in render and video groups
  • Setup: sudo usermod -aG render,video $USER && newgrp render

AMD Strix Halo (gfx1151)

⚠️ This requires special handling - official PyTorch wheels do NOT work!

  • GPU: Ryzen AI MAX+ 395 with gfx1151
  • Critical issue: Official PyTorch wheels fail with "HIP error: invalid device function"
  • Solution: Use AMD gfx1151-specific builds
  • ROCm 7 (Recommended): https://repo.amd.com/rocm/whl/gfx1151/ - ~31 TFLOPS BF16
  • ROCm 6.4.4+ (Fallback): https://rocm.nightlies.amd.com/v2/gfx1151/ - ~12 TFLOPS BF16
  • Memory limits: Default ~33GB; configure GTT for larger models (30B+)
  • Setup requires: User in render and video groups, Linux kernel 6.14+ (6.16.9+ recommended for automatic UMA/GTT behavior)

Reference project: See ~/Projects/amdtest for a working gfx1151 setup example.

See TROUBLESHOOTING.md for complete Strix Halo setup and GTT memory configuration.

CPU-Only Systems

  • Works everywhere
  • Good for development/testing
  • Use for learning before scaling to GPU

Current Versions (2026)

  • PyTorch: 2.10.0
  • Python: 3.13 (or 3.12 if needed)
  • CUDA: 12.8 (main), 13.0 (Blackwell experimental)
  • ROCm: 6.2 (RDNA), 7.x preferred for Strix Halo (6.4.4+ as fallback)
  • Key ML libs: numpy, pandas, matplotlib, scikit-learn, jupyter, accelerate, tensorboard

Validating an Existing Environment

If you already have a project and want to verify it works:

cd ~/your-ml-project
bash ~/.claude/skills/ml-env/scripts/validate.sh

This will check:

  • Python and PyTorch versions
  • GPU/CPU backend detection
  • GPU memory and specifications
  • Computation tests

Troubleshooting

GPU Not Detected

NVIDIA:

nvidia-smi  # Check driver is installed
python -c "import torch; print(torch.cuda.is_available())"

AMD:

rocm-smi  # Check ROCm installation
rocminfo | grep gfx  # Check GPU architecture

CUDA Out of Memory

  1. Reduce batch size
  2. Enable mixed precision training
  3. Use torch.cuda.empty_cache() between batches
  4. Try gradient accumulation

PyTorch Not Finding GPU After Install

  1. Activate the environment: source ml-env/bin/activate
  2. Check driver version
  3. Reinstall PyTorch with correct index URL:
    uv pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu128
    

Strix Halo Specific Issues

See TROUBLESHOOTING.md for detailed Strix Halo troubleshooting, GTT memory setup, and performance optimization.

Best Practices

  1. Always activate first: Before running any Python/ML code
  2. Use virtual environments: Never install to system Python
  3. Move models to device: Explicitly move tensors to GPU
  4. Monitor memory: Keep an eye on GPU memory usage
  5. Test on CPU first: Develop with small data on CPU, scale to GPU
  6. Save checkpoints: Don't train for hours without saving progress

Common Workflows

Training a Model

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = YourModel().to(device)

for batch in dataloader:
    x, y = batch
    x, y = x.to(device), y.to(device)
    loss = model(x, y)
    loss.backward()

Mixed Precision Training (Faster, Less Memory)

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Checking GPU Memory

import torch
if torch.cuda.is_available():
    print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
    print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f}GB")
    print(f"Total: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f}GB")

Reference Documentation

Scripts in This Skill

All scripts are in ~/.claude/skills/ml-env/scripts/:

  • setup-universal.sh - Hardware detection and PyTorch installation (used during initial setup)
  • validate.sh - Validate an existing environment and test GPU/CPU

When to Use This Skill

Use me when you:

  • Want to create a new ML project
  • Need to set up PyTorch with GPU support
  • Are troubleshooting GPU/CUDA/ROCm issues
  • Want to update or maintain your ML environment
  • Have hardware-specific questions (NVIDIA, AMD, Strix Halo)
  • Need guidance on ML best practices

Questions?

Ask me anything about:

  • Creating new ML projects
  • Hardware setup and troubleshooting
  • PyTorch installation
  • GPU/CPU configuration
  • ML best practices
  • Updating packages
Related skills