Notre avis
Ce guide fournit des règles et bonnes pratiques pour la programmation CUDA, incluant la gestion de la mémoire, l'optimisation des kernels et la vérification des erreurs.
Points forts
- Impose des macros de vérification d'erreur robustes
- Couvre la hiérarchie mémoire GPU et l'accès coalescent
- Propose des patrons de conception de kernels
- Adapté pour l'optimisation de l'occupation des SM
Limites
- Nécessite des connaissances préalables en CUDA
- Ne couvre pas tous les détails d'architecture spécifiques
- Peut ne pas convenir aux frameworks de haut niveau comme PyTorch
Lorsque vous développez des kernels CUDA ou optimisez du code de calcul GPU.
Lorsque vous utilisez des bibliothèques GPU de haut niveau qui abstraient les détails CUDA.
Analyse de sécurité
SûrThe skill contains only CUDA programming guidelines and code snippets. It does not include any commands that perform system modifications, data exfiltration, or unsafe operations. The provided macros and patterns are defensive and standard, posing no execution risk.
Aucun point d'attention détecté
Exemples
Help me write a CUDA kernel for vector addition that uses grid-stride loops and checks for errors with the CUDA_CHECK macro.I have a CUDA kernel that accesses global memory in a strided pattern. How can I rewrite it to ensure coalesced memory access?name: cuda-guide description: | CUDA/GPU computing guardrails, patterns, and best practices for AI-assisted development. Use when working with CUDA files (.cu, .cuh), or when the user mentions CUDA/GPU programming. Provides kernel design patterns, memory hierarchy guidelines, and occupancy optimization specific to this project's coding standards. license: MIT metadata: author: samuel version: "1.0" category: language language: cuda extensions: ".cu,.cuh"
CUDA Guide
Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC
Core Principles
- Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
- Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
- Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
- Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
- Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory
Guardrails
Error Checking
- ALWAYS check CUDA API return values with a macro wrapper
- ALWAYS call
cudaGetLastError()after every kernel launch - ALWAYS call
cudaDeviceSynchronize()before reading kernel results on the host - Use
compute-sanitizer(successor tocuda-memcheck) in development builds - Handle
cudaErrorMemoryAllocationgracefully; never assume GPU memory is infinite
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
#define CUDA_CHECK_KERNEL() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
Memory Management
- Pair every
cudaMallocwith acudaFree; prefer RAII wrappers in C++ host code - Use
cudaMallocManaged(Unified Memory) for prototyping; switch to explicit transfers for production - Use
cudaMallocHost(pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute - Prefer
cudaMemcpyAsyncwith streams over synchronouscudaMemcpy - Never access device pointers from host code or host pointers from device code (except Unified Memory)
- Call
cudaMemsetorcudaMemsetAsyncto zero-initialize device buffers
Kernel Design
- Block size must be a multiple of warp size (32); prefer 128, 256, or 512
- Calculate grid size as
(n + block_size - 1) / block_size - Always include bounds checking:
if (idx < n)at the top of every kernel - Use grid-stride loops for kernels that must handle arbitrary data sizes
- Document thread mapping: which dimension maps to which data axis
- Mark device-only helpers as
__device__, host+device as__host__ __device__
// Grid-stride loop: works with any grid size, any data size
__global__ void saxpy(float a, const float* x, float* y, int n) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = a * x[i] + y[i];
}
}
Synchronization
- Use
__syncthreads()after every shared memory write before any thread reads another thread's value - Never place
__syncthreads()inside a conditional branch that not all threads in a block will reach (deadlock) - Use
__syncwarp()(CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution - Use
cudaDeviceSynchronize()sparingly in production; prefer stream synchronization withcudaStreamSynchronize() - Use CUDA events (
cudaEventRecord/cudaEventSynchronize) for fine-grained inter-stream ordering
Performance
- Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view
- Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions
- Aim for >60% of peak memory bandwidth in memory-bound kernels
- Avoid warp divergence: ensure threads within a warp take the same branch when possible
- Prefer
floatoverdoubleon consumer GPUs (2x throughput difference) - Minimize atomic operations on global memory; use shared memory atomics with a final reduction
Memory Hierarchy
Understanding the memory hierarchy is the single most important factor in CUDA performance.
| Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write | |-------------|-------|-------------------|------|--------|------------| | Registers | Thread | 1 | ~255 per thread | N/A | R/W | | Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W | | L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R | | L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W | | Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W | | Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R | | Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R |
Decision guide:
- Data reused within a thread -> registers (automatic via local variables)
- Data shared across threads in a block ->
__shared__memory - Read-only data broadcast to all threads ->
__constant__memory - Large read-only data with spatial locality -> texture memory
- Everything else -> global memory with coalesced access patterns
Key Patterns
Kernel Launch Configuration
// Query device for optimal configuration
void launch_optimized(const float* input, float* output, int n) {
int block_size;
int min_grid_size;
// Let the runtime suggest optimal block size for maximum occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &block_size, my_kernel, 0, n);
int grid_size = (n + block_size - 1) / block_size;
my_kernel<<<grid_size, block_size>>>(input, output, n);
CUDA_CHECK_KERNEL();
}
Coalesced Memory Access
// BAD: Strided access -- adjacent threads access non-adjacent memory
// Each warp issues 32 separate memory transactions
__global__ void transpose_naive(const float* in, float* out, int W, int H) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < W && y < H) {
out[x * H + y] = in[y * W + x]; // Write is strided
}
}
// GOOD: Use shared memory to coalesce both reads and writes
__global__ void transpose_coalesced(
const float* in, float* out, int W, int H
) {
__shared__ float tile[32][33]; // +1 padding avoids bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < W && y < H) {
tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read
}
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < H && y < W) {
out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write
}
}
Shared Memory Tiling
// Dot product of two vectors using shared memory reduction
__global__ void dot_product(
const float* a, const float* b, float* result, int n
) {
__shared__ float cache[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Each thread computes its partial sum via grid-stride
float partial = 0.0f;
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
partial += a[i] * b[i];
}
cache[tid] = partial;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
cache[tid] += cache[tid + s];
}
__syncthreads();
}
if (tid == 0) {
atomicAdd(result, cache[0]);
}
}
Warp-Level Primitives (CUDA 9+)
// Warp-level reduction using shuffle instructions -- no shared memory needed
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
// Block-level reduction combining warp shuffles and shared memory
__device__ float block_reduce_sum(float val) {
__shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block)
int lane = threadIdx.x % warpSize;
int warp_id = threadIdx.x / warpSize;
val = warp_reduce_sum(val);
if (lane == 0) {
warp_sums[warp_id] = val;
}
__syncthreads();
// First warp reduces the warp sums
int num_warps = (blockDim.x + warpSize - 1) / warpSize;
val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f;
if (warp_id == 0) {
val = warp_reduce_sum(val);
}
return val;
}
Performance
Occupancy Calculator
// Query occupancy at compile time for tuning
void report_occupancy() {
int block_size = 256;
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks, my_kernel, block_size, 0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int active_warps = num_blocks * (block_size / prop.warpSize);
int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
float occupancy = (float)active_warps / max_warps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, active_warps, max_warps);
}
Nsight Profiling Workflow
# System-level trace: find CPU/GPU idle gaps, stream concurrency
nsys profile -o trace ./program
nsys stats trace.nsys-rep
# Kernel-level analysis: roofline, memory throughput, occupancy
ncu --set full -o kernel_report ./program
ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI
# Quick single-metric check
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program
Memory Bandwidth Measurement
// Measure effective bandwidth of a kernel
void measure_bandwidth(int n) {
size_t bytes = 2 * n * sizeof(float); // Read A + Write B
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
CUDA_CHECK(cudaEventRecord(start));
copy_kernel<<<grid, block>>>(d_in, d_out, n);
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms = 0;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
float gb_per_sec = bytes / (ms * 1e6);
printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}
Tooling
Essential Commands
# Compile CUDA code
nvcc -arch=sm_80 -O3 -o program main.cu # Single file
nvcc -arch=native -lineinfo -o program main.cu # With debug line info
# CMake build
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86"
cmake --build build -j$(nproc)
# Runtime debugging
compute-sanitizer ./program # Memory errors (replaces cuda-memcheck)
compute-sanitizer --tool racecheck ./program # Shared memory race conditions
compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads
compute-sanitizer --tool synccheck ./program # Synchronization errors
# Profiling
nsys profile ./program # System-level timeline
ncu ./program # Kernel-level metrics
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program
# Device info
nvidia-smi # GPU status and memory usage
nvcc --version # CUDA compiler version
CMakeLists.txt Template
cmake_minimum_required(VERSION 3.18)
project(myproject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_ARCHITECTURES 70 80 86)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDAToolkit REQUIRED)
add_library(kernels src/kernels.cu)
target_include_directories(kernels PUBLIC include)
add_executable(main src/main.cpp)
target_link_libraries(main kernels CUDA::cudart)
enable_testing()
add_executable(tests tests/test_kernels.cu)
target_link_libraries(tests kernels CUDA::cudart)
add_test(NAME gpu_tests COMMAND tests)
References
For detailed patterns and examples, see:
- references/patterns.md -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline
External References
Expert Next.js App Router
Developpement
Un skill qui transforme Claude en expert Next.js App Router.
Générateur de README
Developpement
Crée des README.md professionnels et complets pour vos projets.
Rédacteur de Documentation API
Developpement
Génère de la documentation API complète au format OpenAPI/Swagger.