|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +GPUStreamlines (`cuslines`) is a GPU-accelerated tractography package for diffusion MRI. It supports **three GPU backends**: NVIDIA CUDA, Apple Metal (Apple Silicon), and WebGPU (cross-platform via wgpu-py). Backend is auto-detected at import time in `cuslines/__init__.py` (priority: Metal → CUDA → WebGPU). Kernels are compiled at runtime (NVRTC for CUDA, `MTLDevice.newLibraryWithSource` for Metal, `device.create_shader_module` for WebGPU/WGSL). |
| 8 | + |
| 9 | +## Build & Run |
| 10 | + |
| 11 | +```bash |
| 12 | +# Install (pick your backend) |
| 13 | +pip install ".[cu13]" # CUDA 13 |
| 14 | +pip install ".[cu12]" # CUDA 12 |
| 15 | +pip install ".[metal]" # Apple Metal (Apple Silicon) |
| 16 | +pip install ".[webgpu]" # WebGPU (cross-platform: NVIDIA, AMD, Intel, Apple) |
| 17 | + |
| 18 | +# From PyPI |
| 19 | +pip install "cuslines[cu13]" |
| 20 | +pip install "cuslines[metal]" |
| 21 | +pip install "cuslines[webgpu]" |
| 22 | + |
| 23 | +# GPU run (downloads HARDI dataset if no data passed) |
| 24 | +python run_gpu_streamlines.py --output-prefix small --nseeds 1000 --ngpus 1 |
| 25 | + |
| 26 | +# Force a specific backend |
| 27 | +python run_gpu_streamlines.py --device=webgpu --output-prefix small --nseeds 1000 |
| 28 | + |
| 29 | +# CPU reference run (for comparison/debugging) |
| 30 | +python run_gpu_streamlines.py --device=cpu --output-prefix small --nseeds 1000 |
| 31 | + |
| 32 | +# Docker |
| 33 | +docker build -t gpustreamlines . |
| 34 | +``` |
| 35 | + |
| 36 | +There is no dedicated test or lint suite. Validate by comparing CPU vs GPU outputs on the same seeds. |
| 37 | + |
| 38 | +## Architecture |
| 39 | + |
| 40 | +**Two-layer design**: Python orchestration + GPU kernels compiled at runtime. Three parallel backend implementations share the same API surface. |
| 41 | + |
| 42 | +``` |
| 43 | +run_gpu_streamlines.py # CLI entry: DIPY model fitting → CPU or GPU tracking |
| 44 | +cuslines/ |
| 45 | + __init__.py # Auto-detects Metal → CUDA → WebGPU backend at import |
| 46 | + boot_utils.py # Shared bootstrap matrix preparation (OPDT/CSA) for all backends |
| 47 | + cuda_python/ # CUDA backend |
| 48 | + cu_tractography.py # GPUTracker: context manager, multi-GPU allocation |
| 49 | + cu_propagate_seeds.py # SeedBatchPropagator: chunked seed processing |
| 50 | + cu_direction_getters.py # Direction getter ABC + Boot/Prob/PTT implementations |
| 51 | + cutils.py # REAL_DTYPE, REAL3_DTYPE, checkCudaErrors(), ModelType enum |
| 52 | + _globals.py # AUTO-GENERATED from globals.h (never edit manually) |
| 53 | + cuda_c/ # CUDA kernel source |
| 54 | + globals.h # Source-of-truth for constants (REAL_SIZE, thread config) |
| 55 | + generate_streamlines_cuda.cu, boot.cu, ptt.cu, tracking_helpers.cu, utils.cu |
| 56 | + cudamacro.h, cuwsort.cuh, ptt.cuh, disc.h |
| 57 | + metal/ # Metal backend (mirrors cuda_python/) |
| 58 | + mt_tractography.py, mt_propagate_seeds.py, mt_direction_getters.py, mutils.py |
| 59 | + metal_shaders/ # MSL kernel source (mirrors cuda_c/) |
| 60 | + globals.h, types.h, philox_rng.h |
| 61 | + generate_streamlines_metal.metal, boot.metal, ptt.metal |
| 62 | + tracking_helpers.metal, utils.metal, warp_sort.metal |
| 63 | + webgpu/ # WebGPU backend (mirrors metal/) |
| 64 | + wg_tractography.py, wg_propagate_seeds.py, wg_direction_getters.py, wgutils.py |
| 65 | + benchmark.py # Cross-backend benchmark: python -m cuslines.webgpu.benchmark |
| 66 | + wgsl_shaders/ # WGSL kernel source (mirrors metal_shaders/) |
| 67 | + globals.wgsl, types.wgsl, philox_rng.wgsl |
| 68 | + utils.wgsl, warp_sort.wgsl, tracking_helpers.wgsl |
| 69 | + generate_streamlines.wgsl # Prob/PTT buffer bindings + Prob getNum/gen kernels |
| 70 | + boot.wgsl # Boot direction getter kernels (standalone module) |
| 71 | + disc.wgsl, ptt.wgsl # PTT support |
| 72 | +``` |
| 73 | + |
| 74 | +**Data flow**: DIPY preprocessing → seed generation → GPUTracker context → SeedBatchPropagator chunks seeds across GPUs → kernel launch → stream results to TRK/TRX output. |
| 75 | + |
| 76 | +**Direction getters** (subclasses of `GPUDirectionGetter`): |
| 77 | +- `BootDirectionGetter` — bootstrap sampling from SH coefficients (OPDT/CSA models) |
| 78 | +- `ProbDirectionGetter` — probabilistic selection from ODF/PMF (CSD model) |
| 79 | +- `PttDirectionGetter` — Probabilistic Tracking with Turning (CSD model) |
| 80 | + |
| 81 | +Each has `from_dipy_*()` class methods for initialization from DIPY models. |
| 82 | + |
| 83 | +## Critical Conventions |
| 84 | + |
| 85 | +- **`_globals.py` is auto-generated** from `cuslines/cuda_c/globals.h` during `setup.py` build via `defines_to_python()`. Never edit it manually; change `globals.h` and rebuild. |
| 86 | +- **GPU arrays must be C-contiguous** — always use `np.ascontiguousarray()` and project scalar types (`REAL_DTYPE`, `REAL_SIZE` from `cutils.py` or `mutils.py`). |
| 87 | +- **All CUDA API calls must be wrapped** with `checkCudaErrors()`. |
| 88 | +- **Angle units**: CLI accepts degrees, internals convert to radians before the GPU layer. |
| 89 | +- **Multi-GPU**: CUDA uses explicit `cudaSetDevice()` calls; Metal and WebGPU are single-GPU only. |
| 90 | +- **CPU/GPU parity**: `run_gpu_streamlines.py` maintains parallel CPU and GPU code paths — keep both in sync when changing arguments or model-selection logic. |
| 91 | +- **Logger**: use `logging.getLogger("GPUStreamlines")`. |
| 92 | +- **Kernel compilation**: CUDA uses `cuda.core.Program` with NVIDIA headers. Metal uses `MTLDevice.newLibraryWithSource_options_error_()` with MSL source concatenated from `metal_shaders/`. WebGPU uses `device.create_shader_module()` with WGSL source concatenated from `wgsl_shaders/`. |
| 93 | + |
| 94 | +## Metal Backend Notes |
| 95 | + |
| 96 | +- **Unified memory**: Metal buffers use `storageModeShared` — numpy arrays are directly GPU-accessible (zero memcpy per batch, vs ~6 in CUDA). |
| 97 | +- **float3 alignment**: All buffers use `packed_float3` (12 bytes) with `load_f3()`/`store_f3()` helpers. Metal `float3` is 16 bytes in registers. |
| 98 | +- **Page alignment**: Use `aligned_array()` from `mutils.py` for arrays passed to `newBufferWithBytesNoCopy`. |
| 99 | +- **No double precision**: Only `REAL_SIZE=4` (float32) is ported. |
| 100 | +- **Warp primitives**: `__shfl_sync` → `simd_shuffle`, `__ballot_sync` → `simd_ballot`. SIMD width = 32. |
| 101 | +- **SH basis**: Always use `real_sh_descoteaux(legacy=True)` for all matrices. See `boot_utils.py`. |
| 102 | + |
| 103 | +## WebGPU Backend Notes |
| 104 | + |
| 105 | +- **Cross-platform**: wgpu-py maps to Metal (macOS), Vulkan (Linux/Windows), D3D12 (Windows). Install: `pip install "cuslines[webgpu]"`. |
| 106 | +- **Explicit readbacks**: `device.queue.read_buffer()` for GPU→CPU (~3 per seed batch, matching CUDA's cudaMemcpy pattern). |
| 107 | +- **WGSL shaders**: Concatenated in dependency order by `compile_program()`. Boot compiles standalone; Prob/PTT share `generate_streamlines.wgsl`. |
| 108 | +- **Buffer binding**: Boot needs 17 buffers across 3 bind groups. Prob/PTT use 2 bind groups. `layout="auto"` only includes reachable bindings. |
| 109 | +- **Subgroups required**: Device feature `"subgroup"` (singular, not `"subgroups"`). Naga does NOT support `enable subgroups;` directive. |
| 110 | +- **WGSL constraints**: No `ptr<storage>` parameters (use module-scope accessors). `var<workgroup>` sizes must be compile-time constants. PhiloxState is pass-by-value (return result structs). |
| 111 | +- **Boot standalone module**: `_kernel_files()` returns `[]` to avoid `params` struct redefinition. |
| 112 | +- **Benchmark**: `python -m cuslines.webgpu.benchmark --nseeds 10000` — auto-detects all backends. |
| 113 | + |
| 114 | +## Key Dependencies |
| 115 | + |
| 116 | +- `dipy` — diffusion models, CPU direction getters, seeding, stopping criteria |
| 117 | +- `nibabel` — NIfTI/TRK file I/O (`StatefulTractogram`) |
| 118 | +- `trx-python` — TRX format support (memory-mapped, for large outputs) |
| 119 | +- `cuda-python` / `cuda-core` / `cuda-cccl` — CUDA Python bindings, kernel compilation, C++ headers |
| 120 | +- `pyobjc-framework-Metal` / `pyobjc-framework-MetalPerformanceShaders` — Metal Python bindings (macOS only) |
| 121 | +- `wgpu` — WebGPU Python bindings (wgpu-native, cross-platform) |
| 122 | +- `numpy` — array operations throughout |
0 commit comments