GPU Compute: Hand-Tuned PTX Kernels on Tensor Cores

NSL TeamFeb 20, 20263 min readDeep Dive

No CUDA Toolkit Required

NSL loads CUDA at runtime via the driver API. If you have an NVIDIA GPU with drivers installed, GPU acceleration just works. No CUDA toolkit, no nvcc, no build system -- just tensor.matmul(a, b) and it routes to tensor cores automatically.

52 Hand-Tuned PTX Kernels

All kernels are stored as external .ptx files and loaded at runtime. Changes to PTX take effect without recompiling C++.

AI/ML Kernels (34 files)

Tensor Core SGEMM:

TF32 128x128 with cp.async pipelining (15.1 TFLOPS peak)
TF32 32x128 for small-M batch inference
FP16 128x128 (27.0 TFLOPS peak)
BF16 128x128 (same geometry as FP16, better dynamic range)
Synchronous variants for small K dimensions

Activations: ReLU, GELU, SiLU, sigmoid, tanh (vectorized PTX)

Attention: FlashAttention-2, fused softmax, layer normalization

Other: Element-wise ops, conv2d (implicit GEMM), text search, NaN check

Game/Compute Kernels (18 files)

Physics: Frustum culling, cloth simulation (Verlet + PBD), SPH fluid simulation
Rendering: Depth pyramid (HiZ), SDF ray marching, BVH construction (Morton codes)
Simulation: Particle system, spatial hashing, noise generation (value + Worley)
Math: Bitonic sort, prefix scan, FFT (radix-2 Cooley-Tukey)
Crypto: SHA-256 batch hash, Keccak-256 batch hash
Data: Gather/scatter, transpose, reductions (sum/max/min, global + dimensional)

Auto-Dispatch: 7 Priority Levels

When you call tensor.matmul(a, b), NSL picks the best kernel automatically:

| Priority | Path | When |

|----------|------|------|

| 1 | FP16 HGEMM | Opt-in, dims divisible by 16/32, alpha=1/beta=0 |

| 2 | BF16 BGEMM | Opt-in, same geometry as FP16 |

| 3 | TF32 Small-M | M <= 128 |

| 4 | TF32 128x128 | General large matrices |

| 5 | Hybrid | Mixed precision |

| 6 | cuBLAS TF32 | Fallback when custom unavailable |

| 7 | FP32 | Last resort |

Large matrices are automatically split into sub-grids (<=64 or <=256 blocks per launch) with stream synchronization between launches.

Performance

Benchmark on RTX 5060 Laptop GPU (SM 12.0, Blackwell):

|-------|----------|--------|---------|

| 512x512 | 6,934 GFLOPS | 3,126 | 2.22x |

| 1024^3 | 8,209 | 3,563 | 2.30x |

| 2048^3 | 11,289 | 3,644 | 3.10x |

| 4096^3 | 11,124 | 3,644 | 3.05x |

| LLM FFN Up | 11,974 | 3,366 | 3.56x |

| Average | 7,716 | 3,213 | 2.40x |

FP16 peak: 27.0 TFLOPS (4.36x over cuBLAS TF32).

SM 12.0 (Blackwell) Discoveries

While building these kernels, we discovered undocumented hardware behaviors on NVIDIA's Blackwell architecture:

1. TF32 A fragment swap -- PTX spec says (a0,a1) for row group, actual hardware uses (a0,a2). Without this correction, TF32 kernels produce wrong results.

2. FP16 uses standard mapping -- Unlike TF32, FP16 MMA does NOT swap fragments.

3. JIT NaN bug -- Compiling .target sm_80 PTX for sm_120 introduced sporadic NaN. Fixed by targeting .target sm_120 with .version 8.7.

NaN Safety: 4-Layer Reliability

Despite hardware quirks, NSL guarantees correct output:

1. Sub-grid splitting -- Limits blocks per launch

2. NaN pre-fill -- Output buffer filled with NaN before kernel; unwritten tiles are caught

3. Host-side detection -- Samples 6 rows via D-to-H copy, scans for NaN

4. Fallback chain -- TF32 -> cuBLAS FP32 -> scalar FP32

GPU Tensor Cache

Chained operations reuse GPU-resident data:

let a = tensor.ones([10000])
let b = tensor.ones([10000])
let c = tensor.add(a, b)   # uploads a,b to GPU, caches results
let d = tensor.mul(c, a)   # cache hit on c and a -- zero uploads
let e = tensor.add(d, d)   # cache hit on d -- zero uploads

3 operations, 6 transfers without cache, 2 with cache.

Interposer: Accelerate Any CUDA App

gpu_interpose.dll hooks cublasSgemm and cublasLtMatmul in any CUDA application and routes to NSL's kernels. Inject it into PyTorch, TensorFlow, or any CUDA binary:

Zero passthrough degradation (GetModuleHandle, not LoadLibrary)

5 wins, 16 ties, 0 losses vs stock cuBLAS (median-of-5)
Dramatically more consistent (1-4% CV vs 20-45% stock for game shapes)

All 41 verification + 14 performance tests passing.