--
← Back to blog
GPU Compute: Hand-Tuned PTX Kernels on Tensor Cores

GPU Compute: Hand-Tuned PTX Kernels on Tensor Cores

No CUDA Toolkit Required

NSL loads CUDA at runtime via the driver API. If you have an NVIDIA GPU with drivers installed, GPU acceleration just works. No CUDA toolkit, no nvcc, no build system -- just tensor.matmul(a, b) and it routes to tensor cores automatically.

52 Hand-Tuned PTX Kernels

All kernels are stored as external .ptx files and loaded at runtime. Changes to PTX take effect without recompiling C++.

AI/ML Kernels (34 files)

Tensor Core SGEMM:

  • TF32 128x128 with cp.async pipelining (15.1 TFLOPS peak)
  • TF32 32x128 for small-M batch inference
  • FP16 128x128 (27.0 TFLOPS peak)
  • BF16 128x128 (same geometry as FP16, better dynamic range)
  • Synchronous variants for small K dimensions
  • Activations: ReLU, GELU, SiLU, sigmoid, tanh (vectorized PTX)

    Attention: FlashAttention-2, fused softmax, layer normalization

    Other: Element-wise ops, conv2d (implicit GEMM), text search, NaN check

    Game/Compute Kernels (18 files)

  • Physics: Frustum culling, cloth simulation (Verlet + PBD), SPH fluid simulation
  • Rendering: Depth pyramid (HiZ), SDF ray marching, BVH construction (Morton codes)
  • Simulation: Particle system, spatial hashing, noise generation (value + Worley)
  • Math: Bitonic sort, prefix scan, FFT (radix-2 Cooley-Tukey)
  • Crypto: SHA-256 batch hash, Keccak-256 batch hash
  • Data: Gather/scatter, transpose, reductions (sum/max/min, global + dimensional)
  • Auto-Dispatch: 7 Priority Levels

    When you call tensor.matmul(a, b), NSL picks the best kernel automatically:

    | Priority | Path | When |

    |----------|------|------|

    | 1 | FP16 HGEMM | Opt-in, dims divisible by 16/32, alpha=1/beta=0 |

    | 2 | BF16 BGEMM | Opt-in, same geometry as FP16 |

    | 3 | TF32 Small-M | M <= 128 |

    | 4 | TF32 128x128 | General large matrices |

    | 5 | Hybrid | Mixed precision |

    | 6 | cuBLAS TF32 | Fallback when custom unavailable |

    | 7 | FP32 | Last resort |

    Large matrices are automatically split into sub-grids (<=64 or <=256 blocks per launch) with stream synchronization between launches.

    Performance

    Benchmark on RTX 5060 Laptop GPU (SM 12.0, Blackwell):

    | Shape | NSL TF32 | cuBLAS | Speedup |

    |-------|----------|--------|---------|

    | 512x512 | 6,934 GFLOPS | 3,126 | 2.22x |

    | 1024^3 | 8,209 | 3,563 | 2.30x |

    | 2048^3 | 11,289 | 3,644 | 3.10x |

    | 4096^3 | 11,124 | 3,644 | 3.05x |

    | LLM FFN Up | 11,974 | 3,366 | 3.56x |

    | Average | 7,716 | 3,213 | 2.40x |

    FP16 peak: 27.0 TFLOPS (4.36x over cuBLAS TF32).

    SM 12.0 (Blackwell) Discoveries

    While building these kernels, we discovered undocumented hardware behaviors on NVIDIA's Blackwell architecture:

    1. TF32 A fragment swap -- PTX spec says (a0,a1) for row group, actual hardware uses (a0,a2). Without this correction, TF32 kernels produce wrong results.

    2. FP16 uses standard mapping -- Unlike TF32, FP16 MMA does NOT swap fragments.

    3. JIT NaN bug -- Compiling .target sm_80 PTX for sm_120 introduced sporadic NaN. Fixed by targeting .target sm_120 with .version 8.7.

    NaN Safety: 4-Layer Reliability

    Despite hardware quirks, NSL guarantees correct output:

    1. Sub-grid splitting -- Limits blocks per launch

    2. NaN pre-fill -- Output buffer filled with NaN before kernel; unwritten tiles are caught

    3. Host-side detection -- Samples 6 rows via D-to-H copy, scans for NaN

    4. Fallback chain -- TF32 -> cuBLAS FP32 -> scalar FP32

    GPU Tensor Cache

    Chained operations reuse GPU-resident data:

    let a = tensor.ones([10000])
    let b = tensor.ones([10000])
    let c = tensor.add(a, b)   # uploads a,b to GPU, caches results
    let d = tensor.mul(c, a)   # cache hit on c and a -- zero uploads
    let e = tensor.add(d, d)   # cache hit on d -- zero uploads
    

    3 operations, 6 transfers without cache, 2 with cache.

    Interposer: Accelerate Any CUDA App

    gpu_interpose.dll hooks cublasSgemm and cublasLtMatmul in any CUDA application and routes to NSL's kernels. Inject it into PyTorch, TensorFlow, or any CUDA binary:

  • Zero passthrough degradation (GetModuleHandle, not LoadLibrary)
  • 5 wins, 16 ties, 0 losses vs stock cuBLAS (median-of-5)
  • Dramatically more consistent (1-4% CV vs 20-45% stock for game shapes)

All 41 verification + 14 performance tests passing.

All Posts