CPU Compute: AVX-512 SGEMM at 104% of Sustained Peak

NSL TeamFeb 20, 20263 min readDeep Dive

The BLIS 5-Loop Algorithm

NSL's CPU matrix multiplication implements the BLIS algorithm -- the same approach used by BLAS libraries like OpenBLAS and MKL:

for jc = 0..N step NC:           # L3 loop
  for pc = 0..K step KC:         # L2 loop
    pack_b(B -> NR-wide strips)  # Serial
    parallel_for MC blocks:      # Parallel over M
      pack_a(A -> MR-tall panels)
      for each MR x NR tile:
        micro_kernel(packA, packB, C)  # AVX-512 FMA

AVX-512 Micro-Kernel (6x32)

The core compute unit processes 6 rows x 32 columns per call:

12 ZMM accumulators (6 rows x 2 halves of 16 floats)
K-unrolled by 2 for instruction-level parallelism
384 FLOPs per K step (6 x 32 x 2)
228 GFLOPS single-core = 100% of theoretical peak

AVX2 fallback uses a 6x16 micro-kernel with YMM registers.

Cache-Optimized Tile Sizes

MR=6, NR=32     # Micro-kernel shape
KC=384           # B strip = 48KB (fills L1)
MC=256           # A panel = 393KB (fits L2)
NC=4096          # B panel in L3

Four Parallelization Strategies

| Strategy | When | How |

|----------|------|-----|

| M-parallel | M >> MC | Split MC blocks across threads |

| N-parallel | Small M, large N | Each thread packs own B strip |

| Dynamic MC | mc_blocks < threads | Reduce MC for more blocks |

| Fast path | M,N<=128, K<=512 | Single-threaded, no dispatch overhead |

Performance (Zen 5, 16 cores, AVX-512)

Laptop 45W TDP throttles all-core AVX-512 from 3.5 GHz to ~1.7 GHz:

| Shape | GFLOPS | % Sustained Peak |

|-------|--------|------------------|

| 128x128 | 206 | 11.8% |

| 512x512 | 536 | 30.7% |

| 1024^3 | 768 | 43.9% |

| 2048^3 | 1,415 | 80.8% |

| 4096^3 | 1,648 | 94.2% |

| LLM FFN Up | 1,822 | 104.1% |

| LLM Attn QKV | 1,787 | 102.1% |

| HPC 8K | 1,768 | 101.0% |

Peak: 1,822 GFLOPS (104.1% of sustained). Yes, over 100% -- the sustained peak is a conservative estimate from a 4096x4096 workload, while LLM shapes have better cache behavior.

Additional CPU Kernels

Activations: ReLU, GELU, SiLU, sigmoid, tanh, softmax (AVX-512 + AVX2)

Reductions: sum, max, min (horizontal SIMD)
Neural: Layer norm, fused attention (QKV)
Game: Bitonic sort, prefix scan, FFT, frustum cull, particle integration, spatial hash
Linear algebra: 8x8 block transpose with AVX-512 shuffle

Benchmark Stability

Research-grade measurement methodology:

VirtualLock + prefault for memory pinning
rdtsc cycle-accurate timing for micro-shapes
25% outlier trimming (keep middle 50%)
Adaptive warmup (CV < 2% convergence)
Thermal cooldown between tests
Core pinning + ABOVE_NORMAL priority
40 verification + 20 performance tests, exit code 1 on any failure