The BLIS 5-Loop Algorithm
NSL's CPU matrix multiplication implements the BLIS algorithm -- the same approach used by BLAS libraries like OpenBLAS and MKL:
for jc = 0..N step NC: # L3 loop
for pc = 0..K step KC: # L2 loop
pack_b(B -> NR-wide strips) # Serial
parallel_for MC blocks: # Parallel over M
pack_a(A -> MR-tall panels)
for each MR x NR tile:
micro_kernel(packA, packB, C) # AVX-512 FMA
AVX-512 Micro-Kernel (6x32)
The core compute unit processes 6 rows x 32 columns per call:
- 12 ZMM accumulators (6 rows x 2 halves of 16 floats)
- K-unrolled by 2 for instruction-level parallelism
- 384 FLOPs per K step (6 x 32 x 2)
- 228 GFLOPS single-core = 100% of theoretical peak
AVX2 fallback uses a 6x16 micro-kernel with YMM registers.
Cache-Optimized Tile Sizes
MR=6, NR=32 # Micro-kernel shape
KC=384 # B strip = 48KB (fills L1)
MC=256 # A panel = 393KB (fits L2)
NC=4096 # B panel in L3
Four Parallelization Strategies
| Strategy | When | How |
|----------|------|-----|
| M-parallel | M >> MC | Split MC blocks across threads |
| N-parallel | Small M, large N | Each thread packs own B strip |
| Dynamic MC | mc_blocks < threads | Reduce MC for more blocks |
| Fast path | M,N<=128, K<=512 | Single-threaded, no dispatch overhead |
Performance (Zen 5, 16 cores, AVX-512)
Laptop 45W TDP throttles all-core AVX-512 from 3.5 GHz to ~1.7 GHz:
| Shape | GFLOPS | % Sustained Peak |
|-------|--------|------------------|
| 128x128 | 206 | 11.8% |
| 512x512 | 536 | 30.7% |
| 1024^3 | 768 | 43.9% |
| 2048^3 | 1,415 | 80.8% |
| 4096^3 | 1,648 | 94.2% |
| LLM FFN Up | 1,822 | 104.1% |
| LLM Attn QKV | 1,787 | 102.1% |
| HPC 8K | 1,768 | 101.0% |
Peak: 1,822 GFLOPS (104.1% of sustained). Yes, over 100% -- the sustained peak is a conservative estimate from a 4096x4096 workload, while LLM shapes have better cache behavior.
Additional CPU Kernels
Benchmark Stability
Research-grade measurement methodology: