--
← Back to blog
CPU Compute: AVX-512 SGEMM at 104% of Sustained Peak

CPU Compute: AVX-512 SGEMM at 104% of Sustained Peak

The BLIS 5-Loop Algorithm

NSL's CPU matrix multiplication implements the BLIS algorithm -- the same approach used by BLAS libraries like OpenBLAS and MKL:

for jc = 0..N step NC:           # L3 loop
  for pc = 0..K step KC:         # L2 loop
    pack_b(B -> NR-wide strips)  # Serial
    parallel_for MC blocks:      # Parallel over M
      pack_a(A -> MR-tall panels)
      for each MR x NR tile:
        micro_kernel(packA, packB, C)  # AVX-512 FMA

AVX-512 Micro-Kernel (6x32)

The core compute unit processes 6 rows x 32 columns per call:

  • 12 ZMM accumulators (6 rows x 2 halves of 16 floats)
  • K-unrolled by 2 for instruction-level parallelism
  • 384 FLOPs per K step (6 x 32 x 2)
  • 228 GFLOPS single-core = 100% of theoretical peak
  • AVX2 fallback uses a 6x16 micro-kernel with YMM registers.

    Cache-Optimized Tile Sizes

    MR=6, NR=32     # Micro-kernel shape
    KC=384           # B strip = 48KB (fills L1)
    MC=256           # A panel = 393KB (fits L2)
    NC=4096          # B panel in L3
    

    Four Parallelization Strategies

    | Strategy | When | How |

    |----------|------|-----|

    | M-parallel | M >> MC | Split MC blocks across threads |

    | N-parallel | Small M, large N | Each thread packs own B strip |

    | Dynamic MC | mc_blocks < threads | Reduce MC for more blocks |

    | Fast path | M,N<=128, K<=512 | Single-threaded, no dispatch overhead |

    Performance (Zen 5, 16 cores, AVX-512)

    Laptop 45W TDP throttles all-core AVX-512 from 3.5 GHz to ~1.7 GHz:

    | Shape | GFLOPS | % Sustained Peak |

    |-------|--------|------------------|

    | 128x128 | 206 | 11.8% |

    | 512x512 | 536 | 30.7% |

    | 1024^3 | 768 | 43.9% |

    | 2048^3 | 1,415 | 80.8% |

    | 4096^3 | 1,648 | 94.2% |

    | LLM FFN Up | 1,822 | 104.1% |

    | LLM Attn QKV | 1,787 | 102.1% |

    | HPC 8K | 1,768 | 101.0% |

    Peak: 1,822 GFLOPS (104.1% of sustained). Yes, over 100% -- the sustained peak is a conservative estimate from a 4096x4096 workload, while LLM shapes have better cache behavior.

    Additional CPU Kernels

  • Activations: ReLU, GELU, SiLU, sigmoid, tanh, softmax (AVX-512 + AVX2)
  • Reductions: sum, max, min (horizontal SIMD)
  • Neural: Layer norm, fused attention (QKV)
  • Game: Bitonic sort, prefix scan, FFT, frustum cull, particle integration, spatial hash
  • Linear algebra: 8x8 block transpose with AVX-512 shuffle
  • Benchmark Stability

    Research-grade measurement methodology:

  • VirtualLock + prefault for memory pinning
  • rdtsc cycle-accurate timing for micro-shapes
  • 25% outlier trimming (keep middle 50%)
  • Adaptive warmup (CV < 2% convergence)
  • Thermal cooldown between tests
  • Core pinning + ABOVE_NORMAL priority
  • 40 verification + 20 performance tests, exit code 1 on any failure
All Posts