GEMM Optimization Benchmarks in C++

This repository explores the step-by-step optimization of General Matrix Multiplication (GEMM) in C++. GEMM is the core mathematical operation behind nearly all modern deep learning models, including Large Language Models and Vision Transformers.

This project starts with a mathematically correct but highly inefficient "naive" implementation and gradually applies memory, caching, and compiler optimizations to achieve massive performance gains.

What is GEMM and FLOPs?

At its core, GEMM computes each output element of a matrix as a dot product:

C_ij = sum from k=0 to N-1 of (A_ik * B_kj)

FLOPs (Floating Point Operations) is the metric used to measure computational cost.

For an N x N matrix multiplication:

The output matrix has N² elements
Calculating a single element requires N multiplications and N-1 additions
Total operations per element: N + (N-1) approximately equals 2N
Total FLOPs = N² x 2N = 2N³

Because matrix multiplication scales at O(N³), doubling the matrix size increases the computational workload by a factor of 8.

The Optimization Journey

All benchmarks were run on an Intel i5-13450HX (13th Gen) multiplying two 2048x2048 floating-point matrices.

1. The Naive Implementation (ijk loop)

The most natural way to write matrix multiplication is a triple-nested loop corresponding to the mathematical formula:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        float sum = 0;
        for (int k = 0; k < N; k++) {
            sum += A[i * N + k] * B[k * N + j];
            C[i * N + j] = sum; 
        }
    }
}

The Flaw: We constantly write to memory (C[i * N + j]) inside the innermost loop. Writing to RAM is incredibly slow.

2. Register Optimization

We can easily speed this up by accumulating the dot product inside a local variable (which the compiler places in a high-speed CPU register) and only writing to memory once per output element.

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        float sum = 0;
        for (int k = 0; k < N; k++) {
            sum += A[i * N + k] * B[k * N + j];
        }
        C[i * N + j] = sum; // Moved OUTSIDE the k-loop!
    }
}

The Flaw: We are still thrashing the CPU cache. In C++, matrices are stored in row-major order. Accessing B[k * N + j] inside the k loop forces the CPU to jump forward in memory by N elements every iteration, missing the cache entirely.

3. Loop Reordering (ikj Loop) - The Cache Magic

By swapping the two inner loops, we fundamentally change the memory access pattern:

for (int i = 0; i < N; i++) {
    for (int k = 0; k < N; k++) {
        float temp = A[i * N + k]; // Load once
        for (int j = 0; j < N; j++) {
            C[i * N + j] += temp * B[k * N + j]; // Sequential access!
        }
    }
}

The Fix: Now the innermost loop iterates over j. Both C and B are accessed sequentially (+1 offset in memory). The CPU can load entire 64-byte cache lines at once, eliminating RAM bottlenecking. This simple change yields massive speedups without altering the math.

4. Compiler Flags

Writing cache-friendly code is only half the battle. Unleashing the compiler pushes it to the limit:

-O3: Enables aggressive optimizations (loop unrolling, function inlining, vectorization)
-march=native: Uses CPU-specific instructions for your architecture
-ffast-math: Enables faster (though sometimes less precise) mathematical operations

Benchmark Results

1. The 256x256 Benchmark (33.55 Million FLOPs)

Before Optimization (Raw C++):

After Optimization (-O3 -march=native -ffast-math):

2. The 2048x2048 Benchmark (17.18 Billion FLOPs)

Before Optimization (Raw C++):

After Optimization (-O3 -march=native -ffast-math):

With compiler optimizations, the ikj kernel completes 17.18 Billion FLOPs in just 5.5 seconds, achieving ~3.2 GFLOPS. The naive implementation would take considerably longer due to cache thrashing — estimated 10-28x slower depending on matrix size and CPU.

Kernel	256x256 Time	256x256 GFLOPS	Speedup
Naive ijk	~12 ms	~2.9	1x
Register	~11 ms	~3.1	1.1x
ikj	~1.2 ms	~28	9.8x

How to Build and Run

To run these benchmarks on your own machine, clone the repository and compile with aggressive optimization flags:

# Clone the repo
git clone https://github.com/cheese-cakee/Benchmarking-GEMM.git
cd GEMM-Benchmarking

# Compile with GCC/Clang (Highly Recommended Flags)
g++ -std=c++17 -O3 -march=native -ffast-math -static -o optimized.exe src/gemm_all.cpp

# Run
./optimized.exe

Note: The -static flag is required on Windows/MSYS2 to work around linker issues.

Project Structure

├── src/
│   ├── gemm.hpp          # Header with all kernel declarations
│   ├── gemm_kernels.cpp # Kernel implementations  
│   ├── gemm_all.cpp     # Main benchmark suite (all-in-one)
│   └── main.cpp         # Original benchmark harness
├── Makefile             # Build automation
├── baseline.exe        # Pre-built naive binary (no optimization)
└── optimized.exe       # Pre-built optimized binary (all kernels)

Key Takeaways

This project demonstrates a fundamental truth of high-performance computing:

Understanding computer hardware (memory and caches) is just as important as understanding Big O notation.

Simple loop reordering — without changing the math at all — can yield ~10x speedups. The key insight is that memory access patterns matter more than raw algorithmic complexity when data doesn't fit in cache.

Optimization	Speedup
Register optimization	~1.3x
Loop reordering (ikj)	~10-20x
Compiler flags (-O3 -march=native -ffast-math)	Up to 65x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM Optimization Benchmarks in C++

What is GEMM and FLOPs?

The Optimization Journey

1. The Naive Implementation (ijk loop)

2. Register Optimization

3. Loop Reordering (ikj Loop) - The Cache Magic

4. Compiler Flags

Benchmark Results

1. The 256x256 Benchmark (33.55 Million FLOPs)

2. The 2048x2048 Benchmark (17.18 Billion FLOPs)

How to Build and Run

Project Structure

Key Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
out		out
src		src
Makefile		Makefile
README.md		README.md
baseline.exe		baseline.exe
optimized.exe		optimized.exe

Folders and files

Latest commit

History

Repository files navigation

GEMM Optimization Benchmarks in C++

What is GEMM and FLOPs?

The Optimization Journey

1. The Naive Implementation (ijk loop)

2. Register Optimization

3. Loop Reordering (ikj Loop) - The Cache Magic

4. Compiler Flags

Benchmark Results

1. The 256x256 Benchmark (33.55 Million FLOPs)

2. The 2048x2048 Benchmark (17.18 Billion FLOPs)

How to Build and Run

Project Structure

Key Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages