CUDA Matrix Multiply: Memory and Data Locality

01 — Matrix Multiplication Overview

Matrix multiplication maps naturally to a 2D CUDA grid

For square matrix multiplication, the output matrix P is computed from input matrices M and N. Each output element is the dot product of one row of M and one column of N.

P[row][col] = Σ_k=0^Width−1 M[row][k] × N[k][col]

The most natural CUDA mapping is one thread per output cell. A 2D thread block computes a rectangular tile of output cells, and the full grid of blocks covers the full output matrix.

Animated dot-product view

row of M + column of N → one P cell

M

×

N

=

P

Each CUDA thread computes one output element of P.

Main idea: the arithmetic is simple. The performance challenge is memory traffic: many nearby threads need overlapping rows and columns from global memory.

02 — Basic CUDA Matrix Multiplication Kernel

The basic kernel computes one output cell per thread

The basic kernel uses 2D indexing to compute the output row and column for each CUDA thread. Then the thread loops over k to perform the dot product.

Code walkthrough: basic kernel

line highlight + visual state are synchronized

Basic Kernel
1__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
  int Row = blockIdx.y * blockDim.y + threadIdx.y;
  int Col = blockIdx.x * blockDim.x + threadIdx.x;
  if ((Row < Width) && (Col < Width)) {
      float Pvalue = 0.0f;
      for (int k = 0; k < Width; ++k) {
          Pvalue += M[Row * Width + k] * N[k * Width + Col];
      }
      P[Row * Width + Col] = Pvalue;
  }
11}

// thread state and dot product

M

N

P

Row

—

Col

—

k

—

Pvalue

0

The current thread computes one P element using a row of M and a column of N.

Row identifies the row of M and the row of P.
Col identifies the column of N and the column of P.
Pvalue is a private per-thread accumulator, typically stored in a register.
The bounds check prevents invalid memory writes when the grid is larger than the matrix.

03 — Thread-to-Output Mapping

A 2D grid of blocks covers the output matrix

Assume Width = 4 and BLOCK_WIDTH = 2. Each block has a 2 × 2 group of threads, so each block computes a 2 × 2 tile of the output matrix.

Animated block/thread mapping

Block(row, col) + Thread(row, col) → P[row][col]

// output matrix P

Row = blockIdx.y × blockDim.y + threadIdx.y

Col = blockIdx.x × blockDim.x + threadIdx.x

// active block and thread

blockIdx

—

threadIdx

—

Row

—

Col

—

Each thread maps to one output element.

04 — Global Memory Access Problem

The basic kernel has correct math but poor data locality

In the basic kernel, every thread reads a full row of M and a full column of N directly from global memory. Nearby threads reuse many of the same input values, but without tiling those values are repeatedly fetched.

Repeated global memory reads

same values are loaded again by nearby threads

// global memory row and column reuse

current load

repeated load

// threads reading overlapping data

Many threads need the same row or column values, but the basic kernel fetches them directly from global memory each time.

Locality problem: reuse exists across threads, but the basic kernel does not capture that reuse in fast on-chip memory.

05 — Tiling and Data Locality

Tiling moves reused data into shared memory

Tiling divides the dot product into phases. In each phase, a thread block cooperatively loads one tile of M and one tile of N into shared memory, then reuses those tile values for multiple multiply-add operations.

Tiling/blocking concept

global memory → shared memory → many threads reuse

global memory

M tile + N tile

→

on-chip memory

Shared tiles

→

thread block

Many dot-product updates

Tiling improves locality by loading a block of data once, then reusing it many times.

Identify a tile of global memory data reused by multiple threads.
Load the tile into shared memory.
Synchronize to make sure the tile is fully loaded.
Compute using shared memory.
Synchronize again before overwriting shared memory in the next phase.

06 — Tiled Matrix Multiplication Phases

Each output tile is computed in phases

For a 4 × 4 matrix with TILE_WIDTH = 2, Block(0,0) computes the top-left 2 × 2 output tile. It needs two phases along the inner dimension.

Animated phased execution for Block(0,0)

Phase 0 then Phase 1

M

Shared Mds

Shared Nds

N

P tile

Phase 0 loads the first pair of tiles.

P_0,0 = (M_0,0N_0,0 + M_0,1N_1,0) + (M_0,2N_2,0 + M_0,3N_3,0)

07 — Loading Tiles into Shared Memory

Each thread loads one M element and one N element

For each phase p, the M tile moves horizontally across a row of M, while the N tile moves vertically down a column of N. Each thread stores its loaded values into Mds[ty][tx] and Nds[ty][tx].

Code walkthrough: tiled loading and compute

shared-memory tile fill + reuse

Tiled Load
1__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
2__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
3int Row = by * TILE_WIDTH + ty;
4int Col = bx * TILE_WIDTH + tx;
5for (int p = 0; p < Width / TILE_WIDTH; ++p) {
6    Mds[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
7    Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * Width + Col];
8    __syncthreads();
9    for (int k = 0; k < TILE_WIDTH; ++k)
10        Pvalue += Mds[ty][k] * Nds[k][tx];
11    __syncthreads();
12}

// shared memory state

Mds

Nds

Every thread cooperatively loads one M element and one N element.

08 — Phased Execution and Synchronization

Two barriers protect shared memory reuse

Tiled matrix multiplication uses __syncthreads() twice inside each phase. The first barrier ensures all tile values are loaded before computation. The second barrier ensures all threads are finished using the old tile before any thread overwrites shared memory with the next tile.

Why barriers are required

within one block only

// thread arrival at barrier

Threads that arrive early wait until every thread in the block reaches the barrier.

// without synchronization

Mds

Nds

Without synchronization, a fast thread may read an uninitialized or overwritten shared-memory value.

Scope: __syncthreads() synchronizes threads in the same block only. It does not synchronize different blocks.

09 — Tiled Matrix Multiplication Kernel

The complete tiled kernel

The clean tiled kernel below assumes square matrices and assumes Width is a multiple of TILE_WIDTH. Boundary checks are added later.

Tiled Kernel
1#define TILE_WIDTH 16
2__global__ void MatrixMulTiledKernel(float* M, float* N, float* P, int Width) {
  __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
  __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
  int tx = threadIdx.x, ty = threadIdx.y;
  int Row = blockIdx.y * TILE_WIDTH + ty;
  int Col = blockIdx.x * TILE_WIDTH + tx;
  float Pvalue = 0.0f;
  for (int p = 0; p < Width / TILE_WIDTH; ++p) {
      Mds[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
      Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * Width + Col];
      __syncthreads();
      for (int k = 0; k < TILE_WIDTH; ++k)
          Pvalue += Mds[ty][k] * Nds[k][tx];
      __syncthreads();
  }
  P[Row * Width + Col] = Pvalue;
18}

Allocate shared tiles: Mds and Nds are per-block arrays.
Compute output coordinates: Row and Col identify the output cell.
Loop over phases: each phase moves by one tile width through the dot-product dimension.
Load, sync, compute, sync: this protects shared memory and enables reuse.
Store final value: after all phases, each thread writes its one output cell.

10 — Tile Size and Resource Tradeoffs

Tile width controls reuse, work, and resource usage

For a tile width TILE_WIDTH, each block has TILE_WIDTH × TILE_WIDTH threads. In each phase, each thread loads one M value and one N value, so each block performs 2 × TILE_WIDTH² global float loads.

FLOPs per global memory load = (2 × TILE_WIDTH³) / (2 × TILE_WIDTH²) = TILE_WIDTH

TILE_WIDTH = 16

Threads per block = 16 × 16 = 256
Global loads per phase = 2 × 256 = 512 floats
Work per phase = 256 × (2 × 16) = 8192 FLOPs
Ratio = 16 FLOPs per memory load
Shared memory = 2 × 16 × 16 × 4 = 2 KB

TILE_WIDTH = 32

Threads per block = 32 × 32 = 1024
Global loads per phase = 2 × 1024 = 2048 floats
Work per phase = 1024 × (2 × 32) = 65536 FLOPs
Ratio = 32 FLOPs per memory load
Shared memory = 2 × 32 × 32 × 4 = 8 KB

Tradeoff: larger tiles improve reuse, but they consume more threads and shared memory per block. Occupancy also depends on registers, block limits, and hardware constraints.

11 — Handling Arbitrary Matrix Sizes

Real matrices may not be multiples of TILE_WIDTH

The clean tiled kernel assumes square matrices whose dimensions are exact multiples of the tile width. Real applications often have matrix boundaries where some tile positions fall outside the valid matrix.

3×3 matrix with TILE_WIDTH = 2

boundary tiles include invalid positions

// valid and invalid tile cells

valid matrix element

outside matrix

// boundary handling rule

check index

inside matrix?

→

if invalid

write 0 into shared tile

Instead of padding the whole matrix in memory, the kernel writes 0 for out-of-bound tile elements.

Why zero works: a zero value contributes nothing to the multiply-add, so invalid tile positions do not corrupt the final output.

12 — Boundary Checks and Zero Padding

M, N, and P need different boundary conditions

The condition for loading an M tile element is different from loading an N tile element, and both are different from the condition for storing an output P element.

Loading M

M condition
1if (Row < M_height && p * TILE_WIDTH + tx < M_width)
2    Mds[ty][tx] = M[Row * M_width + p * TILE_WIDTH + tx];
3else Mds[ty][tx] = 0.0f;

Loading N

N condition
1if (p * TILE_WIDTH + ty < N_height && Col < N_width)
2    Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * N_width + Col];
3else Nds[ty][tx] = 0.0f;

Output store condition
1if (Row < P_height && Col < P_width)
2    P[Row * P_width + Col] = Pvalue;

Important: do not only check whether the output cell is valid. A thread that does not store a valid P element may still load a valid input tile element that other threads need.

Rectangular matrices

If M is Height_M × Width_M and N is Width_M × Width_N, then P is Height_M × Width_N. The inner dimension Width_M controls the number of phases.

M

Height_M × Width_M

×

N

Width_M × Width_N

=

P

Height_M × Width_N

13 — Key Takeaways

Checklist for understanding tiled matrix multiplication

One thread

Computes one output element P[Row][Col] using a dot product.

One block

Computes one tile of the output matrix P.

One phase

Loads one tile from M and one tile from N, then performs partial dot-product work.

Shared memory

Stores reusable tile values so nearby threads do not repeatedly fetch the same values from global memory.

Barriers

The first barrier protects tile loading. The second protects tile reuse before the next phase overwrites shared memory.

Boundary handling

Invalid tile positions are replaced with zero; stores to P are guarded separately.

Mental model: tiled matrix multiplication repeats this loop: load tile → synchronize → compute with tile → synchronize → next tile.