Thread Mapping
1 Thread → 1 P Cell
Core Operation
Dot Product
Optimization
Shared-Memory Tiling
Main Issue
Global Memory Reuse
Load M tile
global → shared
→
Load N tile
global → shared
→
Sync + compute
reuse shared tiles
→
Write P cell
one output per thread
Inefficient or invalid access
Tiled matrix multiplication improves memory locality by loading reusable tiles into shared memory, synchronizing the block, and reusing those values across many multiply-add operations.
01 — Matrix Multiplication Overview
Matrix multiplication maps naturally to a 2D CUDA grid
For square matrix multiplication, the output matrix P is computed from input matrices M and N. Each output element is the dot product of one row of M and one column of N.
P[row][col] = Σk=0Width−1 M[row][k] × N[k][col]
The most natural CUDA mapping is one thread per output cell. A 2D thread block computes a rectangular tile of output cells, and the full grid of blocks covers the full output matrix.
Animated dot-product view
row of M + column of N → one P cell
Each CUDA thread computes one output element of P.
Main idea: the arithmetic is simple. The performance challenge is memory traffic: many nearby threads need overlapping rows and columns from global memory.
02 — Basic CUDA Matrix Multiplication Kernel
The basic kernel computes one output cell per thread
The basic kernel uses 2D indexing to compute the output row and column for each CUDA thread. Then the thread loops over k to perform the dot product.
Code walkthrough: basic kernel
line highlight + visual state are synchronized
Basic Kernel
1__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
2 int Row = blockIdx.y * blockDim.y + threadIdx.y;
3 int Col = blockIdx.x * blockDim.x + threadIdx.x;
4 if ((Row < Width) && (Col < Width)) {
5 float Pvalue = 0.0f;
6 for (int k = 0; k < Width; ++k) {
7 Pvalue += M[Row * Width + k] * N[k * Width + Col];
8 }
9 P[Row * Width + Col] = Pvalue;
10 }
11}
// thread state and dot product
The current thread computes one P element using a row of M and a column of N.
Row identifies the row of M and the row of P.
Col identifies the column of N and the column of P.
Pvalue is a private per-thread accumulator, typically stored in a register.
- The bounds check prevents invalid memory writes when the grid is larger than the matrix.
03 — Thread-to-Output Mapping
A 2D grid of blocks covers the output matrix
Assume Width = 4 and BLOCK_WIDTH = 2. Each block has a 2 × 2 group of threads, so each block computes a 2 × 2 tile of the output matrix.
Animated block/thread mapping
Block(row, col) + Thread(row, col) → P[row][col]
// active block and thread
Each thread maps to one output element.
04 — Global Memory Access Problem
The basic kernel has correct math but poor data locality
In the basic kernel, every thread reads a full row of M and a full column of N directly from global memory. Nearby threads reuse many of the same input values, but without tiling those values are repeatedly fetched.
Repeated global memory reads
same values are loaded again by nearby threads
// global memory row and column reuse
// threads reading overlapping data
Many threads need the same row or column values, but the basic kernel fetches them directly from global memory each time.
Locality problem: reuse exists across threads, but the basic kernel does not capture that reuse in fast on-chip memory.
05 — Tiling and Data Locality
Tiling moves reused data into shared memory
Tiling divides the dot product into phases. In each phase, a thread block cooperatively loads one tile of M and one tile of N into shared memory, then reuses those tile values for multiple multiply-add operations.
Tiling/blocking concept
global memory → shared memory → many threads reuse
global memory
M tile + N tile
→
on-chip memory
Shared tiles
→
thread block
Many dot-product updates
Tiling improves locality by loading a block of data once, then reusing it many times.
- Identify a tile of global memory data reused by multiple threads.
- Load the tile into shared memory.
- Synchronize to make sure the tile is fully loaded.
- Compute using shared memory.
- Synchronize again before overwriting shared memory in the next phase.
06 — Tiled Matrix Multiplication Phases
Each output tile is computed in phases
For a 4 × 4 matrix with TILE_WIDTH = 2, Block(0,0) computes the top-left 2 × 2 output tile. It needs two phases along the inner dimension.
Animated phased execution for Block(0,0)
Phase 0 then Phase 1
Phase 0 loads the first pair of tiles.
P0,0 = (M0,0N0,0 + M0,1N1,0) + (M0,2N2,0 + M0,3N3,0)
07 — Loading Tiles into Shared Memory
Each thread loads one M element and one N element
For each phase p, the M tile moves horizontally across a row of M, while the N tile moves vertically down a column of N. Each thread stores its loaded values into Mds[ty][tx] and Nds[ty][tx].
Code walkthrough: tiled loading and compute
shared-memory tile fill + reuse
Tiled Load
1__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
2__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
3int Row = by * TILE_WIDTH + ty;
4int Col = bx * TILE_WIDTH + tx;
5for (int p = 0; p < Width / TILE_WIDTH; ++p) {
6 Mds[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
7 Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * Width + Col];
8 __syncthreads();
9 for (int k = 0; k < TILE_WIDTH; ++k)
10 Pvalue += Mds[ty][k] * Nds[k][tx];
11 __syncthreads();
12}
// shared memory state
Every thread cooperatively loads one M element and one N element.
08 — Phased Execution and Synchronization
Two barriers protect shared memory reuse
Tiled matrix multiplication uses __syncthreads() twice inside each phase. The first barrier ensures all tile values are loaded before computation. The second barrier ensures all threads are finished using the old tile before any thread overwrites shared memory with the next tile.
Why barriers are required
within one block only
// thread arrival at barrier
Threads that arrive early wait until every thread in the block reaches the barrier.
// without synchronization
Without synchronization, a fast thread may read an uninitialized or overwritten shared-memory value.
Scope: __syncthreads() synchronizes threads in the same block only. It does not synchronize different blocks.
09 — Tiled Matrix Multiplication Kernel
The complete tiled kernel
The clean tiled kernel below assumes square matrices and assumes Width is a multiple of TILE_WIDTH. Boundary checks are added later.
Tiled Kernel
1#define TILE_WIDTH 16
2__global__ void MatrixMulTiledKernel(float* M, float* N, float* P, int Width) {
3 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
4 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
5 int tx = threadIdx.x, ty = threadIdx.y;
6 int Row = blockIdx.y * TILE_WIDTH + ty;
7 int Col = blockIdx.x * TILE_WIDTH + tx;
8 float Pvalue = 0.0f;
9 for (int p = 0; p < Width / TILE_WIDTH; ++p) {
10 Mds[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
11 Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * Width + Col];
12 __syncthreads();
13 for (int k = 0; k < TILE_WIDTH; ++k)
14 Pvalue += Mds[ty][k] * Nds[k][tx];
15 __syncthreads();
16 }
17 P[Row * Width + Col] = Pvalue;
18}
- Allocate shared tiles:
Mds and Nds are per-block arrays.
- Compute output coordinates:
Row and Col identify the output cell.
- Loop over phases: each phase moves by one tile width through the dot-product dimension.
- Load, sync, compute, sync: this protects shared memory and enables reuse.
- Store final value: after all phases, each thread writes its one output cell.
10 — Tile Size and Resource Tradeoffs
Tile width controls reuse, work, and resource usage
For a tile width TILE_WIDTH, each block has TILE_WIDTH × TILE_WIDTH threads. In each phase, each thread loads one M value and one N value, so each block performs 2 × TILE_WIDTH² global float loads.
FLOPs per global memory load = (2 × TILE_WIDTH³) / (2 × TILE_WIDTH²) = TILE_WIDTH
TILE_WIDTH = 16
Threads per block = 16 × 16 = 256
Global loads per phase = 2 × 256 = 512 floats
Work per phase = 256 × (2 × 16) = 8192 FLOPs
Ratio = 16 FLOPs per memory load
Shared memory = 2 × 16 × 16 × 4 = 2 KB
TILE_WIDTH = 32
Threads per block = 32 × 32 = 1024
Global loads per phase = 2 × 1024 = 2048 floats
Work per phase = 1024 × (2 × 32) = 65536 FLOPs
Ratio = 32 FLOPs per memory load
Shared memory = 2 × 32 × 32 × 4 = 8 KB
Tradeoff: larger tiles improve reuse, but they consume more threads and shared memory per block. Occupancy also depends on registers, block limits, and hardware constraints.
11 — Handling Arbitrary Matrix Sizes
Real matrices may not be multiples of TILE_WIDTH
The clean tiled kernel assumes square matrices whose dimensions are exact multiples of the tile width. Real applications often have matrix boundaries where some tile positions fall outside the valid matrix.
3×3 matrix with TILE_WIDTH = 2
boundary tiles include invalid positions
// valid and invalid tile cells
// boundary handling rule
check index
inside matrix?
→
if invalid
write 0 into shared tile
Instead of padding the whole matrix in memory, the kernel writes 0 for out-of-bound tile elements.
Why zero works: a zero value contributes nothing to the multiply-add, so invalid tile positions do not corrupt the final output.
12 — Boundary Checks and Zero Padding
M, N, and P need different boundary conditions
The condition for loading an M tile element is different from loading an N tile element, and both are different from the condition for storing an output P element.
Loading M
M condition
1if (Row < M_height && p * TILE_WIDTH + tx < M_width)
2 Mds[ty][tx] = M[Row * M_width + p * TILE_WIDTH + tx];
3else Mds[ty][tx] = 0.0f;
Loading N
N condition
1if (p * TILE_WIDTH + ty < N_height && Col < N_width)
2 Nds[ty][tx] = N[(p * TILE_WIDTH + ty) * N_width + Col];
3else Nds[ty][tx] = 0.0f;
Output store condition
1if (Row < P_height && Col < P_width)
2 P[Row * P_width + Col] = Pvalue;
Important: do not only check whether the output cell is valid. A thread that does not store a valid P element may still load a valid input tile element that other threads need.
Rectangular matrices
If M is Height_M × Width_M and N is Width_M × Width_N, then P is Height_M × Width_N. The inner dimension Width_M controls the number of phases.
13 — Key Takeaways
Checklist for understanding tiled matrix multiplication
One thread
Computes one output element P[Row][Col] using a dot product.
One block
Computes one tile of the output matrix P.
One phase
Loads one tile from M and one tile from N, then performs partial dot-product work.
Shared memory
Stores reusable tile values so nearby threads do not repeatedly fetch the same values from global memory.
Barriers
The first barrier protects tile loading. The second protects tile reuse before the next phase overwrites shared memory.
Boundary handling
Invalid tile positions are replaced with zero; stores to P are guarded separately.
Mental model: tiled matrix multiplication repeats this loop: load tile → synchronize → compute with tile → synchronize → next tile.