What this CUDA code computes
The kernel implements single-precision matrix multiplication, often called SGEMM. It computes an output matrix C from input matrices A and B.
The code uses shared memory tiling. Instead of every thread repeatedly loading all values from slow global memory, each thread block cooperatively loads a tile of A and a tile of B into on-chip shared memory, reuses those tile values, and then moves to the next tile along the inner k dimension.
High-level flow
global memory → shared memory → register sum → global memorylarge input matrices
per-block shared memory tiles
one output element per thread
- One thread computes one output element. The output position is controlled by
rowandcol. - One block computes one output tile. Since
TILE_SIZE = 16, a full block computes a 16 × 16 tile ofC. - Each phase loads a tile pair. One 16 × 16 tile from
Aand one 16 × 16 tile fromB. - Boundary checks allow rectangular matrices. The kernel handles sizes where
m,n, andkare not multiples of 16.
Full kernel walkthrough with synchronized animation
The real code uses TILE_SIZE = 16. The animation below uses a smaller 4 × 4 visual tile so the movement is readable, but the same indexing logic applies to the actual 16 × 16 CUDA block.
Host launch configuration
The host wrapper basicSgemm() chooses the CUDA block and grid dimensions. The block size is exactly the tile size, so each CUDA block has 16 × 16 = 256 threads.
n or m is not divisible by 16, the grid still launches enough blocks to cover the boundary tiles. Boundary checks inside the kernel prevent invalid writes.How a thread maps to one output element
Each thread uses its block position and local thread position to compute one global output coordinate.
col = blockIdx.x × TILE_SIZE + threadIdx.x
2D block grid and output tile mapping
toy example: visual tile = 4Because the block is two-dimensional, threadIdx.y selects the row inside the output tile, while threadIdx.x selects the column inside the output tile.
Shared memory tiles: A_s and B_s
The two shared arrays are allocated once per thread block. All 256 threads in that block can access the same A_s and B_s, but different blocks have separate copies.
__shared__ float B_s[16][16];
Without tiling
repeated global loadsEach thread repeatedly reads the row of A and column of B directly from global memory. Neighboring threads often reload values that other threads also need.
With tiling
cooperative reuseThreads cooperatively load a tile once, then many threads reuse those values from shared memory during the inner compute loop.
__syncthreads() is a block-level barrier.The phase loop walks across the inner k dimension
The output element C[row][col] is a dot product. The dot product length is k. Instead of reading all k elements at once, the loop processes the dot product one tile at a time.
Phase timeline
same pattern repeats for each tileFor tile phase tile, the thread computes:
b_row = tile × TILE_SIZE + threadIdx.y
Why the two barriers are necessary
There are two barriers in each phase. The first barrier protects the compute loop. The second barrier protects the next tile load.
First barrier
after loading tiles__syncthreads() ensures all threads have finished writing A_s and B_s before any thread starts reading from them.
Second barrier
after using tilesThe next phase overwrites A_s and B_s. This barrier ensures no thread is still using the old tile when another thread begins overwriting it.
During the compute loop, each thread reads one row from A_s and one column from B_s. The thread accumulates the products in a private register variable named sum.
Boundary checks and zero fill
The code supports matrix dimensions that are not multiples of TILE_SIZE. Boundary threads may point outside the valid matrix. Instead of reading invalid memory, those threads write 0.0f into shared memory.
A tile check
row and a_colA_s[ty][tx] = A[row*k + a_col];
else A_s[ty][tx] = 0;
B tile check
b_row and colB_s[ty][tx] = B[b_row*n + col];
else B_s[ty][tx] = 0;
C store check
row and colC[row*n + col] = sum;
0 × something or something × 0, so it does not change the dot product. This avoids padding the whole matrix in memory.Why tiling reduces global memory traffic
The general rule is that tiling reduces global memory traffic by a factor of TILE_SIZE. With TILE_SIZE = 16, a tiled kernel can reduce repeated global loads by about 16× compared with a direct non-tiled kernel.
Non-tiled
reloads values oftenEach A element is loaded once for every output column it contributes to, so when n = 64, each A value is loaded 64 times. Each B element is loaded once for every output row, so when m = 64, each B value is also loaded 64 times.
Tiled, 16 × 16
reuse through shared memoryWith TILE_SIZE = 16 and a 64-wide matrix, there are 64 / 16 = 4 output tile groups per dimension. Each element is loaded 4 times instead of 64 times, so the reduction factor equals TILE_SIZE.
Larger tiles generally improve reuse because each loaded value can be used by more threads before the next tile is loaded, but larger tiles also increase thread count, shared memory use, and occupancy pressure.
Common pitfalls and correctness checklist
- Wrong leading dimension:
Ais indexed with widthk,Bwith widthn, andCwith widthn. - Missing first sync: a thread may read
A_sorB_sbefore another thread finishes loading it. - Missing second sync: one thread may overwrite shared memory for the next tile while another is still computing with the current tile.
- No boundary check: non-multiple dimensions can read outside
AorB, or write outsideC. - Confusing block dimensions:
blockIdx.xselects output tile columns;blockIdx.yselects output tile rows.
C. Each phase loads one compatible tile of A and one compatible tile of B. Threads synchronize, reuse the tiles to add partial dot-product contributions, synchronize again, and repeat until the final sum is written to C.