Assignment 3 — CUDA SGEMM Tiled Kernel

01 — Kernel Goal

What this CUDA code computes

The kernel implements single-precision matrix multiplication, often called SGEMM. It computes an output matrix C from input matrices A and B.

A is m × k, B is k × n, C is m × n, and C[row][col] = Σ_i=0^k−1 A[row][i] × B[i][col]

The code uses shared memory tiling. Instead of every thread repeatedly loading all values from slow global memory, each thread block cooperatively loads a tile of A and a tile of B into on-chip shared memory, reuses those tile values, and then moves to the next tile along the inner k dimension.

High-level flow

global memory → shared memory → register sum → global memory

Global A and B
large input matrices

→

A_s and B_s
per-block shared memory tiles

→

C[row*n+col]
one output element per thread

One thread computes one output element. The output position is controlled by row and col.
One block computes one output tile. Since TILE_SIZE = 16, a full block computes a 16 × 16 tile of C.
Each phase loads a tile pair. One 16 × 16 tile from A and one 16 × 16 tile from B.
Boundary checks allow rectangular matrices. The kernel handles sizes where m, n, and k are not multiples of 16.

02 — Animated Code Walkthrough

Full kernel walkthrough with synchronized animation

The real code uses TILE_SIZE = 16. The animation below uses a smaller 4 × 4 visual tile so the movement is readable, but the same indexing logic applies to the actual 16 × 16 CUDA block.

kernel(3).cu
1#define TILE_SIZE 16
2
3__global__ void mysgemm(int m, int n, int k, const float *A, const float *B, float* C) {
4    __shared__ float A_s[TILE_SIZE][TILE_SIZE];
5    __shared__ float B_s[TILE_SIZE][TILE_SIZE];
6    unsigned int row = blockIdx.y * TILE_SIZE + threadIdx.y;
7    unsigned int col = blockIdx.x * TILE_SIZE + threadIdx.x;
8    float sum = 0.0f;
9
10    for (unsigned int tile = 0; tile < (k + TILE_SIZE - 1) / TILE_SIZE; ++tile) {
11        unsigned int a_col = tile * TILE_SIZE + threadIdx.x;
12        unsigned int b_row = tile * TILE_SIZE + threadIdx.y;
13
14        if (row < m && a_col < k) {
15            A_s[threadIdx.y][threadIdx.x] = A[row * k + a_col];
16        } else {
17            A_s[threadIdx.y][threadIdx.x] = 0.0f;
18        }
19
20        if (b_row < k && col < n) {
21            B_s[threadIdx.y][threadIdx.x] = B[b_row * n + col];
22        } else {
23            B_s[threadIdx.y][threadIdx.x] = 0.0f;
24        }
25
26        __syncthreads();
27
28        for (unsigned int i = 0; i < TILE_SIZE; ++i) {
29            sum += A_s[threadIdx.y][i] * B_s[i][threadIdx.x];
30        }
31
32        __syncthreads();
33    }
34
35    if (row < m && col < n) {
36        C[row * n + col] = sum;
37    }
38}

Stage title

stage

Representative thread block: blockIdx = (0, 0), visual TILE = 4

Global Am × k

Global Bk × n

Global Cm × n

A_s shared tilethread block local

B_s shared tilethread block local

03 — Host Wrapper

Host launch configuration

The host wrapper basicSgemm() chooses the CUDA block and grid dimensions. The block size is exactly the tile size, so each CUDA block has 16 × 16 = 256 threads.

host wrapper
1void basicSgemm(int m, int n, int k, const float *A, const float *B, float *C)
2{
3    const unsigned int BLOCK_SIZE = TILE_SIZE;
4    dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);
5    dim3 dim_grid((n + BLOCK_SIZE - 1) / BLOCK_SIZE,
6                  (m + BLOCK_SIZE - 1) / BLOCK_SIZE,
7                  1);
8    mysgemm<<<dim_grid, dim_block>>>(m, n, k, A, B, C);
9}

dim_block.x

16 threads across columns

dim_block.y

16 threads across rows

dim_grid.x

ceil(n / 16) output tile columns

dim_grid.y

ceil(m / 16) output tile rows

Why the ceiling formula? If n or m is not divisible by 16, the grid still launches enough blocks to cover the boundary tiles. Boundary checks inside the kernel prevent invalid writes.

04 — Thread Mapping

How a thread maps to one output element

Each thread uses its block position and local thread position to compute one global output coordinate.

row = blockIdx.y × TILE_SIZE + threadIdx.y
col = blockIdx.x × TILE_SIZE + threadIdx.x

2D block grid and output tile mapping

toy example: visual tile = 4

Thread coordinates inside one block

Output C tile produced by the same block

Thread(1,2) computes C[2][1] when blockIdx=(0,0). In the real kernel, the tile is 16 × 16, not 4 × 4.

Because the block is two-dimensional, threadIdx.y selects the row inside the output tile, while threadIdx.x selects the column inside the output tile.

05 — Shared Memory

Shared memory tiles: `A_s` and `B_s`

The two shared arrays are allocated once per thread block. All 256 threads in that block can access the same A_s and B_s, but different blocks have separate copies.

__shared__ float A_s[16][16];
__shared__ float B_s[16][16];

Without tiling

repeated global loads

Each thread repeatedly reads the row of A and column of B directly from global memory. Neighboring threads often reload values that other threads also need.

With tiling

cooperative reuse

Threads cooperatively load a tile once, then many threads reuse those values from shared memory during the inner compute loop.

Scope: shared memory is visible only to threads in the same block. This is why __syncthreads() is a block-level barrier.

06 — Phase Loop

The phase loop walks across the inner `k` dimension

The output element C[row][col] is a dot product. The dot product length is k. Instead of reading all k elements at once, the loop processes the dot product one tile at a time.

number of phases = ceil(k / TILE_SIZE) = (k + TILE_SIZE − 1) / TILE_SIZE

Phase timeline

same pattern repeats for each tile

Load A tile

Load B tile

__syncthreads() ①

Compute TILE_SIZE products

__syncthreads() ②

Next tile

For tile phase tile, the thread computes:

a_col = tile × TILE_SIZE + threadIdx.x
b_row = tile × TILE_SIZE + threadIdx.y

07 — Synchronization + Compute

Why the two barriers are necessary

There are two barriers in each phase. The first barrier protects the compute loop. The second barrier protects the next tile load.

First barrier

after loading tiles

__syncthreads() ensures all threads have finished writing A_s and B_s before any thread starts reading from them.

Second barrier

after using tiles

The next phase overwrites A_s and B_s. This barrier ensures no thread is still using the old tile when another thread begins overwriting it.

Without synchronization: one thread could read an uninitialized shared-memory value, or another thread could overwrite the tile before a slower thread finishes using it.

sum += A_s[threadIdx.y][i] × B_s[i][threadIdx.x]

During the compute loop, each thread reads one row from A_s and one column from B_s. The thread accumulates the products in a private register variable named sum.

08 — Boundary Checks

Boundary checks and zero fill

The code supports matrix dimensions that are not multiples of TILE_SIZE. Boundary threads may point outside the valid matrix. Instead of reading invalid memory, those threads write 0.0f into shared memory.

A tile check

row and a_col

if (row < m && a_col < k)
A_s[ty][tx] = A[row*k + a_col];
else A_s[ty][tx] = 0;

B tile check

b_row and col

if (b_row < k && col < n)
B_s[ty][tx] = B[b_row*n + col];
else B_s[ty][tx] = 0;

C store check

row and col

if (row < m && col < n)
C[row*n + col] = sum;

Why zero works: an out-of-range tile element contributes 0 × something or something × 0, so it does not change the dot product. This avoids padding the whole matrix in memory.

09 — Memory Locality

Why tiling reduces global memory traffic

The general rule is that tiling reduces global memory traffic by a factor of TILE_SIZE. With TILE_SIZE = 16, a tiled kernel can reduce repeated global loads by about 16× compared with a direct non-tiled kernel.

Non-tiled

reloads values often

Each A element is loaded once for every output column it contributes to, so when n = 64, each A value is loaded 64 times. Each B element is loaded once for every output row, so when m = 64, each B value is also loaded 64 times.

2 × 64 × 64 × 64 = 524,288 global loads

Tiled, 16 × 16

reuse through shared memory

With TILE_SIZE = 16 and a 64-wide matrix, there are 64 / 16 = 4 output tile groups per dimension. Each element is loaded 4 times instead of 64 times, so the reduction factor equals TILE_SIZE.

2 × 64 × 64 × 4 = 32,768 global loads

Reduction

491,520 fewer loads

Factor

TILE_SIZE = 16×

64×64 ops

262,144 multiplies

mul + add

524,288 FLOPs

Larger tiles generally improve reuse because each loaded value can be used by more threads before the next tile is loaded, but larger tiles also increase thread count, shared memory use, and occupancy pressure.

10 — Pitfalls + Checklist

Common pitfalls and correctness checklist

Wrong leading dimension: A is indexed with width k, B with width n, and C with width n.
Missing first sync: a thread may read A_s or B_s before another thread finishes loading it.
Missing second sync: one thread may overwrite shared memory for the next tile while another is still computing with the current tile.
No boundary check: non-multiple dimensions can read outside A or B, or write outside C.
Confusing block dimensions: blockIdx.x selects output tile columns; blockIdx.y selects output tile rows.

Final mental model: each block owns one tile of C. Each phase loads one compatible tile of A and one compatible tile of B. Threads synchronize, reuse the tiles to add partial dot-product contributions, synchronize again, and repeat until the final sum is written to C.

Tiled Matrix Multiplication Kernel Explained