CUDA Thread Hierarchy Notes

EE147: Graphics Processing Unit Computing and Programming

Warp Size
1 Warp = 32 Threads
Block Size Example
256 Threads / Block
Warps per Block
8 Warps / Block
Total Threads
25,600 Threads

CUDA hierarchy

Thread → Warp → Block → Grid

2D indexing map

From block coordinates to row and column
CUDA structure
Active highlight
Divergence / inactive threads
CUDA work starts with individual threads, groups them into warps, groups warps into blocks, and launches blocks inside a grid.
01 — CUDA Thread Hierarchy

CUDA Thread Hierarchy Notes

CUDA organizes work into a simple hierarchy: threads, warps, blocks, and grids.

Thread → Warp → Block → Grid

1. Thread

A thread is the smallest execution unit in CUDA. Each thread runs one instance of the kernel.

Each thread has its own:

2. Warp

A warp is a group of 32 threads. The GPU schedules and executes work in warps.

1 warp = 32 threads

For example, a block with 256 threads contains 8 warps:

256 threads per block ÷ 32 = 8 warps per block

Threads inside the same warp execute together. If threads in one warp take different branches, the warp diverges. Divergence can reduce performance because the GPU may need to execute each branch path separately.

3. Block

A block, also called a thread block, is a group of threads. Threads in the same block can cooperate with each other.

Threads inside the same block can:

Example kernel launch:

kernel<<<100, 256>>>();

This launch creates 100 blocks, and each block has 256 threads. Inside each block:

threadIdx.x = 0, 1, 2, ..., 255

A block runs entirely on one Streaming Multiprocessor, or SM. The block is not split across multiple SMs.

4. Grid

A grid is the full collection of blocks launched by one kernel call.

kernel<<<100, 256>>>();

This means:

100 blocks and 256 threads per block

The total number of threads is:

100 × 256 = 25,600 total threads
02 — Built-in Variables

CUDA Built-in Variables

CUDA gives every thread built-in variables that describe where it is located in the launch.

threadIdx

threadIdx is the thread's index inside its block.

threadIdx.x, threadIdx.y, threadIdx.z

blockDim

blockDim gives the number of threads in each block along each dimension.

blockDim.x, blockDim.y, blockDim.z

For example:

dim3 block(16, 16);

Inside the kernel, this means:

blockDim.x = 16, blockDim.y = 16, blockDim.z = 1

The number of threads per block is:

16 × 16 = 256 threads

blockIdx

blockIdx is the block's index inside the grid.

blockIdx.x, blockIdx.y, blockIdx.z

gridDim

gridDim gives the number of blocks in the grid along each dimension.

gridDim.x, gridDim.y, gridDim.z

For example:

dim3 grid(64, 64);

Inside the kernel, gridDim.x is 64 and gridDim.y is 64. The grid contains:

64 × 64 = 4096 blocks
03 — 1D Indexing

Global Thread Index

For a 1D kernel, each thread usually computes one global index. That index tells the thread which array element to process.

i = blockIdx.x × blockDim.x + threadIdx.x

Example:

blockIdx.x = 2, blockDim.x = 256, threadIdx.x = 10

Then:

i = 2 × 256 + 10 = 522
04 — 2D Indexing

2D Thread Index Example

For images or matrices, a 2D block and 2D grid often make the indexing easier to understand.

dim3 block(16, 16);
dim3 grid(64, 64);

Each block has:

16 × 16 = 256 threads

The grid has:

64 × 64 = 4096 blocks

The total number of threads is:

4096 × 256 = 1,048,576 threads

A 2D kernel often computes the global row and column like this:

row = blockIdx.y × blockDim.y + threadIdx.y
col = blockIdx.x × blockDim.x + threadIdx.x
05 — Hardware Mapping

Hardware Mapping Reference

The CUDA hierarchy is logical, but it maps to GPU hardware in a useful way.

Thread → CUDA core execution
Warp → scheduled execution group
Block → one Streaming Multiprocessor, or SM
Grid → complete GPU workload

Core Hardware Rules

Simple Analogy

Grid = entire factory
Block = one team
Warp = small squad of 32 workers
Thread = one worker
06 — Function Specifiers

CUDA Function Declarations and Device/Host Distinction

CUDA extends C/C++ with function specifiers. These keywords say where a function runs and where it can be called from.

Function Specifier Runs On Called From
__device__ float DeviceFunc() Device, meaning GPU Device, meaning GPU
__global__ void KernelFunc() Device, meaning GPU Host, meaning CPU
__host__ float HostFunc() Host, meaning CPU Host, meaning CPU

Important Rules

07 — Compiler Pipeline

Compiling a CUDA Program

CUDA programs are usually stored in .cu files and compiled with NVIDIA's CUDA compiler driver, nvcc.

08 — Grayscale Application

Practical Application: Color-to-Grayscale Conversion

An RGB image stores three color channels: red, green, and blue. A grayscale image stores one intensity value per pixel.

A common weighted conversion is:

GrayPixel = 0.21 × R + 0.71 × G + 0.07 × B

The CUDA kernel below maps each thread to one pixel:

#define CHANNELS 3 // Red, Green, Blue

__global__ void colorConvert(unsigned char* grayImage,
                             unsigned char* rgbImage,
                             int width, int height)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;

    if (x < width && y < height) {
        int grayOffset = y * width + x;
        int rgbOffset  = grayOffset * CHANNELS;

        unsigned char r = rgbImage[rgbOffset];
        unsigned char g = rgbImage[rgbOffset + 1];
        unsigned char b = rgbImage[rgbOffset + 2];

        grayImage[grayOffset] =
            (unsigned char)(0.21f * r + 0.71f * g + 0.07f * b);
    }
}
09 — Hardware Limits

Thread Block Scheduling and Hardware Limits

Thread blocks are designed to be independent. Because of this, the GPU can schedule blocks in any order across available SMs.

Hardware Capacity and Resource Limits

Each SM has limits on how many blocks and threads can be active at the same time. For the example below, assume these limits:

Block size affects occupancy. Occupancy is how much of the SM's available thread capacity is being used.

Block Geometry Threads per Block Active Blocks per SM Active Threads per SM Result
8 × 8 64 8 blocks 512 / 1536 The block limit is reached first, so many thread slots are unused.
16 × 16 256 6 blocks 1536 / 1536 This reaches full thread occupancy in this example.
32 × 32 1024 1 block 1024 / 1536 The large block size limits how many blocks can fit on the SM.
The best block size depends on the hardware, register use, shared memory use, and the kernel's memory access pattern. Occupancy is useful, but it is not the only performance metric.
10 — Warp Divergence

Warps and Control Flow Divergence Mechanics

Linearization Layout Mapping Rules

CUDA can define blocks in 1D, 2D, or 3D, but the hardware still forms warps from a linear ordering of threads.

Warp Control Divergence

Threads in the same warp normally execute the same instruction together. If an if statement sends some threads down one branch and other threads down another branch, the warp diverges.

Key idea: divergence is a warp-level issue, not a block-level issue. A block can contain many warps, and only one of them might diverge.

Visual examples

Example 1: no divergence

All 32 threads in the warp make the same decision, so the warp stays fully parallel.

  • All threads take the same branch.
  • The warp executes one path.
  • No serialization happens.

Example 2: branch divergence

A condition such as if (threadIdx.x < 16) splits one warp into two different paths.

  • Threads 0 to 15 take one path.
  • Threads 16 to 31 take another path.
  • The warp must execute both paths one after the other.

Example 3: boundary divergence

This is the common case created by a boundary check such as if (i < n) in the final warp.

  • Only part of the warp is still inside the valid range.
  • Valid threads do work; the rest are inactive.
  • This usually affects only the last warp or last block.

Quantifying the boundary check performance penalty

Consider a 1D vector addition kernel processing n = 1000 elements using 256 threads per block:

__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

The launch needs four blocks:

ceil(1000 / 256) = 4 blocks

1,000-element boundary example

256 threads per block → 8 warps per block → 32 warps total

The visual below shows four blocks. Blocks 0, 1, and 2 are fully valid. In block 3, only the final warp diverges.

Block size
256 threads
Warps per block
8 warps
Total warps
32 warps
Divergent warps
1 warp
  • Blocks 0, 1, and 2: These cover global thread indices 0 through 767. All threads are inside the array bounds, so no boundary divergence occurs.
  • Block 3: This covers global thread indices 768 through 1023.
  • Warps 0 through 6 in block 3: These cover indices 768 through 991. All are valid, so they do not diverge.
  • Warp 7 in block 3: This covers indices 992 through 1023. Threads 992 through 999 are valid, but threads 1000 through 1023 are out of bounds.

Conclusion: Only one warp out of 32 total warps diverges. The other 31 warps execute normally. This is why simple boundary checks are usually acceptable, especially when only the final block is partially outside the data range.