CUDA Thread Hierarchy Notes

01 — CUDA Thread Hierarchy

CUDA Thread Hierarchy Notes

CUDA organizes work into a simple hierarchy: threads, warps, blocks, and grids.

Thread \to Warp \to Block \to Grid

1. Thread

A thread is the smallest execution unit in CUDA. Each thread runs one instance of the kernel.

Each thread has its own:

threadIdx, which tells the thread its location inside the block
private registers
private local variables

2. Warp

A warp is a group of 32 threads. The GPU schedules and executes work in warps.

1 warp = 32 threads

For example, a block with 256 threads contains 8 warps:

256 threads per block \div 32 = 8 warps per block

Threads inside the same warp execute together. If threads in one warp take different branches, the warp diverges. Divergence can reduce performance because the GPU may need to execute each branch path separately.

3. Block

A block, also called a thread block, is a group of threads. Threads in the same block can cooperate with each other.

Threads inside the same block can:

share low-latency shared memory
synchronize using __syncthreads()
work together on the same part of a problem

Example kernel launch:

kernel<<<100, 256>>>();

This launch creates 100 blocks, and each block has 256 threads. Inside each block:

threadIdx.x = 0, 1, 2, ..., 255

A block runs entirely on one Streaming Multiprocessor, or SM. The block is not split across multiple SMs.

4. Grid

A grid is the full collection of blocks launched by one kernel call.

kernel<<<100, 256>>>();

This means:

100 blocks and 256 threads per block

The total number of threads is:

100 \times 256 = 25,600 total threads

02 — Built-in Variables

CUDA Built-in Variables

CUDA gives every thread built-in variables that describe where it is located in the launch.

`threadIdx`

threadIdx is the thread's index inside its block.

threadIdx.x, threadIdx.y, threadIdx.z

`blockDim`

blockDim gives the number of threads in each block along each dimension.

blockDim.x, blockDim.y, blockDim.z

For example:

dim3 block(16, 16);

Inside the kernel, this means:

blockDim.x = 16, blockDim.y = 16, blockDim.z = 1

The number of threads per block is:

16 \times 16 = 256 threads

`blockIdx`

blockIdx is the block's index inside the grid.

blockIdx.x, blockIdx.y, blockIdx.z

`gridDim`

gridDim gives the number of blocks in the grid along each dimension.

gridDim.x, gridDim.y, gridDim.z

For example:

dim3 grid(64, 64);

Inside the kernel, gridDim.x is 64 and gridDim.y is 64. The grid contains:

64 \times 64 = 4096 blocks

03 — 1D Indexing

Global Thread Index

For a 1D kernel, each thread usually computes one global index. That index tells the thread which array element to process.

i = blockIdx.x \times blockDim.x + threadIdx.x

Example:

blockIdx.x = 2, blockDim.x = 256, threadIdx.x = 10

Then:

i = 2 \times 256 + 10 = 522

04 — 2D Indexing

2D Thread Index Example

For images or matrices, a 2D block and 2D grid often make the indexing easier to understand.

dim3 block(16, 16);
dim3 grid(64, 64);

Each block has:

16 \times 16 = 256 threads

The grid has:

64 \times 64 = 4096 blocks

The total number of threads is:

4096 \times 256 = 1,048,576 threads

A 2D kernel often computes the global row and column like this:

row = blockIdx.y \times blockDim.y + threadIdx.y

col = blockIdx.x \times blockDim.x + threadIdx.x

05 — Hardware Mapping

Hardware Mapping Reference

The CUDA hierarchy is logical, but it maps to GPU hardware in a useful way.

Thread \to CUDA core execution

Warp \to scheduled execution group

Block \to one Streaming Multiprocessor, or SM

Grid \to complete GPU workload

Core Hardware Rules

A warp always contains exactly 32 threads.
A block is made of threads.
A grid is made of blocks.
Threads in the same block can synchronize.
Blocks cannot directly synchronize with each other inside a normal kernel.
A block runs entirely on one SM.
Different blocks can run on different SMs.

Simple Analogy

Grid = entire factory

Block = one team

Warp = small squad of 32 workers

Thread = one worker

06 — Function Specifiers

CUDA Function Declarations and Device/Host Distinction

CUDA extends C/C++ with function specifiers. These keywords say where a function runs and where it can be called from.

Function Specifier	Runs On	Called From
`__device__ float DeviceFunc()`	Device, meaning GPU	Device, meaning GPU
`__global__ void KernelFunc()`	Device, meaning GPU	Host, meaning CPU
`__host__ float HostFunc()`	Host, meaning CPU	Host, meaning CPU

Important Rules

__global__ marks a function as a CUDA kernel.
A __global__ kernel is launched with triple angle brackets, like kernel<<<grid, block>>>();.
A __global__ function must return void.
CUDA specifiers use double underscores before and after the word, such as __global__.
__host__ and __device__ can be used together, such as __host__ __device__ void CommonMath(). This tells the compiler to generate both CPU and GPU versions.
If no specifier is given, CUDA treats the function as a normal host function.

07 — Compiler Pipeline

Compiling a CUDA Program

CUDA programs are usually stored in .cu files and compiled with NVIDIA's CUDA compiler driver, nvcc.

Source separation: nvcc separates host code from device code.
Host code: CPU code is passed to a normal C/C++ compiler, such as gcc, clang, or cl.exe.
Device code: GPU code is compiled into PTX, which stands for Parallel Thread Execution.
Hardware mapping: PTX is translated into lower-level GPU instructions called SASS. This can happen before runtime or just-in-time at runtime.

08 — Grayscale Application

Practical Application: Color-to-Grayscale Conversion

An RGB image stores three color channels: red, green, and blue. A grayscale image stores one intensity value per pixel.

A common weighted conversion is:

GrayPixel = 0.21 \times R + 0.71 \times G + 0.07 \times B

The CUDA kernel below maps each thread to one pixel:

#define CHANNELS 3 // Red, Green, Blue

__global__ void colorConvert(unsigned char* grayImage,
                             unsigned char* rgbImage,
                             int width, int height)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;

    if (x < width && y < height) {
        int grayOffset = y * width + x;
        int rgbOffset  = grayOffset * CHANNELS;

        unsigned char r = rgbImage[rgbOffset];
        unsigned char g = rgbImage[rgbOffset + 1];
        unsigned char b = rgbImage[rgbOffset + 2];

        grayImage[grayOffset] =
            (unsigned char)(0.21f * r + 0.71f * g + 0.07f * b);
    }
}

09 — Hardware Limits

Thread Block Scheduling and Hardware Limits

Thread blocks are designed to be independent. Because of this, the GPU can schedule blocks in any order across available SMs.

Hardware Capacity and Resource Limits

Each SM has limits on how many blocks and threads can be active at the same time. For the example below, assume these limits:

Maximum blocks per SM: 8 blocks
Maximum threads per SM: 1536 threads
Maximum threads per block: 1024 threads

Block size affects occupancy. Occupancy is how much of the SM's available thread capacity is being used.

Block Geometry	Threads per Block	Active Blocks per SM	Active Threads per SM	Result
`8 × 8`	64	8 blocks	512 / 1536	The block limit is reached first, so many thread slots are unused.
`16 × 16`	256	6 blocks	1536 / 1536	This reaches full thread occupancy in this example.
`32 × 32`	1024	1 block	1024 / 1536	The large block size limits how many blocks can fit on the SM.

The best block size depends on the hardware, register use, shared memory use, and the kernel's memory access pattern. Occupancy is useful, but it is not the only performance metric.

10 — Warp Divergence

Warps and Control Flow Divergence Mechanics

Linearization Layout Mapping Rules

CUDA can define blocks in 1D, 2D, or 3D, but the hardware still forms warps from a linear ordering of threads.

Thread coordinates are linearized in row-major order.
The x dimension increments fastest, followed by y, then z.
The linear thread order is split into groups of 32 threads.
Warp 0 contains linear thread positions 0 through 31. Warp 1 contains positions 32 through 63, and so on.

Warp Control Divergence

Threads in the same warp normally execute the same instruction together. If an if statement sends some threads down one branch and other threads down another branch, the warp diverges.

The GPU runs one branch while masking off the threads that should not take that branch.
Then it switches masks and runs the other branch.
This makes the branch paths run sequentially instead of fully in parallel.

Key idea: divergence is a warp-level issue, not a block-level issue. A block can contain many warps, and only one of them might diverge.

Visual examples

Example 1: no divergence

All 32 threads in the warp make the same decision, so the warp stays fully parallel.

All threads take the same branch.
The warp executes one path.
No serialization happens.

Example 2: branch divergence

A condition such as if (threadIdx.x < 16) splits one warp into two different paths.

Threads 0 to 15 take one path.
Threads 16 to 31 take another path.
The warp must execute both paths one after the other.

Example 3: boundary divergence

This is the common case created by a boundary check such as if (i < n) in the final warp.

Only part of the warp is still inside the valid range.
Valid threads do work; the rest are inactive.
This usually affects only the last warp or last block.

Quantifying the boundary check performance penalty

Consider a 1D vector addition kernel processing $n = 1000$ elements using 256 threads per block:

__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

The launch needs four blocks:

ceil(1000 / 256) = 4 blocks

1,000-element boundary example

256 threads per block → 8 warps per block → 32 warps total

The visual below shows four blocks. Blocks 0, 1, and 2 are fully valid. In block 3, only the final warp diverges.

Block size

256 threads

Warps per block

8 warps

Total warps

32 warps

Divergent warps

1 warp

Blocks 0, 1, and 2: These cover global thread indices 0 through 767. All threads are inside the array bounds, so no boundary divergence occurs.
Block 3: This covers global thread indices 768 through 1023.
Warps 0 through 6 in block 3: These cover indices 768 through 991. All are valid, so they do not diverge.
Warp 7 in block 3: This covers indices 992 through 1023. Threads 992 through 999 are valid, but threads 1000 through 1023 are out of bounds.

Conclusion: Only one warp out of 32 total warps diverges. The other 31 warps execute normally. This is why simple boundary checks are usually acceptable, especially when only the final block is partially outside the data range.

CUDA hierarchy

2D indexing map