Example 1: no divergence
All 32 threads in the warp make the same decision, so the warp stays fully parallel.
- All threads take the same branch.
- The warp executes one path.
- No serialization happens.
EE147: Graphics Processing Unit Computing and Programming
CUDA organizes work into a simple hierarchy: threads, warps, blocks, and grids.
A thread is the smallest execution unit in CUDA. Each thread runs one instance of the kernel.
Each thread has its own:
threadIdx, which tells the thread its location inside the blockA warp is a group of 32 threads. The GPU schedules and executes work in warps.
For example, a block with 256 threads contains 8 warps:
Threads inside the same warp execute together. If threads in one warp take different branches, the warp diverges. Divergence can reduce performance because the GPU may need to execute each branch path separately.
A block, also called a thread block, is a group of threads. Threads in the same block can cooperate with each other.
Threads inside the same block can:
__syncthreads()Example kernel launch:
kernel<<<100, 256>>>();
This launch creates 100 blocks, and each block has 256 threads. Inside each block:
threadIdx.x = 0, 1, 2, ..., 255A block runs entirely on one Streaming Multiprocessor, or SM. The block is not split across multiple SMs.
A grid is the full collection of blocks launched by one kernel call.
kernel<<<100, 256>>>();
This means:
The total number of threads is:
CUDA gives every thread built-in variables that describe where it is located in the launch.
threadIdxthreadIdx is the thread's index inside its block.
threadIdx.x, threadIdx.y, threadIdx.zblockDimblockDim gives the number of threads in each block along each dimension.
blockDim.x, blockDim.y, blockDim.zFor example:
dim3 block(16, 16);
Inside the kernel, this means:
blockDim.x = 16, blockDim.y = 16, blockDim.z = 1The number of threads per block is:
blockIdxblockIdx is the block's index inside the grid.
blockIdx.x, blockIdx.y, blockIdx.zgridDimgridDim gives the number of blocks in the grid along each dimension.
gridDim.x, gridDim.y, gridDim.zFor example:
dim3 grid(64, 64);
Inside the kernel, gridDim.x is 64 and gridDim.y is 64. The grid contains:
For a 1D kernel, each thread usually computes one global index. That index tells the thread which array element to process.
blockIdx.x × blockDim.x + threadIdx.xExample:
blockIdx.x = 2, blockDim.x = 256, threadIdx.x = 10Then:
For images or matrices, a 2D block and 2D grid often make the indexing easier to understand.
dim3 block(16, 16);
dim3 grid(64, 64);
Each block has:
The grid has:
The total number of threads is:
A 2D kernel often computes the global row and column like this:
blockIdx.y × blockDim.y + threadIdx.yblockIdx.x × blockDim.x + threadIdx.xThe CUDA hierarchy is logical, but it maps to GPU hardware in a useful way.
CUDA extends C/C++ with function specifiers. These keywords say where a function runs and where it can be called from.
| Function Specifier | Runs On | Called From |
|---|---|---|
__device__ float DeviceFunc() |
Device, meaning GPU | Device, meaning GPU |
__global__ void KernelFunc() |
Device, meaning GPU | Host, meaning CPU |
__host__ float HostFunc() |
Host, meaning CPU | Host, meaning CPU |
__global__ marks a function as a CUDA kernel.__global__ kernel is launched with triple angle brackets, like kernel<<<grid, block>>>();.__global__ function must return void.__global__.__host__ and __device__ can be used together, such as __host__ __device__ void CommonMath(). This tells the compiler to generate both CPU and GPU versions.CUDA programs are usually stored in .cu files and compiled with NVIDIA's CUDA compiler driver, nvcc.
nvcc separates host code from device code.gcc, clang, or cl.exe.An RGB image stores three color channels: red, green, and blue. A grayscale image stores one intensity value per pixel.
A common weighted conversion is:
The CUDA kernel below maps each thread to one pixel:
#define CHANNELS 3 // Red, Green, Blue
__global__ void colorConvert(unsigned char* grayImage,
unsigned char* rgbImage,
int width, int height)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < width && y < height) {
int grayOffset = y * width + x;
int rgbOffset = grayOffset * CHANNELS;
unsigned char r = rgbImage[rgbOffset];
unsigned char g = rgbImage[rgbOffset + 1];
unsigned char b = rgbImage[rgbOffset + 2];
grayImage[grayOffset] =
(unsigned char)(0.21f * r + 0.71f * g + 0.07f * b);
}
}
Thread blocks are designed to be independent. Because of this, the GPU can schedule blocks in any order across available SMs.
Each SM has limits on how many blocks and threads can be active at the same time. For the example below, assume these limits:
Block size affects occupancy. Occupancy is how much of the SM's available thread capacity is being used.
| Block Geometry | Threads per Block | Active Blocks per SM | Active Threads per SM | Result |
|---|---|---|---|---|
8 × 8 |
64 | 8 blocks | 512 / 1536 | The block limit is reached first, so many thread slots are unused. |
16 × 16 |
256 | 6 blocks | 1536 / 1536 | This reaches full thread occupancy in this example. |
32 × 32 |
1024 | 1 block | 1024 / 1536 | The large block size limits how many blocks can fit on the SM. |
CUDA can define blocks in 1D, 2D, or 3D, but the hardware still forms warps from a linear ordering of threads.
Threads in the same warp normally execute the same instruction together. If an if statement sends some threads down one branch and other threads down another branch, the warp diverges.
All 32 threads in the warp make the same decision, so the warp stays fully parallel.
A condition such as if (threadIdx.x < 16) splits one warp into two different paths.
This is the common case created by a boundary check such as if (i < n) in the final warp.
Consider a 1D vector addition kernel processing n = 1000 elements using 256 threads per block:
__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
C[i] = A[i] + B[i];
}
}
The launch needs four blocks:
The visual below shows four blocks. Blocks 0, 1, and 2 are fully valid. In block 3, only the final warp diverges.
Conclusion: Only one warp out of 32 total warps diverges. The other 31 warps execute normally. This is why simple boundary checks are usually acceptable, especially when only the final block is partially outside the data range.