CUDA Memory Hierarchy
CUDA memory is easiest to understand from two views: the hardware view and the programmer view. The hardware view explains where memory physically sits in the GPU. The programmer view explains what memory spaces CUDA code can use.
Hardware View
Inside a Streaming Multiprocessor (SM), the SIMT front end fetches, decodes, and schedules instructions. The SIMD datapath executes issued instructions. Several memory structures sit close to the execution hardware:
- Register file: private storage for each thread.
- Shared memory (SMem): fast on-chip scratchpad shared by threads in the same block.
- L1 data cache: cache for data accesses.
- Texture and constant caches: specialized read-only caches.
- Global memory: off-chip DRAM accessed through memory partitions and the interconnection network.
Programmer View
From the CUDA programmer's perspective, memory is organized by scope, lifetime, and speed.
| Declaration | Memory | Scope | Lifetime / Notes |
|---|---|---|---|
int LocalVar; | Register or local memory | Thread | Private to one thread. Scalars usually go into registers. Arrays or spilled registers may go into local memory. |
__shared__ int SharedVar; | Shared memory | Block | Shared by all threads in the block. Exists for the lifetime of the block. |
__device__ int GlobalVar; | Global memory | Grid / device | Visible to kernels on the device. Persists for the application unless freed or overwritten. |
__constant__ int ConstantVar; | Constant memory | Grid / device | Read-only from device code. Useful when many threads read the same value. |
__device__ is optional when used with __shared__ or __constant__. Automatic scalar variables inside a kernel normally reside in registers. Shared memory is visible only to threads in the same block.
Shared Memory, Tiling, and Blocking
Shared memory is explicitly allocated inside a kernel and used as fast temporary storage. It is commonly used as a tile or partial-sum buffer.
Tiling data flow
global memory → shared tile → reused by threadsCUDA#define TILE_WIDTH 16
__global__ void blurKernel(unsigned char* in,
unsigned char* out,
int width, int height)
{
__shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];
// Threads cooperatively load data into ds_in.
// Then they synchronize before using the tile.
__syncthreads();
}
A poor memory access pattern can make every thread repeatedly fetch data from global memory. Tiling reduces this cost by copying a small region of global memory into fast on-chip shared memory, then reusing it many times.
The Reduction Pattern
A reduction summarizes many input values into one output value using a binary operation.
Partition and summarize
input array → chunks → partial results → final valueReduction tree example
[3, 1, 7, 0, 4, 1, 6, 3]Block-level shared memory
stride 1 → stride 2 → stride 4Naive vs improved mapping
scattered lanes vs packed lanesSum
Identity value is 0.
Product
Identity value is 1.
Maximum
Identity value is negative infinity.
Minimum
Identity value is positive infinity.
For parallel reduction, the operation should be associative. Commutativity is also helpful because it allows the implementation to rearrange the order of operations.
Sequential vs. Parallel Reduction
Sequential vs. Parallel Reduction
same input, different scheduleCUDA reduction uses the parallel tree idea, then maps it onto threads, shared memory, and synchronization.
Sequential Reduction
A sequential reduction visits every input value once. For a sum reduction:
Cfloat sum = 0.0f; // identity value for sum
for (int i = 0; i < N; ++i) {
sum += input[i];
}
This algorithm performs N visits and N − 1 useful combine operations, so its work complexity is O(N).
Parallel Reduction Tree
A reduction tree combines pairs of values at each level. For N values, the number of steps is approximately log₂(N), while the total number of combine operations is still N − 1.
Basic Block-Level Reduction Design
A common CUDA reduction design lets each block handle 2 × blockDim.x input elements. Each thread loads two values into shared memory, then the block reduces those values into one partial sum.
CUDA#define BLOCK_SIZE 256
__global__ void blockReduceSum(const float* input,
float* blockSums,
unsigned int n)
{
__shared__ float partialSum[2 * BLOCK_SIZE];
unsigned int t = threadIdx.x;
unsigned int start = 2 * blockIdx.x * blockDim.x;
unsigned int i0 = start + t;
unsigned int i1 = start + blockDim.x + t;
partialSum[t] = (i0 < n) ? input[i0] : 0.0f;
partialSum[blockDim.x + t] = (i1 < n) ? input[i1] : 0.0f;
__syncthreads();
// Reduction loop goes here.
if (t == 0) {
blockSums[blockIdx.x] = partialSum[0];
}
}
Code Walkthrough: Block-Level Reduction
code line → thread action → shared memory update__shared__ float partialSum[2 * BLOCK_SIZE];unsigned int t = threadIdx.x;unsigned int start = 2 * blockIdx.x * blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x + t] = input[start + blockDim.x + t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) { __syncthreads(); if (t % stride == 0) { partialSum[2 * t] += partialSum[2 * t + stride]; }}if (t == 0) { output[blockIdx.x] = partialSum[0];}Naive Reduction Kernel: Interleaved Thread Usage
The naive mapping uses increasing stride distances. In each step, a thread adds a value from farther away in the shared memory array.
CUDAfor (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
__syncthreads();
if (t % stride == 0) {
partialSum[2 * t] += partialSum[2 * t + stride];
}
}
Code Walkthrough: Naive Interleaved Reduction
scattered active threads as stride growsfor (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){ __syncthreads(); if (t % stride == 0) { partialSum[2 * t] += partialSum[2 * t + stride]; }}Why This Mapping Performs Poorly
The issue is not the number of arithmetic operations. The issue is which threads stay active.
- As stride grows, fewer threads perform useful work.
- Threads that do not perform additions still occupy lanes in their warp when a branch is evaluated.
- Early iterations can create control divergence because some threads in the same warp execute the addition path and others do not.
- Later iterations may have entire warps inactive. That removes divergence in those warps, but little useful work remains.
| Step | Stride | Active Pattern | Performance Concern |
|---|---|---|---|
| 1 | 1 | Many threads active | Good arithmetic density. |
| 2 | 2 | Fewer active threads, spaced out | More branch imbalance within warps. |
| 3–6 | 4–32 | Active work becomes sparse | Poor lane utilization. |
| 7+ | 64 and above | Very few active threads remain | Active warps may contain only one useful lane. |
Better Reduction Model: Contiguous Thread Usage
A better mapping keeps active threads consecutive at the front of the block. This improves warp behavior because whole warps tend to be either completely active or completely inactive.
CUDAfor (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
__syncthreads();
if (t < stride) {
partialSum[t] += partialSum[t + stride];
}
}
Code Walkthrough: Better Contiguous Reduction
packed active threads at the frontfor (unsigned int stride = blockDim.x; stride > 0; stride /= 2){ __syncthreads(); if (t < stride) { partialSum[t] += partialSum[t + stride]; }}Naive vs. Contiguous Mapping
stride = 2 active-thread layoutWhy Contiguous Indexing Helps
- Active threads are packed at the front of the block.
- Complete warps tend to evaluate the branch the same way.
- The first several steps have no warp divergence because each warp is either fully active or fully inactive.
- Divergence appears only in the final few steps when the number of active threads becomes smaller than one warp.
Barrier Synchronization
The call to __syncthreads() is required between reduction steps because threads communicate through shared memory. A thread must not read a partial sum until the previous step has finished writing it.
Why __syncthreads() Is Needed
shared memory dependency barrier
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){ __syncthreads(); if (t % stride == 0) { partialSum[2 * t] += partialSum[2 * t + stride]; }}__syncthreads() guarantees: every thread in the block reaches the barrier before any thread passes it; shared memory writes from the previous step are visible after the barrier; the next step uses complete and correct partial sums.
__syncthreads() inside a branch unless every thread in the block is guaranteed to execute that same branch. Otherwise, the block can deadlock.Handling Block Partial Sums
At the end of a block-level reduction, thread 0 writes one value per block into an output array:
CUDAif (threadIdx.x == 0) {
blockSums[blockIdx.x] = partialSum[0];
}
What Happens After Each Block Finishes?
block partial sums → final resultif (threadIdx.x == 0) { output[blockIdx.x] = partialSum[0];}__syncthreads() only synchronizes threads inside one block. Blocks cannot directly synchronize with each other inside a normal kernel. That is why each block writes a partial sum, and a later step combines those partial sums.
After this, there are several ways to finish the global reduction:
- Launch another reduction kernel on the
blockSumsarray. - Repeat kernel launches until only one value remains.
- If only a small number of block sums remain, copy them back to the CPU and finish on the host.
- Use atomic operations to accumulate into one global variable, if the performance trade-off is acceptable.
Exam and Interview Checklist
- Memory hierarchy: know the difference between registers, shared memory, global memory, and constant memory.
- Shared memory scope: shared memory is visible only within one block and exists only while that block executes.
- Reduction definition: a reduction combines many values into one value using an associative operator.
- Work efficiency: a tree reduction performs
N − 1combine operations, comparable to the sequential algorithm. - Step complexity: a tree reduction has about
log₂(N)levels. - Synchronization:
__syncthreads()is required between shared-memory reduction steps. - Divergence: scattered active threads cause warp divergence and low lane utilization.
- Optimization idea: keep active threads consecutive and compact partial sums toward the front of shared memory.