Reduction

EE147: Graphics Processing Unit Computing and Programming — focused notes on CUDA memories, shared memory, reduction trees, synchronization, and divergence-aware reduction kernels.

Fast On-Chip Memory
Shared Memory
Reduction Work
N − 1 Ops
Tree Depth
log₂(N) Steps
Block Design
2 × blockDim.x

Reduction tree

pairwise combine → one final value

CUDA memory hierarchy

registers, shared memory, global memory

Naive vs. contiguous indexing

why active thread layout matters
Active / useful work
Shared memory / tile
Divergence / inactive lanes
Final result
Reduction is work-efficient when it performs N − 1 combine operations, but performance depends heavily on shared memory usage, synchronization, and warp-friendly thread indexing.
01 — CUDA Memory Hierarchy

CUDA Memory Hierarchy

CUDA memory is easiest to understand from two views: the hardware view and the programmer view. The hardware view explains where memory physically sits in the GPU. The programmer view explains what memory spaces CUDA code can use.

Hardware View

Inside a Streaming Multiprocessor (SM), the SIMT front end fetches, decodes, and schedules instructions. The SIMD datapath executes issued instructions. Several memory structures sit close to the execution hardware:

Fetch
Decode
Schedule
Register File
SIMD Datapath

Programmer View

From the CUDA programmer's perspective, memory is organized by scope, lifetime, and speed.

DeclarationMemoryScopeLifetime / Notes
int LocalVar;Register or local memoryThreadPrivate to one thread. Scalars usually go into registers. Arrays or spilled registers may go into local memory.
__shared__ int SharedVar;Shared memoryBlockShared by all threads in the block. Exists for the lifetime of the block.
__device__ int GlobalVar;Global memoryGrid / deviceVisible to kernels on the device. Persists for the application unless freed or overwritten.
__constant__ int ConstantVar;Constant memoryGrid / deviceRead-only from device code. Useful when many threads read the same value.
Declaration rules worth remembering: __device__ is optional when used with __shared__ or __constant__. Automatic scalar variables inside a kernel normally reside in registers. Shared memory is visible only to threads in the same block.
02 — Shared Memory, Tiling, and Blocking

Shared Memory, Tiling, and Blocking

Shared memory is explicitly allocated inside a kernel and used as fast temporary storage. It is commonly used as a tile or partial-sum buffer.

Tiling data flow

global memory → shared tile → reused by threads
CUDA#define TILE_WIDTH 16

__global__ void blurKernel(unsigned char* in,
                           unsigned char* out,
                           int width, int height)
{
    __shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];

    // Threads cooperatively load data into ds_in.
    // Then they synchronize before using the tile.
    __syncthreads();
}

A poor memory access pattern can make every thread repeatedly fetch data from global memory. Tiling reduces this cost by copying a small region of global memory into fast on-chip shared memory, then reusing it many times.

Core takeaway: Load a tile from global memory once, reuse it from shared memory many times, and reduce expensive global memory accesses.
03 — The Reduction Pattern

The Reduction Pattern

A reduction summarizes many input values into one output value using a binary operation.

y = x₀ ⊕ x₁ ⊕ x₂ ⊕ ··· ⊕ xN−1

Partition and summarize

input array → chunks → partial results → final value

Reduction tree example

[3, 1, 7, 0, 4, 1, 6, 3]

Block-level shared memory

stride 1 → stride 2 → stride 4

Naive vs improved mapping

scattered lanes vs packed lanes

Sum

Identity value is 0.

Product

Identity value is 1.

Maximum

Identity value is negative infinity.

Minimum

Identity value is positive infinity.

For parallel reduction, the operation should be associative. Commutativity is also helpful because it allows the implementation to rearrange the order of operations.

Partition and summarize: partition a large data set into chunks, let many threads compute partial results in parallel, then use a reduction tree to combine the partial results into one final value.
04 — Sequential vs. Parallel Reduction

Sequential vs. Parallel Reduction

Sequential vs. Parallel Reduction

same input, different schedule
Same result, different schedule: both approaches compute the same final result. The key difference is how the work is scheduled.

CUDA reduction uses the parallel tree idea, then maps it onto threads, shared memory, and synchronization.

Sequential Reduction

A sequential reduction visits every input value once. For a sum reduction:

Cfloat sum = 0.0f;              // identity value for sum
for (int i = 0; i < N; ++i) {
    sum += input[i];
}

This algorithm performs N visits and N − 1 useful combine operations, so its work complexity is O(N).

Parallel Reduction Tree

A reduction tree combines pairs of values at each level. For N values, the number of steps is approximately log₂(N), while the total number of combine operations is still N − 1.

N/2 + N/4 + N/8 + ··· + 1 = N − 1
Work-efficient but not always resource-efficient: the tree does the right amount of total work, but the first level can require a very large number of parallel resources.
05 — Basic Block-Level Reduction Design

Basic Block-Level Reduction Design

A common CUDA reduction design lets each block handle 2 × blockDim.x input elements. Each thread loads two values into shared memory, then the block reduces those values into one partial sum.

CUDA#define BLOCK_SIZE 256

__global__ void blockReduceSum(const float* input,
                               float* blockSums,
                               unsigned int n)
{
    __shared__ float partialSum[2 * BLOCK_SIZE];

    unsigned int t     = threadIdx.x;
    unsigned int start = 2 * blockIdx.x * blockDim.x;

    unsigned int i0 = start + t;
    unsigned int i1 = start + blockDim.x + t;

    partialSum[t]              = (i0 < n) ? input[i0] : 0.0f;
    partialSum[blockDim.x + t] = (i1 < n) ? input[i1] : 0.0f;

    __syncthreads();

    // Reduction loop goes here.

    if (t == 0) {
        blockSums[blockIdx.x] = partialSum[0];
    }
}

Code Walkthrough: Block-Level Reduction

code line → thread action → shared memory update
1__shared__ float partialSum[2 * BLOCK_SIZE];
2
3unsigned int t = threadIdx.x;
4unsigned int start = 2 * blockIdx.x * blockDim.x;
5
6partialSum[t] = input[start + t];
7partialSum[blockDim.x + t] = input[start + blockDim.x + t];
8
9for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
10 __syncthreads();
11
12 if (t % stride == 0) {
13 partialSum[2 * t] += partialSum[2 * t + stride];
14 }
15}
16
17if (t == 0) {
18 output[blockIdx.x] = partialSum[0];
19}
Stage 1 — Shared memory allocation Shared memory is allocated once per block. All threads in the block can access this array.
Threads in Block
4
Values Loaded
8
Reduction Steps
3
Final Block Result
25
06 — Naive Interleaved Reduction

Naive Reduction Kernel: Interleaved Thread Usage

The naive mapping uses increasing stride distances. In each step, a thread adds a value from farther away in the shared memory array.

CUDAfor (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
    __syncthreads();

    if (t % stride == 0) {
        partialSum[2 * t] += partialSum[2 * t + stride];
    }
}

Code Walkthrough: Naive Interleaved Reduction

scattered active threads as stride grows
1for (unsigned int stride = 1;
2 stride <= blockDim.x;
3 stride *= 2)
4{
5 __syncthreads();
6
7 if (t % stride == 0) {
8 partialSum[2 * t] += partialSum[2 * t + stride];
9 }
10}
Stage 1 — Initial shared memory The naive kernel starts with values already loaded into shared memory.
Input Values
8
Threads
4
Reduction Levels
3
Final Result
25
Problem summary: Naive interleaved indexing works correctly, but it spreads active work across the thread IDs. As the stride grows, many threads in the same warp become inactive.

Why This Mapping Performs Poorly

The issue is not the number of arithmetic operations. The issue is which threads stay active.

StepStrideActive PatternPerformance Concern
11Many threads activeGood arithmetic density.
22Fewer active threads, spaced outMore branch imbalance within warps.
3–64–32Active work becomes sparsePoor lane utilization.
7+64 and aboveVery few active threads remainActive warps may contain only one useful lane.
07 — Better Contiguous Reduction

Better Reduction Model: Contiguous Thread Usage

A better mapping keeps active threads consecutive at the front of the block. This improves warp behavior because whole warps tend to be either completely active or completely inactive.

CUDAfor (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
    __syncthreads();

    if (t < stride) {
        partialSum[t] += partialSum[t + stride];
    }
}

Code Walkthrough: Better Contiguous Reduction

packed active threads at the front
1for (unsigned int stride = blockDim.x;
2 stride > 0;
3 stride /= 2)
4{
5 __syncthreads();
6
7 if (t < stride) {
8 partialSum[t] += partialSum[t + stride];
9 }
10}
Stage 1 — Initial shared memory The better reduction model keeps active threads next to each other.
Input Values
8
Threads
4
Active Threads
Consecutive
Final Result
25

Naive vs. Contiguous Mapping

stride = 2 active-thread layout
Takeaway: Both methods compute the same answer. The improved version changes the indexing pattern so useful work stays packed into consecutive threads.

Why Contiguous Indexing Helps

Core takeaway: thread index usage matters. Keeping active partial sums in the front of the array keeps active threads consecutive, which reduces warp divergence.
08 — Barrier Synchronization

Barrier Synchronization

The call to __syncthreads() is required between reduction steps because threads communicate through shared memory. A thread must not read a partial sum until the previous step has finished writing it.

Why __syncthreads() Is Needed

shared memory dependency barrier
1for (unsigned int stride = 1;
2 stride <= blockDim.x;
3 stride *= 2)
4{
5 __syncthreads();
6
7 if (t % stride == 0) {
8 partialSum[2 * t] += partialSum[2 * t + stride];
9 }
10}
Stage 1 — Before reduction All threads have loaded values into shared memory, but CUDA threads do not automatically stay aligned unless we synchronize them.
Scope
One Block
Protects
Shared Memory
Prevents
Stale Reads
Does Not Sync
Other Blocks
What __syncthreads() guarantees: every thread in the block reaches the barrier before any thread passes it; shared memory writes from the previous step are visible after the barrier; the next step uses complete and correct partial sums.
Common bug: never place __syncthreads() inside a branch unless every thread in the block is guaranteed to execute that same branch. Otherwise, the block can deadlock.
09 — Handling Block Partial Sums

Handling Block Partial Sums

At the end of a block-level reduction, thread 0 writes one value per block into an output array:

CUDAif (threadIdx.x == 0) {
    blockSums[blockIdx.x] = partialSum[0];
}

What Happens After Each Block Finishes?

block partial sums → final result
1if (threadIdx.x == 0) {
2 output[blockIdx.x] = partialSum[0];
3}
Stage 1 — Block chunks Each block independently reduces a chunk of the input.
Blocks
4
Partial Sums
4
Final Result
86
Common Finish
Kernel / CPU
Important CUDA Limitation: __syncthreads() only synchronizes threads inside one block. Blocks cannot directly synchronize with each other inside a normal kernel. That is why each block writes a partial sum, and a later step combines those partial sums.

After this, there are several ways to finish the global reduction:

10 — Exam and Interview Checklist

Exam and Interview Checklist

  1. Memory hierarchy: know the difference between registers, shared memory, global memory, and constant memory.
  2. Shared memory scope: shared memory is visible only within one block and exists only while that block executes.
  3. Reduction definition: a reduction combines many values into one value using an associative operator.
  4. Work efficiency: a tree reduction performs N − 1 combine operations, comparable to the sequential algorithm.
  5. Step complexity: a tree reduction has about log₂(N) levels.
  6. Synchronization: __syncthreads() is required between shared-memory reduction steps.
  7. Divergence: scattered active threads cause warp divergence and low lane utilization.
  8. Optimization idea: keep active threads consecutive and compact partial sums toward the front of shared memory.