Reduction

01 — CUDA Memory Hierarchy

CUDA Memory Hierarchy

CUDA memory is easiest to understand from two views: the hardware view and the programmer view. The hardware view explains where memory physically sits in the GPU. The programmer view explains what memory spaces CUDA code can use.

Hardware View

Inside a Streaming Multiprocessor (SM), the SIMT front end fetches, decodes, and schedules instructions. The SIMD datapath executes issued instructions. Several memory structures sit close to the execution hardware:

Register file: private storage for each thread.
Shared memory (SMem): fast on-chip scratchpad shared by threads in the same block.
L1 data cache: cache for data accesses.
Texture and constant caches: specialized read-only caches.
Global memory: off-chip DRAM accessed through memory partitions and the interconnection network.

Fetch

→

Decode

→

Schedule

→

SIMD Datapath

Programmer View

From the CUDA programmer's perspective, memory is organized by scope, lifetime, and speed.

Declaration	Memory	Scope	Lifetime / Notes
`int LocalVar;`	Register or local memory	Thread	Private to one thread. Scalars usually go into registers. Arrays or spilled registers may go into local memory.
`__shared__ int SharedVar;`	Shared memory	Block	Shared by all threads in the block. Exists for the lifetime of the block.
`__device__ int GlobalVar;`	Global memory	Grid / device	Visible to kernels on the device. Persists for the application unless freed or overwritten.
`__constant__ int ConstantVar;`	Constant memory	Grid / device	Read-only from device code. Useful when many threads read the same value.

Declaration rules worth remembering: __device__ is optional when used with __shared__ or __constant__. Automatic scalar variables inside a kernel normally reside in registers. Shared memory is visible only to threads in the same block.

02 — Shared Memory, Tiling, and Blocking

Shared Memory, Tiling, and Blocking

Shared memory is explicitly allocated inside a kernel and used as fast temporary storage. It is commonly used as a tile or partial-sum buffer.

Tiling data flow

global memory → shared tile → reused by threads

CUDA#define TILE_WIDTH 16

__global__ void blurKernel(unsigned char* in,
                           unsigned char* out,
                           int width, int height)
{
    __shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];

    // Threads cooperatively load data into ds_in.
    // Then they synchronize before using the tile.
    __syncthreads();
}

A poor memory access pattern can make every thread repeatedly fetch data from global memory. Tiling reduces this cost by copying a small region of global memory into fast on-chip shared memory, then reusing it many times.

Core takeaway: Load a tile from global memory once, reuse it from shared memory many times, and reduce expensive global memory accesses.

03 — The Reduction Pattern

The Reduction Pattern

A reduction summarizes many input values into one output value using a binary operation.

y = x₀ \oplus x₁ \oplus x₂ \oplus \cdot\cdot\cdot \oplus x N-1

Partition and summarize

input array → chunks → partial results → final value

Reduction tree example

[3, 1, 7, 0, 4, 1, 6, 3]

Block-level shared memory

stride 1 → stride 2 → stride 4

Naive vs improved mapping

scattered lanes vs packed lanes

Sum

Identity value is 0.

Product

Identity value is 1.

Maximum

Identity value is negative infinity.

Minimum

Identity value is positive infinity.

For parallel reduction, the operation should be associative. Commutativity is also helpful because it allows the implementation to rearrange the order of operations.

Partition and summarize: partition a large data set into chunks, let many threads compute partial results in parallel, then use a reduction tree to combine the partial results into one final value.

04 — Sequential vs. Parallel Reduction

Sequential vs. Parallel Reduction

same input, different schedule

Same result, different schedule: both approaches compute the same final result. The key difference is how the work is scheduled.

CUDA reduction uses the parallel tree idea, then maps it onto threads, shared memory, and synchronization.

Sequential Reduction

A sequential reduction visits every input value once. For a sum reduction:

Cfloat sum = 0.0f;              // identity value for sum
for (int i = 0; i < N; ++i) {
    sum += input[i];
}

This algorithm performs N visits and N − 1 useful combine operations, so its work complexity is O(N).

Parallel Reduction Tree

A reduction tree combines pairs of values at each level. For N values, the number of steps is approximately log₂(N), while the total number of combine operations is still N − 1.

N/2 + N/4 + N/8 + \cdot\cdot\cdot + 1 = N - 1

Work-efficient but not always resource-efficient: the tree does the right amount of total work, but the first level can require a very large number of parallel resources.

05 — Basic Block-Level Reduction Design

Basic Block-Level Reduction Design

A common CUDA reduction design lets each block handle 2 × blockDim.x input elements. Each thread loads two values into shared memory, then the block reduces those values into one partial sum.

CUDA#define BLOCK_SIZE 256

__global__ void blockReduceSum(const float* input,
                               float* blockSums,
                               unsigned int n)
{
    __shared__ float partialSum[2 * BLOCK_SIZE];

    unsigned int t     = threadIdx.x;
    unsigned int start = 2 * blockIdx.x * blockDim.x;

    unsigned int i0 = start + t;
    unsigned int i1 = start + blockDim.x + t;

    partialSum[t]              = (i0 < n) ? input[i0] : 0.0f;
    partialSum[blockDim.x + t] = (i1 < n) ? input[i1] : 0.0f;

    __syncthreads();

    // Reduction loop goes here.

    if (t == 0) {
        blockSums[blockIdx.x] = partialSum[0];
    }
}

Code Walkthrough: Block-Level Reduction

code line → thread action → shared memory update

1__shared__ float partialSum[2 * BLOCK_SIZE];

3unsigned int t = threadIdx.x;

4unsigned int start = 2 * blockIdx.x * blockDim.x;

6partialSum[t] = input[start + t];

7partialSum[blockDim.x + t] = input[start + blockDim.x + t];

9for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {

10 __syncthreads();

12 if (t % stride == 0) {

13 partialSum[2 * t] += partialSum[2 * t + stride];

14 }

15}

17if (t == 0) {

18 output[blockIdx.x] = partialSum[0];

19}

Stage 1 — Shared memory allocation Shared memory is allocated once per block. All threads in the block can access this array.

Threads in Block

Values Loaded

Reduction Steps

Final Block Result

06 — Naive Interleaved Reduction

Naive Reduction Kernel: Interleaved Thread Usage

The naive mapping uses increasing stride distances. In each step, a thread adds a value from farther away in the shared memory array.

CUDAfor (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
    __syncthreads();

    if (t % stride == 0) {
        partialSum[2 * t] += partialSum[2 * t + stride];
    }
}

Code Walkthrough: Naive Interleaved Reduction

scattered active threads as stride grows

1for (unsigned int stride = 1;

2 stride <= blockDim.x;

3 stride *= 2)

4{

5 __syncthreads();

7 if (t % stride == 0) {

8 partialSum[2 * t] += partialSum[2 * t + stride];

9 }

10}

Stage 1 — Initial shared memory The naive kernel starts with values already loaded into shared memory.

Input Values

Threads

Reduction Levels

Final Result

Problem summary: Naive interleaved indexing works correctly, but it spreads active work across the thread IDs. As the stride grows, many threads in the same warp become inactive.

Why This Mapping Performs Poorly

The issue is not the number of arithmetic operations. The issue is which threads stay active.

As stride grows, fewer threads perform useful work.
Threads that do not perform additions still occupy lanes in their warp when a branch is evaluated.
Early iterations can create control divergence because some threads in the same warp execute the addition path and others do not.
Later iterations may have entire warps inactive. That removes divergence in those warps, but little useful work remains.

Step	Stride	Active Pattern	Performance Concern
1	1	Many threads active	Good arithmetic density.
2	2	Fewer active threads, spaced out	More branch imbalance within warps.
3–6	4–32	Active work becomes sparse	Poor lane utilization.
7+	64 and above	Very few active threads remain	Active warps may contain only one useful lane.

07 — Better Contiguous Reduction

Better Reduction Model: Contiguous Thread Usage

A better mapping keeps active threads consecutive at the front of the block. This improves warp behavior because whole warps tend to be either completely active or completely inactive.

CUDAfor (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
    __syncthreads();

    if (t < stride) {
        partialSum[t] += partialSum[t + stride];
    }
}

Code Walkthrough: Better Contiguous Reduction

packed active threads at the front

1for (unsigned int stride = blockDim.x;

2 stride > 0;

3 stride /= 2)

4{

5 __syncthreads();

7 if (t < stride) {

8 partialSum[t] += partialSum[t + stride];

9 }

10}

Stage 1 — Initial shared memory The better reduction model keeps active threads next to each other.

Input Values

Threads

Active Threads

Consecutive

Final Result

Naive vs. Contiguous Mapping

stride = 2 active-thread layout

Takeaway: Both methods compute the same answer. The improved version changes the indexing pattern so useful work stays packed into consecutive threads.

Why Contiguous Indexing Helps

Active threads are packed at the front of the block.
Complete warps tend to evaluate the branch the same way.
The first several steps have no warp divergence because each warp is either fully active or fully inactive.
Divergence appears only in the final few steps when the number of active threads becomes smaller than one warp.

Core takeaway: thread index usage matters. Keeping active partial sums in the front of the array keeps active threads consecutive, which reduces warp divergence.

08 — Barrier Synchronization

Barrier Synchronization

The call to __syncthreads() is required between reduction steps because threads communicate through shared memory. A thread must not read a partial sum until the previous step has finished writing it.

Why `__syncthreads()` Is Needed

shared memory dependency barrier

1for (unsigned int stride = 1;

2 stride <= blockDim.x;

3 stride *= 2)

4{

5 __syncthreads();

7 if (t % stride == 0) {

8 partialSum[2 * t] += partialSum[2 * t + stride];

9 }

10}

Stage 1 — Before reduction All threads have loaded values into shared memory, but CUDA threads do not automatically stay aligned unless we synchronize them.

Scope

One Block

Protects

Shared Memory

Prevents

Stale Reads

Does Not Sync

Other Blocks

What __syncthreads() guarantees: every thread in the block reaches the barrier before any thread passes it; shared memory writes from the previous step are visible after the barrier; the next step uses complete and correct partial sums.

Common bug: never place __syncthreads() inside a branch unless every thread in the block is guaranteed to execute that same branch. Otherwise, the block can deadlock.

09 — Handling Block Partial Sums

Handling Block Partial Sums

At the end of a block-level reduction, thread 0 writes one value per block into an output array:

CUDAif (threadIdx.x == 0) {
    blockSums[blockIdx.x] = partialSum[0];
}

What Happens After Each Block Finishes?

block partial sums → final result

1if (threadIdx.x == 0) {

2 output[blockIdx.x] = partialSum[0];

3}

Stage 1 — Block chunks Each block independently reduces a chunk of the input.

Blocks

Partial Sums

Final Result

Common Finish

Kernel / CPU

Important CUDA Limitation: __syncthreads() only synchronizes threads inside one block. Blocks cannot directly synchronize with each other inside a normal kernel. That is why each block writes a partial sum, and a later step combines those partial sums.

After this, there are several ways to finish the global reduction:

Launch another reduction kernel on the blockSums array.
Repeat kernel launches until only one value remains.
If only a small number of block sums remain, copy them back to the CPU and finish on the host.
Use atomic operations to accumulate into one global variable, if the performance trade-off is acceptable.

10 — Exam and Interview Checklist

Exam and Interview Checklist

Memory hierarchy: know the difference between registers, shared memory, global memory, and constant memory.
Shared memory scope: shared memory is visible only within one block and exists only while that block executes.
Reduction definition: a reduction combines many values into one value using an associative operator.
Work efficiency: a tree reduction performs N − 1 combine operations, comparable to the sequential algorithm.
Step complexity: a tree reduction has about log₂(N) levels.
Synchronization: __syncthreads() is required between shared-memory reduction steps.
Divergence: scattered active threads cause warp divergence and low lane utilization.
Optimization idea: keep active threads consecutive and compact partial sums toward the front of shared memory.

Reduction

Reduction tree

CUDA memory hierarchy

Naive vs. contiguous indexing

CUDA Memory Hierarchy

Hardware View

Programmer View

Shared Memory, Tiling, and Blocking

Tiling data flow

The Reduction Pattern

Partition and summarize

Reduction tree example

Block-level shared memory

Naive vs improved mapping

Sum

Product

Maximum

Minimum

Sequential vs. Parallel Reduction

Sequential vs. Parallel Reduction

Sequential Reduction

Parallel Reduction Tree

Basic Block-Level Reduction Design

Code Walkthrough: Block-Level Reduction

Naive Reduction Kernel: Interleaved Thread Usage

Code Walkthrough: Naive Interleaved Reduction

Why This Mapping Performs Poorly

Better Reduction Model: Contiguous Thread Usage

Code Walkthrough: Better Contiguous Reduction

Naive vs. Contiguous Mapping

Why Contiguous Indexing Helps

Barrier Synchronization

Why `__syncthreads()` Is Needed

Handling Block Partial Sums

What Happens After Each Block Finishes?

Exam and Interview Checklist

Reduction tree

CUDA memory hierarchy

Naive vs. contiguous indexing

CUDA Memory Hierarchy

Hardware View

Programmer View

Shared Memory, Tiling, and Blocking

Tiling data flow

The Reduction Pattern

Partition and summarize

Reduction tree example

Block-level shared memory

Naive vs improved mapping

Sum

Product

Maximum

Minimum

Sequential vs. Parallel Reduction

Sequential vs. Parallel Reduction

Sequential Reduction

Parallel Reduction Tree

Basic Block-Level Reduction Design

Code Walkthrough: Block-Level Reduction

Naive Reduction Kernel: Interleaved Thread Usage

Code Walkthrough: Naive Interleaved Reduction

Why This Mapping Performs Poorly

Better Reduction Model: Contiguous Thread Usage

Code Walkthrough: Better Contiguous Reduction

Naive vs. Contiguous Mapping

Why Contiguous Indexing Helps

Barrier Synchronization

Why __syncthreads() Is Needed

Handling Block Partial Sums

What Happens After Each Block Finishes?

Exam and Interview Checklist

Why `__syncthreads()` Is Needed