CUDA Pinned Memory & CUDA Streams

EE147: Graphics Processing Unit Computing and Programming — Spring 2026

Transfer Engine
DMA over PCIe
Host Buffer Type
Pinned Memory
Scheduling Tool
CUDA Streams
Goal
Overlap Copy + Compute
H2D input copy
pinned host buffer
Kernel compute
GPU execution
D2H output copy
copy engine
Next segment
stream queue
Host-device transfer
Kernel computation
Device-host transfer
Pinned memory makes DMA efficient. Streams organize asynchronous copies and kernel launches so independent work can overlap.
01 — Big Picture

Why pinned memory and streams matter

CUDA programs usually spend time on two kinds of work: moving data between CPU and GPU memory, and executing kernels on the GPU. If the program performs these steps one at a time, expensive hardware may sit idle.

copy input to GPU → run kernel → copy output back to CPU

This order is simple, but it serializes communication and computation. Pinned memory and CUDA streams are tools for turning that serial workflow into a pipeline.

Main idea: pinned memory solves the transfer safety and efficiency problem for DMA. Streams solve the scheduling problem by creating queues that can overlap independent copies and kernels.
  1. Pinned memory: keeps host pages resident in physical memory so DMA can safely access them.
  2. Asynchronous copies: allow the host to submit a copy and continue issuing more work.
  3. Streams: group copies and kernel launches into ordered queues.
  4. Segmentation: splits a large array into chunks so one chunk can compute while another chunk transfers.
02 — DMA Transfer Path

CPU-GPU data transfer using DMA

DMA stands for Direct Memory Access. It is a hardware mechanism that transfers bytes between memory regions without making the CPU manually copy every byte.

DMA transfer path

host DRAM → PCIe → GPU global memory
host
CPU Main Memory
(DRAM)
interconnect
PCIe + DMA Engine
device
GPU Global Memory
DMA is efficient because it transfers large byte regions through specialized hardware instead of CPU load/store loops.
03 — Virtual Memory

Virtual memory and why page locking is needed

Normal C/C++ pointer values are usually virtual addresses. The operating system and hardware memory-management unit translate them into physical addresses.

Memory is managed in pages. A virtual page may be present in physical DRAM, or it may be paged out by the operating system to make room for other pages.

program pointer
Virtual Address
translation
Page Table / MMU
hardware memory
Physical DRAM Page
Important: DMA uses physical addresses. If the OS could page out a buffer during a DMA transfer, the transfer engine could read from or write to the wrong physical location.

For efficiency, DMA does not re-translate every address for every byte. After the transfer starts, the involved physical pages must remain valid. This is why host buffers used directly by DMA must be pinned.

04 — Pinned Host Memory

Pinned host memory

Pinned host memory is CPU memory whose pages are locked so the operating system cannot page them out. It is also called page-locked memory, locked pages, or non-pageable host memory.

Pinned memory in one sentence: pinned memory is CPU memory that is guaranteed to stay in physical DRAM, making it safe and efficient for DMA transfers between the CPU and GPU.

Pageable vs pinned transfer

extra staging copy vs direct DMA
ordinary host
Pageable Buffer
extra step
Temporary Pinned Buffer
device
GPU Memory
pinned host
Pinned Buffer
device
GPU Memory
If the host buffer is already pinned, CUDA does not need the extra staging copy.
Benefit
cudaMemcpy() can be faster because it avoids the extra copy through a temporary pinned buffer.
!
Limit
Pinned memory is a limited system resource. Overusing it can hurt overall OS memory management.
05 — CUDA API

Allocating and freeing pinned memory

Pinned memory is allocated with cudaHostAlloc() and released with cudaFreeHost(). From the programmer's perspective, the returned pointer can be used similarly to memory returned by malloc().

CUDA API
1cudaError_t cudaHostAlloc(void **pHost,
2 size_t size,
3 unsigned int flags);
4cudaFreeHost(ptr);
ParameterMeaning
pHostAddress of the pointer that will receive the allocated host memory address.
sizeNumber of bytes to allocate.
flagsAllocation option. Use cudaHostAllocDefault for the basic case.
Pinned vector buffers
1const int N = 1 << 20;
2const size_t bytes = N * sizeof(float);
3float *h_A, *h_B, *h_C;
4cudaHostAlloc((void**)&h_A, bytes, cudaHostAllocDefault);
5cudaHostAlloc((void**)&h_B, bytes, cudaHostAllocDefault);
6cudaHostAlloc((void**)&h_C, bytes, cudaHostAllocDefault);
7// Use h_A, h_B, h_C like normal arrays.
8cudaFreeHost(h_A); cudaFreeHost(h_B); cudaFreeHost(h_C);
06 — Serialized vs Overlapped

Serialized transfer and device overlap

Without streams, vector addition often follows a serialized pattern: transfer input A, transfer input B, run the kernel, then transfer output C back to the host.

Serialized timing

one stage at a time
timeline
Trans A
Trans B
Kernel
Trans C
Idle hardware gaps
During transfers, GPU compute may be idle. During compute, PCIe/copy engines may be idle.

Device overlap means the GPU can execute a kernel while also copying data between host and device memory. Modern GPUs generally support this, but the code must expose enough independent work.

07 — CUDA Streams

CUDA streams

A CUDA stream is a queue of operations submitted by the host. Operations include kernel launches, asynchronous memory copies, and events.

stream 0 — FIFO queue
Memcpy A.0 H2D
Memcpy B.0 H2D
Kernel 0 compute
Memcpy C.0 D2H
stream 1 — FIFO queue
Memcpy A.1 H2D
Memcpy B.1 H2D
Kernel 1 compute
Memcpy C.1 D2H
Stream rule: operations inside the same stream execute in issue order. Operations in different streams may run concurrently if the hardware has resources and there are no dependencies preventing overlap.
08 — Asynchronous Work

Asynchronous copies and kernel launches in streams

To place work in streams, create stream objects and pass the stream as an argument to asynchronous copies and kernel launches.

Stream setup
1cudaStream_t stream0, stream1;
2cudaStreamCreate(&stream0);
3cudaStreamCreate(&stream1);
Async copy + stream launch
1cudaMemcpyAsync(d_A0, h_A + i, bytes, cudaMemcpyHostToDevice, stream0);
2vecAdd<<<blocks, threads, 0, stream0>>>(d_A0, d_B0, d_C0);
3cudaMemcpyAsync(h_C + i, d_C0, bytes, cudaMemcpyDeviceToHost, stream0);
Important: asynchronous host-device copies require pinned host memory to provide true asynchronous behavior and efficient DMA transfer.
09 — Work Segmentation

Segmenting work for overlap

A large vector can be divided into smaller segments. Each segment has its own input copy, kernel execution, and output copy. Streams allow adjacent segments to occupy different hardware engines at the same time.

Ideal pipeline intuition

different segments overlap
segment 0
A.0
B.0
C.0 = A.0 + B.0
C.0
done
segment 1
wait
A.1
B.1
C.1 = A.1 + B.1
C.1
done
segment 2
wait
A.2
B.2
C.2 = A.2 + B.2
C.2
Without overlap: Tserial = Tin + Tk + Tout
With good pipelining: Tpipeline ≈ max(Tin, Tk, Tout)
10 — Two-Stream Vector Addition

Simple multi-stream vector addition

Each stream needs its own device buffers. If both streams used the same device arrays, one stream could overwrite data while the other stream is still using it.

Separate device buffers
1float *d_A0, *d_B0, *d_C0; // stream 0
2float *d_A1, *d_B1, *d_C1; // stream 1
Simple two-stream loop
1for (int i = 0; i < n; i += SegSize * 2) {
2 cudaMemcpyAsync(d_A0, h_A + i, SegSize * sizeof(float), cudaMemcpyHostToDevice, stream0);
3 cudaMemcpyAsync(d_B0, h_B + i, SegSize * sizeof(float), cudaMemcpyHostToDevice, stream0);
4 vecAdd<<<SegSize / 256, 256, 0, stream0>>>(d_A0, d_B0, d_C0);
5 cudaMemcpyAsync(h_C + i, d_C0, SegSize * sizeof(float), cudaMemcpyDeviceToHost, stream0);
6 // Repeat same pattern for stream1 using i + SegSize.
7}

This version uses streams, but the operation ordering can still prevent the best overlap on some GPUs.

11 — Ordering and Copy Queues

Operation ordering and copy engine queues

Streams preserve ordering inside each queue, but hardware copy engines also have queues. If output copy C.0 is submitted too early, it may block future input transfers A.1 and B.1.

Less ideal ordering

C.0 can block A.1 and B.1
copy engine
A.0
B.0
C.0
A.1
B.1
C.1
kernel engine
waiting for inputs
Kernel 0
Kernel 1
idle

A better approach is to submit all input copies first, then kernels, then output copies.

Better ordering
1// Input copies first
2cudaMemcpyAsync(d_A0, h_A + i, bytes, cudaMemcpyHostToDevice, stream0);
3cudaMemcpyAsync(d_B0, h_B + i, bytes, cudaMemcpyHostToDevice, stream0);
4cudaMemcpyAsync(d_A1, h_A + i + SegSize, bytes, cudaMemcpyHostToDevice, stream1);
5cudaMemcpyAsync(d_B1, h_B + i + SegSize, bytes, cudaMemcpyHostToDevice, stream1);
6// Kernels next
7vecAdd<<<blocks, threads, 0, stream0>>>(d_A0, d_B0, d_C0);
8vecAdd<<<blocks, threads, 0, stream1>>>(d_A1, d_B1, d_C1);
9// Output copies last
10cudaMemcpyAsync(h_C + i, d_C0, bytes, cudaMemcpyDeviceToHost, stream0);
11cudaMemcpyAsync(h_C + i + SegSize, d_C1, bytes, cudaMemcpyDeviceToHost, stream1);
Rule of thumb: keep future input transfers moving early. If output copies are issued too early, they can block input copies for later segments.
12 — Ideal Pipelining

Toward ideal pipelined timing

The better two-stream ordering improves overlap, but it may still not be ideal across loop iterations. To get closer to ideal pipelining, the lecture notes suggest using at least three buffers for each original input/output array.

Triple buffering intuition

A, B, and C each need multiple in-flight buffers
stage 0
A.0
B.0
Kernel 0
C.0
free buffer
stage 1
offset
A.1
B.1
Kernel 1
C.1
free
stage 2
offset
A.2
B.2
Kernel 2
C.2
Tradeoff: more buffering can improve overlap, but it makes the code more complicated and uses more device memory.
13 — Synchronization

Waiting for stream or device completion

CUDA gives different synchronization functions depending on whether you need to wait for one stream or the entire device.

FunctionWaits forUse case
cudaStreamSynchronize(stream0)All queued work in stream0.Wait for one segment or one stream.
cudaDeviceSynchronize()All previously issued work on the current device.Wait before final timing or reading final results.
14 — Exam-Style Summary

Exam-style summary

Pinned memory
Host memory that cannot be paged out. It is safe for DMA and can avoid an extra staging copy during cudaMemcpy().
DMA
Transfer hardware that moves bytes between physical address regions, often across PCIe, without CPU byte-by-byte copying.
Streams
FIFO queues of CUDA operations. Same-stream operations are ordered; different streams may overlap.
Synchronization
cudaStreamSynchronize() waits for one stream. cudaDeviceSynchronize() waits for all streams on the device.
Most important intuition: pinned memory makes transfer efficient, and streams arrange work so that the copy engine and kernel engine can stay busy at the same time.