CUDA Pinned Memory & CUDA Streams

01 — Big Picture

Why pinned memory and streams matter

CUDA programs usually spend time on two kinds of work: moving data between CPU and GPU memory, and executing kernels on the GPU. If the program performs these steps one at a time, expensive hardware may sit idle.

copy input to GPU \to run kernel \to copy output back to CPU

This order is simple, but it serializes communication and computation. Pinned memory and CUDA streams are tools for turning that serial workflow into a pipeline.

Main idea: pinned memory solves the transfer safety and efficiency problem for DMA. Streams solve the scheduling problem by creating queues that can overlap independent copies and kernels.

Pinned memory: keeps host pages resident in physical memory so DMA can safely access them.
Asynchronous copies: allow the host to submit a copy and continue issuing more work.
Streams: group copies and kernel launches into ordered queues.
Segmentation: splits a large array into chunks so one chunk can compute while another chunk transfers.

02 — DMA Transfer Path

CPU-GPU data transfer using DMA

DMA stands for Direct Memory Access. It is a hardware mechanism that transfers bytes between memory regions without making the CPU manually copy every byte.

DMA transfer path

host DRAM → PCIe → GPU global memory

host

CPU Main Memory
(DRAM)

↔

interconnect

PCIe + DMA Engine

↔

device

GPU Global Memory

DMA is efficient because it transfers large byte regions through specialized hardware instead of CPU load/store loops.

cudaMemcpy() uses DMA for host-device transfers when possible.
The CPU is freed for other tasks while the transfer hardware performs the copy.
The DMA engine moves data between physical memory address regions.
The system interconnect is usually PCIe on desktop and server systems.

03 — Virtual Memory

Virtual memory and why page locking is needed

Normal C/C++ pointer values are usually virtual addresses. The operating system and hardware memory-management unit translate them into physical addresses.

Memory is managed in pages. A virtual page may be present in physical DRAM, or it may be paged out by the operating system to make room for other pages.

program pointer

Virtual Address

⟷

translation

Page Table / MMU

⟷

hardware memory

Physical DRAM Page

Important: DMA uses physical addresses. If the OS could page out a buffer during a DMA transfer, the transfer engine could read from or write to the wrong physical location.

For efficiency, DMA does not re-translate every address for every byte. After the transfer starts, the involved physical pages must remain valid. This is why host buffers used directly by DMA must be pinned.

04 — Pinned Host Memory

Pinned host memory

Pinned host memory is CPU memory whose pages are locked so the operating system cannot page them out. It is also called page-locked memory, locked pages, or non-pageable host memory.

Pinned memory in one sentence: pinned memory is CPU memory that is guaranteed to stay in physical DRAM, making it safe and efficient for DMA transfers between the CPU and GPU.

Pageable vs pinned transfer

extra staging copy vs direct DMA

ordinary host

Pageable Buffer

→

extra step

Temporary Pinned Buffer

→

device

GPU Memory

pinned host

Pinned Buffer

→

device

GPU Memory

If the host buffer is already pinned, CUDA does not need the extra staging copy.

▲

Benefit

cudaMemcpy() can be faster because it avoids the extra copy through a temporary pinned buffer.

Limit

Pinned memory is a limited system resource. Overusing it can hurt overall OS memory management.

05 — CUDA API

Allocating and freeing pinned memory

Pinned memory is allocated with cudaHostAlloc() and released with cudaFreeHost(). From the programmer's perspective, the returned pointer can be used similarly to memory returned by malloc().

CUDA API
1cudaError_t cudaHostAlloc(void **pHost,
2                          size_t size,
3                          unsigned int flags);
4cudaFreeHost(ptr);

Parameter	Meaning
`pHost`	Address of the pointer that will receive the allocated host memory address.
`size`	Number of bytes to allocate.
`flags`	Allocation option. Use `cudaHostAllocDefault` for the basic case.

Pinned vector buffers
1const int N = 1 << 20;
2const size_t bytes = N * sizeof(float);
3float *h_A, *h_B, *h_C;
4cudaHostAlloc((void**)&h_A, bytes, cudaHostAllocDefault);
5cudaHostAlloc((void**)&h_B, bytes, cudaHostAllocDefault);
6cudaHostAlloc((void**)&h_C, bytes, cudaHostAllocDefault);
7// Use h_A, h_B, h_C like normal arrays.
8cudaFreeHost(h_A); cudaFreeHost(h_B); cudaFreeHost(h_C);

06 — Serialized vs Overlapped

Serialized transfer and device overlap

Without streams, vector addition often follows a serialized pattern: transfer input A, transfer input B, run the kernel, then transfer output C back to the host.

Serialized timing

one stage at a time

timeline

Trans A

Trans B

Kernel

Trans C

Idle hardware gaps

During transfers, GPU compute may be idle. During compute, PCIe/copy engines may be idle.

Device overlap means the GPU can execute a kernel while also copying data between host and device memory. Modern GPUs generally support this, but the code must expose enough independent work.

Use pinned host memory.
Use cudaMemcpyAsync(), not blocking cudaMemcpy().
Use multiple streams.
Use separate device buffers for in-flight segments.

07 — CUDA Streams

CUDA streams

A CUDA stream is a queue of operations submitted by the host. Operations include kernel launches, asynchronous memory copies, and events.

stream 0 — FIFO queue

Memcpy A.0 H2D

Memcpy B.0 H2D

Kernel 0 compute

Memcpy C.0 D2H

stream 1 — FIFO queue

Memcpy A.1 H2D

Memcpy B.1 H2D

Kernel 1 compute

Memcpy C.1 D2H

Stream rule: operations inside the same stream execute in issue order. Operations in different streams may run concurrently if the hardware has resources and there are no dependencies preventing overlap.

08 — Asynchronous Work

Asynchronous copies and kernel launches in streams

To place work in streams, create stream objects and pass the stream as an argument to asynchronous copies and kernel launches.

Stream setup
1cudaStream_t stream0, stream1;
2cudaStreamCreate(&stream0);
3cudaStreamCreate(&stream1);

Async copy + stream launch
1cudaMemcpyAsync(d_A0, h_A + i, bytes, cudaMemcpyHostToDevice, stream0);
2vecAdd<<<blocks, threads, 0, stream0>>>(d_A0, d_B0, d_C0);
3cudaMemcpyAsync(h_C + i, d_C0, bytes, cudaMemcpyDeviceToHost, stream0);

Important: asynchronous host-device copies require pinned host memory to provide true asynchronous behavior and efficient DMA transfer.

09 — Work Segmentation

Segmenting work for overlap

A large vector can be divided into smaller segments. Each segment has its own input copy, kernel execution, and output copy. Streams allow adjacent segments to occupy different hardware engines at the same time.

Ideal pipeline intuition

different segments overlap

segment 0

A.0

B.0

C.0 = A.0 + B.0

C.0

done

segment 1

wait

A.1

B.1

C.1 = A.1 + B.1

C.1

done

segment 2

wait

A.2

B.2

C.2 = A.2 + B.2

C.2

Without overlap: T_serial = T_in + T_k + T_out

With good pipelining: T_pipeline ≈ max(T_in, T_k, T_out)

10 — Two-Stream Vector Addition

Simple multi-stream vector addition

Each stream needs its own device buffers. If both streams used the same device arrays, one stream could overwrite data while the other stream is still using it.

Separate device buffers
1float *d_A0, *d_B0, *d_C0;  // stream 0
2float *d_A1, *d_B1, *d_C1;  // stream 1

Simple two-stream loop
1for (int i = 0; i < n; i += SegSize * 2) {
2    cudaMemcpyAsync(d_A0, h_A + i, SegSize * sizeof(float), cudaMemcpyHostToDevice, stream0);
3    cudaMemcpyAsync(d_B0, h_B + i, SegSize * sizeof(float), cudaMemcpyHostToDevice, stream0);
4    vecAdd<<<SegSize / 256, 256, 0, stream0>>>(d_A0, d_B0, d_C0);
5    cudaMemcpyAsync(h_C + i, d_C0, SegSize * sizeof(float), cudaMemcpyDeviceToHost, stream0);
6    // Repeat same pattern for stream1 using i + SegSize.
7}

This version uses streams, but the operation ordering can still prevent the best overlap on some GPUs.

11 — Ordering and Copy Queues

Operation ordering and copy engine queues

Streams preserve ordering inside each queue, but hardware copy engines also have queues. If output copy C.0 is submitted too early, it may block future input transfers A.1 and B.1.

Less ideal ordering

C.0 can block A.1 and B.1

copy engine

A.0

B.0

C.0

A.1

B.1

C.1

kernel engine

waiting for inputs

Kernel 0

Kernel 1

idle

A better approach is to submit all input copies first, then kernels, then output copies.

Better ordering
1// Input copies first
2cudaMemcpyAsync(d_A0, h_A + i,          bytes, cudaMemcpyHostToDevice, stream0);
3cudaMemcpyAsync(d_B0, h_B + i,          bytes, cudaMemcpyHostToDevice, stream0);
4cudaMemcpyAsync(d_A1, h_A + i + SegSize, bytes, cudaMemcpyHostToDevice, stream1);
5cudaMemcpyAsync(d_B1, h_B + i + SegSize, bytes, cudaMemcpyHostToDevice, stream1);
6// Kernels next
7vecAdd<<<blocks, threads, 0, stream0>>>(d_A0, d_B0, d_C0);
8vecAdd<<<blocks, threads, 0, stream1>>>(d_A1, d_B1, d_C1);
9// Output copies last
10cudaMemcpyAsync(h_C + i,           d_C0, bytes, cudaMemcpyDeviceToHost, stream0);
11cudaMemcpyAsync(h_C + i + SegSize, d_C1, bytes, cudaMemcpyDeviceToHost, stream1);

Rule of thumb: keep future input transfers moving early. If output copies are issued too early, they can block input copies for later segments.

12 — Ideal Pipelining

Toward ideal pipelined timing

The better two-stream ordering improves overlap, but it may still not be ideal across loop iterations. To get closer to ideal pipelining, the lecture notes suggest using at least three buffers for each original input/output array.

Triple buffering intuition

A, B, and C each need multiple in-flight buffers

stage 0

A.0

B.0

Kernel 0

C.0

free buffer

stage 1

offset

A.1

B.1

Kernel 1

C.1

free

stage 2

offset

A.2

B.2

Kernel 2

C.2

Tradeoff: more buffering can improve overlap, but it makes the code more complicated and uses more device memory.

13 — Synchronization

Waiting for stream or device completion

CUDA gives different synchronization functions depending on whether you need to wait for one stream or the entire device.

Function	Waits for	Use case
`cudaStreamSynchronize(stream0)`	All queued work in `stream0`.	Wait for one segment or one stream.
`cudaDeviceSynchronize()`	All previously issued work on the current device.	Wait before final timing or reading final results.

Calling cudaDeviceSynchronize() too often can destroy overlap.
Forgetting to synchronize before reading output can make the CPU read incomplete results.
Synchronizing one stream does not mean all other streams are finished.

14 — Exam-Style Summary

Exam-style summary

Pinned memory

Host memory that cannot be paged out. It is safe for DMA and can avoid an extra staging copy during cudaMemcpy().

DMA

Transfer hardware that moves bytes between physical address regions, often across PCIe, without CPU byte-by-byte copying.

Streams

FIFO queues of CUDA operations. Same-stream operations are ordered; different streams may overlap.

Synchronization

cudaStreamSynchronize() waits for one stream. cudaDeviceSynchronize() waits for all streams on the device.

Most important intuition: pinned memory makes transfer efficient, and streams arrange work so that the copy engine and kernel engine can stay busy at the same time.