Why pinned memory and streams matter
CUDA programs usually spend time on two kinds of work: moving data between CPU and GPU memory, and executing kernels on the GPU. If the program performs these steps one at a time, expensive hardware may sit idle.
This order is simple, but it serializes communication and computation. Pinned memory and CUDA streams are tools for turning that serial workflow into a pipeline.
- Pinned memory: keeps host pages resident in physical memory so DMA can safely access them.
- Asynchronous copies: allow the host to submit a copy and continue issuing more work.
- Streams: group copies and kernel launches into ordered queues.
- Segmentation: splits a large array into chunks so one chunk can compute while another chunk transfers.
CPU-GPU data transfer using DMA
DMA stands for Direct Memory Access. It is a hardware mechanism that transfers bytes between memory regions without making the CPU manually copy every byte.
DMA transfer path
host DRAM → PCIe → GPU global memory(DRAM)
cudaMemcpy()uses DMA for host-device transfers when possible.- The CPU is freed for other tasks while the transfer hardware performs the copy.
- The DMA engine moves data between physical memory address regions.
- The system interconnect is usually PCIe on desktop and server systems.
Virtual memory and why page locking is needed
Normal C/C++ pointer values are usually virtual addresses. The operating system and hardware memory-management unit translate them into physical addresses.
Memory is managed in pages. A virtual page may be present in physical DRAM, or it may be paged out by the operating system to make room for other pages.
For efficiency, DMA does not re-translate every address for every byte. After the transfer starts, the involved physical pages must remain valid. This is why host buffers used directly by DMA must be pinned.
Pinned host memory
Pinned host memory is CPU memory whose pages are locked so the operating system cannot page them out. It is also called page-locked memory, locked pages, or non-pageable host memory.
Pageable vs pinned transfer
extra staging copy vs direct DMAcudaMemcpy() can be faster because it avoids the extra copy through a temporary pinned buffer.Allocating and freeing pinned memory
Pinned memory is allocated with cudaHostAlloc() and released with cudaFreeHost(). From the programmer's perspective, the returned pointer can be used similarly to memory returned by malloc().
| Parameter | Meaning |
|---|---|
pHost | Address of the pointer that will receive the allocated host memory address. |
size | Number of bytes to allocate. |
flags | Allocation option. Use cudaHostAllocDefault for the basic case. |
Serialized transfer and device overlap
Without streams, vector addition often follows a serialized pattern: transfer input A, transfer input B, run the kernel, then transfer output C back to the host.
Serialized timing
one stage at a timeDevice overlap means the GPU can execute a kernel while also copying data between host and device memory. Modern GPUs generally support this, but the code must expose enough independent work.
- Use pinned host memory.
- Use
cudaMemcpyAsync(), not blockingcudaMemcpy(). - Use multiple streams.
- Use separate device buffers for in-flight segments.
CUDA streams
A CUDA stream is a queue of operations submitted by the host. Operations include kernel launches, asynchronous memory copies, and events.
Asynchronous copies and kernel launches in streams
To place work in streams, create stream objects and pass the stream as an argument to asynchronous copies and kernel launches.
Segmenting work for overlap
A large vector can be divided into smaller segments. Each segment has its own input copy, kernel execution, and output copy. Streams allow adjacent segments to occupy different hardware engines at the same time.
Ideal pipeline intuition
different segments overlapSimple multi-stream vector addition
Each stream needs its own device buffers. If both streams used the same device arrays, one stream could overwrite data while the other stream is still using it.
This version uses streams, but the operation ordering can still prevent the best overlap on some GPUs.
Operation ordering and copy engine queues
Streams preserve ordering inside each queue, but hardware copy engines also have queues. If output copy C.0 is submitted too early, it may block future input transfers A.1 and B.1.
Less ideal ordering
C.0 can block A.1 and B.1A better approach is to submit all input copies first, then kernels, then output copies.
Toward ideal pipelined timing
The better two-stream ordering improves overlap, but it may still not be ideal across loop iterations. To get closer to ideal pipelining, the lecture notes suggest using at least three buffers for each original input/output array.
Triple buffering intuition
A, B, and C each need multiple in-flight buffersWaiting for stream or device completion
CUDA gives different synchronization functions depending on whether you need to wait for one stream or the entire device.
| Function | Waits for | Use case |
|---|---|---|
cudaStreamSynchronize(stream0) | All queued work in stream0. | Wait for one segment or one stream. |
cudaDeviceSynchronize() | All previously issued work on the current device. | Wait before final timing or reading final results. |
- Calling
cudaDeviceSynchronize()too often can destroy overlap. - Forgetting to synchronize before reading output can make the CPU read incomplete results.
- Synchronizing one stream does not mean all other streams are finished.
Exam-style summary
cudaMemcpy().cudaStreamSynchronize() waits for one stream. cudaDeviceSynchronize() waits for all streams on the device.