01 — Top-Level Architecture
Hardware Interconnect Overview
Modern GPGPUs employ a highly parallel grid structure: independent computing blocks linked to specialized high-bandwidth memory subsystems.
cluster
SIMT Core Clusters
⟷
fabric
Interconnection Network
⟷
- SIMT Core Clusters — Houses standalone Processing Cores (SMs). Each SM runs thousands of threads concurrently.
- Interconnection Network — On-chip crossbar or mesh routing for simultaneous data distribution.
- Memory Partitions — Interleaved DRAM channels (GDDR5/HBM) optimized for massive data bandwidth.
02 — Microarchitecture
SIMT Core Internal Structure
A single SIMT Core bridges a SIMT Front-End instruction pipeline with a parallelized SIMD Back-End — engineered around fine-grained multithreading.
Fetch
→
Decode
→
Schedule
Scoreboard
→
Issue
→
SIMD ALUs
×N lanes
- Instruction Cache & Buffers — Stores active instruction streams from concurrent execution contexts.
- Scoreboard System — Monitors dependency conflicts and hazards; gates safe instruction issuance.
- SIMT Execution Stack — Manages branch divergence via precise bit-mask tracking.
- Decoupled Warp Schedulers — Multiple concurrent units picking from ready, non-stalled warps.
- Monolithic Register File — Massive register matrix persistent in-core; eliminates context-switch overhead.
- Memory Subsystem — L1 Data Cache, Shared Scratchpad (SMem), Texture Cache, Constant Cache.
03 — Latency Hiding
High-Latency Hiding Mechanics
GPGPUs don't rely on speculative execution or out-of-order pipelines. Latency is hidden by rotating between thousands of resident thread contexts.
Latency Masking Ratio =
Memory Pipeline Cycle Delay
Number of Co-Resident Scheduled Warps
// live warp scheduler — fine-grained interleaving
When a warp stalls on a global memory read, the scheduler immediately swaps in a non-stalled warp — keeping SIMD ALUs continuously packed.
04 — Divergence Hardware
The SIMT Hardware Execution Stack
When a warp hits a conditional branch and threads disagree, the hardware serializes paths via a dedicated SIMT Stack. Each entry tracks three fields:
- Target PC — Memory address of the instruction block to process next.
- Token / Type — Structural marker:
R = Reconvergence, S = Split path.
- Active Mask — Bit vector where
1 = active thread, 0 = idle.
05 — Divergence Trace
Branch Divergence — 4-Thread Warp Walkthrough
A 4-thread warp hits an if/else. Threads 1&2 take the TRUE path, threads 3&4 take FALSE. The SIMT stack serializes and reconverges.
C / CUDA
1A: // Common base path — all threads run lockstep
2 if (condition) {
3 B: // TRUE path — threads 1 & 2 only
4 } else {
5 D: // FALSE path — threads 3 & 4 only
6 }
7E: // Reconvergence point — all threads reunite
// simt stack
| Target PC |
Token |
Active Mask |
FALSE path active / reconverged
06 — Trade-offs
Hardware Design Trade-offs
The SIMT stack accurately tracks divergence but introduces measurable overheads:
▲
Memory Storage Overhead
Deeply nested branches scale stack depth per warp. Every resident warp needs its own stack state, multiplying storage with occupancy.
◼
SIMD Efficiency Drop
Serialization idles hardware lanes under zero-masks. A 4-thread warp split 50/50 runs at ≤50% ALU utilization during divergence.
Key insight: The SIMT model is efficient when threads in a warp follow the same path. Divergence is not a bug — it's a cost the hardware explicitly manages — but minimizing it is critical for peak throughput.