GPU Microarchitecture: SIMT Core & Hardware Stack

EE147: Graphics Processing Unit Computing and Programming — Spring 2026

01 — Top-Level Architecture

Hardware Interconnect Overview

Modern GPGPUs employ a highly parallel grid structure: independent computing blocks linked to specialized high-bandwidth memory subsystems.

cluster
SIMT Core Clusters
fabric
Interconnection Network
memory
Memory Partitions
02 — Microarchitecture

SIMT Core Internal Structure

A single SIMT Core bridges a SIMT Front-End instruction pipeline with a parallelized SIMD Back-End — engineered around fine-grained multithreading.

Fetch
Decode
Schedule
Scoreboard
Issue
SIMD ALUs
×N lanes
03 — Latency Hiding

High-Latency Hiding Mechanics

GPGPUs don't rely on speculative execution or out-of-order pipelines. Latency is hidden by rotating between thousands of resident thread contexts.

Latency Masking Ratio =  Memory Pipeline Cycle Delay Number of Co-Resident Scheduled Warps
// live warp scheduler — fine-grained interleaving
Executing
Stalled (memory)
Waiting

When a warp stalls on a global memory read, the scheduler immediately swaps in a non-stalled warp — keeping SIMD ALUs continuously packed.

04 — Divergence Hardware

The SIMT Hardware Execution Stack

When a warp hits a conditional branch and threads disagree, the hardware serializes paths via a dedicated SIMT Stack. Each entry tracks three fields:

05 — Divergence Trace

Branch Divergence — 4-Thread Warp Walkthrough

A 4-thread warp hits an if/else. Threads 1&2 take the TRUE path, threads 3&4 take FALSE. The SIMT stack serializes and reconverges.

C / CUDA
1A: // Common base path — all threads run lockstep
2   if (condition) {
3  B: // TRUE path — threads 1 & 2 only
4   } else {
5  D: // FALSE path — threads 3 & 4 only
6   }
7E: // Reconvergence point — all threads reunite
// warp state
// simt stack
Target PC Token Active Mask
TRUE path active
FALSE path active / reconverged
Thread masked off
06 — Trade-offs

Hardware Design Trade-offs

The SIMT stack accurately tracks divergence but introduces measurable overheads:

Memory Storage Overhead
Deeply nested branches scale stack depth per warp. Every resident warp needs its own stack state, multiplying storage with occupancy.
SIMD Efficiency Drop
Serialization idles hardware lanes under zero-masks. A 4-thread warp split 50/50 runs at ≤50% ALU utilization during divergence.
Key insight: The SIMT model is efficient when threads in a warp follow the same path. Divergence is not a bug — it's a cost the hardware explicitly manages — but minimizing it is critical for peak throughput.