GPU Microarchitecture — SIMT Core Notes

01 — Top-Level Architecture

Hardware Interconnect Overview

Modern GPGPUs employ a highly parallel grid structure: independent computing blocks linked to specialized high-bandwidth memory subsystems.

cluster

SIMT Core Clusters

⟷

fabric

Interconnection Network

⟷

memory

Memory Partitions

SIMT Core Clusters — Houses standalone Processing Cores (SMs). Each SM runs thousands of threads concurrently.
Interconnection Network — On-chip crossbar or mesh routing for simultaneous data distribution.
Memory Partitions — Interleaved DRAM channels (GDDR5/HBM) optimized for massive data bandwidth.

02 — Microarchitecture

SIMT Core Internal Structure

A single SIMT Core bridges a SIMT Front-End instruction pipeline with a parallelized SIMD Back-End — engineered around fine-grained multithreading.

Fetch

→

Decode

→

Schedule
Scoreboard

→

Issue

→

SIMD ALUs
×N lanes

Instruction Cache & Buffers — Stores active instruction streams from concurrent execution contexts.
Scoreboard System — Monitors dependency conflicts and hazards; gates safe instruction issuance.
SIMT Execution Stack — Manages branch divergence via precise bit-mask tracking.
Decoupled Warp Schedulers — Multiple concurrent units picking from ready, non-stalled warps.
Monolithic Register File — Massive register matrix persistent in-core; eliminates context-switch overhead.
Memory Subsystem — L1 Data Cache, Shared Scratchpad (SMem), Texture Cache, Constant Cache.

03 — Latency Hiding

High-Latency Hiding Mechanics

GPGPUs don't rely on speculative execution or out-of-order pipelines. Latency is hidden by rotating between thousands of resident thread contexts.

Latency Masking Ratio = Memory Pipeline Cycle Delay Number of Co-Resident Scheduled Warps

// live warp scheduler — fine-grained interleaving

Executing

Stalled (memory)

Waiting

When a warp stalls on a global memory read, the scheduler immediately swaps in a non-stalled warp — keeping SIMD ALUs continuously packed.

04 — Divergence Hardware

The SIMT Hardware Execution Stack

When a warp hits a conditional branch and threads disagree, the hardware serializes paths via a dedicated SIMT Stack. Each entry tracks three fields:

Target PC — Memory address of the instruction block to process next.
Token / Type — Structural marker: R = Reconvergence, S = Split path.
Active Mask — Bit vector where 1 = active thread, 0 = idle.

05 — Divergence Trace

Branch Divergence — 4-Thread Warp Walkthrough

A 4-thread warp hits an if/else. Threads 1&2 take the TRUE path, threads 3&4 take FALSE. The SIMT stack serializes and reconverges.

C / CUDA
1A: // Common base path — all threads run lockstep
2   if (condition) {
3  B: // TRUE path — threads 1 & 2 only
4   } else {
5  D: // FALSE path — threads 3 & 4 only
6   }
7E: // Reconvergence point — all threads reunite

// warp state

// simt stack

Target PC	Token	Active Mask

TRUE path active

FALSE path active / reconverged

Thread masked off

06 — Trade-offs

Hardware Design Trade-offs

The SIMT stack accurately tracks divergence but introduces measurable overheads:

▲

Memory Storage Overhead

Deeply nested branches scale stack depth per warp. Every resident warp needs its own stack state, multiplying storage with occupancy.

◼

SIMD Efficiency Drop

Serialization idles hardware lanes under zero-masks. A 4-thread warp split 50/50 runs at ≤50% ALU utilization during divergence.

Key insight: The SIMT model is efficient when threads in a warp follow the same path. Divergence is not a bug — it's a cost the hardware explicitly manages — but minimizing it is critical for peak throughput.