CS 450: Computer Architecture

Sihang Liu

Estimated study time: 1 hr 16 min

Table of contents

Computer architecture occupies the seam between mathematics and engineering, between the abstract and the physical. It asks: how do we design a machine that faithfully executes the instructions a programmer writes, does so quickly, cheaply, and without consuming too much power, and can be extended as technology evolves? This course answers that question by moving through every major layer of a modern processor — the instruction set contract, the pipeline that executes instructions, the branch predictor that keeps the pipeline full, the cache hierarchy that feeds data at speed, and the memory system that underpins it all. Along the way the notes encounter some of the deepest design tradeoffs in all of engineering, and a few sobering security lessons that follow naturally from those tradeoffs.

Chapter 1: Foundations of Computer Architecture

1.1 What Computer Architecture Is

The word architecture in computer science carries a specific historical weight. In 1964, Gene Amdahl wrote that architecture describes “the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation.” That definition separates what a machine does — the contract between hardware and software — from how it does it.

A broader working definition is this: computer architecture is the science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy, cost, and reliability goals. The architect must simultaneously understand the software running on top, the physical technology underneath, the problems that motivated the design, and the designs of the past. Yale Patt famously described the architect’s role in four directions. Looking backward, the architect studies past tradeoffs, their upsides and downsides, and the workloads that shaped them. Looking forward, the architect imagines new designs and evaluates new choices. Looking upward, the architect understands the application problems that demand new solutions. Looking downward, the architect tracks the capabilities of circuits and predicts the direction of technology.

1.2 The Abstraction Stack

Modern computing is organized into layers, each hiding the complexity of the one below from the one above. At the top sit user-facing applications — web browsers, databases, machine-learning frameworks. Below them are algorithms and data structures, then programming languages, then compilers, then the operating system. At the hardware boundary lies the Instruction Set Architecture (ISA), the interface that the programmer (or compiler) sees. Below the ISA is the microarchitecture, the concrete implementation of the ISA. Below that are logic gates, transistors, and ultimately the physics of semiconductors.

The power of abstraction is twofold. First, isolation: a programmer writing Python need not know anything about how the JIT compiler or the processor pipeline works. Second, productivity: each layer can be developed and optimized independently. A new microarchitecture that runs the same ISA faster does not require rewriting any software.

Yet the abstraction stack is not inviolable. When one layer hits a fundamental limit — when DRAM cannot get faster simply by shrinking, when a single core cannot be made much faster without burning too much power — the only path forward is to break the abstraction and co-design across layers. This course returns to that theme repeatedly.

↓↓↓

Each layer hides complexity below · ISA is the hardware/software boundary

The computing abstraction stack. The ISA (green) is the critical boundary: software above it can run on any hardware that correctly implements it.

1.3 Why Architecture Matters Today

For roughly three decades, computer architects relied on a pair of reliable engines: Moore’s Law and Dennard scaling. Moore’s Law (Gordon Moore, 1965) observed that the number of transistors on a chip doubles roughly every two years. Dennard scaling (Robert Dennard, 1974) observed that as transistors shrink, their power density stays constant, so a chip of the same area runs faster without consuming more power.

Both engines have stalled. Transistor counts still grow, but shrinking below about 40–35 nm has become increasingly difficult because DRAM capacitors cannot shrink further without losing the charge needed for reliable sensing, and because leakage current grows as transistors scale. Dennard scaling broke around 2005 because the electric field in a transistor does not scale cleanly with its physical dimensions. The result is the power wall: chip designers can place more transistors on a die, but cannot run all of them simultaneously without melting the package.

This crisis has driven a fundamental paradigm shift. The industry moved from single fast cores to multicore designs. It created domain-specific accelerators — GPUs, TPUs, FPGAs — optimized for particular workloads. It invented new memory technologies to address the memory wall: the observation that processor speeds improved at roughly 60,000× over two decades while memory bandwidth improved only 100×. The mismatch between computation speed and memory bandwidth has become the central performance bottleneck in modern systems, motivating the processing-in-memory ideas discussed at the end of this course.

Chapter 2: Computing Models and the ISA

2.1 The Von Neumann Model

Every major ISA today — x86, ARM, MIPS, RISC-V — is grounded in the Von Neumann model of computation, also called the stored-program computer. Its two defining properties are so familiar they can seem obvious, yet each was a genuine conceptual breakthrough in the 1940s.

The first property is the stored program: instructions are stored in the same linear memory array as data. Memory is unified; a stored value’s meaning (instruction or data) depends on the control signals that access it, not on any physical distinction. The second property is sequential instruction processing: one instruction is fetched, executed, and completed at a time, and a program counter (PC) — also called the instruction pointer — advances sequentially through memory, jumping only when a control-flow instruction explicitly redirects it.

Von Neumann Model. A computing model in which (1) instructions and data share a single, unified memory, and (2) instructions are processed in the sequential order specified by the PC, except when a branch, jump, or call explicitly changes the PC.

The Von Neumann model is not the only possible model. It is simply the dominant one — and understanding why requires briefly considering the alternative.

PC advances sequentially; branch instructions explicitly redirect it

The Von Neumann stored-program model. Instructions and data share the same unified memory; the PC sequences through instructions in order.

2.2 The Dataflow Model

In the dataflow model, instructions are not executed in program-counter order. Instead, an instruction executes — or fires — when all of its input operands are available. Programs are represented as directed graphs: nodes are operations, arcs carry data tokens from producers to consumers. There is no instruction pointer; the data dependencies among instructions fully determine execution order.

The appeal is parallelism. In a Von Neumann machine, the instruction x = v − w must wait for the instruction v = a + b to complete, even if there is idle hardware that could be doing other work. In a dataflow machine, all instructions whose inputs are ready fire simultaneously, achieving maximal instruction-level parallelism without the programmer specifying it explicitly. A concrete example is computing the Hamming distance between two binary strings: the XOR of two operands, the bit-count loop, and the accumulation can all be expressed as a graph of operations that fire as tokens flow, achieving higher throughput than sequential execution on the same hardware.

Dataflow advantages include excellent exploitation of irregular parallelism and absence of false sequencing constraints. But the disadvantages are severe in practice. Precise exception handling is nearly impossible when there is no well-defined program state between instruction firings. Debugging requires reconstructing the program’s history from distributed token state. The overhead of matching tokens to their destination nodes is large. Memory locality is hard to exploit because the execution order is determined by data readiness, not by the physical arrangement of data in memory.

Dataflow at the ISA level has not succeeded commercially. The Monsoon processor (1990) was an explicit-token-store dataflow architecture that never reached the mainstream. What has succeeded is using dataflow ideas underneath the sequential ISA: out-of-order execution, where hardware dynamically determines which instructions whose operands are ready can execute while others wait, is essentially hardware-scheduled dataflow hidden beneath the sequential programming model.

2.3 The Instruction Set Architecture

An ISA is the agreed-upon interface between software and hardware. The compiler assumes the ISA; the processor guarantees it. Software written to a given ISA runs on any processor that implements it, regardless of the internal implementation. This separation has enormous economic value: an ISA can outlive any particular microarchitecture by decades. The x86 ISA, introduced in 16-bit form in 1978 and extended to 64 bits in 2003, runs today on processors that would have been inconceivable to its original designers.

Instruction Processing Style

Every ISA must specify how instructions identify and obtain their operands. The four canonical styles differ in the number of operands explicitly named:

A 0-address (stack) machine keeps all operands on an implicit last-in, first-out stack. Operations such as ADD pop two values, add them, and push the result. Arguments are pushed explicitly with PUSH instructions and results retrieved with POP. Code is compact — no operand fields are needed in operate instructions — and procedure calls are natural since parameters live on the stack. The disadvantage is inflexibility: when an operation needs operands that are not at the top of the stack, extra push/pop instructions are required.

A 1-address (accumulator) machine uses a special register called the accumulator for all operations. One operand comes from memory; the other is implicitly the accumulator; the result goes back to the accumulator. Early microprocessors and many simple microcontrollers used this style. The Intel 8080, ancestor of the x86, is an accumulator machine.

A 2-address machine names two operands: one is both a source and the destination. The PDP-11 and the x86 follow this convention. The drawback is that one source operand is overwritten by the result; preserving the original value requires an extra copy instruction.

A 3-address machine names three operands: two sources and a distinct destination. MIPS, Alpha, ARM, and RISC-V all use this style. Three-address instructions are longer, but the compiler has far more flexibility in mapping computation to registers.

Data Types and Addressing Modes

The ISA specifies what data types instructions operate on. Integer and floating-point types are universal. Some ISAs include higher-level types: the VAX had instructions for doubly-linked list insertion (INSQUEUE) and removal (REMQUEUE), string operations, and array access with bounds checking (INDEX). The advantage of high-level types is that the compiler writes fewer instructions for common patterns. The disadvantage is that the microarchitect must implement complex operations efficiently, and the compiler loses the fine-grained control needed for optimization.

Binary-coded decimal (BCD) is a notable historical data type: each decimal digit is encoded in four bits, making conversion to human-readable decimal trivial. Legacy financial and telecommunications systems still use BCD instructions in the x86.

Addressing modes specify how to compute the effective memory address for a load or store instruction. The most important modes are:

Immediate: the address is a constant embedded in the instruction.
Register indirect: the address is the value in a base register.
Displaced (based): the address is a constant plus a base register — the ubiquitous offset(base) of MIPS and [base + displacement] of x86.
Indexed: the address is a base register plus an index register, useful for array access.
Scaled indexed: the address is a base plus an index multiplied by an element size (e.g., base + index*4 for a 4-byte integer array).
Auto-increment/decrement: the address is the base register, which is then incremented or decremented automatically, useful for sequential traversal.

More addressing modes make the ISA more expressive and reduce code size, but they complicate the microarchitect’s job and make instruction decode slower.

CISC vs. RISC

The tension between complex and simple instructions reflects a deep tradeoff called the semantic gap: how close should the ISA sit to the high-level language (narrow gap, many complex instructions) versus the underlying hardware control signals (wide gap, few simple instructions)?

Complex Instruction Set Computer (CISC) architectures, like the x86 and VAX, have narrow semantic gaps. A single instruction can load, add, and store, or copy an entire string (REP MOVS). Advantages: denser code, simpler compiler. Disadvantages: complex hardware, harder to optimize at a fine grain.

Reduced Instruction Set Computer (RISC) architectures, pioneered by John Cocke at IBM in the mid-1970s and popularized by MIPS, SPARC, Alpha, and ARM, have wide semantic gaps. Instructions are simple and uniform: load-store architecture (operate instructions work only on registers; memory is accessed by dedicated load and store instructions), fixed-length encoding, and three-address format. Advantages: simple hardware, short cycle time, good compiler optimization opportunities. Disadvantages: more instructions, larger code.

The practical resolution of this debate is translation. Modern x86 processors translate complex x86 instructions into simple internal micro-operations (µops) in hardware, gaining RISC-like internal execution while preserving x86 software compatibility. Transmeta’s Crusoe processor did the same transformation in software, using a code morphing layer that translated x86 instructions into the processor’s internal VLIW instructions.

Semantic Gap. The distance in complexity between high-level language constructs and ISA instructions. CISC architectures minimize the semantic gap; RISC architectures maximize it. Translation layers — in hardware or software — allow ISA complexity to be decoupled from implementation complexity.

Instruction Length

Fixed-length instructions (e.g., MIPS: always 32 bits; Alpha: always 32 bits) are easier to decode in parallel and make it trivial to compute the address of the next instruction: simply add 4. Variable-length instructions (e.g., x86: 1–15 bytes) allow compact encoding of common short instructions and rare long ones, reducing memory bandwidth and improving cache hit rates, but make simultaneous decode of multiple instructions complex and expensive.

Registers

The number of architectural registers profoundly affects performance and code size. Fewer registers (e.g., x86’s original 8 general-purpose registers) mean smaller instruction encoding and simpler hardware, but force the compiler to frequently save and restore (“spill”) values to memory. More registers (e.g., MIPS and Alpha’s 32 GPRs, SPARC’s 128 windowed registers) allow the compiler to keep more live values in fast register storage, reducing spills.

Memory Organization

The ISA specifies the address space (how many uniquely addressable locations exist — typically $2^{64}$ in 64-bit architectures) and addressability (how much data each location contains — almost universally 8 bits today, though early machines used bit, 64-bit, or other granularities). It also specifies whether memory accesses must be aligned (e.g., a 4-byte load must start at an address divisible by 4) or can be unaligned. Alignment simplifies hardware; unaligned access makes the programmer’s or compiler’s life easier.

ISA Elements Summary

The full list of things an ISA must specify is longer than it might appear:

Instructions: opcodes, operand specifiers, data types, instruction formats
Memory: address space, addressability, alignment rules, virtual memory interface
Registers: count, size, special-purpose vs. general-purpose
Privilege modes: user vs. supervisor (e.g., x86 rings 0–3)
Exception and interrupt handling: vectored or non-vectored interrupts, precise vs. imprecise exceptions
I/O: memory-mapped I/O (a region of the address space maps to device registers) vs. special I/O instructions (x86’s IN/OUT)

Chapter 3: Microarchitecture

3.1 ISA vs. Microarchitecture

The microarchitecture is the concrete implementation of an ISA. Many different microarchitectures can implement the same ISA: the Intel 80386, Pentium, Pentium Pro, Core, and Skylake all implement the x86 ISA, yet their internal organizations differ radically. The ISA is the car’s pedals and steering wheel; the microarchitecture is the engine and drivetrain.

What belongs to the ISA vs. the microarchitecture? The ISA defines the ADD instruction; the microarchitecture defines whether the adder uses ripple-carry, carry-lookahead, or some other implementation. The ISA specifies that exceptions are precise (the machine state is well-defined at the point of exception); the microarchitecture implements mechanisms — reorder buffers, exception flags in pipeline registers — to make that guarantee hold even though instructions execute speculatively and out of order.

Programmer-visible state includes all registers named by the ISA (general-purpose registers, PC, condition codes, control registers) and all of memory. Programmer-invisible state includes caches, pipeline registers, branch predictors, reorder buffers, and reservation stations. The software can never directly observe or modify invisible state; from the software’s perspective, it does not exist.

3.2 Processing an Instruction

Every instruction transforms the architectural state AS to a new state AS’. The Von Neumann model specifies what AS’ should be for each instruction; the microarchitecture specifies how the transformation is performed. The microarchitect has enormous freedom in this how, provided the visible outcome is correct.

Instruction Processing Cycle. The six canonical phases of processing a single instruction:

Fetch (IF): read the instruction from memory at the address in the PC.
Decode (ID/RF): determine the instruction type; read source register values from the register file.
Address Generation (EX/AG): compute the effective memory address (for loads/stores) or perform the ALU operation.
Memory Access (MEM): read from or write to data memory (only for load and store instructions).
Write-back (WB): write the result back to the destination register.

Not all instructions require all five phases, but all must go through them in order.

3.3 Single-Cycle Microarchitecture

The simplest possible microarchitecture executes each instruction in a single clock cycle. The entire five-phase processing sequence is implemented as a single combinational logic path from the state elements (registers and memory) through the datapath and back. No intermediate state is stored; the clock edge at the end of the cycle simultaneously captures all updates to all state elements.

We trace the MIPS single-cycle datapath as a concrete example. MIPS is a clean RISC ISA with three instruction formats:

R-type (register–register): opcode (6 bits), rs (5), rt (5), rd (5), shamt (5), funct (6). Used for ALU operations: ADD rd, rs, rt means GPR[rd] ← GPR[rs] + GPR[rt]; PC ← PC + 4.
I-type (immediate): opcode (6), rs (5), rt (5), immediate (16). Used for loads, stores, branches, and immediate ALU ops: ADDI rt, rs, imm means GPR[rt] ← GPR[rs] + sign_extend(imm); PC ← PC + 4.
J-type (jump): opcode (6), target (26). Used for unconditional jumps.

MIPS 32-bit Instruction Formats

312621161160

Ropcode6 bits

rs5rt5rd5shamt5funct6 bits

Iopcode6 bits

rs5rt5immediate16 bits

Jopcode6 bits

jump target26 bits

All MIPS instructions are exactly 32 bits wide (fixed-length encoding)

MIPS instruction encoding. R-type uses all fields for register operands; I-type embeds a 16-bit immediate; J-type dedicates 26 bits to a jump target.

For an R-type ALU instruction, the datapath is: fetch the instruction from I-memory using PC, decode it to determine rs, rt, rd, and the ALU function, read GPR[rs] and GPR[rt] from the register file, feed both values to the ALU, write the result to GPR[rd], and increment PC by 4. All of this happens inside a single clock cycle.

For a load word (LW rt, offset(base)): fetch instruction, decode, read GPR[base], add the sign-extended offset in the ALU to form the effective address, read D-memory at that address, and write the loaded value to GPR[rt]. This requires two memory accesses — instruction fetch and data read — which is why single-cycle machines typically use separate instruction and data memories.

Branch instructions (BEQ rs, rt, offset): compute PC + 4 + sign_extend(offset) × 4 as the branch target. Compare GPR[rs] and GPR[rt] in the ALU to determine the branch condition. Multiplex between PC+4 and the branch target based on the condition.

Critical Path and Latency Analysis

The clock cycle time in a single-cycle machine is determined by the critical path — the longest combinational delay from any state element output to any state element input. With the rough timing figures: instruction memory read = 200 ps, ALU = 100 ps, data memory read/write = 200 ps, register file read/write = 50 ps, the critical path for a load instruction is:

I-memory (200) + register file read (50) + ALU (100) + D-memory (200) + register file write (50) = 600 ps

A register–register ALU instruction takes only 400 ps along its critical path. But in a single-cycle machine, the clock period must accommodate the slowest instruction — the load at 600 ps. Every instruction, including the simple register–register add that needs only 400 ps, wastes the remaining 200 ps waiting for the clock.

Limitations of Single-Cycle

The single-cycle design violates all three of the key microarchitecture design principles:

Critical path: the machine is designed around the worst case (load), not the typical case.
Common case: simple arithmetic instructions — by far the most frequent — are slowed to match the rarest, slowest instruction.
Balanced design: the ALU, data memory, and register file are all provided as combinational resources needed in parallel, but most instructions use only a subset; the rest are wasted.

If a complex ISA instruction like x86’s REP MOVS (copy a string of arbitrary length) were implemented as a single-cycle instruction, it would require an enormous and absurd combinational circuit. The critical path would be impossibly long.

3.4 Multi-Cycle Microarchitecture

The multi-cycle microarchitecture breaks instruction execution into multiple shorter clock cycles, allowing a shorter clock period. Each phase of the instruction processing cycle becomes a separate clock cycle. The clock period is set by the slowest phase, not the slowest instruction. A register–register add that needs only the fetch, decode, execute, and write-back phases takes four cycles; a load that needs all five takes five cycles. On average, the multi-cycle design is much more efficient.

Multi-cycle machines introduce intermediate state: pipeline registers that hold partial results between phases. The register file is written only at the end of the last cycle of an instruction; from the programmer’s perspective, the update is still atomic.

Chapter 4: Pipelining

4.1 The Basic Idea

Even the multi-cycle design leaves hardware idle: during the execute phase, the instruction fetch hardware sits unused. Pipelining exploits this idle hardware. While one instruction is being executed, the next instruction is being decoded, and the one after that is being fetched.

Pipelining. A technique in which the instruction processing cycle is divided into distinct stages, and different instructions occupy different stages simultaneously. Like an assembly line, pipelining increases throughput — instructions complete at the rate of one per stage (in steady state) — without necessarily reducing the latency of any individual instruction.

The laundry analogy is apt. Doing four loads of laundry sequentially — wash, dry, fold, put away, then repeat — takes much longer than pipelining: while load 2 is in the dryer, load 3 is in the washer, and load 4 is waiting. The total throughput improves by up to a factor of four. The latency of any single load is unchanged, but the throughput of completing multiple loads is dramatically better.

In an ideal k-stage pipeline with uniform stage latency T/k, the throughput approaches k/(T + latch overhead) ≈ k/T. The cost increases by the overhead of k−1 sets of pipeline registers (latches between stages), but this is modest compared to the throughput gain.

4.2 The Five-Stage MIPS Pipeline

The MIPS single-cycle datapath divides naturally into five stages: IF, ID/RF, EX/AG, MEM, WB. By adding four sets of pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB), we obtain a five-stage pipeline. Each register captures the state produced by one stage and feeds it to the next.

With stages of roughly 200 ps (IF), 100 ps (ID), 100 ps (EX), 200 ps (MEM), 50 ps (WB), the maximum stage latency is 200 ps — the clock period of the pipelined design. The throughput is one instruction per 200 ps (ignoring stalls), compared to one instruction per 800 ps in the best-case single-cycle machine. The speedup is 4×, not 5× as the ideal model predicts, because the clock period is set by the slowest stage (200 ps), not by the total latency divided by the number of stages.

An important constraint: no resource may be used by more than one stage simultaneously. If instructions share the same data memory for fetch and for data access, two instructions in different stages would conflict. The solution is to use separate instruction cache (I-cache) and data cache (D-cache).

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6+

Inst 1Inst 2Inst 3Inst 4

IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMIFIDEX

Steady state: one instruction completes per cycle · Latency per instruction = 5 cycles

Five-stage MIPS pipeline timing. Four instructions overlap in steady state — while Inst 1 writes back, Inst 4 is just being fetched. One instruction completes per cycle.

4.3 Why Pipelines Are Not Ideal

Real instruction pipelines violate all three assumptions of the ideal pipeline model:

Different instructions ≠ identical operations. A branch does not need a memory access stage; a load does not have the same latency as a register–register add. Forcing all instructions through the same five stages wastes hardware and introduces external fragmentation — some stages are idle for some instructions.

Non-uniform stage latency → internal fragmentation. The MEM stage (200 ps) is twice as long as the ID stage (100 ps). All stages must run at the same clock period (200 ps), so the ID stage is only half-utilized.

Instructions are not independent. Successive instructions often depend on each other’s results. This is the most consequential violation and gives rise to pipeline hazards — conditions that prevent the pipeline from continuing at full throughput.

4.4 Pipeline Hazards

Three classes of hazard disrupt pipeline flow:

Resource hazards arise when two instructions in different pipeline stages need the same hardware resource at the same time. Example: if the machine had only a single shared memory, an instruction in IF (needing instruction memory) and an instruction in MEM (needing data memory) would conflict. Solution: either duplicate the resource (separate I-cache and D-cache) or detect the conflict and stall one instruction.

Data hazards arise from dependencies between instructions. Three sub-types:

Flow (RAW) dependency: a later instruction reads a register that an earlier instruction has not yet written. This is a true dependency — the later instruction genuinely needs the result.
Anti (WAR) dependency: a later instruction writes a register that an earlier instruction has not yet read.
Output (WAW) dependency: two instructions write the same register; the later write must be the one that survives.

In the five-stage in-order pipeline, WAR and WAW dependencies rarely cause actual stalls because writes happen in WB and reads in ID, and instructions proceed in order. RAW dependencies are the critical case.

Control hazards arise from branches and jumps: when the pipeline fetches the next instruction, it does not yet know whether the current instruction is a branch, and if so, whether it is taken. By the time the branch is resolved, one or more instructions have already been fetched from the wrong path.

4.5 Handling Data Hazards

Detection

Two approaches identify data dependencies before they cause incorrect execution:

Scoreboarding associates a single valid bit with each register. An instruction that will write a register clears the valid bit when it enters the pipeline; the bit is restored when the write completes. An instruction in the decode stage stalls if any of its source or destination registers have cleared valid bits. This is simple — 1 bit per register — but conservative: it stalls on output and anti-dependencies that could be handled without stalling.

Combinational dependency check logic explicitly checks whether any instruction in a later pipeline stage will write to a source register of the instruction being decoded. This avoids unnecessary stalls for WAR and WAW, but the logic grows complex as the pipeline deepens or widens (as in superscalar execution).

Resolution: Stalling

The simplest resolution is to stall the dependent instruction: hold it in the decode stage (and freeze all earlier stages) until its source data is available. A bubble (a no-operation NOP) propagates through the pipeline in place of the stalled instruction. This is correct but wastes cycles.

Consider a flow dependency where instruction I writes register rx and instruction J reads rx, with J immediately following I (distance 1). In the five-stage pipeline: I completes WB at the end of cycle 5; J needs rx in ID at the start of cycle 3. J must stall for 2 cycles.

Resolution: Forwarding (Bypassing)

Stalling is often unnecessary. The value written by instruction I is available at the end of I’s EX stage — even before I writes it to the register file in WB. If we can route that intermediate result directly to the input of instruction J’s EX stage, we eliminate the stall.

Data Forwarding (Bypassing). The technique of routing a computed value directly from the pipeline stage where it becomes available to the pipeline stage where it is needed, bypassing the register file. Forwarding eliminates stalls for data hazards whenever the producer's result is available before (or at the same time as) the consumer needs it.

The key insight is that the register file is not the fundamental communication channel between instructions — it is an abstraction. What matters is maintaining the correct data flow between operations, regardless of where that data physically resides at any moment. Forwarding paths from EX/MEM and MEM/WB registers to EX-stage inputs cover the most common cases. The only case that cannot be forwarded away is a load–use hazard: if instruction I is a load and instruction J immediately follows and reads the loaded register, the loaded data is not available until the end of I’s MEM stage, one cycle after J would need it in EX. One stall cycle is unavoidable.

Inst I(ADD rx)

IFIDEXresult readyMEMstill validWB

Inst J(uses rx)

IFIDneeds rxEXMEMWBEX/MEM fwdMEM/WB fwdForwarding removes stallsEX/MEM → EX (dist=1)MEM/WB → EX (dist=2)⚠ Load-Use HazardLW result available after MEM —one stall cycle unavoidable

Data forwarding paths. The producer's result is routed directly from the EX/MEM or MEM/WB pipeline register back to the EX-stage inputs, eliminating most stalls. The load-use hazard (one stall cycle) is the sole unavoidable case.

Software Scheduling

An alternative to hardware interlocking is to let the compiler rearrange instructions to avoid hazards. If the compiler places an independent instruction between the producer and consumer, the hardware can forward or the pipeline naturally resolves the dependency without stalling. This static scheduling works well when the compiler knows instruction latencies and when latencies are predictable. It fails when latency varies at run time — for example, a load that hits the cache takes 1 cycle but misses takes 100 cycles. Dynamic (hardware) scheduling handles variable latencies gracefully.

Chapter 5: Control Hazards and Solutions

5.1 The Control Dependency Problem

Every instruction is control-dependent on all preceding branches: if a branch redirects execution, instructions fetched after the branch (but before the branch resolves) are on the wrong path. This is the control hazard. In the five-stage MIPS pipeline, the branch condition is evaluated at the end of the EX stage; by then, the pipeline has already fetched two more instructions. If the branch is taken, those two instructions must be flushed (squashed), wasting 2 cycles.

The key constraint: the next instruction’s address (the fetch PC) must be determined before the instruction currently being fetched is even decoded, let alone executed. This is the challenge that branch prediction addresses.

5.2 Strategies for Control Hazards

The architect has several tools, each with distinct tradeoffs.

Stalling until the branch resolves guarantees correctness but wastes 2–4 cycles per branch (depending on pipeline depth). With 15–25% of instructions being branches, this is prohibitively expensive in deep pipelines.

Predict not-taken assumes the branch falls through to PC+4 and fetches the next sequential instruction. When the branch is actually not taken, no penalty. When taken, the incorrectly-fetched instructions are squashed. Since approximately 70% of branches are taken overall (and nearly 90% of backward, loop-closing branches are taken), this strategy has about 30% accuracy on conditional branches — not much better than stalling.

Predicated execution eliminates branches by converting conditional code into unconditional instructions that include a condition predicate. Each instruction carries a condition code; if the condition is false, the instruction executes as a NOP. ARM’s architecture has had full predication since its earliest versions. The advantage is that the pipeline never needs to predict which path is taken; the disadvantage is wasted execution of instructions whose predicates are false, and the ISA complexity of encoding conditions in every instruction.

Delayed branching changes the ISA semantics: the N instructions after a branch (the branch delay slots) always execute, regardless of the branch outcome. The branch takes effect only after those N instructions. The compiler fills the delay slots with useful instructions that would have executed anyway. MIPS, SPARC, and HP-PA use a 1-slot delay. The difficulty is that useful instructions that are independent of the branch and its target are not always available to fill the slots, and the scheme becomes untenable as pipelines deepen and the delay grows.

Fine-grained multithreading maintains multiple hardware thread contexts (separate PCs, register files) and switches among threads every cycle. By the time the pipeline resolves a branch from thread A, the pipeline has been executing instructions from threads B, C, and D; no instructions from thread A have been wasted. This sacrifices single-thread performance for aggregate throughput and is used in network processors, Niagara (Sun’s 32-way multithreaded SPARC), and the peripheral processing units of the CDC 6600. The key trade is: no stalls from any single thread’s hazards, but each thread advances only once per N cycles.

Chapter 6: Branch Prediction

6.1 Why Branch Prediction Matters

The impact of branch mispredictions scales with both branch frequency and pipeline depth. Consider a 5-wide superscalar processor with a 20-cycle branch resolution latency, processing 500 instructions with one branch per 5 instructions:

At 100% accuracy: 100 cycles (perfect, no wrong-path work).
At 99% accuracy: 120 cycles (20% overhead).
At 98% accuracy: 140 cycles (40% overhead).
At 95% accuracy: 200 cycles (100% overhead).

Even a 1% misprediction rate doubles execution time in this scenario. Accurate branch prediction is not an optimization — it is existential for high-performance processors.

6.2 The Three Prediction Problems

A complete branch prediction system must answer three questions at the fetch stage, before the current instruction is even decoded:

Is the fetched instruction a branch?
If so, is it taken or not taken?
If taken, what is the target address?

The Branch Target Buffer (BTB) addresses questions 1 and 3. It is a small, fast cache indexed by the PC of the branch instruction. Each entry stores the target address computed from the branch’s previous execution. If the PC hits in the BTB, the instruction is (almost certainly) a branch, and the cached target address is used to fetch the next instruction speculatively. If the PC misses, either it is not a branch or it is a branch seen for the first time; either way, fetch continues at PC+4.

Question 2 — direction prediction — requires separate mechanisms and is where the interesting engineering lies.

6.3 Static Branch Prediction

Static predictors make predictions at compile time, independently of run-time behavior.

Always not-taken: trivial to implement (no BTB needed), but achieves only 30–40% accuracy on conditional branches since most branches (especially loop-closing backward branches) are taken.

Always taken: better, at 60–70% accuracy. Backward branches (target address less than branch PC, typical of loops) are taken nearly 90% of the time.

BTFN (backward taken, forward not-taken): loop-closing (backward) branches predicted taken; if-then-else (forward) branches predicted not-taken. A simple heuristic that approximates the statistical distribution of branches in typical programs.

Profile-based prediction: the compiler runs the program on a representative input and records the direction each branch actually took. These statistics are encoded as hint bits in the branch instruction. Accurate when the profile input is representative of production workloads; fragile when it is not.

Program-analysis heuristics (Ball and Larus, PLDI 1993) derive predictions from code structure: a branch protecting a loop body is likely taken; a pointer comparison is likely unequal; BLEZ (branch if ≤ 0) is likely not taken because negative integers often encode error conditions. These heuristics achieve roughly 80% accuracy without profiling.

6.4 Dynamic Branch Prediction

Dynamic predictors observe the run-time behavior of branches and adapt their predictions accordingly. They have no representativeness problem and naturally adapt to input-dependent behavior.

Last-Time (1-bit) Predictor

The simplest dynamic predictor remembers, for each branch, how it went last time it executed. A single bit per branch encodes this history. The accuracy is approximately 85% for typical programs, much better than always-taken or always-not-taken. The failure mode is loops: the predictor always mispredicts the last iteration (predicting taken when the loop exits) and the first iteration of the next invocation (predicting not-taken before the loop is re-entered). For a loop with N iterations: accuracy = (N−2)/N.

A worse failure mode is alternating branches (TNTNTN…): accuracy degrades to 0%.

Two-Bit Saturating Counter

Adding hysteresis to the 1-bit predictor prevents a single exceptional outcome from changing the prediction. A 2-bit counter per branch has four states: strongly taken (11), weakly taken (10), weakly not-taken (01), strongly not-taken (00). The counter increments on a taken outcome and decrements on not-taken, saturating at the extremes. A prediction changes from taken to not-taken only after two consecutive not-taken outcomes. This achieves roughly 85–90% accuracy on typical programs and is the bimodal predictor — the foundation of all later designs.

Two-Bit Saturating Counter (2BC). A four-state counter that encodes a branch direction prediction with hysteresis. The states are strongly taken, weakly taken, weakly not-taken, and strongly not-taken. Two consecutive wrong outcomes are required to change the predicted direction. A table of 2BCs indexed by the branch PC constitutes the bimodal predictor.

Prediction changes only after two consecutive wrong outcomes

Two-bit saturating counter state machine. Green states predict Taken; red states predict Not-Taken. Hysteresis means one wrong outcome is not enough to change the prediction.

Global History Predictors

The key insight motivating more powerful predictors is correlation: the outcome of a branch may correlate with the outcomes of other branches that recently executed, not just with the branch’s own last outcome.

Consider the code:

if (aa == 2) aa = 0;  // B1
if (bb == 2) bb = 0;  // B2
if (aa != bb) { ... } // B3

If both B1 and B2 are taken (meaning aa and bb are both reset to 0), then B3 is certainly not taken (aa == bb == 0). The outcome of B3 is fully determined by the outcomes of B1 and B2.

The two-level global history predictor (Yeh and Patt, MICRO 1991) captures global correlation. A Global History Register (GHR) records the directions (T or NT) of the most recent N branches, shifting in each new outcome. The GHR value indexes into a Pattern History Table (PHT) of 2-bit saturating counters. The 2BC entry at index GHR is updated after each branch resolution.

The gshare predictor (McFarling, DEC WRL 1993) improves on simple two-level global prediction by XOR-hashing the GHR with the branch PC before indexing the PHT. This adds per-branch context, reducing aliasing (two different branches with the same GHR value using the same PHT entry).

Gshare predictor. The Global History Register (N last branch outcomes) is XOR-hashed with the branch PC to index the Pattern History Table. The selected 2-bit counter supplies the prediction.

Local History Predictors

A complementary idea is local correlation: a branch’s outcome may correlate with its own past outcomes, not just with recent outcomes of other branches. A loop branch is taken repeatedly, then not-taken once, then taken repeatedly again. A local history register that records the past N outcomes of this specific branch can identify the pattern precisely.

The two-level local history predictor maintains a separate per-branch history register. The history register value indexes a PHT of 2-bit counters. For a loop with N iterations, the local history can perfectly predict the last (exiting) iteration after enough warmup.

Hybrid (Tournament) Predictors

No single predictor is best for all branches. Some branches exhibit global correlation; others exhibit local correlation; still others are simple bimodal. Hybrid predictors use multiple predictor tables and a meta-predictor (selector) to choose which prediction to use for each branch.

The Alpha 21264 uses a tournament predictor: a global predictor, a local predictor, and a 2-bit saturating counter selector for each branch. The selector tracks which predictor has been more accurate for this branch’s recent history and bets on that one. Hybrid predictors achieve 90–97% accuracy on typical workloads.

Neural and Perceptron Predictors

The perceptron branch predictor (Jimenez and Lin, HPCA 2001) treats direction prediction as a classification problem. Each branch has an associated perceptron — a vector of weights, one per GHR bit. The prediction is the sign of the dot product of the GHR (expressed as +1 for taken, −1 for not-taken) and the weight vector. On a misprediction, the weights are updated: the weight for GHR[i] is incremented if GHR[i] agreed with the correct outcome, decremented otherwise. Perceptrons can learn longer history patterns than 2-bit counters and achieve higher accuracy at the cost of slower prediction (a dot product computation).

More recently, convolutional neural networks (BranchNet, MICRO 2020) have been applied to hard-to-predict branches, learning which subset of the global history is informative for a given branch. Light-weight binary neural network approximations (XNOR-Net) reduce the cost enough for practical hardware use.

Chapter 7: Parallelism — SIMD, Multicore, and Accelerators

7.1 Flynn’s Taxonomy

In 1966, Michael Flynn classified computer architectures by the number of simultaneous instruction streams and data streams:

SISD (single instruction, single data): the classic Von Neumann sequential processor.
SIMD (single instruction, multiple data): one instruction operates on multiple data elements. Includes array processors and vector processors.
MISD (multiple instructions, single data): systolic arrays and streaming processors are the closest real examples.
MIMD (multiple instructions, multiple data): multiprocessors and multithreaded processors.

Flynn's taxonomy classifies architectures by instruction stream count and data stream count. SIMD and MIMD are the dominant practical categories.

7.2 Vector and Array Processing

Vector and array processors exploit data-level parallelism — the observation that many programs apply the same operation to large collections of independent data elements.

An array processor applies a single instruction to all elements of a vector simultaneously, using multiple independent arithmetic units operating in parallel at the same time step.

A vector processor applies a single instruction to all elements of a vector sequentially, but uses a deeply pipelined functional unit. The CRAY-1 (1975) is the canonical example: it had eight 64-element vector registers (each element 64 bits), 16 memory banks for interleaved access, and deeply pipelined functional units for add, multiply, shift, and logic. A single VADD V2, V0, V1 instruction adds 64 pairs of 64-bit floating-point values through the pipeline, producing one result per cycle in steady state.

Vector code advantages:

No data dependencies within a vector (elements are independent), enabling fully-pipelined execution.
Each instruction specifies a large amount of work, reducing instruction fetch bandwidth.
Highly regular memory access patterns (stride-1 or constant-stride) enable efficient bank interleaving and prefetching.
Eliminates explicit loop control and branch overhead.

Memory bandwidth bottleneck: if arithmetic throughput exceeds memory bandwidth, the vector unit stalls waiting for operands. Interleaving memory across multiple banks is essential: with 16 banks and 11-cycle bank latency, a new bank can be accessed every cycle, sustaining one element per cycle throughput.

Amdahl’s Law limits vectorization benefits. If a fraction f of a program can be vectorized with speedup S:

\[ \text{Speedup} = \frac{1}{(1 - f) + \frac{f}{S}} \]

As S → ∞, the maximum speedup approaches 1/(1−f). If 10% of the program is sequential and cannot be vectorized, the maximum possible speedup from infinite parallelism is only 10×.

Vector chaining further improves performance by forwarding the output of one vector operation directly to the input of the next, allowing the second operation to begin on the first element before the first operation has finished all elements. This is the vector equivalent of scalar data forwarding.

7.3 Multicore Processors

When a single processor core’s clock frequency could no longer be increased without melting the chip (the power wall), the industry turned to multicore: placing multiple complete processor cores on the same die, sharing the last-level cache and memory controller.

The multicore tradeoff against a single large core:

In favor of multicore:

Simpler cores are more power-efficient and easier to design and verify.
Multiple cores increase system throughput on multiprogrammed workloads (multiple independent programs running simultaneously).
Parallel applications can use all cores simultaneously.

Against multicore:

Requires explicitly parallel software (threads, processes) to utilize multiple cores — transparent performance improvement for a single-threaded program is minimal.
Shared hardware resources (last-level cache, memory bandwidth) create contention and can reduce per-thread performance compared to running alone.
Amdahl’s Law: even a small sequential fraction drastically limits parallel speedup.

Multicore Design Variants

Symmetric multicore (homogeneous): all cores are identical, running at the same voltage and frequency. Simple but unable to exploit the performance-per-watt differences between large and small cores.

Asymmetric multicore (heterogeneous): one large, fast core handles serial work; many small, efficient cores handle parallel work. ARM’s big.LITTLE and Apple’s Performance/Efficiency core design are real examples.

Dynamic multicore: an asymmetric design where large cores are powered off during parallel phases and small cores during serial phases.

Composed/fused multicore: small cores that can be logically merged into a single large core for serial execution, then split apart for parallel execution.

Research by Esmaeilzadeh et al. (Dark Silicon and the End of Multicore Scaling, ISCA 2011) showed that multicore scaling alone cannot sustain historical performance improvement rates — transistor density grows faster than power budgets allow all cores to be active. The “dark silicon” problem means that a large fraction of the chip’s transistors must remain powered off at any given time.

7.4 Accelerators: The TPU

As multicore scaling plateaus, domain-specific accelerators emerge: hardware designed to efficiently execute one class of computation, accepting poor performance or efficiency on everything else.

The Google Tensor Processing Unit (TPU) is the canonical modern accelerator. Its origin story is instructive: in 2013, Google estimated that if users spoke to Google’s voice-recognition system for just 3 minutes per day, the required CPU processing would need double the company’s entire data center fleet. The solution was custom hardware for deep neural network (DNN) inference, with a development cycle of just 15 months from project start to deployment in data centers.

The TPU is built around a matrix multiply unit containing 65,536 (256×256) 8-bit multiply-accumulate units. At 700 MHz, this delivers a peak throughput of 92 trillion operations per second — more than 25× the MAC throughput of a GPU and more than 100× that of a CPU. The chip includes 28 MiB of on-chip activation memory (3.5× the on-chip memory of contemporary GPUs) and two DDR3 channels for weight storage.

The key hardware mechanism is systolic execution: data flows rhythmically through the matrix array, arriving at processing elements in a pipelined fashion from two directions and being multiplied and accumulated in transit, without being written to memory between operations. This eliminates the energy overhead of repeated SRAM reads that would otherwise dominate a naive matrix multiplication implementation.

The programmer’s view is deliberately simple: five CISC instructions (ReadHostMemory, WriteHostMemory, ReadWeights, MatrixMultiply/Convolve, Activate) each represent more than 10 clock cycles of computation. Control complexity is handled in software — no branches, in-order execution, software-controlled buffers and pipeline synchronization.

7.5 Systolic Arrays

The systolic array concept, introduced by H.T. Kung (1982), provides the theoretical foundation for accelerators like the TPU. Data flows from a memory (“heart”) rhythmically through an array of simple processing elements (“cells”), each performing a multiply-accumulate on the data passing through it. No element has its own memory; data is pipelined through the array from input to output.

Systolic arrays are optimal for computations with high arithmetic intensity (many operations per memory access) and regular, predictable data flow patterns — exactly the structure of convolution and matrix multiplication that appears throughout modern deep learning. The WARP computer at CMU (1984–1988) was an early programmable systolic array used for vision and robotics.

A 3×3 systolic array for matrix–vector multiplication. Activations stream right through each row; weights trickle down through each column. Each PE accumulates a multiply–add without accessing any external memory — the key to the TPU's energy efficiency.

Chapter 8: Memory Hierarchy and Caching

8.1 The Memory Hierarchy Problem

An idealized memory would have zero latency, infinite capacity, zero cost, and infinite bandwidth. Unfortunately, these properties oppose each other in every known technology. Larger memories take longer to access (more circuitry to decode addresses, longer wires). Faster memories — SRAM — are much more expensive and physically larger per bit than slower memories — DRAM. Disk is cheap and abundant but 100,000× slower than DRAM.

The memory hierarchy resolves this dilemma by providing multiple levels of storage, each faster and smaller than the one below:

Level	Technology	Typical Size	Typical Latency
Registers	SRAM (flip-flops)	32–256 values	< 1 ns
L1 Cache	SRAM	32–64 KiB	1–4 ns
L2 Cache	SRAM	256 KiB–4 MiB	5–15 ns
L3 Cache	SRAM	4–64 MiB	20–40 ns
Main Memory	DRAM	8–512 GiB	50–100 ns
Storage	SSD/HDD	TiB	0.1–10 ms

The hierarchy works because of locality: programs tend to reuse recently accessed data (temporal locality) and access data near recently-accessed addresses (spatial locality). If most accesses hit the fast, small top level, the average access time approaches the fast level’s latency even though most data resides in the slow bottom level.

↑ FastSmall↓ SlowLarge

Locality principle: most accesses hit near the top → avg latency ≈ fast-tier latency

The memory hierarchy. Capacity increases downward; speed and cost-per-byte increase upward. Locality makes the hierarchy effective: most accesses hit the fast L1 cache.

8.2 Cache Fundamentals

A cache is an automatically managed fast memory that stores recently accessed data blocks. When the processor requests an address, the hardware first checks the cache (a cache hit); if the data is not present (a cache miss), it is fetched from the next level of the hierarchy and placed in the cache.

Block size (also called cache line size) determines the unit of transfer. Modern caches typically use 64-byte blocks. When a miss occurs, the entire 64-byte block containing the requested address is fetched. This exploits spatial locality: nearby bytes will likely be needed soon and arrive “for free.”

Cache Organization

A cache is organized as S sets, each containing W ways (blocks). An address is split into three fields:

Block offset (log₂(block size) bits): identifies the byte within the block.
Index (log₂(S) bits): selects the set.
Tag (remaining bits): identifies which memory block occupies a way.

A direct-mapped cache (W=1) maps each memory address to exactly one set. Simple, fast, but susceptible to conflict misses: two frequently-used addresses that map to the same set will evict each other repeatedly, even if most of the cache is empty.

A fully associative cache (S=1) has a single set containing all W ways. Any block can go anywhere. Conflict misses are eliminated, but the tag comparison must be done in parallel across all ways, requiring expensive hardware.

A set-associative cache (S sets, W ways) is the common compromise. Most modern L1 caches are 4-way or 8-way set-associative; L2 and L3 caches use 8–16 ways. Doubling associativity typically reduces miss rate by about the same amount as doubling cache size, at a fraction of the area cost.

Address decomposition and cache lookup for a 4-way set-associative cache with 16 sets and 64-byte blocks. The index selects one set; all four way tags are compared simultaneously; on a hit the matching way's data is returned.

Cache Misses: Three Categories

The three Cs framework classifies all cache misses:

Compulsory (cold) misses: every address is accessed for the first time at some point. These misses are irreducible regardless of cache size or associativity. Prefetching can reduce their cost by bringing blocks into the cache before they are needed.

Capacity misses: the program’s working set — the set of distinct memory blocks actively used within a time window — exceeds the cache capacity. Even a fully-associative cache of the given size would miss. The only remedy is a larger cache or a more compact data representation.

Conflict misses: multiple memory blocks that are frequently used together happen to map to the same cache set, evicting each other. A fully-associative cache of the same total size would not miss. Higher associativity reduces conflict misses.

Replacement Policies

When a cache miss occurs and the target set is full, a victim block must be chosen for eviction. The key policies:

LRU (Least Recently Used): evict the block in the set that was accessed least recently. This is the best practical approximation to Belady’s optimal (OPT) policy — evict the block whose next access is furthest in the future — because past recency is a reasonable predictor of future recency. True LRU requires log₂(W!) bits per set to track the full ordering, which is expensive for high associativity. Modern processors use approximate LRU: not-most-recently-used (NMRU) or hierarchical LRU with two bits per way.

Random: evict a randomly chosen block. Surprisingly competitive with LRU in practice, and dramatically better when the working set slightly exceeds the cache size (LRU thrashes in a cyclic pattern while random avoids the pathological case). For example, a 4-way cache with a 5-block cyclic working set A→B→C→D→E→A→…: LRU achieves 0% hit rate (each access evicts the block needed next), while random achieves roughly 20%.

FIFO (First In, First Out): evict the oldest block. Simple but generally worse than LRU.

Optimal (Belady’s OPT): evict the block whose next use is furthest in the future. Achieves the minimum possible miss rate for a given cache size and access stream. Requires future knowledge, so it is not implementable in hardware, but it provides a lower bound for evaluating practical policies.

Write Policies

When the processor writes to a cached address, two decisions must be made:

Write-through vs. write-back: write-through updates both the cache and the next memory level immediately. Write-back marks the block as dirty and writes it to the next level only when it is evicted, potentially coalescing multiple writes into a single transfer. Write-back is far more bandwidth-efficient (a block written 100 times generates 100 writes with write-through but only 1 with write-back), at the cost of a dirty bit per block and the complexity of handling evictions of dirty blocks.

Write-allocate vs. no-write-allocate: on a write miss, write-allocate fetches the missing block into the cache, then updates it; no-write-allocate writes directly to the next level without fetching the block. Write-back caches almost always use write-allocate; write-through caches often use no-write-allocate.

8.3 Memory Technology: SRAM and DRAM

SRAM (Static RAM): each bit cell contains six transistors — two cross-coupled inverters with two access transistors. The feedback keeps the state without refresh. Fast (sub-nanosecond access), no refresh needed, and process-compatible with logic manufacturing. But six transistors per bit cell makes SRAM expensive and low-density. SRAM is used for register files, L1–L3 caches, and BTBs.

DRAM (Dynamic RAM): each bit cell contains one transistor and one capacitor. The capacitor’s charge state encodes a bit. DRAM is 6× denser than SRAM and much cheaper, but: (1) the capacitor leaks, requiring periodic refresh (reading and rewriting) every 64 ms; (2) reads are destructive (the charge dissipates when sensed, and must be restored by the sense amplifier); and (3) latency is 50–100 ns, far slower than SRAM.

8.4 Improving Cache Performance

The average memory access time (AMAT) governs effective performance:

\[ \text{AMAT} = \text{hit\_latency} + \text{miss\_rate} \times \text{miss\_penalty} \]

Improvements target each term.

Reducing miss rate:

Higher associativity reduces conflict misses.
Victim cache (Jouppi, ISCA 1990): a small fully-associative buffer between L1 and L2 that stores blocks evicted from L1. If L1 evicts block A and immediately needs it again (ping-ponging), the victim cache catches it.
Skewed-associative caches (Seznec, ISCA 1993): different ways use different hash functions for the index, redistributing conflicts across sets.
Software: loop interchange (access arrays in their storage order), blocking/tiling (divide a large working set into pieces that each fit in the cache).

Reducing miss penalty:

Multi-level caches: an L2 cache absorbs misses from L1 with much lower penalty than going to DRAM.
Critical-word-first: when a block is being fetched, deliver the requested word to the processor first before the rest of the block arrives.
Non-blocking (lockup-free) caches (Kroft, ISCA 1981): allow the processor to continue issuing memory requests while a miss is being serviced. Miss Status Handling Registers (MSHRs) track outstanding misses and merge requests to the same block. This enables memory-level parallelism (MLP) — multiple cache misses overlapping in time, amortizing their combined latency.

Reducing hit latency:

Smaller, lower-associativity L1 caches: a smaller cache is faster. The L1 hit latency is a critical path component in the pipeline; L1 caches are typically 2–4 cycle latency to avoid lengthening the clock period.
Parallel tag and data access: check the tag and begin reading the data simultaneously; cancel the data read if the tag misses.

8.5 Cache Hierarchy in Practice

Modern processors use split L1 caches (separate instruction and data) because the instruction fetch unit and the load/store unit access the cache in different pipeline stages and a unified cache would require two ports from a single location. L2 and L3 caches are unified (instruction and data share) because the L1 filter reduces interference between instruction and data streams.

The multi-level cache hierarchy is analyzed with a recursive latency formula. If level-i has latency $t_i$ and miss rate $m_i$, then the effective access time from level i is:

\[ T_i = t_i + m_i \cdot T_{i+1} \]

For a Pentium 4 at 3.6 GHz with L1 (4-cycle = 1.1 ns, 16 KiB), L2 (18-cycle = 5 ns, 1 MiB), and main memory (180 cycles ≈ 50 ns): with L1 miss rate 5% and L2 miss rate 1%, the effective access time is approximately 1.1 + 0.05 × (5 + 0.01 × 50) ≈ 1.37 ns — not bad. But if L1 miss rate is 10% and L2 miss rate is 50%, effective access time balloons to over 30 ns, a 27× slowdown.

Chapter 9: Memory Systems and DRAM

9.1 DRAM Organization

A DRAM chip is organized as a hierarchy of increasing granularity:

Cell → Subarray → Bank → Chip → Rank (multiple chips on a DIMM) → Channel

Within a bank, cells are arranged in a 2D array of rows and columns. To read a cell, the wordline of its row is asserted, pulling charge from all cells in that row onto the corresponding bitlines. A row of sense amplifiers detects and amplifies these tiny charge differences, effectively restoring the row’s charge while reading it. This row-sized buffer — the row buffer — acts as a cache within the DRAM bank.

A DRAM access proceeds in three phases:

Row Activate (RAS): assert the wordline, read the row into the row buffer. This destroys the cells’ charge, requiring restoration.
Column Read/Write (CAS): select specific columns from the row buffer and transfer them to the chip output.
Precharge: restore the bitlines for the next activate.

If two consecutive accesses go to the same row, the row buffer is already loaded — a row buffer hit — and only the CAS command is needed, reducing latency dramatically. If they go to different rows in the same bank, the old row must be precharged first — a row buffer miss — incurring the full RAS + CAS latency.

Modern DRAM latencies are characterized by timing parameters: tRCD (RAS-to-CAS delay, typically 13–14 ns), tCL (CAS latency, 13–14 ns), tRP (precharge time, 13–14 ns). A worst-case access (precharge + activate + read) takes roughly 40–50 ns.

DRAM bank internals. Asserting a row's wordline charges the bitlines, which the sense amplifiers detect and amplify into the row buffer. Subsequent column reads to the same row (row buffer hits) pay only the CAS latency — roughly 3× faster than a full activate from a new row.

9.2 DRAM Controller

The DRAM controller (now typically on-chip in modern processors) manages all interactions with DRAM. Its responsibilities:

Correctness: issue refresh commands every 64 ms to restore charge in cells that are not otherwise accessed. Each row must be activated within 64 ms or its charge leaks to an unrecoverable level.
Scheduling: reorder memory requests to maximize performance. FCFS (first come, first served) is simple but ignores row buffer status. FR-FCFS (first-ready, first come) prioritizes row-buffer-hit requests over misses and, among requests of the same type, prioritizes older requests. FR-FCFS maximizes DRAM throughput but can starve some requestors.
Address mapping: distribute accesses across banks and channels to enable parallelism. Cache-block interleaving maps consecutive cache blocks to consecutive banks, enabling parallel access to consecutive addresses. XOR-based mapping randomizes bank assignment to reduce conflict probability.

9.3 The DRAM Scaling Problem and RowHammer

DRAM reliability becomes a deep concern as feature sizes shrink. The most alarming manifestation is RowHammer (Kim et al., ISCA 2014): by repeatedly opening and closing a DRAM row (an “aggressor” row) thousands of times within a 64 ms refresh interval, the electromagnetic disturbance causes bit flips in adjacent “victim” rows.

The attack requires no special privileges — any process that can execute a tight memory access loop can cause RowHammer. The exploit has been demonstrated to escalate privileges on Linux, attack browsers via JavaScript (Rowhammer.js), and root Android devices. Over 85% of DRAM modules manufactured in 2012–2013 were found to be vulnerable.

RowHammer illustrates a fundamental tension in the memory hierarchy: optimizing for density and cost creates physical vulnerabilities that become security exploits. The fix — Target Row Refresh (TRR) and related mitigations — requires the DRAM controller to track row activation counts and refresh vulnerable neighbors, adding complexity and overhead.

RowHammer. Rapidly toggling a DRAM aggressor row disturbs adjacent victim rows through capacitive/EM coupling, flipping bits without ever reading or writing the victim rows directly.

9.4 Multiple Banks and Channels

A single DRAM bank is a sequential bottleneck: each access requires an activate–read–precharge sequence. Multiple banks within a rank can service requests in parallel (as long as they go to different banks). Multiple channels each have independent buses and can service requests truly in parallel.

With N channels (each 64-bit bus), peak memory bandwidth scales linearly with N. Modern high-end servers use 8 DDR5 channels for aggregate bandwidth exceeding 300 GB/s.

The interaction between virtual-to-physical address mapping (managed by the OS) and DRAM bank/channel mapping is significant: the OS page allocator influences which bank a page lands in. Page coloring — allocating pages to specific colors (bank/row positions) — lets the OS minimize bank conflicts and inter-application interference.

Chapter 10: New Memory Technologies

10.1 Why DRAM Alone Is Not Enough

Several trends are straining DRAM’s ability to serve as the sole main memory technology:

Scaling limits: DRAM stores charge on a capacitor that must be large enough for reliable sensing. Scaling transistors below 35–40 nm makes it increasingly difficult to build capacitors of adequate charge capacity. ITRS projections showed that DRAM density scaling would plateau before conventional logic technology.

Power consumption: IBM measured that roughly 50% of the energy in a server is spent in the off-chip memory hierarchy. DRAM consumes power even when idle (keeping charge in row buffers, periodic refresh) and has high active energy per access.

Capacity gap: the number of processor cores doubles approximately every 2 years; DRAM DIMM capacity doubles only every 3 years. The available memory per core is declining — a serious problem for data-intensive workloads.

10.2 Phase Change Memory

Phase Change Memory (PCM) stores data as the phase — crystalline or amorphous — of a chalcogenide glass material. Crystalline glass has low electrical resistivity (logical 1); amorphous glass has high resistivity (logical 0). Switching between phases requires brief high-current pulses:

SET (crystalline, “1”): heat the material above its crystallization temperature (~600°C) with a sustained pulse and allow it to cool slowly.
RESET (amorphous, “0”): heat the material above its melting point (~700°C) with a short pulse and quench rapidly.
Read: apply a small current and measure resistance without disturbing the phase.

PCM advantages over DRAM:

Better scalability: PCM’s switching mechanism depends on current pulse magnitude, which scales linearly with feature size. Prototyped at 20 nm; expected to reach 9 nm.
Higher density: multi-level cell PCM stores 2–4 bits per cell by exploiting the large resistance ratio between amorphous and crystalline states.
Non-volatile: no refresh, low idle power, data retained without power for over 10 years at 85°C.

PCM disadvantages:

Higher latency: read ~50 ns (4× DRAM), write ~150 ns (12× DRAM).
Lower endurance: the SET/RESET thermal cycling degrades the contacts. Typical endurance is ~10^8 writes per cell — 10 million times worse than DRAM, 1000× better than NAND Flash.
Higher write energy: PCM writes consume 2–50× more energy than DRAM writes.

Lee et al. (ISCA 2009) showed that naively replacing DRAM with PCM incurs 1.6× delay and 2.2× energy overhead, with an average lifetime of only 500 hours under typical workloads. With architectural optimizations — narrow row buffers to reduce destructive reads, cache-block-granularity writes to reduce unnecessary wear, wear-leveling to distribute writes uniformly — the system can achieve 1.2× delay, 1.0× energy, and 5.6-year average lifetime.

10.3 Hybrid Memory Systems and Non-Volatile Memory

A natural architecture is a hybrid of DRAM and PCM (or other NVM): DRAM serves as a fast volatile tier; PCM provides large-capacity non-volatile storage at lower cost. The controller (in hardware or OS) migrates hot pages to DRAM and cold pages to PCM.

The deeper opportunity is persistent memory (also called non-volatile main memory or NVM). If PCM or STT-MRAM can be accessed via load and store instructions — rather than the block I/O interface of disks — the traditional two-level storage model (volatile memory + block storage) collapses into a single persistent address space. Programs can manipulate persistent data structures with load and store, eliminating the operating system and file system overhead of translating between virtual memory and disk blocks.

Intel’s Optane DC Persistent Memory (based on 3D XPoint technology, a form of PCM-like material) offered exactly this capability, delivering byte-addressable, persistent memory that fits in DIMM slots. Experiments showed 5–24× performance improvement and 5–16× energy improvement for file-intensive workloads (PostMark) compared to traditional SSD-based storage.

The challenge of persistent memory is crash consistency: if a system fails mid-write (e.g., while updating a linked list — first linking the new node forward, then backward), the data structure may be left in an inconsistent state. Crash-consistent programming requires flush instructions (to ensure CPU caches are drained to NVM before declaring a transaction complete) and careful ordering of stores. This is a rich active research area.

Chapter 11: Processing-in-Memory and Hardware Security

11.1 The Data Movement Problem

Modern computing systems are overwhelmingly processor-centric: all meaningful computation happens in the CPU, and everything else (memory, storage, network) merely moves data to and from the processor. This design made sense when processors were the bottleneck. Today, the bottleneck is data movement.

The numbers are stark: a 64-bit addition consumes roughly 0.1 pJ; accessing DRAM consumes ~1,000× more energy per operation. In mobile systems, 62.7% of total system energy is spent on data movement, not computation. Profiling Google’s data center workloads (Kanev et al., ISCA 2015) showed that large fractions of CPU time are spent in memory access code, not arithmetic.

The response is a paradigm shift toward processing-in-memory (PIM): moving computation to where the data lives rather than moving data to where the computation is.

11.2 PIM Approaches

Minimally modified DRAM: Seshadri et al. (MICRO 2017) showed that existing DRAM sense amplifiers can perform bulk bitwise AND, OR, and NOT operations by simultaneously activating multiple rows in a DRAM array. Two rows containing operands A and B are activated simultaneously; the bitwise AND of A and B appears on the bitline due to the physics of charge sharing. This provides in-DRAM bulk bitwise operations with 30–60× performance and energy improvement over conventional CPU approaches, requiring only minor changes to the DRAM chip and controller.

3D-stacked memory with logic: emerging memory technologies stack multiple DRAM layers on top of a logic die using vertical interconnects (Through-Silicon Vias, TSVs). The logic die can implement in-memory processors with access to the stacked DRAM at much higher bandwidth than any off-chip bus (8 TB/s for HBM vs. ~50 GB/s for DDR5). Tesseract (ISCA 2015) demonstrated a PIM architecture for graph processing that achieved 10× speedup over a conventional CPU system on workloads like PageRank.

PIM faces significant adoption barriers: programming models must expose near-memory computation; the operating system’s virtual memory system must be extended; cache coherence between main processor caches and PIM units must be maintained. These system-level challenges are active research problems.

11.3 Side-Channel Attacks

The final theme of this course is hardware security — specifically, how architectural features that were designed for performance can be exploited to leak secret information.

A side channel is an information channel that arises from the physical implementation of a system, not from its logical specification. Rather than breaking the mathematical encryption algorithm, a side-channel attack exploits measurable physical quantities — execution time, power consumption, electromagnetic radiation, or memory access patterns — to infer secrets.

Side Channel Attack. An attack that exploits information gained from the physical implementation of a computer system, including timing, power consumption, electromagnetic radiation, or memory access patterns, rather than weaknesses in the implemented algorithm itself. Side-channel attacks are non-invasive, stealthy, and extremely difficult to eliminate because they exploit fundamental hardware behaviors.

Cache Side Channels

Caches are shared between programs (and even between security domains) in modern systems. The cache hit/miss status of a memory access leaks information about which addresses have been recently accessed, which may reveal secrets.

Flush+Reload (Yarom and Falkner, USENIX Security 2014): the attacker and victim share a memory-mapped file (e.g., a shared library). The attack proceeds in three steps:

The attacker uses the x86 CLFLUSH instruction to evict a shared memory line from the cache.
The victim executes, possibly accessing the evicted line (or not, depending on the secret computation).
The attacker accesses the same memory line and times the access. A fast access (cache hit) reveals that the victim accessed that line; a slow access (cache miss) reveals that it did not.

Applied to RSA decryption: the RSA square-and-multiply algorithm accesses different code paths (the square-reduce path vs. the square-reduce-multiply-reduce path) depending on bits of the secret exponent. Probing the cache residency of code functions in the RSA implementation after each victim computation reveals the secret exponent bit by bit.

Flush+Reload in three steps. The attacker's ability to distinguish a ~4-cycle cache hit from a ~200-cycle DRAM miss turns the cache into a one-bit oracle per probe, leaking one bit of the victim's secret computation per iteration.

Prime+Probe requires no shared memory. The attacker fills one or more cache sets with its own data (primes the cache). After the victim executes, the attacker measures access times to its own data. Cache lines that were evicted (slow access) reveal that the victim accessed addresses mapping to those sets — again leaking information about the victim’s computation.

DRAM Side Channels

DRAM row buffers function as caches within DRAM banks. DRAMA (Pessl et al., USENIX Security 2016) demonstrated that row buffer hit/miss timing differences create a side channel exploitable across CPU sockets. By timing accesses to DRAM addresses that map to the same bank but different rows, an attacker can detect whether a victim accessed that bank (causing row buffer eviction, slower access) — creating a covert channel with bandwidth up to 2.1 Mbps on a desktop system.

Power Side Channels

Processor power consumption varies with the data being processed. Differential Power Analysis (DPA) exploits this: by collecting power traces during many executions of a cryptographic algorithm (e.g., AES) with different plaintexts but the same key, statistical analysis can isolate the contribution of specific key bits to the power trace.

AES performs a substitution box (S-box) lookup in each of 10 rounds. The least significant bit of the S-box output is data-dependent (it depends on both the plaintext and the key). By repeatedly computing AES(plaintexts, key_guess) for all 256 possible guesses for one byte of the key and correlating the LSB of the S-box output with the measured power, the correct key byte produces a strong correlation while incorrect guesses produce noise. This process, repeated for each key byte, recovers the full AES key.

Architectural Implications

Side-channel attacks reveal a fundamental tension in computer architecture: the optimizations that make processors fast — caching, speculative execution, branch prediction — all create observable timing variations that can be exploited. The most dramatic recent examples are Spectre and Meltdown (2018), which exploit CPU speculative execution to read arbitrary memory across security boundaries. These attacks demonstrated that microarchitectural state (which instructions executed speculatively, what data was loaded into the cache) can be observed through timing measurements, violating the assumption that programmer-invisible state cannot affect security.

The lesson for computer architects is that correctness is not enough. A processor can behave exactly as its ISA specifies and still leak secrets through timing. Designing secure hardware requires reasoning about physical observability, not just logical behavior — a challenge that is reshaping how processors are designed.

Chapter 12: Microarchitecture Simulation

12.1 Why Simulate?

Designing a new microarchitecture directly in hardware is enormously expensive: a 16 nm chip costs roughly $310M in non-recurring engineering costs and takes years to fabricate and test. Architects need a way to evaluate design choices — cache associativity, pipeline depth, branch predictor algorithms, memory scheduling policies — before committing to silicon.

Microarchitectural simulation provides a software model of a processor that predicts performance (IPC, cache miss rates, DRAM bandwidth utilization, etc.) without building hardware. The tradeoff is between accuracy and simulation speed.

12.2 Simulation Approaches

Cycle-accurate simulation models the processor at clock-cycle granularity, tracking the state of every pipeline stage, cache set, and queue on every cycle. It provides high accuracy but is slow: cycle-accurate simulators typically run at 1–10 MIPS (million simulated instructions per second) — millions of times slower than real hardware.

Trace-based simulation decouples the simulated CPU from the simulated memory system. A trace of memory addresses is collected from a real or simulated run, then replayed through the cache/memory simulator. This is much faster but less accurate (the CPU model is simplified, and interactions between the CPU and memory system are not captured).

FPGA-accelerated simulation (FireSim, Karandikar et al.) uses field-programmable gate arrays to implement RTL descriptions of processors in hardware, running at MHz rather than KHz speeds. FireSim demonstrated simulation of a 1024-node RISC-V data center at 3.4 MHz (13 billion simulated instructions per second), enabling full Linux boot and realistic application benchmarks. The key innovations are: using an open ISA (RISC-V) and open-source SoC implementations so the RTL is available, using the public cloud (AWS EC2 F1 instances) for FPGA access, and automating the process of mapping simulation to FPGA resources.

12.3 gem5

gem5 (Binkert et al., 2011) is the dominant open-source microarchitectural simulator, used in roughly 70% of recent computer architecture research. It is a discrete-event simulator: events (instruction fetches, cache misses, DRAM responses) are placed in a priority queue ordered by simulation time (measured in ticks, typically 1 ps per tick). The simulation engine dequeues the head event, executes it (which may enqueue new events), and repeats.

gem5 is written in Python (configuration) and C++ (simulation core). A modular standard library allows components — processor models (in-order, out-of-order), cache hierarchies (no cache, private L1/L2, MESI two-level coherent), and memory systems (DDR3, DDR4) — to be swapped with minimal code changes.

gem5 supports two simulation modes: System Emulation (SE) mode, which only simulates the processor and memory and uses the host OS for system calls; and Full System (FS) mode, which simulates a complete system including an OS, enabling boot of Linux and execution of unmodified binaries. SE mode is faster; FS mode is more accurate for workloads with significant OS interaction.

An important caveat: gem5’s accuracy has not been thoroughly validated against real hardware for most configurations. Relative comparisons (design A vs. design B) are more meaningful than absolute cycle count predictions. Using a simulator without understanding its limitations can lead to misleading conclusions.

These notes draw substantially on the lecture slides of Sihang Liu (CS 450/650, University of Waterloo, Winter 2026), which are in turn adapted from courses at UVa (Dr. Khan) and CMU (Dr. Mutlu). External sources that enriched the presentation include Patterson and Hennessy’s Computer Organization and Design (5th ed.), Onur Mutlu’s 18-447 lecture notes at Carnegie Mellon, and primary research papers cited throughout.