CS 251: Computer Organization and Design

Rosina Kharal

Estimated study time: 1 hr 17 min

Table of contents

Chapter 1: Computer Abstractions and Performance

Classes of Computers

Computers come in three broad classes, each with distinct design priorities. Desktop computers are general-purpose machines designed to balance cost against performance, running a wide variety of software. Server computers are network-connected systems optimized for high capacity, throughput, and reliability — they range from small rack-mounted units to warehouse-scale data centers. Embedded computers are hidden as components inside larger systems (automobiles, appliances, industrial controllers), where stringent constraints on power, performance, and cost dominate the design.

Despite these differences, all three classes share the same fundamental components: a processor (CPU) with a datapath and control unit, memory (cache, main memory, secondary storage), and input/output mechanisms. The hardware/software interface is mediated through the Instruction Set Architecture (ISA), which defines the set of instructions the processor can execute, the registers, memory addressing modes, and data types. The ISA is the contract between software and hardware: any implementation that correctly executes the ISA will run any program written for it.

Measuring Performance

Performance is not a single number — it depends on what you measure. Response time (also called execution time or latency) is how long it takes to complete a single task. Throughput is the total amount of work done per unit time (tasks per second, transactions per hour). Replacing a processor with a faster version improves both response time and throughput; adding more processors (in a parallel system) primarily improves throughput.

We define performance as the inverse of execution time:

\[\text{Performance} = \frac{1}{\text{Execution Time}}\]

Saying “machine X is $n$ times faster than machine Y” means:

\[\frac{\text{Performance}_X}{\text{Performance}_Y} = \frac{\text{Execution Time}_Y}{\text{Execution Time}_X} = n\]

For example, if program P takes 10 seconds on machine A and 15 seconds on machine B, then A is $15/10 = 1.5$ times faster than B.

CPU Time

Elapsed time includes everything: processing, I/O, OS overhead, and idle time. CPU time isolates just the time the processor spends working on a given program, comprising user CPU time (executing the program’s own instructions) and system CPU time (OS calls made on behalf of the program).

The operation of digital hardware is governed by a constant-rate clock. The clock period (cycle time) is the duration of one clock cycle — for example, 250 ps = 0.25 ns. The clock frequency (clock rate) is its reciprocal: a 250 ps period corresponds to a 4.0 GHz clock rate. CPU time is given by:

\[\text{CPU Time} = \text{CPU Clock Cycles} \times \text{Clock Cycle Time} = \frac{\text{CPU Clock Cycles}}{\text{Clock Rate}}\]

Performance improves by reducing the number of clock cycles or by increasing the clock rate — though hardware designers must often trade one against the other, since a faster clock may require more cycles per instruction.

Instruction Count and CPI

The number of clock cycles depends on how many instructions the program executes and how many cycles each instruction takes:

\[\text{Clock Cycles} = \text{Instruction Count} \times \text{CPI}\]

where CPI (Cycles Per Instruction) is the average number of clock cycles per instruction. Combining with the CPU time formula:

\[\text{CPU Time} = \text{Instruction Count} \times \text{CPI} \times \text{Clock Cycle Time} = \frac{\text{Instruction Count} \times \text{CPI}}{\text{Clock Rate}}\]

The instruction count is determined by the program, the ISA, and the compiler. CPI is determined by the CPU hardware. If different instruction classes have different cycle costs, the overall CPI is a weighted average:

\[\text{CPI} = \sum_{i=1}^{n} \text{CPI}_i \times \frac{\text{Instruction Count}_i}{\text{Instruction Count}}\]

Example. Suppose Computer A has a cycle time of 250 ps and CPI = 2.0, while Computer B has a cycle time of 500 ps and CPI = 1.2, running the same ISA. For a program with $I$ instructions: CPU Time$_A = I \times 2.0 \times 250\text{ps} = 500I$ ps, while CPU Time$_B = I \times 1.2 \times 500\text{ps} = 600I$ ps. Computer A is $600/500 = 1.2$ times faster.

The Performance Equation

The complete performance equation ties together all three factors:

\[\text{CPU Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Clock cycle}}\]

Each factor is influenced by different aspects of the system: the algorithm and compiler affect instruction count (and possibly CPI); the programming language affects both instruction count and CPI; the ISA affects instruction count, CPI, and clock cycle time; and the hardware implementation determines CPI and clock cycle time.

Pitfall: MIPS as a Performance Metric

MIPS (Millions of Instructions Per Second) is sometimes used as a performance metric:

\[\text{MIPS} = \frac{\text{Instruction Count}}{\text{Execution Time} \times 10^6} = \frac{\text{Clock Rate}}{\text{CPI} \times 10^6}\]

However, MIPS is misleading because it does not account for differences in ISAs between computers (a CISC instruction may do far more work than a RISC instruction), nor for differences in the complexity of instructions within the same ISA. CPI varies between programs on a given CPU, so a single MIPS number cannot characterize a machine’s performance across workloads.

Chapter 2: Digital Logic and Data Representation

Digital Logic Design

Modern computers are built from billions of transistors organized into logic gates, which in turn form the functional units of a processor. Understanding how logic gates combine to represent Boolean functions is the foundation of all computer hardware design.

A truth table is a tabular method of specifying a Boolean function, listing every possible combination of inputs alongside the corresponding output. For a function with $n$ inputs, the truth table has $2^n$ rows. The Sum of Products (SOP) form extracts a Boolean expression directly from the truth table by identifying every row where the output is 1, writing a product (AND) term for each such row, and ORing those terms together. For example, if a three-input function produces a 1 at rows where $A=1, B=0, C=1$ and $A=1, B=1, C=0$, the SOP expression is $A\bar{B}C + AB\bar{C}$. SOP can always be simplified using Boolean algebra identities.

Boolean algebra identities: The key laws include the identity laws ($A \cdot 1 = A$, $A + 0 = A$), domination laws ($A \cdot 0 = 0$, $A + 1 = 1$), idempotent laws ($A \cdot A = A$, $A + A = A$), complement laws ($A \cdot \bar{A} = 0$, $A + \bar{A} = 1$), double negation ($\overline{\bar{A}} = A$), absorption ($A(A+B) = A$), and the distributive and associative laws.

\[\overline{A \cdot B} = \bar{A} + \bar{B}\]\[\overline{A + B} = \bar{A} \cdot \bar{B}\]

These identities allow any expression built from AND, OR, and NOT to be rewritten using only NAND gates or only NOR gates, which is valuable because NAND and NOR are each universal — any Boolean function can be implemented with just one gate type.

\[A \oplus B = A\bar{B} + \bar{A}B\]

Transistors and CMOS

In CMOS (Complementary Metal-Oxide-Semiconductor) technology, logic gates are built from two types of transistors: NMOS and PMOS. An NMOS transistor acts as a switch that closes (conducts) when its gate input is HIGH (1); a PMOS transistor closes when its gate input is LOW (0). In both cases, they are open (non-conducting) otherwise.

A CMOS gate consists of a pull-up network of PMOS transistors connected between the supply voltage $V_{DD}$ and the output, and a pull-down network of NMOS transistors connected between the output and ground. When the input combination makes the pull-up network conduct and the pull-down network not conduct, the output is driven HIGH; when the pull-down network conducts and the pull-up does not, the output is driven LOW. Exactly one network should be on at any given time, ensuring that the output is never floating and that there is no direct path from $V_{DD}$ to ground.

For a CMOS NOT gate (inverter): a single PMOS transistor (gate connected to input $A$) is connected from $V_{DD}$ to the output, and a single NMOS transistor (gate also connected to $A$) is connected from the output to ground. When $A = 0$, the PMOS conducts (output = 1) and the NMOS is off; when $A = 1$, the NMOS conducts (output = 0) and the PMOS is off.

CMOS inverter — PMOS pull-up and NMOS pull-down, complementary behavior.

A CMOS NAND gate uses two PMOS transistors in parallel (pull-up) and two NMOS transistors in series (pull-down). The output is 0 only when both inputs are 1 (both NMOS transistors conduct), implementing $\overline{A \cdot B}$. A CMOS NOR gate uses two PMOS in series and two NMOS in parallel.

Sequential Circuits

Combinational circuits produce outputs that depend only on current inputs. Sequential circuits add memory: their output depends on both current inputs and past history (stored state). Sequential circuits use feedback loops to hold state.

SR Latch

The simplest memory element is the SR latch (Set-Reset latch), built from two cross-coupled NOR gates. The inputs are $S$ (Set) and $R$ (Reset); the outputs are $Q$ and $\bar{Q}$. When $S=1, R=0$, the latch is set ($Q=1$). When $S=0, R=1$, it is reset ($Q=0$). When $S=0, R=0$, the latch holds its current state. The combination $S=1, R=1$ is forbidden because it drives both outputs to 0, causing undefined behavior when both inputs return to 0.

D Latch

The SR latch’s forbidden state is eliminated by the D latch, which has a single data input $D$ and an enable input. When the enable is HIGH, the latch is transparent — $Q$ follows $D$. When the enable is LOW, the latch holds its current value. The D latch is level-triggered.

D Flip-Flop (Master-Slave)

D flip-flop state diagram — Q captures D on rising clock edge, holds until next edge.

The D flip-flop (edge-triggered register) consists of two D latches connected in series. The first (master) is enabled on the LOW phase of the clock and the second (slave) is enabled on the HIGH phase. The result is that the flip-flop samples its $D$ input on the rising edge of the clock and holds the value for the entire clock period. This edge-triggered behavior is critical: in synchronous digital design, all state elements are clocked flip-flops, ensuring that all state changes happen at discrete clock edges and that combinational logic has a full clock period to stabilize.

Clock: A periodic signal that synchronizes state changes in a digital circuit. The clock period (or cycle time) is the time between rising edges. The clock frequency is the reciprocal of the period. The minimum clock period must be at least as long as the longest combinational path between any two flip-flops (the critical path), plus setup and hold time margins.

Finite State Machines

A Finite State Machine (FSM) is a sequential circuit that transitions between a finite set of states based on the current state and input values. In a Moore machine, outputs depend only on the current state (not directly on inputs). The design process involves:

Draw a state diagram identifying all states and labeled transitions.
Encode states as binary values.
Construct the next-state table (current state + inputs → next state).
Construct the output table (current state → outputs).
Derive Boolean expressions for next-state logic and output logic from these tables.
Implement with D flip-flops (one per state bit) and combinational logic.

A classic example is a traffic-light controller. States encode which lights are green/red, and transitions occur on a timer signal. The output (light colors) depends on the current state only.

Combinational Building Blocks

Decoders

An $n$-to-$2^n$ decoder takes $n$ input bits and activates exactly one of $2^n$ output lines — the one corresponding to the binary value of the input. All other outputs are 0. For example, a 2-to-4 decoder has inputs $A_1, A_0$ and four outputs $D_0, D_1, D_2, D_3$, where $D_i = 1$ iff the binary value of $A_1 A_0$ equals $i$.

Decoders are used in register files to select which register to write, in memory to select which row/word to access, and in many address-decoding applications.

Multiplexors

A $2^n$-to-1 multiplexor (MUX) selects one of $2^n$ data inputs to forward to its output, based on an $n$-bit select signal. For a 2-to-1 MUX with inputs $D_0, D_1$ and select $S$: $\text{Out} = \bar{S} \cdot D_0 + S \cdot D_1$. Internally, a MUX is built from AND gates (one per input, gated by the appropriate decoded select combination) and an OR gate combining all AND outputs.

MUXes appear throughout the datapath as key decision points — for example, choosing between a register value and an immediate in the ALU input, or choosing between the ALU result and memory data for write-back.

Register Files

A register file is a small, fast array of registers. In MIPS, the register file holds 32 registers of 32 bits each. It has write ports and read ports:

Write: A decoder takes the 5-bit destination register number and enables the write enable of exactly that register’s D flip-flops. The 32-bit write data is presented to all registers, but only the selected one latches it on the clock edge.
Read: A MUX per read port selects the output of one of the 32 registers based on a 5-bit register number. Reading is combinational (no clock edge needed).

Karnaugh Maps

Boolean expressions derived from truth tables are often not minimal — they may contain redundant terms that translate into unnecessary gates. Karnaugh maps (K-maps) provide a graphical technique for simplifying Boolean equations, working well for functions with up to four variables. The method exploits the algebraic identity $PA + P\bar{A} = P$: when two minterms differ in exactly one variable, that variable can be eliminated.

A K-map is a grid where each cell corresponds to one row of the truth table. Adjacent cells differ in exactly one variable — this adjacency is arranged using Grey code ordering so that physically neighboring cells always differ by a single bit. For a three-variable function $f(A,B,C)$, the K-map is a $2 \times 4$ grid; for four variables $f(A,B,C,D)$, it is a $4 \times 4$ grid.

The simplification procedure is:

Fill each cell with the function’s output value (0 or 1) for that input combination.
Circle groups of adjacent 1s. Each group must be a rectangular block whose size is a power of 2 in each dimension (1, 2, or 4 cells per side).
Every 1 must be included in at least one circle.
Use the fewest, largest circles possible — larger circles eliminate more variables.
Circles may wrap around the edges of the K-map (the left edge is adjacent to the right edge, and the top to the bottom).
For each circle, write a product term that includes only the variables that have the same value throughout the circle. Variables that appear in both complemented and uncomplemented form within the circle are eliminated.
OR the product terms together to obtain the simplified expression.

Don’t-care conditions (marked X in the truth table) represent input combinations that cannot occur or whose output does not matter. In the K-map, don’t-cares may be circled if doing so enlarges a group and simplifies the expression, but they need not be circled — they are used opportunistically.

Example. Consider $f(A,B,C) = \bar{A}\bar{B}\bar{C} + \bar{A}\bar{B}C + A\bar{B}\bar{C} + A\bar{B}C$. Placing these four minterms in a K-map reveals that they form a single group of four cells spanning all values of $A$ and $C$ while $B=0$ throughout. The simplified expression is simply $f = \bar{B}$.

Timing and Glitches

Real circuits do not respond instantaneously to input changes — signals take time to propagate through gates. The propagation delay $t_{pd}$ is the maximum delay from any input change to the corresponding output stabilizing. The contamination delay $t_{cd}$ is the minimum delay from an input change to the output beginning to change. These may differ because of asymmetric rising and falling edges, different input-to-output paths, or temperature effects.

The critical path through a circuit is the longest chain of gates from input to output, determining the overall propagation delay. The short path is the shortest such chain, determining the contamination delay.

A glitch occurs when a single input change causes the output to temporarily assume an incorrect value before settling to the correct value. Glitches arise when different signal paths through the circuit have different delays, causing intermediate inconsistency. For example, in the circuit for $Y = AB + BC$, if $A=0$, $C=1$, and $B$ transitions from 1 to 0, the term $AB$ goes low before $BC$ goes high — creating a momentary dip in $Y$ (which should remain 1 throughout). Glitches can be fixed by adding a consensus term — an extra gate that covers the transition region. In this example, adding the term $AC$ eliminates the glitch.

Memory: SRAM and DRAM

SRAM

SRAM (Static RAM) stores each bit in a cross-coupled inverter pair (a 6-transistor cell), which holds its value as long as power is supplied. SRAM is fast (1–10 ns access time) and does not require refresh, but it is expensive per bit and has a large cell size. SRAM is used for processor caches (L1, L2, L3).

A SRAM array is organized as rows and columns. Two-level decoding splits the address into row and column parts: the row decoder (using a standard $n$-to-$2^n$ decoder) activates an entire row of cells, placing their contents on bit lines. Then a column MUX selects the specific column(s) needed. This two-level scheme reduces the hardware complexity compared to a flat fully decoded array.

Three-state (tri-state) buffers allow multiple SRAM outputs to share a common data bus: each buffer either drives its output HIGH or LOW, or presents a high-impedance (disconnected) state when not selected, preventing bus conflicts.

DRAM

DRAM (Dynamic RAM) stores each bit as a charge on a capacitor, requiring only 1 transistor per cell. This makes DRAM far denser and cheaper than SRAM. However, capacitors leak charge, so DRAM must be refreshed periodically (typically every 64 ms) by reading and rewriting each row. DRAM access times are slower than SRAM (tens to hundreds of nanoseconds). DRAM is used for main memory.

Data Representation

Integer Representation

Characters are stored using ASCII encoding, where each character maps to a 7-bit integer (e.g., ‘A’ = 65, ‘a’ = 97). Unsigned integers in $n$ bits represent values $0$ to $2^n - 1$. For a 32-bit unsigned integer, the range is $0$ to $4{,}294{,}967{,}295$.

Signed-magnitude representation uses the most-significant bit as a sign bit (0 = positive, 1 = negative) and the remaining bits as the magnitude. This representation has two encodings of zero ($+0$ and $-0$) and requires special-case hardware for arithmetic.

\[x = -x_{n-1} \cdot 2^{n-1} + \sum_{i=0}^{n-2} x_i \cdot 2^i\]

For 32-bit two’s complement, the range is $-2^{31}$ to $2^{31}-1$, i.e., $-2{,}147{,}483{,}648$ to $2{,}147{,}483{,}647$. Key properties: there is exactly one encoding of zero; negation is performed by inverting all bits and adding 1; addition/subtraction hardware is the same for signed and unsigned numbers (overflow must be detected separately).

Sign extension converts an $n$-bit two’s complement value to a wider representation by replicating the sign bit into all new upper bits. For example, extending the 8-bit value $1111\ 1010_2$ (which represents $-6$) to 16 bits yields $1111\ 1111\ 1111\ 1010_2$, still $-6$.

IEEE 754 Floating-Point

IEEE 754 is the standard for representing floating-point numbers in binary. A single-precision (32-bit) number has three fields:

Sign bit $s$: 1 bit. $0$ for positive, $1$ for negative.
Biased exponent $e$: 8 bits, stored with a bias of 127. The actual exponent is $e - 127$.
Significand (mantissa) $f$: 23 bits, representing the fractional part after an implied leading 1.

\[(-1)^s \times 1.f \times 2^{e-127}\]

The implied leading 1 means the significand effectively has 24 bits of precision. The biased exponent scheme allows floating-point numbers to be compared as integers. Special cases:

Exponent = 0, fraction = 0: the value is $\pm 0$.
Exponent = 255, fraction = 0: $\pm\infty$.
Exponent = 255, fraction $\neq 0$: NaN (Not a Number).
Exponent = 0, fraction $\neq 0$: denormalized number, with value $(-1)^s \times 0.f \times 2^{-126}$, enabling gradual underflow.

Floating-point addition algorithm: (1) Compare exponents; shift the significand of the smaller number right by the difference to align binary points. (2) Add the significands. (3) Normalize the result (shift so there is a single 1 before the binary point), adjusting the exponent. (4) Round to fit the 23-bit significand field using guard and round bits.

Binary Arithmetic

Full Adder and Ripple-Carry Adder

\[S = A \oplus B \oplus C_{in}\]\[C_{out} = AB + C_{in}(A \oplus B)\]

A ripple-carry adder chains $n$ full adders together, with the carry-out of each stage feeding the carry-in of the next. A 32-bit ripple-carry adder has a critical path of 32 carry propagations. Carry-lookahead adders compute carries in parallel, reducing delay to $O(\log n)$ — not covered in depth in this course but noted as an important optimization.

ALU Design

Subtraction $A - B$ is implemented as $A + (-B) = A + \bar{B} + 1$ (two’s complement negation). An ALU performs the operation selected by a control signal:

ALU control	Function
0000	AND
0001	OR
0010	add
0110	subtract
0111	set-on-less-than (slt)
1100	NOR

The slt (set-on-less-than) operation computes $A - B$ and examines the sign bit of the result: if negative, the output is 1; otherwise 0.

Logical shift operations (sll, srl) move bits left or right, filling vacated positions with zeros. These are implemented by barrel shifters. In MIPS, shifts are used for multiplication/division by powers of 2 and for bit manipulation.

Chapter 3: The MIPS Instruction Set Architecture

Overview

MIPS (Microprocessor without Interlocked Pipeline Stages) is a classic RISC (Reduced Instruction Set Computer) architecture used extensively in embedded systems and as an educational platform. It is a load-store architecture: arithmetic and logical operations work only on registers, and only lw and sw instructions access memory.

Key properties of MIPS:

32 general-purpose registers, each 32 bits wide.
Fixed 32-bit instruction length.
Three instruction formats: R-format, I-format, J-format.
Memory is byte-addressed; words are 4 bytes and must be word-aligned.

Registers

MIPS has 32 registers named $0 through $31, with software conventions for their use:

Register	Name	Use
`$0`	`$zero`	Always 0; writes ignored
`$1`	`$at`	Assembler temporary
`$2–$3`	`$v0–$v1`	Function return values
`$4–$7`	`$a0–$a3`	Function arguments
`$8–$15`	`$t0–$t7`	Temporaries (caller-saved)
`$16–$23`	`$s0–$s7`	Saved temporaries (callee-saved)
`$28`	`$gp`	Global pointer
`$29`	`$sp`	Stack pointer
`$30`	`$fp`	Frame pointer
`$31`	`$ra`	Return address

Instruction Formats

R-format (Register): used for arithmetic and logical operations that take two register sources and write to a register destination.

| op (6) | rs (5) | rt (5) | rd (5) | shamt (5) | funct (6) |

The op field is 0 for all R-format instructions; the funct field distinguishes them. rs and rt are source registers; rd is the destination. shamt is the shift amount (nonzero only for shift instructions).

I-format (Immediate): used for loads, stores, branches, and immediate arithmetic.

| op (6) | rs (5) | rt (5) | immediate (16) |

The 16-bit immediate is sign-extended to 32 bits. For lw/sw, the effective address is $\text{rs} + \text{sign\_extend(immediate)}$. For beq/bne, the branch target is $\text{PC}+4 + 4 \times \text{sign\_extend(immediate)}$ (PC-relative, word-offset).

J-format (Jump): used for unconditional jump.

| op (6) | target (26) |

The jump target address is formed by taking the upper 4 bits of $\text{PC}+4$, concatenating the 26-bit target field, and appending 00 (word-aligned). This gives a 32-bit address, allowing jumps to any address within the same 256 MB region.

Key Instructions

The most important MIPS instructions covered in this course:

add rd, rs, rt — rd = rs + rt (R-format; traps on overflow)
sub rd, rs, rt — rd = rs - rt
addi rt, rs, imm — rt = rs + sign_extend(imm)
and, or, nor — bitwise operations (R-format)
slt rd, rs, rt — rd = (rs < rt) ? 1 : 0
slti rt, rs, imm — immediate version of slt
lw rt, offset(rs) — load word from memory
sw rt, offset(rs) — store word to memory
beq rs, rt, offset — branch if equal
bne rs, rt, offset — branch if not equal
j target — unconditional jump
jal target — jump and link (stores return address in $ra)
jr rs — jump to address in register
sll rd, rt, shamt — shift left logical

Addressing Modes

MIPS uses several addressing modes — ways an instruction specifies the location of an operand:

Register addressing: the operand is in a named register (add $t0, $t1, $t2).
Immediate addressing: a constant embedded in the instruction itself (addi $t0, $t1, 4).
Base (displacement) addressing: effective address = base register + sign-extended offset (lw $t0, 8($sp)).
PC-relative addressing: branch target = PC + 4 + 4 × offset (beq).
Pseudo-direct addressing: jump target constructed from upper PC bits + 26-bit field + 00 (j).

Chapter 4: The Single-Cycle Datapath

Overview

The single-cycle datapath executes each instruction in exactly one clock cycle. The clock period must be long enough for the slowest instruction to complete. All functional units operate in parallel, and the result is registered at the end of the cycle. While simple, this is inefficient because fast instructions must wait for the cycle time set by the slowest instruction.

Single-cycle MIPS datapath — simplified flow from PC through functional units.

Functional Units

The datapath contains these major units:

Program Counter (PC): a 32-bit register holding the address of the current instruction.
Instruction memory: a read-only memory that outputs the instruction at the address given by PC.
Register file: 32×32-bit registers with two read ports and one write port.
Sign extender: extends the 16-bit immediate to 32 bits.
ALU: performs arithmetic and logical operations; outputs a 32-bit result and a Zero flag.
Data memory: read/write memory for lw/sw; controlled by MemRead and MemWrite.
Adders: one adds PC + 4 (for sequential execution); another computes the branch target.
Shift-left-2: multiplies the sign-extended branch offset by 4 (because branch targets are word-aligned).

Control Signals

The main control unit decodes the 6-bit opcode field and asserts the appropriate control signals:

Signal	Purpose
`RegDst`	0 = write to `rt`, 1 = write to `rd`
`ALUSrc`	0 = second ALU operand is register `rt`, 1 = sign-extended immediate
`MemToReg`	0 = write ALU result to register, 1 = write memory data to register
`RegWrite`	1 = write to destination register
`MemRead`	1 = read data memory
`MemWrite`	1 = write data memory
`Branch`	1 = enable branch logic
`ALUOp`	2-bit code directing the ALU control unit

The ALU control unit takes ALUOp and the 6-bit funct field (for R-format) to produce the 4-bit ALU control signal.

For R-format (ALUOp = 10): the funct field determines the operation (add, subtract, AND, OR, slt). For lw/sw (ALUOp = 00): ALU always adds (to compute effective address). For beq (ALUOp = 01): ALU subtracts (Zero flag used to detect equality).

Branch Logic

For beq, the branch target is computed in parallel with the ALU comparison. The branch is taken if the ALU’s Zero output is 1 AND the Branch control signal is 1. The PCSrc multiplexor selects between PC+4 (not taken) and the branch target (taken).

Instruction Timing

In the single-cycle design, instruction completion times are determined by the critical path through each instruction’s data flow:

Instruction	Critical path	Time
`beq`	IM → RegFile → ALU	350 ps
R-format	IM → RegFile → ALU → RegFile	400 ps
`sw`	IM → RegFile → ALU → DM	550 ps
`lw`	IM → RegFile → ALU → DM → RegFile	600 ps

Since the clock period is fixed at 600 ps (set by lw, the slowest), all instructions take 600 ps regardless of their actual need. This wastes 200–250 ps for R-format, sw, and beq instructions.

Modifying the Datapath

The single-cycle datapath can be extended to support additional instructions:

Jump (j): Add a second MUX after the branch MUX to select between the branch target and the jump target. The jump target is formed by shifting the 26-bit field left by 2 and concatenating with the upper 4 bits of PC+4. A new Jump control signal is added.

Branch-not-equal (bne): Add an inverter on the Zero signal, and OR it with the BNE control signal to generate PCSrc for bne. Alternatively, add a separate MUX level. Only one new control line is needed.

Set register (SREG) and relative jump (JREL): Each new instruction type requires examining what new paths are needed in the datapath and adding MUXes and control lines accordingly, following the same systematic approach.

Chapter 5: The Multicycle and Pipelined Datapath

The Multicycle Datapath

In the multicycle datapath, each instruction is broken into multiple shorter steps, each taking one clock cycle. A shorter clock period can be used because each stage does less work. Functional units (ALU, adders) can be shared across stages. The five stages are:

IF (Instruction Fetch): Fetch the instruction from instruction memory; PC ← PC + 4.
ID (Instruction Decode / Register Fetch): Decode the instruction; read the two source registers; sign-extend the immediate.
EX (Execute): ALU performs the operation (arithmetic, address computation, or comparison).
MEM (Memory Access): For lw/sw, access data memory. For branches, the branch decision is made here (if branching is placed in MEM stage). For other instructions, no memory operation.
WB (Write Back): Write result back to the register file.

5-stage pipeline — IF/ID/EX/MEM/WB with inter-stage registers holding forwarded values.

Intermediate pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) are added between stages to hold values as they pass through the pipeline. Each register stores all the information the later stages need — instruction fields, computed values, control signals.

With a clock period of 200 ps per stage, a lw instruction takes 5 cycles = 1000 ps — worse than the single-cycle design for a single instruction. The benefit emerges only when instructions overlap (pipelining).

Pipelining

Pipelining exploits instruction-level parallelism by overlapping the execution of multiple instructions. While instruction $i$ is in the EX stage, instruction $i+1$ is in ID and instruction $i+2$ is in IF. In the ideal steady-state case (no hazards), the throughput is one instruction completed per clock cycle — a CPI (Cycles Per Instruction) of 1.

Pipeline speedup: For a program with many instructions, a $k$-stage pipeline ideally provides a factor-$k$ speedup over a single-cycle design with the same stage delay. In practice, hazards reduce the effective speedup.

The key insight is that pipelining improves throughput (instructions per second), not latency (time to complete a single instruction). A pipelined lw still takes 5 clock cycles; but in steady state, one lw completes every cycle.

Control signals generated in the ID stage must be carried forward through the pipeline registers alongside the data, so that each instruction’s signals arrive at the right stage at the right time. The MEM/WB register carries WB-stage signals; the EX/MEM register carries EX/MEM/WB signals; and so on.

Pipeline Startup Time

\[\text{Total clock cycles} = N + 4\]

The +4 accounts for pipeline fill time (4 stall cycles at the start). This is sometimes called the pipeline startup overhead. In most problems, if the pipeline is described as “already running,” this startup cost is omitted.

Chapter 6: Pipeline Hazards

A hazard is any condition that prevents the next instruction from executing in the next clock cycle, forcing the pipeline to stall or take corrective action. There are three types: structural, data, and control.

Structural Hazards

A structural hazard occurs when two instructions need the same hardware resource at the same time. In the basic pipelined MIPS datapath, the key structural hazard is resolved by using separate instruction and data memories — if a single memory were used, an lw in the MEM stage and the fetch in the IF stage would conflict. With separate I-cache and D-cache, this structural hazard is eliminated.

Data Hazards

A data hazard occurs when an instruction depends on the result of a preceding instruction that has not yet written its result to the register file.

Forwarding (Bypassing)

The most important solution to data hazards is forwarding (also called bypassing): rather than waiting for the result to be written to the register file, it is forwarded directly from wherever it is in the pipeline to where it is needed. Two forwarding paths exist:

EX hazard: The instruction immediately following an arithmetic instruction needs the result. The result is available at the end of the EX stage (in the EX/MEM pipeline register). It can be forwarded directly to the ALU input of the next instruction (which is now in EX). The forwarding unit sets ForwardA = 10 (or ForwardB = 10) when:

EX/MEM.RegWrite = 1
EX/MEM.RegisterRd ≠ 0
EX/MEM.RegisterRd = ID/EX.RegisterRs  (for ForwardA)

MEM hazard: The instruction two behind needs the result. The result is available in the MEM/WB register. It is forwarded to the ALU input of the instruction now in EX. The forwarding unit sets ForwardA = 01 when:

MEM/WB.RegWrite = 1
MEM/WB.RegisterRd ≠ 0
MEM/WB.RegisterRd = ID/EX.RegisterRs
(and no EX hazard takes priority)

The forwarding unit compares register numbers from the pipeline registers and generates multiplexor select signals (ForwardA, ForwardB) that choose among: the normal register file output (00), the MEM/WB forwarded value (01), and the EX/MEM forwarded value (10).

Load-Use Hazard

Forwarding cannot fully solve the load-use hazard: when a lw instruction is immediately followed by an instruction that uses the loaded value, the data is not available until the end of the MEM stage of the lw — which is too late to forward to the EX stage of the following instruction.

The only solution is to stall the pipeline for one clock cycle (insert a “bubble” or NOP). The hazard detection unit compares ID/EX.MemRead with the register numbers in the IF/ID register:

if (ID/EX.MemRead = 1) and
   ((ID/EX.RegisterRt = IF/ID.RegisterRs) or
    (ID/EX.RegisterRt = IF/ID.RegisterRt))
then stall

When a stall is detected: the PC is prevented from advancing (the same instruction is re-fetched); the IF/ID register is preserved (the same instruction is re-decoded); and a NOP is inserted into the ID/EX register by zeroing all control signals. This effectively inserts one dead cycle, giving lw time to complete the MEM stage before its result is needed.

The load-use hazard can be avoided by code rearrangement: if the compiler can find an independent instruction to place between the lw and its use, no stall is needed. Code rearrangement is always preferred over hardware stalls when possible. However, the load-use hazard is still classified as a data hazard even when stalling hardware is present.

Control Hazards

A control hazard (or branch hazard) occurs because the pipeline fetches instructions sequentially, but a branch may redirect control to a different address. By the time the branch outcome is known, several instructions following the branch may have already entered the pipeline.

Branch in MEM Stage

If the branch decision is made in the MEM stage (4th stage), three instructions following the branch have already been fetched and partially executed. These are called errant instructions. Two solutions:

Branch flushing: Zero out the control bits in the IF/ID, ID/EX, and EX/MEM pipeline registers, effectively replacing the errant instructions with NOPs. The branch penalty is 3 clock cycles.
Code rearrangement / NOPs: The compiler inserts three NOPs (or moves three independent instructions) after every branch. No hardware flushing needed, but code size increases.

Branch in ID Stage

If additional hardware computes the branch target and evaluates the branch condition in the ID stage (2nd stage), only one errant instruction has been fetched. The branch penalty drops to 1 cycle.

Solutions:

Flushing: Zero the IF/ID register to kill the one errant instruction.
NOP / code rearrangement: The compiler places one NOP or one independent instruction in the branch delay slot.

The branch delay slot is the instruction immediately following a branch, which always executes regardless of whether the branch is taken. The compiler tries to fill it with a useful instruction; if none is available, a NOP is inserted.

Branch Data Hazard

When the branch instruction in the ID stage needs a register value that is being produced by the immediately preceding instruction, a branch data hazard exists. With forwarding that only targets the EX stage, this cannot be resolved by forwarding alone for a branch in ID.

One stall is inserted between the producing instruction and the branch. Code rearrangement is the preferred solution. For example:

# Hazardous:
112  add $1, $1, -1
116  bne $1, $0, -5    # needs $1 immediately

# Solution via code rearrangement:
108  addi $4, $4, 4    # moved from before, fills stall slot
112  add $1, $1, -1
116  bne $1, $0, -5    # now safe

Pipeline Performance Analysis

\[\text{Total CC} = (\text{instruction count}) + (\text{load-use stalls}) + (\text{branch penalties}) + 4\]

The +4 is the pipeline startup cost. Branch penalties are 1 cycle per taken branch (with branching in ID and flushing), and load-use stalls are 1 cycle per load-use hazard.

\[\text{CPI} = 1 + (\text{load stall rate}) + (\text{branch stall rate})\]

Loop Unrolling (FYI)

Loop unrolling is a compiler technique that reduces branch overhead by replicating the loop body multiple times and reducing the number of branch instructions. For example, a loop that processes one element per iteration can be unrolled 4× to process four elements per iteration, reducing the number of branch checks by a factor of 4. This also exposes more independent instructions for the compiler to schedule to fill hazard slots.

Exceptions and Interrupts (FYI)

Exceptions (internal, e.g., arithmetic overflow, undefined instruction) and interrupts (external, e.g., I/O device signals) require the processor to transfer control to an exception handler. In MIPS, the Cause register records the type of exception, and the EPC (Exception Program Counter) saves the address of the offending instruction. The processor branches to a fixed exception handler address.

Vectored interrupts provide separate handler addresses for each interrupt type, avoiding a software dispatch table in the handler.

In the pipelined processor, when an exception occurs: (1) the current instruction and all subsequent instructions that have entered the pipeline are flushed (zeroed out); (2) the EPC is saved; (3) the Cause register is set; (4) the PC is set to the handler address. The flushing hardware sets all control signals to zero in the affected pipeline registers, effectively converting them to NOPs.

Chapter 7: Memory Hierarchy and Caches

The Memory Hierarchy

All modern computers face a fundamental tension: faster memories are more expensive per bit and smaller, while larger memories are slower and cheaper. The solution is the memory hierarchy — a series of storage levels with different speed/size/cost tradeoffs, arranged so that the most frequently accessed data is kept in the fastest, smallest levels closest to the processor.

A typical hierarchy from fastest to slowest:

Registers — within the CPU, 32 × 32 bits, access in one clock cycle.
L1 cache — on-chip SRAM, 32 KB typical, 1–4 cycle access, split instruction/data.
L2 cache — on-chip or close SRAM, 256 KB–1 MB, 4–12 cycles.
L3 cache — larger on-chip SRAM, several MB.
Main memory (RAM) — DRAM, several GB, 50–100+ cycles.
Secondary storage (disk/SSD) — nonvolatile, TB capacity, millions of cycles.

The hierarchy works because of the principle of locality:

Temporal locality: If a memory location is accessed once, it will likely be accessed again soon. Programs spend most of their time in loops and calling frequently-used functions.

Spatial locality: If a location is accessed, nearby locations are likely to be accessed soon. Arrays, sequential code execution, and stack frames all exhibit spatial locality.

The terminology for hierarchy interactions: a block is the minimum unit of data transfer between levels; a hit occurs when the requested data is found in the upper (faster) level; a miss occurs when it is not and must be fetched from a lower level. The hit rate is the fraction of accesses that are hits; the miss rate = 1 − hit rate. The miss penalty is the additional time needed to fetch a block from the next lower level.

Direct-Mapped Cache

\[\text{cache index} = (\text{block address}) \bmod (\text{number of blocks in cache})\]

Each cache entry has three fields:

Valid bit: indicates whether the entry holds valid data (initially 0 at power-on).
Tag: the upper bits of the original block address, stored alongside the data to verify a hit.
Data: the block of data (one or more words).

When the processor requests address $A$:

Use the index bits of $A$ to locate the cache entry.
Check the valid bit; if 0, it is a miss.
If valid, compare the tag field with the tag bits of $A$. If they match, it is a hit — return the data from the data field.
If the tag does not match, it is a miss — fetch the block from RAM, store it in the cache at this index (evicting whatever was there), update the tag, set valid bit to 1, and return the data.

For a 32-bit word address (byte address divided by 4, since words are 4 bytes), with a cache of $2^k$ blocks and block size of $2^b$ words:

The lowest $b$ bits select the word within the block (block offset).
The next $k$ bits are the index.
The remaining $32 - k - b - 2$ bits form the tag (the 2 accounts for byte offset within a word).

The primary weakness of direct-mapped caches is conflict misses: two addresses that map to the same cache index will continually evict each other, causing poor performance even if the cache is otherwise large enough.

Memory hierarchy pyramid — faster and smaller toward the top; slower and larger toward the bottom.

Cache direct-mapped structure — address split into tag / index / offset to locate data.

Set-Associative Cache

A set-associative cache reduces conflict misses by allowing each memory block to map to any of $W$ locations within a set. The cache is organized as $2^k / W$ sets, each containing $W$ ways (entries). A block maps to a unique set (determined by index bits) but can be placed in any of the $W$ ways within that set.

On a lookup, the index selects the set, then all $W$ tag comparators run in parallel to check all ways simultaneously. A hit is indicated if any way’s valid bit is set and tag matches.

1-way set associative = direct-mapped.
2-way set associative: each block can go to one of 2 locations in its set.
4-way set associative: one of 4 locations.
Fully associative: a block can go anywhere in the cache; there is only one set.

As associativity increases, hit rates improve (fewer conflict misses) but lookup time and hardware cost increase (more comparators needed). The miss rate vs. associativity vs. cache size trade-off is empirical: real processors use 4-way or 8-way set-associative caches as a practical balance.

For a 32-bit address with an $N$-way set-associative cache of $2^k$ sets and block size $2^b$ words:

Block offset: $b$ bits (+ 2 byte offset bits).
Index: $k - \log_2 N$ bits (fewer index bits as $N$ increases, because there are fewer sets).
Tag: remaining upper bits (more tag bits as $N$ increases).

When fully associative, all bits except the block offset form the tag and there is no index field.

Replacement Policies

When a miss occurs and all ways in a set are occupied, one entry must be evicted. The standard replacement policy is LRU (Least Recently Used): evict the entry that was last accessed furthest in the past. LRU approximates optimal replacement and works well due to temporal locality. For large associativity, approximate LRU using reference bits that are set on access and periodically cleared.

Larger Block Sizes

Bringing a single word from RAM on a miss is inefficient. Instead, a larger block size exploits spatial locality by bringing in multiple consecutive words. With block size 4 words, accessing address $A$ also brings in the three adjacent words. Subsequent accesses to those words are cache hits.

The cost of a larger block is slightly increased miss penalty (more data to transfer) but greatly reduced miss rate for sequential access patterns. Typical miss penalties:

1-word block: 100 cc (100 clock cycles to access RAM).
4-word block: 104 cc (small extra cost to transfer 4 words).
8-word block: 108 cc.

Average Memory Access Time (AMAT)

\[\text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty}\]\[\text{AMAT} = 1 + 0.05 \times 20 = 2 \text{ cc}\]

Cache Miss Handling

On a read miss: stall the entire processor; fetch the block from RAM; place it in the cache; resume execution.

On a write (store):

Write-through: every write goes to both the cache and RAM immediately. RAM and cache are always consistent. Disadvantage: every sw instruction causes a slow RAM write.
Write-back: write only to the cache. A dirty bit is added to each cache entry to track whether the cached data has been modified but not yet written to RAM. When a dirty block is evicted, it is written back to RAM first. Write-back is more complex but much faster, since many writes to the same location require only one eventual RAM write.

The dirty bit adds overhead on cache miss with write-back: if the evicted block is dirty, it must be written to RAM (104 cc for a 4-word block) before the new block can be loaded.

Total Execution Time with Cache

\[\text{Total CC} = \text{Pipeline CC} + \text{Instruction cache miss penalty} + \text{Data cache miss penalty}\]

where:

Pipeline CC accounts for instructions executed, load-use stalls, and branch penalties.
Instruction cache misses: each miss adds the RAM access time (e.g., 104 cc per 4-word block miss). On the first pass through a loop, instruction cache misses occur; on subsequent iterations, the instructions are already in cache.
Data cache misses: each lw or sw that causes a miss adds the RAM access time. With block size 4, accessing $M$ unique data words where they fall in $M/4$ distinct blocks causes $M/4$ misses.

Example analysis approach for a loop executing 20 iterations with 9 instructions starting at address 112, block size 4 words, and 104 cc RAM penalty:

Instruction count: 3 + 6 × 20 + 1 = 124 (setup + loop body × iterations + branch not taken at end), adjusted for load-use stalls and branch penalty.
Instruction cache misses: instructions 112–144 span addresses 112 to 144 (= 9 words). With block size 4, instructions 112–128 (addresses 112, 116, 120, 124, 128 = 5 words) fit in 2 blocks (112–128 is 5 instructions; 112 is aligned to block boundary since 112 = 7 × 16, and the next block starts at 128, so blocks at 112 and 128 and 144). On the first pass: 3 instruction cache misses = 3 × 104 cc = 312 cc. On loop-back iterations, instructions are already cached.
Data cache misses: lw $5, 0($4) accesses a different address on each iteration. With block size 4, every 4 consecutive accesses fall in one block. 20 iterations → 20 unique lw addresses → if they are spaced 4 bytes apart, each falls in its own block → 20 data cache misses = 20 × 104 cc. If aligned to 4-word blocks, it may be fewer.

Chapter 8: Virtual Memory

Motivation

Virtual memory is the level of the memory hierarchy between RAM and disk. It serves two purposes:

Allow programs to exceed physical RAM: the operating system keeps only the currently needed pages of a program in RAM; the rest remain on disk.
Multiprogramming and protection: multiple programs can run simultaneously, each believing it has the entire address space. The OS isolates programs from each other using per-process address mappings.

Each process sees a large virtual address space (64-bit programs see $2^{64}$ addresses). The physical address space is determined by the amount of RAM installed. The OS manages the mapping between them.

Pages and Page Tables

Page: The unit of transfer between RAM and disk in virtual memory. A typical page size is 4 KB ($2^{12}$ bytes). Virtual and physical addresses are divided into a page number and a page offset.

A page table maps virtual page numbers (VPN) to physical page numbers (PPN). Each process has its own page table, stored in RAM, and pointed to by the page table register (PTR). Each entry contains:

Valid bit: 1 if the page is currently in RAM; 0 if it is on disk (triggering a page fault on access).
Dirty bit: 1 if the page has been written since it was loaded from disk.
Reference bit: used to approximate LRU for page replacement.
Physical page number: the location in RAM where this page resides.

Address translation works as follows. A 64-bit virtual address is split:

\[ \underbrace{\text{Virtual Page Number (VPN)}}_{52 \text{ bits}} \quad \underbrace{\text{Page Offset}}_{12 \text{ bits}} \]

The VPN is used as an index into the page table (via the PTR) to obtain the PPN. The physical address is then:

\[ \underbrace{\text{Physical Page Number (PPN)}}_{?? \text{ bits}} \quad \underbrace{\text{Page Offset (unchanged)}}_{12 \text{ bits}} \]

The page offset bits do not change between virtual and physical address because the page is mapped as a whole unit — only the page number changes.

Page Faults

A page fault occurs when the requested virtual page is not in RAM (valid bit = 0). The OS must:

Choose a page to evict from RAM (using LRU, LFU, FIFO, or random replacement).
If the evicted page is dirty, write it back to disk (write-back policy — write-through to disk is too expensive).
Load the requested page from disk into the freed RAM location.
Update the page table entry (set valid bit = 1, update PPN).
Resume execution of the faulting instruction.

Page faults are expensive (disk access takes millions of cycles), so minimizing them is critical. Virtual memory always uses a write-back policy because write-through to disk would be prohibitively slow.

Translation Lookaside Buffer (TLB)

Every memory access requires a virtual-to-physical translation, which requires accessing the page table in RAM — adding one RAM access per instruction. This overhead is unacceptable.

The Translation Lookaside Buffer (TLB) is a small, fully associative cache storing recently used page table entries. It resides on-chip, close to the processor.

When the processor generates a virtual address:

Check the TLB for the VPN.
TLB hit: get the PPN directly. Then use the PPN to check the cache (and RAM if cache miss).
TLB miss: access the page table in RAM (costs one RAM access). If valid (page in RAM), load the translation into the TLB and use the PPN. If invalid (page on disk): page fault — load the page from disk, update page table, update TLB.

A TLB hit means only the translation itself is free; data still needs to be found in cache or RAM. If there is a TLB hit, it is impossible to have a page fault (because TLB entries only reference pages that are in RAM).

The valid bit in a TLB entry indicates whether the TLB entry is valid (not garbage data) — it does not indicate whether the page is in RAM vs. disk. The valid bit in the page table indicates whether the page is in RAM.

The Full Memory Access Flow

For every instruction fetch and every lw/sw:

TLB lookup (VA → PA):
- TLB hit → have PA.
- TLB miss → page table lookup in RAM → if page valid, load TLB; if page invalid, page fault → load from disk, update PT, load TLB.
Cache lookup (using PA):
- Cache hit → data returned; no RAM access needed.
- Cache miss → fetch block from RAM; update cache; return data.

Summary of possible combinations (for a system with branching in ID, flushing and forwarding in the pipelined processor):

TLB	Page Table	RAM (Cache)	Disk	Possible?
Hit	—	Hit (cache)	—	Yes (fastest path)
Hit	—	Miss (cache)	—	Yes (TLB hit, cache miss, go to RAM)
Miss	Hit	Hit (cache)	—	Yes (costly: PT lookup in RAM, but data in cache — rare)
Miss	Hit	Miss (cache)	—	Yes (PT lookup + RAM access)
Miss	Miss (fault)	—	Access	Yes (page fault, most expensive)
Hit	—	—	Access	No (TLB hit means page is in RAM)

Process Switching

When the OS switches from one process (program A) to another (program B):

Save the state of A: PC, all registers, and the PTR.
Flush the TLB (its translations belong to A and are invalid for B).
Flush the cache (optional; depends on whether the cache uses physical or virtual addressing).
Load the state of B: restore PC, registers, and PTR for B.
B begins with initially empty TLB and cache, gradually filling as it executes.

The page tables themselves remain in RAM throughout the switch; the PTR is the only thing that needs to change to select the correct page table.

Chapter 9: Other Architectures

Microprogramming

Microprogramming is a technique for implementing the control unit of a processor. Instead of hardwired combinational logic, the control unit runs a small program (the microprogram) for each assembly instruction. Each microinstruction corresponds to one set of control signals to the datapath (one clock cycle). Microinstructions are stored in a ROM.

The microprogram for an instruction specifies sequencing: FETCH (go decode next assembly instruction), NEXT (go to next microinstruction in sequence), or JUMP N (go to microinstruction N). Microprogramming makes it easy to add new instructions or fix bugs by changing ROM contents rather than redesigning hardware.

VAX is a classic microprogrammed architecture. It is a CISC (Complex Instruction Set Computer) with variable-length instructions (1 to 54 bytes), 304 instruction types, and instructions like MOVC3 (copy a string), INSQUE (insert into a queue), and POLY (evaluate a polynomial). These complex instructions simplify assembly programming and reduce code size, but make the hardware more complex.

RISC vs. CISC

RISC (Reduced Instruction Set Computer) architectures (like MIPS and SPARC) embrace a philosophy of simplicity:

Small number of simple, fixed-length instructions.
Load-store architecture (only lw/sw access memory).
Large general-purpose register file.
Single-cycle execution of most instructions.
Heavy reliance on the compiler to optimize code.

CISC architectures (like VAX and Intel x86) provide many complex instructions with variable lengths and multiple addressing modes, aiming to reduce code size and simplify programming at the expense of more complex hardware.

Modern processors often blend both: Intel’s Pentium Pro and later processors decode complex x86 instructions into simpler micro-ops that execute on a RISC-like internal pipeline.

SPARC

SPARC is a RISC architecture designed to support object-oriented languages and efficient subroutine calls. It has 32 visible registers at any time, organized as a register window into a large register file of 40–520 registers.

On a subroutine call, the register window slides forward by 16: the caller’s output registers become the callee’s input registers, and the callee gets a fresh set of local and output registers. On return, the window slides back. This mechanism avoids saving and restoring registers on most calls, trading register-file area for performance.

SPARC uses fixed 32-bit instructions with 5 instruction formats, delayed branches (the instruction after a branch always executes — the branch delay slot), and 32 single-precision or 16 double-precision floating-point registers.

Intel x86 / Pentium

The Intel x86 architecture grew from the original 8086 (16-bit) through the 80386 (32-bit) to the modern 64-bit extensions. Key milestones:

80386 (1985): First 32-bit Intel processor.
80486 (1989): Added on-chip cache, instruction pipeline, and integrated math coprocessor.
Pentium (1993): Superscalar microarchitecture (executes multiple instructions in parallel); split L1 instruction/data caches.
Pentium Pro (1995): Added L2 cache; decodes complex x86 instructions into RISC-like micro-ops for internal execution.
Pentium III (1999): Added SSE (SIMD single-precision floating-point); 10-stage pipeline.
Pentium 4 (2000): Added SSE2; 20-stage pipeline; 4K-entry branch predictor; instruction trace cache stores decoded micro-ops; 8-way set-associative L1 data cache (8 KB) and L2 (256 KB, 128-byte lines).

The x86 instruction set is large: 48 move instructions, 12 binary arithmetic instructions, 57 string instructions, 48 control transfer instructions, and many more. Addressing modes include register, immediate, direct, register indirect, base + displacement, indexed, and complex combinations like [BX + DI + 2040H]. This rich addressing mode set is a defining feature of CISC architectures.

Modern Intel processors translate x86 instructions into micro-ops internally, allowing the RISC execution engine to benefit from pipeline techniques developed for simple fixed-length architectures, while maintaining full binary compatibility with the enormous installed base of x86 software.

Chapter 10: Memory-Mapped Input and Output

Accessing I/O Devices

A computer must communicate with external devices — keyboards, monitors, printers, network controllers, and disk drives. Rather than introducing a separate I/O instruction set, most modern architectures use memory-mapped I/O: each I/O device is assigned one or more addresses in the same address space as main memory. When the processor executes a lw or sw to one of these reserved addresses, the hardware routes the access to the appropriate device rather than to RAM.

A portion of the address space is dedicated to I/O devices. For example, in a MIPS system the addresses 0xFFFF0000 through 0xFFFFFFFF might be reserved for I/O. The operating system uses the virtual memory address translation mechanism to make these addresses accessible only from kernel mode, preventing user programs from directly manipulating hardware.

Address Decoding

An address decoder sits between the processor’s memory interface and the devices. It examines the address on the bus and asserts the write-enable signal of the appropriate device (or memory) while deactivating all others. A ReadData multiplexor selects which device’s output data is routed back to the processor, based on the same decoded address.

For instance, if I/O Device 1 is mapped to address 0xFFFFFFF4, the instruction sw $t0, 0xFFF4($0) causes the address decoder to assert WE1 (write-enable for Device 1) while keeping memory’s write-enable inactive. Conversely, lw $t3, 0xFFF4($0) causes the ReadData MUX to select Device 1’s output data and route it to register $t3.

Each I/O device exposes three types of registers through its mapped addresses: command registers (which cause the device to perform actions), status registers (which indicate what the device is doing and whether errors have occurred), and data registers (which hold the data being transferred to or from the device).

Controlling I/O

Three mechanisms govern when and how data transfers occur between the processor and I/O devices:

Polling is the simplest approach: the processor periodically reads the device’s status register to check whether the device is ready. If ready, the processor reads or writes the data register; if an error is indicated, it takes corrective action. Polling is common in small or low-performance embedded systems with predictable timing requirements. In general-purpose systems, however, polling wastes CPU time — the processor spins in a loop checking status instead of doing useful work.

Interrupt-driven I/O eliminates polling overhead. When a device is ready or an error occurs, its I/O controller sends an interrupt signal to the processor. The interrupt mechanism is similar to exceptions: the processor finishes the current instruction, saves the program counter, and transfers control to an interrupt handler. The Cause register identifies the interrupting device. Priority interrupts allow more urgent devices to preempt the handlers of less urgent ones, ensuring that time-critical devices are serviced promptly.

Direct Memory Access (DMA) offloads bulk data transfer entirely. Instead of the CPU copying data word-by-word between a device and memory, the OS provides the DMA controller with a starting memory address and a transfer length. The DMA controller then transfers data autonomously — the processor is free to execute other instructions. When the transfer is complete (or if an error occurs), the DMA controller interrupts the processor.

DMA and Cache Coherence

DMA introduces a subtle consistency problem. If the DMA controller writes new data to a main-memory block that is currently cached, the cached copy becomes stale. Conversely, if a dirty cache block has not yet been written back to main memory and the DMA controller reads that memory location, it reads stale data. Solutions include flushing the relevant cache blocks before initiating a DMA transfer, or marking I/O memory regions as non-cacheable.

A similar issue arises with virtual memory: the DMA controller operates on physical addresses, but the OS allocates memory in virtual pages that may not be contiguous in physical memory. The OS must either break large DMA transfers into page-sized fragments or allocate physically contiguous pages for DMA buffers.

Interconnecting Components

The processor, memory, and I/O controllers must be interconnected. A (system) bus is a shared communication channel consisting of parallel wires for data transfer and control signals for synchronization. Buses can become performance bottlenecks because only one device can drive the bus at a time.

A bus contains data lines (carrying address and data, sometimes multiplexed onto the same wires), control lines (indicating the type of transaction and synchronizing transfers), and may be synchronous (clocked) or asynchronous (using request/acknowledge handshaking).

Modern systems separate buses by function. The processor-memory bus (or front-side bus) is short and fast, designed to match the memory subsystem’s bandwidth. I/O buses are longer, support multiple device connections, and follow standardized interfaces for interoperability. The North Bridge (memory controller) manages transfers between the processor and main memory, while the South Bridge (I/O controller hub) manages transfers between peripheral devices and the rest of the system. These connect to the processor-memory bus through a bridge chip.

Common I/O bus standards include USB (up to 60 MB/s for USB 2.0, external, hot-pluggable), FireWire/IEEE 1394 (50–100 MB/s), PCI Express (250 MB/s per lane, internal), Serial ATA (300 MB/s, internal), and Serial Attached SCSI (300 MB/s, external). Each standard makes different tradeoffs between bandwidth, cable length, device count, and hot-pluggability.

Chapter 11: Storage Technologies

Magnetic Disk Storage

Hard disk drives (HDDs) provide nonvolatile storage using rotating magnetic platters. A drive contains 1 to 4 platters spinning at 5,400 to 15,000 RPM, with 10,000 to 50,000 tracks per surface. Each track is divided into 100 to 500 sectors, typically 512 bytes each (though 4096-byte sectors are increasingly common). Each sector stores a sector ID, a gap, the data payload, an error-correcting code (ECC) for detecting and correcting read errors, and another gap.

Accessing a specific sector involves several delays:

Queuing delay: waiting if other accesses are already pending.
Seek time: the time to move the read/write head to the correct track (typically 4–9 ms average).
Rotational latency: the time for the desired sector to rotate under the head. Average rotational latency is half a rotation: 5.6 ms at 5,400 RPM, 2.0 ms at 15,000 RPM.
Transfer time: the time to read the sector data, a function of sector size, rotation speed, and recording density.
Controller overhead: I/O processing overhead, typically ~0.2 ms.

Example. For a 512-byte sector on a 15,000 RPM disk with 4 ms average seek time, 100 MB/s transfer rate, and 0.2 ms controller overhead: average read time = 4 ms (seek) + 2.0 ms (rotational latency) + 0.005 ms (transfer) + 0.2 ms (controller) = 6.2 ms.

Modern disk drives include on-board caches and microprocessors that prefetch sectors in anticipation of sequential access and optimize seek ordering, reducing effective access times below the worst-case averages.

Flash Storage

Flash memory is nonvolatile semiconductor storage — 100 to 1,000 times faster than hard disks, smaller, lighter, and more power-efficient. It costs more per gigabyte than magnetic disk but less than DRAM.

There are two main types. NOR flash provides random read/write access at the bit level and is typically used for instruction memory in embedded systems. NAND flash is denser (more bits per unit area) but supports only block-at-a-time access; it is cheaper per gigabyte and is used in USB drives, SSDs, and media storage.

Flash cells wear out after thousands to hundreds of thousands of write cycles. Wear levelling algorithms in the flash controller distribute writes evenly across all cells to extend the device’s lifetime. This wear-out characteristic means flash is not a direct drop-in replacement for DRAM (which tolerates unlimited writes), but for storage applications where write frequency is moderate, flash provides an excellent balance of speed, density, and power consumption.

RAID

RAID (Redundant Array of Independent Disks) uses multiple smaller disks as a single logical volume. Parallelism across disks improves performance, and redundancy provides fault tolerance — especially when combined with hot-swapping (replacing a failed disk without shutting down the system).

RAID 0 (Striping): Data is striped across $N$ disks with no redundancy. Performance scales with the number of disks, but any single disk failure destroys the entire array.

RAID 1 (Mirroring): Every write goes to both a data disk and a mirror disk ($N + N$ total disks). On failure, the mirror provides a complete copy. Storage efficiency is only 50%, but read performance doubles since either copy can serve a read.

RAID 2 (ECC): Data is split at the bit level across $N$ data disks, with $E$ additional disks storing error-correcting codes. Generally too complex for practical use.

RAID 3 (Bit-Interleaved Parity): Data is striped at the byte level across $N$ disks, with one parity disk. All disks participate in every access. On failure, the missing disk’s data is reconstructed from the surviving disks and the parity.

RAID 4 (Block-Interleaved Parity): Like RAID 3, but data is striped at the block level. Only the disk holding the requested block (plus the parity disk on writes) needs to be accessed, enabling independent reads from different disks. The single parity disk can become a bottleneck on writes.

RAID 5 (Distributed Parity): Like RAID 4, but parity blocks are distributed across all $N+1$ disks rather than concentrated on one. This eliminates the parity-disk bottleneck and is the most widely deployed RAID level.

RAID 6 (Dual Distributed Parity): Uses $N+2$ disks with two independent parity calculations distributed across all disks. Can tolerate two simultaneous disk failures.

RAID 0+1 stripes first, then mirrors the striped arrays. RAID 10 mirrors first, then stripes across the mirrored pairs. Both combine the performance of striping with the redundancy of mirroring, and are common in database and enterprise storage deployments.

Chapter 12: Multiprocessor Architectures

From Instruction-Level to Thread-Level Parallelism

Throughout this course, we have seen parallelism at the instruction level: pipelining overlaps the execution of multiple instructions, and superscalar architectures issue multiple instructions per clock cycle. SIMD (Single Instruction, Multiple Data) extensions allow one instruction to operate on multiple data elements in parallel — for example, adding four pairs of floating-point numbers simultaneously.

However, single-processor performance improvements have slowed due to power constraints, diminishing returns from instruction-level parallelism, and memory latency walls. The industry response has been multicore processors: chips containing multiple independent processor cores, each capable of executing its own instruction stream. This is MIMD (Multiple Instruction, Multiple Data) parallelism, where each core runs a separate thread or process.

Shared Memory Multiprocessors

In a Shared Memory Multiprocessor (SMP), all processor cores share a single physical address space. Any core can access any memory location, and cores communicate by reading and writing shared variables. Synchronization is achieved through hardware-supported primitives like locks and atomic operations.

Two memory access models exist within the shared-memory paradigm:

UMA (Uniform Memory Access): All processors have equal access time to all memory. This is typical of small-scale multiprocessors where all cores connect to memory through a shared bus or crossbar.
NUMA (Non-Uniform Memory Access): Each processor has local memory that it can access faster than remote memory belonging to other processors. NUMA scales to larger systems but requires the OS and programmer to be aware of data placement for good performance.

Message Passing

An alternative to shared memory is message passing, where each processor has its own private physical address space. Processors communicate by explicitly sending and receiving messages through an interconnection network. Due to the higher setup cost of message transfers, message-passing architectures are typically implemented as clusters — collections of commodity computers connected by high-speed networking.

Message passing avoids the cache coherence complexity of shared-memory systems but places the burden of data distribution and communication on the programmer.

Clusters and Grid Computing

Loosely coupled clusters are networks of independent computers, each with its own memory, OS, and storage, connected by standard I/O interfaces such as Ethernet. They excel at applications with naturally independent tasks — web servers, databases, and embarrassingly parallel simulations. Key advantages include high availability (one node can fail without bringing down the others), scalability (add more nodes as needed), and affordability (built from commodity hardware).

Grid computing extends the cluster idea across geographic boundaries: separate computers interconnected by wide-area networks (including the Internet) cooperate on large computations. Work units are distributed to idle machines, and results are collected centrally. Notable examples include SETI@home (distributing radio telescope data analysis to millions of volunteer PCs) and the World Community Grid (applying distributed computing to scientific research in health and sustainability).

Challenges of Parallel Programming

The main obstacle to realizing multiprocessor performance is not hardware but software. Writing efficient parallel programs requires solving three hard problems:

Partitioning: decomposing the computation into tasks that can execute concurrently. Not all algorithms are easily parallelizable — sequential dependencies can limit the achievable speedup. The compiler and programmer must identify code regions that can run independently.

Coordination: managing the execution of parallel tasks, ensuring correct synchronization of shared data, and maintaining load balance so that no core sits idle while others are overloaded. Synchronization overhead (locks, barriers, atomic operations) reduces the effective parallelism.

Communication overhead: ensuring that intermediate results are consistent across cores. In shared-memory systems this involves cache coherence protocols; in message-passing systems it requires explicit send/receive operations. If communication time dominates computation time, adding more cores yields diminishing returns.

These challenges explain why, despite the availability of multicore hardware, many applications still struggle to exploit more than a handful of cores effectively. The design of parallel algorithms and programming models remains an active area of research.