CS 251: Computer Organization and Design
Rosina Kharal
Estimated reading time: 43 minutes
Table of contents
Part 1: Digital Logic and Data Representation
Digital Logic Design
Modern computers are built from billions of transistors organized into logic gates, which in turn form the functional units of a processor. Understanding how logic gates combine to represent Boolean functions is the foundation of all computer hardware design.
A truth table is a tabular method of specifying a Boolean function, listing every possible combination of inputs alongside the corresponding output. For a function with \(n\) inputs, the truth table has \(2^n\) rows. The Sum of Products (SOP) form extracts a Boolean expression directly from the truth table by identifying every row where the output is 1, writing a product (AND) term for each such row, and ORing those terms together. For example, if a three-input function produces a 1 at rows where \(A=1, B=0, C=1\) and \(A=1, B=1, C=0\), the SOP expression is \(A\bar{B}C + AB\bar{C}\). SOP can always be simplified using Boolean algebra identities.
These identities allow any expression built from AND, OR, and NOT to be rewritten using only NAND gates or only NOR gates, which is valuable because NAND and NOR are each universal — any Boolean function can be implemented with just one gate type.
\[A \oplus B = A\bar{B} + \bar{A}B\]Transistors and CMOS
In CMOS (Complementary Metal-Oxide-Semiconductor) technology, logic gates are built from two types of transistors: NMOS and PMOS. An NMOS transistor acts as a switch that closes (conducts) when its gate input is HIGH (1); a PMOS transistor closes when its gate input is LOW (0). In both cases, they are open (non-conducting) otherwise.
A CMOS gate consists of a pull-up network of PMOS transistors connected between the supply voltage \(V_{DD}\) and the output, and a pull-down network of NMOS transistors connected between the output and ground. When the input combination makes the pull-up network conduct and the pull-down network not conduct, the output is driven HIGH; when the pull-down network conducts and the pull-up does not, the output is driven LOW. Exactly one network should be on at any given time, ensuring that the output is never floating and that there is no direct path from \(V_{DD}\) to ground.
For a CMOS NOT gate (inverter): a single PMOS transistor (gate connected to input \(A\)) is connected from \(V_{DD}\) to the output, and a single NMOS transistor (gate also connected to \(A\)) is connected from the output to ground. When \(A = 0\), the PMOS conducts (output = 1) and the NMOS is off; when \(A = 1\), the NMOS conducts (output = 0) and the PMOS is off.
A CMOS NAND gate uses two PMOS transistors in parallel (pull-up) and two NMOS transistors in series (pull-down). The output is 0 only when both inputs are 1 (both NMOS transistors conduct), implementing \(\overline{A \cdot B}\). A CMOS NOR gate uses two PMOS in series and two NMOS in parallel.
Sequential Circuits
Combinational circuits produce outputs that depend only on current inputs. Sequential circuits add memory: their output depends on both current inputs and past history (stored state). Sequential circuits use feedback loops to hold state.
SR Latch
The simplest memory element is the SR latch (Set-Reset latch), built from two cross-coupled NOR gates. The inputs are \(S\) (Set) and \(R\) (Reset); the outputs are \(Q\) and \(\bar{Q}\). When \(S=1, R=0\), the latch is set (\(Q=1\)). When \(S=0, R=1\), it is reset (\(Q=0\)). When \(S=0, R=0\), the latch holds its current state. The combination \(S=1, R=1\) is forbidden because it drives both outputs to 0, causing undefined behavior when both inputs return to 0.
D Latch
The SR latch’s forbidden state is eliminated by the D latch, which has a single data input \(D\) and an enable input. When the enable is HIGH, the latch is transparent — \(Q\) follows \(D\). When the enable is LOW, the latch holds its current value. The D latch is level-triggered.
D Flip-Flop (Master-Slave)
The D flip-flop (edge-triggered register) consists of two D latches connected in series. The first (master) is enabled on the LOW phase of the clock and the second (slave) is enabled on the HIGH phase. The result is that the flip-flop samples its \(D\) input on the rising edge of the clock and holds the value for the entire clock period. This edge-triggered behavior is critical: in synchronous digital design, all state elements are clocked flip-flops, ensuring that all state changes happen at discrete clock edges and that combinational logic has a full clock period to stabilize.
Finite State Machines
A Finite State Machine (FSM) is a sequential circuit that transitions between a finite set of states based on the current state and input values. In a Moore machine, outputs depend only on the current state (not directly on inputs). The design process involves:
- Draw a state diagram identifying all states and labeled transitions.
- Encode states as binary values.
- Construct the next-state table (current state + inputs → next state).
- Construct the output table (current state → outputs).
- Derive Boolean expressions for next-state logic and output logic from these tables.
- Implement with D flip-flops (one per state bit) and combinational logic.
A classic example is a traffic-light controller. States encode which lights are green/red, and transitions occur on a timer signal. The output (light colors) depends on the current state only.
Combinational Building Blocks
Decoders
An \(n\)-to-\(2^n\) decoder takes \(n\) input bits and activates exactly one of \(2^n\) output lines — the one corresponding to the binary value of the input. All other outputs are 0. For example, a 2-to-4 decoder has inputs \(A_1, A_0\) and four outputs \(D_0, D_1, D_2, D_3\), where \(D_i = 1\) iff the binary value of \(A_1 A_0\) equals \(i\).
Decoders are used in register files to select which register to write, in memory to select which row/word to access, and in many address-decoding applications.
Multiplexors
A \(2^n\)-to-1 multiplexor (MUX) selects one of \(2^n\) data inputs to forward to its output, based on an \(n\)-bit select signal. For a 2-to-1 MUX with inputs \(D_0, D_1\) and select \(S\): \(\text{Out} = \bar{S} \cdot D_0 + S \cdot D_1\). Internally, a MUX is built from AND gates (one per input, gated by the appropriate decoded select combination) and an OR gate combining all AND outputs.
MUXes appear throughout the datapath as key decision points — for example, choosing between a register value and an immediate in the ALU input, or choosing between the ALU result and memory data for write-back.
Register Files
A register file is a small, fast array of registers. In MIPS, the register file holds 32 registers of 32 bits each. It has write ports and read ports:
- Write: A decoder takes the 5-bit destination register number and enables the write enable of exactly that register’s D flip-flops. The 32-bit write data is presented to all registers, but only the selected one latches it on the clock edge.
- Read: A MUX per read port selects the output of one of the 32 registers based on a 5-bit register number. Reading is combinational (no clock edge needed).
Memory: SRAM and DRAM
SRAM
SRAM (Static RAM) stores each bit in a cross-coupled inverter pair (a 6-transistor cell), which holds its value as long as power is supplied. SRAM is fast (1–10 ns access time) and does not require refresh, but it is expensive per bit and has a large cell size. SRAM is used for processor caches (L1, L2, L3).
A SRAM array is organized as rows and columns. Two-level decoding splits the address into row and column parts: the row decoder (using a standard \(n\)-to-\(2^n\) decoder) activates an entire row of cells, placing their contents on bit lines. Then a column MUX selects the specific column(s) needed. This two-level scheme reduces the hardware complexity compared to a flat fully decoded array.
Three-state (tri-state) buffers allow multiple SRAM outputs to share a common data bus: each buffer either drives its output HIGH or LOW, or presents a high-impedance (disconnected) state when not selected, preventing bus conflicts.
DRAM
DRAM (Dynamic RAM) stores each bit as a charge on a capacitor, requiring only 1 transistor per cell. This makes DRAM far denser and cheaper than SRAM. However, capacitors leak charge, so DRAM must be refreshed periodically (typically every 64 ms) by reading and rewriting each row. DRAM access times are slower than SRAM (tens to hundreds of nanoseconds). DRAM is used for main memory.
Data Representation
Integer Representation
Characters are stored using ASCII encoding, where each character maps to a 7-bit integer (e.g., ‘A’ = 65, ‘a’ = 97). Unsigned integers in \(n\) bits represent values \(0\) to \(2^n - 1\). For a 32-bit unsigned integer, the range is \(0\) to \(4{,}294{,}967{,}295\).
Signed-magnitude representation uses the most-significant bit as a sign bit (0 = positive, 1 = negative) and the remaining bits as the magnitude. This representation has two encodings of zero (\(+0\) and \(-0\)) and requires special-case hardware for arithmetic.
\[x = -x_{n-1} \cdot 2^{n-1} + \sum_{i=0}^{n-2} x_i \cdot 2^i\]For 32-bit two’s complement, the range is \(-2^{31}\) to \(2^{31}-1\), i.e., \(-2{,}147{,}483{,}648\) to \(2{,}147{,}483{,}647\). Key properties: there is exactly one encoding of zero; negation is performed by inverting all bits and adding 1; addition/subtraction hardware is the same for signed and unsigned numbers (overflow must be detected separately).
Sign extension converts an \(n\)-bit two’s complement value to a wider representation by replicating the sign bit into all new upper bits. For example, extending the 8-bit value \(1111\ 1010_2\) (which represents \(-6\)) to 16 bits yields \(1111\ 1111\ 1111\ 1010_2\), still \(-6\).
IEEE 754 Floating-Point
IEEE 754 is the standard for representing floating-point numbers in binary. A single-precision (32-bit) number has three fields:
- Sign bit \(s\): 1 bit. \(0\) for positive, \(1\) for negative.
- Biased exponent \(e\): 8 bits, stored with a bias of 127. The actual exponent is \(e - 127\).
- Significand (mantissa) \(f\): 23 bits, representing the fractional part after an implied leading 1.
The implied leading 1 means the significand effectively has 24 bits of precision. The biased exponent scheme allows floating-point numbers to be compared as integers. Special cases:
- Exponent = 0, fraction = 0: the value is \(\pm 0\).
- Exponent = 255, fraction = 0: \(\pm\infty\).
- Exponent = 255, fraction \(\neq 0\): NaN (Not a Number).
- Exponent = 0, fraction \(\neq 0\): denormalized number, with value \((-1)^s \times 0.f \times 2^{-126}\), enabling gradual underflow.
Floating-point addition algorithm: (1) Compare exponents; shift the significand of the smaller number right by the difference to align binary points. (2) Add the significands. (3) Normalize the result (shift so there is a single 1 before the binary point), adjusting the exponent. (4) Round to fit the 23-bit significand field using guard and round bits.
Binary Arithmetic
Full Adder and Ripple-Carry Adder
\[S = A \oplus B \oplus C_{in}\]\[C_{out} = AB + C_{in}(A \oplus B)\]A ripple-carry adder chains \(n\) full adders together, with the carry-out of each stage feeding the carry-in of the next. A 32-bit ripple-carry adder has a critical path of 32 carry propagations. Carry-lookahead adders compute carries in parallel, reducing delay to \(O(\log n)\) — not covered in depth in this course but noted as an important optimization.
ALU Design
Subtraction \(A - B\) is implemented as \(A + (-B) = A + \bar{B} + 1\) (two’s complement negation). An ALU performs the operation selected by a control signal:
| ALU control | Function |
|---|---|
| 0000 | AND |
| 0001 | OR |
| 0010 | add |
| 0110 | subtract |
| 0111 | set-on-less-than (slt) |
| 1100 | NOR |
The slt (set-on-less-than) operation computes \(A - B\) and examines the sign bit of the result: if negative, the output is 1; otherwise 0.
Logical shift operations (sll, srl) move bits left or right, filling vacated positions with zeros. These are implemented by barrel shifters. In MIPS, shifts are used for multiplication/division by powers of 2 and for bit manipulation.
Part 2: The MIPS Instruction Set Architecture
Overview
MIPS (Microprocessor without Interlocked Pipeline Stages) is a classic RISC (Reduced Instruction Set Computer) architecture used extensively in embedded systems and as an educational platform. It is a load-store architecture: arithmetic and logical operations work only on registers, and only lw and sw instructions access memory.
Key properties of MIPS:
- 32 general-purpose registers, each 32 bits wide.
- Fixed 32-bit instruction length.
- Three instruction formats: R-format, I-format, J-format.
- Memory is byte-addressed; words are 4 bytes and must be word-aligned.
Registers
MIPS has 32 registers named $0 through $31, with software conventions for their use:
| Register | Name | Use |
|---|---|---|
$0 | $zero | Always 0; writes ignored |
$1 | $at | Assembler temporary |
$2–$3 | $v0–$v1 | Function return values |
$4–$7 | $a0–$a3 | Function arguments |
$8–$15 | $t0–$t7 | Temporaries (caller-saved) |
$16–$23 | $s0–$s7 | Saved temporaries (callee-saved) |
$28 | $gp | Global pointer |
$29 | $sp | Stack pointer |
$30 | $fp | Frame pointer |
$31 | $ra | Return address |
Instruction Formats
R-format (Register): used for arithmetic and logical operations that take two register sources and write to a register destination.
| op (6) | rs (5) | rt (5) | rd (5) | shamt (5) | funct (6) |
The op field is 0 for all R-format instructions; the funct field distinguishes them. rs and rt are source registers; rd is the destination. shamt is the shift amount (nonzero only for shift instructions).
I-format (Immediate): used for loads, stores, branches, and immediate arithmetic.
| op (6) | rs (5) | rt (5) | immediate (16) |
The 16-bit immediate is sign-extended to 32 bits. For lw/sw, the effective address is \(\text{rs} + \text{sign\_extend(immediate)}\). For beq/bne, the branch target is \(\text{PC}+4 + 4 \times \text{sign\_extend(immediate)}\) (PC-relative, word-offset).
J-format (Jump): used for unconditional jump.
| op (6) | target (26) |
The jump target address is formed by taking the upper 4 bits of \(\text{PC}+4\), concatenating the 26-bit target field, and appending 00 (word-aligned). This gives a 32-bit address, allowing jumps to any address within the same 256 MB region.
Key Instructions
The most important MIPS instructions covered in this course:
add rd, rs, rt—rd = rs + rt(R-format; traps on overflow)sub rd, rs, rt—rd = rs - rtaddi rt, rs, imm—rt = rs + sign_extend(imm)and, or, nor— bitwise operations (R-format)slt rd, rs, rt—rd = (rs < rt) ? 1 : 0slti rt, rs, imm— immediate version of sltlw rt, offset(rs)— load word from memorysw rt, offset(rs)— store word to memorybeq rs, rt, offset— branch if equalbne rs, rt, offset— branch if not equalj target— unconditional jumpjal target— jump and link (stores return address in$ra)jr rs— jump to address in registersll rd, rt, shamt— shift left logical
Addressing Modes
MIPS uses several addressing modes — ways an instruction specifies the location of an operand:
- Register addressing: the operand is in a named register (
add $t0, $t1, $t2). - Immediate addressing: a constant embedded in the instruction itself (
addi $t0, $t1, 4). - Base (displacement) addressing: effective address = base register + sign-extended offset (
lw $t0, 8($sp)). - PC-relative addressing: branch target = PC + 4 + 4 × offset (
beq). - Pseudo-direct addressing: jump target constructed from upper PC bits + 26-bit field +
00(j).
Part 3: The Single-Cycle Datapath
Overview
The single-cycle datapath executes each instruction in exactly one clock cycle. The clock period must be long enough for the slowest instruction to complete. All functional units operate in parallel, and the result is registered at the end of the cycle. While simple, this is inefficient because fast instructions must wait for the cycle time set by the slowest instruction.
Functional Units
The datapath contains these major units:
- Program Counter (PC): a 32-bit register holding the address of the current instruction.
- Instruction memory: a read-only memory that outputs the instruction at the address given by PC.
- Register file: 32×32-bit registers with two read ports and one write port.
- Sign extender: extends the 16-bit immediate to 32 bits.
- ALU: performs arithmetic and logical operations; outputs a 32-bit result and a Zero flag.
- Data memory: read/write memory for
lw/sw; controlled byMemReadandMemWrite. - Adders: one adds PC + 4 (for sequential execution); another computes the branch target.
- Shift-left-2: multiplies the sign-extended branch offset by 4 (because branch targets are word-aligned).
Control Signals
The main control unit decodes the 6-bit opcode field and asserts the appropriate control signals:
| Signal | Purpose |
|---|---|
RegDst | 0 = write to rt, 1 = write to rd |
ALUSrc | 0 = second ALU operand is register rt, 1 = sign-extended immediate |
MemToReg | 0 = write ALU result to register, 1 = write memory data to register |
RegWrite | 1 = write to destination register |
MemRead | 1 = read data memory |
MemWrite | 1 = write data memory |
Branch | 1 = enable branch logic |
ALUOp | 2-bit code directing the ALU control unit |
The ALU control unit takes ALUOp and the 6-bit funct field (for R-format) to produce the 4-bit ALU control signal.
For R-format (ALUOp = 10): the funct field determines the operation (add, subtract, AND, OR, slt).
For lw/sw (ALUOp = 00): ALU always adds (to compute effective address).
For beq (ALUOp = 01): ALU subtracts (Zero flag used to detect equality).
Branch Logic
For beq, the branch target is computed in parallel with the ALU comparison. The branch is taken if the ALU’s Zero output is 1 AND the Branch control signal is 1. The PCSrc multiplexor selects between PC+4 (not taken) and the branch target (taken).
Instruction Timing
In the single-cycle design, instruction completion times are determined by the critical path through each instruction’s data flow:
| Instruction | Critical path | Time |
|---|---|---|
beq | IM → RegFile → ALU | 350 ps |
| R-format | IM → RegFile → ALU → RegFile | 400 ps |
sw | IM → RegFile → ALU → DM | 550 ps |
lw | IM → RegFile → ALU → DM → RegFile | 600 ps |
Since the clock period is fixed at 600 ps (set by lw, the slowest), all instructions take 600 ps regardless of their actual need. This wastes 200–250 ps for R-format, sw, and beq instructions.
Modifying the Datapath
The single-cycle datapath can be extended to support additional instructions:
Jump (j): Add a second MUX after the branch MUX to select between the branch target and the jump target. The jump target is formed by shifting the 26-bit field left by 2 and concatenating with the upper 4 bits of PC+4. A new Jump control signal is added.
Branch-not-equal (bne): Add an inverter on the Zero signal, and OR it with the BNE control signal to generate PCSrc for bne. Alternatively, add a separate MUX level. Only one new control line is needed.
Set register (SREG) and relative jump (JREL): Each new instruction type requires examining what new paths are needed in the datapath and adding MUXes and control lines accordingly, following the same systematic approach.
Part 4: The Multicycle and Pipelined Datapath
The Multicycle Datapath
In the multicycle datapath, each instruction is broken into multiple shorter steps, each taking one clock cycle. A shorter clock period can be used because each stage does less work. Functional units (ALU, adders) can be shared across stages. The five stages are:
- IF (Instruction Fetch): Fetch the instruction from instruction memory; PC ← PC + 4.
- ID (Instruction Decode / Register Fetch): Decode the instruction; read the two source registers; sign-extend the immediate.
- EX (Execute): ALU performs the operation (arithmetic, address computation, or comparison).
- MEM (Memory Access): For
lw/sw, access data memory. For branches, the branch decision is made here (if branching is placed in MEM stage). For other instructions, no memory operation. - WB (Write Back): Write result back to the register file.
Intermediate pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) are added between stages to hold values as they pass through the pipeline. Each register stores all the information the later stages need — instruction fields, computed values, control signals.
With a clock period of 200 ps per stage, a lw instruction takes 5 cycles = 1000 ps — worse than the single-cycle design for a single instruction. The benefit emerges only when instructions overlap (pipelining).
Pipelining
Pipelining exploits instruction-level parallelism by overlapping the execution of multiple instructions. While instruction \(i\) is in the EX stage, instruction \(i+1\) is in ID and instruction \(i+2\) is in IF. In the ideal steady-state case (no hazards), the throughput is one instruction completed per clock cycle — a CPI (Cycles Per Instruction) of 1.
The key insight is that pipelining improves throughput (instructions per second), not latency (time to complete a single instruction). A pipelined lw still takes 5 clock cycles; but in steady state, one lw completes every cycle.
Control signals generated in the ID stage must be carried forward through the pipeline registers alongside the data, so that each instruction’s signals arrive at the right stage at the right time. The MEM/WB register carries WB-stage signals; the EX/MEM register carries EX/MEM/WB signals; and so on.
Pipeline Startup Time
\[\text{Total clock cycles} = N + 4\]The +4 accounts for pipeline fill time (4 stall cycles at the start). This is sometimes called the pipeline startup overhead. In most problems, if the pipeline is described as “already running,” this startup cost is omitted.
Part 5: Pipeline Hazards
A hazard is any condition that prevents the next instruction from executing in the next clock cycle, forcing the pipeline to stall or take corrective action. There are three types: structural, data, and control.
Structural Hazards
A structural hazard occurs when two instructions need the same hardware resource at the same time. In the basic pipelined MIPS datapath, the key structural hazard is resolved by using separate instruction and data memories — if a single memory were used, an lw in the MEM stage and the fetch in the IF stage would conflict. With separate I-cache and D-cache, this structural hazard is eliminated.
Data Hazards
A data hazard occurs when an instruction depends on the result of a preceding instruction that has not yet written its result to the register file.
Forwarding (Bypassing)
The most important solution to data hazards is forwarding (also called bypassing): rather than waiting for the result to be written to the register file, it is forwarded directly from wherever it is in the pipeline to where it is needed. Two forwarding paths exist:
EX hazard: The instruction immediately following an arithmetic instruction needs the result. The result is available at the end of the EX stage (in the EX/MEM pipeline register). It can be forwarded directly to the ALU input of the next instruction (which is now in EX). The forwarding unit sets ForwardA = 10 (or ForwardB = 10) when:
EX/MEM.RegWrite = 1
EX/MEM.RegisterRd ≠ 0
EX/MEM.RegisterRd = ID/EX.RegisterRs (for ForwardA)
MEM hazard: The instruction two behind needs the result. The result is available in the MEM/WB register. It is forwarded to the ALU input of the instruction now in EX. The forwarding unit sets ForwardA = 01 when:
MEM/WB.RegWrite = 1
MEM/WB.RegisterRd ≠ 0
MEM/WB.RegisterRd = ID/EX.RegisterRs
(and no EX hazard takes priority)
The forwarding unit compares register numbers from the pipeline registers and generates multiplexor select signals (ForwardA, ForwardB) that choose among: the normal register file output (00), the MEM/WB forwarded value (01), and the EX/MEM forwarded value (10).
Load-Use Hazard
Forwarding cannot fully solve the load-use hazard: when a lw instruction is immediately followed by an instruction that uses the loaded value, the data is not available until the end of the MEM stage of the lw — which is too late to forward to the EX stage of the following instruction.
The only solution is to stall the pipeline for one clock cycle (insert a “bubble” or NOP). The hazard detection unit compares ID/EX.MemRead with the register numbers in the IF/ID register:
if (ID/EX.MemRead = 1) and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
then stall
When a stall is detected: the PC is prevented from advancing (the same instruction is re-fetched); the IF/ID register is preserved (the same instruction is re-decoded); and a NOP is inserted into the ID/EX register by zeroing all control signals. This effectively inserts one dead cycle, giving lw time to complete the MEM stage before its result is needed.
The load-use hazard can be avoided by code rearrangement: if the compiler can find an independent instruction to place between the lw and its use, no stall is needed. Code rearrangement is always preferred over hardware stalls when possible. However, the load-use hazard is still classified as a data hazard even when stalling hardware is present.
Control Hazards
A control hazard (or branch hazard) occurs because the pipeline fetches instructions sequentially, but a branch may redirect control to a different address. By the time the branch outcome is known, several instructions following the branch may have already entered the pipeline.
Branch in MEM Stage
If the branch decision is made in the MEM stage (4th stage), three instructions following the branch have already been fetched and partially executed. These are called errant instructions. Two solutions:
- Branch flushing: Zero out the control bits in the IF/ID, ID/EX, and EX/MEM pipeline registers, effectively replacing the errant instructions with NOPs. The branch penalty is 3 clock cycles.
- Code rearrangement / NOPs: The compiler inserts three NOPs (or moves three independent instructions) after every branch. No hardware flushing needed, but code size increases.
Branch in ID Stage
If additional hardware computes the branch target and evaluates the branch condition in the ID stage (2nd stage), only one errant instruction has been fetched. The branch penalty drops to 1 cycle.
Solutions:
- Flushing: Zero the IF/ID register to kill the one errant instruction.
- NOP / code rearrangement: The compiler places one NOP or one independent instruction in the branch delay slot.
The branch delay slot is the instruction immediately following a branch, which always executes regardless of whether the branch is taken. The compiler tries to fill it with a useful instruction; if none is available, a NOP is inserted.
Branch Data Hazard
When the branch instruction in the ID stage needs a register value that is being produced by the immediately preceding instruction, a branch data hazard exists. With forwarding that only targets the EX stage, this cannot be resolved by forwarding alone for a branch in ID.
One stall is inserted between the producing instruction and the branch. Code rearrangement is the preferred solution. For example:
# Hazardous:
112 add $1, $1, -1
116 bne $1, $0, -5 # needs $1 immediately
# Solution via code rearrangement:
108 addi $4, $4, 4 # moved from before, fills stall slot
112 add $1, $1, -1
116 bne $1, $0, -5 # now safe
Pipeline Performance Analysis
\[\text{Total CC} = (\text{instruction count}) + (\text{load-use stalls}) + (\text{branch penalties}) + 4\]The +4 is the pipeline startup cost. Branch penalties are 1 cycle per taken branch (with branching in ID and flushing), and load-use stalls are 1 cycle per load-use hazard.
\[\text{CPI} = 1 + (\text{load stall rate}) + (\text{branch stall rate})\]Loop Unrolling (FYI)
Loop unrolling is a compiler technique that reduces branch overhead by replicating the loop body multiple times and reducing the number of branch instructions. For example, a loop that processes one element per iteration can be unrolled 4× to process four elements per iteration, reducing the number of branch checks by a factor of 4. This also exposes more independent instructions for the compiler to schedule to fill hazard slots.
Exceptions and Interrupts (FYI)
Exceptions (internal, e.g., arithmetic overflow, undefined instruction) and interrupts (external, e.g., I/O device signals) require the processor to transfer control to an exception handler. In MIPS, the Cause register records the type of exception, and the EPC (Exception Program Counter) saves the address of the offending instruction. The processor branches to a fixed exception handler address.
Vectored interrupts provide separate handler addresses for each interrupt type, avoiding a software dispatch table in the handler.
In the pipelined processor, when an exception occurs: (1) the current instruction and all subsequent instructions that have entered the pipeline are flushed (zeroed out); (2) the EPC is saved; (3) the Cause register is set; (4) the PC is set to the handler address. The flushing hardware sets all control signals to zero in the affected pipeline registers, effectively converting them to NOPs.
Part 6: Memory Hierarchy and Caches
The Memory Hierarchy
All modern computers face a fundamental tension: faster memories are more expensive per bit and smaller, while larger memories are slower and cheaper. The solution is the memory hierarchy — a series of storage levels with different speed/size/cost tradeoffs, arranged so that the most frequently accessed data is kept in the fastest, smallest levels closest to the processor.
A typical hierarchy from fastest to slowest:
- Registers — within the CPU, 32 × 32 bits, access in one clock cycle.
- L1 cache — on-chip SRAM, 32 KB typical, 1–4 cycle access, split instruction/data.
- L2 cache — on-chip or close SRAM, 256 KB–1 MB, 4–12 cycles.
- L3 cache — larger on-chip SRAM, several MB.
- Main memory (RAM) — DRAM, several GB, 50–100+ cycles.
- Secondary storage (disk/SSD) — nonvolatile, TB capacity, millions of cycles.
The hierarchy works because of the principle of locality:
The terminology for hierarchy interactions: a block is the minimum unit of data transfer between levels; a hit occurs when the requested data is found in the upper (faster) level; a miss occurs when it is not and must be fetched from a lower level. The hit rate is the fraction of accesses that are hits; the miss rate = 1 − hit rate. The miss penalty is the additional time needed to fetch a block from the next lower level.
Direct-Mapped Cache
\[\text{cache index} = (\text{block address}) \bmod (\text{number of blocks in cache})\]Each cache entry has three fields:
- Valid bit: indicates whether the entry holds valid data (initially 0 at power-on).
- Tag: the upper bits of the original block address, stored alongside the data to verify a hit.
- Data: the block of data (one or more words).
When the processor requests address \(A\):
- Use the index bits of \(A\) to locate the cache entry.
- Check the valid bit; if 0, it is a miss.
- If valid, compare the tag field with the tag bits of \(A\). If they match, it is a hit — return the data from the data field.
- If the tag does not match, it is a miss — fetch the block from RAM, store it in the cache at this index (evicting whatever was there), update the tag, set valid bit to 1, and return the data.
For a 32-bit word address (byte address divided by 4, since words are 4 bytes), with a cache of \(2^k\) blocks and block size of \(2^b\) words:
- The lowest \(b\) bits select the word within the block (block offset).
- The next \(k\) bits are the index.
- The remaining \(32 - k - b - 2\) bits form the tag (the 2 accounts for byte offset within a word).
The primary weakness of direct-mapped caches is conflict misses: two addresses that map to the same cache index will continually evict each other, causing poor performance even if the cache is otherwise large enough.
Set-Associative Cache
A set-associative cache reduces conflict misses by allowing each memory block to map to any of \(W\) locations within a set. The cache is organized as \(2^k / W\) sets, each containing \(W\) ways (entries). A block maps to a unique set (determined by index bits) but can be placed in any of the \(W\) ways within that set.
On a lookup, the index selects the set, then all \(W\) tag comparators run in parallel to check all ways simultaneously. A hit is indicated if any way’s valid bit is set and tag matches.
- 1-way set associative = direct-mapped.
- 2-way set associative: each block can go to one of 2 locations in its set.
- 4-way set associative: one of 4 locations.
- Fully associative: a block can go anywhere in the cache; there is only one set.
As associativity increases, hit rates improve (fewer conflict misses) but lookup time and hardware cost increase (more comparators needed). The miss rate vs. associativity vs. cache size trade-off is empirical: real processors use 4-way or 8-way set-associative caches as a practical balance.
For a 32-bit address with an \(N\)-way set-associative cache of \(2^k\) sets and block size \(2^b\) words:
- Block offset: \(b\) bits (+ 2 byte offset bits).
- Index: \(k - \log_2 N\) bits (fewer index bits as \(N\) increases, because there are fewer sets).
- Tag: remaining upper bits (more tag bits as \(N\) increases).
When fully associative, all bits except the block offset form the tag and there is no index field.
Replacement Policies
When a miss occurs and all ways in a set are occupied, one entry must be evicted. The standard replacement policy is LRU (Least Recently Used): evict the entry that was last accessed furthest in the past. LRU approximates optimal replacement and works well due to temporal locality. For large associativity, approximate LRU using reference bits that are set on access and periodically cleared.
Larger Block Sizes
Bringing a single word from RAM on a miss is inefficient. Instead, a larger block size exploits spatial locality by bringing in multiple consecutive words. With block size 4 words, accessing address \(A\) also brings in the three adjacent words. Subsequent accesses to those words are cache hits.
The cost of a larger block is slightly increased miss penalty (more data to transfer) but greatly reduced miss rate for sequential access patterns. Typical miss penalties:
- 1-word block: 100 cc (100 clock cycles to access RAM).
- 4-word block: 104 cc (small extra cost to transfer 4 words).
- 8-word block: 108 cc.
Average Memory Access Time (AMAT)
\[\text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty}\]\[\text{AMAT} = 1 + 0.05 \times 20 = 2 \text{ cc}\]Cache Miss Handling
On a read miss: stall the entire processor; fetch the block from RAM; place it in the cache; resume execution.
On a write (store):
- Write-through: every write goes to both the cache and RAM immediately. RAM and cache are always consistent. Disadvantage: every
swinstruction causes a slow RAM write. - Write-back: write only to the cache. A dirty bit is added to each cache entry to track whether the cached data has been modified but not yet written to RAM. When a dirty block is evicted, it is written back to RAM first. Write-back is more complex but much faster, since many writes to the same location require only one eventual RAM write.
The dirty bit adds overhead on cache miss with write-back: if the evicted block is dirty, it must be written to RAM (104 cc for a 4-word block) before the new block can be loaded.
Total Execution Time with Cache
\[\text{Total CC} = \text{Pipeline CC} + \text{Instruction cache miss penalty} + \text{Data cache miss penalty}\]where:
- Pipeline CC accounts for instructions executed, load-use stalls, and branch penalties.
- Instruction cache misses: each miss adds the RAM access time (e.g., 104 cc per 4-word block miss). On the first pass through a loop, instruction cache misses occur; on subsequent iterations, the instructions are already in cache.
- Data cache misses: each
lworswthat causes a miss adds the RAM access time. With block size 4, accessing \(M\) unique data words where they fall in \(M/4\) distinct blocks causes \(M/4\) misses.
Example analysis approach for a loop executing 20 iterations with 9 instructions starting at address 112, block size 4 words, and 104 cc RAM penalty:
- Instruction count: 3 + 6 × 20 + 1 = 124 (setup + loop body × iterations + branch not taken at end), adjusted for load-use stalls and branch penalty.
- Instruction cache misses: instructions 112–144 span addresses 112 to 144 (= 9 words). With block size 4, instructions 112–128 (addresses 112, 116, 120, 124, 128 = 5 words) fit in 2 blocks (112–128 is 5 instructions; 112 is aligned to block boundary since 112 = 7 × 16, and the next block starts at 128, so blocks at 112 and 128 and 144). On the first pass: 3 instruction cache misses = 3 × 104 cc = 312 cc. On loop-back iterations, instructions are already cached.
- Data cache misses:
lw $5, 0($4)accesses a different address on each iteration. With block size 4, every 4 consecutive accesses fall in one block. 20 iterations → 20 uniquelwaddresses → if they are spaced 4 bytes apart, each falls in its own block → 20 data cache misses = 20 × 104 cc. If aligned to 4-word blocks, it may be fewer.
Part 7: Virtual Memory
Motivation
Virtual memory is the level of the memory hierarchy between RAM and disk. It serves two purposes:
- Allow programs to exceed physical RAM: the operating system keeps only the currently needed pages of a program in RAM; the rest remain on disk.
- Multiprogramming and protection: multiple programs can run simultaneously, each believing it has the entire address space. The OS isolates programs from each other using per-process address mappings.
Each process sees a large virtual address space (64-bit programs see \(2^{64}\) addresses). The physical address space is determined by the amount of RAM installed. The OS manages the mapping between them.
Pages and Page Tables
A page table maps virtual page numbers (VPN) to physical page numbers (PPN). Each process has its own page table, stored in RAM, and pointed to by the page table register (PTR). Each entry contains:
- Valid bit: 1 if the page is currently in RAM; 0 if it is on disk (triggering a page fault on access).
- Dirty bit: 1 if the page has been written since it was loaded from disk.
- Reference bit: used to approximate LRU for page replacement.
- Physical page number: the location in RAM where this page resides.
The page offset bits do not change between virtual and physical address because the page is mapped as a whole unit — only the page number changes.
Page Faults
A page fault occurs when the requested virtual page is not in RAM (valid bit = 0). The OS must:
- Choose a page to evict from RAM (using LRU, LFU, FIFO, or random replacement).
- If the evicted page is dirty, write it back to disk (write-back policy — write-through to disk is too expensive).
- Load the requested page from disk into the freed RAM location.
- Update the page table entry (set valid bit = 1, update PPN).
- Resume execution of the faulting instruction.
Page faults are expensive (disk access takes millions of cycles), so minimizing them is critical. Virtual memory always uses a write-back policy because write-through to disk would be prohibitively slow.
Translation Lookaside Buffer (TLB)
Every memory access requires a virtual-to-physical translation, which requires accessing the page table in RAM — adding one RAM access per instruction. This overhead is unacceptable.
The Translation Lookaside Buffer (TLB) is a small, fully associative cache storing recently used page table entries. It resides on-chip, close to the processor.
When the processor generates a virtual address:
- Check the TLB for the VPN.
- TLB hit: get the PPN directly. Then use the PPN to check the cache (and RAM if cache miss).
- TLB miss: access the page table in RAM (costs one RAM access). If valid (page in RAM), load the translation into the TLB and use the PPN. If invalid (page on disk): page fault — load the page from disk, update page table, update TLB.
A TLB hit means only the translation itself is free; data still needs to be found in cache or RAM. If there is a TLB hit, it is impossible to have a page fault (because TLB entries only reference pages that are in RAM).
The valid bit in a TLB entry indicates whether the TLB entry is valid (not garbage data) — it does not indicate whether the page is in RAM vs. disk. The valid bit in the page table indicates whether the page is in RAM.
The Full Memory Access Flow
For every instruction fetch and every lw/sw:
TLB lookup (VA → PA):
- TLB hit → have PA.
- TLB miss → page table lookup in RAM → if page valid, load TLB; if page invalid, page fault → load from disk, update PT, load TLB.
Cache lookup (using PA):
- Cache hit → data returned; no RAM access needed.
- Cache miss → fetch block from RAM; update cache; return data.
Summary of possible combinations (for a system with branching in ID, flushing and forwarding in the pipelined processor):
| TLB | Page Table | RAM (Cache) | Disk | Possible? |
|---|---|---|---|---|
| Hit | — | Hit (cache) | — | Yes (fastest path) |
| Hit | — | Miss (cache) | — | Yes (TLB hit, cache miss, go to RAM) |
| Miss | Hit | Hit (cache) | — | Yes (costly: PT lookup in RAM, but data in cache — rare) |
| Miss | Hit | Miss (cache) | — | Yes (PT lookup + RAM access) |
| Miss | Miss (fault) | — | Access | Yes (page fault, most expensive) |
| Hit | — | — | Access | No (TLB hit means page is in RAM) |
Process Switching
When the OS switches from one process (program A) to another (program B):
- Save the state of A: PC, all registers, and the PTR.
- Flush the TLB (its translations belong to A and are invalid for B).
- Flush the cache (optional; depends on whether the cache uses physical or virtual addressing).
- Load the state of B: restore PC, registers, and PTR for B.
- B begins with initially empty TLB and cache, gradually filling as it executes.
The page tables themselves remain in RAM throughout the switch; the PTR is the only thing that needs to change to select the correct page table.
Part 8: Other Architectures
Microprogramming
Microprogramming is a technique for implementing the control unit of a processor. Instead of hardwired combinational logic, the control unit runs a small program (the microprogram) for each assembly instruction. Each microinstruction corresponds to one set of control signals to the datapath (one clock cycle). Microinstructions are stored in a ROM.
The microprogram for an instruction specifies sequencing: FETCH (go decode next assembly instruction), NEXT (go to next microinstruction in sequence), or JUMP N (go to microinstruction N). Microprogramming makes it easy to add new instructions or fix bugs by changing ROM contents rather than redesigning hardware.
VAX is a classic microprogrammed architecture. It is a CISC (Complex Instruction Set Computer) with variable-length instructions (1 to 54 bytes), 304 instruction types, and instructions like MOVC3 (copy a string), INSQUE (insert into a queue), and POLY (evaluate a polynomial). These complex instructions simplify assembly programming and reduce code size, but make the hardware more complex.
RISC vs. CISC
RISC (Reduced Instruction Set Computer) architectures (like MIPS and SPARC) embrace a philosophy of simplicity:
- Small number of simple, fixed-length instructions.
- Load-store architecture (only
lw/swaccess memory). - Large general-purpose register file.
- Single-cycle execution of most instructions.
- Heavy reliance on the compiler to optimize code.
CISC architectures (like VAX and Intel x86) provide many complex instructions with variable lengths and multiple addressing modes, aiming to reduce code size and simplify programming at the expense of more complex hardware.
Modern processors often blend both: Intel’s Pentium Pro and later processors decode complex x86 instructions into simpler micro-ops that execute on a RISC-like internal pipeline.
SPARC
SPARC is a RISC architecture designed to support object-oriented languages and efficient subroutine calls. It has 32 visible registers at any time, organized as a register window into a large register file of 40–520 registers.
On a subroutine call, the register window slides forward by 16: the caller’s output registers become the callee’s input registers, and the callee gets a fresh set of local and output registers. On return, the window slides back. This mechanism avoids saving and restoring registers on most calls, trading register-file area for performance.
SPARC uses fixed 32-bit instructions with 5 instruction formats, delayed branches (the instruction after a branch always executes — the branch delay slot), and 32 single-precision or 16 double-precision floating-point registers.
Intel x86 / Pentium
The Intel x86 architecture grew from the original 8086 (16-bit) through the 80386 (32-bit) to the modern 64-bit extensions. Key milestones:
- 80386 (1985): First 32-bit Intel processor.
- 80486 (1989): Added on-chip cache, instruction pipeline, and integrated math coprocessor.
- Pentium (1993): Superscalar microarchitecture (executes multiple instructions in parallel); split L1 instruction/data caches.
- Pentium Pro (1995): Added L2 cache; decodes complex x86 instructions into RISC-like micro-ops for internal execution.
- Pentium III (1999): Added SSE (SIMD single-precision floating-point); 10-stage pipeline.
- Pentium 4 (2000): Added SSE2; 20-stage pipeline; 4K-entry branch predictor; instruction trace cache stores decoded micro-ops; 8-way set-associative L1 data cache (8 KB) and L2 (256 KB, 128-byte lines).
The x86 instruction set is large: 48 move instructions, 12 binary arithmetic instructions, 57 string instructions, 48 control transfer instructions, and many more. Addressing modes include register, immediate, direct, register indirect, base + displacement, indexed, and complex combinations like [BX + DI + 2040H]. This rich addressing mode set is a defining feature of CISC architectures.
Modern Intel processors translate x86 instructions into micro-ops internally, allowing the RISC execution engine to benefit from pipeline techniques developed for simple fixed-length architectures, while maintaining full binary compatibility with the enormous installed base of x86 software.