ECE 327: Digital Hardware Systems

Andrew Boutros

Estimated study time: 1 hr 9 min

Table of contents

Sources and References

Primary references — Neil H. E. Weste & David Money Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed., Pearson, 2011; David Patterson & John Hennessy, Computer Organization and Design: The Hardware/Software Interface, 6th ed., Morgan Kaufmann, 2021. HDL reference — Stuart Sutherland, Simon Davidmann & Peter Flake, SystemVerilog for Design, 2nd ed., Springer, 2006. FPGA — Intel/Altera Quartus Prime Handbook; AMD/Xilinx Vivado Design Suite User Guide (UG910, UG901, UG835 — publicly available at xilinx.com/support). Online resources — MIT OCW 6.004 Computation Structures (open access); Synopsys Design Compiler User Guide; IEEE Std 1800-2017 (SystemVerilog Language Reference Manual).


Chapter 1: Introduction and the Digital Design Process

1.0 Historical Context

The trajectory of digital hardware over the past five decades is one of the most dramatic growth stories in engineering history. In 1971, Intel’s 4004 microprocessor contained 2,300 transistors switching at 740 kHz and delivered about 60,000 instructions per second. A modern Apple M4 chip contains over 28 billion transistors, operates at up to 4.4 GHz, and delivers trillions of operations per second across its CPU and Neural Engine cores. This 10-million-fold improvement in transistor count follows Moore’s Law — the empirical observation by Gordon Moore in 1965 that the number of transistors on a chip doubles roughly every two years as lithography technology advances.

Moore’s Law has slowed at the frontier of advanced nodes (below 5 nm) because transistors can no longer be made meaningfully smaller without quantum mechanical leakage becoming unmanageable. The industry has responded with heterogeneous integration (packaging multiple chiplets together), new device structures (FinFETs, Gate-All-Around FETs), and a renewed emphasis on architecture-level and microarchitecture-level optimisation. This shift is why ECE 327 exists: when you can no longer rely on faster transistors to improve performance, you must design smarter hardware. Pipelining, parallelism, specialised datapaths, efficient memory hierarchies, and low-power techniques are the tools that keep progress alive in the post-Dennard scaling era.

1.1 Why Digital Hardware?

Every computation you run — from a smartphone camera app to a neural-network accelerator in a data centre — eventually executes on digital hardware. Software engineers write code that a processor interprets, but the processor itself is a precisely designed arrangement of billions of transistors. Understanding how those transistors are organized into logic gates, how logic gates are composed into arithmetic units and finite-state machines, how those building blocks are coordinated with a clock, and how the resulting circuit is described, verified, and implemented in silicon or on an FPGA constitutes the subject matter of ECE 327.

The course sits at the intersection of three disciplines: electrical engineering (transistor physics and circuit behaviour), computer engineering (architecture and microarchitecture), and software engineering (hardware description languages, verification methodology). A practitioner who is fluent in all three is equipped to design custom silicon chips, FPGA-based accelerators, embedded controllers, and high-performance data-path engines — some of the most sought-after engineering roles in the industry.

1.1.1 Digital Abstraction

An analogue signal is a continuous-valued waveform that can take any voltage between the supply rails. A digital system replaces that continuum with a binary abstraction: any voltage above a threshold \( V_{IH} \) is interpreted as logic 1, and any voltage below \( V_{IL} \) is interpreted as logic 0. The region between \( V_{IL} \) and \( V_{IH} \) is the forbidden zone; a well-designed gate never operates in it during steady state.

This abstraction buys two enormous advantages. First, it provides noise immunity: small perturbations due to power-supply noise, thermal effects, or crosstalk cannot flip a bit as long as they stay within the noise margins \( NM_H = V_{OH} - V_{IH} \) and \( NM_L = V_{IL} - V_{OL} \). Second, it enables composability: the output of one gate directly drives the input of the next with no need for analogue level-shifting circuitry, because logic levels are regenerated by each stage.

1.1.2 Design Abstraction Levels

The journey from behaviour to silicon passes through several layers:

  • Behavioural / algorithmic level: what the circuit computes, expressed in pseudocode or a programming language.
  • Register-Transfer Level (RTL): computation expressed as operations on registers, using a hardware description language such as SystemVerilog. This is the primary design entry level in ECE 327.
  • Gate / netlist level: Boolean equations implemented as interconnected logic primitives (AND, OR, inverter, flip-flop).
  • Transistor / circuit level: individual transistors and their connections.
  • Layout / physical level: geometric polygons on silicon layers, ready for fabrication.

A logic synthesis tool (e.g., Synopsys Design Compiler, Cadence Genus) translates RTL to a gate netlist; a place-and-route tool (e.g., Cadence Innovus for ASIC, Vivado for FPGA) converts the netlist into a physical implementation. In this course the target technology is an FPGA, so synthesis and place-and-route are performed by the AMD/Xilinx Vivado tool suite.

1.2 The Hardware Development Process

Digital hardware design follows a methodology often summarised as design–verify–implement.

Specification captures functional and non-functional requirements: what computations the circuit must perform, at what clock frequency, with what latency and throughput, in what power budget, and on what target technology.

RTL Design is the act of writing SystemVerilog (or VHDL) that correctly implements the specification. The designer describes the circuit at the register-transfer level: which data are stored in which registers, what combinational logic transforms them each clock cycle, and what finite-state machine governs control decisions.

Functional Verification uses simulation to check that the RTL behaves correctly. A testbench — itself a SystemVerilog module — applies stimulus to the design under test (DUT) and checks the outputs against a reference model. In industrial practice, verification consumes more engineering effort than design.

Synthesis maps the RTL to a target technology library. For FPGAs the tool decomposes logic into look-up tables (LUTs), flip-flops, carry chains, and hard IP blocks. For ASICs it maps to a standard-cell library. The synthesiser also reports the resource count (area) and the estimated critical-path delay (timing).

Place and Route assigns each logic element a physical location and routes the wires between them. After routing, the tool performs static timing analysis (STA) to verify that every path satisfies setup and hold constraints at the target clock frequency.

Implementation and Testing for FPGAs means generating a bitstream and programming the board. The design is then tested in hardware, using either embedded logic analysers (e.g., Xilinx ILA) or external test equipment.

Co-simulation and hardware emulation. For very large designs it may be impractical to simulate entirely in software. Hardware emulators (e.g., Cadence Palladium) run the RTL on purpose-built hardware hundreds of times faster than software simulation, enabling full-system verification before tape-out. FPGAs are themselves widely used as emulation platforms in industry.

Chapter 2: SystemVerilog for Digital Hardware

2.1 Modules, Ports, and Hierarchy

SystemVerilog describes hardware as a collection of modules. A module is the fundamental unit of design hierarchy: it has a port list declaring its inputs and outputs, and a body describing its internal behaviour.

module adder #(parameter WIDTH = 8) (
    input  logic [WIDTH-1:0] a, b,
    input  logic             cin,
    output logic [WIDTH-1:0] sum,
    output logic             cout
);
    assign {cout, sum} = a + b + cin;
endmodule

The keyword logic declares a 4-state variable (0, 1, X for unknown, Z for high-impedance). The parameter construct allows the same module to be instantiated at different bit-widths, a crucial feature for writing reusable, parameterised hardware.

Hierarchy is created by instantiating one module inside another. The top-level module connects to physical board pins through the constraints file; all other modules are sub-designs instantiated within it.

2.2 Concurrent vs. Sequential Semantics

A critical conceptual shift from software to hardware is that hardware is inherently concurrent. All the logic in a circuit operates simultaneously, every clock edge. SystemVerilog captures this through two assignment styles.

Continuous assignments (assign) describe combinational (non-clocked) logic. The right-hand side is evaluated and driven continuously on the left-hand net whenever any input changes:

assign y = (a & b) | (~a & c);

Procedural blocks (always_ff, always_comb, always_latch) describe behaviour triggered by events. The synthesiser infers the appropriate hardware from the sensitivity list and the style of assignment.

  • always_ff @(posedge clk) — infer flip-flops (edge-triggered registers).
  • always_comb — infer purely combinational logic (no state, no latches).
  • always_latch — infer latches (level-sensitive state). Latches are usually unintentional in synchronous design and should be avoided.
always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)
        q <= '0;
    else
        q <= d;
end

The non-blocking assignment <= inside always_ff is essential: all right-hand sides are evaluated before any left-hand sides are updated. This models the behaviour of real flip-flops, where all outputs change simultaneously at the clock edge, not sequentially.

Non-blocking vs. blocking assignment. Inside always_ff, use non-blocking (<=) assignments — they model register behaviour correctly. Inside always_comb, use blocking (=) assignments — they model combinational logic correctly. Mixing the two in a single block is a frequent source of subtle simulation/synthesis mismatches.

2.3 Data Types and Arithmetic

SystemVerilog supports both 2-state (bit, byte, int) and 4-state (logic, reg) types. For synthesisable RTL, logic is recommended because it can carry X (unknown) values during simulation, which helps catch uninitialised-register bugs.

Arithmetic operations (+, -, *, <<, >>) on logic vectors are synthesisable. The synthesiser instantiates appropriate hardware — ripple-carry adders, barrel shifters, or multipliers depending on context and optimisation settings.

Bit-selects and concatenation are fundamental:

logic [15:0] a;
logic [7:0]  hi, lo;

assign hi = a[15:8];        // bit select
assign lo = a[7:0];
assign a  = {hi, lo};       // concatenation
assign a  = {8{lo[0]}};     // replication: sign-extend lo[0] to 16 bits

2.4 Finite-State Machines in SystemVerilog

Finite-state machines (FSMs) are the canonical structure for control logic in digital design. An FSM has a finite set of states, transitions guarded by combinational conditions on inputs, and output values that depend on state (Moore machine) or on state and inputs (Mealy machine).

The recommended two-block coding style separates state register logic from next-state and output logic:

typedef enum logic [1:0] {
    IDLE = 2'b00,
    LOAD = 2'b01,
    PROC = 2'b10,
    DONE = 2'b11
} state_t;

state_t state, next_state;

// State register
always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) state <= IDLE;
    else        state <= next_state;
end

// Next-state and output logic
always_comb begin
    next_state = state;  // default: stay
    out        = 1'b0;   // default output

    case (state)
        IDLE: if (start) next_state = LOAD;
        LOAD: next_state = PROC;
        PROC: if (done_cond) begin
                  next_state = DONE;
                  out = 1'b1;
              end
        DONE: next_state = IDLE;
    endcase
end

Using an enum with named states makes the code self-documenting and prevents the designer from accidentally creating invalid binary encodings. The synthesiser chooses between binary, one-hot, and Gray-code state encodings based on optimisation objectives.

Example: Traffic light controller. Consider a two-road intersection. States are NS_GREEN, NS_YELLOW, EW_GREEN, EW_YELLOW. The FSM transitions on a timer input. The output is a 4-bit signal encoding the light colours for both roads. Because the output depends only on the current state (not on timer), this is a Moore machine, and the output logic is simply a lookup from state to colour pattern.

2.5 Generate Constructs and Structural Hierarchy

SystemVerilog’s generate statement allows conditional and iterative instantiation of hardware at elaboration time (before simulation or synthesis). This is indispensable for parameterised designs:

module ripple_adder #(parameter N = 8) (
    input  logic [N-1:0] a, b,
    input  logic         cin,
    output logic [N-1:0] sum,
    output logic         cout
);
    logic [N:0] carry;
    assign carry[0] = cin;
    assign cout     = carry[N];

    genvar i;
    generate
        for (i = 0; i < N; i++) begin : fa_loop
            full_adder fa_inst (
                .a(a[i]), .b(b[i]), .cin(carry[i]),
                .sum(sum[i]), .cout(carry[i+1])
            );
        end
    endgenerate
endmodule

The generate for loop unrolls at elaboration time, creating \( N \) distinct full_adder instances named fa_loop[0], fa_loop[1], …, fa_loop[N-1]. Each instance is a real hardware element in the netlist. This is structurally superior to a behavioural for loop inside an always_comb block when you want hierarchical visibility into each stage (for timing analysis or debug).

generate if enables conditional instantiation based on parameters — for example, generating a Wallace-tree multiplier for large widths and a simple array multiplier for small widths:

generate
    if (WIDTH > 16)
        wallace_mult #(.W(WIDTH)) m (.a, .b, .p);
    else
        array_mult  #(.W(WIDTH)) m (.a, .b, .p);
endgenerate

2.6 Testbenches and Simulation

A testbench is a non-synthesisable SystemVerilog module that wraps the DUT, generates stimulus, and checks results. It has no ports (it models the environment of the chip).

module tb_adder;
    logic [7:0] a, b;
    logic       cin;
    logic [7:0] sum;
    logic       cout;

    adder #(.WIDTH(8)) dut (.*);  // .* port connections

    initial begin
        // Test 1: simple addition
        a = 8'd10; b = 8'd20; cin = 1'b0;
        #10;
        assert (sum === 8'd30 && cout === 1'b0)
            else SV_error("Test 1 failed: sum=%0d cout=%0b", sum, cout);

        // Test 2: overflow
        a = 8'hFF; b = 8'h01; cin = 1'b0;
        #10;
        assert (sum === 8'h00 && cout === 1'b1)
            else SV_error("Test 2 failed");

        SV_finish;
    end
endmodule

The #10 delay advances simulation time by 10 units — only valid in testbenches. The === operator performs 4-state comparison, distinguishing X from 0 and 1. SystemVerilog’s assert and SV_error are powerful verification constructs; SV_monitor, SV_display, and SV_dumpvars provide waveform capture for debugging.

2.6.1 Clocked Testbench Methodology

Most designs are synchronous and therefore require a clock generator in the testbench. The standard pattern drives the DUT through a series of clock-aligned stimulus–check pairs:

module tb_dut;
    logic clk = 0, rst_n;
    always #5 clk = ~clk;  // 10-unit period (100 MHz equivalent)

    // Apply stimulus on the rising edge, check on the same or next edge
    initial begin
        rst_n = 0;
        @(posedge clk); #1; // small hold past edge for signal stability
        rst_n = 1;
        // drive stimulus
        @(posedge clk); #1;
        in_data  = 8'hAB;
        in_valid = 1'b1;
        // wait for output to be valid
        @(posedge clk iff out_valid);
        assert (out_data === expected) else SV_fatal(1, "Mismatch");
        SV_finish;
    end
endmodule

The #1 hold time after the clock edge ensures that the testbench applies stimulus after flip-flop outputs settle, which mimics real-world timing and avoids race conditions in simulation.

2.6.2 Coverage-Driven Verification

In production verification, functional coverage is used to measure how thoroughly a testbench has exercised the design’s behaviour space. SystemVerilog’s covergroup construct defines bins of interest:

covergroup cg_adder @(posedge clk);
    cp_a: coverpoint a {
        bins zero     = {0};
        bins mid      = {[1:254]};
        bins max_val  = {255};
    }
    cp_carry: coverpoint cout;
    cx_carry_a: cross cp_a, cp_carry;
endgroup

When the coverage model shows 100% bin coverage for the cross of carry-out with input range, the team has confidence that all interesting combinations were exercised. Modern verification methodologies (UVM — Universal Verification Methodology) build on these constructs to create modular, reusable testbench environments for large system-level designs.


Chapter 3: Pipelining and Performance Analysis

3.1 Throughput, Latency, and Initiation Interval

Two metrics govern the performance of a digital circuit:

Latency is the time from when the first input of a computation is presented to when the final output is produced. For a purely combinational implementation of an \( n \)-bit adder, the latency equals the propagation delay of the longest logic path.

Throughput is the rate at which results are produced, measured in operations per unit time (or per clock cycle). For a non-pipelined combinational circuit, throughput is \( 1/T_{critical} \) where \( T_{critical} \) is the critical-path delay. For a pipelined circuit, throughput approaches \( 1/T_{clk} \) as pipeline depth increases, where \( T_{clk} \) is the clock period.

The initiation interval (II) is the number of clock cycles between accepting successive inputs. For a fully pipelined circuit, II = 1 (a new input can be accepted every cycle). For a circuit with resource contention, II > 1.

3.2 Pipelining Principles

Pipelining inserts registers between stages of a computation so that each stage can work on a different piece of data simultaneously. Consider a three-stage pipeline with stages \( S_1, S_2, S_3 \):

\[ \text{Latency}_{\text{pipelined}} = 3 \times T_{clk} \]\[ \text{Throughput}_{\text{pipelined}} = 1 / T_{clk} \]

whereas the equivalent non-pipelined circuit has latency \( T_{S_1} + T_{S_2} + T_{S_3} \) and throughput \( 1/(T_{S_1} + T_{S_2} + T_{S_3}) \).

The pipeline throughput improvement factor approaches the number of stages \( k \) when the stages have equal delay. Unequal stage delays waste the potential speedup because the clock period is constrained by the slowest stage. Good pipeline design balances stages by retiming — moving logic across register boundaries to equalise delay.

Critical path. The critical path is the longest delay path from any register output (or primary input) to any register input (or primary output) through purely combinational logic. The maximum clock frequency is \[ f_{max} = (T_{cp} + T_{setup} + T_{skew})^{-1} \]

where \( T_{cp} \) is the combinational delay, \( T_{setup} \) is the flip-flop setup time, and \( T_{skew} \) is the clock skew. Identifying and reducing the critical path is the central task of timing optimisation.

3.3 Pipeline Hazards

When pipelining a datapath that has dependencies between consecutive operations, hazards arise:

Data hazards occur when a later instruction needs a result that an earlier instruction has not yet written to the register. In hardware pipelines (unlike software pipelines running on a processor), the designer must explicitly forward intermediate results or insert pipeline stalls.

Structural hazards occur when multiple pipeline stages require the same hardware resource simultaneously. The solution is either to duplicate the resource or to stall one stage.

Control hazards occur when the pipeline must change direction (e.g., due to a conditional branch). In processor pipelines this leads to branch prediction and pipeline flushing; in custom hardware pipelines it typically manifests as needing to invalidate or drain in-flight data.

3.3.1 Forwarding and Stall Logic in Custom Pipelines

In a custom pipelined datapath (not a general-purpose processor), data hazards appear whenever a computation in stage \( k \) depends on a result produced by a computation that is currently in stage \( k + m \) for \( m \geq 1 \). The designer must trace every producer–consumer pair and insert forwarding wires (MUXes fed from intermediate pipeline registers) or issue stalls.

The SystemVerilog pattern for a forwarding MUX in a 3-stage pipeline is:

// In the execute stage, select between the register file output and
// a forwarded value from a later-in-pipe stage
always_comb begin
    // Forward from EX/MEM pipeline register
    if (ex_mem_we && ex_mem_rd == id_ex_rs)
        alu_src_a = ex_mem_result;
    // Forward from MEM/WB pipeline register
    else if (mem_wb_we && mem_wb_rd == id_ex_rs)
        alu_src_a = mem_wb_result;
    else
        alu_src_a = id_ex_rdata_a;
end

The conditions compare destination and source register addresses using bitwise equality. This forwarding logic adds a MUX on the critical path (ALU input to ALU output), so the designer must ensure that the forwarding MUX delay does not become the critical path of the EX stage.

3.4 Latency-Insensitive Design

In complex systems with multiple pipelined components, the communication latency between components may vary depending on placement during physical implementation. Latency-insensitive design (LI design) decouples the functional correctness of a circuit from communication latency using handshaking protocols.

The standard LI handshake uses two control signals in addition to the data bus:

  • valid: asserted by the sender when data on the bus is valid.
  • ready (or stall): asserted by the receiver when it can accept data.

A transaction occurs when both valid and ready are asserted simultaneously. This is the backbone of the AXI-Stream and AXI4 protocols widely used in FPGA and ASIC design.

FIFO-based latency-insensitive dataflow. A FIFO buffer between two pipeline stages decouples their rates. The upstream stage writes when the FIFO is not full (not_full signal asserted); the downstream stage reads when the FIFO is not empty (not_empty asserted). The FIFO absorbs transient rate mismatches. In Vivado, a shallow FIFO (e.g., 16 entries) can be inferred from the synthesis template or instantiated as the FIFO18E2 primitive.

Chapter 4: Datapath Design — Arithmetic Circuits

4.1 Binary Adders

Addition is the most fundamental arithmetic operation. Its implementation illustrates the central trade-off in digital design: hardware cost (area, power) versus performance (speed).

4.1.1 Ripple-Carry Adder

A ripple-carry adder (RCA) chains \( n \) full-adder cells, passing the carry from each stage to the next. The full-adder equations for bit position \( i \) are:

\[ s_i = a_i \oplus b_i \oplus c_i \]\[ c_{i+1} = (a_i \cdot b_i) + (b_i \cdot c_i) + (a_i \cdot c_i) \]

where \( c_i \) is the carry-in and \( c_{i+1} \) is the carry-out of position \( i \).

The critical path through an \( n \)-bit RCA runs through all \( n \) carry stages, giving latency proportional to \( n \). For a 64-bit adder the carry chain spans 64 gate delays — a significant fraction of the critical path in modern arithmetic-intensive designs.

4.1.2 Carry-Lookahead Adder

The carry-lookahead adder (CLA) eliminates the serial carry propagation by computing carry signals in parallel using generate and propagate terms:

\[ G_i = a_i \cdot b_i \quad \text{(generate: carry is produced regardless of } c_i\text{)} \]\[ P_i = a_i \oplus b_i \quad \text{(propagate: carry-in is passed to carry-out)} \]

The carry into position \( i+1 \) can then be expressed without waiting for earlier carries:

\[ c_{i+1} = G_i + P_i \cdot c_i \]

For a 4-bit group:

\[ c_4 = G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 \cdot c_0 \]

This expression has depth \( O(\log n) \) in the worst case, reducing the adder latency to logarithmic in the word width. The cost is additional logic for the lookahead generation circuitry.

4.1.3 Carry-Select Adder

A carry-select adder precomputes two versions of a group of adder bits — one assuming carry-in = 0, another assuming carry-in = 1 — and then uses a multiplexer to select the correct result once the actual carry-in is known. The latency is limited by the time to propagate carries across groups (logarithmic) plus one MUX delay per group, with the advantage that the redundant computation is simple and fast.

FPGA carry chains. Modern FPGAs (Xilinx 7-series, UltraScale) implement a dedicated carry chain using hard carry-logic primitives (CARRY4, CARRY8) that run vertically through the fabric. These chains are far faster than carry logic built from LUTs because the signal traverses dedicated wiring. Vivado's synthesis engine automatically maps + operations to these primitives; manually writing carry-chain code rarely improves on the tool's results.

4.2 Subtraction and Signed Arithmetic

Two’s complement representation is the universal convention for signed integers in digital hardware. The negation of an \( n \)-bit two’s complement number \( A \) is \( -A = \overline{A} + 1 \) (bitwise invert plus one). Subtraction \( A - B \) is implemented as \( A + \overline{B} + 1 \), so a single adder with the carry-in set to 1 and \( B \) inverted performs both addition and subtraction depending on a control signal — this is the standard ALU subtractor circuit.

4.3 Multiplication

4.3.1 Array Multiplier

The schoolbook long-multiplication algorithm generates \( n \) partial products (each the multiplicand shifted by one bit and ANDed with a multiplier bit) and sums them. An array multiplier implements this directly: \( n \) rows of AND gates produce the partial products, and \( n - 1 \) rows of adders sum them. Latency is \( O(n) \) and area is \( O(n^2) \).

4.3.2 Wallace Tree Multiplier

A Wallace tree reduces the \( n \) partial products in parallel using a tree of carry-save adders (CSAs). A CSA takes three \( n \)-bit inputs and produces two \( n \)-bit outputs (a sum vector and a carry vector) in constant time (one full-adder level). By arranging CSAs in a tree, the \( n \) partial products are reduced to two vectors in \( O(\log n) \) CSA levels, which are then combined with a single fast adder. Wallace tree latency is \( O(\log n) \), but area grows as \( O(n^2 \log n) \) compared to the array multiplier’s \( O(n^2) \).

4.3.3 Booth Encoding

Booth’s algorithm recodes the multiplier in a radix-4 signed-digit representation. Every two bits of the multiplier (with one bit of overlap) produce one of the digits \( \{-2, -1, 0, +1, +2\} \). This halves the number of partial products from \( n \) to \( n/2 \), reducing the Wallace tree size and improving performance, particularly for wide operands. Modified Booth encoding (MBE) is standard in commercial multiply units.

DSP blocks on FPGAs. Xilinx UltraScale FPGAs contain DSP58E2 hard blocks, each implementing a 27×18 two's-complement multiplier followed by a 48-bit pre-adder/post-accumulator pipeline. A 16×16 multiply-accumulate (MAC) operation maps entirely into a single DSP block, consuming no LUT resources. Arithmetic-intensive designs (FIR filters, matrix engines) are mapped almost entirely to these blocks, achieving throughputs of hundreds of GOPS.

Chapter 5: Memory Systems

5.1 Static RAM (SRAM)

Static RAM is the workhorse of on-chip memory. Each bit is stored in a cross-coupled inverter pair (a bistable latch): two inverters whose input and output are swapped so that each holds the other in a stable state. Two pass transistors connect the storage node to a bit-line pair (BL, BL_bar) when the word-line (WL) is raised.

A standard 6-transistor (6T) SRAM cell consists of two NMOS access transistors and two CMOS inverter pairs. Cell area is roughly 140–200 \( F^2 \) (where \( F \) is the minimum feature size), making SRAM about 10–20 times larger per bit than DRAM but capable of sub-nanosecond access times with no need for periodic refresh.

Read operation: both bit-lines are precharged to \( V_{DD} \); the word-line is raised; the cell differentially discharges one bit-line toward ground; a sense amplifier detects the small differential voltage (typically 100 mV) and amplifies it to full logic swing.

Write operation: the write driver drives the bit-lines to the new value strongly enough to overpower the internal feedback and flip the cell. SRAM write is destructive in the sense that the old value is lost.

5.1.1 SRAM Timing: Read and Write Cycle

SRAM access is governed by two timing parameters. The read access time \( t_{AA} \) is the delay from address stable to data valid on the outputs. The write cycle time \( t_{WC} \) specifies the minimum time the write-enable must be asserted together with stable address and data. In synchronous SRAM (typical in FPGA BRAMs and embedded SRAM macros), address and control are registered on the rising clock edge, so the effective latency is one clock cycle (for a simple synchronous read) or more for pipelined SRAM.

On-chip SRAM read timing budget:

\[ T_{clk} \geq T_{clk\text{-}to\text{-}q,\,addr\,reg} + T_{decoder} + T_{bitline} + T_{sense\,amp} + T_{setup,\,output\,reg} \]

Each term has a physical origin: the address register’s clock-to-Q delay launches the address; the row decoder asserts the correct word-line; the bit-line pair differentially discharges; the sense amplifier regenerates the full swing; the output register captures the result. Minimising \( T_{bitline} \) requires short bit-lines (small memory arrays) and large precharge transistors.

5.2 Memory in FPGA Designs

Xilinx 7-series and UltraScale FPGAs contain block RAM (BRAM) hard tiles — true synchronous dual-port memories with configurable aspect ratios, up to 36 Kb per tile. Vivado’s inference engine recognises the standard RTL patterns for BRAM:

// Simple dual-port BRAM: one write port, one read port
always_ff @(posedge clk) begin
    if (we) mem[waddr] <= wdata;
    rdata <= mem[raddr];
end

Writing rdata <= mem[raddr] (registered read) infers a synchronous read, matching BRAM behaviour. An unregistered assign rdata = mem[raddr] infers distributed RAM built from LUT RAM (LUTRAM) — smaller capacity but combinational (zero-cycle) read access.

UltraScale also provides UltraRAM (URAM) tiles of 288 Kb each with a fixed 72-bit data width, suited for very large lookup tables or streaming buffers.

5.3 Register Files

A register file is a small, multi-ported memory array typically implemented from flip-flops or LUTRAMs. A standard 32-entry, 32-bit register file with two read ports and one write port appears in every pipelined processor. In SystemVerilog:

logic [31:0] regs [0:31];

always_ff @(posedge clk)
    if (we && (waddr != 5'b0)) regs[waddr] <= wdata;

assign rdata1 = regs[raddr1];
assign rdata2 = regs[raddr2];

The condition waddr != 5'b0 enforces the RISC convention that register 0 is hardwired to zero.

5.4 FIFOs

A FIFO (first-in, first-out) queue is an essential building block for producer–consumer interfaces. It is implemented as a circular buffer in memory with two independent pointers: a write pointer (enqueue side) and a read pointer (dequeue side). The FIFO is empty when the pointers are equal and full when the write pointer would lap the read pointer (requiring one extra bit for disambiguation in binary counter FIFOs).

For asynchronous FIFOs (read and write clocks are different domains), the pointers must be synchronised using Gray-code counters — Gray code ensures that only one bit changes per pointer increment, preventing metastability from sampling a partially-updated binary counter across the clock-domain crossing.

5.5 Clock Domain Crossing

When data passes between two clock domains (two independent clocks with no fixed phase relationship), there is a non-zero probability that a flip-flop in the receiving domain captures a value during its metastable window — a brief period around the clock edge where the outcome is undetermined. If the flip-flop enters a metastable state, its output will eventually resolve to a valid 0 or 1, but the resolution time is exponentially distributed and may exceed one clock period.

The standard mitigation is a two-flip-flop synchroniser: two flip-flops clocked by the destination domain are placed in series on the crossing signal. The first flip-flop may enter a metastable state, but the probability that it fails to resolve within one destination clock period is extremely small (MTBF measured in millions of years for typical devices). The second flip-flop always captures a valid resolved value.

// Two-FF synchroniser for single-bit control signals
module sync_ff #(parameter STAGES = 2) (
    input  logic clk_dst, d,
    output logic q
);
    logic [STAGES-1:0] pipe = '0;
    always_ff @(posedge clk_dst)
        pipe <= {pipe[STAGES-2:0], d};
    assign q = pipe[STAGES-1];
endmodule

Multi-bit data buses require special treatment — it is not sufficient to synchronise each bit independently, because different bits may sample across different metastability windows and produce inconsistent (partially-updated) data. The correct approaches are: use a grey-coded counter (for pointers), use a request-acknowledge handshake, or pass multi-bit data through an asynchronous FIFO.


Chapter 6: FPGA Architecture

6.1 The FPGA Paradigm

A field-programmable gate array (FPGA) is an integrated circuit containing a large array of configurable logic blocks (CLBs), programmable interconnect, and a variety of hard IP blocks (memories, DSP units, transceivers, PCIe controllers), all configured by loading a bitstream into internal configuration memory (SRAM cells). Unlike an ASIC, an FPGA can be reprogrammed in the field — hence the name.

FPGAs bridge the gap between software (flexible but slow) and ASICs (fast and efficient but expensive and inflexible). They are widely used for prototyping, low-volume products, and applications where algorithms evolve post-deployment, and increasingly as accelerators for machine learning inference and scientific computing.

6.2 Look-Up Tables

The fundamental logic primitive in an FPGA is the look-up table (LUT). An \( k \)-input LUT is a small SRAM with \( 2^k \) bits of storage and \( k \) address inputs. Because any Boolean function of \( k \) variables can be expressed as a \( 2^k \)-entry truth table, a LUT can implement any \( k \)-input Boolean function. Xilinx 7-series and UltraScale FPGAs use 6-input LUTs (LUT6), each storing 64 configuration bits.

LUT as function generator. A LUT6 with inputs \( a_5, \ldots, a_0 \) addresses a 64-bit configuration SRAM. During operation the configuration bits act as programmable truth-table entries. The output \( f = \text{LUT}[a_5 a_4 a_3 a_2 a_1 a_0] \) (reading the addressed bit) implements any Boolean function of six variables in a single logic level.

Synthesis tools decompose complex Boolean functions into networks of LUTs using technology mapping algorithms (e.g., Curry, FlowMap). The key figure of merit is the LUT depth of the critical path, since each LUT adds one stage of delay.

6.3 Flip-Flops and Slices

Each LUT in Xilinx FPGAs is paired with a D flip-flop. Two LUT/FF pairs (plus carry-chain logic and multiplexer logic) form a slice. Slices tile the FPGA fabric; the ratio of flip-flops to LUTs is approximately 2:1 (two FFs per LUT). For a highly registered design (deep pipeline) the flip-flop count may be the binding constraint; for a logic-intensive design with many combinational levels, LUT count is the constraint.

6.4 Programmable Interconnect

Logic elements accomplish nothing without wires to connect them. FPGA routing uses a hierarchical network of programmable switch matrices. Short connections within a CLB are made through local routing; connections between nearby CLBs use general routing tracks; long-distance connections use horizontal and vertical long lines or global lines (e.g., BUFG clock networks).

Each programmable crosspoint is implemented as a pass transistor controlled by an SRAM configuration bit. The routing delay is often larger than the LUT delay and is the dominant source of timing variability between designs with the same logic structure but different placement.

Why placement matters for timing. Two implementations of the same RTL can have critical-path delays differing by 2× depending on how the place-and-route tool assigns resources to physical locations. Long nets spanning many switch matrices contribute hundreds of picoseconds of delay. Designers can guide placement with floorplanning constraints (Pblocks in Vivado) to co-locate timing-critical logic and shorten critical nets.

6.5 Hard IP Blocks

Beyond LUTs and FFs, modern FPGAs integrate hard (non-programmable) silicon blocks:

  • Block RAM (BRAM): synchronous dual-port memory tiles (see §5.2).
  • DSP blocks (DSP48E2, DSP58E2): hard multiply-accumulate engines (see §4.3).
  • Phase-locked loops (PLLs/MMCMs): generate multiple clock frequencies from a reference; control phase relationships.
  • SerDes / transceivers: high-speed serialiser/deserialiser for PCIe, Ethernet, and other serial protocols.
  • PCIe hard controller: implements the PCIe protocol layers in silicon.
  • High Bandwidth Memory (HBM) interfaces: present on Xilinx Alveo and Versal devices for memory-bandwidth-intensive applications.

Understanding which operations map naturally to which hard blocks is essential for high-efficiency FPGA design. A 32-bit floating-point multiply that would require hundreds of LUTs can often be absorbed into a chain of DSP blocks.

6.6 FPGA Design Flow with Vivado

The AMD/Xilinx Vivado tool suite processes RTL through the following stages:

  1. Synthesis (synth_design): maps RTL to Xilinx primitives; reports LUT, FF, BRAM, DSP resource counts; estimates timing based on placement-independent delays.
  2. Implementation — Opt: logic optimisation and retiming to improve timing before placement.
  3. Implementation — Place (place_design): assigns primitives to physical tiles; iteratively improves placement to minimise wire lengths and congestion.
  4. Implementation — Route (route_design): assigns nets to routing resources; runs static timing analysis on routed delays.
  5. Bitstream generation (write_bitstream): produces the .bit file for JTAG programming or the .bin file for SPI flash.

The timing report (report_timing_summary) after routing is the authoritative source of timing information. A design passes timing when all setup slack values are non-negative and all hold slack values are non-negative.


Chapter 7: Timing Analysis

7.1 Setup and Hold Constraints

Every flip-flop in a synchronous circuit has two fundamental timing requirements:

Setup constraint: data must arrive at the flip-flop’s D input at least \( T_{setup} \) before the clock edge that will capture it. The setup constraint for a path from flip-flop launch FF1 to capture flip-flop FF2 is:

\[ T_{clk} + T_{clk\_to\_q}^{FF1} + T_{combinational} < T_{clk} + T_{setup}^{FF2} \]

which simplifies to:

\[ T_{clk\_to\_q} + T_{combinational} < T_{clk} - T_{setup} \]

Equivalently, the setup slack is:

\[ \text{setup\_slack} = T_{clk} - T_{setup} - (T_{clk\_to\_q} + T_{comb}) \]

A negative setup slack means the circuit fails timing at the desired frequency and the clock must be slowed or the critical path shortened.

Hold constraint: data must not change at the capture flip-flop’s D input until at least \( T_{hold} \) after the clock edge. For paths with very short combinational delay, a later data transition could arrive before the hold window closes, corrupting the captured value. The hold constraint is:

\[ T_{clk\_to\_q}^{FF1} + T_{combinational} > T_{hold}^{FF2} \]

Hold violations are fixed by adding intentional delay (buffer insertion) on short paths; they cannot be fixed by slowing the clock.

7.2 Clock Skew

Clock skew \( T_{skew} \) is the difference in arrival time of the clock edge at two flip-flops driven by the same clock signal. A positive skew (the capture flip-flop’s clock arrives later than the launch flip-flop’s clock) is helpful for setup: it effectively widens the timing window. Negative skew tightens the setup constraint and can introduce hold violations on short paths. FPGA global clock networks (BUFG) minimise skew across the fabric to typically less than 50 ps.

7.3 Static Timing Analysis

Static timing analysis (STA) computes the worst-case timing along every path in the circuit without requiring simulation. STA tools perform:

  1. Graph construction: model the circuit as a directed graph with nodes (flip-flops, primary I/Os) and edges (combinational paths with annotated delays).
  2. Arrival time propagation (forward): compute the latest signal arrival time at each node assuming worst-case (slow) conditions.
  3. Required time back-propagation (backward): starting from timing endpoints, compute the latest time data is allowed to arrive.
  4. Slack computation: slack = required time − arrival time. Negative slack = timing failure.

STA is exhaustive — it checks every path, including paths that are logically impossible. These false paths may be constrained away using set_false_path directives so that the tool does not attempt to meet timing on them.

7.3.1 Multi-Corner Multi-Mode Analysis

Real VLSI timing analysis must account for variation in process, voltage, and temperature (PVT). A slow-corner characterisation uses worst-case process (slow transistors), minimum voltage, and maximum temperature — this corner dominates setup-time analysis. A fast-corner characterisation uses best-case process (fast transistors), maximum voltage, and minimum temperature — this corner dominates hold-time analysis (the fear is that data propagates so fast it arrives before the hold window closes).

On FPGAs, Vivado provides multiple timing corners through its device model, and by default it checks both setup and hold across the supported operating range. The constraints file (xdc) specifies the operating conditions; the designer should always review the timing report under the slowest timing grade (e.g., -2 versus -1 for Xilinx devices) to ensure margin.

7.4 Critical Path Optimisation

The critical path is the path with the smallest (most negative) setup slack. Common optimisation techniques include:

Retiming: move registers through combinational logic (backward or forward across logic gates) to balance stage delays without changing the circuit’s cycle-by-cycle I/O behaviour. Vivado’s synthesis option retiming enables this automatically.

Logic restructuring: replace a deep adder tree with a shallower structure (e.g., carry-lookahead instead of ripple-carry); restructure Boolean expressions to reduce LUT depth.

Pipeline insertion: add pipeline registers at strategic points to break long combinational paths, accepting increased latency in exchange for higher clock frequency and throughput.

Physical optimisation: after placement, move logic cells to shorter distances from their fanout loads. Vivado’s phys_opt_design performs replication of high-fanout nets and cell relocation to reduce critical-path delays.

7.5 Interconnect Delay Model

On-chip wires are not ideal: they have resistance \( R \) and capacitance \( C \). A long wire modelled as a lumped RC chain with \( n \) segments of resistance \( r \) and capacitance \( c \) each has a time constant estimated by the Elmore delay model:

\[ \tau_{Elmore} = \sum_{i=1}^{n} r_i \cdot C_{downstream,i} \]

where \( C_{downstream,i} \) is the total capacitance downstream of resistor \( r_i \). For a uniform wire of length \( l \), resistance per unit length \( R_s \), and capacitance per unit length \( C_s \):

\[ \tau = (R_s C_s l^2) / 2 \]

The quadratic dependence on length means that doubling a wire’s length quadruples its delay. This is a key reason why placement quality strongly affects timing: spreading logic across the chip is far worse than keeping communicating logic co-located.

On FPGAs, routing delays are modelled as resistance/capacitance values for each programmable routing segment, pre-characterised by Xilinx and stored in the device timing model. After place-and-route, timing reports show routed delays that include both logic and routing components.


Chapter 8: Power Analysis and Optimisation

8.1 Sources of Power Consumption

Power in CMOS circuits has two primary components.

Dynamic power arises from charging and discharging capacitive nodes as logic values switch:

\[ P_{dynamic} = \alpha \cdot C_L \cdot V_{DD}^2 \cdot f_{clk} \]

where \( \alpha \) is the activity factor (average probability that a node switches per clock cycle), \( C_L \) is the load capacitance, \( V_{DD} \) is the supply voltage, and \( f_{clk} \) is the clock frequency. The quadratic dependence on \( V_{DD} \) makes voltage scaling the most powerful lever for reducing dynamic power.

Static (leakage) power flows even when the circuit is idle, due to sub-threshold conduction and gate-oxide tunnelling in transistors. Leakage dominates in modern deep-submicron processes at advanced technology nodes, making power management critically important in systems-on-chip.

Short-circuit power flows momentarily during transitions when both PMOS and NMOS transistors in a gate are simultaneously conducting. Well-designed CMOS circuits minimise this by ensuring fast input transitions.

8.2 Clock Gating

Since clock networks are the highest-activity nets (switching every cycle), disabling the clock to idle portions of the design saves significant dynamic power. Clock gating inserts an AND gate (or dedicated integrated clock gating cell) between the clock and a group of flip-flops:

\[ \text{gated\_clk} = \text{clk} \cdot \text{enable} \]

In synthesis tools, clock-gating cells must be placed on the clock path before the flip-flops they control; the synthesis tool typically infers clock gates automatically from always_ff blocks with enable conditions.

Clock gating efficiency. A RISC processor's register file is clocked at the full frequency but most registers are not written every cycle. With clock gating, only the one or two written registers consume dynamic power per cycle. This can reduce register-file power by 80–90% in typical workloads.

8.3 Operand Isolation

A combinational multiplier or adder consumes dynamic power even when its result will not be used (e.g., when a downstream register is clock-gated). Operand isolation (also called input gating) inserts a register or multiplexer to hold the operands constant when the unit is idle, preventing spurious switching and eliminating wasted dynamic power:

always_ff @(posedge clk)
    if (enable) operand_reg <= data_in;
// Combinational logic sees stable inputs when not enabled

8.3.1 Dynamic Voltage and Frequency Scaling

For systems where workload intensity varies over time — a battery-powered sensor that wakes up periodically to process data, for example — dynamic voltage and frequency scaling (DVFS) can match power consumption to workload. When the workload is light, both the supply voltage and clock frequency are reduced; since dynamic power scales as \( V_{DD}^2 f \), a 30% voltage reduction combined with a proportional frequency reduction can reduce power by more than 60%. DVFS requires on-chip voltage regulators and PLL/MMCM reconfiguration (on Xilinx devices, the MMCM can be dynamically reconfigured via the DRP — Dynamic Reconfiguration Port). The complexity of DVFS management is non-trivial and is typically handled by a power management unit implemented in firmware or hardware.

8.4 Power in FPGA Designs

Vivado’s Power Analysis tool (invoked via report_power) estimates dynamic and static power given a post-implementation netlist and switching activity. Activity can be sourced from:

  • Default estimates (industry-standard toggle rates).
  • Switching activity interchange format (SAIF) files from simulation.
  • VCD (Value Change Dump) files.

Low-power techniques for FPGAs include:

  • Reducing clock frequency: directly reduces dynamic power and allows voltage scaling on some devices.
  • Reducing supply voltage: works in conjunction with lower frequency.
  • Shutting down unused logic: Xilinx partial reconfiguration can power-gate unused regions.
  • Minimising routing: poor placement leads to long, capacitive nets with high dynamic power.

Chapter 9: Resource Sharing and Scheduling

9.1 The Area–Throughput Trade-off

Every arithmetic operation — multiply, divide, compare — requires hardware resources. If a computation requires \( k \) multiplications, instantiating \( k \) multiplier circuits in parallel maximises throughput (all \( k \) operations complete in one cycle) but maximises area. A single shared multiplier can perform all \( k \) multiplications over \( k \) cycles, minimising area but reducing throughput \( k \)-fold. Between these extremes lies a continuum of design points.

9.2 Hardware Scheduling

Scheduling assigns operations to time-steps (clock cycles), subject to data-dependency constraints (an operation cannot be scheduled before its inputs are ready) and resource constraints (only \( r \) copies of each resource type are available per cycle). The result is a schedule — a mapping from operations to cycle numbers.

As-soon-as-possible (ASAP) scheduling assigns each operation to the earliest cycle permitted by its data dependencies, ignoring resource constraints. ASAP gives minimum latency but may require many parallel resources.

As-late-as-possible (ALAP) scheduling assigns each operation to the latest cycle permitted by the overall latency constraint. ALAP maximises the opportunity for resource sharing.

Force-directed scheduling is a heuristic that tries to distribute operations across time-steps to level the resource usage — assigning operations to cycles where their resource type is under-utilised. It is widely used in high-level synthesis tools.

Sharing a multiplier between two independent paths. Suppose a circuit computes both a × b and c × d in every iteration, and these multiplications are independent. In a throughput-constrained design (II = 1) both multiplications must happen in the same cycle, requiring two multipliers. But if II = 2 is acceptable, a single multiplier can be time-multiplexed: on odd cycles it computes a × b, on even cycles c × d. Input multiplexers select the operands; an output register and counter-based control steer the result to the correct destination.

9.2.1 Modulo Scheduling

When resource constraints limit the initiation interval to II > 1, the designer must ensure that each time step in the repeating schedule pattern does not exceed the available resources. Modulo scheduling is an algorithm that computes the minimum feasible II as:

\[ II_{min} = \max\!\left(\lceil T_{recurrence} \rceil,\; \max_r \lceil N_r / R_r \rceil\right) \]

where \( T_{recurrence} \) is the recurrence-constrained II from loop-carried dependencies, \( N_r \) is the number of operations of resource type \( r \) in the loop body, and \( R_r \) is the number of available units of resource \( r \). For a loop body containing 4 multiply operations with only 2 DSP blocks available, the resource-constrained lower bound is \( \lceil 4/2 \rceil = 2 \) cycles II. If the loop also has a recurrence (an operation depends on a result from the previous iteration) with total delay spanning 3 cycles, then II must be at least 3.

9.3 Loop Pipelining and Loop Unrolling

When an algorithm contains a loop, the designer faces a choice between spatial (parallel) and temporal (sequential) implementation:

Loop unrolling replicates the loop body \( U \) times, effectively processing \( U \) loop iterations in parallel. Throughput increases by \( U \times \); area increases by \( U \times \) as well. Full unrolling (unroll factor = loop bound) maps the entire loop to combinational logic.

Loop pipelining (also called loop folding) keeps a single instance of the loop body but applies pipelining so that different iterations overlap in execution. A fully pipelined loop with initiation interval II = 1 processes one new iteration per cycle, achieving the same throughput as full unrolling with significantly less area — at the cost of \( k \) cycles of warm-up latency for a \( k \)-stage pipeline.


Chapter 10: Faults, Testing, and Reliability

10.1 Manufacturing Defects and Fault Models

Integrated circuits emerge from a manufacturing process that is not perfectly repeatable. Defects — microscopic impurities, pattern irregularities, or electrostatic discharge events — can cause circuit failures. To make manufactured chips testable, the design community adopted abstract fault models that approximate the electrical behaviour of common defects.

Stuck-at fault model: the most widely used model. A node is assumed to be permanently stuck at logic 0 (SA0) or logic 1 (SA1), regardless of what driving logic computes. A test vector detects a stuck-at-0 fault on node \( n \) if and only if it would make \( n = 1 \) in the fault-free circuit (sensitising the fault) and the fault’s effect can propagate to a primary output (observability).

Bridging fault: two nodes are shorted together. More physically realistic than stuck-at but harder to model and test.

Transition fault: a node can only produce 0→1 or 1→0 transitions too slowly. Transition faults model delay defects, important in high-speed circuits.

10.2 Test Coverage and Fault Coverage

Fault coverage is the fraction of all single stuck-at faults detected by a test vector set. High fault coverage (> 99%) is required for consumer electronics shipped in millions of units. A design with 95% coverage means 5% of faults go undetected; at volume this translates to many faulty units reaching customers.

Test vector generation is automated by ATPG (automatic test pattern generation) tools (e.g., Synopsys TetraMAX). ATPG finds a set of test vectors that maximises fault coverage. The complexity of ATPG is strongly influenced by the controllability (how easily a node can be set to 0 or 1 from primary inputs) and observability (how easily the value of a node is visible at primary outputs) of nodes in the circuit.

10.2.1 Path Sensitisation and D-Calculus

The formal method for identifying whether a fault can be detected by a given test vector is D-calculus (Roth, 1966). The D symbol represents “1 in the fault-free circuit, 0 in the faulty circuit” (a wire that is stuck-at-0 would be D if its fault-free value is 1). \( \overline{D} \) represents “0 in the fault-free circuit, 1 in the faulty circuit.” Propagating D or \( \overline{D} \) through the combinational network to a primary output (sensitising the fault) using Boolean algebra on the extended alphabet \( \{0, 1, D, \overline{D}, X\} \) constitutes the test generation procedure.

Modern ATPG tools implement significantly faster algorithms (such as the PODEM and FAN algorithms) that prune the search space aggressively, making fault coverage computation tractable for circuits with millions of gates.

10.3 Design for Testability

Testability is a design property, not a post-design add-on. Common design-for-testability (DFT) techniques include:

Scan insertion: all flip-flops in the design are linked into one or more scan chains by adding a multiplexer on each flip-flop’s data input. In test mode, the scan chain shifts test data into all flip-flops serially, and then captures combinational logic outputs into the flip-flops in a single clock cycle, after which the results are shifted out for comparison. Scan gives very high observability and controllability of internal state with minimal circuit overhead.

Built-in self-test (BIST): an on-chip test controller generates test vectors and checks responses using signature analysis (a linear-feedback shift register that compresses the output sequence into a compact signature). BIST is particularly important for embedded memories (March BIST algorithms systematically test all SRAM cell patterns) and for circuits that are hard to access from external pins.

10.4 Reliability and Fault Tolerance

Beyond manufacturing test, circuits must operate reliably in the field in the presence of transient faults. High-energy particle strikes (cosmic rays, alpha particles) can flip the state of a bit stored in a memory cell or flip-flop — a single-event upset (SEU). Reliability requirements differ by application: a consumer device may tolerate occasional soft errors, but a satellite or medical device cannot.

Triple modular redundancy (TMR) replicates critical logic three times and uses a majority voter: if one copy computes an incorrect value, the voter overrides it with the result from the other two. TMR provides fault tolerance at the cost of roughly 3× area and power.

Error-correcting codes (ECC) protect memory arrays against SEUs. The standard Hamming code can correct 1-bit errors and detect 2-bit errors with only modest overhead (e.g., 7 check bits for 64 data bits). Vivado’s memory IP generator includes optional ECC.


Chapter 11: Advanced Topics — Overlays, Timing Closure, and High-Level Concepts

11.1 FPGA Overlays

An overlay is a reconfigurable accelerator architecture implemented on top of an FPGA’s standard logic fabric — essentially a softcore architecture that provides a higher-level programmable substrate on top of the underlying FPGA. Examples include coarse-grained reconfigurable arrays (CGRAs), vector processors, and neural-network inference engines.

The motivation for overlays is to reduce the long place-and-route compile times of FPGA design (minutes to hours) to milliseconds: the overlay hardware is compiled once and the computation kernel is mapped to the overlay’s programming model much faster than to the raw FPGA fabric. The trade-off is efficiency: an overlay consumes more area and achieves lower clock frequency than a design optimised directly for the FPGA.

11.2 Timing Closure in Practice

Timing closure is the process of iteratively modifying a design until all timing constraints are met after place-and-route. It is one of the most time-consuming activities in FPGA and ASIC design, particularly for large designs close to the technology’s performance limits.

A systematic timing-closure methodology involves:

  1. Identify the critical path using report_timing_summary and report_timing -n 10 (top-10 failing paths).
  2. Determine the cause: is it a long combinational chain (logic-bound), a long routing wire (routing-bound), or a clock-domain crossing problem?
  3. Apply the appropriate fix: add pipeline stage (logic-bound), apply floorplan constraint to reduce wire length (routing-bound), recode the CDC synchroniser (CDC problem).
  4. Re-run implementation and check if the fix propagated correctly.
  5. Repeat until all slack is positive.

A common mistake is to chase a single path without understanding why it is critical. The correct approach is to look at the pattern of failing paths: if many paths all involve a common register or block, fixing that block’s logic will close multiple paths at once.

11.3 The RTL-to-Silicon Mental Model

It is worth consolidating the mental model that underpins all of ECE 327. At every level of the hierarchy, the designer works simultaneously with three views of the same circuit:

  • Behavioural view: what computation the circuit performs (algorithm, data flow, I/O protocol).
  • Structural view: what hardware components implement it (LUTs, FFs, BRAM, DSP blocks, adders, FSMs).
  • Physical view: where those components live on the chip or FPGA fabric and how they are connected.

Mastery of digital hardware engineering requires fluent translation among all three views. When a synthesis tool reports 95% LUT utilisation, a structural thinker knows the design is close to full and routing congestion will slow timing. When a timing report shows a 2 ns negative slack on a carry-chain path, a physical thinker knows the adder is being driven from a register on the opposite side of the fabric and a floorplan constraint is needed. When a simulation fails on the 1000th stimulus vector, a behavioural thinker deduces that the FSM is missing a transition from a corner-case state.

Career perspective. The skills developed in ECE 327 — RTL design, functional verification, FPGA implementation, and timing analysis — are directly applicable to co-op and full-time roles at semiconductor companies (Intel, AMD, Nvidia, Qualcomm, Tenstorrent, Groq), cloud-FPGA teams (AWS, Microsoft Azure), and academic research groups in computer architecture and VLSI. The course's use of industry-standard tools (Vivado, SystemVerilog) means that a student who completes the labs has genuine tool experience to discuss in technical interviews.

Chapter 12: Synthesis Methodology and RTL Coding Style

12.1 Synthesisable vs. Non-Synthesisable Constructs

Not all valid SystemVerilog is synthesisable. The synthesis tool must be able to map every construct to physical hardware; constructs that only make sense in simulation (timing delays, file I/O, real-valued variables, system tasks) are non-synthesisable and must be confined to testbenches.

Synthesisable constructs: always_ff, always_comb, assign, case, if-else, for loops with constant bounds, module instantiation, parameter, generate, arithmetic and logical operators, packed arrays.

Non-synthesisable constructs: #delay, initial (outside of FPGA reset inference), SV_display, SV_random, real type, event variables, wait statements, force/release.

12.2 Inferring vs. Instantiating

Designers choose between two styles for hardware primitives:

Inference writes generic RTL that the synthesiser maps to technology-specific primitives. For example, always_ff @(posedge clk) infers a D flip-flop regardless of the target device. Inference is portable (the same RTL synthesises for Xilinx, Intel, or an ASIC target) and benefits from tool optimisations.

Instantiation directly names a technology primitive (e.g., CARRY8, DSP58E2, BUFG). This gives the designer precise control over the implementation but ties the code to a specific vendor and device family. Instantiation is used when the synthesiser’s inference does not produce the desired mapping, or when using primitives that cannot be inferred (e.g., transceiver primitives).

12.3 Common RTL Pitfalls

Unintended latches: a combinational always_comb block that does not assign a signal in all branches of an if-else or case infers a latch on that signal. Always use a default assignment at the top of the block:

always_comb begin
    out = '0;  // default
    case (sel)
        2'b00: out = a;
        2'b01: out = b;
        // 2'b10, 2'b11 handled by default
    endcase
end

Multiple drivers: assigning the same variable in two separate always blocks causes a multiple-driver error. Synthesisers will report this as an error; simulators may give unpredictable results.

X propagation: if a register is not reset, its initial value in simulation is X. Any logic that depends on an uninitialised register will propagate X, causing simulation failures that are hard to debug. Always reset flip-flops that affect downstream control logic.

Width mismatches: implicit truncation or zero-extension when assigning vectors of different widths is legal but often unintentional. Use explicit '0 fills or SV_signed() casts and enable width-mismatch warnings in the simulator.

12.4 Parameterised Design

Reusable hardware modules should be parameterised over data width, memory depth, and any other dimension that may change between instances:

module shift_reg #(
    parameter int WIDTH = 8,
    parameter int DEPTH = 4
) (
    input  logic             clk, rst_n, en,
    input  logic [WIDTH-1:0] d,
    output logic [WIDTH-1:0] q
);
    logic [WIDTH-1:0] sr [0:DEPTH-1];

    always_ff @(posedge clk or negedge rst_n) begin
        if (!rst_n)
            foreach (sr[i]) sr[i] <= '0;
        else if (en) begin
            sr[0] <= d;
            foreach (sr[i])
                if (i > 0) sr[i] <= sr[i-1];
        end
    end

    assign q = sr[DEPTH-1];
endmodule

At instantiation time, #(.WIDTH(16), .DEPTH(8)) overrides the defaults. The synthesiser generates separate hardware for each unique parameter combination.


Chapter 13: Case Studies in Digital Hardware Design

13.1 A Pipelined FIR Filter

A finite impulse response (FIR) filter computes:

\[ y[n] = \sum_{k=0}^{N-1} h[k] \cdot x[n-k] \]

for \( N \) coefficients \( h[k] \) and input samples \( x[n] \). A direct-form implementation consists of a shift register for the sample history, multipliers for each tap, and an adder tree to sum the products.

A fully pipelined FIR maps each MAC operation to a DSP block and pipelines the adder tree. For a 32-tap filter on an UltraScale FPGA, Vivado maps 32 DSP58E2 blocks in a cascade, achieving one output sample per clock cycle at clock rates exceeding 400 MHz.

The critical design decision is whether to use symmetric coefficient exploitation (for linear-phase FIR, coefficients are symmetric, so \( h[k] = h[N-1-k] \), allowing each multiplier to handle two taps). This halves the multiplier count at the cost of a pre-addition stage.

13.2 A Pipelined Processor

A 5-stage RISC pipeline (IF, ID, EX, MEM, WB) is a canonical case study in digital hardware design. The five stages communicate via pipeline registers:

  • IF (Instruction Fetch): read the next instruction from instruction memory (BRAM) using the program counter.
  • ID (Instruction Decode / Register Read): decode the opcode, read source registers from the register file, generate the sign-extended immediate.
  • EX (Execute): the ALU computes the result; the branch condition is evaluated; the branch target is computed.
  • MEM (Memory Access): data memory is read (load) or written (store).
  • WB (Write Back): the result is written back to the register file.

Data forwarding (bypassing) is needed to handle RAW (read-after-write) data hazards: the output of the EX or MEM stage is forwarded directly to the EX-stage input instead of waiting for WB to update the register file. The forwarding unit compares source and destination register addresses to determine when forwarding is needed.

A load-use hazard (a load followed immediately by an instruction that uses the loaded value) cannot be resolved by forwarding alone; it requires a one-cycle stall inserted by the hazard detection unit, which holds the PC and IF/ID pipeline register constant while flushing the ID/EX register.

13.2.1 Hazard Detection Unit Implementation

The hazard detection unit is a purely combinational block comparing register addresses across pipeline stages. Its output signals stall_if_id and flush_id_ex control the pipeline:

// Load-use hazard detection
always_comb begin
    stall = 1'b0;
    if (id_ex_memread &&
       (id_ex_rd == if_id_rs1 || id_ex_rd == if_id_rs2)) begin
        stall = 1'b1;
    end
end

When stall is asserted, the PC register and IF/ID pipeline register hold their values (the same instruction is re-fetched and re-decoded), and a NOP bubble is inserted into the ID/EX stage. After one stall cycle, the load result is available in the EX/MEM register and forwarding handles the dependency. Without the stall, the instruction after the load would read the register file before the write-back of the loaded data, producing an incorrect result — this is the canonical load-use data hazard that makes load scheduling important in compiler optimisation for RISC architectures.

13.3 AXI-Stream Handshake Interface

The AXI4-Stream protocol is an industry-standard protocol (ARM AMBA specification) for streaming data between IP blocks. Its minimal interface consists of:

  • TDATA: the data payload (width configurable, typically a power of two bytes).
  • TVALID: driven by the master (sender), asserts when TDATA is valid.
  • TREADY: driven by the slave (receiver), asserts when it can accept data.
  • TLAST: marks the last beat of a packet (optional but common).

A transaction occurs on the rising clock edge when both TVALID and TREADY are high. The sender must hold TDATA and TVALID constant until TREADY is asserted; the receiver must assert TREADY whenever it can accept data.

Vivado’s IP catalogue provides AXI-Stream FIFO, width converter, and interconnect IP that handle the handshake protocol transparently. Custom modules need only implement the producer or consumer side of the protocol.

13.4 A Matrix-Vector Multiplication Accelerator

Matrix-vector multiplication (MVM) is the dominant operation in neural-network inference. A naïve systolic-array implementation tiles the matrix computation across a grid of processing elements (PEs), each containing a MAC unit. Each PE receives one element of the matrix column and accumulates the partial dot-product over time.

For a systolic array of \( N \times N \) PEs with word width \( W \) bits:

  • Each PE contains one DSP block (1 multiply, 1 accumulate).
  • The matrix is loaded row-by-row into a local register in each PE.
  • The input vector is broadcast or pulsed through the array.
  • The output vector is collected from one column of PEs.

For a 16×16 systolic array on an UltraScale+ device (which has thousands of DSP58E2 blocks), the implementation achieves:

\[ \text{Throughput} = N^2 \times 2 \text{ ops/cycle} = 512 \text{ MACs/cycle} \]

At 300 MHz clock frequency this is approximately 150 GOPS (giga-operations per second) for 16-bit integer operations — competitive with an edge-inference accelerator and implementable in a fraction of the chip area of a full GPU.

The design exercise in implementing such an accelerator illustrates all the themes of ECE 327: RTL coding of parameterised datapaths, BRAM-based weight storage, pipelining the MAC chain, timing closure of the critical accumulator path, and power estimation using Vivado’s power report.


Summary of Key Equations and Relationships

For quick reference, the central quantitative relationships in ECE 327 are:

Maximum clock frequency:

\[ f_{max} = (T_{cp} + T_{setup} + T_{skew})^{-1} \]

Dynamic power:

\[ P_{dyn} = \alpha C_L V_{DD}^2 f \]

Elmore delay for a wire segment:

\[ \tau = (R_s C_s l^2) / 2 \]

Pipeline throughput:

\[ \text{Throughput} = T_{clk}^{-1} \quad \text{(for II} = 1\text{)} \]

Setup slack:

\[ \text{slack}_{setup} = (T_{clk} - T_{setup}) - (T_{clk\text{-}to\text{-}q} + T_{comb}) \]

FIR filter output:

\[ y[n] = \sum_{k=0}^{N-1} h[k]\, x[n-k] \]

These relationships link the physical properties of the circuit (transistor speed, wire parasitics, supply voltage) to system-level metrics (clock frequency, throughput, power), forming the quantitative backbone of digital hardware engineering.

Back to top