ECE 224: Embedded Microprocessor Systems

Bill Bishop

Estimated study time: 1 hr 6 min

Table of contents

Sources and References

Primary reference — Joseph Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd ed., Newnes/Elsevier, 2013. [Comprehensive treatment of Cortex-M architecture, NVIC, and peripheral interfaces.]

Architecture documentation — ARM Limited, ARM Cortex-M3 Technical Reference Manual (publicly available at developer.arm.com); ARM Limited, ARMv7-M Architecture Reference Manual. [Definitive ISA specification and memory map details.]

Embedded systems design — Jonathan Valvano, Embedded Systems: Real-Time Interfacing to ARM Cortex-M Microcontrollers, 5th ed., CreateSpace, 2016. [Open-access companion site: users.ece.utexas.edu/~valvano/]

Bus and interface standards — IEEE Std 1149.1 (JTAG); NXP Semiconductors, I2C-bus specification and user manual, Rev. 7.0, 2021; Motorola/NXP, SPI Block Guide, V03.06.

Error-correcting codes — Richard Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal, 29(2):147–160, 1950. [Original paper, freely accessible.]

Open courseware — MIT OCW 6.004 Computation Structures; University of Texas ECE 319K open lecture materials; Nordic Semiconductor Academy (academy.nordicsemi.com).

Chapter 1: Embedded Systems and the Interfacing Challenge

1.1 What Is an Embedded System?

An embedded system is a computer integrated into a larger product whose primary purpose is not computation. The car-engine management unit optimising fuel injection, the wrist-worn heart-rate monitor sampling photoplethysmographic signals, the dishwasher controller sequencing wash cycles, the industrial PLC managing a conveyor belt — all are embedded systems. What distinguishes them from a general-purpose computer is not capability, but dedication and constraint. An embedded processor exists to do one class of jobs, under restrictions on cost, power consumption, physical size, real-time responsiveness, and long-term reliability that would be intolerable in a laptop.

The 32-bit ARM Cortex-M family dominates this space. It offers a thoroughly modern RISC architecture — pipelined, Thumb-2 instruction set, hardware multiply and optional hardware divide — at a price point that allows it to appear in devices costing under a dollar in volume. ECE 224 uses this architecture as the concrete platform through which abstract interfacing concepts are explored.

Embedded system. A computing system designed for a specific control or signal-processing function, typically embedded within a larger electromechanical product, characterised by constraints on cost, power, area, and real-time responsiveness that do not apply to general-purpose computing.

The central engineering challenge of an embedded system is not the processor itself, but the interfaces: the mechanisms by which the processor moves data to and from the world around it. Sensors produce continuously varying voltages. Actuators require timed pulse sequences. Memory chips require carefully sequenced address and data handshakes. Communication peers may speak completely different protocols. Translating between the processor’s clean, synchronous, digital worldview and the messy, asynchronous, analogue reality outside the chip is the subject of this course.

1.2 Core Engineering Questions in Interfacing

Every interfacing design decision can be framed around a small set of fundamental questions:

How do you transfer data from one type of device to another? A sensor produces a continuous analogue voltage; the processor expects a discrete digital word. A keyboard produces asynchronous serial pulses; the processor reads a parallel register. Answering this question requires understanding converters, encoders, bus protocols, and timing constraints.

How should data be organised? Should multiple bytes be packed into a word? Should data be stored in a FIFO buffer or a circular queue? Should it be transferred as individual bytes or as DMA bursts? Organisational choices affect throughput, latency, and the complexity of the software that consumes the data.

When should data be exchanged? The processor could poll a peripheral register in a tight loop, burning CPU cycles but achieving minimal latency. Or it could configure the peripheral to assert an interrupt when data is ready, freeing the CPU for other work. Or it could configure a DMA controller to move blocks of data autonomously, offloading data movement entirely.

What conditions must be satisfied for reliable data exchange? Metastability in flip-flops, signal integrity on long traces, jitter in recovered clocks, and setup/hold violations are all physical realities that undermine logically correct designs unless carefully managed.

Why do different interfaces exist, and under what circumstances does one outperform another? UART is simple but slow. SPI is fast but requires more wires. I2C is elegant for short distances with many slaves. USB handles hot-plugging and variable bandwidth. PCIe provides multi-gigabit throughput. Each exists because a different combination of cost, speed, distance, and simplicity was optimal for its application domain.

1.3 The Nios II Laboratory Platform

The laboratory strand of ECE 224 instantiates these concepts on an Intel/Altera FPGA board using the Nios II soft-core processor, designed using Intel Quartus Prime and programmed using the Intel Nios II Software Build Tools (SBT). This System-on-a-Programmable-Chip (SOPC) approach lets students wire up custom hardware peripherals — parallel I/O ports, timer cores, DAC interfaces — in the FPGA fabric and then access them from C software through a memory-mapped register model. The conceptual architecture is identical to a fixed silicon microcontroller; only the implementation substrate is reconfigurable.

Lab 0 establishes the baseline SOPC design. Lab 1 investigates parallel interfacing and interrupt-driven synchronisation: pushbuttons raise interrupts, an interrupt service routine reads input state, updates seven-segment displays, and reschedules a timer. Lab 2 investigates analogue interfacing: audio data is read from an SD card, optionally processed, and streamed through a digital-to-analogue converter at a precise sample rate enforced by a timer interrupt.

Chapter 2: Computer Structure and Processor Organisation

2.1 A Review of Processor Structure

A microprocessor is a finite-state machine realised in VLSI that executes a program stored in memory. Its essential components are a set of registers that hold operands and state, an Arithmetic Logic Unit (ALU) that performs integer operations, a control unit that decodes instructions and orchestrates data flow, and a bus interface that moves instructions and data between the processor and memory.

The instruction execution cycle consists, in its simplest form, of three phases: fetch, where the next instruction word is read from memory at the address held in the Program Counter (PC); decode, where the control unit interprets the opcode and operand fields; and execute, where the ALU performs the computation or the memory interface initiates a load/store. Modern processors overlap these phases using pipelining: while one instruction executes, the next is being decoded and the one after that is being fetched.

Pipeline. A structural implementation technique in which different stages of instruction processing are performed simultaneously on successive instructions, improving instruction throughput at the cost of increased latency for a single instruction. A 3-stage pipeline can, in the ideal case, sustain a throughput of one instruction per clock cycle.

The ARM Cortex-M3 uses a 3-stage pipeline (fetch, decode, execute). The Cortex-M4 extends this with a single-cycle multiply and optional single-precision floating-point unit. Both processors employ branch speculation: the pipeline fetches the instruction after a branch while the branch target is being resolved. A mispredicted branch causes a pipeline flush, the cost of which must be accounted for in real-time worst-case execution time (WCET) analysis.

2.2 Clock Signals and Control Signals

All synchronous digital logic is disciplined by a clock signal — a square wave of known frequency that defines when state elements (flip-flops and registers) sample their inputs. The setup time \( t_{su} \) is the minimum interval that data must be stable before the clock edge; the hold time \( t_h \) is the minimum interval data must remain stable after the clock edge. Violating either constraint places the flip-flop in a potentially metastable state.

Metastability. A condition in which a bistable element (flip-flop or latch) fails to resolve to a definite logic level within the time allotted by the system clock period, because its input data transitioned too close to the clock edge. The probability of an unresolved metastable state decays exponentially with the time allowed for resolution, but never reaches exactly zero.

Metastability is not merely a theoretical concern. Whenever an asynchronous signal — one not synchronised to the local clock domain — is sampled by a flip-flop, a metastability event can occur. The standard mitigation is a synchroniser: two back-to-back flip-flops clocked at the receiving clock, giving the first stage’s output the full clock period to resolve before the second stage samples it. Two-stage synchronisers are standard in commercial practice for moderate clock frequencies; higher-frequency designs may require three stages.

Control signals govern the flow of data through the processor and across the bus interface: read/write enables, chip selects, bus request/grant lines, interrupt request lines, DMA acknowledge signals. Understanding the timing relationships among these signals — their setup and hold requirements relative to clock edges, their propagation delays, and their tri-state behaviour on shared buses — is the foundation for reliable interface design.

2.3 Memory System Organisation

Embedded processors use a flat, byte-addressed memory space. The ARM Cortex-M architecture defines a 4 GB address space (32-bit addresses), partitioned into regions with architectural significance:

Address Range	Typical Use
0x00000000 – 0x1FFFFFFF	Code (Flash)
0x20000000 – 0x3FFFFFFF	SRAM
0x40000000 – 0x5FFFFFFF	Peripheral registers
0x60000000 – 0x9FFFFFFF	External RAM
0xA0000000 – 0xDFFFFFFF	External device
0xE0000000 – 0xFFFFFFFF	Private Peripheral Bus (PPB) — NVIC, SysTick, debug

The peripheral register region is the foundation of memory-mapped I/O. A GPIO output data register, a UART transmit data register, or a SPI control register is simply a word at a particular address in this region. Writing to that address sends data to the peripheral; reading from it retrieves peripheral status. From the software’s perspective, the peripheral is indistinguishable from memory.

Volatile keyword in C. Because peripheral registers can change value in response to external events — independent of anything the processor does — the C compiler must be prevented from optimising away repeated reads. The volatile qualifier on a pointer or variable instructs the compiler to generate a memory access for every reference, rather than caching the value in a register. Every peripheral register accessed in C code must be declared or cast as volatile.

2.4 Bus Interfacing and Timing

A bus is a shared communication pathway connecting a processor to one or more memory or I/O devices. The fundamental transaction on any bus is a transfer: the master (typically the processor or a DMA controller) drives address lines to select a target device, drives data lines (on a write) or tristates them (on a read), and asserts control signals to indicate the direction and timing.

Bus protocols differ in how they synchronise master and slave. In a synchronous bus, all transfers are referenced to a common clock; the slave must respond within a fixed number of clock cycles. This simplicity comes at the cost of being limited to the speed of the slowest slave. In an asynchronous bus, the master and slave exchange handshake signals (request and acknowledge, or REQ/ACK) without a shared clock. The transfer completes only when the slave explicitly acknowledges. An asynchronous bus can accommodate slaves of arbitrary speed but requires careful design to avoid hang conditions if an acknowledge never arrives.

In partially interlocked (also called semi-synchronous) designs, the slave can insert wait states — holding a WAIT or READY signal active to extend the bus cycle — but the base protocol is clocked. This hybrid approach, used by most practical bus standards, achieves the implementation simplicity of a synchronous bus while accommodating slow devices.

Split-cycle or split-transaction buses separate the address and data phases of a transfer, allowing another master to use the bus during the slave’s access time. This dramatically improves bus utilisation in multi-master systems at the cost of a more complex protocol.

Chapter 3: Synchronisation and Data Transfers

3.1 The Synchronisation Problem

A peripheral operates at its own pace, governed by its physical characteristics. A keypad switch closes in a few milliseconds; a temperature sensor settles after many milliseconds; a UART receiver produces a new byte every few hundred microseconds at 115200 baud; a high-speed ADC produces samples every few tens of nanoseconds. The processor executes instructions at its own pace — perhaps hundreds of millions per second. Synchronisation is the problem of coordinating these two rates so that data is never lost (because the processor read before the peripheral had new data) or overwritten (because the peripheral produced new data before the processor consumed the old data).

3.2 Polling

The simplest synchronisation strategy is polling: the processor loops, repeatedly reading a peripheral status register, until a flag indicates that data is ready. This is straightforward to implement and delivers the lowest possible latency — the processor detects the event within a single polling interval — but it wastes CPU cycles on the checking loop.

Consider a UART that sets a Receive Data Ready (RDR) bit in its status register when a new byte has arrived. A polling loop in C would be:

volatile uint8_t *status_reg = (volatile uint8_t *)UART_STATUS;
volatile uint8_t *data_reg   = (volatile uint8_t *)UART_RX_DATA;

while (!(*status_reg & UART_RDR_BIT))
    ;  /* spin */
uint8_t byte = *data_reg;

The CPU executes the test repeatedly, consuming power and preventing other work. If the peripheral is slow — a human typing at a keyboard — the polling fraction of time is nearly 100%. If the CPU has nothing else to do and response latency must be minimised (e.g., reading encoder pulses on a motion controller), polling is entirely appropriate. But in most embedded applications, the CPU has other tasks: running a user interface, computing a control law, servicing other peripherals. A polling loop for one peripheral prevents all of these.

3.3 Interrupt-Driven I/O

An interrupt is a hardware-initiated transfer of control from the currently executing program to a special routine called an interrupt service routine (ISR) or interrupt handler. When a peripheral event occurs — a UART byte arrives, a timer expires, a GPIO edge is detected — the peripheral asserts an interrupt request line. The CPU’s interrupt controller accepts the request, saves the processor state onto the stack, and dispatches to the ISR. When the ISR finishes, the saved state is restored and execution returns to the interrupted program.

Interrupt service routine (ISR). A function executed asynchronously in response to a hardware interrupt request. The ISR runs at elevated priority, has access to the full register set, and is expected to execute quickly, acknowledging the peripheral event and deferring long computation to a background task.

Interrupt-driven I/O transforms the CPU’s relationship with peripherals: instead of waiting for the peripheral, the CPU works on useful computation until the peripheral demands attention. The overhead is the latency from event occurrence to ISR entry — determined by the number of instructions to finish executing, plus the interrupt latency of the controller hardware — and the time spent saving and restoring processor state.

3.3.1 Interrupt Latency and Response Analysis

Let \( T_{ISR} \) be the worst-case execution time of the ISR, and let \( T_{inter} \) be the minimum interval between successive interrupt requests from a given source. A necessary (but not sufficient) condition for the ISR to keep up with the source is:

\[ T_{ISR} \leq T_{inter}. \]

When multiple interrupt sources exist with periods \( T_1, T_2, \ldots, T_n \) and ISR costs \( C_1, C_2, \ldots, C_n \), the CPU utilisation devoted to interrupt handling is approximately

\[ U_{IRQ} = \sum_{i=1}^{n} \frac{C_i}{T_i}, \]

and the remaining fraction \( 1 - U_{IRQ} \) is available for background computation. This is the same rate-monotonic utilisation formula familiar from real-time systems analysis.

3.3.2 Nested Interrupts and Priority

Modern embedded systems must handle multiple simultaneous interrupt sources. The Nested Vectored Interrupt Controller (NVIC) on Cortex-M processors assigns a numerical priority level to each interrupt channel. A lower numerical value means higher priority. When the CPU is executing an ISR of priority \( P \) and an interrupt of priority \( Q < P \) arrives, the NVIC preempts the running ISR, saves its state, and dispatches the higher-priority handler. This nested interrupt capability ensures that time-critical events receive bounded service regardless of what less-critical ISRs are executing.

Priority assignment example. Suppose three interrupt sources: a SysTick timer at 1 kHz (period 1 ms, WCET 10 µs), a UART receiver at 115200 baud (byte arrives every 87 µs, WCET 5 µs), and a pushbutton debounced to at most 10 Hz (period 100 ms, WCET 20 µs). The UART has the tightest period and should receive the highest priority (lowest NVIC priority number). The SysTick has a shorter period than the pushbutton and should be second. The pushbutton receives the lowest priority. Utilisation: U = 10/1000 + 5/87 + 20/100000 ≈ 0.010 + 0.057 + 0.0002 ≈ 6.9% — comfortably schedulable.

3.4 Shared Data Hazards

When an ISR and a background task share a data structure — a circular buffer, a counter, a flag — a race condition can occur if the ISR preempts the background task in the middle of a multi-step update. Suppose the background task is incrementing a 32-bit counter that the ISR also reads:

/* background task */
counter = counter + 1;  /* may compile to: load, add, store */

If the ISR preempts after the load but before the store, the ISR sees a stale value of counter. The subsequent store by the background task overwrites the ISR’s read with the pre-incremented value — a lost update. This is a classic read-modify-write hazard on a non-atomic operation.

The Cortex-M architecture provides a solution: LDREX / STREX (load-link / store-conditional) instructions for software atomics, plus the CPSID I / CPSIE I instructions to globally disable/enable interrupts around a critical section. For short critical sections, disabling interrupts is simplest:

__disable_irq();   /* CPSID I — disables all configurable interrupts */
counter++;
__enable_irq();    /* CPSIE I */

A more structured approach is a mutex (mutual exclusion semaphore): a synchronisation primitive that ensures only one thread of execution enters a critical section at a time. In a bare-metal embedded system without an RTOS, a binary flag protected by interrupt-disable/enable serves the same purpose.

Chapter 4: Parallel Interfacing

4.1 The Role of the Parallel Interface

A parallel interface transfers multiple bits simultaneously over dedicated signal lines, one bit per line. It represents the oldest and most direct method of connecting a digital device to a processor bus. Understanding parallel interfacing requires understanding the impedance mismatch between the bus and the device: the bus has its own timing — address valid, data valid, read/write cycle times — and the device has its own timing — access time for memory, setup time for latches, propagation delay for combinational logic. The interface must translate between these two timing regimes without violating either device’s constraints.

4.2 Timing Diagrams and Constraint Analysis

A bus timing diagram specifies, relative to the clock edge or address-strobe edge:

Address valid \( t_{AV} \): time after the clock edge when the address lines are guaranteed stable.
Data valid (for write): time after the clock edge when the data lines are guaranteed stable.
Data required (for read): the latest time before a sample edge when the device must present stable data.
Cycle time \( t_{cyc} \): total duration of one bus transaction.

The device’s data sheet specifies its own constraints:

Address access time \( t_{acc} \): time from address-stable to data-stable on the device’s output (for read).
Write setup time \( t_{ws} \): minimum time data must be valid before the write-enable de-asserts.
Write pulse width \( t_{wp} \): minimum duration of the write-enable pulse.

The interface designer must verify that the bus timing satisfies the device constraints, inserting wait states if necessary to extend the bus cycle and give slow devices sufficient access time.

Wait-state calculation. A synchronous bus runs at 50 MHz (cycle time \( t_{cyc} = 20 \) ns). Address lines are valid 4 ns after the rising clock edge. A flash memory has an address access time of 70 ns. The total time from address-valid to data-required must be at least 70 ns. Without wait states, only \( 20 - 4 = 16 \) ns is available — insufficient. Inserting two wait states extends the read cycle to three clock periods (60 ns), giving \( 60 - 4 = 56 \) ns... still insufficient. Three wait states gives \( 80 - 4 = 76 \) ns > 70 ns. Therefore, three wait states are required with this memory at this clock speed.

4.3 Memory-Mapped I/O Peripheral Access in C

On the Nios II and Cortex-M platforms, peripheral registers are accessed through pointers to volatile memory. A well-structured embedded C driver defines base addresses and register offsets as preprocessor constants, then wraps register accesses in inline functions:

#define GPIO_BASE       0xFF200000u
#define GPIO_DATA_REG   (*(volatile uint32_t *)(GPIO_BASE + 0x00))
#define GPIO_DIR_REG    (*(volatile uint32_t *)(GPIO_BASE + 0x04))
#define GPIO_IRQ_REG    (*(volatile uint32_t *)(GPIO_BASE + 0x08))
#define GPIO_IRQ_MASK   (*(volatile uint32_t *)(GPIO_BASE + 0x0C))

static inline void gpio_set_dir(uint32_t mask) { GPIO_DIR_REG = mask; }
static inline void gpio_write(uint32_t val)    { GPIO_DATA_REG = val; }
static inline uint32_t gpio_read(void)         { return GPIO_DATA_REG; }

Writing a 1 to a bit position in the direction register configures that bit as an output; writing 0 configures it as an input. Writing to the data register drives outputs; reading from the data register samples inputs. This API pattern — base address plus register-offset macros plus thin inline wrappers — is standard in embedded system driver libraries (see, e.g., STM32 HAL or TI DriverLib).

4.4 Servicing Latency and Throughput

The design of an interrupt-driven parallel interface requires careful analysis of two figures of merit:

Interrupt latency is the time from the peripheral asserting its interrupt request to the first instruction of the ISR executing. This includes: the time remaining in the current instruction (variable), the pipeline flush (fixed, typically 1–3 cycles), the NVIC arbitration (fixed, 12 cycles for Cortex-M3), and the function call overhead of any vector table indirection (fixed). Total Cortex-M3 interrupt latency is 12 clock cycles in the best case, with some variation due to late-arriving exceptions and bus stalls.

Throughput is the rate at which the interface can sustain data transfer. For an interrupt-driven GPIO that triggers on a strobe from an external device:

\[ \text{max throughput} = \frac{1}{T_{ISR} + T_{overhead}}, \]

where \( T_{overhead} \) includes interrupt entry/exit costs. If the ISR takes 50 cycles and interrupt overhead is 24 cycles (12 entry + 12 exit), then at 100 MHz the maximum throughput is \( 100 \times 10^6 / 74 \approx 1.35 \) million transactions per second.

Chapter 5: Error Detection and Error Correction

5.1 Why Errors Occur

Data stored in memory or transmitted across a channel can be corrupted by physical noise: ionising radiation causing single-event upsets in SRAM cells, electromagnetic interference inducing bit errors on long cable runs, power supply noise causing marginal flip-flops to make incorrect transitions. The probability of a bit error may be small — perhaps \( 10^{-12} \) per bit per hour in modern DRAM under normal conditions — but for a 1 Gbit memory operating for years, some errors are likely. Critical applications — aerospace, medical, financial — require the system to detect and correct such errors.

5.2 Parity — Single-Bit Error Detection

The simplest error-detecting code augments each data word with a single parity bit chosen so that the number of 1-bits in the extended word (data plus parity) is always even (even parity) or always odd (odd parity). The receiver recalculates the parity of the received word and compares it with the received parity bit; a mismatch indicates at least one bit error.

Even parity for a 4-bit word \( d_3 d_2 d_1 d_0 \):

\[ p = d_3 \oplus d_2 \oplus d_1 \oplus d_0. \]

Parity detects all single-bit errors and any odd number of bit errors. It cannot detect double-bit errors (two flips cancel in the XOR), and it cannot correct any errors — only detect them.

5.3 Hamming Codes — Single-Error Correction, Double-Error Detection (SECDED)

Richard Hamming’s 1950 insight was that by adding multiple redundant check bits, each covering a different subset of data bits, the position of a single-bit error could be precisely identified and corrected.

For a data word of length \( k \) bits, the number of parity bits \( r \) needed to correct single-bit errors satisfies:

\[ 2^r \geq k + r + 1. \]

For \( k = 4 \), we need \( r = 3 \) (since \( 2^3 = 8 \geq 8 = 4+3+1 \)), giving a 7-bit codeword. For \( k = 8 \), \( r = 4 \) (since \( 2^4 = 16 \geq 13 \)), giving a 12-bit codeword.

In a systematic Hamming code, parity bits are placed at positions that are powers of 2 (1, 2, 4, 8, …), and data bits fill the remaining positions. Parity bit \( p_i \) at position \( 2^{i-1} \) covers all bit positions whose binary representation has a 1 in bit position \( i-1 \).

Hamming(7,4) encoding. Data bits \( d_1 d_2 d_3 d_4 = 1011 \). Bit positions in the codeword \( b_7 b_6 b_5 b_4 b_3 b_2 b_1 \):

Position 1 (binary 001): parity bit \( p_1 \)
Position 2 (binary 010): parity bit \( p_2 \)
Position 3 (binary 011): data bit \( d_1 = 1 \)
Position 4 (binary 100): parity bit \( p_4 \)
Position 5 (binary 101): data bit \( d_2 = 0 \)
Position 6 (binary 110): data bit \( d_3 = 1 \)
Position 7 (binary 111): data bit \( d_4 = 1 \)

\( p_1 \) covers positions 1, 3, 5, 7: \( p_1 = 1 \oplus 0 \oplus 1 = 0 \). \( p_2 \) covers positions 2, 3, 6, 7: \( p_2 = 1 \oplus 1 \oplus 1 = 1 \). \( p_4 \) covers positions 4, 5, 6, 7: \( p_4 = 0 \oplus 1 \oplus 1 = 0 \).

Codeword: 0 1 1 0 1 1 0 (positions 7 to 1: \( b_7=1, b_6=1, b_5=0, b_4=0, b_3=1, b_2=1, b_1=0 \)).

To correct a single-bit error, the receiver recomputes the three syndrome bits \( s_1, s_2, s_4 \) (using the same covering sets as the parity bits), concatenates them as a binary number \( s_4 s_2 s_1 \), and if non-zero, flips the bit at the position indicated by the syndrome.

Adding a global parity bit over all positions yields SECDED: single-error correct, double-error detect. If the syndrome is non-zero but the global parity check passes, two bits are in error (detected but uncorrectable). ECC (Error-Correcting Code) memory in servers uses Hamming-based SECDED codes over 64-bit data words with 8 check bits.

5.4 Error Types in Practice

Hard errors are permanent faults — a stuck-at-0 or stuck-at-1 bit cell — that reproduce reliably. They are diagnosed at manufacturing test (using stuck-at fault models) and cause the device to be discarded or remapped.

Soft errors (single-event upsets) are transient bit flips caused by high-energy particles (cosmic rays, alpha particles from packaging materials) striking a storage cell and depositing enough charge to flip its state. They do not damage the cell; the next write restores correct operation. Soft error rates in SRAM are characterised in FITs (failures in time, where 1 FIT = 1 failure per \( 10^9 \) device-hours). High-reliability systems use ECC memory to tolerate soft errors in flight.

Chapter 6: Serial Interfacing

6.1 Motivation for Serial Communication

A parallel bus connecting two chips requires one wire per bit — 8, 16, or 32 wires for typical data widths — plus separate control and clock lines. At PCB scale this is manageable. But for communication over longer distances, between boards or to remote sensors, the cost of multiple parallel conductors becomes prohibitive. More importantly, at high frequencies, maintaining signal integrity across many parallel lines simultaneously — ensuring all bits arrive within a single bit period of each other — requires meticulous impedance matching and length matching that adds cost and constrains layout.

Serial communication resolves both problems by transmitting bits one at a time over a single data line (or a small number of lines). The trade-off is a reduction in raw throughput for a given clock rate, but serial techniques enable the use of differential signalling (such as LVDS or RS-485), which provides excellent noise immunity over long distances, and they enable clock encoding schemes that permit very high bit rates on short links.

6.2 Asynchronous Serial Communication: UART

The Universal Asynchronous Receiver/Transmitter (UART) is the oldest and most widely used serial interface in embedded systems. It requires only two signal lines per direction (typically labelled TX and RX), operates without a shared clock, and is supported by virtually every microcontroller produced in the last 40 years.

6.2.1 Frame Format

A UART frame consists of:

Start bit — a single bit of logic 0, transitioning from the idle (logic 1) line state. This marks the beginning of a character.
Data bits — typically 8 bits, transmitted LSB first.
Parity bit (optional) — even, odd, or none.
Stop bit(s) — one or two bits of logic 1, returning the line to idle.

At 115200 baud with 8N1 format (8 data bits, no parity, 1 stop bit), each frame is 10 bit-periods wide, consuming \( 10 / 115200 \approx 86.8 \) µs. Effective data throughput is 11,520 bytes per second, or 92.16 kbits/s.

6.2.2 Clock Recovery

Because there is no shared clock, the receiver must derive bit timing from the data stream. A UART receiver samples each bit near its centre to maximise noise immunity. With an oversampling factor of 16 (the standard), the receiver samples the line 16 times per bit period. On detecting the start bit’s falling edge, it counts to sample number 8 (the centre of the start bit) and then to samples 8, 24, 40, … relative to the falling edge — the centres of the subsequent bits.

The accuracy of this scheme depends on the two ends maintaining the same nominal baud rate. A frequency offset of \( \delta \) percent accumulates as a timing error of \( \delta \times N \)% over a frame of \( N \) bits. For 8N1 (10 bit-periods), the accumulated error at the last data bit must remain less than half a bit period (50%). This requires \( \delta < 5\% \), which standard crystal oscillators and fractional-N baud-rate generators achieve comfortably.

6.2.3 UART Initialisation in C

The following example configures a Cortex-M UART peripheral (using a notional register-level API):

#define UART_BASE     0x40004400u
#define UART_CR1      (*(volatile uint32_t *)(UART_BASE + 0x0C))
#define UART_BRR      (*(volatile uint32_t *)(UART_BASE + 0x08))
#define UART_SR       (*(volatile uint32_t *)(UART_BASE + 0x00))
#define UART_DR       (*(volatile uint32_t *)(UART_BASE + 0x04))

#define UART_CR1_UE   (1u << 13)
#define UART_CR1_TE   (1u << 3)
#define UART_CR1_RE   (1u << 2)
#define UART_SR_TXE   (1u << 7)
#define UART_SR_RXNE  (1u << 5)

void uart_init(uint32_t pclk_hz, uint32_t baud) {
    UART_BRR = pclk_hz / baud;          /* integer divider; ignores fractional */
    UART_CR1 = UART_CR1_UE | UART_CR1_TE | UART_CR1_RE;
}

void uart_send_byte(uint8_t b) {
    while (!(UART_SR & UART_SR_TXE))    /* wait for transmit data register empty */
        ;
    UART_DR = b;
}

uint8_t uart_recv_byte(void) {
    while (!(UART_SR & UART_SR_RXNE))   /* wait for receive data register not empty */
        ;
    return (uint8_t)UART_DR;
}

In a real system, uart_send_byte would be driven by a transmit-buffer-empty interrupt rather than polling, and uart_recv_byte would be replaced by a receive interrupt that stores incoming bytes in a circular buffer for consumption by the application.

6.3 Synchronous Serial: SPI

The Serial Peripheral Interface (SPI) is a synchronous, full-duplex protocol originally developed by Motorola. It uses four signals:

SCLK (Serial Clock) — driven by the master.
MOSI (Master Out Slave In) — data from master to slave.
MISO (Master In Slave Out) — data from slave to master.
SS̄ / CS̄ (Slave Select / Chip Select) — active-low, one per slave, driven by the master.

Because the clock is provided by the master, there is no need for clock recovery; both master and slave sample MISO/MOSI on the same clock edge. SPI achieves significantly higher throughput than UART — tens of megabits per second is common — and is well suited to ADCs, DACs, flash memories, and displays.

6.3.1 SPI Modes

SPI defines four clock polarity/phase combinations, selected by two bits CPOL (clock polarity) and CPHA (clock phase):

Mode	CPOL	CPHA	Idle Clock	Sample Edge
0	0	0	Low	Rising
1	0	1	Low	Falling
2	1	0	High	Falling
3	1	1	High	Rising

The slave device’s data sheet specifies which mode it supports. Mismatched SPI mode is a common cause of garbled data in embedded designs.

6.3.2 SPI Transaction

An SPI transaction begins when the master asserts (pulls low) the appropriate SS̄ line, clocks out a command or address byte on MOSI while simultaneously shifting in the slave’s response on MISO, and finally de-asserts SS̄. The shift register in both master and slave exchange their full contents with each clock pulse; what the slave transmits during the command byte phase is often a dummy byte (0x00 or 0xFF), with the actual response appearing in subsequent bytes.

uint8_t spi_transfer(uint8_t tx) {
    while (!(SPI_SR & SPI_SR_TXE))
        ;
    SPI_DR = tx;
    while (!(SPI_SR & SPI_SR_RXNE))
        ;
    return (uint8_t)SPI_DR;
}

uint8_t adc_read_channel(uint8_t ch) {
    uint8_t result;
    CS_LOW();
    spi_transfer(0x01);              /* start bit */
    result = spi_transfer(0x80 | (ch << 4));  /* SGL/DIFF, channel */
    result = (result & 0x03) << 8;
    result |= spi_transfer(0x00);   /* clock out result */
    CS_HIGH();
    return result;
}

6.4 Two-Wire Serial: I2C

The Inter-Integrated Circuit (I2C) bus, developed by Philips (now NXP), provides a simple two-wire multi-master, multi-slave serial protocol using:

SDA (Serial Data) — open-drain, bidirectional.
SCL (Serial Clock) — open-drain, driven by master (slaves may hold it low to stretch the clock).

Open-drain signalling means any device on the bus can pull a line low, and a pull-up resistor returns it to logic high when no device drives it low. This wired-AND arrangement allows multiple masters and slaves to share the two wires without bus contention.

6.4.1 I2C Protocol Elements

An I2C transaction consists of:

START condition — SDA falls while SCL is high. Only the master generates this.
Address byte — 7 or 10 bits identifying the target slave, followed by a Read/Write bit. The slave whose address matches acknowledges by pulling SDA low during the acknowledgement clock pulse.
Data byte(s) — transferred MSB first, each followed by an ACK from the receiver.
STOP condition — SDA rises while SCL is high.

A repeated START (Sr) allows a master to change direction (read after write, or write after read) without releasing the bus.

I2C standard mode operates at 100 kbits/s, fast mode at 400 kbits/s, fast-mode plus at 1 Mbits/s, and high-speed mode at 3.4 Mbits/s. The pull-up resistor value trades speed against noise margin and power consumption — lower resistance enables higher speed but increases quiescent current.

6.4.2 Clock Stretching and Arbitration

A slow slave may hold SCL low after the falling edge of each clock pulse, forcing the master to pause. This clock stretching allows the slave to process incoming data at its own pace. Multi-master I2C requires bitwise arbitration: a master that drives SDA high while observing SDA low (due to another master) immediately loses arbitration, halts its transmission, and waits before retrying. Because arbitration is implicit in the SDA value, no special arbitration protocol is needed.

Chapter 7: Analogue Interfacing

7.1 The Analogue-Digital Boundary

The physical world is analogue: temperatures, pressures, accelerations, voltages, and acoustic pressures are continuous quantities that can take any value within their range. Embedded systems ultimately interact with this world — reading sensors, driving actuators — so they must convert between analogue and digital representations. The ADC (Analogue-to-Digital Converter) and DAC (Digital-to-Analogue Converter) are the components that span this boundary.

7.2 Digital-to-Analogue Conversion

A DAC accepts an \( n \)-bit digital code \( D \in \{0, 1, \ldots, 2^n - 1\} \) and produces a corresponding analogue output voltage:

\[ V_{out} = V_{ref} \cdot \frac{D}{2^n}, \]

where \( V_{ref} \) is the full-scale reference voltage. An ideal 12-bit DAC with \( V_{ref} = 3.3 \) V has a resolution (least significant bit, LSB) of \( 3.3 / 4096 \approx 0.806 \) mV.

7.2.1 R-2R Ladder DAC

The R-2R ladder network implements DAC conversion using only two resistor values, making it practical to integrate in silicon. The ladder consists of resistors of value \( R \) in the series path and \( 2R \) in the shunt path. Each bit switches its shunt resistor between \( V_{ref} \) and GND. The network’s Thevenin equivalent produces a current proportional to the binary-weighted sum of the bit inputs. The R-2R architecture is insensitive to the absolute values of \( R \) and \( 2R \) provided their ratio is accurate — a constraint achievable with modern IC fabrication.

7.2.2 DAC Errors

Static errors are systematic deviations from the ideal transfer characteristic:

Offset error: the output when the input code is zero is not exactly zero. Corrected by calibration.
Gain error: the slope of the transfer characteristic deviates from the ideal. Also correctable.
Differential Non-Linearity (DNL): the actual step size for a unit code increment deviates from the ideal 1 LSB. DNL > 1 LSB makes the DAC non-monotonic.
Integral Non-Linearity (INL): the maximum deviation of the actual transfer characteristic from the ideal straight line.

Dynamic errors manifest during rapidly changing outputs:

Settling time: the time from a code change to the output settling within \( \pm 0.5 \) LSB of its final value. Relevant when the DAC output rate equals the audio sample rate.
Glitch impulse: a transient spike occurring when multiple bits change simultaneously (e.g., from code 0111…1 to 1000…0). Major-carry transitions produce the largest glitches.

7.3 Analogue-to-Digital Conversion

An ADC samples a continuous analogue voltage at discrete time instants and quantises each sample to an \( n \)-bit integer. The sampling theorem (Nyquist-Shannon) mandates that the sampling frequency \( f_s \) must exceed twice the highest frequency component \( f_{max} \) in the analogue signal:

\[ f_s > 2 f_{max}. \]

Sampling below this rate causes aliasing: high-frequency content appears as low-frequency artefacts in the digital signal. In practice, an anti-aliasing filter (a low-pass filter with cut-off below \( f_s / 2 \)) is placed before the ADC input to suppress frequency content that would otherwise alias.

7.3.1 Sample-and-Hold

Before quantisation, the input voltage must be held constant for the duration of the conversion. A sample-and-hold (S&H) circuit does this: a transmission gate samples the analogue input onto a capacitor, then opens, holding the capacitor voltage constant while the ADC conversion proceeds. The aperture time of the S&H — the time uncertainty in when sampling occurs — limits the maximum input frequency. For an input sinusoid of amplitude \( A \) and frequency \( f \), the maximum rate of change is \( 2\pi f A \), so an aperture uncertainty of \( \Delta t \) produces a voltage error of \( 2\pi f A \Delta t \). To keep this error below 0.5 LSB with a full-scale \( A \) and \( n \)-bit ADC, the required aperture is:

\[ \Delta t < \frac{1}{2\pi f \cdot 2^n}. \]

For a 12-bit ADC sampling a 10 kHz signal, \( \Delta t < 1 / (2\pi \times 10000 \times 4096) \approx 3.9 \) ns.

7.3.2 Successive Approximation ADC

The Successive Approximation Register (SAR) ADC is the dominant architecture in embedded microcontrollers for medium-speed (up to a few MSPS), medium-resolution (8–16 bits) applications. It uses a binary search algorithm: starting from the MSB, it sets each bit to 1, compares the resulting DAC output to the input, and retains the bit if the DAC output is less than or equal to the input, or clears it otherwise.

For bit n-1 down to 0:
    Set bit → compare DAC output to V_in
    If DAC > V_in: clear bit
    Else: keep bit

An \( n \)-bit SAR ADC requires exactly \( n \) comparisons per conversion, making it deterministic and predictable — ideal for embedded real-time applications. Conversion time is \( n \times T_{clk,ADC} \), where \( T_{clk,ADC} \) is the ADC clock period.

7.3.3 ADC in Embedded C

A typical SAR ADC initialisation and read sequence (Cortex-M STM32-style):

/* Enable ADC clock in RCC, configure GPIO pin as analogue input */
ADC1->CR2  |= ADC_CR2_ADON;              /* power on ADC */
ADC1->SQR3  = channel;                  /* select channel */
ADC1->CR2  |= ADC_CR2_SWSTART;          /* start conversion */
while (!(ADC1->SR & ADC_SR_EOC))        /* wait for end-of-conversion */
    ;
uint16_t raw = (uint16_t)ADC1->DR;      /* read 12-bit result */
float voltage = raw * (3.3f / 4096.0f); /* scale to volts */

In a production design, the polling wait would be replaced by a DMA transfer triggered by the EOC signal, moving the converted sample directly into a buffer without CPU intervention.

7.4 Audio Playback: Lab 2 Architecture

Lab 2’s audio pipeline illustrates the integration of multiple interface concepts. An SD card (accessed via SPI) stores a raw PCM audio file sampled at 8 kHz or 44.1 kHz. A SysTick or general-purpose timer fires at the audio sample rate. The ISR reads the next sample from a software buffer and writes it to a DAC control register (via SPI or parallel interface). A second higher-level task, running in the background, refills the buffer from the SD card in block-aligned chunks. The design must ensure the buffer never empties (causing an audible glitch) despite the variable latency of SD card reads.

Chapter 8: Bus Data Transfer

8.1 Synchronous Bus Protocol

In a synchronous bus, every operation is governed by the system clock. A typical synchronous read transaction proceeds as follows:

Master asserts address on the address bus at the start of clock period 1.
Master asserts read (R/W̄ = 1) and bus enable signals.
Slave decodes address, drives data onto the data bus before the sample point (typically the rising edge of clock period 2 or 3).
Master samples data at the rising edge of clock period 2 (or later, if wait states are inserted).
Master releases the bus; slave tri-states its data output.

The synchronous protocol is simple to implement and analyse. Its weakness is inflexibility: the cycle time is fixed, so slow devices must insert wait states, and fast devices cannot exploit their speed advantage beyond the minimum cycle time.

8.2 Asynchronous Bus Handshake

An asynchronous bus uses explicit request/acknowledge signals in place of a shared clock. A fully interlocked (four-cycle) handshake:

Master asserts REQ (request).
Slave responds with ACK (acknowledge) after completing the operation.
Master observes ACK and de-asserts REQ.
Slave observes REQ de-assert and de-asserts ACK.

This protocol adapts naturally to devices of any speed. A fast memory completes in a few nanoseconds; a slow I/O device may take milliseconds. The protocol accommodates both without wasted cycles. The disadvantage is increased latency per cycle due to the four handshake phases, and the requirement to handle metastability on REQ and ACK signals crossing clock domain boundaries.

8.3 Semi-Synchronous and Split-Cycle Buses

A semi-synchronous bus is clocked, but the slave can assert a WAIT signal to extend the cycle by whole clock periods. This is the most common practical design: the base cycle is fast (one or two clocks) and slow devices simply hold WAIT active for as many additional clocks as needed. Implementation is straightforward: the master samples WAIT one setup time before the data sample edge; if asserted, it re-latches the cycle’s address and control signals for another clock period.

A split-cycle bus separates address and data phases into independent sub-transactions. The master initiates the address phase and then releases the bus. When the slave has retrieved the data, it requests the bus again for the data phase. Between the two phases, the bus is free for other transactions. This dramatically improves utilisation for slow slaves, but requires the master and slave to maintain transaction context across the gap, adding protocol complexity.

8.4 Bus Performance Analysis

Bus throughput depends on the average number of cycles per transfer. Let \( T_{bus} \) be the nominal bus cycle time, \( n_{wait} \) the average number of wait states, and \( n_{overhead} \) the average overhead cycles (address phase, turnaround, etc.):

\[ \text{Bus throughput} = \frac{\text{data width in bytes}}{(1 + n_{wait} + n_{overhead}) \times T_{bus}}. \]

For a 32-bit synchronous bus at 50 MHz (\( T_{bus} = 20 \) ns) with an average of 1 wait state and no overhead:

\[ \text{Throughput} = \frac{4 \text{ B}}{2 \times 20 \text{ ns}} = \frac{4}{40 \times 10^{-9}} = 100 \text{ MB/s}. \]

Chapter 9: Bus Arbitration

9.1 The Need for Arbitration

A shared bus can have only one master active at a time; if two masters attempt to drive address or data lines simultaneously, contention occurs — the resulting voltage is neither a valid logic 0 nor a valid logic 1, and both transactions are corrupted. Arbitration is the process by which multiple masters negotiate for exclusive bus ownership.

Requirements for a good arbitration scheme:

Mutual exclusion: only one master receives the bus grant at a time.
Fairness: no master is permanently denied access (freedom from starvation).
Bounded latency: each master receives the bus within a known worst-case time.
Efficiency: arbitration overhead should be small relative to bus transaction time.

9.2 Centralised Arbitration: Daisy Chain

In daisy-chain arbitration, a single central arbiter drives a Bus Grant (BG) line that passes through masters in series. When a master wishes to use the bus, it asserts Bus Request (BR). The arbiter, seeing BR asserted and the bus idle, asserts BG. The grant propagates down the daisy chain until it reaches the first master that has a pending request; that master captures the grant and asserts Bus Busy (BB).

Daisy chain. An arbitration topology in which a grant signal is passed serially from one potential master to the next; the first requester encountered captures the grant. Masters physically closer to the arbiter (earlier in the chain) have inherently higher priority.

Daisy-chain priority is determined by physical position in the chain — the master closest to the arbiter always wins. This creates a static priority scheme that can starve lower-priority masters under heavy load. It is simple to implement (one extra wire per master, plus BR and BB shared lines) and scales to many masters at low hardware cost.

9.3 Centralised Arbitration: Non-Daisy-Chain (Independent Request)

Each master has its own BR line to the central arbiter. The arbiter implements a priority encoder (selecting the highest-priority requester) or a round-robin scheduler (advancing the priority pointer after each grant). Non-daisy-chain arbitration provides:

Configurable priority: the arbiter software or hardware can implement any priority policy.
Bounded waiting time for round-robin: with \( N \) masters, any requester waits at most \( N - 1 \) grants before receiving its turn.
Higher wiring cost: \( N \) separate BR lines, typically bundled into a bus.

9.4 Distributed Arbitration

In a distributed scheme, no single arbiter exists. Each master observes the state of the bus and makes its own grant decision based on a shared arbitration protocol. Ethernet’s CSMA/CD (Carrier Sense Multiple Access with Collision Detection) is a distributed arbitration protocol: each station listens before transmitting, detects collisions, and implements a random binary-exponential backoff to resolve them.

In a wired-OR bus with assigned priority IDs, each master drives its priority bits onto an open-collector arbitration bus and monitors the bus. A master that drives a 1 but sees a 0 (a higher-priority master also requesting) withdraws immediately. This is the arbitration mechanism of I2C and also of the Controller Area Network (CAN bus), which uses bitwise NRZ arbitration without collision: the winning master’s frame is transmitted without corruption.

Chapter 10: Direct Memory Access

10.1 CPU Bottleneck in High-Rate Transfers

Consider a microcontroller sampling an ADC at 44.1 kHz (audio-rate). Each interrupt fires every 22.7 µs. The ISR must read the ADC result register and store it in a buffer — perhaps 10–20 instructions. At 100 MHz, this consumes perhaps 200 ns per sample, or about 0.9% of CPU time. Tolerable.

Now consider a video ADC at 25 frames per second with \( 640 \times 480 \) pixels, 2 bytes per pixel: \( 15.36 \) million bytes per second. An interrupt per byte would fire 15.36 million times per second, and each interrupt’s entry/exit overhead (≥ 12 cycles on Cortex-M3) alone would consume \( 12 \times 15.36 \times 10^6 = 184 \times 10^6 \) cycles per second — 184% of a 100 MHz CPU. Interrupt-driven I/O cannot keep up.

Direct Memory Access (DMA) resolves this by delegating data movement to a dedicated hardware engine that operates in parallel with the CPU.

10.2 DMA Operation

A DMA controller (DMAC) is a specialised processor that performs memory-to-memory, peripheral-to-memory, or memory-to-peripheral transfers without CPU involvement. The CPU configures the DMAC by writing:

Source address: the starting address from which data is read.
Destination address: the starting address to which data is written.
Transfer count: the number of elements (bytes, half-words, words) to transfer.
Transfer width: the size of each element.
Increment mode: whether source and/or destination address are incremented after each element.
Trigger: the condition that initiates each element transfer (peripheral data-ready flag, software trigger, etc.).

Once configured and enabled, the DMAC performs the transfer autonomously. When complete, it asserts an interrupt to notify the CPU that the buffer is ready.

DMA configuration for ADC streaming. Suppose an ADC peripheral stores its conversion results at address ADC_DR. The CPU configures the DMAC as follows:

DMA_Channel->CPAR  = (uint32_t)&ADC1->DR;       /* peripheral source */
DMA_Channel->CMAR  = (uint32_t)sample_buffer;    /* memory destination */
DMA_Channel->CNDTR = BUFFER_SIZE;               /* number of samples */
DMA_Channel->CCR   = DMA_CCR_MINC              /* increment memory address */
                   | DMA_CCR_PSIZE_16BIT        /* 16-bit peripheral */
                   | DMA_CCR_MSIZE_16BIT        /* 16-bit memory */
                   | DMA_CCR_TCIE               /* transfer complete interrupt */
                   | DMA_CCR_EN;                /* enable */

The ADC triggers each DMA transfer on end-of-conversion. The CPU is free to execute other code. When BUFFER_SIZE samples have been collected, the DMA asserts an interrupt and the CPU processes the buffer.

10.3 Bus Bandwidth Implications

The DMAC must compete with the CPU for access to the memory bus. Two schemes exist:

Burst mode: the DMAC seizes the bus and transfers an entire block before relinquishing it. The CPU is stalled for the duration of the burst. This maximises DMA throughput but can introduce significant latency spikes in CPU execution — a problem for real-time control.

Cycle steal mode: the DMAC steals one bus cycle at a time, interleaving with CPU accesses. The CPU experiences a slight slowdown (each stolen cycle delays its memory access by one cycle) but is never stalled for long periods. Most embedded DMAC implementations use cycle stealing.

Transparent mode: the DMAC transfers data only during idle bus cycles — bus cycles in which the CPU is not accessing memory (e.g., during instruction execution using the cache). This is ideal but depends on cache hit rate; with poor locality, the CPU may have few idle cycles.

10.4 Double-Buffering for Continuous Streaming

Continuous audio playback or ADC streaming requires a buffer that is simultaneously being filled by DMA and consumed by the application. A double-buffer (ping-pong buffer) arrangement uses two equal-sized buffers:

While DMA fills buffer A, the CPU processes buffer B.
When DMA completes buffer A, it switches to buffer B and raises a half-complete / full-complete interrupt.
The CPU switches to processing buffer A.

This requires the CPU processing rate to match or exceed the DMA fill rate. If processing takes longer than the time to fill one buffer, a buffer underrun occurs and audio continuity is lost. Lab 2’s audio playback system must be designed to prevent this.

Chapter 11: Signal Integrity — Drivers, Transmission Lines, Grounding, and Shielding

11.1 Signal Integrity Fundamentals

As clock frequencies rise and trace lengths grow, the interconnecting wires on a PCB can no longer be treated as ideal conductors with instantaneous propagation. A PCB trace has a characteristic impedance \( Z_0 \), determined by its width, the substrate dielectric constant \( \varepsilon_r \), and the distance to the reference plane:

\[ Z_0 \approx \frac{87}{\sqrt{\varepsilon_r + 1.41}} \ln\left(\frac{5.98 h}{0.8 w + t}\right) \quad [\Omega], \]

where \( h \) is the dielectric thickness, \( w \) is the trace width, and \( t \) is the trace thickness (all in consistent units). A typical FR-4 microstrip with \( \varepsilon_r = 4.5 \) and appropriate geometry has \( Z_0 \approx 50 \) Ω.

Transmission line. A conductor structure for which electromagnetic wave propagation effects are significant — specifically, any interconnect whose electrical length (physical length divided by the wavelength of the signal's highest significant frequency) is a non-negligible fraction. A rule of thumb: treat any trace longer than 1/10 of the signal wavelength as a transmission line.

When a signal encounters an impedance discontinuity — a via, a connector, or a receiver whose input impedance differs from \( Z_0 \) — a reflection occurs. The reflection coefficient at the receiver is:

\[ \Gamma = \frac{Z_L - Z_0}{Z_L + Z_0}, \]

where \( Z_L \) is the load impedance. For an unloaded CMOS input (\( Z_L \approx \infty \)), \( \Gamma \approx +1 \): the full incident voltage is reflected back, doubling the voltage at the receiver. This reflected wave travels back to the driver, bouncing again if the driver’s output impedance differs from \( Z_0 \), and continues until the reflections attenuate. The resulting oscillations, called ringing, can violate logic threshold voltages and cause spurious switching.

11.2 Termination Strategies

Source termination places a resistor \( R_s \) equal to \( Z_0 \) in series with the driver output. The resistor absorbs reflections at the source end. Because the source resistance and \( Z_0 \) form a voltage divider, the signal initially launched onto the line is \( V_{supply}/2 \), reaching full amplitude only after the forward wave reflects off the open-circuit receiver and returns. This introduces a delay equal to twice the line propagation delay before the line settles — acceptable for point-to-point connections but problematic for multi-drop buses.

Parallel (end) termination places a resistor \( R_T = Z_0 \) between the receiver end of the line and a reference voltage. The terminated load impedance matches \( Z_0 \), so \( \Gamma = 0 \) and no reflections occur. The disadvantage is static power consumption: even in the idle state, current flows through \( R_T \).

11.3 Grounding and Power Integrity

A solid, low-impedance ground plane is the single most important design practice for signal integrity. At high frequencies, currents return to the source via the path of lowest inductance, not lowest resistance — and the path of lowest inductance is directly beneath the signal trace, in the adjacent reference plane. Slots or voids in the ground plane force return currents to detour around the obstruction, increasing loop inductance and generating electromagnetic interference (EMI).

Power delivery is also a concern: when a large number of output buffers switch simultaneously (simultaneous switching noise, SSN), a momentary large current is drawn from the power supply. The inductance of the power delivery network \( L_{PDN} \) and the current ramp rate \( dI/dt \) produce a voltage drop:

\[ \Delta V = L_{PDN} \cdot \frac{dI}{dt}. \]

Decoupling capacitors placed close to each IC’s power pins reduce this noise by providing a local charge reservoir. Their effectiveness depends on their self-resonant frequency being above the frequency of the switching event.

Chapter 12: Real-Time Operating System Concepts

12.1 From Bare Metal to an RTOS

A bare-metal embedded application consists of a superloop (a while(1) in main) and a collection of ISRs. This architecture works well for simple systems but scales poorly:

Complex task interactions become difficult to reason about.
Adding a new background computation requires manual time-slicing within the superloop.
Priority inversion — a low-priority task holding a resource needed by a high-priority task, while a medium-priority task preempts it — can lead to deadline misses with no programmatic remedy.

A Real-Time Operating System (RTOS) provides a scheduler that manages multiple tasks (also called threads or processes), a timer service for periodic wake-up, and synchronisation primitives (semaphores, mutexes, queues) for inter-task communication.

Task. In an RTOS, a task is an independent thread of execution with its own stack and context. The scheduler maintains a list of ready tasks and a currently running task, and makes dispatching decisions based on priority and state.

12.2 Scheduling Policies

Fixed-priority preemptive scheduling assigns each task a static priority. The scheduler always runs the highest-priority ready task. A higher-priority task preempts a lower-priority one whenever it becomes ready (e.g., on a semaphore signal from an ISR). Under Rate Monotonic Analysis (RMA), tasks with shorter periods are assigned higher priorities, and the system is schedulable if:

\[ \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n(2^{1/n} - 1), \]

where \( C_i \) is the worst-case execution time and \( T_i \) is the period of task \( i \). For \( n \to \infty \), the bound approaches \( \ln 2 \approx 0.693 \).

Round-robin scheduling gives each task equal time slices in rotation. It is fair but does not provide deterministic response time guarantees.

12.3 Semaphores and Mutexes

A binary semaphore has two states: taken (0) and given (1). A task that calls sem_take() blocks if the semaphore is taken; it proceeds when the semaphore is given by another task or ISR. An ISR can give a semaphore to wake a blocked task:

/* ISR: signals that new data is available */
void UART_IRQHandler(void) {
    rx_buffer[rx_head++] = UART->DR;
    osSemaphoreRelease(data_ready_sem);
}

/* Task: waits for and processes data */
void data_processor_task(void *arg) {
    for (;;) {
        osSemaphoreAcquire(data_ready_sem, osWaitForever);
        process(rx_buffer[rx_tail++]);
    }
}

A mutex (mutual exclusion semaphore) is a binary semaphore with ownership: only the task that took the mutex can give it back. This enables priority inheritance — when a high-priority task blocks on a mutex held by a low-priority task, the RTOS temporarily elevates the low-priority task’s priority to that of the blocker, preventing priority inversion.

12.4 Inter-Task Communication with Queues

A queue is a FIFO data structure managed by the RTOS. One task (producer) writes items to the queue; another (consumer) reads them. The RTOS handles blocking and waking automatically: the producer blocks if the queue is full; the consumer blocks if it is empty. Queues are the cleanest mechanism for transferring data between tasks because they avoid shared-variable races — the RTOS’s queue operations are inherently atomic.

Chapter 13: Putting It All Together — System Design Perspectives

13.1 Interface Selection Criteria

Choosing the right interface for a given application requires balancing competing constraints:

Criterion	UART	SPI	I2C	DMA
Wire count	2 (TX/RX)	4 + 1 per slave	2 (shared)	N/A (bus mechanism)
Speed	0.1–10 Mbps	1–100+ Mbps	0.1–3.4 Mbps	Limited by bus
Distance	1–10 m (RS-232)	<1 m (PCB)	<1 m (PCB)	On-chip
Multi-slave	No (point-to-point)	Yes (one CS per slave)	Yes (address)	N/A
CPU load	High (interrupt per byte)	High (interrupt per frame)	High	Low (autonomous)
Complexity	Low	Medium	Medium	High (configuration)

DMA should be used whenever the data rate is high enough to saturate interrupt-driven approaches. SPI is preferred over I2C when speed matters and wiring is not constrained. I2C is preferred for many slow peripherals on a shared bus. UART suffices for diagnostic output, host communication, and low-rate sensors.

13.2 Design Tradeoffs in Synchronisation

Every synchronisation mechanism has associated costs:

Polling minimises latency but wastes CPU cycles. It is appropriate when the CPU has nothing else to do and when latency is paramount — e.g., a motor encoder read in a tight control loop.

Interrupt-driven I/O amortises CPU cost over the interrupt period. It adds latency (interrupt entry overhead, typically 12–40 cycles) and complexity (shared-state hazards, re-entrancy concerns, stack depth planning). It is appropriate for the large middle ground of embedded I/O.

DMA maximises throughput and minimises CPU involvement but requires careful buffer management, cache coherency maintenance (on processors with data caches), and attention to DMA channel contention. It is appropriate for bulk transfers: audio, image sensors, network packets.

RTOS task-based design adds the overhead of context switching (100–300 cycles per switch in a typical RTOS) but provides modularity, priority management, and synchronisation primitives that make complex systems tractable. It is appropriate when the system has five or more concurrent activities with non-trivial interactions.

13.3 Noise, Jitter, and Metastability in Practice

System realities manifest in ways that pure logic design ignores:

Noise on a logic line — from power supply rail collapse, cross-talk from adjacent traces, or EMI coupling — can cause a clean digital signal to momentarily violate its valid-high or valid-low threshold. A Schmitt trigger input (with hysteresis) makes the threshold decision more robust: the input must cross a higher threshold to switch from low-to-high than from high-to-low, suppressing noise-induced oscillations around the threshold.

Jitter is the variation in the timing of a periodic signal — the difference between actual and ideal edge positions. In a UART receiver’s oversampling architecture, jitter in the recovered clock (which derives from the MCU’s internal oscillator) accumulates across the bits of a frame. In high-speed PCIe or USB links, jitter budgets are carefully allocated between transmitter, interconnect, and receiver to ensure reliable sampling.

Metastability in a synchroniser cannot be eliminated, only reduced. The mean time between failures (MTBF) of a synchroniser is:

\[ \text{MTBF} = \frac{e^{t_{res}/\tau}}{f_{data} \cdot f_{clk} \cdot T_0}, \]

where \( t_{res} \) is the resolution time (clock period minus other delays), \( \tau \) is the flip-flop’s metastability time constant (typically 0.1–0.3 ns for fast CMOS), \( f_{data} \) is the rate of input events, \( f_{clk} \) is the receiving clock frequency, and \( T_0 \) is a technology-dependent constant. Increasing the number of synchroniser stages increases \( t_{res} \) by one clock period, exponentially improving MTBF.

13.4 Embedded System Verification and Debugging

An embedded system that works in simulation but fails on hardware is a common experience. Debugging strategies include:

JTAG debugging: the ARM Cortex-M’s CoreSight debug infrastructure exposes registers, memory, and breakpoints over a 4- or 2-wire JTAG or SWD (Serial Wire Debug) interface. A debugger such as OpenOCD combined with GDB allows setting breakpoints, stepping through code, and inspecting peripheral registers — essential for hardware bring-up.

Logic analyser: a tool that captures dozens of digital signals simultaneously at high sample rates, timestamped, enabling timing analysis of bus protocols, interrupt latencies, and signal integrity issues. Protocol decoders for UART, SPI, I2C, and CAN make captured traces human-readable.

Oscilloscope with protocol decode: a mixed-signal oscilloscope (MSO) captures both analogue waveforms and digital logic, enabling signal integrity measurements (rise time, overshoot, ringing) and protocol decode in a single instrument.

Printf debugging via SemiHosting or ITM: the ARM Instrumentation Trace Macrocell (ITM) allows printf-style debug output to be streamed to the debugger over the SWD interface with negligible impact on real-time behaviour.

13.5 Learning Outcomes Revisited

By the end of ECE 224, you should be able to:

Identify system complications — noise, jitter, metastability, transmission line reflections — and select appropriate countermeasures: synchronisers, decoupling capacitors, termination resistors, Schmitt trigger inputs.

Compare and critically assess design tradeoffs: polling versus interrupt versus DMA, synchronous versus asynchronous bus, daisy-chain versus independent-request arbitration, UART versus SPI versus I2C. No single answer is universally best; the choice depends on the operating constraints.

Analyse the effects of synchronisation mechanisms: compute interrupt utilisation using \( \sum C_i / T_i \), verify that ISR response time satisfies device latency requirements, and determine whether priority assignment prevents deadline misses.

Design hardware and software components: write device drivers that correctly initialise peripherals, handle interrupts, manage shared state, and interface with application software through clean APIs. Build systems that remain correct under concurrent interrupt events and that recover gracefully from communication errors.

These skills transfer directly to industrial embedded systems work: every product that senses and actuates the physical world faces the same interface challenges, and the principles developed in ECE 224 — timing analysis, synchronisation, protocol design, signal integrity — are enduring foundations of the discipline.