ECE 224: Embedded Microprocessor Systems
Bill Bishop
Estimated study time: 1 hr 6 min
Table of contents
Sources and References
Primary reference — Joseph Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, 3rd ed., Newnes/Elsevier, 2013. [Comprehensive treatment of Cortex-M architecture, NVIC, and peripheral interfaces.]
Architecture documentation — ARM Limited, ARM Cortex-M3 Technical Reference Manual (publicly available at developer.arm.com); ARM Limited, ARMv7-M Architecture Reference Manual. [Definitive ISA specification and memory map details.]
Embedded systems design — Jonathan Valvano, Embedded Systems: Real-Time Interfacing to ARM Cortex-M Microcontrollers, 5th ed., CreateSpace, 2016. [Open-access companion site: users.ece.utexas.edu/~valvano/]
Bus and interface standards — IEEE Std 1149.1 (JTAG); NXP Semiconductors, I2C-bus specification and user manual, Rev. 7.0, 2021; Motorola/NXP, SPI Block Guide, V03.06.
Error-correcting codes — Richard Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal, 29(2):147–160, 1950. [Original paper, freely accessible.]
Open courseware — MIT OCW 6.004 Computation Structures; University of Texas ECE 319K open lecture materials; Nordic Semiconductor Academy (academy.nordicsemi.com).
Chapter 1: Embedded Systems and the Interfacing Challenge
1.1 What Is an Embedded System?
An embedded system is a computer integrated into a larger product whose primary purpose is not computation. The car-engine management unit optimising fuel injection, the wrist-worn heart-rate monitor sampling photoplethysmographic signals, the dishwasher controller sequencing wash cycles, the industrial PLC managing a conveyor belt — all are embedded systems. What distinguishes them from a general-purpose computer is not capability, but dedication and constraint. An embedded processor exists to do one class of jobs, under restrictions on cost, power consumption, physical size, real-time responsiveness, and long-term reliability that would be intolerable in a laptop.
The 32-bit ARM Cortex-M family dominates this space. It offers a thoroughly modern RISC architecture — pipelined, Thumb-2 instruction set, hardware multiply and optional hardware divide — at a price point that allows it to appear in devices costing under a dollar in volume. ECE 224 uses this architecture as the concrete platform through which abstract interfacing concepts are explored.
The central engineering challenge of an embedded system is not the processor itself, but the interfaces: the mechanisms by which the processor moves data to and from the world around it. Sensors produce continuously varying voltages. Actuators require timed pulse sequences. Memory chips require carefully sequenced address and data handshakes. Communication peers may speak completely different protocols. Translating between the processor’s clean, synchronous, digital worldview and the messy, asynchronous, analogue reality outside the chip is the subject of this course.
1.2 Core Engineering Questions in Interfacing
Every interfacing design decision can be framed around a small set of fundamental questions:
How do you transfer data from one type of device to another? A sensor produces a continuous analogue voltage; the processor expects a discrete digital word. A keyboard produces asynchronous serial pulses; the processor reads a parallel register. Answering this question requires understanding converters, encoders, bus protocols, and timing constraints.
How should data be organised? Should multiple bytes be packed into a word? Should data be stored in a FIFO buffer or a circular queue? Should it be transferred as individual bytes or as DMA bursts? Organisational choices affect throughput, latency, and the complexity of the software that consumes the data.
When should data be exchanged? The processor could poll a peripheral register in a tight loop, burning CPU cycles but achieving minimal latency. Or it could configure the peripheral to assert an interrupt when data is ready, freeing the CPU for other work. Or it could configure a DMA controller to move blocks of data autonomously, offloading data movement entirely.
What conditions must be satisfied for reliable data exchange? Metastability in flip-flops, signal integrity on long traces, jitter in recovered clocks, and setup/hold violations are all physical realities that undermine logically correct designs unless carefully managed.
Why do different interfaces exist, and under what circumstances does one outperform another? UART is simple but slow. SPI is fast but requires more wires. I2C is elegant for short distances with many slaves. USB handles hot-plugging and variable bandwidth. PCIe provides multi-gigabit throughput. Each exists because a different combination of cost, speed, distance, and simplicity was optimal for its application domain.
1.3 The Nios II Laboratory Platform
The laboratory strand of ECE 224 instantiates these concepts on an Intel/Altera FPGA board using the Nios II soft-core processor, designed using Intel Quartus Prime and programmed using the Intel Nios II Software Build Tools (SBT). This System-on-a-Programmable-Chip (SOPC) approach lets students wire up custom hardware peripherals — parallel I/O ports, timer cores, DAC interfaces — in the FPGA fabric and then access them from C software through a memory-mapped register model. The conceptual architecture is identical to a fixed silicon microcontroller; only the implementation substrate is reconfigurable.
Lab 0 establishes the baseline SOPC design. Lab 1 investigates parallel interfacing and interrupt-driven synchronisation: pushbuttons raise interrupts, an interrupt service routine reads input state, updates seven-segment displays, and reschedules a timer. Lab 2 investigates analogue interfacing: audio data is read from an SD card, optionally processed, and streamed through a digital-to-analogue converter at a precise sample rate enforced by a timer interrupt.
Chapter 2: Computer Structure and Processor Organisation
2.1 A Review of Processor Structure
A microprocessor is a finite-state machine realised in VLSI that executes a program stored in memory. Its essential components are a set of registers that hold operands and state, an Arithmetic Logic Unit (ALU) that performs integer operations, a control unit that decodes instructions and orchestrates data flow, and a bus interface that moves instructions and data between the processor and memory.
The instruction execution cycle consists, in its simplest form, of three phases: fetch, where the next instruction word is read from memory at the address held in the Program Counter (PC); decode, where the control unit interprets the opcode and operand fields; and execute, where the ALU performs the computation or the memory interface initiates a load/store. Modern processors overlap these phases using pipelining: while one instruction executes, the next is being decoded and the one after that is being fetched.
The ARM Cortex-M3 uses a 3-stage pipeline (fetch, decode, execute). The Cortex-M4 extends this with a single-cycle multiply and optional single-precision floating-point unit. Both processors employ branch speculation: the pipeline fetches the instruction after a branch while the branch target is being resolved. A mispredicted branch causes a pipeline flush, the cost of which must be accounted for in real-time worst-case execution time (WCET) analysis.
2.2 Clock Signals and Control Signals
All synchronous digital logic is disciplined by a clock signal — a square wave of known frequency that defines when state elements (flip-flops and registers) sample their inputs. The setup time \( t_{su} \) is the minimum interval that data must be stable before the clock edge; the hold time \( t_h \) is the minimum interval data must remain stable after the clock edge. Violating either constraint places the flip-flop in a potentially metastable state.
Metastability is not merely a theoretical concern. Whenever an asynchronous signal — one not synchronised to the local clock domain — is sampled by a flip-flop, a metastability event can occur. The standard mitigation is a synchroniser: two back-to-back flip-flops clocked at the receiving clock, giving the first stage’s output the full clock period to resolve before the second stage samples it. Two-stage synchronisers are standard in commercial practice for moderate clock frequencies; higher-frequency designs may require three stages.
Control signals govern the flow of data through the processor and across the bus interface: read/write enables, chip selects, bus request/grant lines, interrupt request lines, DMA acknowledge signals. Understanding the timing relationships among these signals — their setup and hold requirements relative to clock edges, their propagation delays, and their tri-state behaviour on shared buses — is the foundation for reliable interface design.
2.3 Memory System Organisation
Embedded processors use a flat, byte-addressed memory space. The ARM Cortex-M architecture defines a 4 GB address space (32-bit addresses), partitioned into regions with architectural significance:
| Address Range | Typical Use |
|---|---|
| 0x00000000 – 0x1FFFFFFF | Code (Flash) |
| 0x20000000 – 0x3FFFFFFF | SRAM |
| 0x40000000 – 0x5FFFFFFF | Peripheral registers |
| 0x60000000 – 0x9FFFFFFF | External RAM |
| 0xA0000000 – 0xDFFFFFFF | External device |
| 0xE0000000 – 0xFFFFFFFF | Private Peripheral Bus (PPB) — NVIC, SysTick, debug |
The peripheral register region is the foundation of memory-mapped I/O. A GPIO output data register, a UART transmit data register, or a SPI control register is simply a word at a particular address in this region. Writing to that address sends data to the peripheral; reading from it retrieves peripheral status. From the software’s perspective, the peripheral is indistinguishable from memory.
2.4 Bus Interfacing and Timing
A bus is a shared communication pathway connecting a processor to one or more memory or I/O devices. The fundamental transaction on any bus is a transfer: the master (typically the processor or a DMA controller) drives address lines to select a target device, drives data lines (on a write) or tristates them (on a read), and asserts control signals to indicate the direction and timing.
Bus protocols differ in how they synchronise master and slave. In a synchronous bus, all transfers are referenced to a common clock; the slave must respond within a fixed number of clock cycles. This simplicity comes at the cost of being limited to the speed of the slowest slave. In an asynchronous bus, the master and slave exchange handshake signals (request and acknowledge, or REQ/ACK) without a shared clock. The transfer completes only when the slave explicitly acknowledges. An asynchronous bus can accommodate slaves of arbitrary speed but requires careful design to avoid hang conditions if an acknowledge never arrives.
In partially interlocked (also called semi-synchronous) designs, the slave can insert wait states — holding a WAIT or READY signal active to extend the bus cycle — but the base protocol is clocked. This hybrid approach, used by most practical bus standards, achieves the implementation simplicity of a synchronous bus while accommodating slow devices.
Split-cycle or split-transaction buses separate the address and data phases of a transfer, allowing another master to use the bus during the slave’s access time. This dramatically improves bus utilisation in multi-master systems at the cost of a more complex protocol.
Chapter 3: Synchronisation and Data Transfers
3.1 The Synchronisation Problem
A peripheral operates at its own pace, governed by its physical characteristics. A keypad switch closes in a few milliseconds; a temperature sensor settles after many milliseconds; a UART receiver produces a new byte every few hundred microseconds at 115200 baud; a high-speed ADC produces samples every few tens of nanoseconds. The processor executes instructions at its own pace — perhaps hundreds of millions per second. Synchronisation is the problem of coordinating these two rates so that data is never lost (because the processor read before the peripheral had new data) or overwritten (because the peripheral produced new data before the processor consumed the old data).
3.2 Polling
The simplest synchronisation strategy is polling: the processor loops, repeatedly reading a peripheral status register, until a flag indicates that data is ready. This is straightforward to implement and delivers the lowest possible latency — the processor detects the event within a single polling interval — but it wastes CPU cycles on the checking loop.
Consider a UART that sets a Receive Data Ready (RDR) bit in its status register when a new byte has arrived. A polling loop in C would be:
volatile uint8_t *status_reg = (volatile uint8_t *)UART_STATUS;
volatile uint8_t *data_reg = (volatile uint8_t *)UART_RX_DATA;
while (!(*status_reg & UART_RDR_BIT))
; /* spin */
uint8_t byte = *data_reg;
The CPU executes the test repeatedly, consuming power and preventing other work. If the peripheral is slow — a human typing at a keyboard — the polling fraction of time is nearly 100%. If the CPU has nothing else to do and response latency must be minimised (e.g., reading encoder pulses on a motion controller), polling is entirely appropriate. But in most embedded applications, the CPU has other tasks: running a user interface, computing a control law, servicing other peripherals. A polling loop for one peripheral prevents all of these.
3.3 Interrupt-Driven I/O
An interrupt is a hardware-initiated transfer of control from the currently executing program to a special routine called an interrupt service routine (ISR) or interrupt handler. When a peripheral event occurs — a UART byte arrives, a timer expires, a GPIO edge is detected — the peripheral asserts an interrupt request line. The CPU’s interrupt controller accepts the request, saves the processor state onto the stack, and dispatches to the ISR. When the ISR finishes, the saved state is restored and execution returns to the interrupted program.
Interrupt-driven I/O transforms the CPU’s relationship with peripherals: instead of waiting for the peripheral, the CPU works on useful computation until the peripheral demands attention. The overhead is the latency from event occurrence to ISR entry — determined by the number of instructions to finish executing, plus the interrupt latency of the controller hardware — and the time spent saving and restoring processor state.
3.3.1 Interrupt Latency and Response Analysis
Let \( T_{ISR} \) be the worst-case execution time of the ISR, and let \( T_{inter} \) be the minimum interval between successive interrupt requests from a given source. A necessary (but not sufficient) condition for the ISR to keep up with the source is:
\[ T_{ISR} \leq T_{inter}. \]When multiple interrupt sources exist with periods \( T_1, T_2, \ldots, T_n \) and ISR costs \( C_1, C_2, \ldots, C_n \), the CPU utilisation devoted to interrupt handling is approximately
\[ U_{IRQ} = \sum_{i=1}^{n} \frac{C_i}{T_i}, \]and the remaining fraction \( 1 - U_{IRQ} \) is available for background computation. This is the same rate-monotonic utilisation formula familiar from real-time systems analysis.
3.3.2 Nested Interrupts and Priority
Modern embedded systems must handle multiple simultaneous interrupt sources. The Nested Vectored Interrupt Controller (NVIC) on Cortex-M processors assigns a numerical priority level to each interrupt channel. A lower numerical value means higher priority. When the CPU is executing an ISR of priority \( P \) and an interrupt of priority \( Q < P \) arrives, the NVIC preempts the running ISR, saves its state, and dispatches the higher-priority handler. This nested interrupt capability ensures that time-critical events receive bounded service regardless of what less-critical ISRs are executing.
3.4 Shared Data Hazards
When an ISR and a background task share a data structure — a circular buffer, a counter, a flag — a race condition can occur if the ISR preempts the background task in the middle of a multi-step update. Suppose the background task is incrementing a 32-bit counter that the ISR also reads:
/* background task */
counter = counter + 1; /* may compile to: load, add, store */
If the ISR preempts after the load but before the store, the ISR sees a stale value of counter. The subsequent store by the background task overwrites the ISR’s read with the pre-incremented value — a lost update. This is a classic read-modify-write hazard on a non-atomic operation.
The Cortex-M architecture provides a solution: LDREX / STREX (load-link / store-conditional) instructions for software atomics, plus the CPSID I / CPSIE I instructions to globally disable/enable interrupts around a critical section. For short critical sections, disabling interrupts is simplest:
__disable_irq(); /* CPSID I — disables all configurable interrupts */
counter++;
__enable_irq(); /* CPSIE I */
A more structured approach is a mutex (mutual exclusion semaphore): a synchronisation primitive that ensures only one thread of execution enters a critical section at a time. In a bare-metal embedded system without an RTOS, a binary flag protected by interrupt-disable/enable serves the same purpose.
Chapter 4: Parallel Interfacing
4.1 The Role of the Parallel Interface
A parallel interface transfers multiple bits simultaneously over dedicated signal lines, one bit per line. It represents the oldest and most direct method of connecting a digital device to a processor bus. Understanding parallel interfacing requires understanding the impedance mismatch between the bus and the device: the bus has its own timing — address valid, data valid, read/write cycle times — and the device has its own timing — access time for memory, setup time for latches, propagation delay for combinational logic. The interface must translate between these two timing regimes without violating either device’s constraints.
4.2 Timing Diagrams and Constraint Analysis
A bus timing diagram specifies, relative to the clock edge or address-strobe edge:
- Address valid \( t_{AV} \): time after the clock edge when the address lines are guaranteed stable.
- Data valid (for write): time after the clock edge when the data lines are guaranteed stable.
- Data required (for read): the latest time before a sample edge when the device must present stable data.
- Cycle time \( t_{cyc} \): total duration of one bus transaction.
The device’s data sheet specifies its own constraints:
- Address access time \( t_{acc} \): time from address-stable to data-stable on the device’s output (for read).
- Write setup time \( t_{ws} \): minimum time data must be valid before the write-enable de-asserts.
- Write pulse width \( t_{wp} \): minimum duration of the write-enable pulse.
The interface designer must verify that the bus timing satisfies the device constraints, inserting wait states if necessary to extend the bus cycle and give slow devices sufficient access time.
4.3 Memory-Mapped I/O Peripheral Access in C
On the Nios II and Cortex-M platforms, peripheral registers are accessed through pointers to volatile memory. A well-structured embedded C driver defines base addresses and register offsets as preprocessor constants, then wraps register accesses in inline functions:
#define GPIO_BASE 0xFF200000u
#define GPIO_DATA_REG (*(volatile uint32_t *)(GPIO_BASE + 0x00))
#define GPIO_DIR_REG (*(volatile uint32_t *)(GPIO_BASE + 0x04))
#define GPIO_IRQ_REG (*(volatile uint32_t *)(GPIO_BASE + 0x08))
#define GPIO_IRQ_MASK (*(volatile uint32_t *)(GPIO_BASE + 0x0C))
static inline void gpio_set_dir(uint32_t mask) { GPIO_DIR_REG = mask; }
static inline void gpio_write(uint32_t val) { GPIO_DATA_REG = val; }
static inline uint32_t gpio_read(void) { return GPIO_DATA_REG; }
Writing a 1 to a bit position in the direction register configures that bit as an output; writing 0 configures it as an input. Writing to the data register drives outputs; reading from the data register samples inputs. This API pattern — base address plus register-offset macros plus thin inline wrappers — is standard in embedded system driver libraries (see, e.g., STM32 HAL or TI DriverLib).
4.4 Servicing Latency and Throughput
The design of an interrupt-driven parallel interface requires careful analysis of two figures of merit:
Interrupt latency is the time from the peripheral asserting its interrupt request to the first instruction of the ISR executing. This includes: the time remaining in the current instruction (variable), the pipeline flush (fixed, typically 1–3 cycles), the NVIC arbitration (fixed, 12 cycles for Cortex-M3), and the function call overhead of any vector table indirection (fixed). Total Cortex-M3 interrupt latency is 12 clock cycles in the best case, with some variation due to late-arriving exceptions and bus stalls.
Throughput is the rate at which the interface can sustain data transfer. For an interrupt-driven GPIO that triggers on a strobe from an external device:
\[ \text{max throughput} = \frac{1}{T_{ISR} + T_{overhead}}, \]where \( T_{overhead} \) includes interrupt entry/exit costs. If the ISR takes 50 cycles and interrupt overhead is 24 cycles (12 entry + 12 exit), then at 100 MHz the maximum throughput is \( 100 \times 10^6 / 74 \approx 1.35 \) million transactions per second.
Chapter 5: Error Detection and Error Correction
5.1 Why Errors Occur
Data stored in memory or transmitted across a channel can be corrupted by physical noise: ionising radiation causing single-event upsets in SRAM cells, electromagnetic interference inducing bit errors on long cable runs, power supply noise causing marginal flip-flops to make incorrect transitions. The probability of a bit error may be small — perhaps \( 10^{-12} \) per bit per hour in modern DRAM under normal conditions — but for a 1 Gbit memory operating for years, some errors are likely. Critical applications — aerospace, medical, financial — require the system to detect and correct such errors.
5.2 Parity — Single-Bit Error Detection
The simplest error-detecting code augments each data word with a single parity bit chosen so that the number of 1-bits in the extended word (data plus parity) is always even (even parity) or always odd (odd parity). The receiver recalculates the parity of the received word and compares it with the received parity bit; a mismatch indicates at least one bit error.
Even parity for a 4-bit word \( d_3 d_2 d_1 d_0 \):
\[ p = d_3 \oplus d_2 \oplus d_1 \oplus d_0. \]Parity detects all single-bit errors and any odd number of bit errors. It cannot detect double-bit errors (two flips cancel in the XOR), and it cannot correct any errors — only detect them.
5.3 Hamming Codes — Single-Error Correction, Double-Error Detection (SECDED)
Richard Hamming’s 1950 insight was that by adding multiple redundant check bits, each covering a different subset of data bits, the position of a single-bit error could be precisely identified and corrected.
For a data word of length \( k \) bits, the number of parity bits \( r \) needed to correct single-bit errors satisfies:
\[ 2^r \geq k + r + 1. \]For \( k = 4 \), we need \( r = 3 \) (since \( 2^3 = 8 \geq 8 = 4+3+1 \)), giving a 7-bit codeword. For \( k = 8 \), \( r = 4 \) (since \( 2^4 = 16 \geq 13 \)), giving a 12-bit codeword.
In a systematic Hamming code, parity bits are placed at positions that are powers of 2 (1, 2, 4, 8, …), and data bits fill the remaining positions. Parity bit \( p_i \) at position \( 2^{i-1} \) covers all bit positions whose binary representation has a 1 in bit position \( i-1 \).
- Position 1 (binary 001): parity bit \( p_1 \)
- Position 2 (binary 010): parity bit \( p_2 \)
- Position 3 (binary 011): data bit \( d_1 = 1 \)
- Position 4 (binary 100): parity bit \( p_4 \)
- Position 5 (binary 101): data bit \( d_2 = 0 \)
- Position 6 (binary 110): data bit \( d_3 = 1 \)
- Position 7 (binary 111): data bit \( d_4 = 1 \)
\( p_1 \) covers positions 1, 3, 5, 7: \( p_1 = 1 \oplus 0 \oplus 1 = 0 \). \( p_2 \) covers positions 2, 3, 6, 7: \( p_2 = 1 \oplus 1 \oplus 1 = 1 \). \( p_4 \) covers positions 4, 5, 6, 7: \( p_4 = 0 \oplus 1 \oplus 1 = 0 \).
Codeword: 0 1 1 0 1 1 0 (positions 7 to 1: \( b_7=1, b_6=1, b_5=0, b_4=0, b_3=1, b_2=1, b_1=0 \)).
To correct a single-bit error, the receiver recomputes the three syndrome bits \( s_1, s_2, s_4 \) (using the same covering sets as the parity bits), concatenates them as a binary number \( s_4 s_2 s_1 \), and if non-zero, flips the bit at the position indicated by the syndrome.
Adding a global parity bit over all positions yields SECDED: single-error correct, double-error detect. If the syndrome is non-zero but the global parity check passes, two bits are in error (detected but uncorrectable). ECC (Error-Correcting Code) memory in servers uses Hamming-based SECDED codes over 64-bit data words with 8 check bits.
5.4 Error Types in Practice
Hard errors are permanent faults — a stuck-at-0 or stuck-at-1 bit cell — that reproduce reliably. They are diagnosed at manufacturing test (using stuck-at fault models) and cause the device to be discarded or remapped.
Soft errors (single-event upsets) are transient bit flips caused by high-energy particles (cosmic rays, alpha particles from packaging materials) striking a storage cell and depositing enough charge to flip its state. They do not damage the cell; the next write restores correct operation. Soft error rates in SRAM are characterised in FITs (failures in time, where 1 FIT = 1 failure per \( 10^9 \) device-hours). High-reliability systems use ECC memory to tolerate soft errors in flight.
Chapter 6: Serial Interfacing
6.1 Motivation for Serial Communication
A parallel bus connecting two chips requires one wire per bit — 8, 16, or 32 wires for typical data widths — plus separate control and clock lines. At PCB scale this is manageable. But for communication over longer distances, between boards or to remote sensors, the cost of multiple parallel conductors becomes prohibitive. More importantly, at high frequencies, maintaining signal integrity across many parallel lines simultaneously — ensuring all bits arrive within a single bit period of each other — requires meticulous impedance matching and length matching that adds cost and constrains layout.
Serial communication resolves both problems by transmitting bits one at a time over a single data line (or a small number of lines). The trade-off is a reduction in raw throughput for a given clock rate, but serial techniques enable the use of differential signalling (such as LVDS or RS-485), which provides excellent noise immunity over long distances, and they enable clock encoding schemes that permit very high bit rates on short links.
6.2 Asynchronous Serial Communication: UART
The Universal Asynchronous Receiver/Transmitter (UART) is the oldest and most widely used serial interface in embedded systems. It requires only two signal lines per direction (typically labelled TX and RX), operates without a shared clock, and is supported by virtually every microcontroller produced in the last 40 years.
6.2.1 Frame Format
A UART frame consists of:
- Start bit — a single bit of logic 0, transitioning from the idle (logic 1) line state. This marks the beginning of a character.
- Data bits — typically 8 bits, transmitted LSB first.
- Parity bit (optional) — even, odd, or none.
- Stop bit(s) — one or two bits of logic 1, returning the line to idle.
At 115200 baud with 8N1 format (8 data bits, no parity, 1 stop bit), each frame is 10 bit-periods wide, consuming \( 10 / 115200 \approx 86.8 \) µs. Effective data throughput is 11,520 bytes per second, or 92.16 kbits/s.
6.2.2 Clock Recovery
Because there is no shared clock, the receiver must derive bit timing from the data stream. A UART receiver samples each bit near its centre to maximise noise immunity. With an oversampling factor of 16 (the standard), the receiver samples the line 16 times per bit period. On detecting the start bit’s falling edge, it counts to sample number 8 (the centre of the start bit) and then to samples 8, 24, 40, … relative to the falling edge — the centres of the subsequent bits.
The accuracy of this scheme depends on the two ends maintaining the same nominal baud rate. A frequency offset of \( \delta \) percent accumulates as a timing error of \( \delta \times N \)% over a frame of \( N \) bits. For 8N1 (10 bit-periods), the accumulated error at the last data bit must remain less than half a bit period (50%). This requires \( \delta < 5\% \), which standard crystal oscillators and fractional-N baud-rate generators achieve comfortably.
6.2.3 UART Initialisation in C
The following example configures a Cortex-M UART peripheral (using a notional register-level API):
#define UART_BASE 0x40004400u
#define UART_CR1 (*(volatile uint32_t *)(UART_BASE + 0x0C))
#define UART_BRR (*(volatile uint32_t *)(UART_BASE + 0x08))
#define UART_SR (*(volatile uint32_t *)(UART_BASE + 0x00))
#define UART_DR (*(volatile uint32_t *)(UART_BASE + 0x04))
#define UART_CR1_UE (1u << 13)
#define UART_CR1_TE (1u << 3)
#define UART_CR1_RE (1u << 2)
#define UART_SR_TXE (1u << 7)
#define UART_SR_RXNE (1u << 5)
void uart_init(uint32_t pclk_hz, uint32_t baud) {
UART_BRR = pclk_hz / baud; /* integer divider; ignores fractional */
UART_CR1 = UART_CR1_UE | UART_CR1_TE | UART_CR1_RE;
}
void uart_send_byte(uint8_t b) {
while (!(UART_SR & UART_SR_TXE)) /* wait for transmit data register empty */
;
UART_DR = b;
}
uint8_t uart_recv_byte(void) {
while (!(UART_SR & UART_SR_RXNE)) /* wait for receive data register not empty */
;
return (uint8_t)UART_DR;
}
In a real system, uart_send_byte would be driven by a transmit-buffer-empty interrupt rather than polling, and uart_recv_byte would be replaced by a receive interrupt that stores incoming bytes in a circular buffer for consumption by the application.
6.3 Synchronous Serial: SPI
The Serial Peripheral Interface (SPI) is a synchronous, full-duplex protocol originally developed by Motorola. It uses four signals:
- SCLK (Serial Clock) — driven by the master.
- MOSI (Master Out Slave In) — data from master to slave.
- MISO (Master In Slave Out) — data from slave to master.
- SS̄ / CS̄ (Slave Select / Chip Select) — active-low, one per slave, driven by the master.
Because the clock is provided by the master, there is no need for clock recovery; both master and slave sample MISO/MOSI on the same clock edge. SPI achieves significantly higher throughput than UART — tens of megabits per second is common — and is well suited to ADCs, DACs, flash memories, and displays.
6.3.1 SPI Modes
SPI defines four clock polarity/phase combinations, selected by two bits CPOL (clock polarity) and CPHA (clock phase):
| Mode | CPOL | CPHA | Idle Clock | Sample Edge |
|---|---|---|---|---|
| 0 | 0 | 0 | Low | Rising |
| 1 | 0 | 1 | Low | Falling |
| 2 | 1 | 0 | High | Falling |
| 3 | 1 | 1 | High | Rising |
The slave device’s data sheet specifies which mode it supports. Mismatched SPI mode is a common cause of garbled data in embedded designs.
6.3.2 SPI Transaction
An SPI transaction begins when the master asserts (pulls low) the appropriate SS̄ line, clocks out a command or address byte on MOSI while simultaneously shifting in the slave’s response on MISO, and finally de-asserts SS̄. The shift register in both master and slave exchange their full contents with each clock pulse; what the slave transmits during the command byte phase is often a dummy byte (0x00 or 0xFF), with the actual response appearing in subsequent bytes.
uint8_t spi_transfer(uint8_t tx) {
while (!(SPI_SR & SPI_SR_TXE))
;
SPI_DR = tx;
while (!(SPI_SR & SPI_SR_RXNE))
;
return (uint8_t)SPI_DR;
}
uint8_t adc_read_channel(uint8_t ch) {
uint8_t result;
CS_LOW();
spi_transfer(0x01); /* start bit */
result = spi_transfer(0x80 | (ch << 4)); /* SGL/DIFF, channel */
result = (result & 0x03) << 8;
result |= spi_transfer(0x00); /* clock out result */
CS_HIGH();
return result;
}
6.4 Two-Wire Serial: I2C
The Inter-Integrated Circuit (I2C) bus, developed by Philips (now NXP), provides a simple two-wire multi-master, multi-slave serial protocol using:
- SDA (Serial Data) — open-drain, bidirectional.
- SCL (Serial Clock) — open-drain, driven by master (slaves may hold it low to stretch the clock).
Open-drain signalling means any device on the bus can pull a line low, and a pull-up resistor returns it to logic high when no device drives it low. This wired-AND arrangement allows multiple masters and slaves to share the two wires without bus contention.
6.4.1 I2C Protocol Elements
An I2C transaction consists of:
- START condition — SDA falls while SCL is high. Only the master generates this.
- Address byte — 7 or 10 bits identifying the target slave, followed by a Read/Write bit. The slave whose address matches acknowledges by pulling SDA low during the acknowledgement clock pulse.
- Data byte(s) — transferred MSB first, each followed by an ACK from the receiver.
- STOP condition — SDA rises while SCL is high.
A repeated START (Sr) allows a master to change direction (read after write, or write after read) without releasing the bus.
I2C standard mode operates at 100 kbits/s, fast mode at 400 kbits/s, fast-mode plus at 1 Mbits/s, and high-speed mode at 3.4 Mbits/s. The pull-up resistor value trades speed against noise margin and power consumption — lower resistance enables higher speed but increases quiescent current.
6.4.2 Clock Stretching and Arbitration
A slow slave may hold SCL low after the falling edge of each clock pulse, forcing the master to pause. This clock stretching allows the slave to process incoming data at its own pace. Multi-master I2C requires bitwise arbitration: a master that drives SDA high while observing SDA low (due to another master) immediately loses arbitration, halts its transmission, and waits before retrying. Because arbitration is implicit in the SDA value, no special arbitration protocol is needed.
Chapter 7: Analogue Interfacing
7.1 The Analogue-Digital Boundary
The physical world is analogue: temperatures, pressures, accelerations, voltages, and acoustic pressures are continuous quantities that can take any value within their range. Embedded systems ultimately interact with this world — reading sensors, driving actuators — so they must convert between analogue and digital representations. The ADC (Analogue-to-Digital Converter) and DAC (Digital-to-Analogue Converter) are the components that span this boundary.
7.2 Digital-to-Analogue Conversion
A DAC accepts an \( n \)-bit digital code \( D \in \{0, 1, \ldots, 2^n - 1\} \) and produces a corresponding analogue output voltage:
\[ V_{out} = V_{ref} \cdot \frac{D}{2^n}, \]where \( V_{ref} \) is the full-scale reference voltage. An ideal 12-bit DAC with \( V_{ref} = 3.3 \) V has a resolution (least significant bit, LSB) of \( 3.3 / 4096 \approx 0.806 \) mV.
7.2.1 R-2R Ladder DAC
The R-2R ladder network implements DAC conversion using only two resistor values, making it practical to integrate in silicon. The ladder consists of resistors of value \( R \) in the series path and \( 2R \) in the shunt path. Each bit switches its shunt resistor between \( V_{ref} \) and GND. The network’s Thevenin equivalent produces a current proportional to the binary-weighted sum of the bit inputs. The R-2R architecture is insensitive to the absolute values of \( R \) and \( 2R \) provided their ratio is accurate — a constraint achievable with modern IC fabrication.
7.2.2 DAC Errors
Static errors are systematic deviations from the ideal transfer characteristic:
- Offset error: the output when the input code is zero is not exactly zero. Corrected by calibration.
- Gain error: the slope of the transfer characteristic deviates from the ideal. Also correctable.
- Differential Non-Linearity (DNL): the actual step size for a unit code increment deviates from the ideal 1 LSB. DNL > 1 LSB makes the DAC non-monotonic.
- Integral Non-Linearity (INL): the maximum deviation of the actual transfer characteristic from the ideal straight line.
Dynamic errors manifest during rapidly changing outputs:
- Settling time: the time from a code change to the output settling within \( \pm 0.5 \) LSB of its final value. Relevant when the DAC output rate equals the audio sample rate.
- Glitch impulse: a transient spike occurring when multiple bits change simultaneously (e.g., from code 0111…1 to 1000…0). Major-carry transitions produce the largest glitches.
7.3 Analogue-to-Digital Conversion
An ADC samples a continuous analogue voltage at discrete time instants and quantises each sample to an \( n \)-bit integer. The sampling theorem (Nyquist-Shannon) mandates that the sampling frequency \( f_s \) must exceed twice the highest frequency component \( f_{max} \) in the analogue signal:
\[ f_s > 2 f_{max}. \]Sampling below this rate causes aliasing: high-frequency content appears as low-frequency artefacts in the digital signal. In practice, an anti-aliasing filter (a low-pass filter with cut-off below \( f_s / 2 \)) is placed before the ADC input to suppress frequency content that would otherwise alias.
7.3.1 Sample-and-Hold
Before quantisation, the input voltage must be held constant for the duration of the conversion. A sample-and-hold (S&H) circuit does this: a transmission gate samples the analogue input onto a capacitor, then opens, holding the capacitor voltage constant while the ADC conversion proceeds. The aperture time of the S&H — the time uncertainty in when sampling occurs — limits the maximum input frequency. For an input sinusoid of amplitude \( A \) and frequency \( f \), the maximum rate of change is \( 2\pi f A \), so an aperture uncertainty of \( \Delta t \) produces a voltage error of \( 2\pi f A \Delta t \). To keep this error below 0.5 LSB with a full-scale \( A \) and \( n \)-bit ADC, the required aperture is:
\[ \Delta t < \frac{1}{2\pi f \cdot 2^n}. \]For a 12-bit ADC sampling a 10 kHz signal, \( \Delta t < 1 / (2\pi \times 10000 \times 4096) \approx 3.9 \) ns.
7.3.2 Successive Approximation ADC
The Successive Approximation Register (SAR) ADC is the dominant architecture in embedded microcontrollers for medium-speed (up to a few MSPS), medium-resolution (8–16 bits) applications. It uses a binary search algorithm: starting from the MSB, it sets each bit to 1, compares the resulting DAC output to the input, and retains the bit if the DAC output is less than or equal to the input, or clears it otherwise.
For bit n-1 down to 0:
Set bit → compare DAC output to V_in
If DAC > V_in: clear bit
Else: keep bit
An \( n \)-bit SAR ADC requires exactly \( n \) comparisons per conversion, making it deterministic and predictable — ideal for embedded real-time applications. Conversion time is \( n \times T_{clk,ADC} \), where \( T_{clk,ADC} \) is the ADC clock period.
7.3.3 ADC in Embedded C
A typical SAR ADC initialisation and read sequence (Cortex-M STM32-style):
/* Enable ADC clock in RCC, configure GPIO pin as analogue input */
ADC1->CR2 |= ADC_CR2_ADON; /* power on ADC */
ADC1->SQR3 = channel; /* select channel */
ADC1->CR2 |= ADC_CR2_SWSTART; /* start conversion */
while (!(ADC1->SR & ADC_SR_EOC)) /* wait for end-of-conversion */
;
uint16_t raw = (uint16_t)ADC1->DR; /* read 12-bit result */
float voltage = raw * (3.3f / 4096.0f); /* scale to volts */
In a production design, the polling wait would be replaced by a DMA transfer triggered by the EOC signal, moving the converted sample directly into a buffer without CPU intervention.
7.4 Audio Playback: Lab 2 Architecture
Lab 2’s audio pipeline illustrates the integration of multiple interface concepts. An SD card (accessed via SPI) stores a raw PCM audio file sampled at 8 kHz or 44.1 kHz. A SysTick or general-purpose timer fires at the audio sample rate. The ISR reads the next sample from a software buffer and writes it to a DAC control register (via SPI or parallel interface). A second higher-level task, running in the background, refills the buffer from the SD card in block-aligned chunks. The design must ensure the buffer never empties (causing an audible glitch) despite the variable latency of SD card reads.
Chapter 8: Bus Data Transfer
8.1 Synchronous Bus Protocol
In a synchronous bus, every operation is governed by the system clock. A typical synchronous read transaction proceeds as follows:
- Master asserts address on the address bus at the start of clock period 1.
- Master asserts read (R/W̄ = 1) and bus enable signals.
- Slave decodes address, drives data onto the data bus before the sample point (typically the rising edge of clock period 2 or 3).
- Master samples data at the rising edge of clock period 2 (or later, if wait states are inserted).
- Master releases the bus; slave tri-states its data output.
The synchronous protocol is simple to implement and analyse. Its weakness is inflexibility: the cycle time is fixed, so slow devices must insert wait states, and fast devices cannot exploit their speed advantage beyond the minimum cycle time.
8.2 Asynchronous Bus Handshake
An asynchronous bus uses explicit request/acknowledge signals in place of a shared clock. A fully interlocked (four-cycle) handshake:
- Master asserts REQ (request).
- Slave responds with ACK (acknowledge) after completing the operation.
- Master observes ACK and de-asserts REQ.
- Slave observes REQ de-assert and de-asserts ACK.
This protocol adapts naturally to devices of any speed. A fast memory completes in a few nanoseconds; a slow I/O device may take milliseconds. The protocol accommodates both without wasted cycles. The disadvantage is increased latency per cycle due to the four handshake phases, and the requirement to handle metastability on REQ and ACK signals crossing clock domain boundaries.
8.3 Semi-Synchronous and Split-Cycle Buses
A semi-synchronous bus is clocked, but the slave can assert a WAIT signal to extend the cycle by whole clock periods. This is the most common practical design: the base cycle is fast (one or two clocks) and slow devices simply hold WAIT active for as many additional clocks as needed. Implementation is straightforward: the master samples WAIT one setup time before the data sample edge; if asserted, it re-latches the cycle’s address and control signals for another clock period.
A split-cycle bus separates address and data phases into independent sub-transactions. The master initiates the address phase and then releases the bus. When the slave has retrieved the data, it requests the bus again for the data phase. Between the two phases, the bus is free for other transactions. This dramatically improves utilisation for slow slaves, but requires the master and slave to maintain transaction context across the gap, adding protocol complexity.
8.4 Bus Performance Analysis
Bus throughput depends on the average number of cycles per transfer. Let \( T_{bus} \) be the nominal bus cycle time, \( n_{wait} \) the average number of wait states, and \( n_{overhead} \) the average overhead cycles (address phase, turnaround, etc.):
\[ \text{Bus throughput} = \frac{\text{data width in bytes}}{(1 + n_{wait} + n_{overhead}) \times T_{bus}}. \]For a 32-bit synchronous bus at 50 MHz (\( T_{bus} = 20 \) ns) with an average of 1 wait state and no overhead:
\[ \text{Throughput} = \frac{4 \text{ B}}{2 \times 20 \text{ ns}} = \frac{4}{40 \times 10^{-9}} = 100 \text{ MB/s}. \]Chapter 9: Bus Arbitration
9.1 The Need for Arbitration
A shared bus can have only one master active at a time; if two masters attempt to drive address or data lines simultaneously, contention occurs — the resulting voltage is neither a valid logic 0 nor a valid logic 1, and both transactions are corrupted. Arbitration is the process by which multiple masters negotiate for exclusive bus ownership.
Requirements for a good arbitration scheme:
- Mutual exclusion: only one master receives the bus grant at a time.
- Fairness: no master is permanently denied access (freedom from starvation).
- Bounded latency: each master receives the bus within a known worst-case time.
- Efficiency: arbitration overhead should be small relative to bus transaction time.
9.2 Centralised Arbitration: Daisy Chain
In daisy-chain arbitration, a single central arbiter drives a Bus Grant (BG) line that passes through masters in series. When a master wishes to use the bus, it asserts Bus Request (BR). The arbiter, seeing BR asserted and the bus idle, asserts BG. The grant propagates down the daisy chain until it reaches the first master that has a pending request; that master captures the grant and asserts Bus Busy (BB).
Daisy-chain priority is determined by physical position in the chain — the master closest to the arbiter always wins. This creates a static priority scheme that can starve lower-priority masters under heavy load. It is simple to implement (one extra wire per master, plus BR and BB shared lines) and scales to many masters at low hardware cost.
9.3 Centralised Arbitration: Non-Daisy-Chain (Independent Request)
Each master has its own BR line to the central arbiter. The arbiter implements a priority encoder (selecting the highest-priority requester) or a round-robin scheduler (advancing the priority pointer after each grant). Non-daisy-chain arbitration provides:
- Configurable priority: the arbiter software or hardware can implement any priority policy.
- Bounded waiting time for round-robin: with \( N \) masters, any requester waits at most \( N - 1 \) grants before receiving its turn.
- Higher wiring cost: \( N \) separate BR lines, typically bundled into a bus.
9.4 Distributed Arbitration
In a distributed scheme, no single arbiter exists. Each master observes the state of the bus and makes its own grant decision based on a shared arbitration protocol. Ethernet’s CSMA/CD (Carrier Sense Multiple Access with Collision Detection) is a distributed arbitration protocol: each station listens before transmitting, detects collisions, and implements a random binary-exponential backoff to resolve them.
In a wired-OR bus with assigned priority IDs, each master drives its priority bits onto an open-collector arbitration bus and monitors the bus. A master that drives a 1 but sees a 0 (a higher-priority master also requesting) withdraws immediately. This is the arbitration mechanism of I2C and also of the Controller Area Network (CAN bus), which uses bitwise NRZ arbitration without collision: the winning master’s frame is transmitted without corruption.
Chapter 10: Direct Memory Access
10.1 CPU Bottleneck in High-Rate Transfers
Consider a microcontroller sampling an ADC at 44.1 kHz (audio-rate). Each interrupt fires every 22.7 µs. The ISR must read the ADC result register and store it in a buffer — perhaps 10–20 instructions. At 100 MHz, this consumes perhaps 200 ns per sample, or about 0.9% of CPU time. Tolerable.
Now consider a video ADC at 25 frames per second with \( 640 \times 480 \) pixels, 2 bytes per pixel: \( 15.36 \) million bytes per second. An interrupt per byte would fire 15.36 million times per second, and each interrupt’s entry/exit overhead (≥ 12 cycles on Cortex-M3) alone would consume \( 12 \times 15.36 \times 10^6 = 184 \times 10^6 \) cycles per second — 184% of a 100 MHz CPU. Interrupt-driven I/O cannot keep up.
Direct Memory Access (DMA) resolves this by delegating data movement to a dedicated hardware engine that operates in parallel with the CPU.
10.2 DMA Operation
A DMA controller (DMAC) is a specialised processor that performs memory-to-memory, peripheral-to-memory, or memory-to-peripheral transfers without CPU involvement. The CPU configures the DMAC by writing:
- Source address: the starting address from which data is read.
- Destination address: the starting address to which data is written.
- Transfer count: the number of elements (bytes, half-words, words) to transfer.
- Transfer width: the size of each element.
- Increment mode: whether source and/or destination address are incremented after each element.
- Trigger: the condition that initiates each element transfer (peripheral data-ready flag, software trigger, etc.).
Once configured and enabled, the DMAC performs the transfer autonomously. When complete, it asserts an interrupt to notify the CPU that the buffer is ready.
DMA_Channel->CPAR = (uint32_t)&ADC1->DR; /* peripheral source */
DMA_Channel->CMAR = (uint32_t)sample_buffer; /* memory destination */
DMA_Channel->CNDTR = BUFFER_SIZE; /* number of samples */
DMA_Channel->CCR = DMA_CCR_MINC /* increment memory address */
| DMA_CCR_PSIZE_16BIT /* 16-bit peripheral */
| DMA_CCR_MSIZE_16BIT /* 16-bit memory */
| DMA_CCR_TCIE /* transfer complete interrupt */
| DMA_CCR_EN; /* enable */
The ADC triggers each DMA transfer on end-of-conversion. The CPU is free to execute other code. When BUFFER_SIZE samples have been collected, the DMA asserts an interrupt and the CPU processes the buffer.
10.3 Bus Bandwidth Implications
The DMAC must compete with the CPU for access to the memory bus. Two schemes exist:
Burst mode: the DMAC seizes the bus and transfers an entire block before relinquishing it. The CPU is stalled for the duration of the burst. This maximises DMA throughput but can introduce significant latency spikes in CPU execution — a problem for real-time control.
Cycle steal mode: the DMAC steals one bus cycle at a time, interleaving with CPU accesses. The CPU experiences a slight slowdown (each stolen cycle delays its memory access by one cycle) but is never stalled for long periods. Most embedded DMAC implementations use cycle stealing.
Transparent mode: the DMAC transfers data only during idle bus cycles — bus cycles in which the CPU is not accessing memory (e.g., during instruction execution using the cache). This is ideal but depends on cache hit rate; with poor locality, the CPU may have few idle cycles.
10.4 Double-Buffering for Continuous Streaming
Continuous audio playback or ADC streaming requires a buffer that is simultaneously being filled by DMA and consumed by the application. A double-buffer (ping-pong buffer) arrangement uses two equal-sized buffers:
- While DMA fills buffer A, the CPU processes buffer B.
- When DMA completes buffer A, it switches to buffer B and raises a half-complete / full-complete interrupt.
- The CPU switches to processing buffer A.
This requires the CPU processing rate to match or exceed the DMA fill rate. If processing takes longer than the time to fill one buffer, a buffer underrun occurs and audio continuity is lost. Lab 2’s audio playback system must be designed to prevent this.
Chapter 11: Signal Integrity — Drivers, Transmission Lines, Grounding, and Shielding
11.1 Signal Integrity Fundamentals
As clock frequencies rise and trace lengths grow, the interconnecting wires on a PCB can no longer be treated as ideal conductors with instantaneous propagation. A PCB trace has a characteristic impedance \( Z_0 \), determined by its width, the substrate dielectric constant \( \varepsilon_r \), and the distance to the reference plane:
\[ Z_0 \approx \frac{87}{\sqrt{\varepsilon_r + 1.41}} \ln\left(\frac{5.98 h}{0.8 w + t}\right) \quad [\Omega], \]where \( h \) is the dielectric thickness, \( w \) is the trace width, and \( t \) is the trace thickness (all in consistent units). A typical FR-4 microstrip with \( \varepsilon_r = 4.5 \) and appropriate geometry has \( Z_0 \approx 50 \) Ω.
When a signal encounters an impedance discontinuity — a via, a connector, or a receiver whose input impedance differs from \( Z_0 \) — a reflection occurs. The reflection coefficient at the receiver is:
\[ \Gamma = \frac{Z_L - Z_0}{Z_L + Z_0}, \]where \( Z_L \) is the load impedance. For an unloaded CMOS input (\( Z_L \approx \infty \)), \( \Gamma \approx +1 \): the full incident voltage is reflected back, doubling the voltage at the receiver. This reflected wave travels back to the driver, bouncing again if the driver’s output impedance differs from \( Z_0 \), and continues until the reflections attenuate. The resulting oscillations, called ringing, can violate logic threshold voltages and cause spurious switching.
11.2 Termination Strategies
Source termination places a resistor \( R_s \) equal to \( Z_0 \) in series with the driver output. The resistor absorbs reflections at the source end. Because the source resistance and \( Z_0 \) form a voltage divider, the signal initially launched onto the line is \( V_{supply}/2 \), reaching full amplitude only after the forward wave reflects off the open-circuit receiver and returns. This introduces a delay equal to twice the line propagation delay before the line settles — acceptable for point-to-point connections but problematic for multi-drop buses.
Parallel (end) termination places a resistor \( R_T = Z_0 \) between the receiver end of the line and a reference voltage. The terminated load impedance matches \( Z_0 \), so \( \Gamma = 0 \) and no reflections occur. The disadvantage is static power consumption: even in the idle state, current flows through \( R_T \).
11.3 Grounding and Power Integrity
A solid, low-impedance ground plane is the single most important design practice for signal integrity. At high frequencies, currents return to the source via the path of lowest inductance, not lowest resistance — and the path of lowest inductance is directly beneath the signal trace, in the adjacent reference plane. Slots or voids in the ground plane force return currents to detour around the obstruction, increasing loop inductance and generating electromagnetic interference (EMI).
Power delivery is also a concern: when a large number of output buffers switch simultaneously (simultaneous switching noise, SSN), a momentary large current is drawn from the power supply. The inductance of the power delivery network \( L_{PDN} \) and the current ramp rate \( dI/dt \) produce a voltage drop:
\[ \Delta V = L_{PDN} \cdot \frac{dI}{dt}. \]Decoupling capacitors placed close to each IC’s power pins reduce this noise by providing a local charge reservoir. Their effectiveness depends on their self-resonant frequency being above the frequency of the switching event.
Chapter 12: Real-Time Operating System Concepts
12.1 From Bare Metal to an RTOS
A bare-metal embedded application consists of a superloop (a while(1) in main) and a collection of ISRs. This architecture works well for simple systems but scales poorly:
- Complex task interactions become difficult to reason about.
- Adding a new background computation requires manual time-slicing within the superloop.
- Priority inversion — a low-priority task holding a resource needed by a high-priority task, while a medium-priority task preempts it — can lead to deadline misses with no programmatic remedy.
A Real-Time Operating System (RTOS) provides a scheduler that manages multiple tasks (also called threads or processes), a timer service for periodic wake-up, and synchronisation primitives (semaphores, mutexes, queues) for inter-task communication.
12.2 Scheduling Policies
Fixed-priority preemptive scheduling assigns each task a static priority. The scheduler always runs the highest-priority ready task. A higher-priority task preempts a lower-priority one whenever it becomes ready (e.g., on a semaphore signal from an ISR). Under Rate Monotonic Analysis (RMA), tasks with shorter periods are assigned higher priorities, and the system is schedulable if:
\[ \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n(2^{1/n} - 1), \]where \( C_i \) is the worst-case execution time and \( T_i \) is the period of task \( i \). For \( n \to \infty \), the bound approaches \( \ln 2 \approx 0.693 \).
Round-robin scheduling gives each task equal time slices in rotation. It is fair but does not provide deterministic response time guarantees.
12.3 Semaphores and Mutexes
A binary semaphore has two states: taken (0) and given (1). A task that calls sem_take() blocks if the semaphore is taken; it proceeds when the semaphore is given by another task or ISR. An ISR can give a semaphore to wake a blocked task:
/* ISR: signals that new data is available */
void UART_IRQHandler(void) {
rx_buffer[rx_head++] = UART->DR;
osSemaphoreRelease(data_ready_sem);
}
/* Task: waits for and processes data */
void data_processor_task(void *arg) {
for (;;) {
osSemaphoreAcquire(data_ready_sem, osWaitForever);
process(rx_buffer[rx_tail++]);
}
}
A mutex (mutual exclusion semaphore) is a binary semaphore with ownership: only the task that took the mutex can give it back. This enables priority inheritance — when a high-priority task blocks on a mutex held by a low-priority task, the RTOS temporarily elevates the low-priority task’s priority to that of the blocker, preventing priority inversion.
12.4 Inter-Task Communication with Queues
A queue is a FIFO data structure managed by the RTOS. One task (producer) writes items to the queue; another (consumer) reads them. The RTOS handles blocking and waking automatically: the producer blocks if the queue is full; the consumer blocks if it is empty. Queues are the cleanest mechanism for transferring data between tasks because they avoid shared-variable races — the RTOS’s queue operations are inherently atomic.
Chapter 13: Putting It All Together — System Design Perspectives
13.1 Interface Selection Criteria
Choosing the right interface for a given application requires balancing competing constraints:
| Criterion | UART | SPI | I2C | DMA |
|---|---|---|---|---|
| Wire count | 2 (TX/RX) | 4 + 1 per slave | 2 (shared) | N/A (bus mechanism) |
| Speed | 0.1–10 Mbps | 1–100+ Mbps | 0.1–3.4 Mbps | Limited by bus |
| Distance | 1–10 m (RS-232) | <1 m (PCB) | <1 m (PCB) | On-chip |
| Multi-slave | No (point-to-point) | Yes (one CS per slave) | Yes (address) | N/A |
| CPU load | High (interrupt per byte) | High (interrupt per frame) | High | Low (autonomous) |
| Complexity | Low | Medium | Medium | High (configuration) |
DMA should be used whenever the data rate is high enough to saturate interrupt-driven approaches. SPI is preferred over I2C when speed matters and wiring is not constrained. I2C is preferred for many slow peripherals on a shared bus. UART suffices for diagnostic output, host communication, and low-rate sensors.
13.2 Design Tradeoffs in Synchronisation
Every synchronisation mechanism has associated costs:
Polling minimises latency but wastes CPU cycles. It is appropriate when the CPU has nothing else to do and when latency is paramount — e.g., a motor encoder read in a tight control loop.
Interrupt-driven I/O amortises CPU cost over the interrupt period. It adds latency (interrupt entry overhead, typically 12–40 cycles) and complexity (shared-state hazards, re-entrancy concerns, stack depth planning). It is appropriate for the large middle ground of embedded I/O.
DMA maximises throughput and minimises CPU involvement but requires careful buffer management, cache coherency maintenance (on processors with data caches), and attention to DMA channel contention. It is appropriate for bulk transfers: audio, image sensors, network packets.
RTOS task-based design adds the overhead of context switching (100–300 cycles per switch in a typical RTOS) but provides modularity, priority management, and synchronisation primitives that make complex systems tractable. It is appropriate when the system has five or more concurrent activities with non-trivial interactions.
13.3 Noise, Jitter, and Metastability in Practice
System realities manifest in ways that pure logic design ignores:
Noise on a logic line — from power supply rail collapse, cross-talk from adjacent traces, or EMI coupling — can cause a clean digital signal to momentarily violate its valid-high or valid-low threshold. A Schmitt trigger input (with hysteresis) makes the threshold decision more robust: the input must cross a higher threshold to switch from low-to-high than from high-to-low, suppressing noise-induced oscillations around the threshold.
Jitter is the variation in the timing of a periodic signal — the difference between actual and ideal edge positions. In a UART receiver’s oversampling architecture, jitter in the recovered clock (which derives from the MCU’s internal oscillator) accumulates across the bits of a frame. In high-speed PCIe or USB links, jitter budgets are carefully allocated between transmitter, interconnect, and receiver to ensure reliable sampling.
Metastability in a synchroniser cannot be eliminated, only reduced. The mean time between failures (MTBF) of a synchroniser is:
\[ \text{MTBF} = \frac{e^{t_{res}/\tau}}{f_{data} \cdot f_{clk} \cdot T_0}, \]where \( t_{res} \) is the resolution time (clock period minus other delays), \( \tau \) is the flip-flop’s metastability time constant (typically 0.1–0.3 ns for fast CMOS), \( f_{data} \) is the rate of input events, \( f_{clk} \) is the receiving clock frequency, and \( T_0 \) is a technology-dependent constant. Increasing the number of synchroniser stages increases \( t_{res} \) by one clock period, exponentially improving MTBF.
13.4 Embedded System Verification and Debugging
An embedded system that works in simulation but fails on hardware is a common experience. Debugging strategies include:
JTAG debugging: the ARM Cortex-M’s CoreSight debug infrastructure exposes registers, memory, and breakpoints over a 4- or 2-wire JTAG or SWD (Serial Wire Debug) interface. A debugger such as OpenOCD combined with GDB allows setting breakpoints, stepping through code, and inspecting peripheral registers — essential for hardware bring-up.
Logic analyser: a tool that captures dozens of digital signals simultaneously at high sample rates, timestamped, enabling timing analysis of bus protocols, interrupt latencies, and signal integrity issues. Protocol decoders for UART, SPI, I2C, and CAN make captured traces human-readable.
Oscilloscope with protocol decode: a mixed-signal oscilloscope (MSO) captures both analogue waveforms and digital logic, enabling signal integrity measurements (rise time, overshoot, ringing) and protocol decode in a single instrument.
Printf debugging via SemiHosting or ITM: the ARM Instrumentation Trace Macrocell (ITM) allows printf-style debug output to be streamed to the debugger over the SWD interface with negligible impact on real-time behaviour.
13.5 Learning Outcomes Revisited
By the end of ECE 224, you should be able to:
Identify system complications — noise, jitter, metastability, transmission line reflections — and select appropriate countermeasures: synchronisers, decoupling capacitors, termination resistors, Schmitt trigger inputs.
Compare and critically assess design tradeoffs: polling versus interrupt versus DMA, synchronous versus asynchronous bus, daisy-chain versus independent-request arbitration, UART versus SPI versus I2C. No single answer is universally best; the choice depends on the operating constraints.
Analyse the effects of synchronisation mechanisms: compute interrupt utilisation using \( \sum C_i / T_i \), verify that ISR response time satisfies device latency requirements, and determine whether priority assignment prevents deadline misses.
Design hardware and software components: write device drivers that correctly initialise peripherals, handle interrupts, manage shared state, and interface with application software through clean APIs. Build systems that remain correct under concurrent interrupt events and that recover gracefully from communication errors.
These skills transfer directly to industrial embedded systems work: every product that senses and actuates the physical world faces the same interface challenges, and the principles developed in ECE 224 — timing analysis, synchronisation, protocol design, signal integrity — are enduring foundations of the discipline.