ECE 423: Embedded Computer Systems

Rodolfo Pellizzoni

Estimated study time: 1 hr 5 min

Table of contents

Sources and References

Primary reference — Marilyn Wolf, High Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing, 2nd ed., Morgan Kaufmann / Elsevier, 2014. (Available as a free O’Reilly e-text through the UW library.)

Supplementary texts — Wayne Wolf, Computers as Components: Principles of Embedded Computing System Design, 4th ed., Morgan Kaufmann, 2016; Giorgio C. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, 3rd ed., Springer, 2011; Peter Marwedel, Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things, 4th ed., Springer, 2021.

Seminal papers — C.L. Liu & James W. Layland, “Scheduling algorithms for multiprogramming in a hard-real-time environment,” Journal of the ACM, 20(1):46–61, January 1973 (open access); A. Burns & A. Wellings, Real-Time Systems and Programming Languages, 4th ed., Addison-Wesley, 2009.

Standards — AUTOSAR Classic Platform Specification (public); IEC 61508 Functional Safety of E/E/PE Safety-Related Systems; ISO 26262 Road Vehicles — Functional Safety.


Chapter 1: Introduction and the Embedded Design Space

1.1 What Makes a System “Embedded”?

An embedded computer system is a computing subsystem integrated into a larger device to perform a dedicated function, typically under real-time constraints, power budgets, and physical form-factor restrictions that would be considered irrelevant for a general-purpose desktop machine. The camera module in a smartphone, the engine-control unit (ECU) in an automobile, the insulin pump worn by a diabetic patient, and the flight-management computer aboard a commercial aircraft are all embedded systems. They share a set of distinguishing characteristics that collectively define the design challenge this course addresses.

The first characteristic is functional specialization. Unlike a general-purpose CPU whose instruction set and operating system are designed to run any program, an embedded processor is typically configured to execute a fixed, pre-validated workload. This allows aggressive optimization — a dedicated digital signal processor (DSP) for a hearing aid can execute audio-processing kernels dozens of times faster and at a fraction of the power of a Cortex-A-class application processor, precisely because it needs to do nothing else.

The second characteristic is real-time behaviour. Many embedded systems must respond to external events within a bounded, predetermined deadline. Missing that deadline — even if the correct result is eventually computed — constitutes a system failure. A brake-by-wire controller that delivers the correct braking force fifty milliseconds after a wheel-lock event is detected provides no safety benefit; the vehicle has already fishtailed. This temporal correctness requirement distinguishes embedded design from conventional software engineering, where optimizing average-case latency is usually sufficient.

The third characteristic is resource constraint. Memory, processing cycles, and electrical power are all scarce. The microcontroller at the heart of a wireless sensor node may have 64 KB of flash, 4 KB of SRAM, and a coin-cell battery that must last two years. Every design decision — choice of algorithm, data representation, communication protocol, sleep schedule — must be evaluated against this budget.

The fourth characteristic is physical coupling. An embedded system interacts continuously with the physical world through sensors and actuators. The dynamics of the physical plant — its response time, noise spectrum, and failure modes — impose requirements on the computing subsystem that have no analogue in cloud software.

Together these four characteristics define a design space that is far richer, and far more constrained, than general-purpose computing. ECE 423 provides the analytical and methodological tools to navigate that space systematically.

1.2 The Embedded Design Flow

A structured methodology is essential because the space of possible designs is enormous and the cost of discovering a fundamental architectural flaw late in the development cycle is catastrophic. The standard embedded design flow proceeds through several stages.

Requirements capture translates customer needs and system-level specifications into formal, verifiable requirements. For an automotive ECU this includes functional requirements (what torques can the motor produce), safety requirements (what happens on sensor failure), and non-functional requirements (maximum latency, power envelope, operating temperature range). The most common requirements-capture notation is a combination of natural-language requirements, use-case diagrams, and formal specification languages such as EAST-ADL or the AUTOSAR methodology.

Architecture exploration defines the hardware platform — processor type and count, memory hierarchy, communication buses, peripheral IP — and the partitioning of functions between hardware and software. This stage is inherently exploratory: the designer is searching a large, discontinuous design space for a configuration that simultaneously satisfies all requirements. Systematic design-space exploration (DSE) techniques, including Pareto analysis across objectives such as performance, power, and cost, are used to prune the search.

High-level synthesis (HLS) converts a behavioural description of an algorithm — typically written in C, C++, or SystemC — into a register-transfer level (RTL) hardware implementation. HLS tools schedule operations onto functional units, bind variables to registers, and generate the control finite-state machine (FSM) automatically.

Hardware/software co-design refines the partition decided during architecture exploration, co-simulates the hardware and software components together, and iterates until performance, area, and power targets are jointly met.

Verification and validation confirm that the implementation satisfies its specifications. Verification asks “did we build the system right?” (formal methods, model checking, simulation); validation asks “did we build the right system?” (hardware-in-the-loop testing, system integration test).

Contrast with general-purpose software engineering. In a web application, a failed deployment can be rolled back in seconds. In an embedded automotive system, a firmware update requires over-the-air programming infrastructure, regression testing, regulatory approval, and potentially a vehicle recall if the update introduces a regression. The asymmetric cost of late errors is why front-loaded, model-driven embedded design flows are standard in industry.

1.3 Models of Computation

A model of computation (MoC) is a mathematical framework that defines the components of a system, how they communicate, and what it means for the system to execute. Different MoCs are appropriate for different application domains, and selecting the wrong one leads to a specification that either cannot be efficiently implemented or that obscures critical real-time properties.

The principal MoCs encountered in embedded system design are the following.

Synchronous dataflow (SDF) represents the system as a directed graph of actors connected by FIFO channels. Each actor fires when sufficient tokens are available on its input channels, consuming a fixed number of tokens per firing and producing a fixed number on its outputs. SDF is statically schedulable — a compile-time schedule can be computed that satisfies all buffer bounds — making it attractive for signal-processing applications where throughput is the primary concern.

Kahn process networks (KPN) relax the fixed-rate constraint: each process reads from blocking FIFO channels but can produce tokens at data-dependent rates. KPN is determinate (the output sequence is independent of scheduling order) and expressive enough to model most streaming computations, but static schedulability analysis is undecidable in general.

Communicating sequential processes (CSP) model the system as concurrent processes that synchronize by message passing over rendezvous channels. CSP is well-suited to reactive systems where the emphasis is on interaction protocols rather than throughput.

Finite state machines (FSM) and hierarchical FSMs model reactive, event-driven behaviour. Statecharts (Harel, 1987) extend flat FSMs with hierarchy, concurrency, and broadcast communication, enabling compact representation of complex reactive controllers. The AUTOSAR software architecture relies heavily on hierarchical state machines for ECU application software.

Timed automata add real-valued clocks to FSMs, enabling formal reasoning about timing. Tools such as UPPAAL allow model checking of real-time properties (“is it guaranteed that the airbag fires within 15 ms of detecting a collision?”) against timed automaton models of both the controller and the physical plant.


Chapter 2: System-on-Chip Design and Embedded Architectures

2.1 The System-on-Chip Paradigm

A system-on-chip (SoC) integrates all the components of a complete electronic system — processors, memories, digital logic, analog interfaces, and I/O controllers — onto a single piece of silicon. A modern mobile SoC such as Apple’s M-series or Qualcomm’s Snapdragon integrates tens of billions of transistors implementing heterogeneous processing elements, multiple cache levels, on-chip DRAM controllers, display engines, image signal processors, neural-processing units, radio baseband processors, and power-management circuits.

The SoC paradigm offers substantial advantages over a board-level multi-chip design: lower power (shorter interconnects reduce capacitive switching energy), higher bandwidth (on-chip buses operate at frequencies unattainable across package pins), reduced latency (no board-level parasitic delays), smaller physical footprint, and lower bill-of-materials cost at volume. The principal disadvantage is inflexibility: once a chip is taped out, its architecture is fixed. Errors that could be corrected in a board-level design by replacing a chip require a new mask set, costing millions of dollars and months of schedule.

IP Core. An intellectual property (IP) core is a pre-designed, pre-verified hardware module that can be licensed and integrated into a custom SoC. Hard cores are physical layout implementations optimized for a specific process node; they are area- and power-efficient but cannot be ported to a different node without re-implementing them. Soft cores are RTL descriptions that can be synthesized for any compatible process or mapped onto an FPGA; they are portable but less efficient than hard cores. Firm cores are netlists that occupy a middle ground: partially placed but technology-independent.

The challenge of SoC design is the integration of heterogeneous IP blocks from multiple sources, verifying that they interoperate correctly, and ensuring that the resulting system meets its timing, power, and functional requirements. This is substantially an engineering-management and methodology problem as much as a technical one.

2.2 Processor Architectures for Embedded Systems

2.2.1 General-Purpose Processors and Microcontrollers

General-purpose embedded processors (ARM Cortex-A series, RISC-V application-class cores) provide rich instruction sets, memory management units, operating system support, and out-of-order execution for maximum average-case performance. They are appropriate for applications such as smartphone application processors, automotive infotainment systems, and industrial HMI panels.

Microcontrollers (ARM Cortex-M series, AVR, MSP430) integrate processor core, flash memory, SRAM, and peripherals in a single low-power package optimized for deeply embedded, often battery-operated applications. The Cortex-M4 core, for instance, adds a hardware floating-point unit and DSP-oriented multiply-accumulate instructions to the Cortex-M3 baseline, enabling motor-control and sensor-fusion algorithms without an external DSP. Microcontrollers typically run without an MMU, executing code directly from on-chip flash in a bare-metal or RTOS environment.

2.2.2 Digital Signal Processors

A digital signal processor (DSP) is a microprocessor architecture specifically optimized for the arithmetic kernels that dominate signal processing: multiply-accumulate (MAC) operations, bit-reversal addressing for FFT butterfly stages, circular buffers for FIR filter delays, and packed SIMD instructions for processing multiple 16-bit samples per clock. Classic DSP architectural features include:

  • Harvard architecture: separate instruction and data memory buses, allowing simultaneous instruction fetch and data access. Many DSPs extend this to dual data buses, enabling two coefficient-data multiplications per cycle.
  • Single-cycle MAC: the multiply-accumulate operation \( A \leftarrow A + x \cdot y \) completes in one clock cycle, supported by a wide (typically 40-bit) accumulator to avoid saturation on intermediate results.
  • Zero-overhead loops: a dedicated loop counter and repeat instruction eliminate the branch and decrement instructions at the bottom of inner loops, reducing the number of pipeline stalls.
  • Saturation arithmetic: when a result overflows the representable range, it saturates to the maximum (or minimum) representable value rather than wrapping around, preventing the catastrophic quality degradation that two’s-complement overflow introduces in audio and image signals.

Modern application processors increasingly incorporate DSP-style extensions (ARM NEON, RISC-V “V” vector extension) rather than a separate DSP, reflecting the trend toward heterogeneous SoCs where a single integrated circuit provides both application-class and signal-processing capabilities.

2.2.3 Very Long Instruction Word (VLIW) Architectures

VLIW processors expose instruction-level parallelism (ILP) to the compiler rather than discovering it at runtime through dynamic scheduling hardware. A VLIW instruction word is a wide bundle containing fields for multiple operations — for example, two integer ALU operations, one memory load, one memory store, and one branch, all issued in the same clock cycle. Because the scheduling is done at compile time, the processor requires no out-of-order execution logic, no reservation stations, and no branch prediction — the hardware is simple and power-efficient.

The Texas Instruments C6x family, widely used in baseband DSP and infrastructure processing, is a VLIW architecture with eight functional units per core, issuing up to eight 32-bit operations per clock. The compiler is responsible for filling all functional-unit slots; slots left empty due to data dependencies are NOPs.

The fundamental weakness of VLIW is binary compatibility: a compiled binary encodes assumptions about the functional unit count and latencies of the specific processor generation. Changing the hardware requires recompilation. Exposing NOP slots in binaries also wastes code memory. Predicated execution and software pipelining (loop unrolling combined with modulo scheduling) are the principal compiler techniques used to fill VLIW instruction words densely.

2.2.4 Superscalar Processors

A superscalar processor issues multiple instructions per clock cycle using hardware that dynamically identifies independent instructions in the instruction stream. Tomasulo’s algorithm (originally implemented in the IBM System/360 Model 91) is the canonical dynamic scheduling mechanism: a reservation station holds dispatched instructions until their operands arrive (via a common data bus), at which point the instruction executes on the available functional unit. A reorder buffer (ROB) ensures that instructions commit in program order, maintaining precise exception semantics.

For embedded systems, the trade-off with superscalar is unfavourable in many contexts: the out-of-order logic consumes significant area and power, and the dynamic behaviour makes worst-case execution time (WCET) analysis extremely difficult — the execution time of a code segment depends not only on its own instructions but on the state of the reservation stations, ROB, and cache at the time it executes. Safety-critical embedded systems therefore often favour simpler in-order pipelines even at the cost of average-case performance.

2.3 Memory Hierarchy in Embedded Systems

2.3.1 Cache Design and Coherency

Cache memory reduces the effective latency of DRAM access by exploiting temporal locality (recently accessed data will likely be accessed again soon) and spatial locality (data near recently accessed addresses will likely be accessed soon). A set-associative cache organizes its lines into \( S \) sets of \( E \) ways each; a block of \( B \) bytes occupies one way of one set. Given a physical address \( a \), the set index occupies bits \( [\log_2 B + \log_2 S - 1 : \log_2 B] \) and the tag occupies the remaining high-order bits.

For embedded real-time systems, cache behaviour introduces timing non-determinism. A cache miss on a critical path can stall the processor for tens to hundreds of cycles waiting for DRAM. WCET analysis must account for worst-case cache miss patterns, which requires sophisticated abstract interpretation techniques (notably the cache analysis of Theiling, Ferdinand, and Wilhelm) or measurement-based methods with appropriate safety margins.

Cache coherency becomes relevant in multicore embedded SoCs. When core 0 writes to a cache line that core 1 also has cached, some mechanism must ensure that core 1 does not read stale data. Snooping-based protocols (MESI — Modified, Exclusive, Shared, Invalid) work well at the board level but consume significant bus bandwidth as core count grows. Directory-based protocols maintain a centralized or distributed directory tracking which cores hold each cache line; they scale better to larger core counts at the cost of directory lookup latency.

2.3.2 Scratchpad SRAM

A scratchpad memory (SPM) is a software-managed SRAM region whose address space is directly visible to the processor, with no hardware tag logic and no coherency traffic. Because the programmer (or compiler) explicitly places data and code into the scratchpad, access latency is perfectly predictable: every access to the SPM takes exactly the same number of cycles. This is a major advantage for WCET analysis.

Optimal scratchpad allocation is a constrained optimization problem: given a set of data objects with known access frequencies and sizes, and a scratchpad of capacity \( C \), assign objects to scratchpad or DRAM to minimize total expected memory access cost. The problem is equivalent to the 0/1 knapsack problem and thus NP-hard in general, but practical instances are solvable by integer linear programming or by dynamic-programming pseudo-polynomial algorithms.

2.3.3 Memory Protection and Virtual Memory

An MPU (memory protection unit) partitions physical memory into regions with individually configured read/write/execute permissions and cacheability attributes. An MPU without address translation is common in Cortex-M processors; it allows the RTOS to enforce isolation between tasks without the overhead of virtual-to-physical address mapping.

An MMU (memory management unit) provides full virtual-to-physical address translation via a page table, enabling virtual address spaces, demand paging, and user-space isolation. The translation lookaside buffer (TLB) caches recent page-table lookups; a TLB miss triggers a page-table walk that may cost tens of cycles. In hard real-time systems, TLB misses introduce timing non-determinism and must be accounted for in WCET analysis or avoided by locking critical pages in the TLB.


Chapter 3: High-Level Synthesis

3.1 The HLS Problem

High-level synthesis (HLS) takes as input a behavioural description of a computation — typically an untimed or partially timed algorithmic description in C, C++, or a synthesizable subset of SystemC — and produces a register-transfer level (RTL) implementation ready for logic synthesis. The goal is to automate the manual, error-prone translation from algorithm to hardware that previously required weeks of engineering effort by digital design experts.

The core HLS problem decomposes into three tightly coupled subproblems.

Scheduling assigns each operation in the design’s data-flow graph (DFG) to a control step (clock cycle). The schedule must respect data dependencies — if operation \( o_j \) consumes the output of \( o_i \), then \( o_j \) must be scheduled no earlier than \( o_i \) plus the latency of \( o_i \). The schedule also respects resource constraints: if only one multiplier is available and two multiplications must occur, they must be in different control steps.

Binding maps each scheduled operation to a specific hardware resource (functional unit) and maps each variable to a specific storage element (register or memory port). Operations sharing a functional unit must be time-multiplexed; operations sharing a register must have non-overlapping lifetimes.

Control synthesis generates the finite state machine (FSM) that sequences the control steps, asserts data-path control signals, and manages external interfaces.

3.2 Scheduling Algorithms

The fundamental scheduling formulation is as follows. Let the DFG be a directed acyclic graph \( G = (V, E) \) where each vertex \( v \in V \) represents an operation with latency \( l(v) \) (measured in clock cycles) and each edge \( (u, v) \in E \) represents a data dependency. The schedule assigns each operation a start control step \( t(v) \). The constraints are:

For all edges \( (u, v) \in E \):

\[ t(v) \geq t(u) + l(u). \]

Subject to this, the scheduler optimizes one of two objectives: minimize schedule length subject to resource constraints (resource-constrained scheduling), or minimize resource usage subject to a latency constraint (latency-constrained scheduling).

ASAP (As Soon As Possible) scheduling assigns to each operation the earliest control step consistent with its data dependencies. Starting from operations with no inputs, the ASAP schedule is computed by a single topological traversal of the DFG:

for each v in topological order:
    t_asap(v) = max over predecessors u of (t_asap(u) + l(u))

ASAP gives the minimum possible schedule length but ignores resource constraints entirely; the resulting schedule may require an unbounded number of concurrent functional units.

ALAP (As Late As Possible) scheduling dually assigns each operation to the latest control step that does not push any successor beyond the schedule deadline \( L \):

for each v in reverse topological order:
    t_alap(v) = L - l(v) - max over successors w of (t_alap(w) - t(v) - l(v))

The difference \( t_{\text{alap}}(v) - t_{\text{asap}}(v) \) is the mobility of operation \( v \) — how many control steps it can be delayed without lengthening the critical path. Operations on the critical path have zero mobility.

List scheduling is a greedy heuristic that processes operations in priority order, assigning each to the earliest feasible control step given current resource usage. The priority can be ASAP time, urgency (slack-based), or a user-specified weight. List scheduling runs in polynomial time and gives good results in practice, though it does not guarantee optimality.

For exact optimal solutions under resource constraints, HLS tools formulate scheduling as an integer linear program (ILP), which is solvable in exponential worst-case time but manageable for the DFG sizes encountered in typical DSP kernels.

3.3 Resource Binding and Allocation

Given a schedule, binding assigns operations to hardware resources. Two operations can share a functional unit if they are not concurrently active (i.e., their control steps do not overlap). This is equivalent to a graph-coloring problem: construct a conflict graph where two operations are connected by an edge if they are concurrent; the minimum number of colors needed equals the minimum number of functional units of that type.

Left-edge algorithm: sort operations by start time; greedily assign each to the first available resource. This produces an optimal binding when operations all have unit latency and a single resource type.

Register binding is dual to functional-unit binding: two variables can share a register if their live ranges do not overlap. The lifetime of a variable \( x \) computed in control step \( s \) and last consumed in control step \( e \) is the interval \( [s, e] \). Register sharing is again a graph-coloring problem on the interval-overlap graph.

Example — FIR filter kernel. Consider a 4-tap direct-form FIR filter: \( y[n] = \sum_{k=0}^{3} h[k] \cdot x[n-k] \). The DFG has four multiply operations (one per tap) and three add operations arranged in a tree. With one multiplier and one adder, the ASAP schedule length is 4 + 3 = 7 control steps; with two multipliers and one adder the ASAP length is 2 + 3 = 5 control steps. An HLS tool exploring the resource-latency design space would present both configurations (and others) on a Pareto frontier of latency versus hardware cost, allowing the designer to choose based on the available area budget.

Chapter 4: Real-Time Scheduling Theory

4.1 Task Models and Real-Time Definitions

A real-time system is one in which the correctness of a computation depends on both the logical result and the time at which the result is produced. Hard real-time systems tolerate no deadline misses — a single violation is a system failure. Soft real-time systems tolerate occasional misses with some degradation in service quality (a missed video frame, for instance). Firm real-time systems tolerate misses but derive no value from a late result.

The standard periodic task model, formalized by Liu and Layland (1973), describes the real-time workload as a set of \( n \) tasks \( \tau = \{\tau_1, \tau_2, \ldots, \tau_n\} \). Each task \( \tau_i \) is characterized by:

  • Period \( T_i \): the task releases a new job every \( T_i \) time units.
  • Worst-case execution time \( C_i \): the maximum time any job of \( \tau_i \) requires on the processor.
  • Relative deadline \( D_i \): each job must complete within \( D_i \) time units of its release. In the implicit-deadline model, \( D_i = T_i \).

The utilization of task \( \tau_i \) is \( U_i = C_i / T_i \), and the total system utilization is

\[ U = \sum_{i=1}^{n} \frac{C_i}{T_i}. \]

A necessary condition for schedulability under any work-conserving uniprocessor algorithm is \( U \leq 1 \).

Schedulability. A task set \( \tau \) is schedulable under a given algorithm if every job of every task meets its deadline for all possible release-time patterns. Proving schedulability requires worst-case analysis; it is not sufficient to simulate a few scenarios.

4.2 Rate-Monotonic Scheduling

Rate-Monotonic Scheduling (RMS), introduced by Liu and Layland, assigns fixed priorities to tasks based on their periods: the shorter the period, the higher the priority. At runtime, the processor always executes the highest-priority ready task (preemptive priority scheduling). RMS is provably optimal among all fixed-priority preemptive algorithms for implicit-deadline periodic task sets.

The famous Liu-Layland schedulability bound states that a set of \( n \) implicit-deadline periodic tasks is schedulable under RMS if

\[ U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n \left(2^{1/n} - 1\right). \]

As \( n \to \infty \), the right-hand side approaches \( \ln 2 \approx 0.693 \). This means that any task set with total utilization at most \( \ln 2 \) is guaranteed to be schedulable under RMS, regardless of the number of tasks or their individual periods.

Worked example — RMS schedulability. Consider three tasks: \( \tau_1 = (C_1=1, T_1=4) \), \( \tau_2 = (C_2=2, T_2=6) \), \( \tau_3 = (C_3=3, T_3=12) \). Total utilization: \[ U = \frac{1}{4} + \frac{2}{6} + \frac{3}{12} = 0.25 + 0.333 + 0.25 = 0.833. \]

The Liu-Layland bound for \( n = 3 \) is \( 3(2^{1/3}-1) \approx 3 \times 0.26 = 0.779 \). Since \( U = 0.833 > 0.779 \), the bound is not satisfied. However, the Liu-Layland test is sufficient but not necessary: the task set may still be schedulable. To confirm, we apply the exact response-time analysis.

4.2.1 Exact Response-Time Analysis

The response time of task \( \tau_i \) (with tasks ordered by priority \( \tau_1 > \tau_2 > \ldots > \tau_n \)) is the smallest positive solution \( R_i \) of the recurrence:

\[ R_i^{(k+1)} = C_i + \sum_{j < i} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j, \]

starting from \( R_i^{(0)} = C_i \) and iterating until convergence. The term \( \lceil R_i^{(k)} / T_j \rceil \) counts the number of preemptions of \( \tau_i \) by the higher-priority task \( \tau_j \) within a window of length \( R_i^{(k)} \). Task \( \tau_i \) is schedulable if and only if \( R_i \leq D_i \).

Applying this to the example above: \( R_1 = 1 \leq 4 \) (trivially schedulable). For \( \tau_2 \): starting with \( R_2^{(0)} = 2 \), then \( R_2^{(1)} = 2 + \lceil 2/4 \rceil \cdot 1 = 3 \), then \( R_2^{(2)} = 2 + \lceil 3/4 \rceil \cdot 1 = 3 \) (converged). \( R_2 = 3 \leq 6 \). For \( \tau_3 \): \( R_3^{(0)} = 3 \), \( R_3^{(1)} = 3 + \lceil 3/4 \rceil \cdot 1 + \lceil 3/6 \rceil \cdot 2 = 3+1+2 = 6 \), \( R_3^{(2)} = 3 + \lceil 6/4 \rceil \cdot 1 + \lceil 6/6 \rceil \cdot 2 = 3+2+2=7 \), \( R_3^{(3)} = 3 + \lceil 7/4 \rceil \cdot 1 + \lceil 7/6 \rceil \cdot 2 = 3+2+4=9 \), \( R_3^{(4)} = 3 + 3 + 4 = 10 \), \( R_3^{(5)} = 3 + 3 + 4 = 10 \) (converged). \( R_3 = 10 \leq 12 \). The task set is indeed schedulable despite exceeding the Liu-Layland utilization bound.

4.3 Earliest Deadline First Scheduling

Earliest Deadline First (EDF) is a dynamic-priority algorithm: at every scheduling decision point, the processor is assigned to the ready task with the earliest absolute deadline. Unlike RMS, EDF can achieve up to 100% processor utilization while guaranteeing that all tasks meet their deadlines. Formally:

EDF Schedulability Theorem (Liu and Layland 1973). A set of \( n \) independent, preemptive, periodic tasks with implicit deadlines is schedulable under EDF if and only if \[ U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq 1. \]

This is a tight necessary-and-sufficient condition: EDF can schedule any feasible task set. The optimality of EDF among all preemptive uniprocessor algorithms follows from an exchange argument: if any schedule misses a deadline while EDF does not, then by swapping any two adjacent time slices where the schedule differs from EDF, the schedule can be transformed into EDF without introducing new deadline misses.

The practical disadvantage of EDF over RMS is implementation complexity. EDF requires a dynamic priority queue that is updated whenever a new job is released, whereas RMS requires only a static priority comparison. In legacy RTOS kernels without EDF support, practitioners use RMS and accept the utilization penalty.

4.4 Deadline Monotonic Scheduling

When task deadlines are not equal to periods (\( D_i \leq T_i \), constrained-deadline model), the optimal fixed-priority algorithm is Deadline Monotonic (DM) scheduling: priorities are assigned in order of increasing relative deadline. The schedulability test remains the response-time recurrence of Section 4.2.1, substituting \( D_i \) for the deadline bound.

Observation. DM reduces to RMS when \( D_i = T_i \) for all tasks, since shorter periods correlate with shorter deadlines in that case. When deadlines are shorter than periods, DM can be strictly better than RMS.

4.5 Priority Inversion and the Priority Inheritance Protocol

Priority inversion occurs when a high-priority task is blocked waiting for a resource held by a low-priority task, which in turn is preempted by a medium-priority task. The net effect is that the medium-priority task effectively runs ahead of the high-priority task, violating the priority ordering and potentially causing a deadline miss. The most famous instance of priority inversion in production systems is the Mars Pathfinder software crisis (1997), where a high-priority meteorological task was repeatedly delayed by priority inversion through a shared information bus.

Formally, suppose task \( \tau_H \) (high priority) needs a semaphore \( S \) held by \( \tau_L \) (low priority). Before \( \tau_L \) can release \( S \), it is preempted by \( \tau_M \) (medium priority, \( p_L < p_M < p_H \)). Task \( \tau_H \) is blocked not only for the duration of \( \tau_L \)’s critical section but for the entire duration of \( \tau_M \)’s execution — an unbounded priority inversion.

Priority Inheritance Protocol (PIP), proposed by Sha, Rajkumar, and Lehoczky (1990), resolves this: when a task \( \tau_H \) is blocked on a semaphore held by \( \tau_L \), task \( \tau_L \) temporarily inherits the priority of \( \tau_H \). This ensures that \( \tau_L \) is not preempted by \( \tau_M \), and \( \tau_H \)’s blocking time is bounded by the length of \( \tau_L \)’s critical section.

Under PIP, the worst-case blocking time for task \( \tau_i \) is:

\[ B_i = \max_{j > i,\, \text{sharing resource with some } k \leq i} C_{j,\text{cs}}, \]

where \( C_{j,\text{cs}} \) is the worst-case length of the critical section of lower-priority task \( \tau_j \) that accesses a resource also used by a higher-priority task. This blocking term \( B_i \) is added to the response-time recurrence:

\[ R_i^{(k+1)} = C_i + B_i + \sum_{j < i} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j. \]

Priority Ceiling Protocol (PCP) goes further: each semaphore is assigned a priority ceiling equal to the highest priority of any task that may lock it. A task can acquire a semaphore only if its priority is strictly higher than the priority ceiling of all currently locked semaphores (held by other tasks). PCP prevents deadlock and bounds blocking to at most one critical section per task, regardless of the number of semaphores.

4.6 Sporadic and Aperiodic Tasks

Real systems invariably include tasks triggered by external events (interrupt arrival, sensor reading, communication message) rather than periodic timers. A sporadic task \( \tau_s \) is characterized by a minimum inter-arrival time \( T_s^{\min} \) and worst-case execution time \( C_s \); it models an event source with a known maximum rate. An aperiodic task has no guaranteed minimum inter-arrival time and is typically treated as best-effort background work.

Sporadic servers (Sprunt et al., 1989) and the total bandwidth server (TBS) of Spuri and Buttazzo allow aperiodic tasks to be served with bounded interference to the periodic task set. The key idea is to encapsulate the aperiodic workload in a virtual periodic task whose utilization is \( U_s = C_s / T_s^{\min} \); this virtual task participates in the standard RMS or EDF schedulability analysis.


Chapter 5: Inter-Task Communication and Synchronization

5.1 Shared Memory Communication

The simplest mechanism for inter-task communication in an RTOS is shared memory: two tasks exchange data by reading and writing a common memory region. Shared memory offers the lowest latency — a write by the producer is immediately visible to the consumer after a cache flush — but requires explicit synchronization to prevent race conditions.

A race condition arises when the correctness of the result depends on the relative order of operations by concurrent tasks, and the language or runtime provides no guarantee about that order. In embedded C without synchronization, a structure update by one task may be partially visible to another task that reads it in the middle of the update, because the compiler may decompose a multi-word structure copy into multiple loads and stores that the scheduler can interleave.

The classical solutions are:

Disable/enable interrupts: on a uniprocessor, disabling interrupts creates an atomic section. This is the lowest-overhead mechanism, appropriate for very short critical sections, but it increases interrupt latency and is not applicable to multicore systems.

Semaphores: a semaphore \( S \) is an integer variable with two atomic operations, \( \text{wait}(S) \) and \( \text{signal}(S) \). A binary semaphore (mutex) used as a mutual exclusion lock ensures that at most one task executes in the critical section at any time. The kernel’s \( \text{wait}(S) \) operation blocks the calling task if \( S = 0 \), preventing busy-waiting.

Condition variables: a condition variable allows a task to block until a predicate becomes true, releasing the associated mutex atomically during the wait. The standard idiom is:

pthread_mutex_lock(&mutex);
while (!condition_satisfied()) {
    pthread_cond_wait(&cond, &mutex);  // atomically releases mutex and blocks
}
/* operate on shared data */
pthread_mutex_unlock(&mutex);

The while loop (rather than if) guards against spurious wakeups.

5.2 Message Passing

Message passing decouples producer and consumer: the producer writes a message to a queue (or channel) and the consumer reads from it, potentially at a different rate. Message-passing APIs in RTOS kernels (FreeRTOS queues, POSIX message queues, VxWorks message queues) typically support:

  • Blocking send: if the queue is full, the sender blocks until space is available.
  • Blocking receive: if the queue is empty, the receiver blocks until a message arrives.
  • Non-blocking (try) variants: return immediately with an error code if the operation cannot be satisfied.
  • Timeout variants: block for at most a specified duration.

Message passing has the advantage of naturally providing synchronization (the consumer cannot read a message that has not yet been sent) and of enabling easier formal reasoning about communication. It has higher overhead than shared memory due to data copying.

5.3 The Monitor Abstraction

A monitor (Hoare, 1974; Hansen, 1975) encapsulates shared data with the procedures that operate on it; the runtime guarantees mutual exclusion — only one task executes inside the monitor at a time. Condition variables within the monitor allow tasks to wait for internal conditions.

In the Hoare semantics, a signalling task yields the processor to the waiting task immediately (signal-and-wait). In the Mesa semantics (used in Java and POSIX), the signalling task continues and the waiting task is placed in a ready queue to be rescheduled later; the waiting task must re-check its predicate upon waking (hence the while loop idiom above).

For real-time systems, the monitor abstraction maps naturally onto the priority ceiling protocol: the monitor’s implicit mutex receives a priority ceiling equal to the highest priority of any task that calls any of the monitor’s procedures.


Chapter 6: Hardware/Software Co-design and Design Space Exploration

6.1 The Partitioning Problem

Hardware/software co-design addresses the following fundamental question: given a system specification, which computations should be implemented in hardware (dedicated silicon, FPGA fabric, or configurable logic) and which should be executed in software on a programmable processor?

Hardware implementations offer high throughput and low latency through spatial parallelism — many operations execute simultaneously in dedicated logic — but consume silicon area, are inflexible once fabricated, and have long development cycles. Software implementations are flexible and quick to modify, but execute operations sequentially (or with limited parallelism) and may not meet throughput or latency requirements for computationally intensive tasks.

The partitioning decision interacts with performance estimation, scheduling, and power budget in complex ways. Formally, the hardware/software partitioning problem is: given a task graph \( G = (V, E) \) where each vertex \( v \) represents a task with a software execution time \( t^{SW}(v) \) and a hardware execution time \( t^{HW}(v) \), and edges represent communication with cost \( c(u, v) \) (the communication overhead when \( u \) and \( v \) are on different execution domains), assign each vertex to hardware or software to minimize the overall schedule length subject to hardware area budget \( A \). This is an NP-hard combinatorial optimization problem, typically addressed by simulated annealing, genetic algorithms, or ILP.

6.2 Performance Estimation

Performance estimation is the process of predicting the execution time of a computation before it is fully implemented, to guide architectural decisions. Several levels of abstraction are used.

Analytical models estimate performance from algorithmic complexity (operation counts, memory access patterns) combined with processor parameters (cycle time, IPC, cache miss penalty). For a matrix-matrix multiplication of \( n \times n \) matrices, the operation count is \( 2n^3 \) floating-point operations; given a processor delivering \( F \) FLOPS/s, the estimated time is \( 2n^3 / F \), adjusted for cache miss overhead using a roofline model.

Simulation executes the computation on a software model of the target hardware. Instruction-set simulation (ISS) emulates the processor at the instruction level, providing cycle-accurate estimates at the cost of simulation speed (typically \(10^3\)–\(10^6\) times slower than real hardware). Transaction-level modeling (TLM) in SystemC simulates the system at the level of bus transactions, trading accuracy for speed.

Profiling on prototype hardware executes the computation on an FPGA or SoC evaluation board and measures actual execution time with hardware performance counters. This is the most accurate but requires a physical prototype.

6.3 FPGA-Based Prototyping

Field-programmable gate arrays (FPGAs) allow hardware designs to be implemented in reconfigurable logic without fabricating a custom chip. An FPGA consists of a matrix of configurable logic blocks (CLBs), interconnected by a programmable routing fabric, supplemented by hard blocks (DSP slices, block RAMs, high-speed transceivers) optimized for common operations.

FPGAs serve two complementary roles in embedded design. First, as prototyping platforms: a design intended for eventual ASIC fabrication is first mapped to an FPGA to verify functional correctness and measure performance before committing to the irreversible (and expensive) tape-out step. Second, as production platforms: at lower volumes, FPGAs are commercially viable end products, since their reconfigurability allows in-field update and multi-product platforms.

The HLS flow described in Chapter 3 targets both ASICs and FPGAs. FPGA-specific considerations include the LUT-based logic representation (optimized for 4–6 input functions), the granularity of block RAM resources, and the clock frequency limitations imposed by routing delay.


Chapter 7: Interconnect and Bus Protocols

7.1 On-Chip Interconnect

7.1.1 Shared Bus and Bus Arbitration

In a simple shared-bus architecture, all masters (processors, DMA controllers, IP cores) share a single set of address, data, and control lines. At most one master can use the bus at a time, requiring an arbitration protocol to resolve simultaneous requests. Common arbitration schemes:

Fixed priority: masters are statically ranked; ties are broken in favor of the higher-priority master. Simple and fast, but lower-priority masters may be starved.

Round-robin: each bus cycle, the next master in cyclic order is granted access. Provides fairness but may not meet latency requirements for high-priority masters.

TDMA (time-division multiple access): each master is allocated a fixed time slot in a repeating frame. TDMA provides perfect temporal isolation — one master’s bus usage cannot delay another’s — but wastes bandwidth when a master’s slot is unused.

The AMBA AHB (Advanced High-performance Bus) and AMBA APB (Advanced Peripheral Bus) specifications from ARM define standardized on-chip bus protocols widely used in Cortex-M and Cortex-A based SoCs.

7.1.2 Network-on-Chip

As the number of processing elements on an SoC grows beyond a dozen, shared-bus bandwidth becomes the primary bottleneck. A network-on-chip (NoC) replaces the shared bus with a packet-switched network embedded in the chip, consisting of routers connected by point-to-point links. This architecture scales to dozens of processors and provides high aggregate bandwidth through spatial reuse of links.

Key NoC design parameters include topology (mesh, torus, fat-tree, butterfly), routing algorithm (dimension-ordered routing, adaptive routing), and flow control (wormhole, virtual cut-through, store-and-forward). Wormhole routing is standard in embedded NoCs: a packet is divided into flits (flow control units); the head flit reserves a path through the network and the remaining flits follow immediately in a pipelined fashion, minimizing the buffering required at each router.

For real-time embedded systems, NoC latency bounds are essential. Worst-case NoC latency analysis using network calculus or timed automata verifies that no packet misses its deadline under the worst-case traffic pattern.

7.2 Field-Level Protocols

7.2.1 I2C and SPI

The Inter-Integrated Circuit (I2C) bus is a two-wire serial protocol (clock SCL, data SDA) supporting multiple masters and multiple slaves on a shared bus, with 7-bit or 10-bit slave addressing. I2C uses open-drain signalling and relies on pull-up resistors, making it suitable for board-level communication at speeds up to 400 kbit/s (Fast mode), 1 Mbit/s (Fast-mode Plus), or 3.4 Mbit/s (High-speed mode). I2C is common for communicating with sensors, EEPROMs, and display controllers at low data rates.

The Serial Peripheral Interface (SPI) uses four wires (SCLK, MOSI, MISO, SS) and is always single-master. SPI is faster than I2C (speeds up to hundreds of Mbit/s), supports full duplex, and has no addressing overhead — the master asserts the slave-select (SS) line to address a specific device. SPI is preferred for high-speed peripherals: flash memory, high-resolution ADCs, and display drivers.

7.2.2 CAN Bus

The Controller Area Network (CAN), developed by Bosch and standardized in ISO 11898, is a multi-master serial bus designed for automotive and industrial control networks. CAN operates at up to 1 Mbit/s over twisted-pair wiring and supports networks of up to approximately 30 nodes without repeaters.

CAN uses a message-based, non-destructive arbitration protocol. Each message carries an 11-bit (CAN 2.0A) or 29-bit (CAN 2.0B extended) identifier. When multiple nodes transmit simultaneously, the arbitration is resolved bit-by-bit: each node monitors the bus and backs off if it detects that another node has driven a dominant (logic 0) bit while it is driving a recessive (logic 1) bit. The node with the numerically lowest identifier wins arbitration without any bits being corrupted — hence “non-destructive.” This priority scheme allows safety-critical messages (assigned low identifiers) to always win arbitration against lower-priority messages.

CAN worst-case latency analysis. A CAN message with identifier \( m \) has worst-case response time \( R_m \) that can be computed by a fixed-point iteration analogous to RMS response-time analysis. Define the set \( hp(m) \) as the messages with lower identifiers (higher priority) than \( m \). Then: \[ R_m^{(k+1)} = C_m + \sum_{j \in hp(m)} \left\lceil \frac{R_m^{(k)} + t_{\text{bit}}}{T_j} \right\rceil C_j, \]

where \( C_m \) is the worst-case transmission time of message \( m \) (including bit-stuffing overhead: a CAN frame of up to 8 bytes payload takes at most 134 bit times), \( T_j \) is the minimum period of message \( j \), and \( t_{\text{bit}} \) accounts for the synchronization segment. The worst-case response time determines whether the message meets its deadline.

7.2.3 FlexRay and Time-Sensitive Networking

CAN’s arbitration-based protocol does not provide guaranteed bounded latency; a high-priority message can delay all lower-priority messages indefinitely if it arrives at a high rate. For safety-critical applications requiring guaranteed latency (steer-by-wire, brake-by-wire), the FlexRay protocol provides a TDMA-based static segment where each node’s transmission slot is predetermined, guaranteeing worst-case latency equal to the cycle period. A dynamic segment allows additional CAN-like arbitration for non-time-critical messages.

Time-Sensitive Networking (TSN) extends standard IEEE 802.3 Ethernet with a set of standards (IEEE 802.1AS time synchronization, 802.1Qbv scheduled traffic, 802.1Qbu frame preemption) to provide bounded latency over Ethernet. The 802.1Qbv “time-aware shaper” opens and closes transmission gates according to a schedule computed offline, ensuring that high-priority frames are transmitted in dedicated time slots without interference from lower-priority traffic. TSN is becoming the standard for next-generation automotive Ethernet networks replacing CAN and FlexRay at gigabit speeds.


Chapter 8: System Software for Embedded Multicores

8.1 RTOS Fundamentals

A real-time operating system (RTOS) is a kernel that provides task management, inter-task communication, synchronization, and hardware abstraction with guaranteed worst-case latencies. An RTOS kernel is characterized by the following properties absent in general-purpose OS kernels (Linux, Windows):

Preemptive priority scheduling: the kernel can preempt a running task immediately when a higher-priority task becomes ready, with deterministic preemption latency (typically microseconds rather than the milliseconds of a Linux kernel).

Deterministic system call latency: operations such as semaphore acquisition, message-queue post, and task creation complete in bounded, documented time.

Priority inheritance: the kernel natively implements PIP or PCP on mutexes to prevent priority inversion.

Common RTOS kernels include FreeRTOS (open-source, widely used in microcontrollers), VxWorks (commercial, used in aerospace and defense), QNX Neutrino (commercial, used in automotive), and RTEMS (open-source, used in space applications).

8.2 Multicore Scheduling Challenges

The extension of real-time scheduling theory to multicore processors is significantly more complex than the uniprocessor case. Two principal scheduling paradigms are used.

Partitioned scheduling assigns each task to a specific core at design time; on each core, the local task set is scheduled independently using RMS or EDF. Partitioned scheduling is simple to implement (each core runs a conventional RTOS), provides cache isolation, and makes WCET analysis tractable (no migration between caches). The partitioning step is itself an NP-hard bin-packing problem: assigning tasks to cores to satisfy schedulability conditions on each core. A first-fit decreasing (FFD) heuristic provides a practical approximation.

Global scheduling maintains a single global priority queue and allows tasks to migrate between cores. Global EDF (G-EDF) and global fixed-priority scheduling can achieve higher utilization than any partitioned scheme in theory, but migration introduces cache warm-up penalties (a task migrated to a new core must reload its working set from shared LLC or DRAM), and the Dhall effect shows that G-EDF can fail to schedule task sets with utilization just below \( m \) (for \( m \) cores) if any single task has utilization close to 1.

Clustered scheduling partitions cores into clusters and applies global scheduling within each cluster, trading the extremes of both approaches.

8.3 Memory Interference in Multicore Systems

In a multicore SoC, all cores share the DRAM controller and memory bus. A core running a memory-intensive task generates DRAM requests that may delay memory requests from other cores — a phenomenon called memory interference or DRAM contention. In the worst case, one core’s DRAM activity can delay another core’s memory access by hundreds of bus cycles, inflating WCET estimates dramatically.

Mechanisms to bound or reduce memory interference include:

Memory bandwidth partitioning: enforce a maximum memory request rate per core using a per-core token-bucket regulator in the DRAM controller, ensuring that no core can monopolize the memory bus.

DRAM bank partitioning: assign exclusive DRAM banks to each core (through careful page coloring in the OS or RTOS page allocator). Since DRAM row conflicts occur only within a bank, inter-core interference is eliminated when banks are partitioned.

TDMA memory arbitration: allocate DRAM bus slots to cores using a TDMA schedule, providing each core a guaranteed bandwidth allocation with deterministic latency bounds.

8.4 Worst-Case Execution Time Analysis

Worst-case execution time (WCET) analysis determines an upper bound on the maximum time a piece of code can take to execute on a given processor. WCET bounds are the fundamental input to the schedulability analysis of Chapters 4 and 5; overestimating them leads to pessimistic (but safe) schedulability conclusions, while underestimating them (or using average-case measurements) can result in deadline misses in production.

Two broad approaches exist.

Static WCET analysis analyzes the program’s control-flow graph (CFG) and models the hardware (cache, pipeline, branch predictor) using abstract interpretation or implicit path enumeration (IPET). The IPET method formulates the WCET computation as an ILP: integer variables count the number of times each basic block executes; the objective is to maximize total execution time subject to flow constraints derived from the CFG. Abstract interpretation of the cache models all possible cache states at each program point using a set abstraction (may-analysis: the set of blocks that might be in cache; must-analysis: the set of blocks guaranteed to be in cache), classifying each memory access as always-hit, always-miss, or unpredictable.

Measurement-based WCET instruments the code to measure execution times across many inputs and uses extreme-value theory (EVT) or probabilistic timing analysis to extrapolate a probabilistic WCET bound. This approach is simpler to apply to complex out-of-order processors where static modeling is intractable, but requires careful experimental design to cover corner cases.

The aiT static WCET analyzer from AbsInt and the Chronos tool from NUS are representative tools that implement static WCET analysis for common embedded processors. WCET analysis is mandated by safety standards such as IEC 61508 SIL 3/4 and DO-178C for safety-critical software.


Chapter 9: Safety, Security, and Functional Safety Standards

9.1 Functional Safety: IEC 61508 and ISO 26262

Functional safety is the part of the overall safety of a system that depends on the system functioning correctly in response to its inputs, including the management of likely operator errors, hardware failures, and environmental changes. It is distinct from safety from misuse or physical hazard.

IEC 61508, “Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems,” is the foundational international standard for functional safety. It introduces the concept of Safety Integrity Level (SIL), a discrete level (SIL 1 to SIL 4) corresponding to a required probability of dangerous failure per hour. A SIL 4 system is required to operate with a probability of dangerous failure on demand (PFD) below \( 10^{-5} \), whereas SIL 1 permits PFD up to \( 10^{-2} \).

SIL determination involves a structured hazard and risk analysis — typically HAZOP (Hazard and Operability Study) or FMEA (Failure Mode and Effects Analysis) — to identify hazards, estimate their severity and likelihood without the safety function, and determine the risk reduction needed from the safety function.

ISO 26262, “Road vehicles — Functional safety,” is the automotive adaptation of IEC 61508. It introduces Automotive Safety Integrity Levels (ASILs), labeled ASIL A through ASIL D (most stringent), plus QM (quality management, no specific safety requirements). ASIL is determined by a three-factor risk graph considering:

  • Severity (S0–S3): consequence of the hazard (no injury to life-threatening).
  • Exposure (E0–E4): frequency of the vehicle being in a situation where the hazard can occur.
  • Controllability (C0–C3): ability of the driver or other road users to avoid harm.

An autonomous emergency braking (AEB) failure that could cause a fatal collision in heavy traffic (S3, E4, C2) would be rated ASIL D, requiring the most rigorous design, verification, and documentation process.

ASIL Decomposition. ISO 26262 permits decomposing a single ASIL D requirement into two redundant subsystems each implementing an ASIL B requirement, provided the two subsystems are independent (no common-cause failure). This redundancy decomposition is the basis for dual-channel safety architectures: a main channel and a safety monitor that independently confirm each other's outputs before actuation.

9.2 Security in Embedded Systems

The security threat landscape for embedded systems differs from enterprise IT security in several respects: physical access to the device is often possible (enabling side-channel and fault-injection attacks), software update mechanisms may be absent or infrequent, and the consequences of security breaches may be physical (tampering with a pacemaker, compromising a vehicle’s steering).

9.2.1 Secure Boot

Secure boot establishes a chain of trust from reset to application software. The root of trust is a small piece of immutable ROM code (the boot ROM) that verifies a cryptographic signature on the next-stage bootloader before executing it. Each stage verifies the next: boot ROM → bootloader → RTOS → application. If any stage fails verification, the boot process halts or falls back to a recovery mode.

The signature verification typically uses RSA or ECDSA with a public key stored in one-time-programmable (OTP) fuses on the SoC. The OTP fuses cannot be modified after programming, preventing an attacker from substituting a different public key.

9.2.2 Trusted Execution Environment

A Trusted Execution Environment (TEE) partitions the processor into a normal world (untrusted OS and applications) and a secure world (trusted OS and trusted applications — TAs). The partition is enforced by the processor’s hardware security extensions: ARM TrustZone assigns each memory region, peripheral, and processor mode to either the secure or normal world. The secure monitor call (SMC) instruction is the only software-defined entry point into the secure world; all other transitions are blocked by hardware.

TAs executing in the secure world can protect cryptographic keys, biometric templates, DRM content, and payment credentials from malicious code executing in the normal world. The GlobalPlatform TEE specification standardizes the TEE API and management framework.

9.2.3 Hardware Security Modules

A Hardware Security Module (HSM) is a dedicated secure processor — either an on-chip subsystem (as in modern automotive SoCs) or an external chip — that stores cryptographic keys in tamper-resistant memory and performs cryptographic operations (AES, RSA, ECC, SHA, true random number generation) in hardware, never exposing key material to the main processor. In automotive applications, HSMs conforming to the AUTOSAR SHE (Secure Hardware Extension) or EVITA specifications provide the root of trust for vehicle communication security.


Chapter 10: Low-Power Design for Embedded Systems

10.1 Sources of Power Dissipation

Power dissipation in CMOS digital circuits has two principal components. Dynamic power arises from charging and discharging capacitances during switching:

\[ P_{\text{dyn}} = \alpha C V_{DD}^2 f, \]

where \( \alpha \) is the activity factor (fraction of clock cycles in which a node switches), \( C \) is the effective switched capacitance, \( V_{DD} \) is the supply voltage, and \( f \) is the clock frequency. The quadratic dependence on \( V_{DD} \) makes supply voltage reduction the most powerful lever for dynamic power reduction.

Static (leakage) power arises from subthreshold conduction and gate tunneling currents in transistors that are nominally off. As process nodes have scaled below 28 nm, static power has grown to be comparable to or larger than dynamic power in modern SoCs, particularly at high temperatures.

10.2 Dynamic Voltage and Frequency Scaling

Dynamic voltage and frequency scaling (DVFS) reduces \( V_{DD} \) and \( f \) together (since the maximum operating frequency scales approximately linearly with \( V_{DD} - V_t \), where \( V_t \) is the threshold voltage) to reduce dynamic power during periods of low computational demand. The power saving from reducing frequency from \( f_{\max} \) to \( f_{\min} = \alpha f_{\max} \) while proportionally reducing voltage is:

\[ \frac{P_{\text{dyn}}(f_{\min})}{P_{\text{dyn}}(f_{\max})} = \alpha^3, \]

because both the frequency and \( V_{DD}^2 \) each contribute a factor of \( \alpha \). A 50% frequency reduction yields an 8× power reduction in dynamic power — a dramatic saving. The trade-off is increased execution time: if a task with \( C \) cycles must complete, running at half frequency doubles the wall-clock execution time.

For real-time systems, DVFS must be applied while preserving schedulability. The optimal EDF-VD (voltage and deadline) algorithm of Yao, Demers, and Shenker computes the minimum processor speed at each point in time that guarantees all deadlines are met, reducing energy to the provably optimal level. Under EDF with DVFS, the task set is schedulable at voltage \( V \) if

\[ \sum_{i=1}^{n} \frac{C_i(V)}{T_i} \leq 1, \]

where \( C_i(V) \) is the execution time at voltage \( V \).

10.3 Power States and Duty Cycling

Modern embedded SoCs implement a hierarchy of power states, ranging from full-performance active mode to deeply suspended states where most of the chip is powered off:

  • Active (C0): processor and all peripherals powered and clocked.
  • Clock-gated idle (C1): clock gated off to inactive functional units; state retained.
  • Retention (C2): processor core powered down; state saved to retention flip-flops at reduced voltage.
  • Deep sleep (C3+): power domains shut down; state saved to SRAM or external DRAM; restart requires full re-initialization.

Transitioning to a deeper power state saves more energy per unit time but incurs a wake-up latency. The break-even time for a given power state is the minimum time the system must remain in that state for the energy savings to exceed the energy cost of the transition. An event-driven RTOS scheduler that includes break-even-aware sleep decisions can reduce idle power significantly.

Duty cycling is the extension of sleep/wake cycles to the entire system: a wireless sensor node, for example, wakes periodically (say, once per minute), takes a sensor reading, transmits the data over IEEE 802.15.4, and returns to deep sleep. If the active phase lasts 10 ms out of a 60 s period, the duty cycle is \( 10 \times 10^{-3} / 60 \approx 1.67 \times 10^{-4} \), and the average power is approximately this fraction of the active power — reducing battery life requirements from hours to years.


Chapter 11: Embedded Processor Software and Code Optimization

11.1 Compilation and Retargetable Code Generation

Compilers for embedded targets must go beyond standard optimization to meet code size, execution time, and energy constraints. The GCC and LLVM compiler frameworks support extensive retargeting through machine-description files (GCC) or TableGen-based backend specifications (LLVM), allowing a single compiler infrastructure to generate code for dozens of processor architectures.

Key embedded-specific compilation techniques include:

Instruction selection maps intermediate representation (IR) operations to target instructions using tree pattern matching. For DSP targets, composite instructions (MAC, butterfly operation) must be recognized from the IR tree; failure to recognize them forces multiple instructions and loses the performance advantage of the target.

Register allocation assigns program variables to processor registers, spilling to memory when registers are exhausted. On architectures with few registers (8051: 4 banks of 8 registers) or with register-file partitioning (DSP: separate address and data registers), register allocation significantly affects both code size and execution time.

Software pipelining (modulo scheduling) overlaps multiple iterations of a loop at the instruction level, filling the latency slots of one iteration’s long-latency operations with instructions from other iterations. This is essential for VLIW DSPs to achieve high utilization of functional units.

Code size optimization is critical on microcontrollers with limited flash memory. Techniques include: Thumb instruction set (16-bit compressed encoding of ARM instructions, reducing code size by roughly 30%); interprocedural common subexpression elimination; and function inlining tradeoffs (inlining reduces call overhead but increases code size, potentially reducing instruction cache effectiveness).

11.2 Embedded Operating System Concepts

Beyond the scheduling and IPC mechanisms of Chapters 4 and 5, an RTOS provides the following services for embedded applications.

Memory management: embedded RTOS memory allocators (FreeRTOS pvPortMalloc, RTEMS region manager) provide deterministic allocation and deallocation times, unlike the general-purpose malloc which has worst-case \( O(n) \) behavior dependent on heap fragmentation. Many safety-critical embedded systems prohibit dynamic memory allocation entirely after initialization, using statically allocated memory pools instead.

Timer management: software timers built on top of the RTOS tick interrupt allow tasks to be woken, callbacks to be executed, or watchdog timers to be reset at specified intervals. Hierarchical timer wheels provide \( O(1) \) timer insertion and expiry for arbitrary numbers of outstanding timers.

Device driver model: the RTOS provides a standardized interface between applications and hardware peripherals. A driver exports open, close, read, write, and ioctl operations; the application interacts with the hardware through this interface without knowledge of the physical register map. Driver design must handle concurrent access (multiple tasks calling read simultaneously), interrupt-driven I/O (the driver puts the calling task to sleep and wakes it from the interrupt service routine when data arrives), and DMA (the driver programs a DMA controller to transfer data between DRAM and the peripheral, freeing the processor during the transfer).

Interrupt management: the RTOS manages the interaction between hardware interrupts and software tasks. An interrupt service routine (ISR) should be as short as possible — save context, acknowledge the interrupt, post a message or semaphore to a task, restore context and return. The deferred interrupt handling pattern moves all non-trivial processing into a task, keeping the ISR duration bounded and measurable for interrupt-latency analysis.


Chapter 12: Summary and Design Methodology Integration

12.1 The Complete Embedded Design Flow

The topics of this course collectively form an integrated design methodology. Beginning with a system-level specification, the designer applies models of computation to structure the specification, high-level synthesis to generate hardware implementations of compute-intensive kernels, hardware/software partitioning to allocate functions to hardware and software, real-time scheduling analysis to verify temporal correctness, and communication protocol design to implement inter-component data exchange.

At each level, formal analysis techniques — schedulability analysis, WCET analysis, safety-level determination, NoC timing analysis — provide quantitative guarantees that the design meets its requirements before any silicon or firmware is committed. This front-loading of verification work is the defining characteristic of embedded systems engineering practice in safety-critical domains (automotive, aerospace, medical), and it is what distinguishes ECE 423 from software engineering courses that treat testing as the primary quality mechanism.

Several trends are reshaping embedded systems design practice.

Heterogeneous computing: the shift from homogeneous multicore to heterogeneous SoCs (combining CPU clusters, GPU shaders, neural processing units, and domain-specific accelerators on one die) is driven by the breakdown of Dennard scaling and the demand for AI inference at the network edge. The design challenge is efficiently scheduling a heterogeneous workload across heterogeneous resources, a significantly harder problem than homogeneous multicore scheduling.

RISC-V and open-source hardware: the RISC-V ISA, developed at UC Berkeley and now maintained by RISC-V International, is an open, royalty-free instruction set that has spawned a large ecosystem of open-source processor cores (Rocket, BOOM, CVA6, VexRiscv, Ibex). RISC-V is increasingly deployed in embedded SoCs, microcontrollers, and domain-specific accelerators, offering designers freedom to customize the ISA with application-specific extensions.

Machine learning at the edge: deploying neural network inference on embedded devices requires fixed-point quantization, structured pruning, and hardware-aware neural architecture search (NAS) to fit models into the memory and computational budgets of microcontrollers. TinyML frameworks (TensorFlow Lite for Microcontrollers, CMSIS-NN) optimize inference kernels for Cortex-M processors.

Security and safety convergence: as connected embedded systems (automotive, industrial IoT) face both functional safety and cybersecurity requirements simultaneously, standards bodies (ISO 21434 for automotive cybersecurity, IEC 62443 for industrial cybersecurity) are developing integrated safety-security frameworks. The fundamental tension — security mechanisms add latency and resource usage that may conflict with real-time safety requirements — is an active area of research.


Sources

  • Marilyn Wolf, High Performance Embedded Computing: Applications in Cyber-Physical Systems and Mobile Computing, 2nd ed., Morgan Kaufmann, 2014.
  • Wayne Wolf, Computers as Components: Principles of Embedded Computing System Design, 4th ed., Morgan Kaufmann, 2016.
  • C.L. Liu and James W. Layland, “Scheduling algorithms for multiprogramming in a hard-real-time environment,” Journal of the ACM, 20(1):46–61, 1973.
  • Giorgio C. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, 3rd ed., Springer, 2011.
  • Peter Marwedel, Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things, 4th ed., Springer, 2021.
  • Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky, “Priority inheritance protocols: an approach to real-time synchronization,” IEEE Transactions on Computers, 39(9):1175–1185, 1990.
  • AUTOSAR Classic Platform Specification, Release 22-11, AUTOSAR Consortium, 2022 (public).
  • ISO 26262:2018, Road Vehicles — Functional Safety, ISO, 2018.
  • IEC 61508:2010, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems, IEC, 2010.
  • IEEE 802.1Qbv-2015, Enhancements for Scheduled Traffic, IEEE, 2015.
  • Bosch GmbH, CAN Specification Version 2.0, Robert Bosch GmbH, 1991 (public).
Back to top