ECE 455: Embedded Software
Sebastian Fischmeister
Estimated study time: 58 minutes
Table of contents
Sources and References
Primary textbooks — J. Liu, Real-Time Systems, Prentice Hall, 2000; M. Wolf, Computers as Components, 3rd ed., Morgan Kaufmann, 2012; E. A. Lee and S. A. Seshia, Introduction to Embedded Systems: A Cyber-Physical Systems Approach, 2nd ed., MIT Press, 2016 (free PDF: leeseshia.org).
Supplementary texts — M. R. Lyu, Software Fault Tolerance, Wiley, 1995; J.-C. Laprie (ed.), Dependability: Basic Concepts and Terminology, Springer, 1992.
Research articles — A. Avizienis, J.-C. Laprie, and B. Randell, “Dependability and its Threats: A Taxonomy,” Building the Information Society, Springer, 2004; J. Regehr, “Safe and Structured Use of Interrupts in Real-Time and Embedded Software,” Handbook of Real-Time and Embedded Systems, CRC Press, 2008.
Chapter 1: Embedded Software Is Difficult
1.1 What Makes Embedded Software Unique
Embedded software is software that executes on a computing system tightly coupled to a physical process or environment. Unlike desktop applications or server-side software, embedded systems must satisfy correctness criteria that span both logical and temporal dimensions. A result that is computed correctly but delivered late is, in many contexts, a failure just as serious as a wrong result.
Several properties distinguish embedded software from general-purpose software:
- Timing constraints: Tasks must complete within specified deadlines imposed by the physical environment. A controller for an automotive anti-lock braking system must respond within milliseconds; a pacemaker pulse must arrive within a fraction of a heartbeat period.
- Resource constraints: Embedded processors often have limited RAM, ROM, and computational throughput. Memory cannot be expanded at will; dynamic allocation is frequently prohibited.
- Reliability requirements: Many embedded systems operate in environments where failure causes physical harm, financial loss, or safety hazards. Redundancy and fault tolerance must often be built in at the software level.
- Concurrency: Embedded systems interact with multiple hardware peripherals, sensors, and actuators simultaneously. This concurrency must be managed through interrupt handling, scheduling, and synchronization primitives.
- No general operating system: Many embedded systems run on bare metal or use a real-time operating system (RTOS) with a far smaller footprint and different guarantees than a commodity OS.
1.2 Challenges in Practice
1.2.1 Complexity of Interaction with Hardware
Embedded software directly programs hardware registers, memory-mapped I/O, and interrupt controllers. This requires knowledge of the hardware architecture, and errors at this level (e.g., incorrect interrupt masking, wrong register offsets) can cause subtle, non-reproducible bugs.
1.2.2 Non-Determinism and Timing Anomalies
Modern processors include pipelines, caches, branch predictors, and out-of-order execution units. These features dramatically improve average performance but introduce non-deterministic execution times. A single cache miss can cause an execution time variation of 10–100 cycles; pipeline stalls from branch mispredictions add further variability. Determining the worst-case execution time (WCET) of a program on such a processor is a hard research problem.
1.2.3 Concurrency and Race Conditions
Interrupt service routines (ISRs) preempt the main thread of execution at unpredictable times. Shared data structures accessed from both ISRs and task code must be protected. In embedded systems, the usual mechanisms—mutexes, semaphores—interact poorly with real-time scheduling, producing phenomena such as priority inversion.
1.2.4 Observability and Debugging Difficulties
Unlike desktop software, embedded software runs without a rich OS to support debugging. Debuggers must connect via JTAG or similar interfaces. Adding instrumentation probes changes timing and may mask bugs. Printf-style debugging is often unavailable or disrupts real-time behavior.
1.3 The Correctness Hierarchy
For an embedded system task \( \tau_i \), correctness requires:
- Logical correctness: the output value is correct.
- Temporal correctness: the output is produced before the deadline \( D_i \).
- Safety: the system never enters a hazardous state.
All three must hold simultaneously. Missing any one may constitute a system failure.
Chapter 2: Embedded Computing
2.1 Processor Architectures for Embedded Systems
Embedded processors span a wide design space, trading off performance, power, and cost:
- Microcontrollers (MCUs): Integrate CPU, memory, and peripherals on one chip. Examples: ARM Cortex-M series, AVR, MSP430. Typically used in cost-sensitive applications with moderate performance requirements.
- Application processors: Higher-performance cores (ARM Cortex-A) running Linux or Android. Used where rich OS features are needed but power budgets allow it.
- Digital Signal Processors (DSPs): Optimized for multiply-accumulate operations in tight loops (e.g., audio, image processing). Often Harvard architecture with separate instruction and data caches.
- FPGAs and ASICs: Reconfigurable or fixed hardware for extremely high throughput or deterministic timing requirements.
2.1.1 Memory Hierarchy
Typical embedded memory hierarchy:
| Level | Size | Latency | Type |
|---|---|---|---|
| Registers | 32–256 B | 0 cycles | On-chip |
| Cache (L1) | 4–64 KB | 1–4 cycles | SRAM |
| Scratchpad | 4–256 KB | 1–2 cycles | SRAM |
| Main memory | 256 KB–8 MB | 5–100 cycles | SRAM/Flash |
| External flash | 1 MB–1 GB | 50–200 cycles | NOR/NAND |
Scratchpad memories differ from caches: they are software-controlled (no automatic eviction) and provide deterministic access times, making them preferred in safety-critical embedded contexts.
2.2 Interrupts and Interrupt-Driven I/O
An interrupt is an asynchronous signal from hardware that diverts processor execution to an interrupt service routine (ISR). The interrupt controller (e.g., ARM NVIC) prioritizes concurrent interrupt requests.
The interrupt latency \( L_{int} \) includes:
- Time to finish the current atomic instruction (1–several cycles)
- Time to save context (push registers, update stack pointer)
- Time to load ISR address from interrupt vector table
- Pipeline flush time on modern processors
Regehr’s structured interrupt model distinguishes three categories of code:
- Interrupt-level code: runs inside an ISR, cannot block.
- Task-level code: runs in the main loop or RTOS tasks, can block.
- Shared data: must be protected by disabling interrupts or using lock-free structures.
2.2.1 Safe Interrupt Programming Patterns
Key rules for safe interrupt programming (after Regehr):
- Declare shared variables
volatileto prevent compiler optimization from caching them in registers across ISR boundaries. - Use atomic read-modify-write sequences or disable/enable interrupt brackets around accesses to multi-byte shared variables.
- Keep ISR execution time short; defer complex processing to a task using a flag or a deferred interrupt mechanism.
- Never call blocking functions (e.g.,
malloc,printf, blocking OS calls) from an ISR.
2.3 Embedded Software Structures
2.3.1 Round-Robin (Polling Loop)
The simplest embedded software structure: a single infinite loop that polls all peripherals in sequence.
while (1) {
check_sensor_A();
check_sensor_B();
handle_communication();
}
Advantages: simple, predictable, no OS needed. Disadvantages: latency of response is bounded by the full loop time; a slow peripheral handler delays everything else.
2.3.2 Foreground/Background System
An ISR (foreground) handles time-critical events; the main loop (background) performs less urgent processing. The ISR sets flags or places data in queues; the background reads them.
2.3.3 Real-Time Operating System (RTOS)
An RTOS provides task management, scheduling, synchronization, and communication. Tasks (threads) are scheduled by a kernel according to a scheduling policy (e.g., rate-monotonic, EDF). The RTOS abstraction separates concerns: each task focuses on its own logic; the scheduler enforces timing.
2.4 Bus and Communication Architectures
Embedded systems connect processors and peripherals via buses and serial protocols:
- SPI (Serial Peripheral Interface): Full-duplex, synchronous, short distance. Common for sensors, flash memory.
- I2C: Two-wire, addressable, multi-master capable. Slower than SPI.
- UART: Asynchronous serial. Ubiquitous for debug consoles.
- CAN bus: Differential pair, broadcast, used extensively in automotive applications. Deterministic arbitration based on message priority identifiers.
- Ethernet/Industrial Ethernet: Higher bandwidth; deterministic variants (EtherCAT, PROFINET) used in industrial automation.
Chapter 3: Embedded Systems Debugging
3.1 Why Embedded Debugging Is Hard
Embedded debugging faces unique obstacles not present in desktop software:
- No rich OS services: No
gdbserver running natively; no filesystem for core dumps. - Probe effect: Any observation (e.g., a printf, a logic analyzer probe) may alter the system’s timing and cause the bug to disappear (Heisenbug).
- Limited memory: Cannot store large logs or enable extensive instrumentation.
- Real-time constraints: Stopping at a breakpoint halts the system, causing missed deadlines and possibly hardware damage.
3.2 Debugging Infrastructure
3.2.1 JTAG and SWD
JTAG (Joint Test Action Group, IEEE 1149.1) provides a serial boundary scan interface for testing and debugging. JTAG allows:
- Reading and writing CPU registers and memory while the CPU is halted.
- Setting hardware breakpoints (limited number, stored in dedicated debug registers).
- Tracing execution through an embedded trace buffer (ETB) or trace port (ETM on ARM Cortex).
SWD (Serial Wire Debug) is a two-pin ARM alternative to JTAG, used on space-constrained boards.
3.2.2 Hardware Trace and Profiling
The ARM CoreSight architecture provides non-intrusive hardware tracing:
- ETM (Embedded Trace Macrocell): Records instruction-by-instruction execution trace into a circular buffer or off-chip trace sink.
- ITM (Instrumentation Trace Macrocell): Allows software-generated trace messages over the SWD link without halting the CPU.
- DWT (Data Watchpoint and Trace): Monitors data accesses (reads/writes to specific addresses) and generates trace events.
3.2.3 Timing Analysis with Oscilloscopes and Logic Analyzers
GPIO toggling is a low-overhead technique: set a GPIO high at the start of a critical section, low at the end. An oscilloscope measures the pulse width, giving execution time. This technique is non-intrusive if the GPIO toggle overhead is accounted for.
3.3 Software Debugging Techniques
3.3.1 Assertion-Based Debugging
Insert runtime assertions that check invariants. In embedded systems, assertions can log to a hardware trace buffer or toggle an error LED rather than calling abort().
3.3.2 Stack Overflow Detection
Task stacks have fixed sizes in RTOSes. Stack overflow corrupts adjacent memory. Detection techniques:
- Stack painting: Fill stack with a known pattern (e.g.,
0xDEADBEEF); periodically check if the pattern near the bottom of the stack has been overwritten. - MPU-based protection: Configure the ARM Memory Protection Unit to generate a fault on access below the stack bottom.
3.3.3 Logging and Tracing Frameworks
Lightweight logging frameworks (e.g., FreeRTOS+Trace, Segger SystemView) record task switches, ISR entries/exits, and semaphore operations with timestamps into a circular RAM buffer, which is read out post-mortem or via a debug probe.
Chapter 4: Real-Time Scheduling — Foundations
4.1 Task Model
A real-time task \( \tau_i \) is characterized by:
- \( C_i \): worst-case execution time (WCET) — the maximum time any job of the task can require on the given processor.
- \( T_i \): period (for periodic tasks) or minimum inter-arrival time (for sporadic tasks).
- \( D_i \): relative deadline — the maximum allowable response time from a job’s arrival. Often \( D_i = T_i \) (implicit deadline), but can be less (constrained deadline) or more (arbitrary deadline).
- \( \phi_i \): phase offset — the time of the first job’s release relative to time 0.
The utilization of task \( \tau_i \) is:
\[ U_i = \frac{C_i}{T_i} \]The total system utilization is:
\[ U = \sum_{i=1}^{n} U_i = \sum_{i=1}^{n} \frac{C_i}{T_i} \]A necessary condition for schedulability on a single processor is \( U \leq 1 \).
4.2 Preemptive vs. Non-Preemptive Scheduling
In preemptive scheduling, the currently executing task can be interrupted by a higher-priority task. In non-preemptive scheduling, once a task begins executing, it runs to completion (or until it voluntarily yields). Preemptive scheduling enables better responsiveness for high-priority tasks but requires context-save overhead and complicates shared-resource analysis.
4.3 Feasibility and Optimality
A schedule is feasible if every job meets its deadline. A scheduling algorithm is optimal within a class of algorithms if, whenever a feasible schedule exists, the algorithm produces one.
Two classic optimality results:
- EDF is optimal among preemptive uniprocessor algorithms with arbitrary deadlines.
- RM is optimal among fixed-priority preemptive uniprocessor algorithms for implicit-deadline periodic tasks.
Chapter 5: Worst-Case Execution Time Analysis
5.1 Motivation
WCET analysis determines an upper bound on the execution time of a program on a specific hardware platform. This bound is necessary to verify that task deadlines can be met: without a reliable WCET estimate, schedulability analysis is meaningless.
The challenge is that program execution time depends on:
- The input data (which branches are taken, which memory locations are accessed).
- The hardware state (cache contents, pipeline state, branch predictor history) at the start of execution.
5.2 Program Flow Analysis
5.2.1 Control Flow Graph
A control flow graph (CFG) \( G = (V, E) \) represents a program where each node \( v \in V \) is a basic block (maximal straight-line sequence of instructions with one entry and one exit), and each directed edge \( (u, v) \in E \) represents a possible transfer of control from block \( u \) to block \( v \).
5.2.2 Loop Bounds and Infeasible Paths
WCET analysis requires loop bounds: for each loop, an upper bound on the number of iterations. These are typically provided by the analyst as annotations or inferred by static analysis. Additionally, certain paths through the CFG may be infeasible: they cannot occur during any actual execution. Identifying and excluding infeasible paths tightens the WCET estimate.
5.3 Implicit Path Enumeration Technique (IPET)
The IPET formulates WCET computation as an integer linear program (ILP). For each basic block \( b \), let:
- \( x_b \): the number of times block \( b \) is executed (a non-negative integer).
- \( c_b \): the local execution time (WCET) of block \( b \).
The objective is to maximize total execution time:
\[ \text{maximize} \quad \sum_{b \in V} c_b \cdot x_b \]Subject to:
- Flow conservation at each node: \( x_b = \sum_{e \text{ entering } b} x_e = \sum_{e \text{ leaving } b} x_e \)
- Loop bound constraints: For a loop header \( h \) with bound \( N_h \): \( x_h \leq N_h \cdot x_{entry} \)
- Non-negativity: \( x_b \geq 0 \) for all \( b \)
- Infeasible path exclusions: Linear constraints excluding forbidden combinations of edge executions.
The ILP solution yields the WCET upper bound.
5.4 Measurement-Based WCET
An alternative to static analysis is measurement-based WCET estimation: execute the program many times with diverse inputs and hardware states, record execution times, and apply statistical methods to estimate the upper tail.
Approaches:
- Extreme value theory (EVT): Fit a generalized extreme value distribution to observed maxima; extrapolate to a desired exceedance probability.
- Randomized measurement: Use random initial cache states and inputs to explore the distribution of execution times.
Measurement-based WCET estimates are not guaranteed safe upper bounds (they are probabilistic), making them unsuitable for hard real-time certification unless combined with hardware models that bound the gap between observed and worst-case times.
5.5 Hardware Effects on WCET
5.5.1 Cache Analysis
Instruction and data caches reduce average memory access time but introduce timing variability. WCET cache analysis classifies each memory access into:
- Always hit (AH): The referenced block is always in cache at this point. Contribute cache-hit time.
- Always miss (AM): The referenced block is never in cache. Contribute cache-miss time.
- Not classified (NC): May or may not be cached. Conservatively use cache-miss time.
Abstract interpretation over cache states (May and Must analyses) computes these classifications.
5.5.2 Pipeline Timing Anomalies
Pipeline timing anomalies occur on processors where a local worst-case decision (e.g., taking a longer path) leads to a globally shorter execution time, because the longer path warms the pipeline or avoids a later stall. This means WCET analysis cannot simply add the worst-case times of individual blocks independently.
5.5.3 WCET of Memory-Mapped I/O Access
Access to memory-mapped peripherals may have non-deterministic latency if the peripheral uses wait states or bus arbitration. WCET analysis must account for worst-case peripheral response times, which are typically provided by hardware specifications.
Chapter 6: Clock-Driven Scheduling
6.1 Table-Driven (Cyclic Executive) Scheduling
Clock-driven scheduling (also called table-driven scheduling or the cyclic executive) precomputes a static schedule offline and executes it using a timer interrupt. The schedule is a complete timetable specifying which task runs at each time slot.
6.1.1 Frame-Based Scheduling
The cyclic executive divides time into frames of fixed length \( f \). At the start of each frame, the executor dispatches tasks in a predefined sequence. The hyperperiod \( H \) is the least common multiple of all task periods:
\[ H = \text{lcm}(T_1, T_2, \ldots, T_n) \]The complete schedule repeats every hyperperiod, which is divided into \( H/f \) frames.
6.1.2 Frame Size Constraints
The frame length \( f \) must satisfy three constraints:
Fit constraint: Every task must fit within a single frame (or a specified number of frames). Formally, \( f \geq C_i \) for all \( i \) (assuming each task is assigned to one frame slot and executes contiguously).
Divisibility constraint: The hyperperiod must be an integer multiple of the frame length: \( H \mod f = 0 \).
Deadline constraint: Every job must complete before its deadline. For a task with period \( T_i \) and deadline \( D_i \), the frame length must satisfy:
This ensures that in the worst case (job arrives just after the beginning of a frame), it is scheduled in the next frame and completes before its deadline.
Hyperperiod \( H = \text{lcm}(4, 5, 20) = 20 \). Candidate frame sizes that divide 20: 1, 2, 4, 5, 10, 20. Fit constraint requires \( f \geq 5 \). Deadline constraint for \( \tau_3 \): \( 2f - \gcd(20, f) \leq 20 \). At \( f = 5 \): \( 10 - 5 = 5 \leq 20 \). At \( f = 10 \): \( 20 - 10 = 10 \leq 20 \). Choose \( f = 5 \) for smallest valid frame, giving 4 frames per hyperperiod.
6.1.3 Schedule Construction
Given a valid frame size, the schedule is constructed by assigning task slices to frame slots such that each task’s jobs are assigned to frames before their deadlines. This is typically done offline by a schedule synthesis tool or by hand for small task sets.
6.2 Advantages and Limitations of the Cyclic Executive
Advantages:
- Completely deterministic and analyzable; no runtime scheduling overhead.
- Periodic task preemption is impossible within a frame; task jitter is bounded.
- No RTOS needed; the entire system is a single state machine driven by a timer ISR.
Limitations:
- Inflexible: adding or modifying a task may require recomputing the entire schedule.
- Aperiodic task handling is awkward; aperiodic tasks must be integrated via polling servers.
- Long hyperperiods (when periods are coprime) produce very large schedule tables.
- Cannot exploit unused slack time dynamically without offline analysis.
6.3 Handling Aperiodic Tasks
Aperiodic tasks can be integrated into a cyclic executive using a polling server: a periodic task \( \tau_{PS} \) with period \( T_{PS} \) and WCET \( C_{PS} \) that, whenever it executes, checks for pending aperiodic work and executes it up to \( C_{PS} \) units.
The aperiodic response time under a polling server depends on the server period and the aperiodic task arrival pattern.
Chapter 7: Priority-Driven Scheduling
7.1 Fixed-Priority Scheduling: Rate Monotonic
7.1.1 Rate-Monotonic (RM) Priority Assignment
7.1.2 Optimality of RM
Theorem (Liu & Layland, 1973): Rate-Monotonic scheduling is optimal among all fixed-priority preemptive scheduling algorithms for periodic tasks with implicit deadlines (\( D_i = T_i \)).
The optimality proof uses an exchange argument: given any valid fixed-priority assignment that is not RM, we can swap the priorities of any two adjacent tasks (where “adjacent” means one has a higher priority than the other but a longer period) without introducing a deadline miss. Repeated application of this argument transforms any valid assignment into the RM assignment, showing that if any fixed-priority assignment can schedule the task set, RM can also schedule it.
7.1.3 RM Utilization Bound
As \( n \to \infty \), this bound converges to \( \ln 2 \approx 0.693 \).
Values of the bound for small \( n \):
| \( n \) | Bound |
|---|---|
| 1 | 1.000 |
| 2 | 0.828 |
| 3 | 0.780 |
| 4 | 0.757 |
| \( \infty \) | 0.693 |
The utilization bound is a sufficient condition only. Task sets with utilization above the bound may still be schedulable (RM schedulability is not tight at the bound).
7.1.4 Response-Time Analysis
A tighter schedulability test for RM is response-time analysis (RTA), which computes the worst-case response time of each task iteratively.
For task \( \tau_i \) (with tasks ordered by priority, \( \tau_1 \) highest), the worst-case response time \( R_i \) satisfies:
\[ R_i = C_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i}{T_j} \right\rceil C_j \]where \( hp(i) \) is the set of tasks with strictly higher priority than \( \tau_i \).
This equation is solved by successive approximation (fixed-point iteration):
\[ R_i^{(0)} = C_i, \qquad R_i^{(k+1)} = C_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]Iteration continues until \( R_i^{(k+1)} = R_i^{(k)} \) (convergence) or \( R_i^{(k+1)} > D_i \) (deadline miss). The task set is schedulable if \( R_i \leq D_i \) for all \( i \).
For \( \tau_1 \): \( R_1 = C_1 = 3 \leq D_1 = 10 \). Schedulable.
For \( \tau_2 \): \( R_2^{(0)} = 4 \). \( R_2^{(1)} = 4 + \lceil 4/10 \rceil \cdot 3 = 4 + 3 = 7 \). \( R_2^{(2)} = 4 + \lceil 7/10 \rceil \cdot 3 = 4 + 3 = 7 \). Converged: \( R_2 = 7 \leq D_2 = 20 \). Schedulable.
7.1.5 Deadline Monotonic Scheduling
When deadlines differ from periods (\( D_i \leq T_i \)), the optimal fixed-priority assignment is Deadline Monotonic (DM): assign higher priority to tasks with shorter relative deadlines. DM reduces to RM when \( D_i = T_i \).
7.2 Dynamic Priority Scheduling: Earliest Deadline First
7.2.1 EDF Algorithm
7.2.2 Optimality of EDF
Theorem: EDF is optimal among all preemptive uniprocessor scheduling algorithms (including dynamic-priority algorithms) for independent tasks with arbitrary deadlines.
The proof uses an exchange argument on two jobs \( J_a \) and \( J_b \) where \( J_a \) runs before \( J_b \) in a feasible schedule but \( d_a > d_b \) (later deadline runs first). Swapping their execution order cannot cause \( J_a \) to miss its deadline (it still finishes before its original finish time) and may only help \( J_b \) (which now runs sooner). This transformation preserves feasibility; hence EDF produces a feasible schedule whenever any algorithm can.
7.2.3 EDF Utilization Bound
EDF achieves full processor utilization, unlike RM which is limited to 69.3% in the worst case. This makes EDF theoretically superior, but practical considerations (dynamic priority management overhead, system behavior under overload) favor RM in many industrial systems.
7.2.4 EDF Under Overload
When \( U > 1 \), EDF behaves unpredictably: it is not guaranteed to preferentially miss deadlines of less important tasks. Under RM, overload causes the lowest-priority (longest-period) task to miss deadlines first, which may be acceptable. This predictable overload behavior is a significant practical advantage of fixed-priority scheduling.
7.3 Priority Inversion
7.3.1 The Priority Inversion Problem
Priority inversion occurs when a high-priority task is blocked waiting for a shared resource held by a low-priority task, and a medium-priority task preempts the low-priority task, indirectly blocking the high-priority task.
t=0: \( \tau_L \) acquires \( m \). t=1: \( \tau_H \) is released, preempts \( \tau_L \), tries to acquire \( m \), blocks. t=2: \( \tau_M \) is released. Since \( \tau_L \) is blocked and \( \tau_H \) is also blocked, \( \tau_M \) runs (no resource contention for \( \tau_M \)). t=3: \( \tau_M \) completes. \( \tau_L \) resumes, releases \( m \). t=4: \( \tau_H \) acquires \( m \), continues.
The high-priority task \( \tau_H \) was blocked for an arbitrarily long time due to lower-priority \( \tau_M \), not because of direct resource contention but because \( \tau_M \) preempted the lock-holder.
Priority inversion was the cause of a famous anomaly in the Mars Pathfinder mission (1997), where a high-priority task was repeatedly missed, causing system resets.
7.3.2 Priority Inheritance Protocol (PIP)
PIP properties:
- A task blocks on a resource at most once per critical section of higher-priority tasks.
- A task \( \tau_i \) can be blocked by at most \( \min(n_i, m_i) \) lower-priority tasks, where \( n_i \) is the number of higher-priority tasks and \( m_i \) is the number of distinct resources \( \tau_i \) accesses.
- PIP prevents unbounded priority inversion but allows chained blocking: a task can block on a sequence of resources, each released in turn.
- PIP does not prevent deadlock when tasks can hold multiple resources simultaneously.
7.3.3 Priority Ceiling Protocol (PCP)
PCP properties:
- Deadlock-free: Because a task can acquire a resource only when it cannot be blocked later by a task holding a conflicting resource.
- Bounded blocking: Each task can be blocked at most once per activation, for at most the duration of a single critical section of a lower-priority task.
- The maximum blocking time for task \( \tau_i \) is bounded by the longest critical section of any lower-priority task that accesses a resource with ceiling \( \geq \) priority of \( \tau_i \).
7.3.4 Stack Resource Policy (SRP)
The Stack Resource Policy extends PCP to systems where tasks share a single stack (common in uniprocessor embedded systems). Under SRP, a task is allowed to begin execution (preempt the current task) only if its priority is greater than the current system ceiling (the maximum priority ceiling of all acquired resources). This ensures that once a task begins, it will not be blocked by any lower-priority resource holder.
A key advantage of SRP: tasks that share the same stack can be co-scheduled without stack overflow, reducing memory requirements.
Chapter 8: Temporal Programming
8.1 Time in Embedded Software
Managing time correctly is one of the most error-prone aspects of embedded software. Common errors:
- Using busy-wait loops for delays (non-portable, breaks under CPU frequency changes).
- Ignoring timer overflow (wrap-around of 16-bit or 32-bit counters).
- Incorrect synchronization between hardware timers and software state.
8.2 Hardware Timers
Embedded microcontrollers provide hardware timers with the following features:
- Free-running counter: Increments at a fixed rate derived from the system clock.
- Compare register: Generates an interrupt when the counter matches a preset value.
- Capture register: Records the counter value when an external event occurs.
- PWM mode: Automatically toggles a GPIO at configurable intervals.
For precise periodic task activation, the timer compare register is updated incrementally (not reloaded from zero) to avoid cumulative drift:
// Non-drifting timer: update compare register incrementally
void timer_isr(void) {
COMPARE_REG += PERIOD; // add period, not reset to PERIOD
task_activate(&tau1);
}
8.3 Lee-Seshia Temporal Model
The Lee-Seshia textbook introduces timed automata and synchronous reactive models for embedded temporal programming:
- Synchronous model: All components react simultaneously to a common clock tick. Each tick, every component reads inputs and produces outputs. Time is logically instantaneous within a tick; physical time is between ticks.
- Timed automata: Finite automata augmented with real-valued clocks. Guards on transitions specify clock constraints; invariants on states bound how long the automaton can remain in a state.
These models underpin model-driven development (MDD) tools such as MATLAB/Simulink, SCADE, and Lustre.
8.4 POSIX Real-Time Extensions
The POSIX standard (IEEE 1003.1b) defines interfaces for real-time programming:
clock_gettime(),clock_nanosleep(): High-resolution timers.timer_create(),timer_settime(): POSIX interval timers.sched_setscheduler(): Set scheduling policy (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE).SCHED_DEADLINE: Linux-specific implementation of EDF with CBS (Constant Bandwidth Server) for aperiodic tasks.
Chapter 9: Dependability
9.1 The Laprie Framework
Dependability is the ability of a system to deliver its specified service in a trustworthy manner. The Laprie framework (formalized in the 2004 Avizienis-Laprie-Randell paper) decomposes dependability into attributes, threats, and means.
9.1.1 Dependability Attributes
- Availability: Readiness for correct service. Measured as the fraction of time the system delivers correct service.
- Reliability: Continuity of correct service. The probability that the system delivers correct service over a specified interval.
- Safety: Absence of catastrophic consequences on the user and environment.
- Integrity: Absence of improper system state alterations.
- Maintainability: Ability to undergo modifications and repairs.
Confidentiality and security are later additions to the taxonomy, forming the security attribute cluster alongside integrity and availability.
9.1.2 Dependability Threats
Faults, errors, and failures form a causal chain:
\[ \text{Fault} \longrightarrow \text{Error} \longrightarrow \text{Failure} \]- Fault: The hypothesized or actual cause of an error. A fault is dormant until it is activated by the system's computation.
- Error: The part of the system state that may cause a subsequent failure. An error is a manifestation of a fault within the system boundary.
- Failure: An event that occurs when the delivered service deviates from the specified service. A failure occurs when an error propagates to the system output.
9.1.3 Fault Classes
The Avizienis-Laprie taxonomy classifies faults along multiple dimensions:
| Dimension | Classes |
|---|---|
| Phase of creation | Development faults, Operational faults |
| System boundaries | Internal, External |
| Phenomenological cause | Natural, Human-made |
| Dimension | Hardware, Software |
| Objective | Malicious, Non-malicious |
| Intent | Deliberate, Non-deliberate |
| Capability | Accidental, Incompetence |
| Persistence | Permanent, Transient, Intermittent |
Transient faults are caused by temporary conditions (cosmic rays, electromagnetic interference) and are often the hardest to reproduce. Intermittent faults recur due to marginal hardware conditions (e.g., a loose connector). Permanent faults persist until repaired.
9.1.4 Dependability Means
The means to achieve dependability:
- Fault prevention: Design to prevent faults from being introduced. Code reviews, formal methods, design patterns.
- Fault tolerance: Allow the system to continue correct operation in the presence of faults. Redundancy, error correction.
- Fault removal: Find and eliminate faults. Testing, verification, debugging.
- Fault forecasting: Estimate the number and impact of future faults. Reliability modeling, FMEA.
9.2 Reliability Quantification
9.2.1 Basic Reliability Metrics
Let \( F(t) \) be the cumulative failure distribution function and \( R(t) = 1 - F(t) \) the reliability function (survival function). The failure rate (hazard rate) is:
\[ \lambda(t) = \frac{f(t)}{R(t)} \]where \( f(t) = dF/dt \) is the failure density.
For the exponential distribution (constant failure rate \( \lambda \)):
\[ R(t) = e^{-\lambda t}, \qquad \text{MTTF} = \frac{1}{\lambda} \]where MTTF is the Mean Time To Failure.
9.2.2 Availability
For a repairable system with mean time to failure MTTF and mean time to repair MTTR:
\[ A = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} \]High availability systems (e.g., 99.999% = “five nines”) require either very high MTTF or very low MTTR.
9.2.3 The Bathtub Curve
The failure rate of physical hardware typically follows a bathtub curve:
- Infant mortality (decreasing \( \lambda \)): Early defects exposed during burn-in.
- Useful life (constant \( \lambda \)): Random failures; exponential distribution applies.
- Wear-out (increasing \( \lambda \)): Fatigue, corrosion; Weibull distribution applies.
Chapter 9b: Failure Mode and Effects Analysis (FMEA)
9b.1 Purpose and Scope
FMEA (Failure Mode and Effects Analysis) is a systematic, bottom-up technique for identifying potential failure modes of system components, analyzing their effects on system behavior, and prioritizing corrective actions.
FMEA is mandated by safety standards including IEC 61508, ISO 26262 (automotive), and DO-178C (aviation). It is typically performed early in the design phase and updated as the design evolves.
9b.2 FMEA Procedure
The standard FMEA process:
- Identify the system boundary and functions: Define what the system does and what counts as a failure.
- Enumerate failure modes: For each component, list the ways it can fail (e.g., “open circuit,” “short circuit,” “stuck at 0,” “produces incorrect output”).
- Analyze effects: For each failure mode, determine the effect on the immediate function, subsystem, and overall system.
- Determine detection: Identify how the failure mode is detected (during design, test, or field operation).
- Assign RPN: Calculate the Risk Priority Number.
9b.3 Risk Priority Number (RPN)
where:
- S (Severity): Severity of the failure effect on a scale of 1–10 (1 = no effect, 10 = catastrophic/safety-critical).
- O (Occurrence): Likelihood that the failure mode will occur, on a scale of 1–10 (1 = almost impossible, 10 = almost certain).
- D (Detectability): Ability to detect the failure before it affects the customer/system, on a scale of 1–10 (1 = easily detected, 10 = undetectable).
RPN ranges from 1 to 1000. Failure modes with high RPN (typically > 100) require corrective action.
Failure mode: Sensor output stuck at 0 (no pressure reading). Effect: ABS controller interprets no brake application; ABS does not engage when it should. S: 9 (potential loss of vehicle control). O: 3 (relatively rare; sensor is robust). D: 4 (some diagnostic coverage via redundant sensors). RPN = 9 × 3 × 4 = 108. Corrective action: add sensor redundancy or plausibility check.
9b.4 DFMEA vs. PFMEA
- DFMEA (Design FMEA): Focuses on design choices and their failure modes. Performed by design engineers.
- PFMEA (Process FMEA): Focuses on manufacturing and assembly process failures. Performed by manufacturing engineers.
9b.5 FMEA Limitations
- FMEA is single-fault analysis: it considers one failure at a time and may miss combinations of failures (covered by FMECA — Failure Mode, Effects, and Criticality Analysis, or FTA — Fault Tree Analysis).
- RPN is not a probability: two failure modes with the same RPN may have very different risk profiles if their severity, occurrence, and detectability differ.
- FMEA is only as good as the expertise of the team performing it and the completeness of the failure mode enumeration.
Chapter 10: Safety-Critical Systems Standards
10.1 The Safety Integrity Level (SIL) Framework
IEC 61508 is the foundational functional safety standard for electrical/electronic/programmable electronic safety-related systems. It defines four Safety Integrity Levels (SIL) based on the required probability of dangerous failure per hour (for continuously operated systems) or per demand (for on-demand systems).
| SIL | Probability of Failure on Demand (PFD) | Probability of Dangerous Failure per Hour (PFH) |
|---|---|---|
| SIL 1 | \( 10^{-2} \) to \( 10^{-1} \) | \( 10^{-6} \) to \( 10^{-5} \) |
| SIL 2 | \( 10^{-3} \) to \( 10^{-2} \) | \( 10^{-7} \) to \( 10^{-6} \) |
| SIL 3 | \( 10^{-4} \) to \( 10^{-3} \) | \( 10^{-8} \) to \( 10^{-7} \) |
| SIL 4 | \( 10^{-5} \) to \( 10^{-4} \) | \( 10^{-9} \) to \( 10^{-8} \) |
Higher SIL requires more rigorous development processes, higher test coverage, more formal methods use, and more extensive documentation.
10.2 Sector-Specific Standards
IEC 61508 is the generic standard; sector-specific standards adapt it for particular domains:
- ISO 26262: Automotive functional safety. Uses Automotive Safety Integrity Levels (ASIL A–D), analogous to SIL 1–4, with ASIL D being most stringent. Applies to passenger vehicles.
- DO-178C: Software considerations in airborne systems and equipment certification. Defines Design Assurance Levels (DAL A–E) with DAL A for catastrophic failure conditions.
- EN 50128: Railway safety. Defines SIL-based requirements for software in railway control systems.
- IEC 62443: Industrial automation and cybersecurity. Defines Security Levels (SL 1–4).
10.3 Software Development Process Requirements
IEC 61508 Part 3 (software) mandates specific development activities proportional to SIL:
| Technique | SIL 1 | SIL 2 | SIL 3 | SIL 4 |
|---|---|---|---|---|
| Structured design | Recommended | Recommended | Mandatory | Mandatory |
| Formal methods | Optional | Recommended | Recommended | Mandatory |
| Static analysis | Recommended | Mandatory | Mandatory | Mandatory |
| Proof of correctness | Not required | Optional | Recommended | Mandatory |
| Software FMEA | Optional | Recommended | Recommended | Mandatory |
10.4 WCET Analysis in Safety Standards
Safety standards for real-time systems require that task WCETs be established with a defined confidence level. IEC 61508 does not mandate a specific WCET analysis method, but the evidence package must demonstrate that timing requirements are met. DO-178C (aviation) requires structural coverage analysis (MC/DC — Modified Condition/Decision Coverage) of safety-critical code, which indirectly constrains the WCET analysis scope.
10.5 Hazard Analysis and Risk Assessment
Before FMEA, a system-level hazard analysis identifies hazardous situations and quantifies their risk. The standard risk graph maps severity and frequency of hazard occurrence to required SIL. Techniques include:
- HAZOP (Hazard and Operability Study): Structured review using guide words (No, More, Less, As Well As, Part Of, Reverse, Other Than) to identify deviations from design intent.
- FTA (Fault Tree Analysis): Top-down, deductive analysis. Starts from an undesired top-level event (failure) and deduces combinations of lower-level events (using AND/OR gates) that lead to it.
- ETA (Event Tree Analysis): Bottom-up, inductive. Starts from an initiating event and traces possible outcomes through a tree of succeeding events.
Chapter 11: Software Fault Tolerance
11.1 Motivation
Hardware redundancy alone is insufficient for high dependability: software bugs are a leading cause of system failures, and identical copies of software contain identical bugs. Software fault tolerance techniques specifically address the presence of software faults by providing redundant implementations or checkpointing mechanisms.
11.2 Design Diversity
11.2.1 N-Version Programming (NVP)
The voter uses majority voting (for \( N \geq 3 \)): the output value agreed upon by the majority of versions is accepted. For \( N = 3 \), the system tolerates one incorrect version; for \( N = 2k+1 \), it tolerates \( k \) simultaneous incorrect versions.
Reliability under NVP (assuming statistically independent version failures with probability \( p \)):
For \( N = 3 \):
\[ R_{NVP} = R^3 + 3R^2(1-R) = R^2(3-2R) \]where \( R = 1-p \) is the probability that a single version produces a correct output for a given input.
Criticisms of NVP:
- Independence assumption: Empirical studies (Knight & Leveson, 1986) found that independently developed versions fail on the same inputs more often than the independence assumption predicts — a phenomenon called correlated failures.
- Specification faults: If the specification is incorrect, all versions inherit the same specification fault. NVP provides no protection against specification errors.
- Cost: Developing \( N \) versions is \( N \) times the development cost.
11.2.2 Recovery Blocks
Structure:
ensure <acceptance test>
by <primary module>
else by <alternate module 1>
else by <alternate module 2>
else error
Comparison with NVP:
- Recovery blocks are sequential (alternates tried one at a time) vs. NVP (parallel execution).
- Recovery blocks rely on an acceptance test rather than a voter; the acceptance test must be less expensive than the primary module and must be independent of the primary module’s design.
- Recovery blocks have lower overhead when faults are rare (only the primary runs in the common case) but higher latency when failures occur.
11.2.3 Consensus Recovery Blocks
A hybrid approach combining NVP and recovery blocks: multiple versions run in parallel; if they disagree (detected by a voter), the acceptance test is used to select the correct output.
11.3 Checkpointing and Rollback
Checkpointing periodically saves the system state to stable storage. When a failure is detected, the system rolls back to the most recent valid checkpoint and re-executes from there.
11.3.1 Checkpointing in Real-Time Systems
In real-time systems, checkpointing introduces overhead and constraints:
- Checkpoint cost: Saving state takes time, adding to task execution time. If checkpoint overhead is significant, WCET must include it.
- Recovery time: Rolling back and re-executing takes time. The task must complete within its deadline even after a rollback, which constrains the checkpointing interval.
- Optimal checkpoint interval: Given a task with WCET \( C \), failure probability \( p \), and checkpoint overhead \( \delta \), the optimal checkpoint interval minimizes expected execution time.
11.3.2 Coordinated vs. Uncoordinated Checkpointing
In distributed embedded systems:
- Coordinated checkpointing: All nodes save state simultaneously, ensuring a globally consistent snapshot. Simple but requires coordination overhead.
- Uncoordinated checkpointing: Nodes save state independently; recovery requires rolling back to a consistent cut (the “domino effect” problem — rolling back one node may require others to roll back further).
11.4 Error Detection Techniques
Before fault tolerance can act, errors must be detected:
- Duplicate execution with comparison: Execute the same computation twice (on separate hardware or time-slices) and compare results. Detects transient hardware faults.
- Acceptance tests: Verify that output satisfies plausibility constraints (e.g., a computed value is within a physical range).
- Watchdog timers: A hardware timer that resets the system unless periodically “kicked” by the software. Detects software hangs and infinite loops.
- Cyclic Redundancy Checks (CRC): Detect memory corruption and communication errors.
- Parity and ECC: Error-correcting codes in DRAM detect and correct single-bit errors, detect two-bit errors.
11.5 Redundancy Configurations
11.5.1 Triple Modular Redundancy (TMR)
Three identical hardware modules and a voter. The voter selects the majority output. TMR tolerates one module failure.
Reliability:
\[ R_{TMR}(t) = R(t)^3 + 3R(t)^2\left(1-R(t)\right) \]TMR improves reliability when \( R(t) > 0.5 \) and degrades it when \( R(t) < 0.5 \) (the voter introduces additional failure probability). For long mission times, TMR may be worse than simplex.
11.5.2 Standby Redundancy
One primary module operates; one or more standbys are available. On failure of the primary, a standby takes over. Types:
- Hot standby: Standby runs synchronously with the primary; fast switchover.
- Warm standby: Standby is powered but not fully synchronized; medium switchover time.
- Cold standby: Standby is powered off; must boot before taking over.
Chapter 12: Embedded System Security
12.1 Security in Embedded Context
Embedded systems are increasingly connected (IoT, Industrial IoT, automotive) and face security threats that were historically not a concern for isolated embedded systems. The convergence of IT security with operational technology (OT) security creates a distinct set of challenges.
Key security concerns specific to embedded systems:
- Physical access: Attackers may have physical access to the device; extracting firmware via JTAG, reading flash memory, or hardware reverse engineering is possible.
- Limited resources: Cryptographic algorithms designed for servers may be too computationally expensive for low-power MCUs.
- Long lifetimes: Industrial embedded systems may operate for 20–30 years; software update mechanisms must be available but secure.
- Safety-security interaction: Security countermeasures (e.g., encryption latency, authentication delays) may violate real-time timing constraints and compromise safety.
12.2 Threat Modeling for Embedded Systems
STRIDE is a systematic threat classification:
| Threat | Description | Example |
|---|---|---|
| Spoofing | Impersonating a legitimate entity | Malicious sensor injecting fake readings |
| Tampering | Modifying data or code | Overwriting firmware via JTAG |
| Repudiation | Denying an action occurred | Device log alteration |
| Information disclosure | Unauthorized data access | Side-channel attack on cryptographic key |
| Denial of service | Preventing legitimate service | Flooding CAN bus with high-priority messages |
| Elevation of privilege | Gaining unauthorized capabilities | Exploiting buffer overflow to execute arbitrary code |
12.3 Common Embedded Attack Vectors
12.3.1 Memory Safety Vulnerabilities
Embedded systems often use C without memory-safe abstractions. Buffer overflows, format string vulnerabilities, and integer overflows are prevalent. On systems without an MMU (e.g., ARM Cortex-M0), there is no virtual memory isolation between tasks, making stack smashing attacks more impactful.
Mitigations:
- Use safe C subsets (MISRA-C, CERT-C).
- Enable stack canaries (compiler support:
-fstack-protector). - Use ARM MPU to enforce memory region permissions.
- Static analysis tools (e.g., Polyspace, Coverity).
12.3.2 Side-Channel Attacks
Side-channel attacks exploit physical characteristics of the implementation rather than mathematical weaknesses:
- Power analysis: Measure power consumption during cryptographic operations to infer secret keys.
- Timing attacks: Measure execution time of conditional branches that depend on secret data.
- Electromagnetic (EM) analysis: Measure EM emissions correlated with computation.
Countermeasures: constant-time implementations, power-analysis resistant cryptographic libraries, hardware crypto accelerators with masking.
12.3.3 Firmware Extraction and Reverse Engineering
Attackers read out firmware from flash memory (via JTAG, SWD, or chip decapping) and reverse-engineer it to find vulnerabilities or extract secrets. Countermeasures:
- Enable JTAG lockout (disable after production programming).
- Use hardware-enforced read-back protection (e.g., ARM TrustZone, STM32 RDP levels).
- Encrypt firmware at rest using a device-unique hardware key (PUF-based or OTP).
12.4 Secure Boot and Code Integrity
Secure boot ensures that only authenticated, unmodified firmware executes. Stages:
- Boot ROM (immutable, in chip): Contains root public key or hash. Verifies the first-stage bootloader signature.
- First-stage bootloader: Verifies the second-stage bootloader or OS kernel.
- Application: Verified before execution.
Each stage extends a chain of trust. If any link fails verification, boot halts or falls back to recovery mode.
12.5 Cryptographic Considerations for Embedded
Symmetric algorithms (AES-128, ChaCha20) are feasible even on low-power MCUs with hardware acceleration. Public-key cryptography (RSA-2048, EC-DSA) is expensive; on a Cortex-M4 without hardware acceleration, RSA-2048 signing takes hundreds of milliseconds, which may violate real-time constraints.
Practical approach: use public-key cryptography only during boot/provisioning; use symmetric keys with pre-shared or derived session keys for runtime operation.
12.6 Safety and Security Co-Analysis
Safety and security are not orthogonal: a security attack can compromise safety, and safety countermeasures may create security vulnerabilities. For example:
- A safety watchdog that reboots the system on timeout may allow an attacker to trigger controlled reboots (denial of safety service).
- Redundant communication for safety (CAN bus broadcast) amplifies the attack surface for injection attacks.
- Encryption delays may cause deadline misses in safety-critical controllers.
Standards addressing this interaction include IEC 62443 (industrial cybersecurity) and emerging work on ISO 21434 (automotive cybersecurity, complementary to ISO 26262).
Appendix A: Key Formulas Reference
A.1 Scheduling
RM Utilization Bound:
\[ U_{RM} = n\left(2^{1/n}-1\right) \xrightarrow{n\to\infty} \ln 2 \approx 0.693 \]EDF Necessary and Sufficient Condition (implicit deadlines):
\[ U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq 1 \]Response-Time Analysis (fixed-point iteration):
\[ R_i^{(k+1)} = C_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]Cyclic Executive Frame Constraint:
\[ 2f - \gcd(T_i, f) \leq D_i \quad \text{for all } i \]A.2 Dependability
Reliability (exponential):
\[ R(t) = e^{-\lambda t}, \qquad \text{MTTF} = \frac{1}{\lambda} \]Steady-State Availability:
\[ A = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}} \]TMR Reliability:
\[ R_{TMR}(t) = 3R(t)^2 - 2R(t)^3 \]FMEA Risk Priority Number:
\[ \text{RPN} = S \times O \times D \]A.3 WCET
IPET ILP (objective):
\[ \text{maximize} \quad \sum_{b \in V} c_b \cdot x_b \]subject to flow conservation, loop bounds, and infeasible path constraints.
Appendix B: Glossary of Key Terms
| Term | Definition |
|---|---|
| WCET | Worst-Case Execution Time: upper bound on program execution time on given hardware |
| RTOS | Real-Time Operating System: OS providing guaranteed scheduling and timing |
| ISR | Interrupt Service Routine: code executed in response to a hardware interrupt |
| IPET | Implicit Path Enumeration Technique: ILP-based WCET computation |
| RM | Rate Monotonic: fixed-priority scheduling assigning priority inverse to period |
| EDF | Earliest Deadline First: dynamic-priority scheduling by absolute deadline |
| DM | Deadline Monotonic: fixed-priority scheduling assigning priority inverse to relative deadline |
| PIP | Priority Inheritance Protocol: lock-holder inherits blocker’s priority |
| PCP | Priority Ceiling Protocol: task blocked unless its priority exceeds all resource ceilings |
| SRP | Stack Resource Policy: single-stack variant of PCP |
| NVP | N-Version Programming: \( N \) independently developed software versions voted upon |
| TMR | Triple Modular Redundancy: three hardware modules with majority voter |
| SIL | Safety Integrity Level: IEC 61508 classification of safety requirements |
| ASIL | Automotive Safety Integrity Level: ISO 26262 classification for automotive systems |
| FMEA | Failure Mode and Effects Analysis: systematic fault identification and risk ranking |
| RPN | Risk Priority Number: \( S \times O \times D \) composite FMEA metric |
| FTA | Fault Tree Analysis: top-down deductive failure analysis |
| HAZOP | Hazard and Operability Study: structured review using guide words |
| ETM | Embedded Trace Macrocell: ARM hardware for non-intrusive instruction tracing |
| MPU | Memory Protection Unit: hardware enforcing memory region permissions without full MMU |