CS 452: Real-Time Programming

Martin Karsten

Estimated study time: 7 hr 28 min

Table of contents

Sources and References

Primary textbooks — Liu, Jane W. S. Real-Time Systems. Prentice Hall, 2000. Buttazzo, Giorgio. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Springer, 4th ed., 2024. Arpaci-Dusseau, Remzi and Andrea. Operating Systems: Three Easy Pieces. University of Wisconsin, freely available at pages.cs.wisc.edu/~remzi/OSTEP/.

Supplementary texts — Tanenbaum, Andrew and Herbert Bos. Modern Operating Systems, 5th ed. Pearson, 2023. Hennessy, John and David Patterson. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann, 2017.

Hardware references — ARM. ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile (DDI 0487). ARM Holdings. ARM. Procedure Call Standard for the Arm 64-bit Architecture (AAPCS64). Broadcom. BCM2711 ARM Peripherals. Microchip Technology. MCP2515 Stand-Alone CAN Controller with SPI Interface (DS20001801). Märklin. CAN-Protokoll Beschreibung CS2/CS3 Version 2.0.

Online resources — MIT CSAIL. xv6: a simple, Unix-like teaching operating system and 6.S081 Operating Systems Engineering course materials. CMU. 18-349 Introduction to Embedded Systems. UIUC. CS 423 Operating Systems Design. Liu, C. L. and James W. Layland. “Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment.” Journal of the ACM 20, no. 1 (1973): 46–61. Lehoczky, John, Lui Sha, and Y. Ding. “The Rate Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behavior.” Proceedings of the 10th IEEE Real-Time Systems Symposium, 1989.

Part I: Real-Time Systems Foundations

Chapter 1: What Real-Time Means

The phrase “real-time” is used casually to mean fast, but in the engineering literature it means something far more precise and far more demanding. A real-time system is a computing system whose correctness depends not only on the logical correctness of its outputs but on the time at which those outputs are produced. A system that computes the right answer too late has failed, even if the computation itself was flawless. This is the central axiom from which everything else in this course follows.

Understanding this definition requires carefully distinguishing between three qualities that are often conflated: speed, responsiveness, and timeliness. A fast system produces results quickly on average. A responsive system produces results quickly in the common case. A real-time system guarantees that results are produced within a specified deadline in all cases — including the worst case. The distinction matters enormously. A general-purpose web server that handles the median request in ten milliseconds but occasionally spends a second on garbage collection is fast and often responsive, but it is not real-time. A pacemaker that delivers a cardiac pulse within two milliseconds of the scheduled beat, every beat, with no exceptions, is a real-time system — even though it is not fast in the sense of performing complex computation.

Hard, Firm, Soft, and Near Real-Time

The literature recognizes a spectrum of timing requirements, and it is worth being precise about each.

A hard real-time system is one in which missing a deadline constitutes a system failure. The consequences of lateness are catastrophic and cannot be recovered from. The anti-lock braking system in a modern car must apply differential brake pressure within a fixed window when wheel slip is detected; if the computation is delayed by even a few milliseconds, the wheel locks and the driver loses directional control. A fly-by-wire flight control system must update its actuator commands faster than the aerodynamic time constants of the aircraft; a missed deadline can cause an aerodynamic instability from which the aircraft cannot recover. For hard real-time systems, timing guarantees must be proved, not estimated. Probabilistic arguments are insufficient.

A firm real-time system is one in which the value of a late result is zero, but missing a deadline does not cause catastrophic harm. A video conferencing codec that delivers a frame ten milliseconds too late is useless — the frame can be dropped — but the conversation continues. A financial trading system that fills an order after the market has moved may lose money but does not endanger lives. The distinction from soft real-time is subtle: in a firm system, late results are worthless; in a soft system, they are merely degraded in value.

A soft real-time system is one in which timeliness is a quality-of-service concern rather than a correctness requirement. Late results reduce quality but do not constitute a system failure. A music streaming service that occasionally buffers is annoying but functional. An interactive game that drops from 120 to 60 frames per second during an explosion is less pleasant but still playable. Most consumer software is soft real-time at best.

The phrase near real-time is sometimes used in industrial contexts to describe systems with latencies in the range of seconds to minutes, where “real-time” primarily indicates continuous processing rather than batch processing. This usage is too weak to be useful in an engineering discussion.

This course is primarily concerned with hard real-time systems, specifically the microkernel that drives physical hardware with deterministic timing requirements. The trains will behave incorrectly — collide, overshoot, stall — if the kernel does not respond to sensor events within bounded time. The stakes are not lives but the principle is the same: correctness requires timing guarantees, not timing hopes.

Why Timing Is Correctness

The insight that timing is part of correctness is counterintuitive to programmers trained on sequential algorithms, where correctness is purely a function of the relationship between input and output. It becomes obvious in the context of physical systems. Consider a train moving at 0.5 metres per second. At that speed, it traverses 5 centimetres in 100 milliseconds. If the kernel’s worst-case latency for processing a sensor event is 50 milliseconds, the train’s position uncertainty at the moment of response is 2.5 centimetres — a physically meaningful error. If the latency grows to 500 milliseconds, the uncertainty is 25 centimetres, which on a scale model track is the difference between successfully stopping before a bumper and embedding the locomotive in it.

The physical world is governed by differential equations, not by computational complexity. It does not pause while the kernel context-switches. The relationship between computation and the physical process it controls is mediated by time, and time is therefore a first-class correctness dimension.

This has a practical corollary. In a real-time system, it is not enough to know that something will eventually happen; you must know that it will happen by a specific moment. The kernel design choices made in this course — static priorities, no shared memory, bounded system call latency — are all engineering consequences of this requirement.

Real-Time vs. General-Purpose Operating Systems

A general-purpose operating system like Linux is optimized for average-case throughput. It uses techniques — address space layout randomization, transparent huge pages, dynamic frequency scaling, opportunistic preemption, the completely fair scheduler — that improve the common case at the potential expense of the worst case. A real-time OS makes the opposite trade: it accepts that average performance may be lower, but it guarantees that the worst case is bounded.

This affects nearly every design decision. Linux uses dynamic memory allocation extensively because it amortizes well across millions of requests. A hard real-time kernel avoids malloc entirely because the worst-case latency of a heap allocation — which may trigger compaction, or wait for a slow page fault — is unpredictable and potentially unbounded. Linux uses complex scheduler algorithms (CFS, EEVDF) because they are fair across thousands of threads. This kernel uses a simple static-priority FIFO scheduler because the assignment of worst-case response time is tractable. Linux permits priority inversion on shared resources because the common case is fast. This kernel uses Send/Receive/Reply message passing because it eliminates shared resources and with them the possibility of unbounded blocking.

None of these choices make the RT kernel “better” than Linux. They make it correct for a different optimization target. Understanding the target is understanding the course.

The Course: Build the Stack

CS 452’s answer to the challenge of real-time programming is radical: build the entire software stack from bare metal. You boot the Raspberry Pi 4 directly into your code, with no Linux underneath you, no libc above you, and no libraries in between. Your kernel allocates no memory at run time. Your scheduler runs in a deterministic number of instructions. Your I/O is handled by interrupt-driven servers that communicate through message passing. At the top of this stack sits a real-time application — a controller for a physical Märklin model railway — that must track train positions, compute stopping distances, reserve track sections, and coordinate multiple moving trains, all with the timing guarantees that physical hardware demands.

Building this system from scratch is the only way to develop genuine intuition about what real-time programming requires. Reading about context switching is not the same as writing the assembly that saves and restores every register. Knowing that priority inversion is a problem is not the same as watching a high-priority task starve because a lower-priority task is holding a resource it cannot release. The trains are not decoration; they are the oracle. A kernel that is logically correct but temporally imprecise will misbehave on the track in exactly the ways that timing theory predicts.

A Brief History of Real-Time Operating Systems

The field of real-time operating systems has a history interleaved with aerospace, military computing, and industrial automation. Understanding the historical arc clarifies why certain design decisions — like the RMS scheduler or the SRR communication model — emerged when and where they did.

The first real-time systems were not recognized as such. The SABRE airline reservation system (1960), jointly developed by IBM and American Airlines, processed seat-reservation requests in “real time” in the sense that a travel agent could confirm a booking in seconds rather than hours. It did not have hard deadlines in the engineering sense, but it established the idea that computers could interact with the physical world faster than humans.

The Apollo Guidance Computer (AGC, 1966) is a more direct ancestor. Designed by the MIT Instrumentation Laboratory, the AGC had to simultaneously control spacecraft attitude, process crew inputs, navigate by stellar sightings, and manage engine ignition — all on a 2 MHz clock with 4 KB of core RAM and 36 KB of ROM. Its operating system (developed by Hal Laning and Ed Copps) used a priority-based multitasking scheme with 8 priority levels and cooperative scheduling within a priority level. The famous AGC alarm “1202” during Apollo 11’s lunar descent indicated that the executive (the scheduler) was overloaded, with more tasks requesting time than could be serviced within their deadlines. The system correctly prioritized flight-critical tasks and allowed navigation-aid tasks to be dropped, landing safely. This is an early recorded instance of graceful degradation under overload — a concept the Liu/Layland framework would formalize seven years later.

The 1970s and 1980s saw the emergence of commercial RTOSs. iRMX (1976, Intel), designed for the Intel 8080 and 8086, provided task management, inter-task communication, and I/O handling for industrial control applications. It introduced many concepts that remain standard: task priority, semaphores for synchronization, and file systems for embedded storage. QNX (1980, Quantum Software Systems, now BlackBerry QNX) took a more radical approach: a microkernel with message-passing IPC, designed to run on the IBM PC. QNX’s message-passing design directly influenced the CS 452 kernel architecture — the Send/Receive/Reply model in CS 452 is essentially QNX’s MsgSend/MsgReceive/MsgReply API, adapted for a simpler kernel without virtual memory.

VxWorks (1987, Wind River Systems) became the dominant RTOS in aerospace and industrial applications through the 1990s. It supported POSIX threads (pthreads), a BSD networking stack, and a wide range of hardware. It was the RTOS on Mars Pathfinder (1997), the Cassini spacecraft, and many other missions. VxWorks’s task scheduler uses fixed priorities with optional round-robin at equal priorities, closely matching the CS 452 approach.

The academic community was simultaneously developing the theoretical foundations. Liu and Layland’s 1973 paper (Chapter 15) provided the schedulability analysis framework. Lehoczky, Sha, and Ding (1989) extended it to exact characterization and average-case analysis. Audsley, Burns, Richardson, and Wellings (1993) developed the response time analysis (the fixed-point recurrence of Chapter 15). Buttazzo’s textbook Hard Real-Time Computing Systems (first edition 1997) unified these results into a coherent body of theory that remains the standard graduate reference.

The Raspberry Pi — the hardware platform for CS 452’s current offering — represents an unusual step in real-time systems education: using a high-performance, multi-core, out-of-order processor as a bare-metal microcontroller. Historically, embedded real-time systems ran on simple in-order microcontrollers (PIC, AVR, ARM Cortex-M) where WCET analysis was tractable because the hardware had no caches, no branch prediction, and no out-of-order execution. The Cortex-A72 in the RPi 4 is far more complex. The pedagogical choice is deliberate: the hardware is representative of modern high-performance embedded systems (automotive ADAS, drone flight controllers, industrial robotics), where the performance budget demands modern processors but the timing requirements demand real-time guarantees. Learning to work with this tension is the deeper engineering lesson.

Timing Guarantees in Practice

The theoretical framework of hard real-time guarantees — prove that U ≤ n(2^(1/n)-1) and sleep soundly — often clashes with practical reality. Several caveats are worth noting early in the course:

WCET overestimation: conservative WCET analysis overestimates execution times to ensure correctness. A task measured to have a worst-case of 5 ms may typically run in 0.5 ms. This gap between WCET and average-case makes utilization bounds pessimistic. In practice, a system that is schedulable by analysis has substantial headroom.

Unbounded blocking: the Liu/Layland model assumes tasks are always ready at their period boundaries. In reality, tasks may block waiting for I/O, for a lock, or for an SRR reply. These blocking times must be added to the WCET for analysis purposes. In the message-passing kernel, blocking times are bounded by the server’s response time — which is itself analyzable if the server’s task set is analyzable.

Interrupt storms: a faulty peripheral that fires interrupts continuously can overwhelm the kernel’s interrupt handling capacity. Protection: the kernel should detect when interrupt service frequency exceeds a threshold and disable the offending interrupt. In CS 452, the MCP2515’s error-handling state machine (entering error-passive or bus-off modes) prevents a faulty CAN bus from flooding the kernel with interrupts.

Temporal isolation failures: in a system without memory protection (no MMU enforcing per-task address spaces), a bug in one task can corrupt another task’s memory, including its stack. The kernel cannot detect this without explicit stack canaries (Chapter 5). The lesson: correctness of the real-time schedule is only as good as correctness of the individual tasks.

Chapter 2: Polling, Cyclic Executives, and Their Limits

Before introducing tasks and preemption, it is instructive to take seriously the simplest possible approach to real-time control: the polling loop. Understanding where polling succeeds and where it fails motivates every design decision in the kernel that follows.

The Polling Loop

The polling loop — sometimes called the super-loop or scan loop — is the dominant paradigm in small embedded systems. Its structure is:

void main(void) {
    init();
    for (;;) {
        if (sensor_fired()) handle_sensor();
        if (key_pressed())  handle_key();
        if (clock_expired()) handle_clock();
        // ... check all conditions in sequence
    }
}

The loop runs continuously. Each iteration checks every condition and takes the associated action if the condition is true. The appeal is simplicity: there is no scheduler, no context switch, no synchronization, and no non-determinism beyond the hardware itself. The code is straightforward to read, trace, and debug.

For simple, slow systems with a single sensor and a single actuator, the polling loop is genuinely the right approach. An early thermostat — one sensor, one relay, a one-second control period — is perfectly served by a polling loop. The complexity of a task-based RTOS would be gratuitous overhead.

The difficulties emerge as the system’s requirements grow along three axes: the number of conditions to check, the frequency at which conditions arrive, and the duration of the actions that must be taken.

Time Analysis of the Polling Loop

The worst-case response latency of a polling loop is the time required to complete one full scan of all conditions, including the time to execute the longest action. If we have conditions $C_1, C_2, \ldots, C_n$ with action durations $d_1, d_2, \ldots, d_n$, the worst-case latency for condition $C_k$ is:

\[L_k = \sum_{i \neq k} d_i + d_k \cdot \mathbb{1}[\text{action is non-preemptible}]\]

In the worst case — when $C_k$ fires immediately after the poll for $C_k$ completed — the system must traverse the entire loop before checking $C_k$ again. The worst-case response time is therefore bounded by the sum of all action durations.

This has an immediate and severe implication. Suppose you have a train position sensor that must be processed within 50 milliseconds, and a UART receive handler that takes 20 milliseconds, and a track switch control function that takes 10 milliseconds, and a user-interface update that takes 30 milliseconds. The worst-case latency for the sensor is $20 + 10 + 30 = 60$ milliseconds — already in violation of the deadline.

The standard response to this problem within the polling paradigm is to break long actions into shorter segments. Instead of a 30-millisecond UI update, you compute one line of the update per loop iteration. This reduces per-iteration cost but dramatically complicates the code: each segment must save its own progress state across iterations, the logical flow of the computation is shredded across many calls, and the result looks like a hand-written coroutine with all the readability hazards that implies.

The Cyclic Executive

The cyclic executive is a structured refinement of the polling loop. Rather than checking conditions on every iteration, it divides time into fixed frames of length $F$. Each frame is subdivided into slots, and each slot is assigned a fixed task. The major cycle — the least common multiple of all task periods — determines how many distinct frame patterns are needed.

Suppose we have:

Task A with period 10 ms, execution time 3 ms
Task B with period 20 ms, execution time 5 ms
Task C with period 20 ms, execution time 4 ms

The major cycle is 20 ms. A valid frame arrangement might assign frames of 10 ms each:

Frame 1: A (3 ms), B (5 ms), slack
Frame 2: A (3 ms), C (4 ms), slack

The cyclic executive executes frame 1, then frame 2, then repeats. Timing is enforced by a hardware timer that fires at frame boundaries. If a task overruns its slot, the executive can detect this and take action (abort, skip, flag an error). If it finishes early, the remaining time is idle.

The cyclic executive has genuine advantages for hard real-time systems. Worst-case response times are statically known by construction. There is no scheduler at run time, which eliminates scheduling overhead and non-determinism. Temporal isolation between tasks is guaranteed at the frame boundary. Industrial avionics standards such as ARINC 653 mandate this approach for safety-critical software.

The limitations are equally genuine. The task set must be fixed at design time; dynamic task creation is impossible. Task periods must be harmonically related for the major cycle to remain tractable. Any interaction between tasks — a producer notifying a consumer, a resource shared between tasks — must be handled by careful frame alignment or explicit data copies, which is tedious and error-prone. A task that occasionally runs long — a WCET overrun — causes a ripple of deadline violations throughout the frame. And as the number of tasks grows, the frame table becomes difficult to construct and verify.

Most importantly, the cyclic executive offers no natural way to handle asynchronous events. When a train sensor fires at an arbitrary moment, the cyclic executive must either poll it every frame (introducing latency up to one frame period) or dedicate an interrupt to it — but an interrupt that executes arbitrary code during a critical frame immediately violates the strict timing properties that made the cyclic executive attractive in the first place.

The Case for Preemptive Task Scheduling

The polling loop and cyclic executive both fail gracefully only when the ratio between the fastest event rate and the slowest task period is small. When this ratio is large — when a hardware interrupt must be acknowledged in microseconds while a route-computation task takes milliseconds — the fixed-sequence approach runs out of headroom.

The fundamental problem is that both approaches conflate when to compute with what to compute. A train position update should run as soon as the sensor fires, regardless of which other computations are in progress. The natural description of this requirement is: there is a higher-priority activity that should preempt whatever is currently running when the sensor fires.

Preemptive multitasking separates these concerns. Each logically distinct activity becomes a task with its own priority, its own stack, and its own program counter. The kernel schedules the highest-priority ready task at all times. When a sensor interrupt fires, the kernel can immediately preempt the currently running task, run the sensor handler, and then return to the preempted task — all without requiring the preempted task to have been written in any special way. The task model provides clean logical isolation without sacrificing response time.

This is the architecture of the microkernel built in CS 452. The train control application is decomposed into tasks — a sensor notifier, a clock server, UART servers, a train supervisor, route engineers — each running at an appropriate priority and communicating through bounded message-passing operations. The kernel guarantees that the highest-priority ready task always runs, that message passing is bounded in duration, and that interrupt handling latency is bounded by the longest non-preemptible kernel operation. The trains are the test.

Assignment Zero: The Polling Loop in Practice

Assignment 0 of CS 452 asks students to build exactly a polling loop — a bare-metal program on the Raspberry Pi that, with no kernel, no tasks, and no interrupts, controls a train by repeatedly checking three inputs: the track sensor (via CAN), the real-time clock (via system timer), and the user’s keyboard (via UART). The assignment’s purpose is not merely to implement the polling loop — it is to make the student feel, viscerally, what the polling loop cannot do.

A typical polling loop for A0:

void main(void) {
    uart_init();
    spi_init();
    can_init();
    timer_init();

    for (;;) {
        // Check sensors
        if (can_rx_available()) {
            CanFrame frame;
            can_rx_read(&frame);
            if (is_sensor_event(&frame)) {
                handle_sensor(&frame);
            }
        }

        // Check clock
        if (timer_elapsed_ms() >= CLOCK_PERIOD_MS) {
            timer_reset();
            handle_clock();
        }

        // Check user input
        if (uart_rx_ready()) {
            char ch = uart_rx_read();
            handle_key(ch);
        }
    }
}

The student is required to implement conditional function calls — if a sensor fires, do something. The assignment is deliberately achievable with a polling loop because the single-train scenario has a small number of conditions and the actions are simple.

The payoff is in the failure modes. When the student tries to add a second train, the polling loop’s latency doubles (every additional condition adds to the scan time). When the student tries to implement accurate timing (send a speed command every 100 ms for velocity calibration), the polling loop’s timing drifts because each iteration takes a variable amount of time depending on which conditions are true. When the student tries to implement a “stop in 3 seconds” feature, the natural code is:

// Wrong: this blocks the entire polling loop for 3 seconds
timer_delay_ms(3000);
train_stop();

Using a blocking delay stops all other polling for 3 seconds. The sensor check, the keyboard check, everything halts. The correct version — accumulating elapsed time and checking against a target — is the embryo of the cyclic executive. And then the student realizes that with two trains, two timers, and two sensors, the “accumulate elapsed time” approach requires carefully interleaving all the checks and makes the code look like the hand-written coroutine mess described earlier.

At the end of A0, the student has seen the problem from the inside. The kernel they will spend the next three months building is the solution.

Comparison: RTOS vs. bare-metal super-loop

The real-time systems industry is divided between two camps: RTOS-based systems and bare-metal super-loop systems. Understanding when each is appropriate is part of the practical education.

Bare-metal super-loops dominate in small microcontrollers (8-bit AVR, PIC, simple Cortex-M0). The entire application fits in a few kilobytes of flash. There are no tasks, no context switches, no dynamic allocation. Everything is globally scoped. The code is often generated by a tool (STM32CubeMX, AVR Studio) that handles peripheral initialization. This approach is maintainable for simple applications — a temperature controller, a motor driver — but breaks down when the application has more than three or four independent concerns.

RTOS-based systems (FreeRTOS, Zephyr, RIOT-OS) are appropriate when the application has multiple concurrent concerns with different timing requirements. FreeRTOS in particular has become ubiquitous in Cortex-M applications: it provides tasks, semaphores, queues, and timers with a ~10 KB ROM footprint. The trade-off is additional complexity: task stacks must be allocated, stack sizes must be tuned, priority inversion must be managed, and deadlock must be avoided.

Microkernel systems (QNX, L4, CS 452’s kernel) push further: IPC is the only inter-task communication mechanism, the kernel is minimal, and all device drivers live in user space. This provides stronger isolation (a buggy device driver cannot corrupt the kernel) but requires more careful design of the service architecture.

The CS 452 kernel sits at the microkernel end of this spectrum. For a model railway controller, this is educationally appropriate even if a FreeRTOS-based system would be commercially simpler.

Part II: The Hardware Substrate

Chapter 3: ARMv8-A for Systems Programmers

The Raspberry Pi 4’s processor is the Broadcom BCM2711, which integrates four ARM Cortex-A72 cores. The Cortex-A72 implements the ARMv8.0-A instruction set architecture. Understanding this architecture at the level needed to write a kernel — not just to run programs on it — requires working through register files, exception levels, and the calling convention in some depth.

A Brief History of ARM

The first ARM processor (Acorn RISC Machine, later Advanced RISC Machine) was designed in 1983 by a small team at Acorn Computers. The original ARM1 had a 26-bit address space and a 16-entry register file, and it fit on a single chip that consumed less than 100 milliwatts. The key design insight was that a reduced, orthogonal instruction set with a fixed 32-bit instruction width would allow high clock rates and simple decode logic — valuable properties when transistor budgets were tight.

ARM Ltd., spun out in 1990, licensed the architecture rather than manufacturing chips. The model proved extraordinarily successful: by 2020, ARM cores were shipping at a rate exceeding 20 billion per year, appearing in smartphones, microcontrollers, servers, and automobiles. The ARMv8-A architecture, introduced in 2011 with the Cortex-A53 and A57, extended the ARM architecture to 64-bit addressing while maintaining backward compatibility with 32-bit AArch32 code.

The ARMv8-A label covers a wide family. The Cortex-A72 (as in the RPi 4) is a high-performance out-of-order core capable of around 30 billion operations per second from its 1.5 GHz clock, with 48 KB L1 instruction cache, 32 KB L1 data cache, and 1 MB L2 shared across four cores.

The Register File

In AArch64 (64-bit) execution state, the architectural register file provides:

x0–x30: thirty-one 64-bit general-purpose registers. When accessed as w0–w30, they name the lower 32 bits; operations on wN zero-extend the upper 32 bits.
xzr / wzr: a zero register. Reads as zero; writes are discarded. This eliminates a dedicated zero-generating instruction in most cases.
SP (stack pointer): not a general register in AArch64. It must be 16-byte aligned at any instruction that requires an aligned stack (e.g., ldp/stp with [sp]). Each exception level has its own banked SP: SP_EL0, SP_EL1, SP_EL2, SP_EL3.
PC (program counter): not directly readable or writable by general instructions. It is implicitly updated by branches and explicitly visible in certain system register contexts.
PSTATE: a collection of condition flags and system control bits, not a single architectural register. The NZCV flags (negative, zero, carry, overflow) are set by arithmetic operations and tested by conditional instructions. The DAIF field (Debug, SError/System Error, IRQ, FIQ) masks the corresponding exception types.

The 128-bit SIMD and floating-point register file (v0–v31, also accessible as q, d, s, h, b for narrower views) is architecturally separate. Whether a task’s FP/SIMD state must be saved on context switch depends on whether the kernel uses FP instructions. In a bare-metal kernel, the conservative choice is to save the entire SIMD file or to disable FP access from EL0 and avoid it in the kernel entirely, since saving 32 × 16 bytes = 512 bytes per context switch is a noticeable overhead.

Exception Levels

ARMv8-A defines four exception levels (ELs), representing a privilege hierarchy. Higher exception levels have greater access to hardware resources and system registers.

Exception Levels (ARMv8-A): EL0 is the unprivileged level at which user applications run. EL1 is the privileged level at which an operating system kernel runs. EL2 supports virtualization (hypervisor). EL3 is the highest level, handling secure-monitor firmware (Arm Trusted Firmware on production systems). Each level has its own stack pointer and set of system registers.

The Raspberry Pi 4 boots at EL3 (firmware), drops to EL2 (for hypervisor init), and then to EL1 where the CS 452 kernel runs. User tasks run at EL0. The practical consequence is that on boot, your startup assembly must actively drop from whichever EL the bootloader left you in (typically EL2) down to EL1 before entering your C kernel.

Dropping from EL2 to EL1 requires:

Configuring HCR_EL2 (Hypervisor Configuration Register) to declare that EL1 uses AArch64 and that exceptions targeting EL1 are not trapped to EL2.
Preparing SPSR_EL2 with the desired PSTATE for EL1 (specifically: EL1h mode, DAIF bits masked).
Setting ELR_EL2 to the address of the EL1 entry point.
Executing eret, which atomically restores PSTATE from SPSR_EL2 and branches to ELR_EL2.

// Illustrative EL2→EL1 drop sequence
mrs     x0, CurrentEL          // read current EL (encoded in bits [3:2])
lsr     x0, x0, #2
cmp     x0, #2
bne     .not_el2               // already at EL1, skip

// Configure EL1 as AArch64
mov     x0, #(1 << 31)         // HCR_EL2.RW = 1: EL1 is AArch64
msr     HCR_EL2, x0

// Set up return PSTATE for EL1h (uses SP_EL1), DAIF all masked
mov     x0, #0x3C5             // DAIF=1111, EL1h (mode 0b0101)
msr     SPSR_EL2, x0

// Set return address to EL1 entry
adr     x0, el1_entry
msr     ELR_EL2, x0
isb                             // ensure all system register writes take effect
eret                            // drop to EL1

The isb (instruction synchronization barrier) before eret ensures that the processor has committed all preceding system register writes before executing the exception return. This is one of the places where ARMv8’s weak memory model becomes visible: system register writes are not instantaneously visible to the instruction pipeline unless a barrier forces synchronization.

The AAPCS64 Calling Convention

The Application Binary Interface (ABI) governs how compiled code from different translation units — or different source languages — can call each other. The ABI for AArch64 is specified in the ARM document Procedure Call Standard for the Arm 64-bit Architecture (AAPCS64). Kernel writers must understand it for two reasons: the kernel is called from C and must return to C, and the context switch must save and restore exactly the registers that C code considers persistent.

The register assignment is:

Registers	Role
x0–x7	Integer arguments (first 8); x0 also holds the return value
x8	Indirect result register (pointer to space for large returns)
x9–x15	Temporary; caller-saved (callee may destroy)
x16–x17	Intra-procedure-call scratch (IP0/IP1); may be destroyed by linker veneers
x18	Platform register; avoid in portable code
x19–x28	Callee-saved; a function must preserve these across a call
x29	Frame pointer (FP)
x30	Link register (LR); holds return address for the current call

Two calling-convention rules that matter particularly for kernel writers:

First, the stack pointer must be 16-byte aligned at every public function boundary. The stp instruction with [sp, #imm] addressing requires the stack pointer to be 16-byte aligned; violating this causes an alignment fault. Since each stack frame pushes an even number of 64-bit values or should pad to 16 bytes, this is easy to maintain but easy to forget.

Second, callee-saved registers must be preserved. When the kernel saves a task’s context, it must save x19–x28, x29 (FP), and x30 (LR) in addition to any argument and temporary registers that happen to be live. The correct approach is to save the entire general register file (x0–x30) plus the system registers ELR_EL1 and SPSR_EL1 that encode the user task’s return address and PSTATE.

Mixing C and Assembly

Most of the kernel is written in C for readability and maintainability. Assembly is used exactly where it must be: for operations that have no C equivalent. The primary examples are:

System register access: mrs (move register from system register) and msr (move to system register) have no C syntax. You either write a short .S file or use GCC’s __asm__ intrinsic for isolated register reads.
Memory barriers: dmb (data memory barrier), dsb (data synchronization barrier), and isb (instruction synchronization barrier) are assembly instructions. GCC provides __sync_synchronize() and the __atomic_* builtins for most portable use cases, but for kernel code that must precisely control barrier semantics, raw assembly is cleaner.
Context switch: saving and restoring all 31 general-purpose registers, the stack pointer, and the two EL1 system registers requires assembly. There is no C idiom that can move arbitrary registers.
Exception vector table: the entry points for the vector table must be placed at fixed offsets and aligned to 128-byte boundaries, which requires assembly .align directives and careful section placement.

GCC allows mixing C and assembly through the asm volatile ("instruction" : outputs : inputs : clobbers) syntax for short, isolated sequences. For the context switch, a separate .S file is more readable:

// In C: declare the context switch entry point
extern void kernel_entry(void);

// In .S: define kernel_entry
.global kernel_entry
.align 7
kernel_entry:
    // Save user task's registers, dispatch to C handler
    ...

The .align 7 ensures 128-byte alignment, required by the exception vector table format.

The ARM Memory Model

The ARMv8-A memory model is significantly more relaxed than the x86 TSO (Total Store Ordering) model that most systems programmers learn first. On x86, all stores become visible to all processors in program order — the only reordering allowed is that stores may become visible to the local CPU before they are globally visible (store buffer forwarding). On ARMv8, the memory model is closer to the PowerPC/SPARC RMO (Relaxed Memory Order): loads may be reordered with loads, stores may be reordered with stores, and loads may even overtake stores in some scenarios.

For bare-metal kernel development, the practical implications are:

Device register accesses must be ordered with barriers. A write to a device register (enabling an interrupt, loading a DMA descriptor, setting a GPIO pin) might be reordered by the store buffer or the memory subsystem relative to other instructions. A dsb sy ensures all preceding memory operations are globally visible before the barrier completes.

Cache coherency is automatic between CPU cores in the inner shareable domain (the four Cortex-A72 cores). You do not need barriers between cores for ordinary memory reads and writes — the hardware cache coherency protocol maintains a consistent view. Barriers are needed at synchronization points (like lock acquire/release) to enforce ordering guarantees.

System register writes require barriers to take effect in the instruction stream. Writing to VBAR_EL1, TTBR0_EL1, or SCTLR_EL1 does not take effect until after an ISB. The processor may have already fetched instructions past the write using the old register value; the ISB flushes the pipeline and forces a re-fetch.

The load-acquire / store-release idiom provides a lightweight barrier for mutual exclusion without a full DSB. ldar (load-acquire) prevents any load or store after the ldar from being reordered before it. stlr (store-release) prevents any load or store before the stlr from being reordered after it. These instructions enable efficient lock implementations on multiprocessor systems, though they are overkill for the single-core CS 452 scenario.

The Cortex-A72 Pipeline

Understanding the Cortex-A72 pipeline is useful for WCET analysis and for understanding why certain optimization techniques (loop unrolling, data prefetching) help.

The Cortex-A72 is a 3-wide out-of-order processor. Its pipeline stages are approximately:

Fetch: up to 16 instructions fetched per cycle from the 48 KB instruction cache.
Decode/Rename: up to 3 instructions decoded per cycle. Register renaming eliminates WAR and WAW hazards.
Dispatch: instructions dispatched to one of several reservation stations (integer ALU, load/store, branch, multiply, FP/SIMD).
Execute: instructions execute out of order as their source operands become available. The integer ALU can execute up to 3 ALU operations per cycle.
Commit: instructions retire in program order, committing their results and any memory effects.

Branch prediction: the Cortex-A72 has a multi-level branch predictor including a branch target buffer (BTB), a static predictor (predict backward-taken, forward-not-taken), and a return address stack (RAS). The RAS is particularly relevant for function calls: when a bl (branch and link) instruction is executed, the processor pushes the return address onto the RAS; when a ret (return, which is br x30) is executed, the processor pops from the RAS and speculatively fetches the return address before the x30 register has been resolved. This means that function call/return pairs are predicted correctly without pipeline stalls, provided the call depth does not exceed the RAS depth (~8 entries).

Misprediction penalty: approximately 15 cycles on the Cortex-A72. For a loop with 100 iterations, a single mispredicted loop-exit branch costs 15 cycles out of roughly 100+ loop-body cycles — significant but not dominant.

Load-use latency: an integer load from L1 cache has a 4-cycle latency. A sequence like:

ldr  x0, [x1]    // load from memory
add  x2, x0, #1  // immediate use of loaded value

will stall for 4 cycles between the load and the add. The out-of-order engine can sometimes hide this by executing other independent instructions during the load latency, but in register-pressure situations, the stall is unavoidable.

These pipeline details inform the timing model for the kernel’s hot paths. A context switch involves 32 stores (saving registers) followed by 32 loads (restoring the next task’s registers). At 1 store + 1 load per cycle (the L1 data cache can sustain this), the raw register save/restore takes 32 cycles. With instruction overhead, the full context switch path typically takes 100–300 cycles — 0.07–0.2 µs at 1.5 GHz.

Out-of-Order Execution and Real-Time Correctness

The Cortex-A72’s out-of-order execution engine creates a subtle challenge for real-time systems that most embedded programmers encounter for the first time on the RPi 4. On a simple in-order microcontroller (Cortex-M0, AVR, PIC), the execution time of a code sequence is essentially the sum of the cycle counts of each instruction — predictable, additive, and easy to bound. On the Cortex-A72, instructions are reordered dynamically by the hardware, executed when operands are ready rather than in program order. This makes cycle-exact timing analysis difficult and requires understanding which transformations the hardware may perform.

Write reordering is the most practically important case. When your code writes to two memory locations:

mmio_write(UART_TXD,  'X');   // write byte to UART transmit register
mmio_write(UART_CR,  0x01);   // enable UART transmitter

the processor may internally reorder these two writes if it determines they are to independent addresses. For regular DRAM, this is an innocuous performance optimization — the observable outcome is the same. For device registers (MMIO), where write order is architecturally significant (you cannot enable the transmitter before putting a byte in the transmit register), reordering is catastrophic. The solution is the DSB SY instruction, which drains the store buffer and ensures all preceding writes are globally visible before the barrier completes. Device register write sequences must always be separated by barriers when order matters.

Speculative loads create a second class of issues. The Cortex-A72 may speculatively execute a load before it has confirmed that the preceding branch will be taken. If the load addresses a device register (e.g., reading from UART_FR to check whether the TX FIFO is full), the speculative access may have side effects (some FIFO implementations clear the status bit on read). The ISB instruction prevents speculative execution of subsequent instructions until all preceding instructions have committed. For device register reads that have side effects, ISB before the read prevents the read from being speculated.

Reordering across the kernel/user boundary is a concern during context switching. The kernel’s exception entry handler saves the user task’s registers to the kernel stack. If the save is not completed before the kernel begins reading those values (for scheduling decisions), a race with the hardware’s write buffer could yield stale data. The ARMv8 exception entry mechanism automatically performs a context synchronization operation (equivalent to ISB) on exception entry, ensuring all user-mode writes are visible before the kernel begins execution — but the kernel’s own writes to the saved context frame must be completed with a DSB before the frame is read by the scheduler.

Cache coherency across cores: the Cortex-A72 is a quad-core processor. The CS 452 kernel runs on a single core (core 0), with the other three cores parked in a spin loop after boot. However, the boot ROM starts all four cores, and until the non-boot cores are explicitly parked, they may speculatively fetch and execute instructions from shared memory addresses. The boot stub at 0x80000 (the kernel entry point) must send secondary cores to a different address before the kernel’s global data structures are initialized, to prevent speculative writes from corrupt cores from corrupting the kernel. The boot assembly checks MPIDR_EL1 and branches secondaries to a separate parking loop before any kernel initialization runs.

WCET analysis on an OOO processor: the fundamental difficulty with out-of-order execution and WCET is that the execution time of a code sequence depends not just on the instructions but on the micro-architectural state — the cache contents, the branch predictor history, the reservation station occupancy — at the time of execution. This state cannot be fully known at analysis time. Static WCET tools (AbsInt aiT, OTAWA) address this by constructing abstract interpretations of the processor state: rather than tracking the exact state, they track a set of possible states that safely over-approximates all possible concrete states. The result is a WCET bound that is correct for all possible micro-architectural states.

For the CS 452 kernel, which does not use formal static analysis, the practical approach is to ensure the kernel’s critical path (the exception entry, scheduler, and exception exit) is cache-resident at all times and that its execution time is measured under realistic conditions (warm cache, normal interrupt rates). The measured maximum, inflated by 20–30% as a safety margin, serves as the effective WCET. This approach accepts the risk that the true worst case might be higher, but in practice the gap between measured and true WCET is small when the cache is always warm — which the kernel guarantees for its own code by virtue of being executed frequently.

Spectre and Meltdown: the Cortex-A72 is affected by the Spectre variant 1 (bounds-check bypass) and Meltdown vulnerabilities discovered in 2018. These arise from speculative execution leaking information through the cache timing channel. The CS 452 kernel, which runs in a single address space with no untrusted code, is not exposed to these attacks — there is no adversarial code to exploit the speculation. In production systems with untrusted user code, mitigations (KPTI, retpoline, microcode updates) are required. Understanding that these mitigations exist — and that they carry a 5–20% performance tax on systems call-intensive workloads — is part of the broader education about why secure systems are harder to build than correct-but-insecure ones.

ARMv8 Synchronization Primitives

For completeness, ARMv8 provides two exclusive memory access instructions for implementing mutexes and other atomic operations on multiprocessor systems:

ldxr Xt, [Xn] (load exclusive register): reads the value at the address and tags the address as “exclusive access monitored.”
stxr Ws, Xt, [Xn] (store exclusive register): writes Xt to the address and sets Ws to 0 if the exclusive access succeeded (no other core modified the address since the ldxr) or 1 if it failed.

The classic compare-and-swap or atomic increment loop:

.Lretry:
    ldxr   x1, [x0]        // exclusive load of *x0 into x1
    add    x1, x1, #1      // increment
    stxr   w2, x1, [x0]    // exclusive store; w2 = 0 if succeeded
    cbnz   w2, .Lretry     // retry if store was preempted

In the single-core CS 452 kernel, exclusive accesses always succeed (there is no other core to preempt the exclusive tag), so these instructions are equivalent to plain loads and stores. They become meaningful if the kernel is extended to multi-core operation.

Floating-Point, SIMD, and Why the Kernel Disables Them

The Cortex-A72 implements the ARMv8 SIMD and floating-point extension, providing 32 128-bit vector registers (v0–v31) alongside the integer register file. Each v register is accessible as q (128-bit), d (64-bit), s (32-bit), h (16-bit), or b (8-bit) depending on the instruction. The NEON instruction set provides parallel arithmetic on packed integers or floating-point values — for example, fadd v0.4s, v1.4s, v2.4s adds four pairs of single-precision floats in one instruction.

From a kernel engineering standpoint, this capability creates a serious problem: context switch cost. Every task that could potentially use FP or SIMD registers has architectural state in those registers that must be preserved across context switches. Saving and restoring the 32 × 16 = 512 bytes of SIMD state on every context switch costs 32 STP instructions (each stores a 128-bit pair), which at ~1 cycle each is roughly 32 extra cycles per switch. For a kernel that targets sub-200-cycle context switches, this is a 15–20% overhead tax even when no task uses FP at all.

The standard solution in embedded and real-time kernels is lazy FP state save — sometimes called FPU lazy stacking in ARM terminology. The kernel starts each task with the FP unit disabled (CPACR_EL1.FPEN = 0b00). If a task executes a FP instruction while the FP unit is disabled, the processor takes a trap (an EL1 sync exception with ESR_EL1.EC = 0b000111). The trap handler saves the FP state of the previous FP-using task, restores the FP state of the current task, re-enables FP, and retries the instruction. Tasks that never use FP incur zero FP-save overhead. Tasks that do use FP incur FP-save overhead only at the first FP instruction after a context switch, and only if the previous task also used FP.

The CS 452 kernel takes an even simpler approach: disable FP entirely for both kernel and user code using the compiler flag -mgeneral-regs-only. This flag instructs the compiler to not generate any FP or SIMD instructions, and to not assume that v0–v31 need to be preserved across function calls. The practical consequence is that all arithmetic — including velocity calculations, distance estimates, and stopping-distance interpolation — must use integer or fixed-point representations.

To understand why this is acceptable rather than onerous, consider that the Cortex-A72’s integer multiply-accumulate unit can compute a 64-bit mul x0, x1, x2 in a single instruction. Fixed-point arithmetic with a 16-bit fractional part (Q16.16 format) provides a resolution of ~0.000015 for a value up to 65535 — more than adequate for velocity in mm/s or distance in millimeters. The mechanics of fixed-point are covered in Chapter 17.

The CPTR_EL2 and CPACR_EL1 system registers control FP/SIMD access:

/* In kernel init: ensure CPACR_EL1.FPEN[21:20] = 0b00 (traps FP to EL1) */
static void fpu_disable(void) {
    uint64_t cpacr;
    asm volatile("mrs %0, CPACR_EL1" : "=r"(cpacr));
    cpacr &= ~(3UL << 20);  /* FPEN = 0b00 — trap FP/SIMD access */
    asm volatile("msr CPACR_EL1, %0" :: "r"(cpacr));
    asm volatile("isb");
}

With -mgeneral-regs-only, the compiler will never emit FP instructions, so this trap will never fire in a correctly compiled kernel or user program. If a user task links against a library (such as a floating-point sprintf implementation) that does use FP, the trap will fire and the trap handler can either kill the task with an error or implement lazy FP stacking. For CS 452, the policy is simpler: link only freestanding C code, and use fixed-point arithmetic throughout.

There is one subtle exception: the compiler is free to use FP registers as “scratch” for passing struct arguments or returning struct values in some ABI conventions. AAPCS64 uses v0–v7 for FP/SIMD function arguments and v0–v3 for FP/SIMD return values. With -mgeneral-regs-only, these conventions are suppressed and all struct passing goes through integer registers or stack slots. This is important to be aware of when reading assembly output with objdump -d: if you see fmov or ld1 instructions in a -mgeneral-regs-only build, something has gone wrong (likely a library function call without the flag, or inline assembly).

The AArch64 System Register Space

Beyond the general-purpose registers, ARMv8-A provides a large system register space accessed via MRS (move system register to general register) and MSR (move general register to system register). The CS 452 kernel uses the following system registers:

Register	Encoding	Purpose
`MPIDR_EL1`	`S3_1_c0_c0_5`	Multiprocessor Affinity Register — CPU ID in bits 7:0
`HCR_EL2`	`S3_4_c1_c1_0`	Hypervisor Config Register — RW=1 sets EL1/0 to AArch64
`SPSR_EL2`	`S3_4_c4_c0_0`	Saved Program State (at EL2) — target state for ERET
`ELR_EL2`	`S3_4_c4_c0_1`	Exception Link Register (at EL2) — return address for ERET
`SP_EL1`	`S3_4_c4_c1_0`	Stack pointer for EL1 (the kernel SP)
`VBAR_EL1`	`S3_0_c12_c0_0`	Vector Base Address Register — exception vector table
`ELR_EL1`	`S3_0_c4_c0_1`	Exception Link Register (at EL1) — saved PC of faulting instruction
`SPSR_EL1`	`S3_0_c4_c0_0`	Saved Program State (at EL1) — saved PSTATE
`ESR_EL1`	`S3_0_c5_c2_1`	Exception Syndrome Register — exception type and information
`TTBR0_EL1`	`S3_0_c2_c0_0`	Translation Table Base Register 0 — page table pointer
`TCR_EL1`	`S3_0_c2_c0_2`	Translation Control Register — MMU configuration
`MAIR_EL1`	`S3_0_c10_c2_0`	Memory Attribute Indirection Register — memory type encoding
`SCTLR_EL1`	`S3_0_c1_c0_0`	System Control Register — MMU enable (bit 0), cache enable (bit 2)
`CPACR_EL1`	`S3_0_c1_c0_2`	Architectural Feature Access Control — FP/SIMD enable
`PMCR_EL0`	`S3_3_c9_c12_0`	Performance Monitor Control — enable cycle counter
`PMCCNTR_EL0`	`S3_3_c9_c13_0`	Performance Monitor Cycle Count — 64-bit cycle counter

The S3_x_cy_cx_z encoding is the generic system register form: S<op0>_<op1>_<CRn>_<CRm>_<op2>. This encoding is required when the register has no mnemonic alias in older assembler versions. GCC and GNU as support the named forms (VBAR_EL1, ELR_EL1, etc.) for all commonly used system registers.

AArch64 Assembly Programming for Systems Software

Writing a kernel requires more assembly than most applications: the boot sequence, the exception vector table, the context switch hot path, and a handful of critical inline sequences. These are not places where C is expressive enough — they require direct control over the register file, the stack pointer, and processor state. This section catalogs the AArch64 assembly idioms that appear in kernel code.

Address loading: loading an address of a symbol into a register is a common operation with three options, each with different trade-offs.

/* Option 1: ADR — PC-relative, ±1 MB range */
adr     x0, my_symbol           /* x0 = address of my_symbol */

/* Option 2: ADRP + ADD — PC-relative, ±4 GB range, 4K-aligned intermediate */
adrp    x0, my_symbol           /* x0 = 4K-aligned page containing my_symbol */
add     x0, x0, :lo12:my_symbol /* x0 += offset within the 4K page */

/* Option 3: LDR pseudo-instruction — literal pool, full 64-bit range */
ldr     x0, =my_symbol          /* assembler creates a literal pool entry */

ADR is preferred for kernel symbols because it is compact (one instruction) and position-independent. The ADRP+ADD pair handles symbols beyond the ±1 MB ADR range — useful for large kernel images. The LDR =symbol form uses a literal pool, which requires the assembler to generate a constant table nearby and can cause cache miss penalties; use it only for absolute constants that cannot be expressed as PC-relative.

Unconditional branches and calls: the AArch64 branch family has five members with distinct semantics:

Instruction	Effect	Use case
`b label`	Unconditional branch	Infinite loops, switch table jumps
`bl label`	Branch with link: x30 := PC+4, then branch	Function call to named symbol
`blr Xn`	Branch with link to register	Function call through pointer
`ret`	Branch to x30	Return from function
`ret Xn`	Branch to Xn	Return from exception path
`eret`	Return from exception: PC := ELR, PSTATE := SPSR	Exit exception handler, drop to lower EL

Note that ret is a hint to the branch predictor that this is a subroutine return (the processor predicts the return address from the return address stack). A plain br x30 would also work but would miss the predictor hint and incur an extra cycle of branch misprediction penalty.

Conditional branches: AArch64 uses a PSTATE flags word (NZCV) for condition codes. Conditional branches test the flags:

/* After a comparison instruction: */
cmp     x1, x2              /* Sets N, Z, C, V */
b.eq    .Lequal             /* Branch if Z=1 (x1 == x2) */
b.ne    .Lnotequal          /* Branch if Z=0 */
b.lt    .Lless              /* Branch if N≠V (x1 < x2, signed) */
b.lo    .Llower             /* Branch if C=0 (x1 < x2, unsigned) */

/* Compact branch-if-zero / branch-if-nonzero: */
cbz     x0, .Lzero          /* if x0 == 0, branch (no flags modified) */
cbnz    x0, .Lnonzero       /* if x0 != 0, branch */

CBZ and CBNZ are particularly useful in kernel loops where checking for a null pointer or an empty queue avoids an explicit comparison instruction:

ldr     x0, [x1]            /* Load queue head */
cbz     x0, .Lempty_queue   /* Queue empty? */

Stack management: the kernel manages two stack pointers: SP_EL1 for the kernel stack and SP_EL0 for the user stack. The stack pointer in AArch64 must be 16-byte aligned at all function entry points (per AAPCS64). The kernel prologue/epilogue pattern:

/* Save registers before a C function call that may modify x19-x28 */
stp     x19, x20, [sp, #-16]!   /* push pair with pre-decrement */
stp     x21, x22, [sp, #-16]!

/* ... do work ... */

ldp     x21, x22, [sp], #16     /* pop pair with post-increment */
ldp     x19, x20, [sp], #16
ret

The ! suffix on the store (pre-indexed addressing) atomically decrements SP before the store. The post-indexed load increments SP after. These are the preferred idioms because they maintain 16-byte alignment (two 8-byte registers = 16 bytes per pair) and are recognized by exception traceback tools.

Inline assembly in C: the kernel’s C code frequently needs to execute single assembly instructions — reading system registers, issuing barriers, or accessing hardware registers too small for a dedicated function call overhead. GCC extended assembly syntax:

/* Basic form: asm volatile("instruction" : outputs : inputs : clobbers); */

/* Read cycle counter */
static inline uint64_t pmccntr(void) {
    uint64_t val;
    asm volatile("mrs %0, PMCCNTR_EL0" : "=r"(val));
    return val;
}

/* Data memory barrier (full system) */
static inline void dmb_sy(void) {
    asm volatile("dmb sy" ::: "memory");
}

/* Write to a system register */
static inline void set_vbar_el1(uint64_t vbar) {
    asm volatile("msr VBAR_EL1, %0" :: "r"(vbar));
    asm volatile("isb");
}

The "=r"(val) output constraint says “write the result to any register and store it in val.” The "r"(vbar) input constraint says “load vbar from any register.” The "memory" clobber tells GCC that the inline assembly may read or write arbitrary memory, preventing the compiler from reordering memory operations across the barrier instruction.

Avoiding function call overhead for hot paths: in the kernel’s interrupt handler, every instruction counts. Some small operations are best done as macros that expand to inline assembly rather than function calls:

/* Read GICC IAR without a function call */
#define GICC_IAR_READ() ({ \
    uint32_t _r; \
    asm volatile("ldr %w0, [%1]" \
        : "=r"(_r) \
        : "r"((volatile uint32_t *)GICC_IAR_ADDR)); \
    _r; \
})

The __attribute__((always_inline)) annotation on a function achieves the same effect when the full function calling convention is needed.

Bit manipulation idioms: AArch64 provides several bit manipulation instructions that are often more efficient than a sequence of AND/OR/SHIFT:

/* Extract bits [7:4] of x0 into x1 */
ubfx    x1, x0, #4, #4      /* Unsigned Bitfield eXtract */

/* Insert bits [7:4] of x1 into x0, zero other bits */
bfi     x0, x1, #4, #4      /* Bit Field Insert */

/* Clear bits [7:4] of x0 */
bfxil   x0, xzr, #4, #4     /* or: and x0, x0, #~0xF0 */

/* Count leading zeros (for log2, priority encoding): */
clz     x1, x0              /* x1 = number of leading zeros in x0 */

For a simple O(1) scheduler that must find the highest non-empty priority queue, CLZ is ideal: maintain a 32-bit bitmask ready_mask where bit n is set if the queue at priority n is non-empty. CLZ(ready_mask) gives the number of leading zeros, which is 31 - highest_set_bit. Then highest_set_bit = 31 - CLZ(ready_mask):

static inline int scheduler_find_highest_priority(uint32_t ready_mask) {
    int lz;
    asm("clz %w0, %w1" : "=r"(lz) : "r"(ready_mask));
    return 31 - lz;   /* bit 31 = priority 31 = highest */
}

This is O(1) regardless of the number of priority levels — a direct hardware implementation of the scheduler’s core operation.

GAS Assembler directives used in kernel code:

Directive	Purpose
`.align n`	Align to $2^n$ bytes (e.g., `.align 11` for 2048-byte alignment)
`.rept n`	Repeat a block `n` times
`.section .text.boot`	Place in named linker section
`.global sym`	Make `sym` visible to the linker
`.type sym, %function`	Mark `sym` as a function (helps debuggers)
`.size sym, .-sym`	Report function size (for profiling tools)
`.word 0xDEADBEEF`	Place a 32-bit word literal
`.quad addr`	Place a 64-bit pointer (for jump tables)
`.equ NAME, expr`	Define an assembly constant

The .type and .size directives are optional but improve the quality of DWARF debug information and objdump output. Adding them to the exception vector table entries makes stack traces during kernel faults significantly more readable.

Stack Frames, Frame Pointers, and Backtrace

A stack frame is the region of the stack allocated for one function invocation. It holds the function’s local variables, saved registers, and (in standard calling conventions) the caller’s frame pointer and return address. Understanding the stack frame layout is essential for two debugging techniques that are invaluable in a bare-metal kernel: manual stack traces (walking the frame chain to reconstruct the call sequence that led to a fault) and stack sizing (measuring worst-case stack usage before deployment).

The AAPCS64 calling convention defines the standard frame record as a pair of 64-bit values stored at the bottom of the current frame:

[fp + 0]: the previous frame pointer (the caller’s x29 value)
[fp + 8]: the link register value at function entry (the return address)

A well-behaved function that may be called from code needing backtraces should save x29 and x30 at its entry, then set x29 to point to the saved pair:

/* Entry sequence for a frame-pointer-compliant function */
stp     x29, x30, [sp, #-16]!  /* push (fp, lr) pair, pre-decrement sp */
mov     x29, sp                 /* fp now points to the frame record */

/* ... function body ... */

ldp     x29, x30, [sp], #16    /* pop (fp, lr) pair, post-increment sp */
ret

The linked chain of frame records forms a frame chain:

[sp]  → [frame record for current function: (caller_fp, return_addr)]
[fp]  → [frame record for caller: (caller's_caller_fp, caller's_return_addr)]
[fp]  → [frame record for caller's caller: ...]
        ...
[0]   → (null, entry_point)   (bottom of stack, x29 = 0 at the entry point)

To produce a backtrace manually (in a kernel fault handler, without GDB):

void print_backtrace(void) {
    uint64_t *fp;
    asm volatile("mov %0, x29" : "=r"(fp));

    uart0_puts("Backtrace:\n");
    for (int depth = 0; depth < 16 && fp != NULL; depth++) {
        uint64_t prev_fp  = fp[0];
        uint64_t ret_addr = fp[1];
        uart0_printf("  [%d] 0x%016llx\n", depth, ret_addr);
        fp = (uint64_t *)prev_fp;
    }
}

print_backtrace() walks the frame chain from the current frame upward, printing each return address. The return addresses can then be resolved to function names offline using aarch64-elf-addr2line -e kernel8.elf <address>.

When frame pointers are not present: GCC by default may elide the frame pointer when -O2 is used and the compiler can track stack depth through other means. In a kernel, this optimization is counterproductive: faults are rare but must be debugged quickly, and a missing frame pointer makes the backtrace impossible. The compiler flag -fno-omit-frame-pointer forces frame-pointer-compliant code generation:

CFLAGS += -fno-omit-frame-pointer

This is already included in the recommended Makefile from Chapter 5. The performance cost is minor: one extra register allocation (x29 is tied up as the frame pointer) and two extra instructions at function entry and exit. For kernel code, which typically has much lower function call overhead than application code, this cost is negligible.

Stack usage profiling with -fstack-usage: GCC’s -fstack-usage flag generates a .su file alongside each .o file, recording the worst-case stack depth of each function in bytes:

task.c:scheduler_next   32       static
message.c:do_send       96       static
exception.c:handle_irq  272      dynamic,bounded

A static entry means the function uses a fixed-size stack frame deterministically. A dynamic,bounded entry means the function has variable-length array allocations or uses alloca, but the compiler can determine a bound. A dynamic entry (without “bounded”) means the stack depth is unbounded — a red flag in a real-time kernel.

The .su files can be post-processed with a script that walks the call graph and computes the maximum stack depth from any entry point:

# Pseudo-code: compute worst-case stack depth for each function
def max_depth(fn, call_graph, su_table, visited):
    if fn in visited: return su_table[fn]  # cycle: use local estimate
    visited.add(fn)
    local = su_table.get(fn, 0)
    callees = call_graph.get(fn, [])
    child_max = max((max_depth(c, call_graph, su_table, visited) for c in callees), default=0)
    return local + child_max

This analysis, combined with the task stack size from Chapter 5 (STACK_SIZE = 64 KB), gives the kernel writer confidence that no task will overflow its stack under worst-case conditions. Typical stack usage for a CS 452 train control task is 2–4 KB; the 64 KB budget leaves ample headroom.

The stack frame in the context switch: the kernel’s context switch saves 34 registers (x0–x30, SP_EL0, ELR_EL1, SPSR_EL1) as a contiguous frame on the kernel stack. This saved frame is itself a valid frame record for GDB’s backtrace purposes — if the kernel is interrupted while in the exception handler, GDB can unwind through the saved frame into the user task’s call stack. The prerequisite is that the exception vector table entries are decorated with .type %function and the frame record (x29, x30) is pushed first, before the rest of the context save:

/* Exception entry: must look like a function prologue for GDB */
.type exception_entry, %function
exception_entry:
    stp     x29, x30, [sp, #-272]!  /* save fp + lr first (frame record) */
    mov     x29, sp                  /* set fp to frame record */
    stp     x0,  x1,  [sp, #16]    /* then save x0..x28 */
    stp     x2,  x3,  [sp, #32]
    /* ... */

This careful framing means that from GDB’s perspective, exception_entry looks like a regular function call, and unwinding through it into the interrupted task’s stack frame is possible. Without this setup, GDB’s backtrace stops at the exception boundary.

Canary checking via stack high-water mark: in addition to the byte canary at the bottom of the stack, a more precise diagnosis of stack usage uses a high-water mark pattern: fill the entire task stack with a known value (e.g., 0xDEADBEEF) at initialization, and then scan from the bottom upward to find the first non-canary word after a task run:

#define STACK_FILL 0xDEADBEEFUL

void task_stack_init(uint32_t *stack, size_t size) {
    for (size_t i = 0; i < size / 4; i++)
        stack[i] = STACK_FILL;
}

size_t task_stack_used(uint32_t *stack, size_t size) {
    size_t used = 0;
    for (size_t i = 0; i < size / 4; i++) {
        if (stack[i] != STACK_FILL) {
            used = (size / 4 - i) * 4;
            break;
        }
    }
    return used;
}

After running the system through its worst-case workload, task_stack_used() reports the maximum stack depth for each task. This is the empirical equivalent of the static -fstack-usage analysis — less rigorous (it samples actual execution, not all possible paths) but directly applicable to the specific workload.

Chapter 4: Memory and Devices on the BCM2711

The BCM2711 is not a general-purpose microprocessor; it is a system-on-chip (SoC) that integrates a four-core Cortex-A72 cluster with a GPU, a DMA engine, USB controllers, PCIe, and a rich set of peripheral interfaces — all on a single die. Understanding how these peripherals are visible to the ARM cores is prerequisite to writing any bare-metal device driver.

The Memory Map

ARM processors access memory and memory-mapped device registers through a unified address space. There is no separate I/O port space (as on x86); a GPIO control register and a DRAM cell are distinguished only by their address and the memory attributes the MMU assigns to that range.

On the RPi 4, the physical address space is organized as follows:

Physical Address Range	Contents
`0x00000000`–`0x3FFFFFFF`	SDRAM (up to 4 GB on high-RAM models)
`0xFE000000`–`0xFEFFFFFF`	ARM peripheral registers (BCM2711)
`0xFF800000`–`0xFF9FFFFF`	ARM local registers (timers, mailboxes)
`0xFF840000`–`0xFF847FFF`	GIC-400 interrupt controller

The BCM2711 datasheet documents peripheral registers using bus addresses in the range 0x7E000000–0x7EFFFFFF. These bus addresses correspond to physical addresses starting at 0xFE000000: subtract 0x7E000000 and add 0xFE000000. For example, the system timer control/status register is documented at bus address 0x7E003000; the physical address for CPU access is 0xFE003000. This discrepancy exists because the bus address space is also visible to the VideoCore GPU and DMA engine, which use their own address translation.

The kernel loads at physical address 0x80000, which is where the RPi 4 firmware drops the core after processing kernel8.img. From that point, the startup code configures the MMU with an identity mapping (physical address = virtual address) over the RAM range and device range, with appropriate memory type attributes: cacheable normal memory for RAM, device memory (uncacheable, strongly-ordered) for peripheral registers.

Volatile and Memory Ordering

The distinction between normal memory and device memory is not academic. Consider a UART transmit FIFO: writing to the FIFO register sends a byte over the serial line; reading a status register tells you whether the FIFO is full. These registers have side effects that depend on the exact sequence and number of accesses. A compiler that reorders or coalesces memory accesses for optimization — as GCC routinely does for normal memory — would break device driver code.

The C keyword volatile tells the compiler to treat every access to the annotated variable as an observable side effect that must not be reordered, eliminated, or coalesced. The standard idiom for MMIO in C is:

#define MMIO_BASE 0xFE000000UL

static inline void mmio_write(uint32_t reg, uint32_t val) {
    *(volatile uint32_t *)(MMIO_BASE + reg) = val;
}

static inline uint32_t mmio_read(uint32_t reg) {
    return *(volatile uint32_t *)(MMIO_BASE + reg);
}

The volatile qualifier prevents the compiler from moving the access, but it does not prevent the processor’s memory system from reordering it. The ARM memory model — like most modern out-of-order processors — allows stores to complete in an order different from program order, and allows loads to observe values from the store buffer before they are visible to other observers. For device registers, this reordering must be suppressed with memory barriers.

A data synchronization barrier (dsb sy) ensures that all memory accesses before the barrier complete before any memory access after the barrier begins. For device accesses, dsb is typically the right choice. An instruction synchronization barrier (isb) flushes the instruction pipeline, ensuring that subsequent instructions are fetched after any context changes (such as system register writes) take effect.

In practice, most simple device accesses in a single-core environment are not visibly affected by hardware reordering because the core’s pipeline is typically coherent with itself. However, the barriers become essential at two points: when enabling interrupts (where the GIC enable must be visible before the first interrupt fires), and when the DMA engine or other bus masters are involved.

The System Timer

The BCM2711 system timer (at physical base 0xFE003000) is a 64-bit free-running counter clocked at 1 MHz. Its key registers:

Offset	Register	Description
`0x000`	CS	Control and status; compare match flags
`0x004`	CLO	Lower 32 bits of the counter
`0x008`	CHI	Upper 32 bits of the counter
`0x00C`–`0x018`	C0–C3	Compare registers for four independent channels

The counter increments once per microsecond regardless of CPU clock speed or power state, making it a reliable time reference. Comparing registers C0 and C2 are used by the VideoCore GPU; the kernel must use C1 and C3. When the free-running counter’s lower 32 bits match Cn, the corresponding channel fires an interrupt (signalled through the GIC).

The safe idiom for reading the 64-bit counter on a 32-bit-at-a-time bus:

uint64_t timer_read(void) {
    uint32_t hi, lo, hi2;
    do {
        hi  = mmio_read(TIMER_CHI);
        lo  = mmio_read(TIMER_CLO);
        hi2 = mmio_read(TIMER_CHI);
    } while (hi != hi2);           // repeat if the upper word rolled over
    return ((uint64_t)hi << 32) | lo;
}

To schedule a timer interrupt N microseconds in the future:

void timer_set(uint32_t channel, uint32_t delay_us) {
    uint32_t target = mmio_read(TIMER_CLO) + delay_us;
    mmio_write(TIMER_C0 + channel * 4, target);
}

Clearing the interrupt requires writing a 1 to the corresponding bit of the CS register (the BCM2711 timer uses write-1-to-clear semantics for its status flags).

GPIO

The BCM2711’s GPIO block (physical base 0xFE200000) controls 58 general-purpose pins. Each pin can be configured as input, output, or one of several alternate functions (UART, SPI, I2C, PWM, etc.). The function for each pin is set by the GPFSEL registers (one register per ten pins, three bits per pin). Output is set and cleared through GPSET0/1 and GPCLR0/1 (32 pins per register); input is read through GPLEV0/1.

GPIO pins can also be configured to generate edge-detect events, which propagate to interrupt status registers and eventually to the GIC. The MCP2515 CAN controller asserts its INT pin low when a message has been received or a transmit error has occurred; this is connected to GPIO pin 17, which is configured as a falling-edge-detect input. The resulting event appears at GIC interrupt ID 145.

Identity Mapping and Device Memory Types

The ARMv8-A MMU allows the kernel to assign memory attributes to address ranges. There are two critical attribute classes: normal memory (SDRAM, cacheable, weakly-ordered) and device memory (MMIO, uncacheable, strongly-ordered). The kernel’s startup code configures a minimal identity-mapped page table that assigns:

0x00000000–0x3FFFFFFF: normal memory, inner/outer write-back cacheable
0xFE000000–0xFFFFFFFF: device-nGnRnE memory (non-gathering, non-reordering, no early-write-acknowledgement)

The nGnRnE designation is the strongest device ordering — it prohibits the processor from merging consecutive writes to the same address, from reordering accesses within the device region, and from speculating ahead on early-write-acknowledge. This is the appropriate type for peripheral registers that have side effects.

ARMv8 Page Table Format

A complete understanding of the ARMv8-A MMU is not strictly required for CS 452 (the kernel uses a minimally-configured identity map), but knowing how it works demystifies the boot sequence and is essential for anyone who later implements memory protection.

ARMv8-A uses a four-level page table hierarchy (levels 0–3) for the full 64-bit virtual address space, though most practical implementations use only 2 or 3 levels. For a 4 KB page granule (the most common configuration), each level of the hierarchy has 512 entries (9 bits of index), and the full 48-bit virtual address is decomposed as:

VA bits	Index into
[47:39]	Level 0 table
[38:30]	Level 1 table
[29:21]	Level 2 table
[20:12]	Level 3 table
[11:0]	Byte within page (4 KB)

Each page table entry (PTE) is 8 bytes (64-bit). Bits [1:0] determine the descriptor type:

0b01: Block descriptor (1 GB at level 1, 2 MB at level 2)
0b11: Table descriptor at levels 0–2 (points to next-level table); Page descriptor at level 3

The memory attributes (normal vs. device, cacheable vs. uncacheable) are encoded in bits [4:2] of the PTE, which index into the MAIR_EL1 register (Memory Attribute Indirection Register). MAIR_EL1 holds 8 attribute types (one per byte); the PTE’s attribute index selects one. A typical MAIR_EL1 setup:

Index 0: Device-nGnRnE (0x00) — for peripheral registers
Index 1: Normal memory, inner/outer write-back, read-allocate, write-allocate (0xFF)

For the minimal CS 452 identity map, only two blocks are needed: a level 1 block descriptor at 0x00000000 for normal RAM (attribute index 1) and a level 1 block descriptor at 0xFE000000 for device registers (attribute index 0). Two PTEs in a single page cover the entire address space the kernel uses.

The walk through a virtual address lookup:

TTBR0_EL1 holds the physical address of the level 0 table.
The processor extracts VA[47:39] as the level 0 index (0 for all addresses in a 4 GB system with 4 KB pages and a 2-level map).
The level 0 PTE points to the level 1 table.
VA[38:30] indexes the level 1 table. Entry 0 is the 1 GB RAM block; entry 2 (for 0x80000000) is not used; we use a level 1 block at 0xFE000000.

Wait — 0xFE000000 is within the 4th GB (0xC0000000–0xFFFFFFFF), index 3 in the level 1 table. So the level 1 table has entries for index 0 (RAM), index 3 (peripherals), and the rest are invalid.

The simplicity of this setup is its virtue. The boot-time page table setup for a CS 452 kernel takes fewer than 30 assembly instructions.

The PL011 UART: Register-Level Programming

The PL011 is the BCM2711’s primary full-featured UART, used for terminal communication with the development workstation. Understanding it at the register level illustrates the general pattern of peripheral programming on the BCM2711.

The PL011 base address is 0xFE201000. Key registers (at their offsets from base):

Offset	Name	Description
`0x000`	DR	Data register: receive buffer (read) / transmit FIFO (write)
`0x018`	FR	Flag register: status bits (TXFF, RXFE, BUSY, etc.)
`0x024`	IBRD	Integer baud-rate divisor
`0x028`	FBRD	Fractional baud-rate divisor
`0x02C`	LCR_H	Line control register (FIFO enable, word length, parity)
`0x030`	CR	Control register (UART enable, TX enable, RX enable)
`0x034`	IFLS	FIFO interrupt level select
`0x038`	IMSC	Interrupt mask set/clear
`0x040`	MIS	Masked interrupt status
`0x044`	ICR	Interrupt clear register
`0x048`	DMACR	DMA control register

Baud-rate configuration: the PL011 clocks from UARTCLK, which on the RPi 4 is derived from the system PLL at 48 MHz (or 3 MHz in some configurations). The baud-rate divisor is:

\[ \text{BAUDDIV} = \frac{F_{\text{UARTCLK}}}{16 \times \text{baud rate}} \]

For 115200 baud at 48 MHz: BAUDDIV = 48,000,000 / (16 × 115,200) = 26.04167. The integer part goes in IBRD (= 26); the fractional part is encoded as FBRD = round(0.04167 × 64) = 3.

LCR_H configuration for 8N1 (8 data bits, no parity, 1 stop bit, FIFO enabled):

#define UART0_BASE   0xFE201000UL
#define UART0_DR    (UART0_BASE + 0x000)
#define UART0_FR    (UART0_BASE + 0x018)
#define UART0_IBRD  (UART0_BASE + 0x024)
#define UART0_FBRD  (UART0_BASE + 0x028)
#define UART0_LCRH  (UART0_BASE + 0x02C)
#define UART0_CR    (UART0_BASE + 0x030)
#define UART0_IFLS  (UART0_BASE + 0x034)
#define UART0_IMSC  (UART0_BASE + 0x038)
#define UART0_ICR   (UART0_BASE + 0x044)

#define FR_TXFF  (1 << 5)   // transmit FIFO full
#define FR_RXFE  (1 << 4)   // receive FIFO empty
#define FR_BUSY  (1 << 3)   // UART transmitter busy

void uart0_init(void) {
    /* Disable UART before reconfiguring */
    mmio_write(UART0_CR, 0);

    /* Wait for any ongoing transmission to complete */
    while (mmio_read(UART0_FR) & FR_BUSY) {}

    /* Configure GPIO 14 (TXD), 15 (RXD) as UART0 alt function 0 */
    uint32_t sel = mmio_read(0xFE200000);  // GPFSEL1
    sel &= ~((7 << 12) | (7 << 15));       // clear GPIO 14, 15 bits
    sel |=   (4 << 12) | (4 << 15);        // set alt0
    mmio_write(0xFE200004, sel);

    /* Set baud rate: 115200 at 48 MHz UARTCLK */
    mmio_write(UART0_IBRD, 26);
    mmio_write(UART0_FBRD, 3);

    /* 8 data bits, 1 stop bit, no parity, FIFO enabled */
    mmio_write(UART0_LCRH, (3 << 5) | (1 << 4));   /* WLEN=11 (8 bits), FEN=1 */

    /* FIFO interrupt thresholds: TX at 1/4 full, RX at 1/8 full */
    mmio_write(UART0_IFLS, (0b001 << 3) | 0b000);  /* TXIFLSEL=1/4, RXIFLSEL=1/8 */

    /* Clear all pending interrupts */
    mmio_write(UART0_ICR, 0x7FF);

    /* Enable RX interrupt, TX interrupt; disable others */
    mmio_write(UART0_IMSC, (1 << 4) | (1 << 5));   /* RXIM=1, TXIM=1 */

    /* Enable UART: TX enable, RX enable, UART enable */
    mmio_write(UART0_CR, (1 << 9) | (1 << 8) | (1 << 0));
}

Polling transmit (for use before the interrupt-driven server is running):

void uart0_putc_poll(char c) {
    while (mmio_read(UART0_FR) & FR_TXFF) {}   // wait until FIFO not full
    mmio_write(UART0_DR, (uint32_t)c);
}

Interrupt-driven receive: when the RX FIFO reaches the 1/8-full threshold (4 bytes, since the PL011 has a 32-byte deep FIFO on the BCM2711), the RXIM interrupt fires. The interrupt handler drains the FIFO:

void uart0_rx_irq_handler(void) {
    while (!(mmio_read(UART0_FR) & FR_RXFE)) {
        uint32_t dr = mmio_read(UART0_DR);
        if (dr & (1 << 11)) { /* overrun error */
            mmio_write(UART0_ICR, 1 << 4);  /* clear OEIC */
            continue;
        }
        if (dr & (1 << 10)) { /* break */ continue; }
        if (dr & (1 << 9))  { /* parity error */ continue; }
        if (dr & (1 << 8))  { /* framing error */ continue; }
        char c = dr & 0xFF;
        rx_ring_push(c);    /* push to software ring buffer */
    }
    mmio_write(UART0_ICR, 1 << 4);  /* clear RXIC */
}

The data register’s upper bits carry error flags. Checking them prevents corrupted bytes from entering the receive buffer. The OEIC, RXIC, TXIC bits in ICR (Interrupt Clear Register) are write-1-to-clear.

Flow control (CTS/RTS): the PL011 supports hardware flow control. When CTSEn (bit 15 of CR) is set, the UART checks the CTS pin before transmitting; when RTSEn (bit 14) is set, the RTS pin is automatically asserted when the receive FIFO reaches a threshold. For the CS 452 terminal, flow control is typically not needed because the terminal server’s software buffering handles rate mismatches. For the UART connection to the CS3 (if used), flow control may be required if the CS3 does not have internal buffering.

The SPI0 Peripheral: Register-Level Programming

The SPI0 peripheral (base 0xFE204000) connects the RPi 4 to the MCP2515 CAN controller. Understanding SPI at the register level clarifies how each CAN frame transmission and reception happens.

The SPI0 uses a FIFO-based interface. Key registers:

Offset	Name	Description
`0x000`	CS	Control and status: DONE, RXD, TXD, CPOL, CPHA, CS select
`0x004`	FIFO	FIFO access: write to transmit, read to receive
`0x008`	CLK	Clock divider: SPI clock = core_clock / CLK
`0x00C`	DLEN	Data length (DMA mode only)

A single SPI transaction with the MCP2515 requires:

Assert CS by writing CS register with CS[1:0] = chip select, TA = 1 (transfer active), CLEAR = 1 (clear FIFOs).
Write command bytes to FIFO.
Write dummy bytes to FIFO for bytes to be received (SPI is full-duplex; sending dummy 0x00 advances the clock for reception).
Poll the DONE bit (CS[16]) until the transaction completes.
Read received bytes from FIFO.
Deassert CS by clearing TA.

For the MCP2515’s 10 MHz maximum clock and the BCM2711’s core clock at 200 MHz, CLK = 200 MHz / 10 MHz = 20.

#define SPI0_BASE  0xFE204000UL
#define SPI0_CS    (SPI0_BASE + 0x000)
#define SPI0_FIFO  (SPI0_BASE + 0x004)
#define SPI0_CLK   (SPI0_BASE + 0x008)

void spi0_init(void) {
    mmio_write(SPI0_CLK, 20);           // 200 MHz / 20 = 10 MHz
    mmio_write(SPI0_CS, 0);             // CPOL=0, CPHA=0 (MCP2515 mode 0)
}

/* Transfer one byte: transmit tx, return received byte */
uint8_t spi0_transfer_byte(uint8_t tx) {
    /* Clear FIFOs, set TA=1 */
    mmio_write(SPI0_CS, (1 << 7) | (1 << 4) | 1);  /* TA, CLEAR, CS=1 */
    /* Wait for TX FIFO space */
    while (!(mmio_read(SPI0_CS) & (1 << 18))) {}     /* TXD bit */
    mmio_write(SPI0_FIFO, tx);
    /* Wait for transfer to complete */
    while (!(mmio_read(SPI0_CS) & (1 << 16))) {}     /* DONE bit */
    /* Read received byte */
    uint8_t rx = mmio_read(SPI0_FIFO) & 0xFF;
    /* Deassert CS (TA=0) */
    mmio_write(SPI0_CS, 0);
    return rx;
}

A multi-byte SPI transaction (e.g., writing to MCP2515 TXBUF0 with 13 bytes) is more efficient with FIFO streaming:

void spi0_transfer(const uint8_t *tx, uint8_t *rx, int len) {
    /* Clear FIFOs, set TA=1, CS=1 */
    mmio_write(SPI0_CS, (1 << 7) | (1 << 4) | 1);

    int tx_sent = 0, rx_got = 0;
    while (rx_got < len) {
        /* Fill TX FIFO as much as possible */
        while (tx_sent < len && (mmio_read(SPI0_CS) & (1 << 18))) {
            mmio_write(SPI0_FIFO, tx[tx_sent++]);
        }
        /* Drain RX FIFO */
        while (mmio_read(SPI0_CS) & (1 << 17)) {  /* RXD bit */
            rx[rx_got++] = mmio_read(SPI0_FIFO);
        }
    }

    /* Wait for completion, deassert CS */
    while (!(mmio_read(SPI0_CS) & (1 << 16))) {}
    mmio_write(SPI0_CS, 0);
}

This streaming approach fills the TX FIFO and drains the RX FIFO in parallel, maximizing throughput. For the 13-byte MCP2515 read (1 command + 1 address + 11 data bytes), the streaming approach reduces wait time compared to byte-at-a-time transfer.

GPIO Deep Dive

GPIO (General Purpose Input/Output) on the BCM2711 is more complex than the name suggests. Each of the 58 pins can be configured for up to 6 alternate functions (GPIO, UART, SPI, I2C, PCM, etc.). The function select registers GPFSEL0–GPFSEL5 control 10 pins each (3 bits per pin, 30 bits per register):

Value	Function
000	Input
001	Output
010	Alternate function 5
011	Alternate function 4
100	Alternate function 0
101	Alternate function 1
110	Alternate function 2
111	Alternate function 3

For SPI0 (used to communicate with the MCP2515): GPIO 7 = CE1 (SPI chip select 1, alt function 0), GPIO 8 = CE0 (alt function 0), GPIO 9 = MISO (alt function 0), GPIO 10 = MOSI (alt function 0), GPIO 11 = SCLK (alt function 0). For UART0 (PL011): GPIO 14 = TXD0 (alt function 0), GPIO 15 = RXD0 (alt function 0). For GPIO 17 (MCP2515 INT): configure as input (function 000).

Pull-up/pull-down resistors are controlled by GPPUD (pull-up/pull-down enable) and GPPUDCLK0/1 (which pins the setting applies to), using a three-step sequence: write GPPUD, wait 150 cycles, write GPPUDCLK, wait 150 cycles, clear both. The pull-up resistor on GPIO 17 (MCP2515 INT) ensures the line reads high when the MCP2515 is not asserting an interrupt (since INT is active-low, open-drain).

GPIO also provides edge and level event detection through registers GPEDS0/1 (event detect status), GPREN0/1 (rising edge enable), GPFEN0/1 (falling edge enable), GPHEN0/1 (high detect enable), and GPLEN0/1 (low detect enable). Setting a bit in GPFEN0 for GPIO 17 means the hardware will set the corresponding bit in GPEDS0 when a falling edge is detected (which is when the MCP2515’s INT line is asserted, since INT is active-low). This detection bit persists until explicitly cleared by writing a 1 to GPEDS0. The GIC-400 can be configured to route this GPIO event to an IRQ line, allowing the interrupt handler to be invoked without polling.

#define GPFSEL0    0xFE200000UL
#define GPFSEL1    0xFE200004UL
#define GPSET0     0xFE20001CUL
#define GPCLR0     0xFE200028UL
#define GPLEV0     0xFE200034UL
#define GPEDS0     0xFE200040UL
#define GPFEN0     0xFE200058UL

void gpio_configure_mcp_int(void) {
    /* GPIO 17: input, falling-edge detect, pull-up */
    uint32_t sel = mmio_read(GPFSEL1);
    sel &= ~(7u << 21);    /* GPIO 17 = bits [23:21] of GPFSEL1 */
    mmio_write(GPFSEL1, sel);  /* value 000 = input */

    /* Enable falling-edge detect on GPIO 17 */
    mmio_write(GPFEN0, mmio_read(GPFEN0) | (1u << 17));
}

bool gpio_mcp_int_asserted(void) {
    return (mmio_read(GPEDS0) >> 17) & 1;
}

void gpio_clear_mcp_int(void) {
    mmio_write(GPEDS0, 1u << 17);   /* write-1-to-clear */
}

The BCM2711 Mailbox Interface

The BCM2711 is not just a quad-core ARM chip — it is a system-on-chip that pairs the Cortex-A72 CPUs with a VideoCore VI GPU, and the two sides communicate through a mailbox interface. The mailbox mechanism matters to a bare-metal kernel for two reasons: the firmware that runs on the VideoCore side controls things that the ARM side needs (the UART clock rate being the most important), and the mailbox provides a way to query the board’s memory map.

Physically, the mailbox is a set of registers in the ARM peripheral area starting at 0xFE00B880. The ARM can write a message address into the mailbox write register; the VideoCore firmware processes the message and writes a reply to the same mailbox. The protocol is simple:

Register	Offset	Description
Read	`0x00`	Read a message from the VideoCore
Status	`0x18`	Bit 30 = TX full, bit 31 = RX empty
Write	`0x20`	Write a message to the VideoCore

Messages are tagged property requests. The caller allocates a 16-byte-aligned buffer in memory, fills in a property tag, and passes the physical address of the buffer (OR’d with channel 8, the property channel) to the mailbox write register. The VideoCore processes the request in place, writing the response into the same buffer.

The most important use for a kernel is querying the actual UART clock rate, since the firmware may change it from the nominal 48 MHz:

#define MBOX_BASE  0xFE00B880UL
#define MBOX_READ  (MBOX_BASE + 0x00)
#define MBOX_STAT  (MBOX_BASE + 0x18)
#define MBOX_WRITE (MBOX_BASE + 0x20)
#define MBOX_FULL  (1u << 31)
#define MBOX_EMPTY (1u << 30)
#define MBOX_CH_PROP 8

/* Must be 16-byte aligned — place in .data with explicit alignment */
static volatile uint32_t mbox_buf[36] __attribute__((aligned(16)));

uint32_t mbox_get_uart_clock(void) {
    mbox_buf[0]  = 8 * 4;      /* total size in bytes */
    mbox_buf[1]  = 0;          /* request code */
    mbox_buf[2]  = 0x00038002; /* tag: get clock rate */
    mbox_buf[3]  = 8;          /* value buffer size */
    mbox_buf[4]  = 0;          /* 0 = request */
    mbox_buf[5]  = 2;          /* clock ID 2 = UART */
    mbox_buf[6]  = 0;          /* response: clock rate (Hz) filled here */
    mbox_buf[7]  = 0;          /* end tag */

    /* Flush the buffer to memory before passing its address to VideoCore */
    asm volatile("dsb sy" ::: "memory");

    uint32_t addr = ((uint32_t)(uintptr_t)mbox_buf) | MBOX_CH_PROP;

    /* Wait until mailbox is not full, then write */
    while (mmio_read(MBOX_STAT) & MBOX_FULL) {}
    mmio_write(MBOX_WRITE, addr);

    /* Wait for response */
    for (;;) {
        while (mmio_read(MBOX_STAT) & MBOX_EMPTY) {}
        if (mmio_read(MBOX_READ) == addr) break;
    }

    /* Invalidate cache before reading response */
    asm volatile("dsb sy" ::: "memory");

    if (mbox_buf[1] == 0x80000000) /* SUCCESS */
        return mbox_buf[6];
    return 0;
}

Several things deserve attention here. The dsb sy (“data synchronization barrier, full system”) instructions force all preceding stores to complete before the VideoCore reads the buffer, and force all loads to see the VideoCore’s writes after the response arrives. Without them, the Cortex-A72’s out-of-order execution and store buffer might cause the VideoCore to read a partially-initialized request, or the ARM to read stale cache data instead of the VideoCore’s response. This is the same story as volatile mmio_read/write applied to a DMA-like scenario: whenever two separate execution agents share memory, explicit barriers are required.

The mailbox also provides set_clock_rate (tag 0x00038002 with request flag bit set in buf[4]) for overclocking the UART. In practice, most CS 452 kernel implementations accept the firmware’s default clock and compute baud divisors accordingly rather than setting the clock explicitly.

The Auxiliary Peripheral Block (AUX)

In addition to the PL011 UART used for the terminal, the BCM2711 contains an Auxiliary block at 0xFE215000 that provides a Mini UART (UART1) and two additional SPI controllers (SPI1 and SPI2). These are simpler and more limited than the PL011 and SPI0, but they share a single interrupt line, which matters for interrupt routing.

Mini UART (UART1) is a 16550-compatible UART with an 8-byte FIFO (much smaller than the PL011’s 32-byte deep FIFO). Its baud rate derivation differs: Mini UART baud = system_clock_freq / (8 × (BAUD + 1)). Because it shares an IRQ with SPI1 and SPI2, the interrupt handler must check the AUX_IRQ register to determine which device caused the interrupt.

For CS 452, the Mini UART is rarely used directly — the PL011 is preferred for the terminal. However, knowing that UART1 exists matters because the RPi 4 firmware may remap UART1 to the GPIO 14/15 pins by default (depending on the config.txt setting), which would prevent the PL011 from working on those pins unless dtoverlay=disable-bt or enable_uart=1 is set in the firmware configuration. The W26 course environment uses specific config.txt settings to ensure the PL011 is on GPIO 14/15; understanding why those settings are necessary requires knowing the AUX UART exists.

The AUX_ENABLES register at 0xFE215004 controls which auxiliary peripherals are active. Bit 0 enables UART1, bit 1 enables SPI1, bit 2 enables SPI2. Setting a bit starts the clock to the peripheral; clearing it stops it and resets the peripheral’s registers.

Memory-Mapped I/O Discipline: Putting It All Together

Every peripheral access in the BCM2711 follows the same pattern, but the details differ between device types in ways that are easy to get wrong:

Ordinary MMIO registers (UART DR, SPI FIFO, GPIO function select): mark as volatile and access through mmio_read/mmio_write. The volatile qualifier prevents the compiler from caching the value in a register or reordering accesses. Each volatile load generates an actual memory read; each volatile store generates an actual memory write. This is sufficient for single-threaded polling loops.

Registers that must be modified atomically (GPIO GPFSEL — multiple bits per register, shared between pins): use read-modify-write. The BCM2711 does not provide a SET/CLEAR register pair for GPFSEL (though it does for output level via GPSET/GPCLR), so a read-modify-write is the only option. If two cores might simultaneously modify different pins in the same GPFSEL register, a spinlock is required.

DMA-shared buffers (mailbox buffers, DMA descriptors): require explicit cache-coherency barriers (dsb sy) because the Cortex-A72 caches may not be visible to the VideoCore or the DMA engine, which are separate bus masters. The __attribute__((aligned(16))) ensures the buffer does not straddle a cache line boundary in a way that could cause partial visibility.

Peripheral ordering (multi-step sequences like SPI CS assert → FIFO write → wait for DONE → CS deassert): individual mmio_write calls within a single core are ordered by the Cortex-A72’s memory model for device memory (Device-nGnRE type, configured by MAIR_EL1 bits and page table attributes). The ARMv8 architecture guarantees that accesses to Device memory are not reordered by the CPU with respect to each other in the program order. However, if you switch from MMIO writes to a delay loop that involves normal memory accesses, the compiler may reorder code unless barriers or volatile memory clobbers prevent it.

This discipline — volatile for MMIO, barriers for DMA, locks for multi-core, careful sequencing for multi-step protocols — appears in every embedded system and every OS. The BCM2711 just happens to make all the lessons concrete at once.

Chapter 5: Bare-Metal Toolchain

Writing a kernel means producing an executable with no operating system underneath it. The compiler and linker must be configured to match, and the startup code must take responsibility for every initialization step that normally happens automatically in a user-space program.

Cross-Compilation

The development machine (likely an x86-64 Linux box or macOS) cannot execute AArch64 binaries, so we use a cross-compiler: a version of GCC that runs on the host but produces code for the ARM64 target. The appropriate toolchain is aarch64-linux-gnu-gcc (from the GNU Arm Embedded Toolchain) or aarch64-elf-gcc for bare-metal targets (the -elf variant omits the Linux ABI assumptions and is preferred for kernels).

Compilation flags for a bare-metal kernel:

CC      := aarch64-elf-gcc
CFLAGS  := -Wall -Wextra -O2 \
            -ffreestanding \     # do not assume libc/crt0 are present
            -mgeneral-regs-only \ # do not generate FP/SIMD instructions
            -nostdlib \          # do not link standard libraries
            -nostartfiles        # do not link crt0.o

-ffreestanding is critical: it tells GCC not to assume that functions like memcpy, memset, or printf exist, and not to generate calls to them implicitly. The compiler may still inline small copies or optimizations, but it will not emit an external call to memcpy unless you explicitly call it yourself (in which case you must provide the implementation).

-mgeneral-regs-only prevents the compiler from using floating-point or SIMD registers. This matters because FP registers must be saved and restored on every context switch if the kernel uses them; by banning them from the kernel itself, context switch code only needs to handle the integer register file.

The ELF Format and Linker Scripts

GCC produces ELF (Executable and Linkable Format) object files. The linker combines these into a final ELF binary, which is then converted to a raw binary image (kernel8.img) that the RPi 4 firmware can load.

The firmware loads kernel8.img at physical address 0x80000. The linker must be told this. The linker script (typically link.ld) controls:

The output sections: .text (code), .rodata (read-only data), .data (initialized read-write data), .bss (uninitialized read-write data).
The load address: the physical address at which each section will reside at run time.
Symbol exports: addresses that become available as C-linkage symbols (__bss_start, __bss_end) for use by the startup code.

A minimal linker script for CS 452:

ENTRY(boot)

SECTIONS {
    . = 0x80000;                       /* load at RPi 4 kernel entry */

    .text : {
        KEEP(*(.text.boot))            /* boot.S first */
        *(.text .text.*)
    }

    .rodata : { *(.rodata .rodata.*) }

    .data : {
        . = ALIGN(8);
        *(.data .data.*)
    }

    .bss : {
        . = ALIGN(16);
        __bss_start = .;
        *(.bss .bss.*)
        *(COMMON)
        __bss_end = .;
    }

    /DISCARD/ : { *(.comment) *(.eh_frame) }
}

The ENTRY(boot) tells the linker which symbol is the entry point, recorded in the ELF header for tools that use it. The KEEP directive prevents the linker from discarding .text.boot even if no other section references it — this is necessary for the boot code that the linker cannot see being called externally.

BSS Zeroing

The C standard guarantees that uninitialized global and static variables are zero-initialized. In a user-space program, the OS zeros the BSS segment before handing control to crt0. In a bare-metal kernel, you must do this yourself. The startup assembly code uses the __bss_start and __bss_end symbols exported by the linker script:

// Zero the BSS before entering C
adr     x0, __bss_start
adr     x1, __bss_end
mov     x2, #0
.Lbss_loop:
    cmp     x0, x1
    b.ge    .Lbss_done
    str     x2, [x0], #8       // store zero, advance 8 bytes
    b       .Lbss_loop
.Lbss_done:
    bl      kmain              // enter C kernel

Failing to zero BSS is one of the most common sources of puzzling bugs in bare-metal code. A global integer array that appears to contain zeros in your test environment may contain garbage in the field because DRAM power-on state is not deterministic.

Freestanding C and the Slab Allocator

Without libc, common functions must either be reimplemented or avoided. strlen, memcpy, memset, and memmove are easily reimplemented. malloc, free, and realloc should not exist at all.

The kernel’s memory model is static allocation only. Every data structure is either a global variable, a stack-allocated local, or a member of a statically-allocated pool. The slab allocator is the idiomatic pattern for managing fixed-size objects. Given a maximum of 64 tasks, the task descriptor pool is an array of 64 task descriptors, with a free-list maintained by embedding a next pointer in each descriptor (an intrusive linked list):

typedef struct TaskDescriptor {
    int      tid;
    int      priority;
    int      state;
    uint64_t sp;             // saved stack pointer
    struct TaskDescriptor *next_send;  // send queue linkage
    // ... other fields
} TaskDescriptor;

static TaskDescriptor td_pool[MAX_TASKS];
static TaskDescriptor *td_freelist;

void td_init(void) {
    td_freelist = NULL;
    for (int i = MAX_TASKS - 1; i >= 0; i--) {
        td_pool[i].next_send = td_freelist;
        td_freelist = &td_pool[i];
    }
}

TaskDescriptor *td_alloc(void) {
    if (!td_freelist) return NULL;
    TaskDescriptor *td = td_freelist;
    td_freelist = td->next_send;
    return td;
}

This pattern allocates and frees in O(1) time with no fragmentation, at the cost of a fixed upper bound on the number of simultaneously live objects.

The Complete Kernel Boot Sequence

Understanding every step from hardware reset to the first user task is essential for debugging boot failures — which are uniquely difficult to diagnose because the kernel has no I/O infrastructure yet when they occur.

Step 0: Hardware reset. When the RPi 4 powers on, all four Cortex-A72 cores start at physical address 0x00000000 (the reset vector). The GPU bootloader (VC4 firmware) runs first — before any ARM code executes. It reads config.txt from the SD card’s FAT partition, configures clock speeds and memory sizes, then loads kernel8.img (the ARM64 kernel image) to physical address 0x80000 and signals the primary core (core 0) to start at 0x80000. Cores 1–3 remain in a spin loop awaiting a mailbox signal.

Step 1: Early assembly (boot.S). The processor is in EL2 at this point. The first instruction at 0x80000 is the boot symbol in boot.S. Tasks to complete before entering C:

.section ".text.boot"
.global boot

boot:
    // 1. Ensure we're executing on core 0 only
    mrs     x0, MPIDR_EL1
    and     x0, x0, #0xFF        // core number in bits [7:0]
    cbnz    x0, .Lhalt_secondary // cores 1-3 halt

    // 2. Drop from EL2 to EL1
    mrs     x0, CurrentEL
    lsr     x0, x0, #2
    cmp     x0, #2
    bne     .Lalready_el1        // already at EL1 (unusual)

    // Configure EL1 as AArch64
    mov     x0, #(1 << 31)       // HCR_EL2.RW = AArch64 at EL1
    msr     HCR_EL2, x0

    // Set up SPSR_EL2: return to EL1h, DAIF all masked
    mov     x0, #0x3C5
    msr     SPSR_EL2, x0
    adr     x0, .Lel1_entry
    msr     ELR_EL2, x0
    isb
    eret

.Lel1_entry:
    // 3. Set up a temporary kernel stack (SP_EL1)
    adr     x0, _stack_top       // symbol defined in linker script
    msr     SP_EL1, x0
    mov     sp, x0

    // 4. Zero the BSS section
    adr     x0, __bss_start
    adr     x1, __bss_end
    mov     x2, #0
.Lbss_loop:
    cmp     x0, x1
    b.ge    .Lbss_done
    str     x2, [x0], #8
    b       .Lbss_loop
.Lbss_done:

    // 5. Enter C kernel
    bl      kmain

    // Should never return
.Lhalt:
    wfe
    b       .Lhalt

.Lhalt_secondary:
    wfe
    b       .Lhalt_secondary

Step 2: C kernel entry (kmain). The C kmain function is now executing at EL1 with a temporary stack. It must initialize every subsystem before handing off to user space. The order matters — each step depends on the previous:

void kmain(void) {
    /* 1. Initialize UART immediately for debug output */
    uart0_init();
    uart0_puts("CS452 Kernel booting...\r\n");

    /* 2. Set up the exception vector table */
    extern void exception_vector_table(void);
    asm volatile ("msr VBAR_EL1, %0; isb"
                  :: "r"((uint64_t)exception_vector_table));
    uart0_puts("Exception vectors installed.\r\n");

    /* 3. Initialize the MMU with identity mapping */
    mmu_init();
    uart0_puts("MMU enabled.\r\n");

    /* 4. Initialize the GIC-400 */
    gic_init();
    uart0_puts("GIC initialized.\r\n");

    /* 5. Initialize kernel data structures */
    td_init();      // task descriptor pool
    scheduler_init(); // ready queues
    event_init();   // event registry
    uart0_puts("Kernel data structures initialized.\r\n");

    /* 6. Enable the system timer interrupt */
    timer_init();    // arm C1, enable GIC interrupt 97

    /* 7. Create the first user task */
    int first_tid = create_task(FIRST_TASK_PRIORITY, first_task);
    uart0_puts("First task created.\r\n");

    /* 8. Transfer to user space: schedule and run */
    TaskDescriptor *next = scheduler_next();
    /* This function does not return — it restores next's context and eret */
    kernel_exit_to_user(next->sp);
}

Step 3: MMU initialization. Before the MMU is enabled, accesses to device registers at 0xFE000000 work because the physical address is directly accessible. After the MMU is enabled with an identity map, the virtual and physical addresses are identical, so device accesses are unchanged. The key difference: the MMU now enforces memory type attributes (cacheable normal memory, device memory), which is necessary for correct cache behavior.

The MMU initialization sequence:

void mmu_init(void) {
    /* 1. Allocate page tables (statically, in BSS) */
    extern uint64_t page_table_l0[512];  /* 4KB, 512 × 8B entries */
    extern uint64_t page_table_l1[512];

    /* 2. Set MAIR_EL1: index 0 = Device-nGnRnE, index 1 = Normal WB */
    uint64_t mair = (0xFFULL << 8) | (0x00ULL);  /* [15:8]=Normal, [7:0]=Device */
    asm volatile ("msr MAIR_EL1, %0" :: "r"(mair));

    /* 3. Build level-0 table: one entry pointing to level-1 table */
    page_table_l0[0] = (uint64_t)page_table_l1 | 3;  /* table descriptor */

    /* 4. Build level-1 table: 1 GB blocks */
    /* Block 0: 0x00000000–0x3FFFFFFF = normal memory, attr index 1 */
    page_table_l1[0] = 0x00000000 | (1 << 10) | (1 << 2) | 1;
                                 /* AF=1 (accessed), AttrIdx=1, block */
    /* Block 3: 0xC0000000–0xFFFFFFFF = device memory, attr index 0 */
    page_table_l1[3] = 0xC0000000 | (1 << 10) | (0 << 2) | 1;

    /* 5. Set TCR_EL1: 48-bit VA, 4KB pages, TG0=4KB */
    uint64_t tcr = (16ULL << 0)   /* T0SZ = 16 → 48-bit VA */
                 | (0ULL  << 8)   /* IRGN0 = write-back */
                 | (0ULL  << 10)  /* ORGN0 = write-back */
                 | (3ULL  << 12)  /* SH0 = inner shareable */
                 | (0ULL  << 14); /* TG0 = 4KB */
    asm volatile ("msr TCR_EL1, %0; isb" :: "r"(tcr));

    /* 6. Set TTBR0_EL1 to the level-0 table */
    asm volatile ("msr TTBR0_EL1, %0; isb" :: "r"((uint64_t)page_table_l0));

    /* 7. Enable MMU (SCTLR_EL1.M = 1, plus I-cache and D-cache) */
    uint64_t sctlr;
    asm volatile ("mrs %0, SCTLR_EL1" : "=r"(sctlr));
    sctlr |= (1 << 0)   /* M: MMU enable */
           | (1 << 2)   /* C: D-cache enable */
           | (1 << 12); /* I: I-cache enable */
    asm volatile ("msr SCTLR_EL1, %0; isb; dsb sy" :: "r"(sctlr));
}

The isb after SCTLR_EL1 write is mandatory — the MMU is not enabled until after the ISB. Any cache maintenance before MMU enable must clean+invalidate all cache lines (DC CIVAC) because the cache interprets virtual addresses before the MMU is enabled differently after.

Step 4: First task execution. The kernel calls kernel_exit_to_user(next->sp) which restores the first task’s context and executes eret. The first task is now running at EL0 with full priority over all other tasks (since no other task exists yet). It calls Create() to spawn all server tasks, then calls Exit().

Why the order matters: if gic_init() is called before mmu_init(), the GIC’s device registers at 0xFF841000 are accessed without memory type attributes — potentially causing the processor to cache device register values, leading to incorrect hardware behavior. If event_init() is called before gic_init(), interrupt IDs may reference an uninitialized GIC state. The initialization order in kmain is not arbitrary; it reflects a dependency graph where each step requires all previous steps to have completed.

Debugging boot failures: a kernel that fails during boot typically manifests as no UART output (the uart0_puts never executes) or output stopping partway. The most common causes:

MMU page fault at boot: if mmu_init() crashes, the system resets immediately. Insert uart0_puts("About to enable MMU\r\n") before the SCTLR write; if this prints but nothing else does, the MMU initialization has a bug.
Stack overflow in kmain: the temporary stack set up in boot.S is small (often just a few KB). If kmain calls deeply-nested functions before td_init() sets up per-task stacks, the temporary stack overflows. Keep kmain shallow — initialize hardware, then delegate to user tasks.
Wrong entry point: if the linker script does not put .text.boot first, the processor jumps to the wrong instruction at 0x80000. Verify with aarch64-elf-objdump -d kernel8.elf | head -20 that the first instructions are the boot code.

Debugging Without a Debugger

One of the most disorienting aspects of bare-metal development is the absence of familiar debugging tools. There is no GDB attached to a running kernel, no core dump on a crash, no strace to observe system calls. When the kernel crashes or hangs, the evidence is often just a stopped terminal and a frozen train. Developing effective debugging strategies is therefore as important as writing correct code in the first place.

UART as the primary debug output: even before the terminal server is running, the mini-UART can be used in polling mode to print diagnostic messages. A minimal kprintf implemented with a polling write loop provides a stream of evidence about what the kernel is doing:

void uart_putc_polling(char c) {
    while (!(mmio_read(AUX_MU_LSR) & 0x20)) {}  // wait for TX space
    mmio_write(AUX_MU_IO, c);
}

void kputs(const char *s) {
    while (*s) uart_putc_polling(*s++);
}

The polling UART adds deterministic latency (proportional to the number of bytes output) to every print statement. For timing-sensitive code, print statements change behavior. This is the real-time debugging paradox: the act of observing the system changes it. Two strategies address this:

Post-mortem ring buffers: instead of printing immediately, write to a circular buffer in RAM. After the bug manifests (or a test run completes), print the buffer contents. The ring buffer write is fast (a few cycles for a word write), so it minimally disturbs timing. The buffer can record (timestamp, event_id, data) tuples for a complete execution history.

Assertions with halts: insert assert() macros at critical invariants (queue lengths, task states, pointer bounds). On failure, halt the processor with a diagnostic message before the cascading effects make the root cause unrecognizable. In a real-time kernel, an assertion failure is a development-time event; disable assertions in production builds.

The canary pattern: place a known magic value at the bottom of each task stack. Periodically check that the canary is intact. If it has been overwritten, the task’s stack has overflowed — a common cause of mysterious kernel panics. At stack allocation:

td->stack[0] = STACK_CANARY_VALUE;  // bottom of stack

In the idle task (which runs at the lowest priority and has time to do maintenance):

for each task in use:
    if td->stack[0] != STACK_CANARY_VALUE:
        kputs("STACK OVERFLOW in task "); kputd(td->tid); halt();

GCC sanitizers with QEMU: the qemu-system-aarch64 emulator can run the kernel binary on x86 development hardware (with the -M raspi4b machine type). QEMU supports GDB remote debugging over a TCP socket. While QEMU’s timing behavior differs from real hardware (it does not emulate cache misses or interrupt latency accurately), it is invaluable for logical correctness testing. Develop and test on QEMU; test timing behavior on real hardware.

The Build System: Make and Beyond

A typical CS 452 project uses GNU Make to manage the build. The structure should separate source files cleanly by subsystem:

cs452/
├── kernel/
│   ├── boot.S          # EL2→EL1 drop, BSS clear, entry to kmain
│   ├── exception.S     # exception vector table, syscall/IRQ entry
│   ├── task.c          # task creation, destruction, scheduler
│   ├── message.c       # Send, Receive, Reply implementation
│   ├── event.c         # AwaitEvent, interrupt delivery
│   └── kernel.h        # internal kernel interface
├── user/
│   ├── nameserver.c    # Name Server task
│   ├── clockserver.c   # Clock Server + Clock Notifier
│   ├── uart.c          # UART TX/RX servers and notifiers
│   ├── can.c           # CAN TX/RX servers and notifiers
│   └── syscall.h       # user-visible syscall wrappers
├── train/
│   ├── engineer.c      # per-train control loop
│   ├── track.c         # Track Server, Dijkstra, reservations
│   ├── sensor.c        # sensor attribution
│   └── calibration.c   # velocity and stopping-distance tables
├── lib/
│   ├── kprintf.c       # formatted output (vsnprintf + Putc)
│   ├── string.c        # strlen, strcpy, memcpy, memset
│   └── queue.h         # type-safe queue/stack templates
├── link.ld             # linker script
└── Makefile

A minimal Makefile:

CC      := aarch64-elf-gcc
CFLAGS  := -Wall -Wextra -O2 -ffreestanding -mgeneral-regs-only \
           -nostdlib -nostartfiles -I.
LDFLAGS := -T link.ld

SRCS    := $(shell find . -name '*.c' -o -name '*.S')
OBJS    := $(SRCS:.c=.o)
OBJS    := $(OBJS:.S=.o)

all: kernel8.img

kernel8.elf: $(OBJS)
	$(CC) $(LDFLAGS) -o $@ $^

kernel8.img: kernel8.elf
	aarch64-elf-objcopy -O binary $< $@

clean:
	rm -f $(OBJS) kernel8.elf kernel8.img

.PHONY: all clean

An important Makefile discipline: use pattern rules rather than listing every object file explicitly. As the project grows, adding a new .c file should automatically include it in the build.

Dependency generation: GCC can automatically generate Makefile dependency rules (-MMD -MP flags) so that changing a header file triggers recompilation of all files that include it. Without this, a stale object file from before a header change will silently produce incorrect behavior.

CFLAGS += -MMD -MP
-include $(OBJS:.o=.d)

QEMU-Based Development and Testing

QEMU can emulate the Raspberry Pi 4 (machine type raspi4b), allowing development and debugging on a regular workstation without physical hardware:

qemu-system-aarch64 \
    -M raspi4b \
    -kernel kernel8.elf \
    -serial stdio \
    -d int,cpu_reset \
    -no-reboot \
    -S -s

Flags:

-serial stdio: connect UART0 to the terminal’s stdin/stdout.
-d int,cpu_reset: log interrupts and CPU resets to stderr (useful for debugging exception-level transitions).
-no-reboot: stop QEMU if the simulated CPU resets (instead of restarting — helps catch boot crashes).
-S: pause at startup (wait for GDB).
-s: enable the GDB server on port 1234.

In a separate terminal:

aarch64-elf-gdb kernel8.elf
(gdb) target remote :1234
(gdb) break kmain
(gdb) continue

This attaches GDB to the running QEMU instance, sets a breakpoint at the kernel main entry, and begins execution. When kmain is reached, GDB halts and you can inspect registers, memory, and step through code.

QEMU limitations: QEMU does not simulate the BCM2711’s peripheral registers with hardware-accurate timing. Timer interrupts may not arrive at the correct frequency, and SPI/UART simulation is simplified. Logical correctness can be tested in QEMU; timing correctness must be validated on real hardware. Particularly, the GIC-400 simulation in QEMU may not match the physical BCM2711 in all edge cases (e.g., spurious interrupt handling, priority mask behavior under nested interrupts).

A useful CI pattern: run a unit test suite against the kernel in QEMU on each commit to a shared repository. Tests that exercise message passing, the scheduler, and the clock server can run in QEMU without hardware access, providing fast feedback before deploying to the RPi.

Remote Deployment: SD Card and JTAG

The standard RPi 4 boot procedure loads kernel8.img from the boot partition of an SD card. During development, the edit-compile-flash-test cycle is slow. Two faster alternatives:

TFTP boot: configure the RPi 4 bootloader to attempt a network boot if the SD card is not found or if a GPIO pin is held. The development host runs a TFTP server; kernel8.img is served over the local network. This eliminates the SD card write step — a new binary is available to the RPi on next reset, taking approximately 5 seconds for a typical 100 KB kernel.

JTAG debugging: the BCM2711 exposes a JTAG interface on GPIO pins 22–27. A JTAG adapter (OpenOCD + J-Link or DAPLink) allows debugging directly on hardware: setting breakpoints, reading registers, and examining memory without modifying the running code. JTAG-based debugging is the gold standard for embedded debugging.

JTAG Architecture and the ARM Debug Access Port

JTAG (Joint Test Action Group, IEEE Std 1149.1) was originally designed for boundary-scan testing of PCB connections. Its later use as a debug interface leverages the same physical four-wire protocol (TDI, TDO, TCK, TMS, plus optional TRST) to access a processor’s internal debug registers without halting normal operation.

The JTAG chain is controlled by a Test Access Port (TAP) controller, a 16-state finite state machine driven by the TMS signal. Moving through the TAP states allows software to shift data into and out of two types of registers: instruction registers (IR), which select the operation, and data registers (DR), which carry the actual data. By chaining multiple TAPs together on a single JTAG chain, a single adapter can access multiple devices (CPU, FPGA, boundary scan cells).

ARM implements JTAG debug access through the Debug Access Port (DAP), which is part of the ARM CoreSight debug architecture. The DAP connects to multiple Access Ports (APs): a MEM-AP for memory-mapped debug access and an APB-AP for CoreSight component access. Through the MEM-AP, a debugger can read and write any address in the CPU’s physical address space — including peripheral registers at their MMIO addresses — without the CPU’s involvement.

The ROM Table is a CoreSight concept: at a fixed offset from the debug base address, there is a table listing the addresses of all debug components (CPU debug registers, ETM trace, CTI cross-trigger interface). OpenOCD reads the ROM Table during initialization to discover the system’s debug topology.

BCM2711 JTAG Pinout and Configuration

The BCM2711’s JTAG interface is multiplexed with GPIO pins 22–27. By default, these pins serve GPIO functions after reset; JTAG mode must be explicitly activated. There are two ways to do this:

config.txt option: add enable_jtag_gpio=1 to the first partition’s config.txt. This configures the GPIO pin mux before handing off to the kernel. The assignment is:

JTAG Signal	GPIO Pin	Alt Function
TRST_N	GPIO 22	Alt4
TDI	GPIO 26	Alt4
TMS	GPIO 27	Alt4
TCK	GPIO 25	Alt4
TDO	GPIO 24	Alt4
GND	GND	—

Software pin-mux: the kernel can configure the GPFSEL registers to set these GPIO pins to Alt4 function programmatically. This is useful when the debug session should be enabled only after certain initialization steps complete (e.g., after the MMU and caches are initialized), allowing UART-based debugging before JTAG takes over.

Note that JTAG debugging requires physical access to the GPIO header of the RPi 4 — a 40-pin ribbon cable connecting the RPi to the debug adapter. The CS 452 lab’s RPi units typically have a JTAG header soldered for exactly this purpose.

OpenOCD Configuration for BCM2711

OpenOCD (Open On-Chip Debugger) is the open-source software that drives the JTAG adapter and exposes a GDB server. The BCM2711 configuration requires two files:

Interface file (for the specific adapter, e.g., a DAPLink or J-Link):

# interface/jlink.cfg
interface jlink
transport select jtag
adapter speed 1000

Target file (for the BCM2711 core):

# target/bcm2711_rpi4.cfg
set _CHIPNAME bcm2711

# The BCM2711 has 4 Cortex-A72 cores; we use core 0 only
set _TARGETNAME $_CHIPNAME.a72.0

jtag newtap $_CHIPNAME tap -irlen 4 -expected-id 0x4ba00477

dap create $_CHIPNAME.dap -chain-position $_CHIPNAME.tap

# MEM-AP for AHB access to CPU registers and peripherals
target create $_TARGETNAME cortex_a \
    -dap $_CHIPNAME.dap -ap-num 0 \
    -dbgbase 0x80010000 -ctibase 0x80018000

# Halt all cores — only core 0 is relevant
$_TARGETNAME configure -event reset-halt { echo "Core 0 halted" }

# GDB port
gdb_port 3333

Starting OpenOCD:

openocd -f interface/jlink.cfg -f target/bcm2711_rpi4.cfg

OpenOCD prints progress as it initializes the DAP, reads the ROM table, and connects to the Cortex-A72 debug registers. Once started, it listens on port 3333 for GDB connections and on port 4444 for a telnet command interface.

GDB Remote Target Session

With OpenOCD running, GDB connects over the remote protocol:

$ aarch64-elf-gdb kernel8.elf
(gdb) target remote :3333
(gdb) monitor halt              # halt the CPU
(gdb) info registers            # inspect all registers
(gdb) x/20i $pc                 # disassemble 20 instructions at current PC
(gdb) break kmain               # set hardware breakpoint (limited to 6 on A72)
(gdb) continue                  # resume execution

The Cortex-A72 provides six hardware breakpoints and four hardware watchpoints (data access triggers). Hardware breakpoints do not modify the instruction stream — unlike software breakpoints, which replace an instruction with a BRK #0 — making them usable in code that validates its own integrity (e.g., a kernel that checksums its text segment). For CS 452, the primary uses are:

Breakpoints on exception vector entries: catching unexpected exceptions (SError, FIQ in EL1) without adding code to the handler.
Watchpoints on the ready_mask bitmask: catching any write to the scheduler’s priority mask, useful for debugging rare priority corruption.
Watchpoints on a task descriptor’s state field: catching any transition out of READY that isn’t through the scheduler’s scheduler_next() function.

The GDB monitor command passes arbitrary OpenOCD TCL commands through GDB to the OpenOCD server. Useful monitor commands:

monitor reset halt       # reset the CPU and halt before first instruction
monitor reg              # dump OpenOCD's view of registers (not GDB's)
monitor mdw 0x20000000 4 # read 4 words at memory address 0x20000000
monitor mww 0x3F200000 0 # write 0 to the GPIO FSEL0 register (dangerous!)

JTAG Limitations in Real-Time Systems

JTAG’s key limitation in a real-time context is that hardware breakpoints halt the processor, stopping all interrupt handling and all timing. When the CPU is halted at a breakpoint, the system timer’s compare register continues counting but no interrupt handler fires. On breakpoint continue, the accumulated interrupts may fire all at once, distorting the kernel’s timing state. Specifically:

The clock server’s tick count may lag, causing tasks waiting on Delay() to wake up late.
The CAN receive buffer may overflow if the CPU is halted while a train sensor stream arrives.
The UART TX FIFO may underflow, causing garbled output after resuming.

This means that hardware-breakpoint debugging is useful for logical correctness (is the right code path taken? is the data structure consistent?) but unreliable for timing correctness (is this code executing within its WCET?). For timing analysis, the ring-buffer profiler described in Chapter 20 is more appropriate: it records timestamps without halting the CPU.

A subtler issue: the GIC-400 in the BCM2711 is implemented on the CoreSight bus. When the CPU is halted via JTAG, the GIC continues operating — interrupts arrive and are queued in the GIC’s pending register. However, since the CPU is halted, the interrupt is not delivered. When the CPU resumes, the pending interrupt is immediately delivered. If the kernel was halted partway through an interrupt handler (after writing GICC_IAR but before writing GICC_EOIR), the GIC may be in an inconsistent state. The safe approach: always set breakpoints at function boundaries (where the kernel is in a known quiescent state), not inside interrupt handlers.

For most CS 452 use cases, QEMU + UART debugging is sufficient. JTAG becomes necessary when debugging problems that QEMU cannot simulate: precise interrupt timing, cache coherency bugs between the CPU and DMA, or boot-stage failures that occur before UART output is available.

Memory Layout: Stack Sizes and Task Counts

With static allocation, the total kernel memory usage is fully determined at compile time. A CS 452 kernel with:

64 task descriptors at ~128 bytes each = 8 KB
64 task stacks at 8 KB each = 512 KB
64 kernel stacks at 4 KB each = 256 KB
Kernel code and data = ~100 KB (typical)
Delay queue, send queues, event registry = ~10 KB

Total: ~886 KB. The BCM2711 has 1–8 GB of SDRAM depending on the board variant; even 1 GB is 1000× more than the kernel needs. The actual constraint is the CPU cache. The kernel should fit in the 1 MB L2 cache; most of it does, as 886 KB is close to the L2 limit, but the task stacks dominate. Reducing task stack sizes (to 4 KB per task) brings the total to ~634 KB.

Stack sizing is a classic embedded systems challenge. Too small and you get stack overflows; too large and you waste memory (and cache). The canary pattern helps diagnose overflows, but it is reactive. Proactive sizing requires either static stack analysis (tools like StackAnalyzer from AbsInt) or worst-case measurement (running the system with full debug instrumentation and observing the high-water mark of stack usage).

A practical heuristic for CS 452: allocate 8 KB for user task stacks and 4 KB for kernel stacks. This is sufficient for tasks that use a few hundred bytes of local variables and call 3–4 levels of functions. If a task does string formatting with large local buffers (e.g., a 256-byte kprintf buffer), account for it explicitly.

C++ in the Kernel

CS 452 permits C++ in the kernel, subject to limitations. Useful C++ features:

Templates: for type-safe intrusive linked lists, priority queues, and FIFO queues without dynamic allocation. A Queue<T, N> template that holds up to N objects of type T avoids the type-unsafe void * casts of C.
RAII: for interrupt masking regions — a guard object that clears DAIF in its constructor and restores it in its destructor ensures that interrupts are always unmasked when the guard goes out of scope, even if an early return is taken.
Constexpr: for compile-time computation of queue sizes, priority bounds, and memory map offsets. No runtime cost.

C++ features that must be avoided:

Exceptions: require runtime support library (libstdc++ unwind tables) and _Unwind_RaiseException. Not available in freestanding mode.
RTTI (dynamic_cast, typeid): requires typeinfo objects in .rodata, adds size overhead, and may generate library calls.
Global constructors with non-trivial initialization: must be called explicitly via the .init_array section (Chapter 5 covers this). If a global constructor calls a kernel function (e.g., to register a name), and the kernel is not yet initialized when the constructor runs, the result is undefined behavior.
Standard library containers (std::vector, std::map): all use dynamic memory allocation. Define __attribute__((noreturn)) void operator new(size_t) { halt(); } to catch any accidental new expressions at link time.

A Complete Makefile for a CS 452 Kernel

Building a bare-metal kernel involves several steps: compiling C and assembly sources, linking with a custom linker script, converting the ELF binary to a raw image, and deploying to the RPi 4. A well-organized Makefile ties these steps together so that make rebuilds only what changed and make deploy copies the kernel to the SD card.

# CS 452 Kernel Makefile
# Targets: make (build), make clean, make deploy, make qemu

# ─── Toolchain ─────────────────────────────────────────────────────────────────
CC      := aarch64-elf-gcc
AS      := aarch64-elf-as
LD      := aarch64-elf-ld
OBJCOPY := aarch64-elf-objcopy
OBJDUMP := aarch64-elf-objdump

# ─── Compiler Flags ─────────────────────────────────────────────────────────────
CFLAGS := -Wall -Wextra -Werror \
           -O2 -g3 \
           -ffreestanding \
           -mgeneral-regs-only \
           -nostdlib \
           -nostartfiles \
           -march=armv8-a \
           -mtune=cortex-a72 \
           -mcpu=cortex-a72 \
           -fno-omit-frame-pointer \
           -fstack-usage

# -g3 includes macro expansions in DWARF debug info (useful with GDB)
# -fstack-usage generates .su files with per-function stack usage estimates
# -fno-omit-frame-pointer preserves x29 as frame pointer for backtraces

ASFLAGS := -g
LDFLAGS := -T link.ld

# ─── Source Files ───────────────────────────────────────────────────────────────
C_SRCS   := $(wildcard kernel/*.c) $(wildcard user/*.c) $(wildcard lib/*.c)
ASM_SRCS := $(wildcard kernel/*.S) $(wildcard user/*.S)

OBJ_DIR  := build
C_OBJS   := $(patsubst %.c, $(OBJ_DIR)/%.o, $(C_SRCS))
ASM_OBJS := $(patsubst %.S, $(OBJ_DIR)/%.o, $(ASM_SRCS))
OBJS     := $(C_OBJS) $(ASM_OBJS)

# ─── Targets ───────────────────────────────────────────────────────────────────
KERNEL_ELF := build/kernel8.elf
KERNEL_IMG := build/kernel8.img
KERNEL_MAP := build/kernel8.map

.PHONY: all clean deploy qemu disasm size

all: $(KERNEL_IMG)

$(KERNEL_IMG): $(KERNEL_ELF)
	$(OBJCOPY) -O binary $< $@
	@echo "Built: $@ ($(shell wc -c < $@ | tr -d ' ') bytes)"

$(KERNEL_ELF): $(OBJS) link.ld
	$(LD) $(LDFLAGS) -Map=$(KERNEL_MAP) -o $@ $(OBJS)

$(OBJ_DIR)/%.o: %.c
	@mkdir -p $(dir $@)
	$(CC) $(CFLAGS) -c $< -o $@

$(OBJ_DIR)/%.o: %.S
	@mkdir -p $(dir $@)
	$(CC) $(CFLAGS) $(ASFLAGS) -c $< -o $@

clean:
	rm -rf $(OBJ_DIR)

# Copy kernel image to SD card (adjust SDCARD_MOUNT for your system)
SDCARD_MOUNT ?= /Volumes/BOOT
deploy: $(KERNEL_IMG)
	cp $(KERNEL_IMG) $(SDCARD_MOUNT)/kernel8.img
	sync
	@echo "Deployed to $(SDCARD_MOUNT)/kernel8.img"

# Run in QEMU with GDB server on port 1234
qemu: $(KERNEL_ELF)
	qemu-system-aarch64 \
	    -M raspi4b \
	    -kernel $(KERNEL_ELF) \
	    -serial stdio \
	    -no-reboot \
	    -d int,cpu_reset \
	    -S -s

# Disassemble kernel
disasm: $(KERNEL_ELF)
	$(OBJDUMP) -d --source --demangle $(KERNEL_ELF) | less

# Report size of each section
size: $(KERNEL_ELF)
	aarch64-elf-size -A $(KERNEL_ELF)

# Dependency generation: each .c file generates a .d file listing its includes
DEPFLAGS = -MT $@ -MMD -MP -MF $(OBJ_DIR)/$*.d
$(OBJ_DIR)/%.o: %.c $(OBJ_DIR)/%.d
	@mkdir -p $(dir $@)
	$(CC) $(CFLAGS) $(DEPFLAGS) -c $< -o $@

DEPS := $(OBJS:.o=.d)
-include $(DEPS)

Several features of this Makefile deserve comment.

-march=armv8-a -mtune=cortex-a72 -mcpu=cortex-a72: these three flags together give the compiler complete information about the target CPU. -march specifies the instruction set (ARMv8-A, which the Cortex-A72 implements). -mtune enables micro-architectural optimizations (e.g., choosing branch-free sequences that minimize Cortex-A72 branch mispredictions). -mcpu combines both. Using all three is redundant but makes intent explicit.

-fstack-usage: generates a .su file alongside each object file. Each .su file lists the stack frame size for every function in the translation unit. A static analysis tool (aarch64-elf-gcc -fstack-usage + a custom script) can walk the call graph and compute the maximum stack depth, helping to size task stacks without over-allocating. The .su data is also useful for identifying which functions have large stack frames and should be refactored.

-fno-omit-frame-pointer: the Cortex-A72 has 31 general-purpose registers, and GCC normally uses x29 as a frame pointer only when necessary. This flag forces x29 to always point to the current frame. This allows GDB to generate backtraces in debug sessions (without a frame pointer, the unwinder must rely on DWARF unwind tables, which in a freestanding build may not be complete). The cost is that x29 is unavailable as a general-purpose register, slightly increasing register pressure.

Dependency generation (-MMD -MP -MF): the -MMD flag makes GCC generate a Makefile dependency file (.d) listing all header files included by the source. The -include $(DEPS) at the bottom includes these dependency files, so that changing a header causes all dependent sources to be recompiled. Without this, changing a struct definition in a .h file would not trigger recompilation of files that include it — a common source of puzzling stale-binary bugs.

The deploy target assumes the SD card is mounted at /Volumes/BOOT (macOS) or /media/pi/BOOT (Linux). Adjust for your environment. The sync call flushes the filesystem cache to ensure the file is fully written before unmounting the card.

The disasm target is invaluable for debugging: make disasm | grep -A 20 "handle_irq" shows the disassembly of the interrupt handler with source annotations (from -g3). Cross-referencing the disassembly against the high-level C ensures that the compiler’s optimization did not inadvertently move a store before a barrier or eliminate a volatile read.

The Boot Sequence in Detail

The RPi 4’s boot sequence passes through several stages before the kernel begins:

Stage 0 (VideoCore firmware): the GPU runs first. It loads a secondary bootloader (bootcode.bin) from the SD card. This is fixed in ROM and cannot be modified.

Stage 1 (bootcode.bin): a GPU-side bootloader that loads start4.elf (the main GPU firmware) from the SD card. Also loads config.txt to configure system parameters (UART pin assignment, clock rates, memory split).

Stage 2 (start4.elf): the full GPU firmware. It configures the peripherals, sets the ARM core frequencies, and loads kernel8.img from the SD card into DRAM at address 0x80000. Then it sets the ARM core PC to 0x80000 and releases the ARM core from reset.

Stage 3 (your code): execution begins at address 0x80000, which is the first instruction of your boot.S. At this point:

The ARM core is at EL2 (Hypervisor level) with AArch64.
DRAM is accessible from 0x00000000 to the top of installed RAM.
The GPU-side peripherals (0xFE000000 onwards) are memory-mapped.
SP_EL2 is undefined (you must set it before making calls).
All four cores are active; cores 1–3 are spinning in a firmware loop waiting for a startup address to be written to a mailbox register.

The boot.S entry sequence:

.section .text.boot
.global boot

boot:
    /* Determine which CPU core we are */
    mrs     x0, MPIDR_EL1
    and     x0, x0, #3             /* CPU ID = bits [1:0] */
    cbnz    x0, .Lsecondary_hold   /* cores 1-3 go to a parking loop */

    /* We are core 0 — proceed with initialization */
    /* Drop from EL2 to EL1 */
    bl      el2_to_el1

    /* Set up initial SP_EL1 */
    adr     x0, __kernel_stack_top
    msr     SP_EL1, x0

    /* Initialize BSS */
    bl      bss_zero

    /* Initialize the exception vector table */
    adr     x0, exception_vector_table
    msr     VBAR_EL1, x0
    isb

    /* Enter the C kernel */
    bl      kmain
    b       .Lhalt      /* Should never return */

.Lsecondary_hold:
    /* Secondary CPUs park here indefinitely */
    wfi
    b       .Lsecondary_hold

.Lhalt:
    wfi
    b       .Lhalt

The MPIDR_EL1 register’s lower 2 bits give the CPU core number (0–3). Cores 1–3 are sent to an infinite WFI loop — they will never be used in the CS 452 single-core kernel. If a bug causes one of the secondary cores to escape this loop (perhaps through a firmware error), it would execute code at its current PC with an undefined stack, causing unpredictable behavior. A more robust park loop writes the core ID to a watchdog register and enters WFI; the watchdog reset fires if the core ever escapes.

The el2_to_el1 procedure configures the exception return state to drop from EL2 to EL1 with the correct PSTATE settings for a kernel that runs at EL1 with DAIF set and AArch64 state. This procedure was described in detail in Chapter 3.

Debugging Without GDB: The UART as a Debug Interface

In the early phases of kernel bring-up — before the interrupt system is working, before tasks exist, before the scheduler runs — the only debug output is the UART. A minimal polling UART driver (no interrupts, no task structure) suffices:

void uart_putc(char c) {
    volatile uint32_t *FR   = (volatile uint32_t *)0xFE201018;
    volatile uint32_t *DR   = (volatile uint32_t *)0xFE201000;
    while (*FR & (1u << 5)) {}  /* wait until TX FIFO not full */
    *DR = (uint32_t)c;
}

void uart_puts(const char *s) {
    for (; *s; s++) {
        if (*s == '\n') uart_putc('\r');
        uart_putc(*s);
    }
}

/* Minimal hex dump for debugging */
void uart_hex64(uint64_t v) {
    uart_puts("0x");
    for (int i = 60; i >= 0; i -= 4) {
        int nibble = (v >> i) & 0xF;
        uart_putc(nibble < 10 ? '0' + nibble : 'a' + nibble - 10);
    }
}

This function works even before the MMU is configured, before interrupts are enabled, and before any kernel data structures are initialized, because it uses only MMIO accesses to a single peripheral address and loops in a register. It is the “printf on a bare metal” equivalent.

A common bring-up pattern: add uart_puts("Stage X\n") markers at each initialization step (EL2→EL1 drop, BSS zero, UART init, interrupt enable, first task creation). On the development host, minicom -b 115200 -D /dev/ttyUSB0 displays these messages in real time. When the system hangs, the last printed stage identifies the failing step. This technique predates formal debuggers by decades and remains the most reliable bring-up approach — no tool setup required, no hardware compatibility issues, no JTAG chain to configure.

Chapter 6: Serial Communication — UART, SPI, and CAN

Three serial interfaces connect the kernel to the outside world: UART for human-readable terminal communication, SPI for communication with the MCP2515 CAN controller, and CAN bus for communication with the Märklin train hardware. Each protocol represents a different point in the design space of serial communication.

UART: Asynchronous Serial

Universal Asynchronous Receiver/Transmitter (UART) is the oldest and simplest of the three. It is asynchronous — transmitter and receiver share no clock signal; instead, they agree in advance on a baud rate (bits per second) and each runs a local clock to sample the signal. A UART frame consists of:

A start bit (logic 0, indicating the beginning of a frame)
Five to nine data bits (typically eight)
An optional parity bit (for error detection)
One or two stop bits (logic 1, returning the line to idle)

At 115200 baud, each bit is approximately 8.7 µs wide. The receiver samples each bit at the midpoint, allowing up to ±4.3 µs of clock drift per bit before a misread. Over ten bits (start + 8 data + stop), the accumulated drift budget is 4.3 µs — tight enough that UART links over long cables can lose synchronization due to clock imprecision in cheap oscillators.

The BCM2711 provides two UART implementations. PL011 (UART0, at physical base 0xFE201000) is a full-featured ARM primecell UART with 32-character TX and RX FIFOs, configurable trigger levels, and hardware flow control via RTS/CTS lines. The Mini-UART (AUX_MU, at physical base 0xFE215040) is simpler but shares resources with the SPI0 interface and has a fixed 8-bit baud-rate calculation. CS 452 uses PL011 for terminal output and the Mini-UART or the AUX peripheral for CAN communications.

Flow control is critical when the receiver cannot guarantee bounded processing latency. Without flow control, a fast transmitter can overflow the receiver’s FIFO. Clear-To-Send (CTS) hardware flow control uses a dedicated hardware pin: the receiver asserts CTS when it can accept data; the transmitter checks CTS before each byte. The PL011 can perform CTS checking automatically in hardware, freeing the software from polling the status register before every byte.

The UART FIFOs have configurable trigger levels: the RXFIFO interrupt fires when the receive FIFO reaches 1/8, 1/4, 1/2, 3/4, or 7/8 full; the TXFIFO interrupt fires when the transmit FIFO falls below the trigger level. Choosing the right trigger level is an engineering trade-off: a low RX trigger (fire at 1/8 full) means frequent interrupts and low latency but high overhead; a high trigger (fire at 7/8 full) batches interrupts but risks overflow if the processor is slow to respond.

SPI: Synchronous Serial

The Serial Peripheral Interface (SPI) is a synchronous, full-duplex bus designed by Motorola in the 1980s for short-distance chip-to-chip communication. Unlike UART, SPI uses an explicit clock signal (SCLK) driven by a designated master. The master also drives MOSI (master out, slave in) while the slave simultaneously drives MISO (master in, slave out). A separate chip-select pin (CS̄, active low) addresses individual peripherals on a shared bus.

The SPI0 peripheral on the BCM2711 (physical base 0xFE204000) can operate at up to 125 MHz (half the system clock), though the MCP2515 is rated to 10 MHz maximum. SPI has no built-in addressing scheme: each chip-select assertion initiates a transaction with one specific peripheral. For the MCP2515, a typical transaction sequence is: assert CS̄, send command byte, send/receive address byte(s) and data byte(s), deassert CS̄.

Because SPI is synchronous, there is no baud-rate mismatch or sampling ambiguity. The trade-offs relative to UART are: SPI requires more wires (at least 4 vs. 2), the master must drive the clock, and the bus topology is point-to-point or single-master star rather than multi-drop.

CAN Bus: Deterministic Multi-Drop Serial

The Controller Area Network (CAN) protocol was developed by Robert Bosch GmbH in the early 1980s for automotive applications. Its defining characteristics — multi-drop topology, arbitration without a master, error detection and recovery — were driven by the needs of automotive systems: many ECUs sharing a single cable, with no single point of coordination.

A CAN bus is a differential signal pair (CANH, CANL). The bus can be in one of two states: dominant (logic 0, CANH–CANL ≈ 2V) or recessive (logic 1, CANH–CANL ≈ 0V). When multiple nodes transmit simultaneously, a dominant bit overrides all recessive bits — this is the “wired-AND” property that enables arbitration.

Non-destructive arbitration works as follows. Every node that wants to transmit begins transmitting its frame simultaneously, starting with the identifier field. Each node monitors the bus while transmitting. If it transmits a recessive bit but reads a dominant bit, it lost arbitration and immediately stops transmitting, deferring to the winner. The node with the numerically lowest identifier always wins. This scheme requires that all nodes have bit-synchronous clocks — CAN uses NRZ encoding with bit stuffing (after five consecutive identical bits, a complementary stuffing bit is inserted) to maintain clock synchronization.

A standard CAN data frame (CAN 2.0A, 11-bit identifier) contains:

SOF  | Arbitration (11 bits ID + RTR) | Control (6 bits) | Data (0–64 bytes) |
CRC (15 bits) | CRC delimiter | ACK | ACK delimiter | EOF (7 bits)

The Extended Frame format (CAN 2.0B, 29-bit identifier) extends the identifier field and is used by the Märklin CS2/CS3 protocol. The extended identifier provides 29 bits of address space, supporting up to 536 million distinct message identifiers.

CAN’s error detection is multi-layered: cyclic redundancy check (15-bit CRC), bit monitoring (each node reads what it writes and checks for mismatch), bit stuffing violation detection, frame format checking, and acknowledgement checking. Any detected error triggers an error frame, and all nodes restart the transmission. Nodes track error counts in transmit and receive error counters; if a node’s error count exceeds 128, it enters error passive mode and stops asserting error frames; above 255, it enters bus-off and disconnects entirely.

The bitrates supported range from 10 kbit/s (long cables, noisy environments) to 1 Mbit/s (short, clean cables). The Märklin system runs at 250 kbit/s, which allows cable lengths up to approximately 100 metres without terminators while still providing adequate throughput for the relatively low message rates of a model railway control protocol.

CAN Error Detection in Depth

CAN’s error detection is comprehensive because the protocol was designed for automotive applications where cable harnesses are noisy, connectors corrode, and the cost of a missed message could be a brake failure. The five error-detection mechanisms operate at different layers:

Bit error (transmitter monitors itself): while transmitting, every node reads back the bus level simultaneously. If a node transmits a recessive bit (logic 1) but reads a dominant bit (logic 0), it has been overridden by another node’s dominant bit — this is normal during arbitration and expected. But if it happens after arbitration is settled (in the data field), it is a bit error, and the transmitter immediately aborts and sends an error frame.

Stuff error: CAN uses non-return-to-zero (NRZ) encoding with bit stuffing — after 5 consecutive identical bits, a complementary stuffing bit is inserted. This limits the maximum run of identical bits to 5, ensuring clock synchronization by frequent signal transitions. A receiver that detects 6 consecutive identical bits in a normally-stuffed region has detected a stuff error.

CRC error: the transmitter computes a 15-bit CRC over the identifier and data fields and appends it. Each receiver independently computes the same CRC. If the computed CRC differs from the transmitted CRC, the frame has been corrupted.

Form error: certain bit fields (CRC delimiter, ACK delimiter, end-of-frame) must be recessive (logic 1). A dominant bit in these fields is a form error.

Acknowledgement error: after a frame is transmitted, the transmitter checks the ACK slot. Every receiver that successfully received the frame (no errors detected so far) overdrives the ACK slot to dominant. If no receiver acknowledges — the ACK slot remains recessive — the transmitter has sent a frame that no one received, and reports an ACK error. This detects the case where the transmitter is on the bus alone (disconnected cable) or all receivers are in error-passive mode.

When any of these errors is detected, the detecting node sends an active error frame: 6 consecutive dominant bits (a deliberate violation of the bit-stuffing rule, which other nodes will also detect as an error) followed by an 8-bit error delimiter. All nodes restart their receivers. The original frame’s transmission is aborted and must be retransmitted.

The error counter mechanism prevents faulty nodes from disrupting the bus indefinitely. Each node maintains a transmit error counter (TEC) and a receive error counter (REC). Detecting a transmit error increments TEC by 8; successfully transmitting a frame decrements TEC by 1. When TEC > 127, the node enters error passive mode: it still participates in communication but sends passive error frames (6 recessive bits, invisible to other nodes) and waits 8 error delimiter bits (rather than 3) before retransmitting. When TEC > 255, the node enters bus-off and disconnects from the bus entirely until software resets it (or after 128 occurrences of 11 consecutive recessive bits).

For the MCP2515, TEC and REC are readable in registers 0x1C and 0x1D. Monitoring these counters is good diagnostic practice: TEC/REC values creeping upward indicate CAN bus issues (loose connectors, interference, incorrect bit timing) before the node enters error passive mode.

Bit Timing Configuration

The MCP2515 bit timing is configured through three registers: CNF1 (0x2A), CNF2 (0x29), CNF3 (0x28). The CAN bit period is divided into four segments: Sync Seg (always 1 TQ), Prop Seg, Phase Seg 1, and Phase Seg 2, where TQ (Time Quantum) is derived from the oscillator clock. For a 16 MHz oscillator and 250 kbit/s:

\[ \text{Bit rate} = \frac{F_{\text{osc}}}{2 \times \text{BRP} \times (1 + \text{PropSeg} + \text{PhaseSeg1} + \text{PhaseSeg2})} \]

A valid setting for 250 kbit/s at 16 MHz: BRP=1 (CNF1[5:0]=0), SJW=1 (CNF1[7:6]=0), PropSeg=1 (CNF2[2:0]=0), PhaseSeg1=4 (CNF2[5:3]=3), PhaseSeg2=4 (CNF3[2:0]=3), BTLMODE=1.

Resulting TQ count per bit = 1 + 2 + 5 + 5 = 13 TQ. At BRP=1, TQ frequency = 16 MHz / 2 = 8 MHz. Bit rate = 8 MHz / 13 ≈ 615 kbit/s — too fast. Adjusting BRP=2: TQ = 8 MHz / 2 = 4 MHz. Bit rate = 4 MHz / (1 + 2 + 5 + 5) = 307 kbit/s — close but not exact 250 kbit/s. The exact register values for 250 kbit/s at 16 MHz are available in the Microchip application note AN754 and the Pigpio library.

Getting bit timing wrong is one of the most common MCP2515 bring-up errors. Symptoms: TEC rises rapidly, REC stays at 0 (receiver never successfully receives), no messages appear in receive buffers. Diagnosis: connect an oscilloscope to CANH/CANL and verify the bit timing matches the expected 4 µs bit period.

MCP2515 Complete Initialization Sequence

The full MCP2515 initialization in C, starting from hardware reset through entering Normal operating mode:

/* MCP2515 register addresses */
#define MCP_CANSTAT   0x0E
#define MCP_CANCTRL   0x0F
#define MCP_CNF3      0x28
#define MCP_CNF2      0x29
#define MCP_CNF1      0x2A
#define MCP_CANINTE   0x2B
#define MCP_CANINTF   0x2C
#define MCP_EFLG      0x2D
#define MCP_TXB0CTRL  0x30
#define MCP_RXB0CTRL  0x60
#define MCP_RXB1CTRL  0x70
#define MCP_RXF0SIDH  0x00  /* receive filter 0 — standard ID high byte */
#define MCP_RXM0SIDH  0x20  /* receive mask 0 — standard ID high byte */

/* Control mode bits */
#define MCP_MODE_NORMAL    0x00
#define MCP_MODE_SLEEP     0x20
#define MCP_MODE_LOOPBACK  0x40
#define MCP_MODE_LISTENONLY 0x60
#define MCP_MODE_CONFIG    0x80

/* CANINTE interrupt enable bits */
#define MCP_RX0IE  (1 << 0)  /* RX buffer 0 full */
#define MCP_RX1IE  (1 << 1)  /* RX buffer 1 full */
#define MCP_TX0IE  (1 << 2)  /* TX buffer 0 empty */
#define MCP_TX1IE  (1 << 3)  /* TX buffer 1 empty */
#define MCP_TX2IE  (1 << 4)  /* TX buffer 2 empty */
#define MCP_ERRIE  (1 << 5)  /* error interrupt */

void mcp2515_reset(void) {
    spi0_cs_assert();
    spi0_transfer_byte(0xC0);  /* RESET command */
    spi0_cs_deassert();
    /* The MCP2515 enters Configuration mode after reset.
       Oscillator start-up time: 128 × Tosc = 128/16MHz = 8 µs.
       Wait at least 2× longer to be safe. */
    delay_us(20);
}

void mcp2515_write_reg(uint8_t addr, uint8_t val) {
    spi0_cs_assert();
    spi0_transfer_byte(0x02);   /* WRITE command */
    spi0_transfer_byte(addr);
    spi0_transfer_byte(val);
    spi0_cs_deassert();
}

uint8_t mcp2515_read_reg(uint8_t addr) {
    spi0_cs_assert();
    spi0_transfer_byte(0x03);   /* READ command */
    spi0_transfer_byte(addr);
    uint8_t val = spi0_transfer_byte(0x00);
    spi0_cs_deassert();
    return val;
}

void mcp2515_bit_modify(uint8_t addr, uint8_t mask, uint8_t data) {
    spi0_cs_assert();
    spi0_transfer_byte(0x05);   /* BIT MODIFY command */
    spi0_transfer_byte(addr);
    spi0_transfer_byte(mask);
    spi0_transfer_byte(data);
    spi0_cs_deassert();
}

void mcp2515_init(void) {
    mcp2515_reset();

    /* Verify we are in Configuration mode (CANSTAT[7:5] = 100) */
    uint8_t canstat = mcp2515_read_reg(MCP_CANSTAT);
    if ((canstat & 0xE0) != MCP_MODE_CONFIG) {
        /* MCP2515 not responding — check SPI wiring */
        panic("MCP2515 init failed: not in config mode");
    }

    /* Configure bit timing for 250 kbit/s at 16 MHz oscillator.
       From Microchip AN754:
         BRP = 0 (TQ = 2/16MHz = 125ns)
         SJW = 0 (1 TQ)
         PropSeg = 5 (6 TQ, CNF2[2:0] = 5)
         PhaseSeg1 = 7 (8 TQ, CNF2[5:3] = 7)
         PhaseSeg2 = 5 (6 TQ, CNF3[2:0] = 5)
         Total: 1+6+8+6 = 21 TQ × 125ns = 2.625 µs → ~380 kbit/s

       For 250 kbit/s, BRP must be adjusted. Use BRP=1:
         TQ = 2*(1+1)/16MHz = 250ns
         Total: 1+6+8+6 = 21 TQ × 250ns = 5.25 µs → ~190 kbit/s (too slow)

       Standard recipe for 250 kbit/s at 16 MHz (from CAN forums):
         CNF1 = 0x41  (BRP=1, SJW=1)
         CNF2 = 0xF1  (BTLMODE=1, SAM=1, PS1=7, PRSEG=1)
         CNF3 = 0x05  (SOF=0, WAKFIL=0, PS2=5)
       Resulting: 1+2+8+6=17 TQ at 500ns = 8.5 µs → ~117 kbit/s still wrong.

       Correct values verified against TJA1050 datasheet reference:
         CNF1 = 0x01 (BRP=1: TQ=250ns, SJW=1TQ)
         CNF2 = 0xAC (BTLMODE=1, SAM=0, PHSEG1=5 → 6TQ, PRSEG=4 → 5TQ)
         CNF3 = 0x05 (PHSEG2=5 → 6TQ)
       Total TQ: 1+5+6+6=18 TQ; bit time=18×250ns=4.5µs; bitrate=222kbit/s
       Closest to 250: BRP=0 (TQ=125ns), 1+2+7+6=16TQ, 16×125ns=2µs = 500kbit/s
       BRP=0: CNF1=0x00, CNF2=0xC5, CNF3=0x04 → 500kbit/s

       Final answer for 250kbit/s with 16MHz:
         BRP = 0 (Fosc/2 = 8MHz, TQ=125ns), TQ×16 doesn't work.
         Use 8MHz oscillator: TQ=125ns, 1+3+4+2=10TQ,10×125ns×2=2.5µs = 400kbit/s
       The exact configuration depends on the oscillator actually installed on
       your MCP2515 board. Check the crystal marking. Assume 16 MHz:
         CNF1 = 0x01 (BRP=1: Fosc/4=4MHz, TQ=250ns)
         CNF2 = 0xB5 (BTLMODE=1,SAM=0,PHSEG1=6→7TQ,PRSEG=5→6TQ)
         CNF3 = 0x01 (PHSEG2=1→2TQ)
       Total: 1+6+7+2=16TQ, 16×250ns=4µs = 250kbit/s ✓
    */
    mcp2515_write_reg(MCP_CNF1, 0x01);
    mcp2515_write_reg(MCP_CNF2, 0xB5);
    mcp2515_write_reg(MCP_CNF3, 0x01);

    /* Configure RXB0 to receive all frames (mask=0, filter don't-care) */
    mcp2515_write_reg(MCP_RXB0CTRL, 0x60);  /* RXM=11: receive all */
    mcp2515_write_reg(MCP_RXB1CTRL, 0x60);  /* RXB1 also receive all */

    /* Set acceptance mask to zero (accept all IDs) */
    for (int i = 0; i < 4; i++) {
        mcp2515_write_reg(MCP_RXM0SIDH + i, 0x00);
    }

    /* Enable RX buffer 0 and 1 interrupts + error interrupt */
    mcp2515_write_reg(MCP_CANINTE, MCP_RX0IE | MCP_RX1IE | MCP_ERRIE | MCP_TX0IE);

    /* Enter Normal mode */
    mcp2515_write_reg(MCP_CANCTRL, MCP_MODE_NORMAL);

    /* Verify Normal mode (CANSTAT[7:5] = 000) */
    uint32_t timeout = 1000;
    while (timeout-- && (mcp2515_read_reg(MCP_CANSTAT) & 0xE0) != MCP_MODE_NORMAL)
        delay_us(1);
    if (timeout == 0)
        panic("MCP2515: failed to enter Normal mode");
}

The commented bit-timing calculation shows the complexity of getting the exact 250 kbit/s rate from a 16 MHz oscillator. The approach is: choose BRP and the segment lengths such that the total number of TQ per bit period equals F_osc / (2 × BRP_val × bitrate) where BRP_val = BRP + 1. Segment lengths must satisfy: PhaseSeg2 > SJW; 1 < PhaseSeg1 + PropSeg ≤ 16; SJW ≤ min(4, PhaseSeg1). These constraints come from the CAN standard and the MCP2515 register field widths.

CAN Acceptance Filters

The MCP2515 provides six acceptance filters (RXF0–RXF5) and two acceptance masks (RXM0, RXM1). Each received CAN frame’s identifier is compared against the filters using the mask:

\[ (\text{frame\_id} \, \& \, \text{mask}) == (\text{filter} \, \& \, \text{mask}) \]

If the comparison succeeds, the frame is accepted into the corresponding receive buffer. If no filter matches, the frame is discarded silently.

For the Märklin system, the CS3 sends back acknowledgement frames with the same command code as the original command but with the response bit (bit 16 of the extended identifier) set. If your kernel wants to receive only sensor events (command 0x10/0x11) and system status messages, you configure filters to match only those IDs:

void mcp2515_set_filter_for_sensor_events(void) {
    /* Sensor event (S88) command code = 0x10, shifted to extended ID bits [28:17] */
    /* Extended ID for sensor event: (0x10 << 17) = 0x00200000 */
    /* Encode in SIDH, SIDL, EID8, EID0 format */
    uint32_t filter_id = (0x10 << 17);  /* S88 event command */

    uint8_t sidh = (filter_id >> 21) & 0xFF;
    uint8_t sidl = ((filter_id >> 13) & 0xE0) | 0x08 | ((filter_id >> 16) & 0x03);
    uint8_t eid8 = (filter_id >> 8) & 0xFF;
    uint8_t eid0 = (filter_id) & 0xFF;

    /* Write filter 0 (for RXB0) */
    mcp2515_write_reg(MCP_RXF0SIDH, sidh);
    mcp2515_write_reg(MCP_RXF0SIDH + 1, sidl);
    mcp2515_write_reg(MCP_RXF0SIDH + 2, eid8);
    mcp2515_write_reg(MCP_RXF0SIDH + 3, eid0);

    /* Set mask to match only the command bits (bits [28:17]) */
    uint32_t mask_id = (0xFFF << 17);   /* 12-bit command field mask */
    mcp2515_write_reg(MCP_RXM0SIDH,     (mask_id >> 21) & 0xFF);
    mcp2515_write_reg(MCP_RXM0SIDH + 1, ((mask_id >> 13) & 0xE0) | 0x08);
    mcp2515_write_reg(MCP_RXM0SIDH + 2, (mask_id >> 8) & 0xFF);
    mcp2515_write_reg(MCP_RXM0SIDH + 3, mask_id & 0xFF);
}

In practice, for CS 452, it is simpler to configure the MCP2515 to receive all frames (mask = 0x000, filter = 0x000) and filter in software — the message rate is low enough that the software filtering overhead is negligible. Hardware filtering would be important at high CAN bus utilization (hundreds of nodes, high message rates) to prevent the kernel from being interrupted for every bus message.

GPIO Pin Multiplexing for SPI and CAN

The RPi 4’s GPIO pins are multiplexed — each pin can serve one of several functions, selected by writing to the GPIO Function Select registers (GPFSELn at 0xFE200000–0xFE20000C). For SPI0, the relevant pins are:

GPIO	Alt Function	Signal	Physical Pin
GPIO 9	Alt0	SPI0_MISO	21
GPIO 10	Alt0	SPI0_MOSI	19
GPIO 11	Alt0	SPI0_SCLK	23
GPIO 8	Alt0	SPI0_CE0_N	24

For the MCP2515 INT pin (interrupt output), any GPIO in input mode with falling-edge detect enabled can be used. GPIO 17 (physical pin 11) is a common choice:

void gpio_init_spi_and_can(void) {
    volatile uint32_t *GPFSEL0 = (uint32_t *)0xFE200000;
    volatile uint32_t *GPFSEL1 = (uint32_t *)0xFE200004;
    volatile uint32_t *GPREN0  = (uint32_t *)0xFE20004C;  /* rising edge detect */
    volatile uint32_t *GPFEN0  = (uint32_t *)0xFE200058;  /* falling edge detect */

    /* GPIO 8–11: Alt0 for SPI0 */
    uint32_t fsel1 = *GPFSEL0;
    fsel1 = (fsel1 & ~(7 << 24)) | (4 << 24);  /* GPIO8: Alt0 */
    fsel1 = (fsel1 & ~(7 << 27)) | (4 << 27);  /* GPIO9: Alt0 */
    *GPFSEL0 = fsel1;

    uint32_t fsel2 = *GPFSEL1;
    fsel2 = (fsel2 & ~(7 << 0))  | (4 << 0);   /* GPIO10: Alt0 */
    fsel2 = (fsel2 & ~(7 << 3))  | (4 << 3);   /* GPIO11: Alt0 */
    fsel2 = (fsel2 & ~(7 << 21)) | (0 << 21);  /* GPIO17: Input */
    *GPFSEL1 = fsel2;

    /* Enable falling-edge detect on GPIO17 (MCP2515 INT, active low) */
    *GPFEN0 |= (1 << 17);
}

The GPIO event detect registers (GPFEN0 for falling edge, GPREN0 for rising edge, GPLEV0 for current level) work by latching the detected event in GPEDS0 (event detect status). The kernel’s interrupt handler reads GPEDS0 to identify which GPIO triggered, then writes 1 to the corresponding GPEDS0 bit to clear the event.

Chapter 7: The Märklin Protocol over CAN

The Märklin Digital system uses a proprietary extension of CAN for communication between the Central Station (CS2 or CS3) and accessories, sensors, and other control nodes. Understanding this protocol is necessary for writing the train control layer — it defines every command that can move a locomotive, change a switch, or query a sensor.

The CS3 Architecture

The Märklin CS3 (Central Station 3) is the master controller on the CAN bus. It:

Maintains a list of active locomotives — any loco that has received a speed or direction command recently. The CS3 periodically cycles through this list to refresh commands, since Märklin locomotives use a DCC or Motorola protocol with watchdog-like semantics: a loco that stops receiving commands will eventually reset to stop.
Controls track power and turnout switches directly.
Relays commands from the CAN bus to the track using its internal DCC/Motorola encoder.
Reports sensor events from S88 sensor modules.

The CS3 sits between the CAN bus and the track. Your kernel sends commands to the CS3 over CAN; the CS3 executes them on the track. This two-hop architecture means that CAN message latency is not the only source of control delay — the CS3’s internal scheduling adds additional latency that must be accounted for in timing models.

MCP2515 Registers and SPI Interface

The MCP2515 (Microchip Technology) is a standalone CAN 2.0B controller that interfaces to any microprocessor via SPI. It handles all CAN framing, arbitration, error detection, and recovery in hardware, presenting a simple mailbox interface to the host.

The MCP2515 provides three transmit buffers (TXB0, TXB1, TXB2) and two receive buffers (RXB0, RXB1). Each buffer contains registers for the extended identifier, DLC (data length code), and up to 8 data bytes. Transmission is initiated by writing to the buffer and setting the TXREQ bit in TXBnCTRL. Reception is signalled by an interrupt flag in CANINTF.

SPI commands for the MCP2515:

Command	Byte	Description
RESET	0xC0	Reset internal registers to default
READ	0x03	Read from address
WRITE	0x02	Write to address
BIT MODIFY	0x05	Modify specific bits at address (mask + data)
LOAD TX BUFFER	0x40/0x42/0x44	Load TXB0/1/2 directly
REQUEST TO SEND	0x80–0x87	Initiate transmission of one or more TX buffers
READ STATUS	0xA0	Read status byte (TX/RX pending)
RX STATUS	0xB0	Read receive buffer status and filter match

Initialization sequence: RESET, wait 128 µs (oscillator start-up), write CNF1/CNF2/CNF3 for 250 kbit/s bit timing, write CANCTRL to enter Normal mode, write CANINTE to enable RX interrupts, configure acceptance filters.

A complete SPI transaction to send a CAN frame:

// assert CS (GPIO output low)
// SPI WRITE: load TXB0 starting at TXBnSIDH
spi_byte(0x02);             // WRITE command
spi_byte(0x31);             // TXB0SIDH address
spi_byte((id >> 21) & 0xFF); // SIDH: standard ID bits [10:3]
spi_byte(((id >> 13) & 0xE0) | 0x08 | ((id >> 16) & 0x03)); // SIDL + EXIDE=1
spi_byte((id >> 8) & 0xFF); // EID8
spi_byte(id & 0xFF);        // EID0
spi_byte(len & 0x0F);       // DLC
for (int i = 0; i < len; i++) spi_byte(data[i]);
// deassert CS
// assert CS
spi_byte(0x81);             // REQUEST TO SEND TXB0
// deassert CS

Märklin Command Codes

The Märklin CAN protocol uses the 29-bit extended identifier to encode command type, response flag, and hash. The high 9 bits of the identifier encode the command (Befehl); the remaining 20 bits encode a hash derived from the UID of the command originator.

Key command codes:

Code	Name	Purpose
0x00	System	Emergency stop, go, halt, power
0x04	Speed	Set locomotive speed (0–1000)
0x05	Direction	Set direction (1=forward, 2=backward, 3=change)
0x06	Function	Set function outputs (headlights, sound, etc.)
0x0B	Switch	Control track turnout or signal
0x10	S88 Event	Sensor contact event (from CS3)
0x11	S88 Poll	Request sensor status

For speed control, the speed argument ranges from 0 (stop) to 1000 (maximum). The mapping from speed level to actual locomotive velocity is highly non-linear and varies by locomotive model, decoder type, and motor characteristics. This non-linearity is precisely what the velocity calibration procedure in Chapter 17 measures and models.

Switch control (0x0B) requires specifying the turnout’s UID and the desired position (0 = straight/through, 1 = diverging). The CS3 briefly activates the solenoid coil to move the switch, then cuts power to prevent coil burnout. A critical operational note: double slips (double-crossover switches with four positions) should never be commanded to invalid states (both curved positions simultaneously) as this can mechanically damage the mechanism.

The CS3 active locomotive list is a practical constraint that affects software design. The CS3 accepts speed commands only for locomotives that it considers “active” — meaning they have been acknowledged by the CS3. Before sending speed commands to a locomotive, you must issue a Go command (system command 0x00) to activate the loco on the CS3. More subtly, the CS3 cycles through its active list and periodically re-sends speed commands; a loco you commanded to speed 500 will gradually slow if the CS3 removes it from its active list. Your kernel must either keep locos active through periodic commands or explicitly manage CS3 loco registration.

The S88 sensor modules report reed-switch contacts to the CS3, which forwards events over CAN. A sensor event message contains the module number and the contact number within the module. The CS3 polls S88 modules at a configurable rate; the maximum poll rate determines the sensor latency. At 250 kbit/s CAN with 16 contacts per S88 module, polling 8 modules takes approximately 1–2 ms.

Märklin CAN Message Format

A complete Märklin CAN message consists of the 29-bit extended identifier plus up to 8 data bytes. The extended identifier encodes:

Bits [28:17]	Bits [16]	Bits [15:0]
Command (12 bits)	Response flag (1 bit)	Hash (16 bits)

The command field encodes what type of message this is. Values above 0x00 are defined by the Märklin protocol; the response flag is set by the CS3 when it is responding to a command (as opposed to initiating one). The hash is a 16-bit value derived from the originator’s UID, ensuring that messages from different devices have different identifiers and CAN arbitration works correctly.

For a speed command (command 0x04), the data bytes are:

Byte	Content
0–3	Locomotive UID (32-bit big-endian)
4–5	Speed (16-bit big-endian, range 0–1000)
6	Direction (0=unchanged, 1=forward, 2=backward, 3=change)

The locomotive UID is a 32-bit identifier assigned to each Märklin locomotive decoder. It is printed on the underside of the locomotive or can be read from the CS3’s locomotive database. The CS3 uses the UID to identify which decoder should act on the command.

A complete speed command to locomotive UID 0x00000042 at speed 500 forward:

typedef struct {
    uint8_t data[8];
    uint8_t len;
} MarklnData;

void build_speed_cmd(MarklnData *d, uint32_t uid, uint16_t speed, uint8_t dir) {
    d->data[0] = (uid >> 24) & 0xFF;
    d->data[1] = (uid >> 16) & 0xFF;
    d->data[2] = (uid >>  8) & 0xFF;
    d->data[3] = (uid >>  0) & 0xFF;
    d->data[4] = (speed >> 8) & 0xFF;
    d->data[5] = (speed >> 0) & 0xFF;
    d->data[6] = dir;
    d->data[7] = 0;
    d->len = 8;
}

void send_markln_cmd(uint8_t command, MarklnData *d, uint32_t my_hash) {
    uint32_t can_id = ((uint32_t)command << 17) | (my_hash & 0xFFFF);
    // Build CAN frame with extended identifier can_id and data d->data[0..d->len-1]
    can_tx_frame(can_id, d->data, d->len);
}

Locomotive Decoder Types

Märklin locomotives use several DCC/Motorola decoder variants, which affects the speed step resolution and the locomotive behavior:

14-step decoders (older Märklin Motorola format): 14 discrete speed steps plus stop. Speed 0 = stop; steps 1–14 map to increasing speeds. The CS3 maps these to the 0–1000 range: step n → speed = n × 71 (approximately). The coarse speed granularity makes smooth speed ramping impossible — only 14 distinct velocities are achievable.

28-step decoders (DCC): 28 discrete speed steps plus stop. More resolution than 14-step. Many modern Märklin locomotives use 28-step DCC decoders.

128-step decoders (DCC with extended speed steps): 128 speed steps, giving fine-grained control. Modern high-quality decoders support 128 steps. The CS3 maps the 0–1000 software range to the decoder’s step count transparently.

The type of decoder in each locomotive must be determined experimentally: set the speed to a low value and observe whether the locomotive moves, or use the CS3’s built-in decoder identification function. The calibration procedure (Chapter 17) produces a velocity table that is correct regardless of decoder type, because it measures actual physical velocity rather than relying on the decoder specification.

Emergency Stop and Safety

The Märklin system command 0x00 with subcommand 0x00 is a broadcast emergency stop: all locomotives stop immediately, all switch commands are cancelled, and the CS3 cuts track power. This is the safety command that the train control application should issue when any safety invariant is violated — two trains predicted to occupy the same segment, a sensor firing with no plausible attribution, or a kernel assertion failure.

Implementing an emergency stop involves:

Sending the broadcast stop CAN frame.
Waiting for the CS3 to acknowledge (or timing out).
Marking all trains as stopped in the track server state.
Blocking all subsequent speed commands until the operator manually resumes.

The emergency stop should be callable from any task, including interrupt handlers, through a dedicated “kill switch” mechanism that bypasses the normal SRR queue. One approach: a dedicated emergency-stop task at the highest priority, started at boot, that parks itself in a Receive loop. Any task that detects a safety violation calls Send to the emergency-stop task; the emergency-stop task wakes, issues the stop command, and blocks until an operator presses the resume key.

The Track Layout Data Format

The CS 452 course distributes a track graph data structure representing the specific Märklin track set available in the lab. The format is a C array of TrackNode structs, where each node represents a track section endpoint. The connectivity is specified as directed edges — each node has up to two “next” nodes (one for each direction of travel, or two for a switch).

Two track layouts are available (track A and track B), with different topologies. Track A is a figure-8 with four sidings; track B has a more complex structure with a central loop and multiple branch lines. Students choose which layout to demonstrate on. The track data array is handed out rather than having students measure it themselves, to ensure consistency and save time.

The key invariant of the track data: every edge has a corresponding reverse edge. If node 27 has a “next” to node 28 with distance 150 mm, then node 28 has a “next” to node 27 (traveling in the opposite direction) with the same distance. This reversibility property ensures that Dijkstra’s algorithm can find paths in either direction along any segment.

The DCC Protocol: What the CS3 Does on the Track

The CS3 sits between your CAN commands and the trains. When you send a Märklin CAN speed command, the CS3 translates it into a DCC (Digital Command Control) packet and transmits it to the track. Understanding DCC demystifies why commands sometimes have unexpected effects and why the CS3 introduces additional latency beyond CAN communication time.

DCC is an NMRA-standardized digital protocol (NMRA Standard S-9.1) for controlling model trains. The track carries both power (approximately 15–18 V AC-like) and data simultaneously. The data is encoded as a bit stream modulated onto the power waveform: a long half-cycle (58 µs per half) encodes a 0 bit; a short half-cycle (29 µs per half) encodes a 1 bit. The track voltage oscillates continuously; the duration of each half-cycle determines the bit value. Locomotives with DCC decoders extract both power and data from the track signal.

A DCC packet consists of:

A preamble of at least 14 consecutive 1 bits.
A 0 start bit introducing the address byte.
An address byte (1 byte for short addresses 1–127, 2 bytes for long addresses 128–9999).
A 0 start bit introducing the data byte.
One or more data bytes (instruction type and value).
A 0 start bit introducing the error correction byte.
An error check byte (XOR of all preceding bytes).
A 1 end bit completing the packet.

A complete DCC speed-and-direction packet to locomotive address 42, speed step 14 (of 28), forward direction:

Preamble: 1 1 1 1 1 1 1 1 1 1 1 1 1 1   (14 ones)
Start:    0
Address:  0 0 1 0 1 0 1 0               (0x2A = decimal 42)
Start:    0
Data:     0 1 1 0 1 1 1 0               (direction+speed: 0b01101110)
Start:    0
CRC:      0 1 0 0 0 1 0 0               (0x2A XOR 0x6E = 0x44)
End:      1

The data byte for speed is encoded as 01DSSSS for 28-step mode (where D = direction, SSSS = speed steps 0–28). The full DCC speed byte encoding includes a half-bit for the highest speed bit, making the encoding non-obvious at first glance.

The CS3 generates these packets continuously for all “active” locomotives in its roster. A locomotive that has not received a DCC packet within a manufacturer-specific timeout will reset to speed 0 (the DCC “emergency stop” behavior). This is the watchdog mechanism mentioned in Chapter 7: the CS3 must continuously re-send commands to keep locomotives at their commanded speed. If the CAN bus to the CS3 is disrupted (the RPi 4 crashes or reboots), the CS3 will stop receiving speed commands from your kernel, but it will continue sending the last-commanded speed to the track until the locomotive’s decoder resets. This provides a brief “fail-safe” period, but ultimately the locomotive will stop.

DCC packet timing: at 1-bit duration of 58 µs (half-cycle) and a 14-bit preamble, one DCC packet takes approximately 14×116 µs (preamble) + 8×(2×116 µs) (address byte) + 8×(2×116 µs) (data byte) + 8×(2×116 µs) (CRC byte) + overhead ≈ 10 ms. The CS3 cycles through all active locomotives at this rate; with 5 active locomotives, the refresh rate per locomotive is approximately 50 ms. This means a speed command update takes up to 50 ms to take effect on the track, independent of CAN latency. This is the dominant delay in the train control critical path and explains why stopping distances are not sub-centimeter.

Motorola protocol: older Märklin locomotives use the proprietary Märklin/Motorola digital protocol rather than DCC. The Motorola protocol predates NMRA DCC standardization and is not interoperable. The CS3 supports both; it determines which protocol to use based on the locomotive decoder type registered in its database. The key difference from a software perspective: Motorola locomotives have 14 speed steps compared to DCC’s 28 or 128, and the protocol lacks some DCC features (no multi-function decoders with function outputs, no long addresses). The CS3 handles this transparently — your kernel always speaks the Märklin CAN protocol, and the CS3 chooses Motorola or DCC as appropriate.

CAN Bus Physical Layer and Error Handling

The MCP2515 speaks CAN 2.0B, which specifies a physical layer based on differential signaling over a twisted-pair bus. Understanding the physical layer matters because the train lab is a noisy electrical environment — locomotive motor brushes generate RF interference, and track power creates magnetic fields that can couple into the CAN cable.

Differential signaling: the CAN bus has two wires, CAN_H (high) and CAN_L (low). A recessive bit is signalled when both wires are at the same voltage (approximately 2.5 V each, differential = 0 V). A dominant bit is signalled when CAN_H is driven to approximately 3.5 V and CAN_L to approximately 1.5 V (differential = 2 V). Any node on the bus can force the bus dominant by driving CAN_H and CAN_L; a dominant state overrides a recessive state. This property is the basis of CAN arbitration.

Bit stuffing: to maintain synchronization between nodes, CAN uses bit stuffing: after 5 consecutive bits of the same polarity in the data stream, the transmitter inserts a bit of the opposite polarity. The receiver removes these stuffed bits. Bit stuffing limits run-length to 5 and ensures the receiver can re-synchronize its clock. A stuff error occurs when more than 5 consecutive identical bits are seen (indicating either a corrupted frame or a line fault).

CAN frame structure (standard frame, extended ID used by Märklin):

Start-of-Frame (SOF): 1 dominant bit
Extended identifier: 29 bits (11-bit base ID + substitute remote request + IDE + 18-bit extension)
Remote Transmission Request (RTR): 1 bit
Reserved bits: 2 bits (EDL=0, res=0 for CAN 2.0)
Data Length Code (DLC): 4 bits (0-8 data bytes)
Data Field: 0-64 bits (0-8 bytes)
CRC Field: 15 bits + 1 CRC delimiter bit
ACK Field: 1 bit + 1 delimiter (receiver drives ACK dominant to acknowledge receipt)
End-of-Frame (EOF): 7 recessive bits
Intermission Frame Space (IFS): 3 recessive bits

Arbitration: when two nodes transmit simultaneously, each node monitors the bus state while transmitting. A node that has transmitted a recessive bit but detects a dominant bit on the bus knows it has been overridden by another node with a lower-ID (higher-priority) frame. It immediately stops transmitting and waits for the bus to be idle before retrying. This mechanism (bit-wise arbitration) ensures that the lowest-numbered CAN ID always wins, without any explicit collision detection or retry protocol. Märklin uses the command field in the extended ID for priority: system commands (0x00) have higher CAN priority than speed commands (0x04), which have higher priority than sensor poll commands (0x11).

Error detection: CAN has five error detection mechanisms:

Bit monitoring: a transmitting node monitors its own bits; if the received bit differs from the transmitted bit (outside of the arbitration phase for dominant bits), it is a bit error.
Bit stuffing: a violation of the 5-bit limit is a stuff error.
CRC check: the receiver independently computes the CRC and compares it to the transmitted CRC.
Form check: the fixed-format fields (EOF, IFS, ACK delimiter, CRC delimiter) must have specific values.
Acknowledgement check: if no node acknowledges (no dominant ACK bit), the transmitter detects an acknowledgement error.

When an error is detected, the detecting node transmits an error flag: 6 dominant bits (violating the bit-stuffing rule, which forces all other nodes to also generate error flags). After the error flag, the bus returns to idle and the transmitter retries.

Error counters: the MCP2515 maintains two error counters, TEC (Transmit Error Counter) and REC (Receive Error Counter). Each transmitted error flag increments TEC by 8; each received error flag increments REC by 1; each successful transmission or reception decrements the counter by 1. The counters drive a state machine:

TEC/REC	State	Behavior
≤ 127	Error Active	Normal; transmits active error flags
128–255	Error Passive	Transmits passive error flags (recessive, less disruptive)
TEC > 255	Bus Off	No transmit or receive until recovery

Bus-off recovery: when TEC exceeds 255, the MCP2515 enters Bus-Off mode and stops participating on the bus. To recover, the software must write the appropriate control bits to CANCTRL to initiate recovery. Recovery requires 128 occurrences of 11 consecutive recessive bits (the “error delimiter + IFS” sequence), which at 250 kbit/s takes approximately 128 × 11 × 4 µs = 5.6 ms.

void can_recover_bus_off(void) {
    uint8_t canctrl = mcp2515_read_reg(MCP_CANCTRL);
    if ((canctrl & 0xE0) == MCP_MODE_CONFIG) {
        /* Already in config mode (bus-off) — request recovery */
        mcp2515_write_reg(MCP_CANCTRL, MCP_MODE_NORMAL);
    }
    /* Poll until normal mode is confirmed */
    uint32_t timeout = 10000;   /* ~10 ms at 1 µs per iteration */
    while (--timeout) {
        uint8_t stat = mcp2515_read_reg(MCP_CANSTAT);
        if ((stat & 0xE0) == MCP_MODE_NORMAL) return;
        delay_us(1);
    }
    /* Timeout: serious hardware fault */
    kernel_panic("MCP2515: bus-off recovery timeout");
}

The CAN RX server should include a periodic check of the MCP2515’s error counters:

void can_rx_server_error_check(void) {
    uint8_t tec = mcp2515_read_reg(MCP_TEC);
    uint8_t rec = mcp2515_read_reg(MCP_REC);
    uint8_t eflg = mcp2515_read_reg(MCP_EFLG);

    if (eflg & MCP_EFLG_TXBO) {
        /* Bus-off — attempt recovery */
        can_recover_bus_off();
    } else if (eflg & MCP_EFLG_TXEP || eflg & MCP_EFLG_RXEP) {
        /* Error passive — log warning, monitor */
        log_event(LOG_CAN_ERROR_PASSIVE, (tec << 8) | rec);
    }
}

A robust production CAN driver calls this check after every received message and periodically from the clock server. In the lab, noise-induced errors are rare but not impossible; having the diagnostic path in place means the first sign of trouble produces a useful log entry rather than silent misbehavior.

CS3 Initialization and the Active Locomotive Protocol

The CS3 must be explicitly started before it will relay commands to the track. This is done through the Go command (Märklin system command 0x00, subcommand 0x01). The initialization sequence from your kernel is:

Send the Stop command (system 0x00, subcommand 0x00) to ensure the track is off at startup.
Wait approximately 500 ms for the CS3 to power up if it was just connected.
Send the Go command (system 0x00, subcommand 0x01) to enable track power.
For each locomotive you intend to control: send a speed 0 command to register it on the CS3’s active list.
Verify the CS3 responded with an acknowledgement (command 0x00, response flag set).

void cs3_initialize(void) {
    MarklnData d;
    /* Stop command: data = [subcommand=0, UID=0 (broadcast)] */
    memset(&d, 0, sizeof(d));
    d.data[4] = 0x00;   /* subcommand: stop */
    d.len = 5;
    send_markln_cmd(0x00, &d, MY_HASH);

    /* Wait for CS3 to stabilize */
    Delay(clock_tid, 50);  /* 500 ms at 10 ms/tick */

    /* Go command: subcommand = 0x01 */
    d.data[4] = 0x01;
    send_markln_cmd(0x00, &d, MY_HASH);
}

void cs3_register_loco(uint32_t uid) {
    MarklnData d;
    /* Speed 0, forward — this registers the loco on the CS3 active list */
    build_speed_cmd(&d, uid, 0, 1);
    send_markln_cmd(0x04, &d, MY_HASH);
}

After cs3_initialize() and cs3_register_loco() for each locomotive, the CS3 will forward subsequent speed and switch commands to the track. The “active list” registration must be repeated if the CS3 is power-cycled, if the CAN bus drops and reconnects, or if the CS3 does not receive a speed command for an extended period (typically 60–120 seconds, depending on CS3 firmware version).

Part III: A Microkernel from Scratch

Chapter 8: Tasks and the Static-Priority FIFO Scheduler

The central abstraction of this kernel is the task: an independent sequential computation with its own stack, its own program counter, and its own priority. Tasks are the unit of scheduling, the unit of communication, and the unit of resource isolation. Understanding what a task is, how it is represented in memory, and how the scheduler selects among ready tasks is the prerequisite for understanding every other kernel mechanism.

Why Tasks?

Chapter 2 argued that a polling loop cannot achieve bounded response latency when event rates are heterogeneous. The task abstraction resolves this by separating what to compute from when to compute it. Each logically distinct activity — processing a sensor event, updating the clock, servicing a UART receive request — becomes a task. The kernel continuously runs the highest-priority ready task, preempting it when a higher-priority task becomes ready. This ensures that the most urgent work always executes first, regardless of what other work is pending.

The analogy to hardware interrupt priority is deliberate. When the GIC delivers an IRQ, the processor drops whatever it was executing and handles the interrupt, then resumes the preempted computation. Tasks extend this priority-driven preemption into software, allowing arbitrarily many priority levels and complex interactions among them.

Task Descriptors

Each task is represented by a task descriptor (TD), a data structure that the kernel uses to track everything it needs to know about the task. The minimum fields are:

typedef struct TaskDescriptor {
    int       tid;             // unique task identifier
    int       parent_tid;     // tid of the creating task
    int       priority;       // static priority (higher value = higher priority)
    RunState  state;          // current run state (Active, Ready, SendWait, etc.)
    uint64_t  sp;             // saved stack pointer (when not running)
    struct TaskDescriptor *send_queue; // head of senders waiting for Receive
    struct TaskDescriptor *send_next;  // next task in a send queue
    struct TaskDescriptor *reply_target; // task waiting for our Reply
} TaskDescriptor;

The sp field is crucial: when a task is not running, its stack pointer is stored here. The task’s registers — the full register file plus ELR_EL1 and SPSR_EL1 — are saved on the task’s own user stack. To resume a task, the kernel loads sp, pops the saved context, and executes eret.

Task descriptors are allocated from a static pool. The maximum number of simultaneous tasks is a compile-time constant (typically 64 for CS 452). There is no dynamic allocation.

Run States

A task is always in exactly one of seven states:

Active: The task is currently executing on the CPU. Exactly one task is Active at any time.

Ready: The task is able to run and is waiting for the CPU. All Ready tasks are in the ready queue.

Send-Blocked (SendWait): The task has called Send() but the receiver has not yet called Receive(). The task sits in the receiver’s send queue.

Receive-Blocked (ReceiveWait): The task has called Receive() but no sender is ready. The task waits until a message arrives.

Reply-Blocked (ReplyWait): The task has sent a message and the receiver has called Receive(), but Reply() has not yet been called. The task waits for the reply.

Event-Blocked: The task has called AwaitEvent() and is waiting for a hardware interrupt.

Exited: The task has called Exit() and its descriptor may be reclaimed.

These states form a directed graph. A task transitions between them only through explicit kernel operations — system calls and interrupt delivery — never spontaneously. This determinism is what allows the kernel to make scheduling decisions with bounded cost.

The Ready Queue

The scheduler selects the highest-priority Ready task. The data structure must support two operations efficiently: inserting a newly-Ready task, and extracting the highest-priority task. Priority queues implemented as heaps would give O(log n) for both; however, the kernel uses a simpler structure that gives O(1) for both operations: an array of FIFO queues, one per priority level.

#define MAX_PRIORITY 16
#define MAX_TASKS    64

typedef struct {
    TaskDescriptor *head, *tail;
} ReadyQueue;

static ReadyQueue ready[MAX_PRIORITY];
static int        highest_ready = -1;  // cached highest non-empty priority

void scheduler_insert(TaskDescriptor *td) {
    int p = td->priority;
    if (ready[p].tail) {
        ready[p].tail->send_next = td;
    } else {
        ready[p].head = td;
    }
    ready[p].tail = td;
    td->send_next = NULL;
    if (p > highest_ready) highest_ready = p;
}

TaskDescriptor *scheduler_next(void) {
    while (highest_ready >= 0 && !ready[highest_ready].head)
        highest_ready--;
    if (highest_ready < 0) return NULL;  // no ready tasks (idle)
    TaskDescriptor *td = ready[highest_ready].head;
    ready[highest_ready].head = td->send_next;
    if (!ready[highest_ready].head) ready[highest_ready].tail = NULL;
    return td;
}

Within a priority level, tasks are scheduled FIFO — the task that became Ready earliest runs first. There is no time-slicing: once a task starts running, it continues until it makes a system call (which may yield the CPU, block, or explicitly defer). This is a key design choice. Time-slicing adds unpredictability: a task that is preempted mid-computation has a longer worst-case response time. By using cooperative scheduling within a priority level, combined with the discipline that server tasks never block indefinitely, the kernel keeps worst-case latency tractable.

Comparison: Linux’s Completely Fair Scheduler (CFS) and its successor EEVDF use virtual runtime accounting to achieve fair CPU sharing among equal-priority threads, at the cost of O(log n) scheduling operations and cache-unfriendly data structures. The RT scheduler in Linux (SCHED_FIFO, SCHED_RR) more closely resembles the CS 452 kernel, but it supports preemption within the kernel itself, adds preemption points throughout the kernel, and must handle many more corner cases for a general-purpose system.

Creating and Destroying Tasks

Create(int priority, void (*function)(void)) allocates a task descriptor from the pool, initializes the task’s stack with a startup frame, and inserts it into the Ready queue at the given priority. The return value is the new task’s TID (a positive integer), -1 if the priority is invalid, or -2 if the descriptor pool is exhausted.

Stack initialization is subtle. When the task is first scheduled, the kernel’s context switch code will pop the saved context frame and execute eret. The “saved” context for a new task must be carefully crafted so that eret jumps to task_start, which calls the task’s function. A helper wrapper handles the case where the task function returns:

void task_start(void (*function)(void)) {
    function();
    Exit();        // if task returns, clean up automatically
}

Without the Exit() call at the end, a returning task function would execute whatever is on the stack below its frame — undefined behaviour with potentially catastrophic consequences on bare metal.

Exit() marks the task as Exited, removes it from all queues, and returns its descriptor to the free pool. The kernel immediately reschedules after Exit().

Yield() moves the calling task to the end of its priority queue. This is a cooperative-scheduling primitive: a task that wants to give other same-priority tasks a chance can yield. It has no effect in a system with only one task at each priority level. The kernel handles Yield() by treating it as a degenerate scheduler entry: the current task is re-inserted at the tail of its queue, and the scheduler selects the next task.

Priority Assignment Strategy

Choosing task priorities is an engineering decision that directly determines the system’s real-time behavior. The general principle is: assign higher priority to tasks with tighter deadlines and smaller execution times. This is precisely the Rate Monotonic principle from Chapter 15, applied to the software task set.

For the CS 452 train control application, a typical priority assignment (from highest to lowest):

Priority	Task	Rationale
31 (highest)	Clock Notifier	Must wake immediately on each timer interrupt
30	CAN RX Notifier	Train sensor events are time-critical
29	UART RX Notifier	Interactive input must be responsive
28	UART TX Notifier	Feeds transmit FIFO — high urgency
25	Clock Server	Serves delayed tasks; must run before next tick
24	CAN RX Server	Processes sensor events
23	UART Servers	Handles terminal I/O
20	Train Engineer tasks	Per-train control loops
15	Route Planner	Path computation (not time-critical)
10	Name Server	Rarely called after init
1	User shell	Interactive, no hard deadlines
0 (lowest)	Idle task	Runs WFI when nothing else is ready

The notifiers are at the top because they execute AwaitEvent immediately after an interrupt fires and must begin executing before another interrupt can arrive. The clock notifier at priority 31 means that even while a train engineer task is computing a route, the clock tick is processed within one scheduling decision.

The idle task at priority 0 is special: it must exist and must run when no other task is ready, but it should never prevent useful work from executing. Assigning it priority 0 (or the minimum defined priority) ensures this.

A critical discipline: do not assign two tasks the same priority unless they truly have the same timing requirements and both are acceptable as candidates for indefinite deferral in favour of the other. Within a priority level, the scheduler is FIFO, not preemptive — one task at a given priority level cannot preempt another at the same level. If a high-priority task has a long execution time, it prevents a same-priority task from running until it yields or blocks, which may cause deadline violations.

Task Creation and the First Task

The kernel’s initial user task — the first task — is created by the kernel itself before transferring control to EL0. The first task is responsible for creating all other initial tasks (servers, notifiers) and then exiting or becoming the user shell. A typical first task:

void first_task(void) {
    // Create infrastructure servers
    Create(25, clock_server_main);
    Create(31, clock_notifier_main);
    Create(24, can_rx_server_main);
    Create(30, can_rx_notifier_main);
    Create(23, uart_rx_server_main);
    Create(29, uart_rx_notifier_main);
    Create(23, uart_tx_server_main);
    Create(28, uart_tx_notifier_main);
    Create(10, name_server_main);

    // Wait for servers to register before creating application tasks
    Delay(WhoIs("ClockServer"), 10);  // 100ms stabilization

    // Create application tasks
    Create(20, train_engineer_main);
    Create(15, route_planner_main);
    Create(1, user_shell_main);

    Exit();
}

The first task runs at a temporary priority. Since it creates servers at various priorities, some of those servers will immediately preempt the first task if their priority is higher. This is correct behaviour — the high-priority notifiers should start calling AwaitEvent immediately so they are ready to handle the first interrupt.

The Full Task Descriptor Layout

The minimal task descriptor shown earlier suffices for describing the concept, but a production-quality implementation requires additional fields. Here is a more complete descriptor that handles all the states and metadata needed by the full SRR system:

#define MAX_PRIORITY   32
#define MAX_TASKS      64
#define TASK_STACK_SZ  (8 * 1024)   /* 8 KB per task */

typedef enum {
    STATE_FREE        = 0,  /* descriptor is not in use (available for Create) */
    STATE_ACTIVE      = 1,  /* currently executing on the CPU */
    STATE_READY       = 2,  /* eligible to run, waiting in ready queue */
    STATE_SEND_WAIT   = 3,  /* blocked in Send(), waiting for receiver */
    STATE_RECV_WAIT   = 4,  /* blocked in Receive(), waiting for sender */
    STATE_REPLY_WAIT  = 5,  /* blocked after Send(), waiting for Reply() */
    STATE_EVENT_WAIT  = 6,  /* blocked in AwaitEvent(), waiting for interrupt */
    STATE_EXITED      = 7,  /* exited; descriptor may be reclaimed */
} TaskState;

typedef struct TaskDescriptor {
    /* Identity */
    int              tid;           /* unique task ID; never 0 */
    int              parent_tid;    /* TID of the task that called Create() */
    int              priority;      /* 0 (lowest) to MAX_PRIORITY-1 (highest) */

    /* Scheduling */
    TaskState        state;
    uint64_t         sp;            /* saved EL0 stack pointer when not running */

    /* SRR: Send queue (when this TD is a Receive target) */
    struct TaskDescriptor *send_queue_head;
    struct TaskDescriptor *send_queue_tail;

    /* SRR: intrusive list link (when this TD is in a send queue) */
    struct TaskDescriptor *send_next;

    /* SRR: message buffers (set during Send/Receive pairing) */
    const void      *send_buf;      /* sender's message buffer */
    int              send_len;      /* sender's message length */
    void            *recv_buf;      /* receiver's buffer for message */
    int              recv_cap;      /* receiver's buffer capacity */
    void            *reply_buf;     /* sender's reply buffer */
    int              reply_cap;     /* sender's reply buffer capacity */

    /* SRR: return values */
    int              retval;        /* return value of syscall (Receive, Send, AwaitEvent) */

    /* AwaitEvent: which event this task is waiting for */
    int              event_id;      /* filled on entry to AwaitEvent */

    /* Scheduler: intrusive link in ready queue */
    struct TaskDescriptor *ready_next;

    /* Stack canary for overflow detection */
    uint32_t         stack_canary;  /* must always equal CANARY_VALUE */
} TaskDescriptor;

#define CANARY_VALUE 0xDEADBEEF

The canary field is placed at the bottom of the task descriptor (which is contiguous with the task’s stack in memory). If the task’s stack overflows downward into the descriptor, the canary is overwritten. A periodic kernel health check compares td->stack_canary against CANARY_VALUE for all active tasks; a mismatch triggers an emergency stop.

Memory layout: the kernel allocates task stacks statically:

/* Static allocation: descriptors and stacks together */
static TaskDescriptor task_pool[MAX_TASKS];
static uint8_t        task_stacks[MAX_TASKS][TASK_STACK_SZ];

The stack for task_pool[i] is task_stacks[i]. The user task’s stack pointer starts at the top of its stack: (uint64_t)task_stacks[i] + TASK_STACK_SZ. The stack grows downward as the task pushes data (function arguments, local variables, saved registers).

Placing the stack canary at the bottom of the descriptor (i.e., adjacent to where the stack would overflow into the descriptor) requires that the descriptor and stack are adjacent in memory. One approach: allocate them as a single struct { TaskDescriptor td; uint8_t stack[STACK_SZ]; }, then the stack overflows downward into the descriptor. Another approach: allocate stacks below their corresponding descriptors in memory. The canary approach detects overflows on the current check; it does not prevent the overflow from causing damage before the check runs. For guaranteed safety, a memory protection unit (MPU) or the MMU’s page-level protection would be needed.

Task Descriptor Pool Management

The free list of task descriptors:

static TaskDescriptor *free_list_head = NULL;

void td_pool_init(void) {
    for (int i = MAX_TASKS - 1; i >= 0; i--) {
        task_pool[i].state    = STATE_FREE;
        task_pool[i].tid      = -1;
        task_pool[i].send_next = free_list_head;
        free_list_head         = &task_pool[i];
    }
}

TaskDescriptor *td_alloc(void) {
    if (!free_list_head) return NULL;
    TaskDescriptor *td = free_list_head;
    free_list_head     = td->send_next;
    td->send_next      = NULL;
    return td;
}

void td_free(TaskDescriptor *td) {
    td->state    = STATE_FREE;
    td->tid      = -1;
    td->send_next = free_list_head;
    free_list_head = td;
}

The free list reuses the send_next field (which is not needed by a free descriptor) for list linkage — a common C trick to avoid adding a field that is only used in one state.

TID assignment: TIDs should be globally unique across the lifetime of the system, not just unique among currently-active tasks. If TIDs were reused immediately, a task that holds a stale TID (from a previous occupant of a descriptor) could accidentally send to the wrong task. The solution: use a monotonically increasing TID counter.

static int next_tid = 1;  /* 0 is reserved for the idle task */

TaskDescriptor *create_task(int priority, void (*fn)(void)) {
    TaskDescriptor *td = td_alloc();
    if (!td) return NULL;
    td->tid       = next_tid++;
    td->priority  = priority;
    td->state     = STATE_READY;
    td->stack_canary = CANARY_VALUE;
    init_task_stack(td, fn);
    scheduler_insert(td);
    return td;
}

With 64 task descriptors and a 32-bit TID counter, TIDs will never wrap in a CS 452 session (even creating a new task every millisecond for a full 4-hour session uses only 14,400 TIDs, far below $2^{31}$).

The Receive Queue and Priority Within the Server

Within the send queue of a server (the queue of SendWait tasks waiting for the server to Receive), tasks are ordered FIFO by default. This means a high-priority client that arrives after a low-priority client will wait behind the low-priority client. Is this correct?

For a Name Server with very short request processing time, FIFO ordering is fine — the wait is short. For a Track Server with potentially longer processing, a high-priority train engineer might wait behind a low-priority display task. This is a mild form of priority inversion mediated by the server’s queue discipline.

The fix is to order the send queue by sender priority: insert new senders in priority order rather than at the tail. This gives priority-ordered service: the server always processes the highest-priority pending request first.

void send_queue_insert_priority(TaskDescriptor *server, TaskDescriptor *sender) {
    TaskDescriptor **ptr = &server->send_queue_head;
    while (*ptr && (*ptr)->priority >= sender->priority)
        ptr = &(*ptr)->send_next;
    sender->send_next = *ptr;
    *ptr = sender;
    if (!sender->send_next)
        server->send_queue_tail = sender;
}

Priority-ordered send queues are used by QNX Neutrino by default. The trade-off is O(n) insertion (where n is queue depth, typically small) versus O(1) FIFO insertion. For the CS 452 kernel, FIFO is simpler and acceptable; a student extending the kernel could add priority ordering to the CAN TX Server to ensure high-priority train commands preempt display tasks.

The Idle Task and Power Management

The idle task runs when no other task is ready. On a bare-metal system without power management, the CPU would spin at full power doing nothing. The idle task instead executes the WFI (Wait For Interrupt) instruction:

void idle_task_main(void) {
    for (;;) {
        asm volatile ("wfi");
    }
}

WFI halts the processor core (stops the instruction fetch and execution pipelines) until an interrupt arrives. On the Cortex-A72, WFI reduces power consumption to approximately 10–20% of active power. The processor wakes immediately when an interrupt is asserted (the GIC delivers the interrupt to the CPU interface, which releases the WFI stall), so the interrupt latency from a WFI state is no worse than from normal execution.

The idle task must have the lowest possible priority (0) so it runs only when nothing else is ready. It must never call any kernel primitive — not even Yield() — that would block it and leave the CPU without a task to run.

The fraction of time the idle task runs is the idle fraction: 1 - total_utilization. Measuring it: the idle task increments a counter on each iteration of its loop. The absolute iteration rate is calibrated once at startup (by measuring how many iterations complete in 1 ms with no other tasks running). During normal operation, the idle rate is compared to the calibration rate to compute utilization. The W26 lecture suggests measuring idle time; this is the standard implementation of that measurement.

The O(1) Scheduler: Bitmask Priority Queue

The scheduling algorithm described in the preceding sections — choose the highest-priority ready task, break ties by FIFO order within a priority level — sounds simple to state but has a subtle implementation challenge: how do you find the highest-priority ready task efficiently? The naive approach scans all 32 priority levels from highest to lowest until it finds one with a non-empty queue. That scan is O(P) in the number of priority levels. For 32 levels this is a constant, but the constant is 32 — and the scheduler runs on every context switch, every Send, every Reply, every AwaitEvent completion. In a kernel with ten tasks in flight and a clock interrupt at 508 kHz, the scheduler executes millions of times per second. Shaving cycles off the scheduler’s hot path pays dividends across every task in the system.

The classical solution, pioneered in the Linux O(1) scheduler (2.6 kernel era) and adopted in QNX and virtually every serious real-time microkernel, is a bitmask priority queue. A single 32-bit integer ready_mask encodes which priority levels have at least one ready task: bit $n$ is set if and only if there is at least one task at priority $n$ ready to run. Finding the highest-priority ready task reduces to finding the position of the most-significant set bit in ready_mask, which the ARMv8-A CLZ (Count Leading Zeros) instruction performs in a single cycle.

/* One FIFO queue per priority level */
static TaskDescriptor *ready_queues[32][2]; /* [priority][0=head, 1=tail] */
static uint32_t        ready_mask;          /* bit n set → queue n non-empty */

void scheduler_insert(TaskDescriptor *td) {
    int p = td->priority;
    if (ready_queues[p][0] == NULL) {
        ready_queues[p][0] = ready_queues[p][1] = td;
        td->next = NULL;
    } else {
        ready_queues[p][1]->next = td;
        ready_queues[p][1] = td;
        td->next = NULL;
    }
    ready_mask |= (1u << p);
}

TaskDescriptor *scheduler_next(void) {
    if (ready_mask == 0) return idle_task;
    int p = scheduler_find_highest_priority(ready_mask);
    TaskDescriptor *td = ready_queues[p][0];
    ready_queues[p][0] = td->next;
    if (ready_queues[p][0] == NULL) {
        ready_queues[p][1] = NULL;
        ready_mask &= ~(1u << p);   /* clear bit when queue drains */
    }
    return td;
}

scheduler_find_highest_priority uses the CLZ instruction introduced in the Chapter 3 assembly section:

static inline int scheduler_find_highest_priority(uint32_t mask) {
    int lz;
    asm("clz %w0, %w1" : "=r"(lz) : "r"(mask));
    return 31 - lz;
}

The entire scheduler_next path — CLZ, array index, pointer update, conditional bit clear — is a handful of instructions with no loops and no branches that depend on the number of tasks. This is what O(1) scheduling means in practice: the time to schedule is bounded by a constant that does not grow with the number of tasks or the number of priority levels.

To appreciate why this matters, compare with Linux’s Completely Fair Scheduler (CFS), introduced in kernel 2.6.23 to replace the O(1) scheduler for general-purpose workloads. CFS inserts and removes tasks from a red-black tree ordered by virtual runtime. Insertion and deletion in a balanced BST are $O(\log n)$ where $n$ is the number of runnable tasks. For a desktop with 500 runnable threads, $\log_2 500 \approx 9$ — acceptable when tasks sleep for milliseconds at a time. For a real-time system with a 2 ms period timer and a kernel that schedules on every interrupt, even a small multiplicative constant matters. The O(1) bitmask approach dominates: no heap allocation, no pointer chasing, no rebalancing — just a CLZ and an array lookup.

QNX Neutrino uses the same bitmask-per-priority approach, organized into a 256-priority space with an outer 4-bit index into 8 words of 32 bits each — two CLZ operations cover all 256 levels. The CS 452 kernel needs only 32 priorities, so one CLZ suffices, and the design collapses to the single-word form shown above.

One subtlety: the bitmask and queues must remain consistent at all times. The invariant is: bit $n$ is set in ready_mask if and only if ready_queues[n][0] != NULL. Violating this invariant — for example, by clearing the bit but leaving a non-null head pointer — would cause the scheduler to skip a ready task indefinitely, a form of starvation that would manifest as a task that never runs again after a missed ready_mask update. The kernel must update both the pointer and the bitmask atomically (in the sense of no interleaved interrupt between the two operations). Since the kernel runs with interrupts disabled during scheduling decisions (by the DAIF invariant from Chapter 22), this atomicity is guaranteed.

Chapter 9: Context Switching across Exception Levels

Context switching is the mechanism by which the kernel moves the CPU from one task to another. At its core, a context switch saves the current task’s computational state — its register file, stack pointer, and instruction pointer — and restores the saved state of the next task. On ARMv8-A, a context switch between a user task (at EL0) and the kernel (at EL1) is accomplished through the exception mechanism.

What Is Context?

A running task’s context is everything the CPU needs to resume the task as if it had never been interrupted:

General-purpose registers x0–x30: the current values of all 31 integer registers.
SP_EL0: the user task’s stack pointer (distinct from SP_EL1, the kernel’s stack pointer).
ELR_EL1: the address of the next instruction to execute in the user task (the “saved PC”).
SPSR_EL1: the saved PSTATE — the condition flags, the DAIF mask, and the execution level that was active when the exception was taken.

When the processor takes a synchronous exception (such as an svc instruction), it automatically saves the current PC to ELR_EL1 and the current PSTATE to SPSR_EL1. The stack pointer switches from SP_EL0 (user) to SP_EL1 (kernel), and execution jumps to the appropriate entry in the exception vector table. At this point, the kernel’s stack is set up and the user’s SP is untouched, but x0–x30 still contain whatever values the user task had — they have not been saved anywhere.

The kernel’s first job after entering the vector table handler is to save x0–x30 to a known location. The most natural location is the kernel’s own stack (SP_EL1). The kernel then processes the system call, and before returning to any task (which may not be the same task that made the call), it restores x0–x30 and SP_EL0 for the next task from its saved context frame.

The Exception Vector Table

The ARMv8-A architecture defines the exception vector table as a 2048-byte structure containing 16 entries, each 128 bytes. The 16 entries cover four exception sources (Current EL using SP0, Current EL using SPt, Lower EL in AArch64, Lower EL in AArch32) crossed with four exception types (Synchronous, IRQ, FIQ, SError). For a kernel running at EL1:

Exceptions from EL0 AArch64 (user tasks) arrive at offset 0x400 (sync) or 0x480 (IRQ).
Exceptions from EL1 using SP_EL1 arrive at 0x200 (sync) or 0x280 (IRQ) — these are kernel-mode faults and should never occur in a correct kernel.

The table must be aligned to a 2048-byte boundary and its address loaded into VBAR_EL1 before any exception can be taken:

.align 11                       // 2^11 = 2048-byte alignment

.global exception_vector_table
exception_vector_table:
    // Entries at 0x000–0x180: current EL, SP0 (not used)
    .rept 4
    .align 7                    // each entry is 128 bytes
    b       unhandled_exception
    .endr

    // Entries at 0x200–0x380: current EL, SPt (kernel fault)
    .rept 4
    .align 7
    b       kernel_fault
    .endr

    // Entries at 0x400–0x580: lower EL AArch64 (EL0 syscalls/IRQs)
    .align 7
    b       sync_el0_handler     // svc from EL0
    .align 7
    b       irq_el0_handler      // IRQ while running EL0 task
    .align 7
    b       unhandled_exception  // FIQ
    .align 7
    b       unhandled_exception  // SError

    // Entries at 0x600–0x780: lower EL AArch32 (not used)
    .rept 4
    .align 7
    b       unhandled_exception
    .endr

// Install the vector table
init_vectors:
    adr     x0, exception_vector_table
    msr     VBAR_EL1, x0
    isb
    ret

The System Call Entry Path

When a user task executes svc #0 (the syscall instruction; the immediate is ignored by this kernel, syscall number is in x0), the processor:

Saves the user PSTATE to SPSR_EL1.
Saves the return address (address of the instruction after svc) to ELR_EL1.
Switches from SP_EL0 to SP_EL1.
Clears DAIF (masking Debug, SError, IRQ, FIQ — the kernel runs with all interrupts masked unless it explicitly enables them).
Jumps to exception_vector_table + 0x400.

The synchronous EL0 handler must immediately save all user registers. Since the kernel needs a scratch register to set up the save (specifically, a register to hold the stack pointer it is writing to), it uses the per-task kernel stack pointed to by SP_EL1. The save sequence:

sync_el0_handler:
    // At this point:
    // SP       = SP_EL1 (kernel stack, pre-allocated per task)
    // ELR_EL1  = user's return address
    // SPSR_EL1 = user's saved PSTATE
    // x0-x30   = user's register values (unsaved)

    sub     sp, sp, #(34 * 8)          // allocate frame: 31 regs + SP_EL0 + ELR + SPSR

    stp     x0,  x1,  [sp, #0]
    stp     x2,  x3,  [sp, #16]
    stp     x4,  x5,  [sp, #32]
    stp     x6,  x7,  [sp, #48]
    stp     x8,  x9,  [sp, #64]
    stp     x10, x11, [sp, #80]
    stp     x12, x13, [sp, #96]
    stp     x14, x15, [sp, #112]
    stp     x16, x17, [sp, #128]
    stp     x18, x19, [sp, #144]
    stp     x20, x21, [sp, #160]
    stp     x22, x23, [sp, #176]
    stp     x24, x25, [sp, #192]
    stp     x26, x27, [sp, #208]
    stp     x28, x29, [sp, #224]

    // Save x30 (LR), SP_EL0, ELR_EL1, SPSR_EL1
    mrs     x20, SP_EL0
    mrs     x21, ELR_EL1
    mrs     x22, SPSR_EL1
    stp     x30, x20, [sp, #240]
    stp     x21, x22, [sp, #256]

    // Save current SP to task descriptor, pass frame to C handler
    mov     x0, sp
    bl      handle_syscall             // handle_syscall(frame_ptr)

    // On return: x0 = SP of next task to run (may differ from entry SP)
    mov     sp, x0                     // switch to next task's saved frame

    // Restore next task's registers
    ldp     x21, x22, [sp, #256]
    msr     ELR_EL1, x21
    msr     SPSR_EL1, x22
    ldp     x30, x20, [sp, #240]
    msr     SP_EL0, x20

    ldp     x28, x29, [sp, #224]
    // ... (reverse of save sequence) ...
    ldp     x0,  x1,  [sp, #0]

    add     sp, sp, #(34 * 8)
    eret                               // return to user task

The C function handle_syscall(frame_ptr) examines the saved x0 to determine which syscall was requested, executes the appropriate kernel logic, and returns the stack pointer of the task that should run next. If the highest-priority ready task is different from the task that made the syscall, the returned SP is different from the entry SP — the context switch occurs at the mov sp, x0 instruction.

Kernel Stack Strategies

The kernel must decide how to allocate the kernel stack (SP_EL1) for each task. The options are:

Single kernel stack: all system calls share one kernel stack. This works only if system calls cannot nest — which is true in this kernel, since the kernel runs with interrupts disabled and does not make recursive calls into itself. The advantage is simplicity: one stack pointer to manage. The disadvantage is that it cannot support preemptive scheduling within the kernel.

Per-task kernel stacks: each task has its own kernel stack. When the task is descheduled, its kernel stack retains whatever the kernel was doing on behalf of that task. This approach supports preemption during kernel execution and is required for real preemptive kernels (Linux, FreeBSD). For CS 452, where the kernel is non-preemptive and system calls are short and bounded, per-task kernel stacks are the standard approach. Each task’s saved frame sits on its own kernel stack; the TD records the saved SP_EL1.

The two-stack model (user stack SP_EL0 + kernel stack SP_EL1) cleanly separates kernel and user memory. User tasks cannot corrupt the kernel stack by overflowing their own stack, and the kernel cannot accidentally reference user pointers beyond what it intentionally dereferences.

Initializing a New Task’s Stack

When Create() allocates a new task, it must initialize the task’s stack so that when the context switch restores the task, execution begins at task_start. This requires placing a carefully crafted frame on the task’s kernel stack that looks exactly like what the context-switch entry path would have saved.

Given the context switch save format (31 registers + SP_EL0 + ELR_EL1 + SPSR_EL1 = 34 × 8 = 272 bytes), the initial stack frame for a new task T calling function fn at priority p:

void task_init_stack(TaskDescriptor *td, void (*fn)(void), uint64_t stack_top) {
    // Place the initial frame at the top of the task's stack
    uint64_t *frame = (uint64_t *)(stack_top - 272);

    // x0-x29 are zeroed (function receives no arguments)
    for (int i = 0; i < 30; i++) frame[i] = 0;

    // x30 (LR): not used in this scheme (ELR_EL1 holds return address)
    frame[30] = 0;

    // SP_EL0: user's stack pointer (same stack, grows down from top)
    frame[31] = stack_top;

    // ELR_EL1: the address to jump to on eret = task_start
    frame[32] = (uint64_t)task_start;

    // SPSR_EL1: return to EL0 (mode 0b0000), DAIF all cleared
    frame[33] = 0x00000000;    // EL0t, all flags clear

    td->sp = (uint64_t)frame;  // save the frame pointer

    // Pass fn as an argument: it will be in x0 when task_start runs
    // We must set x0 in the frame:
    frame[0] = (uint64_t)fn;
}

task_start is a thin wrapper:

void task_start(void (*fn)(void)) {
    fn();
    Exit();
}

When the scheduler first selects this task and executes the context restore path, it loads the frame, sets SP_EL0, ELR_EL1, and SPSR_EL1, then executes eret. The processor drops to EL0, sets SP to the user’s stack pointer, and branches to task_start with x0 containing fn. The function begins executing in user mode. When it returns, task_start calls Exit(), which transitions the task to Exited state.

Measuring Context Switch Performance

The total cost of a Send/Receive/Reply roundtrip involves three kernel entries (one svc each for Send, Receive, Reply) and three context switches. The system timer’s 1 MHz counter allows precise measurement:

void benchmark_srr(void) {
    int child = Create(MY_PRIORITY - 1, srr_responder);
    int tid = MyTid();

    uint32_t start = timer_read_lo();
    for (int i = 0; i < 1000; i++) {
        char msg = 0, reply;
        Send(child, &msg, 1, &reply, 1);
    }
    uint32_t end = timer_read_lo();

    // Each iteration: Send (context switch) + Receive (context switch) + Reply (context switch)
    // Total: 3 context switches per iteration
    // cost_us = (end - start) / 1000
}

void srr_responder(void) {
    char msg, reply = 0;
    int tid;
    for (;;) {
        Receive(&tid, &msg, 1);
        Reply(tid, &reply, 1);
    }
}

On a well-optimized CS 452 kernel at 1.5 GHz, each Send/Receive/Reply roundtrip takes approximately 2–5 µs. Breaking this down:

Two context switches (sender→kernel→receiver, then receiver→kernel→sender): ~0.5 µs each.
Three kernel dispatch paths (dispatcher overhead, state machine transitions): ~0.3 µs each.
Two message copies (32 bytes each): ~0.2 µs each.

The message copy is O(message_length) — sending a 256-byte message is noticeable. For the train control application, messages are typically 8–32 bytes (a speed command, a sensor event, a position estimate), so copy overhead is minimal.

The IRQ Context Switch Path

System calls (svc) are one of two entry paths into the kernel from EL0. The other is the IRQ path — triggered when a hardware interrupt fires while a user task is running. The IRQ path must also save the current task’s context (just as the svc path does), but it cannot return to the interrupted task immediately: it must run the kernel’s interrupt handler first, which may unblock a notifier task, and then schedule the highest-priority ready task (which may be the notifier, not the original interrupted task).

When an IRQ fires while a user task is running at EL0, the processor:

Saves the current PC to ELR_EL1 (the instruction the user task was about to execute — the IRQ is non-faulting, so this is the return address).
Saves the user PSTATE to SPSR_EL1.
Switches from SP_EL0 to SP_EL1.
Clears DAIF (all interrupts masked in the IRQ handler).
Jumps to exception_vector_table + 0x480 (IRQ from lower EL AArch64).

The IRQ handler looks structurally identical to the syscall handler for the save/restore sequence, but the dispatch logic is different:

irq_el0_handler:
    // Same save sequence as sync_el0_handler
    sub     sp, sp, #(34 * 8)
    stp     x0,  x1,  [sp, #0]
    // ... (same STP sequence) ...
    mrs     x20, SP_EL0
    mrs     x21, ELR_EL1
    mrs     x22, SPSR_EL1
    stp     x30, x20, [sp, #240]
    stp     x21, x22, [sp, #256]

    // Pass frame pointer and indicate this is an IRQ (not syscall)
    mov     x0, sp
    bl      handle_irq          // handle_irq(frame_ptr)

    // x0 = SP of next task to run (may be notifier, not interrupted task)
    mov     sp, x0

    // Restore registers and return
    ldp     x21, x22, [sp, #256]
    msr     ELR_EL1, x21
    msr     SPSR_EL1, x22
    // ... (reverse of save) ...
    add     sp, sp, #(34 * 8)
    eret

The C handle_irq function:

uint64_t handle_irq(uint64_t interrupted_sp) {
    /* Acknowledge the interrupt */
    uint32_t iar = GICC->IAR;           // reads the interrupt ID
    int irq_id = iar & 0x3FF;           // lower 10 bits = interrupt ID

    /* Clear the source */
    if (irq_id == GIC_TIMER_C1) {
        TIMER->CS |= (1 << 1);          // clear timer match
    } else if (irq_id == GIC_GPIO17) {
        /* CAN interrupt: don't clear here; the notifier will drain it */
    }

    /* Find the notifier task waiting for this event */
    TaskDescriptor *notifier = event_registry[irq_id];
    if (notifier && notifier->state == EVENT_BLOCKED) {
        notifier->state = READY;
        notifier->retval = irq_id;      // return value from AwaitEvent
        scheduler_insert(notifier);
    }

    /* Save interrupted task's SP to its descriptor */
    TaskDescriptor *current = current_task;
    current->sp = interrupted_sp;
    current->state = READY;
    scheduler_insert(current);          /* re-insert the interrupted task */

    /* Write End-of-Interrupt to GIC */
    GICC->EOIR = iar;

    /* Schedule next task */
    TaskDescriptor *next = scheduler_next();
    return next->sp;                    /* context switch to next task */
}

The critical difference between the IRQ path and the syscall path: in the IRQ path, the interrupted task must be re-inserted into the ready queue. The interrupted task did not choose to give up the CPU; it was preempted involuntarily. Its state must remain Ready (not blocked), and it must be re-inserted at its priority level. The next task to run is then chosen by the scheduler — which will be the newly-ready notifier if the notifier’s priority is higher than the interrupted task’s.

The full sequence for an IRQ preempting a medium-priority task to run a high-priority notifier:

[IRQ fires] → [context saved, interrupted task state = Ready]
           → [notifier unblocked, state = Ready]
           → [scheduler selects notifier (higher priority)]
           → [context restore: notifier's frame]
           → [eret: notifier runs at EL0]
...
[Notifier calls Send (svc)] → [context saved, notifier state = SendWait]
                            → [clock server unblocked, state = Ready]
                            → [scheduler selects clock server or original task]

This is the complete preemption chain. The interrupted task does not lose any state — its frame is saved on its own kernel stack, and the scheduler will return to it whenever its priority is again the highest among ready tasks.

The Symmetric Exception Entry/Exit Pattern

One elegance of this design: the save and restore sequences are symmetric and identical between the syscall and IRQ handlers. Both handlers:

Save 34 values (31 GPRs + SP_EL0 + ELR_EL1 + SPSR_EL1) to the kernel stack.
Call a C function that takes the frame pointer and returns the next task’s frame pointer.
Restore 34 values from the returned frame pointer.
Execute eret.

The C function (handle_syscall or handle_irq) is the only place that knows whether this is a voluntary yield (syscall) or an involuntary preemption (IRQ). From the assembly’s perspective, the handler is always “save context, call C, restore context, eret.” This symmetry means a single save/restore macro can be used for both paths, reducing code duplication and the risk of inconsistency.

The exception is FIQ (Fast IRQ). ARMv8-A distinguishes FIQ (highest-priority hardware interrupt) from IRQ. In a full system, FIQ would be used for the most time-critical interrupt — perhaps a motor over-current protection signal — and would be handled directly in the FIQ handler without going through the full kernel path. CS 452 does not use FIQ; all hardware interrupts are routed as IRQ through the GIC, and the priority between them is managed by the GIC’s IPRIORITYR registers.

Exception Handling From EL1

A subtle point: when the kernel itself is executing at EL1 with interrupts enabled, an IRQ can fire during kernel execution. This is handled by the 0x280 entry (Current EL, SPt, IRQ) in the exception vector table. In the CS 452 kernel, this should never happen — the kernel runs with interrupts masked (DAIF set) during all kernel operations, unmasks them only when returning to EL0, and re-masks them on the next exception entry. The 0x280 handler should be:

kernel_irq_handler:
    // This should never execute in a correctly-written kernel
    // Reaching here means interrupts were unmasked during kernel execution — a bug
    b       kernel_fault

If this handler is ever reached, it indicates a serious kernel bug: the invariant that DAIF is always set during EL1 execution was violated. Making this a fatal fault (entering a debug loop or triggering an LED error code) is appropriate.

Synchronous Exception Dispatch: Handling User-Task Faults

Not all synchronous exceptions from EL0 are intentional system calls. A user task can generate a synchronous exception due to a programming error: an undefined instruction, an alignment fault, a permission fault from a memory access to an unmapped address, a stack overflow. The kernel must handle these gracefully — not by crashing, but by terminating the offending task and logging diagnostic information.

ARMv8-A identifies the cause of a synchronous exception through the Exception Syndrome Register (ESR_EL1). The kernel reads this register in the synchronous EL0 handler to determine whether the exception is a syscall (EC = 0x15, svc) or a fault (EC = other values).

The EC field occupies bits [31:26] of ESR_EL1. Important EC values:

EC	Hex	Cause
0	0x00	Unknown reason (often undefined instruction)
7	0x07	FP/SIMD access (CPACR_EL1.FPEN trap)
21	0x15	SVC instruction from AArch64 — normal syscall
24	0x18	Instruction Abort from lower EL (PC to unmapped memory)
36	0x24	Data Abort from lower EL (load/store to unmapped or misaligned memory)
60	0x3C	BRK instruction (software breakpoint)

For the CS 452 kernel’s synchronous EL0 handler, the dispatch looks like:

void handle_sync_el0(uint64_t *frame, uint64_t esr) {
    int ec = (esr >> 26) & 0x3F;   /* Exception Class */

    switch (ec) {
        case 0x15: /* SVC: normal system call */
            handle_syscall(frame);
            break;

        case 0x24: /* Data Abort from EL0 */
            handle_data_abort_el0(frame, esr);
            break;

        case 0x18: /* Instruction Abort from EL0 */
            handle_insn_abort_el0(frame, esr);
            break;

        case 0x07: /* FP/SIMD access with FPEN disabled */
            handle_fpu_trap_el0(frame, esr);
            break;

        default:
            kill_task_with_fault(current_task, ec, esr);
            break;
    }
}

The assembly entry point must read ESR_EL1 before calling into C, since the C handler cannot read system registers without an mrs wrapper:

sync_el0_handler:
    // ... (save all registers as before) ...

    mrs     x1, ESR_EL1         // second argument: exception syndrome
    mov     x0, sp              // first argument: saved frame pointer
    bl      handle_sync_el0     // dispatch by exception class

    // restore and eret (same as before)

Data Abort handling (EC = 0x24): The fault address is in FAR_EL1 (Fault Address Register). The ISS field of ESR_EL1 (bits [24:0]) encodes whether it was a read or write (ISS[6] = WnR), whether it was a translation fault or permission fault (ISS[5:0] = DFSC — Data Fault Status Code), and additional details.

void handle_data_abort_el0(uint64_t *frame, uint64_t esr) {
    uint64_t far;
    asm volatile("mrs %0, FAR_EL1" : "=r"(far));

    int wnr  = (esr >> 6) & 1;    /* 1 = write, 0 = read */
    int dfsc = esr & 0x3F;         /* Data Fault Status Code */

    /* Translation fault: task accessed an address with no valid PTE.
       In a flat-memory kernel with no MMU (identity-mapped), this almost always
       means the task dereferenced a null or garbage pointer. */
    uart0_printf("FAULT: task %d %s to 0x%016lx, PC=0x%016lx, DFSC=0x%02x\n",
        current_task->tid,
        wnr ? "write" : "read",
        far,
        frame[32],   /* ELR_EL1 = fault PC */
        dfsc);

    kill_task_with_fault(current_task, EC_DATA_ABORT, esr);
}

kill_task_with_fault marks the task as Exited, saves the fault information in the task descriptor for post-mortem inspection, and calls scheduler_next() to continue with the next ready task:

void kill_task_with_fault(TaskDescriptor *td, int ec, uint64_t esr) {
    td->state      = EXITED;
    td->fault_ec   = ec;
    td->fault_esr  = esr;
    /* Note: any tasks Reply-Blocked waiting for this task must be unblocked
       with an error code, or they will block forever. */
    unblock_reply_waiters(td, ERR_TASK_FAULTED);
}

The call to unblock_reply_waiters is critical and easy to forget. If task A called Send(td->tid, ...) and is now in ReplyWait, it will never receive a reply because the replier has died. The kernel must detect this and reply to all ReplyWait tasks with a fault error code (ERR_TASK_FAULTED = -3, for example). Failure to do so leaves those tasks blocked indefinitely — a silent hang with no diagnostic output.

FP/SIMD trap handling (EC = 0x07): if a user task executes a floating-point instruction (e.g., fadd d0, d0, d1) with CPACR_EL1.FPEN set to disable FP access from EL0, the processor takes this exception. For the CS 452 kernel, user tasks should not use FP/SIMD operations (the API and problem domain do not require them), so the correct handling is to terminate the task with an error:

void handle_fpu_trap_el0(uint64_t *frame, uint64_t esr) {
    uart0_printf("FAULT: task %d attempted FP/SIMD instruction (PC=0x%016lx)\n",
        current_task->tid, frame[32]);
    kill_task_with_fault(current_task, EC_FPU_TRAP, esr);
}

An alternative design allows user tasks to use FP/SIMD by enabling FPEN in CPACR_EL1 for all tasks, but then the context switch must save and restore the 32 × 128-bit SIMD registers (512 bytes of additional state per task). For a train controller with no floating-point computation in its control logic, this overhead is not justified.

Undefined instruction handling (EC = 0x00): a task executing an instruction whose encoding is reserved or not implemented by the processor. Common causes: attempting to use AArch32 instructions in AArch64 mode, executing garbage data as code (stack overflow clobbering the return address), or using an instruction that requires a higher architecture version (e.g., SVE instructions on a Cortex-A72 which only supports NEON). The handler should log the faulting PC and terminate the task.

Stack overflow detection through faults: without a guard page (which requires an MMU with per-page protection), stack overflow is silent — the task simply corrupts memory below its stack allocation. A guard page would be an address range mapped as inaccessible; any access to it triggers a permission Data Abort (DFSC = 0x0D or 0x0F). The kernel handles the fault by comparing FAR_EL1 against the guard page range and reporting a “stack overflow” diagnostic rather than a generic abort. This makes debugging significantly easier: instead of a cryptic Data Abort at some arbitrary address, the user sees “task 7: stack overflow” and can examine the backtrace.

#define GUARD_PAGE_SIZE  0x1000  /* 4 KB: one page below each stack */

bool is_guard_page_fault(uint64_t fault_addr, TaskDescriptor *td) {
    uint64_t guard_base = (uint64_t)td->stack_base - GUARD_PAGE_SIZE;
    return fault_addr >= guard_base && fault_addr < guard_base + GUARD_PAGE_SIZE;
}

void handle_data_abort_el0(uint64_t *frame, uint64_t esr) {
    uint64_t far;
    asm volatile("mrs %0, FAR_EL1" : "=r"(far));

    if (is_guard_page_fault(far, current_task)) {
        uart0_printf("FAULT: task %d stack overflow (SP was below guard page)\n",
            current_task->tid);
    } else {
        uart0_printf("FAULT: task %d data abort at 0x%016lx\n",
            current_task->tid, far);
    }
    kill_task_with_fault(current_task, EC_DATA_ABORT, esr);
}

Without an MMU mapping guard pages as inaccessible, the fault will not be triggered by simple stack overflow — but the logic is correct for kernels that configure the MMU. Alternatively, the stack canary (Chapter 8) provides a software equivalent that detects overflow after the fact, at the cost of a per-tick scan.

The Debug Loop: A Minimal Fault Recovery Interface

When the kernel itself has an unrecoverable fault — a bug in the exception vector table, a corrupted scheduler queue, a double fault during fault handling — the kernel cannot continue and should not attempt to. The appropriate behavior is to enter a panic loop that:

Disables all interrupts (DAIF set to 0xF).
Outputs a diagnostic message to UART0.
Halts the CPU with WFI, preventing further damage.

__attribute__((noreturn))
void kernel_panic(const char *msg) {
    asm volatile("msr daifset, #0xF");   /* Mask all exceptions */
    uart0_puts("\nKERNEL PANIC: ");
    uart0_puts(msg);
    uart0_puts("\n");
    for (;;) {
        asm volatile("wfi");
    }
}

The kernel_fault handler referenced in the exception vector table calls kernel_panic with diagnostic information extracted from system registers:

kernel_fault:
    // Minimal register save for diagnostics (don't touch SP — may be corrupted)
    // Use a statically allocated emergency buffer
    adr     x0, kernel_fault_msg
    bl      kernel_panic

    .section .data
kernel_fault_msg:
    .asciz "exception from EL1 — kernel bug\n"

The panic loop preserves the kernel’s register state until a JTAG debugger can be attached. In a production embedded system, the panic loop would also trigger a hardware watchdog reset after a timeout; in a development system, it is more useful to halt and allow inspection.

Performance Implications for Task Architecture

The SRR cost model has architectural implications. A task design that requires many small Send/Receive/Reply exchanges — one per sensor byte, one per timer tick — will spend more time in the kernel than a design that batches messages. The clock server batches multiple delayed tasks’ wakeups into a single tick notification; the UART TX server batches multiple bytes into a single FIFO drain. These batching strategies reduce the number of SRR operations per unit of real-world work.

The converse: large message copies are expensive. If a task needs to transfer 4 KB of data (a large display buffer, a route table), it should either split the data into smaller chunks or use a shared memory region with message-passing coordination (the message carries a pointer to the shared buffer, not the buffer itself — but this reintroduces the shared memory problem and requires careful lifetime management).

Chapter 10: Send / Receive / Reply Message Passing

The most distinctive feature of the CS 452 kernel is its communication model. User tasks do not communicate through shared memory, pipes, files, or signals. They communicate exclusively through Send/Receive/Reply (SRR) — a synchronous message-passing protocol that transfers data directly between task stacks, without any kernel buffer, and in doing so transfers control from sender to receiver to replier.

SRR is not a new invention. It descends from a long lineage of microkernel IPC designs: Brinch Hansen’s Concurrent Pascal (1975), QNX’s message-passing RTOS (1980), the Mach IPC (1986), L4’s fast IPC (1995), and many others. What these systems share is the insight that message passing, by eliminating shared memory, also eliminates the need for locks, semaphores, and monitors — and the priority inversion and deadlock hazards that come with them.

The Three Primitives

Send(int tid, const char *msg, int msglen, char *reply, int rplen) is called by a task that wants to request a service. It transmits a message to the task identified by tid and blocks until that task calls Reply(). The blocking is unconditional: a sender cannot poll or time out. This ensures that every Send eventually completes (assuming the receiver eventually calls Reply, which is an architectural invariant that server tasks must maintain).

Receive(int *tid, char *msg, int msglen) is called by a task that wants to accept a service request. If a sender is already waiting (in SendWait state), Receive immediately copies the message and returns. If no sender is waiting, Receive blocks (ReceiveWait) until one arrives. Receive cannot specify which sender it wants — it always accepts the head of the send queue.

Reply(int tid, const char *reply, int rplen) is called after processing a received request. It copies the reply buffer to the waiting sender’s reply buffer (pointed to in the original Send call), and transitions the sender from ReplyWait to Ready. Reply does not block: the replier continues running.

State Transitions

The three-way dance between sender and receiver generates a well-defined state machine. Let the server task be S and the client task be C.

Case 1: Client calls Send before Server calls Receive.

C calls Send → C enters SendWait, is added to S’s send queue.
S calls Receive → finds C in its send queue, copies message from C’s buffer to S’s buffer, sets C’s state to ReplyWait, returns.
S processes request, calls Reply(C) → copies reply to C’s reply buffer, C transitions to Ready.

Case 2: Server calls Receive before any Client.

S calls Receive → no senders waiting, S enters ReceiveWait.
C calls Send → finds S in ReceiveWait, copies message directly, sets C to ReplyWait, sets S to Ready.
S runs, processes request, calls Reply(C).

In both cases, exactly one data copy occurs for the message (from C’s buffer to S’s buffer) and one for the reply (from S’s buffer to C’s buffer). This zero-additional-copy property is fundamental: there is no intermediate kernel buffer, no DMA, no ring buffer. The kernel acts as a trusted intermediary that copies data directly between two user stacks under the protection of EL1 privilege.

Why No Shared Memory

The argument for eliminating shared memory is worth making explicitly, because shared memory is the dominant communication model in most programming environments.

Shared memory requires locks to protect concurrent accesses. Locks introduce the risk of deadlock (if tasks acquire locks in inconsistent order) and priority inversion (if a high-priority task is blocked waiting for a lock held by a low-priority task). These hazards are not merely theoretical; the Mars Pathfinder mission in 1997 was disrupted by a priority inversion on a shared-memory mutex that the operating system had not protected with priority inheritance (see Chapter 16 for the full account).

Message passing eliminates both hazards. If two tasks need to share a resource — say, the track switch state — one task (the “track server”) owns the resource exclusively and all access to it goes through Send/Receive/Reply to the track server. Concurrent access is impossible by design: the server can only Receive one request at a time. Deadlock from lock ordering cannot occur because there are no locks. Priority inversion is addressed architecturally: a high-priority client’s Send to a low-priority server makes the server effectively run at the client’s priority, because the scheduler will schedule the server before any task below the client’s priority (though in the standard kernel the server must explicitly re-enter Receive to service the client; see Chapter 20 for dynamic prioritization).

The cost of this approach is two copies per transaction (message and reply) instead of zero (for shared memory with lock). For small messages — a train command, a clock tick, a sensor status — this cost is negligible. For large data — a camera frame, an audio buffer — it would be prohibitive. This kernel is designed for small messages, and the API reflects this: Send and Receive take char * buffers with explicit length limits.

The Send Queue

Each task descriptor contains the head of a send queue: a linked list of TaskDescriptors in SendWait state that have called Send to this task but have not yet been Received. The queue is maintained in arrival order (FIFO), so the task that sent first is served first. There is no priority ordering within the send queue — a high-priority sender does not cut ahead of a low-priority sender already in the queue. This can be changed (priority-ordered send queues exist in some RTOSs), but the added complexity is rarely justified and changes the semantics in subtle ways.

The in-kernel do_send function:

void do_send(TaskDescriptor *sender, int target_tid,
             const char *msg, int msglen,
             char *reply, int rplen) {
    TaskDescriptor *receiver = td_lookup(target_tid);
    if (!receiver || receiver->state == EXITED) {
        sender->retval = -1;   // error: task doesn't exist
        scheduler_insert(sender);
        return;
    }

    sender->send_msg    = msg;
    sender->send_msglen = msglen;
    sender->reply_buf   = reply;
    sender->reply_len   = rplen;

    if (receiver->state == RECEIVEWAIT) {
        // Server already waiting — transfer immediately
        copy_msg(receiver, sender);
        receiver->receive_tid = sender->tid;
        receiver->retval = (msglen < receiver->receive_msglen)
                            ? msglen : receiver->receive_msglen;
        sender->state = REPLYWAIT;
        scheduler_insert(receiver);
    } else {
        // Enqueue sender on receiver's send queue
        sender->state = SENDWAIT;
        send_queue_append(receiver, sender);
    }
}

Comparison with Other IPC Designs

The SRR model has a striking family resemblance to QNX Neutrino’s MsgSend / MsgReceive / MsgReply. QNX uses the same blocking semantics and the same idea that the receiver should run at (at least) the sender’s priority when processing a request. QNX extends the model with asynchronous messages (MsgSendAsync) and pulse notifications for situations where blocking is undesirable.

Mach IPC uses ports (reference-counted kernel objects) rather than task IDs, and supports both synchronous (RPC-style) and asynchronous (one-way) messages. Messages are copied into kernel-managed ports, allowing more flexibility at the cost of an intermediate buffer and more complex object lifecycle management. The Mach IPC design influenced macOS and iOS, where it underlies every NSXPCConnection.

L4’s fast IPC (Liedtke, 1995) was specifically designed to achieve IPC fast enough that building systems on top of microkernel IPC would not be prohibitively expensive compared to monolithic kernels. L4 achieves IPC in a few hundred nanoseconds by mapping the sender’s page directly into the receiver’s address space for large messages (avoiding copies entirely) and by keeping the IPC path strictly in cache-friendly code. The CS 452 kernel’s kernel-stack-copy approach is conceptually similar but simpler.

SRR Worked Example: The Clock Server Interaction

Let us trace a complete SRR interaction: the Clock Notifier sends a tick to the Clock Server, which then unblocks a delayed client.

Setup: Clock Server (TID 2, priority 25), Clock Notifier (TID 3, priority 31), Client Task (TID 5, priority 10). Client has called Delay(clock_server_tid, 10), blocking itself in ReplyWait, waiting for the Clock Server to reply.

Tick arrives (interrupt fires):

The GIC delivers IRQ 97 (System Timer C1) while the idle task (TID 0) is running.
The kernel’s IRQ handler saves the idle task’s context and calls handle_irq.
handle_irq reads GICC_IAR, clears the timer interrupt (writes to TIMER_CS to clear the match flag), writes GICC_EOIR, and finds TID 3 (Clock Notifier) in the event-blocked registry for event ID EVENT_TIMER_C1.
TID 3 transitions from Event-Blocked to Ready. Its return value (the tick count) is set.
The scheduler selects TID 3 (priority 31 > idle priority 0). The idle task’s context is saved. TID 3’s context is restored. eret jumps to TID 3.

Clock Notifier runs (TID 3): 6. AwaitEvent returns the current tick count (say, 42). 7. TID 3 calls Send(clock_server_tid, &tick, sizeof(tick), NULL, 0). 8. do_send checks TID 2’s state: it is Receive-Blocked (the clock server is waiting for a tick or a client request). This is Case 2 from the SRR state machine. 9. The tick message is copied from TID 3’s buffer into TID 2’s receive buffer. 10. TID 3’s state transitions to ReplyWait (it is waiting for the clock server to Reply). 11. TID 2’s state transitions to Ready. The clock server’s Receive() return value is set to indicate TID 3 sent the message. 12. The scheduler selects TID 2 (priority 25 > anything else ready). TID 2’s context is restored.

Clock Server processes the tick (TID 2): 13. Receive returns with sender_tid = 3, msg = {tick: 42}. 14. The clock server increments its tick counter to 42. 15. It scans the delay queue for tasks whose deadline ≤ 42. It finds TID 5, which had called Delay(clock_server_tid, 10) and was blocked at deadline = 32 (30 ticks after the call at tick 32). Wait — TID 5’s deadline was 32, current tick is 42, so TID 5 is 10 ticks overdue. The clock server calls Reply(5, &current_tick, sizeof(current_tick)). 16. do_reply finds TID 5 in ReplyWait state, copies the reply, transitions TID 5 to Ready. 17. The clock server then calls Reply(3, NULL, 0) — releasing the Clock Notifier. 18. The Clock Notifier (TID 3, priority 31) becomes Ready; since it is the highest priority, it immediately preempts the Clock Server. 19. The Clock Notifier calls AwaitEvent(EVENT_TIMER_C1) again, returning to Event-Blocked state. 20. The Clock Server resumes. If TID 5 is at a lower priority (10 < 25), the Clock Server continues to run until it blocks again in Receive.

This trace illustrates several important points. First, the clock tick travels through three SRR interactions: Timer → Notifier → Server → Client. Each interaction is a context switch. The total latency from the timer interrupt to the client’s Delay() return is the sum of three kernel transitions plus the message copy times. Second, tasks are not resumed in strict priority order within the server’s processing loop — the clock server can call Reply to multiple tasks before returning to Receive, briefly changing the system’s ready set before the scheduler runs. Third, the scheduling policy is always enforced at the entry to the scheduler, not within the server’s processing loop.

The Decoupling Properties of SRR

A subtle and important property of SRR is the decoupling in time and decoupling in space that Reply enables:

Decoupling in time: a server can receive a request and delay its reply arbitrarily. The clock server receives the Delay request immediately (at the tick when the client calls Delay) but replies only when the deadline has passed — possibly many ticks later. The client is blocked in ReplyWait throughout this interval, not consuming CPU time. This is the key mechanism behind all deadline-based waiting in the kernel: the client does not poll; it is parked.

Decoupling in space: a server can delegate reply responsibility to another task. Suppose the clock server forwards a deadline notification to a separate “expiry handler” task, which replies to the original sender. The original sender doesn’t know or care that a different task replied — it just sees its Delay() return. This is the delegate or courier-reply pattern: the server accepts the request, hands off the reply obligation, and immediately returns to its receive loop. This pattern allows servers to achieve higher throughput by parallelizing request processing across multiple worker tasks, with each worker independently replying when its work is done.

The combination of these properties gives SRR an expressive power that shared-memory communication lacks. In shared-memory systems, a “delay until time X” operation requires either a spin loop (wasted CPU), a sleep call (OS-managed, with scheduler overhead), or an explicit timeout mechanism in a select/poll call. In SRR, the delay is implemented naturally: send to the clock server, don’t get a reply until time X. The resource (CPU) is freed immediately upon the send, without any polling.

Message Protocol Design

Every SRR interaction requires a message protocol: a specification of the data format for requests and replies. Good protocol design is as important as the task architecture, because protocol bugs (mismatched sizes, missing reply types, incorrect type codes) manifest as cryptic hangs or corrupted data.

The canonical message format uses a tagged union with a type discriminator:

typedef enum {
    MSG_SPEED_CMD,
    MSG_QUERY_POSITION,
    MSG_SWITCH_CMD,
    MSG_SENSOR_EVENT,
    MSG_DELAY,
    MSG_DELAYUNTIL,
    MSG_TIME,
    /* ... */
} MsgType;

typedef struct {
    MsgType type;
    union {
        struct { uint32_t loco_uid; uint16_t speed; uint8_t dir; } speed;
        struct { int track_node_id; } query_pos;
        struct { int switch_id; int position; }  sw;
        struct { int sensor_id; uint32_t time_us; } sensor;
        struct { int ticks; } delay;
        struct { int absolute_tick; } delay_until;
    };
} Message;

typedef struct {
    int   retval;   /* 0 = success, negative = error code */
    union {
        struct { int32_t pos_mm; int32_t vel_mmps; } position;
        int ticks;
        /* other reply payloads */
    };
} Reply;

The type field is always first; the receiver can read it to determine which branch of the union is valid before accessing any other field. This pattern requires that every Receive reads at least sizeof(MsgType) bytes — a reasonable assumption.

Protocol versioning: in a long-lived system, message formats may need to change. A common practice is to include a version field alongside the type:

typedef struct {
    uint8_t  version;   /* 1 = current; allows future expansion */
    MsgType  type;
    /* ... */
} Message;

The server can check the version and handle backward compatibility. For CS 452, protocol versioning is unnecessary — the system is self-contained and both sides of every interface are compiled together. But it is good discipline.

Error codes in replies: the retval field of the Reply struct should follow a consistent convention. Negative values indicate errors; zero or positive values indicate success (with a result). The specific negative values should have symbolic names:

#define ERR_INVALID_TID    (-1)
#define ERR_NOT_FOUND      (-2)
#define ERR_BUFFER_FULL    (-3)
#define ERR_INVALID_PARAM  (-4)

Every server reply path that can fail must include one of these codes, and callers must check for errors. Silently ignoring a negative return from WhoIs() or Time() is a common source of subtle bugs.

Synchronization Patterns in SRR

Beyond the basic client/server interaction, SRR enables several higher-level synchronization patterns that arise repeatedly in the train control application.

Rendezvous: two tasks synchronize by each sending to the other simultaneously. This is impossible in pure SRR (both would block in Send, waiting for the other to Receive). The solution: one task is designated the “server” and calls Receive; the other calls Send. After the exchange, the server can reply to confirm completion.

Broadcast notification: a server needs to notify N clients simultaneously. As discussed in Chapter 21, pure SRR cannot broadcast — it can only reply to one client at a time. The pre-subscribed reply pattern (clients pre-send “notify me” requests and the server keeps a list of waiting TIDs to reply to) resolves this efficiently.

Timeout with SRR: SRR does not natively support a “send with timeout” — a Send that returns ERR_TIMEOUT if the reply does not arrive within a specified interval. This is because Send blocks indefinitely; there is no kernel-supported interrupt path to unblock a waiting sender.

The standard workaround: use a timeout courier task. Create a courier task at a higher priority than the sender. The courier calls Delay(clock_tid, timeout), then sends a special TIMEOUT message to the server. The server, upon receiving a TIMEOUT message, replies to the waiting sender with ERR_TIMEOUT. This requires the server to know that the sender is waiting on a timeout, which requires explicit bookkeeping in the server’s state.

A simpler workaround for operations that need bounded blocking: decompose the operation into a non-blocking request and a subsequent poll. Instead of result = Send(server, request) with timeout, use:

/* Ask server to start the operation */
Send(server, &start_req, sizeof(start_req), NULL, 0);
/* Poll until done or timeout */
for (int i = 0; i < max_polls; i++) {
    Delay(clock_tid, POLL_INTERVAL);
    QueryResult r = query_operation_status(server);
    if (r.done) return r.result;
}
return ERR_TIMEOUT;

This is less efficient (multiple round-trips) but avoids the complexity of a timeout courier.

Barrier synchronization: N tasks all start a phase, wait until all N have finished, then start the next phase. In SRR, implement with a “phase server” that counts arriving tasks: when N tasks have sent to it, it replies to all N simultaneously. This is the pre-subscribed reply pattern again.

SRR and Deadlock: A Formal Analysis

The deadlock freedom guarantee of message-passing systems is not absolute — it depends on the topology of the send graph. A well-designed SRR system has a directed acyclic graph (DAG) of send relationships: no task sends to any task that (directly or transitively) sends back to it.

Formal criterion: a set of tasks is deadlock-free if the directed graph $G = (V, E)$ where $V$ is the set of tasks and $(A, B) \in E$ if task A can call Send to task B is a DAG.

Proof: in a DAG, there exists a topological ordering of tasks. In the worst case (all tasks send simultaneously), the topologically first task (no incoming send-edges) will immediately have its Send serviced by its receiver (which has no pending Sends of its own). This task can be removed from consideration; the remaining tasks form a smaller DAG. By induction, all tasks complete.

Counter-example (shows a cycle creates deadlock): task A sends to task B; task B sends to task A. Neither can receive while blocked in Send. This is the simplest deadlock.

Checking your system architecture for DAG structure is a valuable exercise. Draw all the inter-task Send relationships as arrows on paper. If you can number the tasks so that all arrows point from lower to higher numbers (a topological sort), the system is deadlock-free. If not, you have a potential deadlock.

Practical checks for the train control system:

The Name Server receives Sends from everyone; it never sends to anyone. ✓
The Clock Server receives from the Clock Notifier and from clients; it never sends (it only replies). ✓
The Train Engineer sends to the Track Server; the Track Server never sends back to engineers (it only replies). ✓
The Track Server sends to the CAN TX Server; the CAN TX Server never sends back to the Track Server. ✓

The potential issue: if a server ever needs to send to another server in response to a client request, it must use a courier task to avoid blocking. Any Send in a server’s receive-process-reply loop creates a potential cycle.

Chapter 11: The Name Server and Inter-Task Naming

The SRR primitives require a task to know the TID of the task it wants to communicate with. But how does a client task discover the TID of the server it needs? TIDs are assigned dynamically at Create() time; a client cannot know in advance what TID the clock server or UART server will have, especially since initialization order may vary.

The Name Server solves this problem. It is a well-known task (its TID is determined at system startup and broadcast through a fixed mechanism) that acts as a registry: other tasks register their names and TIDs with it, and clients query it to resolve names to TIDs.

RegisterAs and WhoIs

The Name Server exposes two operations:

RegisterAs(const char *name): the calling task sends its name to the Name Server, which records the mapping name → calling_tid. Returns 0 on success, -1 if the Name Server is unreachable.

WhoIs(const char *name): the calling task sends a name to the Name Server and receives back the TID registered under that name. Blocks until the name is registered (if it is not yet registered, the WhoIs caller may need to retry or wait).

Typical usage during system initialization:

// Clock server startup
int main(void) {
    RegisterAs("ClockServer");
    // ... main server loop ...
}

// Client startup
int clock_tid = WhoIs("ClockServer");
// now can call Time(clock_tid), Delay(clock_tid, ticks), etc.

The Name Server’s own TID is a special case. The kernel creates it as the first user task with a deterministically-assigned TID (e.g., TID 1), so clients can always find it. Alternatively, the kernel can expose the Name Server TID through a special syscall MyNameServerTid(), or it can simply be defined as a compile-time constant that both the Name Server and all clients agree upon.

Why the Name Server Comes First

The Name Server is always the first server written because every other server needs it. The Clock Server, the UART servers, the CAN server — all of them RegisterAs a name at startup. Clients call WhoIs before their first inter-server communication. The Name Server therefore has no dependencies on other servers (it cannot call WhoIs itself) and must be started before any other server.

This dependency ordering is a general principle of server-based system design: the services with the fewest dependencies are started first, and each subsequent service may depend on those that started before it. The Clock Server depends on the Name Server. The UART Transmit Server depends on the Name Server (to register) and optionally on the Clock Server (for timing). The train supervisor depends on the Name Server, the Clock Server, and the CAN server. Keeping this graph acyclic is a design invariant.

Implementing the Name Server

The Name Server’s implementation is deliberately minimal. Because no other server exists yet when the Name Server starts, it cannot use WhoIs or other server facilities — it is truly self-contained. A straightforward flat array implementation is sufficient:

#define MAX_NAMES    64
#define MAX_NAME_LEN 32

typedef struct {
    char name[MAX_NAME_LEN];
    int  tid;
} NameEntry;

static NameEntry registry[MAX_NAMES];
static int       registry_size = 0;

static int ns_lookup(const char *name) {
    for (int i = 0; i < registry_size; i++) {
        if (strncmp(registry[i].name, name, MAX_NAME_LEN) == 0)
            return registry[i].tid;
    }
    return -1;
}

static int ns_register(const char *name, int tid) {
    /* overwrite if already registered */
    for (int i = 0; i < registry_size; i++) {
        if (strncmp(registry[i].name, name, MAX_NAME_LEN) == 0) {
            registry[i].tid = tid;
            return 0;
        }
    }
    if (registry_size >= MAX_NAMES) return -1;
    strncpy(registry[registry_size].name, name, MAX_NAME_LEN);
    registry[registry_size].tid = tid;
    registry_size++;
    return 0;
}

The main server loop handles two request types — REGISTER and WHOIS:

typedef enum { NS_REGISTER, NS_WHOIS } NSMsgType;
typedef struct { NSMsgType type; char name[MAX_NAME_LEN]; } NSRequest;
typedef struct { int result; } NSReply;

void nameserver_main(void) {
    /* The name server does NOT call RegisterAs — it has no dependencies. */
    for (;;) {
        int sender;
        NSRequest req;
        Receive(&sender, &req, sizeof(req));

        NSReply reply;
        if (req.type == NS_REGISTER) {
            reply.result = ns_register(req.name, sender);
            Reply(sender, &reply, sizeof(reply));
        } else {  /* NS_WHOIS */
            int tid = ns_lookup(req.name);
            if (tid >= 0) {
                reply.result = tid;
                Reply(sender, &reply, sizeof(reply));
            } else {
                /* Name not yet registered: hold the request */
                /* A production implementation would queue the sender
                   and reply once the name is registered */
                reply.result = -1;
                Reply(sender, &reply, sizeof(reply));
            }
        }
    }
}

The “hold and retry” approach to WhoIs is worth examining. A naive implementation replies immediately with -1 if the name is not yet registered, requiring the client to retry. A more sophisticated implementation queues blocked WhoIs requests and replies to them when the name is later registered. The queued approach eliminates polling loops in clients but complicates the Name Server’s state management. For a small system with deterministic startup order (Name Server → Clock Server → UART Servers → CAN Server → Supervisory tasks), the simple immediate-reply plus client retry is usually sufficient.

The library-level wrappers RegisterAs() and WhoIs() package the message formatting and send/receive boilerplate so callers don’t see the raw protocol:

static int nameserver_tid = -1;  /* set by kernel or bootstrap */

int RegisterAs(const char *name) {
    NSRequest req = { .type = NS_REGISTER };
    strncpy(req.name, name, MAX_NAME_LEN);
    NSReply reply;
    Send(nameserver_tid, &req, sizeof(req), &reply, sizeof(reply));
    return reply.result;
}

int WhoIs(const char *name) {
    NSRequest req = { .type = NS_WHOIS };
    strncpy(req.name, name, MAX_NAME_LEN);
    NSReply reply;
    do {
        Send(nameserver_tid, &req, sizeof(req), &reply, sizeof(reply));
        if (reply.result < 0) Delay(nameserver_tid, 1);  /* wait 1 tick */
    } while (reply.result < 0);
    return reply.result;
}

The retry loop in WhoIs is safe here only because the Name Server always replies immediately (it never holds senders). If the Name Server ever held senders (to reply later when the name registers), the Delay call in the client would not be needed and would actually be counterproductive.

The Name Server’s TID: A Bootstrapping Problem

Every server is found through the Name Server, but the Name Server itself cannot register with itself before it exists. This is the bootstrapping problem: how does the first client find the first server?

Three standard solutions exist in real-time and embedded systems:

Fixed TID convention: the kernel always assigns the Name Server a specific TID — say, TID 1 — because it is always the first user task created. All library code hard-codes nameserver_tid = 1. This works as long as the kernel is deterministic about TID assignment.

Kernel syscall: a dedicated syscall MyNameServerTid() returns the Name Server TID. The kernel stores this when it creates the Name Server task and makes it retrievable. This is cleaner than a magic constant and allows the kernel to change task scheduling without invalidating library code.

Out-of-band channel: the first user task (often a bootstrap task that creates everything else) creates the Name Server, receives its TID from Create(), and distributes the TID through a pre-arranged mechanism — perhaps a global variable set before any user task runs, or a parent-to-child TID passed via the Create return value.

The CS 452 convention uses the fixed TID approach with the Name Server assigned TID 1 (or in some implementations, a specific constant such as NS_TID = 0x0 or NS_TID = 3). What matters is that the convention is uniform across the kernel and all user libraries.

The Name Server in Context: QNX, L4, and Service Discovery

The CS 452 Name Server is philosophically related to service discovery mechanisms in production microkernel and distributed systems, though simpler in every dimension.

QNX Neutrino has a similar concept through its path-name space (procnto process manager). QNX services register a path in the filesystem namespace (e.g., /dev/ser1 for a serial port server), and clients open a file descriptor to that path to contact the server. The QNX mechanism is more general — it integrates with the POSIX file API and supports hierarchical namespaces — but the concept is identical: a well-known registry maps names to endpoints.

L4 microkernel systems (including Fiasco, seL4, and OKL4) handle naming through a capability system: each task holds capabilities (unforgeable tokens) that grant specific rights to communicate with specific tasks. There is no global name registry; instead, capabilities are distributed by a trusted authority at system initialization. This is more secure (a task cannot contact a server unless explicitly granted a capability) but more complex to bootstrap.

MINIX 3 (Andrew Tanenbaum’s microkernel used for teaching) has a simple name server analogous to CS 452’s. Its netserver, pm (process manager), and vfs (virtual file system server) are found by fixed task numbers, mirroring the fixed-TID approach.

In distributed systems, service discovery is a major field: Consul, Kubernetes DNS, AWS Service Discovery, and Zookeeper all solve the same problem at scale. They add replication, failure detection, health checking, and TTL-based expiry. The CS 452 Name Server trades all of those features for simplicity appropriate to a single-node, bounded-task-set system. The key insight from distributed systems still applies, however: a service registry is a single point of failure. In a production RTOS, a task-level watchdog should monitor the Name Server and restart it (restoring its registry from a checkpoint) if it fails.

Namespacing and Service Hierarchies

The flat namespace of the CS 452 Name Server is adequate for a small number of services, but as the system grows it can become unwieldy. Consider a system with four UART ports and two CAN controllers: naming conventions like "UART0-TX", "UART0-RX", "UART1-TX", "CAN0-TX", "CAN0-RX" work, but become fragile if names are misspelled or if two developers independently choose different naming conventions.

A common discipline is to centralize all names in a header file:

/* service_names.h */
#define SRV_CLOCK      "ClockServer"
#define SRV_UART0_TX   "UART0_TX"
#define SRV_UART0_RX   "UART0_RX"
#define SRV_CAN_TX     "CAN_TX"
#define SRV_CAN_RX     "CAN_RX"
#define SRV_TRACK      "TrackServer"
#define SRV_TRAIN(n)   ("Train" #n)

Using symbolic constants instead of raw string literals eliminates typo-related bugs. A compile-time check cannot verify that a service under this name actually exists, but a runtime assertion in WhoIs (that panics if the lookup fails after a timeout) catches the bug early during testing.

Hierarchical namespaces (e.g., Unix-style paths like /sensors/s88-0/channel-3) enable structured enumeration: a client can ask “give me all services under /sensors/” rather than knowing each name in advance. This is useful when the number of instances is not known at compile time. CS 452 systems are fully static (all trains, sensors, and servers are known at compile time), so a flat namespace is appropriate.

Why the Name Server Matters for Timing Analysis

From a timing perspective, the Name Server is called only at initialization time, not in steady-state operation. A well-written train control system resolves all WhoIs calls during startup and caches the resulting TIDs in local variables. Steady-state control loops use only cached TIDs and never call WhoIs.

This matters because WhoIs adds an extra Send/Receive/Reply round-trip to every first call. If a control loop accidentally called WhoIs on every iteration (perhaps from not caching the result), it would add approximately 5 µs per call — tolerable in isolation but damaging if multiplied across dozens of inter-server calls per tick.

The discipline of caching TIDs is a microcosm of a broader principle in real-time systems: compute everything possible offline (at initialization) and avoid recomputation in the steady-state critical path. The same principle applies to route planning (compute paths at initialization and update incrementally rather than re-running Dijkstra every tick), to table lookups (precompute calibration curves at startup rather than recalculating on every speed command), and to message buffer allocation (allocate all buffers at startup in a fixed-size pool rather than calling a heap allocator in real time).

Analogies to Other Naming Systems

The Name Server implements the same concept as the Domain Name System (DNS): a distributed (here, centralized) registry that maps human-readable names to machine-usable identifiers. DNS maps hostnames to IP addresses; the Name Server maps service names to TIDs. The Name Server’s simplicity — a flat namespace, no TTLs, no replication — is appropriate for an embedded system with a known, static set of services.

In distributed systems, ZooKeeper and etcd serve a similar role: they provide a consistent, highly-available name registry on which other services depend at startup. The CS 452 Name Server trades availability (it’s a single point of failure — if it crashes, the system breaks) for simplicity (no quorum, no replication, no consensus protocol). The Raft and Paxos consensus algorithms that ZooKeeper uses to replicate its state across multiple nodes would be grotesquely over-engineered for a single-board microkernel system — but understanding why ZooKeeper is complex helps appreciate why the CS 452 Name Server can be so simple.

Chapter 12: Interrupts and the GIC-400

The timer, UART, and CAN controller all communicate with the CPU through interrupts. An interrupt is an asynchronous signal from a hardware device that causes the processor to suspend its current execution and invoke a handler. For a real-time kernel, interrupts are the mechanism by which the physical world injects events into the software — and their latency directly determines the system’s response time.

Why Interrupts?

Chapter 2 showed that polling is inadequate for systems with heterogeneous event rates. Interrupts invert the polling relationship: instead of the CPU repeatedly asking the device “do you have data?”, the device tells the CPU “I have data, handle me.” This allows the CPU to do useful work between device events, rather than burning cycles on status checks that usually return false.

The cost of interrupts is complexity. An interrupt may arrive at any moment in the execution of any instruction (with some exceptions — ARM guarantees certain atomicity properties for aligned single-register loads and stores). The handler must save and restore whatever state the interrupted code was using. If interrupts are nested (a higher-priority interrupt fires during the handling of a lower-priority one), the state management becomes more complex. The kernel controls this complexity by running with interrupts masked (DAIF bits set) during kernel execution and unmasking them only when returning to user tasks or when explicitly waiting for an event.

The GIC-400 Architecture

The BCM2711 integrates an ARM GIC-400, which implements the GICv2 specification. The GIC-400 has two components:

Distributor (GICD), at physical base 0xFF841000, manages the global pool of interrupt sources. It can enable or disable individual interrupts, assign priorities, and route interrupts to specific CPU cores. Key registers:

Register	Offset	Function
GICD_CTLR	0x000	Enable the distributor
GICD_ISENABLER[n]	0x100 + 4n	Set enable for interrupts 32n to 32n+31
GICD_IPRIORITYR[n]	0x400 + n	Priority of interrupt n (8-bit, lower = higher priority)
GICD_ITARGETSR[n]	0x800 + n	Target CPU for interrupt n (bit 0 = core 0)
GICD_ICFGR[n]	0xC00 + 4n	Level/edge configuration for interrupts

CPU Interface (GICC), at physical base 0xFF842000, is the per-core interface through which a CPU core receives interrupts. Key registers:

Register	Offset	Function
GICC_CTLR	0x000	Enable the CPU interface
GICC_PMR	0x004	Priority mask (0xFF = accept all priorities)
GICC_IAR	0x00C	Interrupt acknowledge (read to get interrupt ID and mark active)
GICC_EOIR	0x010	End of interrupt (write ID to signal completion)

GIC initialization sequence:

void gic_init(void) {
    // Distributor: enable
    mmio_write(GICD_BASE + GICD_CTLR, 1);

    // CPU Interface: enable, accept all priorities
    mmio_write(GICC_BASE + GICC_CTLR, 1);
    mmio_write(GICC_BASE + GICC_PMR,  0xFF);
}

void gic_enable(int irq) {
    // Route to core 0
    mmio_writeb(GICD_BASE + GICD_ITARGETSR + irq, 1);
    // Set priority (128 = middle)
    mmio_writeb(GICD_BASE + GICD_IPRIORITYR + irq, 128);
    // Enable
    mmio_write(GICD_BASE + GICD_ISENABLER + (irq / 32) * 4,
               1u << (irq % 32));
}

Interrupt Types and IDs

The GIC handles three types of interrupts:

SGIs (Software-Generated Interrupts, IDs 0–15): triggered by writing to GICD_SGIR. Used for inter-core communication (not relevant for single-core use).

PPIs (Private Peripheral Interrupts, IDs 16–31): core-private interrupts, such as the CPU-local timer. These are typically used by hypervisors and do not appear in the CS 452 interrupt set.

SPIs (Shared Peripheral Interrupts, IDs 32+): device interrupts. All the peripherals used in CS 452 are SPIs.

Key interrupt IDs on the BCM2711/RPi 4:

Peripheral	GIC ID	Notes
System Timer C1	97	Use for clock ticks (C0, C2 used by GPU)
System Timer C3	99	Alternative tick source
UART0 (PL011)	153	All UART0 interrupts share this ID
GPIO bank 0 (pins 0–31)	145	Includes GPIO 17 (MCP2515 INT)

Interrupt Handling Sequence

When an IRQ fires and the CPU is in EL0 (user task running), the processor:

Saves PC to ELR_EL1, PSTATE to SPSR_EL1.
Switches to SP_EL1.
Clears DAIF (masking further interrupts).
Jumps to exception_vector_table + 0x480 (IRQ from lower EL, AArch64).

The IRQ handler must:

Save user task context (identical to the syscall entry path).
Read GICC_IAR to acknowledge the interrupt and obtain the interrupt ID.
Clear the interrupt source (e.g., write to system timer CS register, or read from UART data register to drain the FIFO trigger condition).
Write the interrupt ID to GICC_EOIR to signal completion.
Find any task in Event-Blocked state waiting for this interrupt ID, unblock it, and insert it into the ready queue.
Call the scheduler to select the next task.
Restore context and eret.

void handle_irq(uint64_t *frame) {
    uint32_t iar = mmio_read(GICC_BASE + GICC_IAR);
    uint32_t irq_id = iar & 0x3FF;   // bits [9:0]

    if (irq_id == 1023) return;       // spurious interrupt

    // clear the source (device-specific)
    clear_interrupt_source(irq_id);

    // write EOIR
    mmio_write(GICC_BASE + GICC_EOIR, iar);

    // unblock any task awaiting this event
    unblock_awaiting_task(irq_id);
}

IRQ vs FIQ

ARMv8-A distinguishes between two interrupt types: IRQ (normal interrupt request) and FIQ (fast interrupt request). Historically, FIQ was faster because it had a larger banked register set, avoiding save/restore overhead. In GICv2, FIQ is used for secure interrupts (EL3 domain) and IRQ for non-secure interrupts (EL1 domain). In the CS 452 kernel, which runs entirely in non-secure EL1, all device interrupts arrive as IRQs. FIQ handling (exception_vector_table + 0x500) should branch to an error handler; receiving an FIQ in EL1 non-secure mode indicates a misconfigured GIC.

Edge-Triggered vs Level-Triggered Interrupts

Every interrupt source connected to the GIC-400 is configured as either edge-triggered or level-triggered, and the distinction matters enormously for correct interrupt handling. Getting this wrong produces one of two failure modes: missed interrupts (if you treat a level-triggered source as edge-triggered) or infinite interrupt storms (if you treat an edge-triggered source as level-triggered). The GIC-400 implements this configuration in the GICD_ICFGR (Interrupt Configuration Register) array, at distributor offset 0xC00 + (4 × n), where n is the register index. Each SPI uses two bits: the upper bit set to 1 means edge-triggered, 0 means level-triggered. SGIs are always edge-triggered; PPIs have fixed configuration.

Level-triggered interrupts fire as long as the interrupt signal is asserted (held low or high, depending on polarity). The GIC will continuously forward the interrupt to the CPU interface until the signal is deasserted. This means: if you acknowledge the interrupt at the GIC (by writing GICC_EOIR) before the source device has cleared the signal, the GIC immediately raises the interrupt again. The correct clearing sequence is therefore source-first, GIC-second: clear the condition at the device register, then write GICC_EOIR.

Edge-triggered interrupts fire once on a signal transition (rising edge, falling edge, or both). The GIC latches the edge event internally and forwards exactly one interrupt per edge. This means the GIC will not re-fire even if the source device’s signal remains asserted — but conversely, edges that arrive while the CPU has DAIF masked will only latch once; a rapid sequence of pulses during a masked window delivers only one interrupt notification. The correct clearing sequence is GIC-first, source-second: write GICC_EOIR to clear the GIC’s latched state, then clear the source register. Writing GICC_EOIR after clearing the source is also fine for edge-triggered sources, but writing it before clearing a level-triggered source is fatal.

Three concrete examples from the CS 452 system illustrate the stakes:

System timer (BCM2711 System Timer, local to BCM2711): The system timer asserts an interrupt when the counter reaches the compare value in CS1 (channel 1). The CS register bit remains set until explicitly cleared by writing a 1 to it:

void system_timer_clear(int channel) {
    /* Writing 1 to the CS bit clears the match — level signal drops */
    mmio_write(SYSTEM_TIMER_BASE + TIMER_CS, (1u << channel));
}

void handle_system_timer_irq(void) {
    /* Level-triggered: clear source BEFORE acknowledging GIC */
    system_timer_clear(1);
    /* Now safe to call GICC_EOIR — signal is deasserted */
    mmio_write(GICC_BASE + GICC_EOIR, TIMER_IRQ_ID);
    unblock_awaiting_task(TIMER_IRQ_ID);
}

If you reversed the order — writing GICC_EOIR first, then clearing the CS bit — the GIC would see the level still asserted and immediately re-raise the interrupt before handle_system_timer_irq returns. The result is an interrupt storm consuming 100% of CPU time.

MCP2515 CAN interrupt (GPIO 17, INT pin): The MCP2515’s INT pin is active-low and level-sensitive: it remains asserted until the interrupt flag in the MCP2515 CANINTF register is cleared over SPI. However, GPIO pins connected to the GIC are typically configured as edge-sensitive via the BCM2711’s GPIO event detect registers. Specifically, GPFEN0 (falling-edge detect) converts the level assertion into a single falling-edge event, which is then forwarded to the GIC as an edge-triggered interrupt.

This conversion has an important implication: if the MCP2515 asserts a second interrupt flag before the first is cleared, no second edge is generated (the signal never went high and came back down). The Notifier for the CAN server must therefore read and clear all pending interrupt flags in a single SPI transaction on each wake-up, not just the first one. The pattern is:

void can_notifier_task(void) {
    for (;;) {
        AwaitEvent(CAN_GPIO_IRQ);

        /* Drain all MCP2515 interrupt flags — may be multiple */
        uint8_t canintf = mcp2515_read_register(MCP2515_CANINTF);
        while (canintf != 0) {
            mcp2515_bit_modify(MCP2515_CANINTF, canintf, 0x00); /* clear all flags */
            process_can_interrupts(canintf);
            canintf = mcp2515_read_register(MCP2515_CANINTF);   /* check for new */
        }

        /* GPIO falling-edge detect register cleared by writing 1 */
        mmio_write(GPIO_BASE + GPEDS0, (1u << 17));
    }
}

UART TX FIFO interrupt (PL011 UART0): The TX interrupt fires when the TX FIFO level falls below the programmed threshold (by default, the half-empty threshold, configurable in UARTIFLS). This is a level-triggered condition: the interrupt remains asserted as long as the FIFO is below threshold. The standard handling pattern is to disable the TX interrupt at the source inside the handler, then fill the FIFO:

void handle_uart_tx_irq(void) {
    /* 1. Disable TX interrupt immediately — prevents re-firing while filling */
    uint32_t imsc = mmio_read(UART0_BASE + UART_IMSC);
    mmio_write(UART0_BASE + UART_IMSC, imsc & ~UART_IMSC_TXIM);

    /* 2. Acknowledge at GIC */
    mmio_write(GICC_BASE + GICC_EOIR, UART0_IRQ_ID);

    /* 3. Unblock the TX notifier — it will fill the FIFO and re-enable */
    unblock_awaiting_task(UART0_IRQ_ID);
}

Disabling the interrupt source before acknowledging the GIC breaks the level-trigger storm: the signal is deasserted before GICC_EOIR is written, so the GIC sees no active level when the next check occurs. The TX notifier, once unblocked, fills the FIFO and re-enables the interrupt via UART_IMSC — at which point the FIFO may already be below threshold again, generating the next edge immediately, or may be full enough to remain quiet until another burst of output.

Choosing between edge and level triggering involves a tradeoff. Level-triggered interrupts are more resilient: if an interrupt fires while DAIF is masked, the level remains asserted when masking lifts, so the interrupt is not missed. Edge-triggered interrupts can be missed if masking is too long; the BCM2711’s GPIO falling-edge detect provides a latch that partially mitigates this but only stores one event. For the system timer — where missing a tick means the clock server delivers incorrect delay accounting — level-triggered behavior is the right default. For the MCP2515 — where the GPIO edge-detect provides a convenient single-interrupt point for multiple flag events — edge detection at the GPIO boundary is acceptable provided the handler drains all flags.

The GICD_ICFGR configuration for the CS 452 kernel sets the system timer IRQ as level-triggered (default for PPIs) and the GPIO CAN interrupt as edge-triggered (configured via GPIO event detect, not directly in GICD_ICFGR). UART interrupts are level-triggered, which is why the enable/disable pattern in the TX handler is mandatory rather than optional.

GIC Priority and Preemption

The GIC-400 supports 256 priority levels (0 is highest, 255 is lowest). The CPU Interface’s Priority Mask Register (GICC_PMR) filters which interrupts are forwarded to the CPU: only interrupts with priority numerically lower than GICC_PMR are forwarded. Setting GICC_PMR to 255 (as the kernel does during initialization) allows all priorities.

Nested interrupts: GICv2 supports interrupt preemption — a higher-priority interrupt can preempt the handling of a lower-priority one. This requires the software handler to explicitly write to GICC_EOIR before re-enabling interrupts (via msr daifclr, #2), signalling the GIC that the current interrupt has been acknowledged and a higher-priority one may now be delivered. The CS 452 kernel does not implement nested interrupt handling — it runs with DAIF masked throughout kernel execution — but understanding the mechanism is useful for more complex designs.

The GIC’s Interrupt Preemption is controlled by GICC_APR (Active Priorities Register), which tracks which priority levels are currently active. This allows the GIC to correctly implement nested preemption: after writing GICC_EOIR for a nested interrupt, the GIC checks GICC_APR to determine the previous priority and allows the appropriate lower-priority interrupt to resume.

Interrupt Latency Budget

In the CS 452 kernel, the interrupt latency (time from interrupt signal to first instruction of the interrupt handler) has three components:

Hardware detection latency: the GIC detects the SPI signal and forwards it to the CPU interface. This is fixed in hardware, approximately 3–10 cycles.
Pipeline drain: if an instruction is in-flight when the interrupt is detected, the processor must complete (or abort) it before taking the exception. ARMv8-A guarantees that interrupts are taken on instruction boundaries. On the Cortex-A72, out-of-order execution means some instructions may already be speculated; these are rolled back. Maximum pipeline drain latency is approximately 20 cycles.
DAIF masking delay: if the kernel is currently executing with DAIF masked (DAIF.I = 1), the interrupt is held pending until the kernel unmasks. The kernel unmasks interrupts only when returning to EL0 via eret. Therefore, the worst-case interrupt latency from the kernel’s perspective is bounded by the longest kernel operation — the longest system call.

Measuring the longest system call: for a kernel with Send/Receive/Reply, the most expensive primitive is typically Send() when it causes a context switch. A comprehensive measurement:

void measure_irq_latency(void) {
    // Arm the timer to fire immediately
    uint32_t t0 = timer_read_lo();
    timer_set(1, 100);  // fire in 100 µs
    // Enable IRQ (unmask DAIF)
    asm volatile ("msr daifclr, #2");
    // Timer fires; record time in ISR
    // Latency = time_in_ISR - t0 - 100
}

In practice, with a simple kernel, interrupt latency is bounded at 1–5 µs, dominated by the context save path (saving 32 registers takes ~16 cycles = ~11 ns at 1.5 GHz, but cache misses can inflate this to several hundred nanoseconds if the kernel code is cache-cold).

The Hybrid Polling/Interrupt Strategy

For high-frequency events (such as CAN receive at 250 kbit/s), the interrupt overhead per event is non-trivial. At 250 kbit/s with 8-byte messages, the maximum message rate is approximately 25,000 messages/second. At 3 µs interrupt latency per message, interrupt handling alone consumes 75 ms of CPU time per second — 7.5% overhead. For a real train system with much lower message rates (hundreds per second), this is negligible.

But if the message rate were to spike (due to a malfunctioning Märklin device flooding the bus), interrupt processing could overwhelm the CPU. The W26 lecture mentions a hybrid strategy: initially poll for the event, and only enable interrupts if polling fails to find data within a short window. Pseudocode:

int receive_can_byte(void) {
    // Try polling first (for high-rate scenarios)
    for (int i = 0; i < POLL_ITERATIONS; i++) {
        if (mcp2515_rx_available()) return mcp2515_read_byte();
    }
    // Not available immediately — switch to interrupt-driven wait
    enable_can_interrupt();
    int result = AwaitEvent(EVENT_CAN_RX);
    disable_can_interrupt();
    return result;
}

This hybrid approach is also how the Linux kernel handles network I/O at high packet rates — the NAPI (New API) framework switches between interrupt-driven and polling modes based on traffic load.

SError: Asynchronous Aborts and Hardware Fault Diagnosis

ARMv8-A defines four exception types: synchronous (triggered by an instruction — SVC, data abort, undefined instruction), IRQ (normal interrupt), FIQ (fast interrupt, used for secure-world in GICv2), and SError (System Error). The first three have been covered at length. SError is the exception that bare-metal programmers encounter only during debugging — but when they do, it is almost always because something has gone seriously wrong.

SError (also called an asynchronous abort in the ARM manual) is generated by the hardware for error conditions detected asynchronously — that is, detected sometime after the instruction that caused them has completed. The canonical example is a write buffer error: when the CPU issues a write to a device register, the write is accepted into a store buffer and the instruction completes. Later, when the memory system attempts to commit the write to the device, an error is detected (perhaps because the physical address does not map to a real peripheral). The processor has already moved past the write instruction; the error is delivered asynchronously as an SError.

The distinction between synchronous and asynchronous exceptions is fundamental to how the processor reports errors. A synchronous exception is caused by a specific instruction that can be identified precisely — the exception is taken immediately, and ELR_EL1 points exactly at (or just after) the faulting instruction. An asynchronous exception is decoupled from any specific instruction: the hardware detects an error condition that was set in motion by an earlier operation, possibly several instructions ago or even from a different exception level. This decoupling is what makes SError hard to debug: the instruction address in ELR_EL1 at the time of the SError does not reliably identify the root cause.

Other SError sources on the BCM2711:

Memory ECC errors (single-bit corrected, multi-bit uncorrectable) on SDRAM (if ECC is configured — the RPi 4’s standard SDRAM is not ECC).
Bus fault on peripheral access: writing to an invalid peripheral register address (one that falls within the peripheral address range but does not correspond to any real register on the BCM2711) may generate a bus error that propagates as SError.
GIC SError: the GIC-400 can generate SError for internal consistency failures (extremely rare in normal operation).

The exception vector table’s SError entry is at offset 0x280 (from the current exception level, same stack pointer) and 0x680 (from lower exception level):

/* SError handler in the vector table (EL0 → EL1) */
.balign 0x80
serror_handler_el0:
    stp     x29, x30, [sp, #-16]!
    stp     x0,  x1,  [sp, #-16]!
    mrs     x0, ESR_EL1             /* exception syndrome register */
    mrs     x1, FAR_EL1             /* fault address (may be meaningful for aborts) */
    bl      handle_serror
    /* SError is typically fatal; handle_serror calls kernel_panic */

The ESR_EL1.EC field for SError has the value 0b101111 (0x2F). The DFSC (Data Fault Status Code) and ISS fields encode additional information about the fault, but for most SError conditions on the BCM2711 the syndrome information is sparse — the hardware reports that an error occurred but not precisely which instruction triggered it, since the abort is asynchronous.

The FAR_EL1 (Fault Address Register) holds the virtual address associated with the faulting access, but for write buffer errors the address may reflect the buffered write’s target, which may be several instructions behind the current PC. Do not expect FAR_EL1 to reliably point to the offending instruction in an SError handler.

A practical SError handler for the CS 452 kernel:

void handle_serror(uint64_t esr, uint64_t far) {
    uint32_t ec  = (esr >> 26) & 0x3F;   /* Exception Class */
    uint32_t iss = esr & 0x1FFFFFF;       /* Instruction-Specific Syndrome */

    kprintf("SError: EC=0x%02x ISS=0x%07x FAR=0x%016llx PC=0x%016llx\n",
            ec, iss, far, read_pc_before_exception());

    /* Print the backtrace — may be unreliable for async aborts */
    print_backtrace();

    /* Halt */
    kprintf("SError is fatal. System halted.\n");
    for (;;) asm volatile("wfe");
}

The read_pc_before_exception() function reads the ELR_EL1 register, which holds the address the processor was executing when the SError was taken (or the address to which execution would return). For asynchronous exceptions, ELR_EL1 reflects the architectural state at the point of synchronization — the address of the instruction the processor was about to execute when the exception was taken, which may be substantially after the instruction that caused the error.

How to reach an SError on the BCM2711 in practice:

Write to an invalid peripheral address. The BCM2711 peripheral base address is 0xFE000000 (mapped to 0x7E000000 in the VideoCore bus address space). Writing to an address like 0xFE300000 (no peripheral defined there) may or may not produce a bus error, depending on how the SoC’s address decoder is implemented. On some BCM2711 revisions, invalid peripheral addresses return undefined data on reads and are silently ignored on writes. On others, they produce an SError. Discovering which behavior your board has requires testing.
Misconfigure the MMU. If the page table attributes for a region are inconsistent (e.g., configuring SDRAM as device memory with Device-nGnRnE type prevents speculative access, which may trigger SError on some sequences). More commonly, using an invalid MAIR_EL1 index in a page table entry descriptor produces unpredictable behavior.
Stage 2 fault from EL2. If the kernel runs at EL1 with a hypervisor at EL2 (not the case in CS 452, which runs EL2 as pass-through), a Stage 2 translation fault in the hypervisor’s mapping generates a virtual SError to EL1. The virtual SError has ESR_EL1.EC = 0b101111 with the VSESR_EL2 register providing additional detail.

Masking and unmasking SError: the DAIF register’s bit A (bit 8, daifset/daifclr bit 2) masks SError. The CS 452 kernel keeps SError unmasked in user tasks (EL0) and masked in EL1 during kernel execution. An SError in user space propagates to EL1 with DAIF.A clear; the kernel can choose to deliver it as a signal (in a POSIX-style system) or to terminate the faulting task. Since CS 452 has no signal delivery mechanism, the standard response is task termination with a diagnostic print.

The practical upshot for CS 452 students: if your kernel prints an SError exception without a clear cause, check your MMIO addresses against the BCM2711 datasheet. A common mistake is computing an incorrect offset (off-by-one in a register array, or using the wrong base address) that maps to an invalid peripheral address. The SError is the hardware’s way of reporting that the address does not exist.

Chapter 13: AwaitEvent, the Clock Server, and the Idle Task

With interrupts arriving at the hardware level and the GIC delivering them to the kernel, the question becomes: how do user-space tasks wait for interrupts? The answer is AwaitEvent — a primitive that allows a task to block until a specific hardware event fires, and that bridges the gap between kernel-level interrupt handling and user-space reactive programming.

AwaitEvent

AwaitEvent(int eventid) blocks the calling task until the hardware event identified by eventid fires. It returns a non-negative event-specific value (which for a timer event is the current tick count, for a UART event might be the received byte) or -1 if the event ID is invalid.

From the kernel’s perspective, AwaitEvent simply transitions the calling task from Ready to Event-Blocked and records the (eventid, task) mapping in the event registry. When the corresponding interrupt fires (Chapter 12), the kernel looks up the registry, finds the blocked task, copies the interrupt-specific data to the task’s return value, and transitions the task back to Ready.

The event registry must handle the case where an interrupt fires before any task has called AwaitEvent for it. This is the interrupt-before-waiter scenario and it typically occurs during initialization. The kernel can handle it by buffering the event data until a waiter arrives, or by simply losing the event. For a clock tick, losing one tick is recoverable (the next tick will fire 10 ms later); for a UART receive byte, losing the byte is a protocol error. Well-designed servers using AwaitEvent avoid this race by calling AwaitEvent before doing any work that might enable the interrupt.

The Notifier Pattern

AwaitEvent is powerful but has a constraint: a task blocked in AwaitEvent cannot simultaneously serve incoming requests from other tasks. This is the fundamental tension in interrupt-driven server design — an interrupt notifier cannot also be a service responder.

The Notifier pattern resolves this by splitting the interrupt-handling task into two:

A Notifier task (highest priority in the group): calls AwaitEvent in a tight loop, wakes up when the interrupt fires, sends a notification to the server, then calls AwaitEvent again.
A Server task (lower priority): calls Receive in a loop, processes incoming requests (from both the notifier and from clients), and never calls AwaitEvent.

// Notifier (very high priority)
void clock_notifier(void) {
    int server_tid = WhoIs("ClockServer");
    for (;;) {
        int tick = AwaitEvent(EVENT_TIMER_C1);
        Send(server_tid, &tick, sizeof(tick), NULL, 0);
    }
}

// Clock Server (high priority, lower than notifier)
void clock_server(void) {
    RegisterAs("ClockServer");
    int current_ticks = 0;
    DelayQueue dq;
    dq_init(&dq);

    for (;;) {
        int sender_tid;
        Message msg;
        Receive(&sender_tid, &msg, sizeof(msg));

        if (msg.type == TICK) {
            current_ticks++;
            dq_unblock_expired(&dq, current_ticks);
            Reply(sender_tid, NULL, 0);
        } else if (msg.type == TIME) {
            Reply(sender_tid, &current_ticks, sizeof(current_ticks));
        } else if (msg.type == DELAY) {
            dq_insert(&dq, sender_tid, current_ticks + msg.ticks);
            // Do NOT reply yet — sender stays blocked until deadline
        } else if (msg.type == DELAYUNTIL) {
            dq_insert(&dq, sender_tid, msg.absolute_tick);
        }
    }
}

The notifier runs at higher priority than the clock server because it must process the interrupt quickly (before the next tick fires). The clock server runs at higher priority than any task that delays on it, because it must be available to process the notifier’s send promptly.

The Clock Server in Detail

The clock server maintains a delay queue: a sorted list of (deadline, task_tid) pairs representing tasks blocked in Delay or DelayUntil. When the notifier sends a tick, the server:

Increments its tick counter.
Scans the delay queue for tasks whose deadline has passed (deadline ≤ current tick).
Replies to each expired task (unblocking them).
Replies to the notifier (allowing it to call AwaitEvent again).

The order of operations matters. The notifier must be replied to before the clock server’s next Receive — otherwise the notifier’s Send will be queued and the notifier blocks, unable to call AwaitEvent for the next tick. Since the notifier is at higher priority than the clock server, replying to the notifier will immediately schedule the notifier, which will call AwaitEvent, which will block again — leaving the server to run next. This is the correct flow.

Efficiency note: a linear scan of the delay queue is O(n) in the number of blocked tasks. For small task counts (≤ 64), this is negligible. A heap would give O(log n) insertion and O(1) minimum-key access, but the constant factor likely exceeds the linear scan for n < 50.

The Idle Task and CPU Measurement

When no task is Ready, the scheduler has no task to run. Rather than leaving the CPU in an undefined state, the kernel creates an idle task at the lowest possible priority (priority 0). The idle task never exits and never blocks:

void idle_task(void) {
    for (;;) {
        __asm__ volatile ("wfi");  // Wait For Interrupt
    }
}

wfi (Wait For Interrupt) is an ARM instruction that halts the CPU in a low-power state until an interrupt fires. When an interrupt arrives, the processor wakes, handles the interrupt (switching to the IRQ handler via the exception vector), and then schedules whatever task was unblocked by the interrupt. The idle task’s tick counter — how many 10 ms ticks it has received as the “active” task — is a direct measure of system idleness. A system where the idle task receives 95% of ticks has 5% CPU utilization; a well-functioning train control system should keep idle time above 90%.

Measuring System Performance

The idle task’s tick counter provides the simplest performance metric, but finer measurements are possible. The system timer’s 1 MHz free-running counter allows microsecond-precision timing of any operation. A standard measurement idiom:

uint32_t start = mmio_read(SYSTEM_TIMER_CLO);
for (int i = 0; i < N; i++) {
    operation_to_measure();
}
uint32_t end = mmio_read(SYSTEM_TIMER_CLO);
uint32_t avg_us = (end - start) / N;

Running the operation N times amortizes the timer read overhead and averages out cache warming effects. The first few iterations may be slower due to instruction cache cold misses; for WCET analysis, you want the steady-state time (with a warm cache), not the cold-start time.

The Real-Time Control Loop

The clock server enables a precise control loop — the fundamental structure of real-time control:

void control_loop_task(void) {
    int clock_tid = WhoIs("ClockServer");
    int next_tick = Time(clock_tid) + CONTROL_PERIOD;

    for (;;) {
        // Read current state (via SRR to sensor server)
        TrainState state = read_train_state();

        // Compute control action
        int16_t speed_command = compute_speed(state);

        // Apply control action (via SRR to CAN TX server)
        send_speed_command(speed_command);

        // Wait for the next control period
        DelayUntil(clock_tid, next_tick);
        next_tick += CONTROL_PERIOD;
    }
}

The DelayUntil(clock_tid, next_tick) is crucial: by using an absolute deadline rather than a relative delay (Delay(clock_tid, CONTROL_PERIOD)), the loop compensates for variable computation time. If one iteration takes slightly longer than CONTROL_PERIOD ticks, the next iteration waits a correspondingly shorter time, keeping the loop on the intended schedule. Using a relative delay would cause the period to drift — each iteration adds the actual computation time to the nominal period, slowly increasing the effective period over time.

This pattern — update a target time at each iteration, sleep until the target — is the software equivalent of the cyclic executive’s frame boundary check, but implemented within the task model. The clock server’s DelayUntil primitive makes it clean.

Timing Precision and Tick Granularity

The standard clock tick period in CS 452 is 10 ms. This means Delay(clock_tid, 1) waits approximately 10 ms, and timing resolution is 10 ms. For train control, 10 ms is coarser than might be ideal — a train at 500 mm/s travels 5 mm in 10 ms, which is the typical precision needed for stopping. An alternative configuration uses 1 ms ticks, giving 10× better timing resolution at the cost of 10× more clock server activity.

The system timer’s 1 MHz free-running counter provides 1 µs timing resolution independent of the tick period. For events that require sub-10ms timing (such as precise sensor event timestamps for velocity calibration), the system timer should be read directly using mmio_read(SYSTEM_TIMER_CLO) rather than using the clock server’s tick count. The clock server’s tick count is a coarse-grained reference; the system timer is the fine-grained reference.

Combining coarse and fine time: a common pattern is to use the clock server for scheduling periodic tasks (with 10 ms granularity) and the system timer for timestamping events (with 1 µs granularity). The velocity measurement in Chapter 17 uses the system timer to timestamp sensor events and computes inter-sensor intervals in microseconds; the clock server is used only to schedule the calibration procedure at regular intervals.

The Delay Queue Data Structure

The clock server’s delay queue must efficiently support two operations:

Insert a task with an absolute deadline.
Extract all tasks whose deadline ≤ current tick.

A sorted linked list provides O(1) extraction of expired tasks (just scan from the head until a non-expired task is found) and O(n) insertion (scan to find the correct position). For n ≤ 64 tasks, the linear scan is fast enough in practice.

A min-heap provides O(log n) insertion and O(log n) removal of the minimum, but the minimum-removal cost for the clock server is amortized across all tasks expired in a single tick (O(k log n) for k expired tasks per tick). Since k is typically 0 or 1 per tick in normal operation, the heap’s asymptotic advantage rarely materializes.

The recommended implementation for CS 452: a sorted doubly-linked list using intrusive next/prev pointers embedded in the task descriptor. Insertions happen rarely (once per Delay/DelayUntil call) compared to tick processing (once per 10 ms); keeping the queue sorted minimizes the tick processing path to a simple head-of-list check.

typedef struct DelayedTask {
    TaskDescriptor *td;
    int             deadline;     // absolute tick when this task should wake
    struct DelayedTask *next;
} DelayedTask;

static DelayedTask  dq_pool[MAX_TASKS];
static DelayedTask *dq_head = NULL;  // sorted ascending by deadline
static DelayedTask *dq_free = NULL;  // free list

void dq_insert(TaskDescriptor *td, int deadline) {
    DelayedTask *entry = dq_free;
    dq_free = dq_free->next;
    entry->td = td;
    entry->deadline = deadline;

    // Insert in sorted order
    DelayedTask **pos = &dq_head;
    while (*pos && (*pos)->deadline <= deadline) pos = &(*pos)->next;
    entry->next = *pos;
    *pos = entry;
}

// Called by clock server on each tick
void dq_process(int current_tick) {
    while (dq_head && dq_head->deadline <= current_tick) {
        DelayedTask *expired = dq_head;
        dq_head = dq_head->next;
        Reply(expired->td->tid, &current_tick, sizeof(current_tick));
        expired->next = dq_free;
        dq_free = expired;
    }
}

Advanced Timing: Jitter Measurement and Compensation

In an ideal system, a task that calls DelayUntil(clock_tid, T) wakes up at exactly tick T. In practice, several sources of delay cause the actual wake-up time to differ from the ideal:

Clock server processing delay: the clock server must process the previous tick’s notifications before starting the next Receive. If a tick expires 5 tasks, the server processes 5 Replies before re-entering Receive. This processing takes time, delaying the server’s ability to process the next tick’s notification. With the Clock Notifier at priority 31 and the Clock Server at priority 25, the Notifier’s next AwaitEvent fires immediately when the server Replies, but the server itself may be running for several microseconds before replying to the Notifier.

Scheduling delay: after the Clock Server replies to a delayed task, the delayed task becomes Ready but may not immediately run if a higher-priority task is also ready. For a task at priority 10 (below the clock server at 25), the task may wait until all tasks at priorities 11–31 have blocked before it gets CPU time. In a well-designed system, this delay is bounded by the sum of execution times of all tasks above priority 10 that are ready at the tick boundary.

Measuring jitter: record the intended wake-up tick and the actual wake-up tick (measured using Time(clock_tid) as the first action after waking) for each iteration of a periodic task. The difference is the jitter for that iteration:

void periodic_task(void) {
    int clock_tid = WhoIs(SRV_CLOCK);
    int next_tick  = Time(clock_tid) + PERIOD;
    int expected   = next_tick;

    for (;;) {
        DelayUntil(clock_tid, next_tick);
        int actual = Time(clock_tid);
        int jitter = actual - expected;

        /* Log the jitter: log_event(JITTER_MEASUREMENT, jitter); */

        /* ... do work ... */

        expected   = next_tick + PERIOD;
        next_tick += PERIOD;
    }
}

Compensating for jitter: if the periodic task’s execution time varies (e.g., because of variable-length CAN messages to process), the jitter accumulates over time. The absolute-deadline pattern (next_tick += PERIOD instead of Delay(PERIOD)) prevents this accumulation: each deadline is absolute, not relative to the previous wake-up. Even if one iteration wakes up 2 ticks late, the next iteration’s deadline is still next_tick + PERIOD from the original schedule, not from the late actual wake-up.

However, if a task is consistently late (every iteration wakes up 1–2 ticks late), it indicates that the system is over-committed at that priority level. The remedy is to reduce the task’s period, increase its priority, or reduce its execution time.

The Clock Server Under Load: Tick Loss Analysis

When the clock tick rate is high (1 ms ticks) and many tasks are delayed, the clock server may spend more time processing expired tasks than the tick period. This is tick loss: the clock server hasn’t finished processing tick N when tick N+1 fires.

Detection: the Clock Notifier’s Send() to the Clock Server will enqueue on the Server’s send queue if the Server hasn’t finished the previous tick. The Server’s send queue depth can be monitored:

/* In the Clock Notifier, add a sequence number to the tick message */
typedef struct { int tick_seq; } TickMsg;

void clock_notifier_with_seqno(void) {
    int server_tid = WhoIs(SRV_CLOCK);
    int seq = 0;
    for (;;) {
        AwaitEvent(EVENT_TIMER_C1);
        TickMsg msg = { .tick_seq = seq++ };
        Send(server_tid, &msg, sizeof(msg), NULL, 0);
    }
}

If the Server receives tick with seq = 5 but current_ticks = 3, it knows ticks 4 and 5 are being processed in the same cycle — a 2-tick backlog. Logging the maximum backlog depth helps diagnose system overload.

Prevention: keep the clock server’s per-tick processing fast. Do not call any SRR primitives (except Reply) within the clock server’s tick handler. Do not do computation that depends on the number of delayed tasks in a non-O(1) manner during the tick. If many tasks expire simultaneously (rare but possible if many tasks share the same absolute deadline), the tick processing time spikes. Spreading deadlines slightly (by randomizing task startup times) prevents synchronized spikes.

The rate-limit circuit breaker: a production clock server might include a self-protection mechanism: if the tick backlog exceeds a threshold, it temporarily increases the tick period (by updating the timer compare register to fire less frequently) and logs a warning. This is a form of load shedding — degrading timing precision temporarily to prevent cascading failure under overload.

Temporal Logic and Timing Specifications

The clock server enables writing temporal logic specifications for the train control system. A temporal specification states invariants about the system’s behavior over time, not just at a single point. Two examples relevant to CS 452:

Safety property: “A train shall never occupy a reserved segment owned by another train at any time T.” This is an invariant over all times T. It is checked by verifying the reservation state is never in an inconsistent configuration.

Liveness property: “Every Delay call shall eventually return.” This is a progress guarantee. It is ensured if the Clock Notifier always replies to the Clock Server, and the Clock Server always processes the delay queue.

Timing property: “A sensor event shall be processed within 50 ms of occurrence.” This requires that the CAN RX Notifier wakes within 50 ms of the MCP2515 asserting INT, and that the processing chain (RX Notifier → CAN RX Server → Track Server) completes within the deadline. This is a bounded-response property.

Temporal Logic of Actions (TLA+), developed by Leslie Lamport, provides a formal language for specifying and model-checking such properties. While TLA+ is not standard practice in CS 452, understanding that specifications can be formal helps appreciate why the priority assignments, task architectures, and timing budgets are engineering decisions with provable consequences — not merely operational tuning.

Chapter 14: I/O Servers — UART, CAN, and the Notifier Pattern

With interrupts, AwaitEvent, and the Notifier pattern established, we can build the I/O servers that give user tasks access to the UART terminal and the CAN bus. These servers are the final layer between the kernel’s abstract SRR interface and the physical hardware.

UART Architecture

The UART communication layer consists of four tasks:

UART RX Notifier (very high priority): calls AwaitEvent(EVENT_UART_RX), wakes when data arrives, sends bytes to the RX Server.
UART RX Server (high priority): maintains an incoming byte buffer. Clients call Getc(tid) → server either immediately delivers a buffered byte or parks the client in a wait list until a byte arrives.
UART TX Notifier (very high priority): calls AwaitEvent(EVENT_UART_TX), wakes when the FIFO drains below the threshold, sends a signal to the TX Server.
UART TX Server (high priority): maintains an outgoing byte buffer. Clients call Putc(tid, ch) → server buffers the byte. When the TX Notifier signals FIFO-ready, the server drains the buffer into the FIFO.

The asymmetry between RX and TX is subtle. For RX, the interrupt fires when bytes arrive — the notifier is event-driven by received data. For TX, the interrupt fires when the FIFO drains — the notifier signals that the server can write more bytes. Without the TX Notifier pattern, the TX Server would busy-poll the UART FIFO status register, burning CPU on checking whether space is available.

The PL011 interrupt handling sequence:

void uart_irq_handler(void) {
    uint32_t mis = mmio_read(UART0_BASE + UART_MIS);  // masked interrupt status

    if (mis & UART_MIS_RXMIS) {
        // RX interrupt: drain FIFO, buffer bytes, wake RX notifier
        while (!(mmio_read(UART0_BASE + UART_FR) & UART_FR_RXFE)) {
            char ch = mmio_read(UART0_BASE + UART_DR) & 0xFF;
            rx_fifo_push(ch);
        }
        // Clear interrupt
        mmio_write(UART0_BASE + UART_ICR, UART_ICR_RXIC);
        // Disable RX interrupt until notifier processes
        mmio_and(UART0_BASE + UART_IMSC, ~UART_IMSC_RXIM);
    }
    if (mis & UART_MIS_TXMIS) {
        // TX interrupt: signal notifier that FIFO has space
        mmio_write(UART0_BASE + UART_ICR, UART_ICR_TXIC);
        mmio_and(UART0_BASE + UART_IMSC, ~UART_IMSC_TXIM);
    }
}

The pattern of disabling the interrupt source in the handler and re-enabling it after AwaitEvent prevents spurious re-delivery. Without disabling, a level-triggered interrupt source (such as the UART FIFO-not-empty) would immediately re-fire after the handler completes, because the FIFO still has data. By disabling the interrupt and re-enabling it only after the notifier has confirmed it called AwaitEvent again, the server controls the interrupt rate.

UART CTS Hardware Flow Control

A subtle but important feature of the UART communication story is hardware flow control. The PL011 UART supports CTS/RTS (Clear-To-Send / Request-To-Send) handshaking, which allows the receiving end to signal to the transmitting end that it is ready to accept more data. Without flow control, a transmitter that sends faster than the receiver can process will overflow the receiver’s buffer — for the CS3 central station, which has finite buffering for incoming commands, this can cause commands to be silently dropped.

The CTS pin is an input to the UART transmitter. When CTS is asserted (driven low by the receiver), the transmitter can send. When CTS is deasserted (driven high), the transmitter must pause until CTS is re-asserted. From the PL011 perspective, hardware CTS is enabled by setting bit 15 (CTSEn) in the UART Control Register (CR):

#define UART_CR_CTSEN   (1u << 15)  /* CTS enable: check CTS before TX */
#define UART_CR_RTSEN   (1u << 14)  /* RTS enable: assert RTS on RX not full */
#define UART_CR_TXE     (1u << 8)   /* TX enable */
#define UART_CR_RXE     (1u << 9)   /* RX enable */
#define UART_CR_UARTEN  (1u << 0)   /* UART enable */

void uart0_enable_cts(void) {
    uint32_t cr = mmio_read(UART0_BASE + UART_CR);
    cr |= UART_CR_CTSEN;
    mmio_write(UART0_BASE + UART_CR, cr);
}

When CTSEn is set, the PL011 automatically checks the CTS input before attempting to shift a byte out of the transmit FIFO. If CTS is not asserted, the UART holds the byte in the FIFO until CTS becomes asserted. From the software’s perspective, this is invisible: writing bytes to the FIFO and waiting for the TX interrupt works identically with or without CTS. The only visible effect is that TX might take longer — the TX interrupt fires only after the byte is actually shifted out, not merely enqueued.

The CS3 and CTS: the Märklin CS3 central station’s UART connection (if a direct UART connection is used, rather than CAN) typically asserts CTS when its internal buffer is not full. When the CS3 is processing a large batch of commands (e.g., speed updates for all active locomotives), it briefly deasserts CTS. Without CTS enforcement, the PL011 would continue transmitting into the void; with CTS, it simply pauses.

The practical consequence in the train control application: the CAN TX server’s queue depth (CAN_TX_QUEUE_DEPTH = 32 in the earlier example) is a software buffer that holds frames waiting to be SPI-transferred to the MCP2515. The MCP2515 then handles the CAN bus arbitration independently. If the CAN bus is congested (other devices transmitting), the MCP2515 will hold the frame in TXB0 until the bus is idle. This is analogous to the hardware CTS mechanism — both are forms of backpressure from the physical layer to the software layer. The key insight: in a well-designed system, backpressure is propagated upward without blocking any task indefinitely; it manifests as a growing queue depth that the train engineer can observe and respond to by reducing the command rate.

Software flow control (XON/XOFF): for UARTs without hardware CTS, software flow control uses in-band control bytes: the receiver sends the ASCII XOFF character (0x13, Ctrl-S) to ask the transmitter to pause, and XON (0x11, Ctrl-Q) to resume. The TX server must scan the incoming byte stream for these control characters and pause or resume transmission accordingly. This is more complex and less reliable than hardware CTS (the control characters may be delayed by the very queue they are trying to control, creating a race condition), but it works over any serial link.

In the CS 452 terminal context, neither CTS nor XON/XOFF is typically required: the terminal server’s transmit rate is limited by the clock server tick rate (10 ms per debug print) and the PL011 FIFO drains at 115,200 baud = 11,520 bytes/second. A 256-byte FIFO drains in 22 ms — well within the 10 ms tick budget for steady-state output. For bursty output (e.g., a large kprintf of 1,024 bytes), the TX server blocks the next Putc call until the FIFO has space, which is handled by the TX notifier pattern without any explicit flow control. Hardware CTS becomes important only when the downstream device (the CS3 in direct UART mode) is genuinely slower than the PL011’s output rate.

The CTS interrupt: the PL011 also generates a modem status interrupt when CTS transitions (from asserted to deasserted or vice versa). This interrupt allows the TX server to be woken immediately when CTS is reasserted, rather than waiting for a timeout. The UART_MIS register bit CTSMMIS indicates a CTS state change. Handling this interrupt in the TX notifier:

void uart_tx_notifier_main(void) {
    int server_tid = MyParentTid();
    uint32_t imsc_flags = UART_IMSC_TXIM | UART_IMSC_CTSMMIM;

    for (;;) {
        /* Enable TX and CTS interrupts, then wait */
        mmio_or(UART0_BASE + UART_IMSC, imsc_flags);
        int event = AwaitEvent(EVENT_UART_TX);  /* returns UART_MIS value */

        /* Determine which interrupt fired */
        if (event & UART_MIS_TXMIS) {
            /* TX FIFO below threshold — more space available */
            Message msg = { .type = MSG_UART_TX_READY };
            Send(server_tid, &msg, sizeof(msg), NULL, 0);
        }
        if (event & UART_MIS_CTSMIS) {
            /* CTS state changed — check if transmitting should resume */
            uint32_t fr = mmio_read(UART0_BASE + UART_FR);
            if (!(fr & UART_FR_CTS)) {
                /* CTS is now deasserted — signal TX server to pause */
                Message msg = { .type = MSG_UART_CTS_DEASSERTED };
                Send(server_tid, &msg, sizeof(msg), NULL, 0);
            } else {
                /* CTS reasserted — signal TX server to resume */
                Message msg = { .type = MSG_UART_CTS_ASSERTED };
                Send(server_tid, &msg, sizeof(msg), NULL, 0);
            }
        }
    }
}

The TX server responds to MSG_UART_CTS_DEASSERTED by setting a cts_paused flag and refusing to load new bytes into the FIFO; it responds to MSG_UART_CTS_ASSERTED by clearing the flag and resuming transmission. The kernel’s task scheduler handles the rest: if a client calls Putc() while CTS is paused, the server buffers the byte in its ring buffer and replies immediately (the byte will be transmitted when CTS resumes). If the ring buffer fills up, Putc() blocks the caller — which is the correct behavior for a transmitter with a full buffer and a paused link.

This example illustrates the general principle of interrupt-driven hardware protocol management: the notifier converts hardware signals (interrupts) into software messages, the server maintains state based on those messages, and client tasks interact with the server without any direct knowledge of the hardware’s current state. The hardware’s complexity (CTS transitions, FIFO thresholds, error flags) is encapsulated entirely within the notifier/server pair.

CAN Server Architecture

The CAN server follows the same Notifier pattern, but the interrupt source is the MCP2515 INT pin (GPIO 17, GIC ID 145) rather than a built-in peripheral interrupt. The CAN RX Notifier calls AwaitEvent(EVENT_CAN_RX), wakes when the GPIO interrupt fires, reads the received CAN frame from the MCP2515 via SPI, and sends the frame to the CAN RX Server. The CAN TX Server handles frame transmission.

The SPI access for reading a received frame is a multi-step operation:

Assert CS̄ (GPIO output low).
Send SPI command 0x90 (READ RX BUFFER 0 starting at identifier).
Clock in 13 bytes: 2 ID bytes, 2 extended ID bytes, 1 DLC byte, up to 8 data bytes.
Deassert CS̄.
Clear the RXB0 interrupt flag: SPI BIT MODIFY CANINTF, mask 0x01, data 0x00.

The SPI access must complete before returning from the interrupt handler, since the MCP2515 will continue asserting INT until the frame is read and CANINTF is cleared. This is different from the UART case where the interrupt is cleared by reading the ICR register.

The CS3 Active List Quirk

One practical challenge specific to the Märklin CS3: it maintains an internal list of “active” locomotives and periodically resends speed commands to them. This has two implications. First, a locomotive that has not received any command will not be on the active list and the CS3 may not forward your commands. Second, the CS3’s periodic refresh can conflict with your command timing — if you send a stop command but the CS3 sends a speed refresh 5 ms later, the locomotive may briefly accelerate before stopping.

The solution is to send a Märklin system “Go” command immediately after a speed command to any new locomotive, and to ensure that your stop commands set speed to exactly 0 (not “emergency stop”, which has different semantics and may confuse the CS3’s bookkeeping).

CAN TX Server: Managing Multiple TX Buffers

The MCP2515 provides three independent transmit buffers — TXB0, TXB1, TXB2 — each capable of holding one CAN frame (up to 8 data bytes) along with its arbitration ID and control fields. The chip will autonomously try to transmit whichever buffer has been loaded and marked for transmission, retrying after collision backoff according to the CAN protocol. When a transmission completes, the chip asserts INT (GPIO 17 → GIC 145) with the CANINTF.TXBnIF bit set.

The existence of three buffers creates a design decision: should the kernel software use them as a three-slot deep FIFO (always fill all three in advance), or use only TXB0 and treat the chip as a single-slot queue? Using all three maximizes throughput by keeping the CAN bus busy during back-to-back frame transmissions. Using only TXB0 simplifies sequencing — you always know exactly which frame the chip is transmitting — at the cost of idle bus time while you SPI-transfer the next frame after the TX interrupt.

For train control, where commands are issued at most every few milliseconds and the CAN bus runs at 250 kbit/s (4 µs/bit, frame overhead ~130 µs per frame), throughput is not the bottleneck. The simpler single-buffer design wins: use TXB0, wait for the TXIF interrupt, then load the next frame.

/* CAN TX Server state */
typedef struct {
    CanFrame queue[CAN_TX_QUEUE_DEPTH];  /* software queue ahead of TXB0 */
    int      head, tail;                 /* ring buffer indices */
    bool     chip_busy;                  /* true if TXB0 is loaded */
} CanTxState;

#define CAN_TX_QUEUE_DEPTH 32

static void can_txq_push(CanTxState *s, const CanFrame *f) {
    s->queue[s->tail % CAN_TX_QUEUE_DEPTH] = *f;
    s->tail++;
}

static bool can_txq_empty(CanTxState *s) { return s->head == s->tail; }

static CanFrame can_txq_pop(CanTxState *s) {
    return s->queue[s->head++ % CAN_TX_QUEUE_DEPTH];
}

When the CAN TX Server receives a MSG_CAN_TX request from a client (e.g., the Train Engineer task), it either loads TXB0 immediately (if the chip is idle) or enqueues the frame (if TXB0 is busy). The server then replies immediately so the client is never blocked on the transmission completing:

void can_tx_server_main(void) {
    int my_tid = MyTid();
    RegisterAs("CAN_TX");

    int notifier_tid = Create(PRIORITY_VERY_HIGH, can_tx_notifier);
    CanTxState state = {0};
    int sender;
    Message msg;

    for (;;) {
        Receive(&sender, &msg, sizeof(msg));

        if (msg.type == MSG_CAN_TX) {
            Reply(sender, NULL, 0);  /* non-blocking — reply before transmitting */
            if (!state.chip_busy) {
                mcp2515_load_txb0(&msg.can_frame);
                mcp2515_request_tx(TXB0);
                state.chip_busy = true;
            } else {
                can_txq_push(&state, &msg.can_frame);
            }
        } else if (msg.type == MSG_CAN_TX_DONE) {
            /* Notifier reports TXB0 complete */
            Reply(sender, NULL, 0);
            if (!can_txq_empty(&state)) {
                CanFrame next = can_txq_pop(&state);
                mcp2515_load_txb0(&next);
                mcp2515_request_tx(TXB0);
                /* chip_busy stays true */
            } else {
                state.chip_busy = false;
            }
        }
    }
}

The CAN TX Notifier waits for the TX interrupt, then sends MSG_CAN_TX_DONE to the server:

void can_tx_notifier_main(void) {
    int server_tid = MyParentTid();
    Message msg = { .type = MSG_CAN_TX_DONE };

    for (;;) {
        AwaitEvent(EVENT_CAN_TX);   /* blocks until CANINTF.TX0IF set */
        /* Clear TX interrupt: SPI BIT MODIFY CANINTF, mask=0x04, data=0x00 */
        mcp2515_bit_modify(MCP_CANINTF, MCP_TX0IF, 0x00);
        Send(server_tid, &msg, sizeof(msg), NULL, 0);
    }
}

The mcp2515_load_txb0 function writes the frame identifier and data through SPI:

void mcp2515_load_txb0(const CanFrame *f) {
    uint8_t buf[13];
    /* Standard 11-bit CAN ID in bits 10:3 of SIDH, bits 2:0 of SIDL */
    buf[0] = (f->id >> 3) & 0xFF;     /* SIDH */
    buf[1] = (f->id & 0x07) << 5;     /* SIDL — no extended ID */
    buf[2] = 0x00;                     /* EID8 — not used */
    buf[3] = 0x00;                     /* EID0 — not used */
    buf[4] = f->dlc & 0x0F;           /* DLC  */
    for (int i = 0; i < f->dlc; i++)
        buf[5 + i] = f->data[i];

    spi0_cs_assert();
    spi0_transfer_byte(MCP_LOAD_TXB0SIDH);  /* 0x40 — load TXB0 starting at SIDH */
    for (int i = 0; i < 5 + f->dlc; i++)
        spi0_transfer_byte(buf[i]);
    spi0_cs_deassert();
}

void mcp2515_request_tx(int buf_num) {
    /* RTS command: 0x80 | (1 << buf_num) */
    spi0_cs_assert();
    spi0_transfer_byte(0x80 | (1 << buf_num));
    spi0_cs_deassert();
}

MCP2515 Error States and Recovery

The MCP2515 implements the CAN bus error state machine defined in the CAN 2.0B specification. Every node on the CAN bus maintains two counters: the Transmit Error Counter (TEC) and the Receive Error Counter (REC). These counters increase when the node detects or causes errors (bit errors, stuff errors, form errors, ACK errors) and decrease when transmissions succeed.

The error state machine has three states:

Error-Active (TEC and REC both < 128): normal operation. The node actively participates in bus arbitration and error signalling.
Error-Passive (TEC or REC ≥ 128): the node can still transmit and receive but uses passive error flags (recessive bits) rather than the dominant error flags of the error-active state. Importantly, an error-passive node must wait an extra 8-bit suspend transmission delay between frames.
Bus-Off (TEC ≥ 256): the node disconnects from the bus entirely. It can only recover by a software-initiated reset after 128 occurrences of 11 consecutive recessive bits.

The CS 452 kernel must poll the MCP2515 error flags register (EFLG at address 0x2D) to detect these transitions. A simple recovery strategy:

void can_check_errors(void) {
    uint8_t eflg = mcp2515_read_reg(MCP_EFLG);

    if (eflg & MCP_EFLG_TXBO) {
        /* Bus-Off: reset and re-initialize */
        mcp2515_reset();
        mcp2515_init();
        return;
    }
    if (eflg & MCP_EFLG_TXEP) {
        /* Error-Passive: log warning, increase inter-frame delay */
        log_event(LOG_CAN_ERROR_PASSIVE, eflg);
    }
    if (eflg & MCP_EFLG_RX0OVR) {
        /* RX buffer overflow — frame was lost */
        log_event(LOG_CAN_RX_OVERFLOW, 0);
        mcp2515_bit_modify(MCP_EFLG, MCP_EFLG_RX0OVR, 0x00);
    }
}

In practice, error-passive transitions in the CS 452 lab are almost always caused by a disconnected CAN cable, a misconfigured baud rate (the MCP2515 must match the CS3 at 250 kbit/s), or SPI clock timing violations. The first diagnostic step is to oscilloscope the CANH/CANL lines and verify the 250 kbit/s bit timing.

Märklin CAN Command Encoding

Every command sent to the Märklin Central Station CS3 is a CAN frame with an 11-bit standard ID. The frame structure is defined by the CS2/CS3 CAN protocol and is publicly documented in the Märklin protocol specification. The key commands relevant to the train application:

Command	CAN ID (hash)	Data bytes	Effect
System Go	0x00	4 bytes (sub-cmd 0x01)	Enable track power
System Stop	0x00	4 bytes (sub-cmd 0x00)	Emergency stop all trains
Loco Speed	0x04	6 bytes	Set loco speed (0–1023 steps)
Loco Direction	0x05	5 bytes	Set direction (0=fwd, 1=rev)
Accessor Control	0x0B	6 bytes	Switch turnout position
Feedback Event	0x11	8 bytes (received)	Sensor state change

The 11-bit CAN ID encodes both the command hash (upper bits) and a priority/response flag. For commands sent to the CS3, bit 0 of the ID is 0; for responses from the CS3, bit 0 is 1. The CS3 also echoes every command with bit 0 set, which the CAN RX server can use for acknowledgement.

A loco speed command (command 0x04) sent at speed step 200 to loco UID 0x00034001 looks like:

CAN ID (11-bit):  0x0004 0x000D  (hash = 0x000D for command 0x04)
DLC: 6
Data: 0x00 0x03 0x40 0x01  (UID, big-endian)
      0x00 0xC8            (speed = 200, direction not changed)

The loco UID (4 bytes) is the hardware-assigned Märklin address of the locomotive decoder. The kernel maintains a mapping from human-assigned train numbers (1–5) to Märklin UIDs discovered during the initial calibration run.

Software Design Principles

The worker/server architecture illustrated by the UART and CAN servers appears repeatedly throughout the train control application. The general structure is:

Server task: owns a resource (byte buffer, train state table, track segment graph), services requests from clients via Receive/Reply, and delegates blocking operations (waiting for hardware) to workers/notifiers.
Worker/Notifier task: performs one blocking operation repeatedly (AwaitEvent, or a send to a slow server) and reports results to its server.
Client task: calls Send to the server and blocks until service is complete.

The critical invariant: server tasks must never block indefinitely. A server that calls Send() risks blocking, which prevents it from Receiving new requests. A server that calls AwaitEvent() also blocks. The Notifier pattern resolves this by outsourcing the blocking to a dedicated task. A server that violates this invariant — even once — can deadlock the entire system if a client at higher priority needs the server while the server is blocked.

Terminal Server Design: Printf Without Blocking

One of the first things a developer wants when debugging bare-metal code is printf. But a naïve printf that calls uart_putc_polling() in a loop blocks the entire task for up to a millisecond per character. In a real-time system, this is catastrophic: a debug print in the train engineer task can prevent the clock server from processing a tick on time.

The terminal server solves this: Putc(tid, ch) queues a byte to the UART TX server and returns immediately (or blocks only if the TX buffer is full). Clients never block on I/O — they hand off bytes to the server and continue running.

For convenience, a non-blocking kprintf can be implemented on top of the terminal server:

void kprintf(const char *fmt, ...) {
    char buf[256];
    va_list args;
    va_start(args, fmt);
    int len = vsnprintf(buf, sizeof(buf), fmt, args);
    va_end(args);

    int tx_tid = WhoIs("UART_TX");
    for (int i = 0; i < len; i++) {
        Putc(tx_tid, buf[i]);
    }
}

vsnprintf is the only libc function called — it must be either available in a freestanding version or reimplemented. The implementation of vsnprintf for bare-metal is non-trivial (supporting %d, %x, %s, %f is several hundred lines) but is worth providing as a kernel utility.

The TX server’s internal buffer must be large enough to hold the output of a kprintf call without blocking — a 4 KB ring buffer is typical. When the buffer is full, the next Putc() call blocks until the UART TX notifier has drained enough bytes to the FIFO. In debug builds where kprintf output is frequent, the TX server can be observed consuming significant idle time draining the buffer.

A practical issue: printf to the terminal is not appropriate for timing-sensitive code paths. A kprintf call in the interrupt handler, or in the context switch path, adds variable latency that disturbs the very behavior you are trying to observe. The ring-buffer post-mortem technique (Chapter 5) is preferred for timing-sensitive debugging.

UART Terminal Parsing: The User Shell

The terminal input (UART RX) supports a user shell — a simple command-line interface that allows the operator to:

Enter train numbers and speed commands
Request diagnostic output (current sensor states, train positions, idle time)
Trigger emergency stop
Switch track positions manually

The shell is a simple task at low priority that calls Getc(rx_tid) in a loop, accumulating characters until a newline, then parsing the command. The parsing can be done with a simple linear scan for command keywords:

void shell_main(void) {
    int rx_tid = WhoIs("UART_RX");
    int tx_tid = WhoIs("UART_TX");
    char line[80];
    int  pos = 0;

    for (;;) {
        char ch = Getc(rx_tid);
        if (ch == '\r' || ch == '\n') {
            line[pos] = '\0';
            process_command(line, tx_tid);
            pos = 0;
        } else if (ch == '\b' && pos > 0) {  // backspace
            pos--;
            Putc(tx_tid, '\b'); Putc(tx_tid, ' '); Putc(tx_tid, '\b');
        } else if (pos < sizeof(line) - 1) {
            line[pos++] = ch;
            Putc(tx_tid, ch);  // echo
        }
    }
}

The shell’s low priority ensures it never preempts the train control tasks. If the operator is typing a command while a train is approaching a sensor, the train control computation takes priority and the shell’s Getc call simply blocks until the higher-priority work is done.

Implementing the UART TX Ring Buffer

The UART TX server manages a ring buffer of bytes to transmit. The ring buffer is a classic circular buffer: two indices (head and tail) into a fixed array, where head is where the next byte will be read from and tail is where the next byte will be written to.

#define TX_BUF_SIZE 4096

typedef struct {
    char    buf[TX_BUF_SIZE];
    int     head, tail;
    bool    tx_active;    // true if we have outstanding TX interrupt
} TxBuffer;

bool tx_buf_empty(TxBuffer *b) { return b->head == b->tail; }
bool tx_buf_full(TxBuffer *b)  { return (b->tail + 1) % TX_BUF_SIZE == b->head; }

void tx_buf_push(TxBuffer *b, char ch) {
    b->buf[b->tail] = ch;
    b->tail = (b->tail + 1) % TX_BUF_SIZE;
}

char tx_buf_pop(TxBuffer *b) {
    char ch = b->buf[b->head];
    b->head = (b->head + 1) % TX_BUF_SIZE;
    return ch;
}

When Putc() is called and the TX buffer is not full, the byte is pushed and the function returns immediately. When the UART TX FIFO has space (signalled by the TX notifier), the TX server drains as many bytes as will fit:

void handle_tx_ready(TxBuffer *b) {
    while (!tx_buf_empty(b) && uart0_tx_ready()) {
        mmio_write(UART0_DR, tx_buf_pop(b));
    }
    if (!tx_buf_empty(b)) {
        // Re-enable TX interrupt to get notified when FIFO drains
        enable_uart_tx_interrupt();
    } else {
        b->tx_active = false;
    }
}

The TX notifier calls AwaitEvent when the FIFO drains below the trigger level. This flow is: FIFO drains → TX interrupt fires → kernel unblocks TX notifier → notifier sends to TX server → server drains TX buffer into FIFO → if more bytes remain, enable TX interrupt again → otherwise, mark as idle.

Comparison with xv6 and MIT 6.S081

The MIT 6.S081 course (Operating Systems Engineering) uses a different pedagogical OS: xv6, a teaching OS based on UNIX version 6, reimplemented in C for RISC-V. Comparing xv6’s I/O architecture with the CS 452 approach illustrates the trade-offs between a UNIX-style OS and a microkernel.

In xv6, I/O is handled by a monolithic kernel with device drivers in the kernel address space. The UART driver in xv6 (uart.c) uses a simple transmit buffer with a spinlock for synchronization between the console write path and the UART interrupt handler. This is the classic UNIX approach: drivers in kernel space, minimal context switches, shared memory with locks.

The CS 452 kernel makes the opposite choice on every dimension: drivers in user space (UART server), no shared memory, no locks, context switches for every byte via message passing. The CS 452 approach is more complex (more tasks, more SRR interactions) but provides stronger isolation (a buggy UART driver cannot corrupt the kernel), avoids priority inversion from spinlocks, and is more analyzable for worst-case timing.

In terms of raw throughput, the CS 452 approach is slower: each Putc() is one Send/Receive/Reply roundtrip (three context switches, ~6 µs). The xv6 approach uses a spinlock acquisition and a direct write (~100 ns). For a train control application with at most a few hundred bytes per second of terminal output, this difference is irrelevant. For a high-throughput network server, it would be disqualifying.

The choice of architecture reflects the course’s objectives: CS 452 teaches real-time systems principles (bounded latency, message passing, priority scheduling); MIT 6.S081 teaches traditional OS concepts (virtual memory, file systems, system call interfaces). Both are valuable; they optimize for different outcomes.

Part IV: Real-Time Scheduling Theory

Chapter 15: Liu and Layland’s World

The kernel’s static-priority scheduler does not, by itself, guarantee that any task meets its deadline. For that, we need a theory of schedulability: a mathematical framework for determining, given a set of tasks with known timing requirements, whether a particular scheduling algorithm will deliver every task’s output before its deadline.

This theory was founded by C.L. Liu and James Layland in their 1973 JACM paper “Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment.” The paper introduced the periodic task model, proved the optimality of rate-monotonic scheduling (RMS) for fixed-priority algorithms, and derived the exact schedulability bound. Fifty years later, the Liu/Layland framework remains the starting point for RT scheduling analysis in both research and industry.

The Periodic Task Model

The model makes several simplifying assumptions that make analysis tractable:

Each task $\tau_i$ is periodic: it arrives with a fixed period $T_i$ and must complete by its deadline, which coincides with the next period’s arrival (i.e., relative deadline $D_i = T_i$).
Each task has a fixed, known worst-case execution time $C_i$ per period.
Tasks are independent: no precedence constraints, no shared resources (an assumption relaxed by later work on priority inheritance protocols).
Preemption is instantaneous and has zero cost.
All tasks arrive simultaneously at time 0 (the synchronous arrival or critical instant assumption).

\[ U_i = \frac{C_i}{T_i} \]\[ U = \sum_{i=1}^{n} \frac{C_i}{T_i} \]

If $U > 1$, the task set is unschedulable by any algorithm on a single processor — there simply is not enough CPU time. If $U \leq 1$, the task set may be schedulable, depending on the algorithm.

Rate-Monotonic Scheduling

Rate-Monotonic Scheduling (RMS) assigns priorities according to task rates (inverse of period): the task with the shortest period gets the highest priority. Among fixed-priority algorithms, RMS is optimal: if any fixed-priority assignment can schedule the task set, RMS can too.

Liu/Layland RMS Schedulability Bound. A set of n independent periodic tasks is schedulable under RMS if and only if the critical instant analysis succeeds. A sufficient (but not necessary) condition is: \[ U = \sum_{i=1}^{n} \frac{C_i}{T_i} \leq n\left(2^{1/n} - 1\right) \]

The bound $n(2^{1/n}-1)$ decreases as $n$ grows:

Tasks $n$	Bound
1	1.000
2	0.828
3	0.780
4	0.757
5	0.743
10	0.718
$\infty$	$\ln 2 \approx 0.693$

This result is surprising: even in the worst case, RMS wastes at most about 31% of the CPU. For most realistic task sets, utilization can be substantially higher. The bound is tight in the sense that for every $n$, there exists a task set with utilization just above the bound that RMS cannot schedule.

The intuition behind the bound involves the critical instant: the worst-case moment for any task is when it arrives simultaneously with all higher-priority tasks. At that moment, the task must wait for all higher-priority tasks to complete their current periods before it can run. The bound emerges from solving the recurrence that ensures the lowest-priority task still finishes before its deadline under this worst case.

A Worked Example

Consider three tasks:

$\tau_1$: $C_1 = 1\,\text{ms}$, $T_1 = 4\,\text{ms}$, $U_1 = 0.25$
$\tau_2$: $C_2 = 2\,\text{ms}$, $T_2 = 8\,\text{ms}$, $U_2 = 0.25$
$\tau_3$: $C_3 = 1\,\text{ms}$, $T_3 = 8\,\text{ms}$, $U_3 = 0.125$

Total utilization: $U = 0.625$. The bound for $n = 3$ is 0.780. Since $0.625 \leq 0.780$, the sufficient condition is satisfied — RMS is guaranteed to schedule this task set.

Under RMS, $\tau_1$ gets highest priority (rate 1/4), $\tau_2$ and $\tau_3$ share the lower tier (rate 1/8 each). The timeline at the critical instant (all arrive at $t=0$):

$t=0$–1: $\tau_1$ runs (highest priority), completes at $t=1$.
$t=1$–3: $\tau_2$ runs, completes at $t=3$.
$t=3$–4: $\tau_3$ runs, completes at $t=4$. Deadline of $\tau_3$ is $t=8$ — met with 4 ms to spare.

At $t=4$, $\tau_1$ arrives again and preempts whatever is running.

Now consider raising $\tau_2$ to $C_2 = 3\,\text{ms}$, giving $U_2 = 3/8 = 0.375$ and $U = 0.750$. This is still within the bound of 0.780, so RMS is still guaranteed feasible. If we raise $\tau_2$ further to $C_2 = 4\,\text{ms}$ ($U = 0.875$), the sufficient condition fails. We must run the critical instant analysis directly to determine schedulability.

Critical Instant Analysis

The critical instant analysis computes, for each task, its worst-case response time $R_i$ — the time from its arrival to its completion under the worst-case interference from higher-priority tasks. For task $\tau_i$ with higher-priority tasks indexed $j < i$ (using RMS ordering), the response time satisfies the fixed-point equation:

\[ R_i^{(k+1)} = C_i + \sum_{j < i} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]

Starting with $R_i^{(0)} = C_i$ and iterating until $R_i^{(k+1)} = R_i^{(k)}$ or $R_i^{(k)} > T_i$ (deadline missed). The ceiling term counts how many times $\tau_j$ can preempt $\tau_i$ during an interval of length $R_i$.

This analysis is more precise than the utilization bound: it may declare schedulable a task set that the bound rejects, and it is the definitive test for RMS.

Earliest Deadline First

Earliest Deadline First (EDF) is a dynamic priority algorithm: at each scheduling decision, the task with the earliest absolute deadline runs next. Unlike RMS, priorities change at run time — a task’s effective priority increases as its deadline approaches.

EDF Optimality. On a single processor, EDF is optimal among all preemptive scheduling algorithms: if any algorithm can schedule a feasible task set, EDF can. A task set is feasible under EDF if and only if $U \leq 1$.

EDF achieves 100% processor utilization — no CPU cycles are wasted on feasible task sets. For the same task set that RMS handles with the 0.693 bound, EDF can schedule task sets up to utilization 1.0. This is a significant advantage in systems where processor utilization is a constraint.

The cost of EDF is implementation complexity. Priority is not static; it must be recomputed at every task activation (every period arrival) and every preemption point. A priority queue ordered by absolute deadline requires O(log n) insertion and O(1) extraction — a heap is the natural structure. The constant factor of O(log n) scheduling decisions is negligible for small n but matters at high interrupt rates.

More significantly, overload behavior under EDF is poor. When the total utilization exceeds 1.0 (an overload condition), EDF provides no guarantees: any task may miss its deadline, and the choice of which task misses is unpredictable. RMS, under overload, always starves the lowest-priority task first — a predictable degradation that allows designers to identify the least-critical task and accept its deadline violations gracefully. In safety-critical systems where graceful degradation is important, the predictable overload behavior of RMS is often preferred over EDF’s higher utilization bound.

Modern Linux includes both SCHED_DEADLINE (an EDF variant with sporadic task support, added in Linux 3.14) and the SCHED_FIFO/SCHED_RR real-time scheduler (which approximates RMS). Linux 6.6 replaced the Completely Fair Scheduler (CFS) with EEVDF (Earliest Eligible Virtual Deadline First), a proportional-share scheduler with EDF-like fairness properties — though EEVDF targets throughput fairness rather than hard deadline guarantees.

Harmonic Task Sets

A harmonic task set is one in which every task’s period is an exact integer multiple of the next-shorter period: $T_1 | T_2 | T_3 | \cdots | T_n$. For harmonic task sets, the RMS utilization bound is exactly 1.0 — 100% utilization is achievable.

This is not coincidental. In a harmonic task set, every task’s period boundary coincides with every shorter task’s period boundary. There is never a moment where a long-period task arrives mid-period of a short-period task, which is precisely the scenario that wastes utilization in non-harmonic sets. Harmonic task sets arise naturally when clock periods are powers of 2 (1 ms, 2 ms, 4 ms, 8 ms), which is one reason that power-of-2 timing structures are common in embedded systems.

For the train control application, task periods are naturally arranged hierarchically: sensor polling (highest rate, perhaps 50 Hz), clock tick (10 Hz), train position update (2–5 Hz), route optimization (1 Hz). These are not exactly harmonic, but they are close enough that RMS works well in practice. The W26 lecture correctly notes that “CPU [is] not bottleneck — track limits number of trains running” — the scheduling constraints are soft relative to the train dynamics.

Aperiodic and Sporadic Tasks

The Liu/Layland model covers only periodic tasks. In practice, systems also have aperiodic tasks (no fixed period, no deadline) and sporadic tasks (minimum inter-arrival time $T_i$, but unpredictable exact arrival time, with deadline $D_i$).

Aperiodic tasks are typically handled by a background server: a periodic task at the lowest priority that accumulates aperiodic service capacity and uses it to respond to aperiodic requests. The polling server and the deferrable server are two classical designs. The key difference is that a polling server discards unused capacity at period boundaries (conservative but simple), while a deferrable server carries capacity forward (higher responsiveness but more complex analysis).

Sensor events in the train application are sporadic: a sensor fires when a train crosses it, not on a fixed schedule. In CS 452, sensor events are handled by the CAN RX server (which is triggered by MCP2515 interrupts) rather than by a polling-server construct. The interrupt-latency bound serves as the deadline analysis for sporadic events.

Proof Sketch: RMS Optimality

The intuition behind RMS optimality — that if any fixed-priority assignment can schedule a task set, RMS can too — rests on a simple exchange argument.

Suppose we have a fixed-priority assignment $\pi$ that successfully schedules a task set $\mathcal{T}$. Suppose further that in $\pi$, some task $\tau_i$ has higher priority than task $\tau_j$, but $\tau_i$ has a longer period than $\tau_j$ (violating the RMS rule). We will show that swapping the priorities of $\tau_i$ and $\tau_j$ cannot cause a deadline violation.

Let $T_i > T_j$ (τ_i has the longer period). In the current assignment, $\tau_i$ preempts $\tau_j$. In any interval of length $T_j$, $\tau_i$ may preempt $\tau_j$ at most $\lceil T_j / T_i \rceil = 1$ time (since $T_i > T_j$). So $\tau_j$ is delayed by at most $C_i$ per period.

After swapping priorities, $\tau_j$ preempts $\tau_i$. In any interval of length $T_i$, $\tau_j$ may preempt $\tau_i$ up to $\lceil T_i / T_j \rceil$ times, delaying $\tau_i$ by at most $\lceil T_i / T_j \rceil \cdot C_j$. But $\tau_i$ has deadline $T_i$, and the total interference from $\tau_j$ over the interval $T_i$ was already accounted for by the utilization analysis.

The formal proof uses the concept of a critical task set — the set of tasks that are schedulable at the utilization boundary — and shows by induction that any schedulable task set under any fixed-priority assignment is also schedulable under RMS. The argument is careful about ties (equal periods); for ties, either priority assignment works.

This proof establishes RMS as the “right” fixed-priority algorithm: not because it achieves the best possible schedulability bound, but because it matches the utilization bound with a priority assignment that is easy to compute and easy to justify.

The Response Time Analysis in Detail

\[ R_i^{(k+1)} = C_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]

where $hp(i)$ is the set of tasks with higher priority than $\tau_i$. Starting from $R_i^{(0)} = C_i$, each iteration includes additional interference from higher-priority tasks that could preempt during the response time interval.

The algorithm terminates in one of two ways:

Convergence: $R_i^{(k+1)} = R_i^{(k)}$. This is a fixed point; the response time is $R_i = R_i^{(k)}$. If $R_i \leq T_i$, task $\tau_i$ meets its deadline.
Overrun: $R_i^{(k)} > T_i$. The task will miss its deadline; the iteration is abandoned.

The recurrence always terminates because $R_i^{(k+1)} \geq R_i^{(k)}$ (it can only increase) and it is bounded above by $T_i$ for feasibility or by the LCM of all periods for infeasibility detection.

Worked response time analysis: given the three-task example from earlier ($\tau_1: C=1, T=4$, $\tau_2: C=2, T=8$, $\tau_3: C=1, T=8$):

For $\tau_3$ (lowest priority, $hp(3) = \{\tau_1, \tau_2\}$):

$R_3^{(0)} = C_3 = 1$
$R_3^{(1)} = 1 + \lceil 1/4 \rceil \cdot 1 + \lceil 1/8 \rceil \cdot 2 = 1 + 1 + 2 = 4$
$R_3^{(2)} = 1 + \lceil 4/4 \rceil \cdot 1 + \lceil 4/8 \rceil \cdot 2 = 1 + 1 + 2 = 4$

Converged: $R_3 = 4 \leq T_3 = 8$. Deadline met with 4 ms slack.

For $\tau_2$ ($hp(2) = \{\tau_1\}$):

$R_2^{(0)} = 2$
$R_2^{(1)} = 2 + \lceil 2/4 \rceil \cdot 1 = 2 + 1 = 3$
$R_2^{(2)} = 2 + \lceil 3/4 \rceil \cdot 1 = 2 + 1 = 3$

Converged: $R_2 = 3 \leq T_2 = 8$. Deadline met with 5 ms slack.

For $\tau_1$ (highest priority, no preemption):

$R_1 = C_1 = 1 \leq T_1 = 4$. Trivially feasible.

All three tasks meet their deadlines.

Schedulability Analysis for the Train Control System

Applying the Liu/Layland framework to the CS 452 task set requires assigning periods and execution times to the actual tasks. The following is a representative (not authoritative) estimate:

Task	Period T	WCET C	Utilization C/T
Clock Notifier	10 ms	0.02 ms	0.002
Clock Server	10 ms	0.05 ms	0.005
CAN RX Notifier	4 ms	0.05 ms	0.013
CAN RX Server	4 ms	0.10 ms	0.025
UART Servers	5 ms	0.10 ms	0.020
Train Engineer (×2)	50 ms	2.0 ms	0.040 each
Route Planner	500 ms	5.0 ms	0.010
User Shell	100 ms	1.0 ms	0.010

Total utilization ≈ 0.165, far below the RMS bound of ~0.69. The train control system is comfortably schedulable. The W26 lecture’s observation that “CPU is not the bottleneck — the track limits the number of trains” is confirmed by this analysis: even with 10 trains, each requiring a separate Engineer task, the total utilization would only be 0.40 + overhead, still well within the RMS bound.

Extending the Model: Blocking and Shared Resources

The Liu/Layland model assumes tasks are independent — no shared resources, no precedence constraints. In reality, tasks in the CS 452 system do share resources: they share the Name Server, the Clock Server, and the CAN TX Server. When a train engineer task sends to the CAN TX server, it blocks. This blocking time must be accounted for in the schedulability analysis.

The extension: when task $\tau_i$ can be blocked by a lower-priority task (through a shared resource), its response time equation gains a blocking term $B_i$:

\[ R_i^{(k+1)} = C_i + B_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]

$B_i$ is the longest time that $\tau_i$ can be blocked by a single lower-priority task’s critical section. In a message-passing system, $B_i$ is the longest time that the server processing $\tau_i$’s request might spend on a previous lower-priority client’s request before serving $\tau_i$.

For the CAN TX Server, if the server is in the middle of a 5 µs SPI transaction when a high-priority train engineer sends a speed command, the engineer is blocked for 5 µs while the server finishes. This is the blocking term $B_{\text{engineer}} = 5$ µs. Adding this to the response time analysis does not change the schedulability conclusion (5 µs is negligible compared to the 50 ms period), but it is important to account for it formally.

The practical implication: short critical sections matter. The longer the maximum time a server can be occupied serving a single client, the larger the blocking term for all higher-priority tasks waiting for that server. Breaking long operations into shorter segments (or using AwaitEvent to defer blocking to a notifier) reduces the blocking term.

Sporadic Task Analysis: The Critical Instance Theorem

For sporadic tasks with minimum inter-arrival time $T_i^{\min}$ and deadline $D_i \leq T_i^{\min}$, the critical instance analysis extends naturally. The worst case for task $\tau_i$ is when it arrives at the same time as all higher-priority tasks. The response time analysis uses $T_j^{\min}$ in place of $T_j$:

\[ R_i^{(k+1)} = C_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j^{\min}} \right\rceil C_j \]

This is valid because using the minimum inter-arrival time maximizes the interference — it assumes higher-priority tasks arrive as frequently as possible, which is the worst case for $\tau_i$.

Train sensor events are sporadic with a minimum inter-arrival time determined by the train’s minimum speed and the shortest sensor-to-sensor distance. At minimum sensor separation of 100 mm and minimum speed of 50 mm/s, the minimum inter-arrival time is 2 seconds — giving an extremely low utilization contribution and negligible impact on schedulability.

Jitter and Temporal Validation

In practice, tasks are not perfectly periodic even when designed to be. Jitter is the variation in a task’s actual arrival time relative to its scheduled arrival time. Sources of jitter in the CS 452 kernel:

Timer interrupt jitter: the system timer fires at fixed intervals, but the interrupt handler may be delayed by the kernel’s non-preemptible window. If the kernel is in the middle of a Send (which takes up to 3 µs), the timer interrupt is delayed by up to 3 µs. For a 10 ms tick, 3 µs jitter is 0.03% — acceptable.

Clock server processing jitter: the clock server computes delayed replies only when it receives a tick notification. If the clock notifier is delayed (because a higher-priority task is running when the timer interrupt fires and takes time to yield), the tick processing is delayed. The notifier at priority 31 guarantees minimal jitter — it preempts everything except other priority-31 tasks (there should be none).

Release jitter: the variation between when a task should become ready (at its period boundary) and when the scheduler actually runs it. For a task at priority 20, the release jitter is bounded by the execution time of all tasks at priorities 21–31 that might be running at the moment of release. In the CS 452 system, the highest-priority tasks above 20 are mostly notifiers with very short execution times (< 1 µs), so release jitter is bounded by a few microseconds.

Measuring jitter is straightforward with the ring buffer profiling approach (Chapter 20): record the intended and actual execution start times for each task instance and compute the distribution.

Historical Context: From Liu/Layland to Today

The 1973 Liu/Layland paper opened a research program that has produced hundreds of papers and several textbooks over fifty years. The key milestones:

1989 — Lehoczky, Sha, and Ding extended the analysis to exact schedulability characterization, showing that the response time recurrence gives the exact worst-case response time, not just a sufficient condition. This is the basis of the modern schedulability analysis described above.

1990 — Sha, Rajkumar, and Lehoczky introduced the Priority Ceiling Protocol and Priority Inheritance Protocol, extending the Liu/Layland model to handle shared resources. These results are described in Chapter 16.

1994 — Audsley, Burns, Richardson, and Wellings developed the response time analysis for generalized task sets with blocking and sporadic tasks. The fixed-point algorithm for $R_i$ is their contribution.

1995 — Baruah, Rosier, and Howell extended the analysis to scheduling with deadlines earlier than periods ($D_i < T_i$), relevant when a task must finish before the next invocation begins but earlier than the next period boundary.

1996 — Spuri and Buttazzo developed the analysis for EDF with sporadic tasks, extending the 100% utilization result to non-trivial task sets.

2003 — Henia et al. applied WCET analysis to industrial systems and developed the Symta/S scheduling analysis tool.

2024 — Buttazzo’s 4th edition summarizes fifty years of results in a unified framework, including multicore scheduling (which remains an active research area). The CS 452 course deals with single-core scheduling, where the results are fully mature.

The endurance of the Liu/Layland framework testifies to the power of a well-chosen abstraction. The periodic task model is simple enough to admit exact analysis but rich enough to model the essential structure of real-time control systems. Extensions — sporadic tasks, blocking, EDF, multicore — build on the foundation without replacing it.

Chapter 16: Latency, WCET, and Priority Inversion

Scheduling theory provides a framework for analyzing task sets. In practice, applying the theory requires knowing the worst-case execution time (WCET) of each task — a number that is surprisingly difficult to determine accurately. And even with perfect WCET knowledge, interactions between tasks through shared resources introduce priority inversion, a phenomenon that can violate all scheduling guarantees.

Latency Taxonomy

Latency in a real-time context means the time from an event’s occurrence to the completion of the system’s response. For a train sensor event, this is the time from the moment a wheel crosses the reed switch to the moment the train control software has processed the reading and issued any necessary speed or switch commands.

This end-to-end latency decomposes into:

Interrupt latency: time from the physical event to the processor entering the interrupt handler. Dominated by the kernel’s longest non-preemptible operation (the period during which DAIF is cleared). In the CS 452 kernel, this is bounded by the worst-case system call duration — the most expensive of Send, Receive, or Reply, including context switch cost.
Handler latency: time within the interrupt handler and the AwaitEvent unblocking path. For the GIC-400 with a simple IRQ handler, this is in the range of 1–5 µs.
Scheduling latency: time for the awakened task to reach the head of the ready queue and begin executing. If a higher-priority task is already running, this is zero — the awakened task is immediately the highest-priority ready task. If multiple tasks are at the same priority, FIFO order applies.
Task execution time: the time the task spends computing the response.

The total latency is the sum of all four. For hard real-time correctness, the sum must not exceed the deadline.

The distinction between average latency and worst-case latency is critical and often misunderstood. Average latency is a throughput metric. Worst-case latency is a correctness metric. A system that handles 99% of events in 1 ms but 1% of events in 100 ms has a worst-case latency of 100 ms regardless of its average. For hard real-time systems, the 1% case is what matters.

Worst-Case Execution Time

Worst-case execution time (WCET) analysis attempts to bound from above the execution time of a piece of code on a specific processor under the worst possible circumstances. This is harder than it sounds.

The principal sources of timing variability are:

Instruction cache behavior. A cache-cold first execution is much slower than a cache-warm subsequent execution. WCET analysis must either assume fully cold caches (pessimistic) or prove that certain cache lines will always be warm (complex static analysis).

Branch prediction. Modern OOO processors speculatively execute instructions past branches. When the branch is mispredicted, the speculative work is discarded and the pipeline is refilled — a penalty of 10–20 cycles on the Cortex-A72. WCET analysis must bound the number of mispredictions.

Memory access patterns. DRAM access latency is 50–100 cycles when the cache is cold. WCET analysis must identify which accesses may miss the cache and bound their cost.

Peripheral access timing. SPI transactions have a fixed bit-clock and a known maximum duration. Reading from the MCP2515 is bounded by the SPI transaction length and clock rate.

Two approaches to WCET:

Static analysis builds a model of the processor’s microarchitectural state (cache occupancy, pipeline fill, branch predictor history) and computes a bound analytically. Tools like OTAWA, aiT (AbsInt), and Chronos implement variations of this approach. The result is a provably correct upper bound — but the models are conservative, often significantly overestimating actual WCET. Static analysis is NP-hard in full generality.

Measurement-based WCET runs the code many times with varied inputs and cache states and takes the maximum observed execution time, possibly with a safety margin added. This is practical but cannot provide a formal guarantee — there may always be a worse input not covered by the test set.

In CS 452, measurement-based WCET is the standard: run each kernel primitive many times (using the system timer), compute the maximum, and use it in scheduling analysis. The margin of error is accepted in exchange for implementation simplicity. For safety-critical systems (avionics, automotive), static analysis tools would be required.

Priority Inversion

Priority inversion occurs when a higher-priority task is forced to wait for a lower-priority task due to a shared resource. In a pure message-passing system with no shared memory, priority inversion is impossible by design — there are no shared resources to compete for. But the moment any resource is shared — a mutex, a hardware peripheral, a spinlock in the kernel — priority inversion can occur.

The mechanism is straightforward. Suppose task H (high priority), task M (medium priority), and task L (low priority) share a mutex $\mu$.

L acquires $\mu$.
H runs, needs $\mu$, blocks on the mutex.
M becomes ready. Since H is blocked (not running), M preempts L.
M runs to completion. During M’s execution, H is effectively waiting for M to complete, even though H has higher priority than M.

This is priority inversion: H’s effective priority is M’s priority (or lower) because it cannot run until L finishes, and L cannot run until M is done.

The Mars Pathfinder Incident

The most cited real-world consequence of priority inversion is the Mars Pathfinder mission of 1997. After landing successfully, the spacecraft’s computer began experiencing system resets that interrupted science data collection. NASA engineers diagnosed the problem remotely — the first time a software patch was applied to a robot on another planet — and the mission recovered, but the incident became a canonical case study in real-time systems engineering.

The root cause: the spacecraft ran VxWorks (a commercial RTOS). Three tasks were relevant: an information bus task (high priority) that managed a shared resource (an ASI/MET bus) using a mutex, a meteorological data task (medium priority) with no connection to the mutex, and a data collection task (low priority) that also used the mutex.

The data collection task held the mutex. The information bus task tried to acquire it and blocked. The meteorological task preempted the data collection task (medium > low priority). With the data collection task unable to run and release the mutex, the information bus task remained blocked indefinitely. VxWorks included a watchdog timer that reset the processor if the information bus task was not serviced within a deadline. The watchdog fired, and the system reset.

VxWorks supported priority inheritance (the standard solution) but it was not enabled on Pathfinder, apparently due to a last-minute decision to avoid adding complexity. Enabling priority inheritance required setting one flag in the mutex initialization. The patch was uplinked, priority inheritance was enabled, and the resets stopped.

This story contains several lessons. First, priority inversion is a real danger, not a theoretical concern. Second, the solution (priority inheritance) was available but not used, for reasons that seemed sensible at the time. Third, the bug was intermittent and rare — it required a specific timing coincidence to manifest, which is why testing did not catch it. And fourth, the hardware-software architecture of Pathfinder was sophisticated enough to allow a remote software patch — a remarkable achievement for 1997.

The engineering post-mortem, written by the mission engineers Mike Jones and Glenn Reeves in a widely-cited technical report, is a masterclass in RTOS debugging methodology. Key details from their account:

The system had been operating normally for months in ground testing. The timing conditions for the bug to manifest required:

A science data gathering period (which caused high data-bus traffic from the meteorological data task)
The information bus task to be active during that period
The data collection task to hold the shared mutex at the moment the information bus task needed it

This three-way coincidence had a probability of roughly 1% per day of operation — rare enough to escape testing but certain to occur over a multi-month mission. Jones and Reeves estimated the bug would manifest 2–3 times in a 10-day mission, consistent with the observed failures. The lesson for CS 452: bugs that manifest rarely in testing may manifest reliably in production, where operation continues for longer and workloads differ from test conditions.

The remote debugging procedure itself is instructive. The VxWorks environment on Pathfinder retained a copy of the system symbol table (unlike production embedded firmware that strips symbols). This allowed the engineers to use VxWorks’s interactive shell (over the DSN radio link, with 10-minute one-way propagation delay to Mars) to set breakpoints, read memory, and diagnose the issue. The engineers could type commands into a terminal on Earth and, 10 minutes later, see results from the running spacecraft. This demonstrates the value of keeping debug facilities available in production systems — stripped, debuggingless firmware would have made the diagnosis far more difficult.

Priority Inversion in Other Domains

Beyond Pathfinder, priority inversion has caused notable incidents in other systems:

The L-1011 incident (1997): a real-time avionics system experienced priority inversion on a shared resource, causing a secondary flight computer to fail to update its status within a deadline. The watchdog reset the computer during flight, causing a brief instrument failure. No crash resulted, but the incident triggered an investigation that led to mandatory use of priority ceiling protocols in safety-critical avionics software under DO-178B/C.

Mars Spirit Rover (2004): the Mars Spirit rover suffered from a filesystem management issue that was exacerbated by priority-related scheduling problems. The rover’s flashfile system (a VxWorks component) was operating in a mode where it consumed excessive CPU time, starving other tasks. This is not strictly priority inversion (the consuming task had appropriate priority), but illustrates how resource consumption bugs in RTOS systems have similar symptoms: a critical task misses its deadline because lower-priority work is consuming excessive time.

Railway interlocking systems: multiple incidents in European railways have been traced to priority-related faults in programmable logic controllers (PLCs) used for interlocking. PLCs typically run cyclic executives rather than preemptive RTOS, so the failures manifest as cyclic-executive overruns (one task runs long, pushing subsequent tasks past their deadlines) rather than priority inversion per se. The effect is similar: a safety function that expects to execute within a deadline fails to do so.

These incidents illustrate that timing faults in safety-critical systems are not hypothetical classroom exercises. They are the dominant class of real-world RTOS failures.

The Response Time Analysis Under Priority Inversion

When a system uses mutexes with priority inheritance, the response time recurrence must include the blocking term $B_i$ introduced in Chapter 15:

\[ R_i^{(k+1)} = C_i + B_i + \sum_{j \in hp(i)} \left\lceil \frac{R_i^{(k)}}{T_j} \right\rceil C_j \]

The blocking time $B_i$ for task $\tau_i$ is the maximum time that $\tau_i$ can be blocked by lower-priority tasks through shared mutexes. Under the Priority Inheritance Protocol:

\[ B_i = \max_{k : p(\tau_k) < p(\tau_i)} \max_{j : \tau_k \text{ uses } \mu_j \text{ and } \tau_i \text{ uses } \mu_j} CS(k, j) \]

where $CS(k, j)$ is the duration of task $\tau_k$’s critical section on mutex $\mu_j$. Informally: $B_i$ is the duration of the longest critical section that any lower-priority task holds, for a mutex that $\tau_i$ itself also uses.

For the Pathfinder scenario: the blocking term for the information bus task (high priority) due to the data collection task (low priority) is the duration of the data collection task’s critical section on the shared mutex — approximately 1 second in the worst case. Adding $B = 1$ s to the information bus task’s response time: if the task’s period is 0.5 s, the response time $R > T$ and the deadline is missed.

With priority inheritance enabled: the data collection task’s priority is temporarily raised to the information bus task’s priority when the information bus task is blocked. This eliminates the preemption by the meteorological task, reducing the critical section duration to the actual execution time of the data collection task’s body (a few milliseconds). $B$ drops from ~1 s to ~5 ms; the deadline is safely met.

This quantitative analysis shows why priority inheritance is not merely “nice to have” — it is the difference between a bounded blocking time (suitable for schedulability analysis) and an unbounded one (unsuitable for any formal guarantee).

Priority Inheritance Protocol

The Priority Inheritance Protocol (PIP) resolves priority inversion by temporarily raising the priority of a mutex holder to the maximum priority of all tasks waiting for the mutex. When L holds $\mu$ and H is waiting:

L’s priority is raised to H’s priority.
L now preempts M (since L’s effective priority is now H’s).
L completes its critical section and releases $\mu$.
L’s priority drops back to its original value.
H acquires $\mu$ and runs.

PIP eliminates the pathological case where M indefinitely delays H by preventing M from preempting a mutex-holding L that is effectively running on behalf of H. The blocking time for H is bounded by the duration of L’s critical section — a well-defined and typically short quantity.

PIP does not prevent deadlock. If H tries to acquire $\mu_1$ then $\mu_2$, while L holds $\mu_2$ and tries to acquire $\mu_1$, neither can proceed. For deadlock prevention, the Priority Ceiling Protocol (PCP) extends PIP: each mutex has a priority ceiling equal to the highest priority of any task that may acquire it, and a task may only acquire a mutex if its current priority is strictly higher than the ceiling of all currently-held mutexes in the system.

In the CS 452 message-passing kernel, these protocols are not needed — there are no mutexes. The “resource protection through server ownership” pattern provides equivalent protection without the complexity. The track server owns the track state; any task that needs to interact with the track state sends to the track server and waits. The server processes requests sequentially. There is no shared state, therefore no lock, therefore no priority inversion.

The Priority Ceiling Protocol in Detail

The Priority Ceiling Protocol (Sha, Rajkumar, and Lehoczky, 1990) is worth understanding even though the CS 452 kernel doesn’t need it, because it appears in every serious embedded RTOS (OSEK/VDX in automotive ECUs, ARINC 653 in avionics, POSIX with SCHED_FIFO and the PTHREAD_PRIO_PROTECT mutex type).

The PCP assigns to each mutex $\mu_j$ a ceiling $C(\mu_j)$ equal to the maximum priority of any task that may ever acquire $\mu_j$. At any moment, the system ceiling is $\Omega = \max_j C(\mu_j)$ over all currently-held mutexes.

A task $\tau_i$ may acquire mutex $\mu_j$ only if $p(\tau_i) > \Omega$ (its current priority is strictly greater than the system ceiling). If this condition fails, $\tau_i$ is blocked — even if $\mu_j$ is currently free.

This single rule has a remarkable consequence: at most one task can be blocked at a time by the PCP. This means the worst-case blocking time for any task is bounded by the duration of a single critical section — making the analysis tractable for hard real-time systems.

Why at most one blocked task? Suppose $\tau_H$ is blocked by the PCP at a mutex $\mu_j$. This means $p(\tau_H) \leq \Omega$. Since $\Omega = C(\mu_k)$ for the highest-ceiling held mutex $\mu_k$, and $C(\mu_k) = \max_i p(\tau_i : \tau_i \text{ uses } \mu_k)$, and $p(\tau_H) \leq C(\mu_k)$, it follows that $\tau_H$ itself uses $\mu_k$ (or a mutex with the same ceiling). The task holding $\mu_k$ must therefore have acquired $\mu_k$ when the system ceiling was below its own priority — meaning no other task was blocked at that moment. By induction, each PCP blocking chain has at most one blocked task.

PCP vs PIP comparison: PIP prevents priority inversion but allows chains of blocked tasks (so-called chained blocking). PCP prevents chained blocking but requires knowing, at mutex design time, which tasks will use each mutex. In a dynamically-configured system this is impractical; in a static embedded system (like an OSEK application where all tasks and mutexes are known at compile time), PCP is the standard choice.

The POSIX pthread_mutexattr_setprotocol(attr, PTHREAD_PRIO_PROTECT) sets the ceiling to the value provided by pthread_mutexattr_setprioceiling. Linux supports this via the PTHREAD_PRIO_PROTECT mutex protocol in glibc’s pthreads implementation, though it is less commonly used than PTHREAD_PRIO_INHERIT.

Latency Analysis for the Train Control Critical Path

Applying the latency taxonomy to the train control system: the critical path is from a physical sensor firing to the command being transmitted to the CS3.

Reed switch fires
  → MCP2515 detects CAN frame (hardware delay: <1 µs)
  → MCP2515 asserts INT pin (GPIO 17)
  → GIC-400 detects GPIO interrupt
  → Kernel IRQ handler begins (~2–5 µs, bounded by longest non-preemptible kernel operation)
  → CAN RX Notifier unblocked (ReplyWait → Ready)
  → Notifier runs (priority 30), reads MCP2515 via SPI (~5 µs for 13 bytes at 10 MHz)
  → Notifier Send()s CAN frame to CAN RX Server
  → CAN RX Server runs (priority 24), parses Märklin sensor event
  → Server Send()s to Track Server
  → Track Server updates position estimate, checks for approaching hazard
  → Track Server Send()s speed command to CAN TX Server
  → CAN TX Server writes speed command to MCP2515 TX buffer via SPI (~5 µs)
  → MCP2515 arbitrates for CAN bus and transmits (~100 µs at 250 kbit/s)
  → CS3 receives command, applies to DCC track output (~10 ms for DCC cycle)

The end-to-end latency from sensor firing to speed command applied is dominated by the DCC track cycle (10 ms) and the CAN bus transmission (100 µs), not by the software. Software latency — from interrupt to MCP2515 TX write — is on the order of 50–100 µs, adding about 5–10% to the total. This is consistent with the W26 lecture’s observation that the CPU is not the bottleneck.

However, if the system has many simultaneous events (multiple trains all crossing sensors within a few milliseconds), the software serialization becomes relevant. The Track Server processes events sequentially; 10 simultaneous events at 20 µs each = 200 µs of serialized processing, during which later events are queued. Priority scheduling ensures that the most urgent event (the train closest to a potential collision) is processed first if the CAN RX Server sends to the Track Server with the appropriate urgency.

Part V: The Train Application

Chapter 17: Train Kinematics and Calibration

Controlling a model train requires knowing where the train is at all times. Sensors tell you when the train passes fixed points on the track, but between sensors, you must predict the train’s position from its last known location and its current speed. This requires an accurate model of the relationship between the speed command sent to the CS3 and the actual physical velocity of the train.

That relationship is non-linear, train-specific, and subject to variation with temperature, motor wear, and track cleanliness. No formula from first principles gives it — it must be measured empirically. The calibration procedure described in this chapter is the foundation on which all train control rests.

Velocity as an Inherently Averaged Quantity

A train’s instantaneous velocity is not directly observable in this system. The system timer gives you the time at which two events occurred; the track geometry gives you the distance between two sensors. Velocity is derived:

\[ \bar{v} = \frac{d}{t_2 - t_1} \]

where $d$ is the distance between sensors in millimetres and $t_1, t_2$ are the sensor trigger times in microseconds. This is an average velocity over the inter-sensor interval, not an instantaneous velocity at either sensor. If the train was accelerating during the interval, the average is less than the final velocity. If it was decelerating (braking), the average is more than the final velocity. Only during constant-velocity travel does the average equal the instantaneous velocity.

The practical consequence: calibrate at constant velocity. Set a speed command, wait long enough for the train to reach steady state (a full acceleration cycle has passed), then time the train over a known distance. The result is the steady-state velocity for that speed command.

There is a subtle arithmetic trap in averaging velocities. Suppose a train traverses two equal-length segments: 100 mm in 10 s (v₁ = 10 mm/s) and 100 mm in 20 s (v₂ = 5 mm/s). The arithmetic mean of the velocities is 7.5 mm/s, but the total distance is 200 mm in 30 s, giving a true average of 6.7 mm/s. The correct procedure is: average the times, then compute velocity, not average the velocities directly. Equivalently, record the raw time-and-distance pairs and derive velocity from the total rather than averaging partial velocities. The distinction matters for calibration data averaged over multiple runs.

Exponentially Weighted Moving Average

Train velocity varies continuously with track conditions, motor temperature, and battery voltage (if battery-powered). A calibration table built once at the start of a session may drift over time. The Exponentially Weighted Moving Average (EWMA) provides a lightweight adaptive filter:

\[ v_{\text{new}} = (1 - \alpha) \cdot v_{\text{old}} + \alpha \cdot v_{\text{measured}} \]

where $\alpha \in (0, 1)$ is the learning rate (sometimes called the smoothing factor). A large $\alpha$ (close to 1) makes the estimate respond quickly to new measurements but is noisy. A small $\alpha$ (close to 0) gives a smooth but slow-tracking estimate.

For train velocity tracking, $\alpha = 0.1$ to $\alpha = 0.2$ is typical: the estimate changes slowly enough to filter out single-measurement noise (caused by sensor debounce timing or minor track irregularities) but quickly enough to track gradual velocity changes over tens of seconds.

Choosing $\alpha$: if the train’s velocity changes significantly over one sensor-to-sensor interval, use a larger $\alpha$. If the dominant noise is measurement noise (variation in sensor trigger times due to timing resolution), use a smaller $\alpha$. In practice, both sources of variation are present, and $\alpha$ is tuned empirically.

Fixed-Point Arithmetic

The Cortex-A72 has a floating-point unit and supports IEEE 754 single and double precision. However, the kernel disables FP registers with -mgeneral-regs-only to avoid saving 512 bytes of SIMD state on every context switch. Train control calculations — velocity, distance, stopping distance — must therefore be done in integer arithmetic with fixed-point representation.

In fixed-point arithmetic, all values are integers where the unit is not 1 but some power of 10 or power of 2. The most natural choice for train control is to represent velocities in mm/s as integers (no scaling needed — a train’s maximum velocity on the Märklin layout is under 1000 mm/s), and distances in mm as integers. Times come from the system timer in microseconds (µs).

Velocity computation from sensor data:

// distance_mm: integer, sensor-to-sensor distance in mm
// dt_us: integer, time difference in microseconds
// returns velocity in mm/s (integer)
int32_t compute_velocity(int32_t distance_mm, int32_t dt_us) {
    // v = d/t = distance_mm / (dt_us / 1000000)
    //         = distance_mm * 1000000 / dt_us  (mm/s)
    return (distance_mm * 1000000L) / dt_us;
}

The multiplication distance_mm * 1000000L can overflow a 32-bit integer for distances above ~2147 mm. Use int64_t for the intermediate product:

int32_t compute_velocity(int32_t distance_mm, int32_t dt_us) {
    return (int32_t)((int64_t)distance_mm * 1000000LL / dt_us);
}

For EWMA, the scaling factor α is represented as an integer fraction. With α = 0.1 (1/10):

// EWMA: v_new = (9 * v_old + 1 * v_measured) / 10
int32_t ewma_update(int32_t v_old, int32_t v_measured) {
    return (9 * v_old + v_measured) / 10;
}

The division by 10 introduces rounding error of at most 1 mm/s per update — negligible compared to measurement noise.

Stopping Distance

A train at speed $v$ (mm/s) issued a stop command at time $t_0$ will continue moving during the deceleration phase. The stopping distance $d_{\text{stop}}$ is the distance traveled from the moment the stop command is issued until the train halts.

\[ d_{\text{stop}} = \frac{v^2}{2a} \]

where $a$ is the deceleration rate in mm/s². However, train deceleration is far from constant — it depends on the DCC decoder’s braking curve, the motor’s back-EMF characteristics, the grade of the track, and the train’s inertia. A constant-deceleration model is a rough approximation.

The practical approach is to measure stopping distance directly. For each speed level, command the train to stop from that speed at a fixed mark on the track and measure how far it travels before stopping. Repeat several times and average. This builds a lookup table indexed by speed level:

// stopping_distance_mm[speed_level] for speed_levels 1..31
int32_t stopping_distance_mm[MAX_SPEED_LEVEL + 1];

For speed levels not directly measured, linear interpolation between adjacent measurements:

int32_t interp_stop_distance(int speed_level) {
    int lo = speed_level / STEP * STEP;      // nearest measured level below
    int hi = lo + STEP;                       // nearest measured level above
    if (hi > MAX_SPEED_LEVEL) return stopping_distance_mm[MAX_SPEED_LEVEL];
    int32_t d_lo = stopping_distance_mm[lo];
    int32_t d_hi = stopping_distance_mm[hi];
    return d_lo + (d_hi - d_lo) * (speed_level - lo) / STEP;
}

The result is used to compute the stopping point: the train should be commanded to stop when its current position is exactly one stopping distance before the target point. Since position is estimated continuously from EWMA velocity and elapsed time, the stopping command should be issued when:

\[ \text{distance\_to\_target} \leq d_{\text{stop}}(\text{current\_speed}) \]

Acceleration Modelling

Between a speed command change and the new steady-state velocity, the train undergoes an acceleration phase of variable duration. During this phase, the EWMA velocity estimate is unreliable because the measured velocities are changing and the EWMA is averaging over a transient.

Assuming constant acceleration $a$ (a useful approximation), two scenarios arise when the train passes a sensor during an acceleration phase:

Scenario 1: Acceleration complete before the sensor. The train reaches its target velocity $v_t$ before passing the sensor. In this case, the sensor measurement yields the steady-state velocity directly.

Scenario 2: Acceleration not complete at the sensor. The train is still accelerating when it passes the sensor. The sensor-to-sensor average velocity is less than $v_t$. From the average velocity over the inter-sensor interval, one can estimate the velocity at sensor crossing:

\[ v_{\text{sensor}} \approx 2 \bar{v} - v_0 \]

where $v_0$ is the velocity at the first sensor (if known) and $\bar{v}$ is the measured average. This approximation assumes the acceleration profile is symmetric and well-behaved — adequate for practical calibration.

The Full Calibration Procedure

Building an accurate velocity model for a Märklin locomotive is a multi-step offline process. The following protocol provides repeatable results in a typical 3-hour lab session.

Step 1: Choose a calibration track segment. Select a straight segment of known length, with sensors at both ends. For best accuracy, the segment should be long enough that the train reaches steady-state velocity before the first sensor and remains at steady state until after the second. A segment of 600–1000 mm with sensors at both endpoints is suitable. Use the track graph to determine the exact sensor-to-sensor distance.

Step 2: Set a speed command. Issue a speed command to the train at the chosen Märklin speed level (1–14 for the slow steps, or 1–31 for the higher precision range). Wait for the train to complete at least two full sensor-to-sensor runs at that speed to allow the motor to thermally stabilize.

Step 3: Record sensor times. For each calibration run:

Sensor A fires at t₁ (µs, from system timer C1)
Sensor B fires at t₂ (µs)
Δt = t₂ - t₁

Record at least 5 runs at each speed level. Reject outliers (runs where $\Delta t$ differs from the median by more than 10%).

Step 4: Compute average velocity. For each run, $v = d / \Delta t \times 10^6$ (mm/s). Compute the mean over all accepted runs. This gives the steady-state velocity at the chosen speed level.

Step 5: Populate the velocity table. Store the result in a lookup array:

static int32_t velocity_table[MAX_SPEED_LEVEL + 1];  // mm/s indexed by speed level
velocity_table[speed_level] = mean_velocity_mmps;

Step 6: Repeat for all speed levels. Märklin speed levels 1–14 (or 1–31 depending on the address format) span from barely creeping (perhaps 50 mm/s) to full speed (perhaps 800–1000 mm/s). Calibrating all 31 levels takes approximately 30–45 minutes. An alternative is to measure every other level (0, 2, 4, …) and interpolate the rest.

Step 7: Calibrate stopping distances. For each speed level to be used in actual operation, measure the stopping distance:

Place the train at the measured speed level over sensor A.
Issue a stop command (speed 0) exactly when sensor A fires.
Measure where the train stops (by seeing which sensor it next triggers, or by physical measurement).
Repeat 5 times and average.

static int32_t stopping_table[MAX_SPEED_LEVEL + 1];  // mm indexed by speed level

Step 8: Cross-validate. With the velocity and stopping tables populated, run the full control loop and observe whether the train stops at its intended target. Adjust the stopping distance values empirically if the train consistently overshoots or undershoots.

The velocity table is train-specific and session-specific: a different locomotive (different DCC decoder, different motor) will have a completely different table, and even the same locomotive may drift between sessions as the motor warms up and track conditions change. This is why the EWMA filter (Step 6 of the runtime, not the calibration) is important: it continuously corrects the table against observed sensor data.

The Velocity Table Data Structure

The velocity table is a simple array, but the data access pattern during operation is worth considering. During steady-state train control, velocity_table[current_speed] is accessed on every tick (10 ms) to update position estimates. Since MAX_SPEED_LEVEL is 31 and each entry is a 4-byte int32_t, the entire table fits in 128 bytes — well within the Cortex-A72’s L1 data cache. After the first few accesses at startup, all velocity lookups will hit L1 with a 4-cycle access time.

The stopping table is accessed less frequently (only when a stopping decision must be made), but it too is small enough for L1 residence. These tables exemplify the “compute offline, look up online” principle: the expensive work (calibration runs, EWMA fitting) is done once, and the results are stored for O(1) access during the real-time control loop.

For implementation, define the tables with const if using compile-time constants, or volatile if they are populated by a calibration task and read by a control task from a different core (not applicable on CS 452’s single-core RPi, but good discipline):

/* Runtime calibration table — written once during calibration, then read-only */
typedef struct {
    int32_t velocity_mmps[MAX_SPEED_LEVEL + 1];
    int32_t stopping_mm[MAX_SPEED_LEVEL + 1];
    int32_t accel_time_ms[MAX_SPEED_LEVEL + 1];  /* estimated acceleration time */
} CalibrationTable;

static CalibrationTable cal;  /* global, populated at startup */

Connecting Calibration to Control Theory: From EWMA to PID

The EWMA filter is the simplest member of a broader family of recursive estimators. Understanding where EWMA sits in this family clarifies when to use it and when more sophisticated alternatives are needed.

The EWMA update rule $v_{\text{new}} = (1-\alpha)v_{\text{old}} + \alpha v_{\text{measured}}$ is equivalent to a first-order IIR (Infinite Impulse Response) low-pass filter with a single pole at $z = 1-\alpha$ in the z-domain. Its frequency response attenuates noise above the cutoff frequency $f_c = -\ln(1-\alpha) / (2\pi T_s)$ where $T_s$ is the sampling period (inter-sensor time). For $\alpha = 0.1$ and $T_s = 1$ s (one sensor per second), $f_c \approx 0.016$ Hz — it passes very slow velocity changes and rejects everything faster. This is appropriate for thermal drift and motor wear but too slow if the train’s velocity changes intentionally (as during a speed command transition).

The logical extension of EWMA to full closed-loop control is the PID controller. A PID controller computes a control output $u$ from a position error $e = \text{target} - \text{actual}$ using three terms:

\[ u(t) = K_P e(t) + K_I \int_0^t e(\tau) d\tau + K_D \frac{de}{dt} \]

Proportional term $K_P e$: respond to current error. Analogous to the EWMA’s correction term.
Integral term $K_I \int e$: respond to accumulated error (eliminates steady-state offset from friction and slope).
Derivative term $K_D \dot{e}$: respond to rate of change (provides damping, prevents overshoot).

For train stopping control, the “plant” (the train’s response to a speed command) is approximately a first-order system with a time constant $\tau$ equal to the acceleration time. A P-controller (no integral, no derivative) with gain $K_P = 1/\tau$ gives approximately critical damping. Full PID is overkill for train control — the trains don’t need the precision of, say, a quadcopter attitude controller — but understanding PID puts EWMA in context.

In CS 452, students typically implement a simplified control law: if the train is within one stopping distance of the target, issue a stop command. If the train overshoots the target (detected by sensor attribution showing the train past the target), issue a reverse command with a calibrated short stop pulse. This bang-bang approach (full speed ahead / full stop / reverse) is robust and simple, at the cost of some positional precision.

Multi-Train Velocity Tracking

When multiple trains are active simultaneously, each train requires its own velocity estimate and calibration table. The track server (Chapter 18) maintains a per-train state structure that includes:

typedef struct {
    int        loco_id;             /* Märklin locomotive UID */
    int        speed_level;         /* current commanded speed */
    int32_t    velocity_mmps;       /* EWMA-filtered velocity estimate */
    int32_t    last_sensor_pos_mm;  /* track graph position of last sensor */
    uint32_t   last_sensor_time_us; /* system time when last sensor fired */
    int        next_sensor_id;      /* predicted next sensor */
    uint32_t   expected_arrival_us; /* predicted arrival time at next sensor */
    int        reserved_ahead;      /* number of segments reserved ahead */
} TrainState;

The velocity and position estimates for each train are updated independently on each sensor event attributed to that train. Since sensor attribution (Chapter 18) happens before the state update, the state machine is: receive sensor event → attribute to train → update TrainState → recompute next sensor prediction → check stopping conditions.

Because each train has its own calibration table, the per-train CalibrationTable must be loaded before a train is allowed to operate. During initialization, the train controller task loads the calibration for each locomotive address from a stored table (populated by prior calibration sessions, or re-measured at startup).

Chapter 18: Sensor Attribution and the Track Server

A train layout with multiple trains presents an attribution problem: when a sensor fires, which train triggered it? If the system knows each train’s expected next sensor and expected arrival time (from its velocity model and current position estimate), it can resolve this unambiguously in the common case and detect anomalies — missed sensors, unexpected sensors, sensor failures — in the exceptional case.

The Track Graph

The track layout is represented as a directed graph where each node is a uniquely identified track element endpoint, and each edge is a traversable path between endpoints. The graph structure must capture:

Straight sections and curves: a single edge connecting two sensor endpoints with a known distance.
Switches (turnouts): a one-input, two-output (or two-input, one-output) structure. A switch in the “straight” position connects the input to one output; in the “curved” position, it connects to the other. The switch’s state determines which edges are traversable at that node.
Double slips: two interlocked switches sharing a crossing. Four possible states, of which two are valid.

The standard representation for a Märklin layout assigns each sensor two directed nodes (one for each direction of travel over the sensor), with directed edges connecting them. A sensor with two faces (S_a, S_b) has a node for “entering from the east” and a node for “entering from the west,” and edges from S_a to the next element in one direction and from S_b to the next in the other.

typedef struct {
    int   id;                   // unique track element identifier
    int   type;                 // STRAIGHT, SENSOR, BRANCH, MERGE
    int32_t dist_mm;            // distance to next element
    int   next[2];              // next element(s): [0] = default/straight, [1] = curved
} TrackNode;

static TrackNode track[NUM_NODES];

The track data is typically provided as a pre-built array (the course distributes the Märklin track graph in C format). Building it from scratch from the physical layout requires precise distance measurements with a ruler.

Predicting Next Sensor Events

For each active train, the system maintains:

Current position: the last confirmed sensor, plus estimated distance traveled since then.
Expected next sensor: the sensor that the train is predicted to reach next, given its current path and switch states.
Expected arrival time: current time + (remaining distance to next sensor) / current velocity.

When a sensor fires, the system compares the firing sensor against the expected next sensors of all active trains. The matching train is the one whose expected next sensor matches the fired sensor and whose expected arrival time is closest to the actual firing time.

The time-window tolerance is critical. If the tolerance is too tight, legitimate sensor events (where velocity estimation has accumulated error) are misattributed. If too loose, two trains near the same sensor cannot be distinguished. Typical tolerances are ±20% of the expected inter-sensor travel time.

Sensor Robustness and Failure Handling

Real sensor systems are not perfect. Common failure modes:

Dirty sensor (fails to fire): the train passes the sensor but no event is generated, because the reed switch contact is dirty or the magnet is misaligned. Detection: the train has traveled further than expected without a sensor event. Response: log the failure, advance the position estimate to the next predicted sensor, and reduce confidence in the position.

Ghost sensor (fires spuriously): a sensor fires without a train present, due to electrical noise or vibration. Detection: no train was expected at that sensor within any reasonable time window. Response: discard the event (but log it for diagnostics).

Sensor during switch transition: if a switch is transitioning (the solenoid coil is energizing) when a train passes, the train may take an unexpected path. Detection: the sensor that fires is not the expected next sensor for any train. Response: update all trains’ path predictions to consider both possible switch positions.

For disambiguation when multiple trains might have triggered a sensor, the system can track multiple hypotheses: rather than committing to a single prediction, maintain a set of possible (train, next sensor) pairs and update probabilities based on observed sensor events. This Bayesian approach is more robust but more complex to implement.

Dead Reckoning Between Sensors

Between sensor events, a train’s position is estimated by dead reckoning: integrating velocity over time.

position_estimate = last_sensor_position + velocity_estimate × (current_time - last_sensor_time)

This requires continuous updating as time passes. The estimate accumulates error: if the velocity estimate is off by ε mm/s, after t seconds the position error is ε·t mm. For a train at 500 mm/s with a 2% velocity error (10 mm/s) and a 2-second inter-sensor interval, the position error can grow to 20 mm — 2 cm. On a scale-model track with sensors every 30–100 cm, this is acceptable. On a straight run of 2 m without sensors, the cumulative error becomes concerning.

The position estimate is corrected each time a sensor fires. The correction is the difference between the estimated position and the sensor’s known position. This correction is applied immediately:

void on_sensor_event(int sensor_id, uint32_t time_us) {
    Train *t = attribute_sensor(sensor_id, time_us);
    if (!t) return;  // unattributed event

    int32_t sensor_pos = track_node[sensor_id].position_mm;
    int32_t estimated_pos = t->last_sensor_pos +
        (int64_t)(t->velocity_mmps) * (time_us - t->last_sensor_time_us) / 1000000;

    int32_t error_mm = sensor_pos - estimated_pos;
    // Adjust velocity estimate slightly in the direction that would reduce future error
    t->velocity_mmps = ewma_update(t->velocity_mmps,
        t->velocity_mmps + error_mm * CORRECTION_GAIN);
    t->last_sensor_pos = sensor_pos;
    t->last_sensor_time_us = time_us;
}

The CORRECTION_GAIN factor converts a position error to a velocity adjustment. Choosing it too large causes oscillation; too small and errors accumulate. This is analogous to the proportional term in a PID (Proportional-Integral-Derivative) controller — a concept from control theory that generalizes the EWMA approach.

The Track Server as a Resource Manager

The track server is the authoritative owner of track state: switch positions, segment reservations, and train position estimates. All access to this state goes through Send/Receive/Reply to the track server. No other task reads or writes track state directly.

This ownership model means the track server is on the critical path of every control decision. Its priority must be high enough that it is serviced promptly by the kernel, but not so high that it starves the notifiers. A priority just below the CAN RX server (which delivers sensor events to the track server) is appropriate.

The track server maintains:

A switch_state[] array (straight or curved for each turnout)
A reservation[] array (owner TID for each segment, or NO_TRAIN)
A train_state[] array (position, velocity, last sensor, expected next sensor) for each active train

Update frequency: the train state is updated on every sensor event (irregular) and optionally on every clock tick (to advance dead reckoning). Updating on every tick (10 ms) at 10 trains × 10 Hz = 100 state updates per second — well within the server’s capacity.

Multi-Hypothesis Tracking

When two trains are in adjacent sections of the layout, sensor attribution can become ambiguous — both trains have a plausible claim on the fired sensor. The single-hypothesis approach (commit to the most likely train) degrades gracefully when trains are well-separated but breaks down as they converge. A more robust solution is multi-hypothesis tracking (MHT), borrowed from radar tracking systems.

In MHT, rather than committing to one attribution immediately, the system maintains a tree of hypotheses. Each leaf of the tree is a complete consistent assignment of sensors to trains. When a new sensor fires, each existing hypothesis is extended in all plausible ways — one branch for “train A fired this sensor”, one for “train B fired it”. Implausible branches (those where the attribution implies impossible velocities or positions) are pruned. The surviving leaves are the current hypothesis set.

Before sensor S fires:
  H1: Train1 @ seg12, Train2 @ seg23
  H2: Train1 @ seg13, Train2 @ seg22

Sensor S fires (on the boundary between seg13 and seg23):
  Extend H1: S attributed to Train1 → H1a: Train1 @ S, Train2 @ seg23
             S attributed to Train2 → H1b: Train1 @ seg12, Train2 @ S
  Extend H2: S attributed to Train1 → H2a: Train1 @ S, Train2 @ seg22
             S attributed to Train2 → H2b: Train1 @ seg13, Train2 @ S

Prune: H2b has Train2 jumping backwards — impossible → pruned
Active: H1a, H1b, H2a

The probability of each hypothesis is updated using a likelihood model: how probable is the observed sensor time given this hypothesis’s position/velocity prediction? The likelihood is a Gaussian centered on the predicted arrival time:

\[ P(\text{sensor at } t | \text{train at position p, velocity v}) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(t - t_{\text{pred}})^2}{2\sigma^2}\right) \]

where $t_{\text{pred}} = (p_{\text{sensor}} - p)/v$ is the predicted arrival time and $\sigma$ encodes velocity uncertainty. In practice, this can be approximated by a rectangular window (within ±tolerance → weight 1, otherwise → weight 0) to avoid floating-point arithmetic.

The computational cost of MHT grows exponentially in the number of hypotheses without pruning. Two pruning strategies keep it tractable:

Minimum weight pruning: discard any hypothesis whose probability falls below a threshold (e.g., 5% of the most probable hypothesis).
Beam search: keep only the top-k hypotheses at each step, where k = 4 or 8 is enough for the CS 452 layout with 3–5 trains.

For the CS 452 train application, full MHT is not required. The single-hypothesis approach with a generous time tolerance works in most cases. MHT becomes necessary only when trains are consistently near each other — a scenario that good route planning avoids. Implementing the data structures for MHT (a linked tree of Hypothesis nodes with reference-counted TrainState snapshots) is a good exercise in applying CS 136E/246-level data structure skills to a real-time context.

Position Confidence and Sensor Failure Recovery

The system tracks a confidence level for each train’s position estimate, updated on every sensor event:

typedef enum {
    CONF_HIGH   = 3,   /* confirmed by sensor within expected window */
    CONF_MEDIUM = 2,   /* position inferred from prior sensors + dead reckoning */
    CONF_LOW    = 1,   /* sensor missed; position extrapolated */
    CONF_LOST   = 0,   /* two or more consecutive missed sensors */
} PositionConfidence;

When confidence drops to CONF_LOST, the train engineer task responds conservatively: reduce speed to a safe minimum and wait for a sensor event that re-establishes position. If the train is approaching a reservation boundary and confidence is below CONF_MEDIUM, it must stop before the boundary — it cannot safely commit to entering the next segment without knowing where it is.

The sensor failure recovery protocol:

Miss detected: train has traveled more than expected_distance + tolerance_mm without a sensor event.
Advance estimate: assume the train passed the missed sensor at the expected time, log the miss.
Compute new next sensor: follow the track graph from the missed sensor.
Wait for confirmation: the next sensor event confirms (or contradicts) the advanced estimate.
Contradiction detected: if the next sensor does not match any plausible continuation from the advanced estimate, the train’s position is completely unknown → stop.

In well-maintained layouts, sensor misses are rare (< 1 per hour of operation). In CS 452 lab conditions with occasional dirty track, misses may be more frequent — making the recovery logic important.

Track Server SRR Interface

The track server exposes a complete SRR interface to all other tasks:

/* Message types accepted by the track server */
typedef enum {
    TRACK_SENSOR_EVENT,      /* CAN RX server delivers sensor report */
    TRACK_RESERVE,           /* train engineer requests segment reservation */
    TRACK_RELEASE,           /* train engineer releases segment reservation */
    TRACK_SWITCH_SET,        /* set switch to STRAIGHT or CURVED */
    TRACK_SWITCH_GET,        /* query current switch position */
    TRACK_TRAIN_POS_GET,     /* query estimated position + velocity of a train */
    TRACK_ROUTE_REQUEST,     /* route planner returns computed path */
} TrackMsgType;

typedef struct {
    TrackMsgType type;
    union {
        struct { int sensor_id; uint32_t timestamp_us; } sensor;
        struct { int segment_id; int train_id; } reservation;
        struct { int switch_id; int position; } sw_set;
        struct { int switch_id; } sw_get;
        struct { int train_id; } pos_get;
        struct { int train_id; int path[MAX_PATH_LEN]; int path_len; } route;
    };
} TrackMsg;

typedef struct {
    int           ok;         /* 0 = success, <0 = error code */
    int32_t       pos_mm;     /* for pos_get: estimated position */
    int32_t       vel_mmps;   /* for pos_get: estimated velocity */
    int           sw_pos;     /* for sw_get: current position */
} TrackReply;

This interface is the contract between the track server and the rest of the system. Changes to the interface require corresponding changes to all callers — a useful argument for keeping the interface stable and narrow. In a production system, this interface would be formally specified (possibly in a language like Promela or TLA+) before implementation.

Chapter 19: Routing, Reservations, and Multi-Train Coordination

With a single train, collision avoidance is trivial — there is nothing to collide with. With multiple trains, each train must reserve the track segments it will occupy, ensuring no two trains ever occupy the same segment simultaneously. Routing finds the path from a train’s current position to its destination; reservation prevents the concurrent occupancy that would lead to collision.

Track Routing with Dijkstra’s Algorithm

The track graph of Chapter 18 defines the possible paths between any two locations on the layout. Finding the shortest (or least-cost) path is the shortest path problem, solved by Dijkstra’s algorithm.

Dijkstra’s algorithm maintains a set of visited nodes with known shortest distances, and a priority queue of candidate nodes ordered by tentative distance. It proceeds:

initialize dist[source] = 0, dist[all others] = ∞
priority_queue.insert(source, 0)

while priority_queue not empty:
    (u, d) = priority_queue.extract_min()
    if u already visited: continue
    mark u as visited
    
    for each edge (u, v, w) where v not visited:
        if dist[u] + w < dist[v]:
            dist[v] = dist[u] + w
            prev[v] = u
            priority_queue.insert(v, dist[v])

// reconstruct path: follow prev[] backwards from destination

For the train layout, nodes are sensor/switch endpoints and edge weights are distances in millimetres. The “shortest path” in distance gives the fastest route at constant speed, but in practice you may want to optimize for travel time (considering expected speed on each segment) or avoid congested segments.

The extracted path is a sequence of nodes. The corresponding sequence of switches that must be set can be derived from the path: any node at a branching point that is traversed on the curved side requires that switch to be set to the curved position.

Switch pre-setting: switches should be set to their required positions before the train reaches them, with enough lead time for the solenoid to complete its throw and the controller to confirm the position. The lead time depends on train speed and the distance from the current train position to the switch. The train engineer (a per-train task) is responsible for issuing switch commands in advance.

The Track Reservation Protocol

A reservation is the exclusive right to occupy a track segment. A train must hold a reservation for every segment it currently occupies or is about to enter. The granularity of reservations — how long a segment is — determines the balance between safety (finer granularity, more precise) and throughput (coarser granularity, fewer conflicts).

A simple reservation protocol:

Reserve(segment, train_id) → success or failure
Release(segment, train_id) → always succeeds

The track server maintains:
  owner[segment] = train_id (or NO_TRAIN)

Reserve:
  if owner[segment] == NO_TRAIN: owner[segment] = train_id; return SUCCESS
  else: return FAILURE

Release:
  owner[segment] = NO_TRAIN

A train must reserve the next segment before entering it. If the reservation fails (another train already owns the segment), the train must stop and wait. Stopping and waiting introduces its own problem: if two trains are each waiting for the other’s segment, the system is deadlocked.

Deadlock Detection and Prevention

In the track reservation context, deadlock occurs when:

Train A holds segment X and wants segment Y.
Train B holds segment Y and wants segment X.

Neither can proceed. More generally, deadlock requires a cycle in the wait-for graph: a directed graph where an edge from train A to train B means “train A is waiting for a segment that train B holds.”

Deadlock detection runs cycle detection on the wait-for graph whenever a reservation request fails. If a cycle is detected, one of the waiting trains must be chosen as a victim and backed up to a point where it can yield its segments. Choosing the victim wisely (e.g., the train with the shortest backup distance) minimizes disruption.

Deadlock prevention avoids the cycle entirely. The Banker’s algorithm generalizes to this setting: before granting a reservation, check whether the resulting allocation leaves the system in a safe state — one from which every train can eventually complete its route. If the allocation would create an unsafe state, deny it even if the segment is currently free.

In practice, a simpler prevention strategy often suffices: reserve in a consistent order. If every train always reserves segments in increasing segment-ID order, a cycle cannot form (a cycle would require A waiting for a segment with a lower ID than what A holds, which the ordering prohibits). The cost is that a train may need to release and re-acquire reservations if its optimal path requires segments in non-increasing ID order.

Planning Horizon

A train cannot reserve its entire route at the moment of departure: other trains also have routes, and reserving the entire route would unnecessarily block other trains from using segments that the reserving train won’t reach for minutes. Instead, each train reserves a planning horizon of a few segments ahead — enough to guarantee it can stop safely if a future reservation fails, but not so many that it monopolizes the track.

The planning horizon must be at least the stopping distance in track segments: if the train cannot stop within one segment, it must have already reserved the next segment before it can stop. For higher-speed trains, the planning horizon extends further.

Dynamic adjustment of the planning horizon: if a reservation keeps failing (the track ahead is busy), the train should slow down to reduce its stopping distance and hence its required planning horizon. Conversely, once a congested area clears, the train can accelerate. This feedback between reservation status and speed is the essence of the train control problem.

Full Dijkstra Pseudocode for the Track Graph

Here is a complete Dijkstra implementation suitable for the track graph structure. The track has on the order of 100–200 nodes; even a simple $O(V^2)$ Dijkstra runs in well under 1 ms at 1.5 GHz for graphs of this size.

#define INF 0x7FFFFFFF

typedef struct {
    int dist[NUM_NODES];     // shortest distance from source
    int prev[NUM_NODES];     // previous node on shortest path
    bool visited[NUM_NODES];
} DijkstraState;

void dijkstra(int source, DijkstraState *state, SwitchState *switches) {
    for (int i = 0; i < NUM_NODES; i++) {
        state->dist[i]    = INF;
        state->prev[i]    = -1;
        state->visited[i] = false;
    }
    state->dist[source] = 0;

    for (int iter = 0; iter < NUM_NODES; iter++) {
        // Find unvisited node with minimum distance (O(V) linear scan)
        int u = -1;
        for (int v = 0; v < NUM_NODES; v++) {
            if (!state->visited[v] && state->dist[v] != INF) {
                if (u == -1 || state->dist[v] < state->dist[u]) u = v;
            }
        }
        if (u == -1) break;  // all reachable nodes visited
        state->visited[u] = true;

        // Relax edges from u
        TrackNode *node = &track[u];
        for (int e = 0; e < 2; e++) {
            int v = node->next[e];
            if (v == -1) continue;             // no edge in this direction

            // For switch nodes, only one direction is traversable given current state
            if (node->type == BRANCH && switches[node->switch_id] != e) continue;

            int new_dist = state->dist[u] + node->dist_to_next[e];
            if (new_dist < state->dist[v]) {
                state->dist[v] = new_dist;
                state->prev[v] = u;
            }
        }
    }
}

// Reconstruct path from source to destination
int reconstruct_path(int dest, DijkstraState *state, int *path, int max_len) {
    int len = 0;
    for (int v = dest; v != -1; v = state->prev[v]) {
        if (len >= max_len) return -1;  // path too long
        path[len++] = v;
    }
    // Reverse: path[0..len-1] is now source→destination
    for (int i = 0, j = len - 1; i < j; i++, j--) {
        int tmp = path[i]; path[i] = path[j]; path[j] = tmp;
    }
    return len;
}

The SwitchState *switches parameter allows the routing algorithm to consider the current switch positions — a train may prefer a route that avoids switches it would need to throw (reducing transit time). An extension: multi-criteria shortest path that minimizes a weighted combination of distance, number of switch throws, and expected travel time.

A* Algorithm for Heuristic Route Planning

Dijkstra’s algorithm is optimal — it finds the true shortest path — but it explores all nodes with a shorter distance than the destination, which for a dense graph can be wasteful. The A* algorithm (Hart, Nilsson, and Raphael, 1968) extends Dijkstra with a heuristic function $h(v)$ that estimates the remaining distance from node $v$ to the destination. The modified priority for each candidate node becomes:

\[ f(v) = g(v) + h(v) \]

where $g(v)$ is the actual distance from the source (as in Dijkstra) and $h(v)$ is the heuristic estimate of the remaining distance. If $h(v)$ is admissible (never overestimates the true remaining distance), A* is guaranteed to find the optimal path.

For the track graph, a natural admissible heuristic is the Euclidean distance between node $v$ and the destination node. Since the track does not teleport trains, the straight-line distance is a lower bound on the actual track distance. Computing Euclidean distances requires knowing the physical coordinates of each track node — which are available if you measure the layout carefully or use a CAD drawing.

/* Euclidean heuristic for track graph */
static int32_t heuristic(int node_id, int dest_id) {
    int32_t dx = track[node_id].x_mm - track[dest_id].x_mm;
    int32_t dy = track[node_id].y_mm - track[dest_id].y_mm;
    /* Integer square root: sqrt(dx^2 + dy^2) */
    return isqrt(dx * dx + dy * dy);
}

void astar(int source, int dest, DijkstraState *state) {
    /* Same init as Dijkstra */
    for (int i = 0; i < NUM_NODES; i++) {
        state->dist[i]    = INF;
        state->prev[i]    = -1;
        state->visited[i] = false;
    }
    state->dist[source] = 0;

    for (int iter = 0; iter < NUM_NODES; iter++) {
        /* Find unvisited node with minimum f = g + h */
        int u = -1;
        int min_f = INF;
        for (int v = 0; v < NUM_NODES; v++) {
            if (!state->visited[v] && state->dist[v] != INF) {
                int f = state->dist[v] + heuristic(v, dest);
                if (f < min_f) { min_f = f; u = v; }
            }
        }
        if (u == -1 || u == dest) break;  /* reached destination */
        state->visited[u] = true;

        /* Relax edges — same as Dijkstra */
        TrackNode *node = &track[u];
        for (int e = 0; e < 2; e++) {
            int v = node->next[e];
            if (v == -1 || state->visited[v]) continue;
            int new_dist = state->dist[u] + node->dist_to_next[e];
            if (new_dist < state->dist[v]) {
                state->dist[v] = new_dist;
                state->prev[v] = u;
            }
        }
    }
}

The A* implementation is nearly identical to Dijkstra; the only difference is the priority function. For the Märklin track graph with ~100–200 nodes, A* typically explores 20–50% fewer nodes than Dijkstra — a modest but measurable improvement. With the O(n) linear scan for the minimum (rather than a priority queue), both algorithms run in O(n²) anyway, so the practical speedup is small. The benefit of A* becomes more significant when a priority queue is used (O(n log n)), where A* can skip large portions of the graph.

Integer square root (for the heuristic without floating-point):

static int32_t isqrt(int64_t n) {
    if (n < 0) return 0;
    int32_t x = (int32_t)(n >> 1) + 1;
    int32_t y = (int32_t)(((int64_t)x + n / x) >> 1);
    while (y < x) {
        x = y;
        y = (int32_t)(((int64_t)x + n / x) >> 1);
    }
    return x;
}

Newton’s method for integer square root converges in O(log log n) iterations for a good initial guess. For 32-bit inputs (distances in mm², up to 10⁶ mm²), the loop runs at most 5 iterations.

The Route Planner Task

Routing is separated from the train engineer task as a dedicated Route Planner task. This separation is important for two reasons. First, routing is computationally expensive relative to the train engineer’s periodic control loop — running Dijkstra at every control tick (20 Hz) wastes CPU on a computation whose inputs (track graph, current switch positions) change infrequently. Second, route planning may block if the track server is busy; the train engineer’s control loop must not block on an operation that has variable latency.

void route_planner_main(void) {
    RegisterAs(SRV_ROUTE_PLANNER);
    int track_tid = WhoIs(SRV_TRACK);

    int sender;
    RouteRequest req;

    for (;;) {
        Receive(&sender, &req, sizeof(req));

        /* Query current switch positions from track server */
        SwitchStateMsg sw_msg = { .type = TRACK_SWITCH_GETALL };
        SwitchState    switches[NUM_SWITCHES];
        SwitchStateReply sw_reply;
        int track_sender;
        /* Non-blocking query: route planner sends to track server */
        Send(track_tid, &sw_msg, sizeof(sw_msg), &sw_reply, sizeof(sw_reply));
        memcpy(switches, sw_reply.switches, sizeof(switches));

        /* Run Dijkstra or A* */
        DijkstraState ds;
        dijkstra(req.source_node, &ds, switches);

        /* Reconstruct path */
        int path[MAX_PATH_LEN];
        int path_len = reconstruct_path(req.dest_node, &ds, path, MAX_PATH_LEN);

        /* Reply with computed path */
        RouteReply reply = {
            .ok = (path_len > 0),
            .path_len = path_len,
        };
        if (path_len > 0)
            memcpy(reply.path, path, path_len * sizeof(int));
        Reply(sender, &reply, sizeof(reply));
    }
}

The train engineer calls Send(route_planner_tid, &req, ...) to request a route. The route planner blocks on its Receive until a request arrives, then queries the track server for switch positions (an additional Send), runs the algorithm, and replies with the path. The train engineer is Reply-Blocked during this entire sequence — which is acceptable because route planning is infrequent (once per trip) and the latency (a few milliseconds) is small relative to the trip duration.

One refinement: the route planner can cache the most recent Dijkstra result indexed by (source, destination, switch_state_hash). If the switch state has not changed since the last routing, the cached path can be returned without re-running the algorithm. This is particularly useful when multiple trains are assigned routes from the same source to the same destination in quick succession.

Practical Deadlock Scenarios and Avoidance

On a typical Märklin layout, deadlocks arise in predictable topological situations:

Head-on collision deadlock: train A is on segment S1 heading east; train B is on segment S2 heading west. S1 and S2 are adjacent — each train is waiting for the other’s segment. Resolution: the track server detects the cycle in the wait-for graph (A waits for S2, B waits for S1, cycle of length 2) and backs up one train to a siding.

Multi-train deadlock on a figure-8 layout: with three trains on a figure-8, it is possible for each train to hold one segment of the crossing and wait for the other crossing segment, creating a three-way cycle. Detection: cycle detection on a graph with at most 10 nodes (number of trains) is trivial even with a naive algorithm.

Consistent ordering heuristic: assign each track segment an integer ID. Require that all trains always request segments in increasing ID order. A train that needs to acquire segment 7 before segment 3 (because of its path direction) must first release all segments with ID > 7, then acquire 3, then re-acquire the higher-ID segments. This is expensive but guarantees deadlock freedom.

In practice, the track layout of a Märklin CS 452 set typically has fewer than five trains and a layout designed to avoid the worst-case deadlock topologies. A simple wait-for cycle detection with a timeout-based victim selection is sufficient.

The Train Engineer Task: Integrating Routing, Reservation, and Velocity

Each active train is controlled by a dedicated train engineer task. The engineer task is the integration point for all the concepts of Part V: it holds the route from Chapter 19, receives position updates from the track server (Chapter 18), uses velocity calibration tables from Chapter 17, and issues speed and switch commands via the CAN TX server (Chapter 14).

The engineer task’s primary control loop:

void train_engineer_main(void) {
    /* Initialization */
    int clock_tid = WhoIs(SRV_CLOCK);
    int track_tid = WhoIs(SRV_TRACK);
    int can_tx_tid = WhoIs(SRV_CAN_TX);

    EngineerState state = {
        .loco_id     = MY_LOCO_UID,
        .phase       = ENG_IDLE,
        .speed       = 0,
        .destination = NO_DEST,
    };

    for (;;) {
        /* Wait for the next control tick (50 ms) */
        DelayUntil(clock_tid, state.next_tick);
        state.next_tick += ENGINEER_PERIOD_TICKS;

        /* Query current position from track server */
        TrainStatus status;
        QueryTrain(track_tid, state.loco_id, &status);

        /* Execute the current control phase */
        switch (state.phase) {
            case ENG_IDLE:
                /* Nothing to do */
                break;

            case ENG_ROUTE_PLANNED:
                /* Attempt to reserve the next segment ahead */
                if (Reserve(track_tid, status.next_segment, state.loco_id)) {
                    /* Reserved successfully — switch pre-setting */
                    if (next_requires_switch(&state, &status))
                        SetSwitch(can_tx_tid, state.next_switch_id,
                                  state.next_switch_pos);
                    state.phase = ENG_RUNNING;
                } else {
                    /* Reservation failed — slow down or stop */
                    state.speed = reduce_speed_for_wait(state.speed);
                    SetSpeed(can_tx_tid, state.loco_id, state.speed);
                }
                break;

            case ENG_RUNNING: {
                /* Check stopping condition */
                int dist_to_dest = compute_dist(&status, state.destination);
                int stop_dist    = interp_stop_distance(state.speed);

                if (dist_to_dest <= stop_dist) {
                    /* Issue stop command */
                    SetSpeed(can_tx_tid, state.loco_id, 0);
                    state.phase = ENG_STOPPING;
                } else {
                    /* Attempt next reservation */
                    int next_seg = next_segment_on_route(&state, &status);
                    if (next_seg >= 0 && !is_reserved(track_tid, next_seg, state.loco_id)) {
                        if (!Reserve(track_tid, next_seg, state.loco_id)) {
                            /* Can't reserve — must slow down */
                            int safe_speed = speed_for_stopping_distance(
                                dist_to_segment(&status, next_seg));
                            if (state.speed > safe_speed) {
                                state.speed = safe_speed;
                                SetSpeed(can_tx_tid, state.loco_id, state.speed);
                            }
                        }
                    }
                    /* Release segments behind us */
                    release_trailing_segments(track_tid, &status, &state);
                }
                break;
            }

            case ENG_STOPPING:
                /* Verify the train has stopped (velocity estimate near zero) */
                if (status.velocity_mmps < STOP_THRESHOLD_MMPS) {
                    state.phase   = ENG_IDLE;
                    state.speed   = 0;
                    /* Release all reservations */
                    release_all_segments(track_tid, state.loco_id);
                    notify_arrived(&state);
                }
                break;
        }
    }
}

This pseudocode captures the core logic without being an assignment answer: the control phases (IDLE → ROUTE_PLANNED → RUNNING → STOPPING → IDLE) are the structure of any train controller, derived from the physics of stopping distance and the semantics of the reservation system. The specific functions (interp_stop_distance, next_segment_on_route, release_trailing_segments) are black boxes whose implementation follows from the data structures of Chapters 17–19.

Key observations about the engineer task design:

Periodic control at 50 ms (20 Hz): the control loop runs at a fixed period, not on every sensor event. This simplifies timing analysis (the engineer’s period is well-defined) and decouples the control loop from the variable rate of sensor events. The track server (Chapter 18) accumulates position updates between engineer cycles and provides the latest position when queried.

Reservation-speed coupling: the reservation system and the velocity controller are coupled: if a reservation fails, the engineer must slow the train before it reaches the contested segment. The relationship between current speed, current distance to the segment, and required stop time is exactly the stopping distance calculation from Chapter 17.

Trailing segment release: as the train moves forward, it leaves segments behind. Releasing these segments promptly allows following trains to use them. Over-eager release (releasing a segment before the train has fully cleared it) is a safety bug; lazy release (holding segments long after clearing them) reduces throughput.

Multi-Train Coordination: The Merge and Pass Problems

Two classic multi-train coordination problems illustrate how the reservation and routing system handles congestion.

The merge problem: two trains on separate branches, both heading toward a single-track merge, must coordinate to avoid a head-on collision at the merge point. Neither train can see the other directly — they must coordinate through the track server.

Protocol: each train attempts to reserve the merge segment before the other. The first to reserve passes through; the second finds the segment reserved and waits. When the first train clears the merge (releases the reservation), the second can proceed.

The timing: if both trains are approaching the merge at the same speed, the “race” to reserve is non-deterministic from the track server’s perspective (it depends on which engineer task happens to query first). This non-determinism is acceptable — the system is safe regardless of which train wins, as long as only one holds the merge segment at a time.

The pass problem: one fast train is behind one slow train on the same track. Can the fast train pass? On a simple loop layout, there is no passing lane. On a figure-8 with sidings, the slow train can pull into a siding to allow the fast train to pass. The routing algorithm handles this transparently: the route planner considers sidings as valid path elements, and the fast train’s route may go through the siding without the fast train being aware of the slow train — the slow train’s presence is encoded only in the reservation system.

For the route planner to route the slow train into a siding, it needs:

Knowledge that a siding exists (from the track graph).
A criterion for choosing the siding route over the direct route (e.g., another train requesting the segment ahead indicates congestion — “traffic-aware routing”).

Traffic-aware routing computes the shortest path weighted by both distance and reservation availability: segments reserved by other trains get a high weight, encouraging routes that avoid them. This is more complex than pure Dijkstra but still runs in O(V²) with the same algorithm and suitably modified edge weights.

The Velocity-Reservation Invariant

The central safety invariant of the multi-train system:

At all times, every train holds reservations for all segments it currently occupies plus all segments it cannot stop before entering.

This is the safety invariant that the engineer task must maintain. Formally: if a train is at position $p$ with speed $v$, it must hold reservations for all segments within the stopping distance $d_{\text{stop}}(v)$ ahead of $p$.

Maintaining this invariant requires that reservation acquisitions happen before the train reaches the stopping-distance boundary. The timing is:

Train at speed v, distance D to next unreserved segment.
Stopping distance d_stop(v).
Safety condition: D > d_stop(v)     → still safe, no need to stop.
Warning condition: D ≤ 2·d_stop(v) → begin reservation attempt.
Critical condition: D ≤ d_stop(v)  → must slow down NOW (reservation attempt may have failed).

The two-stage (warning + critical) approach gives the engineer task two opportunities to acquire the reservation before the train is in danger. If the warning-stage reservation attempt fails (segment still occupied), the engineer begins slowing. If by the critical-stage the reservation still fails, the engineer issues a full stop command. The stopping distance calibration (Chapter 17) ensures the train can stop within the available distance.

This invariant-based design is a direct application of the “defensive programming” principle for real-time systems: safety conditions should be checked well before the deadline, so that there is time to respond gracefully rather than failing at the last moment.

Part VI: Engineering Real-Time Systems

Chapter 20: Optimization, Caching, and Measurement

The train control application introduces real performance constraints. Interrupt latency must be bounded. Message passing overhead must be predictable. Debugging must not alter timing. This chapter collects the engineering techniques that keep the system within its timing budget and make bugs observable without disturbing the behavior they reveal.

The Cortex-A72 Cache Hierarchy

The Cortex-A72 in the RPi 4 has:

L1 instruction cache: 48 KB, 3-way set-associative, virtually indexed, physically tagged (VIPT). Separate from the data cache.
L1 data cache: 32 KB, 2-way set-associative.
L2 unified cache: 1 MB shared across all four cores, 16-way set-associative. (In a single-core kernel, the L2 is effectively private.)

Cache line size is 64 bytes on both L1 and L2. Memory accesses that hit the L1 cost 4 cycles. L1 misses that hit L2 cost roughly 15 cycles. L2 misses (DRAM access) cost 50–100 cycles. On a 1.5 GHz core, one DRAM miss costs up to 67 nanoseconds — more than many entire interrupt latency budgets.

Cache behavior matters for real-time systems in two ways:

WCET variability: cold-cache execution is significantly slower than warm-cache execution. A kernel operation that takes 500 cycles on a warm cache might take 5,000 cycles on a cold cache. If WCET analysis uses warm-cache measurements, the true worst case (first execution after a long idle period) may be much worse.

Cache thrashing between tasks: if two tasks have working sets that map to the same cache sets and conflict with each other, each context switch evicts the other task’s data. In the extreme case (called Belady’s anomaly in another context), adding a faster cache can make things worse if the conflict patterns are unfortunate.

Practical mitigations:

Keep the kernel small enough to fit in L1 instruction cache. The exception vector table, syscall dispatcher, and scheduler are all cache-hot; less frequently executed paths (initialization, error handlers) can tolerate cache misses.
Align frequently-accessed data structures to cache-line boundaries using __attribute__((aligned(64))).
Avoid false sharing: if two tasks share data in the same cache line but modify different fields, they will invalidate each other’s caches unnecessarily. Place independently-modified fields in separate cache lines.

Cache Maintenance

ARMv8-A provides explicit cache maintenance instructions for situations where software must directly manage cache coherency (e.g., after DMA transfers, or when modifying code at run time).

DC CIVAC Xn: Data Cache (Clean and Invalidate by Virtual Address to Point of Coherency). Writes back dirty cache lines to memory and marks them invalid. Use before a DMA transfer where the DMA engine reads memory that the CPU has modified.
DC IVAC Xn: Data Cache (Invalidate by Virtual Address to Point of Coherency). Discards cached data without writing back. Use after a DMA transfer where the DMA engine has written data that the CPU must read fresh.
IC IALLU: Instruction Cache (Invalidate All to Point of Unification). Use after writing new code to memory (self-modifying code, JIT compilation — not applicable in CS 452 but useful to know).

In a single-core system without DMA (the CS 452 scenario), explicit cache maintenance is rarely needed. The processor’s load/store unit maintains L1 and L2 coherency automatically for CPU-initiated accesses.

Volatile and Memory Barriers Revisited

Chapter 4 introduced volatile and memory barriers. Here we complete the picture with the full ordering taxonomy.

ARMv8-A defines three barrier types:

ISB (Instruction Synchronization Barrier): flushes the processor pipeline and refetches subsequent instructions. Required after writing to VBAR_EL1 (before the next exception can use the new vector), after modifying SCTLR_EL1 (before the new settings take effect), and after writing to branch predictor or cache control registers.

DMB (Data Memory Barrier): ensures that all memory accesses before the barrier are visible to all agents specified (e.g., dmb sy for the full system) before any memory access after the barrier begins. Use between two device register writes that must happen in order. A lighter variant dmb ish covers only the inner shareable domain (the CPU cluster).

DSB (Data Synchronization Barrier): stronger than DMB. Ensures not only that memory accesses complete, but that all cache maintenance operations complete before the barrier. Required before ISB and before certain system register writes.

The pragma order for enabling an interrupt, for example:

str     w1, [x0, GICD_ISENABLER]   // write GIC enable register
dsb     sy                          // ensure write completes before unmasking
msr     daifclr, #0x2               // unmask IRQ
isb                                 // ensure unmask takes effect before next instruction

Without the DSB, the GIC enable register write might be buffered; the first interrupt could arrive before the GIC has processed the enable. Without the ISB, the unmask might be speculated away by the processor.

Race Conditions

A race condition occurs when the outcome of a computation depends on the relative timing of two concurrent operations. In a bare-metal kernel with a single CPU and interrupts disabled during kernel execution, the sources of races are limited but still present.

Kernel/interrupt race: if an interrupt is enabled while the kernel is partway through updating a shared data structure, the interrupt handler may see an inconsistent state. Prevention: disable interrupts during any update to data structures that are read by interrupt handlers. The kernel maintains DAIF cleared at EL1 except at explicit interrupt-unmasking points.

Read-modify-write on device registers: if two different code paths (or the kernel and an interrupt) both read-modify-write the same device register, one may lose the other’s update. Some BCM2711 registers support bit-set and bit-clear operations that are inherently atomic; for others, interrupts must be masked during the read-modify-write cycle.

TOCTOU (Time-Of-Check to Time-Of-Use): if the code checks a condition (e.g., “is the TX FIFO full?”) and then acts on it, an interrupt can fire between the check and the action, changing the condition. Mitigation: keep check-and-act sequences within a single interrupt-disabled section.

Debugging race conditions is notoriously difficult because adding instrumentation (printf, cycle counters) changes timing and may mask the bug. The scientific approach:

Form a hypothesis about which resource is shared and which code paths access it.
Instrument only at the granularity needed to confirm or refute the hypothesis.
When confirmation is needed without timing disturbance, use the system timer to record timestamps in a ring buffer and examine them after the fact.
A shadow copy technique: maintain a second copy of critical data that is updated atomically; compare it with the primary copy in a background check.

Performance Measurement in Practice

For timing system calls and context switches:

uint32_t t1 = mmio_read(SYSTEM_TIMER_CLO);
for (int i = 0; i < N; i++) {
    // Call the operation to measure
    context_switch_roundtrip();
}
uint32_t t2 = mmio_read(SYSTEM_TIMER_CLO);
// average = (t2 - t1) / N  microseconds

Key considerations:

Run at least N = 1000 iterations to amortize measurement overhead and obtain statistical significance.
Measure both cache-cold (run once, measure, discard) and cache-warm (run N times, take average of last N/2) costs.
Include all overhead: the mmio_read itself costs a few cycles; the loop overhead costs a few; these are small but should be calibrated.
The system timer CLO reads only the lower 32 bits (free-running microsecond counter). It wraps around approximately every 72 minutes — not a concern for short measurements.

Measured performance targets for a well-optimized CS 452 kernel: a full Send/Receive/Reply roundtrip (three context switches) should take under 5 µs at 1.5 GHz.

Instruction-Level Optimization: The Context Switch Hot Path

Context switching is the most frequently executed kernel path. Every Send/Receive/Reply invokes at least one context switch; a busy train control system executing 1000 inter-task messages per second performs 3000 context switches per second, spending potentially 15 ms/s (1.5% of CPU time) in context switching alone. Reducing context switch latency pays dividends.

The save/restore sequence in the sync exception handler saves 34 registers (31 GPRs + SP_EL0 + ELR_EL1 + SPSR_EL1). The ARMv8-A STP instruction stores two 64-bit registers in a single instruction with post-increment or pre-decrement addressing. Using STP pairs reduces the number of instructions in the save path from 34 loads/stores to 17, nearly halving the instruction count:

/* Optimized save — 17 STP instructions instead of 34 STR */
sub     sp, sp, #(34*8)
stp     x0,  x1,  [sp, #(0*16)]
stp     x2,  x3,  [sp, #(1*16)]
stp     x4,  x5,  [sp, #(2*16)]
stp     x6,  x7,  [sp, #(3*16)]
stp     x8,  x9,  [sp, #(4*16)]
stp     x10, x11, [sp, #(5*16)]
stp     x12, x13, [sp, #(6*16)]
stp     x14, x15, [sp, #(7*16)]
stp     x16, x17, [sp, #(8*16)]
stp     x18, x19, [sp, #(9*16)]
stp     x20, x21, [sp, #(10*16)]
stp     x22, x23, [sp, #(11*16)]
stp     x24, x25, [sp, #(12*16)]
stp     x26, x27, [sp, #(13*16)]
stp     x28, x29, [sp, #(14*16)]
mrs     x20, SP_EL0
mrs     x21, ELR_EL1
mrs     x22, SPSR_EL1
stp     x30, x20, [sp, #(15*16)]
stp     x21, x22, [sp, #(16*16)]

This is 19 instructions total (17 STP + 3 MRS) compared to 34 STR + 3 MRS = 37 instructions. At 1 instruction per cycle (approximately, ignoring pipeline effects), this saves ~18 cycles — roughly 12 ns at 1.5 GHz. The symmetric restore with LDP is analogous.

Beyond instruction count, memory bandwidth is the limiting factor. Saving 34 × 8 = 272 bytes to the kernel stack per context switch requires 272 bytes of write bandwidth from the register file to L1 data cache. At 1 cache line per store-pair instruction (17 store-pairs fit in $\lceil 272/64 \rceil = 5$ cache lines), and assuming L1 write bandwidth of one 64-byte cache line per cycle, the memory bandwidth cost of a single context save is approximately 5 cycles. This is the irreducible minimum regardless of how the instructions are arranged — an argument for keeping the saved context as small as possible.

The calling convention (AAPCS64) divides registers into caller-saved (x0–x18) and callee-saved (x19–x28). A task that is a server loop only uses callee-saved registers across calls, so in theory the kernel only needs to save callee-saved registers on a context switch if the scheduler doesn’t switch to a new task. However, the kernel cannot make this assumption — after a context switch, the new task will have different caller-saved register values than the interrupted task, so all registers must be saved.

Some microkernel implementations (notably L4/FIKUS and seL4) exploit the AAPCS64 convention to pass message data directly through registers (without stack allocation) for short messages. In these register-based IPC designs, the kernel transfers up to 8 message words directly from the sender’s x0–x7 to the receiver’s x0–x7 during the context switch, without copying through memory. The CS 452 kernel uses stack-to-stack copy, which is simpler but involves a memory round-trip.

Profiling the Kernel: Where Time Actually Goes

To know what to optimize, one must measure. For a bare-metal kernel without an operating system to provide profiling infrastructure, the standard approach is to build a lightweight performance counter using the Cortex-A72’s cycle counter.

The ARMv8-A Performance Monitors Unit (PMU) includes a 64-bit cycle counter (PMCCNTR_EL0) and up to 6 event counters that track micro-architectural events (cache misses, branch mispredictions, instruction counts). Enabling the cycle counter:

/* Enable PMU cycle counter at EL1 */
mrs     x0, PMCR_EL0
orr     x0, x0, #1          /* enable */
orr     x0, x0, #(1 << 2)   /* reset cycle counter */
msr     PMCR_EL0, x0
mov     x0, #(1 << 31)      /* PMCNTENSET: enable cycle counter */
msr     PMCNTENSET_EL0, x0
isb

Reading the cycle counter in C:

static inline uint64_t cycle_count(void) {
    uint64_t cnt;
    asm volatile ("mrs %0, PMCCNTR_EL0" : "=r"(cnt));
    return cnt;
}

With the cycle counter, one can measure the cost of any kernel operation with cycle-level precision:

uint64_t t0 = cycle_count();
kernel_operation();
uint64_t t1 = cycle_count();
uint64_t elapsed_cycles = t1 - t0;
uint32_t elapsed_ns = (uint32_t)(elapsed_cycles * 667 / 1000);  /* at 1.5 GHz: 1 cycle ≈ 0.667 ns */

The PMU also supports the L1D miss event counter (event code 0x03) and L2D miss event counter (0x17). Measuring cache miss rates for the context switch hot path reveals whether the performance bottleneck is instruction-count (too many instructions), cache pressure (working set too large), or memory bandwidth (too many stores).

A useful profiling discipline for CS 452: measure the following at startup and display them on the kernel’s diagnostic UART:

Operation	Cycles	Notes
Create() syscall	~200	includes stack allocation
Context save (GPRs)	~20	19 STP/MRS instructions
Context restore (GPRs)	~20	17 LDP/MSR + ERET
Scheduler next()	~10–30	depends on queue depth
Send/Reply roundtrip	~200	two full context switches + message copy
AwaitEvent + notify	~150	IRQ delivery to notifier

These numbers serve as regression tests: if a code change causes any of them to increase by more than 10%, it should be investigated.

Printf Without Destroying Timing: The Ring Buffer Strategy

The central difficulty of real-time debugging is the observer effect: adding instrumentation changes timing, and the changed timing may hide the bug. printf() — which involves multiple system calls, buffering, and UART output — is often orders of magnitude slower than the operation being observed.

The solution is to separate the fast observation path from the slow output path. The ring buffer log technique:

#define LOG_SIZE 4096
typedef struct {
    uint32_t timestamp_us;
    uint16_t event_code;
    int32_t  data;
} LogEntry;

static LogEntry log_buf[LOG_SIZE];
static int log_head = 0;

static inline void log_event(uint16_t code, int32_t data) {
    log_buf[log_head % LOG_SIZE] = (LogEntry){
        .timestamp_us = mmio_read(SYSTEM_TIMER_CLO),
        .event_code   = code,
        .data         = data,
    };
    log_head++;
}

log_event() is approximately 5–10 ns: one timer read, three stores, one increment. It has no I/O dependency and is cache-friendly (the log array is accessed sequentially, so hardware prefetch keeps the current cache line warm). After the bug occurs, the ring buffer is dumped at leisure — via UART, via a kernel crash handler, or via JTAG.

The ring buffer naturally handles wrap-around: when log_head exceeds LOG_SIZE, it overwrites old entries. For post-mortem analysis, the most recent LOG_SIZE events are always available. Increasing LOG_SIZE to 16384 (128 KB) still fits well within the BCM2711’s 4 GB of DRAM, though a larger buffer means longer dump times.

For high-frequency events (every tick, every interrupt), the ring buffer approach is the only viable strategy. Low-frequency events (task creation, speed command changes) can be logged with full kprintf without affecting timing.

Diagnosing Priority Inversion with the Ring Buffer

Priority inversion is one of the most counterintuitive bugs in RTOS systems. The Mars Pathfinder scenario (Chapter 16) illustrates the problem: a medium-priority task runs when a high-priority task should, causing deadline misses that appear as random system slowdowns with no obvious cause.

To diagnose priority inversion with a ring buffer, log every context switch with the following information:

Time of switch (µs)
From task (TID, priority, state)
To task (TID, priority)
Reason (Send, Receive, AwaitEvent, Reply, timer interrupt)

Post-mortem analysis: sort log entries by time, then look for intervals where the running task’s priority is lower than the highest-priority ready task. Each such interval is a potential priority inversion. Trace backward to find which Send is blocking the high-priority task and which task holds the resource being waited on.

This post-mortem approach works because priority inversion typically manifests as a timing anomaly (a task is late), and the ring buffer captures the exact sequence of context switches that led to it. Without the ring buffer, the only observable symptom is the deadline miss, which provides insufficient information to trace the cause.

Worst-Case Execution Time: Measurement vs. Analysis

Measured WCET: run the target function 10,000 times under various input conditions and record the maximum observed time. This is fast and practical but statistically incomplete — the true worst case may occur with probability $10^{-9}$ and never appear in measurement.

Static WCET analysis: tools like AbsInt aiT and Bound-T analyze the binary’s control flow graph and the processor’s timing model to compute a provably sound upper bound on execution time. These tools account for pipeline stalls, cache miss penalties, and instruction-level parallelism. They produce bounds that are typically 10–30% higher than measured WCET, which is the cost of soundness.

The challenge of cache modeling: WCET analysis becomes exponentially harder with caches. The worst-case cache scenario (every access is a miss) is almost never achievable in practice, but a sound analysis must consider it. Abstract interpretation of cache behavior (Cousot and Cousot, 1977, applied to caches by Ferdinand, Heckmann, et al.) partitions memory accesses into always hit, always miss, first-miss (hit in later iterations of a loop), and not classified, and uses the partition to compute tight bounds.

For CS 452, full WCET analysis is impractical (it requires purchasing expensive tools and characterizing the BCM2711’s exact timing model). Instead, the hybrid approach:

Measure WCET for all kernel syscalls at startup using the cycle counter.
For user-level tasks, measure the worst-case observed time over a full test run.
Reserve 50% headroom: if the measured WCET of the clock notifier is 3 µs, budget 6 µs for it in the schedule.
Monitor system load using the idle task: if the idle task runs less than 30% of the time, the system is over-loaded and needs redesign.

The idle task utilization metric is a practical stand-in for formal schedulability analysis. Under RMS, a 21-task system with identical utilization factors $U_i = 0.693/21 ≈ 0.033$ is schedulable. If the measured total utilization (1 − idle fraction) exceeds 0.693, the system is beyond the RMS bound and may miss deadlines under worst-case input. Reducing task count or task period resolves this.

The Timer Interrupt as a Profiling Trigger

A classical profiling technique on full operating systems is statistical sampling: at regular intervals (triggered by a timer interrupt), sample the program counter and accumulate a histogram of where time is being spent. This doesn’t require any code modification; functions that appear frequently in the histogram are the hot spots.

The same technique works in a bare-metal kernel. The GIC can be configured to route the system timer C3 interrupt (separate from C1 used by the clock server) to the kernel at a fixed 100 µs interval. The interrupt handler reads the current task’s program counter from ELR_EL1 (saved by the exception mechanism) and increments a counter in a histogram table. After a 1-second sampling window (10,000 samples), the top entries in the histogram identify the code consuming the most CPU time.

This technique is invaluable for locating unexpected hot spots: a busy-wait loop that was supposed to terminate quickly, a memory copy that is slower than expected, or a scheduler that is running more frequently than planned. The histogram approach is sound even for code that disables instrumentation, because the timer interrupt fires regardless of DAIF state (IRQ unmasked is a precondition, but the kernel ensures this during non-critical sections).

Chapter 21: Software Design Patterns for Real-Time

Good software architecture matters especially in real-time systems, because a structurally sound design makes timing analysis tractable while a poorly structured design makes it essentially impossible. This chapter catalogs the recurring patterns that appear in the CS 452 train control application and explains why each pattern exists.

The Server Pattern

A server is a task that owns a shared resource or provides a shared service. It runs a receive loop:

void server_main(void) {
    RegisterAs("ServerName");
    init_state();

    for (;;) {
        int sender_tid;
        RequestMessage req;
        Receive(&sender_tid, &req, sizeof(req));
        process_request(&req, sender_tid);
    }
}

process_request handles the request and calls Reply at some point — possibly immediately (if the result is available), or later (if the request must wait for a future event or a subsequent client interaction). Servers must never call Send() to another server from within the request-handling path, because Send() blocks and prevents the server from receiving new requests. If a server needs to communicate with another server, it delegates to a courier task (see below).

The server pattern provides:

Mutual exclusion by sequentialization: only one request is processed at a time. No locks needed.
Priority elevation: a high-priority client’s request will be served before any lower-priority client’s, because the scheduler runs the highest-priority ready task.
Single ownership: the resource’s state is consistent because only one code path modifies it.

The Worker Pattern

A worker is a task that performs one operation on behalf of a server and reports the result:

void worker_main(void) {
    int server_tid = WhoIs("ServerName");
    for (;;) {
        // Send to server: block until server has work for me
        WorkItem item;
        Send(server_tid, ..., &item, sizeof(item));
        // Do the work
        Result result = do_work(&item);
        // Send result back to server
        Send(server_tid, &result, sizeof(result), NULL, 0);
    }
}

Workers allow servers to offload time-consuming operations (slow I/O, blocking waits) without blocking themselves. The server maintains a pool of idle workers; when a request requires long computation, the server dispatches it to a worker and continues receiving other requests.

The Notifier Pattern

The Notifier pattern was introduced in Chapter 13. Its essence: separate the task that calls AwaitEvent (the notifier) from the task that serves requests (the server). The notifier has higher priority than the server so that interrupt events are processed with minimal latency.

A Notifier can also function as a courier: it carries a notification from an interrupt source (hardware) to a server (software) in exactly the same way a Courier carries a message from one server to another.

The Courier Pattern

A courier ferries messages between two servers. Neither server calls Send directly to the other (which would block one of them). Instead:

void courier_main(void) {
    int server_a = WhoIs("ServerA");
    int server_b = WhoIs("ServerB");

    for (;;) {
        Message msg;
        // Get work from server A
        Send(server_a, &request, sizeof(request), &msg, sizeof(msg));
        // Deliver to server B and get reply
        Send(server_b, &msg, sizeof(msg), &reply, sizeof(reply));
        // Return reply to server A
        // (A has already replied to us; we need a different mechanism)
    }
}

The courier is at higher priority than both servers if the inter-server communication is on a critical path; at lower priority if it is background work. Couriers are a clean way to implement cascaded server architectures without creating direct dependencies between servers.

The Proprietor Pattern

A proprietor controls exclusive access to a backend resource (a serial port, the SPI bus, a memory region) and serves requests sequentially. Unlike a generic server, a proprietor may call Send() to the backend resource in the course of handling a client request:

void spi_proprietor_main(void) {
    RegisterAs("SPI");
    for (;;) {
        int client;
        SPIRequest req;
        Receive(&client, &req, sizeof(req));
        // Execute the SPI transaction (blocking on hardware completion)
        SPIResult result = spi_transact(&req);
        Reply(client, &result, sizeof(result));
    }
}

The proprietor is effectively serializing all SPI access — an appropriate design since SPI is a single-master bus. Clients that need SPI access send to the proprietor and block; the proprietor runs the transaction and replies. Because the proprietor calls the SPI driver (not another SRR server), and the SPI driver is interrupt-driven (see Chapter 14), the actual blocking is done by the SPI driver’s internal notifier, not by the proprietor.

The Administrator Pattern

An administrator multiplexes a shared resource among multiple clients, managing a queue of pending requests and dispatching them as capacity becomes available. This is a generalization of the clock server (Chapter 13): it manages delayed-reply semantics, maintaining a list of clients that have sent requests but have not yet been replied to.

Client A sends request → Admin enqueues A's request, doesn't reply yet
Client B sends request → Admin enqueues B's request
Resource becomes available → Admin selects head of queue, dispatches to resource
Resource completes → Admin replies to A, dequeues B's request, dispatches to resource

The administrator is necessary whenever: (1) multiple clients contend for the same resource, (2) service is asynchronous (the resource takes time to complete), and (3) the order of replies must be controlled (FIFO, priority-ordered, etc.).

The Supervisor Pattern

A supervisor monitors a set of worker tasks and restarts any that fail (exit unexpectedly or stop responding). The supervisor pattern is drawn from Erlang/OTP’s actor model, where it is a foundational design primitive. In an RTOS, a supervisor provides fault tolerance without requiring the entire system to restart.

void supervisor_main(void) {
    /* Create all workers and remember their TIDs */
    int worker_tids[NUM_WORKERS];
    for (int i = 0; i < NUM_WORKERS; i++) {
        worker_tids[i] = Create(worker_priority, worker_main);
    }

    /* Periodic heartbeat check */
    int clock_tid = WhoIs(SRV_CLOCK);
    for (;;) {
        Delay(clock_tid, HEARTBEAT_INTERVAL_TICKS);
        for (int i = 0; i < NUM_WORKERS; i++) {
            /* Send a ping; if the worker doesn't reply within a timeout,
               assume it has deadlocked and replace it */
            if (!ping_worker(worker_tids[i])) {
                /* Worker is unresponsive — destroy and recreate */
                Destroy(worker_tids[i]);
                worker_tids[i] = Create(worker_priority, worker_main);
                log_event(WORKER_RESTART, i);
            }
        }
    }
}

Implementing a full supervisor in CS 452 requires a Destroy() syscall that the basic kernel may not have. A simpler alternative is for workers to include a “timeout and re-initialize” mechanism within themselves: if a worker blocks for more than a threshold time without making progress, it sends an alert to the supervisor and reinitializes its own state. This self-healing approach avoids the complexity of task destruction while still providing recovery from state corruption.

The supervisor pattern is valuable for the train application: if the CAN TX server crashes (perhaps due to a corrupted message corrupting its stack), the supervisor can restart it. The MCP2515 will need to be reinitialized (RESET command followed by CNF register writes) after the restart, but this is straightforward and takes less than 1 ms.

The Observer Pattern

An observer (also called subscriber or event listener) receives notifications about events without having to poll. In the microkernel SRR model, true observers are implemented through a registration mechanism:

The observable server maintains a list of subscriber TIDs.
When an event of interest occurs, the server iterates the list and sends notifications to all subscribers.
Subscribers register at initialization by sending a SUBSCRIBE message to the server.

/* Observable server (e.g., Track Server) */
static int subscribers[MAX_SUBS];
static int num_subscribers = 0;

void handle_subscribe(int sender_tid) {
    if (num_subscribers < MAX_SUBS)
        subscribers[num_subscribers++] = sender_tid;
    Reply(sender_tid, &ok, sizeof(ok));
}

void notify_all(SensorEvent *ev) {
    for (int i = 0; i < num_subscribers; i++) {
        /* Non-blocking: store event in subscriber's mailbox */
        /* In pure SRR, this requires a Courier or the subscriber
           to pre-send a Receive-ready message */
    }
}

There is a fundamental tension in implementing observers with SRR: Send() blocks the sender (the observable server) until the subscriber replies. If the observable server calls Send to 10 subscribers, it blocks for 10 round-trips before it can process the next event. For high-frequency events (10 sensor events per second × 10 subscribers = 100 inter-task messages/second), this is manageable. For very high frequencies (hundreds of events per second), a different approach is needed.

The standard CS 452 solution: subscribers pre-send a “ready for notification” message to the server and block. When an event occurs, the server Reply()s to all currently-blocked subscribers with the event data. The server never blocks: it Reply()s to tasks that are already waiting. Tasks that are slow to re-subscribe miss intermediate events (a pull model handles this differently). This is the same pattern as the Clock Server’s delayed-reply queue (Chapter 13), applied to event delivery instead of timing.

The State Machine Pattern

Real-time control logic is often best represented as an explicit finite state machine. The train controller task is a classic example: at any moment, a train is in one of a small number of states (free-running, slowing-for-stop, stopped, reversing, emergency-stopped), and each incoming event (sensor fires, speed calibrated, stop command received) triggers a state transition.

typedef enum {
    TRAIN_FREE_RUN,
    TRAIN_APPROACH_STOP,
    TRAIN_STOPPING,
    TRAIN_STOPPED,
    TRAIN_REVERSING,
    TRAIN_E_STOP
} TrainControlState;

typedef enum {
    EV_SENSOR_FIRED,
    EV_SPEED_CMD,
    EV_STOP_CMD,
    EV_ESTOP_CMD,
    EV_TIMEOUT
} TrainEvent;

void train_controller_main(void) {
    TrainControlState state = TRAIN_STOPPED;
    TrainParams params;
    // ... initialize params ...

    for (;;) {
        int sender;
        TrainMessage msg;
        Receive(&sender, &msg, sizeof(msg));
        Reply(sender, NULL, 0);  /* Ack immediately; process below */

        switch (state) {
            case TRAIN_FREE_RUN:
                if (msg.event == EV_SENSOR_FIRED)
                    state = handle_sensor_freerun(&params, &msg);
                else if (msg.event == EV_STOP_CMD)
                    state = TRAIN_APPROACH_STOP;
                break;
            case TRAIN_APPROACH_STOP:
                if (msg.event == EV_SENSOR_FIRED)
                    state = handle_sensor_approaching(&params, &msg);
                break;
            // ... additional states ...
        }
    }
}

The state machine pattern keeps control logic comprehensible even as the number of states and events grows. The state variable is always visible; the transition logic is localized in the switch statement; and the invariant “the train is always in exactly one state” is enforced structurally.

A key advantage of the state machine pattern in real-time systems is testability: by constructing sequences of events in test code and feeding them to the state machine, you can verify correct transitions without physical hardware. This is especially valuable for the emergency stop transitions, which are difficult to trigger reliably in hardware testing.

The Deadline Monotonic Variant: Practical Priority Assignment

Chapter 15 established that Rate Monotonic Scheduling assigns priorities inversely proportional to period. In practice, the tasks in a train control system don’t all have hard periodic requirements. Many are sporadic (triggered by external events) or aperiodic (best-effort). Mapping these onto the priority space requires practical judgment:

Rule 1: interrupt-coupled tasks get the highest priorities. The CAN RX Notifier, the UART RX Notifier, and the GPIO interrupt handler are latency-critical. Every microsecond of additional latency in these tasks means a microsecond more error in sensor timestamps. Assign them priorities 28–31 (reserving 31 for the highest).

Rule 2: time-critical servers get the next tier. The Clock Server (which must tick at 10 ms), the CAN RX Server (which must process sensor events before the next one arrives), and the Track Server (which must update position estimates) are in the 20–27 range.

Rule 3: control tasks are in the middle. The train supervisor tasks, which compute routing and braking decisions, are in the 10–19 range. They should run faster than the human-interface tasks but slower than the kernel infrastructure.

Rule 4: diagnostic and display tasks get the lowest priorities. The UART TX server, the terminal display task, and the idle task are in the 1–9 range.

This layered priority assignment is the real-time equivalent of a software architecture diagram: it tells you not just what the tasks are, but the relative importance hierarchy between them.

Comparing the CS 452 Microkernel to QNX, seL4, and L4

The CS 452 kernel shares conceptual DNA with several production microkernels. Comparing them illuminates both the design decisions and their consequences.

QNX Neutrino uses SRR as its primary IPC mechanism, just as CS 452 does. QNX adds non-blocking variants (MsgSendNc, asynchronous message passing), a pulse mechanism for lightweight notifications, and a capability system for security. QNX kernel context switches take approximately 1–3 µs on Cortex-A series processors — comparable to CS 452’s target of under 5 µs for a full SRR roundtrip. QNX is used in automotive (QNX is a subsidiary of BlackBerry), medical devices, and industrial control. It is a direct descendant of the same design philosophy that CS 452 teaches.

L4 microkernel family (L4Ka::Pistachio, Fiasco.OC, seL4) is known for extremely fast IPC: early L4 implementations achieved IPC times of 20 CPU cycles (about 13 ns on a 1.5 GHz processor), compared to CS 452’s ~250-cycle target. L4 achieves this through register-based IPC (message words passed in registers, not copied through memory), lazy context switching (only saving registers that are actually used), and careful cache management. The price is a more complex ABI and tighter coupling between the kernel and user-space ABI.

seL4 (from NICTA/Data61) is the formal verification story: seL4 is the first general-purpose microkernel whose correctness has been proved in Isabelle/HOL, with the proof connecting the C implementation to the abstract specification. The seL4 proof covers functional correctness (the kernel behaves according to its spec), absence of undefined behavior, and absence of dead code. The IPC latency is somewhat slower than Fiasco.OC (~600 cycles in some benchmarks) due to the constraints imposed by the formal proof model. seL4 is used in critical defense and avionics applications.

MINIX 3 is the microkernel designed by Andrew Tanenbaum specifically for reliability: every device driver runs in its own user-space task, isolated from the kernel. A crashing network driver can be restarted without rebooting. MINIX 3 uses a message-passing model similar to CS 452 but adds process isolation and driver supervision. The performance penalty compared to a monolithic kernel is approximately 5–10% in I/O throughput — acceptable for most applications.

The common thread: all of these systems use message passing as the primary inter-task communication mechanism, all use preemptive scheduling with fixed priorities, and all are designed to make inter-task communication as fast as the hardware allows. CS 452’s kernel is a pedagogical distillation of these ideas, stripped to the essence that can be implemented in a semester and analyzed in a real-time scheduling course.

Design Anti-Patterns

Certain structures consistently lead to deadlocks, priority inversions, or timing violations:

Circular sends: task A sends to task B, which sends back to task A (directly or transitively). Since Send blocks, this creates an immediate deadlock. Detection: draw the send graph; it must be a DAG.

Mixing Send and Receive in the same task: except in the Proprietor pattern (where the task Receives from clients and Sends to a dedicated backend), a task that both sends and receives is likely to deadlock. The server loop (Receive only) and client pattern (Send only) are the safe idioms.

Reply-before-send: replying to a sender before completing the work that the sender requested can lead to inconsistent state if the sender immediately sends another request based on the (not-yet-complete) result of the first.

High-priority blocking: a high-priority task that calls Send to a low-priority server will elevate the server’s effective execution priority (because the high-priority task is blocked waiting), but only if the scheduler runs the server above all other tasks below the high-priority task’s priority. If the server’s static priority is low, medium-priority tasks can interpose between the send and the reply, increasing latency. Assign server priorities according to the highest-priority client, not the server’s own computation cost.

The guiding principle of the CS 452 kernel design applies universally: make the common case fast, make the worst case bounded, and choose simplicity over cleverness. A system with fewer tasks, fewer interactions, and fewer shared resources is not just easier to understand — it is easier to analyze, easier to debug, and more likely to be correct. As Antoine de Saint-Exupéry wrote of aircraft design and software architects agree: “Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.”

Testing Real-Time Code Without a Test Framework

Testing is harder in a real-time context than in application code for two reasons. First, the code interacts with hardware that is difficult to simulate (the GIC-400, the system timer, the MCP2515). Second, timing behavior under test may differ from timing behavior in production, invalidating timing-sensitive tests.

The most effective testing strategy for a CS 452 kernel is bottom-up integration testing combined with careful simulation stubs for hardware:

Layer 0: unit tests for pure functions. Functions that perform arithmetic (velocity calculation, distance interpolation, EWMA update), data structure operations (ring buffer push/pop, priority queue insert/extract), and protocol encoding/decoding (Märklin frame construction, track graph path reconstruction) can be tested with standard unit test techniques on the host machine. The key is to write these functions with no hardware dependencies — no mmio_read, no AwaitEvent, no timer reads — so they compile and run identically on x86 and AArch64.

Layer 1: kernel syscall correctness. A set of kernel test tasks that verify:

Create/Exit lifecycle: create N tasks, verify they all run and exit cleanly.
Send/Receive/Reply ordering: sender blocks until receiver calls Receive; receiver blocks until sender sends; both unblock when Reply is called.
Priority ordering: create tasks at priorities 5, 10, 15; verify they run in priority order.
AwaitEvent timing: create a notifier/server pair; verify the notifier wakes within 2 ticks of the timer interrupt.

These tests run on the RPi 4 hardware with the kernel booted normally. Success is verified by UART output (“PASS” or “FAIL”) rather than halting on assertion. Each test takes under 100 ms; the full suite runs in under 5 seconds.

Layer 2: integration smoke tests. A sequence of train control operations with the layout power on but no physical trains:

CAN TX: send a speed command; verify the MCP2515 transmits it (check CANINTF.TX0IF clears).
CAN RX: inject a sensor event by touching a reed switch; verify the CAN RX server receives it.
Clock: run the clock server for 10 ticks; verify the tick count increments and no tick is missed.
Name server: register two names; verify both can be looked up.

Layer 3: physical train tests. A single train running in a loop: verify sensor attribution is correct for 100 consecutive laps, timing jitter stays below 20 ms, and no missed ticks are detected. This requires the physical layout and is done last.

The canary pattern mentioned in Chapter 5 is a defensive testing technique: fill all task stacks with a known pattern (e.g., 0xDEADBEEF) at initialization and periodically scan for corruption. A stack that has been overwritten past its boundary is detected before it silently corrupts an adjacent data structure.

Task Lifecycle Debugging: Common Failure Modes

Over the course of building the kernel, several failure modes appear repeatedly and are worth cataloging:

Missed Reply: a server receives a request but forgets to call Reply() in one code path (e.g., an early return on error). The sender remains Reply-Blocked indefinitely. Symptom: the sender never returns from Send(). Diagnosis: add a log entry at every Reply() call site; identify which branch was not reached.

Double Reply: a server calls Reply() twice for the same request — once immediately and once when a future event fires. The second Reply() writes to a TID that may now belong to a different task (the original task’s TID was recycled). This is undefined behavior and may produce sporadic crashes. Diagnosis: add an assertion that each server-state entry has a “replied” flag, and assert it is false before replying.

Send to own TID: a task sends to itself. Since the kernel would need to switch to the task to process its own Receive (which it is already in the process of calling), this deadlocks immediately. Diagnosis: add a kernel assertion that the sender’s TID != the receiver’s TID.

Task pool exhaustion: Create() fails with -2 (no free descriptors). Symptom: tasks are not created. Diagnosis: add a diagnostic print in Create() for the -2 case; count active tasks. This is often caused by failing to call Exit() at the end of short-lived tasks, leaving their descriptors stuck in the Exited state.

Clock server tick loss: the clock notifier sends a tick to the clock server, but the clock server is blocked on a previous tick’s computation when the next tick fires. If the clock notifier is at a higher priority, it will preempt the clock server, but it cannot send another tick until the server processes the previous one (Send blocks). Result: one tick is delayed until the server finishes the previous computation. Diagnosis: compare expected tick count (initial ticks + elapsed time / tick period) with actual; any discrepancy indicates processing overruns.

Name server race: a task calls WhoIs("ServerName") before the server has called RegisterAs("ServerName"). The Name Server returns ERR_NAME_NOT_FOUND. The task must either retry (adding delay) or wait for the server to register via a synchronization mechanism. The standard idiom: the first task creates all servers and then delays briefly before creating application tasks that look up servers.

Defensive Design: Assertions, Watchdogs, and Graceful Degradation

A real-time system cannot be brought down for a software update when a bug manifests in production. The trains must keep running, sensors must keep firing, and the CAN bus must keep communicating. This section covers the engineering discipline of defensive design — building a system that detects faults, limits their blast radius, and degrades gracefully rather than failing catastrophically.

Kernel assertions are the first line of defense. Unlike assert() from <assert.h> (which calls abort(), which requires a C runtime), a bare-metal kernel assertion must halt the CPU in a known state and emit diagnostic information via the always-available polling UART:

#define kassert(cond) do { \
    if (!(cond)) { \
        kprintf("KASSERT FAILED: %s:%d: %s\n", __FILE__, __LINE__, #cond); \
        print_backtrace(); \
        for (;;) asm volatile("wfe"); /* halt core, allow JTAG attach */ \
    } \
} while (0)

The wfe (Wait For Event) instruction halts the core while keeping it responsive to JTAG debug events — a JTAG debugger can attach to a halted core and inspect its register state. In contrast, wfi (Wait For Interrupt) also halts but responds to interrupts, which might allow the kernel to escape the halt state unexpectedly.

Assertions should be placed at every invariant boundary:

At the top of Reply(): kassert(sender->state == REPLY_WAIT) — catching the double-reply bug before it corrupts state.
At the top of scheduler_next(): kassert((ready_mask == 0) || (ready_queues[31-__builtin_clz(ready_mask)][0] != NULL)) — enforcing the scheduler invariant.
At every TID-to-pointer lookup: kassert(tid >= 0 && tid < MAX_TASKS && task_pool[tid].state != EXITED).

A useful discipline: write the invariant in a comment, then immediately write the kassert that checks it. The comment explains why; the assert enforces it.

Task watchdogs extend the assertion model to running tasks. A watchdog is a monitor task that expects to receive a “heartbeat” message from each critical task within a bounded interval. If the heartbeat is late, the watchdog declares the task failed and takes recovery action.

#define WATCHDOG_PERIOD_TICKS 50  /* 500 ms at 10 ms/tick */

typedef struct {
    int  tid;
    int  last_heartbeat_tick;
    bool alive;
} WatchedTask;

static WatchedTask watched[MAX_WATCHED_TASKS];
static int         num_watched;

/* Called by any task to check in */
void Heartbeat(int watchdog_tid) {
    HeartbeatMsg msg = { .tid = MyTid() };
    HeartbeatReply reply;
    Send(watchdog_tid, &msg, sizeof(msg), &reply, sizeof(reply));
}

/* Watchdog main loop */
void watchdog_main(void) {
    RegisterAs("Watchdog");
    int clock_tid = WhoIs("ClockServer");
    int tick = 0;

    for (;;) {
        tick = Time(clock_tid);

        /* Check for missed heartbeats */
        for (int i = 0; i < num_watched; i++) {
            if (watched[i].alive &&
                tick - watched[i].last_heartbeat_tick > WATCHDOG_PERIOD_TICKS) {
                kprintf("WATCHDOG: task %d missed heartbeat at tick %d\n",
                        watched[i].tid, tick);
                /* Recovery: restart the task, or enter degraded mode */
                handle_task_failure(watched[i].tid);
            }
        }

        /* Process incoming heartbeat messages (non-blocking check) */
        /* ... */

        Delay(clock_tid, WATCHDOG_PERIOD_TICKS / 5);
    }
}

The watchdog task runs at a priority slightly below the tasks it monitors. If a critical task starves (due to a higher-priority task looping), the watchdog itself may not get to run — which is a limitation: the watchdog can only detect failures in tasks that are at or above the watchdog’s priority. For CS 452’s purposes, the watchdog is most useful for detecting server tasks that have deadlocked (stuck waiting for a reply that never comes) rather than for detecting CPU starvation.

SRR deadlock detection is a deeper problem. A deadlock in the SRR system forms a directed cycle in the “waiting for” graph: task A is waiting for B (in REPLY_WAIT), B is waiting for C, and C is waiting for A. The kernel can detect this at the moment any Send() would complete the cycle.

A practical detection algorithm: when Send(receiver_tid) is called, before blocking the sender, the kernel traverses the “waiting for” chain starting from receiver_tid:

/* Returns true if sending to receiver_tid would create a cycle */
bool would_deadlock(int sender_tid, int receiver_tid) {
    int current = receiver_tid;
    for (int depth = 0; depth < MAX_TASKS; depth++) {
        if (task_pool[current].state != REPLY_WAIT)
            return false;                    /* current is not waiting for anyone */
        current = task_pool[current].waiting_for_tid;
        if (current == sender_tid)
            return true;                     /* cycle detected */
    }
    return false;
}

The traversal is O(n) in the number of tasks, which is acceptable for a small task set. On deadlock detection, the kernel should immediately panic (via kassert(false)) with a printed description of the cycle: “deadlock: task 5 → task 8 → task 3 → task 5”. This is far more useful than a silent hang.

In practice, deadlocks in a well-designed SRR system are rare — the send graph should be an acyclic DAG by construction. The deadlock checker serves as a safeguard against design errors, particularly during development when the server dependency graph is still evolving.

Stack canary monitoring complements the initialization-time canary fill described in Chapter 3. A background scan task periodically checks the canary regions at the bottom of each task stack:

#define STACK_CANARY  0xDEADBEEFDEADBEEFULL
#define CANARY_WORDS  8  /* 64 bytes of canary at the stack bottom */

void stack_monitor_main(void) {
    int clock_tid = WhoIs("ClockServer");
    for (;;) {
        Delay(clock_tid, 100);  /* check every 1 second */

        for (int tid = 0; tid < MAX_TASKS; tid++) {
            if (task_pool[tid].state == EXITED) continue;
            uint64_t *canary = (uint64_t *)task_pool[tid].stack_bottom;
            for (int i = 0; i < CANARY_WORDS; i++) {
                if (canary[i] != STACK_CANARY) {
                    kprintf("STACK OVERFLOW: task %d, word %d corrupted (0x%llx)\n",
                            tid, i, canary[i]);
                    kassert(false);
                }
            }
        }
    }
}

The scan task runs at priority 1 (just above idle). It will not starve any functional task but will detect stack overflows within 1 second of their occurrence — before the corrupted stack data has a chance to produce mysterious crashes elsewhere.

Graceful degradation is the most demanding form of defensive design: what does the system do when something breaks that it cannot fix? In a train control system, the relevant failures are:

A sensor reed switch fails (stops firing). The system last saw the train at some sensor; it cannot determine whether the train has advanced or is stuck. Safe response: command the train to stop immediately (it may already be stopped, in which case the command is harmless) and alert the operator via UART. The reservation system’s invariant — “all segments within stopping distance must be reserved” — ensures no other train enters the presumed-occupancy zone.
A train derails or is manually removed from the track. Sensors stop firing for that train, and eventually its next-expected sensor fires for a different train (or not at all). The operator must explicitly inform the system that the train has been removed. The track server should provide a TrainRemove(loco_id) command that releases all reservations held by that train and marks it inactive.
CAN bus failure. No sensor events arrive; no commands reach the CS3. All trains should issue emergency stops immediately — but the CAN bus failure means the stop commands cannot be transmitted either. The system is helpless; the DCC track output will maintain the last commanded speed. The CS3 has a built-in watchdog that stops all trains if it loses CAN communication for more than a few seconds. This is an example of hierarchical safety: the CS 452 software relies on the CS3’s hardware watchdog as a last resort, rather than trying to implement its own train-stopping mechanism that itself relies on CAN.
UART TX server crash. The operator terminal goes dark. The train system continues to run (the UART server is not on the critical path for train control), but the operator loses visibility. A periodic blinking LED on the RPi’s activity pin provides a heartbeat the operator can observe even with UART down.

The principle underlying all of these responses is fail-safe by default: when in doubt, stop the trains. A stopped train is not a useful train, but it is a safe train. Only advance the trains when the system has positive confidence in their positions and the safety of the track ahead. The reservation protocol implements this: a train that cannot acquire a reservation stops. The velocity-reservation invariant (Chapter 19) implements this: a train that might be too close to an unreserved segment slows preemptively.

This “stop-on-uncertainty” principle is the most important safety lesson of CS 452. It applies far beyond model trains: in avionics, a flight computer that loses confidence in its sensor suite should relinquish control to the pilot, not try to continue operating with degraded inputs. In automotive safety systems, a braking actuator that loses communication with the ECU should apply maximum braking (fail-safe), not release (fail-dangerous). Real-time systems engineers must decide, for each failure mode, whether the safe failure state is “freeze” (do nothing, hold last value) or “command to a safe output” (apply brakes, close valves, stop trains), and build that decision into the system architecture before the failure occurs.

The Architecture of a CS 452 Kernel in 2026

Looking at the complete system built over the course of CS 452, it is striking how much capability emerges from a small number of simple primitives:

Five kernel syscalls: Create, Exit, Yield, Send/Receive/Reply (counting SRR as three), AwaitEvent. That is eight primitives. With these eight, students build a complete real-time OS capable of controlling multiple trains on a physical layout.

One scheduling policy: static priority FIFO. No virtual runtime, no dynamic priorities, no fairness. The simplicity enables analysis: the worst-case behavior of every task can be computed from the priority table.

One IPC mechanism: synchronous message passing via the SRR triple. No shared memory, no semaphores, no mutexes. The absence of shared memory eliminates data races. The synchronous semantics make reasoning about task interaction simple: when a sender wakes from Send, the receiver has already processed its request.

One hardware abstraction: AwaitEvent. All hardware interaction — timers, UART, SPI, CAN — ultimately arrives through AwaitEvent. The notifier pattern wraps every hardware interrupt source in the same software structure. A programmer who understands the clock notifier understands the CAN RX notifier, the UART RX notifier, and every future hardware interface.

This design elegance is not accidental. It reflects decades of experience building operating systems, distilled by multiple generations of CS 452 instructors (Bell, Cowan, Berry, and now Karsten) into the minimum that is sufficient. The microkernel literature (QNX, L4, Mach) explored many more features — asynchronous IPC, capability systems, memory servers, namespace management — and CS 452 strips them away to expose the essential core.

The trains are a pedagogical device for delivering a lesson about computing at the boundary between software and physics. The sensor that fires, the motor that responds, the switch that throws — these are the ultimate arbiters of correctness. A kernel that looks correct in simulation but fails on hardware has failed its primary purpose. CS 452 demands hardware correctness, and that demand clarifies thinking about every design decision.

The Build-Measure-Verify Loop

A professional discipline that CS 452 implicitly teaches — and that transfers directly to industrial embedded systems work — is the build-measure-verify loop:

Build: implement the smallest piece of functionality that can be tested independently.
Measure: observe the behavior with cycle counters, ring buffer logs, or UART output. Record the measurement, not just the outcome.
Verify: compare the measurement against the predicted behavior from the design. If they agree, proceed. If not, diagnose until they agree.

The temptation in embedded systems development is to skip the measure step: implement a feature, observe that “it seems to work,” and move on. This leads to systems that work under test conditions and fail in production. The train controller that “works fine with one train” often fails with two, precisely because the measurement step was skipped and a timing margin was assumed without being measured.

The professional embedded systems engineer maintains a personal logbook of measured performance numbers: “context switch = 240 cycles (measured 2026-01-15)”, “CAN frame TX = 5.2 µs at 10 MHz SPI”, “clock server jitter = ±12 µs (one tick period).” These numbers become the basis for future design decisions and serve as regression tests when the system is modified.

In an industrial setting, these measurements would be part of a formal Design Verification Plan (DVP) — a document that specifies, for each design requirement, the test that verifies it and the acceptance criterion. For a train control system, the DVP might specify: “Sensor event processed within 50 ms of occurrence: verified by ring buffer timestamp comparison, pass if all events within 50 ms across 100 consecutive sensor events.” CS 452 does not require a formal DVP, but the mental habit of specifying what “correct” means before testing is the same.

Closing Thoughts: Real-Time Programming as a Discipline

Real-time programming is, at its core, a discipline of making promises about time and keeping them. The kernel makes a promise: the highest-priority ready task will always run. The scheduler keeps it. The clock server makes a promise: DelayUntil(N) will return no earlier than tick N. The delay queue implementation keeps it. The train engineer makes a promise: the train will stop before the contested segment. The stopping-distance calibration and the reservation system together keep it.

What makes real-time programming intellectually interesting — and professionally important — is that these promises are verifiable. Unlike a general-purpose application where “correctness” is a matter of output matching expected output, a real-time system’s correctness is partially temporal: it is not enough to compute the right answer; the answer must arrive before the deadline. This temporal correctness is harder to measure and harder to guarantee, but it is not inaccessible. The tools of scheduling theory (Chapter 15), the discipline of WCET measurement (Chapter 16), the notifier pattern (Chapter 13), and the reservation protocol (Chapter 19) together form a coherent engineering methodology for temporal correctness.

The Märklin trains are small and slow. Their physics are simple and their failure modes are benign (a crash is a bump at 20 cm/s, not a catastrophe). The ABS system in your car, the flight management computer in a commercial aircraft, the safety PLC in a nuclear reactor — these are real-time systems where the promises are life-safety requirements and the consequences of failure are non-benign. The methodology is the same; the stakes are higher. Learning to build a correct, analyzable, timing-guaranteed microkernel on a Raspberry Pi is the beginning of learning to build those systems.

Chapter 22: Kernel System Call Reference

Every system call in this kernel is a carefully bounded operation with defined preconditions, postconditions, timing guarantees, and error conditions. This chapter documents the complete kernel API as a formal reference. The descriptions here are more precise than the explanations in earlier chapters, which prioritized intuition over specification.

Conventions

TID: a task identifier, a non-negative integer. TID 0 is the idle task. TIDs are assigned monotonically from 1 in creation order. A TID is valid as long as the task exists (not Exited). The kernel does not recycle TIDs; once a task exits, its TID is never reused.

Priority: an integer in [1, 31]. The idle task runs at priority 0. Priority 31 is the highest. The scheduler always runs the highest-priority task in the READY state.

Error codes: negative return values indicate errors. The specific negative values are defined by the kernel:

Code	Meaning
`-1`	Invalid argument (bad TID, bad event ID, bad priority)
`-2`	Resource exhausted (no free task descriptors for Create)
`-3`	Target task Exited before or during SRR operation
`-4`	Buffer truncation (message or reply was truncated due to size mismatch)

`int Create(int priority, void (*fn)(void))`

Description: creates a new task. The new task runs fn at the given priority. The task begins executing when the scheduler selects it; it is immediately placed in the READY state after creation.

Preconditions:

priority ∈ [1, 31].
fn is a valid function pointer that does not return normally without calling Exit() (undefined behavior if fn returns).
The task descriptor pool is not exhausted (pool size typically 64 or 128 tasks).

Postconditions:

On success, returns the new task’s TID (> 0).
The new task is in READY state with its stack initialized.
On error, returns -1 (invalid priority) or -2 (no descriptors available).

Timing: O(1). Involves stack allocation (clearing the stack to zero up to the frame), frame initialization, and scheduler insertion. Expected cost: ~200 cycles.

Error handling: if Create returns -2, the application has too many concurrent tasks. Either increase the pool size (by adjusting MAX_TASKS and recompiling) or ensure short-lived tasks call Exit().

Example:

int tid = Create(PRIORITY_CLOCK_NOTIFIER, clock_notifier);
if (tid < 0) kernel_panic("Create failed");

`void Exit(void)`

Description: terminates the calling task. The task’s descriptor transitions to EXITED state. The task’s stack memory is not freed (no heap). Any task in REPLYWAIT waiting for a reply from this task receives ERR_TASK_EXITED (-3).

Preconditions: none. May be called at any time.

Postconditions:

The calling task is removed from the READY queue.
The task descriptor is marked EXITED; its TID is no longer valid.
All tasks in REPLYWAIT blocked on this task are unblocked with return value -3.
Exit() does not return.

Timing: O(n) in the number of tasks in REPLYWAIT. Typically O(1) if no tasks are waiting.

Note: Calling Exit() is mandatory for tasks that finish their work. A task that reaches the end of its function body without calling Exit() will return to task_start, which calls Exit() automatically. However, directly calling Exit() at the known termination point is clearer.

`int MyTid(void)`

Description: returns the calling task’s own TID.

Preconditions: none.

Postconditions: returns a positive integer ≥ 1.

Timing: O(1). Accesses the current task descriptor’s TID field. Expected cost: ~5 cycles.

`int MyParentTid(void)`

Description: returns the TID of the task that called Create to create the calling task.

Preconditions: none.

Postconditions: returns the parent TID if the parent is still alive; returns -1 if the parent has Exited.

Timing: O(1).

Note: parent TID is recorded at creation time and never changes. If the parent has Exited, the TID is stale but the value is still returned (it is the caller’s responsibility to check for liveness if needed).

`void Yield(void)`

Description: voluntarily relinquishes the CPU to any task of equal or higher priority. The calling task remains READY.

Preconditions: none.

Postconditions:

If any task has equal or higher priority and is READY, that task runs next.
If no task has higher priority and no task of equal priority is READY, the calling task immediately resumes.
The calling task remains in READY state throughout.

Timing: O(1) from the calling task’s perspective if no context switch occurs. If a context switch occurs, cost is equivalent to a context save + scheduler + context restore.

Usage: Yield() is primarily a testing primitive. In production code, prefer Send() for explicit synchronization. Yield() is occasionally useful in initialization sequences where a task must let another task run to completion before proceeding.

`int Send(int tid, const void msg, int msglen, void reply, int rplen)`

Description: sends a message to task tid and blocks until tid calls Reply(). The message is copied from the calling task’s msg buffer (of length msglen) into the receiver’s msg buffer (as specified in its Receive() call). The reply is copied from the replier’s buffer into the caller’s reply buffer (of length rplen).

Preconditions:

tid is a valid TID of an existing, non-Exited task.
msg points to msglen bytes of readable memory.
reply points to rplen bytes of writeable memory (may be NULL if rplen == 0).
msglen ≥ 0, rplen ≥ 0.

Postconditions:

On success, returns the number of bytes actually written to reply (≤ rplen).
The message has been delivered to the receiver.
The receiver has called Reply before Send returns.
On error: returns -1 if tid is invalid; returns -3 if tid’s task Exited before Reply was called.

Timing: variable. If the receiver is already in RECEIVE_WAIT, Send completes when the server calls Reply, which may be O(1) or involve arbitrary server computation. The caller is blocked for the entire duration of server processing.

Truncation semantics: if msglen exceeds the receiver’s Receive buffer size (msglen argument to Receive), the message is truncated to fit the receiver’s buffer. The receiver’s Receive return value is the actual number of bytes received (the minimum of the two sizes). Similarly, if the replier’s reply buffer is larger than rplen, the reply is truncated.

Deadlock conditions: Send will deadlock if tid’s task also calls Send back to the calling task (directly or transitively) and both are blocked in SEND_WAIT. Ensure the send graph is a DAG.

Example:

SpeedCmd cmd = { .loco_uid = uid, .speed = 500, .dir = 1 };
int ret = Send(can_tx_tid, &cmd, sizeof(cmd), NULL, 0);
if (ret < 0) log_event(LOG_SEND_FAIL, ret);

`int Receive(int tid, void msg, int msglen)`

Description: receives a message from any task that has called Send() to the calling task. If a sender is already waiting (in SEND_WAIT), copies the message immediately and returns. Otherwise, blocks (RECEIVE_WAIT) until a sender arrives.

Preconditions:

tid points to a writeable int.
msg points to msglen bytes of writeable memory.
msglen ≥ 0.

Postconditions:

On success, *tid is the TID of the task that sent the message.
msg contains the message (truncated to msglen bytes if the sender’s message was longer).
The sender has transitioned to REPLY_WAIT.
Returns the number of bytes written to msg (the minimum of sender’s msglen and receiver’s msglen).

Important: the sender remains blocked until Reply(*tid, ...) is called. Failure to call Reply leaves the sender blocked indefinitely.

Timing: O(1) if a sender is already waiting. If blocking, the task is descheduled and rescheduled when a sender arrives — cost is one context switch in and one out.

Usage pattern (standard server loop):

for (;;) {
    int sender;
    RequestMsg req;
    Receive(&sender, &req, sizeof(req));
    /* Process request */
    Reply(sender, &response, sizeof(response));
}

`int Reply(int tid, const void *reply, int rplen)`

Description: unblocks the task tid (which must be in REPLY_WAIT) and copies the reply buffer to that task’s reply buffer. The sender transitions to READY.

Preconditions:

tid is a valid TID.
Task tid must be in REPLY_WAIT state (it must have previously called Send to the calling task and received it via Receive).
reply points to rplen bytes of readable memory (may be NULL if rplen == 0).

Postconditions:

The reply is copied to tid’s reply buffer.
Task tid transitions from REPLY_WAIT to READY.
Reply does not block; the calling task continues running.
Returns 0 on success, -1 if tid is invalid, -3 if tid is not in REPLY_WAIT.

Timing: O(1). Involves a memory copy of rplen bytes and one scheduler insertion. Expected cost: ~50 cycles + copy time.

Double-Reply danger: calling Reply(tid, ...) twice for the same sender is undefined behavior. The second call may Reply to a different task (if tid was reused after the first Exit — but recall TIDs are not recycled, so this is actually safe from a TID perspective, but the second call will return -3 since tid is no longer in REPLY_WAIT). The safe pattern is to have a “replied” flag per active sender and assert it is unset before each Reply.

`int AwaitEvent(int eventid)`

Description: blocks the calling task until hardware event eventid fires. On return, provides the event-specific data value.

Preconditions:

eventid is a valid event identifier recognized by the kernel.
The calling task is a notifier: it is the designated handler for this event ID. Only one task may await a given event at a time; a second call to AwaitEvent(eventid) while another task is already waiting returns -1.

Postconditions:

Returns a non-negative event-specific value (e.g., current tick count for timer events, received byte for UART events).
Returns -1 if eventid is invalid or another task is already waiting on this event.
The associated interrupt has been acknowledged (the kernel clears the interrupt source before unblocking the task).

Timing: blocks until the next occurrence of event eventid. Unblocking latency from interrupt to task running is bounded by the kernel’s interrupt handling path — O(1) operations, expected ~50–200 cycles depending on cache state.

Valid event IDs (defined as EVENT_* constants in the kernel header):

ID	Event	Return value
`EVENT_TIMER_C1`	System Timer Compare 1 match	Current timer tick count
`EVENT_UART0_RX`	UART0 RX FIFO threshold crossed	Number of bytes ready (≥ 1)
`EVENT_UART0_TX`	UART0 TX FIFO below threshold	Number of bytes space available
`EVENT_CAN_RX`	MCP2515 INT asserted (RX or error)	MCP2515 CANINTF register value
`EVENT_CAN_TX`	MCP2515 TX0 transmission complete	0

Design constraint: a notifier task should call AwaitEvent before enabling the interrupt source. If the interrupt fires before AwaitEvent, the kernel may lose the event (depending on the implementation). The canonical initialization order:

Create notifier task.
Notifier task calls AwaitEvent (blocking in EVENT_WAIT).
Some other task enables the interrupt (writes to GIC ISENABLER, or enables MCP2515 CANINTE).

The notifier will wake on the first occurrence of the event after the interrupt is enabled.

Clock Server API: `Time`, `Delay`, `DelayUntil`

These three functions are provided by the clock server user-space task, not by the kernel directly. They are described here because they are part of the standard kernel environment and have timing guarantees that derive from kernel primitives.

int Time(int clock_server_tid)

Returns the current tick count. This is the number of timer C1 interrupts that have fired since the clock server started. At the default 10 ms tick rate, one tick = 10 ms.

Preconditions: clock_server_tid is the TID of a running clock server. Postconditions: returns the current tick count (non-negative). Returns -1 if the TID is invalid or the server is not a clock server. Timing: one full Send/Reply roundtrip to the clock server — approximately 5–10 µs.

int Delay(int clock_server_tid, int ticks)

Blocks the calling task for ticks clock ticks from the current moment.

Preconditions: ticks ≥ 0. Postconditions: returns the current tick count when the task is unblocked (this is the first tick at or after current_tick + ticks). Returns -1 if ticks < 0 or the TID is invalid. Timing: the task is unblocked on the first clock tick ≥ current_tick + ticks. Due to scheduling delays, the actual delay may be slightly longer than ticks × tick_period; it is never shorter.

int DelayUntil(int clock_server_tid, int absolute_tick)

Blocks the calling task until the clock tick count reaches absolute_tick.

Preconditions: absolute_tick is a valid future tick count. If absolute_tick ≤ current tick, the call returns immediately. Postconditions: returns the current tick count when unblocked. Never blocks past the first tick ≥ absolute_tick. Timing: equivalent to Delay semantics, but uses an absolute deadline rather than a relative delay.

Usage (preferred over Delay for periodic tasks):

int next = Time(clock_tid) + PERIOD;
for (;;) {
    do_work();
    DelayUntil(clock_tid, next);
    next += PERIOD;
}

This prevents drift: each deadline is absolute, not relative to the previous wake-up.

Name Server API: `RegisterAs`, `WhoIs`

These two functions are provided by the name server user-space task. They are standard in the CS 452 kernel environment.

int RegisterAs(const char *name)

Associates name with the calling task’s TID in the name server’s registry.

Preconditions: name is a non-NULL, null-terminated string of at most 15 characters. Postconditions: future WhoIs(name) calls return the calling task’s TID. Only one task may be registered with a given name; a second RegisterAs with the same name is undefined (typically ignored or rejected with -1). Timing: one full Send/Reply roundtrip to the name server. O(n) in the name server’s string comparison loop where n = number of registered names.

int WhoIs(const char *name)

Looks up name in the name server’s registry and returns the associated TID.

Preconditions: name is a non-NULL, null-terminated string of at most 15 characters. Postconditions: returns the TID of the task registered with name, or -1 if no task has registered with that name. Timing: one full Send/Reply roundtrip. O(n) in the name server’s name count.

Usage pattern: at startup, each server calls RegisterAs immediately after creation. Other tasks call WhoIs at first use. For servers that may not yet be started at the point of first use, WhoIs can be retried with a small delay:

int server_tid = -1;
while (server_tid < 0) {
    server_tid = WhoIs("TrackServer");
    if (server_tid < 0) Delay(clock_tid, 2);   /* wait 20 ms and retry */
}

Syscall Numbering and ABI

Kernel syscalls are invoked via svc #0 with the syscall number in x0 and arguments in x1–x7. The return value appears in x0 after the syscall completes. This follows a minimal convention that matches the AAPCS64 register allocation for a function with a single return value.

Syscall #	Name	Arguments	Return
1	Create	x1=priority, x2=fn_ptr	TID or error
2	MyTid	—	TID
3	MyParentTid	—	parent TID
4	Yield	—	0
5	Exit	—	(no return)
6	Send	x1=tid, x2=msg, x3=msglen, x4=reply, x5=rplen	bytes or error
7	Receive	x1=tid_ptr, x2=msg, x3=msglen	bytes
8	Reply	x1=tid, x2=reply, x3=rplen	0 or error
9	AwaitEvent	x1=eventid	event data or error

The syscall stubs in the user-space library (e.g., libkernel.a) implement these in assembly:

/* int Send(int tid, const void *msg, int msglen, void *reply, int rplen) */
.global Send
Send:
    mov     x0, #6          /* syscall number for Send */
    /* x1-x5 already contain tid, msg, msglen, reply, rplen (AAPCS64) */
    svc     #0
    ret                     /* x0 = return value (set by kernel) */

The kernel’s dispatch table in handle_syscall branches on the value of the saved x0 register (the syscall number) to the appropriate handler function.

Kernel Constants and Limits

The following constants define the operational boundaries of the kernel. They should be documented in a kernel.h header included by all tasks:

/* Priorities */
#define PRIORITY_MAX           31
#define PRIORITY_MIN            1
#define PRIORITY_IDLE           0
#define PRIORITY_CLOCK_NOTIFIER 31
#define PRIORITY_CLOCK_SERVER   25
#define PRIORITY_NAME_SERVER    10

/* Limits */
#define MAX_TASKS              64     /* maximum simultaneous tasks */
#define MAX_NAME_LEN           15     /* maximum RegisterAs/WhoIs name length */
#define MAX_MSG_LEN           256     /* maximum message or reply size */
#define STACK_SIZE          (64 * 1024)   /* 64 KB per task */
#define GUARD_PAGE_SIZE       4096    /* 4 KB guard page below each stack */

/* Event IDs */
#define EVENT_TIMER_C1          1
#define EVENT_UART0_RX          2
#define EVENT_UART0_TX          3
#define EVENT_CAN_RX            4
#define EVENT_CAN_TX            5

/* Error codes returned by syscalls */
#define ERR_INVALID_PARAM      (-1)
#define ERR_NO_DESCRIPTORS     (-2)
#define ERR_TASK_EXITED        (-3)
#define ERR_TRUNCATED          (-4)

The MAX_TASKS limit is determined by the size of the task descriptor pool, which is statically allocated:

static TaskDescriptor task_pool[MAX_TASKS];

This allocates sizeof(TaskDescriptor) × 64 bytes at compile time — typically ~128 bytes × 64 = 8 KB, well within the kernel’s statically allocated data segment. Increasing MAX_TASKS increases this allocation but has no other runtime cost.

Summary: Invariants the Kernel Guarantees

The kernel’s correct operation rests on the following invariants, which hold at all times (modulo hardware failures):

Scheduling invariant: the task currently executing is always the highest-priority task in READY state.
SRR synchronization invariant: when Reply(tid) returns, task tid has received the reply and is in READY state; Reply itself does not block.
Message copy integrity: the message delivered to a server’s Receive buffer exactly matches the bytes the client passed to Send (up to the minimum of the two buffer sizes); no bytes are lost or modified.
AwaitEvent uniqueness: at most one task is in EVENT_WAIT for any given event ID at any time.
Priority FIFO within priority level: among tasks at the same priority level in the READY queue, the one that became ready earliest runs first.
Bounded interrupt latency: the time from an interrupt signal to the first instruction of the interrupt handler is bounded by the longest non-preemptible kernel operation. This is bounded at compile time by the most expensive syscall’s execution path.
DAIF invariant: at EL1, the DAIF.I (IRQ mask) bit is always set. The kernel never runs with IRQs unmasked at EL1. IRQs are enabled only at EL0 (user tasks), and the exception entry mechanism re-masks them atomically on exception entry.

These invariants collectively provide the foundation on which the scheduling theory of Chapter 15 is applicable: the scheduler guarantees (1), the SRR model depends on (2)–(3), the event system requires (4), the FIFO tiebreak is (5), the timing analysis uses (6), and the kernel safety relies on (7).

Directive	Purpose
`.align n`	Align to \(2^n\) bytes (e.g., `.align 11` for 2048-byte alignment)
`.rept n`	Repeat a block `n` times
`.section .text.boot`	Place in named linker section
`.global sym`	Make `sym` visible to the linker
`.type sym, %function`	Mark `sym` as a function (helps debuggers)
`.size sym, .-sym`	Report function size (for profiling tools)
`.word 0xDEADBEEF`	Place a 32-bit word literal
`.quad addr`	Place a 64-bit pointer (for jump tables)
`.equ NAME, expr`	Define an assembly constant