ECE 495: Autonomous Vehicles

Krzysztof Czarnecki

Estimated study time: 1 hr 5 min

Table of contents

Sources and References

Primary references — S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, 2005; R. Szeliski, Computer Vision: Algorithms and Applications, 2nd ed., Springer, 2022 (free PDF: szeliski.org/Book).

Supplementary texts — I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016 (free: deeplearningbook.org); S. Liu, L. Li, J. Tang, S. Wu, and J.-L. Gaudiot, Creating Autonomous Vehicle Systems, 2nd ed., Morgan and Claypool, 2020.

Online resources — MIT 6.S094 Deep Learning for Self-Driving Cars; SAE J3016 Levels of Driving Automation standard; Apollo autonomous driving open platform documentation.

Chapter 1: Introduction to Autonomous Driving Systems

1.1 History and Motivation

The quest for self-driving vehicles stretches back further than most engineers appreciate. Early milestones include the Stanford Cart (1960s–70s), Carnegie Mellon’s Navlab project (1980s), and Ernst Dickmanns’ VaMoRs vehicle in Germany, which drove on the Autobahn using dynamic vision in 1987. The DARPA Grand Challenges (2004, 2005, 2007) catalyzed modern autonomous driving: the 2005 Stanley vehicle from Stanford won the desert course, and the 2007 Urban Challenge produced vehicles capable of navigating real intersection rules. By the 2010s, industrial programs at Google (later Waymo), Tesla, GM Cruise, and a proliferation of startups moved the field from demonstration to pre-commercial deployment.

The motivation is both safety and efficiency. Human error accounts for over 90% of road crashes globally. An automated system that perceives faster, never tires, and acts deterministically under its operational limits has the potential to reduce fatalities dramatically. Secondary benefits include reduced congestion, accessibility for mobility-impaired individuals, and energy efficiency through cooperative driving.

1.2 SAE J3016 Levels of Driving Automation

SAE International standard J3016 defines six levels (0–5) of driving automation, which have become the industry lingua franca.

Operational Design Domain (ODD) — The specific conditions under which a given driving automation system is designed to function, including road type, geographic area, speed range, weather, time of day, and other constraints. A system rated at a given SAE level is only valid within its ODD.

Level	Name	Who monitors environment?	Who performs dynamic driving task (DDT)?
0	No Automation	Human	Human
1	Driver Assistance	Human	Human + system (one of steering or acceleration)
2	Partial Automation	Human	Human + system (both steering and acceleration)
3	Conditional Automation	System	System (human must be ready to intervene on request)
4	High Automation	System	System (within ODD; no human fallback needed)
5	Full Automation	System	System (all conditions, no ODD restriction)

The critical boundary is between Level 2 and Level 3: at L2 the human must continuously supervise; at L3 the system supervises and only requests human intervention when it reaches a limit. This distinction has significant legal and engineering implications.

Common misconception: Level 5 vehicles do not exist commercially as of 2026. Most deployed robotaxis (Waymo, Cruise before suspension) operate at Level 4 with defined geographic ODDs. Tesla's Autopilot and Full Self-Driving are Level 2 systems — the driver remains responsible at all times.

1.3 Automated Driving System (ADS) Functional Architecture

A complete ADS is typically organized into a sense–plan–act pipeline:

Perception — Processes sensor data (cameras, LiDAR, radar, GPS/IMU) to produce a model of the environment: detected objects, free space, lane geometry, traffic signals.
Prediction — Forecasts the future states of dynamic agents (pedestrians, vehicles) over a planning horizon.
Planning — Determines the vehicle’s future trajectory: route planning (global), behavioral planning (when to merge, yield), and motion planning (exact trajectory).
Control — Executes the planned trajectory by commanding steering, throttle, and brake actuators.
Localization — Determines the vehicle’s precise pose within a map, often using HD maps fused with sensor data.

1.3.1 WISE ADS Framework

The WISE (Waymo–inspired Systems Engineering) framework emphasizes four pillars: What the system must do (requirements), Infrastructure (compute, sensors, connectivity), Safety (fault tolerance, fallback), and Evaluation (simulation, closed-track, public road testing). A key principle is the safety case: an ADS must demonstrate, with evidence, that residual risk is acceptably low within its ODD.

1.4 State of the Technology

Current ADS deployments rely heavily on sensor fusion: cameras provide dense color/texture information; LiDAR provides precise 3-D geometry; radar provides velocity via Doppler and is robust to weather; GPS/IMU provide coarse global pose. Perception is now dominated by deep learning (CNNs, transformers). Control is often classical (PID, MPC) with deep RL emerging for edge cases. End-to-end neural approaches that map sensor inputs directly to steering commands exist but face interpretability challenges.

Chapter 2: Computer Vision for Autonomous Driving

2.1 Image Representation

A grayscale image is a 2-D array \(I: \mathbb{Z}^2 \to [0, 255]\) where each element is a pixel intensity. A color image adds a third channel dimension, typically RGB: \(I \in \mathbb{R}^{H \times W \times 3}\). In floating-point processing, values are normalized to \([0,1]\).

2.1.1 Image Convolution and Gaussian Filtering

Linear filtering replaces each pixel with a weighted sum of its neighborhood. Given a kernel (filter) \(k\) of size \((2r+1) \times (2r+1)\), the 2-D discrete convolution is:

\[ (I * k)[x, y] = \sum_{u=-r}^{r} \sum_{v=-r}^{r} I[x-u, y-v]\, k[u, v] \]

The Gaussian kernel smooths the image by weighting nearby pixels more heavily:

\[ G_\sigma(u,v) = \frac{1}{2\pi\sigma^2} \exp\!\left(-\frac{u^2 + v^2}{2\sigma^2}\right) \]

Gaussian filtering is separable: \(G_\sigma(u,v) = g_\sigma(u)\, g_\sigma(v)\), so a 2-D Gaussian convolution can be implemented as two 1-D passes, reducing complexity from \(O(N^2 k^2)\) to \(O(N^2 k)\) where \(k\) is the kernel width.

Example — Gaussian blur parameters: For \(\sigma = 1\) pixel, a \(5 \times 5\) kernel captures \(99\%\) of the Gaussian mass. For \(\sigma = 2\), use at least a \(9 \times 9\) kernel. Increasing \(\sigma\) suppresses high-frequency noise but also blurs edges.

2.2 Edge Detection: The Canny Pipeline

The Canny detector (1986) remains a gold standard for edges because it is optimal under three criteria: good detection, good localization, and single-response (no double edges).

Step 1 — Smoothing. Convolve with \(G_\sigma\) to suppress noise.

Step 2 — Gradient computation. Apply Sobel operators to get partial derivatives:

\[ G_x = \begin{bmatrix}-1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1\end{bmatrix} * I, \qquad G_y = \begin{bmatrix}1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1\end{bmatrix} * I \]

Gradient magnitude and direction:

\[ M = \sqrt{G_x^2 + G_y^2}, \qquad \theta = \arctan\!\left(\frac{G_y}{G_x}\right) \]

Step 3 — Non-maximum suppression. For each pixel, check if its gradient magnitude is a local maximum along the gradient direction. If not, suppress (set to 0). This thins edges to single-pixel width.

Step 4 — Double thresholding. Apply two thresholds \(T_\text{low} < T_\text{high}\):

\(M > T_\text{high}\): strong edge
\(T_\text{low} < M \leq T_\text{high}\): weak edge
\(M \leq T_\text{low}\): not an edge

Step 5 — Hysteresis linking. Keep weak edges only if they are connected (8-connected) to a strong edge. This removes noise-induced weak responses while preserving genuine edge continuations.

2.3 Line Detection: The Hough Transform

The Hough transform maps image points to a parameter space to detect global structures like lines.

A line in image space can be parameterized as:

\[ \rho = x\cos\theta + y\sin\theta \]

where \(\rho\) is the perpendicular distance from the origin and \(\theta \in [0°, 180°)\) is the angle. This avoids the infinite slope problem of \(y = mx + b\).

Algorithm:

For every edge pixel \((x_i, y_i)\), compute \(\rho(\theta) = x_i\cos\theta + y_i\sin\theta\) for all \(\theta\) in a discrete set.
Increment the accumulator cell \(A[\rho, \theta]\).
Peaks in \(A\) correspond to lines with many supporting edge pixels.

Example — Lane line detection: After Canny edge detection on a road image, the Hough transform identifies the two dominant lines corresponding to left and right lane markings. Spurious edges from road texture produce low accumulator values and are ignored.

2.4 Corner Detection: The Harris Detector

Edges are useful but corners (two-directional intensity changes) are more repeatable for tracking and matching. The Harris detector uses the structure tensor (second-moment matrix) of local image gradients.

Define the gradient at each pixel: \(\nabla I = (I_x, I_y)^T\). The structure tensor over a window \(W\) is:

\[ M = \sum_{(x,y) \in W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix} \]

Often the sum is replaced by a Gaussian-weighted sum. The eigenvalues \(\lambda_1, \lambda_2\) of \(M\) characterize the local geometry:

Both small: flat region (no gradient)
One large, one small: edge
Both large: corner

Rather than computing eigenvalues directly, Harris uses the corner response function:

\[ R = \det(M) - \kappa\, [\text{tr}(M)]^2 = \lambda_1\lambda_2 - \kappa(\lambda_1 + \lambda_2)^2 \]

with \(\kappa \approx 0.04{-}0.06\). Pixels with \(R > T\) are corner candidates; non-maximum suppression retains only local maxima.

Scale invariance: The Harris detector is not scale invariant by itself. The SIFT (Scale-Invariant Feature Transform) and SURF descriptors extend the corner idea to a scale-space pyramid to achieve scale and rotation invariance, which is important for matching across viewpoints in 3-D reconstruction.

Chapter 3: Machine Learning Foundations

3.1 Supervised Learning and Linear Regression

Supervised learning finds a function \(f_\theta: \mathcal{X} \to \mathcal{Y}\) from labeled pairs \(\{(\mathbf{x}_i, y_i)\}_{i=1}^N\). In linear regression, the prediction is:

\[ \hat{y} = \mathbf{w}^T \mathbf{x} + b = \boldsymbol{\theta}^T \tilde{\mathbf{x}} \]

where \(\tilde{\mathbf{x}} = [1, \mathbf{x}^T]^T\) absorbs the bias. The cost function is mean squared error:

\[ J(\boldsymbol{\theta}) = \frac{1}{2N} \sum_{i=1}^N (\hat{y}_i - y_i)^2 \]

The closed-form normal equation solution is \(\boldsymbol{\theta}^* = (X^T X)^{-1} X^T \mathbf{y}\), where \(X\) is the design matrix. For high-dimensional inputs, iterative gradient descent is preferred.

3.2 Logistic Regression

For binary classification (\(y \in \{0,1\}\)), logistic regression models the posterior:

\[ P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b), \qquad \sigma(z) = \frac{1}{1+e^{-z}} \]

The sigmoid \(\sigma\) squashes the linear output to \((0,1)\). Training minimizes binary cross-entropy loss:

\[ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i) \right] \]

The gradient with respect to \(\mathbf{w}\) is:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{N} \sum_{i=1}^N (\hat{p}_i - y_i)\, \mathbf{x}_i \]

This elegant result (prediction minus label, times input) motivates the general pattern seen in neural network gradients.

3.3 Neural Network Fundamentals

A feedforward neural network composes multiple layers:

\[ \mathbf{h}^{(l)} = \phi\!\left( W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} \right), \quad l = 1, \ldots, L \]

where \(\phi\) is a nonlinear activation. Common activations:

ReLU: \(\phi(z) = \max(0, z)\) — sparse, avoids vanishing gradient, default for hidden layers
Sigmoid: \(\sigma(z) = (1+e^{-z})^{-1}\) — output layer for binary classification
Softmax: \(\phi(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}\) — output layer for multi-class classification
Tanh: \(\tanh(z)\) — zero-centered, useful in recurrent networks

3.3.1 Backpropagation

Backpropagation efficiently computes \(\partial \mathcal{L}/\partial \theta\) for all parameters using the chain rule. Define the local gradient (error signal) at layer \(l\) as \(\boldsymbol{\delta}^{(l)} = \partial \mathcal{L} / \partial \mathbf{z}^{(l)}\) where \(\mathbf{z}^{(l)} = W^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\).

Forward pass: Compute and store \(\mathbf{z}^{(l)}\) and \(\mathbf{h}^{(l)}\) for all layers.

Backward pass (output layer \(L\)):

\[ \boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \phi'\!\left(\mathbf{z}^{(L)}\right) \]

Backward pass (hidden layer \(l < L\)):

\[ \boldsymbol{\delta}^{(l)} = \left( W^{(l+1)T} \boldsymbol{\delta}^{(l+1)} \right) \odot \phi'\!\left(\mathbf{z}^{(l)}\right) \]

Parameter gradients:

\[ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} \left(\mathbf{h}^{(l-1)}\right)^T, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)} \]

Gradient descent update:

\[ \theta \leftarrow \theta - \eta\, \frac{\partial \mathcal{L}}{\partial \theta} \]

where \(\eta\) is the learning rate. Mini-batch stochastic gradient descent (SGD) uses a random subset of \(B\) samples per update, providing a noisy but efficient estimate.

Vanishing gradients: With sigmoid or tanh activations in deep networks, \(\phi'(z)\) saturates near 0 for large \(|z|\), causing \(\boldsymbol{\delta}^{(l)}\) to shrink exponentially. ReLU mitigates this. Residual connections (ResNet) further address vanishing gradients by allowing gradients to bypass layers.

3.3.2 Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) normalizes pre-activations within a mini-batch to stabilize training:

\[ \hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y_i = \gamma \hat{z}_i + \beta \]

where \(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance, and \(\gamma, \beta\) are learned scale and shift parameters. Benefits: enables higher learning rates, acts as a regularizer, reduces sensitivity to initialization.

Chapter 4: Convolutional Neural Networks

4.1 CNN Architecture

Convolutional Neural Networks exploit three key ideas for visual data: local connectivity (each unit sees only a local receptive field), weight sharing (the same filter is applied at all spatial locations), and spatial hierarchy (deeper layers combine local features into global structure).

4.1.1 Convolution Layer

A convolutional layer with \(K\) filters of size \(F \times F \times C_\text{in}\) produces \(K\) feature maps. The output at spatial location \((i,j)\) for filter \(k\) is:

\[ Z_{i,j,k} = \sum_{u=0}^{F-1}\sum_{v=0}^{F-1}\sum_{c=0}^{C_\text{in}-1} W_{u,v,c,k}\, X_{i\cdot s+u,\, j\cdot s+v,\, c} + b_k \]

where \(s\) is the stride. Padding (typically \(P = \lfloor F/2 \rfloor\)) preserves spatial dimensions. After convolution, a nonlinearity (ReLU) is applied element-wise.

4.1.2 Pooling Layer

Pooling reduces spatial resolution and introduces local translation invariance. Max pooling takes the maximum over a \(2\times2\) window with stride 2, halving each spatial dimension. Average pooling takes the mean. Pooling has no learned parameters.

4.1.3 Fully Connected Layer

After spatial feature extraction, the feature map is flattened and passed through one or more fully connected (dense) layers for final classification or regression. The last FC layer typically outputs class logits for softmax.

4.1.4 Example Architecture — AlexNet (2012)

Conv(96, 11×11, s=4) → ReLU → MaxPool → Conv(256, 5×5) → ReLU → MaxPool → Conv(384, 3×3) → Conv(384, 3×3) → Conv(256, 3×3) → MaxPool → FC(4096) → FC(4096) → FC(1000, softmax)

AlexNet won ImageNet 2012 with 16% top-5 error, halving the previous best and launching the deep learning era in computer vision.

4.2 Semantic Segmentation

Semantic segmentation assigns a class label to every pixel. For autonomous driving this produces the “scene understanding” map: road, sidewalk, vehicle, pedestrian, sky, building.

4.2.1 Fully Convolutional Network (FCN)

Long et al. (2015) replaced all FC layers in a classification CNN with convolutional layers, enabling dense prediction. The spatial output is too coarse due to pooling, so transposed convolutions (sometimes called deconvolutions) upsample back to the input resolution. Skip connections from earlier (higher-resolution) layers add fine-grained spatial information back.

4.2.2 U-Net Encoder–Decoder

U-Net (Ronneberger et al., 2015) follows a symmetric encoder–decoder structure with skip connections at every resolution level:

Encoder (contracting path): repeated blocks of [Conv → ReLU → Conv → ReLU → MaxPool], doubling channels at each level.
Bottleneck: deepest representation at lowest spatial resolution.
Decoder (expanding path): at each level, upsample (transposed conv or bilinear), concatenate the corresponding encoder feature map via skip connection, then [Conv → ReLU → Conv → ReLU].

The skip connections preserve spatial information that would otherwise be lost during downsampling, enabling precise boundary delineation — critical for segmenting road surface from curb.

Skip connections in segmentation networks directly concatenate (U-Net) or add (ResNet) encoder feature maps to decoder feature maps at matching resolutions. They solve the representational bottleneck of the encoder-decoder by allowing fine spatial features to bypass the bottleneck.

Training uses pixel-wise cross-entropy loss. The Intersection over Union (IoU) metric per class, and mean IoU (mIoU), are standard evaluation metrics:

\[ \text{IoU}_c = \frac{|\text{pred}_c \cap \text{gt}_c|}{|\text{pred}_c \cup \text{gt}_c|} \]

Chapter 5: Object Detection

5.1 From Sliding Window to Region Proposals

Early detection applied a classifier at every position and scale (sliding window). This is accurate but computationally prohibitive. The paradigm evolved through three generations:

Selective search + classification (R-CNN, 2014): propose ~2000 region candidates using a classical algorithm, warp each to a fixed size, classify with CNN. Slow: ~47s/image.
Shared CNN features (Fast R-CNN, 2015): run CNN once on the whole image; project proposals onto feature map; use RoI pooling to extract fixed-size features per proposal. ~0.2s/image.
Region Proposal Network (Faster R-CNN, 2015): replace selective search with a small CNN (RPN) sharing the backbone features. Fully end-to-end trainable. ~0.05s/image.

5.2 Anchor Boxes

Modern detectors predict offsets relative to anchors: a set of fixed bounding boxes of various scales and aspect ratios tiled across the feature map. Each anchor is associated with a location, and the network predicts:

Objectness score: is there an object overlapping this anchor?
Class probabilities (given object present)
Bounding box regression offsets \((\Delta x, \Delta y, \Delta w, \Delta h)\) relative to anchor

The box regression targets are defined in log-space for width/height:

\[ t_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}, \quad t_w = \log\frac{w}{w_a}, \quad t_h = \log\frac{h}{h_a} \]

An anchor is a positive example if it has IoU \(> 0.7\) with some ground-truth box; negative if IoU \(< 0.3\).

5.3 YOLO: Real-Time Object Detection

YOLO (You Only Look Once, Redmon et al., 2016) divides the image into an \(S \times S\) grid. Each cell predicts \(B\) bounding boxes and \(C\) class probabilities simultaneously in a single forward pass, achieving real-time speed.

Grid cell prediction: Each cell outputs a tensor of size \(B \times (5 + C)\). The 5 values per box are \((x, y, w, h, \text{confidence})\), where \(x, y\) are relative to the cell and \(w, h\) are relative to the image. Confidence is defined as:

\[ \text{conf} = P(\text{object}) \times \text{IoU}_{\text{pred}}^{\text{truth}} \]

YOLO Loss Function: The total loss sums over all grid cells and boxes:

[ \mathcal{L} = \lambda_\text{coord} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right]

\lambda_\text{coord} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right] ]

[

\sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2
\lambda_\text{noobj} \sum_{i,j} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2
\sum_{i,j} \mathbf{1}{ij}^{\text{obj}} \sum{c} (p_i(c) - \hat{p}_i(c))^2 ]

The square root on \(w, h\) makes the loss less sensitive to size for large boxes; \(\lambda_\text{coord} = 5\) and \(\lambda_\text{noobj} = 0.5\) balance localization vs. confidence terms. Later versions (YOLOv3–v8) introduced multi-scale prediction and anchor priors.

Non-Maximum Suppression (NMS): After detection, many boxes overlap. NMS iteratively selects the highest-confidence box and suppresses all others with IoU \(> \text{threshold}\) (typically 0.5).

Example — YOLO on a highway scene: With a \(13 \times 13\) grid and \(B=3\) anchors per cell, the network produces \(13 \times 13 \times 3 = 507\) candidate boxes in one forward pass at ~45 FPS on a contemporary GPU. After NMS, only the 5–15 most confident detections remain.

Evaluation metric — mAP: Mean Average Precision averages the area under the precision–recall curve across all classes and IoU thresholds. COCO standard uses mAP@[0.5:0.05:0.95], requiring accurate localization as well as correct classification.

Chapter 6: Probabilistic State Estimation

6.1 Dynamic System Modeling

An autonomous vehicle must track its own pose and the states of surrounding agents over time. The general framework models a hidden Markov model (HMM):

State \(\mathbf{x}_t\): true system state at time \(t\) (position, velocity, orientation, etc.)
Control input \(\mathbf{u}_t\): known input applied at time \(t\) (steering, throttle)
Observation \(\mathbf{z}_t\): noisy sensor measurement at time \(t\)
Motion model \(p(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{u}_t)\): probabilistic state transition
Sensor model \(p(\mathbf{z}_t \mid \mathbf{x}_t)\): likelihood of measurement given state

6.2 Bayes Filter: General Recursive Estimation

The goal is to maintain the belief \(\text{bel}(\mathbf{x}_t) = p(\mathbf{x}_t \mid \mathbf{z}_{1:t}, \mathbf{u}_{1:t})\), the posterior distribution over the state given all history.

Markov Assumption: The future state \(\mathbf{x}_t\) is conditionally independent of all past states and observations given the current state \(\mathbf{x}_{t-1}\). Formally: \(p(\mathbf{x}_t \mid \mathbf{x}_{0:t-1}, \mathbf{z}_{1:t-1}, \mathbf{u}_{1:t}) = p(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{u}_t)\).

Under the Markov assumption, the Bayes filter iterates two steps:

Prediction step (propagate through motion model):

\[ \overline{\text{bel}}(\mathbf{x}_t) = \int p(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{u}_t)\, \text{bel}(\mathbf{x}_{t-1})\, d\mathbf{x}_{t-1} \]

Update step (incorporate measurement):

\[ \text{bel}(\mathbf{x}_t) = \eta\, p(\mathbf{z}_t \mid \mathbf{x}_t)\, \overline{\text{bel}}(\mathbf{x}_t) \]

where \(\eta\) is a normalizing constant ensuring \(\int \text{bel}(\mathbf{x}_t)\, d\mathbf{x}_t = 1\).

The Bayes filter is the theoretical foundation. Different implementations arise from different representations of the belief distribution:

Kalman filter — Gaussian beliefs, linear systems
Extended Kalman filter — Gaussian beliefs, nonlinear systems (Jacobian linearization)
Particle filter — non-parametric, arbitrary distributions

6.3 Occupancy Grid Mapping

An occupancy grid divides the environment into a fixed-resolution grid of cells. Each cell \(m_i\) is a binary random variable: occupied (1) or free (0). The goal is to estimate \(p(m_i \mid \mathbf{z}_{1:t}, \mathbf{x}_{1:t})\) for each cell, assuming a known vehicle trajectory.

6.3.1 Log-Odds Representation

Rather than storing probabilities (which can become numerically unstable near 0 or 1), the log-odds representation is used:

\[ l_t(m_i) = \log \frac{p(m_i = 1 \mid \mathbf{z}_{1:t})}{p(m_i = 0 \mid \mathbf{z}_{1:t})} = \log \frac{p(m_i=1)}{1-p(m_i=1)} \]

Using Bayes’ theorem and the assumption of independent cells, the update rule for each new measurement is:

\[ l_t(m_i) = l_{t-1}(m_i) + \log \frac{p(\mathbf{z}_t \mid m_i=1)}{p(\mathbf{z}_t \mid m_i=0)} + \log \frac{1 - p_0}{p_0} \]

where \(p_0 = p(m_i = 1)\) is the prior occupancy probability (typically 0.5, giving \(\log\frac{1-p_0}{p_0} = 0\)). The term \(\log \frac{p(\mathbf{z}_t \mid m_i=1)}{p(\mathbf{z}_t \mid m_i=0)}\) is the inverse sensor model log-odds.

Inverse sensor model (LiDAR): For a LiDAR beam that hits at range \(r\):

Cells along the beam before the hit: likely free → log-odds decremented (e.g., \(-0.4\))
Cell at the hit point: likely occupied → log-odds incremented (e.g., \(+0.9\))
Cells beyond the hit: unknown → no update

Practical implementation: Log-odds values are clamped to a range (e.g., \([-10, 10]\)) to prevent individual cells from becoming permanently occupied after one false reading. The occupancy probability is recovered via \(p = 1 - 1/(1+e^{l})\).

Chapter 7: Kalman Filtering

7.1 The Kalman Filter

The Kalman filter (KF) is the optimal Bayes filter for linear Gaussian systems. The state and observation models are:

\[ \mathbf{x}_t = A\, \mathbf{x}_{t-1} + B\, \mathbf{u}_t + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, Q) \]\[ \mathbf{z}_t = H\, \mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, R) \]

The belief is Gaussian: \(\text{bel}(\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_t, \Sigma_t)\).

7.1.1 KF Equations

Predict:

\[ \bar{\boldsymbol{\mu}}_t = A\, \boldsymbol{\mu}_{t-1} + B\, \mathbf{u}_t \]\[ \bar{\Sigma}_t = A\, \Sigma_{t-1}\, A^T + Q \]

Update:

\[ K_t = \bar{\Sigma}_t\, H^T\, (H\, \bar{\Sigma}_t\, H^T + R)^{-1} \]\[ \boldsymbol{\mu}_t = \bar{\boldsymbol{\mu}}_t + K_t\, (\mathbf{z}_t - H\, \bar{\boldsymbol{\mu}}_t) \]\[ \Sigma_t = (I - K_t H)\, \bar{\Sigma}_t \]

The Kalman gain \(K_t\) balances trust in the prediction vs. the measurement. When measurement noise \(R \to 0\), \(K_t \to H^{-1}\) (trust measurement fully). When process noise \(Q \to 0\), \(K_t \to 0\) (trust prediction fully).

Optimality of Kalman Filter: For linear Gaussian systems, the Kalman filter is the optimal minimum mean squared error (MMSE) estimator. The posterior \(\text{bel}(\mathbf{x}_t)\) is exactly Gaussian, and no other estimator achieves lower expected squared error.

Example — Constant velocity vehicle tracking: State \(\mathbf{x} = (p_x, p_y, v_x, v_y)^T\). Motion model with timestep \(\Delta t\): \(A = \begin{bmatrix}1&0&\Delta t&0\\0&1&0&\Delta t\\0&0&1&0\\0&0&0&1\end{bmatrix}\). Observation of position: \(H = \begin{bmatrix}1&0&0&0\\0&1&0&0\end{bmatrix}\). The KF fuses GPS measurements (high \(R\)) with the motion model to produce smooth velocity estimates.

7.2 Extended Kalman Filter (EKF)

Most robotics systems are nonlinear. The EKF handles nonlinear models:

\[ \mathbf{x}_t = g(\mathbf{x}_{t-1}, \mathbf{u}_t) + \mathbf{w}_t \]\[ \mathbf{z}_t = h(\mathbf{x}_t) + \mathbf{v}_t \]

The EKF linearizes \(g\) and \(h\) around the current estimate using Jacobians:

\[ G_t = \frac{\partial g}{\partial \mathbf{x}}\Bigg|_{\boldsymbol{\mu}_{t-1}, \mathbf{u}_t}, \qquad H_t = \frac{\partial h}{\partial \mathbf{x}}\Bigg|_{\bar{\boldsymbol{\mu}}_t} \]

EKF Predict:

\[ \bar{\boldsymbol{\mu}}_t = g(\boldsymbol{\mu}_{t-1}, \mathbf{u}_t), \qquad \bar{\Sigma}_t = G_t\, \Sigma_{t-1}\, G_t^T + Q \]

EKF Update:

\[ K_t = \bar{\Sigma}_t\, H_t^T\, (H_t\, \bar{\Sigma}_t\, H_t^T + R)^{-1} \]\[ \boldsymbol{\mu}_t = \bar{\boldsymbol{\mu}}_t + K_t\, (\mathbf{z}_t - h(\bar{\boldsymbol{\mu}}_t)), \qquad \Sigma_t = (I - K_t H_t)\, \bar{\Sigma}_t \]

EKF limitations: The Jacobian linearization is only accurate if the nonlinearity is mild over the uncertainty region and \(\Delta t\) is small. Highly nonlinear systems (sharp turns, poor initialization) can cause the EKF to diverge. The Unscented Kalman Filter (UKF) uses sigma points to achieve second-order accuracy without explicit Jacobians.

Chapter 8: Coordinate Frames and Vehicle Modeling

8.1 Coordinate Frames

Autonomous driving involves multiple coordinate systems, and correct transformation between them is essential for sensor fusion.

World frame (ENU: East-North-Up or ECEF): globally fixed, used for maps and global localization.
Vehicle body frame: origin at vehicle center of gravity (or rear axle center), \(x\) forward, \(y\) left, \(z\) up.
Sensor frames: each sensor (camera, LiDAR) has its own frame defined by its mounting position and orientation.
Image plane frame: pixels in \((u, v)\) coordinates with origin at top-left.

A rigid-body transformation between frame \(A\) and frame \(B\) is a 4×4 homogeneous matrix:

\[ T_{AB} = \begin{bmatrix} R & \mathbf{t} \\ \mathbf{0}^T & 1 \end{bmatrix} \]

where \(R \in SO(3)\) is a rotation matrix and \(\mathbf{t} \in \mathbb{R}^3\) is a translation. Point \(\mathbf{p}_A\) transforms as \(\mathbf{p}_B = T_{AB}\, \mathbf{p}_A\).

Camera projection: A 3-D point \((X,Y,Z)\) in the camera frame projects to pixel \((u, v)\):

\[ u = f_x \frac{X}{Z} + c_x, \qquad v = f_y \frac{Y}{Z} + c_y \]

where \(f_x, f_y\) are focal lengths and \((c_x, c_y)\) is the principal point.

8.2 Kinematic Bicycle Model

The kinematic bicycle model approximates the vehicle as a single-track (bicycle) system, lumping front and rear axles into single wheels. It is valid at low speed where tire slip is negligible.

State: \(\mathbf{x} = (x, y, \psi, v)^T\) where \((x,y)\) is the rear axle center position, \(\psi\) is the heading angle, and \(v\) is the speed.

Input: \((a, \delta)^T\) where \(a\) is longitudinal acceleration and \(\delta\) is the front wheel steering angle.

Equations of motion:

\[ \dot{x} = v \cos\psi \]\[ \dot{y} = v \sin\psi \]\[ \dot{\psi} = \frac{v}{L} \tan\delta \]\[ \dot{v} = a \]

where \(L\) is the wheelbase (distance between front and rear axles).

Steering to curvature: The instantaneous turning radius is:

\[ R = \frac{L}{\tan\delta} \]

and curvature \(\kappa = 1/R = \tan\delta / L\). For small angles, \(\tan\delta \approx \delta\), giving \(\kappa \approx \delta / L\).

Extended models: The dynamic bicycle model adds tire slip angles and lateral forces (the "magic formula" tire model), necessary at high speed where lateral dynamics matter. For ECE 495, the kinematic model suffices for controller design within typical urban speed ranges.

Chapter 9: Vehicle Control

9.1 Lateral Control

Lateral control keeps the vehicle on a desired path by commanding the steering angle.

9.1.1 Pure Pursuit Controller

Pure pursuit is a geometric controller that steers toward a look-ahead point on the desired path, located at distance \(L_d\) ahead of the vehicle.

The steering angle \(\delta\) satisfying the bicycle model geometry for a circular arc from the vehicle to the look-ahead point is:

\[ \delta = \arctan\!\left(\frac{2 L \sin\alpha}{L_d}\right) \]

where \(\alpha\) is the angle between the vehicle’s heading and the direction to the look-ahead point, \(L\) is the wheelbase, and \(L_d\) is the look-ahead distance.

The look-ahead distance is often set as \(L_d = k_v v\) (proportional to speed) to provide faster response at low speed and stability at high speed. Pure pursuit has no feedback on cross-track error explicitly — error is implicit through the angle \(\alpha\).

9.1.2 Stanley Controller

The Stanley controller (used by the Stanford Stanley that won DARPA 2005) uses both heading error and cross-track error:

\[ \delta = \psi_e + \arctan\!\left(\frac{k_e\, e}{v}\right) \]

where \(\psi_e\) is the heading error (difference between vehicle heading and path tangent), \(e\) is the cross-track error at the front axle, \(v\) is the vehicle speed, and \(k_e > 0\) is a gain.

The heading error term corrects orientation; the \(\arctan\) term corrects lateral offset and includes velocity scaling (reduces aggressive response at high speed). The use of arctan saturates the steering command naturally.

Stability analysis: With \(e\) the front-axle cross-track error and linearizing for small angles, the closed-loop dynamics yield exponential convergence: \(\dot{e} \approx -k_e e\), so \(e(t) = e(0) e^{-k_e t}\). The rate of convergence is tunable via \(k_e\) independently of speed.

9.1.3 Model Predictive Control (MPC) for Lateral Control

Linear MPC formulates lateral control as a receding-horizon optimization. Given the linearized error-state model (lateral error \(e\), heading error \(\psi_e\)):

\[ \mathbf{x}_{k+1} = A_d\, \mathbf{x}_k + B_d\, \delta_k \]

the MPC solves at each timestep:

\[ \min_{\delta_{0:N-1}} \sum_{k=0}^{N-1} \left( \mathbf{x}_k^T Q\, \mathbf{x}_k + \delta_k^T R\, \delta_k \right) + \mathbf{x}_N^T P\, \mathbf{x}_N \]

subject to:

\[ \mathbf{x}_{k+1} = A_d\, \mathbf{x}_k + B_d\, \delta_k, \quad |\delta_k| \leq \delta_\text{max}, \quad |\delta_k - \delta_{k-1}| \leq \Delta\delta_\text{max} \]

Only the first control input \(\delta_0^*\) is applied; the optimization is re-solved at the next timestep with updated state (receding horizon principle). MPC naturally handles input constraints (maximum steering angle, rate limits) which geometric controllers cannot.

9.2 Longitudinal Control

Longitudinal control regulates vehicle speed by commanding throttle and brake.

9.2.1 PID Controller

The PID controller computes a control command from the speed error \(e(t) = v_\text{ref}(t) - v(t)\):

\[ u(t) = K_p\, e(t) + K_i \int_0^t e(\tau)\, d\tau + K_d\, \frac{de}{dt} \]

In discrete time with timestep \(\Delta t\):

\[ u_k = K_p\, e_k + K_i\, \Delta t \sum_{j=0}^k e_j + K_d\, \frac{e_k - e_{k-1}}{\Delta t} \]

Tuning: The proportional term provides immediate response; the integral term eliminates steady-state error (e.g., due to road grade); the derivative term damps oscillation.

Anti-windup: When the actuator saturates (maximum throttle), the integral can accumulate (“wind up”), causing overshoot when the constraint lifts. Anti-windup clamps the integral or back-calculates to prevent this.

9.2.2 Adaptive Cruise Control (ACC)

ACC extends speed control to headway-keeping. The time-headway model maintains a desired gap \(d_\text{ref} = v\, \tau_h + d_0\) where \(\tau_h\) is the time headway (typically 1.5–2.5 s) and \(d_0\) is a minimum standstill gap. A cascade controller uses an outer loop on gap error to generate a speed reference, fed to the inner PID speed controller.

Chapter 10: Motion Planning

10.1 The Motion Planning Problem

Motion planning determines a trajectory \(\xi: [0, T] \to \mathcal{X}\) from the current state to a goal state that is:

Collision-free — avoiding obstacles in the workspace
Dynamically feasible — respecting vehicle kinematics and dynamics
Comfortable — limiting jerk, lateral acceleration
Rule-compliant — obeying traffic laws, speed limits, right-of-way

The configuration space \(\mathcal{C}\) represents all possible states (poses) of the vehicle. The free configuration space \(\mathcal{C}_\text{free} = \mathcal{C} \setminus \mathcal{C}_\text{obs}\) excludes states where the vehicle collides with obstacles.

10.2 Sampling-Based Motion Planning

10.2.1 Rapidly-Exploring Random Trees (RRT)

RRT grows a tree from the start state by repeatedly:

Sample a random configuration \(\mathbf{x}_\text{rand} \in \mathcal{C}\)
Find the nearest tree node \(\mathbf{x}_\text{near}\)
Extend from \(\mathbf{x}_\text{near}\) toward \(\mathbf{x}_\text{rand}\) by step size \(\epsilon\), producing \(\mathbf{x}_\text{new}\)
If the edge \([\mathbf{x}_\text{near}, \mathbf{x}_\text{new}]\) is collision-free, add \(\mathbf{x}_\text{new}\) to the tree
Check if \(\mathbf{x}_\text{new}\) is within tolerance of the goal

RRT is probabilistically complete: if a solution exists, it will be found with probability 1 as the number of samples \(\to \infty\). However, RRT paths are typically jagged and suboptimal.

10.2.2 RRT* (Asymptotically Optimal RRT)

RRT* adds two steps after extending:

Near-neighbor rewiring:

Find all nodes within a ball of radius \(r = \gamma(\log n / n)^{1/d}\) around \(\mathbf{x}_\text{new}\)
Choose the parent that minimizes the total cost from the start
Rewire neighbors through \(\mathbf{x}_\text{new}\) if it improves their cost

RRT* is asymptotically optimal: as \(n \to \infty\), the cost of the found path converges to the optimal cost. This makes it suitable for offline planning where time is available.

10.2.3 Path Smoothing

RRT paths are piecewise linear and jerky. Smoothing techniques include:

Shortcutting: randomly sample two path points, check if direct connection is collision-free, replace if so.
B-spline fitting: fit a smooth spline to the waypoints, enforce curvature constraints.
Elastic band: represent path as a band that repels from obstacles while minimizing length.

10.3 Trajectory Planning

A trajectory specifies position as a function of time \(\mathbf{p}(t)\), unlike a path which is only geometric. Trajectory planning must satisfy both geometric constraints (obstacle avoidance) and temporal constraints (speed limits, arrival time).

10.3.1 Polynomial Trajectory Planning

For point-to-point motion in one dimension, fit a polynomial \(p(t) = c_0 + c_1 t + c_2 t^2 + \cdots + c_n t^n\) to boundary conditions (position, velocity, acceleration at start and end). The polynomial order \(n\) is chosen to match the number of boundary conditions.

Minimum-jerk trajectory (quintic, \(n=5\)): Given 6 boundary conditions \((p_0, \dot{p}_0, \ddot{p}_0, p_T, \dot{p}_T, \ddot{p}_T)\), the quintic polynomial minimizes:

\[ J = \int_0^T \left(\frac{d^3 p}{dt^3}\right)^2 dt \]

This produces a smooth profile that minimizes jerk (derivative of acceleration), improving passenger comfort and reducing structural stress.

Minimum-snap trajectory (septic, \(n=7\)): Used in quadrotor motion planning, it minimizes \(\int (\dddot{p})^2 dt\) (snap = 4th derivative). For ground vehicles, minimum-jerk suffices.

For multi-segment trajectories (sequences of waypoints), the polynomial is fit per segment with continuity constraints (\(C^2\) or \(C^3\)) at junction points. The full problem becomes a quadratic program (QP):

\[ \min_{\mathbf{c}} \mathbf{c}^T Q\, \mathbf{c} \quad \text{subject to} \quad A\, \mathbf{c} = \mathbf{b} \]

where \(Q\) encodes the integral of squared derivatives and \(A\mathbf{c} = \mathbf{b}\) enforces boundary and continuity conditions.

10.3.2 Frenet Frame Trajectory Planning

Highway driving is naturally described in the Frenet frame \((s, d)\) where \(s\) is the arc-length along the reference path (road centerline) and \(d\) is the lateral offset. Trajectory planning decomposes into:

Longitudinal: design \(s(t)\) to match desired speed and maintain headway
Lateral: design \(d(t)\) to smoothly change lanes or center within lane

Candidate trajectories are generated by sampling target states at a planning horizon \(T\) and fitting quintic polynomials. A cost function balects comfort (jerk), proximity to obstacles, deviation from centerline, and speed. The lowest-cost collision-free trajectory is selected and executed for a short receding horizon.

Example — Lane change trajectory: The vehicle is at \(d=0\) (centerline of right lane) and must change to the left lane (\(d = 3.5\) m). Planning a quintic polynomial with \(d(0)=0, \dot{d}(0)=0, \ddot{d}(0)=0\) and \(d(T)=3.5, \dot{d}(T)=0, \ddot{d}(T)=0\) over \(T=4\) s yields a smooth, symmetric profile. The 6 boundary conditions uniquely determine the 6 polynomial coefficients.

Chapter 11: System Integration and Safety

11.1 Sensor Fusion Architecture

A production ADS relies on multiple heterogeneous sensors whose data must be fused into a coherent world model:

LiDAR (light detection and ranging): provides dense 3-D point clouds at ~10 Hz. Accurate range (~2 cm), but no color/texture, sparse at distance, degraded in heavy rain/snow.
Camera: dense color information, high resolution, cheap. Range must be inferred from geometry. Deep learning excels here.
Radar: long range (~200 m), measures radial velocity directly (Doppler), all-weather. Low angular resolution, sparse returns.
GPS/IMU: global position at low frequency (GPS) combined with high-rate inertial measurements (IMU). Dead reckoning between GPS fixes.

Early vs. late fusion: Early fusion (combine raw sensor data before processing) preserves complementarity but requires tight calibration and time synchronization. Late fusion (each sensor produces detections, then fuse detections) is more modular but loses some information. Mid-level fusion (fuse extracted features) is increasingly popular in deep architectures.

11.2 HD Map and Localization

High-definition (HD) maps encode road geometry at centimeter-level precision: lane boundaries, traffic signs, traffic lights, speed limits. Combined with LiDAR point cloud matching (iterative closest point, ICP, or Monte Carlo localization), the vehicle achieves centimeter-level localization within the map.

Map-based localization workflow:

Pre-drive the route to build an HD map (LiDAR scan + GPS).
At runtime, align the current LiDAR scan to the stored map using point cloud matching.
Fuse the scan-match pose with IMU dead-reckoning in an EKF or factor graph.

11.3 Safety and Redundancy

A fundamental principle is defense in depth: no single system failure should cause a safety-critical incident.

Redundant sensing: e.g., two independent perception stacks using different algorithms; primary and backup GPS.
Fail-safe fallback: if the ADS fails, transition to a minimal risk condition (MRC) — slow down and stop safely.
ISO 26262 ASIL: Automotive Safety Integrity Levels classify the rigor of safety analysis required. Most ADS functions require ASIL-C or ASIL-D (highest).
Formal verification and simulation: before deployment, billions of simulated miles are driven to cover corner cases (e.g., adverse weather, unusual road markings).

Chapter 12: Deep Learning for Perception — Advanced Topics

12.1 Transfer Learning and Domain Adaptation

Training deep networks from scratch requires enormous labeled datasets. Transfer learning fine-tunes networks pre-trained on large datasets (e.g., ImageNet) for AV perception tasks. The early layers (edge detectors, texture detectors) transfer well; later task-specific layers are re-trained.

Domain adaptation addresses the gap between training conditions (e.g., sunny California) and deployment conditions (e.g., snowy Canada). Techniques include data augmentation (synthetic rain/fog), domain randomization, and adversarial domain adaptation.

12.2 3-D Object Detection

For safety, 2-D bounding boxes in image space are insufficient — the 3-D position and dimensions of surrounding vehicles are needed. Methods:

LiDAR-based: PointNet directly processes raw point clouds; VoxelNet voxelizes and applies 3-D convolutions; PointPillars uses vertical pillars for efficient BEV (bird’s-eye-view) detection.
Fusion-based: AvoidFOV, CLOCs, and similar methods fuse camera and LiDAR features for improved accuracy.
Camera-only depth estimation: Mono3D, FCOS3D, and BEVDet use neural networks to infer 3-D from monocular images, leveraging learned priors about object sizes.

12.3 End-to-End Learning

End-to-end approaches (e.g., NVIDIA’s DAVE system, Wayve) train a single neural network from sensor inputs directly to control outputs (steering, throttle), bypassing modular decomposition. Advantages: no hand-designed intermediate representations, can leverage more data. Disadvantages: lack of interpretability, difficult to verify safety, distribution shift is hard to diagnose.

Imitation learning (behavioral cloning) trains on human-driver demonstrations. Reinforcement learning refines the policy through interaction with a simulator. In practice, production systems use hybrid architectures combining learned perception with classical planning and control.

Chapter 13: Summary and Key Formulas

13.1 Core Equations Reference

Image convolution:

\[ (I * k)[x,y] = \sum_{u,v} I[x-u, y-v]\, k[u,v] \]

Harris response:

\[ R = \det(M) - \kappa\, [\mathrm{tr}(M)]^2 \]

Logistic regression:

\[ \hat{p} = \sigma(\mathbf{w}^T\mathbf{x}+b), \quad \sigma(z) = \frac{1}{1+e^{-z}} \]

Backpropagation:

\[ \boldsymbol{\delta}^{(l)} = (W^{(l+1)T}\boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)}) \]

Bayes filter prediction:

\[ \overline{\mathrm{bel}}(\mathbf{x}_t) = \int p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{u}_t)\,\mathrm{bel}(\mathbf{x}_{t-1})\,d\mathbf{x}_{t-1} \]

Bayes filter update:

\[ \mathrm{bel}(\mathbf{x}_t) = \eta\, p(\mathbf{z}_t|\mathbf{x}_t)\, \overline{\mathrm{bel}}(\mathbf{x}_t) \]

Occupancy grid log-odds update:

\[ l_t = l_{t-1} + \log\frac{p(z|m=1)}{p(z|m=0)} + \log\frac{1-p_0}{p_0} \]

Kalman gain:

\[ K_t = \bar{\Sigma}_t H^T (H\bar{\Sigma}_t H^T + R)^{-1} \]

Kinematic bicycle model:

\[ \dot{\psi} = \frac{v}{L}\tan\delta, \quad \kappa = \frac{\tan\delta}{L} \]

Stanley controller:

\[ \delta = \psi_e + \arctan\!\left(\frac{k_e e}{v}\right) \]

PID controller:

\[ u(t) = K_p e(t) + K_i \int_0^t e(\tau)\,d\tau + K_d \frac{de}{dt} \]

13.2 Conceptual Connections

The architecture of a complete ADS follows the sense–plan–act cycle studied throughout the course:

Sensing: cameras, LiDAR, radar produce raw data
Perception (Chapters 2–5): CV algorithms and deep CNNs extract objects, lanes, free space from raw data
State estimation (Chapters 6–7): Bayes filter, occupancy grids, Kalman/EKF fuse observations into a dynamic world model
Localization: EKF fuses HD map matching with IMU to estimate vehicle pose
Prediction: ML-based models forecast agent future states
Planning (Chapter 10): RRT*, polynomial trajectories compute a feasible, safe trajectory
Control (Chapter 9): Stanley/MPC lateral + PID longitudinal controllers track the trajectory
Vehicle actuation: steering, throttle, brake respond to control commands

Each module relies on probabilistic reasoning about uncertainty — the unifying theme of the course. Bayes’ theorem connects perception (sensor likelihood), state estimation (belief update), and planning (risk-aware trajectory selection) into a coherent mathematical framework for building safe autonomous vehicles.