ECE 495: Autonomous Vehicles
Krzysztof Czarnecki
Estimated study time: 1 hr 5 min
Table of contents
Sources and References
Primary references — S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, 2005; R. Szeliski, Computer Vision: Algorithms and Applications, 2nd ed., Springer, 2022 (free PDF: szeliski.org/Book).
Supplementary texts — I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016 (free: deeplearningbook.org); S. Liu, L. Li, J. Tang, S. Wu, and J.-L. Gaudiot, Creating Autonomous Vehicle Systems, 2nd ed., Morgan and Claypool, 2020.
Online resources — MIT 6.S094 Deep Learning for Self-Driving Cars; SAE J3016 Levels of Driving Automation standard; Apollo autonomous driving open platform documentation.
Chapter 1: Introduction to Autonomous Driving Systems
1.1 History and Motivation
The quest for self-driving vehicles stretches back further than most engineers appreciate. Early milestones include the Stanford Cart (1960s–70s), Carnegie Mellon’s Navlab project (1980s), and Ernst Dickmanns’ VaMoRs vehicle in Germany, which drove on the Autobahn using dynamic vision in 1987. The DARPA Grand Challenges (2004, 2005, 2007) catalyzed modern autonomous driving: the 2005 Stanley vehicle from Stanford won the desert course, and the 2007 Urban Challenge produced vehicles capable of navigating real intersection rules. By the 2010s, industrial programs at Google (later Waymo), Tesla, GM Cruise, and a proliferation of startups moved the field from demonstration to pre-commercial deployment.
The motivation is both safety and efficiency. Human error accounts for over 90% of road crashes globally. An automated system that perceives faster, never tires, and acts deterministically under its operational limits has the potential to reduce fatalities dramatically. Secondary benefits include reduced congestion, accessibility for mobility-impaired individuals, and energy efficiency through cooperative driving.
1.2 SAE J3016 Levels of Driving Automation
SAE International standard J3016 defines six levels (0–5) of driving automation, which have become the industry lingua franca.
| Level | Name | Who monitors environment? | Who performs dynamic driving task (DDT)? |
|---|---|---|---|
| 0 | No Automation | Human | Human |
| 1 | Driver Assistance | Human | Human + system (one of steering or acceleration) |
| 2 | Partial Automation | Human | Human + system (both steering and acceleration) |
| 3 | Conditional Automation | System | System (human must be ready to intervene on request) |
| 4 | High Automation | System | System (within ODD; no human fallback needed) |
| 5 | Full Automation | System | System (all conditions, no ODD restriction) |
The critical boundary is between Level 2 and Level 3: at L2 the human must continuously supervise; at L3 the system supervises and only requests human intervention when it reaches a limit. This distinction has significant legal and engineering implications.
1.3 Automated Driving System (ADS) Functional Architecture
A complete ADS is typically organized into a sense–plan–act pipeline:
- Perception — Processes sensor data (cameras, LiDAR, radar, GPS/IMU) to produce a model of the environment: detected objects, free space, lane geometry, traffic signals.
- Prediction — Forecasts the future states of dynamic agents (pedestrians, vehicles) over a planning horizon.
- Planning — Determines the vehicle’s future trajectory: route planning (global), behavioral planning (when to merge, yield), and motion planning (exact trajectory).
- Control — Executes the planned trajectory by commanding steering, throttle, and brake actuators.
- Localization — Determines the vehicle’s precise pose within a map, often using HD maps fused with sensor data.
1.3.1 WISE ADS Framework
The WISE (Waymo–inspired Systems Engineering) framework emphasizes four pillars: What the system must do (requirements), Infrastructure (compute, sensors, connectivity), Safety (fault tolerance, fallback), and Evaluation (simulation, closed-track, public road testing). A key principle is the safety case: an ADS must demonstrate, with evidence, that residual risk is acceptably low within its ODD.
1.4 State of the Technology
Current ADS deployments rely heavily on sensor fusion: cameras provide dense color/texture information; LiDAR provides precise 3-D geometry; radar provides velocity via Doppler and is robust to weather; GPS/IMU provide coarse global pose. Perception is now dominated by deep learning (CNNs, transformers). Control is often classical (PID, MPC) with deep RL emerging for edge cases. End-to-end neural approaches that map sensor inputs directly to steering commands exist but face interpretability challenges.
Chapter 2: Computer Vision for Autonomous Driving
2.1 Image Representation
A grayscale image is a 2-D array \(I: \mathbb{Z}^2 \to [0, 255]\) where each element is a pixel intensity. A color image adds a third channel dimension, typically RGB: \(I \in \mathbb{R}^{H \times W \times 3}\). In floating-point processing, values are normalized to \([0,1]\).
2.1.1 Image Convolution and Gaussian Filtering
Linear filtering replaces each pixel with a weighted sum of its neighborhood. Given a kernel (filter) \(k\) of size \((2r+1) \times (2r+1)\), the 2-D discrete convolution is:
\[ (I * k)[x, y] = \sum_{u=-r}^{r} \sum_{v=-r}^{r} I[x-u, y-v]\, k[u, v] \]The Gaussian kernel smooths the image by weighting nearby pixels more heavily:
\[ G_\sigma(u,v) = \frac{1}{2\pi\sigma^2} \exp\!\left(-\frac{u^2 + v^2}{2\sigma^2}\right) \]Gaussian filtering is separable: \(G_\sigma(u,v) = g_\sigma(u)\, g_\sigma(v)\), so a 2-D Gaussian convolution can be implemented as two 1-D passes, reducing complexity from \(O(N^2 k^2)\) to \(O(N^2 k)\) where \(k\) is the kernel width.
2.2 Edge Detection: The Canny Pipeline
The Canny detector (1986) remains a gold standard for edges because it is optimal under three criteria: good detection, good localization, and single-response (no double edges).
Step 1 — Smoothing. Convolve with \(G_\sigma\) to suppress noise.
Step 2 — Gradient computation. Apply Sobel operators to get partial derivatives:
\[ G_x = \begin{bmatrix}-1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1\end{bmatrix} * I, \qquad G_y = \begin{bmatrix}1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1\end{bmatrix} * I \]Gradient magnitude and direction:
\[ M = \sqrt{G_x^2 + G_y^2}, \qquad \theta = \arctan\!\left(\frac{G_y}{G_x}\right) \]Step 3 — Non-maximum suppression. For each pixel, check if its gradient magnitude is a local maximum along the gradient direction. If not, suppress (set to 0). This thins edges to single-pixel width.
Step 4 — Double thresholding. Apply two thresholds \(T_\text{low} < T_\text{high}\):
- \(M > T_\text{high}\): strong edge
- \(T_\text{low} < M \leq T_\text{high}\): weak edge
- \(M \leq T_\text{low}\): not an edge
Step 5 — Hysteresis linking. Keep weak edges only if they are connected (8-connected) to a strong edge. This removes noise-induced weak responses while preserving genuine edge continuations.
2.3 Line Detection: The Hough Transform
The Hough transform maps image points to a parameter space to detect global structures like lines.
A line in image space can be parameterized as:
\[ \rho = x\cos\theta + y\sin\theta \]where \(\rho\) is the perpendicular distance from the origin and \(\theta \in [0°, 180°)\) is the angle. This avoids the infinite slope problem of \(y = mx + b\).
Algorithm:
- For every edge pixel \((x_i, y_i)\), compute \(\rho(\theta) = x_i\cos\theta + y_i\sin\theta\) for all \(\theta\) in a discrete set.
- Increment the accumulator cell \(A[\rho, \theta]\).
- Peaks in \(A\) correspond to lines with many supporting edge pixels.
2.4 Corner Detection: The Harris Detector
Edges are useful but corners (two-directional intensity changes) are more repeatable for tracking and matching. The Harris detector uses the structure tensor (second-moment matrix) of local image gradients.
Define the gradient at each pixel: \(\nabla I = (I_x, I_y)^T\). The structure tensor over a window \(W\) is:
\[ M = \sum_{(x,y) \in W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix} \]Often the sum is replaced by a Gaussian-weighted sum. The eigenvalues \(\lambda_1, \lambda_2\) of \(M\) characterize the local geometry:
- Both small: flat region (no gradient)
- One large, one small: edge
- Both large: corner
Rather than computing eigenvalues directly, Harris uses the corner response function:
\[ R = \det(M) - \kappa\, [\text{tr}(M)]^2 = \lambda_1\lambda_2 - \kappa(\lambda_1 + \lambda_2)^2 \]with \(\kappa \approx 0.04{-}0.06\). Pixels with \(R > T\) are corner candidates; non-maximum suppression retains only local maxima.
Chapter 3: Machine Learning Foundations
3.1 Supervised Learning and Linear Regression
Supervised learning finds a function \(f_\theta: \mathcal{X} \to \mathcal{Y}\) from labeled pairs \(\{(\mathbf{x}_i, y_i)\}_{i=1}^N\). In linear regression, the prediction is:
\[ \hat{y} = \mathbf{w}^T \mathbf{x} + b = \boldsymbol{\theta}^T \tilde{\mathbf{x}} \]where \(\tilde{\mathbf{x}} = [1, \mathbf{x}^T]^T\) absorbs the bias. The cost function is mean squared error:
\[ J(\boldsymbol{\theta}) = \frac{1}{2N} \sum_{i=1}^N (\hat{y}_i - y_i)^2 \]The closed-form normal equation solution is \(\boldsymbol{\theta}^* = (X^T X)^{-1} X^T \mathbf{y}\), where \(X\) is the design matrix. For high-dimensional inputs, iterative gradient descent is preferred.
3.2 Logistic Regression
For binary classification (\(y \in \{0,1\}\)), logistic regression models the posterior:
\[ P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b), \qquad \sigma(z) = \frac{1}{1+e^{-z}} \]The sigmoid \(\sigma\) squashes the linear output to \((0,1)\). Training minimizes binary cross-entropy loss:
\[ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i) \right] \]The gradient with respect to \(\mathbf{w}\) is:
\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{N} \sum_{i=1}^N (\hat{p}_i - y_i)\, \mathbf{x}_i \]This elegant result (prediction minus label, times input) motivates the general pattern seen in neural network gradients.
3.3 Neural Network Fundamentals
A feedforward neural network composes multiple layers:
\[ \mathbf{h}^{(l)} = \phi\!\left( W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)} \right), \quad l = 1, \ldots, L \]where \(\phi\) is a nonlinear activation. Common activations:
- ReLU: \(\phi(z) = \max(0, z)\) — sparse, avoids vanishing gradient, default for hidden layers
- Sigmoid: \(\sigma(z) = (1+e^{-z})^{-1}\) — output layer for binary classification
- Softmax: \(\phi(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}\) — output layer for multi-class classification
- Tanh: \(\tanh(z)\) — zero-centered, useful in recurrent networks
3.3.1 Backpropagation
Backpropagation efficiently computes \(\partial \mathcal{L}/\partial \theta\) for all parameters using the chain rule. Define the local gradient (error signal) at layer \(l\) as \(\boldsymbol{\delta}^{(l)} = \partial \mathcal{L} / \partial \mathbf{z}^{(l)}\) where \(\mathbf{z}^{(l)} = W^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\).
Forward pass: Compute and store \(\mathbf{z}^{(l)}\) and \(\mathbf{h}^{(l)}\) for all layers.
Backward pass (output layer \(L\)):
\[ \boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \phi'\!\left(\mathbf{z}^{(L)}\right) \]Backward pass (hidden layer \(l < L\)):
\[ \boldsymbol{\delta}^{(l)} = \left( W^{(l+1)T} \boldsymbol{\delta}^{(l+1)} \right) \odot \phi'\!\left(\mathbf{z}^{(l)}\right) \]Parameter gradients:
\[ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \boldsymbol{\delta}^{(l)} \left(\mathbf{h}^{(l-1)}\right)^T, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)} \]Gradient descent update:
\[ \theta \leftarrow \theta - \eta\, \frac{\partial \mathcal{L}}{\partial \theta} \]where \(\eta\) is the learning rate. Mini-batch stochastic gradient descent (SGD) uses a random subset of \(B\) samples per update, providing a noisy but efficient estimate.
3.3.2 Batch Normalization
Batch normalization (Ioffe & Szegedy, 2015) normalizes pre-activations within a mini-batch to stabilize training:
\[ \hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y_i = \gamma \hat{z}_i + \beta \]where \(\mu_B\) and \(\sigma_B^2\) are the batch mean and variance, and \(\gamma, \beta\) are learned scale and shift parameters. Benefits: enables higher learning rates, acts as a regularizer, reduces sensitivity to initialization.
Chapter 4: Convolutional Neural Networks
4.1 CNN Architecture
Convolutional Neural Networks exploit three key ideas for visual data: local connectivity (each unit sees only a local receptive field), weight sharing (the same filter is applied at all spatial locations), and spatial hierarchy (deeper layers combine local features into global structure).
4.1.1 Convolution Layer
A convolutional layer with \(K\) filters of size \(F \times F \times C_\text{in}\) produces \(K\) feature maps. The output at spatial location \((i,j)\) for filter \(k\) is:
\[ Z_{i,j,k} = \sum_{u=0}^{F-1}\sum_{v=0}^{F-1}\sum_{c=0}^{C_\text{in}-1} W_{u,v,c,k}\, X_{i\cdot s+u,\, j\cdot s+v,\, c} + b_k \]where \(s\) is the stride. Padding (typically \(P = \lfloor F/2 \rfloor\)) preserves spatial dimensions. After convolution, a nonlinearity (ReLU) is applied element-wise.
4.1.2 Pooling Layer
Pooling reduces spatial resolution and introduces local translation invariance. Max pooling takes the maximum over a \(2\times2\) window with stride 2, halving each spatial dimension. Average pooling takes the mean. Pooling has no learned parameters.
4.1.3 Fully Connected Layer
After spatial feature extraction, the feature map is flattened and passed through one or more fully connected (dense) layers for final classification or regression. The last FC layer typically outputs class logits for softmax.
4.1.4 Example Architecture — AlexNet (2012)
Conv(96, 11×11, s=4) → ReLU → MaxPool → Conv(256, 5×5) → ReLU → MaxPool → Conv(384, 3×3) → Conv(384, 3×3) → Conv(256, 3×3) → MaxPool → FC(4096) → FC(4096) → FC(1000, softmax)
AlexNet won ImageNet 2012 with 16% top-5 error, halving the previous best and launching the deep learning era in computer vision.
4.2 Semantic Segmentation
Semantic segmentation assigns a class label to every pixel. For autonomous driving this produces the “scene understanding” map: road, sidewalk, vehicle, pedestrian, sky, building.
4.2.1 Fully Convolutional Network (FCN)
Long et al. (2015) replaced all FC layers in a classification CNN with convolutional layers, enabling dense prediction. The spatial output is too coarse due to pooling, so transposed convolutions (sometimes called deconvolutions) upsample back to the input resolution. Skip connections from earlier (higher-resolution) layers add fine-grained spatial information back.
4.2.2 U-Net Encoder–Decoder
U-Net (Ronneberger et al., 2015) follows a symmetric encoder–decoder structure with skip connections at every resolution level:
- Encoder (contracting path): repeated blocks of [Conv → ReLU → Conv → ReLU → MaxPool], doubling channels at each level.
- Bottleneck: deepest representation at lowest spatial resolution.
- Decoder (expanding path): at each level, upsample (transposed conv or bilinear), concatenate the corresponding encoder feature map via skip connection, then [Conv → ReLU → Conv → ReLU].
The skip connections preserve spatial information that would otherwise be lost during downsampling, enabling precise boundary delineation — critical for segmenting road surface from curb.
Training uses pixel-wise cross-entropy loss. The Intersection over Union (IoU) metric per class, and mean IoU (mIoU), are standard evaluation metrics:
\[ \text{IoU}_c = \frac{|\text{pred}_c \cap \text{gt}_c|}{|\text{pred}_c \cup \text{gt}_c|} \]Chapter 5: Object Detection
5.1 From Sliding Window to Region Proposals
Early detection applied a classifier at every position and scale (sliding window). This is accurate but computationally prohibitive. The paradigm evolved through three generations:
- Selective search + classification (R-CNN, 2014): propose ~2000 region candidates using a classical algorithm, warp each to a fixed size, classify with CNN. Slow: ~47s/image.
- Shared CNN features (Fast R-CNN, 2015): run CNN once on the whole image; project proposals onto feature map; use RoI pooling to extract fixed-size features per proposal. ~0.2s/image.
- Region Proposal Network (Faster R-CNN, 2015): replace selective search with a small CNN (RPN) sharing the backbone features. Fully end-to-end trainable. ~0.05s/image.
5.2 Anchor Boxes
Modern detectors predict offsets relative to anchors: a set of fixed bounding boxes of various scales and aspect ratios tiled across the feature map. Each anchor is associated with a location, and the network predicts:
- Objectness score: is there an object overlapping this anchor?
- Class probabilities (given object present)
- Bounding box regression offsets \((\Delta x, \Delta y, \Delta w, \Delta h)\) relative to anchor
The box regression targets are defined in log-space for width/height:
\[ t_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}, \quad t_w = \log\frac{w}{w_a}, \quad t_h = \log\frac{h}{h_a} \]An anchor is a positive example if it has IoU \(> 0.7\) with some ground-truth box; negative if IoU \(< 0.3\).
5.3 YOLO: Real-Time Object Detection
YOLO (You Only Look Once, Redmon et al., 2016) divides the image into an \(S \times S\) grid. Each cell predicts \(B\) bounding boxes and \(C\) class probabilities simultaneously in a single forward pass, achieving real-time speed.
Grid cell prediction: Each cell outputs a tensor of size \(B \times (5 + C)\). The 5 values per box are \((x, y, w, h, \text{confidence})\), where \(x, y\) are relative to the cell and \(w, h\) are relative to the image. Confidence is defined as:
\[ \text{conf} = P(\text{object}) \times \text{IoU}_{\text{pred}}^{\text{truth}} \]YOLO Loss Function: The total loss sums over all grid cells and boxes:
[ \mathcal{L} = \lambda_\text{coord} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right]
- \lambda_\text{coord} \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2\right] ]
[
- \sum_{i,j} \mathbf{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2
- \lambda_\text{noobj} \sum_{i,j} \mathbf{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2
- \sum_{i,j} \mathbf{1}{ij}^{\text{obj}} \sum{c} (p_i(c) - \hat{p}_i(c))^2 ]
The square root on \(w, h\) makes the loss less sensitive to size for large boxes; \(\lambda_\text{coord} = 5\) and \(\lambda_\text{noobj} = 0.5\) balance localization vs. confidence terms. Later versions (YOLOv3–v8) introduced multi-scale prediction and anchor priors.
Non-Maximum Suppression (NMS): After detection, many boxes overlap. NMS iteratively selects the highest-confidence box and suppresses all others with IoU \(> \text{threshold}\) (typically 0.5).
Evaluation metric — mAP: Mean Average Precision averages the area under the precision–recall curve across all classes and IoU thresholds. COCO standard uses mAP@[0.5:0.05:0.95], requiring accurate localization as well as correct classification.
Chapter 6: Probabilistic State Estimation
6.1 Dynamic System Modeling
An autonomous vehicle must track its own pose and the states of surrounding agents over time. The general framework models a hidden Markov model (HMM):
- State \(\mathbf{x}_t\): true system state at time \(t\) (position, velocity, orientation, etc.)
- Control input \(\mathbf{u}_t\): known input applied at time \(t\) (steering, throttle)
- Observation \(\mathbf{z}_t\): noisy sensor measurement at time \(t\)
- Motion model \(p(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{u}_t)\): probabilistic state transition
- Sensor model \(p(\mathbf{z}_t \mid \mathbf{x}_t)\): likelihood of measurement given state
6.2 Bayes Filter: General Recursive Estimation
The goal is to maintain the belief \(\text{bel}(\mathbf{x}_t) = p(\mathbf{x}_t \mid \mathbf{z}_{1:t}, \mathbf{u}_{1:t})\), the posterior distribution over the state given all history.
Under the Markov assumption, the Bayes filter iterates two steps:
Prediction step (propagate through motion model):
\[ \overline{\text{bel}}(\mathbf{x}_t) = \int p(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{u}_t)\, \text{bel}(\mathbf{x}_{t-1})\, d\mathbf{x}_{t-1} \]Update step (incorporate measurement):
\[ \text{bel}(\mathbf{x}_t) = \eta\, p(\mathbf{z}_t \mid \mathbf{x}_t)\, \overline{\text{bel}}(\mathbf{x}_t) \]where \(\eta\) is a normalizing constant ensuring \(\int \text{bel}(\mathbf{x}_t)\, d\mathbf{x}_t = 1\).
The Bayes filter is the theoretical foundation. Different implementations arise from different representations of the belief distribution:
- Kalman filter — Gaussian beliefs, linear systems
- Extended Kalman filter — Gaussian beliefs, nonlinear systems (Jacobian linearization)
- Particle filter — non-parametric, arbitrary distributions
6.3 Occupancy Grid Mapping
An occupancy grid divides the environment into a fixed-resolution grid of cells. Each cell \(m_i\) is a binary random variable: occupied (1) or free (0). The goal is to estimate \(p(m_i \mid \mathbf{z}_{1:t}, \mathbf{x}_{1:t})\) for each cell, assuming a known vehicle trajectory.
6.3.1 Log-Odds Representation
Rather than storing probabilities (which can become numerically unstable near 0 or 1), the log-odds representation is used:
\[ l_t(m_i) = \log \frac{p(m_i = 1 \mid \mathbf{z}_{1:t})}{p(m_i = 0 \mid \mathbf{z}_{1:t})} = \log \frac{p(m_i=1)}{1-p(m_i=1)} \]Using Bayes’ theorem and the assumption of independent cells, the update rule for each new measurement is:
\[ l_t(m_i) = l_{t-1}(m_i) + \log \frac{p(\mathbf{z}_t \mid m_i=1)}{p(\mathbf{z}_t \mid m_i=0)} + \log \frac{1 - p_0}{p_0} \]where \(p_0 = p(m_i = 1)\) is the prior occupancy probability (typically 0.5, giving \(\log\frac{1-p_0}{p_0} = 0\)). The term \(\log \frac{p(\mathbf{z}_t \mid m_i=1)}{p(\mathbf{z}_t \mid m_i=0)}\) is the inverse sensor model log-odds.
Inverse sensor model (LiDAR): For a LiDAR beam that hits at range \(r\):
- Cells along the beam before the hit: likely free → log-odds decremented (e.g., \(-0.4\))
- Cell at the hit point: likely occupied → log-odds incremented (e.g., \(+0.9\))
- Cells beyond the hit: unknown → no update
Chapter 7: Kalman Filtering
7.1 The Kalman Filter
The Kalman filter (KF) is the optimal Bayes filter for linear Gaussian systems. The state and observation models are:
\[ \mathbf{x}_t = A\, \mathbf{x}_{t-1} + B\, \mathbf{u}_t + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, Q) \]\[ \mathbf{z}_t = H\, \mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, R) \]The belief is Gaussian: \(\text{bel}(\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_t, \Sigma_t)\).
7.1.1 KF Equations
Predict:
\[ \bar{\boldsymbol{\mu}}_t = A\, \boldsymbol{\mu}_{t-1} + B\, \mathbf{u}_t \]\[ \bar{\Sigma}_t = A\, \Sigma_{t-1}\, A^T + Q \]Update:
\[ K_t = \bar{\Sigma}_t\, H^T\, (H\, \bar{\Sigma}_t\, H^T + R)^{-1} \]\[ \boldsymbol{\mu}_t = \bar{\boldsymbol{\mu}}_t + K_t\, (\mathbf{z}_t - H\, \bar{\boldsymbol{\mu}}_t) \]\[ \Sigma_t = (I - K_t H)\, \bar{\Sigma}_t \]The Kalman gain \(K_t\) balances trust in the prediction vs. the measurement. When measurement noise \(R \to 0\), \(K_t \to H^{-1}\) (trust measurement fully). When process noise \(Q \to 0\), \(K_t \to 0\) (trust prediction fully).
7.2 Extended Kalman Filter (EKF)
Most robotics systems are nonlinear. The EKF handles nonlinear models:
\[ \mathbf{x}_t = g(\mathbf{x}_{t-1}, \mathbf{u}_t) + \mathbf{w}_t \]\[ \mathbf{z}_t = h(\mathbf{x}_t) + \mathbf{v}_t \]The EKF linearizes \(g\) and \(h\) around the current estimate using Jacobians:
\[ G_t = \frac{\partial g}{\partial \mathbf{x}}\Bigg|_{\boldsymbol{\mu}_{t-1}, \mathbf{u}_t}, \qquad H_t = \frac{\partial h}{\partial \mathbf{x}}\Bigg|_{\bar{\boldsymbol{\mu}}_t} \]EKF Predict:
\[ \bar{\boldsymbol{\mu}}_t = g(\boldsymbol{\mu}_{t-1}, \mathbf{u}_t), \qquad \bar{\Sigma}_t = G_t\, \Sigma_{t-1}\, G_t^T + Q \]EKF Update:
\[ K_t = \bar{\Sigma}_t\, H_t^T\, (H_t\, \bar{\Sigma}_t\, H_t^T + R)^{-1} \]\[ \boldsymbol{\mu}_t = \bar{\boldsymbol{\mu}}_t + K_t\, (\mathbf{z}_t - h(\bar{\boldsymbol{\mu}}_t)), \qquad \Sigma_t = (I - K_t H_t)\, \bar{\Sigma}_t \]Chapter 8: Coordinate Frames and Vehicle Modeling
8.1 Coordinate Frames
Autonomous driving involves multiple coordinate systems, and correct transformation between them is essential for sensor fusion.
- World frame (ENU: East-North-Up or ECEF): globally fixed, used for maps and global localization.
- Vehicle body frame: origin at vehicle center of gravity (or rear axle center), \(x\) forward, \(y\) left, \(z\) up.
- Sensor frames: each sensor (camera, LiDAR) has its own frame defined by its mounting position and orientation.
- Image plane frame: pixels in \((u, v)\) coordinates with origin at top-left.
A rigid-body transformation between frame \(A\) and frame \(B\) is a 4×4 homogeneous matrix:
\[ T_{AB} = \begin{bmatrix} R & \mathbf{t} \\ \mathbf{0}^T & 1 \end{bmatrix} \]where \(R \in SO(3)\) is a rotation matrix and \(\mathbf{t} \in \mathbb{R}^3\) is a translation. Point \(\mathbf{p}_A\) transforms as \(\mathbf{p}_B = T_{AB}\, \mathbf{p}_A\).
Camera projection: A 3-D point \((X,Y,Z)\) in the camera frame projects to pixel \((u, v)\):
\[ u = f_x \frac{X}{Z} + c_x, \qquad v = f_y \frac{Y}{Z} + c_y \]where \(f_x, f_y\) are focal lengths and \((c_x, c_y)\) is the principal point.
8.2 Kinematic Bicycle Model
The kinematic bicycle model approximates the vehicle as a single-track (bicycle) system, lumping front and rear axles into single wheels. It is valid at low speed where tire slip is negligible.
State: \(\mathbf{x} = (x, y, \psi, v)^T\) where \((x,y)\) is the rear axle center position, \(\psi\) is the heading angle, and \(v\) is the speed.
Input: \((a, \delta)^T\) where \(a\) is longitudinal acceleration and \(\delta\) is the front wheel steering angle.
Equations of motion:
\[ \dot{x} = v \cos\psi \]\[ \dot{y} = v \sin\psi \]\[ \dot{\psi} = \frac{v}{L} \tan\delta \]\[ \dot{v} = a \]where \(L\) is the wheelbase (distance between front and rear axles).
Steering to curvature: The instantaneous turning radius is:
\[ R = \frac{L}{\tan\delta} \]and curvature \(\kappa = 1/R = \tan\delta / L\). For small angles, \(\tan\delta \approx \delta\), giving \(\kappa \approx \delta / L\).
Chapter 9: Vehicle Control
9.1 Lateral Control
Lateral control keeps the vehicle on a desired path by commanding the steering angle.
9.1.1 Pure Pursuit Controller
Pure pursuit is a geometric controller that steers toward a look-ahead point on the desired path, located at distance \(L_d\) ahead of the vehicle.
The steering angle \(\delta\) satisfying the bicycle model geometry for a circular arc from the vehicle to the look-ahead point is:
\[ \delta = \arctan\!\left(\frac{2 L \sin\alpha}{L_d}\right) \]where \(\alpha\) is the angle between the vehicle’s heading and the direction to the look-ahead point, \(L\) is the wheelbase, and \(L_d\) is the look-ahead distance.
The look-ahead distance is often set as \(L_d = k_v v\) (proportional to speed) to provide faster response at low speed and stability at high speed. Pure pursuit has no feedback on cross-track error explicitly — error is implicit through the angle \(\alpha\).
9.1.2 Stanley Controller
The Stanley controller (used by the Stanford Stanley that won DARPA 2005) uses both heading error and cross-track error:
\[ \delta = \psi_e + \arctan\!\left(\frac{k_e\, e}{v}\right) \]where \(\psi_e\) is the heading error (difference between vehicle heading and path tangent), \(e\) is the cross-track error at the front axle, \(v\) is the vehicle speed, and \(k_e > 0\) is a gain.
The heading error term corrects orientation; the \(\arctan\) term corrects lateral offset and includes velocity scaling (reduces aggressive response at high speed). The use of arctan saturates the steering command naturally.
9.1.3 Model Predictive Control (MPC) for Lateral Control
Linear MPC formulates lateral control as a receding-horizon optimization. Given the linearized error-state model (lateral error \(e\), heading error \(\psi_e\)):
\[ \mathbf{x}_{k+1} = A_d\, \mathbf{x}_k + B_d\, \delta_k \]the MPC solves at each timestep:
\[ \min_{\delta_{0:N-1}} \sum_{k=0}^{N-1} \left( \mathbf{x}_k^T Q\, \mathbf{x}_k + \delta_k^T R\, \delta_k \right) + \mathbf{x}_N^T P\, \mathbf{x}_N \]subject to:
\[ \mathbf{x}_{k+1} = A_d\, \mathbf{x}_k + B_d\, \delta_k, \quad |\delta_k| \leq \delta_\text{max}, \quad |\delta_k - \delta_{k-1}| \leq \Delta\delta_\text{max} \]Only the first control input \(\delta_0^*\) is applied; the optimization is re-solved at the next timestep with updated state (receding horizon principle). MPC naturally handles input constraints (maximum steering angle, rate limits) which geometric controllers cannot.
9.2 Longitudinal Control
Longitudinal control regulates vehicle speed by commanding throttle and brake.
9.2.1 PID Controller
The PID controller computes a control command from the speed error \(e(t) = v_\text{ref}(t) - v(t)\):
\[ u(t) = K_p\, e(t) + K_i \int_0^t e(\tau)\, d\tau + K_d\, \frac{de}{dt} \]In discrete time with timestep \(\Delta t\):
\[ u_k = K_p\, e_k + K_i\, \Delta t \sum_{j=0}^k e_j + K_d\, \frac{e_k - e_{k-1}}{\Delta t} \]Tuning: The proportional term provides immediate response; the integral term eliminates steady-state error (e.g., due to road grade); the derivative term damps oscillation.
Anti-windup: When the actuator saturates (maximum throttle), the integral can accumulate (“wind up”), causing overshoot when the constraint lifts. Anti-windup clamps the integral or back-calculates to prevent this.
9.2.2 Adaptive Cruise Control (ACC)
ACC extends speed control to headway-keeping. The time-headway model maintains a desired gap \(d_\text{ref} = v\, \tau_h + d_0\) where \(\tau_h\) is the time headway (typically 1.5–2.5 s) and \(d_0\) is a minimum standstill gap. A cascade controller uses an outer loop on gap error to generate a speed reference, fed to the inner PID speed controller.
Chapter 10: Motion Planning
10.1 The Motion Planning Problem
Motion planning determines a trajectory \(\xi: [0, T] \to \mathcal{X}\) from the current state to a goal state that is:
- Collision-free — avoiding obstacles in the workspace
- Dynamically feasible — respecting vehicle kinematics and dynamics
- Comfortable — limiting jerk, lateral acceleration
- Rule-compliant — obeying traffic laws, speed limits, right-of-way
The configuration space \(\mathcal{C}\) represents all possible states (poses) of the vehicle. The free configuration space \(\mathcal{C}_\text{free} = \mathcal{C} \setminus \mathcal{C}_\text{obs}\) excludes states where the vehicle collides with obstacles.
10.2 Sampling-Based Motion Planning
10.2.1 Rapidly-Exploring Random Trees (RRT)
RRT grows a tree from the start state by repeatedly:
- Sample a random configuration \(\mathbf{x}_\text{rand} \in \mathcal{C}\)
- Find the nearest tree node \(\mathbf{x}_\text{near}\)
- Extend from \(\mathbf{x}_\text{near}\) toward \(\mathbf{x}_\text{rand}\) by step size \(\epsilon\), producing \(\mathbf{x}_\text{new}\)
- If the edge \([\mathbf{x}_\text{near}, \mathbf{x}_\text{new}]\) is collision-free, add \(\mathbf{x}_\text{new}\) to the tree
- Check if \(\mathbf{x}_\text{new}\) is within tolerance of the goal
RRT is probabilistically complete: if a solution exists, it will be found with probability 1 as the number of samples \(\to \infty\). However, RRT paths are typically jagged and suboptimal.
10.2.2 RRT* (Asymptotically Optimal RRT)
RRT* adds two steps after extending:
Near-neighbor rewiring:
- Find all nodes within a ball of radius \(r = \gamma(\log n / n)^{1/d}\) around \(\mathbf{x}_\text{new}\)
- Choose the parent that minimizes the total cost from the start
- Rewire neighbors through \(\mathbf{x}_\text{new}\) if it improves their cost
RRT* is asymptotically optimal: as \(n \to \infty\), the cost of the found path converges to the optimal cost. This makes it suitable for offline planning where time is available.
10.2.3 Path Smoothing
RRT paths are piecewise linear and jerky. Smoothing techniques include:
- Shortcutting: randomly sample two path points, check if direct connection is collision-free, replace if so.
- B-spline fitting: fit a smooth spline to the waypoints, enforce curvature constraints.
- Elastic band: represent path as a band that repels from obstacles while minimizing length.
10.3 Trajectory Planning
A trajectory specifies position as a function of time \(\mathbf{p}(t)\), unlike a path which is only geometric. Trajectory planning must satisfy both geometric constraints (obstacle avoidance) and temporal constraints (speed limits, arrival time).
10.3.1 Polynomial Trajectory Planning
For point-to-point motion in one dimension, fit a polynomial \(p(t) = c_0 + c_1 t + c_2 t^2 + \cdots + c_n t^n\) to boundary conditions (position, velocity, acceleration at start and end). The polynomial order \(n\) is chosen to match the number of boundary conditions.
Minimum-jerk trajectory (quintic, \(n=5\)): Given 6 boundary conditions \((p_0, \dot{p}_0, \ddot{p}_0, p_T, \dot{p}_T, \ddot{p}_T)\), the quintic polynomial minimizes:
\[ J = \int_0^T \left(\frac{d^3 p}{dt^3}\right)^2 dt \]This produces a smooth profile that minimizes jerk (derivative of acceleration), improving passenger comfort and reducing structural stress.
Minimum-snap trajectory (septic, \(n=7\)): Used in quadrotor motion planning, it minimizes \(\int (\dddot{p})^2 dt\) (snap = 4th derivative). For ground vehicles, minimum-jerk suffices.
For multi-segment trajectories (sequences of waypoints), the polynomial is fit per segment with continuity constraints (\(C^2\) or \(C^3\)) at junction points. The full problem becomes a quadratic program (QP):
\[ \min_{\mathbf{c}} \mathbf{c}^T Q\, \mathbf{c} \quad \text{subject to} \quad A\, \mathbf{c} = \mathbf{b} \]where \(Q\) encodes the integral of squared derivatives and \(A\mathbf{c} = \mathbf{b}\) enforces boundary and continuity conditions.
10.3.2 Frenet Frame Trajectory Planning
Highway driving is naturally described in the Frenet frame \((s, d)\) where \(s\) is the arc-length along the reference path (road centerline) and \(d\) is the lateral offset. Trajectory planning decomposes into:
- Longitudinal: design \(s(t)\) to match desired speed and maintain headway
- Lateral: design \(d(t)\) to smoothly change lanes or center within lane
Candidate trajectories are generated by sampling target states at a planning horizon \(T\) and fitting quintic polynomials. A cost function balects comfort (jerk), proximity to obstacles, deviation from centerline, and speed. The lowest-cost collision-free trajectory is selected and executed for a short receding horizon.
Chapter 11: System Integration and Safety
11.1 Sensor Fusion Architecture
A production ADS relies on multiple heterogeneous sensors whose data must be fused into a coherent world model:
- LiDAR (light detection and ranging): provides dense 3-D point clouds at ~10 Hz. Accurate range (~2 cm), but no color/texture, sparse at distance, degraded in heavy rain/snow.
- Camera: dense color information, high resolution, cheap. Range must be inferred from geometry. Deep learning excels here.
- Radar: long range (~200 m), measures radial velocity directly (Doppler), all-weather. Low angular resolution, sparse returns.
- GPS/IMU: global position at low frequency (GPS) combined with high-rate inertial measurements (IMU). Dead reckoning between GPS fixes.
Early vs. late fusion: Early fusion (combine raw sensor data before processing) preserves complementarity but requires tight calibration and time synchronization. Late fusion (each sensor produces detections, then fuse detections) is more modular but loses some information. Mid-level fusion (fuse extracted features) is increasingly popular in deep architectures.
11.2 HD Map and Localization
High-definition (HD) maps encode road geometry at centimeter-level precision: lane boundaries, traffic signs, traffic lights, speed limits. Combined with LiDAR point cloud matching (iterative closest point, ICP, or Monte Carlo localization), the vehicle achieves centimeter-level localization within the map.
Map-based localization workflow:
- Pre-drive the route to build an HD map (LiDAR scan + GPS).
- At runtime, align the current LiDAR scan to the stored map using point cloud matching.
- Fuse the scan-match pose with IMU dead-reckoning in an EKF or factor graph.
11.3 Safety and Redundancy
A fundamental principle is defense in depth: no single system failure should cause a safety-critical incident.
- Redundant sensing: e.g., two independent perception stacks using different algorithms; primary and backup GPS.
- Fail-safe fallback: if the ADS fails, transition to a minimal risk condition (MRC) — slow down and stop safely.
- ISO 26262 ASIL: Automotive Safety Integrity Levels classify the rigor of safety analysis required. Most ADS functions require ASIL-C or ASIL-D (highest).
- Formal verification and simulation: before deployment, billions of simulated miles are driven to cover corner cases (e.g., adverse weather, unusual road markings).
Chapter 12: Deep Learning for Perception — Advanced Topics
12.1 Transfer Learning and Domain Adaptation
Training deep networks from scratch requires enormous labeled datasets. Transfer learning fine-tunes networks pre-trained on large datasets (e.g., ImageNet) for AV perception tasks. The early layers (edge detectors, texture detectors) transfer well; later task-specific layers are re-trained.
Domain adaptation addresses the gap between training conditions (e.g., sunny California) and deployment conditions (e.g., snowy Canada). Techniques include data augmentation (synthetic rain/fog), domain randomization, and adversarial domain adaptation.
12.2 3-D Object Detection
For safety, 2-D bounding boxes in image space are insufficient — the 3-D position and dimensions of surrounding vehicles are needed. Methods:
- LiDAR-based: PointNet directly processes raw point clouds; VoxelNet voxelizes and applies 3-D convolutions; PointPillars uses vertical pillars for efficient BEV (bird’s-eye-view) detection.
- Fusion-based: AvoidFOV, CLOCs, and similar methods fuse camera and LiDAR features for improved accuracy.
- Camera-only depth estimation: Mono3D, FCOS3D, and BEVDet use neural networks to infer 3-D from monocular images, leveraging learned priors about object sizes.
12.3 End-to-End Learning
End-to-end approaches (e.g., NVIDIA’s DAVE system, Wayve) train a single neural network from sensor inputs directly to control outputs (steering, throttle), bypassing modular decomposition. Advantages: no hand-designed intermediate representations, can leverage more data. Disadvantages: lack of interpretability, difficult to verify safety, distribution shift is hard to diagnose.
Imitation learning (behavioral cloning) trains on human-driver demonstrations. Reinforcement learning refines the policy through interaction with a simulator. In practice, production systems use hybrid architectures combining learned perception with classical planning and control.
Chapter 13: Summary and Key Formulas
13.1 Core Equations Reference
Image convolution:
\[ (I * k)[x,y] = \sum_{u,v} I[x-u, y-v]\, k[u,v] \]Harris response:
\[ R = \det(M) - \kappa\, [\mathrm{tr}(M)]^2 \]Logistic regression:
\[ \hat{p} = \sigma(\mathbf{w}^T\mathbf{x}+b), \quad \sigma(z) = \frac{1}{1+e^{-z}} \]Backpropagation:
\[ \boldsymbol{\delta}^{(l)} = (W^{(l+1)T}\boldsymbol{\delta}^{(l+1)}) \odot \phi'(\mathbf{z}^{(l)}) \]Bayes filter prediction:
\[ \overline{\mathrm{bel}}(\mathbf{x}_t) = \int p(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{u}_t)\,\mathrm{bel}(\mathbf{x}_{t-1})\,d\mathbf{x}_{t-1} \]Bayes filter update:
\[ \mathrm{bel}(\mathbf{x}_t) = \eta\, p(\mathbf{z}_t|\mathbf{x}_t)\, \overline{\mathrm{bel}}(\mathbf{x}_t) \]Occupancy grid log-odds update:
\[ l_t = l_{t-1} + \log\frac{p(z|m=1)}{p(z|m=0)} + \log\frac{1-p_0}{p_0} \]Kalman gain:
\[ K_t = \bar{\Sigma}_t H^T (H\bar{\Sigma}_t H^T + R)^{-1} \]Kinematic bicycle model:
\[ \dot{\psi} = \frac{v}{L}\tan\delta, \quad \kappa = \frac{\tan\delta}{L} \]Stanley controller:
\[ \delta = \psi_e + \arctan\!\left(\frac{k_e e}{v}\right) \]PID controller:
\[ u(t) = K_p e(t) + K_i \int_0^t e(\tau)\,d\tau + K_d \frac{de}{dt} \]13.2 Conceptual Connections
The architecture of a complete ADS follows the sense–plan–act cycle studied throughout the course:
- Sensing: cameras, LiDAR, radar produce raw data
- Perception (Chapters 2–5): CV algorithms and deep CNNs extract objects, lanes, free space from raw data
- State estimation (Chapters 6–7): Bayes filter, occupancy grids, Kalman/EKF fuse observations into a dynamic world model
- Localization: EKF fuses HD map matching with IMU to estimate vehicle pose
- Prediction: ML-based models forecast agent future states
- Planning (Chapter 10): RRT*, polynomial trajectories compute a feasible, safe trajectory
- Control (Chapter 9): Stanley/MPC lateral + PID longitudinal controllers track the trajectory
- Vehicle actuation: steering, throttle, brake respond to control commands
Each module relies on probabilistic reasoning about uncertainty — the unifying theme of the course. Bayes’ theorem connects perception (sensor likelihood), state estimation (belief update), and planning (risk-aware trajectory selection) into a coherent mathematical framework for building safe autonomous vehicles.