ECE 457C: Reinforcement Learning

Mark Crowley

Estimated study time: 5 minutes

Table of contents

Sources and References

Equivalent UW courses — CS 486/686 (Introduction to Artificial Intelligence), CS 885 (Reinforcement Learning) Primary textbook — Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, 2nd edition, MIT Press, 2018 Supplementary references — Dimitrakakis and Ortner, Decision Making Under Uncertainty and Reinforcement Learning; Ghavamzadeh et al., Bayesian Reinforcement Learning: A Survey, 2016

Equivalent UW Courses

ECE 457C lines up most closely with CS 885 (Pascal Poupart’s reinforcement-learning course) — both work through the Sutton and Barto textbook and cover MDPs, dynamic programming, TD methods, function approximation, and deep RL. CS 486/686 is the cross-listed undergrad/grad intro-to-AI course (one course, two numbers) and touches MDPs and basic RL in one or two lectures inside a much broader survey of search, logic, probability, Bayes nets, and ML, so it overlaps ECE 457C only at the MDP/RL segment. In practice ECE 457C gives engineering undergraduates a CS 885-style experience tuned for an undergraduate workload.

What This Course Adds Beyond the Equivalents

Compared with CS 486/686, ECE 457C goes much deeper into RL specifically — it is not a broad AI survey, and the assignments force students to implement SARSA, Q-learning, DQN, and policy-gradient methods in Python rather than just discussing them. The course also explicitly makes the connection to control theory, which is natural for ECE students and less emphasized in the CS equivalents. What it omits relative to CS 885 is the research-paper reading component and some of the more theoretical material (PAC bounds, convergence proofs in detail, bandit regret analysis), and unlike CS 486/686 it does not cover logic, knowledge representation, or classical search.

Topic Summary

Motivation and Probability Review

Why reasoning under uncertainty matters; the place of RL inside machine learning and AI; a quick refresher on probability, expectation, and conditional distributions. Sets up the vocabulary for everything that follows.

Multi-Armed Bandits

Exploration versus exploitation in the simplest non-trivial setting, with epsilon-greedy, UCB, and Thompson sampling. Bandits are MDPs without state, and the regret framework introduced here recurs throughout RL.

Markov Decision Processes

An MDP is the tuple \( (S, A, P, R, \gamma) \) with transition probabilities \( P(s' \mid s, a) \) and reward function \( R(s, a) \). The discounted return and the value function \( V^{\pi}(s) \) are the central objects of study.

Solving MDPs: Bellman Equations and DP

The Bellman expectation and optimality equations express \( V^{*} \) as a fixed point, and value iteration and policy iteration are the two canonical dynamic-programming solvers. Connection to control theory (LQR, Bellman’s principle of optimality) is made explicit.

SARSA and Q-Learning

Model-free TD control: SARSA is on-policy, Q-learning is off-policy, and both update

\[ Q(s, a) \leftarrow Q(s, a) + \alpha\bigl[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\bigr] \]

from single transitions. These are the workhorses of tabular RL.

Temporal Difference Learning with Eligibility Traces

TD(\(\lambda\)) and eligibility traces interpolate between one-step TD and full Monte Carlo returns, trading bias against variance. Gives a unified view of the return-estimation spectrum.

State Representation and Value Function Approximation

When \( S \) is too large to tabulate, \( V \) and \( Q \) are approximated by parametric functions trained with stochastic gradient descent. Linear features first, then neural networks.

Neural Network Basics

Fully connected MLPs, backpropagation, CNNs and LSTMs, and regularization. Reviewed only enough to support deep-RL methods later.

Deep Q-Networks

DQN replaces the Q-table with a neural network and stabilises training with experience replay and a target network. The Atari results that launched modern deep RL are the motivating example.

Policy Gradients and Actor-Critic

REINFORCE estimates \( \nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a\mid s)\, Q^{\pi}(s,a)] \), and actor-critic methods reduce variance by learning a value baseline. Opens the door to continuous-action RL.

Advanced Algorithm Variants

A3C, A2C, DPG, DDPG, TRPO, PPO, SAC, and decision transformers as points in the design space of modern deep RL. Students see why there is no single winner.

Monte Carlo Tree Search

MCTS as an anytime planning algorithm for (PO)MDPs, and the AlphaGo/AlphaZero architecture that combines MCTS with learned value and policy networks. Ties planning and learning together.

Human Preference and Moral Reasoning

RLHF for large language models and the use of RL concepts in theories of human moral cognition. A short look at where the field is going beyond traditional control problems.