CO 466: Continuous Optimization
Levent Tunçel
Estimated study time: 26 minutes
Table of contents
Based on lecture notes by Sibelius Peng — PDF source
Sources and References
Primary textbook — Nocedal, J. & Wright, S. J. Numerical Optimization, 2nd ed. Springer, 2006. Supplementary texts — Boyd, S. & Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004 (freely available at stanford.edu/~boyd/cvxbook/). Bertsekas, D. P. Nonlinear Programming, 3rd ed. Athena Scientific, 2016. Nesterov, Y. Introductory Lectures on Stochastic Optimization. Springer, 2018. Online resources — Stanford EE364A Convex Optimization I (Boyd — full lectures, notes, and problems freely available); MIT 6.253 Convex Analysis and Optimization (Bertsekas, OCW); Vandenberghe’s UCLA EE236B lecture notes; Nesterov’s CORE lecture notes on optimal first-order methods.
Chapter 1: Formulations and Foundations
The Continuous Optimization Framework
A continuous optimization problem asks for the minimum (or maximum) of an objective function over a feasible set defined by constraints. In its most general form:
\[ \inf_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0,\; i = 1,\ldots,m,\quad h_j(x) = 0,\; j = 1,\ldots,p, \]where \(f \colon \mathbb{R}^n \to \mathbb{R}\) is the objective, \(g_i\) are inequality constraints, and \(h_j\) are equality constraints. The feasible set is \(S = \{x \in \mathbb{R}^n : g(x) \leq 0,\; h(x) = 0\}\).
A point \(\bar{x} \in S\) is a global minimizer if \(f(x) \geq f(\bar{x})\) for all \(x \in S\). It is a local minimizer if the inequality holds in some neighborhood of \(\bar{x}\) intersected with \(S\); a strict local minimizer if \(f(x) > f(\bar{x})\) for all nearby \(x \neq \bar{x}\) in \(S\). Finding global minimizers is in general intractable (NP-hard), but local minimizers are accessible through differential calculus and iterative algorithms.
The breadth of continuous optimization is staggering. Discrete structures can be encoded in continuous form: integer programming (the condition \(x \in \{0,1\}^n\) is equivalent to \(x_j(x_j-1) = 0\) for each \(j\)) becomes a mildly nonlinear continuous program. Even Fermat’s Last Theorem can be encoded as a continuous optimization problem — with optimal value zero if and only if FLT is true — demonstrating that continuous optimization problems can be undecidably hard even for small numbers of variables.
Convex Sets and Cones
The most tractable class of continuous optimization problems is convex optimization, where both the feasible set and the objective function are convex. A set \(C \subseteq \mathbb{R}^n\) is convex if for all \(x, y \in C\) and \(\theta \in [0,1]\), the point \(\theta x + (1-\theta)y \in C\) — any two points in \(C\) can be connected by a line segment lying entirely within \(C\). Important convex sets include halfspaces, polyhedra, balls, ellipsoids, the positive semidefinite cone, and the second-order (Lorentz) cone.
A set \(K \subseteq \mathbb{R}^n\) is a cone if \(\lambda x \in K\) for all \(x \in K\) and all \(\lambda \geq 0\). The non-negative orthant \(\mathbb{R}^n_+\), the positive semidefinite cone \(\mathbb{S}^n_+\), and the second-order cone \(\{(x, t) : \|x\| \leq t\}\) are fundamental cones. Conic optimization — minimizing a linear objective over the intersection of a cone and an affine subspace — encompasses linear programming (\(\mathbb{R}^n_+\)), second-order cone programming (SOCP), and semidefinite programming (SDP) as special cases.
Derivatives and Optimality Conditions
For smooth \(f \colon \mathbb{R}^n \to \mathbb{R}\), the gradient \(\nabla f(x) \in \mathbb{R}^n\) and the Hessian \(\nabla^2 f(x) \in \mathbb{R}^{n \times n}\) are the first and second derivatives. The directional derivative in direction \(d\) is \(f'(x; d) = \nabla f(x)^T d\). A necessary condition for \(x\) to be a local minimizer of an unconstrained smooth problem is \(\nabla f(x) = 0\) (stationarity). Second-order conditions refine this: if \(\nabla f(x) = 0\) and \(\nabla^2 f(x) \succ 0\) (positive definite), then \(x\) is a strict local minimizer.
A fundamental fixed-point perspective: smooth algorithms often reduce to finding fixed points of maps. The gradient-based iteration \(x_{k+1} = x_k - \alpha_k \nabla f(x_k)\) defines a discrete dynamical system whose fixed points are stationary points of \(f\). The interplay between fixed-point theory, contractions, and optimization algorithm convergence is a unifying thread through the subject.
Chapter 2: Unconstrained Optimization
Gradient Descent and Descent Algorithms
A descent algorithm generates a sequence \(x_0, x_1, x_2, \ldots\) satisfying \(f(x_{k+1}) < f(x_k)\) at each step. The canonical form is \(x_{k+1} = x_k + \alpha_k d_k\), where \(d_k\) is a descent direction (\(\nabla f(x_k)^T d_k < 0\)) and \(\alpha_k > 0\) is the step size (or learning rate). The simplest choice is gradient descent: \(d_k = -\nabla f(x_k)\).
The convergence rate of gradient descent depends critically on the curvature of \(f\). If \(f\) is smooth with \(L\)-Lipschitz gradient (\(\|\nabla f(x) - \nabla f(y)\| \leq L\|x-y\|\)), then with step size \(\alpha = 1/L\), gradient descent satisfies:
\[ f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}, \]a convergence rate of \(O(1/k)\). If \(f\) is also \(\mu\)-strongly convex (\(\nabla^2 f(x) \succeq \mu I\) everywhere), then the rate improves to linear (geometric):
\[ \|x_{k+1} - x^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)\|x_k - x^*\|^2. \]The ratio \(\kappa = L/\mu\) is the condition number; large \(\kappa\) leads to slow convergence.
Newton’s Method and Quasi-Newton Approaches
Newton’s method uses second-order information to update \(x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1}\nabla f(x_k)\). Near a strict local minimizer where the Hessian is positive definite, Newton’s method converges quadratically: \(\|x_{k+1} - x^*\| = O(\|x_k - x^*\|^2)\). The quadratic convergence makes Newton’s method extremely fast near the solution — but each step requires forming and factoring the \(n \times n\) Hessian, costing \(O(n^3)\) operations.
Quasi-Newton methods approximate the Hessian (or its inverse) using only gradient information, achieving superlinear convergence at much lower per-iteration cost. The BFGS method (Broyden-Fletcher-Goldfarb-Shanno, 1970) maintains a positive definite approximation \(B_k \approx \nabla^2 f(x_k)\) updated by the rank-2 formula:
\[ B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}, \]where \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\). The BFGS update satisfies the secant condition \(B_{k+1}s_k = y_k\) (matching the change in gradient), and preserves positive definiteness when \(y_k^T s_k > 0\) (satisfied by sufficient decrease line searches). The L-BFGS (limited-memory BFGS) variant stores only the \(m\) most recent update pairs rather than the full \(n \times n\) matrix, making it efficient for large-scale problems.
The convergence analysis of quasi-Newton methods connects to the Dennis-Moré characterization: superlinear convergence occurs if and only if the Hessian approximation converges to the true Hessian along the solution directions, \(\|(B_k - \nabla^2 f(x^*))(x_{k+1}-x_k)\|/\|x_{k+1}-x_k\| \to 0\).
Conjugate Gradient Methods
For large-scale unconstrained problems with quadratic structure, the conjugate gradient (CG) method is often optimal. For the quadratic objective \(f(x) = \frac{1}{2}x^TAx - b^Tx\) (where \(A \succ 0\)), CG generates search directions \(d_k\) that are \(A\)-conjugate: \(d_i^T A d_j = 0\) for \(i \neq j\). This guarantees that CG minimizes \(f\) over the Krylov subspace \(\mathcal{K}_k = \mathrm{span}\{d_0, \ldots, d_{k-1}\}\), and converges in at most \(n\) steps (exactly). For non-quadratic objectives (nonlinear CG), the direction update formula generalizes, but convergence is only guaranteed under additional conditions.
Preconditioning accelerates convergence by transforming the problem to reduce the condition number: \(\kappa(M^{-1/2}AM^{-1/2})\) should be small. The preconditioned CG (PCG) method applies CG to the transformed system. Choosing an effective preconditioner (incomplete Cholesky, multigrid, domain decomposition) is often the most important practical decision in large-scale linear algebra.
Chapter 3: Constrained Optimization and KKT Theory
First-Order Optimality: KKT Conditions
For the general constrained problem, the Lagrangian is:
\[ \mathcal{L}(x, \lambda, \nu) = f(x) + \lambda^T g(x) + \nu^T h(x), \]where \(\lambda \in \mathbb{R}^m_+\) are Lagrange multipliers for the inequality constraints and \(\nu \in \mathbb{R}^p\) for the equality constraints. The Karush-Kuhn-Tucker (KKT) conditions are necessary conditions for local optimality at \(\bar{x}\) under a constraint qualification (CQ):
- Stationarity: \(\nabla f(\bar{x}) + \sum_i \lambda_i \nabla g_i(\bar{x}) + \sum_j \nu_j \nabla h_j(\bar{x}) = 0\)
- Primal feasibility: \(g(\bar{x}) \leq 0\), \(h(\bar{x}) = 0\)
- Dual feasibility: \(\lambda \geq 0\)
- Complementary slackness: \(\lambda_i g_i(\bar{x}) = 0\) for all \(i\)
The complementary slackness condition says that \(\lambda_i > 0\) only if the corresponding constraint \(g_i(\bar{x}) = 0\) is active (binding). For convex problems, KKT conditions are both necessary and sufficient for global optimality. The most common constraint qualifications are LICQ (linear independence of active constraint gradients) and the weaker MFCQ (Mangasarian-Fromovitz). The Abadie CQ is the weakest sufficient CQ: the tangent cone equals the linearized feasible cone.
Second-Order Conditions
The second-order sufficient conditions (SOSC) for a local minimizer \(\bar{x}\) satisfying the KKT conditions with multipliers \((\bar{\lambda}, \bar{\nu})\) require the reduced Hessian to be positive definite on the critical cone:
\[ d^T \nabla^2_{xx}\mathcal{L}(\bar{x}, \bar{\lambda}, \bar{\nu})d > 0 \quad \text{for all } d \neq 0 \text{ in the critical cone}. \]The critical cone consists of directions that are tangent to the active constraints and along which the Lagrangian gradient vanishes. SOSC imply that \(\bar{x}\) is a strict local minimizer and that the problem is locally well-posed (perturbations cause smooth changes in the solution).
Strict complementarity — \(\bar{\lambda}_i + g_i(\bar{x}) > 0\) for all \(i\), meaning each constraint is either strictly inactive or has a strictly positive multiplier — simplifies the second-order analysis and is typically assumed for theoretical results. When it fails, the critical cone is larger and the behavior is more complex.
Augmented Lagrangians and ADMM
The augmented Lagrangian for equality constraints \(h(x) = 0\) is:
\[ \mathcal{L}_\rho(x, \nu) = f(x) + \nu^T h(x) + \frac{\rho}{2}\|h(x)\|^2, \]where \(\rho > 0\) is a penalty parameter. Unlike the classical penalty method (which sets \(\nu = 0\)), the augmented Lagrangian maintains an explicit dual variable \(\nu\). The method of multipliers alternates between minimizing \(\mathcal{L}_\rho(\cdot, \nu_k)\) over \(x\) and updating \(\nu_{k+1} = \nu_k + \rho h(x_{k+1})\). This converges under mild conditions, and unlike pure penalty methods, does not require \(\rho \to \infty\) — a finite \(\rho\) suffices.
The Alternating Direction Method of Multipliers (ADMM) extends the method of multipliers to the separable problem \(\min f(x) + g(z)\) subject to \(Ax + Bz = c\). ADMM alternates:
- \(x_{k+1} = \arg\min_x \mathcal{L}_\rho(x, z_k, \nu_k)\)
- \(z_{k+1} = \arg\min_z \mathcal{L}_\rho(x_{k+1}, z, \nu_k)\)
- \(\nu_{k+1} = \nu_k + \rho(Ax_{k+1} + Bz_{k+1} - c)\)
ADMM exploits problem separability, enabling decomposition over distributed systems or block structure. It has become hugely popular in machine learning and signal processing (lasso, basis pursuit, distributed learning) due to its simplicity and robustness.
Proximal Point Methods
The proximal point algorithm is a fundamental algorithmic template: given \(x_k\), compute
\[ x_{k+1} = \arg\min_x \left\{f(x) + \frac{1}{2\alpha_k}\|x - x_k\|^2\right\} =: \mathrm{prox}_{\alpha_k f}(x_k). \]The proximal operator \(\mathrm{prox}_{\alpha f}\) has a closed form for many functions (for \(f = \|\cdot\|_1\), it is soft thresholding; for an indicator function, it is projection). The proximal point algorithm converges for any convex \(f\) under mild conditions; its per-iteration cost depends entirely on solving the proximal subproblem. ADMM can be viewed as a proximal point algorithm applied to the dual problem.
The closest-point map (projection onto a convex set \(C\), \(P_C(x) = \arg\min_{y \in C} \|x - y\|^2\)) is the special case of the proximal operator for the indicator function of \(C\). Projection algorithms — alternating projections, Dykstra’s algorithm — solve feasibility problems and are closely connected to ADMM.
Sequential Quadratic Programming
Sequential Quadratic Programming (SQP) is the state-of-the-art method for smooth nonlinearly constrained optimization. Each iteration solves a quadratic program (QP) approximating the original problem near the current iterate: the objective is approximated by a quadratic using the Hessian of the Lagrangian, and the constraints are linearized. The QP solution gives both the primal step \(d_k\) and updated multipliers.
SQP has superlinear (locally quadratic) convergence near a solution satisfying SOSC and strict complementarity. The main challenge is globalization — ensuring convergence from arbitrary starting points — which requires careful merit functions or filter methods. SQP is implemented in SNOPT, a highly optimized code used throughout aerospace engineering and process optimization.
Interior-Point Methods and Barrier Functions
Interior-point methods (also called barrier methods) replace inequality constraints with a barrier function that becomes infinite at the boundary of the feasible set, turning the constrained problem into an unconstrained one:
\[ \min_x\; f(x) - \mu \sum_{i=1}^m \ln(-g_i(x)). \]The logarithmic barrier penalizes proximity to the boundary, keeping iterates strictly feasible. As \(\mu \to 0\), the solution converges to the constrained optimum. The central path \(\{x^*(\mu) : \mu > 0\}\) parameterizes the trajectory of barrier minimizers; following this path efficiently is the core of modern IPMs.
A self-concordant barrier for a convex cone \(K\) is a function \(F \colon \mathrm{int}(K) \to \mathbb{R}\) satisfying the third-order condition \(|D^3F(x)[h,h,h]| \leq 2(D^2F(x)[h,h])^{3/2}\) for all directions \(h\). Self-concordance (due to Nesterov and Nemirovskii) allows Newton’s method on the barrier to be analyzed without Lipschitz conditions on the Hessian — a pure intrinsic curvature condition. The resulting polynomial-time interior-point method solves linear programs in \(O(\sqrt{n}\ln(1/\epsilon))\) Newton iterations, establishing LP as tractable in a much stronger sense than the ellipsoid method.
Logarithmically homogeneous self-concordant barriers (LHSCB) exist for every proper cone (closed, convex, pointed, full-dimensional). For the positive semidefinite cone, the barrier is \(F(X) = -\ln\det X\), of parameter \(\nu = n\). For the second-order cone, the parameter is \(\nu = 1\). The barrier parameter \(\nu\) controls the complexity: the number of Newton iterations scales as \(O(\sqrt{\nu}\ln(1/\epsilon))\).
Complexity of First-Order Methods and Nesterov Acceleration
The study of worst-case complexity of first-order methods — algorithms using only gradient (or subgradient) information — reveals fundamental tradeoffs between computation and accuracy. For a convex function with \(L\)-Lipschitz gradient on a bounded domain of diameter \(D\), the optimal first-order algorithm achieves \(f(x_k) - f^* \leq O(LD^2/k^2)\). The \(1/k^2\) rate is achievable but not exceeded.
Nesterov’s 1983 optimal gradient method (sometimes called Nesterov acceleration or FISTA in the composite setting) achieves exactly the \(O(1/k^2)\) rate:
\[ y_{k+1} = x_k - \frac{1}{L}\nabla f(x_k), \quad x_{k+1} = y_{k+1} + \frac{k-1}{k+2}(y_{k+1} - y_k). \]The momentum term \(\frac{k-1}{k+2}(y_{k+1} - y_k)\) is the key: it extrapolates beyond the most recent gradient step. This optimal method is tight: no first-order algorithm can guarantee \(o(1/k^2)\) convergence for general smooth convex functions. The analysis uses an algebraic telescoping argument via an estimate sequence.
For strongly convex functions, the optimal rate is linear: \(\left(1 - \sqrt{\mu/L}\right)^k = \left(1 - 1/\sqrt{\kappa}\right)^k\), achieved by Nesterov’s 1983 method. Remarkably, gradient descent achieves the rate \(\left(1 - 2/(\kappa+1)\right)^k\), which is worse by a factor of \(\sqrt{\kappa}\) in the exponent.
Chapter 4: Beyond This Course
Natural Next Topics
Convex Duality and Conic Optimization
The Lagrangian \(\mathcal{L}(x, \lambda, \nu) = f(x) + \lambda^T g(x) + \nu^T h(x)\) defines a dual function \(q(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)\). The dual problem is \(\max_{\lambda \geq 0, \nu} q(\lambda, \nu)\). Weak duality holds always: \(q(\lambda, \nu) \leq f(\bar{x})\) for any primal-feasible \(\bar{x}\) and dual-feasible \((\lambda, \nu)\). Strong duality (duality gap zero) holds under Slater’s condition (existence of strictly feasible primal point) for convex problems.
Semidefinite programming (SDP) — minimizing a linear function over the intersection of the positive semidefinite cone with an affine subspace — is a major generalization of linear programming with applications in control theory, combinatorial optimization (relaxations of max-cut, max-clique, graph coloring), and polynomial optimization (sum-of-squares certificates). Interior-point methods extend to SDP with the barrier \(-\ln\det X\), solving SDP instances in \(O(\sqrt{n}\ln(1/\epsilon))\) Newton steps.
Stochastic and Online Optimization
Modern machine learning operates with massive datasets, making full-gradient computations prohibitive. Stochastic gradient descent (SGD) replaces \(\nabla f(x) = \frac{1}{N}\sum_{i=1}^N \nabla f_i(x)\) with a single sample gradient \(\nabla f_i(x)\) drawn randomly, reducing per-iteration cost from \(O(N)\) to \(O(1)\). SGD converges but at a slower rate due to noise: \(O(1/\sqrt{k})\) for convex problems, matching the lower bound. Mini-batch SGD averages over a small batch of samples to reduce variance.
Variance reduction methods (SAG, SVRG, SAGA) achieve faster rates by periodically recomputing full gradients or maintaining running averages: these methods achieve linear convergence for strongly convex problems, matching deterministic rates while maintaining cheap per-iteration cost. Adam and its variants use adaptive coordinate-wise step sizes, empirically superior in deep learning though without the same convergence guarantees.
Nonsmooth Optimization
Many important objectives are non-differentiable: the \(\ell_1\) norm (for sparsity), the nuclear norm (for low-rank), the hinge loss (for SVMs), and indicator functions of feasible sets. The subgradient method generalizes gradient descent to nonsmooth convex functions but achieves only \(O(1/\sqrt{k})\) convergence. The proximal gradient method (or forward-backward splitting) handles composite objectives \(f(x) + g(x)\) where \(f\) is smooth and \(g\) is nonsmooth but has a computable proximal operator. For sparse optimization (\(g = \lambda\|\cdot\|_1\)), the proximal operator is soft thresholding, giving the ISTA/FISTA algorithms.
Follow-up Courses and Reading
At the University of Waterloo
CO 671 (Semidefinite Programming) develops the theory and algorithms for SDP and its applications, including polynomial optimization and graph problems. CO 681 (Quantum Information Processing) uses SDP relaxations and conic programming in quantum information theory. CO 673 (Interior Point Methods for Optimization) provides rigorous development of self-concordant barrier theory and polynomial-time algorithms. CO 750/851 are graduate research seminars where Tunçel’s group explores cutting-edge directions in optimization.
Standard Texts
Nocedal & Wright Numerical Optimization (2nd ed., Springer) is the comprehensive reference for algorithms — line search methods, trust regions, nonlinear SQP, interior-point LP/QP, and their convergence analysis. Boyd & Vandenberghe Convex Optimization (Cambridge, freely available) is the leading text for convex analysis, duality, and applications in signal processing, statistics, and machine learning. Nesterov Introductory Lectures on Stochastic Optimization and Lectures on Convex Optimization (Springer) present the author’s own optimal methods with full proofs and sharp complexity bounds. Ruszczyński Nonlinear Optimization (Princeton) gives a careful mathematical treatment of both smooth and nonsmooth theory.
Open Problems and Active Research
The Complexity of Non-Convex Optimization
Local minima of non-convex functions are generally NP-hard to find (even to verify). But many non-convex problems of practical interest — principal component analysis, matrix completion, phase retrieval, tensor decomposition, training neural networks — have structure that makes local search succeed in practice. Understanding when and why remains a central open problem. Recent theory shows that strict saddle points (second-order critical points with a negative Hessian eigenvalue) can be escaped by perturbed gradient descent; in over-parameterized neural networks, all local minima may be global minima. Making these observations rigorous and connecting them to algorithmic guarantees is an active frontier.
Optimal Algorithms and Lower Bounds
Nesterov’s acceleration achieves the optimal \(O(1/k^2)\) rate for smooth convex optimization among first-order methods. But what is optimal for second-order methods (using Hessians)? For general smooth convex functions in \(\mathbb{R}^n\), can Newton’s method be improved beyond quadratic convergence? For stochastic optimization, what is the optimal complexity among variance-reduction methods? These questions about lower bounds — matching them is a cornerstone of optimization theory — remain active.
Tunçel’s research focuses particularly on the geometry of conic optimization: the relationship between the barrier parameter \(\nu\) and combinatorial properties of the cone, new families of self-concordant barriers, and the extension of polynomial-time results to broader problem classes.
Scalability and Distributed Optimization
Modern optimization problems involve billions of variables and billions of constraints — distributed across multiple machines. Distributed optimization decomposes the problem across workers, communicating only gradients or model updates. Communication cost (not computation) dominates runtime in distributed settings, motivating communication-compressed algorithms (gradient sparsification, quantization). The relationship between communication complexity and statistical accuracy — how many bits of information must be exchanged to achieve a given optimization error — is a new information-theoretic frontier in machine learning systems.