CO 367: Nonlinear Optimization

Nargiz Kalantarova

Estimated study time: 57 minutes

Table of contents

CO 367 studies unconstrained and constrained optimization problems along with the algorithms used to solve them. The course builds rigorous theoretical foundations in nonlinear optimization, covering topics from basic calculus and linear algebra through modern interior-point methods. Applications arise throughout in machine learning, signal processing, and economics. The main references are Nocedal and Wright, Numerical Optimization (2nd ed., Springer, 2006) and lecture notes by Stephen Vavasis (CO367 Fall 2016).

Introduction and Mathematical Foundations

Unconstrained and Constrained Optimization

\[ \min_{x \in \mathbb{R}^n} f(x). \]\[ \min f(x) \quad \text{subject to} \quad g_i(x) \leq 0,\ i \in I,\ h_j(x) = 0,\ j \in E, \]

where $ f $ is the objective function, $ g_i $ are inequality constraints, and $ h_j $ are equality constraints. Nonlinear optimization (also called nonlinear programming) describes problems in which the objective or constraint functions are nonlinear in the variables.

A canonical application is data fitting: given observations $ (t_i, y_i) $, one seeks polynomial coefficients $ x_0, x_1, \ldots, x_k $ minimizing $ \sum_{i=1}^n (p(t_i) - y_i)^2 $ where $ p(t) = x_0 + x_1 t + \cdots + x_k t^k $. Portfolio optimization, signal/image processing, and maximum likelihood estimation yield many further examples.

Norms

Definition 1.2 (Norm). A norm $ \|\cdot\| : \mathbb{R}^n \to \mathbb{R} $ satisfies: (i) $ \|x\| \geq 0 $ for all $ x $; (ii) $ \|x\| = 0 $ iff $ x = 0 $; (iii) $ \|\alpha x\| = |\alpha| \|x\| $; (iv) triangle inequality $ \|x + y\| \leq \|x\| + \|y\| $.

The most common norms on $ \mathbb{R}^n $ are: the Euclidean norm $ \|x\|_2 = \sqrt{x_1^2 + \cdots + x_n^2} $; the $ \ell_1 $ norm $ \|x\|_1 = \sum |x_i| $; and the $ \ell_p $ norm $ \|x\|_p = (\sum |x_i|^p)^{1/p} $ for $ p \geq 1 $.

\[ |x^\top y| \leq \|x\|_2 \|y\|_2, \]

with equality if and only if $ x = \alpha y $ for some scalar $ \alpha $.

Linear Algebra Review

For a matrix $ A \in \mathbb{R}^{m \times n} $, the four fundamental subspaces are: the column space $ R(A) = \{Ax : x \in \mathbb{R}^n\} $; the row space $ R(A^\top) $; the nullspace $ N(A) = \{x : Ax = 0\} $; and the left nullspace $ N(A^\top) $.

Theorem 1.7 (Orthogonal Decomposition). $ R(A)^\perp = N(A^\top) $ and $ N(A)^\perp = R(A^\top) $, giving orthogonal decompositions $ \mathbb{R}^m = R(A) \oplus N(A^\top) $ and $ \mathbb{R}^n = N(A) \oplus R(A^\top) $.

Theorem 1.8 (Spectral Decomposition). If $ A \in \mathbb{R}^{n \times n} $ is symmetric, then $ A = UDU^\top $ where $ U $ is orthogonal and $ D $ is diagonal with the eigenvalues of $ A $ on the diagonal.

Positive Semidefinite and Positive Definite Matrices

Definition 1.9. A symmetric matrix $ A \in \mathbb{R}^{n \times n} $ is positive semidefinite (PSD), written $ A \succeq 0 $, if $ x^\top A x \geq 0 $ for all $ x $. It is positive definite (PD), written $ A \succ 0 $, if $ x^\top A x > 0 $ for all $ x \neq 0 $.

Theorem 2.2. For a symmetric matrix $ A $, the following are equivalent: (i) $ A \succeq 0 $; (ii) all eigenvalues of $ A $ are nonnegative; (iii) $ A = BB^\top $ for some matrix $ B $. A symmetric matrix is negative semidefinite ($ A \preceq 0 $ if $ x^\top Ax \leq 0 $ for all $ x $, negative definite ($ A \prec 0 $ if $ x^\top Ax < 0 $ for all $ x \neq 0 $, and indefinite if it has both positive and negative quadratic values.

A useful practical test: a symmetric $ A $ is PSD iff all principal minors are nonnegative, and PD iff all leading principal minors are positive (Sylvester’s criterion, Theorem 2.5).

Topology and Calculus Review

Topology of $ \mathbb{R}^n $

The $ \varepsilon $-neighborhood of $ x_0 $ is $ B_\varepsilon(x_0) = \{x : \|x - x_0\|_2 < \varepsilon\} $. A sequence $ \{x^k\} $ converges to $ x^* $ if for every $ \varepsilon > 0 $ there exists $ N $ such that $ \|x^k - x^*\| < \varepsilon $ for all $ k \geq N $. A set is closed if it contains all its limit points, open if every element is an interior point, bounded if it fits in some ball, and compact if both closed and bounded.

Theorem 2.16 (Equivalence of Norms). All norms on $ \mathbb{R}^n $ are equivalent: for any $ \|\cdot\|_a $ and $ \|\cdot\|_b $ there exist $ \alpha, \beta > 0 $ such that $ \alpha \|x\|_a \leq \|x\|_b \leq \beta \|x\|_a $.

Theorem 2.20 (Extreme Value Theorem). If $ E \subset \mathbb{R}^n $ is compact and $ f : E \to \mathbb{R} $ is continuous, then $ f $ attains both its maximum and minimum on $ E $.

Differentiability

\[ \lim_{h \to 0} \frac{\|f(x_0+h) - f(x_0) - Df_{x_0}(h)\|}{\|h\|} = 0. \]\[ \nabla f(x) = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)^\top. \]

The directional derivative in direction $ u $ is $ D_u f(x) = \nabla f(x)^\top u $.

$Level sets (ellipses) of $f(x_1,x_2)=x_1^2+2x_2^2$ with gradient vectors $\nabla f$ pointing outward, perpendicular to each contour.$

\[ \nabla^2 f(x)_{k\ell} = \frac{\partial^2 f}{\partial x_k \partial x_\ell}. \]

If $ f \in C^2 $ then $ \nabla^2 f(x) $ is symmetric (by equality of mixed partials). For the quadratic $ f(x) = \frac{1}{2} x^\top Ax + b^\top x + c $ with $ A $ symmetric, we have $ \nabla f(x) = Ax + b $ and $ \nabla^2 f(x) = A $.

Mean Value Theorem and First-Order Optimality

Useful Calculus Theorems

Theorem 3.1 (Mean Value Theorem). If $ f : \mathbb{R} \to \mathbb{R} $ is continuous on $ [a,b] $ and differentiable on $ (a,b) $, then there exists $ c \in (a,b) $ such that $ f(b) - f(a) = f'(c)(b-a) $.

\[ f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots + \frac{f^{(k-1)}(a)}{(k-1)!}(x-a)^{k-1} + \frac{f^{(k)}(c)}{k!}(x-a)^k. \]\[ f(x+d) = f(x) + \nabla f(x + td)^\top d \]\[ f(x+d) = f(x) + \nabla f(x)^\top d + \frac{1}{2} d^\top \nabla^2 f(x + td) d \]

for some $ t \in (0,1) $.

Unconstrained Optimality

A point $ x^* $ is a global minimizer if $ f(x^*) \leq f(x) $ for all $ x $, and a local minimizer if this holds in some neighborhood $ B_\delta(x^*) $. A critical point (or stationary point) satisfies $ \nabla f(x^*) = 0 $; one that is neither a local minimum nor a local maximum is called a saddle point.

Theorem 3.7 (FONC — First-Order Necessary Condition). If $ x^* $ is a local minimizer of $ f \in C^1 $, then $ \nabla f(x^*) = 0 $.

Proof sketch. If $ \nabla f(x^*) \neq 0 $, let $ d = -\nabla f(x^*) $. Then $ d^\top \nabla f(x^*) = -\|\nabla f(x^*)\|_2^2 < 0 $. By Taylor’s theorem, for small $ \alpha > 0 $, $ f(x^* + \alpha d) < f(x^*) $, contradicting local minimality.

Second-Order Optimality Conditions

Second-Order Conditions

\[ \nabla f(x^*) = 0 \quad \text{and} \quad \nabla^2 f(x^*) \succeq 0. \]

Proof. FONC gives $ \nabla f(x^*) = 0 $. If $ \nabla^2 f(x^*) $ were not PSD, there would exist $ y $ with $ y^\top \nabla^2 f(x^*) y < 0 $. Setting $ x = x^* + \alpha y $ and applying the quadratic Taylor expansion shows $ f(x) < f(x^*) $ for small $ \alpha $, a contradiction.

\[ \lambda_{\min} \|x\|_2^2 \leq x^\top Ax \leq \lambda_{\max} \|x\|_2^2 \quad \forall x \in \mathbb{R}^n. \]

Theorem 4.4 (SOSC — Second-Order Sufficient Conditions). If $ \nabla f(x^*) = 0 $ and $ \nabla^2 f(x^*) \succ 0 $, then $ x^* $ is a strict local minimizer.

Proof. By Taylor, $ f(x) - f(x^*) = \frac{1}{2}(x-x^*)^\top \nabla^2 f(x^*)(x-x^*) + \psi(x-x^*) $. Since $ \nabla^2 f(x^*) \succ 0 $, with smallest eigenvalue $ \lambda_{\min} > 0 $, for small enough $ x $ near $ x^* $ the leading term dominates and $ f(x) > f(x^*) $.

Summary of second-order conditions:

$ x^* $ local minimizer $ \Rightarrow \nabla f(x^*) = 0 $ and $ \nabla^2 f(x^*) \succeq 0 $
$ x^* $ local maximizer $ \Rightarrow \nabla f(x^*) = 0 $ and $ \nabla^2 f(x^*) \preceq 0 $
$ \nabla f(x^*) = 0 $ and $ \nabla^2 f(x^*) \succ 0 \Rightarrow $ strict local minimizer
$ \nabla f(x^*) = 0 $ and $ \nabla^2 f(x^*) \prec 0 \Rightarrow $ strict local maximizer

Coercive Functions and Convex Sets

Global Minimizers

Theorem 5.1. If $ f \in C^2 $ and $ x^* $ is a critical point, then $ x^* $ is a global minimizer if $ \nabla^2 f(x) \succeq 0 $ everywhere, or a strict global minimizer if $ \nabla^2 f(x) \succ 0 $ everywhere (and analogously for maximizers).

Definition 5.2 (Coercive). A continuous function $ f : \mathbb{R}^n \to \mathbb{R} $ is coercive if all sublevel sets $ \{x : f(x) \leq \beta\} $ are bounded, equivalently if $ \|x\|_2 \to +\infty $ implies $ f(x) \to +\infty $.

Theorem 5.3. If $ f $ is coercive and $ K \subseteq \mathbb{R}^n $ is closed and nonempty, then $ f $ has a global minimizer in $ K $. Proof: Coercivity confines any minimizer to a bounded sublevel set, which intersects $ K $ in a compact set; the EVT then applies.

Convex Sets

Definition 5.5 (Convex set). A set $ C $ is convex if for every $ x, y \in C $ and $ \lambda \in [0,1] $, $ (1-\lambda)x + \lambda y \in C $. Equivalently, the line segment joining any two points in $ C $ remains in $ C $.

Examples of convex sets include: singletons and $ \mathbb{R}^n $; lines and affine subspaces; halfspaces $ \{x : a^\top x \leq c\} $; polyhedra $ \{x : Ax \leq b\} $; open and closed balls; and the set of PSD matrices $ S^n_+ $. The latter is convex because for $ A, B \succeq 0 $ and $ \lambda \in [0,1] $, $ x^\top ((1-\lambda)A + \lambda B)x = (1-\lambda)x^\top Ax + \lambda x^\top Bx \geq 0 $.

Theorem 5.7 (Convex combination). If $ C $ is convex and $ x_1, \ldots, x_n \in C $ with $ \lambda_i \geq 0 $ and $ \sum \lambda_i = 1 $, then $ \sum \lambda_i x_i \in C $. Proof by induction, factoring off the last term.

Operations preserving convexity: intersection of convex sets; images and preimages under affine maps.

Convex Functions

$Convex function $f(x)=x^2+1$ (left) with chord above curve and single global minimum; non-convex $f(x)=\sin(2x)+0.3x^2-0.5$ (right) with multiple local minima.$

\[ f((1-\alpha)x + \alpha y) \leq (1-\alpha)f(x) + \alpha f(y). \]

The function is strictly convex if the inequality is strict for $ x \neq y $. Examples: $ a^\top x + b $ (affine, convex); $ \max(0, x) $; $ x^2 $, $ e^x $ (strictly convex).

Convex Functions and First-Order Conditions

Concave Functions and Jensen’s Inequality

A function is concave if $ -f $ is convex, i.e., $ f((1-\alpha)x + \alpha y) \geq (1-\alpha)f(x) + \alpha f(y) $. The convexity inequality $ f((1-\alpha)x + \alpha y) \leq (1-\alpha)f(x) + \alpha f(y) $ is known as Jensen’s inequality, and extends to convex combinations: $ f(\sum \alpha_i x_i) \leq \sum \alpha_i f(x_i) $.

Operations preserving convexity in functions: nonnegative scalar multiples; sums; composition with affine maps ($ g(x) = f(Ax+b) $ is convex if $ f $ is); pointwise maximum ($ h(x) = \max\{f(x), g(x)\} $ is convex if $ f, g $ are).

First-Order Characterization of Convexity

\[ f(y) \geq f(x) + \nabla f(x)^\top (y - x) \quad \forall x, y \in C. \]

The function is strictly convex iff the inequality is strict for $ x \neq y $.

This has a profound consequence: Corollary 6.1. If $ f \in C^1 $ is convex on an open convex set $ C $ and $ x^* $ is a critical point, then $ x^* $ is a global minimizer of $ f $ on $ C $. Proof: From the first-order condition, $ f(x) \geq f(x^*) + 0 = f(x^*) $ for all $ x $.

Second-Order Conditions for Convexity and Line Search

Second-Order Characterization of Convexity

Theorem 7.1. Suppose $ f \in C^2 $ on an open convex set $ C $. Then $ f $ is convex iff $ \nabla^2 f(x) \succeq 0 $ for every $ x \in C $. It is strictly convex if $ \nabla^2 f(x) \succ 0 $ for every $ x \in C $.

Theorem 7.3. Any local minimizer of a convex function $ f $ over a convex set $ C $ is also a global minimizer. Furthermore, any local minimizer of a strictly convex function is the unique global minimizer.

Line Search Methods

Most practical algorithms for minimizing $ f \in C^2 $ are iterative methods. Starting from $ x^{(0)} $, one repeatedly:

Chooses a search direction $ d^{(k)} $
Chooses a step length $ \alpha_k > 0 $
Updates $ x^{(k+1)} = x^{(k)} + \alpha_k d^{(k)} $

A direction $ d $ is a descent direction of $ f $ at $ x $ if $ \nabla f(x)^\top d < 0 $. Two canonical descent directions: (1) $ d = -\nabla f(x) $ (steepest descent), since $ (-\nabla f)^\top \nabla f = -\|\nabla f\|_2^2 < 0 $; (2) $ d = -H^{-1}\nabla f(x) $ for any $ H \succ 0 $ (Newton/quasi-Newton direction).

Theorem 7.6. If $ d $ is a descent direction of $ f \in C^1 $ at $ x^* $, then there exists $ \delta > 0 $ such that $ f(x^* + \alpha d) < f(x^*) $ for all $ \alpha \in (0, \delta] $. This is proved by Taylor expansion: for small enough $ \alpha $, the first-order term $ \alpha \nabla f(x^*)^\top d $ (negative) dominates the error term $ \phi(\alpha d) $.

Wolfe Conditions and Lipschitz Continuity

Wolfe Conditions

Define $ \phi(\alpha) := f(x^{(k)} + \alpha d^{(k)}) $. For $ 0 < c_1 < c_2 < 1 $, the Wolfe conditions are:

(W1) Sufficient decrease: $ f(x^{(k)} + \alpha d^{(k)}) \leq f(x^{(k)}) + c_1 \alpha \nabla f(x^{(k)})^\top d^{(k)} $
(W2) Curvature condition: $ \nabla f(x^{(k)} + \alpha d^{(k)})^\top d^{(k)} \geq c_2 \nabla f(x^{(k)})^\top d^{(k)} $

The sufficient decrease condition alone does not prevent very small steps; the curvature condition guards against this. Together, the Wolfe conditions define a well-behaved inexact line search.

Theorem 8.1. If $ f \in C^1 $, $ d^{(k)} $ is a descent direction, $ f $ is bounded below along the ray, and $ 0 < c_1 < c_2 < 1 $, then step lengths satisfying the Wolfe conditions exist.

An alternative to the curvature condition is the backtracking line search: starting from a trial $ \bar{\alpha} > 0 $, repeatedly shrink $ \alpha \leftarrow \rho \alpha $ (with $ \rho \in (0,1) $ until the sufficient decrease condition (W1) is met.

Lipschitz Continuity

Definition 8.1. A function $ f : E \to \mathbb{R}^m $ is Lipschitz continuous on $ E $ if there exists $ L > 0 $ such that $ \|f(x) - f(y)\| \leq L\|x - y\| $ for all $ x, y \in E $.

\[ \sum_{k \geq 0} \cos^2(\theta_k) \|\nabla f(x^{(k)})\|_2^2 < +\infty, \]

where $ \cos(\theta_k) = \frac{-\nabla f(x^{(k)})^\top d^{(k)}}{\|\nabla f(x^{(k)})\|_2 \|d^{(k)}\|_2} $ measures the angle between $ d^{(k)} $ and the steepest descent direction.

Steepest Descent and Convergence Rate

Steepest Descent Method

Gradient descent path on elliptic bowl $f(x_1,x_2)=x_1^2+3x_2^2$, showing iterates converging to the global minimum at the origin.

The steepest descent method sets $ d^{(k)} = -\nabla f(x^{(k)}) $. Lemma 9.2 shows this is the direction minimizing the directional derivative among all unit-norm directions: the solution to $ \min_{\|d\|_2=1} \nabla f(x^*)^\top d $ is $ d^* = -\nabla f(x^*)/\|\nabla f(x^*)\|_2 $, by Cauchy-Schwarz.

The algorithm:

Compute direction $ d^{(k)} = -\nabla f(x^{(k)}) $
Choose $ \alpha_k $ by line search on $ \phi(\alpha) = f(x^{(k)} + \alpha d^{(k)}) $
Update $ x^{(k+1)} = x^{(k)} - \alpha_k \nabla f(x^{(k)}) $

Theorem 9.3. With exact line search, consecutive steepest-descent steps are orthogonal: $ (x^{(k+2)} - x^{(k+1)})^\top (x^{(k+1)} - x^{(k)}) = 0 $.

Convergence Rate of Steepest Descent

\[ f(x^{(k+1)}) - f(x^*) \leq \left(\frac{\lambda_1 - \lambda_n}{\lambda_1 + \lambda_n}\right)^2 (f(x^{(k)}) - f(x^*)). \]

The condition number $ \kappa = \lambda_1/\lambda_n $ governs convergence speed: when $ \kappa $ is large (ill-conditioned), the factor approaches 1 and convergence is very slow. If $ A = \alpha I $, convergence happens in one step.

Strongly Convex Functions and Newton’s Method

Strongly Convex Functions

Definition 10.1 (Strongly convex). $ f : C \to \mathbb{R} $ is strongly convex with constant $ \beta > 0 $ if $ f(x) - \beta\|x\|_2^2 $ is convex. Equivalently, for $ f \in C^2 $, $ \nabla^2 f(z) - \beta I \succeq 0 $ for all $ z $.

\[ f(y) \geq f(x) + \nabla f(x)^\top (y-x) + \frac{\beta}{2} \|y-x\|_2^2, \]

which in turn gives $ f(y) \geq f(x) - \frac{1}{2\beta}\|\nabla f(x)\|_2^2 $. Since this holds for the optimal $ y = x^* $, a small gradient implies near-optimality. Furthermore, $ \|x - x^*\|_2 \leq \frac{2}{\beta}\|\nabla f(x)\|_2 $.

Theorem 10.3. For the steepest descent method applied to a strongly convex quadratic with $ A \succ 0 $, convergence is linear: $ f(x^{(k+1)}) - f(x^*) \leq \left(\frac{\lambda_1 - \lambda_n}{\lambda_1 + \lambda_n}\right)^2(f(x^{(k)}) - f(x^*)) $.

Newton’s Method

\[ q(x^{(k)} + h) = f(x^{(k)}) + \nabla f(x^{(k)})^\top h + \frac{1}{2} h^\top \nabla^2 f(x^{(k)}) h. \]

If $ \nabla^2 f(x^{(k)}) \succ 0 $, the minimizer is $ h = -(\nabla^2 f(x^{(k)}))^{-1} \nabla f(x^{(k)}) $, giving the Newton step $ d^{(k)} = -(\nabla^2 f(x^{(k)}))^{-1} \nabla f(x^{(k)}) $. In practice one solves $ \nabla^2 f(x^{(k)}) d^{(k)} = -\nabla f(x^{(k)}) $ as a linear system.

Convergence orders: a sequence $ \{x^{(k)}\} \to x^* $ converges linearly (order 1, factor $ C < 1 $ if $ \|x^{(k+1)} - x^*\| \leq C\|x^{(k)} - x^*\| $, quadratically (order 2) if $ \|x^{(k+1)} - x^*\| \leq C\|x^{(k)} - x^*\|^2 $, sublinearly if the factor is $ C = 1 $. Steepest descent exhibits linear convergence in most cases; Newton’s method (once close enough) converges quadratically.

Theorem 10.8 (Newton convergence). Under local assumptions (Lipschitz-continuous Hessian near $ x^* $, $ \nabla f(x^*) = 0 $, $ \nabla^2 f(x^*) \succ 0 $, if $ x^{(0)} $ is sufficiently close to $ x^* $, Newton’s method with $ \alpha_k = 1 $ converges to $ x^* $ quadratically.

Newton Convergence Proof

Proof of Quadratic Convergence

The proof quantifies how each Newton iterate stays within a ball around $ x^* $ and the error contracts quadratically. Lemma 11.1 guarantees that if $ F(x_0) $ is non-singular, then $ F(x)^{-1} $ exists and is continuous in some ball around $ x_0 $.

\[ x^{(k+1)} - x^* = (\nabla^2 f(x^{(k)}))^{-1} \int_0^1 (\nabla^2 f(x^{(k)}) - \nabla^2 f(x^* + t(x^{(k)} - x^*)))(x^{(k)} - x^*)\,dt. \]

Applying Lipschitz continuity of the Hessian gives $ \|x^{(k+1)} - x^*\|_2 \leq L\|\nabla^2 f(x^*)^{-1}\|_2 \|x^{(k)} - x^*\|_2^2 $, confirming quadratic convergence.

Remarks on Newton’s method:

Newton converges quadratically near a minimizer, but may diverge if started far away — it is not globally convergent.
Each iteration costs $ O(n^2) $ to form the Hessian and $ O(n^3) $ to solve the linear system.
One solves the linear system $ \nabla^2 f(x^{(k)})d^{(k)} = -\nabla f(x^{(k)}) $ rather than forming $ (\nabla^2 f)^{-1} $ directly.
When the Hessian is not PD (e.g., near a saddle), one replaces it with a PD approximation.

Quasi-Newton Methods and Trust-Region Methods

BFGS Quasi-Newton Method

Quasi-Newton methods approximate Newton’s method using only first-order information, updating a positive-definite matrix $ H_k \succ 0 $ at each step. The search direction is $ d^{(k)} = -H_k^{-1} \nabla f(x^{(k)}) $.

\[ B_{k+1} s^{(k)} = y^{(k)}, \]

where $ s^{(k)} = x^{(k+1)} - x^{(k)} $ and $ y^{(k)} = \nabla f(x^{(k+1)}) - \nabla f(x^{(k)}) $. This mirrors the property of the exact Hessian of a quadratic function, where $ A(x-y) = \nabla f(x) - \nabla f(y) $.

\[ B^{(k+1)} = (I - \tau_k s^{(k)} (y^{(k)})^\top) B^{(k)} (I - \tau_k y^{(k)} (s^{(k)})^\top) + \tau_k s^{(k)} (s^{(k)})^\top. \]

Theorem 12.1. If $ B^{(k)} \succ 0 $ and $ (y^{(k)})^\top s^{(k)} > 0 $, then $ B^{(k+1)} \succ 0 $. The step sizes must be chosen (via Wolfe conditions) to guarantee $ (y^{(k)})^\top s^{(k)} > 0 $. BFGS converges superlinearly under suitable conditions.

Trust-Region Methods

\[ x^{\text{TEST}} = \arg\min_{\|d\| \leq \delta_k} q(x^{(k)} + d), \]

where $ \delta_k $ is the trust-region radius. The ratio $ \rho = (f(x^{(k)}) - f(x^{\text{TEST}})) / (q(x^{(k)}) - q(x^{\text{TEST}})) $ measures how well the model predicts the actual decrease. If $ \rho \leq 1/4 $, shrink $ \delta $; if $ \rho \geq 3/4 $ and the constraint is tight, expand $ \delta $; accept the step if $ \rho \geq 1/8 $.

Theorems 13.2–13.4 establish that the trust-region method converges to a point with $ \nabla f = 0 $ and $ \nabla^2 f \succeq 0 $ from any starting point, and when the iterates converge near a point satisfying SOSC, the trust-region step coincides with Newton’s step and convergence becomes quadratic.

Trust-Region Subproblem and Conjugate Directions

Trust-Region Subproblem

\[ \min_{d \in \mathbb{R}^n} \,f(x^{(k)}) + \nabla f(x^{(k)})^\top d + \tfrac{1}{2} d^\top H_k d \quad \text{s.t.} \quad \|d\|_2 \leq \delta_k. \]

Theorem 13.1. $ d^* $ is a global minimizer of TRS iff it is feasible and there exists $ \lambda \geq 0 $ such that $ (H_k + \lambda I)d^* = -\nabla f(x^{(k)}) $, $ \lambda(\delta_k - \|d^*\|_2) = 0 $, and $ H_k + \lambda I \succeq 0 $.

Conjugate Direction Methods (Optional)

For solving $ Ax = b $ with $ A \succ 0 $ (equivalently minimizing $ f(x) = \frac{1}{2}x^\top Ax - b^\top x $, the conjugate direction method uses directions $ d_0, \ldots, d_{n-1} $ satisfying $ d_i^\top A d_j = 0 $ for $ i \neq j $. Such directions are linearly independent, and Theorem 13.7 guarantees convergence to $ x^* $ in at most $ n $ steps. The conjugate gradient (CG) method computes conjugate directions iteratively using only the previous direction, making it efficient for large sparse systems.

MATLAB Tutorial

This lecture was a MATLAB tutorial on programming for numerical optimization. No mathematical content was presented.

Least Squares Problems

Linear Least Squares

\[ \min_{x \in \mathbb{R}^n} \|Ax - b\|_2^2. \]

Since $ \nabla^2 f(x) = 2A^\top A \succ 0 $ (as $ \text{rank}(A) = n $, the unique solution is $ x_{\text{LS}} = (A^\top A)^{-1} A^\top b $, obtained by solving the normal equations $ A^\top A x = A^\top b $.

In practice, the QR factorization is preferred over Cholesky for the normal equations, as it is numerically more stable when $ A $ is ill-conditioned. Theorem 15.1: Any $ A \in \mathbb{R}^{m \times n} $ admits $ A = Q_1 R_1 $ with $ Q_1 \in \mathbb{R}^{m \times n} $ having orthonormal columns and $ R_1 \in \mathbb{R}^{n \times n} $ upper triangular (when columns are linearly independent). The normal equations simplify to $ R_1 x = Q_1^\top b $, solvable by back-substitution in $ O(n^2) $ operations.

Nonlinear Least Squares

\[ \nabla^2 f(x) = 2\sum_{k=1}^m r_k(x) \nabla^2 r_k(x) + 2J(x)^\top J(x). \]

The Gauss-Newton method drops the first term (which requires second derivatives) and uses the approximation $ \nabla^2 f \approx 2J^\top J $, giving updates $ x^{(k+1)} = x^{(k)} - \alpha_k (J(x^{(k)})^\top J(x^{(k)}))^{-1} J(x^{(k)}) r(x^{(k)}) $. This uses only first-order information and is effective when residuals are small or nearly linear.

Constrained Optimization

NLP Formulation and Active Sets

\[ \min f(x) \quad \text{s.t.} \quad g_i(x) \leq 0,\ i \in I;\ h_j(x) = 0,\ j \in E. \]

A special case is linear programming (LP): $ \min c^\top x $ subject to $ Ax = b, x \geq 0 $. A convex program (CP) is an NLP in which $ f, g_1, \ldots, g_m $ are all convex and $ h_1, \ldots, h_p $ are affine.

Definition 16.1 (Active set). At a feasible point $ \bar{x} $, the active set is $ A(\bar{x}) = E \cup \{i \in I : g_i(\bar{x}) = 0\} $. An inequality constraint $ g_i $ is active at $ \bar{x} $ if $ g_i(\bar{x}) = 0 $ and inactive if $ g_i(\bar{x}) < 0 $.

Theorem 16.3. If $ x^* $ is a local minimizer of $ \min f(x) $ s.t. $ g_i(x) \leq 0 $, then there exists no vector $ d $ satisfying simultaneously $ \nabla f(x^*)^\top d < 0 $ and $ \nabla g_i(x^*)^\top d < 0 $ for all $ i \in A(x^*) $. This is the constraint analog of FONC: no direction can simultaneously decrease $ f $ and preserve feasibility.

KKT Conditions

Gordan’s Lemma and KKT

Lemma 17.1 (Gordan). For any $ A \in \mathbb{R}^{m \times n} $, exactly one of the following holds: (I) $ \exists x $ with $ Ax < 0 $; (II) $ \exists y \neq 0, y \geq 0 $ with $ A^\top y = 0 $.

$KKT conditions: feasible region (blue, $x_1+x_2\le3$) with objective contours; the optimal point $x^*=(1.5,1.5)$ lies on the active constraint, where $\nabla f$ and $\nabla g$ are parallel.$

\[ \nabla f(x^*) + \sum_{i=1}^m \lambda_i \nabla g_i(x^*) = 0 \quad \text{and} \quad \lambda_i g_i(x^*) = 0 \quad \forall i. \]

The proof applies Gordan’s Lemma: since no descent direction exists (Theorem 16.3), forming the matrix $ A $ from $ \nabla f(x^*)^\top $ and active constraint gradients, alternative (I) is ruled out, so alternative (II) gives a nonnegative combination equaling zero. LICQ forces the coefficient of $ \nabla f $ to be positive, allowing normalization.

Definition 17.3 (LICQ). At a feasible point $ \bar{x} $, the Linear Independence Constraint Qualification (LICQ) holds if the active gradients $ \{\nabla g_i(\bar{x}) : i \in A(\bar{x})\} \cup \{\nabla h_j(\bar{x}) : j \in E\} $ are linearly independent.

Theorem 17.4 (KKT Necessary Conditions — full NLP). If $ x^* $ is a local minimizer of an NLP and LICQ holds, then there exist $ \lambda^* \in \mathbb{R}^m $, $ \mu^* \in \mathbb{R}^p $ such that:

(Dual feasibility): $ \nabla_x L(x^*, \lambda^*, \mu^*) = 0 $, $ \lambda^* \geq 0 $
(Primal feasibility): $ g_i(x^*) \leq 0 $, $ h_j(x^*) = 0 $
(Complementary slackness): $ \lambda_i^* g_i(x^*) = 0 $

where the Lagrangian is $ L(x, \lambda, \mu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \mu_j h_j(x) $.

A KKT point is any feasible $ x^* $ satisfying these conditions. A constraint qualification (CQ) is any condition that guarantees KKT at every local minimizer.

Theorem 18.2 (KKT Sufficiency for convex programs). For a CP, if there exist multipliers $ \bar{\lambda} \geq 0 $, $ \bar{\mu} $ such that the KKT conditions hold at $ \bar{x} $, then $ \bar{x} $ is a global minimizer of the CP.

Tangent Cones and Constraint Qualifications

Cones and Tangent Cones

A set $ K $ is a cone if $ x \in K $ and $ \alpha > 0 $ implies $ \alpha x \in K $. For example, $ \mathbb{R}^n_+ = \{x : x_i \geq 0\} $ is a cone.

Definition 18.5 (Tangent cone). Given $ \bar{x} \in F $, a vector $ d $ is a tangent vector to $ F $ at $ \bar{x} $ if there exist sequences $ x^{(k)} \to \bar{x} $ in $ F $ and scalars $ t_k \to 0^+ $ with $ (x^{(k)} - \bar{x})/t_k \to d $. The set of all tangent vectors is the tangent cone $ T_F(\bar{x}) $.

\[ L_F(\bar{x}) = \{d : \nabla g_i(\bar{x})^\top d \leq 0,\ \forall i \in A(\bar{x}) \cap I;\ \nabla h_j(\bar{x})^\top d = 0,\ \forall j \in E\}. \]

Theorem 19.2. Always $ T_F(\bar{x}) \subseteq L_F(\bar{x}) $. When LICQ holds, $ T_F(\bar{x}) = L_F(\bar{x}) $. When all active constraints are affine, $ T_F(\bar{x}) = L_F(\bar{x}) $ (Theorem 19.3).

Definition 19.5 (ACQ). The Abadie Constraint Qualification holds at $ \bar{x} $ if $ T_F(\bar{x}) = L_F(\bar{x}) $. This is weaker than LICQ and also guarantees KKT conditions (Theorem 19.6).

Definition/Theorem 19.7 (SCQ). For a convex program, if there exists a Slater point $ \bar{x} $ with $ g_i(\bar{x}) < 0 $ for all $ i \in I $ and $ h_j(\bar{x}) = 0 $ for all $ j \in E $, then the Slater Constraint Qualification (SCQ) holds and $ T_F(x) = L_F(x) $ for every feasible $ x $.

Second-Order Conditions for Constrained Optimization

Theorem 19.1. If $ x^* $ is a local minimizer, then $ \nabla f(x^*)^\top d \geq 0 $ for every tangent direction $ d \in T_F(x^*) $.

\[ C(x^*, \lambda^*) = \{d \in L_F(x^*) : \nabla g_i(x^*)^\top d = 0,\ \forall i \in A(x^*) \cap I \text{ with } \lambda_i^* > 0\}. \]

The critical cone consists of directions for which first-order information alone cannot determine whether $ f $ increases or decreases.

Theorem 19.9 (Second-order constrained conditions).

(SONC): If $ x^* $ is a local minimizer with LICQ and KKT multipliers $ \lambda^*, \mu^* $, then $ d^\top \nabla^2_{xx} L(x^*, \lambda^*, \mu^*) d \geq 0 $ for all $ d \in C(x^*, \lambda^*) $.
(SOSC): If $ x^* $ is a KKT point and $ d^\top \nabla^2_{xx} L(x^*, \lambda^*, \mu^*) d > 0 $ for all $ d \in C(x^*, \lambda^*) \setminus \{0\} $, then $ x^* $ is a strict local minimizer.

Duality Theory

Lagrangian Dual

\[ q(\lambda, \mu) := \min_{x \in D} L(x, \lambda, \mu), \]

where $ \lambda \geq 0 $. The dual problem is $ \max q(\lambda, \mu) $ over its domain. The dual is always a convex program (even when the primal is not), as $ q $ is a pointwise infimum of affine functions.

Theorem 20.3 (Weak Duality). If $ p^* $ and $ d^* $ are the optimal primal and dual values, then $ d^* \leq p^* $. Proof: For any feasible $ x $ and $ (\lambda, \mu) \in \text{dom}(q) $, $ q(\lambda, \mu) \leq L(x, \lambda, \mu) = f(x) + \lambda^\top g(x) + \mu^\top h(x) \leq f(x) $ since $ g(x) \leq 0, h(x) = 0, \lambda \geq 0 $.

The difference $ p^* - d^* \geq 0 $ is the duality gap. When $ p^* = d^* $, strong duality holds.

Theorem 20.4 (Strong duality for convex programs). For a CP, if the KKT conditions hold at $ \bar{x} $ with multipliers $ \bar{\lambda}, \bar{\mu} $, then $ q(\bar{\lambda}, \bar{\mu}) = f(\bar{x}) = p^* $, so strong duality holds.

Example 20.2 (Dual of an LP). For $ (P): \min c^\top x $ s.t. $ Ax \leq b $, the dual objective is $ q(\lambda) = -b^\top \lambda $ when $ A^\top \lambda + c = 0 $, giving dual $ (D): \max -b^\top \lambda $ s.t. $ A^\top \lambda + c = 0, \lambda \geq 0 $ — the standard LP dual.

Example 20.5 (Dual of a QP). For $ \min x^\top Cx $ s.t. $ Ax \leq b $ with $ C \succ 0 $, the Lagrangian is minimized at $ x = -\frac{1}{2}C^{-1}A^\top\lambda $, giving $ q(\lambda) = -\frac{1}{4}\lambda^\top AC^{-1}A^\top\lambda - \lambda^\top b $, a concave dual.

Duality Applications and Gradient Projection

Portfolio Optimization (Markowitz)

\[ \min_w \, w^\top \Sigma w \quad \text{s.t.} \quad \bar{r}^\top w \geq r_{\min},\ e^\top w = 1, \]

where $ w \in \mathbb{R}^n $ is the portfolio weight vector, $ \bar{r} $ is the vector of expected returns, $ \Sigma $ is the covariance matrix (assumed PSD) and $ r_{\min} $ is the minimum acceptable expected return. The objective minimizes variance (risk), subject to a return lower bound and budget constraint.

Gradient Projection Method

\[ x^{(k+1)} = P_\Omega(x^{(k)} - \alpha_k \nabla f(x^{(k)})). \]

Proposition 21.3. If $ x^* $ is a local minimum of $ f $ on $ \Omega $, then $ \nabla f(x^*)^\top(x - x^*) \geq 0 $ for all $ x \in \Omega $.

Proposition 21.4 (Projection characterization). $ y_{\bar{x}} = P_\Omega(\bar{x}) $ iff $ (\bar{x} - y_{\bar{x}})^\top(y - y_{\bar{x}}) \leq 0 $ for all $ y \in \Omega $.

\[ f(x^{(m)}) - f(x^*) \leq \frac{\|x^* - x^{(0)}\|_2^2}{2\alpha m}. \]

This is a $ O(1/m) $ convergence rate, which is slower than Newton’s quadratic rate but applies to any convex problem.

Penalty Methods

Quadratic Penalty Method

\[ Q(x; \mu) = f(x) + \mu \sum_{j=1}^p h_j(x)^2, \]

where $ \mu > 0 $ is the penalty parameter. As $ \mu \to +\infty $, constraint violation is penalized more heavily and the minimizers of $ Q $ approach the feasible region.

The algorithm: start with arbitrary $ x^{(0)} $ and small $ \mu $; repeatedly minimize $ Q(x; \mu) $ (using any unconstrained method), then multiply $ \mu $ by a factor $ C > 1 $.

\[ Q(x, \mu) = f(x) + \mu \sum_{j \in E} h_j(x)^2 + \mu \sum_{i \in I} \max(0, g_i(x))^2. \]

Theorem 22.3 (Convergence). If the iterates converge to $ x^* $, then either $ x^* $ is a critical point of the infeasibility measure $ \|h(x)\|_2^2 $, or (under LICQ) $ x^* $ is a KKT point.

A practical drawback: when $ \mu $ is very large, the Hessian $ \nabla^2_{xx} Q = \nabla^2 f + 2\mu A^\top A $ becomes ill-conditioned (its condition number $ \kappa(A) = \|A\|_2 \|A^{-1}\|_2 $ is large), making Newton steps inaccurate.

Log-Barrier Method

\[ B(x, \rho) = f(x) - \rho \sum_{i \in I} \log(-g_i(x)), \quad \rho > 0, \]

defined only on the strictly feasible interior $ \{x : g_i(x) < 0\} $. As $ \rho \to 0 $, the barrier approximation $ -\rho \log(-u) $ approaches the indicator function of $ \{u \leq 0\} $. Decreasing $ \rho \to 0 $ drives the barrier minimizers towards the boundary.

\[ \nabla f(x) - \sum_{i \in I} \frac{\rho}{g_i(x)} \nabla g_i(x) = 0. \]

Interior-Point Methods

Log-Barrier Central Path

\[ B(x, \rho) = f(x) - \rho \sum_{i=1}^m \log(-g_i(x)). \]

For each $ \rho > 0 $, the minimizer $ x_\rho $ of $ B $ over $ \{Ax = b\} $ is the central point. The set of central points $ \{x_\rho : \rho > 0\} $ is the central path.

\[ \nabla f(x_\rho) + \sum_{i=1}^m \lambda_i(\rho)\nabla g_i(x_\rho) + A^\top \mu = 0. \]

These are the KKT conditions with an approximate complementarity $ -g_i(x_\rho)\lambda_i(\rho) = \rho $ (instead of 0). The central point is thus “nearly optimal” in a quantifiable sense.

Convergence: By weak duality, $ f(x_\rho) - p^* \leq \rho m $, so the barrier iterates converge to the primal optimal with rate proportional to $ \rho $.

$Second-order cone ${(x,t):|x|_2\le t}$: 3D cone surface (left) and cross-sectional disks at increasing values of $t$ (right).$

Interior-Point Method for LP

\[ A^\top y + s - c = 0,\ Ax - b = 0,\ x_j s_j = \rho \quad \forall j, \]

where $ x > 0, s > 0 $. Primal-dual interior-point methods apply Newton’s method to this system for fixed $ \rho $, then decrease $ \rho $. Modern LP solvers (e.g., in MATLAB, Python) typically use this approach.

Newton’s Method for Equality Constraints

\[ \min_{d} \, f(x) + \nabla f(x)^\top d + \frac{1}{2}d^\top \nabla^2 f(x) d \quad \text{s.t.} \quad A(x+d) = b. \]\[ \begin{pmatrix} \nabla^2 f(x) & A^\top \\ A & 0 \end{pmatrix} \begin{pmatrix} d \\ w \end{pmatrix} = \begin{pmatrix} -\nabla f(x) \\ 0 \end{pmatrix}, \]

and the Newton update is $ x \leftarrow x + d $. Under suitable conditions, this converges superlinearly or quadratically, and combined with the log-barrier method, this forms the backbone of modern interior-point algorithms.

References: J. Nocedal and S. J. Wright, Numerical Optimization (2nd ed., Springer, 2006); S. Vavasis, CO367 Fall 2016 Lecture Notes; H. Wolkowicz, CO367 Fall 2015 Lecture Notes.