CO 770: Polynomial Optimization and Sum-of-Squares

Estimated study time: 1 hr 30 min

Table of contents

These notes synthesize material from G. Blekherman, P.A. Parrilo, and R.R. Thomas’s Semidefinite Optimization and Convex Algebraic Geometry, J.B. Lasserre’s An Introduction to Polynomial and Semi-Algebraic Optimization, M. Laurent’s survey on sums of squares and moment matrices, and P.A. Parrilo’s lecture notes, enriched with material from MIT 6.972 and Stanford CS369H.

1. Nonnegative Polynomials and Sums of Squares

The central question of polynomial optimization is disarmingly simple: given a polynomial \( p(x) \in \mathbb{R}[x_1, \ldots, x_n] \), determine \( p^* = \inf_{x \in \mathbb{R}^n} p(x) \). This problem, in its full generality, is NP-hard even for polynomials of degree four. Yet a remarkable thread running from Hilbert through Artin to Parrilo and Lasserre reveals that the algebraic structure of polynomials — particularly the relationship between nonnegativity and representation as sums of squares — gives rise to powerful convex relaxations that transform this intractable problem into a sequence of semidefinite programs.

We begin with the most fundamental objects: nonnegative polynomials and sums of squares.

1.1 Nonnegative Polynomials

A polynomial \( p \in \mathbb{R}[x_1, \ldots, x_n] \) is nonnegative (or positive semidefinite) if \( p(x) \geq 0 \) for all \( x \in \mathbb{R}^n \). We write \( P_{n,2d} \) for the cone of nonnegative polynomials in \( n \) variables of degree at most \( 2d \). (Note that a nonnegative polynomial must have even degree, since any odd-degree polynomial takes arbitrarily large negative values.)

The set \( P_{n,2d} \) is a closed, convex, full-dimensional cone in the vector space \( \mathbb{R}[x_1, \ldots, x_n]_{\leq 2d} \). Convexity is immediate: if \( p, q \geq 0 \) and \( \lambda \in [0,1] \), then \( \lambda p + (1-\lambda) q \geq 0 \) pointwise. Closedness follows from the fact that pointwise nonnegativity is preserved under limits in coefficient space.

However, testing membership in \( P_{n,2d} \) is in general co-NP-hard. Even for quartics (\( 2d = 4 \)) in several variables, deciding nonnegativity is computationally intractable. This motivates the search for tractable inner approximations.

1.2 Sums of Squares

\[ \sigma(x) = \sum_{i=1}^m q_i(x)^2. \]

We write \( \Sigma_{n,2d} \) for the cone of sum-of-squares polynomials in \( n \) variables of degree at most \( 2d \).

Every sum of squares is nonnegative, since a sum of squares of real numbers is nonneg. Hence \( \Sigma_{n,2d} \subseteq P_{n,2d} \). The fundamental question — first asked by Minkowski and answered in a landmark result by Hilbert — is: when does equality hold?

1.3 The Gram Matrix Method

The key to making the SoS condition computationally tractable is the Gram matrix representation.

\[ p(x) = [x]_d^T Q \, [x]_d. \]

The matrix \( Q \) is called a Gram matrix for \( p \).

\[ p(x) = \sum_{i=1}^m (v_i^T [x]_d)^2 = [x]_d^T \left(\sum_{i=1}^m v_i v_i^T\right) [x]_d = [x]_d^T Q \, [x]_d, \]

where \( Q = \sum_i v_i v_i^T \succeq 0 \).

Conversely, if \( p(x) = [x]_d^T Q \, [x]_d \) with \( Q \succeq 0 \), then \( Q \) has a Cholesky-type factorization \( Q = L^T L \), and setting \( w(x) = L [x]_d \), we obtain \( p(x) = w(x)^T w(x) = \sum_i w_i(x)^2 \).

The dimension of the monomial vector \( [x]_d \) is \( s = \binom{n+d}{d} \), so the Gram matrix \( Q \) is \( s \times s \). The condition \( p(x) = [x]_d^T Q \, [x]_d \) imposes linear constraints on \( Q \): the coefficient of each monomial \( x^\alpha \) in the expansion must equal the corresponding coefficient of \( p \). Together with \( Q \succeq 0 \), this is a semidefinite feasibility problem.

\[ (1, x, x^2) Q \begin{pmatrix} 1 \\ x \\ x^2 \end{pmatrix} = x^4 - 2x^2 + 1. \]\[ Q = \begin{pmatrix} 1 & 0 & -1 \\ 0 & 0 & 0 \\ -1 & 0 & 1 \end{pmatrix}. \]

One can verify \( Q \succeq 0 \) (eigenvalues \( 0, 0, 2 \)), confirming that \( p = (1 - x^2)^2 \), a sum of one square.

The Gram matrix is generally not unique. The set of valid Gram matrices for a given polynomial forms an affine section of the PSD cone, known as the Gram spectrahedron.

1.4 Hilbert’s 1888 Theorem

Theorem (Hilbert, 1888). Every nonneg polynomial is a sum of squares, i.e., \( P_{n,2d} = \Sigma_{n,2d} \), if and only if one of the following holds:

\( n = 1 \) (univariate polynomials of any degree),
\( 2d = 2 \) (quadratic forms in any number of variables),
\( n = 2, 2d = 4 \) (bivariate quartics).

In all other cases, there exist nonnegative polynomials that are not sums of squares.

This theorem is one of the most beautiful results in real algebraic geometry. Let us prove the “if” direction for each case.

\[ p(x) = a \prod_{i} (x - r_i)^{2 e_i} \prod_j ((x - \alpha_j)^2 + \beta_j^2)^{f_j} \]\[ (a^2 + b^2)(c^2 + d^2) = (ac - bd)^2 + (ad + bc)^2, \]

the product of sums of two squares is again a sum of two squares. Therefore \( p \) is a sum of (at most two) squares.

Case 2: \( 2d = 2 \). A nonneg quadratic form \( p(x) = x^T A x \) with \( A \) symmetric satisfies \( A \succeq 0 \). Writing \( A = L^T L \) (Cholesky), we get \( p(x) = \|Lx\|^2 = \sum_i (L_i x)^2 \), an explicit SoS decomposition.

Case 3: \( n = 2, 2d = 4 \). This is the deepest of the three cases. Hilbert’s original proof used the theory of ternary quartics and a topological argument. We sketch a modern approach via the Gram matrix.

A nonneg bivariate quartic \( p(x,y) \) of degree 4 has monomial vector \( [x,y]_2 = (1, x, y, x^2, xy, y^2)^T \), so its Gram matrix \( Q \) is \( 6 \times 6 \). After a generic linear change of variables and scaling, one may assume \( p \) is strictly positive away from finitely many zeros. A dimension-counting argument shows that the Gram spectrahedron (the set of PSD Gram matrices consistent with the coefficients of \( p \)) is nonempty for every nonneg ternary quartic: the affine space of Gram matrices has dimension \( \binom{5}{2} - \binom{4}{2} = 4 \), and the codimension of the PSD boundary in the space of \( 6 \times 6 \) symmetric matrices leaves room for feasibility. Hilbert showed this by proving that the relevant linear system always has a PSD solution, using a continuity argument on the space of ternary quartics. A complete modern proof appears in Blekherman-Parrilo-Thomas, Chapter 3.

The “only if” direction — that there exist nonneg-but-not-SoS polynomials in all remaining cases — was first demonstrated by Hilbert through a non-constructive counting argument. The first explicit example came much later.

1.5 The Motzkin Polynomial

\[ M(x,y) = x^4 y^2 + x^2 y^4 - 3x^2 y^2 + 1 \]

is nonneg but not a sum of squares.

\[ \frac{x^4 y^2 + x^2 y^4 + 1}{3} \geq (x^4 y^2 \cdot x^2 y^4 \cdot 1)^{1/3} = (x^6 y^6)^{1/3} = x^2 y^2, \]

hence \( M(x,y) = x^4 y^2 + x^2 y^4 + 1 - 3 x^2 y^2 \geq 0 \). Equality holds iff \( x^4 y^2 = x^2 y^4 = 1 \), i.e., at \( (x,y) = (\pm 1, \pm 1) \).

Not a sum of squares. Suppose for contradiction that \( M = \sum_i q_i^2 \). Since \( M \) has degree 6 in two variables, each \( q_i \) has degree at most 3, so \( q_i(x,y) = a_i + b_i x + c_i y + d_i x^2 + e_i xy + f_i y^2 + g_i x^3 + h_i x^2 y + k_i x y^2 + l_i y^3 \).

Since \( M(x, 0) = 1 \) for all \( x \), we need \( \sum_i q_i(x,0)^2 = 1 \). Each \( q_i(x,0) \) is a polynomial in \( x \) alone, and their squares sum to the constant 1. This forces each \( q_i(x,0) \) to be a constant, say \( q_i(x,0) = a_i \) with \( \sum_i a_i^2 = 1 \). In particular, the coefficients of \( x, x^2, x^3 \) in \( q_i(x,0) \) must vanish: \( b_i = d_i = g_i = 0 \).

By the same argument applied to \( M(0,y) = 1 \), we get \( c_i = f_i = l_i = 0 \).

\[ q_i^2 = a_i^2 + 2a_i e_i xy + 2a_i h_i x^2 y + 2a_i k_i xy^2 + e_i^2 x^2 y^2 + \cdots \]

The coefficient of \( x^2 y^2 \) in \( \sum_i q_i^2 \) is \( \sum_i e_i^2 + 2 \sum_i h_i k_i \). But the coefficient of \( x^2 y^2 \) in \( M \) is \( -3 \). Since \( \sum_i e_i^2 \geq 0 \), we need \( 2\sum_i h_i k_i \leq -3 \), which means \( \sum_i h_i k_i \leq -3/2 \).

Now examine the coefficient of \( x^4 y^2 \) in \( \sum_i q_i^2 \): it is \( \sum_i h_i^2 \), which must equal 1 (the coefficient of \( x^4 y^2 \) in \( M \)). Similarly, the coefficient of \( x^2 y^4 \) gives \( \sum_i k_i^2 = 1 \).

\[ \left|\sum_i h_i k_i\right| \leq \left(\sum_i h_i^2\right)^{1/2}\left(\sum_i k_i^2\right)^{1/2} = 1, \]

which contradicts \( \sum_i h_i k_i \leq -3/2 \). Therefore \( M \) is not a sum of squares.

The Motzkin polynomial occupies a special place in the history of real algebraic geometry. After Hilbert’s non-constructive proof in 1888, it took nearly 80 years before T.S. Motzkin found this concrete example in 1967. Shortly after, R.M. Robinson produced additional examples and systematic constructions. The existence of the “gap” between nonnegativity and SoS has profound consequences for optimization.

1.6 Degree and Dimension Bounds

\[ s(n,d) = \binom{n+d}{d}, \]

so the Gram matrix is \( s \times s \), and the SDP has \( \binom{s+1}{2} \) matrix variables subject to \( \binom{n+2d}{2d} \) linear equality constraints (one per monomial in \( p \)). For instance:

\( n = 2, 2d = 6 \): \( s = 10 \), Gram matrix is \( 10 \times 10 \), and there are 28 monomial constraints.
\( n = 5, 2d = 4 \): \( s = 56 \), Gram matrix is \( 56 \times 56 \).
\( n = 10, 2d = 4 \): \( s = 286 \), which is already a large SDP.

The exponential growth in \( s(n,d) \) is the main computational bottleneck for SoS methods in high dimensions.

1.7 The Gap Between Nonnegativity and SoS

While the Motzkin polynomial shows that the gap exists, a natural quantitative question arises: how large is it? Blekherman (2006) proved that, in a precise asymptotic sense, the gap is vast. Specifically, he showed that as \( n \to \infty \) with \( 2d \geq 4 \) fixed, the ratio of the volumes of \( \Sigma_{n,2d} \) and \( P_{n,2d} \) (intersected with a normalization hyperplane) converges to zero exponentially fast. In other words, “most” nonneg polynomials are not SoS. This is a striking contrast with the computational experience, where the SoS relaxation is often tight for structured optimization problems.

Despite this volumetric gap, the SoS relaxation is remarkably powerful in practice and in theory. The remainder of this course will explore exactly how and why.

2. Semidefinite Programming Relaxations

Having established that testing whether a polynomial is a sum of squares can be cast as a semidefinite program, we now develop the SDP machinery systematically. The connection between polynomial optimization and semidefinite programming, formalized independently by Shor, Parrilo, and Lasserre around 2000, is one of the most important developments in modern optimization.

2.1 Review of Semidefinite Programming

\[ \begin{aligned} \text{(Primal)} \quad & \min_{X} \langle C, X \rangle \\ & \text{subject to } \langle A_i, X \rangle = b_i, \quad i = 1, \ldots, m, \\ & X \succeq 0, \end{aligned} \]

where \( X, C, A_1, \ldots, A_m \in \mathbb{S}^n \) (the space of \( n \times n \) real symmetric matrices), \( b \in \mathbb{R}^m \), and \( \langle A, B \rangle = \operatorname{tr}(AB) \).

The dual SDP is

\[ \begin{aligned} \text{(Dual)} \quad & \max_{y} b^T y \\ & \text{subject to } \sum_{i=1}^m y_i A_i + S = C, \\ & S \succeq 0. \end{aligned} \]

Weak duality always holds: if \( X \) is primal feasible and \( (y, S) \) is dual feasible, then \( \langle C, X \rangle \geq b^T y \). The duality gap \( \langle C, X \rangle - b^T y = \langle S, X \rangle \geq 0 \) (since \( S, X \succeq 0 \) implies \( \operatorname{tr}(SX) \geq 0 \)).

Strong duality. If the primal SDP is strictly feasible (i.e., there exists \( X \succ 0 \) satisfying the constraints) or the dual is strictly feasible, then the duality gap is zero and the optimum is attained on the other side. More precisely (Ramana’s exact duality and the results of Klerk):

If the primal has a strictly feasible point, the dual optimum is attained and equals the primal optimal value.
If the dual has a strictly feasible point, the primal optimum is attained and equals the dual optimal value.

SDPs generalize linear programs (LPs), where the PSD constraint is replaced by entrywise nonnegativity, and second-order cone programs (SOCPs). They can be solved in polynomial time to arbitrary precision via interior-point methods. In practice, solvers such as MOSEK, SeDuMi, and SDPA handle problems with matrix sizes in the hundreds or low thousands.

2.2 The Basic SoS Relaxation

\[ p^* = \inf_{x \in \mathbb{R}^n} p(x), \]\[ p^* = \sup \{ \gamma \in \mathbb{R} : p - \gamma \in P_{n,2d} \}. \]\[ p^*_{\text{sos}} = \sup \{ \gamma \in \mathbb{R} : p - \gamma \in \Sigma_{n,2d} \}. \]

Since \( \Sigma_{n,2d} \subseteq P_{n,2d} \), we have \( p^*_{\text{sos}} \leq p^* \). This relaxation is an SDP: the condition \( p - \gamma \in \Sigma_{n,2d} \) is equivalent to the existence of a Gram matrix \( Q \succeq 0 \) such that \( p(x) - \gamma = [x]_d^T Q \, [x]_d \), which is a linear matrix inequality (LMI) in the variables \( (\gamma, Q) \).

\[ (1, x, x^2) Q (1, x, x^2)^T = x^4 + 2x^3 - x^2 - 2x + 1 - \gamma. \]

Write \( Q = \begin{pmatrix} q_{00} & q_{01} & q_{02} \\ q_{01} & q_{11} & q_{12} \\ q_{02} & q_{12} & q_{22} \end{pmatrix} \). Matching coefficients:

\( x^0 \): \( q_{00} = 1 - \gamma \)
\( x^1 \): \( 2q_{01} = -2 \), so \( q_{01} = -1 \)
\( x^2 \): \( q_{11} + 2q_{02} = -1 \)
\( x^3 \): \( 2q_{12} = 2 \), so \( q_{12} = 1 \)
\( x^4 \): \( q_{22} = 1 \)

The remaining freedom is in \( q_{02} \) (with \( q_{11} = -1 - 2q_{02} \)), and we maximize \( \gamma = 1 - q_{00} \) subject to \( Q \succeq 0 \). This is a small SDP solvable by hand or numerically.

2.3 Constrained Polynomial Optimization and Lasserre’s Hierarchy

\[ p^* = \inf \{ p(x) : g_j(x) \geq 0, \, j = 1, \ldots, m \}, \]

where each \( g_j \in \mathbb{R}[x_1, \ldots, x_n] \). The unconstrained SoS relaxation does not account for the constraints \( g_j \geq 0 \). To incorporate them, we use weighted SoS decompositions.

\[ M(g_1, \ldots, g_m) = \left\{ \sigma_0 + \sum_{j=1}^m \sigma_j g_j : \sigma_0, \sigma_1, \ldots, \sigma_m \in \Sigma_{n,*} \right\}, \]

where \( \Sigma_{n,*} \) denotes the set of all SoS polynomials (of any degree). Elements of \( M(g_1, \ldots, g_m) \) are nonneg on the feasible set \( K = \{ x : g_j(x) \geq 0 \; \forall j \} \), since each \( \sigma_j \) is nonneg everywhere.

\[ p^*_r = \sup \left\{ \gamma : p - \gamma = \sigma_0 + \sum_{j=1}^m \sigma_j g_j, \; \sigma_j \in \Sigma_{n,2(r - d_j)}, \; d_j = \lceil \deg(g_j)/2 \rceil \right\}. \]

Here the degree bounds on \( \sigma_j \) ensure that each summand has degree at most \( 2r \).

\[ p^*_{r_0} \leq p^*_{r_0+1} \leq \cdots \leq p^*. \]

Convergence of Lasserre’s hierarchy (Lasserre, 2001). If the quadratic module \( M(g_1, \ldots, g_m) \) is Archimedean — meaning there exists \( N > 0 \) such that \( N - \|x\|^2 \in M(g_1, \ldots, g_m) \) — then \( p^*_r \to p^* \) as \( r \to \infty \).

The proof proceeds in two steps. First, by Putinar’s Positivstellensatz (which we prove in Chapter 3), if \( p - p^* > 0 \) on \( K \) and the quadratic module is Archimedean, then \( p - p^* \in M(g_1, \ldots, g_m) \). More precisely, for any \( \varepsilon > 0 \), \( p - p^* + \varepsilon \) is strictly positive on \( K \), hence belongs to \( M(g_1, \ldots, g_m) \), which means \( p^*_r \geq p^* - \varepsilon \) for sufficiently large \( r \).

To handle the case when the infimum \( p^* \) is attained (so \( p - p^* \) vanishes on \( K \)), one uses an approximation argument: for any \( \varepsilon > 0 \), \( p - (p^* - \varepsilon) = (p - p^*) + \varepsilon > 0 \) on \( K \), hence \( p - (p^* - \varepsilon) \in M(g_1, \ldots, g_m) \) for large enough degree. This gives \( p^*_r \geq p^* - \varepsilon \) for sufficiently large \( r \), and since \( p^*_r \leq p^* \) always, we conclude \( p^*_r \to p^* \).

The Archimedean condition is a compactness assumption: it implies that the feasible set \( K \) is compact (bounded). When \( K \) is compact, one can always add a redundant constraint \( g_{m+1}(x) = N - \|x\|^2 \geq 0 \) for large enough \( N \) to ensure the Archimedean property.

2.4 Finite Convergence and Flat Extensions

A remarkable feature of the Lasserre hierarchy is that, under certain genericity conditions, it converges in finitely many steps.

Finite convergence (Nie, 2014; Laurent, 2009). Suppose the feasible set \( K \) is compact and the optimal value \( p^* \) is attained at finitely many points. If certain constraint qualification conditions hold (e.g., the constraints satisfy a strict complementarity condition), then there exists a finite \( r^* \) such that \( p^*_r = p^* \) for all \( r \geq r^* \).

The mechanism for finite convergence is the flat extension condition on the dual moment matrix, which we explore in Chapter 4. The dual of the Lasserre hierarchy involves moment matrices, and the flat extension condition provides a criterion for extracting global optimizers from the SDP solution.

2.5 Comparison with LP and SOCP Relaxations

Before semidefinite programming entered the picture, polynomial optimization was approached through linear programming relaxations (e.g., the reformulation-linearization technique of Sherali and Adams) and second-order cone programming (SOCP). The key comparison:

LP relaxations replace the PSD cone with the nonneg orthant. They are computationally cheaper but provide weaker bounds. The Sherali-Adams LP hierarchy requires \( \Omega(n) \) rounds to certify integrality in many cases.
SOCP relaxations use the second-order cone \( \{ (t, x) : t \geq \|x\| \} \) and are intermediate in both cost and strength. They can capture some quadratic structure but miss the full power of SoS.
SDP/SoS relaxations are the strongest among these, at the cost of larger problem sizes. The Lasserre hierarchy subsumes both the LP hierarchy and provides provably tighter bounds.

For any relaxation order \( r \), the bound from the Lasserre SDP hierarchy is at least as tight as the Sherali-Adams LP bound of the same order.

This follows from the fact that the cone of nonneg polynomials with nonneg coefficients (used in LP relaxations) is contained in the SoS cone, which in turn is contained in the nonneg cone.

3. The Positivstellensatz and Real Algebraic Geometry

The convergence of the Lasserre hierarchy depends on deep results from real algebraic geometry. These results — the various Positivstellensätze — provide algebraic certificates for the positivity of a polynomial on a semialgebraic set. In this chapter we develop the necessary theory.

3.1 Real Algebraic Varieties and Semialgebraic Sets

\[ V(f_1, \ldots, f_k) = \{ x \in \mathbb{R}^n : f_1(x) = \cdots = f_k(x) = 0 \}. \]\[ S = \{ x \in \mathbb{R}^n : g_1(x) \geq 0, \ldots, g_m(x) \geq 0, h_1(x) = 0, \ldots, h_l(x) = 0 \} \]

where \( g_j, h_k \in \mathbb{R}[x_1, \ldots, x_n] \). A semialgebraic set is a finite Boolean combination (union, intersection, complement) of basic semialgebraic sets.

By the Tarski-Seidenberg theorem, the projection of a semialgebraic set is again semialgebraic. This quantifier elimination result is a cornerstone of real algebraic geometry.

3.2 Stengle’s Positivstellensatz

The first Positivstellensatz we present is the most general, but also the least constructive.

\[ T(g_1, \ldots, g_m) = \left\{ \sum_{e \in \{0,1\}^m} \sigma_e \prod_{j=1}^m g_j^{e_j} : \sigma_e \in \Sigma \right\}, \]

where \( \Sigma \) denotes the set of SoS polynomials. This is the smallest preordering of \( \mathbb{R}[x] \) containing \( g_1, \ldots, g_m \).

Stengle’s Positivstellensatz (1974). Let \( g_1, \ldots, g_m \in \mathbb{R}[x_1, \ldots, x_n] \) and \( K = \{ x \in \mathbb{R}^n : g_j(x) \geq 0 \; \forall j \} \). Then:

\( K = \emptyset \) if and only if \( -1 \in T(g_1, \ldots, g_m) \).
For \( f \in \mathbb{R}[x] \), \( f > 0 \) on \( K \) if and only if there exist \( p, q \in T(g_1, \ldots, g_m) \) such that \( pf = 1 + q \).
\( f \geq 0 \) on \( K \) if and only if there exist \( p, q \in T(g_1, \ldots, g_m) \) and \( k \in \mathbb{N} \) such that \( pf = f^{2k} + q \).

Stengle’s result is beautiful but has a significant computational drawback: the degrees of the certificates \( p \) and \( q \) are not bounded a priori, and the representation involves products of the \( g_j \) (leading to exponentially many terms in \( m \)). For optimization purposes, we need representations that are more structured.

3.3 Putinar’s Positivstellensatz

The quadratic module \( M(g_1, \ldots, g_m) \) is Archimedean if there exists \( N > 0 \) such that the polynomial \( N - \sum_{i=1}^n x_i^2 \) belongs to \( M(g_1, \ldots, g_m) \). Equivalently, the quadratic module contains a polynomial that is a ball constraint.

If \( M(g_1, \ldots, g_m) \) is Archimedean, then the associated semialgebraic set \( K = \{ x : g_j(x) \geq 0 \; \forall j \} \) is compact. The converse does not hold in general, but if \( K \) is compact, one can always add \( g_{m+1}(x) = N - \|x\|^2 \) for large enough \( N \) to make the quadratic module Archimedean.

\[ f = \sigma_0 + \sum_{j=1}^m \sigma_j g_j \]

for some SoS polynomials \( \sigma_0, \sigma_1, \ldots, \sigma_m \).

Proof sketch. The proof uses functional analysis, specifically the Riesz representation theorem and properties of positive linear functionals.

Step 1. Consider the set \( C = \{ L \in \mathbb{R}[x]^* : L(1) = 1, \, L(\sigma) \geq 0 \; \forall \sigma \in \Sigma, \, L(\sigma g_j) \geq 0 \; \forall \sigma \in \Sigma, \forall j \} \). Elements of \( C \) are positive linear functionals on the quadratic module. By Haviland’s theorem, each such \( L \) is integration against a Borel probability measure supported on \( K \).

Step 2. The Archimedean property ensures that \( C \) is compact (in the weak-* topology) and that the linear functionals in \( C \) are determined by their values on polynomials of bounded degree.

Step 3. If \( f > 0 \) on \( K \), then \( L(f) > 0 \) for all \( L \in C \) (since \( L \) is integration against a probability measure on \( K \)). By a separation argument (Hahn-Banach), \( f \) belongs to the closure of \( M(g_1, \ldots, g_m) \). The Archimedean property then ensures that the quadratic module is closed, giving the representation.

The full proof requires careful handling of the topology and uses Jacobi’s representation theorem. See Blekherman-Parrilo-Thomas, Chapter 3, or Marshall’s Positive Polynomials and Sums of Squares.

The significance of Putinar’s theorem for optimization is immediate: it tells us that the Lasserre hierarchy converges, since the hierarchy searches for representations of the form \( f - \gamma = \sigma_0 + \sum_j \sigma_j g_j \) with bounded degree SoS multipliers.

3.4 Schmüdgen’s Theorem

Schmüdgen’s theorem preceded Putinar’s and uses the preordering instead of the quadratic module.

Schmüdgen’s Positivstellensatz (1991). Let \( g_1, \ldots, g_m \in \mathbb{R}[x_1, \ldots, x_n] \) and suppose \( K = \{ x : g_j(x) \geq 0 \; \forall j \} \) is compact. If \( f \in \mathbb{R}[x] \) is strictly positive on \( K \), then \( f \in T(g_1, \ldots, g_m) \).

Schmüdgen’s result has the advantage of requiring only compactness (not the Archimedean property, though compactness implies the preordering is Archimedean) and produces a certificate in the preordering. However, the preordering has \( 2^m \) generators (all products \( \prod_{j \in S} g_j \) for \( S \subseteq \{1, \ldots, m\} \)), making it computationally more expensive than the quadratic module representation used by Putinar.

3.5 Certificates of Positivity and SDP

The various Positivstellensätze can be organized by their computational implications:

Hierarchy of representations:

Stengle's Positivstellensatz: Most general (works for \( f > 0, f \geq 0, K = \emptyset \)), but uses the preordering with unbounded degrees and multiplicative certificates. Not directly useful for SDP.
Schmüdgen's theorem: Requires \( K \) compact and \( f > 0 \) on \( K \). Uses the preordering with finitely bounded degrees. Leads to an SDP with \( 2^m \) SoS blocks.
Putinar's theorem: Requires Archimedean quadratic module and \( f > 0 \) on \( K \). Uses the quadratic module with finitely bounded degrees. Leads to an SDP with \( m + 1 \) SoS blocks — this is the basis of the Lasserre hierarchy.

3.6 Historical Context

The quest for algebraic certificates of nonnegativity has a rich history. Hilbert’s 17th problem (1900) asked whether every nonneg polynomial is a sum of squares of rational functions. Artin (1927) answered affirmatively, but the proof was non-constructive. Stengle (1974) and Krivine (1964) developed the Positivstellensatz in the framework of ordered fields. Schmüdgen (1991) and Putinar (1993) provided the key refinements that enable computational applications. Shor (1987), Parrilo (2000), and Lasserre (2001) independently recognized the connection to semidefinite programming, launching the modern era of polynomial optimization.

4. Moment Problems and Duality

The SoS/SDP approach to polynomial optimization has a beautiful dual interpretation through the theory of moments. This duality — between algebraic certificates of positivity (SoS side) and moment sequences (dual side) — is at the heart of the Lasserre hierarchy. In this chapter, we develop the moment-theoretic perspective.

4.1 The Classical Moment Problem

\[ y_\alpha = \int_{\mathbb{R}^n} x^\alpha \, d\mu(x). \]

The moment problem asks: given a sequence \( y = (y_\alpha) \), when does there exist a measure \( \mu \) with these moments?

The classical moment problems are named for their domain:

Hamburger moment problem (\( \mathbb{R} \)): characterize sequences that are moments of a measure on the entire real line.
Stieltjes moment problem (\( [0, \infty) \)): characterize moments of a measure on the half-line.
Hausdorff moment problem (\( [0, 1] \)): characterize moments of a measure on a bounded interval.

\[ H = (y_{i+j})_{i,j \geq 0} = \begin{pmatrix} y_0 & y_1 & y_2 & \cdots \\ y_1 & y_2 & y_3 & \cdots \\ y_2 & y_3 & y_4 & \cdots \\ \vdots & \vdots & \vdots & \ddots \end{pmatrix} \]

is positive semidefinite.

\[ \sum_{i,j=0}^d a_i a_j y_{i+j} = \int \left(\sum_i a_i x^i\right)^2 d\mu(x) = \int p(x)^2 \, d\mu(x) \geq 0. \]

So \( H \succeq 0 \).

Conversely, if \( H \succeq 0 \), the positive linear functional \( L(p) = \sum_\alpha p_\alpha y_\alpha \) satisfies \( L(p^2) \geq 0 \) for all polynomials \( p \). By the Riesz-Haviland theorem, \( L \) is represented by a positive measure, i.e., there exists \( \mu \geq 0 \) with \( y_k = \int x^k \, d\mu \).

4.2 The Moment Matrix

For the multivariate setting relevant to polynomial optimization, we use the following construction.

\[ M_d(y)_{\alpha, \beta} = y_{\alpha + \beta}. \]

In other words, if we list all monomials of degree \( \leq d \) as \( x^{\alpha_1}, \ldots, x^{\alpha_s} \) (with \( s = \binom{n+d}{d} \)), then \( M_d(y) \) is the \( s \times s \) matrix whose \( (i,j) \)-entry is \( y_{\alpha_i + \alpha_j} \). The moment matrix is the natural dual object to the Gram matrix.

\[ v^T M_d(y) v = \int \left(\sum_i v_i x^{\alpha_i}\right)^2 d\mu(x) \geq 0 \]

for all \( v \in \mathbb{R}^s \).

4.3 Localizing Matrices

To handle constraints \( g_j(x) \geq 0 \), we introduce localizing matrices.

\[ M_{d - d_g}(g \cdot y)_{\alpha, \beta} = \sum_\gamma g_\gamma \, y_{\alpha + \beta + \gamma}. \]

If \( y_\alpha = \int x^\alpha \, d\mu \) for a measure \( \mu \) supported on \( \{ x : g(x) \geq 0 \} \), then \( M_{d-d_g}(g \cdot y) \succeq 0 \).

\[ v^T M_{d-d_g}(g \cdot y) v = \sum_{\alpha, \beta} v_\alpha v_\beta \sum_\gamma g_\gamma y_{\alpha + \beta + \gamma} = \int g(x) \left(\sum_\alpha v_\alpha x^\alpha \right)^2 d\mu(x) \geq 0, \]

since \( g(x) \geq 0 \) on the support of \( \mu \) and \( (\sum v_\alpha x^\alpha)^2 \geq 0 \) everywhere.

4.4 Lasserre’s Moment–SoS Duality

\[ p^*_r = \sup \left\{ \gamma : p - \gamma = \sigma_0 + \sum_j \sigma_j g_j, \; \deg(\sigma_0) \leq 2r, \; \deg(\sigma_j g_j) \leq 2r \right\}. \]\[ p^*_r = \inf \left\{ L_y(p) : y_0 = 1, \; M_r(y) \succeq 0, \; M_{r - d_j}(g_j \cdot y) \succeq 0, \; j = 1, \ldots, m \right\}, \]

where \( L_y(p) = \sum_\alpha p_\alpha y_\alpha \) is the linear functional associated to the sequence \( y \).

Moment-SoS duality. The primal and dual problems above are SDP duals of each other. Under Slater’s condition (existence of a strictly feasible point in one of the problems), strong duality holds and the optimal values coincide.

\[ \sup \; \gamma \quad \text{s.t.} \quad p - \gamma = [x]_r^T Q_0 [x]_r + \sum_{j=1}^m [x]_{r-d_j}^T Q_j [x]_{r-d_j} \cdot g_j, \quad Q_0, Q_1, \ldots, Q_m \succeq 0. \]

The coefficient-matching constraints are linear in \( (\gamma, Q_0, \ldots, Q_m) \). This is a standard SDP.

Its Lagrangian dual introduces multipliers \( y_\alpha \) for each coefficient constraint. After collecting terms, the dual objective becomes \( \sum_\alpha p_\alpha y_\alpha = L_y(p) \), and the dual PSD constraints become precisely \( M_r(y) \succeq 0 \) (dual to \( Q_0 \succeq 0 \)) and \( M_{r-d_j}(g_j \cdot y) \succeq 0 \) (dual to \( Q_j \succeq 0 \)).

4.5 The Truncated Moment Problem and Flat Extensions

The moment problem for the Lasserre hierarchy is a truncated moment problem: we only have finitely many moments \( (y_\alpha)_{|\alpha| \leq 2r} \). The key question is: when does a truncated moment sequence come from an actual measure?

\[ \mu = \sum_{i=1}^k w_i \delta_{x_i}, \]

where \( k = \operatorname{rank} M_r(y) \), \( w_i > 0 \), and \( x_1, \ldots, x_k \in \mathbb{R}^n \).

The flat extension condition is both necessary and sufficient for representing \( y \) by a finitely-atomic measure with the minimum number of atoms. The atomic measure is unique, and the atoms \( x_i \) can be computed from the column space of \( M_r(y) \) via a Schur-like decomposition or simultaneous diagonalization of multiplication operators.

4.6 Extracting Optimal Solutions

The flat extension theorem gives a practical method for extracting optimal solutions from the Lasserre hierarchy:

\[ \operatorname{rank} M_r(y^*) = \operatorname{rank} M_{r - d_0}(y^*), \]

where \( d_0 = \lceil \deg(p)/2 \rceil \). Then the flat extension theorem applies, and the global optimizers of the original polynomial optimization problem can be extracted as the atoms of the representing measure.

\[ M_1(y) = \begin{pmatrix} 1 & y_{10} & y_{01} \\ y_{10} & y_{20} & y_{11} \\ y_{01} & y_{11} & y_{02} \end{pmatrix} \succeq 0 \]

and the localizing matrix for \( g_1 = 1 - x_1^2 - x_2^2 \) is the scalar \( 1 - y_{20} - y_{02} \geq 0 \). The dual objective is \( L_y(p) = y_{20} + y_{02} \). Minimizing subject to the PSD and localizing constraints yields \( y_{10} = y_{01} = y_{11} = 0 \), \( y_{20} = y_{02} = 0 \), giving \( M_1(y^*) = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix} \), which has rank 1. The flat extension condition is satisfied, and the unique representing measure is \( \delta_{(0,0)} \), the Dirac mass at the origin — confirming the optimizer.

4.7 The Rank Condition in Practice

In numerical SDP solutions, exact rank detection is impossible, so one uses a numerical rank threshold. If the singular values of \( M_r(y^*) \) drop sharply (say from \( 10^{-1} \) to \( 10^{-8} \)), one identifies the numerical rank as the number of singular values above the threshold. Software packages such as GloptiPoly (Henrion, Lasserre, and Löfberg) automate this process.

The atoms (global optimizers) are then extracted via the following procedure:

Compute a basis for the column space of \( M_r(y^*) \).
Form the multiplication matrices \( N_i \) for each variable \( x_i \) (defined by the column relations of \( M_r(y^*) \)).
Simultaneously diagonalize the \( N_i \) to find the common eigenvectors, which yield the optimizer coordinates.

5. LP/SDP Hierarchies in Combinatorial Optimization

The algebraic and convex-analytic machinery developed in the preceding chapters finds powerful applications in combinatorial optimization. Here, the polynomial optimization perspective provides a systematic way to generate increasingly tight convex relaxations of NP-hard discrete problems. We begin with the Lovász theta function — perhaps the most elegant single SDP relaxation in combinatorics — and proceed to the systematic hierarchies of Lovász-Schrijver, Sherali-Adams, and Lasserre.

5.1 The Lovász Theta Function

Let \( G = (V, E) \) be a graph with vertex set \( V = \{1, \ldots, n\} \). The stability number \( \alpha(G) \) (maximum independent set size) is NP-hard to compute.

\[ \vartheta(G) = \max \left\{ \langle J, X \rangle : \operatorname{tr}(X) = 1, \; X_{ij} = 0 \text{ for } \{i,j\} \in E, \; X \succeq 0 \right\}, \]

where \( J \) is the \( n \times n \) all-ones matrix.

\[ \alpha(G) \leq \vartheta(G) \leq \bar{\chi}(G), \]

where \( \bar{\chi}(G) \) is the fractional chromatic number of the complement graph.

\( \alpha(G) \leq \vartheta(G) \): Let \( S \subseteq V \) be a maximum independent set of size \( \alpha(G) \). Define \( X = \frac{1}{|S|} \chi_S \chi_S^T \), where \( \chi_S \) is the characteristic vector of \( S \). Then:

\( X \succeq 0 \) (rank-one PSD matrix).
\( \operatorname{tr}(X) = \frac{|S|}{|S|} = 1 \).
If \( \{i,j\} \in E \), then \( i \) and \( j \) cannot both be in \( S \) (since \( S \) is independent), so \( X_{ij} = 0 \).
\( \langle J, X \rangle = \frac{1}{|S|} (\chi_S^T \mathbf{1})^2 = \frac{|S|^2}{|S|} = |S| = \alpha(G) \).

So \( X \) is feasible with objective value \( \alpha(G) \), giving \( \alpha(G) \leq \vartheta(G) \). \[ \vartheta(G) = \min \left\{ \lambda_{\max}(A) : A_{ij} = -1 \text{ for } \{i,j\} \notin E, \; A_{ii} \text{ free} \right\}, \]

where \( \lambda_{\max}(A) \) is the largest eigenvalue. One can show that this is at most the fractional chromatic number of the complement, using the LP relaxation of the chromatic number as an upper bound.

The theta function is remarkable because it can be computed in polynomial time (via SDP), yet it sits “between” two NP-hard graph parameters. For perfect graphs, \( \vartheta(G) = \alpha(G) \), recovering the maximum independent set exactly. Lovász used \( \vartheta \) to prove the perfect graph theorem for a class of graphs and to settle the Shannon capacity of the pentagon: \( \Theta(C_5) = \sqrt{5} \), where \( C_5 \) is the 5-cycle.

5.2 The Lovász-Schrijver Hierarchy

Lovász and Schrijver (1991) introduced a systematic method to tighten LP relaxations of combinatorial optimization problems.

\[ Y = x \cdot (1, x)^T \quad \text{(outer product)}, \]

so \( Y \) is an \( (n+1) \times (n+1) \) matrix with \( Y e_0 = x \), \( Y_{0j} = x_j \), and the constraint \( Y e_i \in P, (e_0 - e_i)^T Y \in P^T \) for each \( i \).

More precisely, \( N(P) \) is the set of \( x \in \mathbb{R}^n \) such that there exists a symmetric matrix \( Y \) with:

\( Y e_0 = x \) (first column is \( x \)),
\( \operatorname{diag}(Y) = x \) (diagonal entries equal \( x \)),
\( Y e_i \in \operatorname{cone}(P) \) and \( (e_0 - e_i)^T Y \in \operatorname{cone}(P) \) for each \( i \).

\[ N_+(P) = \{ x : \exists Y \succeq 0 \text{ satisfying the above} \}. \]

\[ P \supseteq N(P) \supseteq N^2(P) \supseteq \cdots \supseteq P_I, \]\[ P \supseteq N_+(P) \supseteq N_+^2(P) \supseteq \cdots \supseteq P_I, \]

where \( P_I = \operatorname{conv}(P \cap \{0,1\}^n) \) is the integer hull. Moreover, \( N^n(P) = P_I \) and \( N_+^n(P) = P_I \) (after at most \( n \) rounds).

5.3 The Sherali-Adams Hierarchy

The Sherali-Adams hierarchy (1990) is a purely LP-based lift-and-project method.

\[ \text{SA}_r(P) = \left\{ x : \exists y \text{ with } y_\emptyset = 1, \; y_{\{i\}} = x_i, \; \text{and linearized products satisfy } P \right\}. \]\[ \sum_j a_j y_{S \cup \{j\}} \prod_{i \in T} (1 - \text{``} x_i \text{''}) \leq b \cdot y_S \prod_{i \in T} (1 - \text{``} x_i \text{''}). \]

After expanding products and linearizing (replacing each product \( \prod_{i \in U} x_i \) by \( y_U \)), this becomes a system of linear inequalities in the variables \( y_S \).

\( \text{SA}_n(P) = P_I \) for any 0/1 polytope \( P \subseteq [0,1]^n \).

5.4 The Lasserre/SoS Hierarchy for Combinatorial Problems

The Lasserre hierarchy from Chapter 2, when specialized to 0/1 optimization, gives the strongest known hierarchy.

For a 0/1 optimization problem \( \max\{ c^T x : x \in P \cap \{0,1\}^n \} \), we can write it as a polynomial optimization problem with constraints \( x_i^2 = x_i \) (Boolean constraints) and the linear constraints defining \( P \). The Lasserre hierarchy at level \( r \) introduces a pseudo-expectation operator (moment matrix) and imposes PSD constraints.

\[ \text{Lass}_r(P) \subseteq N_+^r(P) \subseteq \text{SA}_r(P). \]

That is, the Lasserre hierarchy at level \( r \) is at least as tight as \( r \) rounds of \( N_+ \), which is at least as tight as \( r \) rounds of Sherali-Adams.

Proof sketch. The containment \( N_+^r(P) \subseteq \text{SA}_r(P) \) follows from the fact that the \( N_+ \) operator imposes a PSD constraint on the lifting matrix, while Sherali-Adams only imposes nonnegativity constraints on the lifted variables. Every PSD matrix has nonneg diagonal entries and satisfies the Cauchy-Schwarz inequality, which implies the Sherali-Adams constraints.

The containment \( \text{Lass}_r(P) \subseteq N_+^r(P) \) is more subtle. The Lasserre hierarchy at level \( r \) imposes PSD constraints on the full moment matrix \( M_r(y) \), while \( r \) rounds of \( N_+ \) only impose PSD constraints on certain principal submatrices. Laurent showed that the Lasserre moment matrix conditions imply all the \( N_+ \) conditions at the same level.

5.5 Integrality Gaps and Rounds

A central question in the study of hierarchies is: how many rounds are needed to solve a given problem exactly, or to achieve a certain approximation ratio?

Integrality gaps for MAX-CUT.

The basic SDP relaxation (Goemans-Williamson) achieves a 0.878 approximation ratio for MAX-CUT.
\( \Omega(n) \) rounds of Sherali-Adams cannot improve the integrality gap beyond the trivial 1/2 ratio.
\( O(\log n) \) rounds of the Lasserre hierarchy suffice to get an arbitrarily good approximation for MAX-CUT on bounded-degree graphs (de la Vega and Kenyon-Mathieu, 2007).

Integrality gaps for vertex cover.

The basic LP relaxation has integrality gap 2.
\( \Omega(n) \) rounds of Sherali-Adams maintain a gap of \( 2 - \varepsilon \) (Charikar, Makarychev, and Makarychev, 2009).
\( \Omega(\sqrt{\log n / \log \log n}) \) rounds of the Lasserre hierarchy are needed to reduce the gap below \( 7/6 - \varepsilon \) (Schoenebeck, Tulsiani, and Trevisan).

\[ \max \frac{1}{2} \sum_{\{i,j\} \in E} w_{ij}(1 - X_{ij}) \quad \text{s.t.} \quad X_{ii} = 1 \; \forall i, \; X \succeq 0. \]

This is the level-1 SoS/Lasserre relaxation with the Boolean constraints \( x_i^2 = 1 \). The Goemans-Williamson rounding algorithm shows this relaxation achieves at least \( 0.878 \cdot \text{OPT} \). Higher levels of the hierarchy further tighten the relaxation, and for specific graph families, the hierarchy converges to the integer optimum in a controlled number of rounds.

5.6 The Coloring Problem

For the graph coloring problem, the Lovász theta function of the complement graph gives \( \vartheta(\bar{G}) \leq \chi(G) \). The Lasserre hierarchy applied to the fractional chromatic number relaxation tightens this bound. However, the chromatic number is notoriously hard to approximate: assuming the Unique Games Conjecture, no polynomial-time algorithm can achieve a \( k^{1-\varepsilon} \)-approximation for \( k \)-coloring. The Lasserre hierarchy provides the best known algorithmic approximations, but strong lower bounds on the number of rounds required are known.

6. Sum-of-Squares Proofs and Proof Complexity

In this chapter, we shift perspective from optimization to proof complexity. The sum-of-squares method is not merely a computational tool; it is a proof system. An SoS certificate that \( p(x) \geq 0 \) on a domain \( K \) is a syntactic proof of a semantic statement. The study of SoS as a proof system — its power and limitations — connects polynomial optimization to some of the deepest questions in theoretical computer science.

6.1 SoS as a Proof System

\[ \mathcal{A} = \{ g_1(x) \geq 0, \ldots, g_m(x) \geq 0 \} \]\[ f(x) = \sigma_0(x) + \sum_{j=1}^m \sigma_j(x) g_j(x), \]

where each \( \sigma_j \) is a sum of squares and \( \deg(\sigma_0), \deg(\sigma_j g_j) \leq d \).

The SoS degree of a statement \( \mathcal{A} \vdash f \geq 0 \) is the minimum degree \( d \) of such a proof.

By Putinar’s Positivstellensatz, if the constraints define a compact set and \( f > 0 \) on that set, an SoS proof exists (of some finite degree). The question is: what degree is needed?

6.2 Relation to Other Proof Systems

The SoS proof system sits in a hierarchy of algebraic proof systems:

Proof system hierarchy:

Resolution: The weakest standard proof system, operating on clauses. Equivalent to degree-1 Sherali-Adams.
Polynomial Calculus (PC): Proves unsatisfiability by deriving \( 1 = 0 \) from polynomial equations. A degree-\( d \) PC proof is at least as powerful as degree-\( d \) Nullstellensatz.
Positivstellensatz/Nullstellensatz: Static algebraic proof systems. Degree-\( d \) Positivstellensatz refutations correspond to Stengle-type certificates.
SoS (degree \( d \)): Strictly more powerful than degree-\( d \) Polynomial Calculus and degree-\( d \) Positivstellensatz. This is because SoS uses PSD (rather than just nonneg coefficient) certificates, capturing more algebraic structure.
Frege / Extended Frege: The strongest standard proof systems. It is conjectured (but wide open) that they are strictly more powerful than SoS.

More precisely, a degree-\( d \) SoS refutation can simulate a degree-\( d \) Sherali-Adams refutation (since SA uses DSOS/nonneg certificates that are special cases of SoS), and a degree-\( d \) SoS refutation of an unsatisfiable system can be found by solving an SDP of size \( n^{O(d)} \).

6.3 Lower Bounds on SoS Degree

The landmark lower bounds on SoS degree come from Grigoriev (2001) and Schoenebeck (2008).

Grigoriev’s lower bound (2001). Let \( \phi \) be a random 3-XOR instance on \( n \) variables with clause density above the satisfiability threshold (so \( \phi \) is unsatisfiable with high probability). Then any SoS refutation of \( \phi \) requires degree \( \Omega(n) \).

Proof sketch. The proof constructs a pseudo-expectation operator \( \tilde{\mathbb{E}} \) of degree \( \Omega(n) \) that is consistent with the constraints of \( \phi \). A pseudo-expectation of degree \( d \) is a linear functional \( \tilde{\mathbb{E}} : \mathbb{R}[x]_{\leq d} \to \mathbb{R} \) satisfying:

\( \tilde{\mathbb{E}}[1] = 1 \),
\( \tilde{\mathbb{E}}[p^2] \geq 0 \) for all \( p \) with \( \deg(p^2) \leq d \),
\( \tilde{\mathbb{E}}[p^2 \cdot g_j] \geq 0 \) for all constraints \( g_j \geq 0 \) and all \( p \) with \( \deg(p^2 g_j) \leq d \).

If a degree-\( d \) SoS refutation existed, the SoS/moment duality would imply no such pseudo-expectation exists. Hence constructing a pseudo-expectation of degree \( d \) proves that no refutation of degree \( d \) is possible.

Grigoriev constructs the pseudo-expectation by showing that the moment matrix can be made PSD using a careful probabilistic argument (moment method plus spectral bounds). The random structure of the 3-XOR instance provides enough “room” to satisfy all the constraints up to degree \( \Omega(n) \).

Schoenebeck’s lower bound (2008). For random instances of \( k \)-CSPs above the satisfiability threshold, any SoS refutation requires degree \( \Omega(n) \), for any constant \( k \).

Schoenebeck’s result extends Grigoriev’s from 3-XOR to arbitrary predicate CSPs, and the proof technique (constructing high-degree pseudo-expectations) has become a standard tool in SoS lower bounds.

6.4 Planted Clique and SoS

The planted clique problem is a central problem in average-case complexity. In the \( G(n, 1/2, k) \) model, a random graph on \( n \) vertices is augmented with a clique of size \( k \). The computational task is to find (or detect) the planted clique.

SoS and planted clique (Barak, Hopkins, Kelner, Kothari, Moitra, Potechin, 2019; Hopkins, Kothari, Potechin, Raghavendra, Schramm, 2017).

For \( k \geq \tilde{O}(\sqrt{n}) \), degree-4 SoS can distinguish planted clique graphs from purely random graphs.
For \( k \leq n^{1/2 - \varepsilon} \), even degree-\( \Omega(\log n / \log \log n) \) SoS cannot distinguish the two distributions (Barak-Hopkins-Kelner-Kothari-Moitra-Potechin).
There exist SoS lower bounds of degree \( \Omega(n^{1/3}) \) (with suitable parameterization) showing that the \( \sqrt{n} \) barrier for SoS is robust.

The planted clique problem is widely believed to be hard for \( k = o(\sqrt{n}) \), and the SoS lower bounds support this. The \( \sqrt{n} \) threshold is a “computational phase transition” that appears to be fundamental, not an artifact of a particular algorithm.

6.5 The Unique Games Conjecture and SoS

The Unique Games Conjecture (UGC) of Khot (2002) asserts that for every \( \varepsilon > 0 \), it is NP-hard to distinguish whether a unique game has value \( \geq 1 - \varepsilon \) or \( \leq \varepsilon \). The UGC implies optimal inapproximability results for MAX-CUT, vertex cover, and many other problems.

Crucially, constant-degree SoS does not refute the UGC: there exist pseudo-expectations of constant degree consistent with unique games instances of value \( \varepsilon \). However, the sub-exponential version of the UGC (asserting hardness for \( 2^{n^\delta} \)-time algorithms) may be refuted by SoS — Barak, Brandão, Harrow, Kelner, Steurer, and Zhou (2012) showed that \( n^{O(1/\varepsilon)} \) rounds of SoS can solve certain families of unique games instances.

This suggests a rich interplay: SoS is the “right” algorithmic framework for testing the UGC, and understanding SoS lower bounds is equivalent to understanding the hardness landscape of constraint satisfaction.

6.6 Pseudo-Expectations and Pseudo-Distributions

A degree-\( d \) pseudo-distribution over \( \mathbb{R}^n \) is a finitely-supported signed measure \( \tilde{\mu} \) (or more abstractly, a linear functional \( \tilde{\mathbb{E}} : \mathbb{R}[x]_{\leq d} \to \mathbb{R} \)) satisfying:

Normalization: \( \tilde{\mathbb{E}}[1] = 1 \).
Pseudo-positivity: \( \tilde{\mathbb{E}}[p(x)^2] \geq 0 \) for all \( p \) with \( \deg(p) \leq d/2 \).

The associated pseudo-expectation operator \( \tilde{\mathbb{E}} \) extends to constrained settings by additionally requiring:

\( \tilde{\mathbb{E}}[p(x)^2 g_j(x)] \geq 0 \) for all constraints \( g_j \geq 0 \) and all \( p \) with \( \deg(p^2 g_j) \leq d \).

Pseudo-distributions are the dual objects to SoS proofs. A degree-\( d \) pseudo-distribution satisfying all constraints is a certificate that no degree-\( d \) SoS refutation exists. The moment matrix \( M_{d/2}(y) \) of a pseudo-distribution is the PSD matrix whose entries are the pseudo-moments \( y_\alpha = \tilde{\mathbb{E}}[x^\alpha] \).

The key insight is that pseudo-distributions “look like” genuine distributions to all degree-\( d \) polynomial tests. They can assign positive pseudo-probability to events that are impossible (e.g., a graph being 3-colorable when it is not), as long as no degree-\( d \) polynomial test can detect the inconsistency. The higher the degree, the harder it is to fool the SoS verifier, and the degree needed to detect the inconsistency characterizes the SoS proof complexity of the statement.

7. Applications and Modern Developments

The sum-of-squares method has evolved from a tool for polynomial optimization into a broadly applicable algorithmic paradigm. In this final chapter, we survey applications spanning algorithm design, quantum information, control theory, and machine learning, and we discuss open problems at the frontier of the field.

7.1 Robust Estimation via SoS

One of the most striking modern applications of SoS is in robust statistics. The problem of estimating the mean and covariance of a distribution when an adversary has corrupted an \( \varepsilon \)-fraction of the samples has been studied since the work of Tukey and Huber, but efficient algorithms with near-optimal error guarantees emerged only recently, with SoS playing a central role.

\[ \|\hat{\mu} - \mu\| \leq O(\varepsilon^{1-1/k}) \cdot \|\Sigma\|^{1/2} \]

in polynomial time, where \( k \) is the number of certifiable moments.

The key idea is certifiable sub-Gaussianity: the moment conditions of the underlying distribution can be verified by low-degree SoS proofs. When this holds, the SoS relaxation “knows” enough about the distributional structure to identify and downweight corrupted samples.

\[ \mathbb{E}_{X \sim D}[\langle v, X - \mu \rangle^k] \leq (Ck)^{k/2} \cdot (\mathbb{E}[\langle v, X - \mu \rangle^2])^{k/2} \]

for all unit vectors \( v \), where \( C \) is an absolute constant.

The SoS approach to robust estimation has led to optimal or near-optimal algorithms for robust mean estimation, robust covariance estimation, robust linear regression, and list-decodable learning. The general paradigm — called the proofs-to-algorithms framework — converts SoS proofs of identifiability into efficient estimation algorithms.

7.2 Tensor Decomposition

Tensor decomposition is a fundamental problem in machine learning, signal processing, and data science. Given a tensor \( T = \sum_{i=1}^r a_i^{\otimes k} \) (a rank-\( r \) symmetric tensor), the goal is to recover the components \( a_1, \ldots, a_r \).

SoS tensor decomposition (Hopkins, Shi, Steurer, 2015; Ma, Shi, Steurer, 2016). For random overcomplete tensors (where \( r \) can be much larger than \( n \)), degree-\( O(k) \) SoS can decompose order-\( k \) tensors with \( r = \tilde{O}(n^{k/2}) \) components in polynomial time. This matches or improves upon all known polynomial-time algorithms.

The SoS approach to tensor decomposition works by formulating the problem as a polynomial optimization (maximizing a polynomial over the unit sphere) and showing that the SoS hierarchy captures enough structure to succeed. The connection to the injective tensor norm and spectral methods reveals deep algebraic structure.

7.3 Quantum Information: Entanglement Detection

In quantum information theory, determining whether a quantum state is entangled (not separable) is a fundamental problem. A bipartite state \( \rho \) on \( \mathbb{C}^{d_A} \otimes \mathbb{C}^{d_B} \) is separable if it can be written as \( \rho = \sum_i p_i \rho_i^A \otimes \rho_i^B \) with \( p_i \geq 0 \). Testing separability is NP-hard in general.

\[ \exists \; \tilde{\rho} \text{ on } \mathbb{C}^{d_A} \otimes (\mathbb{C}^{d_B})^{\otimes k} : \tilde{\rho} \geq 0, \; \operatorname{tr}_{B_2, \ldots, B_k}(\tilde{\rho}) = \rho, \; \tilde{\rho} \text{ symmetric under } S_k. \]

Convergence of DPS. The DPS hierarchy converges: \( \rho \) passes all levels if and only if \( \rho \) is separable. Moreover, if \( \rho \) is entangled, it is detected at a finite level \( k \) (with \( k \) depending on how far \( \rho \) is from the separable set).

This hierarchy is precisely a moment/SoS hierarchy in disguise, where the “polynomial” variables parameterize the local state \( \rho^B \) and the “nonnegativity” constraint is separability. The connection was made explicit by Parrilo and exploits the isomorphism between PSD matrices and certain polynomial cones.

7.4 Control Theory: Lyapunov Stability

In control theory, establishing the stability of a nonlinear system \( \dot{x} = f(x) \) often requires finding a Lyapunov function \( V(x) \) satisfying \( V(x) > 0 \) for \( x \neq 0 \) and \( \dot{V}(x) = \nabla V(x) \cdot f(x) < 0 \). When \( f \) is polynomial and we search for \( V \) in the class of polynomial Lyapunov functions, this becomes a polynomial nonnegativity problem.

SoS Lyapunov functions (Parrilo, 2000). Given a polynomial vector field \( f : \mathbb{R}^n \to \mathbb{R}^n \) with \( f(0) = 0 \), we can search for a polynomial Lyapunov function \( V(x) \) of degree \( 2d \) satisfying:

\( V(x) - \varepsilon \|x\|^2 \in \Sigma \) (positive definiteness),
\( -\nabla V(x) \cdot f(x) - \varepsilon \|x\|^2 \in \Sigma \) (Lyapunov decrease condition),

for some \( \varepsilon > 0 \). This is a feasibility SDP in the coefficients of \( V \).

\[ V(x_1, x_2) = c_1 x_1^2 + c_2 x_1 x_2 + c_3 x_2^2 + c_4 x_1^4 + \cdots \]

where the coefficients \( c_i \) are determined by the SDP. The SoS constraints ensure that \( V \) and \( -\dot{V} \) are nonneg (actually positive definite), certifying local stability. Parrilo’s thesis demonstrates this methodology on numerous control examples, including region-of-attraction estimation.

7.5 Machine Learning: Kernel Methods and SoS

There is a deep connection between SoS polynomials and kernel methods in machine learning. The Gram matrix of a polynomial kernel \( K(x, x') = (\langle x, x' \rangle + 1)^d \) is positive semidefinite, and functions in the associated RKHS (reproducing kernel Hilbert space) can be written as inner products involving the SoS monomial basis.

More precisely, the SoS relaxation can be viewed as kernel regression in a lifted feature space: the monomial map \( x \mapsto [x]_d \) is the feature map, and the Gram matrix \( Q \) corresponds to the weight matrix in the lifted space. This perspective connects polynomial optimization to statistical learning theory and explains why SoS methods are effective for learning polynomial functions.

Recent work by Klivans and Kothari shows that SoS-based algorithms achieve the best known guarantees for learning polynomial threshold functions and agnostic learning of halfspaces under adversarial noise.

7.6 The Proofs-to-Algorithms Paradigm

A unifying theme in modern SoS applications is the proofs-to-algorithms paradigm: if a mathematical fact (identifiability, correctness, uniqueness) has a low-degree SoS proof, then there is an efficient algorithm witnessing that fact.

Meta-theorem (informal). Let \( \mathcal{A} \) be a set of polynomial constraints and \( f(x) \geq 0 \) a conclusion. If there is a degree-\( d \) SoS proof that \( \mathcal{A} \vdash f(x) \geq 0 \), then the SDP relaxation at level \( d/2 \) of the Lasserre hierarchy certifies this, and one can extract an algorithmic solution (rounding the pseudo-distribution) in time \( n^{O(d)} \).

This meta-theorem has been applied to:

Robust estimation (converting identifiability proofs into algorithms),
Tensor decomposition (converting uniqueness proofs into decomposition algorithms),
Planted problems (converting distinguishing proofs into recovery algorithms),
Community detection and clustering.

7.7 The SoS Algorithm for Planted Problems

A planted problem has the following structure: nature draws a hidden structure \( x^* \) (e.g., a planted partition, coloring, or clique), generates observations \( Y \) from a distribution \( P(Y | x^*) \), and the algorithm must recover \( x^* \) from \( Y \).

The SoS approach to planted problems works as follows:

Formulate the constraints on \( x^* \) as polynomial equalities and inequalities.
Write the observation model as polynomial constraints on \( Y \) and \( x^* \).
Solve the Lasserre hierarchy and extract a solution from the pseudo-distribution.

For the planted partition (stochastic block model) with two communities of size \( n/2 \), degree-4 SoS succeeds down to the computational threshold \( \sqrt{n} \cdot \text{SNR} = \Omega(1) \). For the planted sparse vector problem, degree-4 SoS matches the best known spectral algorithms. The SoS framework provides a principled way to design and analyze algorithms for a wide class of average-case problems.

7.8 Connections to Algebraic Geometry

The SoS method has deep roots in algebraic geometry that continue to yield insights.

\[ \sqrt[\mathbb{R}]{I} = \{ f \in \mathbb{R}[x] : f^{2k} + \sigma \in I \text{ for some } k \in \mathbb{N}, \sigma \in \Sigma \}. \]

The real Nullstellensatz states that \( V(I) = \emptyset \) if and only if \( 1 \in \sqrt[\mathbb{R}]{I} \), which is equivalent (by Stengle) to \( -1 \in I + \Sigma \). This connects the feasibility of polynomial systems to SoS certificates.

In the context of optimization, the algebraic geometry of the feasible set — its dimension, degree, singularities, and the structure of the associated ideal — determines the behavior of the Lasserre hierarchy. For instance:

If the ideal \( I = (h_1, \ldots, h_l) \) is zero-dimensional (i.e., \( V(I) \) is finite), then the Lasserre hierarchy achieves finite convergence: there exists \( r^* \) such that the hierarchy is exact at level \( r^* \). The value \( r^* \) is bounded by the regularity of the ideal, which in turn is bounded by the Bézout number \( \prod_i \deg(h_i) \).

7.9 Symmetry Reduction

When the polynomial optimization problem has symmetry (e.g., invariance under a group action), the SDP can be dramatically simplified using representation theory.

Symmetry reduction (Gatermann and Parrilo, 2004). If the polynomial \( p(x) \) and the constraints \( g_j(x) \) are invariant under a finite group \( G \) acting on \( \mathbb{R}^n \), then the Gram matrix \( Q \) can be chosen to be block-diagonal, with blocks corresponding to the irreducible representations of \( G \). This reduces the SDP size from \( s \times s \) to a collection of blocks of total size \( O(s / |G|) \) (roughly speaking).

For the Motzkin polynomial \( M(x,y) = x^4 y^2 + x^2 y^4 - 3x^2 y^2 + 1 \), which is invariant under the group \( G = \{(x,y) \mapsto (\pm x, \pm y), (x,y) \mapsto (y, x)\} \cong D_4 \) (the dihedral group of order 8), the Gram matrix for a degree-6 SoS certificate can be decomposed into blocks of size at most 3, rather than the full \( 10 \times 10 \) matrix. This makes the SDP much more tractable.

7.10 DSOS and SDSOS: Alternatives to SoS

While SoS (via SDP) is powerful, solving SDPs is computationally expensive for large-scale problems. This has motivated the search for more scalable alternatives.

A polynomial \( p \) is diagonally dominant sum of squares (DSOS) if it can be written as \( p(x) = [x]_d^T Q [x]_d \) where \( Q \) is diagonally dominant (i.e., \( Q_{ii} \geq \sum_{j \neq i} |Q_{ij}| \) for all \( i \)). A polynomial is scaled diagonally dominant sum of squares (SDSOS) if \( p(x) = [x]_d^T (D Q D) [x]_d \) where \( Q \) is diagonally dominant and \( D \) is a diagonal matrix.

The hierarchy \( \text{DSOS} \subseteq \text{SDSOS} \subseteq \text{SoS} \subseteq \text{nonneg} \) provides a trade-off between computational cost and tightness:

DSOS feasibility is a linear program (LP).
SDSOS feasibility is a second-order cone program (SOCP).
SoS feasibility is an SDP.

Ahmadi and Majumdar (2019) developed this framework and showed that for many practical problems (especially in control), DSOS and SDSOS provide good bounds at a fraction of the computational cost of full SoS.

7.11 Open Problems

We conclude with several open problems that drive current research in polynomial optimization and SoS.

Open Problem 1: Optimal SoS degree for planted clique. What is the exact SoS degree threshold for detecting a planted clique of size \( k = n^{1/2 - \varepsilon} \)? Current lower bounds give \( \Omega(\log n) \); is the true answer \( \Theta(n^c) \) for some \( c > 0 \)?

Open Problem 2: SoS and the Unique Games Conjecture. Can sub-exponential levels of the SoS hierarchy refute the Unique Games Conjecture? The Barak-Brandão-Harrow-Kelner-Steurer-Zhou result gives a partial answer, but the full question remains open.

Open Problem 3: Degree bounds in Positivstellensätze. What are the optimal degree bounds in Putinar’s Positivstellensatz as a function of the geometry of the feasible set? Known bounds are doubly exponential in the worst case (Lombardi, Perrucci, Roy); can they be improved for structured problems?

Open Problem 4: SoS vs. spectral methods. Is SoS strictly more powerful than spectral methods (eigenvalue-based algorithms) for natural average-case problems? For the planted clique problem, the two approaches succeed at the same threshold \( k = \Theta(\sqrt{n}) \), raising the question of whether this is a coincidence or reflects a deeper equivalence.

Open Problem 5: Practical scalability. Develop algorithms for solving large-scale SoS SDPs (with millions of variables) efficiently. Current interior-point methods scale as \( O(s^3) \) per iteration; first-order methods (ADMM, Frank-Wolfe) and exploiting sparsity/symmetry are active areas of research.

Open Problem 6: Non-commutative SoS. Extend SoS methods to non-commutative polynomial optimization (relevant to quantum information and operator algebras). The non-commutative Positivstellensatz (Helton, McCullough) provides a foundation, but computational methods are less developed.

The sum-of-squares method sits at a remarkable crossroads of algebra, optimization, and computer science. From Hilbert’s 1888 theorem to Parrilo’s 2000 thesis and the explosion of applications in the 2010s, the journey from nonnegativity to semidefinite programming has been one of the most productive intellectual developments in modern mathematics. The field remains vibrant, with new applications and theoretical insights appearing regularly, and the fundamental question — what can efficient algebraic proof systems certify? — continues to drive research at the frontier.