CO 486/CS 766: Theory of Quantum Information

Debbie Leung

Estimated study time: 4 hr 40 min

Table of contents

Sources and References

Primary textbook — John Watrous, The Theory of Quantum Information (Cambridge University Press, 2018). The Fall 2011 lecture notes (LN2011) that formed the basis of the textbook are cited throughout.

Supplementary texts — T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. (Wiley, 2006); Michael Nielsen and Isaac Chuang, Quantum Computation and Quantum Information (Cambridge University Press, 2000); John Preskill, Lecture Notes for Physics 219: Quantum Computation, Chapter 10 (2022).

Online resources — Debbie Leung, QIC 820 / CO 486 / CS 766 course page (UWaterloo, Fall 2023): https://www.math.uwaterloo.ca/~wcleung/qic820-f2023.html; Jason Sikora’s SDP lecture notes (UWaterloo, Fall 2023); Preskill Phys 219 lecture notes at theory.caltech.edu; arXiv references cited inline for research results.

Chapter 1: Mathematical Foundations

The theory of quantum information rests on a precise algebraic framework. This chapter develops the necessary mathematical infrastructure: complex Euclidean spaces as the arena for quantum states, linear operators as the language for transformations and measurements, the vec operator as a bridge between operators and vectors, and convexity as the geometry underlying optimisation.

1.1 Complex Euclidean Spaces

Classical information is stored in a finite set of symbols. Quantum information generalises this by allowing probability amplitudes — complex numbers — in place of probabilities. The natural home for these amplitudes is a complex Euclidean space indexed by the classical symbol set.

Definition 1.1 (Complex Euclidean Space). Let $\Sigma$ be a finite nonempty set. The complex Euclidean space associated with $\Sigma$ is $\mathcal{X} = \mathbb{C}^\Sigma$, the vector space of all functions $u: \Sigma \to \mathbb{C}$ with componentwise addition and scalar multiplication. For $\Sigma = \{1,\ldots,k\}$ this is $\mathbb{C}^k$.

Definition 1.2 (Inner product and norms). The inner product of $u, v \in \mathcal{X}$ is \[\langle u, v \rangle = \sum_{a \in \Sigma} \overline{u(a)}\, v(a),\] and the Euclidean norm is $\|u\| = \sqrt{\langle u,u\rangle}$. The standard basis is $\{e_a : a \in \Sigma\}$ with $e_a(b) = \mathbf{1}[a=b]$. More generally, the $p$-norm is $\|u\|_p = \bigl(\sum_a |u(a)|^p\bigr)^{1/p}$ and the sup-norm is $\|u\|_\infty = \max_a|u(a)|$.

The inner product is linear in the second argument and conjugate-linear in the first — the physics convention, opposite to the mathematics convention. We use the Dirac bra-ket notation $|u\rangle$ for $u \in \mathcal{X}$ and $\langle u|$ for its dual (conjugate transpose), so $\langle u|v\rangle = \langle u,v\rangle$.

The norms satisfy the Cauchy-Schwarz inequality $|\langle u,v\rangle| \leq \|u\|\|v\|$, with equality iff $u$ and $v$ are proportional. The Euclidean norm and the 2-norm coincide on vectors. The norms are related by $\|u\|_\infty \leq \|u\|_2 \leq \|u\|_1 \leq \sqrt{|\Sigma|}\,\|u\|_2 \leq |\Sigma|\,\|u\|_\infty$.

Definition 1.3 (Direct sum and tensor product). The direct sum $\mathcal{X}_1 \oplus \mathcal{X}_2$ corresponds to index set $\{1\}\times\Sigma_1 \cup \{2\}\times\Sigma_2$; a vector is a pair $u_1 \oplus u_2$ with $\langle u_1\oplus u_2, v_1\oplus v_2\rangle = \langle u_1,v_1\rangle + \langle u_2,v_2\rangle$. The tensor product $\mathcal{X}_1 \otimes \mathcal{X}_2$ corresponds to $\Sigma_1 \times \Sigma_2$, with product vectors $(u_1 \otimes u_2)(a_1,a_2) = u_1(a_1)u_2(a_2)$.

Example 1.1. A single qubit register has $\Sigma = \{0,1\}$, so $\mathcal{X} = \mathbb{C}^2$. The standard basis is $|0\rangle = (1,0)^T$ and $|1\rangle = (0,1)^T$. A two-qubit register has $\mathcal{X} \otimes \mathcal{X} = \mathbb{C}^4$ with basis $\{|00\rangle, |01\rangle, |10\rangle, |11\rangle\}$. The Bell state $|\beta\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle + |11\rangle)$ lives in $\mathbb{C}^2 \otimes \mathbb{C}^2$ but is not a product $u \otimes v$ — it is entangled.

1.2 Linear Operators and Operator Norms

Definition 1.4 (Spaces of operators). The space of linear maps from $\mathcal{X}$ to $\mathcal{Y}$ is $L(\mathcal{X},\mathcal{Y})$, itself a complex Euclidean space of dimension $\dim\mathcal{X}\cdot\dim\mathcal{Y}$. We write $L(\mathcal{X}) = L(\mathcal{X},\mathcal{X})$. The adjoint of $A \in L(\mathcal{X},\mathcal{Y})$ is the unique $A^* \in L(\mathcal{Y},\mathcal{X})$ satisfying $\langle v, Au\rangle = \langle A^*v, u\rangle$ for all $u,v$. In matrix terms, $A^* = \bar{A}^T$ (conjugate transpose). The Hilbert-Schmidt inner product on $L(\mathcal{X},\mathcal{Y})$ is $\langle A, B\rangle = \operatorname{Tr}(A^*B)$.

Definition 1.5 (Distinguished operator classes). For $A \in L(\mathcal{X})$:

Hermitian: $A = A^*$. The set $\operatorname{Herm}(\mathcal{X})$ is a real vector space. Eigenvalues of Hermitian operators are real.
Positive semidefinite (PSD): $A \in \operatorname{Pos}(\mathcal{X})$ iff $A = B^*B$ for some $B$, equivalently all eigenvalues of $A$ are $\geq 0$. Write $A \geq 0$.
Positive definite: $A \in \operatorname{Pd}(\mathcal{X})$ iff all eigenvalues of $A$ are $> 0$. Write $A > 0$.
Unitary: $UU^* = U^*U = I$. These form the group $\mathcal{U}(\mathcal{X})$.
Projector: $\Pi^2 = \Pi = \Pi^*$. Projectors correspond to subspaces of $\mathcal{X}$.

The Löwner order $A \geq B$ means $A - B \geq 0$.

Theorem 1.1 (Spectral theorem). Every normal operator $A \in L(\mathcal{X})$ (satisfying $AA^* = A^*A$) admits a spectral decomposition $A = \sum_i \lambda_i \Pi_i$ where $\{\Pi_i\}$ are mutually orthogonal projectors summing to $I$ and $\{\lambda_i\}$ are the distinct eigenvalues. For Hermitian $A$, all $\lambda_i \in \mathbb{R}$. For PSD $A$, all $\lambda_i \geq 0$.

The spectral theorem allows us to define operator functions: for $f:\mathbb{R} \to \mathbb{R}$ and Hermitian $A = \sum_i \lambda_i \Pi_i$, set $f(A) = \sum_i f(\lambda_i)\Pi_i$. The most important cases are $\sqrt{A}$ (for $A \geq 0$) and $\log A$ (for $A > 0$).

Example 1.2. Let $A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}$. Then $A$ is Hermitian with eigenvalues $\lambda_1 = 4, \lambda_2 = 2$ and eigenvectors $|+\rangle = \tfrac{1}{\sqrt{2}}(|0\rangle+|1\rangle)$, $|-\rangle = \tfrac{1}{\sqrt{2}}(|0\rangle-|1\rangle)$. The spectral decomposition is $A = 4|+\rangle\langle+| + 2|-\rangle\langle-|$. The PSD square root is $\sqrt{A} = 2|+\rangle\langle+| + \sqrt{2}|-\rangle\langle-|$.

Every operator also has a singular value decomposition, which is the analogue of the spectral theorem for non-square or non-normal operators.

Theorem 1.2 (Singular value decomposition). Every $A \in L(\mathcal{X},\mathcal{Y})$ of rank $r$ decomposes as $A = \sum_{i=1}^r s_i |y_i\rangle\langle x_i|$ where $s_1 \geq \cdots \geq s_r > 0$ are the singular values and $\{|x_i\rangle\} \subset \mathcal{X}$, $\{|y_i\rangle\} \subset \mathcal{Y}$ are orthonormal sets. The singular values are the square roots of the eigenvalues of $A^*A$.

Definition 1.6 (Schatten $p$-norms). For $A \in L(\mathcal{X},\mathcal{Y})$ with singular values $s_1 \geq \cdots \geq s_r > 0$, the Schatten $p$-norm is \[\|A\|_p = \Bigl(\sum_{i=1}^r s_i^p\Bigr)^{1/p}.\] The three fundamental cases are:

Trace norm $\|A\|_1 = \sum_i s_i$ (also called nuclear norm or 1-norm).
Frobenius norm $\|A\|_2 = \sqrt{\sum_{a,b}|A_{ab}|^2} = \sqrt{\operatorname{Tr}(A^*A)}$.
Spectral norm $\|A\|_\infty = s_1$ (largest singular value; also called operator norm).

These norms satisfy $\|A\|_\infty \leq \|A\|_2 \leq \|A\|_1$ and the duality $\|A\|_1 = \max\{\langle A, B\rangle : \|B\|_\infty \leq 1\}$. The trace norm is especially important because it governs distinguishability of quantum states.

The trace norm has a direct operational meaning: for two quantum states $\rho,\sigma$, the probability of correctly identifying the state (given uniformly random prior) equals $\tfrac{1}{2}+\tfrac{1}{4}\|\rho-\sigma\|_1$. This is the Holevo-Helstrom theorem (Theorem 2.1). The spectral norm (operator norm) $\|A\|_\infty = \lambda_{\max}(A^*A)^{1/2}$ is the largest singular value; for Hermitian $A$ it is $\max(|\lambda_{\max}(A)|,|\lambda_{\min}(A)|)$. The Frobenius norm $\|A\|_2 = \sqrt{\operatorname{Tr}(A^*A)}$ is the Hilbert-Schmidt norm and is induced by the inner product $\langle A,B\rangle = \operatorname{Tr}(A^*B)$.

A key inequality connecting trace norm and fidelity: $\|\rho-\sigma\|_1^2 \leq 4(1-F(\rho,\sigma)^2)$. This is useful for bounding distinguishability from fidelity. For pure states $\rho = |\psi\rangle\langle\psi|$ and $\sigma = |\phi\rangle\langle\phi|$: $\|\rho-\sigma\|_1 = 2\sqrt{1-|\langle\psi|\phi\rangle|^2}$ (the “Bures distance”) and $F = |\langle\psi|\phi\rangle|$, giving the tight Fuchs-van de Graaf bound as an equality.

The Schatten norms also satisfy Hölder’s inequality: $|\langle A, B\rangle| \leq \|A\|_p \|B\|_q$ for $1/p + 1/q = 1$. For $p = q = 2$ this is the Cauchy-Schwarz inequality for the Hilbert-Schmidt inner product. The Frobenius norm is submultiplicative: $\|AB\|_2 \leq \|A\|_2\|B\|_2$. The spectral norm is submultiplicative: $\|AB\|_\infty \leq \|A\|_\infty\|B\|_\infty$. An important inequality connecting the three: $\|ABC\|_1 \leq \|A\|_2\|B\|_1\|C\|_2$.

Theorem 1.3 (Variational formula for the trace norm). For $A \in L(\mathcal{X})$, \[\|A\|_1 = \max_{U \in \mathcal{U}(\mathcal{X})} \operatorname{Re}\langle A, U\rangle = \max_{U \in \mathcal{U}(\mathcal{X})} \operatorname{Re}\operatorname{Tr}(A^*U).\]

Proof. Let $A = \sum_i s_i |y_i\rangle\langle x_i|$ be the SVD. Take $U = \sum_i |y_i\rangle\langle x_i| + V_{\ker A}$ where $V_{\ker A}$ is any unitary on $\ker A$. Then $\operatorname{Tr}(A^*U) = \sum_i s_i = \|A\|_1$. For any unitary $U$, Cauchy-Schwarz gives $|\operatorname{Tr}(A^*U)| \leq \|A\|_1\|U\|_\infty = \|A\|_1$. $\square$

Example 1.3. For $A = \begin{pmatrix}1 & 0 \\ 0 & -1\end{pmatrix} = Z$ (Pauli-$Z$), the singular values are both 1, so $\|Z\|_1 = 2$, $\|Z\|_2 = \sqrt{2}$, $\|Z\|_\infty = 1$. The optimal unitary in the variational formula is $U = \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix} = I$, giving $\operatorname{Tr}(Z^*I) = \operatorname{Tr}(Z) = 0 \neq 2$. Wait — the Pauli $Z$ is Hermitian, so $A^* = Z$. We want $\max_U \operatorname{Re}\operatorname{Tr}(ZU)$. The SVD of $Z = |0\rangle\langle 0| - |1\rangle\langle 1|$ has $s_1 = s_2 = 1$ and optimal $U = |0\rangle\langle 0| + |1\rangle\langle 1| \cdot (-1)^* = |0\rangle\langle 0| - |1\rangle\langle 1| = Z$, giving $\operatorname{Tr}(ZZ) = \operatorname{Tr}(I) = 2 = \|Z\|_1$. $\checkmark$

1.3 The Vec Operator and Tensor Identities

The vec operator provides a canonical identification between operators (matrices) and vectors (column vectors in a tensor product space). This identification is essential for purification theory, the Choi isomorphism, and many duality arguments.

Definition 1.7 (Vec operator). The vec operator $\operatorname{vec}: L(\mathcal{X},\mathcal{Y}) \to \mathcal{Y} \otimes \mathcal{X}$ is defined on rank-one operators by $\operatorname{vec}(|y\rangle\langle x|) = |y\rangle \otimes |x\rangle$ and extended by linearity. Equivalently, if $A$ has matrix entries $A_{ij}$ then $\operatorname{vec}(A)$ is the column vector obtained by stacking the columns of $A$ (column-major order in the standard basis). The map vec is an isometric isomorphism: $\langle A, B\rangle = \langle \operatorname{vec}(A), \operatorname{vec}(B)\rangle$.

Example 1.4. For $\mathcal{X} = \mathbb{C}^2$, $\operatorname{vec}(I_2) = |0\rangle|0\rangle + |1\rangle|1\rangle \in \mathbb{C}^2 \otimes \mathbb{C}^2$. This is (proportional to) the maximally entangled state. For $A = |0\rangle\langle 1|$, $\operatorname{vec}(A) = |0\rangle|1\rangle$. For a general $2\times 2$ matrix $A = \begin{pmatrix}a & b \\ c & d\end{pmatrix}$, $\operatorname{vec}(A) = (a,c,b,d)^T$ (columns stacked).

Theorem 1.4 (Tensor identity for vec). For $A \in L(\mathcal{X}_1,\mathcal{Y}_1)$, $B \in L(\mathcal{X}_2,\mathcal{Y}_2)$, $C \in L(\mathcal{X}_2,\mathcal{X}_1)$, \[(A \otimes B)\operatorname{vec}(C) = \operatorname{vec}(ACB^T).\]

Proof. It suffices to check on rank-one $C = |x_1\rangle\langle x_2|$. Then $\operatorname{vec}(C) = |x_1\rangle \otimes |x_2\rangle$, so $(A\otimes B)\operatorname{vec}(C) = A|x_1\rangle \otimes B|x_2\rangle = \operatorname{vec}(A|x_1\rangle\langle x_2|B^T) = \operatorname{vec}(ACB^T)$. Linearity extends to general $C$. $\square$

Two important special cases: setting $C = I$ gives $\operatorname{vec}(A) = (A\otimes I)\operatorname{vec}(I) = (I\otimes A^T)\operatorname{vec}(I)$. Setting $A = I$ gives $(I\otimes B)\operatorname{vec}(C) = \operatorname{vec}(CB^T)$.

The tensor identity has a key consequence for quantum channels. If $\Phi(M) = \sum_k A_k M B_k^*$, then the natural representation satisfies $K(\Phi)\operatorname{vec}(M) = \operatorname{vec}(\Phi(M))$, so $K(\Phi) = \sum_k A_k \otimes \bar{B}_k$. The Choi matrix is $J(\Phi) = (\Phi\otimes I)(\operatorname{vec}(I)\operatorname{vec}(I)^*)$, and the tensor identity gives the reconstruction formula $\Phi(M) = \operatorname{Tr}_\mathcal{X}[(I\otimes M^T)J(\Phi)]$.

The vec identity also makes the Choi-Kraus correspondence transparent: $J(\Phi) = \sum_k \operatorname{vec}(A_k)\operatorname{vec}(A_k)^*$ for Kraus operators $\{A_k\}$. This is the Gram matrix of the vectors $\{\operatorname{vec}(A_k)\}$, and its PSD property gives the CP condition.

Example 1.5. For the completely depolarising channel $\Omega(\rho) = \tfrac{I}{d}\operatorname{Tr}(\rho)$ on $\mathbb{C}^d$, the Choi matrix is $J(\Omega) = (\Omega\otimes I)(|\beta\rangle\langle\beta|)$ where $|\beta\rangle = \operatorname{vec}(I/\sqrt{d})$. Since $\Omega$ maps every input to $I/d$, $J(\Omega) = \tfrac{I_{d^2}}{d}$. The Choi matrix being a multiple of identity confirms that the channel is the "maximally noisy" channel.

Example 1.5b (Choi matrix of the unitary channel). For the unitary channel $\mathcal{U}(\rho) = U\rho U^*$ with $U \in \mathcal{U}(\mathbb{C}^d)$, the Kraus representation has a single Kraus operator $A_1 = U$. So $J(\mathcal{U}) = \operatorname{vec}(U)\operatorname{vec}(U)^*$ is a rank-1 PSD matrix — a pure state on $\mathcal{Y}\otimes\mathcal{X}$. The trace condition: $\operatorname{Tr}_\mathcal{Y}(J(\mathcal{U})) = \operatorname{Tr}_\mathcal{Y}(\operatorname{vec}(U)\operatorname{vec}(U)^*) = UU^* = I$ ✓. The channel is TP. Interpretation: unitary channels have rank-1 Choi matrices — they carry all their information in a single direction in the space of operators, reflecting the fact that the channel is deterministic (no probabilistic mixture of errors).

Theorem 1.4b (Choi-Jamiołkowski isomorphism properties). The Choi isomorphism $\Phi \mapsto J(\Phi)$ satisfies:

$\Phi$ is CP iff $J(\Phi) \geq 0$ (Choi's theorem).
$\Phi$ is TP iff $\operatorname{Tr}_\mathcal{Y}(J(\Phi)) = I_\mathcal{X}$.
$\Phi$ is Hermiticity-preserving iff $J(\Phi) = J(\Phi)^*$.
$\Phi$ is trace-non-increasing iff $\operatorname{Tr}_\mathcal{Y}(J(\Phi)) \leq I_\mathcal{X}$.
Composition: if $\Phi_1: L(\mathcal{X})\to L(\mathcal{Y})$ and $\Phi_2: L(\mathcal{Y})\to L(\mathcal{Z})$, then $J(\Phi_2\circ\Phi_1) = [\Phi_2\otimes I_\mathcal{X}](J(\Phi_1))$, not a simple product.

1.4 Matrix Inequalities and Operator Monotone Functions

Operator functions inherit an ordering structure from the Löwner order on $\operatorname{Herm}(\mathcal{X})$, but this ordering interacts with matrix operations in subtle ways.

Definition 1.9 (Operator monotone function). A function $f:(0,\infty)\to\mathbb{R}$ is operator monotone if $A \geq B > 0 \Rightarrow f(A) \geq f(B)$. It is operator convex if $f(\lambda A + (1-\lambda)B) \leq \lambda f(A) + (1-\lambda)f(B)$ for all $A,B > 0$.

Key operator monotone functions: $f(t) = t^\alpha$ for $\alpha \in (0,1]$ is operator monotone; $f(t) = \log t$ is operator monotone; $f(t) = t^2$ is operator convex but not operator monotone. Crucially, $A \geq B \geq 0$ does not imply $A^2 \geq B^2$ in general (the square function is not operator monotone).

Example 1.6. Let $A = \begin{pmatrix}2&1\\1&2\end{pmatrix}$ and $B = \begin{pmatrix}1&0\\0&0\end{pmatrix}$. Then $A - B = \begin{pmatrix}1&1\\1&2\end{pmatrix}$ has eigenvalues $\tfrac{3\pm\sqrt{5}}{2} > 0$, so $A > B$. However, $A^2 - B^2 = \begin{pmatrix}5&4\\4&5\end{pmatrix} - \begin{pmatrix}1&0\\0&0\end{pmatrix} = \begin{pmatrix}4&4\\4&5\end{pmatrix}$ has eigenvalues $\tfrac{9\pm\sqrt{17}}{2}$, both positive — so here $A^2 > B^2$. But this is not always the case: take $A = 2I, B = \begin{pmatrix}1&1\\1&1\end{pmatrix}$; then $A>B$ but computing $A^2 - B^2 = 4I - 2B = \begin{pmatrix}2&-2\\-2&2\end{pmatrix}$ which has a zero eigenvalue, showing the inequality becomes tight.

Theorem 1.9 (Golden-Thompson inequality). For Hermitian $A, B \in \operatorname{Herm}(\mathcal{X})$, \[\operatorname{Tr}(e^{A+B}) \leq \operatorname{Tr}(e^A e^B).\] In particular, if $A$ and $B$ commute, equality holds. The inequality is strict when they do not commute.

The Golden-Thompson inequality is used in the proof of the data processing inequality for quantum relative entropy via the Peierls-Bogoliubov inequality.

Theorem 1.4c (Lieb's concavity theorem). For $t \in [0,1]$ and any matrix $K$, the function $(A,B) \mapsto \operatorname{Tr}(K^* A^t K B^{1-t})$ is jointly concave on $\operatorname{Pd}(\mathcal{X}) \times \operatorname{Pd}(\mathcal{X})$.

Lieb’s concavity theorem (1973) is one of the deepest results in matrix analysis. Its many corollaries include the joint convexity of quantum relative entropy (Theorem 5.8) and the strong subadditivity of von Neumann entropy (Theorem 5.10). The original proof used complex interpolation methods; Carlen-Lieb (2012) later gave a simpler proof using the operator geometric mean.

Theorem 1.4d (Peierls-Bogoliubov inequality). For Hermitian $A,H$ with $H$ traceless, \[\operatorname{Tr}(e^{A+H}) \leq \operatorname{Tr}(e^A)\exp\!\bigl(\langle H\rangle_A\bigr), \quad \langle H\rangle_A = \frac{\operatorname{Tr}(He^A)}{\operatorname{Tr}(e^A)}.\] Combined with the Golden-Thompson inequality: these inequalities are the quantum analogues of the classical moment-generating function bounds used in large deviation theory.

Example 1.6b (Operator convexity of $f(t) = -\log t$). For positive definite matrices $A,B > 0$ and $\lambda \in (0,1)$, the operator convexity of $-\log$ means $-\log(\lambda A + (1-\lambda)B) \leq -\lambda\log A - (1-\lambda)\log B$, equivalently $\log(\lambda A + (1-\lambda)B) \geq \lambda\log A + (1-\lambda)\log B$. This is the **matrix concavity of the logarithm**. In scalar form it is just the Jensen inequality $\log(\lambda a + (1-\lambda)b) \geq \lambda\log a + (1-\lambda)\log b$ for positive reals. The quantum version does not follow from the scalar version term-by-term (since eigenbases can differ) and requires Lieb's concavity.

1.5 Convexity

Quantum information theory is pervaded by convexity arguments — density operators form a convex set, separable states form a convex set, and the feasible regions of SDPs are convex. We collect the key results here.

Definition 1.8 (Convex sets and functions). A set $A \subseteq \mathbb{R}^n$ is convex if $\lambda u + (1-\lambda)v \in A$ for all $u,v \in A$, $\lambda \in [0,1]$. The convex hull $\operatorname{conv}(A)$ is the smallest convex set containing $A$. A point $u \in A$ is an extreme point if $u = \lambda v + (1-\lambda)w$ with $v,w \in A$ implies $v = w = u$. A convex cone $A$ additionally satisfies $u \in A \Rightarrow \lambda u \in A$ for all $\lambda \geq 0$. A function $f: A \to \mathbb{R}$ is convex if $f(\lambda u + (1-\lambda)v) \leq \lambda f(u) + (1-\lambda)f(v)$ for all $u,v \in A$, $\lambda \in [0,1]$; concave if $-f$ is convex.

Theorem 1.5 (Carathéodory's theorem). If $A \subseteq \mathbb{R}^n$, then every $u \in \operatorname{conv}(A)$ is a convex combination of at most $n+1$ points of $A$.

Theorem 1.6 (Krein-Milman theorem). Every compact convex set in $\mathbb{R}^n$ equals the closed convex hull of its extreme points.

Theorem 1.7 (Separating hyperplane theorem). If $A \subseteq \mathbb{R}^n$ is closed convex and $u \notin A$, there exists $r \in \mathbb{R}^n$ and $c \in \mathbb{R}$ with $\langle r, a\rangle \geq c > \langle r, u\rangle$ for all $a \in A$.

Theorem 1.8 (Sion's minimax theorem). Let $A, B$ be compact convex sets and $f: A \times B \to \mathbb{R}$ be convex in the first argument and concave in the second. Then $\min_{a \in A}\max_{b \in B} f(a,b) = \max_{b \in B}\min_{a \in A} f(a,b)$.

Convexity theorems are applied throughout: Carathéodory bounds the decomposition length of separable states; Krein-Milman identifies pure states as extreme points and underpins channel representations; the separating hyperplane theorem gives entanglement witnesses; Sion’s theorem underlies strong duality for SDPs.

Example 1.7 (Convex hull of pure states). For $\mathcal{X} = \mathbb{C}^2$, the set $D(\mathcal{X})$ of qubit density matrices is the Bloch ball. The extreme points are pure states on the Bloch sphere $S^2$. By Krein-Milman, every density matrix $\rho = \tfrac{1}{2}(I+\vec{r}\cdot\vec\sigma)$ with $|\vec{r}|\leq 1$ is a convex combination of (at most $4=2^2$ by Carathéodory, but for qubits at most 2) pure states. Explicitly: $\rho = p|\psi_1\rangle\langle\psi_1| + (1-p)|\psi_2\rangle\langle\psi_2|$ where $|\psi_1\rangle,|\psi_2\rangle$ are the eigenstates of $\rho$ and $p = \lambda_{\max}(\rho)$. For the Bloch sphere: pure states are $\vec{r}$ on the sphere, and the maximally mixed state is the origin $\vec{r}=0 = \tfrac{1}{2}|\psi_1\rangle\langle\psi_1|+\tfrac{1}{2}|\psi_2\rangle\langle\psi_2|$ for any antipodal pair $\{|\psi_1\rangle,|\psi_2\rangle\}$.

Definition 1.10 (Conic duality). For a convex cone $K \subseteq \mathbb{R}^n$, its dual cone is $K^* = \{y : \langle y,x\rangle \geq 0\;\forall x\in K\}$. The PSD cone $\operatorname{Pos}(\mathcal{X})$ is self-dual: $\operatorname{Pos}(\mathcal{X})^* = \operatorname{Pos}(\mathcal{X})$. This self-duality is why the inner product pairing in the SDP duality derivation yields the same PSD constraint in both primal and dual.

Proposition 1.1 (Operator Jensen inequality). For operator convex $f$ and $A \in \operatorname{Herm}(\mathcal{X})$ with spectrum in $[a,b]$ and an isometry $V: \mathcal{Y}\to\mathcal{X}$: \[f(V^*AV) \leq V^*f(A)V.\] Taking $V = |v\rangle$ (a unit vector) gives the scalar Jensen inequality. Taking $V$ to be the Stinespring isometry of a quantum channel gives the operator Jensen inequality used in quantum information.

Chapter 2: Quantum States and Channels

Quantum mechanics is a probabilistic theory, but the probabilities arise from a more fundamental description in terms of amplitudes. A quantum state is not simply a probability distribution over classical configurations — it is a positive semidefinite operator that encodes both the classical uncertainty and the quantum coherence of a system.

2.1 Registers and Density Operators

Definition 2.1 (Register). A register $X$ is a physical system with a finite classical state set $\Sigma_X$ and associated CES $\mathcal{X} = \mathbb{C}^{\Sigma_X}$. For a composite register $(X_1,\ldots,X_n)$, the CES is $\mathcal{X}_1 \otimes \cdots \otimes \mathcal{X}_n$.

Definition 2.2 (Density operator). The set of quantum states (density operators) of register $X$ is \[D(\mathcal{X}) = \{\rho \in \operatorname{Pos}(\mathcal{X}) : \operatorname{Tr}\rho = 1\}.\] A state $\rho \in D(\mathcal{X})$ is pure if $\operatorname{rank}(\rho) = 1$, i.e., $\rho = |u\rangle\langle u|$ for some unit vector $|u\rangle$. Otherwise it is mixed.

The set $D(\mathcal{X})$ is a compact convex set. Its extreme points are precisely the pure states $\{|u\rangle\langle u| : \|u\| = 1\}$. By Krein-Milman, every $\rho \in D(\mathcal{X})$ is a convex combination of pure states (this is just the spectral decomposition: $\rho = \sum_x p(x)|e_x\rangle\langle e_x|$). The maximally mixed state $\frac{I}{d}$ (with $d = \dim\mathcal{X}$) is the centroid of $D(\mathcal{X})$.

The non-uniqueness of the pure-state decomposition is fundamental. While the spectral decomposition $\rho = \sum_x p(x)|e_x\rangle\langle e_x|$ uses eigenstates, the same density matrix can be written as a mixture over any ensemble of pure states consistent with the eigenvalues (via the Schrödinger mixture theorem). This non-uniqueness has a crucial physical consequence: no measurement of $\rho$ alone can determine which pure-state ensemble was prepared. The state $\rho$ encapsulates all operationally distinguishable information about the preparation.

Proposition 2.0 (Bloch sphere parameterisation and Pauli decomposition). Any qubit state $\rho \in D(\mathbb{C}^2)$ can be written uniquely as $\rho = \tfrac{1}{2}(I+\vec{r}\cdot\vec\sigma)$ with $\vec{r} = (r_x,r_y,r_z)\in\mathbb{R}^3$, $|\vec{r}|\leq 1$, where $\vec\sigma = (X,Y,Z)$ are the Pauli matrices. The Bloch vector $\vec{r}$ has entries $r_i = \operatorname{Tr}(\sigma_i\rho)$. For general $d$-dimensional $\rho$, the generalised Pauli (Gell-Mann) decomposition is $\rho = \frac{1}{d}(I + \sum_\alpha c_\alpha \lambda_\alpha)$ where $\{\lambda_\alpha\}$ are the $d^2-1$ generalised Gell-Mann matrices (traceless Hermitian, orthonormal under $\operatorname{Tr}(\lambda_\alpha\lambda_\beta) = 2\delta_{\alpha\beta}$) and $c_\alpha = \tfrac{d}{2}\operatorname{Tr}(\lambda_\alpha\rho)$.

The map $\rho \mapsto \vec{r}$ is affine and identifies $D(\mathbb{C}^2)$ with the unit ball $\{\vec{r}:\,|\vec{r}|\leq 1\}$. Channels act as affine maps on the Bloch ball: $\Phi(\rho) \leftrightarrow M\vec{r}+\vec{c}$ for a contraction matrix $M$ and translation $\vec{c}$. Unital channels ($\Phi(I/2)=I/2$) have $\vec{c}=0$.

Example 2.1. For a qubit $\mathcal{X} = \mathbb{C}^2$, a density matrix has the form \[\rho = \begin{pmatrix}p & z \\ \bar{z} & 1-p\end{pmatrix}, \quad p \in [0,1],\; z \in \mathbb{C},\; p(1-p) \geq |z|^2.\] The constraint $p(1-p) \geq |z|^2$ ensures $\rho \geq 0$. In the Bloch sphere parameterisation, $\rho = \tfrac{1}{2}(I + r_x X + r_y Y + r_z Z)$ with $\vec{r} = (r_x,r_y,r_z) \in \mathbb{R}^3$, $\|\vec{r}\| \leq 1$. Pure states correspond to $\|\vec{r}\| = 1$ (the Bloch sphere surface); the maximally mixed state is the origin $\vec{r} = 0$.

Definition 2.3 (Partial trace). For a bipartite register $(X,Y)$ with state $\rho \in D(\mathcal{X}\otimes\mathcal{Y})$, the reduced state on $X$ is $\rho^X = \operatorname{Tr}_Y(\rho) \in D(\mathcal{X})$, where $\operatorname{Tr}_Y: L(\mathcal{X}\otimes\mathcal{Y}) \to L(\mathcal{X})$ is the partial trace defined by $\operatorname{Tr}_Y(A\otimes B) = A\operatorname{Tr}(B)$ and extended by linearity.

Example 2.2. For the Bell state $|\beta\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$ and $\rho = |\beta\rangle\langle\beta|$, the reduced state on the first qubit is $\rho^X = \operatorname{Tr}_Y(\rho) = \tfrac{1}{2}|0\rangle\langle 0| + \tfrac{1}{2}|1\rangle\langle 1| = \tfrac{I}{2}$. This is the maximally mixed state — Alice's qubit is maximally uncertain despite the joint state being pure.

2.2 Measurements and POVMs

The most general quantum measurement on register $X$ is described by a positive operator-valued measure.

Definition 2.4 (POVM). A positive operator-valued measure (POVM) on $\mathcal{X}$ with outcome set $\Gamma$ is a function $M:\Gamma \to \operatorname{Pos}(\mathcal{X})$ satisfying the completeness condition \[\sum_{a \in \Gamma} M(a) = I_\mathcal{X}.\] When register $X$ is in state $\rho$, the probability of outcome $a$ is $p(a) = \langle M(a), \rho\rangle = \operatorname{Tr}(M(a)\rho)$. This is the Born rule.

Definition 2.5 (Projective measurement). A POVM where every $M(a)$ is an orthogonal projector ($M(a)^2 = M(a) = M(a)^*$) is called a projective measurement. The standard computational basis measurement has $M(a) = |a\rangle\langle a|$ for each $a \in \Sigma$.

A projective measurement $\{P_a\}$ on a pure state $|\psi\rangle$ yields outcome $a$ with probability $\|P_a|\psi\rangle\|^2 = \langle\psi|P_a|\psi\rangle$. After the measurement the state collapses to $P_a|\psi\rangle/\|P_a|\psi\rangle\|$. For mixed states, the post-measurement state is $P_a\rho P_a / \operatorname{Tr}(P_a\rho)$.

The key operational difference between a POVM and a projective measurement: a POVM can have more outcomes than the dimension of $\mathcal{X}$, at the cost of not necessarily being a projection. Neumark’s (Naimark’s) dilation theorem states that any POVM on $\mathcal{X}$ can be implemented as a projective measurement on a larger system $\mathcal{X}\otimes\mathcal{Y}$: there exist projectors $\{Q_a\}$ on $\mathcal{X}\otimes\mathcal{Y}$ and a fixed state $|0\rangle \in \mathcal{Y}$ such that $\operatorname{Tr}(M(a)\rho) = \operatorname{Tr}(Q_a(\rho\otimes|0\rangle\langle 0|))$ for all $\rho$.

Proposition 2.2 (Naimark dilation). Every POVM $\{M(a)\}_{a\in\Gamma}$ on $\mathcal{X}$ with $|\Gamma|$ outcomes can be realised as a projective measurement on $\mathcal{X}\otimes\mathbb{C}^{|\Gamma|}$ followed by discarding the second register.

Proof. Define the isometry $V: \mathcal{X}\to\mathcal{X}\otimes\mathbb{C}^{|\Gamma|}$ by $V|\psi\rangle = \sum_a\sqrt{M(a)}|\psi\rangle\otimes|a\rangle$. The completeness condition $\sum_a M(a) = I$ ensures $V^*V = \sum_a M(a) = I$, so $V$ is indeed an isometry. Then $\operatorname{Tr}(M(a)\rho) = \operatorname{Tr}((I\otimes|a\rangle\langle a|)V\rho V^*)$, giving the projective measurement $Q_a = V(I_\mathcal{X})V^*|a\rangle\langle a|$ on the dilated space. $\square$

Example 2.3. The Hadamard basis measurement on a qubit has $M(+) = |{+}\rangle\langle{+}|$ and $M(-) = |{-}\rangle\langle{-}|$ where $|\pm\rangle = \tfrac{1}{\sqrt{2}}(|0\rangle \pm |1\rangle)$. On state $|0\rangle$, this gives outcome $+$ with probability $|\langle{+}|0\rangle|^2 = 1/2$ and $-$ with probability $1/2$.

Example 2.4b (Trine POVM — a non-projective POVM). The trine states for a qubit are three states equally spaced at $120°$ in the Bloch sphere equatorial plane: $|\psi_k\rangle = \cos(\tfrac{2\pi k}{3})|0\rangle + \sin(\tfrac{2\pi k}{3})|1\rangle$ for $k=0,1,2$. The POVM $M(k) = \tfrac{2}{3}|\psi_k\rangle\langle\psi_k|$ satisfies $\sum_{k=0}^2 M(k) = \tfrac{2}{3}(|\psi_0\rangle\langle\psi_0|+|\psi_1\rangle\langle\psi_1|+|\psi_2\rangle\langle\psi_2|) = \tfrac{2}{3}\cdot\tfrac{3}{2}I = I$ (using the frame property of the trine states). This POVM has 3 outcomes but acts on a 2-dimensional Hilbert space — it is genuinely not a projective measurement. By Naimark's theorem, it can be implemented as a projective measurement on $\mathbb{C}^2\otimes\mathbb{C}^3$, i.e., an ancilla qubit plus the system qubit with a 6-dimensional total Hilbert space.

Theorem 2.5 (Holevo's bound for POVMs). For a quantum state $\rho$ and any POVM $\{M(a)\}_{a\in\Gamma}$, the probability distribution $p(a) = \operatorname{Tr}(M(a)\rho)$ satisfies \[H(p) \leq S(\rho) + \log|\Gamma|,\] where $H(p) = -\sum_a p(a)\log p(a)$ is the Shannon entropy of the outcome. In particular, measuring a $d$-dimensional quantum state can produce at most $\log d$ bits of Shannon entropy from any POVM.

Proof sketch. The joint state before measurement is $\rho\otimes|0\rangle\langle 0|$ in $\mathcal{X}\otimes\mathbb{C}^{|\Gamma|}$. After the Naimark dilation, the projective measurement leaves a classical-quantum state $\sum_a p(a)|a\rangle\langle a|\otimes\rho_a$ where $\rho_a = M(a)^{1/2}\rho M(a)^{1/2}/p(a)$. The von Neumann entropy of the classical part is $H(p)$. Subadditivity: $H(p) = S(\text{classical register}) \leq S(\rho) + S(\text{ancilla})$. Since the ancilla is $|\Gamma|$-dimensional: $S(\text{ancilla}) \leq \log|\Gamma|$. $\square$

2.3 Binary State Discrimination and the Holevo-Helstrom Theorem

A fundamental task is: given a register holding either state $\rho_0$ (with prior $p_0$) or $\rho_1$ (with prior $p_1 = 1-p_0$), find the measurement maximising the probability of correctly identifying the state. This is the quantum analogue of the classical hypothesis testing problem.

The key insight is that the optimal measurement projects onto the eigenspaces of the operator difference $p_0\rho_0 - p_1\rho_1$. This is the quantum version of the Neyman-Pearson lemma: in classical hypothesis testing, the optimal test is the likelihood ratio test. Quantum mechanically, the eigenvectors of $p_0\rho_0-p_1\rho_1$ play the role of the likelihood ratio threshold. States in the positive eigenspace are more likely to have come from $\rho_0$; states in the negative eigenspace from $\rho_1$.

\[d = \sup_{M_0+M_1=I,\,M_0,M_1\geq 0}[p_0\langle M_0,\rho_0\rangle + p_1\langle M_1,\rho_1\rangle] = \tfrac{1}{2} + \tfrac{1}{2}\|p_0\rho_0-p_1\rho_1\|_1.\]

The last equality is the Holevo-Helstrom theorem. Its dual SDP is $\beta = \inf\operatorname{Tr}(Y)$ s.t. $Y \geq p_0\rho_0$ and $Y \geq p_1\rho_1$, and strong duality $d=\beta$ follows since both are strictly feasible.

Theorem 2.1 (Holevo-Helstrom theorem). The maximum probability of correctly identifying the state from the ensemble $\{(p_0,\rho_0),(p_1,\rho_1)\}$ is \[\Pr[\text{correct}] = \frac{1}{2}\bigl(1 + \|p_0\rho_0 - p_1\rho_1\|_1\bigr).\] The optimal measurement projects onto the positive and negative eigenspaces of $p_0\rho_0 - p_1\rho_1$: $M_0 = \Pi^+$, $M_1 = \Pi^- = I - \Pi^+$.

Proof. Any two-outcome POVM with $M_0 + M_1 = I$ and $M_0, M_1 \geq 0$ satisfies $-I \leq M_0 - M_1 \leq I$. Write $T = M_0 - M_1$, so $M_0 = \tfrac{1}{2}(I+T)$. The success probability is \[p_0\operatorname{Tr}(M_0\rho_0) + p_1\operatorname{Tr}(M_1\rho_1) = \tfrac{1}{2} + \tfrac{1}{2}\operatorname{Tr}[(p_0\rho_0 - p_1\rho_1)T] = \tfrac{1}{2} + \tfrac{1}{2}\langle p_0\rho_0-p_1\rho_1, T\rangle.\] This is maximised by choosing $T = \Pi^+ - \Pi^-$, and the variational formula $\|A\|_1 = \max_{-I \leq T \leq I}\langle A, T\rangle$ gives $\tfrac{1}{2}(1 + \|p_0\rho_0-p_1\rho_1\|_1)$. $\square$

With equal priors $p_0 = p_1 = 1/2$, the maximum success probability is $\tfrac{1}{2}(1 + \tfrac{1}{2}\|\rho_0 - \rho_1\|_1)$, so the trace distance $\tfrac{1}{2}\|\rho_0-\rho_1\|_1$ is the bias above guessing. This gives the SDP formulation of trace distance.

Example 2.4. Discriminate $\rho_0 = |0\rangle\langle 0|$ from $\rho_1 = |+\rangle\langle+| = \tfrac{1}{2}\begin{pmatrix}1&1\\1&1\end{pmatrix}$ with equal priors. The operator $\tfrac{1}{2}(\rho_0-\rho_1) = \tfrac{1}{4}\begin{pmatrix}1 & -1 \\ -1 & -1\end{pmatrix}$ has eigenvalues $\pm\tfrac{1}{2\sqrt{2}}$, so $\|\rho_0-\rho_1\|_1 = \tfrac{2}{\sqrt{2}} = \sqrt{2}$. Maximum success probability $= \tfrac{1}{2}(1+\tfrac{\sqrt{2}}{2}) \approx 0.854$.

2.4 Multi-State Discrimination

The binary discrimination problem extends naturally to $n$ states. With states $\{\rho_k\}_{k=1}^n$ prepared with priors $\{p_k\}$, the optimal POVM satisfies the following.

Definition 2.10 (Quantum state exclusion). In the exclusion problem, Bob tries to output an index $j$ such that $j \neq k$ (he was *not* given state $\rho_j$). This is dual to discrimination: rather than identifying the state, he rules one out. The SDP for exclusion has $\inf$ instead of $\sup$ and the constraints are swapped: $\sum_k M_k = I$ with $M_k$ being the "exclude $k$" measurement.

Example 2.8 (Symmetric state discrimination). For $n$ equally likely pure states $\{|\psi_k\rangle\}_{k=1}^n$ forming a *symmetric informationally complete (SIC) POVM* — equally spaced in the Bloch sphere sense — the optimal discrimination POVM is the SIC-POVM itself: $M_k = \tfrac{1}{n}|\psi_k\rangle\langle\psi_k|$. For a qubit ($n = 4$ SIC states): $\{|\psi_k\rangle\}$ are the vertices of a regular tetrahedron on the Bloch sphere, and $\sum_k|\psi_k\rangle\langle\psi_k| = I$ (completeness). The success probability is $\tfrac{1}{4}\sum_k|\langle\psi_k|\psi_k\rangle|^2 = \tfrac{1}{4}\cdot 4\cdot 1 = 1/4$ times probability, giving $P_{\text{succ}} = 1/2$ for equiprobable qubits.

Pretty-good measurement (PGM): For an ensemble $\{p_k,|\psi_k\rangle\}$, the pretty-good measurement is $M_k = p_k\rho^{-1/2}|\psi_k\rangle\langle\psi_k|\rho^{-1/2}$ where $\rho = \sum_k p_k|\psi_k\rangle\langle\psi_k|$. It is not optimal in general but achieves within a constant factor of the optimal discrimination probability and is often optimal for symmetric ensembles.

2.5 Quantum Channels

Definition 2.6 (Completely positive map). A linear map $\Phi \in T(\mathcal{X},\mathcal{Y}) = L(L(\mathcal{X}),L(\mathcal{Y}))$ is:

Positive if $X \geq 0 \Rightarrow \Phi(X) \geq 0$.
Completely positive (CP) if $\Phi \otimes I_\mathcal{Z}$ is positive for every CES $\mathcal{Z}$.
Trace-preserving (TP) if $\operatorname{Tr}(\Phi(X)) = \operatorname{Tr}(X)$ for all $X$.

A quantum channel is a CPTP map $\Phi \in C(\mathcal{X},\mathcal{Y})$. Channels model the most general physically realisable evolution of a quantum system.

The distinction between positive and CP is important. The transpose map $T(A) = A^T$ is positive (it maps PSD operators to PSD operators) but not CP: $(T \otimes I)(|\beta\rangle\langle\beta|)$ is not PSD for the Bell state $|\beta\rangle$. This is why complete positivity — not just positivity — is the physically meaningful condition: any quantum channel must remain valid when applied to part of a larger system.

Remark (Why CP and not just positive?). Consider a system $\mathcal{X}$ entangled with a reference system $\mathcal{Z}$. A positive-but-not-CP map $\Phi$ applied to $\mathcal{X}$ alone gives $\Phi\otimes I_\mathcal{Z}$, which may fail to be positive on entangled states. This would produce non-physical (negative-eigenvalue) density operators. Hence physical channels must be CP. The transpose map is not physically implementable as a channel precisely because it fails this test.

Example 2.5 (Standard channels).

Unitary channel: $\mathcal{U}(\rho) = U\rho U^*$ for unitary $U$. Noiseless reversible evolution.
Dephasing channel: $\Delta(\rho) = \sum_a |a\rangle\langle a|\rho|a\rangle\langle a|$. Destroys off-diagonal coherences; Kraus operators $M_a = |a\rangle\langle a|$.
Depolarising channel: $\Omega_p(\rho) = (1-p)\rho + p\tfrac{I}{d}$. Mixes with the maximally mixed state.
Amplitude damping (qubit): $\Phi(\rho) = K_0\rho K_0^* + K_1\rho K_1^*$ with $K_0 = \begin{pmatrix}\!1\!&\!0\!\\\!0\!&\!\sqrt{1-\gamma}\!\end{pmatrix}$, $K_1 = \begin{pmatrix}\!0\!&\!\sqrt{\gamma}\!\\\!0\!&\!0\!\end{pmatrix}$. Models spontaneous emission.
Partial trace $\operatorname{Tr}_\mathcal{X}: L(\mathcal{X}\otimes\mathcal{Y}) \to L(\mathcal{Y})$: discarding subsystem $\mathcal{X}$.

Definition 2.7 (Choi matrix). The Choi matrix of $\Phi \in T(\mathcal{X},\mathcal{Y})$ is \[J(\Phi) = (\Phi \otimes I_\mathcal{X})(\operatorname{vec}(I_\mathcal{X})\operatorname{vec}(I_\mathcal{X})^*) = \sum_{a,b \in \Sigma} \Phi(|a\rangle\langle b|) \otimes |a\rangle\langle b| \in L(\mathcal{Y}\otimes\mathcal{X}).\] The map $\Phi \mapsto J(\Phi)$ is a linear bijection (the Choi isomorphism). The inverse is $\Phi(A) = \operatorname{Tr}_\mathcal{X}((I_\mathcal{Y}\otimes A^T)J(\Phi))$.

2.5 Representations of Channels and CP/TP Characterisations

Every quantum channel has four equivalent representations, each illuminating a different physical aspect.

Theorem 2.2 (Representations of CP maps; Watrous Thm 5.3). For $\Phi \in T(\mathcal{X},\mathcal{Y})$, the following are equivalent:

$\Phi$ is completely positive.
$\Phi \otimes I_\mathcal{X}$ is positive.
$J(\Phi) \in \operatorname{Pos}(\mathcal{Y}\otimes\mathcal{X})$.
$\Phi$ has a Kraus representation: $\Phi(M) = \sum_{a \in \Gamma} A_a M A_a^*$ for operators $\{A_a\} \subset L(\mathcal{X},\mathcal{Y})$.
As in (4), with exactly $r = \operatorname{rank}(J(\Phi))$ Kraus operators.
$\Phi$ has a Stinespring dilation: $\Phi(M) = \operatorname{Tr}_\mathcal{Z}(AMA^*)$ for isometry $A \in L(\mathcal{X},\mathcal{Y}\otimes\mathcal{Z})$.
As in (6), with $\dim\mathcal{Z} = \operatorname{rank}(J(\Phi))$.

Proof sketch. The cycle is $(1)\Rightarrow(2)\Rightarrow(3)\Rightarrow(5)\Rightarrow(4)\Rightarrow(6)\Rightarrow(1)$. The key step $(3)\Rightarrow(5)$: let $J(\Phi) = \sum_i \lambda_i |u_i\rangle\langle u_i|$ be the spectral decomposition with $\lambda_i > 0$. Each $|u_i\rangle \in \mathcal{Y}\otimes\mathcal{X}$, so write $|u_i\rangle = \operatorname{vec}(A_i)$ for some $A_i \in L(\mathcal{X},\mathcal{Y})$. Setting $K_i = \sqrt{\lambda_i}A_i$ gives the Kraus operators. The identity $J(\Phi) = \sum_i \operatorname{vec}(K_i)\operatorname{vec}(K_i)^*$ is verified using the definition of $J(\Phi)$ and the tensor identity. $\square$

Theorem 2.3 (Characterisations of TP; Watrous Thm 5.4). For CP map $\Phi$ with Kraus operators $\{A_a\}$, the following are equivalent:

$\Phi$ is trace-preserving.
The adjoint $\Phi^*$ is unital: $\Phi^*(I_\mathcal{Y}) = I_\mathcal{X}$.
$\operatorname{Tr}_\mathcal{Y}(J(\Phi)) = I_\mathcal{X}$.
$\sum_a A_a^* A_a = I_\mathcal{X}$.

The adjoint $\Phi^*: L(\mathcal{Y}) \to L(\mathcal{X})$ is defined by $\langle B, \Phi(A)\rangle = \langle \Phi^*(B), A\rangle$. For a Kraus map $\Phi(M) = \sum_a A_a M A_a^*$, the adjoint is $\Phi^*(B) = \sum_a A_a^* B A_a$. The TP condition $\sum_a A_a^*A_a = I$ is equivalent to $\Phi^*(I) = I$ (unital adjoint, the Heisenberg picture unitality).

The adjoint has the following physical interpretation. In the Schrödinger picture, $\Phi$ evolves states: $\rho \mapsto \Phi(\rho)$. In the Heisenberg picture, $\Phi^*$ evolves observables: an observable $B$ on $\mathcal{Y}$ is “pulled back” to $\Phi^*(B)$ on $\mathcal{X}$, so that the expected value is $\operatorname{Tr}(B\Phi(\rho)) = \operatorname{Tr}(\Phi^*(B)\rho)$. The TP condition ensures that the constant observable $I_\mathcal{Y}$ (which always has expectation 1) is preserved: $\Phi^*(I_\mathcal{Y}) = I_\mathcal{X}$. More generally, for a measurement POVM $\{M(a)\}_a$ on the output space, the pullback $\{\Phi^*(M(a))\}_a$ is the effective POVM on the input space.

Example 2.7 (Choi matrix of the dephasing channel). Using the definition with $\mathcal{X} = \mathbb{C}^2$ and $|\beta\rangle = \operatorname{vec}(I_2) = |00\rangle+|11\rangle$: \[J(\Delta) = (\Delta\otimes I_2)(|\beta\rangle\langle\beta|) = \Delta(|0\rangle\langle 0|)\otimes|0\rangle\langle 0| + \Delta(|0\rangle\langle 1|)\otimes|0\rangle\langle 1| + \Delta(|1\rangle\langle 0|)\otimes|1\rangle\langle 0| + \Delta(|1\rangle\langle 1|)\otimes|1\rangle\langle 1|.\] Since $\Delta(|0\rangle\langle 0|) = |0\rangle\langle 0|$, $\Delta(|1\rangle\langle 1|) = |1\rangle\langle 1|$, and $\Delta(|0\rangle\langle 1|) = \Delta(|1\rangle\langle 0|) = 0$: \[J(\Delta) = \begin{pmatrix}1&0&0&0\\0&0&0&0\\0&0&0&0\\0&0&0&1\end{pmatrix}.\] This is PSD (eigenvalues $1,1,0,0$) with rank 2, consistent with two Kraus operators. Verifying TP: $\operatorname{Tr}_{L(\mathcal{Y})}(J(\Delta)) = |0\rangle\langle 0| + |1\rangle\langle 1| = I$. $\checkmark$

The four representations of a channel form a cycle: Stinespring $\to$ Kraus (group rows of the isometry), Kraus $\to$ Choi (apply formula), Choi $\to$ Kraus (spectral decompose), Kraus $\to$ natural representation (vectorise). Each representation is uniquely useful: Kraus for computation, Choi for characterising CP, Stinespring for purification arguments.

Definition 2.10 (Complementary channel). Given a channel $\Phi \in C(\mathcal{X},\mathcal{Y})$ with Stinespring dilation $\Phi(M) = \operatorname{Tr}_\mathcal{Z}(AMA^*)$ for isometry $A: \mathcal{X}\to\mathcal{Y}\otimes\mathcal{Z}$, the complementary channel is $\hat\Phi \in C(\mathcal{X},\mathcal{Z})$ defined by $\hat\Phi(M) = \operatorname{Tr}_\mathcal{Y}(AMA^*)$. The complementary channel sends the input to the environment $\mathcal{Z}$ rather than the output $\mathcal{Y}$.

The complementary channel captures exactly the information that “leaks” to the environment during the evolution. If $\Phi$ is a noisy communication channel, $\hat\Phi$ models the eavesdropper’s channel. The symmetry between $\Phi$ and $\hat\Phi$ is not complete: while $\hat{(\hat\Phi)}$ is isometrically equivalent to $\Phi$ (by the fact that the Stinespring dilation is pure on $\mathcal{Y}\otimes\mathcal{Z}$), the dimensions of $\mathcal{Y}$ and $\mathcal{Z}$ can differ.

Proposition 2.3 (Properties of complementary channels).

A channel $\Phi$ is degradable if there exists a channel $\mathcal{D}$ such that $\hat\Phi = \mathcal{D}\circ\Phi$ (the environment can be obtained from the output by further processing).
A channel is anti-degradable if $\Phi = \mathcal{D}'\circ\hat\Phi$ for some channel $\mathcal{D}'$ (the output can be obtained from the environment).
Degradable implies zero coherent information gap: for degradable channels, the LSD theorem's regularisation is unnecessary and $Q(\Phi) = \max_\rho I_c(\rho,\Phi)$ (single-letter).
Anti-degradable implies zero quantum capacity: $Q(\Phi) = 0$, since the environment always has more information than the output.

Example 2.8b (Complementary channel of the dephasing channel). The dephasing channel $\Delta(\rho) = \sum_a |a\rangle\langle a|\rho|a\rangle\langle a|$ on $\mathbb{C}^d$ has Stinespring isometry $A = \sum_a|a\rangle\langle a|\otimes|a\rangle$ (maps $|a\rangle \to |a\rangle|a\rangle$). The output space is $\mathcal{Y} = \mathbb{C}^d$ and environment $\mathcal{Z} = \mathbb{C}^d$. The channel is $\Delta(\rho) = \operatorname{Tr}_\mathcal{Z}(A\rho A^*)$ and the complementary is $\hat\Delta(\rho) = \operatorname{Tr}_\mathcal{Y}(A\rho A^*)$. Computing: $A\rho A^* = \sum_{a,b}\rho_{ab}|aa\rangle\langle bb|$, so $\hat\Delta(\rho) = \operatorname{Tr}_\mathcal{Y}(\sum_{ab}\rho_{ab}|aa\rangle\langle bb|) = \sum_a\rho_{aa}|a\rangle\langle a|$ — the classical channel that produces the diagonal of $\rho$ in the computational basis. So $\hat\Delta = \Delta$: the dephasing channel is self-complementary. Degradability: $\hat\Delta = \Delta\circ\Delta = \Delta$ ✓ (since dephasing is idempotent). The dephasing channel is degradable, confirming its quantum capacity $Q(\Delta) = 0$ (one can verify: coherent information is non-positive for all inputs).

Remark (Complementary channel and quantum privacy). The complementary channel $\hat\Phi$ models the information available to an eavesdropper. A channel has zero quantum capacity iff the environment always "knows more" than the output (anti-degradable), meaning $\Phi = \mathcal{D}\circ\hat\Phi$. This is the channel-theoretic expression of the no-quantum-communication condition, and is dual to the no-private-information condition on the eavesdropper's side. Quantum key distribution works by ensuring the eavesdropper's channel $\hat\Phi$ is close to a constant channel.

Example 2.9 (Choi matrix to Kraus operators via spectral decomposition). Consider the amplitude damping channel on a qubit with damping parameter $\gamma \in [0,1]$: \[\Phi(\rho) = K_0\rho K_0^* + K_1\rho K_1^*, \quad K_0 = \begin{pmatrix}1 & 0 \\ 0 & \sqrt{1-\gamma}\end{pmatrix}, \quad K_1 = \begin{pmatrix}0 & \sqrt{\gamma} \\ 0 & 0\end{pmatrix}.\] We compute $J(\Phi)$ directly from the definition $J(\Phi) = (\Phi\otimes I)(|\beta\rangle\langle\beta|)$ with $|\beta\rangle = |00\rangle+|11\rangle$: \[J(\Phi) = \Phi(|0\rangle\langle 0|)\otimes|0\rangle\langle 0| + \Phi(|0\rangle\langle 1|)\otimes|0\rangle\langle 1| + \Phi(|1\rangle\langle 0|)\otimes|1\rangle\langle 0| + \Phi(|1\rangle\langle 1|)\otimes|1\rangle\langle 1|.\] Computing: $\Phi(|0\rangle\langle 0|) = |0\rangle\langle 0|$, $\Phi(|0\rangle\langle 1|) = \sqrt{1-\gamma}|0\rangle\langle 1|$, and $\Phi(|1\rangle\langle 1|) = (1-\gamma)|1\rangle\langle 1| + \gamma|0\rangle\langle 0|$. Thus in the basis $\{|00\rangle,|01\rangle,|10\rangle,|11\rangle\}$: \[J(\Phi) = \begin{pmatrix}1 & 0 & 0 & \sqrt{1-\gamma} \\ 0 & 0 & 0 & 0 \\ 0 & 0 & \gamma & 0 \\ \sqrt{1-\gamma} & 0 & 0 & 1-\gamma\end{pmatrix}.\] Now we recover the Kraus operators from the eigenvectors of $J(\Phi)$. The eigenvalues of $J(\Phi)$ are $1, 1-\gamma, \gamma, 0$ (using the block structure). The eigenvector for eigenvalue 1 is $|u_1\rangle = |00\rangle + \sqrt{1-\gamma}|11\rangle / \sqrt{2-\gamma}$... Actually, it is more direct to recognise that $J(\Phi) = \operatorname{vec}(K_0)\operatorname{vec}(K_0)^* + \operatorname{vec}(K_1)\operatorname{vec}(K_1)^*$. One checks: $\operatorname{vec}(K_0) = (1,0,0,\sqrt{1-\gamma})^T$ and $\operatorname{vec}(K_1) = (0,0,\sqrt{\gamma},0)^T$, and their outer products sum to $J(\Phi)$. $\checkmark$

Example 2.10 (Stinespring dilation for the depolarising channel). The depolarising channel $\Omega_p(\rho) = (1-p)\rho + \tfrac{p}{3}(X\rho X + Y\rho Y + Z\rho Z)$ has Kraus operators $K_0 = \sqrt{1-p}\,I$, $K_1 = \sqrt{p/3}\,X$, $K_2 = \sqrt{p/3}\,Y$, $K_3 = \sqrt{p/3}\,Z$. The Stinespring isometry $A: \mathbb{C}^2 \to \mathbb{C}^2\otimes\mathbb{C}^4$ stacks these as rows: \[A|\psi\rangle = \sqrt{1-p}|\psi\rangle|0\rangle + \sqrt{p/3}\,X|\psi\rangle|1\rangle + \sqrt{p/3}\,Y|\psi\rangle|2\rangle + \sqrt{p/3}\,Z|\psi\rangle|3\rangle.\] Then $\Omega_p(\rho) = \operatorname{Tr}_{\mathbb{C}^4}(A\rho A^*)$. Check: $A^*A = (1-p)I + \tfrac{p}{3}(X^2+Y^2+Z^2) = (1-p)I + pI = I$. ✓ The environment $\mathbb{C}^4$ registers which Pauli error occurred (or no error). Tracing out the environment destroys this record, giving the depolarising noise.

Definition 2.11 (Natural representation). The natural representation of $\Phi \in T(\mathcal{X},\mathcal{Y})$ is the matrix $K(\Phi) \in L(\mathcal{Y}\otimes\mathcal{X})$ defined by $K(\Phi)\operatorname{vec}(M) = \operatorname{vec}(\Phi(M))$ for all $M \in L(\mathcal{X})$. If $\Phi(M) = \sum_k A_kMB_k^*$, then $K(\Phi) = \sum_k A_k\otimes\bar{B}_k$. The relationship between the natural and Choi representations is $J(\Phi) = S\cdot K(\Phi)$ where $S$ is the swap operator that permutes the two tensor factors.

2.5b Quantum Instruments and Post-Measurement States

The Kraus representation captures the marginal statistics of a measurement but not the post-measurement states. The more complete framework is the quantum instrument.

Definition 2.12 (Quantum instrument). A quantum instrument $\mathcal{I} = \{\Phi_a\}_{a\in\Gamma}$ is a finite collection of CP maps $\Phi_a: L(\mathcal{X})\to L(\mathcal{Y})$ such that $\sum_a\Phi_a$ is trace-preserving. The probability of outcome $a$ is $p(a) = \operatorname{Tr}(\Phi_a(\rho))$, and the post-measurement state is $\Phi_a(\rho)/p(a)$.

The key relationship: every POVM $\{M(a)\}$ arises as the marginal of a quantum instrument. Specifically, $M(a) = (\Phi_a)^*(I_\mathcal{Y})$ where $(\Phi_a)^*$ is the adjoint of $\Phi_a$. The simplest instrument associated with POVM element $M(a) = K_a^*K_a$ is the Lüders instrument $\Phi_a(\rho) = K_a\rho K_a^*$. The corresponding post-measurement state is $K_a\rho K_a^*/\operatorname{Tr}(K_a^*K_a\rho)$.

Example 2.11 (Projective measurement as instrument). For a projective POVM $M(a) = P_a$ (projectors): the Lüders instrument $\Phi_a(\rho) = P_a\rho P_a$ corresponds to the von Neumann measurement collapse. The post-measurement state is $P_a\rho P_a/\operatorname{Tr}(P_a\rho)$. The sum $\sum_a\Phi_a(\rho) = \sum_a P_a\rho P_a = \Delta(\rho)$ (the dephasing channel, which is TP). For an orthonormal basis $\{|a\rangle\}$: $\Delta(\rho) = \sum_a|a\rangle\langle a|\rho|a\rangle\langle a| = \operatorname{diag}(\rho)$, the diagonal channel.

Definition 2.13 (Quantum channel from instrument via classical register). Given an instrument $\mathcal{I} = \{\Phi_a\}$, the associated measure-and-record} channel maps $\rho$ to the classical-quantum state \[\mathcal{J}(\rho) = \sum_{a\in\Gamma}|a\rangle\langle a|\otimes\Phi_a(\rho) \in D(\mathbb{C}^\Gamma\otimes\mathcal{Y}).\] The classical register stores the measurement outcome; the quantum register stores the post-measurement state. The overall map $\mathcal{J}$ is a CPTP channel.

2.5c Random Unitary Channels and Pauli Channels

A practically important class of quantum channels are the random unitary channels (also called unital channels of Pauli type), which model incoherent noise processes.

Definition 2.10b (Random unitary channel). A channel $\Phi: L(\mathcal{X})\to L(\mathcal{X})$ is a random unitary channel if $\Phi(\rho) = \sum_k p_k U_k \rho U_k^*$ for unitaries $U_k$ and probability distribution $p_k \geq 0$, $\sum_k p_k = 1$. Random unitary channels are automatically unital: $\Phi(I) = \sum_k p_k U_k I U_k^* = I$.

\[\Phi(\rho) = (1-p_x-p_y-p_z)\rho + p_x X\rho X + p_y Y\rho Y + p_z Z\rho Z,\]

parameterised by error probabilities $p_x, p_y, p_z \geq 0$ with $p_x+p_y+p_z \leq 1$. Special cases: the depolarising channel has $p_x = p_y = p_z = p/3$; the dephasing channel has $p_x=p_y=0$, $p_z=p$; the bit-flip channel has $p_x=p$, $p_y=p_z=0$.

The Pauli channel has a particularly clean Choi matrix: since $\{I,X,Y,Z\}/\sqrt{2}$ is an orthonormal basis for $L(\mathbb{C}^2)$, the Choi matrix decomposes as $J(\Phi) = \sum_P\hat\Phi_P\, P\otimes P^T$, where $\hat\Phi_P$ are the eigenvalues of $\Phi$ in the Pauli basis. The CP condition on $J(\Phi)\geq 0$ gives constraints on the probabilities $(p_x,p_y,p_z)$.

Example 2.8c (Bloch ball contraction under Pauli channel). For the dephasing channel $\Phi(\rho) = (1-p)\rho + pZ\rho Z$ in Bloch vector form: $\vec{r} \mapsto M\vec{r}$ with $M = \operatorname{diag}(1-2p, 1-2p, 1)$. This contracts the equatorial plane ($r_x,r_y$) by factor $1-2p$ while leaving the $Z$-axis intact. The channel maps the Bloch sphere to an ellipsoid with semiaxes $(1-2p, 1-2p, 1)$. For the fully depolarising channel ($p_x=p_y=p_z=p/3$): $M = (1-4p/3)I$, contracting the entire Bloch sphere uniformly. At $p=3/4$: $M=0$, mapping everything to the origin (maximally mixed state).

2.6 Discrete Weyl Operators and Quantum Teleportation

The generalised Pauli operators form an orthonormal basis for $L(\mathcal{X})$ and underlie the quantum one-time pad and teleportation.

Definition 2.8 (Discrete Weyl operators). For $\mathcal{X} = \mathbb{C}^d$ with computational basis $\{|0\rangle,\ldots,|d-1\rangle\}$ and $\omega = e^{2\pi i/d}$, define the shift operator $X = \sum_{a}|a+1\rangle\langle a|$ (indices mod $d$) and clock operator $Z = \sum_a \omega^a|a\rangle\langle a|$. The discrete Weyl operator is $W_{b,c} = X^b Z^c$ for $b,c \in \mathbb{Z}_d$.

Proposition 2.1 (Properties of Weyl operators).

Unitarity: Each $W_{b,c}$ is unitary.
Orthogonality: $\langle W_{b,c}, W_{b',c'}\rangle = d\cdot\mathbf{1}[b=b',c=c']$, so $\{W_{b,c}/\sqrt{d}\}_{b,c}$ is an orthonormal basis for $L(\mathcal{X})$.
Commutation: $XZ = \omega ZX$, so $W_{b,c}W_{b',c'} = \omega^{bc'} W_{b+b',c+c'}$.
Depolarisation: $\frac{1}{d^2}\sum_{b,c}W_{b,c}AW_{b,c}^* = \operatorname{Tr}(A)\cdot\frac{I}{d}$ for all $A$.

Definition 2.9 (Quantum one-time pad). The quantum OTP uses a shared uniformly random secret key $(b,c) \in \mathbb{Z}_d^2$ to encrypt state $\rho$ by sending $W_{b,c}\rho W_{b,c}^*$. The eavesdropper sees $\frac{1}{d^2}\sum_{b,c}W_{b,c}\rho W_{b,c}^* = \frac{I}{d}$, independent of $\rho$. Bob recovers $\rho$ by applying $W_{b,c}^*$. The scheme requires $2\log_2 d$ secret bits — precisely the quantum information content of the $d$-dimensional state.

Quantum Teleportation Protocol (Bennett et al. 1993)
Setup: Alice and Bob share the maximally entangled state $|\beta_{00}\rangle = \tfrac{1}{\sqrt{d}}\sum_a|a\rangle_A|a\rangle_B$. Alice wishes to send unknown state $|\psi\rangle_C = \sum_a\alpha_a|a\rangle_C$ to Bob using only classical communication.
Step 1: Alice measures registers $C, A$ in the Bell basis $\{(W_{b,c}\otimes I)|\beta_{00}\rangle\}_{b,c}$, obtaining outcome $(b,c)$ each with probability $1/d^2$.
Step 2: Alice sends the classical outcome $(b,c)$ to Bob (costs $2\log_2 d$ classical bits).
Step 3: Bob applies $W_{b,c}^*$ to his register $B$, recovering $|\psi\rangle$.
Why it works: The pre-measurement state of $C,A,B$ is $\sum_{b,c}\tfrac{1}{d}(W_{b,c}\otimes I)|\beta_{00}\rangle_{AB}\otimes W_{b,c}^*|\psi\rangle_B$ (expansion in the Bell basis). Alice's measurement collapses $B$ to $W_{b,c}^*|\psi\rangle$ for the observed $(b,c)$.

Remark. Teleportation does not allow faster-than-light communication because Alice must transmit the classical outcome $(b,c)$ before Bob can apply the correction. Without this transmission, Bob's register is in state $\operatorname{Tr}_{CA}(\rho_{CAB}) = I/d$ — the maximally mixed state, conveying no information about $|\psi\rangle$.

Superdense Coding Protocol (Bennett-Wiesner 1992)
Task: Alice wants to send 2 classical bits to Bob using only 1 qubit of quantum communication, with a pre-shared Bell pair.
Setup: Alice and Bob share $|\beta_{00}\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$.
Step 1: Alice encodes her 2-bit message $(b,c) \in \{0,1\}^2$ by applying $W_{b,c}$ to her qubit:

(0,0): apply $I$ → joint state $|\beta_{00}\rangle$
(0,1): apply $Z$ → joint state $|\beta_{01}\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle-|11\rangle)$
(1,0): apply $X$ → joint state $|\beta_{10}\rangle = \tfrac{1}{\sqrt{2}}(|01\rangle+|10\rangle)$
(1,1): apply $iY$ → joint state $|\beta_{11}\rangle = \tfrac{1}{\sqrt{2}}(|01\rangle-|10\rangle)$

Step 2: Alice sends her qubit to Bob (1 qubit transmission).
Step 3: Bob performs a Bell measurement on both qubits and identifies which of the four Bell states he received, recovering 2 classical bits.
Resource accounting: 1 pre-shared ebit + 1 qubit communication → 2 classical bits transmitted. Combined with teleportation: 2 cbits + 1 ebit → 1 qubit. These are inverse protocols, and together they confirm that 1 ebit + 1 qubit $\leftrightarrow$ 2 cbits (in resource terms).

Theorem 2.4 (No-cloning theorem). There is no quantum channel $\Phi: L(\mathcal{X})\to L(\mathcal{X}\otimes\mathcal{X})$ such that $\Phi(|\psi\rangle\langle\psi|) = |\psi\rangle\langle\psi|\otimes|\psi\rangle\langle\psi|$ for all $|\psi\rangle \in \mathcal{X}$.

Proof. Suppose such $\Phi$ existed. It must clone two orthonormal states $|0\rangle,|1\rangle$: $\Phi(|0\rangle\langle 0|) = |00\rangle\langle 00|$ and $\Phi(|1\rangle\langle 1|) = |11\rangle\langle 11|$. By linearity and CP: \[\Phi(|+\rangle\langle+|) = \Phi\!\Bigl(\tfrac{|0\rangle\langle 0|+|1\rangle\langle 1|+|0\rangle\langle 1|+|1\rangle\langle 0|}{2}\Bigr) = \tfrac{|00\rangle\langle 00|+|11\rangle\langle 11|+\Phi(|0\rangle\langle 1|)+\Phi(|1\rangle\langle 0|)}{2}.\] For $\Phi$ to clone $|+\rangle$, we'd need $\Phi(|+\rangle\langle+|) = |++\rangle\langle++| = \tfrac{1}{4}(|00\rangle+|01\rangle+|10\rangle+|11\rangle)(\langle 00|+\langle 01|+\langle 10|+\langle 11|)$, which requires $\Phi(|0\rangle\langle 1|) = \tfrac{1}{2}(|00\rangle\langle 11|+|01\rangle\langle 00|+|01\rangle\langle 11|+\ldots)$. But the Choi matrix of $\Phi$ would fail to be positive semidefinite (one can check that the off-diagonal requirements are inconsistent with CP). More elegantly: the fidelity argument — if $\Phi$ clones, then for any two non-orthogonal states $|\psi\rangle,|\phi\rangle$, $|\langle\psi|\phi\rangle|^2 = F(\Phi(|\psi\rangle\langle\psi|),\Phi(|\phi\rangle\langle\phi|)) = F(|\psi\psi\rangle\langle\psi\psi|,|\phi\phi\rangle\langle\phi\phi|) = |\langle\psi|\phi\rangle|^4$. This requires $|\langle\psi|\phi\rangle|^2 = |\langle\psi|\phi\rangle|^4$, which holds only for $|\langle\psi|\phi\rangle|\in\{0,1\}$ — i.e., orthogonal or equal states. Contradiction. $\square$

Chapter 3: Purifications and Fidelity

Purification is the idea that any mixed state of a system can be regarded as part of a pure state on a larger system. This perspective converts questions about mixed states into questions about pure states, often dramatically simplifying proofs. The fidelity function, which quantifies how close two quantum states are, has its deepest characterisation through purifications via Uhlmann’s theorem.

3.1 Purifications

Definition 3.1 (Purification). Let $P \in \operatorname{Pos}(\mathcal{X})$. A vector $|u\rangle \in \mathcal{X}\otimes\mathcal{Y}$ is a purification of $P$ if $\operatorname{Tr}_\mathcal{Y}(|u\rangle\langle u|) = P$. The auxiliary space $\mathcal{Y}$ is called the purifying system.

The connection to the vec operator is fundamental: $|u\rangle$ purifies $P$ iff $|u\rangle = \operatorname{vec}(A)$ for some $A \in L(\mathcal{Y},\mathcal{X})$ with $AA^* = P$. This uses the identity $\operatorname{Tr}_\mathcal{Y}(\operatorname{vec}(A)\operatorname{vec}(A)^*) = AA^*$.

Theorem 3.1 (Existence of purifications). $P \in \operatorname{Pos}(\mathcal{X})$ has a purification in $\mathcal{X}\otimes\mathcal{Y}$ if and only if $\dim\mathcal{Y} \geq \operatorname{rank}(P)$.

Proof. We need $A \in L(\mathcal{Y},\mathcal{X})$ with $AA^* = P$. The rank-factorisation of $P$ is possible iff $\dim\mathcal{Y} \geq \operatorname{rank}(P)$. Explicitly: let $P = \sum_{k=1}^r \lambda_k|d_k\rangle\langle d_k|$ be the spectral decomposition and $\{|y_k\rangle\}_{k=1}^r$ any orthonormal set in $\mathcal{Y}$. Set $A = \sum_k\sqrt{\lambda_k}|d_k\rangle\langle y_k|$. Then $|u\rangle = \operatorname{vec}(A) = \sum_k\sqrt{\lambda_k}|d_k\rangle|y_k\rangle \in \mathcal{X}\otimes\mathcal{Y}$ purifies $P$. $\square$

Theorem 3.2 (Unitary equivalence of purifications). If $|u\rangle, |v\rangle \in \mathcal{X}\otimes\mathcal{Y}$ both satisfy $\operatorname{Tr}_\mathcal{Y}(|u\rangle\langle u|) = \operatorname{Tr}_\mathcal{Y}(|v\rangle\langle v|) = P$, then there exists a unitary $V \in \mathcal{U}(\mathcal{Y})$ such that $|v\rangle = (I_\mathcal{X}\otimes V)|u\rangle$.

Proof. Write $|u\rangle = \operatorname{vec}(A)$ and $|v\rangle = \operatorname{vec}(B)$ with $AA^* = BB^* = P$. Let $P = \sum_k\lambda_k|d_k\rangle\langle d_k|$ and choose orthonormal sets $\{|y_k\rangle\},\{|z_k\rangle\}$ so that $A = \sum_k\sqrt{\lambda_k}|d_k\rangle\langle y_k|$ and $B = \sum_k\sqrt{\lambda_k}|d_k\rangle\langle z_k|$. Define $V$ on $\operatorname{span}\{|y_k\rangle\}$ by $V|y_k\rangle = |z_k\rangle$ and extend to a unitary on $\mathcal{Y}$. Then $(I\otimes V)\operatorname{vec}(A) = \operatorname{vec}(AV^T) = \operatorname{vec}(B)$. $\square$

Example 3.1. Let $\rho = \tfrac{3}{4}|0\rangle\langle 0| + \tfrac{1}{4}|1\rangle\langle 1| \in D(\mathbb{C}^2)$. A purification is $|u\rangle = \tfrac{\sqrt{3}}{2}|0\rangle|0\rangle + \tfrac{1}{2}|1\rangle|1\rangle \in \mathbb{C}^2\otimes\mathbb{C}^2$. Another purification is $|v\rangle = \tfrac{\sqrt{3}}{2}|0\rangle|1\rangle + \tfrac{1}{2}|1\rangle|0\rangle$. These are related by the swap unitary $V = X$ (Pauli-$X$) acting on the second qubit: $(I\otimes X)|u\rangle = |v\rangle$.

The unitary equivalence theorem (Theorem 3.2) is the quantum analogue of the fact that all probability distributions with the same marginal can be related by a coupling. Operationally: if you have access to only the $\mathcal{X}$ part of a purification, the choice of purifying system $\mathcal{Y}$ is irrelevant — all purifications of $\rho$ look the same to Alice. Bob can rotate between them without affecting Alice.

Theorem 3.3 (Schmidt decomposition). Any $|u\rangle \in \mathcal{X}\otimes\mathcal{Y}$ with $\dim\mathcal{X} \leq \dim\mathcal{Y}$ has a unique Schmidt decomposition \[|u\rangle = \sum_{k=1}^r \sqrt{\lambda_k}|e_k\rangle|f_k\rangle,\quad \lambda_1 \geq \cdots \geq \lambda_r > 0,\] where $\{|e_k\rangle\}$ is orthonormal in $\mathcal{X}$, $\{|f_k\rangle\}$ is orthonormal in $\mathcal{Y}$, and $r = \operatorname{rank}(\operatorname{Tr}_\mathcal{Y}|u\rangle\langle u|)$ is the Schmidt rank. The Schmidt coefficients $\{\lambda_k\}$ are the eigenvalues of $\operatorname{Tr}_\mathcal{Y}|u\rangle\langle u|$.

Proof. Write $|u\rangle = \sum_{ij}C_{ij}|i\rangle|j\rangle$ for some matrix $C$. The SVD $C = \sum_k\sqrt{\lambda_k}|e_k\rangle\langle f_k|$ (with $|e_k\rangle, |f_k\rangle$ the left and right singular vectors) gives $|u\rangle = \sum_k\sqrt{\lambda_k}|e_k\rangle|f_k\rangle$. The reduced state on $\mathcal{X}$ is $\operatorname{Tr}_\mathcal{Y}|u\rangle\langle u| = CC^* = \sum_k\lambda_k|e_k\rangle\langle e_k|$, confirming that the $\lambda_k$ are eigenvalues of the reduced state. Uniqueness of the SVD up to phases of degenerate singular values. $\square$

Remark (No Schmidt decomposition for mixed states). The Schmidt decomposition applies only to pure bipartite states. For mixed states $\rho^{AB}$, the analogous decomposition is more complex: one uses the spectral decomposition of the Choi-like operator $C_\rho = \sum_{ij}\rho_{ij,kl}|ij\rangle\langle kl|$, which is related to the partial transpose. A mixed state can have a "Schmidt number" $\geq 2$ even when it is not entangled in the state sense — this leads to the theory of Schmidt rank for operators (Definition 6.5 in Chapter 6).

Example 3.1b (Schmidt rank and entanglement detection). The four Bell states \[|\beta_{ab}\rangle = \tfrac{1}{\sqrt{2}}\sum_{x\in\{0,1\}}(-1)^{ax}|x\rangle|(x\oplus b)\rangle, \quad a,b\in\{0,1\}\] all have Schmidt rank 2 and entropy of entanglement 1 ebit — they are maximally entangled. The state $|00\rangle$ has Schmidt rank 1 — it is a product state. For an arbitrary pure state in $\mathbb{C}^2\otimes\mathbb{C}^d$, the Schmidt rank determines whether the state is entangled: rank 1 iff product, rank $\geq 2$ iff entangled. For a mixed state $\rho^{AB}$, the **Schmidt number** is the minimum Schmidt rank across all pure-state decompositions $\rho = \sum_k p_k|\psi_k\rangle\langle\psi_k|$. A separable state has Schmidt number 1; an entangled state has Schmidt number $\geq 2$. Witnessing Schmidt number $\geq k$ requires entanglement witnesses from the dual cone of states with Schmidt number $< k$.

Definition 3.2 (Purification of a bipartite state). If $\rho^{AB} \in D(\mathcal{X}_A\otimes\mathcal{X}_B)$, a purification is a pure state $|\psi\rangle^{ABR} \in \mathcal{X}_A\otimes\mathcal{X}_B\otimes\mathcal{X}_R$ with $\operatorname{Tr}_R(|\psi\rangle\langle\psi|) = \rho^{AB}$. In the Schmidt decomposition of $|\psi\rangle^{A(BR)}$, the Schmidt coefficients of the bipartition $A:(BR)$ equal the square roots of the eigenvalues of $\rho^A = \operatorname{Tr}_{BR}(\rho^{ABR})$. Purifications of $\rho^{AB}$ in $\mathcal{X}_A\otimes\mathcal{X}_B\otimes\mathcal{X}_R$ exist iff $\dim\mathcal{X}_R \geq \operatorname{rank}(\rho^{AB})$.

The purification of a bipartite state plays a crucial role in proofs about conditional entropy, quantum channel capacity, and Uhlmann’s theorem. A key identity: for any pure tripartite state $|\psi\rangle^{ABR}$, $S(\rho^A) = S(\rho^{BR})$ and $S(\rho^{AB}) = S(\rho^R)$. These follow from the Schmidt decomposition across each bipartition and the purity of the global state.

Proposition 3.1 (Entropy equalities for pure global states). Let $|\psi\rangle^{ABC}$ be a pure tripartite state. Then:

$S(AB) = S(C)$ and $S(AC) = S(B)$ and $S(BC) = S(A)$.
The conditional entropies satisfy $S(A|B) = -S(A|C)$ and cyclic permutations.
The quantum mutual information satisfies $I(A:B) = S(A)+S(B)-S(C)$ and is symmetric in all cyclic permutations.

These identities are the quantum analogues of the chain rule for conditional probabilities and are the key technical tool in entropy-based proofs of quantum information theorems. They encode the monogamy structure: if $S(A|B) < 0$ (A entangled with B), then $S(A|C) > 0$ (A not entangled with C), consistent with monogamy of entanglement.

3.2 The Fidelity Function

Definition 3.2 (Fidelity). For $P,Q \in \operatorname{Pos}(\mathcal{X})$, the fidelity is \[F(P,Q) = \|\sqrt{P}\sqrt{Q}\|_1 = \operatorname{Tr}\bigl|\sqrt{P}\sqrt{Q}\bigr| = \operatorname{Tr}\sqrt{\sqrt{P}Q\sqrt{P}}.\]

Proposition 3.1 (Properties of fidelity).

Symmetry: $F(P,Q) = F(Q,P)$.
Range: For $\rho,\sigma \in D(\mathcal{X})$, $0 \leq F(\rho,\sigma) \leq 1$.
Pure-mixed: $F(|\psi\rangle\langle\psi|,\sigma) = \sqrt{\langle\psi|\sigma|\psi\rangle}$.
Pure-pure: $F(|\psi\rangle\langle\psi|,|\phi\rangle\langle\phi|) = |\langle\psi|\phi\rangle|$.
Multiplicativity: $F(P_1\otimes P_2, Q_1\otimes Q_2) = F(P_1,Q_1)F(P_2,Q_2)$.
Extremes: $F(\rho,\sigma) = 1$ iff $\rho = \sigma$; $F(\rho,\sigma) = 0$ iff $\operatorname{supp}(\rho)\perp\operatorname{supp}(\sigma)$.

Example 3.2. For two qubit pure states $|\psi\rangle = \cos\theta|0\rangle+\sin\theta|1\rangle$ and $|\phi\rangle = \cos\phi|0\rangle+\sin\phi|1\rangle$, $F = |\langle\psi|\phi\rangle| = |\cos(\theta-\phi)|$. For the maximally mixed state $\sigma = I/2$ and any pure state $|\psi\rangle$, $F(|\psi\rangle\langle\psi|, I/2) = \sqrt{1/2} = 1/\sqrt{2}$.

3.3 Uhlmann’s Theorem

Uhlmann’s theorem gives the most elegant characterisation of fidelity: it is the maximal overlap between purifications. The theorem can be understood intuitively as follows: the fidelity $F(\rho,\sigma)$ measures how “close” the two states are in terms of how much their purifications can be aligned. Since all purifications of $\rho$ in a fixed system $\mathcal{X}\otimes\mathcal{Y}$ are related by unitaries on $\mathcal{Y}$, the question of maximising $|\langle u|v\rangle|$ over purifications $|v\rangle$ of $\sigma$ reduces to the variational problem $\max_V |\operatorname{Tr}(A^*BV^T)|$, which by the trace-norm variational formula equals $\|A^*B\|_1$.

Theorem 3.3 (Uhlmann's theorem). Let $P,Q \in \operatorname{Pos}(\mathcal{X})$ and let $|u\rangle \in \mathcal{X}\otimes\mathcal{Y}$ (with $\dim\mathcal{Y} \geq \max\{\operatorname{rank}P,\operatorname{rank}Q\}$) be any purification of $P$. Then \[F(P,Q) = \max\bigl\{|\langle u|v\rangle| : |v\rangle \in \mathcal{X}\otimes\mathcal{Y},\;\operatorname{Tr}_\mathcal{Y}(|v\rangle\langle v|) = Q\bigr\}.\]

Proof. Write $|u\rangle = \operatorname{vec}(A)$ with $AA^* = P$ and $|v\rangle = (I\otimes V)\operatorname{vec}(B)$ with $BB^* = Q$ and $V \in \mathcal{U}(\mathcal{Y})$ (by Theorem 3.2, all purifications of $Q$ have this form). Then \[\langle u|v\rangle = \langle\operatorname{vec}(A),(I\otimes V)\operatorname{vec}(B)\rangle = \langle\operatorname{vec}(A),\operatorname{vec}(BV^T)\rangle = \langle A, BV^T\rangle = \operatorname{Tr}(A^*BV^T).\] By the variational formula $\|C\|_1 = \max_U |\operatorname{Tr}(CU)|$, maximising over unitaries $V$ gives \[\max_V |\langle u|v\rangle| = \|A^*B\|_1 = \|\sqrt{P}\sqrt{Q}\|_1 = F(P,Q),\] using the SVD and the polar decomposition for the last equality. $\square$

Remark. Uhlmann's theorem has a deep physical interpretation: fidelity measures how entangled the two states can appear to be when purified into a joint system. The more similar $P$ and $Q$, the more their purifications can be aligned.

The proof of Uhlmann’s theorem via SDPs proceeds differently and more directly shows the connection between fidelity and the primal optimisation. Following Watrous (Lecture 6, LN2011) and the in-course SDP notes from Debbie Leung, the key steps are:

Proof of Uhlmann via SDP duality (sketch). Consider the primal SDP \[d = \sup_{W \in \operatorname{Pos}(\mathcal{X}\otimes\mathcal{Y})}\langle W, uu^*\rangle \quad\text{s.t.}\quad \operatorname{Tr}_\mathcal{Y}(W) = Q,\] where $|u\rangle$ is a fixed purification of $P$. The dual SDP is \[\beta = \inf_{Z \in \operatorname{Herm}(\mathcal{X})} \langle Q, Z\rangle \quad\text{s.t.}\quad Z\otimes I_\mathcal{Y} \geq uu^*.\] The dual constraint $Z\otimes I_\mathcal{Y} \geq uu^*$ is equivalent, via a Schur complement argument, to $\langle Z^{-1}, P\rangle \leq 1$ for $Z > 0$. Restricting to $\langle Z^{-1},P\rangle = 1$ (scaling invariance) and squaring gives $\beta^2 = \inf_{Z>0}\langle P,Z^{-1}\rangle\langle Q,Z\rangle$. By Alberti's theorem, this equals $F(P,Q)^2$, so $\beta = F(P,Q)$. Strong duality follows from the strictly feasible primal point $W = Q\otimes\tfrac{I_\mathcal{Y}}{\dim\mathcal{Y}}$ and the dual point $Z = \|u\|^2 I_\mathcal{X}$. The primal optimal value $d = F(P,Q)^2$ is achieved by $W = |v^*\rangle\langle v^*|$ where $|v^*\rangle$ is the optimal purification of $Q$ achieving the fidelity overlap. $\square$

Uhlmann’s theorem can also be derived from the SDP formulation of fidelity. The primal SDP for $F(P,Q)$ is

\[d = \sup\;\tfrac{1}{2}\operatorname{Tr}(X)+\tfrac{1}{2}\operatorname{Tr}(X^*)\quad\text{s.t.}\quad\begin{bmatrix}P & X \\ X^* & Q\end{bmatrix}\geq 0.\]

The dual is $\beta = \inf\;\tfrac{1}{2}\langle P,Y\rangle+\tfrac{1}{2}\langle Q,Z\rangle$ s.t. $\begin{bmatrix}Y & I \\ I & Z\end{bmatrix}\geq 0$. Strong duality ($d=\beta$, attained) is verified using Slater’s condition. The dual with the substitution $Z = Y^{-1}$ (optimal when $P,Q>0$) becomes Alberti’s theorem. The primal with the Schur complement gives $\|V\|_\infty\leq 1$ which connects to the variational formula $F(P,Q) = \sup_{\|V\|_\infty\leq 1}\operatorname{Re}\operatorname{Tr}(V^*\sqrt{P}\sqrt{Q}) = \|\sqrt{P}\sqrt{Q}\|_1$. This is exactly the statement of Uhlmann’s theorem with $V = U$ a unitary (the variational formula is maximised by a unitary).

Example 3.4 (Fidelity between qubit states via SDP). Let $\rho = \begin{pmatrix}3/4 & 0 \\ 0 & 1/4\end{pmatrix}$ and $\sigma = \begin{pmatrix}1/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix} = |+\rangle\langle+|$. Then $\sqrt{\rho} = \begin{pmatrix}\sqrt{3}/2 & 0 \\ 0 & 1/2\end{pmatrix}$ and $F(\rho,\sigma) = \|\sqrt{\rho}\sqrt{\sigma}\|_1$. Since $\sigma$ is rank 1, $\sqrt{\sigma} = \sigma$ (as a PSD operator, $\sigma^{1/2} = |+\rangle\langle+|$). Then $\sqrt{\rho}\sqrt{\sigma} = \begin{pmatrix}\sqrt{3}/2 & 0 \\ 0 & 1/2\end{pmatrix}\begin{pmatrix}1/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix} = \begin{pmatrix}\sqrt{3}/4 & \sqrt{3}/4 \\ 1/4 & 1/4\end{pmatrix}$. The singular values are $\tfrac{1}{4}(\sqrt{3}+1)$ and $0$, so $F = \tfrac{\sqrt{3}+1}{4} \approx 0.683$.

3.3b Process Fidelity and the Entanglement Fidelity

Beyond the state fidelity $F(\rho,\sigma)$, there is a natural fidelity measure for quantum channels that compares the “distance” between two quantum processes.

Definition 3.2b (Entanglement fidelity). For a quantum channel $\Phi: L(\mathcal{X})\to L(\mathcal{Y})$ and a state $\rho \in D(\mathcal{X})$, the entanglement fidelity is \[F_e(\rho,\Phi) = \langle u|(\Phi\otimes I)(|u\rangle\langle u|)|u\rangle,\] where $|u\rangle \in \mathcal{X}\otimes\mathcal{X}$ is any purification of $\rho$ with $\operatorname{Tr}_\mathcal{X}(|u\rangle\langle u|) = \rho$. The entanglement fidelity measures how well $\Phi$ preserves the entanglement of $\rho$ with a reference system.

The entanglement fidelity has the following key properties. First, it is independent of the choice of purification $|u\rangle$ (by the unitary equivalence of purifications and linearity). Second, for the identity channel $\Phi = \operatorname{id}$, $F_e(\rho,\operatorname{id}) = 1$ for any $\rho$. Third, the average fidelity over all pure states is related to $F_e$ by $\bar F(\Phi) = \int F(\Phi(|\psi\rangle\langle\psi|),|\psi\rangle\langle\psi|)d|\psi\rangle = \frac{dF_e(\rho_{\text{mm}},\Phi)+1}{d+1}$ where $\rho_{\text{mm}} = I/d$ (Horodecki-Horodecki-Horodecki 1999).

Definition 3.2c (Process fidelity / gate fidelity). For two quantum channels $\Phi,\Psi: L(\mathcal{X})\to L(\mathcal{Y})$, the process fidelity is \[F_{\text{proc}}(\Phi,\Psi) = F(J(\Phi)/d, J(\Psi)/d),\] where $J(\Phi), J(\Psi)$ are their Choi matrices and $d = \dim\mathcal{X}$. The process fidelity measures how similar the Choi matrices are, and reduces to $F(\rho,\sigma)$ when $\Phi,\Psi$ are replacer channels $\Phi(\cdot)=\rho, \Psi(\cdot)=\sigma$.

Example 3.4b (Process fidelity of dephased unitary). Consider the ideal unitary channel $\mathcal{U}(\rho) = U\rho U^*$ and the dephased version $\Phi(\rho) = (1-p)\mathcal{U}(\rho) + p Z\mathcal{U}(\rho)Z^*$ (phase error with probability $p$). The Choi matrices are $J(\mathcal{U}) = |\psi_U\rangle\langle\psi_U|$ (rank 1, pure) and $J(\Phi) = (1-p)|\psi_U\rangle\langle\psi_U| + p(Z\otimes I)|\psi_U\rangle\langle\psi_U|(Z\otimes I)^*$. The process fidelity is $F_{\text{proc}}(\mathcal{U},\Phi)^2 = (1-p)^2$, so $F_{\text{proc}} = 1-p$. For $p=0$: perfect channel, $F_{\text{proc}}=1$. For $p=1/2$: half the time the gate applies an extra $Z$, giving $F_{\text{proc}}=1/2$.

3.4 Alberti’s Theorem

While Uhlmann characterises fidelity through purifications, Alberti’s theorem gives a variational characterisation through an optimisation over positive definite operators — the form that naturally arises as the dual of the fidelity SDP.

Theorem 3.4. For $P,Q \in \operatorname{Pos}(\mathcal{X})$, \[F(P,Q) = \inf_{Y \in \operatorname{Pd}(\mathcal{X})} \Bigl\{\tfrac{1}{2}\langle P,Y\rangle + \tfrac{1}{2}\langle Q,Y^{-1}\rangle\Bigr\}.\]

Corollary 3.1 (Alberti's theorem). $F(P,Q)^2 = \inf_{Y > 0} \langle P,Y\rangle\langle Q,Y^{-1}\rangle$.

Proof of Corollary from Theorem 3.4. By the AM-GM inequality, $\tfrac{1}{2}\langle P,Y\rangle + \tfrac{1}{2}\langle Q,Y^{-1}\rangle \geq \sqrt{\langle P,Y\rangle\langle Q,Y^{-1}\rangle}$. Squaring the infimum of both sides: $F(P,Q)^2 \geq \inf_{Y>0}\langle P,Y\rangle\langle Q,Y^{-1}\rangle$. Equality holds when $\langle P,Y\rangle = \langle Q,Y^{-1}\rangle$, which can always be achieved by scaling $Y \to \lambda Y$ for appropriate $\lambda > 0$. $\square$

3.5 Fuchs-van de Graaf Inequalities

The fidelity and trace distance are the two primary distance measures for quantum states. They are related by tight inequalities.

Theorem 3.5 (Fuchs-van de Graaf inequalities). For $\rho,\sigma \in D(\mathcal{X})$, \[1 - \tfrac{1}{2}\|\rho-\sigma\|_1 \leq F(\rho,\sigma) \leq \sqrt{1 - \tfrac{1}{4}\|\rho-\sigma\|_1^2}.\]

Proof. Lower bound: Using $\|\rho-\sigma\|_1 \geq \|\sqrt\rho-\sqrt\sigma\|_2^2 = 2 - 2\operatorname{Re}\operatorname{Tr}(\sqrt\rho\sqrt\sigma) \geq 2 - 2\|\sqrt\rho\sqrt\sigma\|_1 = 2 - 2F(\rho,\sigma)$, which rearranges to $F \geq 1 - \tfrac{1}{2}\|\rho-\sigma\|_1$.

Upper bound: By Uhlmann, choose purifications $|u\rangle,|v\rangle$ achieving $F(\rho,\sigma) = |\langle u|v\rangle|$. Then $\|\rho-\sigma\|_1 \leq \|uu^*-vv^*\|_1 = 2\sqrt{1-|\langle u|v\rangle|^2} = 2\sqrt{1-F^2}$ (using data processing under the partial trace and the pure-state trace distance formula). Rearranging: $F \leq \sqrt{1-\tfrac{1}{4}\|\rho-\sigma\|_1^2}$. $\square$

Example 3.3. For $\rho = |0\rangle\langle 0|$ and $\sigma = |+\rangle\langle+|$, $F = |\langle 0|+\rangle| = 1/\sqrt{2} \approx 0.707$. Trace distance: $\tfrac{1}{2}\|\rho-\sigma\|_1 = \sqrt{1-F^2} = 1/\sqrt{2}$. Check bounds: lower: $1 - 1/\sqrt{2} \approx 0.293 \leq 0.707$ ✓; upper: $\sqrt{1-1/2} = 1/\sqrt{2} = 0.707 = F$ — the upper bound is tight here because the states are pure.

3.6 Data Processing and Joint Concavity

Theorem 3.6 (Data processing inequalities). For any channel $\Phi \in C(\mathcal{X},\mathcal{Y})$ and $P,Q \in \operatorname{Pos}(\mathcal{X})$:

$\|\Phi(P)-\Phi(Q)\|_1 \leq \|P-Q\|_1$.
$F(\Phi(P),\Phi(Q)) \geq F(P,Q)$.

Proof. (1): Write $\tfrac{1}{2}\|\Phi(P)-\Phi(Q)\|_1 = \sup_{M+M'=I,\,M,M'\geq 0}\tfrac{1}{2}[\langle M,\Phi(P)\rangle + \langle M',\Phi(Q)\rangle] = \sup\tfrac{1}{2}[\langle\Phi^*(M),P\rangle + \langle\Phi^*(M'),Q\rangle]$. Since $\Phi^*(M)+\Phi^*(M') = \Phi^*(I) = I$ (TP) and $\Phi^*(M),\Phi^*(M')\geq 0$ (positivity), $(\Phi^*(M),\Phi^*(M'))$ is feasible for the SDP of $\tfrac{1}{2}\|P-Q\|_1$, giving the bound.

(2): Let $X$ be optimal for $F(P,Q) = \tfrac{1}{2}\operatorname{Tr}(X)+\tfrac{1}{2}\operatorname{Tr}(X^*)$ s.t. $\begin{bmatrix}P&X\\X^*&Q\end{bmatrix}\geq 0$. Since $\Phi$ is CP, $\begin{bmatrix}\Phi(P)&\Phi(X)\\\Phi(X)^*&\Phi(Q)\end{bmatrix}\geq 0$, making $\Phi(X)$ feasible for $F(\Phi(P),\Phi(Q))$. By TP, $\tfrac{1}{2}\operatorname{Tr}(\Phi(X))+\tfrac{1}{2}\operatorname{Tr}(\Phi(X)^*) = \tfrac{1}{2}\operatorname{Tr}(X)+\tfrac{1}{2}\operatorname{Tr}(X^*) = F(P,Q)$. $\square$

Theorem 3.7 (Joint concavity of fidelity). For probability vector $(p_k)$ and states $\rho_k,\sigma_k$, \[F\!\Bigl(\sum_k p_k\rho_k, \sum_k p_k\sigma_k\Bigr) \geq \sum_k p_k F(\rho_k,\sigma_k).\]

Proof. For each $k$, choose purifications $|u_k\rangle,|v_k\rangle$ of $\rho_k,\sigma_k$ with $|\langle u_k|v_k\rangle| = F(\rho_k,\sigma_k)$ (by Uhlmann). Define $|u\rangle = \sum_k\sqrt{p_k}|u_k\rangle|k\rangle$ and $|v\rangle = \sum_k\sqrt{p_k}|v_k\rangle|k\rangle$ in $\mathcal{X}\otimes\mathcal{Y}\otimes\mathcal{K}$. These are purifications of $\sum_k p_k\rho_k$ and $\sum_k p_k\sigma_k$ respectively. By Uhlmann: $F(\sum p_k\rho_k,\sum p_k\sigma_k) \geq |\langle u|v\rangle| = |\sum_k p_k\langle u_k|v_k\rangle| \geq \sum_k p_k|\langle u_k|v_k\rangle| = \sum_k p_k F(\rho_k,\sigma_k)$. $\square$

Chapter 4: Semidefinite Programming

Semidefinite programming (SDP) is to quantum information what linear programming is to classical combinatorial optimisation. Many fundamental quantities — trace distance, fidelity, diamond norm, optimal state discrimination — are exactly expressible as SDPs, enabling both computation and duality-based proofs.

In quantum information, SDPs appear in two distinct roles. First, as computational tools: given the numerical data (Kraus operators, state descriptions, Hamiltonian parameters), SDPs compute exact optimal values efficiently using interior-point methods (e.g., SeDuMi, CVXPY, MOSEK in $O(n^{3.5})$ time for an $n$-dimensional variable). Second, as proof techniques: the primal-dual structure gives certificates of optimality and allows compact proofs of quantum information theorems by finding explicit dual witnesses. The Holevo-Helstrom theorem, Uhlmann’s theorem, and the quantum capacity hierarchy all admit SDP-based proofs that are often more transparent than direct analytical arguments.

A key principle is that most quantum information quantities arising from optimisation over POVMs, channels, or states are SDPs because the feasible region (density matrices, POVMs, channels) is a spectrahedron — a convex set defined by linear matrix inequalities. The Choi isomorphism maps the CP condition to a PSD constraint, converting channel optimisation into standard SDP form.

The duality structure of SDPs has a direct physical interpretation in quantum information. The primal optimises over physical objects (states, measurements, channels). The dual provides witnesses: the dual feasible point certifies that the primal objective cannot exceed a certain value, and the dual optimal witnesses (entanglement witnesses, measurement witnesses) identify the physical constraints. This is why SDP duality is so powerful in quantum information: optimal protocols and impossibility proofs are two sides of the same SDP.

4.1 The Primal and Dual SDP

Definition 4.1 (Primal SDP). A semidefinite program (primal form) is an optimisation problem \[d = \sup\,\langle A, X\rangle \quad\text{subject to}\quad \Phi(X) = B,\; X \geq 0,\] where $A \in \operatorname{Herm}(\mathcal{X})$, $B \in \operatorname{Herm}(\mathcal{Y})$, and $\Phi: L(\mathcal{X})\to L(\mathcal{Y})$ is a Hermiticity-preserving linear map. The variable $X \in \operatorname{Pos}(\mathcal{X})$ is the primal variable. Set $d = -\infty$ if infeasible and $d = +\infty$ if unbounded.

Definition 4.2 (Dual SDP). The dual SDP associated to the primal above is \[\beta = \inf\,\langle B, Y\rangle \quad\text{subject to}\quad \Phi^*(Y) - S = A,\; S \geq 0,\; Y \in \operatorname{Herm}(\mathcal{Y}).\] The variable $Y \in \operatorname{Herm}(\mathcal{Y})$ is the dual variable. The adjoint $\Phi^*$ satisfies $\langle Y,\Phi(X)\rangle = \langle\Phi^*(Y),X\rangle$.

The dual is derived via Lagrange multipliers. Associate dual variable $Y \in \operatorname{Herm}(\mathcal{Y})$ to the equality constraint $\Phi(X)=B$ and $S \in \operatorname{Pos}(\mathcal{X})$ to the inequality $X \geq 0$. The Lagrangian is $\langle A,X\rangle + \langle B-\Phi(X),Y\rangle + \langle X,S\rangle = \langle B,Y\rangle + \langle X, A - \Phi^*(Y) + S\rangle$. For this to be bounded as $X$ varies over $\operatorname{Pos}(\mathcal{X})$, we need $A - \Phi^*(Y) + S = 0$, giving the dual constraint.

The sign convention: the dual variable $Y$ for the equality constraint is free (unconstrained as a Hermitian matrix, since the constraint is an equality). The dual variable $S$ for the inequality $X \geq 0$ is positive semidefinite (since $-S$ is the Lagrange multiplier for $X \geq 0$, and positivity of $-S$ would mean $S \leq 0$, but we choose the sign so that $S \geq 0$ pairs with $\langle X,S\rangle \geq 0$).

The notation $\Phi^*$ for the adjoint of the constraint map should not be confused with the adjoint (Heisenberg-picture) channel. In the SDP context, $\Phi: L(\mathcal{X})\to L(\mathcal{Y})$ is any Hermiticity-preserving linear map, and its adjoint $\Phi^*: L(\mathcal{Y})\to L(\mathcal{X})$ satisfies $\langle Y, \Phi(X)\rangle = \langle\Phi^*(Y), X\rangle$. When $\Phi = \Phi_{\text{ch}}$ is a quantum channel, the adjoint of the constraint map and the adjoint of the channel coincide, but for general SDP constraint maps they differ.

Example 4.1 (A concrete 3×3 SDP). The following primal-dual pair from the lecture supplement illustrates weak but not strong duality: \[d = \sup(-t) \;\text{s.t.}\; \begin{bmatrix}t&a&b\\a&0&\tfrac{1}{2}\\b&\tfrac{1}{2}&c\end{bmatrix}\geq 0, \quad \beta = \inf y \;\text{s.t.}\;\begin{bmatrix}y+1&0&0\\0&2&y\\0&y&z\end{bmatrix}\geq 0.\] The primal optimal is $d = 0$ (take $t\to 0^+$) and the dual optimal is $\beta = 0$, so $d = \beta$. However, this particular instance does not attain strong duality — a gap exists between attainment in the primal and dual.

4.2 Weak and Strong Duality

Theorem 4.1 (Weak duality). If $X$ is primal feasible and $(Y,S)$ is dual feasible, then $\langle A,X\rangle \leq \langle B,Y\rangle$. Hence $d \leq \beta$.

Proof. $\langle B,Y\rangle - \langle A,X\rangle = \langle\Phi(X),Y\rangle - \langle A,X\rangle = \langle X,\Phi^*(Y)\rangle - \langle X,A\rangle = \langle X,\Phi^*(Y)-A\rangle = \langle X,S\rangle \geq 0$, since $X,S \geq 0$. $\square$

The duality gap $\beta - d$ can be strictly positive. Closing the gap requires a regularity condition. Example 4.1 shows a gap exists between attainment (not between the values $d=\beta=0$), but Example 4.4 below shows what happens when Slater’s condition fails: the primal optimal is approached but not attained. This is analogous to the LP degeneracy situation, but for matrix variables.

A useful test: if the SDP arises from a quantum information problem with a natural “maximally noisy” or “identity” solution, strict feasibility usually holds for both primal and dual — precisely the regime of most quantum information SDPs.

Theorem 4.2 (Slater's condition — primal). If there exists a strictly feasible primal solution ($X > 0$ with $\Phi(X) = B$) and $d$ is finite, then $d = \beta$ and $\beta$ is attained.

Theorem 4.3 (Slater's condition — dual). If there exists a strictly feasible dual solution ($S > 0$ and $\Phi^*(Y)-S = A$) and $\beta$ is finite, then $d = \beta$ and $d$ is attained.

Corollary 4.1. If both primal and dual are strictly feasible, then $d = \beta$ and both optima are attained.

4.3 Complementary Slackness

Theorem 4.4 (Complementary slackness). Assume $d = \beta$. If $X$ is primal optimal and $(Y,S)$ is dual optimal, then $\langle X,S\rangle = 0$, equivalently $XS = SX = 0$. This implies \[X\Phi^*(Y) = XA \quad\text{and}\quad \Phi^*(Y)X = AX.\]

Proof. From the weak duality calculation, $\langle B,Y\rangle - \langle A,X\rangle = \langle X,S\rangle$. If $d=\beta$ and both are attained, the left side is zero, so $\langle X,S\rangle = 0$. Since $X,S\geq 0$, this forces $XS = 0$. Substituting $S = \Phi^*(Y)-A$ gives $X(\Phi^*(Y)-A) = 0$. $\square$

Complementary slackness provides necessary and sufficient conditions for optimality when strong duality holds — it is the quantum analogue of complementary slackness in LP.

The complementary slackness conditions can be interpreted geometrically: $XS = 0$ means the support of $X$ is orthogonal to the support of $S$. Since $S = \Phi^*(Y) - A$, this says “the primal optimal $X$ is supported only on the subspace where the dual inequality $\Phi^*(Y) \geq A$ is tight (equality holds).” This is exactly analogous to the classical LP condition that a basic feasible solution has nonzero variables only where the corresponding dual constraints are active.

Example 4.4b (Complementary slackness for quantum state discrimination). For the state discrimination SDP with ensemble $\{(1/2,\rho_0),(1/2,\rho_1)\}$ and optimal measurement $(M_0,M_1)$, the dual variable is $Y = \tfrac{1}{2}M_0\rho_0 + \tfrac{1}{2}M_1\rho_1$. The complementary slackness condition reads: \[M_0(Y - \tfrac{1}{2}\rho_0) = 0 \quad\text{and}\quad M_1(Y - \tfrac{1}{2}\rho_1) = 0.\] These say: $M_0$ is supported only on the subspace where $Y = \tfrac{1}{2}\rho_0$ (the "boundary" between the two hypotheses), and similarly for $M_1$. In the Helstrom case with $p_0=p_1=1/2$, the optimal $Y = \tfrac{1}{2}\Pi^+\rho_0\Pi^+ + \tfrac{1}{2}\Pi^-\rho_1\Pi^-$ and the slackness conditions are immediately satisfied: $M_0 = \Pi^+$ is supported on the positive eigenspace of $\rho_0-\rho_1$, so $M_0(Y-\tfrac{1}{2}\rho_0) = \Pi^+(\tfrac{1}{2}\Pi^+\rho_0\Pi^+ - \tfrac{1}{2}\rho_0)\Pi^+ = 0$ since $\Pi^+\rho_0\Pi^+ = \rho_0\Pi^+$ on the positive eigenspace. ✓

Example 4.4 (Trace distance from scratch via SDP duality). We derive the SDP formulation of trace distance $\tfrac{1}{2}\|\rho-\sigma\|_1$ for two qubit states $\rho = |0\rangle\langle 0|$ and $\sigma = |1\rangle\langle 1|$ by constructing the primal and dual from scratch.

Primal: The discrimination probability is $d = \sup\tfrac{1}{2}[\langle M,\rho\rangle + \langle M',\sigma\rangle]$ s.t. $M + M' = I$, $M,M' \geq 0$. This is an SDP in the block variable $X = \operatorname{diag}(M,M') \in L(\mathbb{C}^2\oplus\mathbb{C}^2)$.

Dual variable: Associate $Y \in \operatorname{Herm}(\mathbb{C}^2)$ to the equality constraint $M+M' = I$. The dual constraint is $\Phi^*(Y) \geq A$ where $A = \tfrac{1}{2}\operatorname{diag}(\rho,\sigma)$ and $\Phi^*(Y) = \operatorname{diag}(Y,Y)$. So the constraint is $Y \geq \tfrac{1}{2}\rho$ and $Y \geq \tfrac{1}{2}\sigma$. The dual objective is $\operatorname{Tr}(Y)$.

Dual SDP: $\beta = \inf\operatorname{Tr}(Y)$ s.t. $Y \geq \tfrac{1}{2}|0\rangle\langle 0|$ and $Y \geq \tfrac{1}{2}|1\rangle\langle 1|$. Since $\rho = |0\rangle\langle 0|$ and $\sigma = |1\rangle\langle 1|$ are orthogonal, the minimum $Y = \tfrac{1}{2}(|0\rangle\langle 0|+|1\rangle\langle 1|) = \tfrac{I}{2}$, giving $\beta = 1$. The primal achieves this with $M = |0\rangle\langle 0|$, $M' = |1\rangle\langle 1|$: success probability $= \tfrac{1}{2}(1+1) = 1$. Strong duality holds: $d = \beta = 1$.

Complementary slackness check: With optimal $M = |0\rangle\langle 0|$, $M' = |1\rangle\langle 1|$, and $Y = I/2$: The slackness operator for $M$ is $Y - \tfrac{1}{2}\rho = \tfrac{I}{2} - \tfrac{1}{2}|0\rangle\langle 0| = \tfrac{1}{2}|1\rangle\langle 1|$. We need $M(Y-\tfrac{1}{2}\rho) = |0\rangle\langle 0|\cdot\tfrac{1}{2}|1\rangle\langle 1| = 0$. ✓

Example 4.5 (Lagrange multiplier derivation of the dual). Consider the simple SDP: $d = \sup\operatorname{Tr}(AX)$ s.t. $\operatorname{Tr}(X) = 1$, $X \geq 0$ (this is a maximisation over density matrices). Associate dual variable $y \in \mathbb{R}$ to the trace constraint. The Lagrangian is \[\mathcal{L} = \operatorname{Tr}(AX) + y(1-\operatorname{Tr}(X)) + \langle S, X\rangle = y + \operatorname{Tr}((A-yI+S)X).\] For boundedness over $X \geq 0$, we need $A - yI + S = 0$, i.e., $S = yI - A$, which requires $S \geq 0$, i.e., $y \geq \lambda_{\max}(A)$. The dual is thus: $\beta = \inf\{y : y \geq \lambda_{\max}(A)\} = \lambda_{\max}(A)$. The primal achieves this with $X = |v\rangle\langle v|$ the top eigenvector of $A$. This is the variational formula for the maximum eigenvalue: $\lambda_{\max}(A) = \sup_{\rho \in D(\mathcal{X})}\operatorname{Tr}(A\rho) = \sup_{\|u\|=1}\langle u|A|u\rangle$. Strong duality is immediate since both sides equal $\lambda_{\max}(A)$.

4.4 Computing Adjoint Maps

To use any specific SDP, one must compute the adjoint $\Phi^*$ of the constraint map $\Phi$. Two methods are available.

Proposition 4.1 (Method 1: Kraus inspection). If $\Phi(X) = \sum_k M_k X N_k^*$ (Kraus-like form), then $\Phi^*(Y) = \sum_k M_k^* Y N_k$.

Proposition 4.2 (Method 2: inner product definition). Compute $\langle Y, \Phi(X)\rangle$ explicitly in terms of the matrix entries $X_{ab}$, rearrange to the form $\langle\cdots, X\rangle$, and read off $\Phi^*(Y)$.

Example 4.2. Let $\Phi = \operatorname{Tr}_\mathcal{X}: L(\mathcal{X}\otimes\mathcal{Y})\to L(\mathcal{Y})$, $\Phi(A\otimes B) = \operatorname{Tr}(A)\cdot B$. For $Z \in L(\mathcal{Y})$: \[\langle Z, \Phi(M)\rangle = \langle Z, \operatorname{Tr}_\mathcal{X}(M)\rangle = \operatorname{Tr}(Z^*\operatorname{Tr}_\mathcal{X}(M)) = \operatorname{Tr}((I\otimes Z)^* M) = \langle I_\mathcal{X}\otimes Z, M\rangle.\] So $\Phi^*(Z) = I_\mathcal{X}\otimes Z$. This confirms the standard result: the adjoint of the partial trace is the tensor-with-identity map.

Example 4.3 (Adjoint of a map defined by Kraus operators). Consider the completely positive map $\Psi: L(\mathbb{C}^3) \to L(\mathbb{C}^3)$ defined by \[\Psi(X) = \begin{pmatrix}X_{11}+X_{23}+X_{32} & 0 & 0 \\ 0 & X_{22} & 0 \\ 0 & 0 & 0\end{pmatrix}\] (from the supplementary SDP notes). We use Method 2 (inner product) to find $\Psi^*(Y)$. Compute $\langle Y, \Psi(X)\rangle = \operatorname{Tr}(Y^*\Psi(X))$: \[\langle Y,\Psi(X)\rangle = Y_{11}(X_{11}+X_{23}+X_{32}) + Y_{22}X_{22} = Y_{11}X_{11} + Y_{22}X_{22} + Y_{11}X_{23} + Y_{11}X_{32}.\] This must equal $\langle\Psi^*(Y),X\rangle = \sum_{a,b}[\Psi^*(Y)]_{ab}^*X_{ab}$. Reading off coefficients: \[[\Psi^*(Y)]_{11} = Y_{11},\quad [\Psi^*(Y)]_{22} = Y_{22},\quad [\Psi^*(Y)]_{23} = Y_{11},\quad [\Psi^*(Y)]_{32} = Y_{11}.\] All other entries of $\Psi^*(Y)$ are 0. Hence \[\Psi^*(Y) = \begin{pmatrix}Y_{11} & 0 & 0 \\ 0 & Y_{22} & Y_{11} \\ 0 & Y_{11} & 0\end{pmatrix}.\] One can verify this from the Kraus representation: $\Psi(X) = M_1XM_1^* + M_2XM_2^* + M_3XM_3^*$ where \[M_1 = \begin{pmatrix}1&0&0\\0&0&0\\0&0&0\end{pmatrix},\; M_2 = \begin{pmatrix}0&0&0\\0&1&0\\0&0&0\end{pmatrix},\; M_3 = \tfrac{1}{\sqrt{2}}\begin{pmatrix}0&0&1\\0&0&0\\1&0&0\end{pmatrix} + \tfrac{1}{\sqrt{2}}\begin{pmatrix}0&1&0\\0&0&0\\0&0&0\end{pmatrix}.\] The adjoint Kraus operators give $\Psi^*(Y) = \sum_k M_k^*YM_k$, recovering the same formula.

4.4b Application: Quantum State Tomography as Semidefinite Program

Quantum state tomography (QST) is the process of estimating a quantum state $\rho$ from the outcomes of multiple measurements. The key computational step — fitting the experimental frequencies to a valid density matrix — is naturally an SDP.

Definition 4.2b (QST as SDP). Given measurement outcomes $\{f_k\}_{k=1}^m$ (empirical frequencies) from a set of POVM elements $\{M_k\}_{k=1}^m$, the maximum likelihood estimate of $\rho$ is the solution to: \[\rho^* = \arg\min_{\rho \in D(\mathcal{X})}\|\{f_k - \operatorname{Tr}(M_k\rho)\}_{k=1}^m\|_2^2 \quad\text{or equivalently}\quad \arg\min_{\rho \in D(\mathcal{X})} -\sum_k n_k \log\operatorname{Tr}(M_k\rho),\] where $n_k$ are the raw counts. The constraint $\rho \in D(\mathcal{X})$ (i.e., $\rho\geq 0, \operatorname{Tr}(\rho)=1$) is a semidefinite constraint, making QST an SDP (for the least-squares version) or a convex optimisation (for maximum likelihood).

Example 4.3b (Qubit tomography). For a qubit, complete tomography requires measuring in three bases: $\{|0\rangle,|1\rangle\}$, $\{|+\rangle,|-\rangle\}$, and $\{|+i\rangle,|-i\rangle\}$. This gives 6 POVM elements. Suppose frequencies are $f_0 = 0.6, f_1 = 0.4$ (Z basis), $f_+ = 0.7, f_- = 0.3$ (X basis), $f_{+i} = 0.5, f_{-i} = 0.5$ (Y basis). The Bloch vector is estimated as $\vec{r} = (r_x, r_y, r_z)$ where $r_z = f_0 - f_1 = 0.2$, $r_x = f_+ - f_- = 0.4$, $r_y = f_{+i} - f_{-i} = 0$. The estimated state is $\rho = \tfrac{1}{2}(I + 0.4X + 0.2Z)$. Check validity: $|\vec{r}| = \sqrt{0.16+0.04} = \sqrt{0.20} \approx 0.447 \leq 1$ ✓. The SDP constraint $\rho \geq 0$ is automatically satisfied since $|\vec{r}| \leq 1$. If the raw frequencies had given $|\vec{r}| > 1$ (due to finite sample noise), the SDP would project to the nearest valid density matrix on the Bloch ball.

The regularised QST problem can also be formulated as a semidefinite program: $\min_\rho \lambda\operatorname{Tr}(\rho^2) - \sum_k f_k\log\operatorname{Tr}(M_k\rho)$ s.t. $\rho\geq 0$, $\operatorname{Tr}(\rho)=1$. The Tikhonov regularisation term $\lambda\operatorname{Tr}(\rho^2) = \lambda\|\rho\|_2^2$ penalises pure states and encourages mixed solutions, acting as a prior for the maximum entropy principle. The resulting SDP has unique solution when the measurement is informationally complete (i.e., the POVM spans $\operatorname{Herm}(\mathcal{X})$).

4.5 Application: Optimal State Discrimination

Theorem 4.5 (Thm 8.3, Watrous). A POVM $(M_1,\ldots,M_n)$ is optimal for discriminating ensemble $\{(p_k,\rho_k)\}_{k=1}^n$ if and only if there exists $Y \in \operatorname{Herm}(\mathcal{X})$ such that:

$Y = \sum_k p_k M_k\rho_k$ (Hermitian), and
$Y \geq p_k\rho_k$ for all $k$.

Proof. Cast the discrimination problem as the SDP: $d = \sup\sum_k p_k\langle M_k,\rho_k\rangle$ s.t. $\sum_k M_k = I$, $M_k \geq 0$. The dual variable for the equality constraint is $Y \in \operatorname{Herm}(\mathcal{X})$; duality gives $\beta = \inf\operatorname{Tr}(Y)$ s.t. $Y \geq p_k\rho_k\;\forall k$.

Both primal (uniform POVM: $M_k = I/n$) and dual ($Y = \lambda I$ for large $\lambda$) are strictly feasible, so strong duality holds and both optima are attained.

At optimality, complementary slackness $M_k(Y - p_k\rho_k) = 0$ for each $k$ gives $M_kY = p_kM_k\rho_k$. Summing over $k$ and using $\sum_kM_k = I$: $Y = \sum_k p_k M_k\rho_k$. This is Hermitian (the primal objective evaluated at the optimal), which is condition (1). Condition (2) is dual feasibility. The converse follows by reversing the argument: conditions (1) and (2) imply the primal and dual objectives coincide, so the POVM is optimal. $\square$

Example 4.3 (Helstrom measurement from CS). For two states with priors $p_0,p_1$ and $n=2$: the dual constraint is $Y \geq p_k\rho_k$. Take $Y = p_0\Pi^+\rho_0\Pi^+ + p_1\Pi^-\rho_1\Pi^-$ where $\Pi^\pm$ project onto the positive/negative eigenspaces of $p_0\rho_0-p_1\rho_1$. One checks $Y = \sum_k p_k M_k\rho_k$ is Hermitian and $Y \geq p_k\rho_k$ for both $k$, confirming $M_0 = \Pi^+, M_1 = \Pi^-$ is optimal.

4.6 Application: Fidelity as an SDP

Lemma 4.1 (Schur complement). For $C > 0$, \[\begin{bmatrix}A & B \\ B^* & C\end{bmatrix} \geq 0 \iff A \geq BC^{-1}B^*.\]

Theorem 4.6 (Fidelity SDP). For $P,Q \in \operatorname{Pos}(\mathcal{X})$ with $P,Q > 0$, \[F(P,Q) = \sup\Bigl\{\tfrac{1}{2}\operatorname{Tr}(X)+\tfrac{1}{2}\operatorname{Tr}(X^*) : \begin{bmatrix}P & X \\ X^* & Q\end{bmatrix}\geq 0\Bigr\}.\] The dual SDP is $\beta = \inf\{\tfrac{1}{2}\langle P,Y\rangle + \tfrac{1}{2}\langle Q,Z\rangle : \begin{bmatrix}Y & I \\ I & Z\end{bmatrix}\geq 0\}$, and the dual optimal with $Y = Z^{-1}$ recovers Alberti's theorem.

Proof sketch. By the Schur complement, $\begin{bmatrix}P&X\\X^*&Q\end{bmatrix}\geq 0$ iff $P \geq XQ^{-1}X^*$ iff $\|P^{-1/2}XQ^{-1/2}\|_\infty\leq 1$. Writing $X = P^{1/2}VQ^{1/2}$ for $\|V\|_\infty\leq 1$: \[\operatorname{Tr}(X) = \operatorname{Tr}(P^{1/2}VQ^{1/2}) = \langle V^*, P^{1/2}Q^{1/2}\rangle.\] Maximising over $V$ by the variational formula: $\sup\operatorname{Re}\operatorname{Tr}(X) = \|P^{1/2}Q^{1/2}\|_1 \cdot (\text{phase})$. Choosing the phase of $X$ to make $\operatorname{Tr}(X) = \operatorname{Tr}(X^*)$ gives $F(P,Q)$. Strong duality: $X=0$ is primal feasible (for $P,Q\geq 0$) and $Y=Z=2I$ is strictly dual feasible, so Slater applies. $\square$

4.7 The Diamond Norm

Definition 4.3 (Diamond norm). The diamond norm of $\Phi \in T(\mathcal{X},\mathcal{Y})$ is \[\|\Phi\|_\diamond = \sup_{\mathcal{Z}}\|\Phi\otimes I_\mathcal{Z}\|_1 = \max_{\rho \in D(\mathcal{X}\otimes\mathcal{X})}\|(\Phi\otimes I_\mathcal{X})(\rho)\|_1,\] where the supremum over auxiliary systems $\mathcal{Z}$ is attained at $\mathcal{Z} \cong \mathcal{X}$ (Lemma 20.2 of LN2011).

When two channels $\Phi_0,\Phi_1$ are given uniformly at random, the optimal probability of correctly identifying which was applied equals $\tfrac{1}{2} + \tfrac{1}{4}\|\Phi_0-\Phi_1\|_\diamond$. The auxiliary system $\mathcal{Z}$ is essential: without it (using only $\|\cdot\|_1$), the discrimination probability may be strictly lower.

Theorem 4.7 (Diamond norm as maximal fidelity; Thm 21.1 of LN2011). Let $\Phi_0,\Phi_1 \in C(\mathcal{X},\mathcal{Y})$ with Stinespring dilations $\Phi_i(M) = \operatorname{Tr}_\mathcal{Z}(A_i M A_i^*)$. Define $\Psi(M) = \operatorname{Tr}_\mathcal{Z}(A_0 M A_1^*)$. Then \[\|\Psi\|_\diamond = F_{\max}(\Phi_0,\Phi_1) := \max_{\rho_0,\rho_1}F(\Phi_0(\rho_0),\Phi_1(\rho_1)).\]

Remark. Since the fidelity SDP is an SDP, and $F_{\max}$ is a maximum of fidelities over quantum states, the diamond norm is computable as an SDP. This is one of the most useful computational tools in quantum information: checking whether two quantum channels are close (in operational terms) reduces to solving a semidefinite program.

Theorem 4.8 (Diamond norm SDP). For $\Phi \in T(\mathcal{X},\mathcal{Y})$ Hermiticity-preserving, the diamond norm satisfies \[\tfrac{1}{2}\|\Phi\|_\diamond = \sup_{\rho \in D(\mathcal{X}\otimes\mathcal{X})} \tfrac{1}{2}\|(\Phi\otimes I)(\rho)\|_1.\] This can be cast as an SDP. Writing the trace norm $\|A\|_1 = \max_{U\in\mathcal{U}(\mathcal{Y}\otimes\mathcal{X})}\operatorname{Re}\operatorname{Tr}(A^*U)$, the diamond norm equals the value of the SDP: \[\tfrac{1}{2}\|\Phi\|_\diamond = \sup_{X,\rho}\Bigl\{\operatorname{Re}\operatorname{Tr}(X) : \begin{bmatrix}(\Phi\otimes I)(\rho) & X \\ X^* & (\Phi^{\dagger}\otimes I)(\rho)\end{bmatrix}\geq 0,\;\rho \in D(\mathcal{X}\otimes\mathcal{X})\Bigr\}.\] For the special case where $\Phi_0,\Phi_1$ are quantum channels and one wants $\|\Phi_0-\Phi_1\|_\diamond$, the auxiliary system restriction to $\mathcal{Z}\cong\mathcal{X}$ (dim$\mathcal{Z}=$ dim$\mathcal{X}$) is sufficient, making this a finite-dimensional SDP.

Example 4.6 (Diamond norm of the completely depolarising channel). Let $\Omega(\rho) = I/d$ be the completely depolarising channel and $\text{id}(\rho) = \rho$ the identity channel. Their difference $\Delta = \text{id} - \Omega$ satisfies \[\|\Delta\|_\diamond = \|\text{id}-\Omega\|_\diamond.\] Take the maximally entangled input $|\beta\rangle\langle\beta|$. Then $(\text{id}\otimes I)(|\beta\rangle\langle\beta|) = |\beta\rangle\langle\beta|$ and $(\Omega\otimes I)(|\beta\rangle\langle\beta|) = \tfrac{I}{d}\otimes\tfrac{I}{d}$. The trace norm is \[\||\beta\rangle\langle\beta| - \tfrac{I}{d^2}\|_1 = \bigl(1-\tfrac{1}{d^2}\bigr) + (d^2-1)\cdot\tfrac{1}{d^2} = 2\bigl(1-\tfrac{1}{d^2}\bigr).\] For a qubit ($d=2$): $\|\text{id}-\Omega\|_\diamond = 2(1-1/4) = 3/2$. Since the identity and depolarising channels are "maximally distinguishable" (one preserves quantum information, the other destroys it), this large diamond norm value $3/2$ confirms that a single use of the unknown channel with an entangled probe can reliably distinguish them.

4.8 Application: Quantum Channel Simulation and the Reverse Shannon Theorem

A natural question arising from the diamond norm is: when can one channel $\Phi$ simulate another channel $\Psi$ using shared entanglement and classical communication? The reverse Shannon theorem answers this question and provides a unified framework connecting channel simulation, classical capacity, and entanglement consumption.

Definition 4.4 (Channel simulation). A channel $\Phi: L(\mathcal{X})\to L(\mathcal{Y})$ can be $\varepsilon$-simulated by $n$ uses of $\Psi: L(\mathcal{X}')\to L(\mathcal{Y}')$ using $E$ ebits and $C$ classical bits if there exist encoding and decoding operations such that the diamond distance between $\Phi$ and the composition $\mathcal{D}\circ\Psi^{\otimes n}\circ\mathcal{E}$ is at most $\varepsilon$.

Theorem 4.9 (Quantum reverse Shannon theorem, Bennett-Devetak-Harrow-Shor-Winter 2014). Let $\Phi$ be a quantum channel with entanglement-assisted classical capacity $C_E(\Phi) = \max_\rho I(A:B)_\rho$. Then $\Phi$ can be simulated (in the limit of many copies) using $C_E(\Phi)$ cbits per use plus $\tfrac{1}{2}C_E(\Phi)$ ebits per use. This simulation is optimal: no protocol can do it with fewer cbits per ebit.

The reverse Shannon theorem gives a duality between the forward problem (maximise communication rate given a channel) and the reverse problem (minimise the resources needed to simulate a channel). Together with the HSW theorem, it implies that the classical capacity equals the channel simulation cost in the entanglement-assisted regime.

Example 4.7 (Simulation cost of the qubit erasure channel). The erasure channel $\Phi_p(\rho) = (1-p)\rho + p|e\rangle\langle e|$ (erasing with probability $p$) has entanglement-assisted capacity $C_E = 2(1-p)$ bits per use (since the quantum capacity is $1-2p$ and the ebit doubles this: $C_E = 2Q + 2\cdot 0 = 2(1-p)$... let us be more careful: from the BSST formula, $C_E = \max_\rho[S(\rho)+S(\Phi_p(\rho))-S_e(\rho,\Phi_p)]$. For the maximally mixed input $\rho = I/2$: output has entropy $S(\Phi_p(I/2)) = S((1-p)\tfrac{I}{2}+p|e\rangle\langle e|) = h(p)+1-p$ bits; entropy exchange $S_e = h(p)$ (from the erasure flag). So $C_E = 1 + h(p) + 1-p - h(p) = 2-p$). The simulation uses $2-p$ cbits and $(2-p)/2 = 1-p/2$ ebits per use.

Chapter 5: Quantum Entropy and Information

Shannon’s information theory and quantum mechanics meet in the study of entropy. The classical Shannon entropy governs the compression of classical data; its quantum generalisation, the von Neumann entropy, governs quantum data compression and lies at the heart of every capacity theorem for quantum channels.

5.1 Shannon Entropy and the Asymptotic Equipartition Property

Definition 5.1 (Shannon entropy). Let $X$ be a random variable with finite sample space $\Omega$ and distribution $p$. The Shannon entropy is \[H(X) = H(p) = -\sum_{x \in \Omega} p(x)\log p(x),\] in bits (logarithm base 2), with the convention $0\log 0 = 0$. For a binary source with bias $p$, $H(p) = -p\log p - (1-p)\log(1-p)$ is the binary entropy function.

Example 5.1. A fair coin has $H = 1$ bit. A biased coin with $p(0)=0.9, p(1)=0.1$ has $H = -0.9\log 0.9 - 0.1\log 0.1 \approx 0.469$ bits — less uncertainty, less entropy. The uniform distribution on $|\Omega|$ outcomes has entropy $\log|\Omega|$, the maximum for a given alphabet size.

Definition 5.2 (Typical sequences). For an iid source $X_1,X_2,\ldots$ distributed as $X$ and $\delta > 0$, a sequence $x^n = x_1\cdots x_n \in \Omega^n$ is $\delta$-typical if \[\Bigl|-\tfrac{1}{n}\log p(x^n) - H(X)\Bigr| \leq \delta.\] The typical set is $T_{n,\delta} = \{x^n \in \Omega^n : x^n\text{ is }\delta\text{-typical}\}$.

Theorem 5.1 (Asymptotic Equipartition Property). For any $\varepsilon,\delta > 0$ and all sufficiently large $n$:

$p(T_{n,\delta}) \geq 1-\varepsilon$ — the typical set has high probability.
$(1-\varepsilon)2^{n(H-\delta)} \leq |T_{n,\delta}| \leq 2^{n(H+\delta)}$ — the typical set is exponentially smaller than $\Omega^n$.
For any $A \subseteq \Omega^n$ with $p(A) \geq 1-\varepsilon$, $|A| \geq (1-2\varepsilon)2^{n(H-\delta)}$ — the typical set is the smallest high-probability set.

Proof of (1). The random variable $-\tfrac{1}{n}\log p(X^n) = \tfrac{1}{n}\sum_{i=1}^n (-\log p(X_i))$ is an average of iid random variables with mean $H(X)$ and finite variance $\operatorname{Var}[\log p(X)] < \infty$. By the (weak) law of large numbers, this converges in probability to $H(X)$, so $\Pr[X^n \notin T_{n,\delta}] \to 0$. For explicit rates, Chebyshev gives $\Pr[|{-\tfrac{1}{n}\log p(X^n)}-H|\geq\delta] \leq \operatorname{Var}/({n\delta^2}) \to 0$. $\square$

The AEP is the rigorous form of the intuitive statement that “most sequences from a source look the same.” More precisely: the typical sequences each have probability roughly $2^{-nH}$, and there are roughly $2^{nH}$ of them. The remaining $|\Omega|^n - 2^{nH}$ atypical sequences collectively have negligible probability.

The AEP is the foundational tool in both classical and quantum Shannon theory. Conceptually, it says that a source with entropy $H$ behaves, asymptotically, like a uniform distribution over a set of size $2^{nH}$. All compression, channel coding, and capacity theorems follow from this: the source can be compressed to $nH$ bits with negligible error (Shannon’s source coding theorem), and a channel with capacity $C$ can transmit at rate $R < C$ by identifying the $2^{nR}$ messages with codewords in the typical set of the output (Shannon’s noisy channel theorem).

Example 5.1c (AEP for a biased coin). For a source with $p(0)=0.9, p(1)=0.1$, the entropy is $H \approx 0.469$ bits. For $n=1000$ symbols: typical sequences have approximately $1000\times 0.9 = 900$ zeros and $100$ ones. The number of typical sequences is $\binom{1000}{100} \approx 2^{nH} = 2^{469}$. All atypical sequences (those with significantly different proportions) have total probability $\to 0$ as $n\to\infty$. Shannon's theorem says: these 1000 symbols can be compressed into 469 bits (instead of 1000 bits naively), with error probability $\to 0$. The compression ratio is $469/1000 = 0.469 = H$ bits per symbol.

Definition 5.2b (Classical mutual information). For a joint distribution $p(x,y)$ on $\Omega_X\times\Omega_Y$, the mutual information is \[I(X:Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(X|Y) = \sum_{x,y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)} = D_{\text{KL}}(p_{XY}\|p_X p_Y).\] It measures the information shared between $X$ and $Y$: how much knowing one reduces uncertainty about the other. $I(X:Y) \geq 0$ (from the non-negativity of KL divergence) with equality iff $X\perp Y$.

Example 5.1b. For a binary symmetric channel (BSC) with crossover probability $p$: input $X$ uniform, output $Y = X\oplus Z$ with noise $Z\sim\text{Bernoulli}(p)$. Then $H(Y) = 1$ bit, $H(Y|X) = h(p)$, so $I(X:Y) = 1 - h(p)$ bits. Shannon's noisy channel coding theorem says the maximum reliable transmission rate equals $\max_{p_X}I(X:Y) = 1 - h(p)$. For a fair coin channel ($p=1/2$): $I = 0$ — no information gets through. For a noiseless channel ($p=0$): $I = 1$ bit per use.

5.2 Shannon’s Noiseless Coding Theorem

Theorem 5.2 (Shannon's noiseless coding theorem). For an iid source with entropy $H$ bits per symbol and any $\varepsilon > 0$:

Direct: If $R > H$, there exist encoder-decoder pairs with $nR$ bits achieving error probability $\leq \varepsilon$ for all large $n$.
Converse: If $R < H$, any encoder-decoder with $nR$ bits has error probability bounded away from 0 for all large $n$.

Proof sketch. Direct: Map the $\leq 2^{n(H+\delta)}$ typical sequences bijectively to binary strings of length $n(H+\delta)$ and declare failure on atypical inputs. By AEP property (1), failure probability is $\leq\varepsilon$.

Converse: Any encoder with $nR < n(H-\delta)$ bits maps onto at most $2^{nR}$ codewords. These cover at most $2^{nR}$ source sequences. By AEP property (3), any set of size $< (1-2\varepsilon)2^{n(H-\delta)}$ has probability $< 1-\varepsilon$, so the decoder fails with probability $> \varepsilon$. $\square$

5.3 Von Neumann Entropy

Definition 5.3 (Von Neumann entropy). For $\rho \in D(\mathcal{X})$ with spectral decomposition $\rho = \sum_x p(x)|e_x\rangle\langle e_x|$, the von Neumann entropy is \[S(\rho) = -\operatorname{Tr}(\rho\log\rho) = -\sum_x p(x)\log p(x) = H(p).\]

Proposition 5.1 (Properties of von Neumann entropy).

Range: $0 \leq S(\rho) \leq \log\dim\mathcal{X}$.
Purity: $S(\rho) = 0$ iff $\rho$ is pure. $S(\rho) = \log d$ iff $\rho = I/d$.
Unitary invariance: $S(U\rho U^*) = S(\rho)$.
Additivity: $S(\rho\otimes\sigma) = S(\rho)+S(\sigma)$.
Concavity: $S(\sum_k p_k\rho_k) \geq \sum_k p_k S(\rho_k)$.
Subadditivity: For bipartite $\rho^{AB}$, $S(AB) \leq S(A)+S(B)$, with equality iff $\rho^{AB} = \rho^A\otimes\rho^B$.

The von Neumann entropy has a maximum entropy interpretation: among all states $\rho$ satisfying linear constraints $\operatorname{Tr}(O_k\rho) = c_k$, the one maximising $S(\rho)$ is the thermal (Gibbs) state $\rho = Z^{-1}e^{-\sum_k\beta_k O_k}$, where $Z = \operatorname{Tr}(e^{-\sum_k\beta_k O_k})$ and $\{\beta_k\}$ are Lagrange multipliers. For a single energy constraint $\operatorname{Tr}(H\rho) = E$: the maximum entropy state is the Boltzmann-Gibbs state $\rho_\beta = e^{-\beta H}/\operatorname{Tr}(e^{-\beta H})$ with temperature $T = 1/(k_B\beta)$. This is the quantum statistical mechanics principle of maximum entropy (Jaynes 1957).

Example 5.2b (Entropy of a thermal qubit). For a qubit with Hamiltonian $H = \tfrac{\omega}{2}Z$ (energy gap $\omega$), the Gibbs state at inverse temperature $\beta$ is \[\rho_\beta = \frac{e^{-\beta H}}{\operatorname{Tr}(e^{-\beta H})} = \frac{1}{e^{\beta\omega/2}+e^{-\beta\omega/2}}\begin{pmatrix}e^{-\beta\omega/2} & 0 \\ 0 & e^{\beta\omega/2}\end{pmatrix} = \begin{pmatrix}p_\beta & 0 \\ 0 & 1-p_\beta\end{pmatrix},\] where $p_\beta = \tfrac{e^{-\beta\omega/2}}{2\cosh(\beta\omega/2)}$ is the ground state population. The entropy is $S(\rho_\beta) = h(p_\beta)$. At infinite temperature $\beta\to 0$: $p_\beta\to 1/2$, $S\to 1$ bit (maximally mixed). At zero temperature $\beta\to\infty$: $p_\beta\to 1$, $S\to 0$ (pure ground state). The entropy is maximised at infinite temperature, confirming the Gibbs state as the maximum entropy state for the energy constraint.

Theorem 5.2.1 (Subadditivity of von Neumann entropy). For any bipartite state $\rho^{AB} \in D(\mathcal{X}_A\otimes\mathcal{X}_B)$, \[S(\rho^{AB}) \leq S(\rho^A) + S(\rho^B).\] Equality holds if and only if $\rho^{AB} = \rho^A\otimes\rho^B$.

Proof. Use the quantum relative entropy: $S(\rho^{AB}\|\rho^A\otimes\rho^B) \geq 0$ (Klein's inequality) and expand: \[S(\rho^{AB}\|\rho^A\otimes\rho^B) = \operatorname{Tr}[\rho^{AB}(\log\rho^{AB} - \log\rho^A\otimes I - I\otimes\log\rho^B)] = -S(\rho^{AB}) + S(\rho^A) + S(\rho^B).\] Hence $S(\rho^{AB}) \leq S(\rho^A)+S(\rho^B)$. Equality iff $\rho^{AB} = \rho^A\otimes\rho^B$ (by equality in Klein's). $\square$

Theorem 5.2.2 (Araki-Lieb inequality). $|S(\rho^A)-S(\rho^B)| \leq S(\rho^{AB})$.

Proof. Let $|u\rangle^{ABR}$ be a purification of $\rho^{AB}$. Then $S(\rho^{ABR}) = 0$ and by subadditivity: $S(\rho^{AB}) + S(\rho^R) \geq S(\rho^{ABR}) = 0$, so $S(\rho^{AB}) \geq -S(\rho^R)$. Since $\rho^R$ is the reduced state of a purification of $\rho^{AB}$, we have $S(\rho^R) = S(\rho^{AB})$. No, let me be more careful: $|u\rangle$ purifies $\rho^{AB}$, so $\rho^R$ is the reduced state of a purification. But actually the purification of $\rho^{AB}$ gives $S(ABR)=0$, and the reduced states satisfy $S(R) = S(AB)$ since the global state is pure. Subadditivity on the bipartition $AB:R$ gives $S(AB) + S(R) \geq S(ABR) = 0$, which is trivial. Instead, note that $S(A) = S(BR)$ (purification) and apply subadditivity to $A:B$: $S(AB) \leq S(A)+S(B)$. For the lower bound: since $|u\rangle^{ABR}$ is pure, $S(AB) = S(R)$. Also $S(B) = S(AR)$. Applying subadditivity $S(AR) \leq S(A)+S(R) = S(A)+S(AB)$, giving $S(B) \leq S(A)+S(AB)$, i.e., $S(AB) \geq S(B)-S(A)$. By symmetry, $S(AB) \geq S(A)-S(B)$. $\square$

The Araki-Lieb inequality shows that quantum conditional entropy $S(A|B) = S(AB)-S(B)$ satisfies $S(A|B) \geq -S(A)$. This is remarkable: while classically $H(A|B) \geq 0$ always, quantum conditional entropy can be as negative as $-S(A)$, achieved when $\rho^{AB}$ is pure (so $S(AB)=0$ and $S(A|B) = -S(A)$).

5.3b Concavity of Entropy and the Subadditivity Chain

The concavity of von Neumann entropy $S(\sum_k p_k\rho_k) \geq \sum_k p_k S(\rho_k)$ has a clean proof via quantum relative entropy: $S(\bar\rho\|\bar\rho) - \sum_k p_k S(\rho_k\|\bar\rho) = S(\bar\rho)\cdot\sum_k p_k - \sum_k p_k S(\rho_k) = -\sum_k p_k[S(\rho_k)-S(\bar\rho)+D(\rho_k\|\bar\rho)]$. Since $D\geq 0$ and $S(\bar\rho) = S(\sum_kp_k\rho_k)$, this gives concavity.

\[S\!\Bigl(\sum_k p_k\rho_k\Bigr) - \sum_k p_kS(\rho_k) = \sum_k p_k S(\rho_k\|\bar\rho) \geq 0.\]

The right side is the Holevo information $\chi$ of the ensemble $\{(p_k,\rho_k)\}$ with average state $\bar\rho = \sum_k p_k\rho_k$. This connects the concavity of entropy directly to the Holevo bound on accessible information (Theorem 5.11): the information gap between “knowing which $\rho_k$” and “having $\bar\rho$” exactly equals the Holevo information, which upper-bounds how much Alice can communicate to Bob through the ensemble.

Example 5.2c (Holevo information equals entropy concavity gap). For the ensemble $\{(1/2,|0\rangle\langle 0|),(1/2,|1\rangle\langle 1|)\}$ with $\bar\rho = I/2$: the entropy gain $S(I/2) - \tfrac{1}{2}(S(|0\rangle\langle 0|)+S(|1\rangle\langle 1|)) = 1 - 0 = 1$ bit, equal to $\chi = 1$ bit. For the ensemble $\{(1/2,|0\rangle\langle 0|),(1/2,|+\rangle\langle+|)\}$ with $\bar\rho = \tfrac{1}{4}\begin{pmatrix}3&1\\1&1\end{pmatrix}$: $S(\bar\rho) \approx 0.601$ bits and $S(\rho_k) = 0$ for both, so $\chi \approx 0.601$ bits. Bob can decode at most 0.601 bits per use of this channel.

5.4 Quantum Data Compression

Theorem 5.3 (Schumacher's theorem). The minimum compression rate for a quantum source with pure-state ensemble $\{(p(x),|\psi_x\rangle)\}$ is $r = S(\rho)$ where $\rho = \sum_x p(x)|\psi_x\rangle\langle\psi_x|$.

The proof uses the typical subspace: the eigenspace of $\rho^{\otimes n}$ spanned by eigenvectors with eigenvalue close to $2^{-nS(\rho)}$. Projecting onto this subspace and transmitting the projection succeeds with high fidelity by the quantum AEP, using only $n(S(\rho)+\delta)$ qubits. The converse uses Fannes’ inequality and Winter’s argument.

\[\mathcal{S}_{n,\delta} = \operatorname{span}\{|e_{x_1}\rangle\otimes\cdots\otimes|e_{x_n}\rangle : x^n \in T_{n,\delta}\},\]

and the typical projector $\Pi_{n,\delta}$ projects onto $\mathcal{S}_{n,\delta}$.

Proposition 5.2 (Quantum AEP). For any $\varepsilon,\delta>0$ and all large $n$:

$\operatorname{Tr}(\Pi_{n,\delta}\rho^{\otimes n}) \geq 1-\varepsilon$ — the typical subspace has high probability.
$(1-\varepsilon)2^{n(S(\rho)-\delta)} \leq \dim\mathcal{S}_{n,\delta} \leq 2^{n(S(\rho)+\delta)}$.
The typical projector satisfies $2^{-n(S(\rho)+\delta)}I \leq \Pi_{n,\delta}\rho^{\otimes n}\Pi_{n,\delta} \leq 2^{-n(S(\rho)-\delta)}\Pi_{n,\delta}$.

The TTS (typical subspace transmission) protocol uses an isometry $V: \mathcal{S}_{n,\delta}\to\mathbb{C}^{\lceil 2^{n(S+\delta)}\rceil}$ to compress into $n(S+\delta)$ qubits. The decoder applies $V^\dagger$ followed by replacement of the discarded part with a fixed state $|0\rangle$. The fidelity with the input (referenced against a purification) is shown to be $\geq 1-O(\varepsilon)$ by the quantum AEP.

Proof of the direct part of Schumacher's theorem. Let $\varepsilon > 0$ and $\delta > 0$. For any $n$-qubit input $\sigma^{(n)} = \rho_{x_1}\otimes\cdots\otimes\rho_{x_n}$ from the source, the entanglement fidelity of the TTS protocol is \[F_e(\sigma^{(n)},\mathcal{E}\circ\mathcal{D}) = \operatorname{Tr}(\Pi_{n,\delta}\sigma^{(n)})\] where $\mathcal{E}$ (compress) and $\mathcal{D}$ (decompress) are the encoding/decoding channels. For the average input state $\rho^{\otimes n}$: $\operatorname{Tr}(\Pi_{n,\delta}\rho^{\otimes n}) \geq 1-\varepsilon$ by Quantum AEP property (1). By Fannes' inequality, this implies fidelity close to 1.

More precisely: let $|u\rangle = \operatorname{vec}(\sqrt{\rho^{\otimes n}}/\|\cdot\|)$ be a purification reference. The fidelity with the identity channel is $F_e(\rho^{\otimes n},\text{id}) = 1$. The TTS channel has entanglement fidelity $F_e(\rho^{\otimes n},\mathcal{E}\circ\mathcal{D}) = \operatorname{Tr}(\Pi_{n,\delta}\rho^{\otimes n}) \geq 1-\varepsilon$. By the Fuchs-van de Graaf inequality (applied to the compressed channel versus identity), the worst-case input fidelity also approaches 1. $\square$

Definition 5.4 (Fannes' inequality). For $\rho,\sigma \in D(\mathcal{X})$ with $\|\rho-\sigma\|_1 \leq \varepsilon \leq 1/e$, \[|S(\rho)-S(\sigma)| \leq \varepsilon\log\dim\mathcal{X} + \eta(\varepsilon), \quad \eta(\varepsilon) = -\varepsilon\log\varepsilon.\]

5.5 Entanglement Concentration and Dilution

For bipartite pure state $|\psi\rangle_{AB} = \sum_v\sqrt{p(v)}|v\rangle_A|v\rangle_B$ in Schmidt form, define the entropy of entanglement $E(|\psi\rangle) = S(\operatorname{Tr}_B|\psi\rangle\langle\psi|) = H(p)$.

Theorem 5.4 (Entanglement concentration). Given $n$ copies of $|\psi\rangle$, LOCC can extract $\approx nE(|\psi\rangle)$ maximally entangled pairs (ebits) asymptotically.

Theorem 5.5 (Entanglement dilution; Bennett-Bernstein-Popescu-Schumacher 1996). Starting from $nE(|\psi\rangle)+\delta$ ebits, LOCC can produce $n$ approximate copies of $|\psi\rangle$. The optimal dilution rate is $E(|\psi\rangle)$, achieved via teleportation of the TTS-compressed state.

The equality of concentration and dilution rates means the entropy of entanglement is the unique entanglement measure for bipartite pure states under asymptotic LOCC. If classical communication is charged, dilution costs only $O(\sqrt{n})$ extra cbits — savings quantified by the entanglement spread $\Delta(|\psi\rangle) = \log\operatorname{rank}(\rho_A) - \log(1/\|\rho_A\|_\infty)$.

Example 5.5 (Schumacher compression for a qubit source). Alice's source emits qubits from the ensemble: $\rho_0 = |0\rangle\langle 0|$ with probability $3/4$, and $\rho_1 = |+\rangle\langle+|$ with probability $1/4$. The average state is \[\rho = \tfrac{3}{4}|0\rangle\langle 0| + \tfrac{1}{4}|+\rangle\langle+| = \begin{pmatrix}7/8 & 1/8 \\ 1/8 & 1/8\end{pmatrix}.\] The eigenvalues of $\rho$ are $\lambda_\pm = \tfrac{1}{2}\bigl(1 \pm \sqrt{1 - 4\det\rho}\bigr)$. We have $\det\rho = \tfrac{7}{64} - \tfrac{1}{64} = \tfrac{6}{64} = \tfrac{3}{32}$, so \[\lambda_\pm = \tfrac{1\pm\sqrt{1-3/8}}{2} = \tfrac{1\pm\sqrt{5/8}}{2}.\] Numerically $\lambda_+ \approx 0.951$ and $\lambda_- \approx 0.049$. The von Neumann entropy is $S(\rho) = h(\lambda_+) \approx 0.287$ bits.

The Schumacher compression protocol requires only $\approx 0.287n$ qubits to encode $n$ copies of this source with high fidelity, compared to $n$ qubits naively. The typical subspace has dimension $\approx 2^{0.287n}$: for $n=100$, roughly $2^{28.7} \approx 4\times 10^8$ dimensions out of the full $2^{100}$ — a dramatic compression.

To see this concretely for small $n=2$: the eigenbasis of $\rho^{\otimes 2}$ consists of products of eigenvectors $\{|e_+\rangle,|e_-\rangle\}$ with eigenvalues $\lambda_+^2 \approx 0.904$, $\lambda_+\lambda_- \approx 0.047$ (twice), and $\lambda_-^2 \approx 0.002$. The typical sequences at the threshold $2^{-2S} \approx 2^{-0.574}$ are the two states $|e_+e_-\rangle,|e_-e_+\rangle$ along with $|e_+e_+\rangle$. Together these carry probability $0.904 + 2\times 0.047 \approx 0.998$, confirming that the typical subspace has dimension 3 and captures nearly all the probability mass.

Example 5.6 (Entanglement concentration step-by-step). Alice and Bob share $n=4$ copies of $|\psi\rangle = \tfrac{\sqrt{3}}{2}|00\rangle + \tfrac{1}{2}|11\rangle$, which has Schmidt coefficients $(\tfrac{3}{4},\tfrac{1}{4})$ and entropy of entanglement $E = h(3/4) = h(1/4) \approx 0.811$ ebits. The protocol:

Alice measures her 4 qubits in the eigenbasis of $\rho_A = \tfrac{3}{4}|0\rangle\langle 0|+\tfrac{1}{4}|1\rangle\langle 1|$. She obtains a typical string $x_1x_2x_3x_4 \in \{0,1\}^4$.
The global joint state after her measurement collapses to a product of $\sim nE = 3.24$ Schmidt components. A typical string with $\sim n\cdot\tfrac{3}{4} = 3$ zeros and $\sim n\cdot\tfrac{1}{4} = 1$ one corresponds to a state with Schmidt coefficient $\approx 2^{-n H(3/4,1/4)} = 2^{-3.24}$.
Alice communicates her measurement outcome (classical): $4$ cbits. Bob applies a local unitary to his 4 qubits based on Alice's announcement.
The result is approximately $\lfloor nE\rfloor = 3$ maximally entangled pairs, with fidelity approaching 1 as $n\to\infty$.

For $n=1000$ copies, the protocol extracts $\approx 811$ ebits from 1000 copies of $|\psi\rangle$, asymptotically optimal.

5.6 Quantum Relative Entropy

Definition 5.5 (Quantum relative entropy). For $P \in D(\mathcal{X})$ and $Q \in D(\mathcal{X})\cap\operatorname{Pd}(\mathcal{X})$, \[S(P\|Q) = \operatorname{Tr}[P(\log P - \log Q)].\]

Theorem 5.6 (Klein's inequality). $S(P\|Q) \geq 0$ with equality iff $P = Q$.

Proof. Klein's inequality: for convex $f$, $\operatorname{Tr}[f(A)] \geq \operatorname{Tr}[f(B)] + \operatorname{Tr}[f'(B)(A-B)]$. Apply with $f = -\log$, $A = P$, $B = Q$: $-\operatorname{Tr}[\log P] \geq -\operatorname{Tr}[\log Q] + \operatorname{Tr}[Q^{-1}(P-Q)]$, so $\operatorname{Tr}[P(\log P-\log Q)] \geq \operatorname{Tr}[P-Q] = 0$. $\square$

Theorem 5.7 (Pinsker's inequality). $S(P\|Q) \geq \tfrac{1}{2}\|P-Q\|_1^2$.

An immediate application: $S(\rho\|I/d) = \log d - S(\rho) \geq 0$, giving the upper bound $S(\rho) \leq \log d$.

Example 5.3 (Quantum relative entropy between qubit states). Let $\rho = \begin{pmatrix}3/4 & 0 \\ 0 & 1/4\end{pmatrix}$ and $\sigma = I/2$. Then $\log\rho = \begin{pmatrix}\log(3/4) & 0 \\ 0 & \log(1/4)\end{pmatrix}$ and $\log\sigma = -(\log 2)I$. So \[S(\rho\|\sigma) = \operatorname{Tr}[\rho(\log\rho-\log\sigma)] = \tfrac{3}{4}\log\tfrac{3}{4}+\tfrac{1}{4}\log\tfrac{1}{4} + \log 2 = \log 2 - S(\rho) \approx 0.189 \text{ bits}.\] This confirms the formula $S(\rho\|I/d) = \log d - S(\rho)$. Note that $S(\rho\|\sigma) > 0$ since $\rho \neq \sigma$ (Klein's inequality). If we had taken $\sigma = \rho$, then $S(\rho\|\rho) = 0$.

The quantum relative entropy also satisfies the data processing inequality: for any channel $\Phi$, $S(\Phi(\rho)\|\Phi(\sigma))\leq S(\rho\|\sigma)$. This is one of the most fundamental results in quantum information — it says that processing quantum information cannot increase the distinguishability of two states. The classical analogue is the data processing inequality for classical divergences.

A natural question is: when does equality hold? The answer is given by the Petz recovery map, which provides a universal sufficient condition for equality and connects data processing to approximate quantum error correction.

Definition 5.5b (Petz recovery map). Given a channel $\Phi$ and a state $\sigma > 0$, the Petz recovery map is the channel $\mathcal{P}_{\sigma,\Phi}: L(\mathcal{Y})\to L(\mathcal{X})$ defined by \[\mathcal{P}_{\sigma,\Phi}(\tau) = \sigma^{1/2}\,\Phi^*\!\bigl(\Phi(\sigma)^{-1/2}\,\tau\,\Phi(\sigma)^{-1/2}\bigr)\,\sigma^{1/2},\] where $\Phi^*$ is the adjoint (Heisenberg-picture) channel. The Petz map is designed so that if $\Phi$ is a degrading of some quantum channel and $\sigma$ is a fixed point, then $\mathcal{P}_{\sigma,\Phi}\circ\Phi = \operatorname{id}$ — the channel is perfectly recoverable.

Theorem 5.6b (Petz recovery map and equality in DPI). For states $\rho,\sigma > 0$ and a channel $\Phi$, \[S(\Phi(\rho)\|\Phi(\sigma)) = S(\rho\|\sigma) \iff \mathcal{P}_{\sigma,\Phi}\circ\Phi(\rho) = \rho.\] In words: the data processing inequality is tight if and only if the state $\rho$ can be perfectly recovered from its image $\Phi(\rho)$ using the Petz map built from $\sigma$.

This characterisation has deep implications for approximate quantum error correction. If a channel $\Phi$ only slightly decreases $S(\rho\|\sigma)$, then there exists a near-perfect recovery map — and quantitatively, the Fawzi-Renner theorem (2015) shows that $S(\rho\|\sigma) - S(\Phi(\rho)\|\Phi(\sigma)) \geq -2\log F(\rho, \mathcal{P}_{\sigma,\Phi}\circ\Phi(\rho))$, giving a recovery fidelity bound purely from the relative entropy decrease.

Example 5.3b (Petz map for the dephasing channel). Let $\Phi = \Delta$ be the dephasing channel $\Delta(\rho) = \sum_a |a\rangle\langle a|\rho|a\rangle\langle a|$ and $\sigma = I/d$ (maximally mixed). Then $\Phi(\sigma) = \sigma$ and the Petz map is $\mathcal{P}_{\sigma,\Delta}(\tau) = (I/d)^{1/2}\Delta^*\bigl((I/d)^{-1/2}\tau(I/d)^{-1/2}\bigr)(I/d)^{1/2} = \Delta^*(d\tau)/d = \Delta(\tau)$ (using $\Delta^* = \Delta$ for the self-adjoint dephasing). For a diagonal state $\rho = \sum_a\lambda_a|a\rangle\langle a|$: $\Delta(\rho) = \rho$ and $\mathcal{P}\circ\Delta(\rho) = \rho$. Equality holds as expected — dephasing a diagonal state loses no relative entropy to the maximally mixed state.

Example 5.7 (Data processing inequality — strict and attained). With $\rho = \operatorname{diag}(3/4,1/4)$ and $\sigma = I/2$:

Dephasing channel $\Delta(\rho) = \rho$ (fixes diagonal states): $S(\Delta(\rho)\|\Delta(\sigma)) = S(\rho\|\sigma) \approx 0.189$. Equality — dephasing is information-lossless here since both inputs are already diagonal.
Erasure channel $\Phi_p$ (erases with probability $p=1/2$): the output is a mixture of the original state and an erasure flag. The relative entropy is $\approx (1-p) \cdot 0.189 + 0 = 0.095$ bits — strictly less. The erasure discards half the information.
Completely depolarising channel $\Omega$: maps every state to $I/d$, so $\Omega(\rho) = \Omega(\sigma) = I/2$ and $S(\Omega(\rho)\|\Omega(\sigma)) = 0$. Maximally noisy channel destroys all distinguishability.

These examples illustrate that equality in data processing holds only when the channel acts identically on the relevant subspaces — captured formally by the Petz recovery map theorem.

5.6b Quantum Hypothesis Testing and Stein’s Lemma

A fundamental task in statistics is hypothesis testing: given many copies of an unknown state, decide whether it is $\rho$ (null hypothesis) or $\sigma$ (alternative). The quantum relative entropy governs the optimal error exponents.

Definition 5.5c (Type I and Type II errors). In quantum hypothesis testing with $n$ copies of the unknown state, a test $T_n \in [0,I]$ (a binary POVM element) defines:

Type I error (false alarm): $\alpha_n = \operatorname{Tr}((I-T_n)\rho^{\otimes n})$ — probability of rejecting $\rho$ when it is true.
Type II error (missed detection): $\beta_n = \operatorname{Tr}(T_n\sigma^{\otimes n})$ — probability of accepting $\rho$ when $\sigma$ is the true state.

Theorem 5.13b (Quantum Stein's lemma, Hiai-Petz 1991, Ogawa-Nagaoka 2000). Fix the Type I error rate at $\alpha_n \leq \varepsilon$ for some $\varepsilon \in (0,1)$. The minimum Type II error exponent is \[\lim_{n\to\infty}-\frac{1}{n}\log\beta_n^* = S(\rho\|\sigma).\] In other words, the optimal Type II error decreases exponentially with rate exactly $S(\rho\|\sigma)$, independent of the constraint $\varepsilon$.

The quantum Stein’s lemma establishes the quantum relative entropy as the fundamental distinguishability measure in the asymptotic hypothesis testing regime. Its classical analogue (Stein’s lemma for classical distributions) uses the KL divergence. The proof of the quantum version is substantially harder and uses the operator inequality techniques developed for quantum entropy — specifically, Lemma 3.17 of Hiai-Petz which is equivalent to the data processing inequality.

Example 5.9 (Quantum Stein's lemma for qubit states). For $\rho = |0\rangle\langle 0|$ and $\sigma = I/2$ (maximally mixed), $S(\rho\|\sigma) = \log 2 = 1$ bit. Stein's lemma says: with $n$ copies, the minimum Type II error (probability of confusing the pure state $|0\rangle\langle 0|$ with the maximally mixed state when the Type I error is fixed) decreases as $\beta_n^* \approx 2^{-n}$. This makes intuitive sense: measuring in the computational basis perfectly distinguishes $|0\rangle\langle 0|$ from $I/2$ in the limit — $I/2$ gives $|0\rangle$ with probability $1/2$ each time, while $|0\rangle\langle 0|$ always gives $|0\rangle$. The Type II error probability after $n$ measurements is $(1/2)^n = 2^{-n}$, exactly matching the Stein exponent.

5.7 Strong Subadditivity

Theorem 5.8 (Joint convexity of quantum relative entropy). For $\lambda \in [0,1]$, \[S(\lambda\rho_0+(1-\lambda)\rho_1 \| \lambda\sigma_0+(1-\lambda)\sigma_1) \leq \lambda S(\rho_0\|\sigma_0) + (1-\lambda)S(\rho_1\|\sigma_1).\]

Theorem 5.9 (Data processing for quantum relative entropy). For any channel $\Phi$ and states $\rho,\sigma$, \[S(\Phi(\rho)\|\Phi(\sigma)) \leq S(\rho\|\sigma).\]

Theorem 5.10 (Strong subadditivity). For any $\rho \in D(\mathcal{X}\otimes\mathcal{Y}\otimes\mathcal{Z})$, \[S(XYZ) + S(Z) \leq S(XZ) + S(YZ),\] equivalently $S(X|YZ) \leq S(X|Z)$ (conditioning on more information reduces uncertainty).

Proof sketch. Apply Theorem 5.9 with the completely dephasing channel on $\mathcal{Y}$: $\Delta_Y(\rho) = \sum_b|b\rangle\langle b|_Y\rho|b\rangle\langle b|_Y$. Under $\Delta_Y$, $S(\rho\|\rho^X\otimes I_Y/d_Y\otimes\rho^Z)$ decreases (data processing), which gives $S(XYZ) - S(XZ) \leq S(Y|Z)$. Rearranging yields SSA. $\square$

Remark (SSA history and difficulty). Strong subadditivity was conjectured by Lanford and Robinson in 1968 and proved by Lieb and Ruskai in 1973. The proof is notoriously difficult. It relies on the Lieb concavity theorem: $f(A,B) = \operatorname{Tr}(A^tKB^{1-t}K^*)$ is jointly concave in $(A,B) \in \operatorname{Pd}\times\operatorname{Pd}$ for $t\in[0,1]$. Multiple simpler proofs have been found since: Ruskai (2002) gave an algebraic proof, Petz (1994) connected it to the monotone metric, and Carlen-Lieb (2012) used complex interpolation. The data processing inequality for quantum relative entropy (Theorem 5.9) actually implies SSA (the proof above shows this), and the data processing inequality itself is equivalent to the concavity of conditional entropy.

Definition 5.6 (Derived quantities from SSA).

Quantum mutual information: $S(X:Y)_\rho = S(\rho^X)+S(\rho^Y)-S(\rho^{XY}) \geq 0$.
Conditional entropy: $S(X|Y)_\rho = S(\rho^{XY})-S(\rho^Y)$ (can be negative for entangled states).
SSA is equivalent to $S(X:Y|Z) \geq 0$, i.e., conditioning on $Z$ cannot increase mutual information.

Example 5.8 (Negative conditional entropy from entanglement). Let $\rho^{AB} = |\beta\rangle\langle\beta|$ be the Bell state on $\mathcal{X}_A\otimes\mathcal{X}_B = \mathbb{C}^2\otimes\mathbb{C}^2$. Then $S(\rho^{AB}) = 0$ (pure state) and $S(\rho^A) = S(\rho^B) = 1$ ebit (each reduced state is maximally mixed). The conditional entropy is \[S(A|B)_\rho = S(\rho^{AB}) - S(\rho^B) = 0 - 1 = -1 \text{ ebit}.\] This negative value has operational meaning: Bob's knowledge of his half of the Bell state actually *helps* the classical transmission of Alice's qubit — it provides a resource. In the quantum teleportation protocol, the ebit precisely compensates for the missing qubit communication: $-1$ ebit of conditional entropy means Bob can reconstruct Alice's qubit from the 2 classical bits she sends, using the pre-shared entanglement.

Contrast with a classically correlated state $\sigma^{AB} = \tfrac{1}{2}(|00\rangle\langle 00|+|11\rangle\langle 11|)$: here $S(\sigma^{AB}) = 1$, $S(\sigma^B) = 1$, so $S(A|B)_\sigma = 0$. Classical correlations always give non-negative conditional entropy.

Theorem 5.13 (Quantum capacity and hashing inequality). For a quantum channel $\Phi: L(\mathcal{X})\to L(\mathcal{Y})$, the quantum capacity (by the LSD theorem) is \[Q = \lim_{n\to\infty}\frac{1}{n}\max_{\rho \in D(\mathcal{X}^{\otimes n})} S(A|E)_{\rho^{AE}},\] where $E$ is the environment of the Stinespring dilation and the conditional entropy $S(A|E)$ is evaluated on the state after the channel acts. The single-letter hashing bound is $Q \geq \max_\rho S(A|E)_\rho = \max_\rho[S(\Phi(\rho)) - S_e(\rho,\Phi)]$, where $S_e$ is the entropy exchange. SSA implies this quantity is non-negative precisely when the channel can transmit quantum information.

5.8 Holevo’s Theorem

Definition 5.7 (Accessible information). For ensemble $\mathcal{E} = \{p(x),\rho_x\}_{x\in\Sigma}$, the accessible information is $I_{\text{acc}}(\mathcal{E}) = \max_M I(X:Y)$, where the max is over all POVMs and $Y$ is the measurement outcome.

Definition 5.8 (Holevo information). The Holevo information of $\mathcal{E}$ is \[\chi(\mathcal{E}) = S\!\Bigl(\sum_x p(x)\rho_x\Bigr) - \sum_x p(x)S(\rho_x).\]

Theorem 5.11 (Holevo's theorem, 1973). $I_{\text{acc}}(\mathcal{E}) \leq \chi(\mathcal{E})$.

Proof. Form the CQ state $\Lambda = \sum_x p(x)|x\rangle\langle x|\otimes\rho_x$. Any measurement $M$ on the quantum system maps $\Lambda$ to the classical-classical state $\Gamma = \sum_{x,y}p(x)p(y|x)|x\rangle\langle x|\otimes|y\rangle\langle y|$. Mutual information $I(X:Y) = S(X:Y)_\Gamma = S(X:W)_\Lambda - [S(X:W)_\Lambda - S(X:Y)_\Gamma]$. Since the measurement is a local operation on $W$ (Bob's side), data processing gives $S(X:Y) \leq S(X:W)_\Lambda = \chi(\mathcal{E})$. $\square$

Theorem 5.12 (Fano's inequality). If $A,B$ are random variables on the same alphabet $\Sigma$ and $q = \Pr[A\neq B]$, then \[H(A|B) \leq h(q) + q\log(|\Sigma|-1),\] where $h(q) = -q\log q-(1-q)\log(1-q)$ is the binary entropy.

Fano’s inequality lower-bounds the error rate in any decoding scheme. Combined with Holevo’s theorem, it implies the quantum Holevo bound: transmitting $n$ classical bits reliably requires at least $n$ qubits in the channel, even with optimal encoding and measurement.

Example 5.4 (Holevo information for a classical-quantum channel). Alice encodes bit $x\in\{0,1\}$ as state $\rho_x$: $\rho_0 = |0\rangle\langle 0|$ and $\rho_1 = |+\rangle\langle+|$, each with probability $1/2$. The average state is $\bar\rho = \tfrac{1}{2}(|0\rangle\langle 0| + |+\rangle\langle+|) = \tfrac{1}{2}\begin{pmatrix}3/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix} = \begin{pmatrix}3/4 & 1/4 \\ 1/4 & 1/4\end{pmatrix}$. The eigenvalues of $\bar\rho$ are $\tfrac{1\pm 1/\sqrt{2}}{2}$, so $S(\bar\rho) = h\!\left(\tfrac{1+1/\sqrt{2}}{2}\right) \approx 0.601$ bits. Since $\rho_0,\rho_1$ are pure, $S(\rho_x) = 0$. Thus $\chi = S(\bar\rho) \approx 0.601 < 1$ bit. Holevo's theorem says Bob can extract at most 0.601 bits of information about $x$, even with the best possible measurement.

\[C = \lim_{n\to\infty}\frac{1}{n}\chi_n(\Phi),\quad \chi_n(\Phi) = \max_{\mathcal{E}}\chi(\Phi^{\otimes n}(\mathcal{E})),\]

where the max is over all ensembles on $\mathcal{X}^{\otimes n}$. The regularisation over $n$ is generally necessary (Hastings 2008 showed strict superadditivity of $\chi$ for some channels), making the classical capacity uncomputable in general.

Proof sketch of HSW (direct part). Fix $\varepsilon > 0$ and a rate $R < \chi(\Phi)$. Choose an ensemble $\mathcal{E} = \{p(x),\rho_x\}_{x\in\Sigma}$ achieving $\chi(\Phi,\mathcal{E}) > R$. Randomly generate a codebook: for each message $m \in [2^{nR}]$, draw an iid codeword $x^n(m) = (x_1(m),\ldots,x_n(m))$ each symbol independently from $p(x)$. Encode message $m$ as the tensor product state $\rho_{x^n(m)} = \rho_{x_1(m)}\otimes\cdots\otimes\rho_{x_n(m)}$. Send through $n$ uses of $\Phi$. Bob decodes using a joint measurement on $\Phi^{\otimes n}(\rho_{x^n(m)})$.

The typical subspace argument shows: with high probability, the output $\Phi^{\otimes n}(\rho_{x^n})$ lies in a typical subspace of dimension $\approx 2^{nS(\Phi(\bar\rho))}$, while the contribution from each codeword lies in a subspace of dimension $\approx 2^{n\sum_x p(x)S(\Phi(\rho_x))}$. Since $R < \chi = S(\Phi(\bar\rho)) - \sum_x p(x)S(\Phi(\rho_x))$, the number of codewords $2^{nR}$ times the per-codeword subspace dimension $2^{n\sum p(x)S(\Phi(\rho_x))}$ is still exponentially smaller than the typical subspace. Hence a random measurement succeeds with high probability in identifying the message. $\square$

Remark (Quantum channel capacity hierarchy). For any quantum channel $\Phi$, the four capacities satisfy: \[Q(\Phi) \leq C(\Phi) \leq C_E(\Phi) = \max_\rho I(A:B)_\rho.\] Here $Q$ is the quantum capacity, $C$ the classical capacity, and $C_E$ the entanglement-assisted classical capacity. The gap $C_E - C$ measures how much entanglement assistance helps, and the gap $C - Q$ measures the advantage of classical over quantum communication. All three are in general different: for the erasure channel with probability $p$, $Q = (1-2p)^+ \cdot 1$ qubit, $C = (1-p)$ bit, and $C_E = 2(1-p)$ bits (superdense coding doubling).

Chapter 6: Separability and Entanglement Theory

Entanglement is the defining resource of quantum information. This chapter develops the mathematical theory of entanglement: how to characterise it, how to detect it, and how it behaves under local operations and classical communication.

The distinction between separable and entangled states is not merely a mathematical curiosity — it has profound operational consequences. Entangled states violate Bell inequalities, which are constraints that any classically correlated (separable) state must satisfy. Detecting this violation operationally certifies entanglement without any characterisation of the measurement devices (device-independent certification).

The mathematical theory of entanglement is rich and still not fully understood. The separability problem is NP-hard; the connection between PPT and separability breaks down above $2\times 3$ dimensions; the quantum capacities remain uncomputable in general; and the relationship between entanglement measures is complex. The framework of SDP relaxations (§6.3b) and convex optimisation over quantum states (§6.7, §6.8) provides the best current tools for these problems, and many of the key results in this chapter are proved using the SDP duality machinery of Chapter 4.

6.1 Separable Operators

Definition 6.1 (Separable operator). An operator $P \in \operatorname{Pos}(\mathcal{X}_A\otimes\mathcal{X}_B)$ is separable if \[P = \sum_j Q_j \otimes R_j, \quad Q_j \in \operatorname{Pos}(\mathcal{X}_A),\; R_j \in \operatorname{Pos}(\mathcal{X}_B).\] The set of separable operators is $\operatorname{Sep}(\mathcal{X}_A:\mathcal{X}_B)$. The separable density operators are $\operatorname{SepD}(\mathcal{X}_A:\mathcal{X}_B) = \operatorname{Sep}\cap D(\mathcal{X}_A\otimes\mathcal{X}_B)$. Operators outside $\operatorname{Sep}$ are entangled.

Proposition 6.1. $\operatorname{SepD}$ is a compact convex set. Its extreme points are the product pure states $\{|u\rangle\langle u|\otimes|v\rangle\langle v|\}$. By Carathéodory's theorem applied in $\operatorname{Herm}(\mathcal{X}_A\otimes\mathcal{X}_B) \cong \mathbb{R}^{d^4}$, every separable state decomposes into at most $d^4$ product terms (where $d = \dim\mathcal{X}_A = \dim\mathcal{X}_B$).

Example 6.1. For $\mathcal{X}_A = \mathcal{X}_B = \mathbb{C}^2$: the state $\rho = \tfrac{1}{4}I_4 = \tfrac{1}{4}(|00\rangle\langle 00|+|01\rangle\langle 01|+|10\rangle\langle 10|+|11\rangle\langle 11|)$ is separable (it is $\tfrac{I}{2}\otimes\tfrac{I}{2}$). The Bell state $|\beta\rangle\langle\beta|$ is entangled — it cannot be written as $\sum_j Q_j\otimes R_j$ with $Q_j,R_j\geq 0$.

6.2 The Horodecki Criterion and Entanglement Witnesses

Before developing the algebraic theory of separability, it is worth understanding the operational test that motivated the field: the CHSH inequality.

Theorem 6.0 (CHSH inequality, Clauser-Horne-Shimony-Holt 1969). For any separable state $\rho \in \operatorname{SepD}(\mathcal{X}_A:\mathcal{X}_B)$ and any observables $A_0, A_1 \in \operatorname{Herm}(\mathcal{X}_A)$, $B_0, B_1 \in \operatorname{Herm}(\mathcal{X}_B)$ with $\|A_i\|_\infty \leq 1$ and $\|B_j\|_\infty \leq 1$, \[\langle A_0\otimes B_0\rangle_\rho + \langle A_0\otimes B_1\rangle_\rho + \langle A_1\otimes B_0\rangle_\rho - \langle A_1\otimes B_1\rangle_\rho \leq 2.\] Quantum mechanics allows violation up to the Tsirelson bound $2\sqrt{2}$.

Example 6.0 (Bell state violates CHSH). For $|\beta\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$, take $A_0 = Z$, $A_1 = X$, $B_0 = \tfrac{Z+X}{\sqrt{2}}$, $B_1 = \tfrac{Z-X}{\sqrt{2}}$. Then: \[\langle A_0\otimes B_0\rangle + \langle A_0\otimes B_1\rangle + \langle A_1\otimes B_0\rangle - \langle A_1\otimes B_1\rangle = \tfrac{1}{\sqrt{2}}+\tfrac{1}{\sqrt{2}}+\tfrac{1}{\sqrt{2}}+\tfrac{1}{\sqrt{2}} = 2\sqrt{2} > 2.\] This violates the CHSH inequality, proving that no separable state could produce these correlations. The value $2\sqrt{2}$ is exactly the Tsirelson bound — the maximum achievable by any quantum state.

Deciding separability is NP-hard in general (Gurvits 2003). However, necessary conditions give powerful tests that suffice to detect most entangled states in practice.

The separating hyperplane theorem applied to the convex set $\operatorname{Sep}$ gives the abstract existence of a witness for every entangled state. The constructive challenge is to find explicit witnesses that are practically measurable in the lab. An entanglement witness $H$ must be measurable (i.e., expressible as a sum of local observables $H = \sum_i a_i\otimes b_i$) and satisfy $\langle H,P\otimes Q\rangle \geq 0$ for all products $P,Q\geq 0$.

A common construction: given a linear map $\Lambda$ that is positive but not CP, define $H = (I\otimes\Lambda^*)(\Pi^+)$ where $\Pi^+ = |\beta\rangle\langle\beta|$ is the maximally entangled projector. Then $\langle H,\rho\rangle = \langle\Pi^+,(I\otimes\Lambda)\rho\rangle = F(\Lambda(\rho)) \geq 0$ for separable $\rho$ (since $\Lambda$ is positive and $\rho^A\geq 0$ for product states), but $\langle H,|\beta\rangle\langle\beta|\rangle = \operatorname{Tr}(\Pi^+ (I\otimes\Lambda)(|\beta\rangle\langle\beta|)) < 0$ when $\Lambda$ is not CP.

Theorem 6.1 (Horodecki criterion). $P \in \operatorname{Pos}(\mathcal{X}_A\otimes\mathcal{X}_B)$ is separable if and only if $(\Phi\otimes I_{\mathcal{X}_B})(P) \geq 0$ for every positive map $\Phi: L(\mathcal{X}_A)\to L(\mathcal{X}_B)$.

Proof sketch. If $P = \sum_j Q_j\otimes R_j$, then $(\Phi\otimes I)(P) = \sum_j\Phi(Q_j)\otimes R_j \geq 0$ since $\Phi$ is positive. Conversely, if $P\notin\operatorname{Sep}$, the separating hyperplane theorem gives $H \in \operatorname{Herm}(\mathcal{X}_A\otimes\mathcal{X}_B)$ with $\langle H,P\rangle < 0$ and $\langle H,Q\otimes R\rangle \geq 0$ for all $Q,R\geq 0$. One constructs a positive unital map $\Phi$ from $H$ such that $(\Phi\otimes I)(P)$ has a negative eigenvalue. $\square$

Definition 6.2 (Entanglement witness). A Hermitian $H$ is an entanglement witness for $P$ if $\langle H,Q\otimes R\rangle \geq 0$ for all $Q,R\geq 0$ but $\langle H,P\rangle < 0$. The existence of such $H$ is equivalent to $P \notin \operatorname{Sep}$.

Definition 6.3 (PPT criterion). The transpose map $T: L(\mathcal{X})\to L(\mathcal{X})$ is $T(A) = A^T$ (complex conjugate in the standard basis). A state $\rho \in D(\mathcal{X}_A\otimes\mathcal{X}_B)$ is PPT (positive partial transpose) if $(T\otimes I)(\rho) \geq 0$.

Theorem 6.2 (Peres PPT criterion). Every separable state is PPT.

Proof. The transpose map $T$ is positive (but not CP). For $\rho = \sum_j Q_j\otimes R_j \in \operatorname{Sep}$: $(T\otimes I)(\rho) = \sum_j T(Q_j)\otimes R_j = \sum_j Q_j^T\otimes R_j \geq 0$, since $Q_j^T \geq 0$ (transpose preserves PSD) and $R_j \geq 0$. $\square$

Theorem 6.3 (Horodecki 2×2 and 2×3). In dimension $2\otimes 2$ and $2\otimes 3$, the PPT condition is also sufficient for separability: $\rho$ is separable if and only if it is PPT.

Remark. This theorem fails in dimension $2\otimes 4$ and above: there exist PPT entangled (bound entangled) states. The key example is the Choi state. The Choi map $\Lambda: L(\mathbb{C}^3)\to L(\mathbb{C}^3)$ defined by $\Lambda(A) = \begin{pmatrix}A_{11}+A_{33}&-A_{12}&-A_{13}\\-A_{21}&A_{22}+A_{11}&-A_{23}\\-A_{31}&-A_{32}&A_{33}+A_{22}\end{pmatrix}$ is positive but not completely positive. The Choi state $J(\Lambda)/\operatorname{Tr}(J(\Lambda))$ is the canonical example of a PPT entangled state in $3\otimes 3$.

Example 6.2 (PPT criterion detecting entanglement). Consider the two-qubit state \[\rho = \begin{pmatrix}1/2 & 0 & 0 & 1/2 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 1/2 & 0 & 0 & 1/2\end{pmatrix} = |\beta_{00}\rangle\langle\beta_{00}|,\] the maximally entangled Bell state with $|\beta_{00}\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$. The partial transpose acts as $(T\otimes I)\rho$: it transposes the $A$ subsystem's matrix elements, which exchanges $|01\rangle\langle 10|$ and $|10\rangle\langle 01|$: \[(T\otimes I)\rho = \begin{pmatrix}1/2 & 0 & 0 & 0 \\ 0 & 0 & 1/2 & 0 \\ 0 & 1/2 & 0 & 0 \\ 0 & 0 & 0 & 1/2\end{pmatrix}.\] The eigenvalues of this matrix are $1/2, 1/2, 1/2, -1/2$. The negative eigenvalue $-1/2$ proves the Bell state is entangled (not PPT), hence not separable. The optimal entanglement witness is the eigenvector of $(T\otimes I)\rho$ with the negative eigenvalue: $H = |\psi^-\rangle\langle\psi^-|$ where $|\psi^-\rangle = \tfrac{1}{\sqrt{2}}(|01\rangle-|10\rangle)$.

Definition 6.5 (Werner states). For $d$-dimensional bipartite systems, the Werner state with parameter $\alpha \in [-1,1]$ is \[W_\alpha = \frac{1}{d^2-\alpha d}\bigl(I - \alpha\mathbb{F}\bigr),\] where $\mathbb{F} = \sum_{i,j}|ij\rangle\langle ji|$ is the swap operator. Werner states are invariant under $(U\otimes U)\rho(U\otimes U)^*$ for any unitary $U$. For $\alpha > 0$, $W_\alpha$ is entangled; for $\alpha \leq 0$, $W_\alpha$ is separable. For $\alpha \in (0,1/d]$, $W_\alpha$ is PPT and entangled (bound entangled in dimension $d\geq 3$); for $\alpha > 1/d$, $W_\alpha$ is NPT.

Example 6.2. For the Bell state $|\beta_{00}\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$, the density matrix is $\rho = \tfrac{1}{2}\begin{pmatrix}1&0&0&1\\0&0&0&0\\0&0&0&0\\1&0&0&1\end{pmatrix}$. Applying partial transpose on the first qubit: $(T\otimes I)(\rho) = \tfrac{1}{2}\begin{pmatrix}1&0&0&0\\0&0&1&0\\0&1&0&0\\0&0&0&1\end{pmatrix}$, which has eigenvalue $-1/2 < 0$. So $|\beta_{00}\rangle$ is NPT (not PPT), confirming it is entangled.

6.3 The Ball of Separable States

Lemma 6.1 (Lemma 14.2, Watrous). For $A = \sum_{a,b}A_{ab}\otimes|e_a\rangle\langle e_b| \in L(\mathcal{X}\otimes\mathbb{C}^\Sigma)$, \[\|A\|_\infty \leq \|A\|_2 \leq \Bigl(\sum_{a,b}\|A_{ab}\|_2^2\Bigr)^{1/2}.\]

Theorem 6.3 (Thm 14.4, Watrous). If $A \in \operatorname{Herm}(\mathcal{X}_A\otimes\mathcal{X}_B)$ satisfies $\|A\|_2 \leq 1$, then $I - A \in \operatorname{Sep}(\mathcal{X}_A:\mathcal{X}_B)$.

Theorem 6.3 says there is a Frobenius-norm ball of radius 1 around the identity operator $I$ that lies entirely within $\operatorname{Sep}$, after normalisation. This gives a concrete sufficient condition for separability and is used in the theory of quantum error correction (highly mixed states near the maximally mixed state are separable).

6.3b SDP Relaxations of Separability

Since exact separability testing is NP-hard, one uses SDP-based relaxations. The basic relaxation is the PPT test (Def. 6.3), which is an SDP feasibility check. More powerful is the DPS (Doherty-Parrilo-Spedalieri) hierarchy of SDP relaxations.

Definition 6.3b (SDP separability hierarchy). For integer $k\geq 1$, define the set $\operatorname{Sep}^{(k)}(\mathcal{X}_A:\mathcal{X}_B)$ of states that have a symmetric extension to $k$ copies of $\mathcal{X}_B$: $\rho^{AB}$ is in $\operatorname{Sep}^{(k)}$ if there exists $\omega^{AB_1\cdots B_k}$ symmetric under permutations of $B_1,\ldots,B_k$ such that $\operatorname{Tr}_{B_2\cdots B_k}\omega = \rho^{AB}$ and $\omega$ is PPT across each $A:B_j$ bipartition. The level-$k$ SDP constraint is $\omega \geq 0$, $\omega$ symmetric, $(T_{B_j}\otimes I)(\omega)\geq 0$ for all $j$.

The DPS hierarchy satisfies $\operatorname{Sep} \subset \cdots \subset \operatorname{Sep}^{(k+1)} \subset \operatorname{Sep}^{(k)} \subset \cdots \subset \operatorname{Sep}^{(1)} = \operatorname{PPT}$. The remarkable result (Brandão-Harrow 2013) is that for any $\varepsilon > 0$, there exists $k = O(\log(d)/\varepsilon^2)$ such that $\operatorname{Sep}^{(k)} \subseteq \{$\rho : \min_{\sigma\in\operatorname{Sep}}\tfrac{1}{2}\|\rho-\sigma\|_1 \leq \varepsilon\}$. This gives an efficient algorithm for approximating separability to within trace-norm precision, establishing that the separability problem has a quasipolynomial-time algorithm in fixed precision.

Example 6.2b (Level-1 SDP check for a qubit-qubit state). For $\rho = W_\alpha = (1-\alpha)\tfrac{I}{4}+\alpha|\beta_{00}\rangle\langle\beta_{00}|$ (Werner state), the level-1 DPS check reduces to the PPT condition: $(T\otimes I)(W_\alpha)\geq 0$ iff $\alpha\leq 1/3$. For level 2, the symmetric extension condition is: does there exist $\omega^{AB_1B_2}$ with $\operatorname{Tr}_{B_2}\omega = W_\alpha$ and $\omega^{AB_1B_2} = \omega^{AB_2B_1}$? For Werner states, the level-2 bound gives $\alpha \leq 1/3$ as well (since Werner states are PPT iff separable in $2\otimes 2$). For general states in higher dimensions, each level of the hierarchy can strictly improve the separability certificate.

6.4 Separable Maps and Schmidt Rank

Definition 6.4 (Schmidt rank). The Schmidt rank of $w \in \mathcal{X}_A\otimes\mathcal{X}_B$ is the rank of the corresponding operator $C \in L(\mathcal{X}_B,\mathcal{X}_A)$ with $w = \operatorname{vec}(C)$. Equivalently, it is the number of nonzero terms in the Schmidt decomposition $w = \sum_k\alpha_k|u_k\rangle|v_k\rangle$.

Definition 6.5 (Entanglement rank of operators). The set $\operatorname{Ent}_r(\mathcal{X}_A:\mathcal{X}_B)$ consists of all $P \in \operatorname{Pos}(\mathcal{X}_A\otimes\mathcal{X}_B)$ decomposable as $P = \sum_j|w_j\rangle\langle w_j|$ with each $|w_j\rangle$ having Schmidt rank $\leq r$. Note $\operatorname{Ent}_1 = \operatorname{Sep}$ and $\operatorname{Ent}_N = \operatorname{Pos}$ for $N = \min(\dim\mathcal{X}_A,\dim\mathcal{X}_B)$.

Definition 6.6 (Separable CP map). A map $\Phi \in C(\mathcal{X}_A\otimes\mathcal{X}_B,\mathcal{Y}_A\otimes\mathcal{Y}_B)$ is separable if it has a Kraus representation $\Phi(M) = \sum_j(A_j\otimes B_j)M(A_j\otimes B_j)^*$ with product Kraus operators.

Proposition 6.2 (Prop 15.1, Watrous). For a CP map $\Phi$ from $\mathcal{X}_A\otimes\mathcal{X}_B$ to $\mathcal{Y}_A\otimes\mathcal{Y}_B$, the following are equivalent:

$\Phi$ has product Kraus operators: $\Phi(M) = \sum_j(A_j\otimes B_j)M(A_j\otimes B_j)^*$.
$J(\Phi) \in \operatorname{Sep}(\mathcal{Y}_A\mathcal{X}_A:\mathcal{Y}_B\mathcal{X}_B)$.
$\Phi$ decomposes as a sum of product channels.

Theorem 6.4 (Thm 15.3, Watrous). Separable channels cannot increase Schmidt rank: if $R \in \operatorname{Ent}_r$ and $\Phi$ is separable CP, then $\Phi(R) \in \operatorname{Ent}_r$.

6.5 LOCC

Definition 6.7 (LOCC). Local operations and classical communication (LOCC) is the class of channels $\Phi \in C(\mathcal{X}_A\otimes\mathcal{X}_B,\mathcal{Y}_A\otimes\mathcal{Y}_B)$ implementable by: multiple rounds, where in each round one party applies a quantum instrument (a set of CP maps summing to a CP map) and broadcasts the classical outcome to the other party. Write $\text{LOCC}_N$ for protocols with at most $N$ rounds, $\text{LOCC}^\infty$ for arbitrarily many rounds, and $\text{LOCC}$ for the topological closure.

The chain of inclusions is: $\text{LO} \subset \text{LOCC}_1 \subset \text{LOCC}_N \subset \text{LOCC}^\infty \subset \text{LOCC} \subsetneq \text{SEP} \subset \text{PPT-preserving}$.

Example 6.4 (LOCC discrimination of two Bell states). Alice and Bob share one of two states: $|\beta_{00}\rangle = \tfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$ or $|\beta_{01}\rangle = \tfrac{1}{\sqrt{2}}(|01\rangle+|10\rangle)$, each with probability 1/2. They wish to determine which state they have using only LOCC.

Protocol: Alice measures her qubit in the computational basis $\{|0\rangle,|1\rangle\}$ and announces the outcome $a\in\{0,1\}$. If the state is $|\beta_{00}\rangle$: outcome $a=0$ leaves Bob in $|0\rangle$ (probability 1/2), and outcome $a=1$ leaves Bob in $|1\rangle$ (probability 1/2). If the state is $|\beta_{01}\rangle$: outcome $a=0$ leaves Bob in $|1\rangle$, and outcome $a=1$ leaves Bob in $|0\rangle$. After Alice’s announcement: Bob sees $|a\rangle$ iff state was $|\beta_{00}\rangle$, and $|1-a\rangle$ iff state was $|\beta_{01}\rangle$. Bob measures his qubit: if his outcome agrees with $a$, they have $|\beta_{00}\rangle$; if it disagrees, they have $|\beta_{01}\rangle$. This 1-round LOCC protocol succeeds with probability 1. It requires only 1 cbit of classical communication.

Contrast with 4 Bell states: Walgate-Hardy-Short-Vedral (2000) showed that any two orthogonal bipartite pure states can be perfectly discriminated by LOCC. But all four Bell states $\{|\beta_{00}\rangle,|\beta_{01}\rangle,|\beta_{10}\rangle,|\beta_{11}\rangle\}$ cannot be discriminated perfectly by any LOCC protocol. This is the archetypal example of a separable operation outside LOCC.

Theorem 6.5 (Lo-Popescu 1997). For two parties sharing a known bipartite pure state, any LOCC protocol can be simulated by a one-way protocol where Alice alone makes measurements and communicates to Bob.

Theorem 6.6 (LOCC $\subsetneq$ SEP). There exist nine orthogonal product states in $\mathbb{C}^3\otimes\mathbb{C}^3$ (an unextendable product basis, UPB) that cannot be perfectly discriminated by LOCC, but can be discriminated by a separable measurement.

Theorem 6.7 (Chitambar-Leung-Lo 2012). LOCC is not closed: there exists a channel in $\overline{\text{LOCC}^\infty}\setminus\text{LOCC}^\infty$. This means adding more rounds strictly increases the set of achievable LOCC operations.

Example 6.3 (Bennett et al. UPB). The nine states in $\mathbb{C}^3\otimes\mathbb{C}^3$: \[|0\rangle|0-1\rangle,\;|2\rangle|1-2\rangle,\;|0-1\rangle|2\rangle,\;|1-2\rangle|0\rangle,\;|1\rangle|1\rangle\] (where $|a-b\rangle = |a\rangle-|b\rangle$ unnormalised) together with 4 more form a basis with no additional product vector orthogonal to all of them. Hence the state $\rho = \tfrac{1}{4}(I-\sum_{i=1}^5|u_i\rangle\langle u_i|)$ is PPT but entangled.

6.6 Nielsen’s Theorem

Definition 6.8 (Majorisation). For $u,v\in\mathbb{R}^k$ with components sorted in decreasing order, say $v \prec u$ ($v$ is majorised by $u$) if $\sum_{j=1}^m v_j \leq \sum_{j=1}^m u_j$ for all $m$ and $\sum_j v_j = \sum_j u_j$. For Hermitian operators: $D \prec C$ iff $\lambda(D) \prec \lambda(C)$, iff $D = \Lambda(C)$ for some doubly stochastic (mixed unitary) channel $\Lambda$.

Theorem 6.8 (Nielsen's theorem, 1999). Let $|x\rangle \in \mathcal{X}_A\otimes\mathcal{X}_B$ and $|y\rangle \in \mathcal{Y}_A\otimes\mathcal{Y}_B$. There exists an LOCC channel mapping $|x\rangle\langle x|$ to $|y\rangle\langle y|$ if and only if \[\operatorname{Tr}_{\mathcal{X}_B}|x\rangle\langle x| \prec \operatorname{Tr}_{\mathcal{Y}_B}|y\rangle\langle y|.\] That is, $|y\rangle$ is more entangled than $|x\rangle$ iff the Schmidt coefficients of $|y\rangle$ are more spread out (majorised by those of $|x\rangle$).

Proof sketch — necessity. Any LOCC channel is separable, and separable channels satisfy $\rho^X \to \Lambda(\rho^X)$ for some doubly stochastic channel $\Lambda$ acting on the eigenvalues. Hence the Schmidt coefficients can only become more mixed (majorised). $\square$

Proof sketch — sufficiency. If $\lambda(x) \prec \lambda(y)$ (the Schmidt coefficients of $|x\rangle$ majorise those of $|y\rangle$), then $\rho^A_x \prec \rho^A_y$, meaning $\rho^A_x = \Lambda(\rho^A_y)$ for a doubly stochastic $\Lambda$. By Birkhoff's theorem, $\Lambda = \sum_j q_j P_{\sigma_j}$ is a convex combination of permutation matrices $P_{\sigma_j}$. The LOCC protocol: Alice measures in the eigenbasis of $\rho^A_x$, applies a random permutation of eigenvalues to simulate $\Lambda$, communicates the classical outcome to Bob, and Bob adjusts his state accordingly using the unitary correction. The correctness follows from the probabilistic structure of doubly stochastic maps. $\square$

Example 6.5 (Nielsen's theorem — explicit qubit protocol). Let $|x\rangle = \tfrac{\sqrt{3}}{2}|00\rangle+\tfrac{1}{2}|11\rangle$ (Schmidt coefficients $p = (3/4, 1/4)$) and $|y\rangle = \tfrac{1}{\sqrt{2}}|00\rangle+\tfrac{1}{\sqrt{2}}|11\rangle$ (Bell state, coefficients $q = (1/2,1/2)$). Check majorisation: $q_1 = 1/2 \leq 3/4 = p_1$ ✓ and $q_1+q_2 = 1 = p_1+p_2$ ✓. So $q \prec p$, meaning LOCC can convert $|x\rangle\to|y\rangle$.

Explicit protocol: The doubly stochastic matrix mapping $p\to q$ is $\Lambda = \begin{pmatrix}1/2&1/2\\1/2&1/2\end{pmatrix}$, the maximally mixing matrix. By Birkhoff’s theorem: $\Lambda = \tfrac{1}{2}P_{\text{id}} + \tfrac{1}{2}P_{\text{swap}}$.

Step 1 — Alice’s measurement: Alice measures in the eigenbasis $\{|0\rangle_A,|1\rangle_A\}$ of $\rho^A_x = \tfrac{3}{4}|0\rangle\langle 0|+\tfrac{1}{4}|1\rangle\langle 1|$:

Outcome $a=0$ (prob 3/4): state collapses to $|00\rangle_B = |0\rangle_A|0\rangle_B$.
Outcome $a=1$ (prob 1/4): state collapses to $|11\rangle_B = |1\rangle_A|1\rangle_B$.

Step 2 — Alice communicates $a$: Bob receives the bit $a$.

Step 3 — Bob’s correction: Bob applies a random unitary from the set $\{I, X\}$ each with probability 1/2 (drawn independently of $a$):

If he applies $I$: state is $|a\rangle_A|a\rangle_B$.
If he applies $X$: state is $|a\rangle_A|(1-a)\rangle_B$.

Averaging: $\tfrac{1}{2}|a\rangle|a\rangle + \tfrac{1}{2}|a\rangle|1-a\rangle = |a\rangle\tfrac{|0\rangle+|1\rangle}{\sqrt{2}} \cdot\sqrt{2}$. Wait — this doesn't quite work as stated. The correct construction from the proof: let $A = \rho^A_y$ (the target reduced state) and use the Uhlmann theorem / Schur-Horn argument to construct the isometry. In practice, the LOCC protocol requires Alice to apply a probabilistic mixture of permutation unitaries conditioned on her initial measurement outcome, then communicate to Bob who applies the conjugate correction. The resulting state has reduced state $\rho^A_y = I/2$, matching the Bell state target. $\checkmark$

6.7 Entanglement Measures

Definition 6.9 (Entanglement cost and distillable entanglement). For bipartite $\rho \in D(\mathcal{X}_A\otimes\mathcal{X}_B)$:

Entanglement cost: $E_c(\rho) = \inf\{r : \exists\text{ LOCC protocol converting ebit}^{\otimes nr}\to\rho^{\otimes n}\text{ for large }n\}$.
Distillable entanglement: $E_d(\rho) = \sup\{r : \exists\text{ LOCC protocol converting }\rho^{\otimes n}\to\text{ebit}^{\otimes nr}\text{ for large }n\}$.

Theorem 6.9. $E_c(\rho) \geq E_d(\rho)$ for all $\rho$.

Proof. Suppose $E_c(\rho) < r < E_d(\rho)$ for some $r$. Then one could: (a) use $nr$ ebits and LOCC to prepare $\rho^{\otimes n}$, and then (b) distill $\rho^{\otimes n}$ into $nr'$ ebits for some $r' > r$. This would be a net ebit gain from LOCC alone, which contradicts monotonicity of entanglement under LOCC (entanglement measures cannot increase under LOCC). $\square$

Definition 6.10 (Entanglement of formation). $E_f(\rho) = \min\sum_k p_k S(\operatorname{Tr}_B|\psi_k\rangle\langle\psi_k|)$ over all pure-state decompositions $\rho = \sum_k p_k|\psi_k\rangle\langle\psi_k|$.

Theorem 6.10 (Entanglement of formation for Bell-diagonal states). For the two-qubit Werner state $W_\alpha = (1-\alpha)\tfrac{I}{4} + \alpha|\beta_{00}\rangle\langle\beta_{00}|$ with $\alpha \in [0,1]$, \[E_f(W_\alpha) = h\!\left(\tfrac{1+\sqrt{1-\alpha^2}}{2}\right),\] where $h$ is the binary entropy. For $\alpha = 1$ (Bell state), $E_f = 1$ ebit. For $\alpha \leq 1/3$ the state is separable and $E_f = 0$.

Remark. Hayden-Horodecki-Terhal (2000) proved $E_c(\rho) = \lim_{n\to\infty}\tfrac{1}{n}E_f(\rho^{\otimes n})$. Hastings (2008) showed $E_f$ is non-additive, so $E_c < E_f$ in general — a major open problem resolved. Whether $E_d > 0$ for all entangled states also remains open.

Definition 6.12 (Squashed entanglement). The squashed entanglement is $E_{\text{sq}}(\rho^{AB}) = \tfrac{1}{2}\inf_{\rho^{ABE}:\,\operatorname{Tr}_E\rho^{ABE}=\rho^{AB}} S(A:B|E)_\rho$, where the infimum is over all extensions. This is the first additive entanglement monotone that interpolates between $E_d$ and $E_c$: $E_d \leq E_{\text{sq}} \leq E_c$. SSA (Theorem 5.10) ensures $E_{\text{sq}} \geq 0$. Brandão-Christandl-Yard (2011) proved $E_{\text{sq}}$ is faithful: $E_{\text{sq}}(\rho) = 0$ iff $\rho$ is separable.

6.8 PPT States and Bound Entanglement

Theorem 6.10 (Thm 18.1, Watrous). Separable channels preserve PPT: if $\rho$ is PPT and $\Phi$ is a separable CP map, then $\Phi(\rho)$ is PPT.

Theorem 6.11 (Thm 18.3, Watrous). $E_d(\rho) = 0$ for every PPT state $\rho$.

Proof. Any LOCC distillation protocol is separable (LOCC $\subseteq$ SEP). By Theorem 6.10, applying any separable channel to $\rho^{\otimes n}$ (PPT) yields a PPT state. An ebit $|\beta\rangle\langle\beta|$ is not PPT — its partial transpose has eigenvalue $-1/2$. Any state with fidelity $\geq 1-\varepsilon$ to an ebit has partial-transpose eigenvalue $\leq -\tfrac{1}{2}+O(\sqrt\varepsilon) < 0$ for small $\varepsilon$. Hence no LOCC protocol can produce a state close to an ebit from a PPT state, so $E_d = 0$. $\square$

Definition 6.11 (Bound entanglement). A state is bound entangled if it is entangled ($E_c > 0$) but has $E_d = 0$. By Theorem 6.11, all PPT entangled states are bound entangled.

Example 6.5 (UPB bound entangled state). Using the five UPB vectors $\{|u_i\rangle\}$ of Example 6.3 in $\mathbb{C}^3\otimes\mathbb{C}^3$, define $\rho = \tfrac{1}{4}(I - \sum_i|u_i\rangle\langle u_i|)$. This state is:

PPT: because no product state is orthogonal to all five UPB vectors, $(T\otimes I)(\rho)\geq 0$.
Entangled: the projector onto the orthogonal complement of the UPB entangles with any purification.
Bound entangled: PPT implies $E_d = 0$, but the entanglement cannot be distilled.

Remark. The hierarchy for bipartite states is $\operatorname{SepD} \subsetneq \operatorname{PPT} \subsetneq D(\mathcal{X}_A\otimes\mathcal{X}_B)$, with all inclusions strict in dimension $\geq 2\times 3$. Whether all non-PPT (NPT) states satisfy $E_d > 0$ remains one of the most important open problems in quantum information theory.

The concept of activation of bound entanglement (Brandão-Plenio 2008) shows that bound entangled states, despite having $E_d = 0$, can still serve as a resource when used alongside other entangled states. In particular, a bound entangled state tensored with a small amount of free entanglement can sometimes yield more distillable entanglement than the free entanglement alone — “activating” the latent entanglement in the bound state. This demonstrates that the bound-free distinction is not a sharp resource boundary, and understanding when and how much bound entanglement can be activated is an active area of research.

The Rains bound $E_d(\rho) \leq R(\rho) = \min_{S:\,(T\otimes I)(S)\geq 0}S(\rho\|S)$ provides an SDP-computable upper bound on distillable entanglement and unifies several earlier results.

Theorem 6.13 (Rains bound as an SDP). The Rains bound can be rewritten as \[R(\rho) = \min_{S \geq 0,\,(T\otimes I)(S)\geq 0} S(\rho\|S) = \min_{S \geq 0,\,(T\otimes I)(S)\geq 0}\bigl[\operatorname{Tr}(S) - S(\rho) - \operatorname{Tr}[\rho(\log S - \log\rho)]\bigr].\] For qubit-qubit states $\rho$ that are Bell-diagonal, $R(\rho)$ can be computed explicitly using the structure of the PPT set in $2\times 2$ systems. The Rains bound is strictly tighter than the logarithmic negativity $E_N(\rho) = \log\|(T\otimes I)(\rho)\|_1$ for some states, but is harder to compute.

Example 6.6 (Logarithmic negativity of the Werner state). For $W_\alpha = (1-\alpha)\tfrac{I}{4}+\alpha|\beta_{00}\rangle\langle\beta_{00}|$ with $\alpha > 1/3$ (entangled), the partial transpose is \[(T\otimes I)(W_\alpha) = (1-\alpha)\tfrac{I}{4} + \alpha(T\otimes I)(|\beta_{00}\rangle\langle\beta_{00}|).\] The partial transpose of the Bell state has eigenvalues $+1/2,+1/2,+1/2,-1/2$. So $(T\otimes I)(W_\alpha)$ has eigenvalues $\tfrac{1-\alpha}{4}+\tfrac{\alpha}{2} = \tfrac{1+\alpha}{4}$ (thrice) and $\tfrac{1-\alpha}{4}-\tfrac{\alpha}{2} = \tfrac{1-3\alpha}{4}$. For $\alpha > 1/3$ the last eigenvalue is negative. The trace norm is $\|(T\otimes I)(W_\alpha)\|_1 = 3\cdot\tfrac{1+\alpha}{4} + |\tfrac{1-3\alpha}{4}| = \tfrac{3+3\alpha}{4}+\tfrac{3\alpha-1}{4} = \tfrac{2+6\alpha}{4} = \tfrac{1+3\alpha}{2}$ for $\alpha > 1/3$. Thus the logarithmic negativity is \[E_N(W_\alpha) = \log\tfrac{1+3\alpha}{2}.\] For $\alpha = 1$ (Bell state): $E_N = \log 2 = 1$ ebit. For $\alpha = 1/3$ (separability threshold): $E_N = \log 1 = 0$. The logarithmic negativity continuously interpolates between 0 (separable) and 1 ebit (maximally entangled) for Werner states.

6.9 Monogamy of Entanglement

One of the most counterintuitive features of quantum entanglement is that it is monogamous: if Alice and Bob share a maximally entangled state, then Alice cannot be entangled with any third party Charlie. This contrasts sharply with classical correlations, which can be shared freely among many parties.

Theorem 6.14 (Coffman-Kundu-Wootters monogamy inequality, 2000). For any three-qubit pure state $|\psi\rangle^{ABC}$, \[\tau(A:B) + \tau(A:C) \leq \tau(A:BC),\] where $\tau(\rho) = C(\rho)^2$ is the **tangle** (square of the concurrence $C$). The concurrence of a two-qubit state $\rho$ is $C(\rho) = \max(0, \lambda_1 - \lambda_2 - \lambda_3 - \lambda_4)$ where $\lambda_1\geq\lambda_2\geq\lambda_3\geq\lambda_4$ are the square roots of the eigenvalues of $\rho(Y\otimes Y)\bar\rho(Y\otimes Y)$.

The CKW inequality is the quantitative version of monogamy: Alice’s entanglement with Bob plus Alice’s entanglement with Charlie cannot exceed Alice’s entanglement with the joint system $BC$. The slack $\tau(A:BC) - \tau(A:B) - \tau(A:C)$ is called the residual tangle or 3-tangle, a genuine tripartite entanglement measure.

Example 6.7 (W state violates monogamy of concurrence? No.). The W state $|W\rangle = \tfrac{1}{\sqrt{3}}(|100\rangle+|010\rangle+|001\rangle)$ has fully symmetric one-qubit reduced states. For any bipartition $A:B$, $\tau(A:B) = 4/9$. The right-hand side: $\tau(A:BC)$ with $|W\rangle^{ABC}$ pure is $\tau = 4S(\rho^A)(1-S(\rho^A))$ by the concurrence-entropy relation for pure states? Actually for pure states: $C(|\psi\rangle^{AB})^2 = 2(1-\operatorname{Tr}((\rho^A)^2))$. So for $A:BC$: $\tau(A:BC) = 2(1-\operatorname{Tr}((\rho^A)^2))$. We have $\rho^A = \operatorname{Tr}_{BC}|W\rangle\langle W| = \tfrac{1}{3}|0\rangle\langle 0|\cdot 2 + \tfrac{1}{3}|1\rangle\langle 1|\cdot 1$ — wait, $\rho^A = \tfrac{2}{3}|0\rangle\langle 0| + \tfrac{1}{3}|1\rangle\langle 1|$. Then $\operatorname{Tr}((\rho^A)^2) = 4/9+1/9 = 5/9$, so $\tau(A:BC) = 2(1-5/9) = 8/9$. CKW: $8/9 \geq 4/9 + 4/9 = 8/9$. The inequality holds with equality — all the entanglement in $|W\rangle$ is bipartite; there is zero residual tangle (no genuine tripartite entanglement in the GHZ sense for the W class).

Remark (General monogamy from squashed entanglement). The squashed entanglement is monogamous for all dimensions: for any $n$-partite state, $E_{\text{sq}}(A:B_1\cdots B_n) \geq \sum_i E_{\text{sq}}(A:B_i)$. This follows from the definition as a conditional mutual information and the chain rule for mutual information: $I(A:B_1\cdots B_n|E) \geq \sum_i I(A:B_i|B_1\cdots B_{i-1}E) \geq \sum_i I(A:B_i|E)$, where the last step uses SSA. This is the most general monogamy result known for a computable entanglement measure.

Chapter 7: Quantum Channel Capacities

The capacity theory for quantum channels is the quantum analogue of Shannon’s channel coding theorems, but far richer due to the multiplicity of resources (classical bits, qubits, entanglement) that can be transmitted and consumed.

7.1 Classical Capacity

Definition 7.1 (Classical capacity). The classical capacity $C(\Phi)$ of a quantum channel $\Phi$ is the maximum rate of reliable classical information transmission using $n$ independent uses of $\Phi$, in the limit $n\to\infty$.

Theorem 7.1 (Holevo-Schumacher-Westmoreland, 1997). $C(\Phi) = \lim_{n\to\infty}\tfrac{1}{n}\chi(\Phi^{\otimes n})$, where the Holevo information $\chi(\Phi)$ is \[\chi(\Phi) = \max_{\{p_k,\rho_k\}}\chi(\Phi(\mathcal{E})) = \max_{\{p_k,\rho_k\}}\Bigl[S\!\Bigl(\Phi\!\Bigl(\sum_k p_k\rho_k\Bigr)\Bigr) - \sum_k p_k S(\Phi(\rho_k))\Bigr].\]

The regularisation over $n$ in the HSW theorem is necessary in general (Hastings 2008 proved that $\chi$ is not additive), which makes $C(\Phi)$ uncomputable. However, for degradable channels (channels $\Phi$ for which there exists a map $\mathcal{N}$ with $\Phi_E = \mathcal{N}\circ\Phi$, where $\Phi_E$ is the complementary channel), the Holevo information is additive and $C = \chi(\Phi)$.

The relationship between the HSW capacity and the classical capacity of the quantum measurement channel is important. If one is restricted to product input states (no entanglement between uses), the maximum rate is $\chi^{(1)}(\Phi) = \max_{\{p_k,\rho_k\}}\chi(\Phi(\mathcal{E}))$. The full classical capacity may be larger because entangled codewords across multiple uses can increase the Holevo information. Concretely: for the qubit depolarising channel $\Omega_p$, entangled codewords provide no advantage ($\chi$ is additive) and $C(\Omega_p) = 1 - h(p) - p\log 3$. For the qubit erasure channel, no entanglement advantage either. The Hastings channel requires a specific carefully constructed channel where entanglement increases the Holevo information by a constant factor.

Example 7.1 (Classical capacity of the qubit dephasing channel). The dephasing channel $\Delta_p(\rho) = (1-p)\rho + p Z\rho Z$ has a diagonal Choi matrix. For a qubit input $\rho = \tfrac{1}{2}(I + r_x X + r_y Y + r_z Z)$: \[\Delta_p(\rho) = \tfrac{1}{2}(I + (1-2p)(r_x X + r_y Y) + r_z Z).\] The dephasing channel contracts $r_x$ and $r_y$ by factor $(1-2p)$ but leaves $r_z$ intact. The optimal encoding uses the computational basis states $\{\rho_0 = |0\rangle\langle 0|, \rho_1 = |1\rangle\langle 1|\}$ with equal priors. Then $\bar\rho = I/2$, $S(\Delta_p(I/2)) = 1$, and $S(\Delta_p(|0\rangle\langle 0|)) = S(\Delta_p(|1\rangle\langle 1|)) = 0$ (the computational basis is preserved). So $\chi(\Delta_p) = 1$ bit regardless of $p$! The dephasing channel has full classical capacity — it only destroys off-diagonal quantum coherence, not the classical information in the diagonal.

7.2 Quantum Capacity and the LSD Theorem

Definition 7.2 (Quantum capacity). The quantum capacity $Q(\Phi)$ is the maximum rate of reliable qubit transmission through $\Phi$, measured in qubits per channel use.

Theorem 7.2 (Lloyd-Shor-Devetak theorem, 1997-2005). $Q(\Phi) = \lim_{n\to\infty}\tfrac{1}{n}\max_\rho S(A|E)_{\rho^{AE}}$, where $\rho^{AE}$ is the state of the input $A$ and the environment $E$ after the Stinespring dilation of $\Phi$.

The single-letter lower bound is the coherent information $I_c(\rho,\Phi) = S(\Phi(\rho)) - S_e(\rho,\Phi)$, where $S_e$ is the entropy exchange (equal to $S(AE)_{\rho^{AE}}$). SSA implies $I_c \leq S(\Phi(\rho)) \leq \log\dim\mathcal{Y}$, and the LSD theorem says $Q \geq \max_\rho I_c(\rho,\Phi)$, with equality for degradable channels.

Example 7.2 (Quantum capacity of the erasure channel). The erasure channel $\Phi_p(\rho) = (1-p)\rho\otimes|0\rangle\langle 0| + p\tfrac{I}{d}\otimes|e\rangle\langle e|$ (where $|e\rangle$ is an erasure flag) erases the qubit with probability $p$. The coherent information is $I_c = \max_\rho[(1-p)S(\rho) - 0] = (1-p)\log d$. Since the erasure channel is degradable, $Q = (1-p)\log d$. For a qubit: $Q = (1-p)$ qubits per use. For $p \geq 1/2$, $Q = 0$ but $E_d > 0$ — entanglement-assisted communication can recover some capacity.

7.3 Entanglement-Assisted Capacity

Theorem 7.3 (Bennett-Shor-Smolin-Thapliyal 2002). The entanglement-assisted classical capacity of a quantum channel $\Phi$ is \[C_E(\Phi) = \max_\rho S(\rho) + S(\Phi(\rho)) - S_e(\rho,\Phi) = \max_\rho I(A:B)_\rho,\] where $I(A:B) = S(A) + S(B) - S(AB)$ is the quantum mutual information, and the maximisation is over input states. This is the fully quantum version of Shannon's noisy channel theorem. Unlike $C(\Phi)$ and $Q(\Phi)$, $C_E$ is always given by a single-letter formula with no regularisation needed.

Remark (Resource inequalities). The four capacities are related by resource inequalities involving ebits ($E$), qubits ($Q$), and classical bits ($C$):

Teleportation: $1Q + 1E \geq 2C$
Superdense coding: $1Q + 1E \leq 2C$ (using entanglement to double classical capacity)
Quantum capacity lower bound: $C_E \geq C + Q$
All these equalities hold for specific channels (e.g., identity channel: $C = Q = 1$, $C_E = 2$)

The framework of resource inequalities, pioneered by Devetak-Harrow-Winter, gives a unified treatment of quantum Shannon theory.

Example 7.3 (Entanglement-assisted capacity of the depolarising channel). For the qubit depolarising channel $\Omega_p(\rho) = (1-p)\rho + p I/2$, the input-output mutual information is \[I(A:B)_\rho = S(\Omega_p(\rho)) - S_e(\rho,\Omega_p) = S(\Omega_p(\rho)) - S(\Phi_E(\rho)),\] where $\Phi_E$ is the complementary channel. For the maximally mixed input $\rho = I/2$: $S(\Omega_p(I/2)) = 1$ (maximally mixed output since $\Omega_p(I/2) = I/2$). The environment after the Stinespring dilation of the depolarising channel has $S_e(I/2, \Omega_p) = h(p) + (1-p)\log 3$ (the entropy of the error syndrome). So $I(A:B) = 1 - h(p) - (1-p)\log 3$, which for $p=0$ gives 2 bits (pure channel: $C_E = 2$), and for $p=3/4$ gives 0 (completely depolarising). The entanglement-assisted capacity $C_E = \max_\rho I(A:B)_\rho$ equals $2 - 2h(p/2)$ by a more careful optimisation.

7.4 Private Classical Capacity

Classical information can be transmitted through a quantum channel not only reliably, but also privately — so that an eavesdropper (who holds the channel environment) obtains negligible information. This is the quantum-mechanical basis for quantum key distribution.

Definition 7.3 (Private classical capacity). The private classical capacity $P(\Phi)$ is the maximum rate of reliable and private classical information transmission, where "private" means that the environment's state $\hat\Phi(\rho)$ carries negligible mutual information about the message.

Theorem 7.4 (Devetak 2005). The private classical capacity is \[P(\Phi) = \lim_{n\to\infty}\frac{1}{n}\max_{\{p_x,\rho_x\}}\Bigl[\chi(\Phi^{\otimes n}(\{p_x,\rho_x\})) - \chi(\hat\Phi^{\otimes n}(\{p_x,\rho_x\}))\Bigr],\] where $\chi$ is the Holevo information and $\hat\Phi$ is the complementary channel. The single-letter lower bound is the **coherent information**, showing that $P(\Phi) \geq Q(\Phi)$.

The inequality $P \geq Q$ has a beautiful physical interpretation: any rate at which quantum information can be reliably transmitted implies the same rate of private classical communication (encode the classical message in a qubit state). The deeper result is that $P = Q$ for many natural channels — the quantum capacity of transmitting qubits is exactly the same as the private classical capacity. This connection was made precise by the Devetak-Shor (2005) equivalence: private codes can be converted to quantum codes and vice versa.

Example 7.4 (Private capacity of the erasure channel). For the erasure channel $\Phi_p$ with erasure probability $p$, the complementary channel $\hat\Phi_p$ is the erasure channel with probability $1-p$ (the eavesdropper gets the qubit with probability $p$ and the erasure flag with probability $1-p$). The private capacity is \[P(\Phi_p) = \max_\rho[\chi(\Phi_p(\rho)) - \chi(\hat\Phi_p(\rho))] = (1-p) - p\cdot 0 - [(1-p)\cdot 0 + p\cdot 1] = \max(1-2p, 0).\] For $p < 1/2$: private capacity equals quantum capacity $Q = 1-2p$. For $p \geq 1/2$: $P = Q = 0$. The erasure channel achieves perfect privacy when $p = 0$ (identity channel: eavesdropper gets nothing) and zero privacy for $p > 1/2$ (eavesdropper has more information than the receiver).

Remark (Quantum Shannon theory hierarchy). For any quantum channel $\Phi$, the capacities satisfy the hierarchy: \[Q(\Phi) \leq P(\Phi) \leq C(\Phi) \leq C_E(\Phi).\] The inequalities are in general strict. The channel may have $C > 0$ but $Q = 0$ (e.g., a classical replacement channel), or $P < C$ (e.g., a channel with entanglement-breaking noise structure). The equality $P = Q$ holds for degradable channels, and $C = C_E/2$ holds when the channel is the identity (saturating the teleportation/superdense coding balance). Understanding all gaps between these capacities is an active research area.

7.5 Quantum Key Distribution and the BB84 Protocol

Quantum key distribution (QKD) is the application of quantum information theory to cryptography. It allows two parties to establish a shared secret key whose security is guaranteed by the laws of physics, not computational hardness assumptions.

Definition 7.4 (Quantum key distribution). A QKD protocol is an interactive protocol between Alice (sender) and Bob (receiver) over a quantum channel and an authenticated classical channel, such that at the end, they share a common key $K$ with:

Correctness: Alice and Bob's keys agree with high probability.
Security: The eavesdropper Eve has negligible mutual information about $K$, even given control over the quantum channel.

Example 7.5 (BB84 protocol, Bennett-Brassard 1984). Alice and Bob communicate over a noisy quantum channel with Eve potentially intercepting.

Preparation: Alice chooses random bits $b_i\in\{0,1\}$ and random bases $\theta_i\in\{Z,X\}$. She sends qubits: if $\theta_i = Z$: send $|b_i\rangle$; if $\theta_i = X$: send $|b_i\rangle_X = H|b_i\rangle$ where $H$ is the Hadamard gate.
Measurement: Bob measures each qubit in a random basis $\theta'_i\in\{Z,X\}$.
Sifting: Alice and Bob publicly announce their basis choices $\theta_i, \theta'_i$ and keep only those bits where $\theta_i = \theta'_i$ (half the bits on average). These form the sifted key.
Error estimation: They sacrifice a random fraction of the sifted key to estimate the quantum bit error rate (QBER). If QBER exceeds a threshold $\sim 11\%$, they abort.
Classical post-processing: Error correction (information reconciliation) and privacy amplification produce the final key.

\[\ell \geq n(1 - h(e_Z) - h(e_X)),\]

where $\ell$ is the secure key length, $n$ is the sifted key length, $e_Z$ is the bit error rate, and $e_X$ is the phase error rate. This formula connects directly to the quantum capacity: the key rate equals the coherent information $I_c$ of the depolarising channel with the measured error rates.

Chapter 8: Quantum Error Correction

Quantum error correction (QEC) is the theory of protecting quantum information against noise. The key insight — that quantum information can be encoded in larger Hilbert spaces in ways that allow errors to be detected and corrected without disturbing the encoded information — is the foundation of fault-tolerant quantum computation.

The existence of quantum error correction is perhaps the most surprising fact in quantum information theory: despite the no-cloning theorem (Theorem 2.4) and the no-broadcasting theorem for mixed states, it is still possible to protect quantum information against a restricted class of errors. The resolution of this apparent contradiction is that QEC does not copy the quantum information — it spreads it across many qubits in a highly entangled way, so that no single qubit carries enough information for the no-cloning argument to apply. The Knill-Laflamme conditions (Definition 8.2) make this precise: the error must leave no information about the logical codeword in the error syndrome.

The entropy connection is deep: correcting errors is dual to erasing entropy, and the channel capacity framework of Chapter 7 and the QEC framework of this chapter are two sides of the same coin — the quantum capacity $Q(\Phi)$ equals the maximum code rate for which QEC is possible through $\Phi$.

8.1 The Error Correction Framework

Definition 8.1 (Quantum error correcting code). A $((n,k,d))$ quantum error correcting code encodes $k$ logical qubits into $n$ physical qubits such that any error on up to $d-1$ physical qubits can be detected, and any error on up to $\lfloor(d-1)/2\rfloor$ physical qubits can be corrected. The code space is a $2^k$-dimensional subspace $\mathcal{C} \subseteq (\mathbb{C}^2)^{\otimes n}$. The distance $d$ is the minimum weight of an undetectable error.

Definition 8.2 (Knill-Laflamme conditions). Let $\mathcal{C}$ be a code subspace and $P_\mathcal{C}$ the projector onto $\mathcal{C}$. A set of errors $\{E_j\} \subset L((\mathbb{C}^2)^{\otimes n})$ is correctable if and only if for all $|u\rangle,|v\rangle \in \mathcal{C}$: \[\langle u|E_i^*E_j|v\rangle = c_{ij}\langle u|v\rangle\] for some Hermitian matrix $c = (c_{ij})$. That is, errors map $\mathcal{C}$ to orthogonal subspaces (when $c$ is diagonal) and leave no information about the logical codeword in the error syndrome.

Theorem 8.1 (Knill-Laflamme theorem). For a channel $\Phi$ with Kraus operators $\{E_j\}$, there exists a recovery channel $\mathcal{R}$ such that $\mathcal{R}\circ\Phi\circ\mathcal{E} = \operatorname{id}$ on $D(\mathbb{C}^{2^k})$ (where $\mathcal{E}$ is the encoding) if and only if the Knill-Laflamme conditions hold for $\{E_j\}$.

Example 8.1 (The three-qubit bit-flip code). Encode $\alpha|0\rangle + \beta|1\rangle$ into $\alpha|000\rangle+\beta|111\rangle$. For the bit-flip channel $\Phi(\rho) = (1-p)\rho + pX\rho X$ acting independently on each qubit, the Kraus operators for the $n=3$ channel are products $K_{abc} = X^a\otimes X^b\otimes X^c$ (a single bit-flip on any one qubit). The Knill-Laflamme conditions check: $\langle\bar{0}|X_i^* X_j|\bar{0}\rangle = \langle 000|X_i X_j|000\rangle$ and similarly for $|\bar{1}\rangle = |111\rangle$. If $i = j$: $\langle 000|I|000\rangle = 1$. If $i\neq j$: $\langle 000|X_iX_j|000\rangle = \langle 000|$ (state with two bits flipped)\(\rangle = 0$ — orthogonal code states. The matrix $c = I$ (identity), so single-qubit bit-flip errors are correctable. $\checkmark$

8.2 Entropy and Error Correction

Theorem 8.2 (Quantum Singleton bound). Any $((n,k,d))$ code satisfies $n - k \geq 2(d-1)$. Codes achieving equality are called quantum MDS (maximum distance separable) codes.

Theorem 8.3 (Quantum capacity from error correction). A channel $\Phi$ has positive quantum capacity ($Q(\Phi) > 0$) if and only if there exists a code of positive rate that is recoverable from $\Phi$. The coherent information $I_c(\rho,\Phi) = S(\Phi(\rho)) - S_e(\rho,\Phi)$ is positive if and only if the encoded qubit can survive the channel with positive fidelity.

The connection between quantum error correction and quantum entropy is deep: the coherent information $I_c$ is exactly the rate at which information survives a noisy channel, and the Knill-Laflamme conditions can be rephrased in terms of conditional entropy. Specifically: a code $\mathcal{C}$ is exactly correctable for $\Phi$ if and only if $S(A|B)_\omega = -S(A)_\omega$ where $\omega^{AB}$ is the state after sending half of the maximally entangled state on $\mathcal{C}\otimes\mathcal{C}$ through $\Phi$. This entropy formulation is equivalent to the Knill-Laflamme conditions and connects error correction directly to the negative conditional entropy criterion for entanglement (§5.3). Specifically, $\{E_j\}$ are correctable for $\mathcal{C}$ if and only if the “error syndrome” (which Kraus operator applied) is statistically independent of the logical information in $\mathcal{C}$. This is the quantum analogue of the classical separation between parity check bits and message bits in linear codes.

Definition 8.3 (Stabiliser code). An $((n,k,d))$ stabiliser code is defined by an abelian subgroup $\mathcal{S}$ of the $n$-qubit Pauli group (generated by tensor products of $I,X,Y,Z$) with $|\mathcal{S}| = 2^{n-k}$ and $-I \notin \mathcal{S}$. The code space is $\mathcal{C} = \{|\psi\rangle : g|\psi\rangle = |\psi\rangle \;\forall g\in\mathcal{S}\}$. Error detection: apply each generator of $\mathcal{S}$ as a measurement; if any eigenvalue $-1$ is detected, an error has occurred. The $2^{n-k}$ syndrome bits uniquely identify the error coset.

Example 8.2 (The five-qubit perfect code). The $((5,1,3))$ code (the smallest code that corrects all single-qubit errors) has stabilisers generated by: \[g_1 = XZZXI, \; g_2 = IXZZX, \; g_3 = XIXZZ, \; g_4 = ZXIXZ.\] It encodes 1 logical qubit into 5 physical qubits with distance 3, achieving the quantum Singleton bound $5-1 = 4 = 2(3-1)$. The code is optimal: no $((4,1,3))$ code exists. Its logical operators are $\bar{X} = XXXXX$ and $\bar{Z} = ZZZZZ$ (up to stabilisers).

8.3 CSS Codes and Fault Tolerance

The most practically important class of stabiliser codes are the CSS (Calderbank-Shor-Steane) codes, built from two classical linear codes with a nested structure. Their special form separates bit-flip and phase-flip error correction, making fault-tolerant gate implementation much simpler.

Definition 8.4 (CSS code). Let $C_1$ and $C_2$ be classical binary linear codes with $C_2 \subseteq C_1 \subseteq \mathbb{F}_2^n$, with parity check matrices $H_1$ (for $C_1$) and $H_2$ (for $C_2$). The condition $C_2 \subseteq C_1$ is equivalent to $H_1 H_2^T = 0$. The CSS code $\operatorname{CSS}(C_1, C_2)$ encodes $k = \dim C_1 - \dim C_2$ logical qubits into $n$ physical qubits. The stabiliser generators are:

$X$-type: $X^{h}$ for each row $h$ of $H_2^T$ (equivalently, for each codeword in $C_2^\perp$).
$Z$-type: $Z^{h}$ for each row $h$ of $H_1$ (checks for $C_1^\perp$-errors).

The orthogonality condition $H_1H_2^T = 0$ ensures the $X$- and $Z$-generators commute.

Example 8.3 (The seven-qubit Steane code). Take $C_1 = C_2 = [7,4,3]_2$ (the classical Hamming code). Since the Hamming code is self-dual ($C = C^\perp$), the CSS construction gives a $[[7,1,3]]$ code — the Steane code — encoding 1 logical qubit into 7 physical qubits with distance 3. The $X$-stabilisers are the rows of the parity check matrix $H$ applied as Pauli-$X$, and the $Z$-stabilisers are the same rows applied as Pauli-$Z$. The logical operators are $\bar{X} = XXXXXXX$ and $\bar{Z} = ZZZZZZZ$. The key advantage: transversal logical gates (applying the same physical gate to each qubit independently) implement a non-trivial set of logical gates, including all Clifford gates. This makes CSS codes especially attractive for fault-tolerant quantum computation.

Theorem 8.4 (Fault tolerance threshold theorem, Aharonov-Ben-Or 1997, Knill-Laflamme-Zurek 1998). There exists a threshold error rate $p_{\text{th}} > 0$ such that if the physical error rate per gate satisfies $p < p_{\text{th}}$, then an $n$-qubit logical circuit of depth $d$ can be simulated fault-tolerantly with error $\varepsilon$ using $O(n\,d\,\operatorname{poly}(\log(nd/\varepsilon)))$ physical qubits and gates, at logical error rate $O((p/p_{\text{th}})^{2^L})$ for $L$ levels of concatenation.

The threshold theorem is the foundational result of fault-tolerant quantum computation. It says that if hardware errors are below a threshold (roughly $10^{-3}$ to $10^{-2}$ for modern codes), quantum computations of arbitrary length can be performed reliably by using quantum error correction with only a polylogarithmic overhead. The exponential suppression $(p/p_{\text{th}})^{2^L}$ comes from concatenation: at each level of the code hierarchy, the logical error rate squares, driving it towards zero.

Remark (Surface codes and topological quantum memory). The **surface code** (Kitaev 1997, Dennis et al. 2002) is a 2D CSS code defined on a planar lattice where qubits live on edges and stabilisers are products of $X$ or $Z$ on qubits around each face or vertex. The distance scales as $\sqrt{n}$ for an $n$-qubit patch, achieving a threshold of $\sim 1\%$ under independent depolarising noise. The surface code is the leading architecture for near-term fault-tolerant quantum computers because:

Only nearest-neighbour gates are needed (2D layout is hardware-friendly).
The high threshold ($\sim 1\%$) is achievable with current superconducting qubit technology.
Error syndromes can be measured repeatedly without collapsing the logical qubit.

The surface code encodes 1 logical qubit in $O(d^2)$ physical qubits with distance $d$, requiring a code cycle of $O(d)$ syndrome measurements to achieve logical error rate $\sim (p/p_{\text{th}})^{d/2}$.

8.4 Approximate Quantum Error Correction and the Knill-Laflamme Conditions Revisited

The exact Knill-Laflamme conditions require the error syndrome to carry zero information about the logical codeword. In practice, this is often relaxed to approximate error correction — tolerating a small fidelity loss.

Theorem 8.5 (Approximate Knill-Laflamme conditions, Leung-Nielsen-Chuang-Yamamoto 1997). A code $\mathcal{C}$ is $\varepsilon$-approximately correctable for a channel $\Phi$ with Kraus operators $\{E_j\}$ if there exists a recovery channel $\mathcal{R}$ with $\|\mathcal{R}\circ\Phi - \operatorname{id}\|_\diamond \leq \varepsilon$. This holds whenever \[\max_{j,k}\left|\langle u|E_j^*E_k|v\rangle - c_{jk}\langle u|v\rangle\right| \leq \frac{\varepsilon}{|\mathcal{C}|}\] for all $|u\rangle,|v\rangle \in \mathcal{C}$. The Petz recovery map $\mathcal{P}_{\sigma,\Phi}$ (with $\sigma = P_\mathcal{C}/\dim\mathcal{C}$ the code-subspace maximally mixed state) provides the canonical approximate recovery.

Example 8.4 (Approximate correction of small-angle rotations). The single-qubit rotation $R_z(\theta) = e^{-i\theta Z/2}$ acts as a near-identity error for small $\theta$. Expanding: $R_z(\theta) = \cos(\theta/2)I - i\sin(\theta/2)Z \approx I - i(\theta/2)Z$ for small $\theta$. The Knill-Laflamme condition for exact $Z$-error correction is $\langle u|Z^*Z|v\rangle = \langle u|v\rangle$ (which holds, so $Z$ errors are exactly correctable). Since $R_z(\theta) = \cos(\theta/2)I + \sin(\theta/2)(-iZ)$, the error operators are $E_0 = \cos(\theta/2)I$ and $E_1 = -i\sin(\theta/2)Z$. For any code that exactly corrects $Z$ errors, the approximate correction fidelity is \[F_e \geq \cos^4(\theta/2) \approx 1 - \theta^2 + O(\theta^4).\] With concatenated QEC, one can reduce this to $\sim (c\theta^2)^{2^L}$ after $L$ levels, exponentially suppressing the small-angle rotation error without needing to correct it exactly at the physical level.

Key Formulas Summary

Central Identities and Inequalities

Linear algebra:

Vec tensor identity: $(A\otimes B)\operatorname{vec}(C) = \operatorname{vec}(ACB^T)$
Trace norm variational formula: $\|A\|_1 = \max_{U\in\mathcal{U}}\operatorname{Re}\operatorname{Tr}(A^*U)$
Schatten duality: $\|A\|_1 = \max\{\langle A,B\rangle : \|B\|_\infty\leq 1\}$

Quantum states and channels:

Choi isomorphism: $\Phi$ is CP iff $J(\Phi)\geq 0$; TP iff $\operatorname{Tr}_\mathcal{Y}(J(\Phi)) = I_\mathcal{X}$
Kraus via Choi: $J(\Phi) = \sum_k\operatorname{vec}(A_k)\operatorname{vec}(A_k)^*$
Heisenberg picture: $\operatorname{Tr}(B\Phi(\rho)) = \operatorname{Tr}(\Phi^*(B)\rho)$

Distance measures:

Trace distance: $\tfrac{1}{2}\|\rho-\sigma\|_1 = \max_{0\leq M\leq I}\operatorname{Tr}(M(\rho-\sigma))$
Fidelity SDP: $F(P,Q) = \sup\{\tfrac{1}{2}\operatorname{Tr}(X)+\tfrac{1}{2}\operatorname{Tr}(X^*) : \begin{bmatrix}P&X\\X^*&Q\end{bmatrix}\geq 0\}$
Uhlmann's theorem: $F(P,Q) = \max_{\text{purifications }|v\rangle\text{ of }Q}|\langle u|v\rangle|$
Alberti's theorem: $F(P,Q)^2 = \inf_{Y>0}\langle P,Y^{-1}\rangle\langle Q,Y\rangle$
Fuchs-van de Graaf: $1-\tfrac{1}{2}\|\rho-\sigma\|_1 \leq F(\rho,\sigma) \leq \sqrt{1-\tfrac{1}{4}\|\rho-\sigma\|_1^2}$

Entropy and information:

Von Neumann entropy: $S(\rho) = -\operatorname{Tr}(\rho\log\rho)$
Subadditivity: $S(AB)\leq S(A)+S(B)$; Strong subadditivity: $S(XYZ)+S(Z)\leq S(XZ)+S(YZ)$
Araki-Lieb: $|S(A)-S(B)|\leq S(AB)$
Klein's inequality: $S(\rho\|\sigma) = \operatorname{Tr}[\rho(\log\rho-\log\sigma)]\geq 0$
Data processing: $S(\Phi(\rho)\|\Phi(\sigma))\leq S(\rho\|\sigma)$
Holevo information: $\chi(\mathcal{E}) = S(\sum_xp_x\rho_x) - \sum_xp_xS(\rho_x)\geq I_{\text{acc}}(\mathcal{E})$

SDP duality:

Weak duality: primal $d\leq\beta$ dual always holds
Slater's condition: strict feasibility (primal or dual) closes the gap and guarantees attainment
Complementary slackness: at optimality with $d=\beta$, $\langle X,S\rangle = 0$, i.e., $XS=0$
Diamond norm SDP: $\|\Phi\|_\diamond = \max_{\rho\in D(\mathcal{X}\otimes\mathcal{X})}\|(\Phi\otimes I)(\rho)\|_1$

Channel capacities:

Classical capacity (HSW): $C = \lim_n\tfrac{1}{n}\chi(\Phi^{\otimes n})$
Quantum capacity (LSD): $Q = \lim_n\tfrac{1}{n}\max_\rho I_c(\rho,\Phi^{\otimes n})$
Entanglement-assisted: $C_E = \max_\rho I(A:B)_\rho$ (single-letter, no regularisation)
Hierarchy: $Q\leq P\leq C\leq C_E$

Entanglement:

Separability: $\rho\in\operatorname{Sep}$ iff $(\Phi\otimes I)(\rho)\geq 0$ for all positive maps $\Phi$ (Horodecki)
PPT is necessary for separability; sufficient in $2\otimes 2$ and $2\otimes 3$
Nielsen's theorem: LOCC maps $|x\rangle\to|y\rangle$ iff $\lambda(x)\prec\lambda(y)$ (majorisation)
Monogamy (CKW): $\tau(A:B)+\tau(A:C)\leq\tau(A:BC)$ for three-qubit states

Quantum error correction:

Knill-Laflamme: errors $\{E_j\}$ correctable iff $\langle u|E_i^*E_j|v\rangle = c_{ij}\langle u|v\rangle$ for all $|u\rangle,|v\rangle\in\mathcal{C}$
Quantum Singleton bound: $n-k\geq 2(d-1)$
Fault-tolerance threshold: below $p_{\text{th}}\approx 10^{-2}$, logical error rate $\to 0$ exponentially with concatenation depth