ECE 417: Image Processing

Oleg Michailovich

Estimated study time: 1 hr 15 min

Table of contents

These notes cover the full scope of ECE 417 as taught at the University of Waterloo in Winter 2026. They are designed to be self-contained, rigorous, and derivation-focused, integrating classical image processing theory with modern data-driven methods.

Sources and References

The following publicly available references informed the content of these notes:

Gonzalez, R. C. and Woods, R. E. Digital Image Processing, 4th edition. Pearson, 2018.
Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition. Springer, 2022. Open access at szeliski.org/Book/.
Prince, S. J. D. Computer Vision: Models, Learning, and Inference. Cambridge University Press, 2012. Open access at computervisionmodels.com.
MIT OpenCourseWare, 6.869 Advances in Computer Vision. Open course materials.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. “Image quality assessment: from error visibility to structural similarity.” IEEE Transactions on Image Processing, 13(4):600–612, 2004.
Lowe, D. G. “Distinctive image features from scale-invariant keypoints.” International Journal of Computer Vision, 60(2):91–110, 2004.
He, K., Zhang, X., Ren, S., and Sun, J. “Deep residual learning for image recognition.” CVPR, 2016.
Ronneberger, O., Fischer, P., and Brox, T. “U-Net: Convolutional networks for biomedical image segmentation.” MICCAI, 2015.
Dalal, N. and Triggs, B. “Histograms of oriented gradients for human detection.” CVPR, 2005.
Rudin, L., Osher, S., and Fatemi, E. “Nonlinear total variation based noise removal algorithms.” Physica D, 60(1–4):259–268, 1992.

Chapter 1: Image Formation and Digital Representation

1.1 What Is a Digital Image?

A digital image is a discrete, finite-dimensional representation of a continuous visual signal. To build a rigorous framework, we begin with the continuous model and trace exactly how discretization introduces its own algebra and constraints.

A continuous grayscale image is a function \( f : \mathbb{R}^2 \to \mathbb{R} \), where \( f(x, y) \) represents the intensity (luminance, radiance, or some photometric quantity) at spatial coordinates \( (x, y) \). For a colour image, the codomain is extended to \( \mathbb{R}^3 \) or some other multi-channel space. Physical images have two fundamental constraints: they are bounded in spatial extent and in amplitude. Hence in practice we work with \( f : [0, W] \times [0, H] \to [0, L_{\max}] \).

Digital image. A digital image of size \(M \times N\) with \(L\) grey levels is a 2D array \( f[m, n] \) for \( m \in \{0, 1, \ldots, M-1\} \), \( n \in \{0, 1, \ldots, N-1\} \), where each sample \( f[m, n] \) takes integer values in \( \{0, 1, \ldots, L-1\} \). Typically \( L = 256 \), corresponding to 8-bit quantization.

The process of converting a continuous signal to this discrete array involves two independent operations: sampling (discretizing spatial coordinates) and quantization (discretizing intensity values).

1.2 Sampling and the Nyquist Criterion

Spatial sampling replaces the continuous domain with a lattice. For a rectangular lattice with horizontal spacing \( \Delta x \) and vertical spacing \( \Delta y \), the sampled image is:

\[ f_s[m, n] = f(m \Delta x,\; n \Delta y) \]

The question of whether \( f_s \) faithfully represents \( f \) is answered by the two-dimensional Nyquist–Shannon sampling theorem. The 2D Fourier transform of \( f \) is:

\[ F(u, v) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y)\, e^{-j 2\pi (ux + vy)}\, dx\, dy \]

The spatial bandwidth of \( f \) is the region in the frequency plane \((u, v)\) outside which \( F(u, v) = 0 \). If \( f \) is band-limited with maximum horizontal frequency \( u_{\max} \) and maximum vertical frequency \( v_{\max} \), then perfect reconstruction from samples is possible if and only if:

\[ \frac{1}{\Delta x} \geq 2u_{\max}, \qquad \frac{1}{\Delta y} \geq 2v_{\max} \]

Violation of these conditions causes aliasing — high-frequency energy folds back into the baseband and masquerades as lower frequencies. Anti-aliasing filters (low-pass filters applied before sampling) are therefore a fundamental component of every image acquisition pipeline.

The human visual system is more sensitive to aliasing in the horizontal and vertical directions than along diagonals, because the eye has a higher density of receptor cells tuned to cardinal orientations. This anisotropy motivated early camera sensor designs to use rectangular, axis-aligned pixel grids, and it still informs choices of filter bank orientations in wavelet-based compression.

1.3 Quantization

Given a sampled continuous-valued image \( f_s[m, n] \in [f_{\min}, f_{\max}] \), quantization maps each real-valued sample to one of \( L \) discrete levels. Uniform quantization with step size \( \Delta = (f_{\max} - f_{\min}) / L \) maps:

\[ f[m, n] = \left\lfloor \frac{f_s[m, n] - f_{\min}}{\Delta} \right\rfloor \]

The quantization error \( e[m, n] = f_s[m, n] - f[m, n] \cdot \Delta - f_{\min} \) lies in \( [0, \Delta) \). The peak signal-to-noise ratio (PSNR) of an 8-bit image, when the maximum quantization error is \( \Delta/2 \), is approximately:

\[ \text{PSNR} \approx 6.02 \cdot b + 1.76 \; \text{dB} \]

where \( b \) is the number of bits per pixel. Each additional bit of quantization precision buys roughly 6 dB of SNR. With only 4 bits (16 levels), images exhibit visible false contouring — smooth gradients appear banded because adjacent quantization levels differ noticeably.

1.4 Colour Models

1.4.1 The RGB Model

The RGB model is additive: colours are formed by summing contributions from red (R), green (G), and blue (B) primaries. It is device-dependent — the same RGB triplet renders differently on monitors with different phosphor or LED characteristics. The sRGB standard fixes a specific set of primaries and a viewing environment, enabling interchange.

1.4.2 The HSV and HSL Models

HSV (Hue, Saturation, Value) reinterprets an RGB cube as a hexagonal cone. Hue \( H \in [0^\circ, 360^\circ) \) encodes the dominant wavelength, Saturation \( S \in [0, 1] \) encodes colour purity, and Value \( V \in [0, 1] \) encodes brightness. The conversion from RGB (normalized to \([0,1]\)) is:

\[ V = \max(R, G, B), \qquad S = \begin{cases} \frac{V - \min(R, G, B)}{V} & \text{if } V \neq 0 \\ 0 & \text{otherwise} \end{cases} \]

and \( H \) is determined by which channel achieves the maximum. HSV is more intuitive for image editing because artists think in terms of hue shifts and saturation adjustments rather than primary-channel values.

1.4.3 The YCbCr Model

YCbCr separates luminance (Y) from chrominance (Cb, Cr). It is the standard representation for video compression (JPEG, MPEG, H.264). The linear transform from RGB (full-range, \([0,255]\)) is:

\[ \begin{pmatrix} Y \\ Cb \\ Cr \end{pmatrix} = \begin{pmatrix} 16 \\ 128 \\ 128 \end{pmatrix} + \begin{pmatrix} 65.481 & 128.553 & 24.966 \\ -37.797 & -74.203 & 112.0 \\ 112.0 & -93.786 & -18.214 \end{pmatrix} \begin{pmatrix} R/255 \\ G/255 \\ B/255 \end{pmatrix} \]

The human visual system has lower spatial acuity for colour (chrominance) than for brightness (luminance). JPEG exploits this by subsampling Cb and Cr by a factor of 2 in each dimension (4:2:0 format), discarding roughly half the chrominance data before anyone perceives the loss.

1.4.4 The CIE Lab Model

CIE Lab (or \( L^* a^* b^* \)) is a perceptually uniform colour space standardized by the International Commission on Illumination (CIE). Distances in Lab are approximately proportional to perceived colour differences — a property that neither RGB nor HSV possesses. The L axis runs from 0 (black) to 100 (white); \( a^* \) runs from green (negative) to red (positive); \( b^* \) runs from blue (negative) to yellow (positive). The conversion goes through CIE XYZ tristimulus values via nonlinear cube-root functions that model the nonlinear response of the visual cortex. In image quality assessment, the \( \Delta E_{ab} \) metric:

\[ \Delta E_{ab} = \sqrt{(\Delta L^*)^2 + (\Delta a^*)^2 + (\Delta b^*)^2} \]

measures the perceptual distance between two colours.

Chapter 2: Spatial Domain Image Enhancement

2.1 Point Transforms

A point transform (or intensity transform) applies the same function to every pixel independently:

\[ g[m, n] = T(f[m, n]) \]

where \( T : \{0, \ldots, L-1\} \to \{0, \ldots, L-1\} \). Common examples include gamma correction \( T(r) = c \cdot r^\gamma \) (which adjusts brightness for display on screens with nonlinear phosphor response), thresholding, contrast stretching, and negative images \( T(r) = L - 1 - r \).

Contrast stretching. Given an image whose intensities span \([r_{\min}, r_{\max}]\), contrast stretching maps this range to the full output range \([0, L-1]\): \[ T(r) = \frac{(r - r_{\min})}{(r_{\max} - r_{\min})} \cdot (L - 1) \]

This is a special case of linear piecewise remapping.

2.2 Histogram Equalization

The image histogram \( h(k) \) counts the number of pixels with intensity \( k \):

\[ h(k) = \#\{(m, n) : f[m, n] = k\}, \quad k = 0, 1, \ldots, L-1 \]

The normalized histogram \( p(k) = h(k) / (MN) \) is an estimate of the probability mass function (PMF) of the intensity random variable. Histogram equalization seeks a transform \( T \) such that the output image has a uniform histogram — maximizing contrast by spreading out intensity values.

For continuous images, the equalization transform is the cumulative distribution function (CDF):

\[ T(r) = \int_0^r p_f(t)\, dt \]

This maps input intensity \( r \) to output intensity equal to the fraction of pixels darker than \( r \), scaled to \([0, 1]\). For discrete images the analogous formula is:

\[ s_k = T(k) = (L - 1) \sum_{j=0}^{k} p(j) = (L - 1) \cdot P(k) \]

where \( P(k) \) is the cumulative histogram. The output levels \( s_k \) are rounded to the nearest integer. The resulting image histogram is approximately uniform (exactly uniform only in the continuous case).

Derivation of equalization optimality. Suppose \( f \) has a dark-biased histogram concentrated at low intensities. After equalization, pixels previously clustered at a few low values are redistributed across the full range. Edges that were previously at the noise level become visible. However, equalization can also amplify noise in near-uniform regions, so it is not universally beneficial.

2.3 Adaptive Histogram Equalization and CLAHE

Global histogram equalization treats the entire image as a single statistical population, which fails when different regions have very different local statistics (e.g., a chest X-ray with both dark lung fields and bright bones). Adaptive Histogram Equalization (AHE) computes a separate CDF for each pixel using a neighbourhood window centred on that pixel, then applies the local transform. This produces excellent local contrast, but:

It is computationally expensive — one histogram per pixel.
It excessively amplifies noise in uniform regions where the local histogram is sharply peaked.

CLAHE (Contrast Limited AHE) addresses point 2 by clipping each local histogram at a threshold \( \beta \) before computing the CDF:

\[ \tilde{h}(k) = \min(h(k), \beta), \quad \forall k \]

The clipped probability mass is redistributed uniformly across all bins. This limits the maximum slope of the equalization transform, capping local gain at \( \beta / (MN/L) \). CLAHE with a clip limit near 2 is now a standard preprocessing step in medical imaging (fundus photography, MRI scout images, ultrasound enhancement).

2.4 Spatial Filtering

A spatial filter replaces each pixel with a weighted combination of its neighbours. The general operation is discrete 2D convolution:

\[ g[m, n] = \sum_{s=-a}^{a} \sum_{t=-b}^{b} w[s, t]\, f[m - s,\, n - t] \]

where \( w \) is the filter kernel (also called mask or template) of size \((2a+1) \times (2b+1)\). The kernel is said to be separable if \( w[s, t] = w_1[s] \cdot w_2[t] \), which reduces complexity from \( O(k^2) \) to \( O(2k) \) operations per pixel for a \( k \times k \) kernel.

2.4.1 Linear Smoothing Filters

The box filter (averaging filter) uses \( w[s,t] = 1/(2a+1)(2b+1) \) — a uniform weighted average. It is separable and fast but blurs edges severely and leaves ringing artefacts because its frequency response (a 2D sinc function) has substantial sidelobes.

The Gaussian filter uses:

\[ w[s, t] = \frac{1}{2\pi\sigma^2}\, e^{-(s^2 + t^2) / (2\sigma^2)} \]

It is the unique filter that is simultaneously separable, rotationally symmetric, and has no negative sidelobes in the frequency domain. Its Fourier transform is itself a Gaussian:

\[ W(u, v) = e^{-2\pi^2 \sigma^2 (u^2 + v^2)} \]

This makes the Gaussian a true low-pass filter — it attenuates high frequencies monotonically without ringing. The parameter \( \sigma \) controls the smoothing radius; a good rule of thumb is to use a kernel of radius \( 3\sigma \) (which captures 99.7% of the Gaussian’s mass).

2.4.2 Median Filter

The median filter is nonlinear: it replaces each pixel with the median of the intensities in its neighbourhood. It is extraordinarily effective at removing impulse noise (salt-and-pepper noise), where random pixels are set to 0 or 255. The key property is that median filtering preserves edges better than linear smoothing filters: an edge is a rank discontinuity, and the median across an edge boundary is always one of the two edge levels, not an average of them.

The rank-order filter family generalizes the median: the \( \alpha \)-trimmed mean discards the \( \alpha\%\) lowest and highest values before averaging; erosion and dilation (Chapter 6) use the minimum and maximum, respectively.

2.4.3 Bilateral Filter

The bilateral filter is a nonlinear, edge-preserving smoothing filter that weights each neighbour by both spatial proximity and intensity similarity:

\[ g[m, n] = \frac{1}{W[m,n]} \sum_{s,t} f[m-s,\, n-t]\; G_{\sigma_s}(s, t)\; G_{\sigma_r}(f[m-s,n-t] - f[m,n]) \]

where \( G_{\sigma_s}(s, t) = \exp(-(s^2 + t^2)/(2\sigma_s^2)) \) is the spatial Gaussian, \( G_{\sigma_r}(\cdot) \) is the range Gaussian with parameter \( \sigma_r \), and the normalization factor is:

\[ W[m,n] = \sum_{s,t} G_{\sigma_s}(s, t)\; G_{\sigma_r}(f[m-s,n-t] - f[m,n]) \]

The range kernel ensures that pixels across a strong intensity edge contribute negligible weight, so edges are preserved while noise is averaged out within homogeneous regions. The bilateral filter has two limitations: it requires \( O(k^2) \) operations per pixel with no obvious separability, and it may create gradient reversal artefacts (the “staircase” or “halos” effect) when the range parameter \( \sigma_r \) is too large. Extensions such as the guided filter and non-local means address these limitations.

Chapter 3: Frequency Domain Processing

3.1 The Two-Dimensional Discrete Fourier Transform

The 2D DFT of an \( M \times N \) image \( f[m, n] \) is:

\[ F[u, v] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} f[m, n]\, e^{-j 2\pi \left(\frac{um}{M} + \frac{vn}{N}\right)}, \quad u = 0, \ldots, M-1,\; v = 0, \ldots, N-1 \]

The inverse DFT (IDFT) is:

\[ f[m, n] = \frac{1}{MN} \sum_{u=0}^{M-1} \sum_{v=0}^{N-1} F[u, v]\, e^{j 2\pi \left(\frac{um}{M} + \frac{vn}{N}\right)} \]

The index \( (u, v) \) represents a spatial frequency: \( u \) cycles per \( M \) pixels in the horizontal direction, \( v \) cycles per \( N \) pixels in the vertical direction. After the common practice of centring the DFT spectrum by multiplying \( f[m,n] \) by \((-1)^{m+n}\), the DC component \( F[0,0] = \sum_{m,n} f[m,n] \) appears at the centre of the displayed spectrum.

Convolution theorem. If \( g = f * h \) (2D discrete convolution), then: \[ G[u, v] = F[u, v] \cdot H[u, v] \]

Spatial convolution corresponds to pointwise multiplication in the frequency domain. This is the theoretical justification for frequency-domain filtering: multiply the spectrum by a filter transfer function and invert.

3.1.1 Key Properties of the 2D DFT

Periodicity. \( F[u, v] \) is periodic with period \( M \) in \( u \) and period \( N \) in \( v \).

Conjugate symmetry. For a real image, \( F[u, v] = F^*[-u, -v] \), meaning \( |F[u,v]| = |F[-u,-v]| \). The magnitude spectrum is symmetric about the origin.

Shift theorem. A spatial shift \( f[m-m_0, n-n_0] \) introduces a linear phase ramp \( F[u,v] \cdot e^{-j2\pi(um_0/M + vn_0/N)} \). The magnitude spectrum is unchanged; only the phase changes.

Energy (Parseval’s theorem).

\[ \sum_{m,n} |f[m,n]|^2 = \frac{1}{MN} \sum_{u,v} |F[u,v]|^2 \]

Total image energy equals total spectral energy (up to a scale factor). In practice, natural images concentrate most energy near DC — the spectral magnitude falls off roughly as \( 1/(u^2 + v^2) \), a consequence of the fractal-like self-similarity of natural scenes.

Separability. The 2D DFT decomposes into 1D DFTs:

\[ F[u, v] = \sum_{m=0}^{M-1} \left[ \sum_{n=0}^{N-1} f[m, n]\, e^{-j2\pi vn/N} \right] e^{-j2\pi um/M} \]

This allows computation by first applying 1D FFTs along all rows, then 1D FFTs along all columns, reducing complexity from \( O(M^2 N^2) \) for direct computation to \( O(MN \log(MN)) \) using the Fast Fourier Transform.

3.2 Frequency Domain Filters

3.2.1 Ideal Filters

The ideal low-pass filter (ILPF) passes all frequencies within a circle of radius \( D_0 \) (in the centred spectrum) and blocks everything outside:

\[ H_{\text{LP}}[u, v] = \begin{cases} 1 & \text{if } D(u, v) \leq D_0 \\ 0 & \text{otherwise} \end{cases} \]

where \( D(u, v) = \sqrt{u^2 + v^2} \) is the distance from DC. Despite its mathematical simplicity, the ILPF produces severe ringing artefacts (Gibbs phenomenon) in the spatial domain. The cause is that the inverse Fourier transform of a hard circular cutoff is a 2D sinc-like function (a Bessel function \( J_1 \)), which has large oscillatory sidelobes.

3.2.2 Butterworth Filters

The Butterworth low-pass filter of order \( n \) smooths the frequency-domain transition:

\[ H_{\text{BW}}[u, v] = \frac{1}{1 + \left( D(u,v) / D_0 \right)^{2n}} \]

At \( D = D_0 \), the gain is \( 1/\sqrt{2} \) (the half-power point, \(-3\) dB). As \( n \to \infty \), the Butterworth filter approaches the ideal filter. For small \( n \) (typically \( n = 1 \) or \( n = 2 \)), the transition is gentle and ringing is minimal. The trade-off is that the filter does not fully block high frequencies for finite \( n \).

3.2.3 Gaussian Filters

The Gaussian low-pass filter in the frequency domain is:

\[ H_{\text{Gauss}}[u, v] = e^{-D^2(u,v) / (2\sigma^2)} \]

Because the inverse Fourier transform of a Gaussian is a Gaussian (the Gaussian is its own transform), this filter has no ringing whatsoever. It is the standard reference for smooth frequency-domain filtering. The cutoff frequency \( D_0 \) at the half-power point is \( D_0 = \sigma\sqrt{2\ln 2} \approx 1.177\sigma \).

High-pass versions of all three filters are obtained by complementation: \( H_{\text{HP}} = 1 - H_{\text{LP}} \).

3.2.4 Notch Filters

Notch filters target specific frequency components — useful for removing periodic noise (e.g., horizontal banding from electronic interference, grid patterns from halftone printing). A notch reject filter has zero gain at a specific frequency pair \( (u_0, v_0) \) and its conjugate \( (-u_0, -v_0) \):

\[ H_{\text{notch}}[u, v] = \prod_{k} \left(1 - H_{\text{LP},k}[u, v]\right) \]

where \( H_{\text{LP},k} \) is a small Gaussian or Butterworth centred at each notch frequency. The position of the notch is identified by inspecting the power spectrum \( |F[u,v]|^2 \) for sharp spikes away from DC.

Chapter 4: Image Restoration

4.1 The Degradation Model

Image restoration seeks to undo the damage introduced by a known or estimated degradation process. The standard linear model is:

\[ g[m, n] = (h * f)[m, n] + \eta[m, n] \]

where \( f \) is the true (unknown) image, \( h \) is the point spread function (PSF) of the degradation system, \( * \) denotes convolution, and \( \eta \) is additive noise. In the frequency domain:

\[ G[u, v] = H[u, v]\, F[u, v] + N[u, v] \]

The degradation transfer function \( H[u, v] \) is called the optical transfer function (OTF); its magnitude is the modulation transfer function (MTF). Common PSF models include:

Gaussian blur: \( h[m,n] \propto \exp(-(m^2+n^2)/(2\sigma^2)) \) — models defocus and atmospheric turbulence.
Motion blur: a line PSF in the direction of motion of length \( L \) pixels — \( H[u,v] = \frac{\sin(\pi(uL_x + vL_y))}{\pi(uL_x + vL_y)} e^{-j\pi(uL_x + vL_y)} \) for motion \( (L_x, L_y) \).
Out-of-focus disk: \( h \) is a uniform disk of radius \( R \).

4.2 Noise Models

Additive White Gaussian Noise (AWGN). The noise \( \eta[m,n] \) is i.i.d. Gaussian with zero mean and variance \( \sigma^2 \). Its power spectral density is flat (white), and its histogram is a Gaussian bell curve.

Other common noise models include:

Poisson (shot) noise: dominant in low-light imaging and X-ray; variance equals the mean intensity.
Rayleigh noise: models noise in range images (radar, sonar).
Uniform noise: models quantization error.
Impulse (salt-and-pepper) noise: random pixels set to 0 or 255 with probability \( p \); median filtering is the natural remedy.

4.3 Inverse Filtering

The naive inverse filter recovers \( F[u,v] \) by dividing:

\[ \hat{F}[u, v] = \frac{G[u, v]}{H[u, v]} = F[u, v] + \frac{N[u, v]}{H[u, v]} \]

The restored image is exact when there is no noise, but catastrophically sensitive to noise in practice: wherever \( |H[u,v]| \) is small (which is everywhere at high frequencies for a blurring PSF), the noise is amplified without bound. To partially mitigate this, one applies a truncated inverse filter that divides only where \( |H| \) exceeds some threshold \( \epsilon \).

4.4 The Wiener Filter

The Wiener filter is the optimal linear estimator of \( F[u,v] \) from \( G[u,v] \) under the mean-squared-error criterion, when the signal and noise power spectra are known. Minimizing:

\[ E\left[|F[u,v] - \hat{F}[u,v]|^2\right] \]

over all linear estimators \( \hat{F} = W \cdot G \) yields:

\[ W[u, v] = \frac{H^*[u, v]}{|H[u, v]|^2 + S_{\eta\eta}[u, v] / S_{ff}[u, v]} \]

where \( S_{ff}[u, v] = E[|F[u,v]|^2] \) is the power spectral density of the signal and \( S_{\eta\eta}[u, v] = E[|N[u,v]|^2] = \sigma_\eta^2 \) for white noise. The ratio \( K = S_{\eta\eta} / S_{ff} \) is the inverse signal-to-noise ratio. When \( K \to 0 \) (no noise), the Wiener filter approaches the inverse filter. When \( K \) is large (heavy noise), the Wiener filter suppresses the estimate toward zero at all frequencies. In practice, \( K \) is often approximated by a constant \( K = \sigma_\eta^2 / \overline{|F|^2} \).

Wiener filter optimality. The Wiener filter achieves the minimum MSE among all linear filters. It balances fidelity to the measured data (through the inverse filter term) against regularization imposed by noise (through the additive \( K \) in the denominator). The denominator never goes to zero for \( K > 0 \), preventing the noise amplification disaster of the pure inverse filter.

4.5 Regularized Restoration

A more general framework replaces the statistical MSE with a deterministic regularization problem. Minimizing:

\[ J(\hat{f}) = \| g - h * \hat{f} \|^2 + \lambda \| L * \hat{f} \|^2 \]

where \( L \) is a high-pass operator (commonly the Laplacian \( \nabla^2 \)) and \( \lambda > 0 \) is the regularization parameter. In the frequency domain, the solution is:

\[ \hat{F}[u, v] = \frac{H^*[u, v]}{|H[u, v]|^2 + \lambda |L[u, v]|^2} G[u, v] \]

This is the Tikhonov regularized inverse filter. Choosing \( L = \nabla^2 \) and \( \lambda \propto K \) recovers approximately the Wiener filter. The parameter \( \lambda \) trades off data fidelity against smoothness: large \( \lambda \) produces over-smoothed restorations; small \( \lambda \) lets noise through. Cross-validation, the discrepancy principle (Morozov), or Stein’s unbiased risk estimate (SURE) can be used to select \( \lambda \) automatically.

The total variation (TV) regularizer replaces the quadratic smoothness penalty with:

\[ \|\nabla \hat{f}\|_1 = \sum_{m,n} \sqrt{(\hat{f}[m+1,n] - \hat{f}[m,n])^2 + (\hat{f}[m,n+1] - \hat{f}[m,n])^2} \]

TV regularization promotes piecewise constant reconstructions — it heavily penalizes diffuse gradient regions (noise) while tolerating sharp jumps (edges). The ROF model (Rudin, Osher, Fatemi, 1992) pioneered TV minimization for image denoising and restoration.

Blind deconvolution addresses the situation where the PSF \( h \) is unknown. This is a bilinear inverse problem: both \( f \) and \( h \) must be estimated simultaneously from \( g \). Without additional constraints, the problem is severely ill-posed. Modern approaches use alternating minimization (alternately estimating \( f \) given \( h \) and vice versa) combined with strong priors on both the image (TV or sparsity) and the PSF (non-negativity, unit sum, small support).

Chapter 5: Image Compression

5.1 Information-Theoretic Foundations

Claude Shannon’s source coding theorem establishes the fundamental limit of lossless compression: the minimum average code length achievable for a memoryless source with PMF \( p(k) \) is the entropy:

\[ H = -\sum_{k} p(k) \log_2 p(k) \quad \text{(bits per symbol)} \]

Any uniquely decodable code must have average length \( \bar{L} \geq H \). Huffman coding achieves \( \bar{L} < H + 1 \). Arithmetic coding can approach \( H \) to within 1 bit for the entire message, regardless of symbol probabilities.

5.2 Lossless Compression Methods

Huffman coding constructs a prefix-free code by building a binary tree bottom-up: repeatedly merge the two symbols with the lowest probability into a parent node. The resulting code assigns shorter codewords to more probable symbols. The code is optimal in the sense that no other prefix-free code has a smaller expected length given the symbol probabilities.

Arithmetic coding represents the entire message as a single real number in \([0, 1)\). Each successive symbol narrows the current interval by a factor equal to the symbol’s probability, and the final interval is encoded as the shortest binary fraction falling within it. This avoids the integer-length constraint of Huffman codes and achieves average code lengths closer to entropy.

LZW (Lempel–Ziv–Welch) is a dictionary-based method used in GIF and early TIFF: it dynamically builds a table of patterns seen so far and encodes each input run as an index into the table. It requires no prior knowledge of the source statistics.

PNG (Portable Network Graphics) applies a spatial predictor (one of five filter modes, chosen per-row) before DEFLATE compression (LZ77 + Huffman). The predictor exploits spatial correlation to reduce the entropy before entropy coding. PNG is lossless, supports alpha channels, and is the standard for graphics and screenshots.

5.3 Lossy Compression: JPEG and DCT

JPEG (Joint Photographic Experts Group) is the dominant standard for natural image compression. Its encoding pipeline consists of:

Colour space conversion: RGB → YCbCr, followed by 4:2:0 chroma subsampling.
Block decomposition: the image is split into non-overlapping \( 8 \times 8 \) pixel blocks.
DCT: each block undergoes the 2D Discrete Cosine Transform:

\[ C[u, v] = \frac{2}{N} \alpha(u) \alpha(v) \sum_{m=0}^{N-1} \sum_{n=0}^{N-1} f[m, n] \cos\!\left(\frac{\pi u (2m+1)}{2N}\right) \cos\!\left(\frac{\pi v (2n+1)}{2N}\right) \]

where \( N = 8 \), \( \alpha(0) = 1/\sqrt{2} \), \( \alpha(k) = 1 \) for \( k > 0 \).

Quantization: each DCT coefficient \( C[u,v] \) is divided by a quantization step \( Q[u,v] \) from the quantization table and rounded to the nearest integer: \( \hat{C}[u,v] = \text{round}(C[u,v] / Q[u,v]) \). The quantization table has larger steps at higher spatial frequencies, exploiting the fact that the eye is less sensitive to high-frequency errors.
Entropy coding: the DC coefficient (top-left, zero frequency) is DPCM coded (difference from the previous block). The 63 AC coefficients are scanned in a zig-zag pattern from low to high frequency and run-length encoded, then Huffman coded.

The DCT is preferred over the DFT for compression because it treats the block as a symmetric extension of itself, reducing the discontinuities at block boundaries that would otherwise create high-frequency energy. In signal processing terms, the DCT is equivalent to a DFT of the signal mirrored about its endpoints, which suppresses the edge discontinuities.

JPEG artefacts arise primarily from two sources. Blocking artefacts appear as visible 8×8 grid boundaries — a consequence of quantizing each block independently. Ringing (Gibbs) artefacts appear near sharp edges within blocks — caused by discarding high-frequency DCT coefficients. At very high compression ratios, a characteristic “mosquito noise” appears around text and strong edges.

5.4 Wavelet-Based Compression: JPEG 2000

JPEG 2000 replaces the block-DCT with a global Discrete Wavelet Transform (DWT), addressing the blocking artefact problem of JPEG. The DWT decomposes the image into a pyramid of subbands: low-frequency (LL), horizontal detail (LH), vertical detail (HL), and diagonal detail (HH) — computed at multiple scales via iterated filter banks.

The wavelet filter bank applies a low-pass filter \( h_0 \) and a high-pass filter \( h_1 \) along rows, then along columns, followed by subsampling by 2 in each direction. This produces four subbands of half the original resolution. The LL subband is then further decomposed to produce the next level. The final representation is a multiresolution pyramid.

JPEG 2000 key innovations:

No blocking artefacts: the wavelet transform is global, not block-wise.
Progressive transmission: the bitstream can be truncated at any point to yield a lower-quality image.
Region of interest: specific areas can be compressed at higher quality.
Lossless and lossy in the same bitstream: using the 5/3 integer wavelet for lossless and 9/7 for lossy.

5.5 Transform Coding Theory

The rate-distortion (R-D) function \( R(D) \) gives the minimum rate (bits per sample) needed to represent a source with distortion at most \( D \). For a Gaussian source with variance \( \sigma^2 \) under MSE distortion, the R-D function is:

\[ R(D) = \frac{1}{2} \log_2 \frac{\sigma^2}{D}, \quad 0 \leq D \leq \sigma^2 \]

and \( R(D) = 0 \) for \( D \geq \sigma^2 \). Transform coding exploits this by: (1) applying a transform that decorrelates the source (so coefficients can be coded independently), and (2) allocating bits to each coefficient according to its variance using the water-filling principle. The optimal bit allocation gives more bits to high-variance (high-energy) coefficients and fewer to low-variance ones. The KLT (Karhunen–Loève Transform) is the theoretically optimal transform for decorrelation; the DCT is a near-optimal approximation for sources with exponential correlation (which natural images exhibit approximately).

Chapter 6: Morphological Image Processing

6.1 Binary Morphology

Morphological operations treat a binary image as a set \( A \subseteq \mathbb{Z}^2 \) (the set of foreground pixels) and use a structuring element \( B \) (a small set defining the neighbourhood shape) to probe the image.

Erosion. The erosion of \( A \) by \( B \) is: \[ A \ominus B = \{ z \in \mathbb{Z}^2 : B_z \subseteq A \} \]

where \( B_z \) is \( B \) translated to centre \( z \). A pixel \( z \) survives erosion only if the entire structuring element fits within \( A \) when centred at \( z \). Erosion shrinks objects, eliminates small protrusions, and separates thin bridges.

Dilation. The dilation of \( A \) by \( B \) is: \[ A \oplus B = \{ z : \hat{B}_z \cap A \neq \emptyset \} \]

where \( \hat{B} \) is the reflection of \( B \) about its origin. A pixel \( z \) is in the dilation if the reflected structuring element touches \( A \) when centred at \( z \). Dilation expands objects, fills small holes, and bridges narrow gaps.

Opening and Closing.

Opening: \( A \circ B = (A \ominus B) \oplus B \). Erosion followed by dilation. Removes small objects (smaller than \( B \)) and smooths boundaries without significantly changing object area.

Closing: \( A \bullet B = (A \oplus B) \ominus B \). Dilation followed by erosion. Fills small holes and gaps without significantly changing object area.

Opening and closing are idempotent: applying the same operation twice gives the same result as applying it once. They are also extensive and anti-extensive, respectively: \( A \circ B \subseteq A \subseteq A \bullet B \).

6.2 Hit-or-Miss Transform

The hit-or-miss transform detects specific patterns (configurations of foreground and background pixels) in a binary image. Given two structuring elements \( B_1 \) (the “hit” kernel, must be in the foreground) and \( B_2 \) (the “miss” kernel, must be in the background), with \( B_1 \cap B_2 = \emptyset \):

\[ A \otimes (B_1, B_2) = (A \ominus B_1) \cap (A^c \ominus B_2) \]

where \( A^c \) is the complement of \( A \). This operation finds all positions where \( B_1 \) fits inside \( A \) and \( B_2 \) fits outside \( A \) simultaneously. Applications include finding isolated pixels, endpoints, and junctions in skeletonized images.

6.3 Skeletonization and Thinning

The morphological skeleton of a binary region \( A \) is the set of points that are centres of maximal disks fitting within \( A \). It can be computed iteratively as:

\[ S(A) = \bigcup_{k=0}^{K} \left[ (A \ominus k B) \setminus ((A \ominus k B) \circ B) \right] \]

where \( A \ominus k B \) denotes \( k \) successive erosions and \( K \) is the last iteration before \( A \ominus K B \) becomes empty. The skeleton is a thin representation of a shape that preserves topology (connectivity and number of holes) and can be used to reconstruct the original set (approximately).

6.4 Grayscale Morphology

For grayscale images, morphology generalizes by replacing set intersection/union with min/max operations:

\[ (f \ominus b)[m, n] = \min_{(s,t) \in B} \{ f[m+s, n+t] - b[s, t] \} \]\[ (f \oplus b)[m, n] = \max_{(s,t) \in B} \{ f[m-s, n-t] + b[s, t] \} \]

For a flat structuring element (\( b = 0 \) everywhere), these reduce to local minimum and local maximum filters. Grayscale opening (erosion then dilation) smooths bright features smaller than \( B \); closing fills dark features. The morphological gradient \( (f \oplus b) - (f \ominus b) \) is an edge detector; the top-hat transform \( f - (f \circ b) \) extracts bright features smaller than \( B \) against a slowly varying background.

Chapter 7: Image Segmentation

7.1 Thresholding and Otsu’s Method

Thresholding is the simplest segmentation approach: a pixel is classified as foreground if \( f[m,n] \geq T \) and background otherwise. The choice of threshold \( T \) is critical. Manual selection is impractical; automatic methods are required.

Otsu’s method finds the threshold that maximizes the between-class variance. For a threshold \( T \), define:

\( \omega_0(T) = \sum_{k=0}^{T} p(k) \) and \( \omega_1(T) = 1 - \omega_0(T) \): class probabilities.
\( \mu_0(T) \) and \( \mu_1(T) \): class means.
\( \mu_T = \omega_0 \mu_0 + \omega_1 \mu_1 \): global mean.

The between-class variance is:

\[ \sigma_B^2(T) = \omega_0(T)(\mu_0(T) - \mu_T)^2 + \omega_1(T)(\mu_1(T) - \mu_T)^2 = \omega_0 \omega_1 (\mu_0 - \mu_1)^2 \]

Otsu’s threshold is:

\[ T^* = \arg\max_T \sigma_B^2(T) \]

This is equivalent to minimizing the total within-class variance \( \sigma_W^2 = \omega_0 \sigma_0^2 + \omega_1 \sigma_1^2 \), since \( \sigma_B^2 + \sigma_W^2 = \sigma_T^2 \) (total variance, constant). Otsu’s method works best for bimodal histograms; for multimodal distributions, multilevel Otsu generalizes to \( K \) thresholds.

7.2 Region-Based Segmentation

Region growing starts from one or more seed points and iteratively adds neighbouring pixels that satisfy a homogeneity criterion (e.g., \( |f[\text{neighbour}] - \mu_{\text{region}}| < \delta \)). The region grows until no more pixels satisfy the criterion. Results depend sensitively on seed selection and the homogeneity threshold.

Region splitting and merging starts with the full image and recursively splits each region into quadrants if the region is not homogeneous (quadtree decomposition), then merges adjacent homogeneous regions. This is more systematic than region growing.

7.3 The Watershed Algorithm

Morphological watershed treats the gradient magnitude image as a topographic surface. Local minima (catchment basins) correspond to image regions; watershed lines form the boundaries. The algorithm floods the basins from their minima; when two flooding fronts meet, a watershed line (boundary) is placed. The result is a complete, connected partition of the image.

The main drawback is over-segmentation: noise creates spurious local minima and thus too many small regions. Solutions include:

Pre-smoothing with a Gaussian to eliminate noise minima.
Marker-based watershed: pre-specifying foreground and background markers, then running watershed only from those markers.

7.4 Active Contours (Snakes)

Active contours (snakes, Kass et al. 1988) model the boundary as a parametric curve \( \mathbf{v}(s) = (x(s), y(s)) \) that minimizes an energy functional combining internal smoothness and external image forces:

\[ E_{\text{snake}} = \int_0^1 \left[ \underbrace{\alpha |\mathbf{v}'(s)|^2 + \beta |\mathbf{v}''(s)|^2}_{\text{internal}} + \underbrace{E_{\text{image}}(\mathbf{v}(s))}_{\text{external}} \right] ds \]

The internal energy penalizes stretching (\( |\mathbf{v}'|^2 \), tension term, weight \( \alpha \)) and bending (\( |\mathbf{v}''|^2 \), rigidity term, weight \( \beta \)). The external image energy typically is \( -|\nabla f|^2 \) — the curve is attracted to strong gradient (edge) locations.

Minimizing this functional via the Euler–Lagrange equations gives the force balance equation:

\[ \alpha \mathbf{v}''(s) - \beta \mathbf{v}''''(s) + \nabla E_{\text{image}} = 0 \]

This is solved iteratively, often by discretizing the curve into control points and performing gradient descent. Limitations include sensitivity to initialization (the contour must start near the true boundary) and the tendency to get trapped at local gradient maxima rather than the true object boundary. Level-set methods (Osher–Sethian) reformulate the curve as the zero level set of a higher-dimensional function, enabling topological changes (merging and splitting of contours) and more stable numerical implementation.

Chapter 8: Feature Extraction

8.1 Edge Detection

8.1.1 Gradient-Based Operators

Edges correspond to rapid intensity transitions. The image gradient:

\[ \nabla f = \left( \frac{\partial f}{\partial x},\; \frac{\partial f}{\partial y} \right) \]

has magnitude \( |\nabla f| = \sqrt{f_x^2 + f_y^2} \) and direction \( \theta = \arctan(f_y / f_x) \). In discrete images, partial derivatives are approximated by finite differences. The Sobel operator uses:

\[ G_x = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} * f, \qquad G_y = \begin{pmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{pmatrix} * f \]

The extra weighting (\( \pm 2 \) in the central row/column) provides some smoothing in the perpendicular direction, making Sobel more robust to noise than the simple first-difference operator.

8.1.2 The Canny Edge Detector

The Canny detector (1986) is a multi-stage algorithm designed to satisfy three optimality criteria: good detection (minimize missed edges and false edges), good localization (detected edge is close to the true edge), and single response (one response per edge). The pipeline is:

Gaussian smoothing: \( f_\sigma = G_\sigma * f \), removing noise.
Gradient computation: \( |\nabla f_\sigma| \) and \( \theta \) using Sobel-like operators.
Non-maximum suppression (NMS): thin ridges in the gradient magnitude image to single-pixel width by retaining only pixels that are local maxima in the gradient direction.
Double thresholding: classify each NMS pixel as strong (above \( T_{\text{high}} \)), weak (between \( T_{\text{low}} \) and \( T_{\text{high}} \)), or suppressed (below \( T_{\text{low}} \)).
Edge tracking by hysteresis: retain weak pixels only if they are connected to at least one strong pixel, then suppress the rest.

The Canny detector is scale-dependent: larger \( \sigma \) detects coarser features and suppresses fine detail. Choosing \( T_{\text{high}} / T_{\text{low}} \approx 2\text{--}3 \) is standard.

8.2 Corner Detection: The Harris Detector

Edges are 1D features (defined along one direction); corners are 2D features (localized in both directions). The Harris corner detector (Harris and Stephens, 1988) measures the change in intensity when a small patch is shifted in all directions, formalized by the structure tensor (second moment matrix):

\[ M = \sum_{(s,t) \in W} w[s,t] \begin{pmatrix} f_x^2 & f_x f_y \\ f_x f_y & f_y^2 \end{pmatrix} \]

where the sum is over a Gaussian window \( w \) centred at the pixel of interest. The eigenvalues \( \lambda_1, \lambda_2 \) of \( M \) characterize the local geometry:

\( \lambda_1 \approx \lambda_2 \approx 0 \): flat region.
\( \lambda_1 \gg \lambda_2 \approx 0 \): edge (one dominant gradient direction).
\( \lambda_1 \approx \lambda_2 \gg 0 \): corner (strong response in all directions).

Rather than computing eigenvalues directly, Harris uses the response function:

\[ R = \det(M) - k \cdot \text{tr}(M)^2 = \lambda_1 \lambda_2 - k(\lambda_1 + \lambda_2)^2 \]

with \( k \approx 0.04\text{--}0.06 \). Large positive \( R \) indicates a corner; \( R < 0 \) (in magnitude) indicates an edge; small \( |R| \) indicates a flat region. Corners are then located as local maxima of \( R \) above a threshold.

8.3 Scale-Invariant Feature Transform (SIFT)

SIFT (Lowe, 2004) detects and describes keypoints that are invariant to scale, rotation, and partially invariant to illumination and viewpoint. The detection pipeline uses the scale-space framework, building a Gaussian pyramid and computing the Difference of Gaussians (DoG) at each scale:

\[ D(x, y, \sigma) = L(x, y, k\sigma) - L(x, y, \sigma), \quad L = G_\sigma * f \]

DoG approximates the Laplacian of Gaussian (LoG), which detects blobs. Keypoints are DoG extrema in the \((x, y, \sigma)\) space — they are simultaneously extrema across spatial position and scale. Subpixel localization is achieved by fitting a 3D quadratic to the DoG values. Keypoints near edges or with low contrast are discarded.

The SIFT descriptor is a \( 4 \times 4 \) grid of gradient orientation histograms (8 bins each), computed in a \( 16 \times 16 \) patch around the keypoint, yielding a 128-dimensional vector. The patch is rotated to the dominant gradient orientation, achieving rotation invariance, and the descriptor is normalized to unit length for illumination robustness.

8.4 Histogram of Oriented Gradients (HOG)

HOG (Dalal and Triggs, 2005) is a dense feature descriptor designed for object detection (especially pedestrians). The computation:

Optionally normalize the image globally.
Compute horizontal and vertical gradients \( G_x, G_y \) using simple \([-1, 0, 1]\) filters.
Compute gradient magnitude and orientation.
Partition the image into cells (e.g., \( 8 \times 8 \) pixels); for each cell, build an orientation histogram with \( K = 9 \) unsigned bins.
Group cells into overlapping blocks (e.g., \( 2 \times 2 \) cells); normalize each block’s feature vector by its L2 norm.
Concatenate all normalized block descriptors into the final feature vector.

HOG captures local shape and gradient structure robustly because the block normalization provides some tolerance to illumination variation. It became the standard baseline for pedestrian detection before deep learning.

Chapter 9: Wavelet and Multiresolution Analysis

9.1 Multiresolution Concept

Many image features of interest exist at multiple scales: coarse structure visible at low resolution, fine textures and edges appearing only at high resolution. Multiresolution analysis provides a framework for simultaneous representation at all scales.

A multiresolution analysis (MRA) of \( L^2(\mathbb{R}) \) is a nested sequence of closed subspaces:

\[ \cdots \subset V_{-1} \subset V_0 \subset V_1 \subset \cdots \subset L^2(\mathbb{R}) \]

such that \( \bigcup_j V_j \) is dense in \( L^2 \), each \( V_j \) is spanned by integer translates of a single scaling function \( \phi_{j,k}(x) = 2^{j/2}\phi(2^j x - k) \), and the MRA satisfies a two-scale relation:

\[ \phi(x) = \sqrt{2} \sum_k h_0[k]\, \phi(2x - k) \]

The complement of \( V_j \) in \( V_{j+1} \) is the wavelet subspace \( W_j \), spanned by the wavelet function \( \psi_{j,k}(x) = 2^{j/2}\psi(2^j x - k) \) where:

\[ \psi(x) = \sqrt{2} \sum_k h_1[k]\, \phi(2x - k), \quad h_1[k] = (-1)^k h_0[1-k] \]

The filters \( h_0 \) (low-pass, scaling filter) and \( h_1 \) (high-pass, wavelet filter) form a quadrature mirror filter (QMF) pair.

9.2 The 2D DWT and Image Compression

For images, the 2D DWT is applied separably. One level of decomposition applies \( h_0 \) and \( h_1 \) along rows, then along columns, producing four \( M/2 \times N/2 \) subbands: LL (coarse approximation), LH (horizontal edges), HL (vertical edges), HH (diagonal edges). The LL subband is recursively decomposed to form the multi-level pyramid. The JPEG 2000 standard uses the biorthogonal 9/7 wavelet (Daubechies, lossy) and the 5/3 wavelet (Le Gall, lossless).

Wavelet coefficients of natural images follow a heavy-tailed (generalized Laplace or Gaussian scale mixture) distribution: most coefficients are near zero, but rare large coefficients correspond to edges and textures. Effective compression exploits this by using aggressive quantization for the many near-zero coefficients and fine quantization for the few large ones.

9.3 Statistical Image Models in the Wavelet Domain

The independence assumption for wavelet coefficients is only approximate — there are strong statistical dependencies between:

Parent–child dependencies: coefficients at the same spatial location but different scales.
Neighbourhood dependencies: coefficients at neighbouring locations within the same subband.

Hidden Markov tree (HMT) models (Crouse, Nowak, Baraniuk, 1998) capture parent–child dependencies using a two-state mixture per coefficient. Gaussian scale mixture (GSM) models (Portilla et al., 2003) model the joint statistics of a local neighbourhood as a product of a Gaussian random vector and a hidden scalar multiplier.

Chapter 10: Introduction to Deep Learning for Image Processing

10.1 Convolutional Neural Networks

A convolutional neural network (CNN) is a hierarchical model that learns image representations by composing layers of linear convolutions and pointwise nonlinearities. The key operations are:

Convolutional layer. Given an input feature map \( \mathbf{x} \in \mathbb{R}^{H \times W \times C_{\text{in}}} \), the convolutional layer with \( C_{\text{out}} \) filters of size \( k \times k \) computes:

\[ \mathbf{y}[m, n, c_{\text{out}}] = \sum_{c=1}^{C_{\text{in}}} \sum_{s,t} W[s, t, c, c_{\text{out}}]\, \mathbf{x}[m+s, n+t, c] + b[c_{\text{out}}] \]

The weights \( W \) are shared across spatial positions, giving the layer translation equivariance and dramatically reducing the number of parameters compared to a fully connected layer.

Pooling layers spatially reduce the feature maps. Max pooling takes the maximum over a \( p \times p \) spatial region, providing local translation invariance and reducing computation. Average pooling takes the mean.

Batch normalization normalizes each channel’s activations to zero mean and unit variance across the batch, then applies learnable scale and shift parameters:

\[ \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y = \gamma \hat{x} + \beta \]

Batch norm accelerates training, reduces sensitivity to initialization, and acts as a regularizer. During inference, \( \mu_B \) and \( \sigma_B^2 \) are replaced by running statistics computed during training.

Activation functions. The Rectified Linear Unit (ReLU) \( \sigma(x) = \max(0, x) \) is the standard choice. It avoids the vanishing gradient problem of sigmoid/tanh for deep networks and is computationally trivial.

10.2 ResNet and Residual Learning

As networks deepen, training becomes harder due to the vanishing gradient problem: gradients shrink exponentially as they propagate backward through many layers. He et al. (2016) introduced residual (skip) connections:

\[ \mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x} \]

where \( \mathcal{F} \) is a stack of convolutional layers. The identity shortcut allows gradients to flow directly through the skip connection without going through the nonlinear layers, making very deep networks (50, 101, 152 layers) trainable. ResNet won the 2015 ILSVRC challenge with a \( 3.57\% \) top-5 error rate on ImageNet, surpassing human-level performance for the first time.

10.3 Image Classification

Image classification assigns a label from a fixed set \( \{c_1, \ldots, c_K\} \) to an input image. A CNN encoder extracts a feature vector \( \mathbf{z} \in \mathbb{R}^d \) (e.g., the output of global average pooling after the last convolutional block), and a linear classifier followed by softmax produces class probabilities:

\[ P(c_k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^\top \mathbf{z}}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^\top \mathbf{z}}} \]

Training minimizes the cross-entropy loss:

\[ \mathcal{L} = -\sum_{i} \log P(y_i | \mathbf{x}_i) \]

via stochastic gradient descent with backpropagation.

10.4 Object Detection

Object detection localizes and classifies multiple objects in an image. It requires both what (class label) and where (bounding box). Two major paradigms are:

Two-stage detectors (R-CNN family). Region-based CNN (Girshick, 2014) and its successors (Fast R-CNN, Faster R-CNN) first generate region proposals (sub-images likely to contain objects), then classify each proposal independently. Faster R-CNN introduced the Region Proposal Network (RPN) — a small CNN that slides over the feature map and predicts objectness scores and bounding box refinements at each anchor, eliminating the expensive selective search step.

One-stage detectors (YOLO, SSD). YOLO (You Only Look Once, Redmon et al., 2016) divides the image into an \( S \times S \) grid; each cell predicts \( B \) bounding boxes with confidence scores and \( K \) class probabilities, all in a single forward pass. This enables real-time detection at \( 45 \)+ frames per second. The trade-off is lower accuracy on small or overlapping objects compared to two-stage methods.

10.5 Semantic Segmentation: U-Net

Semantic segmentation assigns a class label to every pixel. Fully convolutional networks (FCN, Long et al., 2015) replace the final fully connected layers with convolutional layers, enabling output at spatial resolution. However, repeated pooling reduces spatial resolution, losing localization detail.

U-Net (Ronneberger et al., 2015) addresses this with an encoder–decoder architecture with skip connections:

Encoder (contracting path): a sequence of convolutional blocks, each followed by max pooling, progressively halving spatial resolution while doubling feature channels.
Decoder (expanding path): a sequence of up-convolution (transposed convolution) blocks, each followed by concatenation with the corresponding encoder feature map (via a skip connection) and convolutional refinement.

The skip connections pass fine-grained spatial information from early encoder layers directly to the decoder, enabling precise boundary delineation. U-Net was originally developed for biomedical image segmentation (cell boundaries in microscopy) and has since become a dominant architecture for any pixel-level prediction task.

The success of U-Net illustrates a recurring theme in deep image processing: the most effective architectures explicitly combine high-level semantic features (from deep layers with global context) with low-level spatial features (from early layers with fine detail), mirroring the multi-scale philosophy of wavelets and image pyramids.

Chapter 11: Perceptual Image Quality Assessment

11.1 Full-Reference Metrics

Classical metrics such as PSNR and RMSE measure pixel-wise fidelity but correlate poorly with human perception. An image can have high PSNR yet appear poor (ringing artefacts), or have lower PSNR yet appear better (less noticeable distortions).

The Structural Similarity Index (SSIM) (Wang et al., 2004) compares images on three dimensions: luminance, contrast, and structure:

\[ \text{SSIM}(\mathbf{x}, \mathbf{y}) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} \]

where \( \mu_x, \mu_y \) are local means, \( \sigma_x^2, \sigma_y^2 \) are local variances, \( \sigma_{xy} \) is local covariance, and \( C_1, C_2 \) are small constants for numerical stability. SSIM ranges from \( -1 \) to \( 1 \), with 1 indicating perfect structural similarity. The multi-scale extension (MS-SSIM) computes SSIM at multiple scales and combines them multiplicatively, better matching human perception over a range of viewing conditions.

FSIM (Feature Similarity) uses phase congruency and gradient magnitude as feature maps for comparison. These features are closely related to what the human visual system uses for edge and texture perception.

11.2 No-Reference and Reduced-Reference Metrics

No-reference (blind) metrics assess image quality without access to a reference image. They model either specific distortion types (blocking, blur, ringing) or learn a general quality function from a database of images with human opinion scores (MOS). The NIQE and BRISQUE metrics use natural scene statistics: they fit a parametric model to features of natural, undistorted images, and measure quality as the deviation of a test image’s statistics from this model.

11.3 Perceptual Loss for Deep Image Processing

In deep learning-based image processing (super-resolution, denoising, style transfer), the choice of training loss strongly influences perceptual quality. Pixel-wise \( \ell_2 \) loss produces blurry outputs because the network averages over multiple plausible reconstructions. Perceptual loss (Johnson et al., 2016) uses feature maps from a pre-trained VGG network:

\[ \mathcal{L}_{\text{perceptual}} = \sum_l \| \phi_l(\hat{f}) - \phi_l(f_{\text{ref}}) \|_2^2 \]

where \( \phi_l \) denotes the activation at layer \( l \) of VGG. This loss encourages the output to match the reference in terms of high-level features (textures, object parts) rather than pixel values, producing visually sharper and more realistic results.

Chapter 12: Video Processing Fundamentals

12.1 Digital Video and Frame Rates

A digital video is a sequence of images (frames) indexed by time: \( f[m, n, t] \). The temporal sampling rate (frame rate) is measured in frames per second (fps). Cinema uses 24 fps; broadcast video 25 (PAL) or 29.97 (NTSC) fps; high-frame-rate content 60 or 120 fps. Temporal aliasing (motion judder) occurs when the frame rate is too low for fast motion.

12.2 Motion Estimation

Temporal redundancy — the fact that consecutive frames are highly similar — is the primary driver of video compression efficiency. Motion estimation finds a displacement vector \( \mathbf{d}[m,n] = (d_x, d_y) \) for each pixel or block such that:

\[ f[m, n, t] \approx f[m - d_x, n - d_y, t-1] \]

Block matching (used in MPEG/H.264) partitions the current frame into \( 16 \times 16 \) macroblocks and searches a reference frame for the best-matching block, minimizing Sum of Absolute Differences (SAD) or Sum of Squared Differences (SSD). The displacement found is the motion vector. After prediction, only the prediction residual (error) needs to be coded, dramatically reducing bitrate.

Optical flow estimates a dense motion vector field. The Lucas–Kanade method assumes constant flow within a small spatial window and solves:

\[ \begin{pmatrix} \sum f_x^2 & \sum f_x f_y \\ \sum f_x f_y & \sum f_y^2 \end{pmatrix} \begin{pmatrix} d_x \\ d_y \end{pmatrix} = -\begin{pmatrix} \sum f_x f_t \\ \sum f_y f_t \end{pmatrix} \]

where \( f_x, f_y, f_t \) are the spatiotemporal partial derivatives. The left matrix is the structure tensor from Section 8.2, and the system is well-posed only at corners (both eigenvalues large) — the aperture problem prevents unique estimation on edges.

12.3 Video Compression Standards

The H.264/AVC and H.265/HEVC standards extend JPEG-like intra-coding with inter-frame prediction. Each frame is coded as:

I-frame (Intra): coded independently, like a JPEG image.
P-frame (Predicted): coded as residual after motion-compensated prediction from one reference frame.
B-frame (Bi-directional): uses both past and future reference frames for prediction.

H.265 achieves roughly twice the compression efficiency of H.264 at the same quality, primarily by using larger and more flexible coding units (up to \(64 \times 64\)) and more sophisticated motion prediction modes. The AV1 codec (open-source successor to VP9) achieves comparable efficiency to H.265 with royalty-free licensing.