CS 462: Formal Languages and Parsing

Jeffrey Shallit

Estimated study time: 2 hr 45 min

Table of contents

Primary Reference: Jeffrey Shallit, A Second Course in Formal Languages and Automata Theory. Cambridge University Press, 2009. All major theorems, algorithms, and proofs in these notes follow or are inspired by this text. Students are strongly encouraged to read the book alongside these notes.

Chapter 1: Strings and Combinatorics on Words

Basic Notation

Formal language theory is ultimately a theory about strings. Before developing any deep machinery, we establish a precise vocabulary for reasoning about sequences of symbols.

A finite alphabet $\Sigma$ is a nonempty finite set of symbols. Single symbols are typically denoted $a, b, c \in \Sigma$. A finite word (or string) over $\Sigma$ is a map from an initial segment $\{0, 1, \ldots, n-1\}$ to $\Sigma$; we write it as a sequence $w = w[0] w[1] \cdots w[n-1]$. Its length is $|w| = n$. The empty word $\varepsilon$ has length 0 and is a first-class string — it is not excluded unless stated otherwise. An infinite word is a map from $\mathbb{N}$ to $\Sigma$, written $\mathbf{w} = w[0]w[1]w[2]\cdots$ in boldface.

We use the following universe sets:

$\Sigma^*$: all finite words over $\Sigma$, including $\varepsilon$
$\Sigma^+$: all nonempty finite words, $\Sigma^+ = \Sigma^* \setminus \{\varepsilon\}$
$\Sigma^\omega$ (also $\Sigma^{\mathbb{N}}$): all infinite words over $\Sigma$
$\Sigma^\infty = \Sigma^* \cup \Sigma^\omega$: all finite and infinite words

Finite words are denoted by letters $s, t, u, v, w, x, y, z$. Natural numbers $\mathbb{N} = \{0, 1, 2, \ldots\}$ use letters $i, j, k, \ell, m, n$.

A useful shorthand for subwords is interval notation: $w[a..b] = w[a] w[a+1] \cdots w[b]$.

Concatenation of strings $x$ and $y$ is denoted $xy$ or $x \cdot y$. Concatenation is associative but not in general commutative. Writing it multiplicatively lets us define string powers: $x^0 = \varepsilon$, $x^n = x \cdot x^{n-1}$ for $n \geq 1$. This satisfies $x^{m+n} = x^m x^n$.

The subword relation partitions naturally into standard cases. Given strings $x$ and $z$:

$x$ is a prefix of $z$ if $\exists\, y$ with $z = xy$
$x$ is a suffix of $z$ if $\exists\, y$ with $z = yx$
$x$ is a subword (or factor) of $z$ if $\exists\, w, y$ with $z = wxy$
$x$ is a subsequence of $z$ if $x$ can be obtained by deleting zero or more symbols from $z$

The word “substring” is ambiguous in the literature — some authors mean subword (contiguous), others mean subsequence (not necessarily contiguous). We always say “subword” or “subsequence” explicitly.

A prefix, suffix, or subword is proper if it is strictly shorter than the containing word: $x$ is a proper prefix of $z$ if $z = xy$ and $x \neq z$ (equivalently, $y \neq \varepsilon$).

Primitive Words and Morphisms

The concept of primitive words is simple but surprisingly deep: a word that cannot be expressed as a nontrivial repetition of a shorter word.

Definition 1.1 (Primitive Word): A word $w \in \Sigma^+$ is primitive if $w \neq z^n$ for all $z \in \Sigma^+$ and $n \geq 2$. Equivalently, $w$ is primitive if the only way to write $w = z^n$ is with $n = 1$.

Every nonempty word $w$ has a unique primitive root: there exists a unique primitive word $z$ and a unique integer $k \geq 1$ such that $w = z^k$. (This follows from the Lyndon-Schützenberger theorem proven in the next section.) The set of primitive words over a binary alphabet is:

\[P_2 = \{0, 1, 01, 10, 001, 010, 011, 100, 101, 110, \ldots\}\]

An open problem: is $P_2$ context-free? The answer is almost certainly no, but no proof is known.

A morphism (or homomorphism) $h : \Sigma^* \to \Delta^*$ is a map satisfying $h(xy) = h(x)h(y)$ for all $x, y \in \Sigma^*$. A morphism is completely determined by its values on single letters. Morphisms preserve concatenation, and hence string structure, making them a fundamental tool in formal language theory.

Further Operations on Words

\[x \mathop{\text{Ш}} y = x[1]\, y[1]\, x[2]\, y[2] \cdots x[n]\, y[n]\]

A classic example: $\mathtt{term} \mathop{\text{Ш}} \mathtt{hoes} = \mathtt{theorems}$.

The reversal of $x$ is $x^R$, the symbols of $x$ in reverse order. For example, $(\mathtt{stressed})^R = \mathtt{desserts}$. A word satisfying $x = x^R$ is a palindrome.

There are two natural linear orderings on $\Sigma^*$. Assume $\Sigma$ carries an underlying linear order (e.g., $a < b < c < \cdots$ or $0 < 1 < 2 < \cdots$).

Lexicographic order: For words of equal length $|x| = |y| = n$, we have $x < y$ if there exists $i$ with $1 \leq i \leq n$ such that $x[j] = y[j]$ for all $j < i$ and $x[i] < y[i]$.
Radix (length-lexicographic) order: $x < y$ if $|x| < |y|$, or $|x| = |y|$ and $x < y$ lexicographically. This gives a total well-order on $\Sigma^*$: $\varepsilon, 0, 1, 2, 00, 01, 02, 10, 11, \ldots$

A cyclic shift of $x$ is any word of the form $vu$ where $x = uv$. Two words $x, y$ are conjugates if there exist $u, v$ with $x = uv$ and $y = vu$. For example, $\mathtt{eat}, \mathtt{ate}, \mathtt{tea}$ are all conjugates of each other.

Definition 1.2 (Bordered Word): A word $w$ is bordered if it has a proper nonempty prefix that is also a suffix. Otherwise $w$ is unbordered. A border of $w$ is such a prefix/suffix.

For example, $\mathtt{entanglement}$ has border $\mathtt{ent}$. The word $\mathtt{alfalfa}$ has an overlapping border: $\mathtt{alfa}$ is both a prefix and a suffix, and these occurrences overlap. Bordered words are intimately connected to the structure of solutions to the equation $xy = yz$, as we will see shortly.

Periodicity of Infinite Words

\[x^\omega = xxx\cdots\]

This is the infinite word obtained by repeating $x$ indefinitely. An infinite word $\mathbf{z}$ is purely periodic if $\mathbf{z} = x^\omega$ for some finite $x$. It is ultimately periodic if $\mathbf{z} = y x^\omega$ for some finite words $x, y$ with $x \neq \varepsilon$.

A finite word $w$ of length $n$ has period $p$ (where $1 \leq p \leq n$) if $w[i] = w[i+p]$ for all valid $i$. Note that every word has period $n$. A word is periodic with period $p < n$ if this holds. The minimum period of $w$ is the smallest such $p$.

The Lyndon-Schützenberger Theorems

The Lyndon-Schützenberger theorems are among the deepest results in combinatorics on words. They give complete characterizations of when certain string equations have solutions — a kind of analogue, for strings, of classic number-theoretic identities.

The motivation is beautifully simple. Consider the equation $x^2 + xy = y^2 - 1$ over the integers — it has solutions like $(x, y) = (0,1), (1,2), (3,5)$, which are consecutive Fibonacci numbers. Analogously, we can ask: what are the solutions to equations like $xy = yx$ or $x^2 = y^3$ over the monoid $\Sigma^*$?

The Equation xy = yz

Theorem 1.1 (Lyndon-Schützenberger, bordered word characterization): Let $x, y, z \in \Sigma^+$. Then $xy = yz$ if and only if there exist $u \in \Sigma^+$, $v \in \Sigma^*$, and $e \geq 0$ such that: \[x = uv, \quad z = vu, \quad y = (uv)^e u = u(vu)^e\]

This theorem completely parametrizes the solution set. The connection to bordered words: if $xy = yz$ with $|x| < |y|$, then $y$ is bordered (with border of length $|x|$ sharing a relationship with $x$ and $z$).

Proof: The $(\Leftarrow)$ direction is a direct computation: \[xy = uv \cdot (uv)^e u = (uv)^{e+1} u\] \[yz = (uv)^e u \cdot vu = (uv)^e u(vu) = (uv)^e(uv)u = (uv)^{e+1}u\] So $xy = yz$. $\checkmark$

For $(\Rightarrow)$, we proceed by strong induction on $|y|$.

Base case $|y| = 1$: Write $y = a$ for a single symbol. Then $xa = az$. We can write $x = ax'$ and $z = z'a$ for some strings $x', z'$, and then $ax'a = az'a$ implies $x' = z'$. Take $u = a$, $v = x' = z'$, $e = 0$. Then $x = uv$, $z = vu$, $y = u = (uv)^0 u$. $\checkmark$

Inductive step: Assume the result holds for all triples where the middle word has length less than $|y|$. We break into two cases.

Case I: $|x| \geq |y|$. Since $xy = yz$, we have $|xy| = |yz|$, so $|x| = |z|$. Write $x = yw$ for some $w$ (possible since $|x| \geq |y|$ and $x$ has $y$ as a prefix from $xy = yz$). Then take $u = y$, $v = w$, $e = 0$.

Let us verify: $x = yw = uv$, $z = wy = vu$ (since from $xy = yz$: $(yw)y = y(wy)$, i.e., $ywy = ywy$ ✓, so indeed $z = wy$), and $y = (uv)^0 u = u = y$. ✓

Case II: $|x| < |y|$. From $xy = yz$, since $|x| < |y|$ the word $y$ starts with $x$ and ends with $z$, with some overlap region $w$ defined by $y = xw = wz$ (i.e., $w$ is the word satisfying $|w| = |y| - |x|$). Note $w \neq \varepsilon$ (since $|x| < |y|$) and $x, z \neq \varepsilon$ (given).

\[xw = wz\]\[x = uv, \quad z = vu, \quad w = (uv)^{e'} u\]

Then $y = xw = uv \cdot (uv)^{e'} u = (uv)^{e'+1} u$. Setting $e = e' + 1$, we have $x = uv$, $z = vu$, $y = (uv)^e u$. ✓

Commuting Words

When do two words commute? In linear algebra, two matrices commute if and only if they share a common eigenbasis (in the diagonalizable case). The string analogue is: two words commute if and only if they are powers of a common word.

Theorem 1.2 (Lyndon-Schützenberger, commuting words): Let $x, y \in \Sigma^+$. The following eight conditions are equivalent:

There exists $z \in \Sigma^+$ and integers $k, \ell > 0$ with $x = z^k$ and $y = z^\ell$.
$x^\omega = y^\omega$.
There exist integers $i, j > 0$ with $x^i = y^j$.
$xy = yx$.
There exist integers $r, s > 0$ with $x^r y^s = y^s x^r$.
Define the morphism $h : \{a, b\}^* \to \Sigma^*$ by $h(a) = x$ and $h(b) = y$. Then there exist two distinct words $u, v \in \{a, b\}^*$ with $h(u) = h(v)$.
$x\{x,y\}^* \cap y\{x,y\}^* \neq \emptyset$.
$x\{x,y\}^\omega \cap y\{x,y\}^\omega \neq \emptyset$.

Condition (1) is the canonical form: two words commute precisely when they are powers of a common root. Condition (6) is remarkable — it says commutativity is equivalent to the morphism $h$ being “non-injective” on the free monoid $\{a,b\}^*$.

Proof sketch: We prove a cycle of implications.

(1) $\Rightarrow$ (2): If $x = z^k$ and $y = z^\ell$, then $x^\omega = (z^k)^\omega = z^\omega = (z^\ell)^\omega = y^\omega$.

(2) $\Rightarrow$ (3): Let $i = |y|$ and $j = |x|$. The prefix of length $|x||y|$ of $x^\omega$ equals the prefix of the same length of $y^\omega$. This gives $x^i = y^j$.

\[y^j = x^i = (yw)^i = y(wy)^{i-1}w\]

Canceling $y$ on the left: $y^{j-1} = (wy)^{i-1}w$. Appending $y$ on the right: $y^j = (wy)^{i-1}wy = (wy)^i$.

Now we have $(yw)^i = y^j = (wy)^i$. Comparing the first $|y| + |w|$ characters gives $yw = wy$. Since $x = yw = wy$, appending $y$: $xy = (yw)y = y(wy) = yx$.

(4) $\Rightarrow$ (5): Clear, since if $xy = yx$, then $x^r y^s = y^s x^r$ for all $r, s > 0$.

(5) $\Rightarrow$ (6): Suppose $x^r y^s = y^s x^r$. Then $h(a^r b^s) = h(b^s a^r)$, and since $a^r b^s \neq b^s a^r$ (as $r, s > 0$), condition (6) holds.

(6) $\Rightarrow$ (7): From condition (6), there exist distinct $u, v$ with $h(u) = h(v)$. By swapping if needed, we may assume $u$ starts with $a$ and $v$ starts with $b$ (or use any occurrence of $a$ and $b$ in them). With more care one shows that the common value $h(u) = h(v)$ lies in $x\{x,y\}^* \cap y\{x,y\}^*$.

(7) $\Rightarrow$ (8): Any member of $x\{x,y\}^*$ extends to a member of $x\{x,y\}^\omega$.

(8) $\Rightarrow$ (1): If $\mathbf{w} \in x\{x,y\}^\omega \cap y\{x,y\}^\omega$, then $\mathbf{w}$ starts with both $x$ and $y$. This forces $x$ and $y$ to be prefixes of each other’s powers. A careful length argument then shows they share a primitive root.

The Equation x² = y³

To appreciate the full power of Theorem 1.2, consider the equation $x^2 = y^3$ over the integers. Every solution has the form $x = z^2, y = z^2$ for some integer $z$ (modulo sign). The analogous string result follows immediately from condition (3):

Corollary 1.3: If $x, y \in \Sigma^+$ and $x^2 = y^3$, then there exists $z \in \Sigma^+$ with $x = z^3$ and $y = z^2$.

More generally, any equation $x^i = y^j$ with $i, j > 0$ over $\Sigma^+$ forces $x$ and $y$ to be powers of a common word.

Repetitions in Words

A word is said to contain a square if it has a subword of the form $ww$ for some nonempty $w$. For example, $\mathtt{abab}$ is a square. A word is squarefree if it contains no square. Are there arbitrarily long squarefree words? Over a binary alphabet the answer is no — every binary word of length $\geq 4$ contains a square. But over a ternary alphabet, arbitrarily long squarefree words exist. This was proved by Thue in 1906.

An overlap is a word of the form $axaxa$ where $a \in \Sigma$ and $x \in \Sigma^*$. Overlaps are a relaxation of squares: a word containing no overlap also avoids long repetitions. The remarkable fact, also due to Thue, is:

Theorem 1.4 (Thue, 1906): Over a binary alphabet, there exist arbitrarily long overlap-free words. Over a ternary alphabet, there exist arbitrarily long squarefree words.

The Thue-Morse Sequence

\[t_n = \text{(number of 1's in the binary expansion of } n) \bmod 2\]

So $\mathbf{t} = 0110100110010110\cdots$. Equivalently, $\mathbf{t}$ is the fixed point of the morphism $\mu : 0 \mapsto 01, 1 \mapsto 10$. That is, $\mu(\mathbf{t}) = \mathbf{t}$.

Theorem 1.5: The Thue-Morse sequence is overlap-free.

The proof proceeds by showing that any overlap $axaxa$ in $\mathbf{t}$ can be “lifted” to an overlap in $\mathbf{t}$ via the inverse of $\mu$, eventually reaching a contradiction. This type of argument — reducing the problem on the image back to the preimage — is characteristic of morphic sequences.

Chapter 2: Regular Languages — Advanced Theory

Review: Finite Automata

We assume familiarity with the basics of regular languages from a first course. Recall the key objects:

A deterministic finite automaton (DFA) is a 5-tuple $M = (Q, \Sigma, \delta, q_0, F)$ where $Q$ is a finite set of states, $\delta : Q \times \Sigma \to Q$ is the transition function, $q_0 \in Q$ is the start state, and $F \subseteq Q$ is the set of accepting states.
A nondeterministic finite automaton (NFA) has $\delta : Q \times \Sigma \to 2^Q$ (a set-valued transition function), with optional $\varepsilon$-transitions.
Every NFA can be converted to an equivalent DFA (subset construction), at most exponential blowup in states.
The regular languages are exactly those accepted by DFAs (and equivalently NFAs), and also exactly those described by regular expressions.

Moore and Mealy Machines

Beyond acceptance, finite automata can also compute output. There are two classical models.

Definition 2.1 (Moore Machine): A Moore machine is a tuple $M = (Q, \Sigma, \Delta, \delta, \lambda, q_0)$ where $Q$, $\Sigma$, $\delta$, $q_0$ are as in a DFA, $\Delta$ is the output alphabet, and $\lambda : Q \to \Delta$ is the output function associating an output symbol to each state.

Definition 2.2 (Mealy Machine): A Mealy machine is a tuple $M = (Q, \Sigma, \Delta, \delta, \lambda, q_0)$ where $\lambda : Q \times \Sigma \to \Delta$ is the output function depending on the current state *and* the current input symbol.

The difference is subtle but important: Moore machines output depend only on states (so output is “one step delayed”), while Mealy machines output depends on transitions. Both models have the same expressive power as transducers in the sense that they compute the same class of functions.

Advanced Closure Properties

The regular languages are closed under the usual Boolean operations (union, intersection, complement), concatenation, Kleene star, reversal, morphism, and inverse morphism. We now examine several more subtle closure properties.

Quotients

Definition 2.3 (Left and Right Quotient): Let $L \subseteq \Sigma^*$ and $x \in \Sigma^*$. The left quotient of $L$ by $x$ is: \[x^{-1} L = \{ y \in \Sigma^* : xy \in L \}\] The right quotient of $L$ by $x$ is: \[Lx^{-1} = \{ y \in \Sigma^* : yx \in L \}\]

Theorem 2.4: If $L$ is regular and $x \in \Sigma^*$, then $x^{-1}L$ and $Lx^{-1}$ are regular. Moreover, if $K$ is any language (not necessarily regular), then $K^{-1}L$ is regular whenever $L$ is.

The proof for $x^{-1}L$ is simple: given a DFA $M$ for $L$, start in state $\delta^*(q_0, x)$ instead of $q_0$. For $K^{-1}L$ where $K$ is arbitrary, we use the Myhill-Nerode characterization: the quotient language $K^{-1}L$ is a union of right-congruence classes of $L$ (which is regular), hence regular.

Half of a Language

A more exotic operation: given $L$, define $\frac{1}{2} L$ as the set of words that are the “first half” of some word in $L$:

Definition 2.5: $\frac{1}{2}L = \{ x : \exists y \text{ with } |x| = |y| \text{ and } xy \in L\}$

Theorem 2.6: If $L$ is regular, then $\frac{1}{2}L$ is regular.

Proof: Let $M = (Q, \Sigma, \delta, q_0, F)$ be a DFA for $L$. We construct an NFA for $\frac{1}{2}L$ that runs the DFA forward from $q_0$ and simultaneously runs the DFA *backward* from all accepting states. A word $x$ is in $\frac{1}{2}L$ iff there exists a state $q$ such that $\delta^*(q_0, x) = q$ (forward computation) and there exists $y$ with $|y| = |x|$ and $\delta^*(q, y) \in F$ (the suffix $y$ of length $|x|$ leads to acceptance from $q$).

The key insight is that the reverse DFA $M^R$ accepts exactly the reversal of $L^R$. We construct the NFA for $\frac{1}{2}L$ as a product of $M$ (running forward) with all start states of $M^R$ (running backward simultaneously). When both “pointers” meet — the forward pointer from $q_0$ and the backward pointer from some accepting state — at the same state $q$ at the same time, we have found a valid split. Formally, the NFA has state space $Q \times 2^Q$ and accepts when the forward state matches some state that is simultaneously reachable backward.

Since $Q$ is finite, this NFA has $|Q| \cdot 2^{|Q|}$ states, yielding a regular language.

Cyclic Shift of a Language

Definition 2.7: The cyclic shift of a language $L$ is $\text{cyc}(L) = \{ yx : xy \in L \}$.

Theorem 2.8: If $L$ is regular, then $\text{cyc}(L)$ is regular.

The proof uses a product construction on two copies of the DFA for $L$. A word $yx$ is in $\text{cyc}(L)$ iff there is a state $q$ such that reading $x$ from $q$ leads to an accept state, and reading $y$ from that accept state (in a second pass) leads back to $q$ from the start. The NFA guesses the split point and simulates both halves.

Transducers and Nivat’s Theorem

Transducers generalize finite automata by producing output rather than just accepting/rejecting input. They are the finite-state analogue of functions between formal languages.

Definition 2.9 (Finite Transducer): A finite transducer is a tuple $T = (Q, \Sigma, \Delta, \delta, \lambda, q_0, F)$ where $Q$, $\Sigma$, $q_0$, $F$ are as in an NFA, $\Delta$ is the output alphabet, $\delta \subseteq Q \times (\Sigma \cup \{\varepsilon\}) \times Q$ is the transition relation, and $\lambda : \delta \to \Delta^*$ labels each transition with an output string.

A transducer $T$ computes a rational relation $\tau_T \subseteq \Sigma^* \times \Delta^*$: we have $(x, y) \in \tau_T$ if there is an accepting path reading input $x$ and producing output $y$. The image of a language $L$ under $T$ is $T(L) = \{y : \exists x \in L, (x,y) \in \tau_T\}$.

Theorem 2.10 (Nivat's Theorem): A relation $\tau \subseteq \Sigma^* \times \Delta^*$ is a rational relation if and only if there exist an alphabet $\Gamma$ and regular language $R \subseteq \Gamma^*$ and morphisms $h_1 : \Gamma^* \to \Sigma^*$, $h_2 : \Gamma^* \to \Delta^*$ such that: \[\tau = \{(h_1(z), h_2(z)) : z \in R\}\]

In other words, rational relations are exactly those that can be “decoded” from a regular language via a pair of morphisms. This is the string analogue of the algebraic fact that a relation is rational iff it is the image of a recognizable set under a pair of homomorphisms.

Corollary: Rational transductions preserve regular languages. That is, if $L$ is regular and $T$ is a finite transducer, then $T(L)$ is regular.

Composition of Transducers

Unlike DFAs (where composition is trivial via product), transducer composition is non-trivial because the output of one transducer must be fed as input to another. The result is:

Theorem 2.11: The class of rational relations is closed under composition.

However, rational transductions are not closed under complementation or intersection (as relations). This makes the study of transducers substantially richer than that of DFAs.

Two-Way Finite Automata

A two-way deterministic finite automaton (2DFA) is a DFA whose read head can move both left and right on the input tape. The input is surrounded by left and right endmarkers $\vdash$ and $\dashv$. Formally:

Definition 2.12 (2DFA): A 2DFA is a tuple $M = (Q, \Sigma, \vdash, \dashv, \delta, q_0, F)$ where $\delta : Q \times (\Sigma \cup \{\vdash, \dashv\}) \to Q \times \{L, R\}$ specifies the next state and head direction. The automaton accepts if it enters a state in $F$; it rejects if it falls off the right end of the tape or enters a rejecting configuration.

The remarkable theorem due to Shepherdson (1959) is that 2DFAs are no more powerful than 1DFAs:

Theorem 2.13 (Shepherdson): Every 2DFA can be converted to an equivalent 1DFA. Consequently, 2DFAs recognize exactly the regular languages.

Proof sketch (via the Transformation Automaton): Define the **transformation** induced by a word $x$ on a 2DFA $M$ as the partial function $\tau_x : Q \to Q \cup \{\text{accept}, \text{reject}\}$ mapping each state $q$ to the state reached when $M$ starts on the left of $x$ in state $q$ and eventually crosses off the right end of $x$ (or accepts/rejects within $x$).

There are only finitely many such partial functions (at most $(|Q|+2)^{|Q|}$), so the set of possible transformations is finite. The transformation monoid $\mathcal{T}(M)$ is the set of all transformations induced by words in $\Sigma^*$. We can simulate $M$ by a 1DFA whose states are transformations: the start state is the identity transformation, and reading symbol $a$ composes the current transformation with $\tau_a$. Acceptance is determined by applying the final transformation to $q_0$ and checking if the result is an accepting state.

This construction yields a 1DFA with at most $(|Q|+2)^{|Q|}$ states.

While 2DFAs are not more expressive than 1DFAs, they can be exponentially more succinct. There are languages whose smallest 2DFA has $n$ states but whose smallest 1DFA requires $2^n$ states.

Automata and Boolean Matrices

\[(T_a)_{ij} = 1 \iff \delta(q_i, a) = q_j\]\[T_w = T_{a_1} T_{a_2} \cdots T_{a_k}\]

(matrix product over the Boolean semiring, where $1+1=1$). A word $w$ is accepted by $M$ iff $(T_w)_{q_0, f} = 1$ for some $f \in F$.

This viewpoint connects regular language theory to the theory of finite monoids: the set $\{T_w : w \in \Sigma^*\}$ forms a finite monoid (the syntactic monoid of the language), and the language is characterized by which elements of this monoid are “accepting.” This is the foundation of algebraic automata theory.

The Myhill-Nerode Theorem

The Myhill-Nerode theorem is one of the most important results in automata theory. It gives a purely language-theoretic characterization of regularity, without reference to machines or grammars, and simultaneously explains the structure of the minimum DFA.

Definition 2.14 (Myhill-Nerode Equivalence): For a language $L \subseteq \Sigma^*$, define the equivalence relation $\equiv_L$ on $\Sigma^*$ by: \[x \equiv_L y \iff \forall z \in \Sigma^*,\ (xz \in L \iff yz \in L)\] Two words are $\equiv_L$-equivalent if they are indistinguishable by any right extension. The equivalence classes of $\equiv_L$ are the **Myhill-Nerode classes** of $L$.

Theorem 2.15 (Myhill-Nerode): A language $L \subseteq \Sigma^*$ is regular if and only if $\equiv_L$ has finitely many equivalence classes. Moreover, the number of states in the minimum DFA for $L$ equals the number of $\equiv_L$-classes.

Proof:

$(\Rightarrow)$ Suppose $L$ is accepted by a DFA $M$ with $n$ states. For any $x, y \in \Sigma^*$, if $\delta^*(q_0, x) = \delta^*(q_0, y)$ (they lead to the same state), then for any $z$, $\delta^*(q_0, xz) = \delta^*(q_0, yz)$, so $xz \in L \iff yz \in L$, meaning $x \equiv_L y$. Therefore each equivalence class of $\equiv_L$ is a union of words leading to the same state. The $\equiv_L$-classes are coarsenings of state preimages, so there are at most $n$ classes.

$(\Leftarrow)$ Suppose $\equiv_L$ has finitely many classes. We construct a DFA $M_L$ whose states are the equivalence classes $[x]_L$ for $x \in \Sigma^*$:

State set: $Q_L = \{[x]_L : x \in \Sigma^*\}$
Start state: $[\varepsilon]_L$
Transitions: $\delta([x]_L, a) = [xa]_L$ (well-defined since if $x \equiv_L y$ then $xa \equiv_L ya$)
Accept states: $F_L = \{[x]_L : x \in L\}$ (well-defined since $L$ is a union of $\equiv_L$-classes)

This DFA accepts $L$: $\delta^*([\varepsilon]_L, w) = [w]_L$, and $[w]_L \in F_L \iff w \in L$. So $L$ is regular.

For minimality: the DFA $M_L$ is minimal. Distinct classes $[x]_L \neq [y]_L$ are distinguishable: there exists $z$ with $xz \in L, yz \notin L$ (or vice versa). If any DFA for $L$ had two states $q, q'$ that are $[x]_L$-class and $[y]_L$-class for distinct classes, those states must be merged — but then reading $z$ from both would land in the same state, contradicting distinguishability. Hence $M_L$ is the unique (up to isomorphism) minimal DFA for $L$.

The Myhill-Nerode theorem has two uses: (1) proving regularity by bounding the number of equivalence classes, and (2) proving non-regularity by showing infinitely many classes. For the latter, it suffices to find an infinite set of words that are pairwise distinguishable.

Example: The language $L = \{0^n 1^n : n \geq 0\}$ over $\{0,1\}$ has infinitely many Myhill-Nerode classes. The words $0^i$ for $i = 0, 1, 2, \ldots$ are pairwise distinguishable: $0^i 1^i \in L$ but $0^j 1^i \notin L$ for $j \neq i$. Hence $L$ is not regular.

DFA Minimization Algorithms

Given a DFA $M$, we want to compute the minimum DFA for $L(M)$. By the Myhill-Nerode theorem, this amounts to finding the $\equiv_L$-classes. The main algorithms work by computing the indistinguishability relation and merging equivalent states.

Two states $p, q$ are distinguishable if there exists a word $z$ such that exactly one of $\delta^*(p, z)$ and $\delta^*(q, z)$ is in $F$. We denote this by $p \not\sim q$. Two states are equivalent (indistinguishable) if no such $z$ exists, written $p \sim q$.

Note: states must first be trimmed (remove unreachable states and dead states) before minimization.

Algorithm NAIVE-MINIMIZE (O(n³))

The simplest approach iteratively marks distinguishable pairs.

NAIVE-MINIMIZE(M):
  Initialize: mark (p, q) as distinguishable if exactly one is in F.
  Repeat until no change:
    For each unmarked pair (p, q) and each symbol a:
      If (δ(p,a), δ(q,a)) is marked as distinguishable:
        Mark (p, q) as distinguishable.
  Merge all unmarked (indistinguishable) pairs.

Correctness: The invariant is that all marked pairs are truly distinguishable (base case: pairs where one is accepting, one is rejecting are distinguishable by $\varepsilon$). The iteration propagates distinguishability backwards through transitions. When no more pairs can be marked, the unmarked pairs are genuinely indistinguishable.

Running time: At most $O(n^2)$ pairs, and each iteration of the outer loop takes $O(n^2 |\Sigma|)$ time to check all pairs and symbols. Total: $O(n^3 |\Sigma|)$.

Algorithm MINIMIZE (O(n²))

A more careful bookkeeping approach achieves $O(n^2)$:

MINIMIZE(M):
  Initialize a table T[p][q] = ∅ for all unmarked pairs.
  Mark (p, q) if exactly one of p, q is in F.
  For each symbol a and each pair (p, q):
    For each (r, s) with δ(r, a) = p, δ(s, a) = q (preimage pairs):
      Add (r, s) to T[p][q].
  Process newly marked pairs using the table:
    When (p, q) is marked, for each (r, s) in T[p][q]:
      Mark (r, s) if not already marked.

This is essentially a BFS/DFS propagation using precomputed dependency lists. Each pair is processed at most once, giving $O(n^2 |\Sigma|)$ total.

Algorithm FAST-MINIMIZE (Hopcroft, O(n log n))

Hopcroft’s algorithm achieves the optimal $O(n \log n)$ time, which is tight in the worst case. The algorithm maintains a partition refinement structure.

Key idea: Start with the coarsest partition $\{F, Q \setminus F\}$ (accepting vs. rejecting). Repeatedly refine: a block $B$ is a splitter for block $C$ with respect to symbol $a$ if $C$ contains states that go to $B$ on $a$ as well as states that don’t. Split $C$ accordingly.

FAST-MINIMIZE(M):
  Start with partition P = {F, Q \ F}.
  Initialize waiting queue W = {min(F, Q\F)} (the smaller of the two initial blocks).
  While W is nonempty:
    Remove a splitter S from W.
    For each symbol a:
      Let C = {q : δ(q, a) ∈ S} (states that transition into S on a).
      For each block B in P that is split by C:
        Split B into B ∩ C and B \ C.
        Add the smaller part to W (the "smaller half" trick).

The smaller-half trick: Each state is added to $W$ at most $O(\log n)$ times because each time it is processed, it belongs to a block that is at most half the size of the original containing block. This gives the $O(n \log n)$ bound.

Algorithm BRZOZOWSKI

An elegant but conceptually surprising algorithm uses reversal and determinization:

BRZOZOWSKI(M):
  1. Compute the reverse NFA M^R (swap initial and accepting states, reverse all transitions).
  2. Apply subset construction to M^R to get DFA D_1.
  3. Compute the reverse NFA D_1^R.
  4. Apply subset construction to D_1^R to get DFA D_2.
  Return D_2.

Theorem 2.16 (Brzozowski, 1962): The DFA $D_2$ produced by BRZOZOWSKI is the minimal DFA for $L(M)$.

Why does this work? The key insight is that the subset construction applied to an NFA produces an equivalent DFA where every state is reachable (by definition of the subset construction). When applied to $M^R$, the result $D_1$ accepts the reverse language $L^R$. Reversing again and applying subset construction gives a reachable DFA for $L$ where, additionally, all states are distinguishable (this follows from a careful analysis of what the double reversal does to the states). A reachable, co-reachable DFA with all states distinguishable is the minimal DFA.

State Complexity

One of the most fruitful directions in formal language theory is the study of state complexity: the number of states in the minimum DFA for a language obtained by applying an operation to regular languages.

Definition 2.17 (State Complexity): For a regular language $L$, let $\text{sc}(L)$ denote the number of states in the minimum DFA for $L$. The state complexity of an operation $\circ$ is the maximum of $\text{sc}(L_1 \circ L_2)$ over all languages $L_1, L_2$ with $\text{sc}(L_1) = m$ and $\text{sc}(L_2) = n$.

Key state complexity results:

Operation	State Complexity	Tight?
Complement	$n$	Yes
Union	$mn$	Yes
Intersection	$mn$	Yes
Concatenation	$m \cdot 2^n - 2^{n-1}$	Yes
Kleene star	$\frac{3}{4} \cdot 2^n$	Yes
Reversal	$2^n$	Yes
Morphism	$m^n$ (roughly)	Yes

The exponential blowup for reversal and NFA-to-DFA conversion is tight:

Theorem 2.18: For every $n \geq 1$, there exists an NFA with $n$ states whose equivalent minimum DFA requires $2^n$ states.

The witness: the NFA over $\{0,1\}$ with states $\{q_0, q_1, \ldots, q_{n-1}\}$ where $q_0$ is the start, $q_{n-1}$ is the only accepting state, 0-transitions shift states right ($q_i \xrightarrow{0} q_{i+1}$), 1-transitions non-deterministically reset to $q_0$ or shift. After subset construction, all $2^n$ subsets of states turn out to be reachable.

Partial Orders and Higman’s Lemma

The subsequence ordering is a powerful well-quasi-order on strings. Recall that $x$ is a subsequence of $y$ if $x$ can be obtained from $y$ by deleting (not necessarily contiguous) symbols: $x = a_1 a_2 \cdots a_k$ is a subsequence of $y$ if $y = u_0 a_1 u_1 a_2 u_2 \cdots a_k u_k$ for some words $u_i$.

\[L^{\uparrow} = \{ y : \exists x \in L, x \text{ is a subsequence of } y \}\]\[L^{\downarrow} = \{ y : \exists x \in L, y \text{ is a subsequence of } x \}\]

Theorem 2.19 (Higman's Lemma, 1952): For any finite alphabet $\Sigma$, the subsequence relation $\preceq$ on $\Sigma^*$ is a well-quasi-order. That is, for every infinite sequence $w_1, w_2, w_3, \ldots$ of words over $\Sigma$, there exist indices $i < j$ such that $w_i \preceq w_j$.

Proof: By induction on $|\Sigma|$.

Base case $|\Sigma| = 1$: Words are of the form $a^n$. Any infinite sequence must have a non-decreasing pair by the infinite pigeonhole principle on lengths. ✓

Inductive step: Let $|\Sigma| = k \geq 2$. Suppose for contradiction that there exists a bad sequence: an infinite sequence $w_1, w_2, \ldots$ with no $i < j$ satisfying $w_i \preceq w_j$. Choose a bad sequence that is minimal in the following sense: $w_1$ is of minimum length among all first elements of bad sequences; given $w_1$, $w_2$ is of minimum length among all second elements of bad sequences starting with $w_1$; and so on.

Fix a symbol $a \in \Sigma$. For each $w_i$, let $w_i'$ be the word obtained by deleting the first occurrence of $a$ in $w_i$ (if $w_i$ contains no $a$, then $w_i' = w_i$ and is over $\Sigma \setminus \{a\}$). By the inductive hypothesis applied to $\Sigma \setminus \{a\}$ (a smaller alphabet), the subsequence $w_{i_1}', w_{i_2}', \ldots$ of terms not containing $a$ has a good pair.

For terms containing $a$, write $w_i = u_i a w_i''$. The sequence $w_1'', w_2'', \ldots$ is a bad sequence starting from a later position, but by the minimality choice, each $w_i''$ is “at least as short” as the corresponding term of our minimal bad sequence would be — this contradicts minimality.

A more careful argument (by Dickson, Nash-Williams) formalizes this by constructing a bad sequence of strictly decreasing lengths, which is impossible since lengths are nonnegative integers.

Corollary 2.20: For any language $L$, the upward closure $L^\uparrow$ is regular.

Proof: An antichain under subsequence ordering is a set of words no two of which are comparable ($w_i \not\preceq w_j$ for $i \neq j$). By Higman's lemma, every antichain is finite. The set of minimal elements of $L^\uparrow$ is an antichain: if $y \in L^\uparrow$, there is a minimal $x \in L$ with $x \preceq y$, and this $x$ is in the antichain.

Since the antichain of minimal elements $\{x_1, \ldots, x_k\}$ is finite, $L^\uparrow = x_1^\uparrow \cup \cdots \cup x_k^\uparrow = \{y : x_i \preceq y\} \cup \cdots$. Each $\{y : x_i \preceq y\}$ is regular: it’s recognized by an NFA that non-deterministically skips symbols to embed $x_i$. A finite union of regular languages is regular.

Chapter 3: Context-Free Languages — Advanced Theory

Review: Grammars and Pushdown Automata

A context-free grammar (CFG) is a tuple $G = (V, \Sigma, R, S)$ where $V$ is the set of variables (nonterminals), $\Sigma$ is the terminal alphabet, $R$ is a finite set of productions $A \to \alpha$ (with $A \in V$ and $\alpha \in (V \cup \Sigma)^*$), and $S \in V$ is the start symbol. The language $L(G)$ consists of all terminal strings derivable from $S$.

A pushdown automaton (PDA) is a finite automaton augmented with an unbounded stack. PDAs and CFGs are equiexpressive: a language is context-free iff it is accepted by some PDA.

Closure Properties

Context-free languages enjoy several closure properties, though notably they are not closed under intersection or complement.

Theorem 3.1: The CFLs are closed under:

Union, concatenation, and Kleene star
Morphisms (homomorphisms)
Inverse morphisms
Intersection with regular languages
Reversal

But CFLs are not closed under intersection or complement.

The failure of closure under intersection is witnessed by $L_1 = \{a^n b^n c^m : n, m \geq 0\}$ and $L_2 = \{a^m b^n c^n : n, m \geq 0\}$, both context-free, but $L_1 \cap L_2 = \{a^n b^n c^n : n \geq 0\}$ is not context-free (by the pumping lemma).

Inverse morphism closure: This is the most useful closure property in practice. Given a CFL $L$ over $\Delta$ and a morphism $h : \Sigma^* \to \Delta^*$, we have $h^{-1}(L) = \{w \in \Sigma^* : h(w) \in L\}$.

Proof of inverse morphism closure: Let $M = (Q, \Delta, \Gamma, \delta, q_0, Z_0, F)$ be a PDA for $L$. Construct a PDA $M'$ for $h^{-1}(L)$ that reads symbols from $\Sigma$ and simulates $M$ on the output of $h$. When $M'$ reads symbol $a \in \Sigma$, it computes $h(a) \in \Delta^*$ and then simulates $M$ reading the entire string $h(a$. This requires storing the "pending" input to $M$ somewhere; a finite control suffices because $|h(a)| \leq C$ for some constant $C$ (since $\Sigma$ is finite and $h$ maps each symbol to a fixed string). The construction uses an augmented finite control of size $O(|Q| \cdot |\Delta|^C)$.

Pumping Lemmas

The classical pumping lemma for CFLs states that long enough strings have a “pumpable” substructure. However, the classical lemma can be hard to apply for certain non-CFLs. Two stronger versions exist.

Ogden’s Lemma: In the classical pumping lemma, we can mark positions of the string that we require to be pumped.

Theorem 3.2 (Ogden's Lemma): Let $L$ be a CFL. Then there exists a constant $N \geq 1$ (depending only on $L$) such that: for any word $z \in L$ with at least $N$ marked positions, we can write $z = uvwxy$ with:

$vwx$ contains at most $N$ marked positions
$vx$ contains at least one marked position
$uv^i wx^i y \in L$ for all $i \geq 0$

Ogden’s lemma subsumes the classical pumping lemma (mark all positions) and is strictly stronger. It is useful for proving languages like $\{a^{p} b^q c^r : p = q \text{ or } q = r\}$ are not context-free.

The Interchange Lemma is another strengthening, useful for proving non-CFL results when Ogden’s lemma fails. It exploits the structure of parse trees more carefully.

Parikh’s Theorem and Semilinear Sets

One of the most surprising results about context-free languages is that their “letter-counting” behavior is no more complex than that of regular languages. This is Parikh’s theorem.

For a word $w$ over alphabet $\{a_1, \ldots, a_k\}$, define the Parikh vector $\Psi(w) = (|w|_{a_1}, \ldots, |w|_{a_k}) \in \mathbb{N}^k$ where $|w|_{a_i}$ counts occurrences of $a_i$ in $w$. For a language $L$, let $\Psi(L) = \{\Psi(w) : w \in L\} \subseteq \mathbb{N}^k$.

Definition 3.3 (Semilinear Set): A set $S \subseteq \mathbb{N}^k$ is linear if there exist vectors $\mathbf{v}_0, \mathbf{p}_1, \ldots, \mathbf{p}_m \in \mathbb{N}^k$ with: \[S = \{\mathbf{v}_0 + c_1 \mathbf{p}_1 + \cdots + c_m \mathbf{p}_m : c_1, \ldots, c_m \in \mathbb{N}\}\] A set is semilinear if it is a finite union of linear sets.

Semilinear sets are exactly the sets definable in Presburger arithmetic (the first-order theory of $(\mathbb{N}, +)$), which is decidable. Semilinear sets are also exactly the Parikh images of regular languages.

Theorem 3.4 (Parikh, 1966): For any CFL $L$, the set $\Psi(L)$ is semilinear. Consequently, every CFL is **letter-equivalent** to some regular language: there exists a regular language $R$ with $\Psi(L) = \Psi(R)$.

Proof sketch: The key lemma is that if $w \in L$ has a sufficiently long derivation tree, then some nonterminal $A$ appears twice on a path from the root to a leaf. This gives a "pumping" structure in the derivation tree: we can repeat the subderivation between the two $A$-nodes any number of times. More formally, in a derivation tree with $n$ leaves and depth exceeding $|V|$ (the number of nonterminals), there is a repeated nonterminal on some path. Repeating or removing the cycle shows that $\Psi(L)$ is a finite union of sets of the form: \[\Psi(u) + \mathbb{N}\{\Psi(v_1), \ldots, \Psi(v_m)\}\] for some words $u, v_1, \ldots, v_m$, which is exactly a linear set.

Consequences of Parikh’s Theorem: Any language property that depends only on the Parikh image (letter counts) is inherited from regular languages by all CFLs. For example: if $L$ is a CFL over a one-letter alphabet $\{a\}$, then $L$ is actually regular (since semilinear subsets of $\mathbb{N}$ are unions of arithmetic progressions, and unions of arithmetic progressions over $\{a\}^*$ are regular). This provides an easy test: $L = \{a^{p^2} : p \in \mathbb{N}\}$ is not context-free because $\{p^2 : p \in \mathbb{N}\}$ is not semilinear (not a finite union of arithmetic progressions).

Deterministic Context-Free Languages

The DCFLs are the languages recognized by deterministic pushdown automata (DPDAs). Determinism for pushdown automata requires more careful formulation than for finite automata.

Definition 3.5 (DPDA): A pushdown automaton $M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)$ is deterministic if for every state $q \in Q$, stack symbol $A \in \Gamma$, and input symbol $a \in \Sigma \cup \{\varepsilon\}$:

$\delta(q, a, A)$ contains at most one element (at most one transition), and
If $\delta(q, \varepsilon, A) \neq \emptyset$, then $\delta(q, a, A) = \emptyset$ for all $a \in \Sigma$ (no choice between $\varepsilon$-move and reading a symbol).

Every regular language is a DCFL. The language $\{a^n b^n : n \geq 0\}$ is a DCFL but not regular. The language $\{ww^R : w \in \{a,b\}^*\}$ (even palindromes) is a CFL but not a DCFL.

Closure Under Complementation

The most important closure property of DCFLs, and one that sharply distinguishes them from general CFLs, is closure under complementation.

Theorem 3.6: The DCFLs are closed under complementation.

This is surprising: the CFLs are not closed under complementation (since they’re not closed under intersection, and $A \cap B = \overline{\overline{A} \cup \overline{B}}$), but the deterministic subclass is.

Proof: The difficulty is that a DPDA may loop (run forever without reading input) or may not exhaust the input. We handle this via the following normalization:

Lemma 3.7 (Well-formed DPDA): Every DCFL is accepted by a DPDA in normal form: one that (i) never loops, (ii) always reads the entire input, and (iii) accepts or rejects only at the end of input.

Given a normal-form DPDA $M$ for $L$, the complement $\overline{L}$ is accepted by $M$ with the accepting states swapped: $F' = Q \setminus F$. Since $M$ always reads the entire input and always terminates, this DFA-like trick works.

The normalization requires care. A DPDA might loop via $\varepsilon$-transitions. To detect loops, note that a sequence of $\varepsilon$-transitions cycles in a bounded number of steps (at most $|Q|^2 \cdot |\Gamma|$ steps with bounded stack usage, but the stack can grow). The key insight: if the DPDA takes more than $|Q|$ consecutive $\varepsilon$-transitions without the stack height decreasing, it must be in a loop (by the pigeonhole principle on configurations reachable from a fixed stack top). Such loops can be detected and replaced by explicit reject transitions.

To ensure the DPDA reads the entire input: add a “dead” state that absorbs all input after the DPDA would normally halt, marking the word as rejected only at the true end of input.

With these normalizations in place, the complement DPDA is well-defined.

Myhill-Nerode Characterization of DCFLs

Just as DFAs have a Myhill-Nerode characterization, DPDAs have an analogous one, though more complex. The right notion is that of a finite index for a certain equivalence on words:

Theorem 3.8: A language $L$ is a DCFL if and only if it is accepted by a DPDA with a finite number of configurations that can be reached on any input prefix, and the "residual" after reading any prefix has only finitely many distinct behaviors.

A more concrete consequence:

Theorem 3.9: The language $\text{PAL} = \{w \in \{a,b\}^* : w = w^R\}$ (all palindromes, including odd-length) is not a DCFL.

Proof: Suppose $\text{PAL}$ is a DCFL, accepted by DPDA $M$. Consider the words $a^i$ for $i = 0, 1, 2, \ldots$. Each $a^i$ is itself a palindrome. After reading $a^i$, the DPDA must be in some configuration $(q_i, \gamma_i)$ (state, stack). Now consider inputs of the form $a^i b a^i$: these are palindromes, so $M$ must accept. But then $M$ in configuration $(q_i, \gamma_i)$ must accept on input $b a^i$.

Since there are infinitely many values of $i$ but only finitely many states, by pigeonhole there exist $i < j$ with $q_i = q_j$. The stack configurations $\gamma_i, \gamma_j$ differ (otherwise $M$ couldn’t distinguish $a^i b a^i$ from $a^j b a^i$). But here’s the rub: from configuration $(q_i, \gamma_i)$, reading $b a^i$ accepts; from configuration $(q_j, \gamma_j) = (q_i, \gamma_j)$, reading $b a^i$ must also work correctly to accept $a^j b a^i$. The critical problem: reading $b a^j$ from $(q_i, \gamma_i)$ produces the word $a^i b a^j$ which is not a palindrome (since $i \neq j$), so $M$ must reject — but we need the stack to remember whether we’re at depth $i$ or $j$, and with determinism, we get a contradiction via the crossing sequence argument.

A cleaner formal argument uses the fact that the Nerode-like classes of PAL are infinite in number for any DPDA-based equivalence, contradicting the finite-control bound.

Chapter 4: Parsing

Introduction to Parsing

Parsing is the process of determining whether a string belongs to a language described by a CFG, and if so, constructing its parse tree (or trees, if the grammar is ambiguous). The central challenge in compiler construction is building efficient parsers for programming language grammars.

\[L = \{a^i b^j c^k : i = j \text{ or } j = k\}\]

This language is inherently ambiguous.

Earley’s Algorithm

Before studying specialized parsers (LL, LR), it is instructive to consider the general-purpose Earley algorithm, which handles any context-free grammar.

Items: An Earley item is a dotted production $[A \to \alpha \bullet \beta, i]$ meaning: we are trying to derive from $A$, have already seen $\alpha$ starting at position $i$ in the input, and still need to see $\beta$.

Chart: The chart $C[j]$ is the set of all items completed when we have read $j$ symbols of input $w = a_1 \cdots a_n$.

EARLEY(G, w = a_1 a_2 ... a_n):
  Initialize C[0] = {[S' -> . S, 0]} (augmented start rule)
  For j = 0 to n:
    Repeat until no change:
      PREDICTOR: For each [A -> α . B β, i] in C[j] and each rule B -> γ:
        Add [B -> . γ, j] to C[j]
      COMPLETER: For each [B -> γ ., i] in C[j] and each [A -> α . B β, k] in C[i]:
        Add [A -> α B . β, k] to C[j]
    SCANNER: For each [A -> α . a_{j+1} β, i] in C[j]:
      Add [A -> α a_{j+1} . β, i] to C[j+1]
  Accept if [S' -> S ., 0] ∈ C[n]

Complexity: Earley’s algorithm runs in $O(n^3)$ time for general CFGs, $O(n^2)$ for unambiguous grammars, and $O(n)$ for LR(k) grammars. This makes it universal but competitive with specialized parsers on restricted grammar classes.

LL(1) Grammars

An LL(1) grammar is one that can be parsed top-down with a single symbol of lookahead, scanning input Left to right and producing a Leftmost derivation.

FIRST and FOLLOW Sets

Definition 4.1 (FIRST and FOLLOW): For $\alpha \in (V \cup \Sigma)^*$: \[\text{FIRST}(\alpha) = \{ a \in \Sigma : \alpha \Rightarrow^* a\beta \text{ for some } \beta\} \cup (\{\varepsilon\} \text{ if } \alpha \Rightarrow^* \varepsilon)\] For $A \in V$: \[\text{FOLLOW}(A) = \{ a \in \Sigma : S \Rightarrow^* \alpha A a\beta \text{ for some } \alpha, \beta\} \cup (\{\$\} \text{ if } S \Rightarrow^* \alpha A)\] where $\$$ is the end-of-input marker.

The intuition: $\text{FIRST}(\alpha)$ is the set of terminals that can begin a string derived from $\alpha$; $\text{FOLLOW}(A)$ is the set of terminals that can appear immediately after an occurrence of $A$ in any sentential form.

Algorithm to compute FIRST:

COMPUTE-FIRST:
  Initialize FIRST(a) = {a} for each a ∈ Σ, FIRST(ε) = {ε}
  For each variable A: FIRST(A) = ∅
  Repeat until no change:
    For each production A -> X_1 X_2 ... X_k:
      Add FIRST(X_1) \ {ε} to FIRST(A)
      If ε ∈ FIRST(X_1): Add FIRST(X_2) \ {ε} to FIRST(A)
      If ε ∈ FIRST(X_1) ∩ FIRST(X_2): Add FIRST(X_3) \ {ε} to FIRST(A)
      ...
      If ε ∈ FIRST(X_1) ∩ ... ∩ FIRST(X_k): Add ε to FIRST(A)

Algorithm to compute FOLLOW:

COMPUTE-FOLLOW:
  Initialize FOLLOW(S) = {$}, FOLLOW(A) = ∅ for all other A
  Repeat until no change:
    For each production A -> α B β:
      Add FIRST(β) \ {ε} to FOLLOW(B)
      If ε ∈ FIRST(β): Add FOLLOW(A) to FOLLOW(B)

The LL(1) Parsing Table

Definition 4.2 (LL(1) Parsing Table): For grammar $G$, the LL(1) parsing table $T[A, a]$ is defined for $A \in V$, $a \in \Sigma \cup \{\$\}$: \[T[A, a] = \{ A \to \alpha : a \in \text{FIRST}(\alpha) \} \cup \{ A \to \alpha : \varepsilon \in \text{FIRST}(\alpha) \text{ and } a \in \text{FOLLOW}(A) \}\]

Definition 4.3 (LL(1) Grammar): A grammar $G$ is LL(1) if $|T[A, a]| \leq 1$ for every $A \in V$ and $a \in \Sigma \cup \{\$\}$: the parsing table has no conflicts.

Theorem 4.4: A grammar $G$ is LL(1) if and only if for every pair of distinct productions $A \to \alpha$ and $A \to \beta$:

$\text{FIRST}(\alpha) \cap \text{FIRST}(\beta) = \emptyset$, and
If $\varepsilon \in \text{FIRST}(\alpha)$, then $\text{FIRST}(\beta) \cap \text{FOLLOW}(A) = \emptyset$.

The LL(1) parsing table drives a stack-based predictive parser: the stack initially holds $S\$$. At each step, if the stack top is a terminal $a$ matching the current input symbol, pop and advance. If the stack top is a variable $A$, look up $T[A, a]$ and push the right-hand side in reverse order.

Removing Left Recursion

A grammar with a left-recursive rule $A \to A\alpha \mid \beta$ cannot be LL(1) because $\text{FIRST}(\beta)$ would appear in both $T[A, a]$ entries for the same $a$. Left recursion is eliminated by the following transformation:

\[A \to \beta_1 A' \mid \cdots \mid \beta_k A'\]\[A' \to \alpha_1 A' \mid \cdots \mid \alpha_m A' \mid \varepsilon\]

For indirect left recursion ($A \to B\alpha, B \to A\beta$), a systematic elimination algorithm orders the variables and eliminates direct left recursion for each.

Left Factoring

\[A \to \alpha\beta_1 \mid \alpha\beta_2\]\[A \to \alpha A', \quad A' \to \beta_1 \mid \beta_2\]

After left-recursion removal and left-factoring, many natural grammars become LL(1). However, no inherently ambiguous grammar is LL(1), and some unambiguous grammars are not LL(1) even after these transformations.

LR(0) Grammars

LR parsing scans input Left to right and produces a Rightmost derivation in reverse. LR parsers are more powerful than LL parsers: every LL(k) grammar is LR(k), but not conversely.

Handles and Viable Prefixes

Definition 4.5 (Right Sentential Form): A right sentential form is any word derivable from $S$ by a rightmost derivation: $S \Rightarrow^*_{\text{rm}} \alpha$.

Definition 4.6 (Handle): Given a right sentential form $\alpha A w$ where $w \in \Sigma^*$ and $A \to \beta$ is a production with $\alpha\beta w$ being the next right sentential form in a rightmost derivation, the substring $\beta$ at position $|\alpha|+1$ in $\alpha\beta w$ is a handle.

Intuitively, the handle is the rightmost production that was applied in the last derivation step. Bottom-up (LR) parsing works by scanning for handles and reducing them.

Definition 4.7 (Viable Prefix): A string $\gamma$ is a viable prefix of grammar $G$ if there exists a right sentential form $\gamma w$ (with $w \in \Sigma^*$) such that $\gamma$ contains no handle in its interior (the handle, if present, ends at the end of $\gamma$).

The set of viable prefixes is always a regular language — this is the key theorem underlying LR parsing.

LR(0) Items and Knuth’s Construction

An LR(0) item is a production with a dot indicating how much of the right-hand side has been seen: $[A \to \alpha \bullet \beta]$.

Knuth’s NFA-$\varepsilon$: The LR(0) items form the states of an NFA recognizing the viable prefixes:

Start state: $[S' \to \bullet S]$ (augmented grammar start item)
Shift transitions: $[A \to \alpha \bullet a \beta] \xrightarrow{a} [A \to \alpha a \bullet \beta]$ for terminals $a$
Goto transitions: $[A \to \alpha \bullet B \beta] \xrightarrow{B} [A \to \alpha B \bullet \beta]$ for nonterminals $B$
$\varepsilon$-transitions: $[A \to \alpha \bullet B \beta] \xrightarrow{\varepsilon} [B \to \bullet \gamma]$ for each production $B \to \gamma$ (prediction)

Theorem 4.8 (Knuth, 1965): The language of viable prefixes of a context-free grammar is regular, recognized by the NFA-$\varepsilon$ above.

Proof sketch: We show by induction on derivation length that a string $\gamma$ is a viable prefix iff the NFA-$\varepsilon$ accepts $\gamma$.

$(\Leftarrow)$: If the NFA accepts $\gamma$, trace through the accepting path. Each transition corresponds to a step in a rightmost derivation. By induction, the accepting run corresponds to a right sentential form $\gamma w$ where the dot marks exactly where we are in parsing.

$(\Rightarrow)$: Given a viable prefix $\gamma$ (from a right sentential form $\gamma w$), the sequence of handles that appear in constructing the derivation corresponds to a path in the NFA from the start state to some item with dot at the right position. The $\varepsilon$-transitions handle the prediction steps.

Knuth DFA: Apply the standard subset construction to the NFA-$\varepsilon$. The resulting DFA has states that are sets of LR(0) items (called LR(0) item sets or canonical collection). This DFA is the LR(0) automaton.

LR(0) Parsing Table: From the DFA, build the action/goto table:

Shift: If item set $I$ has $[A \to \alpha \bullet a \beta]$, then on input $a$, shift and go to the item set reached by the DFA on $a$ from $I$.
Reduce: If item set $I$ has $[A \to \alpha \bullet]$ (completed item), then reduce by $A \to \alpha$ — but LR(0) does this for all input symbols in the lookahead (no lookahead is used).
Accept: If item set $I$ has $[S' \to S \bullet]$, accept.

Definition 4.9 (LR(0) Grammar): A grammar $G$ (augmented with $S' \to S$) is LR(0) if the LR(0) parsing table has no conflicts: no state has both a shift action and a reduce action, and no state has two different reduce actions.

LR(0) grammars are a restricted class; more powerful are SLR(1), LALR(1), and LR(1) grammars (which use 1 symbol of lookahead in various ways). LALR(1) is the basis of the widely-used YACC/Bison parser generator.

Chapter 5: The Chomsky Hierarchy and Undecidability

Unrestricted Grammars and Turing Machines

We now ascend to the top of the Chomsky hierarchy. An unrestricted grammar (Type 0 grammar) has productions $\alpha \to \beta$ where $\alpha \in (V \cup \Sigma)^*$ with $\alpha$ containing at least one variable, and $\beta \in (V \cup \Sigma)^*$. There is no restriction on the form of productions.

Theorem 5.1: A language $L$ is recursively enumerable (r.e.) if and only if it is generated by some unrestricted grammar.

The proof goes through Turing machines:

Grammar simulates TM: Given a TM $M$, construct a grammar $G$ with variables encoding TM configurations. Productions simulate TM transitions forward and also “guess” the input (using a nondeterministic grammar). The grammar derives a terminal string iff the TM accepts it.

TM simulates grammar: Given an unrestricted grammar $G$, a nondeterministic TM enumerates all derivations of $G$ (breadth-first) and accepts whenever a derivation of the input is found.

Context-Sensitive Languages and Linear Bounded Automata

Between context-free and recursively enumerable languages sits the class of context-sensitive languages (CSLs), defined by context-sensitive grammars (CSGs) where every production has $|\alpha| \leq |\beta|$ (productions are non-length-decreasing, with the exception of $S \to \varepsilon$).

The machine model for CSLs is the linear bounded automaton (LBA): a nondeterministic TM whose working tape is bounded by a linear function of the input length (typically, the tape is just the input tape).

Theorem 5.2: A language is context-sensitive if and only if it is recognized by some LBA.

An important consequence: CSLs are decidable. Given an LBA $M$ and input $w$, the number of distinct configurations on tape of length $|w|$ is finite: $|Q| \cdot |w| \cdot |\Gamma|^{|w|}$. We can enumerate all configurations and decide in finite time whether any accepting configuration is reachable. Hence:

Theorem 5.3: Every context-sensitive language is recursive (decidable).

The Chomsky Hierarchy

The four language classes form the Chomsky hierarchy:

Type	Grammar	Machine	Closed Under
3 (Regular)	Right-linear or left-linear	DFA / NFA	Boolean ops, concat, *, morphism, inverse morphism
2 (CFL)	Context-free	PDA	Union, concat, *, morphism, inverse morphism, ∩ regular
1 (CSL)	Context-sensitive	LBA	Boolean ops (all three), concat, *, morphism
0 (RE)	Unrestricted	TM (nondeterministic)	Union, concat, *, morphism, intersection

Key separations: $\text{Regular} \subsetneq \text{CFL} \subsetneq \text{CSL} \subsetneq \text{Recursive} \subsetneq \text{RE} \subsetneq \text{All languages}$.

Each containment is strict. The separation between CSL and Recursive deserves mention: there exist decidable languages that are not context-sensitive (e.g., $\{a^{2^n} : n \geq 0\}$ is decidable but not context-sensitive).

The Immerman-Szelepcsényi Theorem

A striking result: unlike CFLs, CSLs are closed under complementation. This was an open problem for nearly 20 years.

Theorem 5.4 (Immerman-Szelepcsényi, 1987 — Inductive Counting): The context-sensitive languages are closed under complementation.

Proof via inductive counting: We show that if $L$ is a CSL accepted by LBA $M$, then $\overline{L}$ is accepted by a nondeterministic LBA $M'$.

$M'$ must accept $w$ iff $M$ does not accept $w$. The difficulty: $M'$ cannot simply simulate $M$ and invert, because $M$ is nondeterministic — “does not accept” means no computation path accepts, not just that one path rejects.

The algorithm uses inductive counting of reachable configurations:

Let $C_0 = \{c_{\text{init}}\}$ be the initial configuration of $M$ on $w$. Define $C_i$ as the set of configurations reachable from $C_0$ in exactly $i$ steps. The key insight is:

Compute $|C_i|$ inductively: given the knowledge that $|C_{i-1}| = k_{i-1}$, a nondeterministic LBA can count the number of configurations reachable in $i$ steps, using the following argument:

For each configuration $c$ reachable in $i$ steps, nondeterministically guess $c$ and verify it is reachable (by guessing a path of length $i$ from $c_{\text{init}}$ to $c$, using $k_{i-1}$ as a counter to ensure we’ve seen all $k_{i-1}$ predecessors). Count how many distinct such $c$ exist, obtaining $k_i = |C_i|$.

Using this count, we can decide: is any accepting configuration in $\bigcup_i C_i$? If not, accept $w$ as being in $\overline{L}$.

The total number of configurations is at most $N = |Q| \cdot |w| \cdot |\Gamma|^{|w|}$, which is bounded exponentially in $|w|$ — but the tape needed to write down indices and counts is only polynomial in $|w|$ (since indices fit in $O(\log N) = O(|w|)$ space). Hence the counting procedure itself fits on a linear tape.

This shows $\overline{L}$ is also accepted by an LBA, hence is a CSL.

The Immerman-Szelepcsényi theorem was proved independently by Immerman and Szelepcsényi in 1987, and each received the Gödel Prize in 1995. In complexity theory, the result is equivalent to $\text{NSPACE}(f(n)) = \text{co-NSPACE}(f(n))$ for space-constructible $f(n) \geq \log n$ — in particular, NL = co-NL.

Kolmogorov Complexity

Kolmogorov complexity provides an information-theoretic measure of the “complexity” of individual strings. It is a powerful tool for proving lower bounds.

Definition 5.5 (Kolmogorov Complexity): Fix a universal Turing machine $U$. The Kolmogorov complexity of a string $x$ is: \[K(x) = \min\{|\pi| : U(\pi) = x\}\] the length of the shortest program (description) $\pi$ that causes $U$ to output $x$.

Key facts:

Incompressibility: For each $n$, at least half of all strings of length $n$ satisfy $K(x) \geq n$ (since there are $2^n$ strings but fewer than $2^n$ programs of length $< n$).
Invariance: For any two universal TMs $U, U'$, $K_U(x) \leq K_{U'}(x) + c$ for some constant $c$ depending only on $U, U'$ (not on $x$).
Uncomputability: $K(x)$ is not computable (by a diagonalization argument).

The incompressibility method is a non-constructive proof technique: assume a string $x$ of high Kolmogorov complexity satisfies some property (by the incompressibility theorem, such strings exist), and derive a contradiction from the structure that property imposes. This is used to prove lower bounds on communication complexity, circuit size, and pumping-like lemmas for specific language classes.

Undecidability Results for CFLs

Several natural decision problems about CFLs are undecidable. The proofs use reductions from the Post Correspondence Problem (PCP).

Definition 5.6 (PCP): An instance of PCP over alphabet $\Sigma$ is a finite list of pairs $(x_1, y_1), (x_2, y_2), \ldots, (x_k, y_k)$ with $x_i, y_i \in \Sigma^+$. A solution is a sequence of indices $i_1, i_2, \ldots, i_m$ (with $m \geq 1$) such that: \[x_{i_1} x_{i_2} \cdots x_{i_m} = y_{i_1} y_{i_2} \cdots y_{i_m}\] The PCP decision problem: given an instance, does it have a solution?

Theorem 5.7 (Post, 1946): PCP is undecidable.

Proof via TM computation histories: We reduce the halting problem to PCP. Given a TM $M$ and input $w$, construct a PCP instance whose solution corresponds to a valid halting computation of $M$ on $w$.

A computation history is a sequence of configurations $C_1 \# C_2 \# \cdots \# C_t$ where $C_1$ is the initial configuration and each $C_{i+1}$ follows from $C_i$ by one TM step. A valid computation history is one that ends in an accepting configuration.

Encode: pairs in the PCP instance represent partial matches between successive configurations (with appropriate shifts for the TM head position). A solution to the PCP instance corresponds exactly to building a valid computation history of $M$ on $w$. Hence PCP has a solution iff $M$ accepts $w$.

The PCP undecidability is used as a stepping stone to prove:

Theorem 5.8: The following problems are undecidable:

Given CFGs $G_1, G_2$, is $L(G_1) \cap L(G_2) = \emptyset$?
Given a CFG $G$, is $G$ ambiguous?
Given CFGs $G_1, G_2$, is $L(G_1) = L(G_2)$?
Given a CFG $G$, is $L(G) = \Sigma^*$?

Proof of (1): Given a PCP instance with pairs $(x_i, y_i)$, define: \[L_1 = \{x_{i_1} x_{i_2} \cdots x_{i_m} \# i_m \cdots i_1 : m \geq 1\}\] \[L_2 = \{y_{i_1} y_{i_2} \cdots y_{i_m} \# i_m \cdots i_1 : m \geq 1\}\] Both $L_1, L_2$ are CFLs (each is a union of context-free constructions). Then $L_1 \cap L_2 \neq \emptyset$ iff the PCP instance has a solution.

Proof of (2): Construct a CFG $G_{\text{comp}}$ that generates all valid computation histories of $M$ on $w$ (by a careful encoding). $G_{\text{comp}}$ is ambiguous iff $M$ accepts $w$.

More precisely: construct two CFLs — one generating “top halves” of invalid computation histories and one generating “bottom halves” — whose union covers $\Sigma^*$ iff $M$ does not accept $w$. Then universality of the union is equivalent to non-acceptance, and this can be further reduced to ambiguity.

PSPACE-Completeness of Regular Expression Universality

Even for regular languages, some problems are computationally hard.

Theorem 5.9: The following problems are PSPACE-complete:

Given a regular expression $r$, is $L(r) = \Sigma^*$?
Given an NFA $M$, is $L(M) = \Sigma^*$?

The key construction is the following:

Lemma 5.10: For every polynomial-space TM $M$ and input $w$, there exists a regular expression $r_{M,w}$ (constructible in polynomial space) such that $L(r_{M,w}) = \Sigma^*$ iff $M$ does not accept $w$.

Proof of Lemma: $L(r_{M,w})$ is the complement of the language of valid accepting computation histories of $M$ on $w$. A string $s$ is an *invalid* computation history if: - $s$ is not a properly formatted sequence of configurations, OR - Some configuration is not a valid successor of the previous one.

We can write the set of all invalid computation histories as a union of regular languages, each capturing one type of invalidity. Because each configuration has polynomially bounded length and there are polynomially many configurations (the tape is polynomial), the transitions can be captured by finite automata. The union of all these automata gives a single regular expression capturing all invalid computation histories.

Since “valid accepting computation” is equivalent to “not an invalid computation history,” we get $L(r_{M,w}) = \Sigma^*$ iff no valid accepting computation exists iff $M$ does not accept $w$.

The PSPACE-hardness follows because universality of $r_{M,w}$ decides whether $M$ accepts $w$, and PSPACE-complete problems can be decided in polynomial space, hence are PSPACE-hard.

Membership in PSPACE: NFA universality can be decided by co-nondeterministically guessing a string not in $L(M)$ and verifying — but this is nondeterministic. Via Savitch’s theorem ($\text{NPSPACE} = \text{PSPACE}$), we can convert this to a deterministic PSPACE algorithm.

Chapter 6: Deterministic Context-Free Languages — Further Properties

Normal Forms

Every DCFL has a representation in a clean normal form, analogous to Chomsky Normal Form for CFGs.

Theorem: Every DCFL can be represented by a real-time DPDA (one that never takes $\varepsilon$-moves). This is the DCFL analogue of trimming $\varepsilon$-transitions.

Further, DCFLs admit a “two-way” normal form: the input is read in two passes (forward and backward). This is used in the study of complexity properties of DCFLs.

Relationship to LR(0) Grammars

The relationship between DCFLs and LR-parseable grammars is tight:

Theorem 6.1: Every LR(0) language is a DCFL. Not every DCFL is LR(0).

A language is an LR(0) language iff it has an LR(0) grammar. The canonical non-LR(0) DCFL is the set of palindromes — but palindromes are actually not a DCFL at all! For DCFLs, the right notion is:

Theorem 6.2: A language $L$ with $L \not\ni \varepsilon$ is a DCFL if and only if it has an LR(1) grammar. LR(1) is strictly more powerful than LR(0).

The LR(1) condition uses one symbol of lookahead in the reduce action: instead of reducing on all possible symbols (LR(0)), we consult the FOLLOW sets refined by the specific derivation context.

Closure Properties of DCFLs

Theorem 6.3: The DCFLs are closed under:

Complement (Theorem 3.6)
Inverse morphism
Intersection with regular languages

But DCFLs are not closed under union, concatenation, or morphism.

The failure of closure under union shows DCFLs are not Boolean-closed: $L_1 = \{a^n b^n c^m\}$ and $L_2 = \{a^m b^n c^n\}$ are both DCFLs but $L_1 \cup L_2$ is not (it is a CFL, but not deterministic).

Cook’s Theorem on 2DPDAs

A two-way deterministic pushdown automaton (2DPDA) is a DPDA whose input head can move bidirectionally. Cook’s theorem gives a remarkable simulation result:

Theorem 6.4 (Cook, 1971): Every 2DPDA can be simulated by a 1DPDA in linear time. Consequently, 2DPDAs accept exactly the DCFLs.

This is the pushdown analogue of Shepherdson’s theorem for finite automata (2DFAs = DFAs). The simulation is substantially more involved due to the stack. The key technique is crossing sequences: encoding the behavior of the 2DPDA at each crossing of a position on the input tape as a bounded description, then showing these crossing sequences form a regular set that can be used to guide a one-way simulation.

Chapter 7: Other Language Classes

L-Systems

Lindenmayer systems (L-systems) are string-rewriting systems introduced by biologist Aristid Lindenmayer in 1968 to model the growth of multicellular organisms. Unlike Chomsky grammars, L-systems apply all productions simultaneously (in parallel) rather than one at a time.

Definition 7.1 (DOL System): A Deterministic OL system (DOL system) is a triple $(V, h, w)$ where $V$ is an alphabet, $h : V^* \to V^*$ is a morphism (the production rules, one per symbol), and $w \in V^*$ is the **axiom** (starting word). The language generated is the sequence: \[\mathcal{L}(V, h, w) = \{w, h(w), h^2(w), h^3(w), \ldots\}\]

Example: The morphism $h : 0 \mapsto 01, 1 \mapsto 10$ with axiom $w = 0$ generates the Thue-Morse sequence in the limit. After $n$ steps: $h^n(0)$.

The “D” in DOL stands for Deterministic (each symbol has exactly one production), “O” for OL (no context — each symbol rewrites independently of neighbors), and “L” for Lindenmayer.

The DOL Sequence Problem

Definition 7.2 (DOL Sequence Problem): Given two DOL systems $G_1 = (V, h_1, w_1)$ and $G_2 = (V, h_2, w_2)$, is the sequence they generate the same? That is, is $h_1^n(w_1) = h_2^n(w_2)$ for all $n \geq 0$?

This problem is decidable — proved by Culik and Fris (1977) in a landmark result. However, the proof is highly non-trivial. The language-theoretic version is also decidable:

Theorem 7.3 (DOL Language Equivalence): It is decidable whether two DOL systems generate the same set of words.

L-Systems and Computer Graphics

L-systems have found extensive application in computer graphics for generating fractal-like structures (plants, trees, coastlines). The turtle graphics interpretation assigns geometric meaning to symbols: $F$ = move forward, $+$ = turn left, $-$ = turn right. Applying the morphism repeatedly generates increasingly detailed patterns, such as:

Koch curve: $F \mapsto F+F--F+F$
Sierpinski triangle: $A \mapsto B+A+B, B \mapsto A-B-A$
Dragon curve: various morphisms on pairs of symbols

The book by Prusinkiewicz and Lindenmayer, The Algorithmic Beauty of Plants (Springer, 1990), is the definitive reference on L-systems in graphics.

Rational Series

\[S = \sum_{w \in \Sigma^*} (S, w) \cdot w\]

where $(S, w) \in \mathbb{K}$ is the coefficient of $w$.

Definition 7.4 (Rational Series): A formal power series $S$ over $\mathbb{K}$ is rational if it can be computed by a **weighted finite automaton** (WFA): an NFA where each transition is labeled with an element of $\mathbb{K}$, and the coefficient $(S, w)$ is the sum (over $\mathbb{K}$) of products of edge weights along all accepting paths labeled $w$.

Rational series generalize regular languages: a language $L$ corresponds to the characteristic series $\mathbf{1}_L$ (coefficient 1 for words in $L$, 0 otherwise), and $\mathbf{1}_L$ is rational over $\mathbb{B} = (\{0,1\}, \vee, \wedge)$ iff $L$ is regular.

The Theorem of Schützenberger

Theorem 7.5 (Schützenberger, 1961): A formal power series $S$ over a semiring $\mathbb{K}$ is rational if and only if it can be represented as: \[(S, w) = \mu \cdot M_w \cdot \nu\] for some vector $\mu \in \mathbb{K}^{1 \times n}$, matrices $M_a \in \mathbb{K}^{n \times n}$ for each $a \in \Sigma$ (extended multiplicatively to words: $M_w = M_{a_1} M_{a_2} \cdots M_{a_k}$ for $w = a_1 \cdots a_k$), and vector $\nu \in \mathbb{K}^{n \times 1}$.

This is the linear representation of a rational series. It is the formal-language analogue of a linear recurrence for sequences over $\mathbb{K}$.

Examples and Applications

Counting paths: If $G$ is a directed graph with adjacency matrix $A$, then $(A^k)_{ij}$ counts paths of length $k$ from node $i$ to node $j$. This is a rational series over $(\mathbb{N}, +, \times)$.

Probabilistic automata: A probabilistic finite automaton (PFA) is a WFA over the semiring of probabilities. The series $(S, w)$ is the probability that the automaton accepts $w$. PFAs can recognize non-regular languages in the sense of cut-point isolation: the set $\{w : (S,w) > \lambda\}$ for a cut-point $\lambda$ may not be regular.

Formal languages and power series: Over $\mathbb{N}$, rational series correspond to counting the number of accepting paths — relevant to the study of the #P complexity class (counting problems).

Weighted transducers: The theory of rational series extends naturally to transducers, where the output is not a bit but an element of a semiring. Weighted transducers are used in speech recognition (weighted finite-state methods), natural language processing, and bioinformatics.

Hadamard Product

For two series $S, T$ with $(S \odot T, w) = (S, w) \cdot (T, w)$ (componentwise product), the rational series are closed under Hadamard product over commutative semirings. This has no analogue for regular languages (since the Hadamard product of characteristic functions is the indicator of the intersection, which is preserved — this corresponds to the DFA product construction).

Appendix: Decision Problems Summary

The following table summarizes decidability and complexity for key decision problems, where “CFL” denotes context-free grammar input and “RE” denotes r.e. languages.

Problem	Class	Status
DFA membership	DFA	O(n)
NFA membership	NFA	O(n · \|Q\|)
DFA emptiness	DFA	O(\|Q\|)
DFA equivalence	DFA	O(n log n)
DFA minimization	DFA	O(n log n)
CFL membership	CFG	O(n³) (CYK / Earley)
CFL emptiness	CFG	Decidable (polynomial)
CFL finiteness	CFG	Decidable
CFL ambiguity	CFG	Undecidable
CFL ∩ CFL emptiness	CFG × CFG	Undecidable
CFL = Σ*	CFG	Undecidable
CFL = CFL	CFG × CFG	Undecidable
CSL membership	CSG	PSPACE-complete
RE membership	TM	Undecidable (semi-decidable)
Regex universality	Regex	PSPACE-complete
NFA universality	NFA	PSPACE-complete
PCP	Pairs	Undecidable

Chapter 8: Fine and Wilf, Lyndon Words, and the Structure of Periodicity

The Fine and Wilf Theorem

Chapter 1 introduced periodicity for strings. The most important structural result about multiple periods in a single string is the Fine and Wilf theorem (1965). It answers: if a string has two different periods, what can we say?

Recall that a word $w$ of length $n$ has period $p$ (with $1 \leq p \leq n$) if $w[i] = w[i+p]$ for all valid indices (i.e., for all $0 \leq i \leq n-p-1$). Equivalently, $w$ is a prefix of $x^\omega$ for $x = w[0..p-1]$.

Theorem 8.1 (Fine and Wilf, 1965): Let $w$ be a word of length $n$. If $w$ has periods $p$ and $q$, and $p + q - \gcd(p,q) \leq n$, then $w$ also has period $\gcd(p,q)$.

The bound $p + q - \gcd(p,q)$ is tight: there exist words of length $p + q - \gcd(p,q) - 1$ with periods $p$ and $q$ but not period $\gcd(p,q)$.

Proof: Let $d = \gcd(p,q)$. We must show $w[i] = w[i+d]$ for all $0 \leq i \leq n-d-1$. Since $d = \gcd(p,q)$, by Bézout's identity there exist integers $s, t$ (not necessarily positive) with $sp + tq = d$. However, it is cleaner to use the following direct argument. \[w[i] = w[i+p] \text{ whenever } 0 \leq i \leq n-p-1\]\[w[i] = w[i+q] \text{ whenever } 0 \leq i \leq n-q-1\]

We show that any two indices $i$ and $j$ with $j - i = d$ and $0 \leq i < j \leq n-1$ are connected by a chain of steps of size $\pm p$ and $\pm q$ that stays within $[0, n-1]$.

Formally, consider the integers modulo $p$ and modulo $q$. Since $d = \gcd(p,q)$, the additive subgroup generated by $p$ and $q$ in $\mathbb{Z}$ equals $d\mathbb{Z}$. So there exist non-negative integers $\alpha, \beta$ with $\alpha p - \beta q = d$ (or $\beta q - \alpha p = d$). Starting from index $i$, we can reach $i + d$ by taking $\alpha$ steps of $+p$ and $\beta$ steps of $-q$ (or vice versa). Each intermediate index lies in an interval we must check stays within $[0, n-1]$.

\[i + \alpha p \leq (n - d - 1) + (q/d - 1)p\]

One verifies this is at most $n-1$ using the hypothesis $n \geq p + q - \gcd(p,q) = p + q - d$.

More concretely: the sequence of indices $i, i+p, i+p-q, i+p-q+p, \ldots$ that zigzags by steps of $+p$ and $-q$ reaches $i+d$ and stays within $[0, n-1]$ by the assumed length bound, establishing $w[i] = w[i+d]$.

Tightness example: Take $p = 3, q = 5$, so $d = \gcd(3,5) = 1$ and the bound is $3 + 5 - 1 = 7$. The word $w = \mathtt{abaaaba}$ of length 7 has both periods 3 and 5, but not period 1 (it is not constant). Extending to length 8 forces period 1.

Fine and Wilf for Two Morphisms

The textbook presents a generalization of Fine and Wilf to morphisms:

Theorem 8.2 (Generalized Fine-Wilf): Let $h, g : \Sigma^* \to \Delta^*$ be two morphisms. If $h(a) = g(a)$ for all $a \in \Sigma$, and $h$ and $g$ agree on a common prefix of length $p + q - \gcd(p,q)$ for some integers $p = |h(w)|$ and $q = |g(w)|$... then they agree on the infinite periodic word.

This generalization is used in the study of morphic sequences and has applications in deciding equality of DOL sequences.

Lyndon Words

The theory of Lyndon words is a beautiful piece of combinatorics on words with surprising connections to algebra, algorithms, and number theory.

Definition 8.3 (Lyndon Word): A word $w \in \Sigma^+$ is a Lyndon word if $w$ is strictly lexicographically smaller than all of its proper suffixes: \[\forall\, 1 \leq k \leq |w|-1,\quad w < w[k..|w|-1] \text{ (lexicographically)}\]

Equivalently, a Lyndon word is a primitive word that is the lexicographically smallest element in its conjugacy class.

Examples: Over $\{0,1\}$ with $0 < 1$:

$0, 1, 01, 001, 011, 0001, 0011, 0111, \ldots$ are Lyndon words.
$10$ is not Lyndon: its proper suffix $\mathtt{0}$ satisfies $\mathtt{0} < \mathtt{10}$.
$0101$ is not Lyndon: it is not primitive ($0101 = (01)^2$).

Key properties:

Every Lyndon word is primitive.
If $\ell_1, \ell_2$ are Lyndon words with $\ell_1 \leq \ell_2$ lexicographically, then $\ell_1 \ell_2$ is a Lyndon word iff $\ell_1 < \ell_2$.
If $\ell$ is Lyndon with $\ell = xy$, then $y\ell$ is Lyndon iff $y$ is Lyndon (and $y < \ell$).

These properties arise from the interplay between order and concatenation in the free monoid.

The Chen-Fox-Lyndon Factorization

The most important result about Lyndon words is the unique factorization theorem:

Theorem 8.4 (Chen-Fox-Lyndon, 1958): Every word $w \in \Sigma^+$ has a unique factorization: \[w = \ell_1^{e_1} \ell_2^{e_2} \cdots \ell_k^{e_k}\] where $\ell_1, \ldots, \ell_k$ are Lyndon words with $\ell_1 > \ell_2 > \cdots > \ell_k$ (strictly decreasing in lexicographic order), and $e_i \geq 1$.

\[w = \mathtt{cbacba} = \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a} \cdot \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a}\]

Wait, this is not non-increasing. Correctly: $\mathtt{cba} \cdot \mathtt{cba}$? No, $\mathtt{cba}$ is Lyndon since $c > b > a$ — but $\mathtt{cba}$ is not Lyndon because its suffix $\mathtt{a}$ satisfies $\mathtt{a} < \mathtt{cba}$. The correct factorization of $\mathtt{cbacba}$ is $\mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a} \cdot \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a}$ — oops, these must be non-increasing. Indeed $\mathtt{c} > \mathtt{b} > \mathtt{a}$ and $\mathtt{c}, \mathtt{b}, \mathtt{a}$ are all Lyndon words (single letters are always Lyndon). So the factorization is $\mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a} \cdot \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a}$ — but that is not non-increasing at position 3-4 (we go from $\mathtt{a}$ to $\mathtt{c}$). A simpler example: $\mathtt{aab}$ has Lyndon factorization $\mathtt{aab}$ (one Lyndon word since $\mathtt{aab} < \mathtt{ab} < \mathtt{b}$). The word $\mathtt{ba}$ factors as $\mathtt{b} \cdot \mathtt{a}$.

Proof sketch (existence and uniqueness):

Existence: Given $w$, define $\ell_1$ as the longest Lyndon prefix of $w$ (a Lyndon word that is a prefix of $w$). Then continue with the remaining suffix. One shows this greedy approach terminates and gives a non-increasing sequence (the next Lyndon piece cannot be larger than the previous, because if it were, the current piece would not have been the longest Lyndon prefix).

Uniqueness: Two distinct factorizations into Lyndon words with the same non-increasing property would give a contradiction via a careful comparison of the first differing piece.

Duval’s Algorithm

Computing the Lyndon factorization efficiently is non-trivial, but the Duval algorithm (1983) achieves linear time $O(|w|)$:

DUVAL(w):
  i = 0, j = 1
  While j < |w|:
    k = i
    While k < j and w[k mod (j - i)] == w[j]:   # compare cyclically
      k++; j++
    If w[k mod (j-i)] < w[j]:
      i = k + 1; j = max(j, i+1)     # extend the current Lyndon word
    Else:
      # backtrack: output one copy of the Lyndon word w[i..i+(j-k)-1]
      i = i + (j - k)
  Output w[i..|w|-1] as the last Lyndon word

The key invariant: at every step, the current “candidate” Lyndon word is $w[i..j-1]$. The algorithm never moves $i$ backward, giving an amortized $O(|w|)$ bound.

Application — Sorting: The Lyndon factorization has a remarkable application to the Burrows-Wheeler Transform (BWT), used in data compression (bzip2). The BWT of a string is computed by sorting all cyclic rotations — and the Lyndon factorization controls the structure of this sorted order. Specifically, the suffix array of a word $w$ can be computed in $O(|w|)$ using the Lyndon factorization as a backbone.

The Necklace Problem

A necklace (or Lyndon necklace) is a lexicographically smallest representative of a conjugacy class. Computing the lexicographically smallest rotation of a string — finding the necklace — is equivalent to finding the first character of the Lyndon factorization. By Booth’s algorithm (1980), this runs in $O(n)$.

\[\frac{1}{n} \sum_{d \mid n} \mu(n/d) k^d\]

where $\mu$ is the Möbius function from number theory. This is related to counting primitive necklaces and has connections to the theory of free Lie algebras.

Chapter 9: ω-Languages and Büchi Automata

Motivation and Topology of ω-Languages

The study of languages of infinite words is not merely a theoretical curiosity — it is the foundation of model checking and formal verification of concurrent and reactive systems. A program that runs forever (an operating system, a control loop, a communication protocol) cannot be modeled by a finite input; it produces an infinite stream of events. The right question is not “does this program terminate correctly?” but “does this program always eventually reach a good state?” or “does it never reach a bad state?”

These questions are naturally phrased as membership problems for languages of infinite words.

Definition 9.1 (ω-Language): An ω-language (or ω-language) over $\Sigma$ is a subset of $\Sigma^\omega$. An infinite word $\mathbf{w} = w_0 w_1 w_2 \cdots \in \Sigma^\omega$ is a sequence indexed by $\mathbb{N}$.

The topology of $\Sigma^\omega$ is the Cantor space topology: the basic open sets are the cylinders $\{w \in \Sigma^\omega : w \text{ starts with } u\}$ for finite $u \in \Sigma^*$. An ω-language is open if it is a union of cylinders, and closed if its complement is open. This topology makes $\Sigma^\omega$ compact and totally disconnected.

The Borel hierarchy for ω-languages classifies them by alternation of open/closed operations. The ω-regular languages (defined below) correspond to the $\Delta_2^0$ level of this hierarchy.

Büchi Automata

Definition 9.2 (Nondeterministic Büchi Automaton, NBA): A Büchi automaton is a tuple $B = (Q, \Sigma, \delta, q_0, F)$, exactly as a nondeterministic finite automaton. It reads an infinite word $\mathbf{w} = w_0 w_1 w_2 \cdots$ by producing an infinite sequence of states $q_0, q_1, q_2, \ldots$ where $q_{i+1} \in \delta(q_i, w_i)$.

The automaton accepts $\mathbf{w}$ if there exists a run (sequence of states) such that $\text{Inf}(\rho) \cap F \neq \emptyset$, where $\text{Inf}(\rho) = \{q \in Q : q = q_i \text{ for infinitely many } i\}$ is the set of states visited infinitely often.

In other words: a Büchi automaton accepts an infinite word if some run passes through an accepting state infinitely often. An ω-language is ω-regular if it is accepted by some NBA.

Examples:

The language of all infinite words over $\{a, b\}$ that contain infinitely many $a$’s: use a 2-state NBA with state $q_0$ (initial, non-accepting), $q_1$ (accepting). On $a$: go from $q_0$ to $q_1$ and from $q_1$ to $q_1$. On $b$: stay. Accept at $q_1$. Then every run on a word with infinitely many $a$’s visits $q_1$ infinitely often.
The language $\{uv^\omega : u \in \Sigma^*, v \in L, v \neq \varepsilon\}$ for a regular $L$ — all ultimately periodic words whose “tail” lies in $L$ — is always ω-regular.

Connection to regular languages: For a finite language $L \subseteq \Sigma^*$, define $L^\omega = \{w_1 w_2 \cdots : w_i \in L, w_i \neq \varepsilon\}$. This ω-language is always ω-regular. More generally, every ω-regular language is a finite Boolean combination of languages of the form $L_1 L_2^\omega$ for regular $L_1, L_2$.

Theorem 9.3 (ω-Regular Language Decomposition): A language $L \subseteq \Sigma^\omega$ is ω-regular if and only if it is a finite union of sets of the form $U \cdot V^\omega$, where $U, V \subseteq \Sigma^*$ are regular (and $V$ contains at least one nonempty word).

Closure Properties of ω-Regular Languages

Theorem 9.4: The ω-regular languages are closed under union, intersection, and complementation.

Closure under union is straightforward (take the disjoint union of two NBAs). Closure under intersection requires a product construction with a “double acceptance” trick: accept when both automata have visited their respective accepting states. Closure under complementation is the hardest and requires converting to a deterministic model first.

Limits of Deterministic Büchi Automata

This is where ω-languages diverge sharply from finite-word languages. For finite words, NFA and DFA are equally expressive. For infinite words:

Theorem 9.5 (McNaughton-Papert-Schützenberger): Deterministic Büchi automata (DBAs) are strictly less expressive than nondeterministic Büchi automata.

\[L' = \{uav^\omega : u \in \{a,b\}^*, \text{ where } v \in \{a,b\}^+\}\]

(all words that eventually become periodic) is ω-regular but not recognizable by any DBA. A DBA that visits its accepting states infinitely often must do so in a “determined” way, but periodicity of a tail is inherently nondeterministic to detect.

The canonical non-DBA-recognizable language is: “all $\mathbf{w}$ over $\{a,b\}$ such that if $\mathbf{w}$ contains only finitely many $b$’s, then it also contains only finitely many $a$’s” — this requires nondeterminism.

Muller Automata and McNaughton’s Theorem

Definition 9.6 (Deterministic Muller Automaton, DMA): A Muller automaton is a tuple $M = (Q, \Sigma, \delta, q_0, \mathcal{F})$ where $\delta : Q \times \Sigma \to Q$ is deterministic and $\mathcal{F} \subseteq 2^Q$ is a collection of subsets of states (the acceptance table). The automaton accepts $\mathbf{w}$ if $\text{Inf}(\rho) \in \mathcal{F}$: the set of states visited infinitely often must equal some set in $\mathcal{F}$.

Theorem 9.7 (McNaughton, 1966): Every ω-regular language is recognized by some deterministic Muller automaton.

This is the fundamental theorem of ω-automata: nondeterminism can always be eliminated if we use the more powerful Muller acceptance condition (instead of requiring only some accepting state, we specify exactly which states appear infinitely often).

Proof sketch: Given an NBA $B$ with $n$ states, the DMA construction proceeds by the **Safra construction** (1988). The states of the DMA are **Safra trees**: annotated trees whose nodes are labeled with subsets of the NBA's state space. A Safra tree represents a "summary" of all possible nondeterministic runs of $B$ seen so far.

On reading each symbol, the Safra construction updates the tree deterministically:

Powerset step: each node’s label is updated by applying the NBA transition function to all states in that label.
Subset merging: nodes track which copies of the NBA have visited accepting states, and “merge” them by promoting a subset to a new child node when it becomes stable.
Tree normalization: remove empty nodes and normalize the tree.

The DMA accepts if some node’s label becomes “stable” in the following sense: it appears as a full subset and has visited an accepting state, which manifests in the Muller condition. The resulting DMA has at most $n^{O(n)}$ states.

Other deterministic ω-automaton models include Rabin automata (accepting if some pair $(E_i, F_i)$ in a table has $\text{Inf}(\rho) \cap E_i = \emptyset$ and $\text{Inf}(\rho) \cap F_i \neq \emptyset$) and Streett automata (the dual). All deterministic models are equally expressive for ω-regular languages.

Linear Temporal Logic and Model Checking

The primary application of ω-regular language theory is in linear temporal logic (LTL) and model checking. LTL formulae describe properties of infinite execution traces. Key operators:

$\mathbf{X}\, \varphi$: $\varphi$ holds in the next step
$\mathbf{F}\, \varphi$: $\varphi$ holds in some future step (eventually)
$\mathbf{G}\, \varphi$: $\varphi$ holds in all future steps (globally)
$\varphi\, \mathbf{U}\, \psi$: $\varphi$ holds until $\psi$ holds

Theorem 9.8 (Vardi-Wolper, 1986): Every LTL formula $\varphi$ can be translated into an NBA $B_\varphi$ of size at most $2^{O(|\varphi|)}$ such that $\mathbf{w}$ satisfies $\varphi$ iff $\mathbf{w} \in L(B_\varphi)$. The LTL model checking problem (given a finite-state system $S$ and LTL formula $\varphi$, does every execution of $S$ satisfy $\varphi$?) is PSPACE-complete.

The model checking algorithm: translate $\neg\varphi$ to an NBA $B$, take the product of $B$ with the system’s transition graph, check if the product has an accepting lasso (a path that enters a cycle through an accepting state). This is a reachability problem on the product graph, solvable in linear time in the product’s size.

Chapter 10: Automatic Sequences

From Finite Automata to Sequences

A DFAO (Deterministic Finite Automaton with Output) is a DFA augmented with an output function (\tau : Q \to \Delta$ mapping states to an output alphabet $\Delta$. Given a DFAO $M$ and a natural number $n$ written in base $k$ (with leading zeros allowed), the DFAO reads $n$’s base-$k$ digits and outputs $\tau(\delta^*(q_0, [n]_k))$, where $[n]_k$ is the base-$k$ representation of $n$ (most significant digit first).

Definition 10.1 (k-Automatic Sequence): A sequence $\mathbf{a} = (a_n)_{n \geq 0}$ over a finite alphabet $\Delta$ is k-automatic if there exists a DFAO that, on input $n$ in base $k$, outputs $a_n$. Equivalently, $\mathbf{a}$ is k-automatic if the set $\{a_n : n \geq 0\}$ can be computed by reading $n$'s digits from a finite automaton.

Automatic sequences are the “regular languages of number theory.” Just as regular languages are recognized by finite automata reading strings, automatic sequences are computed by finite automata reading the digits of an index.

Canonical Examples

The Thue-Morse sequence $\mathbf{t} = 0110100110010110\cdots$ is 2-automatic. The DFAO has two states $q_0$ (output 0) and $q_1$ (output 1). On digit 0: stay in current state. On digit 1: switch state. Start in $q_0$. Then $t_n = \text{(sum of binary digits of } n) \bmod 2$, as computed by this DFAO reading $n$ in binary.

The Rudin-Shapiro sequence $\mathbf{r} = (r_n)_{n \geq 0}$ counts the number of (possibly overlapping) occurrences of $11$ in the binary representation of $n$, mod 2. It is 2-automatic with 4 states.

The Baum-Sweet sequence $\mathbf{b} = (b_n)_{n \geq 0}$: $b_n = 1$ if the binary representation of $n$ contains no odd-length block of consecutive 0’s, else $b_n = 0$. It is 2-automatic with 3 states.

The Stern-Brocot sequence: defined by $s(0) = 0, s(1) = 1, s(2n) = s(n), s(2n+1) = s(n) + s(n+1)$. This is 2-automatic and has connections to the Farey sequence in number theory.

Equivalent Characterizations

Theorem 10.2: The following are equivalent for a sequence $\mathbf{a} = (a_n)_{n \geq 0}$ over a finite alphabet:

$\mathbf{a}$ is k-automatic.
The k-kernel of $\mathbf{a}$ — the set $\{(a_{k^e n + r})_{n \geq 0} : e \geq 0, 0 \leq r < k^e\}$ — is finite.
$\mathbf{a}$ is the image under a coding (letter-to-letter morphism) of a fixed point of a k-uniform morphism (a morphism where every letter maps to a word of length exactly $k$).

The k-kernel is the set of all subsequences of the form $(a_{k^e n + r})_{n \geq 0}$, for all $e \geq 0$ and $0 \leq r < k^e$. Finiteness of the kernel captures the “finite-state” nature of the sequence: there are only finitely many distinct “sampling patterns.”

Proof of (1) $\Leftrightarrow$ (2): Given a DFAO $M$ computing $\mathbf{a}$, the sequence $(a_{k^e n + r})_{n \geq 0}$ is computed by the DFAO that starts in the state $\delta^*(q_0, [r]_k^e)$ (the state reached by reading the base-$k$ representation of $r$ in $e$ digits). Since $M$ has finitely many states, there are finitely many such starting states, hence finitely many such subsequences.

Conversely, given a finite k-kernel, define states of the DFAO as the elements of the kernel. The transition function is: on reading digit $d$, the sequence $(a_{k^e n + r})_n$ transitions to $(a_{k^{e+1} n + kd + r})_n$ — which is also in the kernel by finiteness. The output function maps each kernel element to its first value $a_r$.

Morphic Sequences

A sequence is morphic if it is the image of the fixed point of a (not necessarily uniform) morphism. Automatic sequences are a special case: every $k$-automatic sequence is morphic (via a $k$-uniform morphism), but not every morphic sequence is automatic.

Example: The Fibonacci word $\mathbf{f}$ is the fixed point of the morphism $\phi: a \mapsto ab, b \mapsto a$. It is morphic (clearly) but not $k$-automatic for any $k$. This is because automatic sequences have polynomial digit sums (by the kernel characterization), while the Fibonacci word has irrational frequencies of letters (related to the golden ratio).

The sequence $(F_n \bmod 2)_{n \geq 0}$ (Fibonacci numbers mod 2) is 2-automatic, even though the Fibonacci numbers themselves are not automatic as a sequence of integers.

Cobham’s Theorem

The central foundational result for automatic sequences concerns what happens when a sequence is automatic for two different bases:

Theorem 10.3 (Cobham, 1969): Let $k, \ell \geq 2$ be integers that are multiplicatively independent (i.e., $k^a \neq \ell^b$ for all positive integers $a, b$). If a sequence $\mathbf{a}$ is both $k$-automatic and $\ell$-automatic, then $\mathbf{a}$ is eventually periodic.

Multiplicative independence is the same as saying $k$ and $\ell$ are not powers of the same integer. For example, $k = 2$ and $\ell = 3$ are multiplicatively independent (since $2^a \neq 3^b$ for any positive integers $a, b$), while $k = 2$ and $\ell = 4$ are not multiplicatively independent (since $2^2 = 4^1$).

Cobham’s theorem says: the “interesting” automatic sequences (non-eventually-periodic ones like Thue-Morse, Rudin-Shapiro) are tied to a single base. They cannot be captured by automata reading another, unrelated base. This gives automatic sequences a kind of “numerical specificity.”

Proof idea (Cobham 1969, simplified by Semenov 1977 and Durand-Rigo 2011):

The proof uses properties of morphisms and the structure of automatic sequences. Suppose $\mathbf{a}$ is both k-automatic and ℓ-automatic, with k and ℓ multiplicatively independent.

The k-kernel and ℓ-kernel are both finite. Consider the set of “cluster points” of the sequence — positions where specific patterns recur. Using the finiteness of both kernels and the fact that k and ℓ are multiplicatively independent, one shows that the set of distinct consecutive windows $(a_n, a_{n+1}, \ldots, a_{n+m-1})$ is eventually periodic.

The key step: for any fixed window size $m$, the sequence of windows satisfies two different linear recurrences (one with period a power of $k$, one with period a power of $\ell$). By Fine and Wilf (since powers of multiplicatively independent integers are eventually coprime in growth), the window sequence must be eventually periodic. Since this holds for all $m$, the sequence $\mathbf{a}$ itself is eventually periodic.

Consequences:

The Thue-Morse sequence is 2-automatic but not $\ell$-automatic for any $\ell$ not a power of 2.
A sequence that is 6-automatic is also 2-automatic and 3-automatic (since $6 = 2 \times 3$, and 6-automatic $\Rightarrow$ 2-automatic by a base-change argument when the bases are powers of each other).

Automatic Sequences and Number Theory

Automatic sequences encode deep number-theoretic properties. The Champernowne word $\mathbf{c} = 0\, 1\, 10\, 11\, 100\, 101\, \ldots$ (concatenation of all non-negative integers in binary) is not k-automatic for any $k$, because its factor complexity (number of distinct length-$n$ subwords) grows super-linearly.

The factor complexity of an automatic sequence is at most linear: there exists $C$ such that the number of distinct length-$n$ subwords is at most $Cn$. This is a hallmark of “structured” sequences.

Connections to the Cartier operator in algebraic geometry, the Möbius function in analytic number theory, and fractal dimensions of certain Cantor-like sets make automatic sequences a meeting point of diverse mathematical fields.

Chapter 11: Algebraic Automata Theory

The Syntactic Monoid

Every regular language has a canonical algebraic object: its syntactic monoid. This section develops the theory connecting regular languages, finite monoids, and language varieties.

Definition 11.1 (Syntactic Congruence): For a language $L \subseteq \Sigma^*$, define the syntactic congruence $\sim_L$ on $\Sigma^*$ by: \[u \sim_L v \iff \forall x, y \in \Sigma^*,\ (xuy \in L \iff xvy \in L)\] Two words are syntactically equivalent if they behave identically in all contexts.

Definition 11.2 (Syntactic Monoid): The syntactic monoid of $L$ is $\text{Synt}(L) = \Sigma^* / {\sim_L}$, the quotient monoid under the syntactic congruence.

Theorem 11.3: A language $L$ is regular if and only if $\text{Synt}(L)$ is finite.

This is the monoid-theoretic analogue of the Myhill-Nerode theorem. Whereas Myhill-Nerode uses a right congruence (looking at $xz \in L$ vs $yz \in L$ for right contexts only), the syntactic congruence uses two-sided contexts and gives a finer equivalence.

Relationship to the minimal DFA: The transformation monoid of the minimal DFA for $L$ is exactly $\text{Synt}(L)$. Specifically, each word $w \in \Sigma^*$ induces a function $\phi_w : Q \to Q$ defined by $\phi_w(q) = \delta^*(q, w)$. The set $\{\phi_w : w \in \Sigma^*\}$ under composition is the syntactic monoid, with the monoid product being function composition.

Example: The language $L = (ab)^*$ over $\{a, b\}$. The syntactic monoid has 5 elements: the identity $\varepsilon$, the elements $a, b, ab, ba$, with the relation $aba = a$, $bab = b$, $(ab)^2 = ab$, etc. This monoid is the transformation monoid of the 3-state minimal DFA for $L$.

Eilenberg’s Variety Theorem

The variety theorem of Eilenberg (1976) is a deep correspondence between algebraic properties of the syntactic monoid and logical/combinatorial properties of the language.

Definition 11.4 (Variety of Languages): A variety of languages is a collection $\mathbf{V}$ that assigns to each alphabet $\Sigma$ a set $\mathbf{V}(\Sigma)$ of regular languages over $\Sigma$, such that $\mathbf{V}(\Sigma)$ is closed under Boolean operations, left and right quotients, and inverse morphisms from $\Sigma^*$ to $\Delta^*$.

Definition 11.5 (Variety of Finite Monoids): A variety of finite monoids (pseudovariety) is a class $\mathbf{M}$ of finite monoids closed under taking submonoids, quotient monoids, and finite direct products.

Theorem 11.6 (Eilenberg's Variety Theorem, 1976): There is a bijective correspondence between varieties of languages and varieties (pseudovarieties) of finite monoids: \[\text{varieties of languages} \longleftrightarrow \text{pseudovarieties of finite monoids}\] given by: a language $L$ belongs to the variety $\mathbf{V}(\Sigma)$ if and only if $\text{Synt}(L)$ belongs to the pseudovariety $\mathbf{M}(\mathbf{V})$.

This is one of the deepest structural theorems in formal language theory. It reduces the classification of language families to the classification of finite monoid families — an algebraic problem amenable to the tools of group theory and semigroup theory.

Star-Free Languages and Aperiodic Monoids

The most celebrated instance of Eilenberg’s theorem is the characterization of star-free languages.

Definition 11.7 (Star-Free Language): The star-free languages over $\Sigma$ are the smallest collection containing all finite languages and closed under Boolean operations and concatenation (but *not* Kleene star).

Equivalently, star-free languages are those expressible in first-order logic over the word structure $(\Sigma^*, \leq_{\text{prefix}}, (Q_a)_{a \in \Sigma})$, where $Q_a(i)$ means “position $i$ carries symbol $a$.”

Definition 11.8 (Aperiodic Monoid): A finite monoid $M$ is aperiodic (or group-free) if for every element $m \in M$, there exists $n \geq 1$ with $m^n = m^{n+1}$. Equivalently, $M$ contains no nontrivial subgroup.

Theorem 11.9 (Schützenberger, 1965): A regular language $L$ is star-free if and only if $\text{Synt}(L)$ is aperiodic.

This is Schützenberger’s theorem, one of the great results of 20th-century algebra and computer science. It precisely identifies the algebraic structure responsible for the absence of cyclic symmetry in a language.

Proof direction (aperiodic $\Rightarrow$ star-free): If $\text{Synt}(L)$ is aperiodic, then the DFA for $L$ has no cycles in its “essential” structure. One shows by induction on the monoid structure that $L$ can be expressed using Boolean operations and concatenation alone — the star operation is never needed because no “looping” (cycling through states in a structured way) occurs.

Proof direction (star-free $\Rightarrow$ aperiodic): Every finite language is recognized by an aperiodic monoid. The class of languages with aperiodic syntactic monoids is closed under Boolean operations and concatenation (one checks that the product and quotient constructions for monoids preserve aperiodicity). Since star-free languages are the closure of finite languages under Boolean operations and concatenation, they all have aperiodic syntactic monoids.

Examples:

$L = (ab)^*$: aperiodic? The element $ab$ in $\text{Synt}(L)$ satisfies $(ab)^1 = ab$ and $(ab)^2 = ab$ (the transformation $\phi_{ab}$ is idempotent in the minimal DFA). So yes, $L$ is star-free. Indeed, $L = \Sigma^* \setminus (\Sigma^* b\Sigma^* \cup \Sigma^* a \Sigma^* a \Sigma^* \cup \Sigma^* b \Sigma^* b \Sigma^*)$ — expressible without Kleene star.
$L = (ab)^+$: same, since $(ab)^+ = ab \cdot (ab)^*$.
$L = \{w : |w|_a \equiv 0 \pmod{3}\}$: not star-free. The element $a$ in $\text{Synt}(L)$ has order 3 (the DFA cycles through 3 states on reading $a$), generating a non-trivial group $\mathbb{Z}/3\mathbb{Z}$ in the syntactic monoid.

The Krohn-Rhodes Theorem

The Krohn-Rhodes theorem (1965) is the “fundamental theorem of finite automata,” decomposing every finite automaton into a cascade of simpler components.

Theorem 11.10 (Krohn-Rhodes, 1965): Every finite transformation monoid $M$ divides a wreath product of finite simple groups (which are the "prime" simple group divisors of $M$) and the 2-element flip-flop monoid.

“Divides” means “is a quotient of a submonoid of.” The theorem says every finite state machine can be decomposed into a cascade (left-to-right pipeline) of:

Simple group components: cyclic “counters” and permutation machines.
Reset components: 2-state machines with a reset transition.

This decomposition theorem is the foundation of complexity theory for finite automata and has connections to Boolean circuit complexity (the NC/AC hierarchy).

Chapter 12: Learning Regular Languages

The Learning Problem

The grammatical inference problem asks: given examples of a language, can we learn the language? Angluin’s seminal 1987 paper gave a polynomial-time algorithm for learning regular languages from membership queries and equivalence queries — a model of “exact learning.”

The setting: There is an unknown regular language $L \subseteq \Sigma^*$ (the target). A learner (algorithm) can make two types of queries to a teacher (oracle):

Membership query for $w$: is $w \in L$? The teacher answers yes or no.
Equivalence query for a DFA $H$: is $L(H) = L$? The teacher answers yes (and the learner halts) or no and provides a counterexample — a word in $L \Delta L(H) = (L \setminus L(H)) \cup (L(H) \setminus L)$.

Theorem 12.1 (Angluin, 1987): There exists a polynomial-time algorithm (L*) that, given access to membership and equivalence queries for any regular language $L$, exactly learns $L$ in time polynomial in $n = \text{sc}(L)$ (the number of states in the minimum DFA) and the length of the longest counterexample received.

The L* algorithm is one of the most practically important results in automata theory — it is used in model checking (learning specifications), protocol verification, and software testing.

The Observation Table

The L* algorithm maintains an observation table $(S, E, T)$ where:

$S \subseteq \Sigma^*$: a prefix-closed set of “short” words (rows).
$E \subseteq \Sigma^*$: a set of “experiments” / distinguishing suffixes (columns).
$T : (S \cup S \cdot \Sigma) \times E \to \{0, 1\}$: defined by $T(s, e) = 1$ iff $se \in L$.

The table has rows indexed by $S \cup S \cdot \Sigma$ and columns by $E$. The row for word $s$ is the function $\text{row}(s) : E \to \{0,1\}$ defined by $\text{row}(s)(e) = T(s, e)$.

The table must satisfy two properties before we can build a DFA hypothesis:

Closed: For every $s \in S$ and $a \in \Sigma$, there exists $s' \in S$ with $\text{row}(sa) = \text{row}(s')$.
Consistent: For every $s_1, s_2 \in S$ with $\text{row}(s_1) = \text{row}(s_2)$, for every $a \in \Sigma$, $\text{row}(s_1 a) = \text{row}(s_2 a)$.

The L* Algorithm

L*(Σ, target L):
  Initialize S = {ε}, E = {ε}
  Fill T via membership queries.
  Repeat:
    While table is not closed or not consistent:
      If not consistent (∃ s1, s2 ∈ S with row(s1) = row(s2) but row(s1·a) ≠ row(s2·a)):
        Add a·e to E (for the distinguishing suffix e ∈ E); fill new column via queries.
      If not closed (∃ s ∈ S, a ∈ Σ with row(s·a) ≠ row(s') for all s' ∈ S):
        Add s·a to S; add s·a·b for all b ∈ Σ to S·Σ; fill new rows via queries.
    Build DFA hypothesis M from table:
      States = {row(s) : s ∈ S} (distinct rows)
      Start state = row(ε)
      δ(row(s), a) = row(s·a)
      Accept states = {row(s) : T(s, ε) = 1}
    Make equivalence query for M.
    If "yes": return M.
    If counterexample t is given:
      Add all prefixes of t to S; fill new rows via queries.

Correctness: When the table is closed and consistent, the hypothesis DFA $M$ correctly separates all words in $S \cup S\cdot\Sigma$ by their rows. The equivalence query either confirms $M = L$ or provides a counterexample that reveals a new state (a new equivalence class of $\equiv_L$ not yet captured in the table). Since the minimal DFA has $n$ states, at most $n-1$ counterexamples are ever received.

Complexity:

Number of membership queries: $O(n^2 \cdot |\Sigma| + n \cdot \ell)$, where $\ell$ is the maximum counterexample length.
Number of equivalence queries: at most $n - 1$.
Total time: $O(n^2 |\Sigma| + n\ell)$.

Proof of correctness: The key invariant is that the states of the hypothesis DFA are exactly the equivalence classes of $\equiv_L$ restricted to words whose "type" (row in the table) is currently represented. Each counterexample adds at least one new row that is distinct from all existing rows — meaning we've discovered a new Myhill-Nerode equivalence class. Since there are exactly $n$ Myhill-Nerode classes, after at most $n-1$ counterexamples, the hypothesis must equal $L$.

Formally: let $M$ be the hypothesis and let $t$ be the counterexample. Then $t \in L \Delta L(M)$. Let $t = a_1 a_2 \cdots a_m$. Consider the prefixes $\varepsilon, a_1, a_1 a_2, \ldots, t$. At least two consecutive prefixes $a_1 \cdots a_i$ and $a_1 \cdots a_{i+1}$ must be in different rows in the table (since the hypothesis DFA made the wrong decision on $t$, there must be a state transition that “went wrong”). Adding these prefixes to $S$ either (a) adds a new row (increases the number of states) or (b) reveals an inconsistency (which then gets repaired by adding a column to $E$, again increasing information). Either way, the algorithm makes strictly more progress.

Extensions and Limitations

Lower bound: Any learning algorithm for regular languages requires $\Omega(n^2)$ membership queries in the worst case (this matches L*’s upper bound). Equivalence queries alone are sufficient to learn regular languages (without membership queries) in $O(n^2)$ time — via the RPNI algorithm (Oncina-Garcia 1992) — if the learner receives all positive and negative examples up to a given length.

Beyond regular languages: Can one learn CFLs? Context-free grammars are not learnable in polynomial time from membership queries alone (unless NP = P), by a hardness result of Angluin. However, certain subclasses of CFLs (k-testable languages, simple deterministic languages) are polynomial-time learnable.

Active learning in practice: The L* algorithm has been implemented in the LearnLib library and used successfully to infer models of software systems, network protocols (TCP, TLS), and hardware components.

Chapter 13: The CYK Algorithm and Parsing Complexity

Motivation: The General Parsing Problem

Earley’s algorithm (Chapter 4) handles arbitrary CFGs in $O(n^3)$. But for the special case of grammars in Chomsky Normal Form (CNF), there is a strikingly clean dynamic programming algorithm — the Cocke-Younger-Kasami (CYK) algorithm — that achieves the same asymptotic bound with simpler structure and better constants.

Chomsky Normal Form

Definition 13.1 (Chomsky Normal Form): A CFG is in Chomsky Normal Form (CNF) if every production has one of the forms:

$A \to BC$ where $B, C \in V$ are variables (both non-terminal), or
$A \to a$ where $a \in \Sigma$ is a terminal symbol.

Optionally, $S \to \varepsilon$ is allowed if $\varepsilon \in L(G)$.

Theorem 13.2: Every CFL (without the empty string) has a CFG in Chomsky Normal Form. The conversion increases the grammar size by at most a polynomial factor.

The conversion algorithm eliminates $\varepsilon$-productions, unit productions ($A \to B$), and long productions ($A \to X_1 X_2 \cdots X_k$ for $k > 2$) by introducing new variables.

The CYK Algorithm

\[\text{table}[i][j] = \{A \in V : A \Rightarrow^* a_i a_{i+1} \cdots a_j\}\]

The CYK algorithm fills this table bottom-up:

CYK(G in CNF, w = a_1 ... a_n):
  For i = 1 to n:
    table[i][i] = {A : A -> a_i is a production}
  For length = 2 to n:
    For i = 1 to n - length + 1:
      j = i + length - 1
      table[i][j] = ∅
      For k = i to j-1:
        For each production A -> B C:
          If B ∈ table[i][k] and C ∈ table[k+1][j]:
            Add A to table[i][j]
  Accept iff S ∈ table[1][n]

Correctness: By induction on length. For length = 1, table[i][i] contains exactly the variables deriving $a_i$ in one step. For length > 1, the split point $k$ enumerates all possible first derivation steps in a binary split (valid because of CNF), and the sub-problems have already been solved.

Running time: $O(n^3 \cdot |G|)$ where $|G|$ is the number of productions. Three nested loops each run at most $n$ times, and the inner production check takes $O(|G|)$.

Space: $O(n^2 \cdot |V|)$ for the table.

Valiant’s O(n^2.37) Algorithm

Lee (2002) showed that the CYK parsing problem is equivalent (via subcubic reductions) to Boolean matrix multiplication. Valiant (1975) used this connection to obtain:

Theorem 13.3 (Valiant, 1975): Context-free language recognition can be done in $O(n^\omega)$ time, where $\omega < 3$ is the matrix multiplication exponent. Using current best bounds ($\omega < 2.371$), CFL recognition runs in $O(n^{2.371})$.

The reduction encodes the “split” operation in CYK (does variable $A$ span positions $i$ to $j$ with a split at $k$?) as a product of Boolean matrices, one for each grammar production.

Lower Bounds for Parsing

Is cubic time inherently necessary for CFL recognition? This is a major open problem:

Conjecture (CFL hardness): No algorithm can solve the general CFL recognition problem in $O(n^{3-\varepsilon})$ time for any $\varepsilon > 0$ without a breakthrough in matrix multiplication.

This conjecture is supported by the equivalence to Boolean matrix multiplication, which has resisted subcubic improvements for specific problem instances. The 3SUM conjecture and SETH (Strong Exponential Time Hypothesis) imply conditional lower bounds for variants of CFL recognition.

For special grammar subclasses:

LL(k) and LR(k) grammars: recognition in $O(n)$ (linear time).
Unambiguous CFGs: Earley runs in $O(n^2)$.
General CNF grammars: $O(n^\omega)$ via Valiant.

Chapter 14: String Matching and Automata

The Pattern Matching Problem

Given a text $T = T[1..n]$ and pattern $P = P[1..m]$, find all occurrences of $P$ in $T$. This is a central problem in bioinformatics, information retrieval, and compilers. The automata-theoretic approach yields the Knuth-Morris-Pratt (KMP) algorithm, which runs in optimal $O(n + m)$ time.

The Failure Function and KMP

The naive pattern matching algorithm runs in $O(mn)$. The key insight for speedup: when a mismatch occurs at position $j$ of the pattern, we can often skip ahead using information about the pattern’s own structure — specifically, its failure function.

Definition 14.1 (Failure Function): The failure function (also called the prefix function) of a pattern $P$ is the array $f[1..m]$ where: \[f[j] = \max\{k < j : P[1..k] = P[j-k+1..j]\}\] In other words, $f[j]$ is the length of the longest proper prefix of $P[1..j]$ that is also a suffix of $P[1..j]$.

The failure function exactly computes the border lengths of all prefixes of $P$. This connects directly to the notion of bordered words from Chapter 1: $f[j] > 0$ iff $P[1..j]$ is bordered.

Computing the failure function: A naive computation is $O(m^2)$. The efficient approach (core of KMP preprocessing) uses the following:

COMPUTE-FAILURE(P):
  f[1] = 0
  k = 0
  For j = 2 to m:
    While k > 0 and P[k+1] ≠ P[j]:
      k = f[k]       // fall back along border chain
    If P[k+1] = P[j]:
      k++
    f[j] = k
  Return f

This runs in $O(m)$ time. The amortized analysis: the variable $k$ can increase by at most 1 per iteration of the outer loop (at most $m$ increments total), and the while loop decrements $k$ — so total decrements $\leq$ total increments $= O(m)$.

The KMP Automaton

The failure function defines a finite automaton for string matching:

Definition 14.2 (Pattern DFA): The pattern automaton for $P$ is a DFA $M_P$ with: - States $\{0, 1, \ldots, m\}$ (representing "how much of P has been matched") - Start state $0$ - Accept state $m$ - Transition function: \[\delta(j, c) = \begin{cases} j+1 & \text{if } c = P[j+1] \\ \delta(f[j], c) & \text{if } j > 0 \text{ and } c \neq P[j+1] \\ 0 & \text{if } j = 0 \text{ and } c \neq P[1] \end{cases}\]

The KMP search algorithm runs this DFA on the text $T$: whenever the accept state $m$ is reached, an occurrence of $P$ has been found ending at that position in $T$. Running the DFA on all of $T$ takes $O(n)$ time.

Theorem 14.3 (Knuth-Morris-Pratt): The KMP algorithm finds all occurrences of pattern $P$ in text $T$ in $O(n + m)$ time and $O(m)$ space, where $|T| = n$ and $|P| = m$.

Automata-Theoretic Perspective

The pattern automaton $M_P$ recognizes the language $\Sigma^* P$ (all strings ending with the pattern $P$). The failure function computes the “suffix-to-prefix” overlap structure of the pattern, which is exactly the automaton’s transition function for mismatches.

This automaton perspective explains why KMP works: the DFA compactly encodes all possible alignments of the pattern simultaneously. The failure function is the DFA’s way of “not wasting” the information already scanned.

Connection to borders: A position $j$ has $f[j] = k > 0$ iff $P[1..j]$ has a border of length $k$. The while loop in COMPUTE-FAILURE “walks down the border chain”: the borders of $P[1..j]$ are exactly $P[1..f[j]]$, $P[1..f[f[j]]]$, etc. The border chain has length at most $O(\log m)$ in the worst case for each individual step but averages $O(1)$ amortized.

Aho-Corasick: Multiple Pattern Matching

The KMP algorithm finds occurrences of a single pattern. When searching for multiple patterns $P_1, P_2, \ldots, P_k$ simultaneously, the Aho-Corasick algorithm (1975) builds a single DFA for all patterns at once.

The construction: build the trie of all patterns, then add failure links (analogous to the KMP failure function) from each node to the longest proper suffix of that node’s label that also occurs as a prefix label in the trie. The resulting automaton processes the text in $O(n + \sum |P_i|)$ time regardless of how many patterns match.

Applications: Virus signature detection (scanning for known malware signatures), genome assembly (finding k-mers), and lexical analysis in compilers all use Aho-Corasick.

The Aho-Corasick automaton is a regular language automaton: it accepts all strings containing at least one of $P_1, \ldots, P_k$ as a substring. The language $\Sigma^* P_1 \Sigma^* \cup \cdots \cup \Sigma^* P_k \Sigma^*$ is regular and the Aho-Corasick construction computes a near-minimal DFA for it in linear time.

Operation	State Complexity	Tight?
Complement	\(n\)	Yes
Union	\(mn\)	Yes
Intersection	\(mn\)	Yes
Concatenation	\(m \cdot 2^n - 2^{n-1}\)	Yes
Kleene star	\(\frac{3}{4} \cdot 2^n\)	Yes
Reversal	\(2^n\)	Yes
Morphism	\(m^n\) (roughly)	Yes