CS 462: Formal Languages and Parsing
Jeffrey Shallit
Estimated study time: 2 hr 45 min
Table of contents
Primary Reference: Jeffrey Shallit, A Second Course in Formal Languages and Automata Theory. Cambridge University Press, 2009. All major theorems, algorithms, and proofs in these notes follow or are inspired by this text. Students are strongly encouraged to read the book alongside these notes.
Chapter 1: Strings and Combinatorics on Words
Basic Notation
Formal language theory is ultimately a theory about strings. Before developing any deep machinery, we establish a precise vocabulary for reasoning about sequences of symbols.
A finite alphabet \(\Sigma\) is a nonempty finite set of symbols. Single symbols are typically denoted \(a, b, c \in \Sigma\). A finite word (or string) over \(\Sigma\) is a map from an initial segment \(\{0, 1, \ldots, n-1\}\) to \(\Sigma\); we write it as a sequence \(w = w[0] w[1] \cdots w[n-1]\). Its length is \(|w| = n\). The empty word \(\varepsilon\) has length 0 and is a first-class string — it is not excluded unless stated otherwise. An infinite word is a map from \(\mathbb{N}\) to \(\Sigma\), written \(\mathbf{w} = w[0]w[1]w[2]\cdots\) in boldface.
We use the following universe sets:
- \(\Sigma^*\): all finite words over \(\Sigma\), including \(\varepsilon\)
- \(\Sigma^+\): all nonempty finite words, \(\Sigma^+ = \Sigma^* \setminus \{\varepsilon\}\)
- \(\Sigma^\omega\) (also \(\Sigma^{\mathbb{N}}\)): all infinite words over \(\Sigma\)
- \(\Sigma^\infty = \Sigma^* \cup \Sigma^\omega\): all finite and infinite words
Finite words are denoted by letters \(s, t, u, v, w, x, y, z\). Natural numbers \(\mathbb{N} = \{0, 1, 2, \ldots\}\) use letters \(i, j, k, \ell, m, n\).
A useful shorthand for subwords is interval notation: \(w[a..b] = w[a] w[a+1] \cdots w[b]\).
Concatenation of strings \(x\) and \(y\) is denoted \(xy\) or \(x \cdot y\). Concatenation is associative but not in general commutative. Writing it multiplicatively lets us define string powers: \(x^0 = \varepsilon\), \(x^n = x \cdot x^{n-1}\) for \(n \geq 1\). This satisfies \(x^{m+n} = x^m x^n\).
The subword relation partitions naturally into standard cases. Given strings \(x\) and \(z\):
- \(x\) is a prefix of \(z\) if \(\exists\, y\) with \(z = xy\)
- \(x\) is a suffix of \(z\) if \(\exists\, y\) with \(z = yx\)
- \(x\) is a subword (or factor) of \(z\) if \(\exists\, w, y\) with \(z = wxy\)
- \(x\) is a subsequence of \(z\) if \(x\) can be obtained by deleting zero or more symbols from \(z\)
The word “substring” is ambiguous in the literature — some authors mean subword (contiguous), others mean subsequence (not necessarily contiguous). We always say “subword” or “subsequence” explicitly.
A prefix, suffix, or subword is proper if it is strictly shorter than the containing word: \(x\) is a proper prefix of \(z\) if \(z = xy\) and \(x \neq z\) (equivalently, \(y \neq \varepsilon\)).
Primitive Words and Morphisms
The concept of primitive words is simple but surprisingly deep: a word that cannot be expressed as a nontrivial repetition of a shorter word.
Every nonempty word \(w\) has a unique primitive root: there exists a unique primitive word \(z\) and a unique integer \(k \geq 1\) such that \(w = z^k\). (This follows from the Lyndon-Schützenberger theorem proven in the next section.) The set of primitive words over a binary alphabet is:
\[P_2 = \{0, 1, 01, 10, 001, 010, 011, 100, 101, 110, \ldots\}\]An open problem: is \(P_2\) context-free? The answer is almost certainly no, but no proof is known.
A morphism (or homomorphism) \(h : \Sigma^* \to \Delta^*\) is a map satisfying \(h(xy) = h(x)h(y)\) for all \(x, y \in \Sigma^*\). A morphism is completely determined by its values on single letters. Morphisms preserve concatenation, and hence string structure, making them a fundamental tool in formal language theory.
Further Operations on Words
\[x \mathop{\text{Ш}} y = x[1]\, y[1]\, x[2]\, y[2] \cdots x[n]\, y[n]\]A classic example: \(\mathtt{term} \mathop{\text{Ш}} \mathtt{hoes} = \mathtt{theorems}\).
The reversal of \(x\) is \(x^R\), the symbols of \(x\) in reverse order. For example, \((\mathtt{stressed})^R = \mathtt{desserts}\). A word satisfying \(x = x^R\) is a palindrome.
There are two natural linear orderings on \(\Sigma^*\). Assume \(\Sigma\) carries an underlying linear order (e.g., \(a < b < c < \cdots\) or \(0 < 1 < 2 < \cdots\)).
- Lexicographic order: For words of equal length \(|x| = |y| = n\), we have \(x < y\) if there exists \(i\) with \(1 \leq i \leq n\) such that \(x[j] = y[j]\) for all \(j < i\) and \(x[i] < y[i]\).
- Radix (length-lexicographic) order: \(x < y\) if \(|x| < |y|\), or \(|x| = |y|\) and \(x < y\) lexicographically. This gives a total well-order on \(\Sigma^*\): \(\varepsilon, 0, 1, 2, 00, 01, 02, 10, 11, \ldots\)
A cyclic shift of \(x\) is any word of the form \(vu\) where \(x = uv\). Two words \(x, y\) are conjugates if there exist \(u, v\) with \(x = uv\) and \(y = vu\). For example, \(\mathtt{eat}, \mathtt{ate}, \mathtt{tea}\) are all conjugates of each other.
For example, \(\mathtt{entanglement}\) has border \(\mathtt{ent}\). The word \(\mathtt{alfalfa}\) has an overlapping border: \(\mathtt{alfa}\) is both a prefix and a suffix, and these occurrences overlap. Bordered words are intimately connected to the structure of solutions to the equation \(xy = yz\), as we will see shortly.
Periodicity of Infinite Words
\[x^\omega = xxx\cdots\]This is the infinite word obtained by repeating \(x\) indefinitely. An infinite word \(\mathbf{z}\) is purely periodic if \(\mathbf{z} = x^\omega\) for some finite \(x\). It is ultimately periodic if \(\mathbf{z} = y x^\omega\) for some finite words \(x, y\) with \(x \neq \varepsilon\).
A finite word \(w\) of length \(n\) has period \(p\) (where \(1 \leq p \leq n\)) if \(w[i] = w[i+p]\) for all valid \(i\). Note that every word has period \(n\). A word is periodic with period \(p < n\) if this holds. The minimum period of \(w\) is the smallest such \(p\).
The Lyndon-Schützenberger Theorems
The Lyndon-Schützenberger theorems are among the deepest results in combinatorics on words. They give complete characterizations of when certain string equations have solutions — a kind of analogue, for strings, of classic number-theoretic identities.
The motivation is beautifully simple. Consider the equation \(x^2 + xy = y^2 - 1\) over the integers — it has solutions like \((x, y) = (0,1), (1,2), (3,5)\), which are consecutive Fibonacci numbers. Analogously, we can ask: what are the solutions to equations like \(xy = yx\) or \(x^2 = y^3\) over the monoid \(\Sigma^*\)?
The Equation xy = yz
This theorem completely parametrizes the solution set. The connection to bordered words: if \(xy = yz\) with \(|x| < |y|\), then \(y\) is bordered (with border of length \(|x|\) sharing a relationship with \(x\) and \(z\)).
For \((\Rightarrow)\), we proceed by strong induction on \(|y|\).
Base case \(|y| = 1\): Write \(y = a\) for a single symbol. Then \(xa = az\). We can write \(x = ax'\) and \(z = z'a\) for some strings \(x', z'\), and then \(ax'a = az'a\) implies \(x' = z'\). Take \(u = a\), \(v = x' = z'\), \(e = 0\). Then \(x = uv\), \(z = vu\), \(y = u = (uv)^0 u\). \(\checkmark\)
Inductive step: Assume the result holds for all triples where the middle word has length less than \(|y|\). We break into two cases.
Case I: \(|x| \geq |y|\). Since \(xy = yz\), we have \(|xy| = |yz|\), so \(|x| = |z|\). Write \(x = yw\) for some \(w\) (possible since \(|x| \geq |y|\) and \(x\) has \(y\) as a prefix from \(xy = yz\)). Then take \(u = y\), \(v = w\), \(e = 0\).
Let us verify: \(x = yw = uv\), \(z = wy = vu\) (since from \(xy = yz\): \((yw)y = y(wy)\), i.e., \(ywy = ywy\) ✓, so indeed \(z = wy\)), and \(y = (uv)^0 u = u = y\). ✓
Case II: \(|x| < |y|\). From \(xy = yz\), since \(|x| < |y|\) the word \(y\) starts with \(x\) and ends with \(z\), with some overlap region \(w\) defined by \(y = xw = wz\) (i.e., \(w\) is the word satisfying \(|w| = |y| - |x|\)). Note \(w \neq \varepsilon\) (since \(|x| < |y|\)) and \(x, z \neq \varepsilon\) (given).
\[xw = wz\]\[x = uv, \quad z = vu, \quad w = (uv)^{e'} u\]Then \(y = xw = uv \cdot (uv)^{e'} u = (uv)^{e'+1} u\). Setting \(e = e' + 1\), we have \(x = uv\), \(z = vu\), \(y = (uv)^e u\). ✓
Commuting Words
When do two words commute? In linear algebra, two matrices commute if and only if they share a common eigenbasis (in the diagonalizable case). The string analogue is: two words commute if and only if they are powers of a common word.
- There exists \(z \in \Sigma^+\) and integers \(k, \ell > 0\) with \(x = z^k\) and \(y = z^\ell\).
- \(x^\omega = y^\omega\).
- There exist integers \(i, j > 0\) with \(x^i = y^j\).
- \(xy = yx\).
- There exist integers \(r, s > 0\) with \(x^r y^s = y^s x^r\).
- Define the morphism \(h : \{a, b\}^* \to \Sigma^*\) by \(h(a) = x\) and \(h(b) = y\). Then there exist two distinct words \(u, v \in \{a, b\}^*\) with \(h(u) = h(v)\).
- \(x\{x,y\}^* \cap y\{x,y\}^* \neq \emptyset\).
- \(x\{x,y\}^\omega \cap y\{x,y\}^\omega \neq \emptyset\).
Condition (1) is the canonical form: two words commute precisely when they are powers of a common root. Condition (6) is remarkable — it says commutativity is equivalent to the morphism \(h\) being “non-injective” on the free monoid \(\{a,b\}^*\).
(1) \(\Rightarrow\) (2): If \(x = z^k\) and \(y = z^\ell\), then \(x^\omega = (z^k)^\omega = z^\omega = (z^\ell)^\omega = y^\omega\).
(2) \(\Rightarrow\) (3): Let \(i = |y|\) and \(j = |x|\). The prefix of length \(|x||y|\) of \(x^\omega\) equals the prefix of the same length of \(y^\omega\). This gives \(x^i = y^j\).
\[y^j = x^i = (yw)^i = y(wy)^{i-1}w\]Canceling \(y\) on the left: \(y^{j-1} = (wy)^{i-1}w\). Appending \(y\) on the right: \(y^j = (wy)^{i-1}wy = (wy)^i\).
Now we have \((yw)^i = y^j = (wy)^i\). Comparing the first \(|y| + |w|\) characters gives \(yw = wy\). Since \(x = yw = wy\), appending \(y\): \(xy = (yw)y = y(wy) = yx\).
(4) \(\Rightarrow\) (5): Clear, since if \(xy = yx\), then \(x^r y^s = y^s x^r\) for all \(r, s > 0\).
(5) \(\Rightarrow\) (6): Suppose \(x^r y^s = y^s x^r\). Then \(h(a^r b^s) = h(b^s a^r)\), and since \(a^r b^s \neq b^s a^r\) (as \(r, s > 0\)), condition (6) holds.
(6) \(\Rightarrow\) (7): From condition (6), there exist distinct \(u, v\) with \(h(u) = h(v)\). By swapping if needed, we may assume \(u\) starts with \(a\) and \(v\) starts with \(b\) (or use any occurrence of \(a\) and \(b\) in them). With more care one shows that the common value \(h(u) = h(v)\) lies in \(x\{x,y\}^* \cap y\{x,y\}^*\).
(7) \(\Rightarrow\) (8): Any member of \(x\{x,y\}^*\) extends to a member of \(x\{x,y\}^\omega\).
(8) \(\Rightarrow\) (1): If \(\mathbf{w} \in x\{x,y\}^\omega \cap y\{x,y\}^\omega\), then \(\mathbf{w}\) starts with both \(x\) and \(y\). This forces \(x\) and \(y\) to be prefixes of each other’s powers. A careful length argument then shows they share a primitive root.
The Equation x² = y³
To appreciate the full power of Theorem 1.2, consider the equation \(x^2 = y^3\) over the integers. Every solution has the form \(x = z^2, y = z^2\) for some integer \(z\) (modulo sign). The analogous string result follows immediately from condition (3):
More generally, any equation \(x^i = y^j\) with \(i, j > 0\) over \(\Sigma^+\) forces \(x\) and \(y\) to be powers of a common word.
Repetitions in Words
A word is said to contain a square if it has a subword of the form \(ww\) for some nonempty \(w\). For example, \(\mathtt{abab}\) is a square. A word is squarefree if it contains no square. Are there arbitrarily long squarefree words? Over a binary alphabet the answer is no — every binary word of length \(\geq 4\) contains a square. But over a ternary alphabet, arbitrarily long squarefree words exist. This was proved by Thue in 1906.
An overlap is a word of the form \(axaxa\) where \(a \in \Sigma\) and \(x \in \Sigma^*\). Overlaps are a relaxation of squares: a word containing no overlap also avoids long repetitions. The remarkable fact, also due to Thue, is:
The Thue-Morse Sequence
\[t_n = \text{(number of 1's in the binary expansion of } n) \bmod 2\]So \(\mathbf{t} = 0110100110010110\cdots\). Equivalently, \(\mathbf{t}\) is the fixed point of the morphism \(\mu : 0 \mapsto 01, 1 \mapsto 10\). That is, \(\mu(\mathbf{t}) = \mathbf{t}\).
The proof proceeds by showing that any overlap \(axaxa\) in \(\mathbf{t}\) can be “lifted” to an overlap in \(\mathbf{t}\) via the inverse of \(\mu\), eventually reaching a contradiction. This type of argument — reducing the problem on the image back to the preimage — is characteristic of morphic sequences.
Chapter 2: Regular Languages — Advanced Theory
Review: Finite Automata
We assume familiarity with the basics of regular languages from a first course. Recall the key objects:
- A deterministic finite automaton (DFA) is a 5-tuple \(M = (Q, \Sigma, \delta, q_0, F)\) where \(Q\) is a finite set of states, \(\delta : Q \times \Sigma \to Q\) is the transition function, \(q_0 \in Q\) is the start state, and \(F \subseteq Q\) is the set of accepting states.
- A nondeterministic finite automaton (NFA) has \(\delta : Q \times \Sigma \to 2^Q\) (a set-valued transition function), with optional \(\varepsilon\)-transitions.
- Every NFA can be converted to an equivalent DFA (subset construction), at most exponential blowup in states.
- The regular languages are exactly those accepted by DFAs (and equivalently NFAs), and also exactly those described by regular expressions.
Moore and Mealy Machines
Beyond acceptance, finite automata can also compute output. There are two classical models.
The difference is subtle but important: Moore machines output depend only on states (so output is “one step delayed”), while Mealy machines output depends on transitions. Both models have the same expressive power as transducers in the sense that they compute the same class of functions.
Advanced Closure Properties
The regular languages are closed under the usual Boolean operations (union, intersection, complement), concatenation, Kleene star, reversal, morphism, and inverse morphism. We now examine several more subtle closure properties.
Quotients
The proof for \(x^{-1}L\) is simple: given a DFA \(M\) for \(L\), start in state \(\delta^*(q_0, x)\) instead of \(q_0\). For \(K^{-1}L\) where \(K\) is arbitrary, we use the Myhill-Nerode characterization: the quotient language \(K^{-1}L\) is a union of right-congruence classes of \(L\) (which is regular), hence regular.
Half of a Language
A more exotic operation: given \(L\), define \(\frac{1}{2} L\) as the set of words that are the “first half” of some word in \(L\):
The key insight is that the reverse DFA \(M^R\) accepts exactly the reversal of \(L^R\). We construct the NFA for \(\frac{1}{2}L\) as a product of \(M\) (running forward) with all start states of \(M^R\) (running backward simultaneously). When both “pointers” meet — the forward pointer from \(q_0\) and the backward pointer from some accepting state — at the same state \(q\) at the same time, we have found a valid split. Formally, the NFA has state space \(Q \times 2^Q\) and accepts when the forward state matches some state that is simultaneously reachable backward.
Since \(Q\) is finite, this NFA has \(|Q| \cdot 2^{|Q|}\) states, yielding a regular language.
Cyclic Shift of a Language
The proof uses a product construction on two copies of the DFA for \(L\). A word \(yx\) is in \(\text{cyc}(L)\) iff there is a state \(q\) such that reading \(x\) from \(q\) leads to an accept state, and reading \(y\) from that accept state (in a second pass) leads back to \(q\) from the start. The NFA guesses the split point and simulates both halves.
Transducers and Nivat’s Theorem
Transducers generalize finite automata by producing output rather than just accepting/rejecting input. They are the finite-state analogue of functions between formal languages.
A transducer \(T\) computes a rational relation \(\tau_T \subseteq \Sigma^* \times \Delta^*\): we have \((x, y) \in \tau_T\) if there is an accepting path reading input \(x\) and producing output \(y\). The image of a language \(L\) under \(T\) is \(T(L) = \{y : \exists x \in L, (x,y) \in \tau_T\}\).
In other words, rational relations are exactly those that can be “decoded” from a regular language via a pair of morphisms. This is the string analogue of the algebraic fact that a relation is rational iff it is the image of a recognizable set under a pair of homomorphisms.
Corollary: Rational transductions preserve regular languages. That is, if \(L\) is regular and \(T\) is a finite transducer, then \(T(L)\) is regular.
Composition of Transducers
Unlike DFAs (where composition is trivial via product), transducer composition is non-trivial because the output of one transducer must be fed as input to another. The result is:
However, rational transductions are not closed under complementation or intersection (as relations). This makes the study of transducers substantially richer than that of DFAs.
Two-Way Finite Automata
A two-way deterministic finite automaton (2DFA) is a DFA whose read head can move both left and right on the input tape. The input is surrounded by left and right endmarkers \(\vdash\) and \(\dashv\). Formally:
The remarkable theorem due to Shepherdson (1959) is that 2DFAs are no more powerful than 1DFAs:
There are only finitely many such partial functions (at most \((|Q|+2)^{|Q|}\)), so the set of possible transformations is finite. The transformation monoid \(\mathcal{T}(M)\) is the set of all transformations induced by words in \(\Sigma^*\). We can simulate \(M\) by a 1DFA whose states are transformations: the start state is the identity transformation, and reading symbol \(a\) composes the current transformation with \(\tau_a\). Acceptance is determined by applying the final transformation to \(q_0\) and checking if the result is an accepting state.
This construction yields a 1DFA with at most \((|Q|+2)^{|Q|}\) states.
While 2DFAs are not more expressive than 1DFAs, they can be exponentially more succinct. There are languages whose smallest 2DFA has \(n\) states but whose smallest 1DFA requires \(2^n\) states.
Automata and Boolean Matrices
\[(T_a)_{ij} = 1 \iff \delta(q_i, a) = q_j\]\[T_w = T_{a_1} T_{a_2} \cdots T_{a_k}\](matrix product over the Boolean semiring, where \(1+1=1\)). A word \(w\) is accepted by \(M\) iff \((T_w)_{q_0, f} = 1\) for some \(f \in F\).
This viewpoint connects regular language theory to the theory of finite monoids: the set \(\{T_w : w \in \Sigma^*\}\) forms a finite monoid (the syntactic monoid of the language), and the language is characterized by which elements of this monoid are “accepting.” This is the foundation of algebraic automata theory.
The Myhill-Nerode Theorem
The Myhill-Nerode theorem is one of the most important results in automata theory. It gives a purely language-theoretic characterization of regularity, without reference to machines or grammars, and simultaneously explains the structure of the minimum DFA.
\((\Rightarrow)\) Suppose \(L\) is accepted by a DFA \(M\) with \(n\) states. For any \(x, y \in \Sigma^*\), if \(\delta^*(q_0, x) = \delta^*(q_0, y)\) (they lead to the same state), then for any \(z\), \(\delta^*(q_0, xz) = \delta^*(q_0, yz)\), so \(xz \in L \iff yz \in L\), meaning \(x \equiv_L y\). Therefore each equivalence class of \(\equiv_L\) is a union of words leading to the same state. The \(\equiv_L\)-classes are coarsenings of state preimages, so there are at most \(n\) classes.
\((\Leftarrow)\) Suppose \(\equiv_L\) has finitely many classes. We construct a DFA \(M_L\) whose states are the equivalence classes \([x]_L\) for \(x \in \Sigma^*\):
- State set: \(Q_L = \{[x]_L : x \in \Sigma^*\}\)
- Start state: \([\varepsilon]_L\)
- Transitions: \(\delta([x]_L, a) = [xa]_L\) (well-defined since if \(x \equiv_L y\) then \(xa \equiv_L ya\))
- Accept states: \(F_L = \{[x]_L : x \in L\}\) (well-defined since \(L\) is a union of \(\equiv_L\)-classes)
This DFA accepts \(L\): \(\delta^*([\varepsilon]_L, w) = [w]_L\), and \([w]_L \in F_L \iff w \in L\). So \(L\) is regular.
For minimality: the DFA \(M_L\) is minimal. Distinct classes \([x]_L \neq [y]_L\) are distinguishable: there exists \(z\) with \(xz \in L, yz \notin L\) (or vice versa). If any DFA for \(L\) had two states \(q, q'\) that are \([x]_L\)-class and \([y]_L\)-class for distinct classes, those states must be merged — but then reading \(z\) from both would land in the same state, contradicting distinguishability. Hence \(M_L\) is the unique (up to isomorphism) minimal DFA for \(L\).
The Myhill-Nerode theorem has two uses: (1) proving regularity by bounding the number of equivalence classes, and (2) proving non-regularity by showing infinitely many classes. For the latter, it suffices to find an infinite set of words that are pairwise distinguishable.
Example: The language \(L = \{0^n 1^n : n \geq 0\}\) over \(\{0,1\}\) has infinitely many Myhill-Nerode classes. The words \(0^i\) for \(i = 0, 1, 2, \ldots\) are pairwise distinguishable: \(0^i 1^i \in L\) but \(0^j 1^i \notin L\) for \(j \neq i\). Hence \(L\) is not regular.
DFA Minimization Algorithms
Given a DFA \(M\), we want to compute the minimum DFA for \(L(M)\). By the Myhill-Nerode theorem, this amounts to finding the \(\equiv_L\)-classes. The main algorithms work by computing the indistinguishability relation and merging equivalent states.
Two states \(p, q\) are distinguishable if there exists a word \(z\) such that exactly one of \(\delta^*(p, z)\) and \(\delta^*(q, z)\) is in \(F\). We denote this by \(p \not\sim q\). Two states are equivalent (indistinguishable) if no such \(z\) exists, written \(p \sim q\).
Note: states must first be trimmed (remove unreachable states and dead states) before minimization.
Algorithm NAIVE-MINIMIZE (O(n³))
The simplest approach iteratively marks distinguishable pairs.
NAIVE-MINIMIZE(M):
Initialize: mark (p, q) as distinguishable if exactly one is in F.
Repeat until no change:
For each unmarked pair (p, q) and each symbol a:
If (δ(p,a), δ(q,a)) is marked as distinguishable:
Mark (p, q) as distinguishable.
Merge all unmarked (indistinguishable) pairs.
Correctness: The invariant is that all marked pairs are truly distinguishable (base case: pairs where one is accepting, one is rejecting are distinguishable by \(\varepsilon\)). The iteration propagates distinguishability backwards through transitions. When no more pairs can be marked, the unmarked pairs are genuinely indistinguishable.
Running time: At most \(O(n^2)\) pairs, and each iteration of the outer loop takes \(O(n^2 |\Sigma|)\) time to check all pairs and symbols. Total: \(O(n^3 |\Sigma|)\).
Algorithm MINIMIZE (O(n²))
A more careful bookkeeping approach achieves \(O(n^2)\):
MINIMIZE(M):
Initialize a table T[p][q] = ∅ for all unmarked pairs.
Mark (p, q) if exactly one of p, q is in F.
For each symbol a and each pair (p, q):
For each (r, s) with δ(r, a) = p, δ(s, a) = q (preimage pairs):
Add (r, s) to T[p][q].
Process newly marked pairs using the table:
When (p, q) is marked, for each (r, s) in T[p][q]:
Mark (r, s) if not already marked.
This is essentially a BFS/DFS propagation using precomputed dependency lists. Each pair is processed at most once, giving \(O(n^2 |\Sigma|)\) total.
Algorithm FAST-MINIMIZE (Hopcroft, O(n log n))
Hopcroft’s algorithm achieves the optimal \(O(n \log n)\) time, which is tight in the worst case. The algorithm maintains a partition refinement structure.
Key idea: Start with the coarsest partition \(\{F, Q \setminus F\}\) (accepting vs. rejecting). Repeatedly refine: a block \(B\) is a splitter for block \(C\) with respect to symbol \(a\) if \(C\) contains states that go to \(B\) on \(a\) as well as states that don’t. Split \(C\) accordingly.
FAST-MINIMIZE(M):
Start with partition P = {F, Q \ F}.
Initialize waiting queue W = {min(F, Q\F)} (the smaller of the two initial blocks).
While W is nonempty:
Remove a splitter S from W.
For each symbol a:
Let C = {q : δ(q, a) ∈ S} (states that transition into S on a).
For each block B in P that is split by C:
Split B into B ∩ C and B \ C.
Add the smaller part to W (the "smaller half" trick).
The smaller-half trick: Each state is added to \(W\) at most \(O(\log n)\) times because each time it is processed, it belongs to a block that is at most half the size of the original containing block. This gives the \(O(n \log n)\) bound.
Algorithm BRZOZOWSKI
An elegant but conceptually surprising algorithm uses reversal and determinization:
BRZOZOWSKI(M):
1. Compute the reverse NFA M^R (swap initial and accepting states, reverse all transitions).
2. Apply subset construction to M^R to get DFA D_1.
3. Compute the reverse NFA D_1^R.
4. Apply subset construction to D_1^R to get DFA D_2.
Return D_2.
Why does this work? The key insight is that the subset construction applied to an NFA produces an equivalent DFA where every state is reachable (by definition of the subset construction). When applied to \(M^R\), the result \(D_1\) accepts the reverse language \(L^R\). Reversing again and applying subset construction gives a reachable DFA for \(L\) where, additionally, all states are distinguishable (this follows from a careful analysis of what the double reversal does to the states). A reachable, co-reachable DFA with all states distinguishable is the minimal DFA.
State Complexity
One of the most fruitful directions in formal language theory is the study of state complexity: the number of states in the minimum DFA for a language obtained by applying an operation to regular languages.
Key state complexity results:
| Operation | State Complexity | Tight? |
|---|---|---|
| Complement | \(n\) | Yes |
| Union | \(mn\) | Yes |
| Intersection | \(mn\) | Yes |
| Concatenation | \(m \cdot 2^n - 2^{n-1}\) | Yes |
| Kleene star | \(\frac{3}{4} \cdot 2^n\) | Yes |
| Reversal | \(2^n\) | Yes |
| Morphism | \(m^n\) (roughly) | Yes |
The exponential blowup for reversal and NFA-to-DFA conversion is tight:
The witness: the NFA over \(\{0,1\}\) with states \(\{q_0, q_1, \ldots, q_{n-1}\}\) where \(q_0\) is the start, \(q_{n-1}\) is the only accepting state, 0-transitions shift states right (\(q_i \xrightarrow{0} q_{i+1}\)), 1-transitions non-deterministically reset to \(q_0\) or shift. After subset construction, all \(2^n\) subsets of states turn out to be reachable.
Partial Orders and Higman’s Lemma
The subsequence ordering is a powerful well-quasi-order on strings. Recall that \(x\) is a subsequence of \(y\) if \(x\) can be obtained from \(y\) by deleting (not necessarily contiguous) symbols: \(x = a_1 a_2 \cdots a_k\) is a subsequence of \(y\) if \(y = u_0 a_1 u_1 a_2 u_2 \cdots a_k u_k\) for some words \(u_i\).
\[L^{\uparrow} = \{ y : \exists x \in L, x \text{ is a subsequence of } y \}\]\[L^{\downarrow} = \{ y : \exists x \in L, y \text{ is a subsequence of } x \}\]Base case \(|\Sigma| = 1\): Words are of the form \(a^n\). Any infinite sequence must have a non-decreasing pair by the infinite pigeonhole principle on lengths. ✓
Inductive step: Let \(|\Sigma| = k \geq 2\). Suppose for contradiction that there exists a bad sequence: an infinite sequence \(w_1, w_2, \ldots\) with no \(i < j\) satisfying \(w_i \preceq w_j\). Choose a bad sequence that is minimal in the following sense: \(w_1\) is of minimum length among all first elements of bad sequences; given \(w_1\), \(w_2\) is of minimum length among all second elements of bad sequences starting with \(w_1\); and so on.
Fix a symbol \(a \in \Sigma\). For each \(w_i\), let \(w_i'\) be the word obtained by deleting the first occurrence of \(a\) in \(w_i\) (if \(w_i\) contains no \(a\), then \(w_i' = w_i\) and is over \(\Sigma \setminus \{a\}\)). By the inductive hypothesis applied to \(\Sigma \setminus \{a\}\) (a smaller alphabet), the subsequence \(w_{i_1}', w_{i_2}', \ldots\) of terms not containing \(a\) has a good pair.
For terms containing \(a\), write \(w_i = u_i a w_i''\). The sequence \(w_1'', w_2'', \ldots\) is a bad sequence starting from a later position, but by the minimality choice, each \(w_i''\) is “at least as short” as the corresponding term of our minimal bad sequence would be — this contradicts minimality.
A more careful argument (by Dickson, Nash-Williams) formalizes this by constructing a bad sequence of strictly decreasing lengths, which is impossible since lengths are nonnegative integers.
Since the antichain of minimal elements \(\{x_1, \ldots, x_k\}\) is finite, \(L^\uparrow = x_1^\uparrow \cup \cdots \cup x_k^\uparrow = \{y : x_i \preceq y\} \cup \cdots\). Each \(\{y : x_i \preceq y\}\) is regular: it’s recognized by an NFA that non-deterministically skips symbols to embed \(x_i\). A finite union of regular languages is regular.
Chapter 3: Context-Free Languages — Advanced Theory
Review: Grammars and Pushdown Automata
A context-free grammar (CFG) is a tuple \(G = (V, \Sigma, R, S)\) where \(V\) is the set of variables (nonterminals), \(\Sigma\) is the terminal alphabet, \(R\) is a finite set of productions \(A \to \alpha\) (with \(A \in V\) and \(\alpha \in (V \cup \Sigma)^*\)), and \(S \in V\) is the start symbol. The language \(L(G)\) consists of all terminal strings derivable from \(S\).
A pushdown automaton (PDA) is a finite automaton augmented with an unbounded stack. PDAs and CFGs are equiexpressive: a language is context-free iff it is accepted by some PDA.
Closure Properties
Context-free languages enjoy several closure properties, though notably they are not closed under intersection or complement.
- Union, concatenation, and Kleene star
- Morphisms (homomorphisms)
- Inverse morphisms
- Intersection with regular languages
- Reversal
The failure of closure under intersection is witnessed by \(L_1 = \{a^n b^n c^m : n, m \geq 0\}\) and \(L_2 = \{a^m b^n c^n : n, m \geq 0\}\), both context-free, but \(L_1 \cap L_2 = \{a^n b^n c^n : n \geq 0\}\) is not context-free (by the pumping lemma).
Inverse morphism closure: This is the most useful closure property in practice. Given a CFL \(L\) over \(\Delta\) and a morphism \(h : \Sigma^* \to \Delta^*\), we have \(h^{-1}(L) = \{w \in \Sigma^* : h(w) \in L\}\).
Pumping Lemmas
The classical pumping lemma for CFLs states that long enough strings have a “pumpable” substructure. However, the classical lemma can be hard to apply for certain non-CFLs. Two stronger versions exist.
Ogden’s Lemma: In the classical pumping lemma, we can mark positions of the string that we require to be pumped.
- \(vwx\) contains at most \(N\) marked positions
- \(vx\) contains at least one marked position
- \(uv^i wx^i y \in L\) for all \(i \geq 0\)
Ogden’s lemma subsumes the classical pumping lemma (mark all positions) and is strictly stronger. It is useful for proving languages like \(\{a^{p} b^q c^r : p = q \text{ or } q = r\}\) are not context-free.
The Interchange Lemma is another strengthening, useful for proving non-CFL results when Ogden’s lemma fails. It exploits the structure of parse trees more carefully.
Parikh’s Theorem and Semilinear Sets
One of the most surprising results about context-free languages is that their “letter-counting” behavior is no more complex than that of regular languages. This is Parikh’s theorem.
For a word \(w\) over alphabet \(\{a_1, \ldots, a_k\}\), define the Parikh vector \(\Psi(w) = (|w|_{a_1}, \ldots, |w|_{a_k}) \in \mathbb{N}^k\) where \(|w|_{a_i}\) counts occurrences of \(a_i\) in \(w\). For a language \(L\), let \(\Psi(L) = \{\Psi(w) : w \in L\} \subseteq \mathbb{N}^k\).
Semilinear sets are exactly the sets definable in Presburger arithmetic (the first-order theory of \((\mathbb{N}, +)\)), which is decidable. Semilinear sets are also exactly the Parikh images of regular languages.
Consequences of Parikh’s Theorem: Any language property that depends only on the Parikh image (letter counts) is inherited from regular languages by all CFLs. For example: if \(L\) is a CFL over a one-letter alphabet \(\{a\}\), then \(L\) is actually regular (since semilinear subsets of \(\mathbb{N}\) are unions of arithmetic progressions, and unions of arithmetic progressions over \(\{a\}^*\) are regular). This provides an easy test: \(L = \{a^{p^2} : p \in \mathbb{N}\}\) is not context-free because \(\{p^2 : p \in \mathbb{N}\}\) is not semilinear (not a finite union of arithmetic progressions).
Deterministic Context-Free Languages
The DCFLs are the languages recognized by deterministic pushdown automata (DPDAs). Determinism for pushdown automata requires more careful formulation than for finite automata.
- \(\delta(q, a, A)\) contains at most one element (at most one transition), and
- If \(\delta(q, \varepsilon, A) \neq \emptyset\), then \(\delta(q, a, A) = \emptyset\) for all \(a \in \Sigma\) (no choice between \(\varepsilon\)-move and reading a symbol).
Every regular language is a DCFL. The language \(\{a^n b^n : n \geq 0\}\) is a DCFL but not regular. The language \(\{ww^R : w \in \{a,b\}^*\}\) (even palindromes) is a CFL but not a DCFL.
Closure Under Complementation
The most important closure property of DCFLs, and one that sharply distinguishes them from general CFLs, is closure under complementation.
This is surprising: the CFLs are not closed under complementation (since they’re not closed under intersection, and \(A \cap B = \overline{\overline{A} \cup \overline{B}}\)), but the deterministic subclass is.
Lemma 3.7 (Well-formed DPDA): Every DCFL is accepted by a DPDA in normal form: one that (i) never loops, (ii) always reads the entire input, and (iii) accepts or rejects only at the end of input.
Given a normal-form DPDA \(M\) for \(L\), the complement \(\overline{L}\) is accepted by \(M\) with the accepting states swapped: \(F' = Q \setminus F\). Since \(M\) always reads the entire input and always terminates, this DFA-like trick works.
The normalization requires care. A DPDA might loop via \(\varepsilon\)-transitions. To detect loops, note that a sequence of \(\varepsilon\)-transitions cycles in a bounded number of steps (at most \(|Q|^2 \cdot |\Gamma|\) steps with bounded stack usage, but the stack can grow). The key insight: if the DPDA takes more than \(|Q|\) consecutive \(\varepsilon\)-transitions without the stack height decreasing, it must be in a loop (by the pigeonhole principle on configurations reachable from a fixed stack top). Such loops can be detected and replaced by explicit reject transitions.
To ensure the DPDA reads the entire input: add a “dead” state that absorbs all input after the DPDA would normally halt, marking the word as rejected only at the true end of input.
With these normalizations in place, the complement DPDA is well-defined.
Myhill-Nerode Characterization of DCFLs
Just as DFAs have a Myhill-Nerode characterization, DPDAs have an analogous one, though more complex. The right notion is that of a finite index for a certain equivalence on words:
A more concrete consequence:
Since there are infinitely many values of \(i\) but only finitely many states, by pigeonhole there exist \(i < j\) with \(q_i = q_j\). The stack configurations \(\gamma_i, \gamma_j\) differ (otherwise \(M\) couldn’t distinguish \(a^i b a^i\) from \(a^j b a^i\)). But here’s the rub: from configuration \((q_i, \gamma_i)\), reading \(b a^i\) accepts; from configuration \((q_j, \gamma_j) = (q_i, \gamma_j)\), reading \(b a^i\) must also work correctly to accept \(a^j b a^i\). The critical problem: reading \(b a^j\) from \((q_i, \gamma_i)\) produces the word \(a^i b a^j\) which is not a palindrome (since \(i \neq j\)), so \(M\) must reject — but we need the stack to remember whether we’re at depth \(i\) or \(j\), and with determinism, we get a contradiction via the crossing sequence argument.
A cleaner formal argument uses the fact that the Nerode-like classes of PAL are infinite in number for any DPDA-based equivalence, contradicting the finite-control bound.
Chapter 4: Parsing
Introduction to Parsing
Parsing is the process of determining whether a string belongs to a language described by a CFG, and if so, constructing its parse tree (or trees, if the grammar is ambiguous). The central challenge in compiler construction is building efficient parsers for programming language grammars.
\[L = \{a^i b^j c^k : i = j \text{ or } j = k\}\]This language is inherently ambiguous.
Earley’s Algorithm
Before studying specialized parsers (LL, LR), it is instructive to consider the general-purpose Earley algorithm, which handles any context-free grammar.
Items: An Earley item is a dotted production \([A \to \alpha \bullet \beta, i]\) meaning: we are trying to derive from \(A\), have already seen \(\alpha\) starting at position \(i\) in the input, and still need to see \(\beta\).
Chart: The chart \(C[j]\) is the set of all items completed when we have read \(j\) symbols of input \(w = a_1 \cdots a_n\).
EARLEY(G, w = a_1 a_2 ... a_n):
Initialize C[0] = {[S' -> . S, 0]} (augmented start rule)
For j = 0 to n:
Repeat until no change:
PREDICTOR: For each [A -> α . B β, i] in C[j] and each rule B -> γ:
Add [B -> . γ, j] to C[j]
COMPLETER: For each [B -> γ ., i] in C[j] and each [A -> α . B β, k] in C[i]:
Add [A -> α B . β, k] to C[j]
SCANNER: For each [A -> α . a_{j+1} β, i] in C[j]:
Add [A -> α a_{j+1} . β, i] to C[j+1]
Accept if [S' -> S ., 0] ∈ C[n]
Complexity: Earley’s algorithm runs in \(O(n^3)\) time for general CFGs, \(O(n^2)\) for unambiguous grammars, and \(O(n)\) for LR(k) grammars. This makes it universal but competitive with specialized parsers on restricted grammar classes.
LL(1) Grammars
An LL(1) grammar is one that can be parsed top-down with a single symbol of lookahead, scanning input Left to right and producing a Leftmost derivation.
FIRST and FOLLOW Sets
The intuition: \(\text{FIRST}(\alpha)\) is the set of terminals that can begin a string derived from \(\alpha\); \(\text{FOLLOW}(A)\) is the set of terminals that can appear immediately after an occurrence of \(A\) in any sentential form.
Algorithm to compute FIRST:
COMPUTE-FIRST:
Initialize FIRST(a) = {a} for each a ∈ Σ, FIRST(ε) = {ε}
For each variable A: FIRST(A) = ∅
Repeat until no change:
For each production A -> X_1 X_2 ... X_k:
Add FIRST(X_1) \ {ε} to FIRST(A)
If ε ∈ FIRST(X_1): Add FIRST(X_2) \ {ε} to FIRST(A)
If ε ∈ FIRST(X_1) ∩ FIRST(X_2): Add FIRST(X_3) \ {ε} to FIRST(A)
...
If ε ∈ FIRST(X_1) ∩ ... ∩ FIRST(X_k): Add ε to FIRST(A)
Algorithm to compute FOLLOW:
COMPUTE-FOLLOW:
Initialize FOLLOW(S) = {$}, FOLLOW(A) = ∅ for all other A
Repeat until no change:
For each production A -> α B β:
Add FIRST(β) \ {ε} to FOLLOW(B)
If ε ∈ FIRST(β): Add FOLLOW(A) to FOLLOW(B)
The LL(1) Parsing Table
- \(\text{FIRST}(\alpha) \cap \text{FIRST}(\beta) = \emptyset\), and
- If \(\varepsilon \in \text{FIRST}(\alpha)\), then \(\text{FIRST}(\beta) \cap \text{FOLLOW}(A) = \emptyset\).
The LL(1) parsing table drives a stack-based predictive parser: the stack initially holds \(S\$\). At each step, if the stack top is a terminal \(a\) matching the current input symbol, pop and advance. If the stack top is a variable \(A\), look up \(T[A, a]\) and push the right-hand side in reverse order.
Removing Left Recursion
A grammar with a left-recursive rule \(A \to A\alpha \mid \beta\) cannot be LL(1) because \(\text{FIRST}(\beta)\) would appear in both \(T[A, a]\) entries for the same \(a\). Left recursion is eliminated by the following transformation:
\[A \to \beta_1 A' \mid \cdots \mid \beta_k A'\]\[A' \to \alpha_1 A' \mid \cdots \mid \alpha_m A' \mid \varepsilon\]For indirect left recursion (\(A \to B\alpha, B \to A\beta\)), a systematic elimination algorithm orders the variables and eliminates direct left recursion for each.
Left Factoring
\[A \to \alpha\beta_1 \mid \alpha\beta_2\]\[A \to \alpha A', \quad A' \to \beta_1 \mid \beta_2\]After left-recursion removal and left-factoring, many natural grammars become LL(1). However, no inherently ambiguous grammar is LL(1), and some unambiguous grammars are not LL(1) even after these transformations.
LR(0) Grammars
LR parsing scans input Left to right and produces a Rightmost derivation in reverse. LR parsers are more powerful than LL parsers: every LL(k) grammar is LR(k), but not conversely.
Handles and Viable Prefixes
Intuitively, the handle is the rightmost production that was applied in the last derivation step. Bottom-up (LR) parsing works by scanning for handles and reducing them.
The set of viable prefixes is always a regular language — this is the key theorem underlying LR parsing.
LR(0) Items and Knuth’s Construction
An LR(0) item is a production with a dot indicating how much of the right-hand side has been seen: \([A \to \alpha \bullet \beta]\).
Knuth’s NFA-\(\varepsilon\): The LR(0) items form the states of an NFA recognizing the viable prefixes:
- Start state: \([S' \to \bullet S]\) (augmented grammar start item)
- Shift transitions: \([A \to \alpha \bullet a \beta] \xrightarrow{a} [A \to \alpha a \bullet \beta]\) for terminals \(a\)
- Goto transitions: \([A \to \alpha \bullet B \beta] \xrightarrow{B} [A \to \alpha B \bullet \beta]\) for nonterminals \(B\)
- \(\varepsilon\)-transitions: \([A \to \alpha \bullet B \beta] \xrightarrow{\varepsilon} [B \to \bullet \gamma]\) for each production \(B \to \gamma\) (prediction)
\((\Leftarrow)\): If the NFA accepts \(\gamma\), trace through the accepting path. Each transition corresponds to a step in a rightmost derivation. By induction, the accepting run corresponds to a right sentential form \(\gamma w\) where the dot marks exactly where we are in parsing.
\((\Rightarrow)\): Given a viable prefix \(\gamma\) (from a right sentential form \(\gamma w\)), the sequence of handles that appear in constructing the derivation corresponds to a path in the NFA from the start state to some item with dot at the right position. The \(\varepsilon\)-transitions handle the prediction steps.
Knuth DFA: Apply the standard subset construction to the NFA-\(\varepsilon\). The resulting DFA has states that are sets of LR(0) items (called LR(0) item sets or canonical collection). This DFA is the LR(0) automaton.
LR(0) Parsing Table: From the DFA, build the action/goto table:
- Shift: If item set \(I\) has \([A \to \alpha \bullet a \beta]\), then on input \(a\), shift and go to the item set reached by the DFA on \(a\) from \(I\).
- Reduce: If item set \(I\) has \([A \to \alpha \bullet]\) (completed item), then reduce by \(A \to \alpha\) — but LR(0) does this for all input symbols in the lookahead (no lookahead is used).
- Accept: If item set \(I\) has \([S' \to S \bullet]\), accept.
LR(0) grammars are a restricted class; more powerful are SLR(1), LALR(1), and LR(1) grammars (which use 1 symbol of lookahead in various ways). LALR(1) is the basis of the widely-used YACC/Bison parser generator.
Chapter 5: The Chomsky Hierarchy and Undecidability
Unrestricted Grammars and Turing Machines
We now ascend to the top of the Chomsky hierarchy. An unrestricted grammar (Type 0 grammar) has productions \(\alpha \to \beta\) where \(\alpha \in (V \cup \Sigma)^*\) with \(\alpha\) containing at least one variable, and \(\beta \in (V \cup \Sigma)^*\). There is no restriction on the form of productions.
The proof goes through Turing machines:
Grammar simulates TM: Given a TM \(M\), construct a grammar \(G\) with variables encoding TM configurations. Productions simulate TM transitions forward and also “guess” the input (using a nondeterministic grammar). The grammar derives a terminal string iff the TM accepts it.
TM simulates grammar: Given an unrestricted grammar \(G\), a nondeterministic TM enumerates all derivations of \(G\) (breadth-first) and accepts whenever a derivation of the input is found.
Context-Sensitive Languages and Linear Bounded Automata
Between context-free and recursively enumerable languages sits the class of context-sensitive languages (CSLs), defined by context-sensitive grammars (CSGs) where every production has \(|\alpha| \leq |\beta|\) (productions are non-length-decreasing, with the exception of \(S \to \varepsilon\)).
The machine model for CSLs is the linear bounded automaton (LBA): a nondeterministic TM whose working tape is bounded by a linear function of the input length (typically, the tape is just the input tape).
An important consequence: CSLs are decidable. Given an LBA \(M\) and input \(w\), the number of distinct configurations on tape of length \(|w|\) is finite: \(|Q| \cdot |w| \cdot |\Gamma|^{|w|}\). We can enumerate all configurations and decide in finite time whether any accepting configuration is reachable. Hence:
The Chomsky Hierarchy
The four language classes form the Chomsky hierarchy:
| Type | Grammar | Machine | Closed Under |
|---|---|---|---|
| 3 (Regular) | Right-linear or left-linear | DFA / NFA | Boolean ops, concat, *, morphism, inverse morphism |
| 2 (CFL) | Context-free | PDA | Union, concat, *, morphism, inverse morphism, ∩ regular |
| 1 (CSL) | Context-sensitive | LBA | Boolean ops (all three), concat, *, morphism |
| 0 (RE) | Unrestricted | TM (nondeterministic) | Union, concat, *, morphism, intersection |
Key separations: \(\text{Regular} \subsetneq \text{CFL} \subsetneq \text{CSL} \subsetneq \text{Recursive} \subsetneq \text{RE} \subsetneq \text{All languages}\).
Each containment is strict. The separation between CSL and Recursive deserves mention: there exist decidable languages that are not context-sensitive (e.g., \(\{a^{2^n} : n \geq 0\}\) is decidable but not context-sensitive).
The Immerman-Szelepcsényi Theorem
A striking result: unlike CFLs, CSLs are closed under complementation. This was an open problem for nearly 20 years.
\(M'\) must accept \(w\) iff \(M\) does not accept \(w\). The difficulty: \(M'\) cannot simply simulate \(M\) and invert, because \(M\) is nondeterministic — “does not accept” means no computation path accepts, not just that one path rejects.
The algorithm uses inductive counting of reachable configurations:
Let \(C_0 = \{c_{\text{init}}\}\) be the initial configuration of \(M\) on \(w\). Define \(C_i\) as the set of configurations reachable from \(C_0\) in exactly \(i\) steps. The key insight is:
- Compute \(|C_i|\) inductively: given the knowledge that \(|C_{i-1}| = k_{i-1}\), a nondeterministic LBA can count the number of configurations reachable in \(i\) steps, using the following argument:
For each configuration \(c\) reachable in \(i\) steps, nondeterministically guess \(c\) and verify it is reachable (by guessing a path of length \(i\) from \(c_{\text{init}}\) to \(c\), using \(k_{i-1}\) as a counter to ensure we’ve seen all \(k_{i-1}\) predecessors). Count how many distinct such \(c\) exist, obtaining \(k_i = |C_i|\).
- Using this count, we can decide: is any accepting configuration in \(\bigcup_i C_i\)? If not, accept \(w\) as being in \(\overline{L}\).
The total number of configurations is at most \(N = |Q| \cdot |w| \cdot |\Gamma|^{|w|}\), which is bounded exponentially in \(|w|\) — but the tape needed to write down indices and counts is only polynomial in \(|w|\) (since indices fit in \(O(\log N) = O(|w|)\) space). Hence the counting procedure itself fits on a linear tape.
This shows \(\overline{L}\) is also accepted by an LBA, hence is a CSL.
The Immerman-Szelepcsényi theorem was proved independently by Immerman and Szelepcsényi in 1987, and each received the Gödel Prize in 1995. In complexity theory, the result is equivalent to \(\text{NSPACE}(f(n)) = \text{co-NSPACE}(f(n))\) for space-constructible \(f(n) \geq \log n\) — in particular, NL = co-NL.
Kolmogorov Complexity
Kolmogorov complexity provides an information-theoretic measure of the “complexity” of individual strings. It is a powerful tool for proving lower bounds.
Key facts:
Incompressibility: For each \(n\), at least half of all strings of length \(n\) satisfy \(K(x) \geq n\) (since there are \(2^n\) strings but fewer than \(2^n\) programs of length \(< n\)).
Invariance: For any two universal TMs \(U, U'\), \(K_U(x) \leq K_{U'}(x) + c\) for some constant \(c\) depending only on \(U, U'\) (not on \(x\)).
Uncomputability: \(K(x)\) is not computable (by a diagonalization argument).
The incompressibility method is a non-constructive proof technique: assume a string \(x\) of high Kolmogorov complexity satisfies some property (by the incompressibility theorem, such strings exist), and derive a contradiction from the structure that property imposes. This is used to prove lower bounds on communication complexity, circuit size, and pumping-like lemmas for specific language classes.
Undecidability Results for CFLs
Several natural decision problems about CFLs are undecidable. The proofs use reductions from the Post Correspondence Problem (PCP).
A computation history is a sequence of configurations \(C_1 \# C_2 \# \cdots \# C_t\) where \(C_1\) is the initial configuration and each \(C_{i+1}\) follows from \(C_i\) by one TM step. A valid computation history is one that ends in an accepting configuration.
Encode: pairs in the PCP instance represent partial matches between successive configurations (with appropriate shifts for the TM head position). A solution to the PCP instance corresponds exactly to building a valid computation history of \(M\) on \(w\). Hence PCP has a solution iff \(M\) accepts \(w\).
The PCP undecidability is used as a stepping stone to prove:
- Given CFGs \(G_1, G_2\), is \(L(G_1) \cap L(G_2) = \emptyset\)?
- Given a CFG \(G\), is \(G\) ambiguous?
- Given CFGs \(G_1, G_2\), is \(L(G_1) = L(G_2)\)?
- Given a CFG \(G\), is \(L(G) = \Sigma^*\)?
Proof of (2): Construct a CFG \(G_{\text{comp}}\) that generates all valid computation histories of \(M\) on \(w\) (by a careful encoding). \(G_{\text{comp}}\) is ambiguous iff \(M\) accepts \(w\).
More precisely: construct two CFLs — one generating “top halves” of invalid computation histories and one generating “bottom halves” — whose union covers \(\Sigma^*\) iff \(M\) does not accept \(w\). Then universality of the union is equivalent to non-acceptance, and this can be further reduced to ambiguity.
PSPACE-Completeness of Regular Expression Universality
Even for regular languages, some problems are computationally hard.
- Given a regular expression \(r\), is \(L(r) = \Sigma^*\)?
- Given an NFA \(M\), is \(L(M) = \Sigma^*\)?
The key construction is the following:
We can write the set of all invalid computation histories as a union of regular languages, each capturing one type of invalidity. Because each configuration has polynomially bounded length and there are polynomially many configurations (the tape is polynomial), the transitions can be captured by finite automata. The union of all these automata gives a single regular expression capturing all invalid computation histories.
Since “valid accepting computation” is equivalent to “not an invalid computation history,” we get \(L(r_{M,w}) = \Sigma^*\) iff no valid accepting computation exists iff \(M\) does not accept \(w\).
The PSPACE-hardness follows because universality of \(r_{M,w}\) decides whether \(M\) accepts \(w\), and PSPACE-complete problems can be decided in polynomial space, hence are PSPACE-hard.
Membership in PSPACE: NFA universality can be decided by co-nondeterministically guessing a string not in \(L(M)\) and verifying — but this is nondeterministic. Via Savitch’s theorem (\(\text{NPSPACE} = \text{PSPACE}\)), we can convert this to a deterministic PSPACE algorithm.
Chapter 6: Deterministic Context-Free Languages — Further Properties
Normal Forms
Every DCFL has a representation in a clean normal form, analogous to Chomsky Normal Form for CFGs.
Theorem: Every DCFL can be represented by a real-time DPDA (one that never takes \(\varepsilon\)-moves). This is the DCFL analogue of trimming \(\varepsilon\)-transitions.
Further, DCFLs admit a “two-way” normal form: the input is read in two passes (forward and backward). This is used in the study of complexity properties of DCFLs.
Relationship to LR(0) Grammars
The relationship between DCFLs and LR-parseable grammars is tight:
A language is an LR(0) language iff it has an LR(0) grammar. The canonical non-LR(0) DCFL is the set of palindromes — but palindromes are actually not a DCFL at all! For DCFLs, the right notion is:
The LR(1) condition uses one symbol of lookahead in the reduce action: instead of reducing on all possible symbols (LR(0)), we consult the FOLLOW sets refined by the specific derivation context.
Closure Properties of DCFLs
- Complement (Theorem 3.6)
- Inverse morphism
- Intersection with regular languages
The failure of closure under union shows DCFLs are not Boolean-closed: \(L_1 = \{a^n b^n c^m\}\) and \(L_2 = \{a^m b^n c^n\}\) are both DCFLs but \(L_1 \cup L_2\) is not (it is a CFL, but not deterministic).
Cook’s Theorem on 2DPDAs
A two-way deterministic pushdown automaton (2DPDA) is a DPDA whose input head can move bidirectionally. Cook’s theorem gives a remarkable simulation result:
This is the pushdown analogue of Shepherdson’s theorem for finite automata (2DFAs = DFAs). The simulation is substantially more involved due to the stack. The key technique is crossing sequences: encoding the behavior of the 2DPDA at each crossing of a position on the input tape as a bounded description, then showing these crossing sequences form a regular set that can be used to guide a one-way simulation.
Chapter 7: Other Language Classes
L-Systems
Lindenmayer systems (L-systems) are string-rewriting systems introduced by biologist Aristid Lindenmayer in 1968 to model the growth of multicellular organisms. Unlike Chomsky grammars, L-systems apply all productions simultaneously (in parallel) rather than one at a time.
Example: The morphism \(h : 0 \mapsto 01, 1 \mapsto 10\) with axiom \(w = 0\) generates the Thue-Morse sequence in the limit. After \(n\) steps: \(h^n(0)\).
The “D” in DOL stands for Deterministic (each symbol has exactly one production), “O” for OL (no context — each symbol rewrites independently of neighbors), and “L” for Lindenmayer.
The DOL Sequence Problem
This problem is decidable — proved by Culik and Fris (1977) in a landmark result. However, the proof is highly non-trivial. The language-theoretic version is also decidable:
L-Systems and Computer Graphics
L-systems have found extensive application in computer graphics for generating fractal-like structures (plants, trees, coastlines). The turtle graphics interpretation assigns geometric meaning to symbols: \(F\) = move forward, \(+\) = turn left, \(-\) = turn right. Applying the morphism repeatedly generates increasingly detailed patterns, such as:
- Koch curve: \(F \mapsto F+F--F+F\)
- Sierpinski triangle: \(A \mapsto B+A+B, B \mapsto A-B-A\)
- Dragon curve: various morphisms on pairs of symbols
The book by Prusinkiewicz and Lindenmayer, The Algorithmic Beauty of Plants (Springer, 1990), is the definitive reference on L-systems in graphics.
Rational Series
\[S = \sum_{w \in \Sigma^*} (S, w) \cdot w\]where \((S, w) \in \mathbb{K}\) is the coefficient of \(w\).
Rational series generalize regular languages: a language \(L\) corresponds to the characteristic series \(\mathbf{1}_L\) (coefficient 1 for words in \(L\), 0 otherwise), and \(\mathbf{1}_L\) is rational over \(\mathbb{B} = (\{0,1\}, \vee, \wedge)\) iff \(L\) is regular.
The Theorem of Schützenberger
This is the linear representation of a rational series. It is the formal-language analogue of a linear recurrence for sequences over \(\mathbb{K}\).
Examples and Applications
Counting paths: If \(G\) is a directed graph with adjacency matrix \(A\), then \((A^k)_{ij}\) counts paths of length \(k\) from node \(i\) to node \(j\). This is a rational series over \((\mathbb{N}, +, \times)\).
Probabilistic automata: A probabilistic finite automaton (PFA) is a WFA over the semiring of probabilities. The series \((S, w)\) is the probability that the automaton accepts \(w\). PFAs can recognize non-regular languages in the sense of cut-point isolation: the set \(\{w : (S,w) > \lambda\}\) for a cut-point \(\lambda\) may not be regular.
Formal languages and power series: Over \(\mathbb{N}\), rational series correspond to counting the number of accepting paths — relevant to the study of the #P complexity class (counting problems).
Weighted transducers: The theory of rational series extends naturally to transducers, where the output is not a bit but an element of a semiring. Weighted transducers are used in speech recognition (weighted finite-state methods), natural language processing, and bioinformatics.
Hadamard Product
For two series \(S, T\) with \((S \odot T, w) = (S, w) \cdot (T, w)\) (componentwise product), the rational series are closed under Hadamard product over commutative semirings. This has no analogue for regular languages (since the Hadamard product of characteristic functions is the indicator of the intersection, which is preserved — this corresponds to the DFA product construction).
Appendix: Decision Problems Summary
The following table summarizes decidability and complexity for key decision problems, where “CFL” denotes context-free grammar input and “RE” denotes r.e. languages.
| Problem | Class | Status |
|---|---|---|
| DFA membership | DFA | O(n) |
| NFA membership | NFA | O(n · |Q|) |
| DFA emptiness | DFA | O(|Q|) |
| DFA equivalence | DFA | O(n log n) |
| DFA minimization | DFA | O(n log n) |
| CFL membership | CFG | O(n³) (CYK / Earley) |
| CFL emptiness | CFG | Decidable (polynomial) |
| CFL finiteness | CFG | Decidable |
| CFL ambiguity | CFG | Undecidable |
| CFL ∩ CFL emptiness | CFG × CFG | Undecidable |
| CFL = Σ* | CFG | Undecidable |
| CFL = CFL | CFG × CFG | Undecidable |
| CSL membership | CSG | PSPACE-complete |
| RE membership | TM | Undecidable (semi-decidable) |
| Regex universality | Regex | PSPACE-complete |
| NFA universality | NFA | PSPACE-complete |
| PCP | Pairs | Undecidable |
Chapter 8: Fine and Wilf, Lyndon Words, and the Structure of Periodicity
The Fine and Wilf Theorem
Chapter 1 introduced periodicity for strings. The most important structural result about multiple periods in a single string is the Fine and Wilf theorem (1965). It answers: if a string has two different periods, what can we say?
Recall that a word \(w\) of length \(n\) has period \(p\) (with \(1 \leq p \leq n\)) if \(w[i] = w[i+p]\) for all valid indices (i.e., for all \(0 \leq i \leq n-p-1\)). Equivalently, \(w\) is a prefix of \(x^\omega\) for \(x = w[0..p-1]\).
The bound \(p + q - \gcd(p,q)\) is tight: there exist words of length \(p + q - \gcd(p,q) - 1\) with periods \(p\) and \(q\) but not period \(\gcd(p,q)\).
We show that any two indices \(i\) and \(j\) with \(j - i = d\) and \(0 \leq i < j \leq n-1\) are connected by a chain of steps of size \(\pm p\) and \(\pm q\) that stays within \([0, n-1]\).
Formally, consider the integers modulo \(p\) and modulo \(q\). Since \(d = \gcd(p,q)\), the additive subgroup generated by \(p\) and \(q\) in \(\mathbb{Z}\) equals \(d\mathbb{Z}\). So there exist non-negative integers \(\alpha, \beta\) with \(\alpha p - \beta q = d\) (or \(\beta q - \alpha p = d\)). Starting from index \(i\), we can reach \(i + d\) by taking \(\alpha\) steps of \(+p\) and \(\beta\) steps of \(-q\) (or vice versa). Each intermediate index lies in an interval we must check stays within \([0, n-1]\).
\[i + \alpha p \leq (n - d - 1) + (q/d - 1)p\]One verifies this is at most \(n-1\) using the hypothesis \(n \geq p + q - \gcd(p,q) = p + q - d\).
More concretely: the sequence of indices \(i, i+p, i+p-q, i+p-q+p, \ldots\) that zigzags by steps of \(+p\) and \(-q\) reaches \(i+d\) and stays within \([0, n-1]\) by the assumed length bound, establishing \(w[i] = w[i+d]\).
Tightness example: Take \(p = 3, q = 5\), so \(d = \gcd(3,5) = 1\) and the bound is \(3 + 5 - 1 = 7\). The word \(w = \mathtt{abaaaba}\) of length 7 has both periods 3 and 5, but not period 1 (it is not constant). Extending to length 8 forces period 1.
Fine and Wilf for Two Morphisms
The textbook presents a generalization of Fine and Wilf to morphisms:
This generalization is used in the study of morphic sequences and has applications in deciding equality of DOL sequences.
Lyndon Words
The theory of Lyndon words is a beautiful piece of combinatorics on words with surprising connections to algebra, algorithms, and number theory.
Equivalently, a Lyndon word is a primitive word that is the lexicographically smallest element in its conjugacy class.
Examples: Over \(\{0,1\}\) with \(0 < 1\):
- \(0, 1, 01, 001, 011, 0001, 0011, 0111, \ldots\) are Lyndon words.
- \(10\) is not Lyndon: its proper suffix \(\mathtt{0}\) satisfies \(\mathtt{0} < \mathtt{10}\).
- \(0101\) is not Lyndon: it is not primitive (\(0101 = (01)^2\)).
Key properties:
- Every Lyndon word is primitive.
- If \(\ell_1, \ell_2\) are Lyndon words with \(\ell_1 \leq \ell_2\) lexicographically, then \(\ell_1 \ell_2\) is a Lyndon word iff \(\ell_1 < \ell_2\).
- If \(\ell\) is Lyndon with \(\ell = xy\), then \(y\ell\) is Lyndon iff \(y\) is Lyndon (and \(y < \ell\)).
These properties arise from the interplay between order and concatenation in the free monoid.
The Chen-Fox-Lyndon Factorization
The most important result about Lyndon words is the unique factorization theorem:
Wait, this is not non-increasing. Correctly: \(\mathtt{cba} \cdot \mathtt{cba}\)? No, \(\mathtt{cba}\) is Lyndon since \(c > b > a\) — but \(\mathtt{cba}\) is not Lyndon because its suffix \(\mathtt{a}\) satisfies \(\mathtt{a} < \mathtt{cba}\). The correct factorization of \(\mathtt{cbacba}\) is \(\mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a} \cdot \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a}\) — oops, these must be non-increasing. Indeed \(\mathtt{c} > \mathtt{b} > \mathtt{a}\) and \(\mathtt{c}, \mathtt{b}, \mathtt{a}\) are all Lyndon words (single letters are always Lyndon). So the factorization is \(\mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a} \cdot \mathtt{c} \cdot \mathtt{b} \cdot \mathtt{a}\) — but that is not non-increasing at position 3-4 (we go from \(\mathtt{a}\) to \(\mathtt{c}\)). A simpler example: \(\mathtt{aab}\) has Lyndon factorization \(\mathtt{aab}\) (one Lyndon word since \(\mathtt{aab} < \mathtt{ab} < \mathtt{b}\)). The word \(\mathtt{ba}\) factors as \(\mathtt{b} \cdot \mathtt{a}\).
Existence: Given \(w\), define \(\ell_1\) as the longest Lyndon prefix of \(w\) (a Lyndon word that is a prefix of \(w\)). Then continue with the remaining suffix. One shows this greedy approach terminates and gives a non-increasing sequence (the next Lyndon piece cannot be larger than the previous, because if it were, the current piece would not have been the longest Lyndon prefix).
Uniqueness: Two distinct factorizations into Lyndon words with the same non-increasing property would give a contradiction via a careful comparison of the first differing piece.
Duval’s Algorithm
Computing the Lyndon factorization efficiently is non-trivial, but the Duval algorithm (1983) achieves linear time \(O(|w|)\):
DUVAL(w):
i = 0, j = 1
While j < |w|:
k = i
While k < j and w[k mod (j - i)] == w[j]: # compare cyclically
k++; j++
If w[k mod (j-i)] < w[j]:
i = k + 1; j = max(j, i+1) # extend the current Lyndon word
Else:
# backtrack: output one copy of the Lyndon word w[i..i+(j-k)-1]
i = i + (j - k)
Output w[i..|w|-1] as the last Lyndon word
The key invariant: at every step, the current “candidate” Lyndon word is \(w[i..j-1]\). The algorithm never moves \(i\) backward, giving an amortized \(O(|w|)\) bound.
Application — Sorting: The Lyndon factorization has a remarkable application to the Burrows-Wheeler Transform (BWT), used in data compression (bzip2). The BWT of a string is computed by sorting all cyclic rotations — and the Lyndon factorization controls the structure of this sorted order. Specifically, the suffix array of a word \(w\) can be computed in \(O(|w|)\) using the Lyndon factorization as a backbone.
The Necklace Problem
A necklace (or Lyndon necklace) is a lexicographically smallest representative of a conjugacy class. Computing the lexicographically smallest rotation of a string — finding the necklace — is equivalent to finding the first character of the Lyndon factorization. By Booth’s algorithm (1980), this runs in \(O(n)\).
\[\frac{1}{n} \sum_{d \mid n} \mu(n/d) k^d\]where \(\mu\) is the Möbius function from number theory. This is related to counting primitive necklaces and has connections to the theory of free Lie algebras.
Chapter 9: ω-Languages and Büchi Automata
Motivation and Topology of ω-Languages
The study of languages of infinite words is not merely a theoretical curiosity — it is the foundation of model checking and formal verification of concurrent and reactive systems. A program that runs forever (an operating system, a control loop, a communication protocol) cannot be modeled by a finite input; it produces an infinite stream of events. The right question is not “does this program terminate correctly?” but “does this program always eventually reach a good state?” or “does it never reach a bad state?”
These questions are naturally phrased as membership problems for languages of infinite words.
The topology of \(\Sigma^\omega\) is the Cantor space topology: the basic open sets are the cylinders \(\{w \in \Sigma^\omega : w \text{ starts with } u\}\) for finite \(u \in \Sigma^*\). An ω-language is open if it is a union of cylinders, and closed if its complement is open. This topology makes \(\Sigma^\omega\) compact and totally disconnected.
The Borel hierarchy for ω-languages classifies them by alternation of open/closed operations. The ω-regular languages (defined below) correspond to the \(\Delta_2^0\) level of this hierarchy.
Büchi Automata
The automaton accepts \(\mathbf{w}\) if there exists a run (sequence of states) such that \(\text{Inf}(\rho) \cap F \neq \emptyset\), where \(\text{Inf}(\rho) = \{q \in Q : q = q_i \text{ for infinitely many } i\}\) is the set of states visited infinitely often.
In other words: a Büchi automaton accepts an infinite word if some run passes through an accepting state infinitely often. An ω-language is ω-regular if it is accepted by some NBA.
Examples:
The language of all infinite words over \(\{a, b\}\) that contain infinitely many \(a\)’s: use a 2-state NBA with state \(q_0\) (initial, non-accepting), \(q_1\) (accepting). On \(a\): go from \(q_0\) to \(q_1\) and from \(q_1\) to \(q_1\). On \(b\): stay. Accept at \(q_1\). Then every run on a word with infinitely many \(a\)’s visits \(q_1\) infinitely often.
The language \(\{uv^\omega : u \in \Sigma^*, v \in L, v \neq \varepsilon\}\) for a regular \(L\) — all ultimately periodic words whose “tail” lies in \(L\) — is always ω-regular.
Connection to regular languages: For a finite language \(L \subseteq \Sigma^*\), define \(L^\omega = \{w_1 w_2 \cdots : w_i \in L, w_i \neq \varepsilon\}\). This ω-language is always ω-regular. More generally, every ω-regular language is a finite Boolean combination of languages of the form \(L_1 L_2^\omega\) for regular \(L_1, L_2\).
Closure Properties of ω-Regular Languages
Closure under union is straightforward (take the disjoint union of two NBAs). Closure under intersection requires a product construction with a “double acceptance” trick: accept when both automata have visited their respective accepting states. Closure under complementation is the hardest and requires converting to a deterministic model first.
Limits of Deterministic Büchi Automata
This is where ω-languages diverge sharply from finite-word languages. For finite words, NFA and DFA are equally expressive. For infinite words:
(all words that eventually become periodic) is ω-regular but not recognizable by any DBA. A DBA that visits its accepting states infinitely often must do so in a “determined” way, but periodicity of a tail is inherently nondeterministic to detect.
The canonical non-DBA-recognizable language is: “all \(\mathbf{w}\) over \(\{a,b\}\) such that if \(\mathbf{w}\) contains only finitely many \(b\)’s, then it also contains only finitely many \(a\)’s” — this requires nondeterminism.
Muller Automata and McNaughton’s Theorem
This is the fundamental theorem of ω-automata: nondeterminism can always be eliminated if we use the more powerful Muller acceptance condition (instead of requiring only some accepting state, we specify exactly which states appear infinitely often).
On reading each symbol, the Safra construction updates the tree deterministically:
- Powerset step: each node’s label is updated by applying the NBA transition function to all states in that label.
- Subset merging: nodes track which copies of the NBA have visited accepting states, and “merge” them by promoting a subset to a new child node when it becomes stable.
- Tree normalization: remove empty nodes and normalize the tree.
The DMA accepts if some node’s label becomes “stable” in the following sense: it appears as a full subset and has visited an accepting state, which manifests in the Muller condition. The resulting DMA has at most \(n^{O(n)}\) states.
Other deterministic ω-automaton models include Rabin automata (accepting if some pair \((E_i, F_i)\) in a table has \(\text{Inf}(\rho) \cap E_i = \emptyset\) and \(\text{Inf}(\rho) \cap F_i \neq \emptyset\)) and Streett automata (the dual). All deterministic models are equally expressive for ω-regular languages.
Linear Temporal Logic and Model Checking
The primary application of ω-regular language theory is in linear temporal logic (LTL) and model checking. LTL formulae describe properties of infinite execution traces. Key operators:
- \(\mathbf{X}\, \varphi\): \(\varphi\) holds in the next step
- \(\mathbf{F}\, \varphi\): \(\varphi\) holds in some future step (eventually)
- \(\mathbf{G}\, \varphi\): \(\varphi\) holds in all future steps (globally)
- \(\varphi\, \mathbf{U}\, \psi\): \(\varphi\) holds until \(\psi\) holds
The model checking algorithm: translate \(\neg\varphi\) to an NBA \(B\), take the product of \(B\) with the system’s transition graph, check if the product has an accepting lasso (a path that enters a cycle through an accepting state). This is a reachability problem on the product graph, solvable in linear time in the product’s size.
Chapter 10: Automatic Sequences
From Finite Automata to Sequences
A DFAO (Deterministic Finite Automaton with Output) is a DFA augmented with an output function (\tau : Q \to \Delta$ mapping states to an output alphabet $\Delta$. Given a DFAO $M$ and a natural number $n$ written in base $k$ (with leading zeros allowed), the DFAO reads $n$’s base-$k$ digits and outputs $\tau(\delta^*(q_0, [n]_k))$, where $[n]_k$ is the base-$k$ representation of $n$ (most significant digit first).
Automatic sequences are the “regular languages of number theory.” Just as regular languages are recognized by finite automata reading strings, automatic sequences are computed by finite automata reading the digits of an index.
Canonical Examples
The Thue-Morse sequence \(\mathbf{t} = 0110100110010110\cdots\) is 2-automatic. The DFAO has two states \(q_0\) (output 0) and \(q_1\) (output 1). On digit 0: stay in current state. On digit 1: switch state. Start in \(q_0\). Then \(t_n = \text{(sum of binary digits of } n) \bmod 2\), as computed by this DFAO reading \(n\) in binary.
The Rudin-Shapiro sequence \(\mathbf{r} = (r_n)_{n \geq 0}\) counts the number of (possibly overlapping) occurrences of \(11\) in the binary representation of \(n\), mod 2. It is 2-automatic with 4 states.
The Baum-Sweet sequence \(\mathbf{b} = (b_n)_{n \geq 0}\): \(b_n = 1\) if the binary representation of \(n\) contains no odd-length block of consecutive 0’s, else \(b_n = 0\). It is 2-automatic with 3 states.
The Stern-Brocot sequence: defined by \(s(0) = 0, s(1) = 1, s(2n) = s(n), s(2n+1) = s(n) + s(n+1)\). This is 2-automatic and has connections to the Farey sequence in number theory.
Equivalent Characterizations
- \(\mathbf{a}\) is k-automatic.
- The k-kernel of \(\mathbf{a}\) — the set \(\{(a_{k^e n + r})_{n \geq 0} : e \geq 0, 0 \leq r < k^e\}\) — is finite.
- \(\mathbf{a}\) is the image under a coding (letter-to-letter morphism) of a fixed point of a k-uniform morphism (a morphism where every letter maps to a word of length exactly \(k\)).
The k-kernel is the set of all subsequences of the form \((a_{k^e n + r})_{n \geq 0}\), for all \(e \geq 0\) and \(0 \leq r < k^e\). Finiteness of the kernel captures the “finite-state” nature of the sequence: there are only finitely many distinct “sampling patterns.”
Conversely, given a finite k-kernel, define states of the DFAO as the elements of the kernel. The transition function is: on reading digit \(d\), the sequence \((a_{k^e n + r})_n\) transitions to \((a_{k^{e+1} n + kd + r})_n\) — which is also in the kernel by finiteness. The output function maps each kernel element to its first value \(a_r\).
Morphic Sequences
A sequence is morphic if it is the image of the fixed point of a (not necessarily uniform) morphism. Automatic sequences are a special case: every \(k\)-automatic sequence is morphic (via a \(k\)-uniform morphism), but not every morphic sequence is automatic.
Example: The Fibonacci word \(\mathbf{f}\) is the fixed point of the morphism \(\phi: a \mapsto ab, b \mapsto a\). It is morphic (clearly) but not \(k\)-automatic for any \(k\). This is because automatic sequences have polynomial digit sums (by the kernel characterization), while the Fibonacci word has irrational frequencies of letters (related to the golden ratio).
The sequence \((F_n \bmod 2)_{n \geq 0}\) (Fibonacci numbers mod 2) is 2-automatic, even though the Fibonacci numbers themselves are not automatic as a sequence of integers.
Cobham’s Theorem
The central foundational result for automatic sequences concerns what happens when a sequence is automatic for two different bases:
Multiplicative independence is the same as saying \(k\) and \(\ell\) are not powers of the same integer. For example, \(k = 2\) and \(\ell = 3\) are multiplicatively independent (since \(2^a \neq 3^b\) for any positive integers \(a, b\)), while \(k = 2\) and \(\ell = 4\) are not multiplicatively independent (since \(2^2 = 4^1\)).
Cobham’s theorem says: the “interesting” automatic sequences (non-eventually-periodic ones like Thue-Morse, Rudin-Shapiro) are tied to a single base. They cannot be captured by automata reading another, unrelated base. This gives automatic sequences a kind of “numerical specificity.”
The proof uses properties of morphisms and the structure of automatic sequences. Suppose \(\mathbf{a}\) is both k-automatic and ℓ-automatic, with k and ℓ multiplicatively independent.
The k-kernel and ℓ-kernel are both finite. Consider the set of “cluster points” of the sequence — positions where specific patterns recur. Using the finiteness of both kernels and the fact that k and ℓ are multiplicatively independent, one shows that the set of distinct consecutive windows \((a_n, a_{n+1}, \ldots, a_{n+m-1})\) is eventually periodic.
The key step: for any fixed window size \(m\), the sequence of windows satisfies two different linear recurrences (one with period a power of \(k\), one with period a power of \(\ell\)). By Fine and Wilf (since powers of multiplicatively independent integers are eventually coprime in growth), the window sequence must be eventually periodic. Since this holds for all \(m\), the sequence \(\mathbf{a}\) itself is eventually periodic.
Consequences:
- The Thue-Morse sequence is 2-automatic but not \(\ell\)-automatic for any \(\ell\) not a power of 2.
- A sequence that is 6-automatic is also 2-automatic and 3-automatic (since \(6 = 2 \times 3\), and 6-automatic \(\Rightarrow\) 2-automatic by a base-change argument when the bases are powers of each other).
Automatic Sequences and Number Theory
Automatic sequences encode deep number-theoretic properties. The Champernowne word \(\mathbf{c} = 0\, 1\, 10\, 11\, 100\, 101\, \ldots\) (concatenation of all non-negative integers in binary) is not k-automatic for any \(k\), because its factor complexity (number of distinct length-\(n\) subwords) grows super-linearly.
The factor complexity of an automatic sequence is at most linear: there exists \(C\) such that the number of distinct length-\(n\) subwords is at most \(Cn\). This is a hallmark of “structured” sequences.
Connections to the Cartier operator in algebraic geometry, the Möbius function in analytic number theory, and fractal dimensions of certain Cantor-like sets make automatic sequences a meeting point of diverse mathematical fields.
Chapter 11: Algebraic Automata Theory
The Syntactic Monoid
Every regular language has a canonical algebraic object: its syntactic monoid. This section develops the theory connecting regular languages, finite monoids, and language varieties.
This is the monoid-theoretic analogue of the Myhill-Nerode theorem. Whereas Myhill-Nerode uses a right congruence (looking at \(xz \in L\) vs \(yz \in L\) for right contexts only), the syntactic congruence uses two-sided contexts and gives a finer equivalence.
Relationship to the minimal DFA: The transformation monoid of the minimal DFA for \(L\) is exactly \(\text{Synt}(L)\). Specifically, each word \(w \in \Sigma^*\) induces a function \(\phi_w : Q \to Q\) defined by \(\phi_w(q) = \delta^*(q, w)\). The set \(\{\phi_w : w \in \Sigma^*\}\) under composition is the syntactic monoid, with the monoid product being function composition.
Example: The language \(L = (ab)^*\) over \(\{a, b\}\). The syntactic monoid has 5 elements: the identity \(\varepsilon\), the elements \(a, b, ab, ba\), with the relation \(aba = a\), \(bab = b\), \((ab)^2 = ab\), etc. This monoid is the transformation monoid of the 3-state minimal DFA for \(L\).
Eilenberg’s Variety Theorem
The variety theorem of Eilenberg (1976) is a deep correspondence between algebraic properties of the syntactic monoid and logical/combinatorial properties of the language.
This is one of the deepest structural theorems in formal language theory. It reduces the classification of language families to the classification of finite monoid families — an algebraic problem amenable to the tools of group theory and semigroup theory.
Star-Free Languages and Aperiodic Monoids
The most celebrated instance of Eilenberg’s theorem is the characterization of star-free languages.
Equivalently, star-free languages are those expressible in first-order logic over the word structure \((\Sigma^*, \leq_{\text{prefix}}, (Q_a)_{a \in \Sigma})\), where \(Q_a(i)\) means “position \(i\) carries symbol \(a\).”
This is Schützenberger’s theorem, one of the great results of 20th-century algebra and computer science. It precisely identifies the algebraic structure responsible for the absence of cyclic symmetry in a language.
Proof direction (aperiodic \(\Rightarrow\) star-free): If \(\text{Synt}(L)\) is aperiodic, then the DFA for \(L\) has no cycles in its “essential” structure. One shows by induction on the monoid structure that \(L\) can be expressed using Boolean operations and concatenation alone — the star operation is never needed because no “looping” (cycling through states in a structured way) occurs.
Proof direction (star-free \(\Rightarrow\) aperiodic): Every finite language is recognized by an aperiodic monoid. The class of languages with aperiodic syntactic monoids is closed under Boolean operations and concatenation (one checks that the product and quotient constructions for monoids preserve aperiodicity). Since star-free languages are the closure of finite languages under Boolean operations and concatenation, they all have aperiodic syntactic monoids.
Examples:
- \(L = (ab)^*\): aperiodic? The element \(ab\) in \(\text{Synt}(L)\) satisfies \((ab)^1 = ab\) and \((ab)^2 = ab\) (the transformation \(\phi_{ab}\) is idempotent in the minimal DFA). So yes, \(L\) is star-free. Indeed, \(L = \Sigma^* \setminus (\Sigma^* b\Sigma^* \cup \Sigma^* a \Sigma^* a \Sigma^* \cup \Sigma^* b \Sigma^* b \Sigma^*)\) — expressible without Kleene star.
- \(L = (ab)^+\): same, since \((ab)^+ = ab \cdot (ab)^*\).
- \(L = \{w : |w|_a \equiv 0 \pmod{3}\}\): not star-free. The element \(a\) in \(\text{Synt}(L)\) has order 3 (the DFA cycles through 3 states on reading \(a\)), generating a non-trivial group \(\mathbb{Z}/3\mathbb{Z}\) in the syntactic monoid.
The Krohn-Rhodes Theorem
The Krohn-Rhodes theorem (1965) is the “fundamental theorem of finite automata,” decomposing every finite automaton into a cascade of simpler components.
“Divides” means “is a quotient of a submonoid of.” The theorem says every finite state machine can be decomposed into a cascade (left-to-right pipeline) of:
- Simple group components: cyclic “counters” and permutation machines.
- Reset components: 2-state machines with a reset transition.
This decomposition theorem is the foundation of complexity theory for finite automata and has connections to Boolean circuit complexity (the NC/AC hierarchy).
Chapter 12: Learning Regular Languages
The Learning Problem
The grammatical inference problem asks: given examples of a language, can we learn the language? Angluin’s seminal 1987 paper gave a polynomial-time algorithm for learning regular languages from membership queries and equivalence queries — a model of “exact learning.”
The setting: There is an unknown regular language \(L \subseteq \Sigma^*\) (the target). A learner (algorithm) can make two types of queries to a teacher (oracle):
- Membership query for \(w\): is \(w \in L\)? The teacher answers yes or no.
- Equivalence query for a DFA \(H\): is \(L(H) = L\)? The teacher answers yes (and the learner halts) or no and provides a counterexample — a word in \(L \Delta L(H) = (L \setminus L(H)) \cup (L(H) \setminus L)\).
The L* algorithm is one of the most practically important results in automata theory — it is used in model checking (learning specifications), protocol verification, and software testing.
The Observation Table
The L* algorithm maintains an observation table \((S, E, T)\) where:
- \(S \subseteq \Sigma^*\): a prefix-closed set of “short” words (rows).
- \(E \subseteq \Sigma^*\): a set of “experiments” / distinguishing suffixes (columns).
- \(T : (S \cup S \cdot \Sigma) \times E \to \{0, 1\}\): defined by \(T(s, e) = 1\) iff \(se \in L\).
The table has rows indexed by \(S \cup S \cdot \Sigma\) and columns by \(E\). The row for word \(s\) is the function \(\text{row}(s) : E \to \{0,1\}\) defined by \(\text{row}(s)(e) = T(s, e)\).
The table must satisfy two properties before we can build a DFA hypothesis:
- Closed: For every \(s \in S\) and \(a \in \Sigma\), there exists \(s' \in S\) with \(\text{row}(sa) = \text{row}(s')\).
- Consistent: For every \(s_1, s_2 \in S\) with \(\text{row}(s_1) = \text{row}(s_2)\), for every \(a \in \Sigma\), \(\text{row}(s_1 a) = \text{row}(s_2 a)\).
The L* Algorithm
L*(Σ, target L):
Initialize S = {ε}, E = {ε}
Fill T via membership queries.
Repeat:
While table is not closed or not consistent:
If not consistent (∃ s1, s2 ∈ S with row(s1) = row(s2) but row(s1·a) ≠ row(s2·a)):
Add a·e to E (for the distinguishing suffix e ∈ E); fill new column via queries.
If not closed (∃ s ∈ S, a ∈ Σ with row(s·a) ≠ row(s') for all s' ∈ S):
Add s·a to S; add s·a·b for all b ∈ Σ to S·Σ; fill new rows via queries.
Build DFA hypothesis M from table:
States = {row(s) : s ∈ S} (distinct rows)
Start state = row(ε)
δ(row(s), a) = row(s·a)
Accept states = {row(s) : T(s, ε) = 1}
Make equivalence query for M.
If "yes": return M.
If counterexample t is given:
Add all prefixes of t to S; fill new rows via queries.
Correctness: When the table is closed and consistent, the hypothesis DFA \(M\) correctly separates all words in \(S \cup S\cdot\Sigma\) by their rows. The equivalence query either confirms \(M = L\) or provides a counterexample that reveals a new state (a new equivalence class of \(\equiv_L\) not yet captured in the table). Since the minimal DFA has \(n\) states, at most \(n-1\) counterexamples are ever received.
Complexity:
- Number of membership queries: \(O(n^2 \cdot |\Sigma| + n \cdot \ell)\), where \(\ell\) is the maximum counterexample length.
- Number of equivalence queries: at most \(n - 1\).
- Total time: \(O(n^2 |\Sigma| + n\ell)\).
Formally: let \(M\) be the hypothesis and let \(t\) be the counterexample. Then \(t \in L \Delta L(M)\). Let \(t = a_1 a_2 \cdots a_m\). Consider the prefixes \(\varepsilon, a_1, a_1 a_2, \ldots, t\). At least two consecutive prefixes \(a_1 \cdots a_i\) and \(a_1 \cdots a_{i+1}\) must be in different rows in the table (since the hypothesis DFA made the wrong decision on \(t\), there must be a state transition that “went wrong”). Adding these prefixes to \(S\) either (a) adds a new row (increases the number of states) or (b) reveals an inconsistency (which then gets repaired by adding a column to \(E\), again increasing information). Either way, the algorithm makes strictly more progress.
Extensions and Limitations
Lower bound: Any learning algorithm for regular languages requires \(\Omega(n^2)\) membership queries in the worst case (this matches L*’s upper bound). Equivalence queries alone are sufficient to learn regular languages (without membership queries) in \(O(n^2)\) time — via the RPNI algorithm (Oncina-Garcia 1992) — if the learner receives all positive and negative examples up to a given length.
Beyond regular languages: Can one learn CFLs? Context-free grammars are not learnable in polynomial time from membership queries alone (unless NP = P), by a hardness result of Angluin. However, certain subclasses of CFLs (k-testable languages, simple deterministic languages) are polynomial-time learnable.
Active learning in practice: The L* algorithm has been implemented in the LearnLib library and used successfully to infer models of software systems, network protocols (TCP, TLS), and hardware components.
Chapter 13: The CYK Algorithm and Parsing Complexity
Motivation: The General Parsing Problem
Earley’s algorithm (Chapter 4) handles arbitrary CFGs in \(O(n^3)\). But for the special case of grammars in Chomsky Normal Form (CNF), there is a strikingly clean dynamic programming algorithm — the Cocke-Younger-Kasami (CYK) algorithm — that achieves the same asymptotic bound with simpler structure and better constants.
Chomsky Normal Form
- \(A \to BC\) where \(B, C \in V\) are variables (both non-terminal), or
- \(A \to a\) where \(a \in \Sigma\) is a terminal symbol.
The conversion algorithm eliminates \(\varepsilon\)-productions, unit productions (\(A \to B\)), and long productions (\(A \to X_1 X_2 \cdots X_k\) for \(k > 2\)) by introducing new variables.
The CYK Algorithm
\[\text{table}[i][j] = \{A \in V : A \Rightarrow^* a_i a_{i+1} \cdots a_j\}\]The CYK algorithm fills this table bottom-up:
CYK(G in CNF, w = a_1 ... a_n):
For i = 1 to n:
table[i][i] = {A : A -> a_i is a production}
For length = 2 to n:
For i = 1 to n - length + 1:
j = i + length - 1
table[i][j] = ∅
For k = i to j-1:
For each production A -> B C:
If B ∈ table[i][k] and C ∈ table[k+1][j]:
Add A to table[i][j]
Accept iff S ∈ table[1][n]
Correctness: By induction on length. For length = 1, table[i][i] contains exactly the variables deriving \(a_i\) in one step. For length > 1, the split point \(k\) enumerates all possible first derivation steps in a binary split (valid because of CNF), and the sub-problems have already been solved.
Running time: \(O(n^3 \cdot |G|)\) where \(|G|\) is the number of productions. Three nested loops each run at most \(n\) times, and the inner production check takes \(O(|G|)\).
Space: \(O(n^2 \cdot |V|)\) for the table.
Valiant’s O(n^2.37) Algorithm
Lee (2002) showed that the CYK parsing problem is equivalent (via subcubic reductions) to Boolean matrix multiplication. Valiant (1975) used this connection to obtain:
The reduction encodes the “split” operation in CYK (does variable \(A\) span positions \(i\) to \(j\) with a split at \(k\)?) as a product of Boolean matrices, one for each grammar production.
Lower Bounds for Parsing
Is cubic time inherently necessary for CFL recognition? This is a major open problem:
Conjecture (CFL hardness): No algorithm can solve the general CFL recognition problem in \(O(n^{3-\varepsilon})\) time for any \(\varepsilon > 0\) without a breakthrough in matrix multiplication.
This conjecture is supported by the equivalence to Boolean matrix multiplication, which has resisted subcubic improvements for specific problem instances. The 3SUM conjecture and SETH (Strong Exponential Time Hypothesis) imply conditional lower bounds for variants of CFL recognition.
For special grammar subclasses:
- LL(k) and LR(k) grammars: recognition in \(O(n)\) (linear time).
- Unambiguous CFGs: Earley runs in \(O(n^2)\).
- General CNF grammars: \(O(n^\omega)\) via Valiant.
Chapter 14: String Matching and Automata
The Pattern Matching Problem
Given a text \(T = T[1..n]\) and pattern \(P = P[1..m]\), find all occurrences of \(P\) in \(T\). This is a central problem in bioinformatics, information retrieval, and compilers. The automata-theoretic approach yields the Knuth-Morris-Pratt (KMP) algorithm, which runs in optimal \(O(n + m)\) time.
The Failure Function and KMP
The naive pattern matching algorithm runs in \(O(mn)\). The key insight for speedup: when a mismatch occurs at position \(j\) of the pattern, we can often skip ahead using information about the pattern’s own structure — specifically, its failure function.
The failure function exactly computes the border lengths of all prefixes of \(P\). This connects directly to the notion of bordered words from Chapter 1: \(f[j] > 0\) iff \(P[1..j]\) is bordered.
Computing the failure function: A naive computation is \(O(m^2)\). The efficient approach (core of KMP preprocessing) uses the following:
COMPUTE-FAILURE(P):
f[1] = 0
k = 0
For j = 2 to m:
While k > 0 and P[k+1] ≠ P[j]:
k = f[k] // fall back along border chain
If P[k+1] = P[j]:
k++
f[j] = k
Return f
This runs in \(O(m)\) time. The amortized analysis: the variable \(k\) can increase by at most 1 per iteration of the outer loop (at most \(m\) increments total), and the while loop decrements \(k\) — so total decrements \(\leq\) total increments \(= O(m)\).
The KMP Automaton
The failure function defines a finite automaton for string matching:
The KMP search algorithm runs this DFA on the text \(T\): whenever the accept state \(m\) is reached, an occurrence of \(P\) has been found ending at that position in \(T\). Running the DFA on all of \(T\) takes \(O(n)\) time.
Automata-Theoretic Perspective
The pattern automaton \(M_P\) recognizes the language \(\Sigma^* P\) (all strings ending with the pattern \(P\)). The failure function computes the “suffix-to-prefix” overlap structure of the pattern, which is exactly the automaton’s transition function for mismatches.
This automaton perspective explains why KMP works: the DFA compactly encodes all possible alignments of the pattern simultaneously. The failure function is the DFA’s way of “not wasting” the information already scanned.
Connection to borders: A position \(j\) has \(f[j] = k > 0\) iff \(P[1..j]\) has a border of length \(k\). The while loop in COMPUTE-FAILURE “walks down the border chain”: the borders of \(P[1..j]\) are exactly \(P[1..f[j]]\), \(P[1..f[f[j]]]\), etc. The border chain has length at most \(O(\log m)\) in the worst case for each individual step but averages \(O(1)\) amortized.
Aho-Corasick: Multiple Pattern Matching
The KMP algorithm finds occurrences of a single pattern. When searching for multiple patterns \(P_1, P_2, \ldots, P_k\) simultaneously, the Aho-Corasick algorithm (1975) builds a single DFA for all patterns at once.
The construction: build the trie of all patterns, then add failure links (analogous to the KMP failure function) from each node to the longest proper suffix of that node’s label that also occurs as a prefix label in the trie. The resulting automaton processes the text in \(O(n + \sum |P_i|)\) time regardless of how many patterns match.
Applications: Virus signature detection (scanning for known malware signatures), genome assembly (finding k-mers), and lexical analysis in compilers all use Aho-Corasick.
The Aho-Corasick automaton is a regular language automaton: it accepts all strings containing at least one of \(P_1, \ldots, P_k\) as a substring. The language \(\Sigma^* P_1 \Sigma^* \cup \cdots \cup \Sigma^* P_k \Sigma^*\) is regular and the Aho-Corasick construction computes a near-minimal DFA for it in linear time.