CO 760: Online Algorithms and Competitive Analysis
Estimated study time: 1 hr 42 min
Table of contents
These notes synthesize material from A. Borodin and R. El-Yaniv’s Online Computation and Competitive Analysis, N. Buchbinder and J. Naor’s survey on primal-dual online algorithms, and T. Roughgarden’s Beyond Worst-Case Analysis, enriched with material from Stanford CS369 and CMU 15-859.
1. Introduction to Online Computation
The theory of online algorithms addresses a fundamental question in algorithm design: how well can we perform when decisions must be made irrevocably, one at a time, before the full input is revealed? Unlike classical algorithm design, where we assume access to the entire input before producing any output, online computation models the realistic scenario in which data arrives incrementally and we must respond immediately. This framework captures settings ranging from cache management in operating systems, to financial decision-making, to network routing under uncertainty.
1.1 Online vs. Offline Problems
An online problem is specified by a sequence of requests \(\sigma = \sigma_1, \sigma_2, \ldots, \sigma_n\) arriving one at a time. Upon the arrival of each request \(\sigma_i\), an online algorithm \(\text{ALG}\) must produce a response before seeing any future request \(\sigma_{i+1}, \ldots, \sigma_n\). The cost incurred by the algorithm depends on the entire sequence of responses.
In contrast, an offline algorithm \(\text{OPT}\) has the luxury of seeing the entire request sequence \(\sigma\) upfront and can compute the globally optimal sequence of responses. The offline optimum serves as our benchmark: we measure the quality of an online algorithm by comparing its cost to the cost of the best offline solution.
Definition (Online Algorithm). An online algorithm \(\text{ALG}\) for an online minimization problem processes a request sequence \(\sigma = \sigma_1, \sigma_2, \ldots, \sigma_n\) sequentially. Upon receiving request \(\sigma_i\), the algorithm must produce a response \(r_i\) based only on \(\sigma_1, \ldots, \sigma_i\) and the previous responses \(r_1, \ldots, r_{i-1}\). The total cost \(\text{ALG}(\sigma)\) depends on the request sequence and all responses.
This definition immediately raises a natural question: since the online algorithm operates under strictly less information than the offline optimum, what is the worst-case penalty for this lack of foresight?
1.2 Competitive Analysis
The dominant framework for evaluating online algorithms is competitive analysis, introduced by Sleator and Tarjan in 1985 in their seminal work on list accessing and paging. Rather than measuring the absolute performance of an online algorithm, competitive analysis compares the online algorithm’s cost to the offline optimum on the same input.
where \(\text{OPT}(\sigma)\) is the cost of an optimal offline algorithm on \(\sigma\). The competitive ratio of \(\text{ALG}\) is the infimum over all \(c\) for which \(\text{ALG}\) is \(c\)-competitive.
The additive constant \(\alpha\) is necessary to handle initial transients and finite-length effects but becomes negligible for long request sequences. When \(\alpha = 0\), we say the algorithm is strictly \(c\)-competitive.
For randomized algorithms, we take the expectation over the algorithm’s internal coin flips:
1.3 Adversary Models
The power of the adversary — the entity generating the request sequence — is critical in the randomized setting. Three standard adversary models capture different levels of adversarial power.
Definition (Adversary Models).
(i) Oblivious adversary. The adversary must fix the entire request sequence \(\sigma\) in advance, before the algorithm makes any random choices. The adversary knows the algorithm’s code but not its random bits. The offline optimum is computed with full knowledge of \(\sigma\).
(ii) Adaptive online adversary. The adversary sees the algorithm’s responses as they are made (including the outcomes of random choices) and can adaptively choose the next request. However, the adversary must also serve the requests online, and its cost is compared to its own online cost.
(iii) Adaptive offline adversary. The adversary sees the algorithm’s responses as they are made and can adaptively choose the next request. But the adversary’s cost is compared to the optimal offline cost on the resulting sequence. This is the most powerful adversary.
For deterministic algorithms, all three adversary models are equivalent (since there is no randomness to observe). The oblivious adversary is the most commonly used model for randomized algorithms, and most results in these notes use the oblivious model unless stated otherwise.
Ben-David, Borodin, Karp, Tardos, and Wigderson (1994) established the following fundamental relationships between adversary models:
Theorem 1.1 (Ben-David et al.). If a randomized algorithm is \(c\)-competitive against the adaptive online adversary and there exists a \(d\)-competitive deterministic algorithm, then there exists a \(cd\)-competitive algorithm against the adaptive offline adversary.
This theorem shows that results against adaptive adversaries can sometimes be translated into results against the adaptive offline adversary at the cost of an extra factor.
1.4 The Ski Rental Problem
We begin our study with the simplest and most elegant online problem: the ski rental problem, also known as the rent-or-buy problem. Despite its simplicity, it captures the essential tension in online decision-making: committing to a large one-time cost versus repeatedly paying smaller incremental costs.
Example (Ski Rental). A skier goes skiing on multiple days but does not know in advance how many days they will ski. Each day, they must choose between renting skis at cost 1 per day or buying skis at cost \(B\) (after which skiing is free forever). The goal is to minimize total cost.
If the skier will ski for \(n\) days total, the optimal offline cost is \(\min(n, B)\): either rent for all \(n\) days if \(n < B\), or buy on day 1 if \(n \geq B\).
This simple problem already illustrates the difficulty of online computation. Let us analyze deterministic strategies.
Theorem 1.2 (Deterministic Ski Rental). The optimal deterministic competitive ratio for the ski rental problem is \(2 - 1/B\), achieved by the algorithm that rents for exactly \(B - 1\) days and then buys on day \(B\) (if the skier is still skiing).
Proof. Consider the algorithm \(\text{ALG}\) that rents for the first \(B-1\) days and buys on day \(B\). We analyze two cases:
Case 1: The skier skis for \(n \leq B - 1\) days. Then \(\text{ALG}(\sigma) = n = \text{OPT}(\sigma)\), so the ratio is 1.
Case 2: The skier skis for \(n \geq B\) days. Then \(\text{ALG}(\sigma) = (B-1) + B = 2B - 1\) (renting for \(B-1\) days, then buying). The optimal cost is \(\text{OPT}(\sigma) = B\). Thus the ratio is \((2B-1)/B = 2 - 1/B\).
The worst case is Case 2, giving competitive ratio \(2 - 1/B\).
To see this is optimal, consider any deterministic algorithm that buys on day \(d\) (if it has not stopped skiing). If the adversary stops the skier on day \(d\), the algorithm pays \((d-1) + B\) while the optimal cost is \(\min(d, B)\). If instead the adversary stops the skier on day \(d-1\), the algorithm pays \(d-1\) and the optimum pays \(\min(d-1, B)\). The adversary chooses whichever is worse. A careful optimization over \(d\) shows no deterministic algorithm achieves ratio better than \(2 - 1/B\). \(\blacksquare\)
Can randomization help? Beautifully, it can.
Theorem 1.3 (Randomized Ski Rental, Karlin et al., 1994). There exists a randomized algorithm for ski rental with competitive ratio \(e/(e-1) \approx 1.582\) against an oblivious adversary. This is optimal.
As \(B \to \infty\), this converges to the continuous distribution where the buying time \(t\) is chosen uniformly on \([0, B]\) with density \(f(t) = \frac{1}{B} e^{t/B - 1}\) for \(t \in [0, B]\).
\[ \mathbb{E}[\text{ALG}] = \int_0^n f(t)(t + B)\,dt + \left(1 - \int_0^n f(t)\,dt\right) \cdot n. \]The first term accounts for buying at time \(t \leq n\), and the second for never buying (if \(D > n\)). A direct computation shows this is at most \(\frac{e}{e-1} \cdot \text{OPT}(\sigma)\).
For the lower bound, Yao’s minimax principle (discussed below) is used: we exhibit a distribution over inputs such that any deterministic algorithm has expected ratio at least \(e/(e-1)\). Consider the input where the number of skiing days is drawn from the distribution \(\Pr[n = i] \propto 1/i\) for \(i = 1, \ldots, B\). Against this distribution, any deterministic algorithm that buys on day \(d\) achieves expected ratio at least \(e/(e-1)\) as \(B \to \infty\). \(\blacksquare\)
1.5 Yao’s Minimax Principle
The lower bound in the ski rental problem relies on a powerful technique: Yao’s minimax principle, which transfers a celebrated result from game theory to the analysis of algorithms.
for any randomized algorithm \(R\) (against an oblivious adversary). In other words, to prove a lower bound \(c\) on the competitive ratio of any randomized algorithm, it suffices to exhibit a distribution over inputs such that every deterministic algorithm has expected ratio at least \(c\).
This principle is a direct consequence of von Neumann’s minimax theorem applied to the zero-sum game between the algorithm designer and the adversary. It has become the standard tool for proving randomized lower bounds in competitive analysis.
1.6 The Rent-or-Buy Paradigm
The ski rental problem is the prototype for a broad class of online problems sharing the rent-or-buy structure: at each step, the algorithm can either pay a small incremental cost (rent) or commit to a large one-time cost (buy) that eliminates all future incremental costs. Many problems in networking, caching, and resource allocation exhibit this structure.
Example (TCP Acknowledgment). In the TCP acknowledgment problem, a server receives packets and must acknowledge them. Each unacknowledged packet incurs a delay cost of 1 per unit time. Sending an acknowledgment costs \(B\). The server must decide when to send acknowledgments. This is precisely a continuous version of ski rental: delaying acknowledgment corresponds to renting, and sending the acknowledgment corresponds to buying.
Example (Snoopy Caching). In a multiprocessor system with a shared bus, a processor may access a memory block that is cached by another processor. It can either read the block over the bus at cost 1 (rent) or invalidate the other cache and load the block locally at cost \(B\) (buy). The frequency of future accesses is unknown, yielding another instance of rent-or-buy.
The rent-or-buy paradigm extends naturally to more complex settings. In the multislope ski rental problem, there are multiple options with different combinations of one-time and recurring costs — for instance, renting at cost 1/day, leasing for a season at cost \(B_1\) with reduced daily cost \(\ell\), or buying at cost \(B_2 > B_1\). These generalizations arise in cloud computing, where resources can be provisioned on-demand, reserved, or purchased.
1.7 A Broader Perspective
Competitive analysis, while elegant and powerful, has well-known limitations. For some problems, such as paging, it gives sharp and practically meaningful results. For others, the worst-case adversarial model is too pessimistic, and the competitive ratios do not reflect practical performance. This tension has driven much of the research in Chapter 7 on beyond worst-case analysis.
Nevertheless, competitive analysis provides a clean mathematical framework that has yielded deep structural insights. The interplay between deterministic and randomized algorithms, the hierarchy of adversary models, and the connections to game theory and online learning make this a rich and beautiful area of combinatorial optimization.
2. Paging and the \(k\)-Server Problem
Paging is the canonical online problem, and the one that launched the field of competitive analysis. Introduced by Sleator and Tarjan in 1985, paging models the management of a two-level memory hierarchy: a fast cache of size \(k\) and a slow memory containing \(N > k\) pages. When a requested page is not in the cache (a page fault), the algorithm must decide which cached page to evict.
2.1 The Paging Problem
Definition (Paging). The paging problem is specified by a cache of size \(k\), a universe of \(N\) pages, and a request sequence \(\sigma = \sigma_1, \sigma_2, \ldots, \sigma_n\) where each \(\sigma_i\) is a page. If \(\sigma_i\) is in the cache (a hit), no cost is incurred. If \(\sigma_i\) is not in the cache (a fault), the algorithm must bring \(\sigma_i\) into the cache and evict some other page, at cost 1. The objective is to minimize the total number of faults.
Several classical page replacement policies have been studied:
Definition (Classical Paging Algorithms).
LRU (Least Recently Used): On a fault, evict the page whose most recent request was earliest.
FIFO (First In, First Out): On a fault, evict the page that has been in the cache the longest.
LFU (Least Frequently Used): On a fault, evict the page with the fewest requests so far.
LIFO (Last In, First Out): On a fault, evict the most recently loaded page.
Of these, LRU and FIFO are widely used in practice. LFU and LIFO, on the other hand, can perform very poorly from a competitive analysis standpoint.
2.2 Deterministic Paging
Sleator and Tarjan (1985) proved the following foundational result:
Theorem 2.1 (Sleator-Tarjan, 1985). LRU and FIFO are both \(k\)-competitive for paging, and no deterministic online paging algorithm can achieve a competitive ratio better than \(k\).
Proof of the lower bound. We show that any deterministic online algorithm \(\text{ALG}\) for paging with a cache of size \(k\) has competitive ratio at least \(k\). The adversary uses a universe of \(k + 1\) pages.
The adversary always requests the page not currently in the algorithm’s cache. Since there are \(k+1\) pages and the cache holds \(k\), there is always exactly one page not in the cache, and the adversary can identify it (since the algorithm is deterministic). This forces a fault on every single request, so after \(n\) requests, \(\text{ALG}\) incurs cost \(n\).
Now consider an offline algorithm with the same cache size \(k\). We analyze its performance using a phase argument. Partition the request sequence into phases, where each phase is a maximal subsequence containing at most \(k\) distinct pages. In each phase, the offline algorithm faults at most once (it can evict the page that will be requested furthest in the future, by Bélády’s rule, and since at most \(k\) distinct pages appear in each phase, at most one is not in the cache at the start of the phase after the first).
More precisely, use an offline algorithm with cache size \(k\) that uses the Longest-Forward-Distance (LFD) rule: on a fault, evict the page whose next request is furthest in the future (Bélády’s optimal offline algorithm). Consider blocks of \(k\) consecutive requests from the adversary’s sequence. In each block, the adversary requests \(k\) distinct pages (one repeated request might reduce this, but the adversary can ensure \(k\) distinct pages in the worst case over the block of \(k\) requests). The offline algorithm with cache size \(k\) faults at most once per block of \(k\) requests (with careful accounting via Bélády’s rule), giving \(\text{OPT} \leq n/k\).
Thus the competitive ratio is at least \(n / (n/k) = k\). Since this holds for arbitrarily long sequences, no deterministic algorithm achieves a ratio below \(k\). \(\blacksquare\)
Proof that LRU is \(k\)-competitive. We use a phase partition argument. Partition the request sequence into phases: phase 1 starts at the first request, and a new phase begins whenever a request introduces the \((k+1)\)-th distinct page since the start of the current phase.
Consider any phase. Let \(d\) be the number of distinct pages requested in this phase. If \(d \leq k\), then LRU faults at most \(k\) times in this phase (at most once per distinct page), while the offline optimum faults at least 0 times — but we can do better by considering consecutive phases.
More precisely, consider two consecutive phases. The second phase, by definition, contains requests to \(k+1\) distinct pages. At the start of the second phase, LRU’s cache contains exactly the \(k\) most recently requested distinct pages. Since the second phase requests \(k+1\) distinct pages, LRU faults at most \(k\) times during this phase.
The offline algorithm with cache \(k\) must also fault at least once during each phase (since \(k+1\) distinct pages are requested but the cache can hold only \(k\)). Therefore, LRU faults at most \(k\) times per phase while OPT faults at least once per phase, giving a ratio of at most \(k\). \(\blacksquare\)
Remark. A paging algorithm is called conservative if, on any consecutive subsequence containing at most \(k\) distinct pages, it faults at most \(k\) times. Both LRU and FIFO are conservative. Every conservative algorithm is \(k\)-competitive, and conversely, any \(k\)-competitive deterministic algorithm can be shown to be conservative on appropriate subsequences.
2.3 The MARKING Algorithm
To improve upon the deterministic bound using randomization, Fiat, Karp, Luby, McGeoch, Sleator, and Young (1991) introduced the MARKING algorithm, a randomized paging algorithm with an elegant structure.
Definition (MARKING Algorithm). The MARKING algorithm operates in phases. At the beginning of each phase, all pages are unmarked. When a page is requested:
(1) If the requested page is in the cache, mark it.
(2) If the requested page is not in the cache (a fault), evict a page chosen uniformly at random from the set of unmarked pages in the cache, bring in the requested page, and mark it.
(3) If all pages in the cache are marked when a fault occurs, start a new phase: unmark all pages, then process the request as in (2).
Theorem 2.2 (Fiat-Karp-Luby-McGeoch-Sleator-Young, 1991). The MARKING algorithm is \((2H_k - 1)\)-competitive, where \(H_k = \sum_{i=1}^{k} 1/i\) is the \(k\)-th harmonic number. This can be simplified to \(2\ln k + O(1)\).
The analysis proceeds by comparing MARKING to the offline optimum phase by phase. In each phase, the algorithm requests at most \(k\) distinct new pages. The key insight is that the expected number of faults of MARKING on the “stale” pages (those not requested in the previous phase) can be bounded using a coupon-collector-type argument, yielding the harmonic number.
2.4 Optimal Randomized Paging
The MARKING algorithm is not quite optimal. The tight bound for randomized paging was established by a celebrated result:
Theorem 2.3 (Fiat et al., 1991; Achlioptas-Chrobak-Noga, 2000). The optimal randomized competitive ratio for paging with cache size \(k\) against an oblivious adversary is \(H_k = \sum_{i=1}^{k} 1/i = \ln k + O(1)\).
The lower bound \(H_k\) follows from Yao’s minimax principle. Consider a universe of \(k+1\) pages, and let the adversary choose each request uniformly at random from the \(k+1\) pages. Any deterministic algorithm with cache \(k\) faults with probability \(1/(k+1)\) on each request (since the requested page is equally likely to be any of the \(k+1\) pages, and one is not cached). The offline optimum, however, can be shown to fault at rate approximately \(1/((k+1)H_k)\) on this distribution, yielding the \(H_k\) lower bound.
The matching upper bound is achieved by the EQUITABLE algorithm of Achlioptas, Chrobak, and Noga, which is an intricate refinement of MARKING.
2.5 The \(k\)-Server Problem
The \(k\)-server problem is one of the most important and well-studied problems in online computation, generalizing paging to arbitrary metric spaces.
Definition (\(k\)-Server Problem). We are given a metric space \((M, d)\) with \(N\) points and \(k\) mobile servers located at points in \(M\). A request sequence \(\sigma = \sigma_1, \sigma_2, \ldots, \sigma_n\) specifies points in \(M\). To serve request \(\sigma_i\), we must move a server to point \(\sigma_i\) (if no server is already there). The cost is the total distance moved by all servers.
Paging is the special case where \(M\) is a uniform metric space on \(N\) points (all distances equal to 1), and each request must be served by having a server present at the requested point.
Theorem 2.4 (\(k\)-Server Conjecture, Manasse-McGeoch-Sleator, 1988). For any metric space with at least \(k+1\) points, any deterministic \(k\)-server algorithm has competitive ratio at least \(k\). The \(k\)-server conjecture asserts that there exists a \(k\)-competitive deterministic algorithm for every metric space.
Manasse, McGeoch, and Sleator proved the lower bound of \(k\) and showed that the conjecture holds for \(k = 2\) and for any metric space with \(k + 1\) points. The conjecture remains one of the central open problems in online algorithms, though tremendous progress has been made.
2.6 The Work Function Algorithm
The most important algorithm for the \(k\)-server problem is the work function algorithm (WFA), introduced by Chrobak and Larmore (1991) and analyzed by Koutsoupias and Papadimitriou (1995).
Definition (Work Function). After processing requests \(\sigma_1, \ldots, \sigma_t\), the work function \(w_t(X)\) is defined for every configuration \(X\) of \(k\) servers as the minimum cost of serving \(\sigma_1, \ldots, \sigma_t\) starting from the initial configuration and ending in configuration \(X\).
Definition (Work Function Algorithm). When request \(\sigma_{t+1}\) arrives at point \(r\), the Work Function Algorithm moves the server that minimizes \(w_{t+1}(X') + d(X, X')\), where \(X\) is the current configuration, \(X'\) is the configuration after moving a server to \(r\), and \(d(X, X')\) is the cost of the move.
\[ s^* = \arg\min_{s \in X} \left[ w_{t+1}(X - s + r) + d(s, r) \right], \]where \(X - s + r\) denotes the configuration with server \(s\) moved to \(r\).
Theorem 2.5 (Koutsoupias-Papadimitriou, 1995). The Work Function Algorithm is \((2k - 1)\)-competitive for the \(k\)-server problem on any metric space.
This remains the best known competitive ratio for general metric spaces. Closing the gap between \(k\) (the conjectured optimum) and \(2k-1\) is one of the major open problems in the field.
2.7 Special Cases and Recent Progress
For specific metric spaces, much better results are known:
Theorem 2.6 (Special Cases of \(k\)-Server).
(a) On the line (or any tree), the Double Coverage algorithm of Chrobak, Karloff, Larmore, and Shor (1991) is \(k\)-competitive, confirming the \(k\)-server conjecture for trees.
(b) On any metric space with \(k + 1\) points, the WFA is \(k\)-competitive.
(c) For \(k = 2\), the \(k\)-server conjecture holds on any metric space.
In a major breakthrough, Bubeck, Cohen, Lee, and Lee (2018) achieved an \(O(\log^2 k)\)-competitive randomized algorithm for the \(k\)-server problem on hierarchically separated trees (HSTs), and hence \(O(\log^2 k \cdot \log n)\)-competitive on general metrics via Bartal’s tree embedding. This was improved to \(O(\log^6 k)\) by subsequent work, and the randomized \(k\)-server conjecture (conjecturing \(O(\log k)\)-competitive randomized algorithms) remains open.
3. Metrical Task Systems
The framework of metrical task systems (MTS), introduced by Borodin, Linial, and Saks (1992), provides a unifying abstraction that captures a wide variety of online problems, including paging, the \(k\)-server problem, and many others. In an MTS, an algorithm occupies a state in a metric space and must process tasks that impose state-dependent processing costs.
3.1 Definition and Formulation
Definition (Metrical Task System). A metrical task system (MTS) is defined by a metric space \((M, d)\) on \(n\) states. The algorithm occupies one state at any time. A task is a vector \(\vec{c} = (c_1, c_2, \ldots, c_n) \in (\mathbb{R}_{\geq 0} \cup \{\infty\})^n\), where \(c_i\) is the cost of processing the task while in state \(i\). Upon receiving task \(\vec{c}\), the algorithm may move from its current state \(s\) to any state \(s'\), paying the movement cost \(d(s, s')\) plus the processing cost \(c_{s'}\). The objective is to minimize the total cost (movement + processing) over the entire task sequence.
Example (Paging as MTS). Paging with cache size \(k\) and universe of \(N\) pages is an MTS on \(\binom{N}{k}\) states, where each state corresponds to a cache configuration. When page \(p\) is requested, the task vector sets \(c_X = 0\) if \(p \in X\) and \(c_X = \infty\) if \(p \notin X\). The metric is the symmetric difference: \(d(X, Y) = |X \setminus Y| = |Y \setminus X|\), representing the number of page loads needed to transform one cache configuration into another.
3.2 Deterministic Bounds
The fundamental result for deterministic MTS is the tight \((2n - 1)\) bound:
Theorem 3.1 (Borodin-Linial-Saks, 1992). For any metrical task system on \(n\) states: (a) The Work Function Algorithm is \((2n - 1)\)-competitive. (b) No deterministic online algorithm achieves competitive ratio better than \(2n - 1\) on the uniform metric space.
The lower bound uses a potential function argument specific to the uniform metric. The upper bound follows from the analysis of the Work Function Algorithm, which extends naturally from the \(k\)-server setting.
Proof sketch of the lower bound. Consider the uniform metric on \(n\) states (all pairwise distances equal to 1). The adversary uses tasks of the form \(\vec{c}\) where \(c_i = 0\) for one state \(i\) and \(c_j = 1\) for all \(j \neq i\). Specifically, the adversary always sets the zero-cost state to be the state the algorithm is not in (or the optimal state for the offline). The algorithm must either pay 1 to stay (processing cost) or pay 1 to move (movement cost). In either case it pays 1 per task.
The offline algorithm, by contrast, can batch moves. By choosing to stay in a state and accumulate processing cost \(1\) per task, and then moving when it becomes efficient to do so, the offline can amortize movement costs. On the uniform metric with \(n\) states, the adversary can force a pattern where the online algorithm pays 1 per task while the offline pays \(1/(2n - 1)\) per task (amortized). This yields the lower bound of \(2n-1\). The detailed construction uses a potential function argument and a careful adversary strategy. \(\blacksquare\)
3.3 Randomized MTS
For randomized algorithms against an oblivious adversary, the situation improves dramatically:
Theorem 3.2 (Randomized MTS Bounds).
(a) On the uniform metric with \(n\) states, there exists a randomized algorithm with competitive ratio \(O(\log n)\) (Borodin et al., 1992).
(b) On any metric with \(n\) states, the competitive ratio is \(O(\log^2 n \cdot \log \log n)\) (Fiat and Mendel, 2003; Bubeck, Cohen, Lee, Lee, 2019).
(c) The randomized competitive ratio on the uniform metric is \(\Omega(\log n)\).
The \(O(\log n)\) upper bound on the uniform metric uses a beautiful connection to online learning: the algorithm maintains a probability distribution over states and updates it using a multiplicative-weights scheme, balancing movement cost against processing cost.
3.4 Unfair Metrical Task Systems
An important variant is the unfair MTS, where different states have different “weights” affecting movement costs asymmetrically. This generalization captures problems like weighted paging, where different pages have different fetching costs.
Definition (Weighted Paging / Unfair MTS). In weighted paging, each page \(p\) has a weight \(w_p\), and loading page \(p\) into the cache costs \(w_p\). The objective is to minimize the total loading cost.
Bansal, Buchbinder, and Naor (2012) gave an \(O(\log k)\)-competitive randomized algorithm for weighted paging using the primal-dual method, matching the lower bound up to constant factors.
3.5 Connections to Online Learning
One of the deepest insights in modern online algorithms is the connection between MTS and online learning. The randomized algorithm for MTS on the uniform metric essentially runs a variant of the multiplicative weights update method, maintaining a probability distribution over states and updating it to balance exploration and exploitation.
This connection goes both ways: results from online learning (such as regret bounds for the experts problem) can be used to design online algorithms, and conversely, competitive analysis provides a framework for understanding the guarantees of learning algorithms in adversarial settings.
4. Online Matching and Allocation
Online matching problems are among the most impactful applications of online algorithm design, arising naturally in internet advertising, resource allocation, and market design. The foundational result of Karp, Vazirani, and Vazirani (1990) on online bipartite matching launched a prolific research area connecting competitive analysis with combinatorial optimization.
4.1 Online Bipartite Matching
Definition (Online Bipartite Matching). We are given a bipartite graph \(G = (U, V, E)\) where \(U\) is known in advance (the offline vertices) and the vertices of \(V\) arrive online. When vertex \(v \in V\) arrives, its edges to \(U\) are revealed, and the algorithm must immediately and irrevocably either match \(v\) to an unmatched neighbor in \(U\) or leave \(v\) unmatched. The goal is to maximize the size of the matching.
Example. Consider an online advertising platform. The set \(U\) represents ad slots (or advertisers), and each arriving user \(v \in V\) is compatible with certain advertisers. When a user arrives, the platform must immediately assign them to an advertiser (or not), and the assignment is irrevocable.
A natural greedy algorithm — match each arriving vertex to an arbitrary unmatched neighbor — achieves a competitive ratio of \(1/2\). This is tight for the greedy algorithm: consider a graph where \(U = \{a, b\}\), \(V = \{v_1, v_2\}\), \(v_1\) is connected to both \(a\) and \(b\), and \(v_2\) is connected only to \(a\). If greedy matches \(v_1\) to \(a\), then \(v_2\) is unmatched, giving a matching of size 1 versus the optimal size 2.
4.2 The Karp-Vazirani-Vazirani Algorithm
The celebrated result of Karp, Vazirani, and Vazirani (KVV) shows that randomization yields a substantial improvement.
Definition (RANKING Algorithm, KVV 1990). The RANKING algorithm works as follows:
(1) At the start, choose a random permutation \(\pi\) of the offline vertices \(U\).
(2) When an online vertex \(v\) arrives, match it to the unmatched neighbor \(u \in U\) with the highest rank \(\pi(u)\) (i.e., the one appearing earliest in the permutation). If all neighbors are matched, leave \(v\) unmatched.
Theorem 4.1 (Karp-Vazirani-Vazirani, 1990). The RANKING algorithm achieves a competitive ratio of \(1 - 1/e \approx 0.632\) for online bipartite matching. This is optimal among all randomized algorithms against an oblivious adversary.
Proof. We present a streamlined version of the proof, following the approach of Devanur, Jain, and Kleinberg (2013).
Upper bound (\(1 - 1/e\) is achievable). Assume without loss of generality that the optimal offline matching is a perfect matching (we can add dummy vertices if needed), so \(|U| = |V| = n\) and the maximum matching has size \(n\).
Instead of a uniformly random permutation, it is equivalent (and analytically cleaner) to assign each offline vertex \(u\) an independent random threshold \(y_u \sim \text{Uniform}[0,1]\). When online vertex \(v\) arrives, it is matched to the eligible neighbor \(u\) with the smallest threshold \(y_u\) (equivalently, highest rank).
For each offline vertex \(u\) and its match \(v^* = \mu^*(u)\) in the optimal matching, we want to bound \(\Pr[u \text{ is matched by RANKING}]\). The key claim is:
Claim: For each offline vertex \(u\) matched in the optimum, \(\Pr[u \text{ is matched by RANKING}] \geq 1 - 1/e\).
To prove this claim, consider the modified process where online vertices arrive in the order \(v_1, \ldots, v_n\). We analyze the probability that the optimally-matched pair \((u, v^*)\) is disrupted, meaning \(v^*\) gets matched to some other offline vertex before getting a chance to match with \(u\), or \(u\) gets matched to some other online vertex before \(v^*\) arrives.
Define \(G_u\) as the event that \(u\) is matched by RANKING. We use a factor-revealing LP approach: the probability \(1 - \Pr[G_u]\) is maximized (over all possible graph structures and arrival orders) by a specific extremal graph, and on this extremal graph, the probability is exactly \(1/e\).
The factor-revealing LP, after simplification, becomes: the worst case is when the “competition” for \(u\) can be modeled by a chain of potential displacements. Each displacement reduces the remaining probability by a factor related to the threshold. Summing over all possible displacement chains yields:
\[ \Pr[u \text{ is unmatched}] \leq \int_0^1 \int_0^{y_1} \int_0^{y_2} \cdots dy_n \cdots dy_2\, dy_1 = \frac{1}{n!} \cdot n! \cdot \frac{1}{e} = \frac{1}{e}. \]More precisely, in the worst case, the probability that \(u\) remains unmatched is at most:
\[ \sum_{j=0}^{n-1} \frac{(-1)^j}{j!} \to \frac{1}{e} \text{ as } n \to \infty. \]Therefore \(\Pr[u \text{ is matched}] \geq 1 - 1/e\), and summing over all optimally-matched vertices:
\[ \mathbb{E}[|\text{RANKING}|] \geq (1 - 1/e) \cdot n = (1 - 1/e) \cdot |\text{OPT}|. \]Lower bound (\(1 - 1/e\) is optimal). We show no randomized algorithm achieves ratio better than \(1 - 1/e\). By Yao’s minimax principle, it suffices to exhibit a distribution over inputs on which every deterministic algorithm achieves expected ratio at most \(1 - 1/e\).
Consider the graph \(G_n\) on \(2n\) vertices: \(U = \{u_1, \ldots, u_n\}\), \(V = \{v_1, \ldots, v_n\}\), where \(v_i\) is adjacent to \(u_1, \ldots, u_i\). The online vertices arrive in order \(v_1, v_2, \ldots, v_n\).
For any deterministic algorithm, let \(x_i = 1\) if \(v_i\) is matched, and consider the probability that the algorithm matches \(v_n\). Since \(v_1\) is connected only to \(u_1\), it must be matched to \(u_1\) (or left unmatched, which is suboptimal). Then \(v_2\) is connected to \(u_1\) (already matched) and \(u_2\), so it matches to \(u_2\), etc.
However, the adversary can truncate the sequence. Consider the randomized input where the sequence stops after \(v_i\) with probability \(1/n\) for each \(i\). On this distribution, any deterministic algorithm that greedily matches \(v_i\) to the lowest-indexed available neighbor achieves the maximum expected matching, and the ratio converges to \(1 - 1/e\) as \(n \to \infty\). \(\blacksquare\)
4.3 The AdWords Problem
The AdWords problem, introduced by Mehta, Saberi, Vazirani, and Vazirani (2007), extends online matching to the setting of internet advertising with budgets.
Definition (AdWords Problem). There are \(m\) advertisers, each with a daily budget \(B_j\). Queries arrive online; each query \(i\) has a bid \(b_{ij}\) from each advertiser \(j\). When query \(i\) arrives, it must be assigned to some advertiser \(j\) (or left unassigned), and advertiser \(j\) pays \(\min(b_{ij}, \text{remaining budget of } j)\). The goal is to maximize total revenue.
Theorem 4.2 (Mehta-Saberi-Vazirani-Vazirani, 2007). In the small bids regime (where each bid is small relative to the budgets), the BALANCE algorithm achieves competitive ratio \(1 - 1/e\), and this is optimal.
The BALANCE algorithm assigns each query to the advertiser with the highest remaining budget (breaking ties by bid). The analysis uses a carefully designed potential function. In the general case (without the small-bids assumption), the problem becomes harder and the competitive ratio degrades.
4.4 Online Weighted Matching
When edges have weights and we wish to maximize total weight, the problem becomes significantly harder:
Theorem 4.3. For online weighted bipartite matching:
(a) No deterministic algorithm achieves competitive ratio better than \(1/2\) (even with unit weights, as seen above).
(b) The free disposal model (where the algorithm can rematch at any time, paying only for the final matching) admits a \(1 - 1/e\) competitive algorithm.
(c) In the vertex-weighted model (weights on offline vertices only), the \(1 - 1/e\) ratio is achievable by an extension of RANKING due to Aggarwal, Goel, Karber, and Muthukrishnan (2011).
4.5 The Online Primal-Dual Framework
One of the most powerful and general techniques in online algorithm design is the online primal-dual method, developed systematically by Buchbinder and Naor. This framework gives a principled way to design and analyze online algorithms for a wide class of covering and packing problems.
where constraints arrive online: constraint \(i\) is revealed at time \(i\), along with the coefficients \(a_{ij}\) for all \(j\). The algorithm must increase variables \(x_j\) (they can never decrease) to satisfy the new constraint.
\[ \max \sum_{i=1}^{m} y_i \quad \text{s.t.} \quad \sum_{i=1}^{m} a_{ij} y_i \leq c_j \text{ for all } j, \quad y_i \geq 0. \]The online primal-dual method works as follows:
The Buchbinder-Naor Online Primal-Dual Method.
When a new covering constraint \(\sum_j a_{ij} x_j \geq 1\) arrives:
(1) While the constraint is not yet satisfied:
- Increase the dual variable \(y_i\) continuously.
- Simultaneously, for each primal variable \(x_j\) involved in the constraint, increase \(x_j\) at rate proportional to \(a_{ij} x_j / c_j\) (multiplicative update).
(2) Stop when the primal constraint is satisfied.
Theorem 4.4 (Buchbinder-Naor). The online primal-dual method achieves a competitive ratio of \(O(\log n)\) for the online set cover problem, where \(n\) is the size of the ground set. This matches the offline hardness of set cover (assuming P \(\neq\) NP).
The power of this framework lies in its generality: by formulating an online problem as an LP, designing primal and dual update rules, and bounding the ratio of primal to dual objectives, one obtains both the algorithm and its analysis simultaneously.
4.6 Online Set Cover
Definition (Online Set Cover). Given a universe \(\mathcal{U}\) of \(n\) elements and a collection \(\mathcal{S}\) of subsets of \(\mathcal{U}\) with costs, elements arrive online. When element \(e\) arrives, it must be covered by at least one set in the current solution. The algorithm may add sets to its solution but cannot remove them.
The online primal-dual method applied to set cover proceeds as follows. When element \(e\) arrives:
- Set the dual variable \(y_e \leftarrow 0\).
- While \(e\) is not covered, increase \(y_e\) continuously and for each set \(S\) containing \(e\), increase the fractional coverage \(x_S\) exponentially: \(x_S \leftarrow x_S \cdot (1 + 1/c_S) + 1/(n \cdot c_S)\).
- When some \(x_S \geq 1\), add set \(S\) to the solution.
This achieves \(O(\log n \cdot \log m)\) competitive ratio. With a more careful analysis, the ratio can be brought down to \(O(\log m \cdot \log n)\), where \(m\) is the number of sets.
4.7 Applications of the Primal-Dual Framework
The online primal-dual framework has been applied to a remarkable variety of problems:
Example (Applications).
(a) Online weighted bipartite matching: The LP relaxation of bipartite matching and its dual (vertex cover) yield the \(1-1/e\) algorithm when processed through the primal-dual lens.
(b) Online routing and load balancing: The problem of routing requests in a network to minimize congestion can be formulated as an online packing LP, and the primal-dual method yields \(O(\log n)\)-competitive algorithms.
(c) Online scheduling: Machine scheduling with online job arrivals, where each job has processing times and deadlines, can be attacked via online primal-dual methods.
(d) Online facility location: Deciding where to open facilities and how to assign arriving clients to minimize total opening and connection cost.
5. Online Learning and the Experts Problem
The field of online learning, while originating from a different intellectual tradition than competitive analysis, shares deep connections with online algorithms. The experts problem and the multiplicative weights update (MWU) method provide a versatile toolkit that has found applications throughout computer science, from algorithm design to game theory to machine learning.
5.1 The Experts Problem
Definition (Prediction with Expert Advice). A decision maker must make predictions over \(T\) rounds. There are \(n\) experts, each of whom gives advice each round. In each round \(t = 1, 2, \ldots, T\):
(1) Each expert \(i\) incurs a loss \(\ell_t(i) \in [0,1]\).
(2) The decision maker selects an expert \(i_t\) (or, in the randomized version, a distribution over experts) and incurs the corresponding loss.
The goal is to minimize regret: the difference between the decision maker’s total loss and the total loss of the best single expert in hindsight.
The goal is to achieve sublinear regret: \(\text{Regret}_T = o(T)\), which implies that the average per-round regret vanishes as \(T \to \infty\). This is a much weaker benchmark than competitive analysis (we compare to the best fixed expert, not the best adaptive strategy), but it applies to arbitrary loss sequences with no stochastic assumptions.
5.2 The Halving Algorithm
Before introducing the general multiplicative weights method, let us consider a simpler setting: the realizable case, where one expert is always correct.
Definition (Halving Algorithm). Maintain a set \(S\) of “surviving” experts (initially all \(n\) experts). In each round, predict with the majority vote of the surviving experts. After observing the outcome, eliminate all experts that were wrong.
Theorem 5.1. If one of the \(n\) experts makes zero mistakes, the Halving algorithm makes at most \(\log_2 n\) mistakes.
Proof. Each time the Halving algorithm makes a mistake, at least half of the surviving experts are eliminated (since the majority was wrong). Thus after \(m\) mistakes, at most \(n/2^m\) experts survive. Since the perfect expert is never eliminated, we need \(n/2^m \geq 1\), giving \(m \leq \log_2 n\). \(\blacksquare\)
5.3 The Weighted Majority Algorithm
The Halving algorithm fails gracefully when no expert is perfect — we simply cannot eliminate experts entirely. Instead, we downweight experts that make mistakes. This is the key idea behind the Weighted Majority algorithm of Littlestone and Warmuth (1994).
Definition (Weighted Majority Algorithm). Fix a parameter \(\varepsilon \in (0, 1)\). Initialize weights \(w_1(i) = 1\) for all experts \(i\). In each round \(t\):
(1) Predict with the weighted majority: choose the prediction made by experts of higher total weight.
(2) For each expert \(i\) that made a mistake, update: \(w_{t+1}(i) = (1 - \varepsilon) \cdot w_t(i)\).
where \(m^*\) is the number of mistakes of the best expert. Setting \(\varepsilon\) optimally, the total number of mistakes is at most \(m^* + O(\sqrt{m^* \ln n} + \ln n)\).
5.4 The Hedge Algorithm (Multiplicative Weights Update)
The Hedge algorithm, due to Freund and Schapire (1997), generalizes Weighted Majority to the full experts problem with arbitrary losses. It is one of the most important algorithms in all of theoretical computer science.
Definition (Hedge / Multiplicative Weights Update). Fix a learning rate \(\eta > 0\). Initialize weights \(w_1(i) = 1\) for all experts \(i \in [n]\). In each round \(t = 1, 2, \ldots, T\):
(1) Compute the probability distribution \(p_t(i) = w_t(i) / \Phi_t\), where \(\Phi_t = \sum_{i=1}^{n} w_t(i)\) is the total weight.
(2) Choose expert \(I_t \sim p_t\) (or equivalently, incur expected loss \(\langle p_t, \ell_t \rangle\)).
\[ w_{t+1}(i) = w_t(i) \cdot (1 - \eta)^{\ell_t(i)} = w_t(i) \cdot e^{-\eta \ell_t(i) + O(\eta^2 \ell_t(i)^2)}. \]Equivalently, one can use the exponential weights update: \(w_{t+1}(i) = w_t(i) \cdot e^{-\eta \ell_t(i)}\). The two versions have the same asymptotic guarantees.
Setting \(\eta = \sqrt{\ln n / T}\) gives expected regret at most \(2\sqrt{T \ln n}\).
Proof. We use a potential function argument. Define the potential \(\Phi_t = \sum_{i=1}^{n} w_t(i)\), so \(\Phi_1 = n\).
\[ \Phi_{t+1} = \sum_{i=1}^{n} w_{t+1}(i) = \sum_{i=1}^{n} w_t(i)(1 - \eta)^{\ell_t(i)}. \]\[ (1 - \eta)^{\ell_t(i)} \leq 1 - \eta \ell_t(i) + \eta^2 \ell_t(i). \]\[ (1-\eta)^x \leq 1 - (1 - (1-\eta)) x = 1 - \eta x. \]\[ (1-\eta)^x \leq 1 - \eta x + \frac{\eta^2 x^2}{2} \leq 1 - \eta x + \frac{\eta^2}{2}, \]using \(x^2 \leq x \leq 1\) in the last step (which holds for \(x \in [0,1]\)).
\[ (1-\eta)^{\ell_t(i)} \leq e^{-\eta \ell_t(i)}. \]\[ (1 - \eta)^{\ell_t(i)} \leq e^{-\eta \ell_t(i)} \leq 1 - \eta \ell_t(i) + \frac{\eta^2 \ell_t(i)^2}{2}. \]\[ \Phi_{t+1} \leq \sum_{i=1}^{n} w_t(i)\left(1 - \eta \ell_t(i) + \frac{\eta^2}{2}\right) = \Phi_t \left(1 - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2}\right), \]where we used \(p_t(i) = w_t(i)/\Phi_t\) and \(\ell_t(i)^2 \leq 1\).
\[ \Phi_{T+1} \leq \Phi_1 \prod_{t=1}^{T} \left(1 - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2}\right) \leq n \cdot \exp\left(-\eta \sum_{t=1}^{T} \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{2}\right), \]using \(1 + x \leq e^x\).
\[ \Phi_{T+1} = \sum_{i=1}^{n} w_{T+1}(i) \geq w_{T+1}(i^*) = (1-\eta)^{\sum_{t=1}^{T} \ell_t(i^*)} \geq (1-\eta)^{L^*}, \]where \(L^* = \sum_{t=1}^T \ell_t(i^*)\) is the loss of the best expert.
\[ (1-\eta)^{L^*} \leq n \cdot \exp\left(-\eta \sum_{t=1}^{T} \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{2}\right). \]\[ L^* \ln(1-\eta) \leq \ln n - \eta \sum_{t=1}^{T} \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{2}. \]\[ -(\eta + \eta^2)L^* \leq \ln n - \eta \sum_{t=1}^{T} \langle p_t, \ell_t \rangle + \frac{\eta^2 T}{2}. \]\[ \eta \sum_{t=1}^{T} \langle p_t, \ell_t \rangle \leq \eta(1 + \eta) L^* + \ln n + \frac{\eta^2 T}{2}. \]\[ \sum_{t=1}^{T} \langle p_t, \ell_t \rangle \leq (1+\eta) L^* + \frac{\ln n}{\eta} + \frac{\eta T}{2}. \]\[ \sum_{t=1}^{T} \langle p_t, \ell_t \rangle \leq L^* + \frac{\ln n}{\eta} + \eta T, \]absorbing the \(\eta L^*\) term into \(\eta T\) with a factor of \(3/2\) (or simply noting \(\eta L^* + \eta T / 2 \leq \eta T\) since \(L^* \leq T\)).
\[ \text{Regret}_T \leq 2\sqrt{T \ln n}. \]This is the desired bound. \(\blacksquare\)
Remark. The regret bound \(O(\sqrt{T \ln n})\) is tight: for any algorithm, there exists an adversarial loss sequence achieving regret \(\Omega(\sqrt{T \ln n})\). The proof uses a probabilistic argument where losses are drawn i.i.d. from appropriate distributions.
5.5 The Minimax Theorem via MWU
One of the most beautiful applications of the multiplicative weights method is a constructive proof of von Neumann’s minimax theorem. This connection reveals the deep relationship between online learning and game theory.
where \(\Delta_k\) denotes the probability simplex in \(\mathbb{R}^k\).
Proof sketch. Consider the game where the row player uses MWU over \(T\) rounds, and in each round \(t\), the column player best-responds to the row player’s mixed strategy \(p_t\). Let \(q_t = e_{j_t}\) be the column player’s pure best response, so \(\langle p_t, A q_t \rangle = \min_j \langle p_t, A e_j \rangle\).
\[ \sum_{t=1}^{T} p_t^\top A q_t \geq \min_{i} \sum_{t=1}^{T} (A q_t)_i - 2\sqrt{T \ln m}. \]\[ \min_q \bar{p}^\top A q \leq \frac{1}{T}\sum_t p_t^\top A q_t \leq \max_p p^\top A \bar{q} + O\left(\sqrt{\frac{\ln m}{T}}\right). \]Taking \(T \to \infty\), the inequality \(\max_p \min_q p^\top Aq \leq \min_q \max_p p^\top Aq\) becomes tight, establishing the minimax theorem. \(\blacksquare\)
This proof is not merely an alternative to the classical proof (via fixed-point theorems or LP duality) — it is algorithmic, providing an efficient procedure for computing approximate minimax strategies.
5.6 Online Convex Optimization
The experts problem generalizes naturally to the powerful framework of online convex optimization (OCO), introduced by Zinkevich (2003).
Definition (Online Convex Optimization). Let \(\mathcal{K} \subseteq \mathbb{R}^d\) be a compact convex set. In each round \(t = 1, \ldots, T\):
(1) The algorithm selects a point \(x_t \in \mathcal{K}\).
(2) A convex loss function \(f_t : \mathcal{K} \to \mathbb{R}\) is revealed.
(3) The algorithm incurs loss \(f_t(x_t)\).
The regret is \(\sum_{t=1}^T f_t(x_t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T f_t(x)\).
The experts problem is the special case where \(\mathcal{K} = \Delta_n\) (the simplex) and \(f_t(p) = \langle p, \ell_t \rangle\) (linear losses).
5.7 Follow the Regularized Leader
A unified framework for online convex optimization is Follow the Regularized Leader (FTRL):
Different choices of regularizer recover different algorithms:
Example (FTRL Instantiations).
(a) \(R(x) = \frac{1}{2}\|x\|_2^2\) (Euclidean regularizer) on \(\mathcal{K} = \mathbb{R}^d\) gives Online Gradient Descent (Zinkevich, 2003), achieving regret \(O(\sqrt{T})\).
(b) \(R(p) = \sum_i p_i \ln p_i\) (negative entropy) on \(\mathcal{K} = \Delta_n\) gives the Hedge / MWU algorithm, achieving regret \(O(\sqrt{T \ln n})\).
(c) \(R(x) = \sum_i \frac{1}{2} x_i \ln x_i + (1-x_i)\ln(1-x_i)\) (log-barrier) on \(\mathcal{K} = [0,1]^n\) gives algorithms for online combinatorial optimization.
The general FTRL regret bound is:
where \(R_{\max} = \max_{x \in \mathcal{K}} R(x) - \min_{x \in \mathcal{K}} R(x)\). Optimizing \(\eta\) gives \(\text{Regret}_T = O(G\sqrt{R_{\max} T / \lambda})\).
5.8 Bandit Problems
In many applications, the decision maker observes only the loss of the chosen action, not the losses of all experts. This is the bandit setting, a much harder problem.
Definition (Multi-Armed Bandit). In the adversarial multi-armed bandit problem with \(n\) arms, in each round \(t\):
(1) The algorithm selects an arm \(I_t \in [n]\).
(2) The algorithm observes only the loss \(\ell_t(I_t)\) of the selected arm, not the losses of other arms.
Theorem 5.6 (Auer-Cesa-Bianchi-Freund-Schapire, 2002). The EXP3 algorithm achieves expected regret \(O(\sqrt{Tn \ln n})\) in the adversarial bandit setting. This is tight up to logarithmic factors (the lower bound is \(\Omega(\sqrt{Tn})\)).
The EXP3 algorithm uses importance-weighted loss estimates to construct unbiased estimators of the full loss vector, then feeds these estimates into the Hedge algorithm. The \(\sqrt{n}\) factor (compared to \(\sqrt{\ln n}\) for full information) is the price of bandit feedback.
6. Secretary Problems and Prophet Inequalities
The secretary problem and prophet inequalities represent a beautiful intersection of optimal stopping theory, probability, and online algorithm design. These problems and their generalizations have found surprising applications in mechanism design and auction theory.
6.1 The Classical Secretary Problem
Definition (Secretary Problem). An employer interviews \(n\) candidates sequentially for a single position. The candidates arrive in a uniformly random order (equivalently, their relative rankings are drawn from a random permutation). After each interview, the employer observes the relative ranking of the current candidate among all candidates seen so far, and must immediately and irrevocably accept or reject the candidate. The goal is to maximize the probability of hiring the best overall candidate.
This problem has a rich history, dating back to the 1960s, with contributions by Lindley, Dynkin, and others. It is also known as the “optimal stopping problem,” the “dowry problem,” or the “best-choice problem.”
Theorem 6.1 (Optimal Secretary Algorithm). The following algorithm maximizes the probability of selecting the best candidate:
(1) Reject the first \(r - 1\) candidates (the “observation phase”), where \(r = \lfloor n/e \rfloor\).
(2) After the observation phase, accept the first candidate who is better than all previously seen candidates.
This algorithm selects the best candidate with probability at least \(1/e \approx 0.368\) as \(n \to \infty\). No algorithm achieves a higher probability.
Proof. Consider a general threshold strategy with parameter \(r\): reject the first \(r - 1\) candidates, then accept the first candidate who is the best so far.
Let \(P_r(n)\) be the probability that this strategy selects the best candidate among \(n\) candidates. The strategy succeeds if and only if the best candidate is at position \(i \geq r\) and the best among the first \(i - 1\) candidates is among the first \(r - 1\) candidates (otherwise, a suboptimal candidate would have been selected earlier in positions \(r, \ldots, i-1\)).
The best candidate is equally likely to be at any position \(i \in \{1, \ldots, n\}\), each with probability \(1/n\). Conditioned on the best candidate being at position \(i\), the strategy succeeds if and only if the best of the first \(i-1\) candidates is among positions \(1, \ldots, r-1\). The probability of this event is \((r-1)/(i-1)\) (by the symmetry of the random ordering, the best of the first \(i-1\) candidates is equally likely to be in any of the \(i-1\) positions).
\[ P_r(n) = \sum_{i=r}^{n} \frac{1}{n} \cdot \frac{r-1}{i-1} = \frac{r-1}{n} \sum_{i=r}^{n} \frac{1}{i-1} = \frac{r-1}{n} \sum_{j=r-1}^{n-1} \frac{1}{j}. \]\[ P_r(n) \to x \int_x^1 \frac{1}{t}\,dt = -x \ln x. \]\[ f'(x) = -\ln x - 1 = 0 \implies x = 1/e. \]\[ f(1/e) = -\frac{1}{e} \ln \frac{1}{e} = \frac{1}{e}. \]Thus the optimal strategy rejects the first \(\lfloor n/e \rfloor\) candidates and then accepts the first candidate who is the best so far, achieving success probability \(1/e\) as \(n \to \infty\).
Optimality: To see that no strategy can do better, observe that any online algorithm can be described by a sequence of threshold rules (possibly adaptive). The probability of selecting the best candidate from any strategy is at most the probability achieved by the best threshold strategy, since the optimal stopping rule is determined by the relative ranks (which form a sufficient statistic). By the theory of optimal stopping (or by a direct backward-induction argument), the threshold strategy with \(r = \lfloor n/e \rfloor\) is optimal among all stopping rules. \(\blacksquare\)
Remark. The \(1/e\) success probability may seem low, but it is remarkable given that the algorithm must make irrevocable decisions with no knowledge of the future, and the comparison is to the absolute best candidate. The algorithm also has the elegant property of being memoryless after the observation phase: it simply accepts the first candidate who beats the current maximum.
6.2 Multiple-Choice Secretary
A natural generalization allows selecting \(k\) candidates:
Definition (Multiple-Choice Secretary). An employer wishes to hire exactly \(k\) out of \(n\) candidates arriving in random order. The goal is to maximize the expected total value of the selected candidates (or, in another variant, the probability of selecting the top \(k\)).
Theorem 6.2 (Kleinberg, 2005). There exists an algorithm for the multiple-choice secretary problem that selects a set of \(k\) candidates with expected total value at least \((1 - O(\sqrt{1/k})) \cdot \text{OPT}\). As \(k \to \infty\), the competitive ratio approaches 1.
This shows that the secretary problem becomes easier as we are allowed to select more candidates, which is intuitive: more selections provide more “room for error.”
6.3 Matroid Secretary
A far-reaching generalization constrains the selected set to be independent in a given matroid:
Definition (Matroid Secretary Problem, Babaioff-Immorlica-Kleinberg, 2007). Given a matroid \((E, \mathcal{I})\) with weighted elements, the elements arrive in random order. When element \(e\) arrives, its weight is revealed and the algorithm must irrevocably accept or reject \(e\), subject to the constraint that the accepted set must be independent in the matroid. The goal is to maximize the total weight of accepted elements.
Theorem 6.3 (Matroid Secretary Conjecture). Babaioff, Immorlica, and Kleinberg (2007) conjectured that there exists a constant-competitive algorithm for the matroid secretary problem on any matroid. The conjecture remains open. The best known result for general matroids is \(O(\log \log \text{rank})\)-competitive (Feldman, Svensson, and Zenklusen, 2018). For specific matroids:
(a) Graphic matroids: constant-competitive (Korula-Pál, 2009).
(b) Transversal matroids: constant-competitive (Dimitrov-Plaxton, 2012).
(c) Partition matroids: constant-competitive (trivially, by running the classical secretary algorithm on each partition class).
6.4 Prophet Inequalities
Prophet inequalities address a related but different question: how well can a gambler who sees values one at a time do compared to a prophet who sees all values in advance?
Definition (Prophet Inequality Setting). There are \(n\) independent random variables \(X_1, X_2, \ldots, X_n\) (not necessarily identically distributed). The values are revealed one at a time. After seeing \(X_i\), the gambler must irrevocably decide whether to accept \(X_i\) and stop, or reject \(X_i\) and continue. The gambler knows the distributions but not the realizations. The goal is to maximize the expected accepted value, compared to \(\mathbb{E}[\max_i X_i]\) (the prophet’s expected value).
Theorem 6.4 (Samuel-Cahn, 1984). There exists a threshold strategy that achieves expected value at least \(\frac{1}{2} \mathbb{E}[\max_i X_i]\). Specifically, set the threshold \(\tau\) such that \(\Pr[\max_i X_i \geq \tau] = 1/2\), and accept the first \(X_i \geq \tau\). This achieves a gambler-to-prophet ratio of at least \(1/2\), and this is tight.
where the last inequality uses the decomposition \(\mathbb{E}[\max_i X_i] = \tau \cdot \Pr[\max_i X_i \geq \tau] + \mathbb{E}[(\max_i X_i - \tau)^+] + \tau \cdot \Pr[\max_i X_i < \tau]\) and the fact that \(\sum_i \mathbb{E}[(X_i - \tau)^+] \geq \mathbb{E}[(\max_i X_i - \tau)^+]\). \(\blacksquare\)
Remark. The power of the prophet inequality is that the threshold \(\tau\) depends only on the distributions, not on the particular realization. This makes it a “posted-price” mechanism, with profound implications for auction design.
6.5 Prophet Inequalities with Multiple Selections
When the gambler can select up to \(k\) values, the ratio improves:
Theorem 6.5 (Hajiaghayi-Kleinberg-Sandholm, 2007; Alaei, 2014). When the gambler can select up to \(k\) values:
(a) The ratio of gambler to prophet (where the prophet also selects \(k\) values) is \(1 - O(1/\sqrt{k})\), approaching 1 as \(k \to \infty\).
(b) Even for \(k = 1\), if the random variables are i.i.d., the ratio improves to \(1 - 1/e \approx 0.632\) (Hill-Kertz, 1992).
6.6 Connections to Mechanism Design
The prophet inequality has a striking interpretation in mechanism design. Consider a seller with one item and \(n\) buyers with independent private values \(X_1, \ldots, X_n\). A posted-price mechanism sets a price \(\tau\) and offers the item to buyers sequentially; the first buyer with value \(X_i \geq \tau\) gets the item at price \(\tau\).
Theorem 6.6 (Chawla-Hartline-Malec-Sivan, 2010; Kleinberg-Weinberg, 2012). Posted-price mechanisms extract at least half the revenue of the optimal (Myerson) auction, and this connection extends to multi-item settings through matroid prophet inequalities.
This connection between prophet inequalities and mechanism design has been one of the most fruitful developments in algorithmic game theory over the past decade.
6.7 Order Selection: Secretary vs. Prophet
It is instructive to compare the secretary and prophet settings:
Remark (Secretary vs. Prophet).
In the secretary model, values are adversarial but the arrival order is random. The algorithm must select the best element, and the benchmark is the maximum value. The optimal ratio is \(1/e\).
In the prophet model, values are random (from known distributions) but the arrival order is adversarial (worst-case). The algorithm selects one value, and the benchmark is the expected maximum. The optimal ratio is \(1/2\).
These are incomparable: the secretary model has randomness in the order, while the prophet model has randomness in the values. Both capture different facets of online decision-making under uncertainty.
7. Beyond Worst-Case Analysis
Competitive analysis, for all its elegance, sometimes paints an overly pessimistic picture. Many online algorithms that perform well in practice have poor worst-case competitive ratios, while algorithms with optimal competitive ratios may behave poorly on typical inputs. This chapter surveys the rich landscape of alternatives to worst-case competitive analysis, following the program articulated by Roughgarden in Beyond Worst-Case Analysis.
7.1 Limitations of Competitive Analysis
Several well-known examples illustrate the limitations of classical competitive analysis:
Example (Paging). For paging with cache size \(k\), LRU is \(k\)-competitive, which is tight for deterministic algorithms. But in practice, LRU performs much better than a factor-\(k\) overhead on typical workloads, which exhibit temporal locality. The competitive analysis is driven by pathological adversarial sequences that never reuse cached pages — sequences that are unrealistic in most applications.
Example (List Accessing). For the list accessing problem (maintaining a linked list to minimize access cost), the Move-to-Front rule is 2-competitive. But on inputs with locality of reference, Move-to-Front is near-optimal. The worst-case analysis fails to capture this.
These examples motivate frameworks that go “beyond the worst case,” providing performance guarantees that are more nuanced and practically relevant.
7.2 Smoothed Analysis
Smoothed analysis, introduced by Spielman and Teng (2004) in their celebrated work explaining the practical efficiency of the simplex method, provides a middle ground between worst-case and average-case analysis.
Definition (Smoothed Analysis). In the smoothed analysis framework, the input is generated in two stages:
(1) An adversary selects a worst-case input.
(2) The input is then perturbed randomly (e.g., each coordinate is perturbed by independent Gaussian noise of variance \(\sigma^2\)).
The smoothed complexity is the expected cost (over the random perturbation) of the algorithm, maximized over the adversary’s choice. An algorithm has polynomial smoothed complexity if its smoothed complexity is polynomial in the input size and \(1/\sigma\).
Theorem 7.1 (Spielman-Teng, 2004). The simplex method (with a suitable pivot rule) has polynomial smoothed complexity, despite having exponential worst-case complexity. Specifically, the expected number of pivot steps is polynomial in \(n\) (the number of variables), \(m\) (the number of constraints), and \(1/\sigma\).
This result explains a decades-old mystery: why does the simplex method work so well in practice despite Klee-Minty-type worst-case examples? The answer is that the pathological inputs are fragile — even small perturbations destroy the exponential behavior.
Smoothed analysis has been applied to many other problems, including:
Example (Applications of Smoothed Analysis).
(a) \(k\)-means clustering: Lloyd’s algorithm for \(k\)-means has exponential worst-case running time but polynomial smoothed complexity (Arthur-Manthey-Röglin, 2011).
(b) Local search algorithms: Many local search heuristics have polynomial smoothed complexity despite exponential worst-case behavior.
(c) Online algorithms: The smoothed competitive ratio of LRU for paging can be shown to be much better than \(k\) when the input has small perturbations.
7.3 Resource Augmentation
Resource augmentation, introduced by Kalyanasundaram and Pruhs (2000), takes a different approach: instead of weakening the adversary, we strengthen the online algorithm by giving it more resources than the offline optimum.
Definition (Resource Augmentation). In a speed augmentation model, the online algorithm’s machines are \(s\) times faster than the offline optimum’s machines (for scheduling problems). In a machine augmentation model, the online algorithm has more machines. An algorithm is \((1+\varepsilon)\)-speed \(c\)-competitive if, with speed \(1+\varepsilon\), its cost is at most \(c\) times the optimal offline cost with speed 1.
Theorem 7.2 (Kalyanasundaram-Pruhs, 2000). For online scheduling to minimize weighted flow time:
(a) No \(O(1)\)-competitive algorithm exists without speed augmentation (for arbitrary weights).
(b) SRPT (Shortest Remaining Processing Time) is \((1+\varepsilon)\)-speed \(O(1/\varepsilon)\)-competitive for minimizing total flow time.
(c) With \((1+\varepsilon)\)-speed augmentation, various online algorithms become \(O(1)\)-competitive for weighted flow time.
For paging, resource augmentation takes the form of cache augmentation:
Theorem 7.3 (Sleator-Tarjan, 1985). LRU with a cache of size \(k\) is \(k/(k-h+1)\)-competitive against an offline algorithm with cache size \(h \leq k\). In particular, with a cache twice the size of the offline optimum (i.e., \(k = 2h\)), LRU is 2-competitive.
This result shows that a modest increase in cache size dramatically reduces the competitive ratio from \(k\) to a small constant, much better reflecting LRU’s practical performance.
7.4 The Advice Model and Online Algorithms with Predictions
A more recent line of research studies online algorithms that receive advice — additional information about the future, possibly imperfect.
Definition (Advice Complexity). In the advice model (Böckenhauer-Komm-Královič-Rossmanith, 2009; Emek-Fraigniaud-Korman-Rosén, 2011), the online algorithm receives \(b\) bits of advice (written on a tape by an oracle that sees the entire input). The goal is to understand how the competitive ratio improves as a function of \(b\).
Theorem 7.4 (Advice for Paging). For paging with cache size \(k\):
(a) With \(O(n \log k)\) bits of advice (where \(n\) is the sequence length), an optimal solution can be computed.
(b) With \(O(n)\) bits of advice, the competitive ratio can be reduced to \(O(1)\).
(c) With \(o(n)\) bits of advice, no \(o(k)\)-competitive algorithm exists.
7.5 Learning-Augmented Algorithms
The advice model has evolved into the rapidly growing field of learning-augmented algorithms (also called “algorithms with predictions”), initiated by Lykouris and Vassilvitskii (2018) and Purohit, Svitkina, and Kumar (2018). The key idea is that the algorithm receives machine-learned predictions about the input, which are helpful when accurate but may be arbitrarily wrong.
Definition (Learning-Augmented Algorithm). A learning-augmented online algorithm receives a prediction \(\hat{y}\) about some aspect of the input (e.g., the total number of skiing days in ski rental, or the set of pages to be requested in paging). The algorithm’s performance is evaluated along three axes:
(1) Consistency: The competitive ratio when the prediction is correct (\(\hat{y} = y\)).
(2) Robustness: The competitive ratio in the worst case, regardless of prediction quality.
(3) Smoothness: How gracefully the competitive ratio degrades as the prediction error \(\|\hat{y} - y\|\) increases.
Theorem 7.5 (Learning-Augmented Ski Rental, Purohit et al., 2018). For ski rental with a prediction \(\hat{n}\) of the total number of skiing days:
There exists an algorithm parameterized by \(\lambda \in (0,1)\) that achieves:
(a) Consistency: \(1 + \lambda / (1 - \lambda)\) when \(\hat{n} = n\) (approaching 1 as \(\lambda \to 0\)).
(b) Robustness: \(1/(1 - \lambda)\) in the worst case (approaching 2 as \(\lambda \to 1\), recovering the classical bound).
(c) Smoothness: the competitive ratio degrades gracefully with the prediction error \(|\hat{n} - n|\).
The algorithm works by hedging between trusting the prediction and using the classical deterministic strategy:
Example (Learning-Augmented Ski Rental Algorithm). Given prediction \(\hat{n}\) and parameter \(\lambda \in (0,1)\):
If \(\hat{n} < B\) (prediction says renting is optimal): rent for \(\min(\hat{n}, B(1-\lambda))\) days, then buy if still skiing after day \(\min(\hat{n}, B(1-\lambda))\).
If \(\hat{n} \geq B\) (prediction says buying is optimal): buy on day \(\lambda B\) (earlier than the classical day \(B\) strategy).
The parameter \(\lambda\) controls the trade-off between trusting the prediction (small \(\lambda\) for high consistency) and hedging against errors (large \(\lambda\) for high robustness).
Theorem 7.6 (Learning-Augmented Paging, Lykouris-Vassilvitskii, 2018). For paging, given predictions of which page will be evicted next by the optimal offline algorithm, there exists a learning-augmented algorithm achieving competitive ratio \(O(1 + \min(\sqrt{\eta}, k))\), where \(\eta\) is the number of incorrect predictions. This is \(O(1)\) when predictions are good and \(O(k)\) (matching the deterministic bound) when predictions are adversarial.
7.6 Instance Optimality
Another approach to moving beyond worst-case analysis is instance optimality, which asks: can an algorithm perform as well as the best possible algorithm on every single input, not just in the worst case?
Instance optimality is a very strong requirement. For example, for list accessing, the Move-to-Front algorithm is instance-optimal within the class of online algorithms that can only move accessed elements forward (Angelopoulos, 2007).
7.7 Semi-Random Models
Semi-random models provide another way to interpolate between worst-case and average-case analysis:
Definition (Semi-Random Model). In a semi-random model, the input is generated by a combination of adversarial and random choices. For example:
(a) Random-order model: The adversary chooses the set of requests, but they arrive in a uniformly random order (as in the secretary problem).
(b) Stochastic with adversarial perturbations: Requests are drawn from a known distribution, but an adversary can modify a bounded fraction.
(c) i.i.d. model: Requests are drawn i.i.d. from a known (or unknown) distribution.
Theorem 7.7 (Random-Order Model Results).
(a) For online bipartite matching in the random-order model (i.e., the offline vertices are adversarial but the arrival order is random), there exist algorithms achieving competitive ratio \(1 - 1/e + \varepsilon\) for a constant \(\varepsilon > 0\), beating the worst-case bound (Mahdian-Yan, 2011).
(b) For the secretary problem, the random-order assumption is essential: without it (adversarial order), no constant-competitive algorithm exists for maximizing the probability of selecting the best candidate.
7.8 The Price of Anarchy in Online Settings
The price of anarchy (PoA) measures the inefficiency of equilibria in games, and it connects to online algorithms through the lens of selfish agents making online decisions.
for all strategy profiles \(s\) and \(s^*\), then \(\text{PoA} \leq \lambda / (1 - \mu)\).
This connects online algorithms to game theory: when selfish agents use no-regret learning algorithms (like MWU) to make decisions, the resulting dynamics converge to outcomes whose social cost is bounded by the price of anarchy.
7.9 Connections to Competitive Equilibria
A final thread connects online algorithms to economic equilibrium theory. In many resource allocation settings, the competitive ratio of an online algorithm can be interpreted through the lens of market equilibria:
Example (Online Resource Allocation and Equilibria). In the AdWords problem, the BALANCE algorithm can be interpreted as computing approximate competitive equilibrium prices. As queries arrive, the algorithm implicitly sets “prices” for advertiser budgets, and the allocation is approximately market-clearing.
More generally, Devanur and Hayes (2009) showed that online algorithms for packing LPs can be interpreted as primal-dual algorithms that approximately compute competitive equilibrium prices in a corresponding market.
Remark (The Broader Landscape). The field of beyond worst-case analysis is rapidly evolving. Recent directions include:
(a) Data-driven algorithm design: Using samples from the input distribution to tune algorithm parameters (Gupta-Roughgarden, 2017).
(b) Online algorithms with recourse: Allowing the algorithm to change past decisions at a cost, bridging online and offline optimization.
(c) Competitive analysis with predictions: The learning-augmented framework has been extended to dozens of problems including scheduling, caching, ski rental, set cover, matchings, and more.
(d) Pareto-optimal algorithms: Designing algorithms that simultaneously achieve the best possible consistency-robustness trade-off curves.
These directions promise a richer understanding of algorithm performance that bridges theory and practice.
7.10 Summary: A Taxonomy of Models
We conclude with a summary of the models studied in this chapter and their relationships:
Remark (Taxonomy of Beyond Worst-Case Models).
Weakening the adversary:
- Smoothed analysis: adversary picks input, then it is randomly perturbed.
- Random-order model: adversary picks content, random permutation determines order.
- Semi-random models: adversary operates within a stochastic framework.
- Distributional analysis: inputs drawn from a specific distribution.
Strengthening the algorithm:
- Resource augmentation: algorithm has more resources than the benchmark.
- Advice/predictions: algorithm receives (possibly imperfect) extra information.
- Recourse: algorithm can revise past decisions at a cost.
Changing the benchmark:
- Instance optimality: compare to the best online algorithm, not offline optimum.
- Regret minimization: compare to the best fixed action, not the best adaptive strategy.
- Price of anarchy: compare equilibrium outcomes to optimal outcomes.
Each model captures different aspects of what makes certain inputs “easy” or “hard” and provides tools for explaining and predicting practical algorithm performance.
Epilogue: The State of the Art
The theory of online algorithms and competitive analysis has evolved dramatically since its inception in the 1980s. From the foundational work of Sleator and Tarjan on paging and list accessing, through the celebrated results of Karp, Vazirani, and Vazirani on online matching, to the modern synthesis of online algorithms with machine learning predictions, the field has maintained a remarkable balance between mathematical depth and practical relevance.
Several major open problems continue to drive research:
The \(k\)-server conjecture: Is there a \(k\)-competitive deterministic algorithm for the \(k\)-server problem on every metric space? The gap between \(k\) (lower bound) and \(2k-1\) (Work Function Algorithm) has resisted closure for over three decades. The randomized \(k\)-server conjecture (conjecturing \(O(\log k)\)-competitive randomized algorithms) is also open.
The matroid secretary conjecture: Is there a constant-competitive algorithm for the matroid secretary problem on every matroid?
Tight bounds for online matching variants: What are the optimal competitive ratios for edge-weighted online matching, stochastic matching, and matching with reweighing?
Optimal learning-augmented algorithms: For many problems, the optimal consistency-robustness trade-off curve is not yet known. Can we characterize the Pareto frontier for natural online problems?
Connections to deep learning: Can the learning-augmented framework be scaled to real-world problems where predictions come from neural networks? What are the right theoretical models?
The interplay between competitive analysis, online learning, game theory, and mechanism design continues to produce beautiful mathematics and influential algorithms. The framework of online computation provides a unified lens through which to understand decision-making under uncertainty — one of the most fundamental challenges in computer science and beyond.