CS 454/654 - Distributed Systems
Khuzaima Daudjee
Estimated reading time: 54 minutes
Table of contents
These notes synthesize Prof. Daudjee’s lecture slides and detailed course notes into textbook-style prose. The primary textbooks referenced are Tanenbaum & Van Steen’s Distributed Systems: Principles and Paradigms and Coulouris, Dollimore & Kindberg’s Distributed Systems: Concepts and Design. Where appropriate, enriching context and real-world examples have been added to bring the material to life.
Lesson 1: Introduction to Distributed Systems
What Is a Distributed System?
A distributed system is a collection of autonomous computers that appears to its users as a single coherent system. The individual machines communicate and coordinate through message passing over a network, yet the goal is to present a unified, transparent interface — as if the user were interacting with a single powerful computer. This definition, due to Tanenbaum and Van Steen, captures two essential tensions that run through the entire field: the machines are independent (they can fail, lag, or disagree), yet they must cooperate to maintain an illusion of unity.
Why not just use one powerful machine? The answer is multifaceted. First, scalability: a single machine has physical limits on CPU speed, memory, and I/O bandwidth, whereas a distributed system can grow by adding more nodes. Second, reliability: if one machine in a cluster fails, others can take over its workload, whereas a single machine is a catastrophic single point of failure. Third, geography: users are spread around the globe, and having servers close to them reduces latency. Fourth, economics: a cluster of commodity machines often costs far less than a single supercomputer of equivalent aggregate power. Google, Amazon, and Facebook all run on clusters of thousands of ordinary machines rather than a single monolithic server.
But distribution comes at a cost. The network is unreliable — messages can be delayed, lost, or delivered out of order. Clocks on different machines drift apart. Machines can crash at any moment, and there is no shared memory through which processes can coordinate. These challenges give rise to the fundamental problems studied in this course: naming, coordination, consistency, replication, transactions, and fault tolerance.
Goals and Design Principles
A well-designed distributed system should achieve several goals simultaneously.
Transparency is the concealment of distribution from the user. The system should hide the fact that resources are spread across multiple machines. There are several forms of transparency:
- Access transparency: Resources are accessed the same way regardless of whether they are local or remote. A programmer should not need different code to read a local file versus a remote one.
- Location transparency: The user does not need to know the physical location of a resource. A URL, for example, does not reveal which data center serves the content.
- Migration transparency: Resources can move without the user noticing. If a database migrates from one server to another, clients should be unaffected.
- Replication transparency: Multiple copies of a resource can exist without the user’s knowledge. When you read a web page, you may be served by any of several replica servers, but you see the same content.
- Concurrency transparency: Multiple users can access the same resource simultaneously without interfering with each other.
- Failure transparency: The system masks failures of individual components. If one server crashes, another takes over seamlessly.
Complete transparency is not always achievable or even desirable. Hiding a network failure from a user who is actively waiting for a response would be dishonest — better to show a timeout message. Similarly, a mobile user might want to know which replica they are accessing for data freshness reasons. The art of distributed systems design lies in choosing the right level of transparency for each situation.
Openness means the system’s interfaces are well-defined and publicly available, allowing new components to be added or existing ones to be replaced. Open systems use standard protocols (like HTTP, TCP/IP, or gRPC) so that heterogeneous machines — running different operating systems, written in different languages — can interoperate.
Scalability has three dimensions. A system is scalable in size if it can handle a growing number of users and resources. It is scalable geographically if it works well even when users and resources are spread across continents. And it is scalable administratively if it can span multiple independent organizations without requiring a single administrative authority.
Achieving scalability often requires techniques like decentralization (avoiding single points of coordination), caching (keeping frequently accessed data close to the user), and replication (maintaining multiple copies of data for both performance and fault tolerance). However, these techniques introduce their own challenges, particularly around consistency — ensuring that all copies of a piece of data agree.
Middleware
Users interact with distributed systems through applications, but the complexity of distribution is typically managed by a layer of software called middleware. Middleware sits between the application and the operating system / network stack, providing higher-level abstractions that make it easier to build distributed applications.
Examples of middleware include Remote Procedure Call (RPC) systems, message-oriented middleware (like Apache Kafka or RabbitMQ), distributed object systems (like CORBA or Java RMI), and distributed file systems (like NFS or HDFS). Middleware provides the “glue” that lets heterogeneous machines work together as a coherent system.
The middleware layer is responsible for implementing many of the transparency goals described above. For example, RPC middleware hides the fact that a procedure call crosses a network boundary, making remote calls look like local ones to the programmer.
Types of Distributed Systems
Distributed systems come in several flavors:
- Distributed computing systems focus on high-performance computation. Cluster computing uses homogeneous machines in a local network (think Hadoop clusters), while grid computing federates heterogeneous resources across organizations (think SETI@home or the CERN computing grid).
- Distributed information systems focus on data management. Enterprise systems often integrate multiple databases, transaction processing monitors, and application servers using middleware like Enterprise JavaBeans or .NET.
- Distributed pervasive systems embed computation into everyday objects — sensor networks, smart home devices, and the Internet of Things.
Lesson 2: Distributed Architectures
Architectural Styles
The architecture of a distributed system describes how its components are organized and how they interact. There are several fundamental architectural styles, each with different trade-offs.
In a layered architecture, components are organized into layers, where each layer provides services to the layer above and consumes services from the layer below. The OSI model and the TCP/IP stack are classic examples. Layering promotes modularity and separation of concerns, but strict layering can introduce performance overhead because every request must pass through multiple layers.
In an object-based architecture, each component is an object with a well-defined interface. Objects communicate through method invocations, which may be local or remote (via RPC or RMI). This style naturally supports encapsulation and reuse, but object references can be difficult to manage in a distributed setting.
In an event-based architecture, components communicate by publishing and subscribing to events. A component that detects an interesting condition publishes an event, and all components that have subscribed to that event type receive a notification. This style promotes loose coupling — publishers don’t need to know who their subscribers are — but makes it harder to reason about system behavior because the flow of control is implicit.
In a data-centered architecture, components communicate through a shared data space. A component writes data to the shared space, and other components read it. Tuple spaces (as in Linda) and shared distributed file systems are examples. This style is simple and elegant but can become a bottleneck if the shared space is not efficiently distributed.
Client-Server and Multi-Tier
The most prevalent distributed architecture is client-server. A server is a process that provides a service (such as file storage, web serving, or database access), and a client is a process that requests and uses that service. The interaction follows a request-reply pattern: the client sends a request, the server processes it, and sends back a response.
In practice, systems are often organized into multiple tiers. A two-tier architecture has clients talking directly to servers — for example, a desktop application connecting to a database. A three-tier architecture inserts an application server between the client and the database: the client handles presentation, the application server handles business logic, and the database handles data storage. Modern web applications typically use this pattern, with a browser as the client, a web application server in the middle, and a database at the back.
The division of functionality between client and server is a key design decision. A thin client performs minimal processing (just rendering the UI), while a fat client handles significant computation locally. The trend in web applications has shifted back and forth: early web apps used thin clients (server-rendered HTML), then fat clients became popular (single-page applications with heavy JavaScript), and now the pendulum is swinging back with server-side rendering frameworks.
Peer-to-Peer
In a peer-to-peer (P2P) architecture, there is no distinguished server. Every node in the system acts as both client and server, contributing resources (storage, bandwidth, computation) while also consuming them. BitTorrent, Bitcoin, and early file-sharing systems like Napster and Gnutella are prominent examples.
P2P systems can be structured or unstructured. In a structured P2P system, nodes are organized according to a specific topology — often a distributed hash table (DHT) — that allows efficient lookup of data items. Chord, Pastry, and CAN are examples of structured P2P systems. In an unstructured P2P system, nodes form an ad hoc network, and finding a specific data item requires flooding the network with queries or using random walks.
The key advantage of P2P is its extreme scalability and resilience: there is no single point of failure, and the system’s capacity grows with the number of participants. The key disadvantage is complexity: maintaining the overlay network, ensuring data availability, and preventing free-riding all require sophisticated algorithms.
Hybrid Architectures
Many real-world systems combine elements of client-server and P2P. Edge-server architectures place servers at the “edge” of the network, close to users, to reduce latency. Content delivery networks (CDNs) like Akamai and Cloudflare are examples: a central origin server holds the authoritative copy of content, but edge servers cache copies close to users around the world. Collaborative distributed systems like BitTorrent use a central tracker to help peers find each other, but the actual data transfer happens peer-to-peer.
Lesson 3: Networks
Network Fundamentals
Distributed systems rely on computer networks to communicate. Understanding the basics of networking is essential for understanding the performance characteristics and failure modes of distributed systems.
A computer network is an interconnected collection of autonomous computers that can exchange information. Networks are characterized by their scale (LAN, MAN, WAN), their topology (how nodes are connected), and their switching mechanism (how data moves through the network).
The OSI and TCP/IP Models
The OSI (Open Systems Interconnection) model organizes network functionality into seven layers:
| Layer | Name | Function |
|---|---|---|
| 7 | Application | End-user services (HTTP, FTP, DNS) |
| 6 | Presentation | Data encoding and encryption |
| 5 | Session | Dialog control and synchronization |
| 4 | Transport | End-to-end message transfer (TCP, UDP) |
| 3 | Network | Routing across multiple links (IP) |
| 2 | Data Link | Reliable transfer over a single link (Ethernet) |
| 1 | Physical | Bit transmission over physical medium |
In practice, the TCP/IP model is more widely used. It collapses the top three OSI layers into a single Application layer and the bottom two into a single Link layer, yielding four layers: Link, Internet (IP), Transport (TCP/UDP), and Application.
The layers interact through encapsulation: when an application sends data, each layer adds its own header (and sometimes trailer) to the data before passing it down. The receiving side strips off headers in reverse order. This encapsulation ensures that each layer can evolve independently.
Circuit Switching vs. Packet Switching
There are two fundamental approaches to moving data through a network.
Circuit switching establishes a dedicated communication path between sender and receiver for the duration of the conversation. The traditional telephone network uses circuit switching: when you make a call, a circuit is set up through a series of switches, and that circuit is reserved exclusively for your call. Resources (bandwidth, switch capacity) are dedicated and not shared. This guarantees consistent performance but wastes capacity when the circuit is idle (which, in a typical phone conversation, is about half the time due to silences).
Packet switching divides data into discrete packets that are routed independently through the network. Each packet carries enough information (source and destination addresses) to be routed, and different packets from the same message may take different paths. The Internet uses packet switching. Packets share network resources — when demand exceeds capacity, packets queue up in router buffers, leading to congestion and increased delays. If buffers overflow, packets are dropped and must be retransmitted.
Packet switching is more efficient than circuit switching for bursty data traffic (like web browsing) because resources are used only when there is actual data to send. Circuit switching is better for continuous streams (like traditional voice) where consistent latency matters.
Sources of Delay
Packets experience four types of delay on their journey through a packet-switched network:
- Nodal processing delay: The time a router takes to examine a packet’s header, check for bit errors, and determine the output link. Typically microseconds.
- Queuing delay: The time a packet waits in a router’s output buffer before it can be transmitted. This is the most variable component — it depends on the current congestion level and can range from microseconds to milliseconds.
- Transmission delay: The time to push all the bits of a packet onto the link. If the packet is \(L\) bits long and the link bandwidth is \(R\) bits per second, the transmission delay is \(L/R\).
- Propagation delay: The time for a bit to travel from one end of the link to the other. If the link length is \(d\) and the propagation speed is \(s \approx 2 \times 10^8\) m/s (for copper or fiber), the propagation delay is \(d/s\).
The total end-to-end delay for a packet traversing \(N\) links is the sum of all four delays at each hop.
Routing
Routing is the process of determining the path a packet takes from source to destination. Each router maintains a routing table that maps destination addresses to output links and costs. There are two main families of routing algorithms:
In distance-vector routing, each router periodically shares its routing table with its immediate neighbors. Each router computes the shortest path to every destination based on the information it receives. The Bellman-Ford algorithm underlies this approach. Distance-vector routing is simple but converges slowly and can suffer from the count-to-infinity problem when links fail.
In link-state routing, each router broadcasts information about the cost of its links to all other routers in the network. Every router then has a complete map of the network and can independently compute shortest paths using Dijkstra’s algorithm. Link-state routing converges faster and is more robust than distance-vector routing, but it requires more bandwidth and memory.
Lesson 4: Communication
Fundamentals of Communication
Communication is the lifeblood of a distributed system. Processes on different machines must exchange information to coordinate their actions, and the choice of communication mechanism profoundly affects the system’s performance, complexity, and fault tolerance.
At the lowest level, processes communicate through the network using protocols like TCP (reliable, ordered, stream-oriented) and UDP (unreliable, unordered, message-oriented). But building distributed applications directly on sockets is tedious and error-prone. Higher-level communication paradigms provide more convenient abstractions.
Remote Procedure Call (RPC)
The most influential communication paradigm in distributed systems is the Remote Procedure Call (RPC). The idea, introduced by Birrell and Nelson in 1984, is simple and powerful: make a remote call look exactly like a local procedure call. The programmer writes result = server.compute(x, y), and the RPC system handles all the messy details of packing the arguments into a message, sending it over the network, waiting for a response, and unpacking the result.
The RPC mechanism works through stubs. The client stub is a local proxy for the remote procedure. When the client calls the stub, it marshals (serializes) the parameters into a message, sends it to the server, and blocks waiting for a reply. On the server side, the server stub (or skeleton) receives the message, unmarshals the parameters, calls the actual procedure, marshals the return value, and sends it back.
RPC must deal with several complications that don’t arise with local calls:
- Parameter passing: Local calls can pass pointers, but pointers are meaningless across address spaces. RPC must pass parameters by value (copying the data into the message). Complex data structures like linked lists require deep serialization.
- Heterogeneity: The client and server may use different byte orderings (big-endian vs. little-endian), different floating-point formats, or different character encodings. An Interface Definition Language (IDL) defines the types and marshaling format to ensure interoperability.
- Failure handling: A local procedure call either completes or the entire process crashes. An RPC can fail in many more ways: the request message may be lost, the server may crash after receiving the request but before sending the reply, the reply may be lost, or the network may be partitioned. These partial failures make RPC semantics fundamentally different from local call semantics.
RPC systems typically offer one of several failure semantics:
| Semantics | Guarantee | Use Case |
|---|---|---|
| Maybe | No guarantee — the call may or may not execute | Unreliable, rarely used |
| At-least-once | The call executes one or more times | Safe for idempotent operations |
| At-most-once | The call executes zero or one times | Most common; used by Java RMI |
| Exactly-once | The call executes exactly once | Ideal but very difficult to achieve |
Message-Oriented Communication
While RPC is synchronous (the client blocks until the reply arrives), many distributed applications need asynchronous communication. Message-oriented middleware (MOM) provides this through message queues. A sender puts a message into a queue, and the receiver retrieves it later — the sender and receiver need not be running at the same time.
Message queuing provides several advantages: it decouples sender and receiver in time, it naturally supports load balancing (multiple receivers can drain the same queue), and it buffers against bursts of traffic. Systems like Apache Kafka, RabbitMQ, and Amazon SQS are widely used message-oriented middleware.
Communication can also be transient (the message is discarded if the receiver is not running) or persistent (the message is stored until the receiver retrieves it). Email is a familiar example of persistent communication; a phone call is transient.
Multicast Communication
Sometimes a message needs to be delivered to multiple recipients. Multicast communication provides this capability. In application-level multicast, the middleware maintains an overlay network (often a tree or mesh) through which messages are propagated. This is more flexible than network-level multicast (IP multicast), which requires router support that is often unavailable on the public Internet.
Gossip-based (or epidemic) protocols are a particularly elegant form of multicast. Each node periodically selects a random peer and exchanges information with it. Like a rumor spreading through a population, the information eventually reaches all nodes. Gossip protocols are robust (they work even if some nodes fail or are unreachable), scalable (each node does a constant amount of work regardless of system size), and simple to implement. They are widely used in modern distributed systems — for example, Amazon’s Dynamo uses gossip for failure detection and membership management.
Lesson 5: Naming
The Role of Names
A name in a distributed system is a string of bits or characters used to refer to an entity — a file, a process, a machine, a user, or any other resource. Names are essential because they allow us to identify, locate, and access entities without knowing their physical details.
There is an important distinction between names, addresses, and identifiers. An address is a name that refers to an access point — it tells you where an entity can be reached. An identifier is a name that uniquely and permanently refers to an entity — no two entities share the same identifier, and an identifier always refers to the same entity. A human-readable name (like a domain name or a file path) is designed for people rather than machines.
The process of looking up the address or identifier associated with a name is called name resolution. A naming system (or name service) maintains the bindings between names and their associated attributes and performs resolution when requested.
Flat Naming
In a flat naming scheme, names are unstructured bit strings — they have no internal structure that reveals the entity’s location or organization. The challenge with flat names is how to locate the entity: you can’t look at the name and figure out where it is.
Several approaches exist:
Broadcasting and multicasting: Send a query to all nodes, and the one holding the entity responds. ARP (Address Resolution Protocol) uses this approach to map IP addresses to MAC addresses within a local network. This works well in small networks but doesn’t scale to large ones due to the overhead of broadcasting.
Forwarding pointers: When an entity moves from location A to location B, it leaves behind a pointer at A saying “I’ve moved to B.” A chain of these pointers leads from the entity’s original location to its current one. This is simple but the chain can become arbitrarily long, increasing lookup time and creating a fragile dependency on all the intermediate nodes.
Home-based approaches: Each entity has a designated “home” node that always knows its current location. When the entity moves, it informs its home. To find the entity, you contact its home. Mobile IP uses this approach: a mobile device’s home agent on the home network always knows where the device currently is.
Structured Naming
Structured naming organizes names into a hierarchy — a naming graph or name space. File systems are the most familiar example: a path like /users/alice/documents/report.pdf navigates a tree from the root through intermediate directories to the target file.
The Domain Name System (DNS) is the most important structured naming system on the Internet. DNS organizes names into a hierarchy of domains: cs.uwaterloo.ca means the cs subdomain of uwaterloo, which is under the ca top-level domain. DNS uses a distributed database of resource records to resolve names to IP addresses (and other data).
DNS resolution can be iterative or recursive. In iterative resolution, the client contacts each name server in the hierarchy itself, following referrals. In recursive resolution, the client asks one name server, which contacts the next on the client’s behalf, and so on up the chain. Recursive resolution reduces the client’s work but increases the load on name servers.
DNS is a remarkable engineering achievement: it handles billions of queries per day, is highly available despite individual server failures, and has scaled from a few thousand hosts in the 1980s to hundreds of millions today. It achieves this through aggressive caching (resolved names are cached with a Time-To-Live) and replication (each zone is served by multiple name servers).
Attribute-Based Naming
Sometimes you want to find an entity by its properties rather than its name. Directory services support attribute-based naming, allowing queries like “find all color printers on the third floor.” LDAP (Lightweight Directory Access Protocol) is the standard protocol for directory services.
Attribute-based lookup is more flexible than name-based lookup but also more expensive — it requires maintaining indices over attribute values and searching them efficiently.
Lesson 6: Coordination
Clock Synchronization
In a centralized system, there is a single clock that all processes can consult. In a distributed system, each machine has its own clock, and these clocks inevitably drift apart. If two machines’ clocks disagree, they may order events differently — which can lead to subtle and dangerous bugs. Consider a distributed make system: if machine A’s clock is ahead of machine B’s clock, a file edited on B might appear older than it really is, causing make to skip recompilation.
Physical clock synchronization attempts to keep all clocks close to real (UTC) time. The Network Time Protocol (NTP) is the standard approach: machines periodically synchronize with time servers, adjusting their clocks to account for network delays. NTP can achieve accuracy to within a few milliseconds over the Internet and sub-millisecond within a LAN.
When synchronizing clocks, it is important to never set a clock backward — doing so could cause events to appear to happen out of order or make time intervals negative. Instead, a clock that is ahead should be slowed down gradually, and a clock that is behind should be sped up gradually.
Logical Clocks
It is not always necessary to synchronize system clocks to real time. In many cases, what matters is only the relative ordering of events — which event happened before which. This insight led Leslie Lamport to introduce logical clocks in his seminal 1978 paper.
The happened-before relation (denoted \(a \to b\) captures causal ordering between events. Event \(a\) happened before event \(b\) if:
- \(a\) and \(b\) are events in the same process and \(a\) occurs before \(b\), or
- \(a\) is the sending of a message and \(b\) is the receiving of that message, or
- There exists an event \(c\) such that \(a \to c\) and \(c \to b\) (transitivity).
If neither \(a \to b\) nor \(b \to a\), then \(a\) and \(b\) are concurrent, written \(a \| b\).
Lamport’s Algorithm assigns a logical clock value \(C(e)\) to each event \(e\) such that if \(a \to b\) then \(C(a) < C(b)\). The algorithm has three rules:
- Before each event in a process, increment the local clock.
- When sending a message, attach the current clock value (timestamp) to the message.
- When receiving a message with timestamp \(ts\), set the local clock to \(\max(\text{local clock}, ts) + 1\).
Lamport timestamps capture the happened-before relation in one direction: if \(a \to b\) then \(C(a) < C(b)\). However, the converse is not true: \(C(a) < C(b)\) does not imply \(a \to b\). For a stronger guarantee, vector clocks are needed — but they come at higher cost (each timestamp is a vector of \(N\) integers for \(N\) processes).
Mutual Exclusion
In a centralized system, processes use shared-memory primitives like semaphores, locks, and monitors to coordinate access to critical sections. In a distributed system, there is no shared memory, so different mechanisms are needed.
Centralized Algorithm
The simplest approach is to emulate a centralized lock manager. One process is designated as the coordinator. To enter a critical section, a process sends a request to the coordinator. If the critical section is free, the coordinator replies with “OK.” If it is occupied, the coordinator queues the request and replies when the section becomes free. When a process leaves the critical section, it sends a release message to the coordinator.
This algorithm is simple and fair (requests are served in order), and requires only three messages per critical section entry (request, OK, release). However, the coordinator is a single point of failure: if it crashes, the entire mutual exclusion mechanism breaks down.
Distributed Algorithm
In Ricart and Agrawala’s distributed algorithm, a process that wants to enter a critical section broadcasts its request (with a Lamport timestamp) to all other processes. Each recipient replies with an “OK” unless it is either in the critical section or wants to enter it with a lower timestamp — in which case it defers its reply. A process enters the critical section only after receiving “OK” from every other process.
This algorithm is fully distributed, but it has significant drawbacks. The failure of any single process blocks the entire system (since its “OK” will never arrive). It requires \(2(N-1)\) messages per critical section entry, compared to 3 for the centralized algorithm. And paradoxically, it is more complex and less robust than the centralized approach — a good reminder that “distributed” does not always mean “better.”
Token Ring Algorithm
Processes are arranged in a logical ring. A special token message circulates around the ring. When a process receives the token, it may enter the critical section if it wishes; when it exits (or if it doesn’t want to enter), it passes the token to the next process.
The token ring algorithm is simple and fair, but it has a disadvantage: the token constantly circulates even when no process wants to enter the critical section, wasting network bandwidth. If a process crashes while holding the token, the token is lost and must be regenerated — which requires a separate algorithm to detect the loss.
| Algorithm | Messages per Entry | Delay | Problems |
|---|---|---|---|
| Centralized | 3 | 2 | Single point of failure |
| Distributed | \(2(N-1)\) | \(2(N-1)\) | Any failure blocks all |
| Token Ring | 1 to \(\infty\) | 0 to \(N-1\) | Token loss |
Election Algorithms
Many distributed algorithms require a coordinator — a single process that takes on a special role (such as the lock manager in the centralized mutual exclusion algorithm). If the coordinator crashes, a new one must be elected. Election algorithms allow processes to agree on which among them should be the coordinator.
The Bully Algorithm
In the Bully Algorithm, each process has a unique numeric identifier. When a process detects that the coordinator has failed (e.g., by timing out on a message), it initiates an election by sending an “election” message to all processes with higher identifiers. If any of them responds, the initiator backs off and waits — the higher-numbered processes will take over the election. If no one responds, the initiator declares itself the coordinator and announces this to all processes.
The algorithm is called “bully” because the process with the highest identifier always wins — it “bullies” the others into submission. The algorithm handles the case where a previously crashed process comes back online: it immediately initiates an election and, if it has the highest identifier, becomes the new coordinator.
Ring-Based Election
In this algorithm, processes are organized in a logical ring, and each process knows its successor. When a process detects the coordinator’s failure, it sends an “election” message containing its own identifier to its successor. Each process that receives the message appends its own identifier and forwards it. When the message returns to the initiator (it recognizes its own identifier in the list), it selects the process with the highest identifier as the coordinator and sends a “coordinator” message around the ring to announce the result.
Lesson 7: Transactions
What Is a Transaction?
A transaction is a collection of operations that transforms the system from one consistent state to another. The key insight is that during execution, the system may temporarily be in an inconsistent state — but this is invisible to the outside world. The system either completes the entire transaction (making all changes permanent) or aborts it (undoing all changes), but it never leaves the data in a half-modified state.
Consider a concrete example: an airline reservation system. A travel agent inputs a flight number, date, and customer name. The transaction reads the number of seats sold, checks if the flight is full, and if there’s room, increments the seat count and records the customer’s name. If the flight is full, the transaction aborts — no changes are made. If there’s room, the transaction commits — both the seat count and the customer record are durably stored.
Begin_transaction Reservation
temp ← Read(flight_no(date).stsold);
if temp = flight(date).cap then
Abort;
else
Write(flight(date).stsold, temp + 1);
Write(flight(date).cname, customer_name);
Commit;
end {Reservation}
ACID Properties
Transactions are characterized by four properties, collectively known as ACID:
Atomicity: A transaction is “all or nothing.” Either all of its operations are applied, or none of them are. If the system crashes halfway through a transaction, the recovery manager undoes any partial changes.
Consistency: A transaction takes the system from one valid state to another. Every transaction is assumed to be a correct program — it preserves all the integrity constraints of the data. This is the responsibility of the application programmer.
Isolation: The effects of a transaction are invisible to other transactions until it commits. Even though transactions may execute concurrently (interleaved in time), each transaction behaves as if it were the only one running. This is the property that this module focuses on most deeply.
Durability: Once a transaction commits, its effects are permanent. Even if the system crashes immediately after the commit, the data will not be lost. This is typically achieved by writing changes to a persistent log before applying them to the database.
Transaction Primitives
The work performed within a transaction is delimited by transaction primitives:
| Primitive | Description |
|---|---|
BEGIN_TRANSACTION | Marks the start of a transaction |
END_TRANSACTION (EOT) | Terminates the transaction and attempts to commit |
ABORT_TRANSACTION | Kills the transaction and restores old values |
READ | Reads a data item |
WRITE | Writes a data item |
In a centralized system, applications submit transactional operations to a transaction manager (TM), which forwards them to a scheduler (SC) responsible for concurrency control, which in turn interacts with a recovery manager (RM) that logs changes for crash recovery.
In a distributed system, each site has its own TM, SC, and RM. A distributed transaction may access data on multiple sites, requiring coordination across the system to ensure global consistency.
Isolation and Serializability
Isolation is based on the concept of execution histories (also called schedules). An execution history specifies the order in which operations from concurrent transactions are interleaved. A serial history is one where all operations of one transaction complete before any operation of the next transaction begins — no interleaving at all.
Serial execution trivially guarantees consistency (assuming each transaction is correct), but it offers terrible performance because transactions must wait for each other even when they access different data items.
A serializable history allows interleaving but guarantees that the net effect on the data is equivalent to some serial history. In other words, concurrency is allowed, but the result must be indistinguishable from running the transactions one at a time. This is the gold standard for isolation.
Conflicting Operations
Two operations from different transactions conflict if they both access the same data item and at least one of them is a write:
| Operation 1 | Operation 2 | Conflict? | Reason |
|---|---|---|---|
| Read | Read | No | Reads don’t change data |
| Read | Write | Yes | Order determines what is read |
| Write | Write | Yes | Order determines final value |
Two histories are conflict equivalent if they contain the same operations and every pair of conflicting operations appears in the same relative order. A history is conflict serializable if it is conflict equivalent to some serial history.
Serialization Graph
The serialization graph (or precedence graph) is a powerful tool for testing serializability. Create a node for each transaction. For every pair of conflicting operations where one precedes the other in the history, draw a directed edge from the earlier transaction to the later one. A history is serializable if and only if the serialization graph is acyclic. If it is acyclic, a topological sort of the graph gives a conflict-equivalent serial order.
Isolation Anomalies
When isolation is not properly enforced, two classic anomalies can arise:
Lost updates: Two transactions both read a value, compute a new value based on it, and write the result. One transaction’s update is silently overwritten by the other. For example, if account B has $200 and two transactions each add 10%, the correct result is $242 (serial execution), but without proper isolation, both might read $200 and write $220, losing one update.
Inconsistent retrievals: A transaction reads multiple related data items while another transaction is modifying them. The reading transaction sees a mix of old and new values, leading to an inconsistent view. For example, a transaction summing all account balances might see a withdrawal from account A but not the corresponding deposit to account B, computing an incorrect total.
Distributed Transaction Serializability
Serializability becomes more complex in a distributed system. Each site maintains its own local history of operations. For the global execution to be serializable, two conditions must hold:
- Each local history must be serializable.
- For any two conflicting transactions, their relative order must be the same in all local histories where they both appear.
Condition 2 is crucial and can be violated even when condition 1 holds. Consider transactions \(T_1\) (which adds 5 to \(x\) and \(T_2\) (which multiplies \(x\) by 15). If \(T_1\) runs before \(T_2\) at site 1 but \(T_2\) runs before \(T_1\) at site 2, each local history is serial (and thus trivially serializable), but the global execution is not serializable because the relative order of the two conflicting transactions differs between sites.
Concurrency Control Algorithms
Two-Phase Locking (2PL)
Two-Phase Locking is the most widely used concurrency control algorithm. It uses two types of locks: read locks (shared) that allow multiple concurrent readers, and write locks (exclusive) that prevent all other access. The key rule is that a transaction’s lock acquisitions are divided into two phases:
- Growing phase: The transaction may acquire locks but may not release any.
- Shrinking phase: The transaction may release locks but may not acquire any new ones.
The lock point is the moment when a transaction transitions from growing to shrinking. 2PL guarantees serializability — the serialization order corresponds to the order of transactions’ lock points.
However, basic 2PL has two problems. First, a transaction may not know in advance which data items it will need to lock, making it difficult to determine when to stop acquiring locks. Second, cascading aborts can occur: if transaction \(T_1\) writes a data item and releases its lock, then \(T_2\) reads that item, and then \(T_1\) aborts, \(T_2\) must also be aborted because it read data that no longer exists.
Strict 2PL (S-2PL) eliminates cascading aborts by holding all locks until the transaction commits. This ensures no other transaction can read uncommitted data. The downside is reduced concurrency, since locks are held for longer.
In a distributed setting, 2PL can be implemented with either a centralized lock manager (one site manages all locks — simple but a single point of failure) or distributed lock managers (each site manages locks for data items it stores — more robust but requires coordination).
Deadlocks
Locking can cause deadlocks: a cycle of transactions, each waiting for a lock held by the next. The four necessary conditions for deadlock are the same as in operating systems: mutual exclusion, hold and wait, no preemption, and circular wait.
Deadlocks are detected by constructing a Wait-For Graph (WFG): nodes are transactions, and an edge from \(T_i\) to \(T_j\) means \(T_i\) is waiting for a lock held by \(T_j\). A deadlock exists if and only if the WFG contains a cycle.
In a distributed system, local WFGs at individual sites may show no cycles, but the global WFG (the union of all local WFGs plus cross-site wait edges) may contain a cycle. This means distributed deadlock detection is harder than local deadlock detection.
Three approaches exist for managing deadlocks:
- Prevention: Guarantee deadlocks can never occur (e.g., require all locks to be acquired upfront, or impose a total order on lock acquisition). No runtime support needed.
- Avoidance: Detect potential deadlocks before they form and take evasive action. Requires runtime support.
- Detection and recovery: Allow deadlocks to form, then detect and break them by aborting a victim transaction. The victim is chosen by heuristics — for example, abort the youngest transaction (least work lost) or the transaction involved in the most cycles.
Probing is a technique for distributed deadlock detection. When a transaction \(W\) starts waiting for another transaction \(U\) that is itself waiting, a probe message is sent along the edges of the WFG. The probe accumulates the path of transactions. If the probe returns to its origin, a cycle (and thus a deadlock) has been detected.
Lesson 8: Replication
Why Replicate?
Replication — maintaining multiple copies of data across different machines — is one of the most powerful tools in the distributed systems designer’s toolkit. It serves two complementary goals:
Reliability: If one replica is unavailable (due to a crash, network failure, or maintenance), clients can access another copy. Replication eliminates single points of failure and enables continued service during partial outages.
Performance: Replicas can be placed geographically close to users, reducing access latency. Multiple replicas can serve read requests in parallel, increasing throughput. By distributing the load, replication allows the system to scale.
But replication comes at a cost: consistency. When one replica is updated, the change must eventually propagate to all other replicas. If updates are applied at different replicas in different orders, the replicas may diverge — a violation of consistency. The fundamental challenge of replication is maintaining an appropriate level of consistency while achieving the performance and reliability benefits that motivated replication in the first place.
Logical vs. Physical Objects
When reasoning about replication, it is helpful to distinguish between logical objects and physical objects. A logical object (like “the variable \(x\)”) is what the user thinks about. A physical object (like “the copy of \(x\) on server 3”) is an actual data item stored on a specific machine.
Users operate on logical objects. The replication system translates logical operations into physical ones. For example, a logical write(x) becomes write(x₁), write(x₂), write(x₃) — one write to each physical copy of \(x\). The order in which these physical writes are applied, and when they become visible to reads, is what determines the consistency model.
Consistency Models
A consistency model defines the rules governing the order in which updates to replicated data become visible to processes. Stronger models are easier to program against but harder to implement efficiently.
Data-Centric Consistency
Strict consistency is the strongest model: any read of \(x\) returns the value written by the most recent write to \(x\), where “most recent” is defined by absolute global time. This requires all writes to be instantaneously visible to all processes — which is physically impossible in a distributed system due to propagation delays. Strict consistency is a theoretical ideal, not a practical reality.
Linearizability relaxes strict consistency slightly. It requires that all operations appear to take effect at some point between their invocation and completion, and that this ordering is consistent with real time. Formally, if operation \(op_1\) completes before operation \(op_2\) begins (in real time), then \(op_1\) must appear before \(op_2\) in the linearization. Linearizability is expensive because it requires synchronization based on physical timestamps.
Sequential consistency (proposed by Lamport) is more practical. It requires that the result of any execution is the same as if the operations of all processes were executed in some sequential order, and that the operations of each individual process appear in this sequence in the order specified by its program. Unlike linearizability, sequential consistency does not require this order to be consistent with real time — only that per-process order is preserved and all processes agree on a single global interleaving.
Client-Centric Consistency
For many applications, strict consistency is overkill. Consider a mobile user who accesses their email from different replicas as they travel. They don’t care whether all replicas agree at every instant — they just want a consistent experience from their perspective. This leads to client-centric consistency models:
Eventual consistency: In the absence of new updates, all replicas will eventually converge to the same value. This is the weakest useful consistency model and is sufficient for many applications (DNS, for example). The catch is “eventually” — there is no bound on how long convergence takes.
Within eventual consistency, four sub-guarantees are important:
- Monotonic reads: If a process reads a value of \(x\), any subsequent read of \(x\) by the same process will return that value or a more recent one.
- Monotonic writes: A write by a process on \(x\) is completed before any subsequent write on \(x\) by the same process.
- Read your writes: A write by a process on \(x\) will always be seen by a subsequent read of \(x\) by the same process.
- Writes follow reads: A write by a process on \(x\) following a previous read of \(x\) by the same process is guaranteed to take place on the same or a more recent value of \(x\).
Update Propagation
When a replica is updated, the change must be communicated to other replicas. There are three main choices for what to propagate:
- Invalidation: Notify other replicas that their copy is stale, without sending the new value. The next read will need to fetch the current value. This uses minimal bandwidth but increases read latency.
- Updated data: Send the new value to all replicas. This uses more bandwidth but allows replicas to serve reads immediately.
- Update operation: Send the operation itself (e.g., “increment \(x\) by 1”) rather than the new value. This is called active replication and can save bandwidth when updates are small relative to the data.
There are also two main choices for who initiates propagation:
| Aspect | Push-Based | Pull-Based |
|---|---|---|
| Server state | Tracks replicas and caches | None |
| Messages | Sends updates proactively | Client polls for updates |
| Response time | Immediate | Fetch-update time |
Push-based approaches are better for data that changes infrequently but is read often (like DNS). Pull-based approaches are better for data that changes frequently but is read occasionally.
Replication Protocols
Primary-Based Protocols
In a primary-based protocol, each data item has a designated primary copy. All writes are directed to the primary, which updates itself and then propagates the changes to backup copies.
In the remote-write variant, a client can contact any replica, but writes are forwarded to the primary. The primary applies the update, propagates it to all backups, waits for acknowledgments, and then confirms the write to the client. Reads can be served by any replica. This ensures sequential consistency but makes writes slow (they must wait for all backups to acknowledge).
In the local-write variant, the primary role can migrate. When a process wants to write a data item, the primary copy is first moved to the process’s local server, which becomes the new primary. This makes writes fast (they happen locally) but requires a migration step.
Replicated Write Protocols
In replicated write protocols, writes can be submitted to any replica, not just a designated primary.
Active replication sends every write operation to all replicas, which all execute it in the same order (using, for example, totally ordered multicast). This ensures strong consistency but requires reliable, ordered communication.
Quorum-based protocols assign each replica a number of votes. To read, a transaction must obtain a read quorum \(V_r\); to write, a write quorum \(V_w\). Two constraints must hold:
\[V_r + V_w > V\]where \(V\) is the total number of votes. This ensures any read quorum overlaps with any write quorum, so a read will always see the most recent write.
\[V_w > V/2\]This ensures two write quorums overlap, preventing conflicting concurrent writes.
The special case \(V_r = 1, V_w = V\) is called Read-One-Write-All (ROWA): reads are fast (contact any one replica) but writes are slow (must update all replicas). The opposite extreme \(V_r = V, V_w = 1\) makes writes fast but reads slow. In between, the system designer can tune the trade-off.
Lazy Replication
In lazy (or asynchronous) replication, updates are first applied to the primary and then propagated to replicas lazily, using separate transactions. This approach is simpler and faster for writes but introduces a window during which replicas may be out of date. The key requirement is that the order of replica updates must match the commit order at the primary — otherwise the replicas could diverge permanently.
The lazy approach with a single master acting as a sequencer is popular because it is relatively simple to implement: the sequencer determines the order of all updates, and replicas apply them in that same order.
Lesson 9: Distributed File Systems
Overview
Distributed file systems (DFS) provide shared access to files stored across multiple machines, transparently presenting them as if they were on a local disk. They are among the oldest and most widely used forms of distributed systems — your files at the University of Waterloo, for example, are stored on NFS servers, yet when you log in to any lab machine, your home directory appears as a local folder.
As client load increases on a single centralized file system, performance degrades. A distributed file system provides scalability by adding more server resources, supporting more concurrent clients while maintaining low access latencies.
Client-Server Architectures
There are two fundamental models for accessing files in a DFS:
Remote access model: The file remains on the server at all times. The client sends every file operation (read, write, seek) as a request to the server, which executes it and returns the result. This is simple and ensures that the server always has the latest version, but it generates heavy network traffic — every byte read or written crosses the network.
Upload/download model: When a client needs a file, the entire file (or relevant portions) is downloaded to the client machine. All operations are performed locally. When the client is done, the modified file is uploaded back to the server. This dramatically reduces network traffic for files that are accessed multiple times in a session, but it introduces cache consistency challenges — what if another client modifies the file on the server while you are working on your local copy?
NFS Architecture
The Network File System (NFS) is the most widely deployed DFS, originally developed by Sun Microsystems in 1984. NFS uses the remote access model (in its pure form) and provides location-transparent access to remote files.
The architecture relies on the Virtual File System (VFS) layer, which sits between the application and the actual file system implementation. VFS acts as a dispatcher: when an application performs a file operation, VFS checks whether the file is local or remote. If local, it forwards the request to the local file system. If remote, it forwards it to the NFS client, which sends an RPC to the NFS server on the remote machine. The server’s VFS then forwards the request to its local file system, and the result flows back through the same chain.
VFS maintains a v-node for each open file. If the file is local, the v-node points to an i-node. If the file is remote, the v-node points to a file handle on the remote server.
NFS Version 3 vs. Version 4
NFS has evolved significantly over the years. Two versions are particularly important:
| Feature | NFS v3 | NFS v4 |
|---|---|---|
| Server state | Stateless | Stateful |
| Open/Close | Not supported | Supported |
| Compound procedures | No | Yes |
| Mount transitivity | Not transitive (iterative) | Transitive (recursive) |
| File locking | Separate protocol | Integrated |
In NFS v3, the server is stateless: it does not track which clients have which files open. Every RPC from the client must include all the information the server needs to process the request (file handle, offset, etc.). This simplifies crash recovery — after a crash, the server simply restarts and clients re-send their requests — but it means the server cannot implement open/close semantics.
NFS v4 makes the server stateful: it supports open() and close() operations, allowing the server to track which clients have which files open. This enables better performance (compound procedures that batch multiple operations into a single RPC) and richer semantics (file locking integrated into the protocol rather than requiring a separate lock daemon).
Naming in NFS
NFS uses mount points to integrate remote file systems into the local name space. A client mounts a remote directory (exported by a server) at a local mount point, and from then on, files under that directory are accessed as if they were local. Different clients can mount the same remote directory at different local paths, so the global name space is not uniform — each client may see a different view.
The automounter extends this by mounting remote directories on demand. When a client accesses a path like /home/alice, the automounter intercepts the lookup, contacts the appropriate server, mounts Alice’s home directory, and the access proceeds transparently. When the directory is no longer in use, the automounter may unmount it to conserve resources.
In NFS v3, mount points are not transitive: if server A mounts a directory from server B, and a client mounts that directory from server A, the client will not see server B’s files. NFS v4 supports transitive mounts, allowing more flexible name space composition.
Semantics of File Sharing
A fundamental question for any DFS is: when one process modifies a file, when do other processes see the changes? There are several approaches:
UNIX semantics (one-copy semantics): Every write is immediately visible to all processes. This is the behavior of a single-machine file system and is the most intuitive, but it is expensive to implement in a DFS because every write must be immediately propagated to the server and invalidated in all caches.
Session semantics: Changes are visible only to the process that makes them until the file is closed. When the file is closed, changes are sent to the server and become visible to other processes that subsequently open the file. However, if two processes have the same file open and both are writing, the result is unpredictable. AFS and Coda use session semantics.
Immutable files: Files cannot be modified after creation. A “write” creates a new version of the file. This greatly simplifies sharing and replication because old versions never change.
Transactions: All changes to a file occur atomically, protected by the full ACID guarantees. This is the strongest semantic but also the most expensive.
File Locking in NFS
NFS v4 integrates file locking directly into the protocol. The locking operations are:
| Operation | Description |
|---|---|
| Lock | Acquire a lock on a range of bytes |
| Lockt | Test whether a conflicting lock exists |
| Locku | Release a lock on a range of bytes |
| Renew | Renew the lease on a lock |
Locks come with leases: a lease is a time-limited grant that must be periodically renewed. If a client crashes while holding a lock, the lease will eventually expire and the server will release the lock, preventing indefinite blocking of other clients. This is a clean solution to the problem of clients that die while holding locks.
Client-Side Caching
Client-side caching is critical for performance in DFS that use the upload/download model. Clients cache file data, attributes, file handles, and directory entries on the local machine. Subsequent accesses to cached data can be served locally, dramatically reducing network traffic and server load.
Research on AFS showed that client-side caching works well in practice because of three observations: (1) shared files are often valid for long periods, (2) sharing is predominantly read-only, and (3) file access exhibits strong temporal and spatial locality.
The key challenge is cache consistency: ensuring that cached copies reflect the latest version on the server. Two approaches are common:
Callback promises (used by Coda and AFS): When a client caches a file, the server promises to notify the client if the file is modified by another client. This notification invalidates the cached copy, and the client fetches a fresh copy on next access. This approach approximates UNIX semantics while maintaining the performance benefits of caching.
Polling (used by NFS): The client periodically checks whether its cached copy is still valid by comparing timestamps with the server. Given the current time \(T\) and the time the cache entry was last validated \(T_c\), the client checks if \(T - T_c < t\) (where \(t\) is a freshness interval). If so, the cache is considered valid. Otherwise, the client polls the server to compare the file’s modification timestamp. This is simpler than callbacks but can result in stale reads during the freshness interval.
Cluster-Based Distributed File Systems
In cluster-based distributed file systems, files are distributed across multiple servers in a cluster. There are two main approaches:
Whole-file distribution: Each file is stored entirely on a single server, but different files are spread across different servers. If one server goes down, its files are unavailable, but files on other servers remain accessible. Multiple files can be accessed in parallel (since they are on different servers).
File striping: Each file is split into blocks (stripes) that are distributed across multiple servers. Different blocks of the same file can be read in parallel, providing much higher bandwidth for large files. Even if one server goes down, only some blocks are lost (and with redundancy like RAID-style coding, no data is lost). Google’s GFS and Hadoop’s HDFS use this approach, storing each file in 64 MB chunks replicated across multiple nodes.
Lesson 10: Distributed Hashing
The Problem
Consider a distributed system that stores data across multiple nodes — for example, a distributed cache or a peer-to-peer file sharing system. A fundamental question arises: given a data item (identified by a key), which node should store it?
A naive approach is to use a hash function: compute hash(key) mod N, where \(N\) is the number of nodes. This distributes data evenly, but it has a fatal flaw: when a node is added or removed, \(N\) changes, and almost every key maps to a different node. This triggers a massive redistribution of data across the system — completely unacceptable in a dynamic environment where nodes join and leave frequently.
Consistent Hashing
Consistent hashing, introduced by Karger et al. in 1997, solves this problem elegantly. The key idea is to arrange the hash space as a circle (also called an identifier ring). Both nodes and keys are hashed onto this ring using the same hash function.
A key is assigned to the first node encountered when walking clockwise from the key’s position on the ring. This node is called the key’s successor. For example, in a ring with positions 0–7, if nodes sit at positions 0, 1, and 3, then:
- Key 1 maps to node 1 (successor(1) = 1)
- Key 2 maps to node 3 (successor(2) = 3)
- Key 6 maps to node 0 (successor(6) = 0, wrapping around)
The magic of consistent hashing is what happens when nodes join or leave:
Node join: When a new node joins the ring, it takes responsibility for the keys that fall between it and its predecessor — keys that were previously handled by its successor. Only these keys need to be moved; all other key assignments remain unchanged.
Node departure: When a node leaves, its keys are taken over by its successor. Again, only the departing node’s keys are affected.
In both cases, the number of keys that must be redistributed is proportional to \(K/N\) (where \(K\) is the total number of keys and \(N\) is the number of nodes) — the minimum possible disruption. Compare this to naive hashing, where a change in \(N\) remaps nearly all keys.
Chord and Scalable Lookup
In a large distributed system, naive lookup on a consistent hash ring would require traversing the ring node by node — \(O(N)\) hops in the worst case. The Chord protocol (by Stoica et al., 2001) solves this with finger tables: each node maintains a small routing table with \(O(\log N)\) entries that point to exponentially spaced nodes around the ring. This allows any key to be found in \(O(\log N)\) hops, providing efficient and scalable lookup.
Consistent hashing is a foundational technique in modern distributed systems. It underlies Amazon’s Dynamo (which powers DynamoDB), Apache Cassandra, Akamai’s CDN, and many other large-scale distributed storage and caching systems. Its elegance lies in providing near-optimal load balancing and minimal disruption under changes in the node set, with a simple and easy-to-implement algorithm.