CPS 212: Distributed Information Systems
home calendar topics work resources

Virtual computing utilities


Cluster/storage services, service quality, SLAs


[Quorum slides], [other class slides]
[Hippodrome slides]

Brick-based systems: state and availability

Both papers deal with the problem of reliable state in brick-based systems (arrays of servers or storage nodes that are symmetric in the sense that all elements of the array have equivalent functions). One advantage of these systems is a flexible distribution of load across the nodes, which is useful for adaptation and failure recovery. SSM deals with the reliable state problem in a simpler form for soft state that may be discarded after some time interval. FAB uses a similar approach for hard or persistent state in a brick-based storage array. FAB involves many details that we will discuss later in the semester (erasure-coding, volume layout, quorum reconfiguration, etc.), so don't worry about those for now. I would like you to understand the motivation for the FAB structure and how it relates to SSM, and to understand how FAB extends the SSM algorithm to apply to hard state that may be shared by multiple clients. You can get by with reading the intro, Section 3.0, and Section 4.1.

[Slides for replicated state stores]


Atomic state updates in file systems, snapshots, and mirroring

We discuss requirements and implementation mechanisms for atomic durability in a variety of contexts, focusing on file systems. The solutions are related to atomic snapshots and mirroring. [Slides for atomic durability].

Transactional storage

[Stasis slides]

Network File System (NFS), cache consistency, and leases

For NFS we focus on the recovery model and consistency: the papers give an interesting historical and technical context, but don't get bogged down in the details (performance of the original implementations, etc.). The point is to get the basics of how the protocol works and how it has evolved. [Slides on NFS, consistency, and leases]

Introduction to consensus

Before discussing more sophisticated distributed systems, we introduce consensus issues and problems using some of the consensus slides. The asynchronous model applies to Internet-based systems, and the Fischer-Lynch-Patterson impossibility of consensus result applies. We discuss classical two-phase commit (2PC) and its blocking conditions.

We defer a discussion of the Paxos nonblocking asynchronous consensus algorithm until after we discuss a few systems that use lock services based on Paxos. These include the Frangipani scalable storage service and Google's cluster computing tool set using the Chubby lock service.


Scalable storage systems

Frangipani is over 10 years old, but in my view it is still the classic example of what is often called a parallel file system or cluster file system: symmetric file server nodes using locking to coordinate access to a shared network block store. Frangipani's block store was Petal, which has been superseded by FAB (above) among others.

Systems similar to Frangipani are now widely used in cluster and supercomputing settings. They include the Minnesota GFS, IBM's GPFS, and Lustre. Some systems centralize metadata functions or factor them in different ways, such as NASD/Panasas, Slice, IBM's StorageTank (TotalStorage SAN-FS), and NetApp GX. The recent pNFS movement supports parallel block storage in the context of standards-based NFS. Ursa Minor and Ceph are recent works in this area.

[Slides on Petal/Frangipani, 6-up]

Inside Google services

Google Labs has produced many innovations and interesting papers on their integrated tool set for data-intensive distributed cluster computing. The hard distributed systems issues are encapsulated in the Chubby lock service, and we focus on that. But we also discuss the computing model in MapReduce and Bigtable and how it uses Chubby and Google's distributed storage service (Google File System). Hadoop is an open-source variant of these tools. [Slides for Google and Chubby]

Asynchronous replication, clocks, and consistency

Reading: Bayou

For the week of Monday, November 5 we discuss optimistic replication and related material dealing with event ordering, logical clocks, and vector clocks. (three days)

slides: [PDF,6-up PDF]

We discuss two papers from the Bayou system. A good overview of Bayou can be found in the revisionist summary paper:

But we want to get into the specifics of how asynchronous replication protocols work. For that we will need to dig into the SOSP 1997 paper: We also discuss vector clocks in some other classic systems. These include the Coda high-availability file system that supports disconnected (mobile) operation, and the Treadmarks distributed shared memory system. These are easily found on the web. We also touch on causal multicast and the state machine model as defined in the classic Lamport paper: Slides: [PDF,6-up PDF]

Groups and group communication

Our look at Lamport's introduction to causal multicast leads into a discussion of causal and totally ordered group communication (CATOCS), also known as atomic ordered multicast or virtual synchrony. We discuss the process group approach to reliable distributed computing, whose most forceful apostle has been Ken Birman at Cornell. Dr. Birman and his colleagues have worked with this approach through several generations of systems over 20 years. We just touch on it. Our discussion also touches on some of the points from the debate about CATOCS that raged for a time. These readings are optional: [Slides for virtual synchrony, 6up]

Consensus and highly available services

For the last week of classes, we finish up consensus. We discuss the Paxos algorithm in depth, touch on Byzantine consensus and Byzantine Fault Tolerance (BFT), and discuss the Fox/Brewer "CAP Theorem" and how it relates to the various systems we studied.

The new reading for Paxos and consensus:

[Slides for Consensus and Paxos, or "Paxos Really Made Simple"]
Below are some topics we did not get to this semester, but wish we had.

Massive-scale storage and correlated failures


Incentives and accountability