CPS 212: Distributed * Systems
home calendar topics work resources

Introduction: Scalable Services and Their State

We start by discussing the motivation for incrementally scalable systems, the structure of cloud services and Internet-scale services, and some of the challenges that they face. We introduce some basic topics in managing data for these services: recovery and consistency.

State Partitioning and Coordination

This unit addresses two closely related needs for scalable services: distribute state and functions across the participating servers, and coordinate control and ownership of data in a way that is resistant to failures. The focus is on basic techniques that appear in many different systems: leased locks, hashed key/value stores, and consistent hashing. We also discuss relationships to data consistency and recovery.

Name Services and DNS

This unit is a first look at large-scale multi-domain services, using the Domain Name Service as an example. New issues: data integrity and security of ownership, location, and control. We discuss the current DNS implementation, security extensions (DNSSEC), and an alternative implementation based on a distributed key-value store (DHT). Model-driven caching (with Zipf distributions) is also useful for Web caching and other areas as well, but we won't be too concerned with the models themselves.

A Wide-area File Store on a DHT

Pond is a prototype of an ambitious architecture for a wide-area file store with a configurable consistency model, called OceanStore. Like CoDoNS it is based on a distributed hash table (a DHT system called Tapestry) with replication, and it uses hashing and signing with public key cryptography. Pond also gives a first look at Byzantine fault tolerance (BFT), primary-copy replication, erasure codes for redundancy, version trees and snapshot history ("time travel"), and Merkle trees. The important points to focus on are how Oceanstore maps a file abstraction with configurable consistency and snapshots onto the underlying key-value store, and keeps it robust and secure across a wide range of failures of its components.

Scalable File Systems

These systems illustrate a good continuum of redundancy schemes and various degrees of metadata decentralization. Like Pond (and Centrifuge and CoDoNs) they use primary write protocols (e.g., primary/backup replication).


Brick Storage with Voting

These systems illustrate uses of voting protocols as an alternative to primary schemes.


Scalable Key/Value Stores

We discuss related approaches for scalable state storage by major application service providers, e.g., Yahoo and Facebook. These readings are optional.


Asynchronous Replication, Clocks, and Consistency

Optimistic replication and related material dealing with event ordering, logical clocks, and vector clocks.

We discuss two papers from the Bayou system. A good overview of Bayou can be found in the revisionist summary paper:

But we want to get into the specifics of how asynchronous replication protocols work. For that we will need to dig into the SOSP 1997 paper:


Consensus

We discuss the Paxos algorithm in depth, touch on Byzantine consensus and Byzantine Fault Tolerance (BFT), and discuss the Fox/Brewer "CAP Theorem" and how it relates to the various systems we studied.


Authentication, Authorization, and Accountability

[BAN90] Michael Burrows, Martin Abadi, and Roger Needham. A logic of authentication. ACM Trans. Comput. Syst., 8(1):18-36, 1990. [ bib | DOI ]
[CATS07] Aydan R. Yumerefendi and Jeffrey S. Chase. Strong accountability for network storage. Trans. Storage, 3(3):11, 2007. [ bib | DOI ]


Topics

[Antfarm09] Ryan S. Peterson and Emin Gün Sirer. Antfarm: efficient content distribution with managed swarms. In NSDI'09: Proceedings of the 6th USENIX symposium on Networked Systems Design and Implementation, pages 107-122, April 2009. USENIX Association. [ bib ]
[Remus08] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. Remus: high availability via asynchronous virtual machine replication. In NSDI'08: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 161-174, April 2008. USENIX Association. [ bib ]