Virtual computing utilities
- SoftUDC: a software-based data center for utility computing.
Kallahalla, M. Uysal, M. Swaminathan, R. Lowell, D.E. Wray, M. Christian, T. Edwards, N. Dalton, C.I. Gittler, F.
HP Labs., IEEE Computer, Nov. 2004,
Volume: 37, Issue: 11,
page(s): 38- 46.
-
Commodity Grid and Computing with Amazon's S3 and EC2,
by Simson Garfinkel.
USENIX ;LOGIN:, February 2007, pp. 7-13.
- Optional:
An Evaluation of Amazon's Grid Computing Services: EC2, S3 and SQS,
by Simson Garfinkel.
Technical Report TR-08-07, School for Engineering and Applied Sciences, Harvard University, Cambridge, MA. July 2007.
Cluster/storage services, service quality, SLAs
-
Quorum: Flexible Quality of Service for Internet Services. By
Josep M. Blanquer, Antoni Batchelli, Klaus Schauser, and Rich Wolski, University of California, Santa Barbara. In NSDI 2005.
- Hippodrome: Running Circles Around Storage Administration. By
Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch, Hewlett-Packard Labs. In FAST 2002.
[Quorum slides],
[other class slides]
[Hippodrome slides]
Brick-based systems: state and availability
Both papers deal with the problem of reliable state in brick-based
systems (arrays of servers or storage nodes that are symmetric in the sense
that all elements of the array have equivalent functions). One advantage
of these systems is a flexible distribution of load across the nodes,
which is useful for adaptation and failure recovery. SSM deals with
the reliable state
problem in a simpler form for soft state that may be discarded
after some time interval. FAB uses a similar approach for hard or
persistent state in a brick-based storage array. FAB involves many details
that we will discuss later in the semester
(erasure-coding, volume layout, quorum reconfiguration, etc.), so
don't worry about those for now. I would like you to understand the motivation
for the FAB structure and how it relates to SSM, and to understand how
FAB extends the SSM algorithm to apply to hard state that may be shared
by multiple clients. You can get by with reading the intro, Section 3.0,
and Section 4.1.
[Slides for replicated state stores]
Atomic state updates in file systems, snapshots, and mirroring
We discuss requirements and implementation mechanisms for atomic durability
in a variety of contexts, focusing on file systems. The solutions are related
to atomic snapshots and mirroring.
-
Reimplementing the Cedar file system using logging and group commit
. By R. Hagmann.
Proceedings of the Eleventh ACM Symposium on Operating Systems Principles,
November 1987.
-
A simple and efficient implementation of a small database
. By
A. Birrell, M. Jones, E. Wobber.
Proceedings of the Eleventh ACM Symposium on Operating Systems Principles,
November 1987.
-
File System Design for an NFS File Server Appliance
. By
D Hitz, J Lau, M Malcolm.
Winter USENIX Technical Conference, 1994.
-
SnapMirror: File System Based Asynchronous Mirroring for Disaster Recovery.
. By Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman,
Shane Owara. In USENIX Conference on File and Storage Technologies, January 2002.
[Slides for atomic durability].
Transactional storage
[Stasis slides]
Network File System (NFS), cache consistency, and leases
For NFS we focus on the recovery model and consistency:
the papers give an interesting historical and technical context, but don't get bogged
down in the details (performance of the original implementations, etc.).
The point is to get the basics of how the protocol works and how it has evolved.
-
The Sun Network File System: Design, Implementation and Experience
. By
R Sandberg. Based on original paper with
David Goldberg and Steve Kleiman and Dan Walsh and Bob Lyon,
Proceedings of the Summer 1986 USENIX Summer Technical Conference.
-
NFS Version 3 Design and Implementation
. By
Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, David Hitz.
USENIX Summer Technical Conference, 1994.
Focus on the portions relating to the recovery model,
asynchronous writes, and cache consistency.
We also discuss data consistency semantics for distributed data sharing,
and the evolution of lease
mechanisms (also called delegations, callbacks, and leased locks) as a
solution for strong and robust consistency,
and some of their tradeoffs and implications.
The NQ-NFS paper defines an early form of the consistency mechanisms in NFSv4.
-
Not Quite NFS, Soft Cache Consistency for NFS.
. By
Rick Macklem.
In USENIX Technical Conference, 1994.
[Slides on NFS, consistency, and leases]
Introduction to consensus
Before discussing more sophisticated distributed systems, we
introduce consensus issues and problems using some of the
consensus slides.
The asynchronous model applies to Internet-based systems, and the
Fischer-Lynch-Patterson impossibility of consensus result applies.
We discuss classical two-phase commit (2PC) and its blocking
conditions.
We defer a discussion of the Paxos nonblocking
asynchronous consensus algorithm until after we discuss a few
systems that use lock services based on Paxos. These include the
Frangipani scalable storage service and Google's cluster computing
tool set using the Chubby lock service.
Scalable storage systems
Frangipani is over 10 years old, but in my view it is still the
classic example of what is often called a parallel file system
or cluster file system: symmetric
file server nodes using locking to coordinate access
to a shared network block store. Frangipani's block store was
Petal, which has been superseded by FAB (above) among others.
Systems similar to Frangipani are now
widely used in cluster and supercomputing settings. They include
the Minnesota GFS, IBM's GPFS, and Lustre. Some systems centralize
metadata functions or factor them in different ways, such as NASD/Panasas,
Slice, IBM's StorageTank (TotalStorage SAN-FS), and NetApp GX.
The recent pNFS movement
supports parallel block storage in the context of standards-based NFS.
Ursa Minor and Ceph are recent works in this area.
- Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee.
Frangipani: A scalable distributed file system. In
Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 224-237. ACM Press, October
1997.
- Optional: Ursa Minor: Versatile Cluster-based Storage.
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie, Carnegie Mellon University. FAST 2005.
- Optional: Ceph: A Scalable, High-Performance Distributed File System.
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn, University of California, Santa Cruz. OSDI 2006.
[Slides on Petal/Frangipani, 6-up]
Inside Google services
Google Labs has produced many innovations and interesting papers on their
integrated tool set for data-intensive distributed cluster computing. The
hard distributed systems issues are encapsulated in the Chubby lock service,
and we focus on that. But we also discuss the computing model in
MapReduce and Bigtable and how it uses Chubby and Google's distributed
storage service (Google File System). Hadoop is an open-source variant
of these tools.
- The Chubby Lock Service for Loosely-Coupled Distributed Systems.
Mike Burrows et. al., Google Inc. OSDI 2006.
- Bigtable: A Distributed Storage System for Structured Data.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc. OSDI 2006.
- MapReduce: Simplified Data Processing on Large Clusters.
Jeffrey Dean and Sanjay Ghemawat. OSDI 2004.
[Google/MapReduce slides from Jeff Dean], [More Google slides],
[Slides for Google and Chubby]
Asynchronous replication, clocks, and consistency
Reading: Bayou
For the week of Monday, November 5 we discuss optimistic
replication and related material dealing with event ordering, logical
clocks, and vector clocks. (three days)
slides: [PDF,6-up PDF]
We discuss two papers from the
Bayou
system. A good overview of Bayou can be found in the revisionist
summary paper:
But we want to get into the specifics of how asynchronous replication
protocols work. For that we will need to dig into the SOSP 1997
paper:
We also discuss vector clocks in some other classic systems.
These include the Coda high-availability file system that supports disconnected (mobile)
operation, and the Treadmarks distributed shared memory system. These are easily found
on the web. We also touch on causal multicast and the state machine model as defined
in the classic Lamport paper:
Slides: [PDF,6-up PDF]
Groups and group communication
Our look at Lamport's introduction to causal multicast
leads into a discussion of causal and totally ordered
group communication (CATOCS), also known as
atomic ordered multicast or virtual synchrony.
We discuss the process group approach
to reliable distributed computing, whose most forceful apostle
has been Ken Birman at Cornell. Dr. Birman and his colleagues
have worked with this approach through several generations of
systems over 20 years. We just touch on it.
-
The process group approach to reliable distributed computing, by
Kenneth P. Birman
, in
Communications of the ACM,
Volume 36, Issue 12 (December 1993),
Pages: 37 - 53.
- Optional: the original Isis/CATOCS paper.
Exploiting virtual synchrony in distributed systems, by
K. Birman and T. Joseph
, in
Proceedings of the eleventh ACM Symposium on Operating systems principles,
Austin, Texas, United States,
Pages: 123 - 138, 1987.
Our discussion also touches on some of the points from the debate about CATOCS that
raged for a time. These readings are optional:
-
Understanding the limitations of causally and totally ordered communication, by
David R. Cheriton and Dale Skeen
, in
Proceedings of the fourteenth ACM symposium on Operating systems principles, Asheville, North Carolina, United States,
Pages: 44 - 57, December 1993.
-
A response to Cheriton and Skeen's criticism of causal and totally ordered communication
, by
Ken Birman
, in
ACM SIGOPS Operating Systems Review,
Volume 28 , Issue 1 (January 1994),
Pages: 11 - 21.
-
Why bother with CATOCS?, by
Robbert van Renesse
, in
ACM SIGOPS Operating Systems Review,
Volume 28 , Issue 1 (January 1994),
Pages: 22 - 27.
-
To CATOCS or not to CATOCS, that is the dot dot dot, by
Santosh K. Shrivastava
, in
ACM SIGOPS Operating Systems Review,
Volume 28 , Issue 4 (October 1994),
Pages: 11 - 14.
[Slides for virtual synchrony, 6up]
Consensus and highly available services
For the last week of classes, we finish up consensus.
We discuss the Paxos algorithm in depth, touch on Byzantine
consensus and Byzantine Fault Tolerance (BFT), and
discuss the Fox/Brewer "CAP Theorem" and how it relates
to the various systems we studied.
The new reading for
Paxos and consensus:
- How to build highly available systems with consensus, by Butler
Lampson. This paper teaches more than just Paxos, but I think it is the
best source for understanding Paxos. A full reading of this paper is
"optional", but it should be required reading for a PhD in computer systems.
- Paxos Made Live - An Engineering Perspective, by Tushar Chandra, Robert Griesemer, and Joshua Redstone, for PODC '07: 26th ACM Symposium on Principles of Distributed Computing. This paper has a good summary of Paxos and a good summary of
issues for its implementation in Chubby and other systems.
- Optional:
The Many Faces of Consensus in Distributed Systems.
By John Turek and Dennis Shasha.
IEEE Computer,
Volume 25 , Issue 6 (June 1992).
Pages: 8 - 17.
[Slides for Consensus and Paxos, or "Paxos Really Made Simple"]
Below are some topics we did not get to this semester, but wish we had.
Massive-scale storage and correlated failures
-
Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures
Andreas Haeberlen, Alan Mislove, and Peter Druschel, Rice University. NSDI 2005.
- MOAT, NSDI 2006.
Incentives and accountability
- Do Incentives Build Robustness in BitTorrent?
Michael Piatek, Tomas Isdal, Thomas Anderson, and Arvind Krishnamurthy, University of Washington; Arun Venkataramani, University of Massachusetts Amherst. NSDI 2007.
- Strong Accountability for Network Storage. FAST 2007.
Aydan R. Yumerefendi and Jeffrey S. Chase, Duke University