CPS 512 Distributed Systems Final study outline, Spring 22 http://www.cs.duke.edu/courses/spring22/compsci512/final-outline.txt This is all shorthand, as an outline for concepts in the cited slide decks on the course web. It starts from Feb 17, the start of material that was not in scope for the Feb 24 midterm. 13-consensus (2/17) - Objective: make a service (a server application) reliable and available by replication. - Application is a deterministic "state machine": run a copy on each of N servers (replicas). - Each replica executes each operation/request/command in sequence, converges to same state. - State machine replication (RMS/SMR) consensus algorithm is independent of app. - Consensus safety: all replicas agree on the commands and their order (sequence): linearizable. - Consensus algorithm is independent of specific application. App must be deterministic. - We focus on fail-stop Consensus algorithms: RMS/SMR with replicas that are faithful but may go silent. - Three variants of Consensus algorithms/protocols: Viewstamped Replication, Paxos, or Raft - Replicas elect a leader/primary, leader receives and sequences ops, replies to clients. - Replicas monitor the leader: heartbeats/timers and (s)elect a new one if it loses contact. - Consensus is CP in CAP triangle: safe (consistent) but not necessarily live (available) - FLP impossibility result (equivalent to CAP): a consensus alg may be safe (CP) or live (AP) but not both. - Nuances of CAP and FLP: everything is fine if majority stays up and partitions do not occur. - Consensus uses majority quorum for safety, so it "gets stuck" if no quorum. - A quorum is present exactly when a majority of servers are up and connected: who knows and how? - An op commits only after a quorum agree to accept it at the same place (log slot) in the sequence: how? - Leader replies to client only after the op commits: how does leader know it commits? - Committed ops never revert (no forks, no rollbacks), even after leader failure: how? - Replicas of server application execute only committed ops and only in order: agree on a log prefix. - Understand how/why quorum assures safety: a single timeline of ops, with no forks (linearizable). - Clients track current leader across fail-over: how? - No lost or duplicated requests or replies are delivered, even across fail-over: how? - Importance of views/terms. What causes a replica to start a new view/term (a Paxos "ballot")? - In what respect is the view/term number a logical clock? - Why is it important that the view/term number is a logical clock? - What does a replica do when it learns of a new view number? How does it learn? - When does a replica count votes? (two cases) - When does a replica vote? - Under what conditions might a replica vote no? (varies) - On fail-over, how to ensure the new leader knows at least all the committed ops? (subtle, varies) - Safety does not require that the new leader knows any uncommitted ops: why? - If an uncommitted op is lost on fail-over, how is it recovered/restarted? - How does a new leader know if a given op is committed? Possible answers: yes, no, maybe. - "Maybe" is very subtle; e.g., see example scenarios and Raft Figure 8. - If the leader knows an op is "maybe" committed, how does the leader ensure that the op commits? - Outline a proof that Consensus is safe based on your answers to these questions. - Is an RSM service "decentralized"? Discuss. - What is the impact of RSM on service latency (response time)? Tail latency? Throughput? Cost? - How does the choice of N affect RSM service latency? Tail latency? Throughput? Cost? 14-topics This was a catch-up day with no testable material. It previewed some later topics. The slide deck has slides on RPC in exemplary storage services: NFS and RIFL - Reviewed reliable RPC in sharded/replicated system, using RIFL K/V as an example - Network File System (NFS) as a service example - History: RPC in NFS - Object identifiers for RPC: NFS file handles - NFS state and "stateless" failure recovery model - Implementing "statelessness": API design for idempotence and easy recovery, and its limitations - Impact of "stateless writes" on write throughput - Improving write throughput with NFSv3 asynchronous write op and commit op - Client replays uncommitted async writes after a server failure - Use of sessionID to detect server failure; how sessionID is used in NFSv3 async writes. - RIFL RPC: each RPC operates on an object replicated with Consensus - RIFL object state including per-object reply cache is stored reliably by Consensus replication - So RIFL RPCs are exactly-once: APIs that are idempotent and "stateless" are no longer beneficial. 15-paxos - Focus: Multi-Paxos as an example of Consensus replication, and comparison to choices in Raft and VR. - Multi-Paxos is Paxos with a stable leader, similar to Raft and VR. - With stable leader, change of leader (fail-over, elect) occurs only if the leader appears to have failed. - Multi-Paxos: replica requests votes to be leader (Scout/Phase 1) only leader is silent for a liveness timeout. - Tradeoffs for choosing liveness timeouts? E.g., risk of livelock. - Separation of acceptor and replica roles: handling f concurrent failures needs only f+1 full replicas, not 2f+1. - Asynchronous notifications of commitment by piggybacking (see figs illustrating this in Raft and VR too). - Paxos: How to ensure that a new leader is up to date? E.g., knows at least all committed entries. - Paxos: How does leader ensure that any entries that might have committed do commit "for sure"? - Raft: new leader must be "at least as up to date" as each member of its voting quorum. - Can you prove that a new leader in Raft knows at least all committed entries? - In Paxos, can two or more nodes be acting as leader (Commander) concurrently? Can both commit slots? 16-lightning - Material from lightning round presentations is not in scope for any exam. - Blockchain material presented in this session is not in scope unless repeated in a later session. - Bitcoin is our exemplary blockchain (open, public, trustless), though certain aspects are common to others. - Blockchain (bitcoin) uses Nakamoto Consensus. - RSM replication similar to fail-stop Consensus except where it is different. - Leader changes frequently to limit the damage from an unfaithful leader. - All ops/requests are signed under a keypair; an account number is a public key hash. - Clients broadcasts their ops/requests; they are eventually heard/propagated by a leader if valid. - Each leader groups a sequence of ops into a "block" and broadcasts it asynchronously to all peers. - Block is securely linked to predecessor block in the log (chain) by including hash of the predecessor. - Leader signs the block and receives a payment to its account if others vote to accept the block. - When a replica receives a block, it checks for validity and accepts it if it is valid. (What conditions?) - If replicas accept different concurrent competing blocks, a "fork" occurs and there are multiple versions. - Forks heal because the longest branch dominates. - Faithful replicas switch immediately to the dominant branch as soon as they learn of it. - Leader for each block is chosen "randomly", e.g., via Proof of Work. - A valid leader can prove to others that it is valid by cryptographically secure means. - E.g., by Proof-of-work/Proof of Waste (PoW): candidate leader guesses a nonce that yields conforming hash. After Spring break: 17-2pc - RSM replication and sharding are often used together: each shard is served by a replica group. (e.g., RIFL, Lab4) - Transactions propose a notion of "consistency" for server state that is strong than others we have considered. - Transactions are useful for complex operations that touch multiple data items. - Transactions are executions of procedures/subprograms that may touch multiple data items. - A transaction has a well-defined commit point at the end of the subprogram: it either commits or aborts. - If a transaction commits it is "all or nothing": all of its writes appear, or none of them. - Transactions appear to execute and commit in serial order: they do not interfere with one another (isolated). - Once a transaction commits, its updates are never reverted. - An executing transaction sees the writes of preceding transactions. - These consistency properties of transactions are called ACID: atomic, consistent, isolated, durable. - What makes distributed transactions distributed? (Give an example of use.) - What event initiates a 2PC protocol? - What failure model does the 2PC protocol assume? What could go wrong under a more general failure model? - Why does two-phase commit (2PC) protocol require two phases? - Why does 2PC require unanimous votes from participants, rather than merely a quorum? - What causes a transaction to abort during 2PC? - What happens in 2PC if the coordinator does not receive a vote from a participant? (due to failure, message drop) - What happens in 2PC if a participant fails after voting, and does not receive the notification? - What happens in 2PC if a non-failed participant does not receive a notification of outcome of a prepared transaction? - What happens in 2PC if a coordinator fails and does not recover? - Impact of 2PC on tail latency? 18-ctp - How does RSM replication help to resolve the problem of a failed coordinator? - Why not replicate the transaction coordinator too? (e.g., in client/server transactions like RIFL, Lab4). - Contrast a 2PC round (phase 1 then phase 2) with a Consensus round. - Client/server transaction model (e.g., Thor, RIFL, Lab4): clients read/write storage servers, execute/cache locally. - Is the write set W(T) always a subset of the read set R(T) for a given transaction T? - At what point do R(T) and W(T) become known to the client? - At what point do R(T) and W(T) become known to a server? - Can a server "guess" R(T), since it serves the get ops? - For transactions on sharded state, is it sufficient to send to each shard just the subsets of R(T) and W(T) that it "owns"? - Does any server know the entirety of R(T) and W(T), e.g., in RIFL transactions? Why? - Quick discussion of transactional memory (Intel TSX) is not testable, i.e., out of scope for exam. - How does RIFL accommodate resharding, e.g., for reliable (AMO) RPC. What assumptions does it make? - How does RIFL integrate reliable RPC (e.g., reply cache) with the replication protocol? - Garbage collection of completion records in RIFL is not testable, i.e., out of scope. - What could go wrong if Ramcloud/RIFL omits its OCC validation check? ("current version is not expected version") - A "prepared object" on a server is an element of W(T) for some T in the 2PC "prepared" state. - Why does a server block writes to prepared objects? ("hide", "lock") What does it do if a write is attempted? - What could go wrong if it did not block these writes? - Can a server permit reads of prepared objects? - What could go wrong if a server fails/restarts and "forgets" that some object is prepared? - If a client (coordinator) for T fails and does not recover, what happens to T's prepared objects? - CTP is "Cooperative Termination Protocol": server-driven 2PC recovery for coordinator failures. - CTP: if coordinator fails or disconnects, participants/servers restart 2PC protocol on a liveness timer fires. - CTP: abort T unless all participants already voted to prepare. - Why can't CTP abort T even if all participants already voted to prepare T? - CTP requires that each participant can identify the others. How does RIFL ensure this even across resharding? - What if the coordinator reappears during CTP and tries to commit or abort T? - How can RIFL transactions be "sure" that all participants are available for CTP? - Does CTP ensure that 2PC is live? Are there any cases in which 2PC "stalls" even with CTP? - If yes, what is the effect/impact of such a 2PC stall? How does it ever get "fixed"? 19-base - "AP means forks". Discuss. - "There are lots of ways to repair forks in AP services." Discuss. - "An AP service can never be linearizable." Discuss. - Is it possible to implement transactions in an AP system? - In what sense is Dynamo (RWN quorum) "highly available" e.g., relative to replication by Consensus? - How does the choice of R+W quorum sizes affect consistency and availability? - Is there a configuration of RWN that is equivalent to Consensus? "Stronger" than Consensus? - What is the RWN configuration that minimizes tail latency? What is the cost of this choice? - What choices for RWN require some ability to "merge" updates to an object, e.g., to repair forks? - How does Dynamo arrange for merging of updates to an object? - Who performs the merging of updates to an object, and how do they know when/what to merge and how? - Describe a scenario in which "syntactic reconciliation" of conflicting versions is possible. How is it done? - Why is it easy to merge conflicting versions of a shopping cart? What is the cost? - CALM/CRDT material: the meaning of monotonicity and CRDT are not testable. - CALM: Programs "like shopping cart" are coordination-free. - CALM: coordination-free: a distributed program that does not require agreement on ordering. - Coordination-free programs can be replicated with protocols weaker than Consensus (not linearizable). - CALM: coordination-free programs can be available and consistent even under partition. - Coordination-free programs are the exception to the CAP rule. - CALM is a "positive result" (can say yes for some) while CAP is "negative" (can't say yes for all). 20-clocks - Happens-before relation (a) for processes sending messages, (b) for threads accessing memory. - Why is it "impossible" for a happens-before graph to have a cycle? - If a memory system executes puts/gets in the "wrong" order, it can create a happens-before cycle. - Don't try to run programs on memory systems that create such cycles. - Modern memory systems are safe for correct programs. - The three rules of happens-before. - Why is program order important for happened-before? What if a program has a loop? - What does "transitive closure of a relation" mean? - Happens-before (ab) is also called causal precedence, or causality, even if a did not affect b. - What is a partial order? How is it different from a total order? - How can there be many possible total orders that are all consistent with (compliant with) the same partial order? - Happens-before defines a (partial) order of events also called "logical time". - Happens-before defines no ordering among concurrent events. - That is what we mean by concurrency: events are concurrent if neither happened-before the other. - The idea of multicast as a "group chat thread". Replication as multicast. - Examples of inconsistencies that might be observed if multicast delivery order is not causal. - How does a cut define a global state of a distributed program as it executes ("wavefront"). - What could go wrong if an observer observes a global state that is an inconsistent cut? What breaks? - Is there a causal order of the events in an inconsistent cut? If so, in what sense is it "inconsistent"? - Procedure to maintain logical clocks, timestamping of events with logical clocks. - Is the numbering of Consensus views (ballots, terms) based on logical clocks? How/why? - In Consensus, the leader preserves committed ops from all preceding views. - What does that "preceding" mean with respect to logical time? Can ops commit in "concurrent" views? - An ordering of events in a system by logical timestamps is a causal order. - Is the logical clock ordering the only causal order for those events? How could I find a different causal order? - Is the logical clock ordering of the events uniquely defined? - Lamport's Mutex example: not testable. - How to know when an incoming message is the "next" message in a causal order? (stable, committed) - cbcast and catocs: not testable. 21-vector - if two events have the same logical clock timestamp, what can we say about their happened-before order? - What is the Clock Condition? Give an example that shows the converse to be false. ("false positives") - How would you obtain the vector clock corresponding to a cut? Or draw the cut corresponding to a vector clock? - Rules to manage vector clocks and vector timestamps. - A vector clock has strictly more information than a logical clock: discuss. - Why might you use a map to represent a vector clock? What do the x and y values in the map mean? - Given vector clocks (timestamps) for two events, what can I say about their happened-before relationship? - How could I use vector timestamps to define a linear timeline (total event order) that respects causality? - In an AP timeline fork, what do we know about the vector timestamps for events in either branch of the fork? - Bayou is an AP version of RSM consensus replication: discuss. - AP replication must always have some notion of a tentative update: discuss. - Bayou uses vector clocks to find the "delta" of two cuts: discuss. - Could a committed update ever be lost in Bayou? - Could a committed update ever revert in Bayou? - Can Bayou ever know that two concurrent updates are concurrent? - Can Bayou always tell that two concurrent updates are concurrent? - Bayou replica may reorder updates. Does it ever reorder updates that are not concurrent? - Does a Bayou replica ever reorder updates that are not known committed? - Does a Bayou replica ever insert newly discovered updates before known updates? When/why? - How is it that Bayou can handle concurrent updates and merge conflicting updates without vector timestamps? - Why is it necessary that a Bayou application is deterministic? - Bayou handles stabiity/commitment without matrix clocks. What are the pros/cons of the solution, vs. matrix clocks? - Bayou log truncation (discarding updates): not testable. - Bayou dependency checks and merge procedures: When used, why needed? No deep questions on this I promise. 22-nakamoto: Resolving forks, and blockchain forks - "AP means forks" and "There are lots of ways to repair forks", revisited. - Review Dynamo's RWN quorum from 19-base. - RWN write: broadcast/multicast to all N replicas, wait for W responses. - RWN read: multicast read to all N replicas, wait for R responses. - Depending on choice of R and W, stale reads are possible. - Stale reads make it possible to have conflicting/concurrent writes. Fork! - "The continuum of RWN choices makes consistency vs. availability a continuous tradeoff." Discuss. - How to resolve forks? Detect conflicting values among the replicas, and pass them to the app: "merge this please". - Why might a Dynamo coordinator return multiple values for a read (get)? (possibly) - Could a single Dynamo replica return multiple values for a read? Or are the values always from different replicas? - Dynamo version vectors are per-object: they count the number of ops on that object at each replica. - They work the same as vector clocks/timestamps: use them to determine update ordering and concurrency. - If application discovers concurrent updates (returned from get), it must merge them and write back merged value. - Example of application-specific merge: shopping cart (set union) from 19-base. - After merge, how does the app determine the timestamp ("context") to write back with the merged value? - Blockchains (bitcoin) are also AP and may similarly fork due to races and partitions, and also disruption attacks. - Blockchain forks occur when two or more contending leaders attempt to append a log segment concurrently. - Blockchain replicas that "see" the fork take the longest ("heaviest") branch and discard any others. - When a fork occurs, contending leaders (miners) race to build out each branch. - One branch wins and everyone joins it: is a "tug of war" in which everyone pulls for the winning team. - When a replica discards a branch, it reverts any ops it received on the branch. - Ops from a discarded branch typically reappear in the longer branch (possibly reordered) if still valid. - Blockchain ops are "always tentative": there is no commit, no stability. - Blockchains "work" because the probability that an op reverts approaches zero as dominant branch grows over time. - Other material on blockchains in this deck is not testable, unless it also appears in 24-pow. 23-tails - Review of key concept of course: services as the unit of scaling, composition, state isolation and recovery. - Latency of a service: another term for response time (R): time for server to generate reply, or for reply to arrive. - Tail latency: R for the "slowest" replies (e.g., the top 1%, .1%, .01% == 99% quantile, 99.9%, 99.99%) - Why is tail latency important? Why is it so hard to predict (e.g., with back-of-napkin queue models)? - We have seen how replication (Consensus, RWN), elastic services, transactions, and dynamic (re)sharding affect tail latency. - What is the relationship between load balancing and tail latency? - Why/how do parallel operations (e.g., Web search, 2PC) affect tail latency? (importance of stragglers) - Why does tail latency tend to grow with the degree of parallelism ("fanout")? - Why does latency vary from individual servers executing the same system/application code and workload? - Tools for tail-tolerant services: what coordination choices tend to reduce tail latency? - Compare tail-tolerance of 2PC, Consensus (e.g., Paxos), and RWN. - For RWN, how is tail latency related to configuration choices for availability and consistency? - Probabilistic Bounded Staleness (PBS): method to quantify consistency and latency distributions for RWN. - PBS: understand the shape of the graphs in the slides, but details are not testable, out of scope. - Power of Two Choices randomized load balancing (P2C) to select a "good" server: not testable, out of scope - Tail-tolerant adaptations: hedged and tied requests (from Tail at Scale) - How might you apply the concept of hedged/tied requests to RWN quorum reads? Discuss benefits and risks. - Harvesting search results: quit when "out of time and good enough". How is it "like" RWN quorum reads? 24-pow - How "blockchain is like Bayou" - Byzantine fault model - Details of PBFT Consensus: not testable, out of scope. - Proof of Work "mining" (PoW): why does it exist? What does it do? Why does Chase hate it so much? - What are sybils? What is a sybil attack? (e.g., on a voting protocol) - How does PoW defeat sybil attacks for leader selection? - The bitcoin broadcast overlay: how does it differ from anti-entropy? - What metrics does Bitcoin-NG use to quantify propagation time for new blocks? - Why do longer propagation times lead to more forks (contending leaders, i.e., racing miners)? - Why do shorter inter-block times increase forks? - Why do larger blocks increase forks? - How does Bitcoin-NG propose to improve payment throughput without increasing forks? - Why are blockchain forks bad? - Other blockchain details, or Bitcoin-NG details, are not testable.