Klepmann Chapter 9 Consistency and Consensus

Consistency Guarantees eventual consistency eventually data converges different than transaction isolation isolation - avoid race conditions due to concurrent execution consistency - about coordinating replica state levels strongest - linearizability causality and total ordering commit in a distributed system consensus problem Linearizability one replica illusion (one copy of the data) guarantee read is most recent - recency guarantee read A / write C / read A ok read A / begin write C / read B / end write C / read A not ok read is concurrent with the write linearizability, 1 always follows 1 (no flipping) cas(x, vold, vnew) vs Serializability serializability is about transactions guaranteeing a sequential order linearizability is about recency guarantees on read and write importance locking and leader election lock must be linearizable all nodes must agree on who holds the lock zookeeper - used for distributed locking and leader election constraints and uniqueness guarantees enforcing uniquess (username, filename) similar to acquiring a lock similar to cas operation constraints that bank balance >= 0 two people don’t book the same flight requires a single up-to-date value for account balance or seat occupancy uniquess constraints in DBs are linearizable foreigh key and attribute constraints can be implemented without linearizability cross-channel timing dependencies File storage service is not linearizable, that is two requests went into it, one to store the image, and the other to resize the image.

Posts

Kleppmann Chapter 07 - Transactions

Definition of transaction a group of several reads and writes logical unit - looks like one operation all or nothing (commit or abort/rollback) Who needs transactions when you need certain guarantees Acid Atomicity - abortability (resiliant to faults) Consistency - some invariants about the data must be true Isolation - concurrent transactions are isolated serializability - too strong Durability - committed means data will not be lost there is no perfect guarantee Base Basically Available, Soft state, and Eventual consistency Single-Object vs Multi-Object Operations many writes atomicity : all written or none isolation : one transaction cannot see a partial write of another identify multi-object transaction tie to client’s TCP connection - bad separate transaction id Single object writes Storage engine atomicity : log for crash recovery isolation : using a lock atomic operations increment compare-and-set Multi-object transactions when do you need them?

Posts

Vector clocks

vector clocks A, B, C, D are trying to set a date. Alice starts off (Wed, (A:1)) (Tue, (A:1,B:1)) (Tue, (A:1,B:1,D:1)) (Thu, (A:1,C:1)) -> Conflict (A:1,C:1) did not descencd from (A:1,B:1,D:1) Descend Each marker in vclk2 must have corresponding or greater marker than in vclk1 Resolve (Thu, (A:1, B:1, C:1, D:2)) problems width of vector clock grows with number of actors, or clients. keep growth under control with timestamp to prune old clocks revisited Actor Some entity in the system that makes an update to an object

Posts

DynamoDB

Highlights kv store data partitioning and replication by consistent hashing consistency facilitated by object versioning consistency among replicas during update by quorum decentralized replica synchronization gossip based distributed failure detection and membership Background Authors confuse ‘C’ in ACID with ‘C’ in CAP 2.3 optimistic replication - conflict resolution when you need item allow writes/updates - “always writable” pushes complex conflict resolution on the reader at data store means “last write wins” too simple, allow client to do the conflict resolution Related work “always writable” requirement trusted nodes simple k-v latency ~ 100-200ms zero-hop DHT, each node can route request to appropriate node System architecture components: data persistence component load balancing membership failure detection failure recovery replica synchronization overload handling state transfer concurrency job scheduling request marshalling request routing system monitoring system alarming configuration management cover: partitioning replication versioning membership failure handling scaling system interface api get and put get(key) locates object replicas returns with conflicting versions returns with a context put(key, context, object) determines which replica context : metadata such as version key and object are opaque array of bytes use MD5 hash to generate a 128-bit id partitioning consistent hashing each node assigned a random position in the ring each data item hashed to ring and served by first node larger position each node serves data between it and its predecessor each node actually mapped to multiple points in the ring (tokens) virtual node advantages if node goes down, load gets handled more evenly by other nodes replication data item is replicated at \(N\) hosts each key assigned to a coordinator node coordinator is in charge of replication of the data items that fall in its range coordinator replicates key at N-1 clockwise successor nodes in the ring each node is responsible for between it and its \(N\) th predecessor preference list is the list of nodes responsible for a key every node in the system can determine which nodes for a key the pref list skips positions in the ring to ensure that it contains only distinct physical nodes data versioning eventual consistency, data propagates asynchronously guarantees that writes cannot be forgotten or rejected each modificaiton is a new and immutable version of the data version branching can happen in the presence of failures resulting in conflicting versions, client must perform reconciliation vector clocks used to capture causality (node, counter) if the counters of an object are less-than-or-equal to all the nodes in a second clock, the first is an ancestor on update, client must specify which version is being updated pass the context from earlier read timestamp is used to truncate the clock which may be growing get and put operations route request through LB use partition aware client library that routes to coordinator requests received through a LB routed to any random node in ring node will forward to the first among top \(N\) in preference list quorum protocol \(R+W > N\) put(), coordinator generates vector clock for new version sends to \(N\) highest-ranked reachable nodes if \(W-1\) nodes respond then the write is considered successful get(), coordinator requests all existing versions of data forward for that key, wiates for \(R\) responses before returning value to client handling failure hinted handoff