Intro

de Candia is one of the founding fathers of the NoSQL movement, whose influence now extends well beyond the big-name websites, stretching into the data center that underpin all sorts of businesses.

"If you look at every NoSQL solution out there, everyone goes back to the Amazon Dynamo paper or the Google BigTable paper," says Jason Hoffman, the chief technology officer at the San Francisco-based cloud computing outfit Joyent. "What would the world be like if no one at Google or Amazon ever wrote an academic paper?"

Talks & Videos

Report

Dynamo is an eventual consistent, leaderless, key-value store used in internal Amazon services serving the needs of establishing a 99.9% SLA. This paper, together with Google's BigTable paer, had a huge impact on later NoSQL storage designs, including Riak, Facebook's Cassandra, and LinkedIn's Voldermort.

As the paper mentioned:

The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer...

Desgining a production-ready storage system is difficult to get it right, and this is what Dynamo is famous about - using the correct techniques to solve their hardest problems.

We will start with partitioning and replication. Just like Chord, Dynamo uses Consistent Hashing (CH) to partition the nodes, by hash of the id, into a ring-like key space, and hashes the key of the data object to the next successor node. To reduce the skew, Dynamo improves CH by using Fixed Partition Assignment for rebalancing (chap 6.2). This has the advantage of fixed number of partitions, but a disadvantage that the size of the partition grows linearly with that of data. Cassandra, on the other side, uses Random Token Assignment to grow the # of partitions linearly with # of hosts, and decreases partition size linearly. Unlike Chord or other systems that rely on Zookeeper for membership discovery, Dynamo uses Gossip-based protocol to lookup the node that contains the desired key. The operation complexity is $O(1)$, compared to Chrod's $O(log(N))$ for membership lookups, but with the cost of a lose membership consistency. To achieve high availability and durability, Dynamo replicates data to $N$ physical nodes. The $N-1$ nodes are successors of the first coordinator node determined by the CH. It uses version vector for replication conflict detection. Once a causality violation being detected, Dynamo can either use last-write-win based on physical timestamp to resolve it, or propagating back to the application to handle the merge. Both of the conflict resolving strategies are not ideal. Last-write-wins can hurt durability, and propagating the complexity back to the client is making operation harder. A final note to Dynamo's replication is that they limit and truncate the Version Vector size to 10. The consequence of this is unknown as the paper was written, but I believe it could result incorrect answer by the conflict merger.

What's Dynamo's story about read-your-write consistency? Well, it has a concept of coordinator node, which is determined by the usual CH rule. It accepts reads and writes. However, when coordinator failed, other replicas are allowed to handle reads and writes. So Dynamo isn't a read-your-write consistent system after all.

Eventual consistent means that after pausing writes, all the replicas will eventually converge to the same state (the Merkle tree root is the same in Dynamo POV). To achieve this, the coordinator needs to replicate the put to all the replicas. However, Dynamo wants the property of "always-writable", even in facing failures. Dynamo uses a Quorum mechanism to tolerate node failures. A typical setup is $N=3,R=2,W=2.$ This is not enough to achieve "always-writable" since it can only tolerate one node failure. Dynamo uses Sloppy Quorum and Hinted Handoff are used. This is another example that Dynamo favors availability than consistency.

Lastly, detecting and handling failures are also important. Dynamo uses the same Gossip protocol to detect failures. If a node cannot be reached, it will be regarded as down. When the node comes back, Dynamo needs to quickly resynchronize the data to the recovered node in the background (this is also called anti-entropy process). It uses Merkle trees for comparison. It allows nodes to compare whether the keys within a key range are up-to-date. The advantage of the merkle tree is that it minimizes the amount of data needs to be transferred during synchronization process. Permanent node failure is handled by an admin using command line tools. The question remains to me is how does Dynamo be confident that some nodes become permanently unavailable and need human intervention.

Discussion: