Jepsen: Redis-Raft 1b3fbf6

Redis is a popular in-memory data structure server. Historically, Redis has supported a number of ad-hoc replication mechanisms, but none guaranteed stronger than causal consistency. Redis-Raft aims to bring strict serializability to Redis by means of the Raft consensus algorithm. We found twenty-one issues in development builds of Redis-Raft, including partial unavailability in healthy clusters, crashes, infinite loops on any request, stale reads, aborted reads, split brain leading to lost updates, and total data loss on any failover. All but one issue (a crash due to assertion failure around snapshots) appear addressed in recent development builds. This work was funded by Redis Labs and conducted in accordance with the Jepsen ethics policy.

Redis is a fast single-threaded data structure server, commonly used as a cache, scratchpad, queue, or coordination mechanism between distributed applications. It offers operations over a broad array of datatypes, including binary blobs, lists, sets, sorted sets, maps, geohashes, counters, channels, streams, and more. In recent years, an increasing cohort of production users have deployed Redis as a system of record, spurring an increased focus on safety and reliability.

In addition to individual operations, Redis supports Lua scripts, and a transaction mechanism called MULTI which allows clients to group together operations into an atomically executed batch. MULTI does not provide interactive transactions; the results of operations are only realized after the transaction is committed. The WATCH command allows transactions to check whether a key has remained unmodified since its last read—an optimistic concurrency control primitive.

1.1 Existing Replication Mechanisms

Redis offers several replication mechanisms, each with distinct tradeoffs. Redis’ initial replication mechanism sent updates asynchronously from primary to secondary nodes—secondaries overwrote their state with whatever the primary happened to have at that point in time. Failover was performed by hand, or via third-party watchdogs like Pacemaker

Redis Sentinel, introduced in 2012, allowed nodes to automatically select new primaries with the help of external processes executing a homegrown fault-detection and leader-election algorithm. Sentinel lost data during network partitions, and continues to do so as of this writing:

In general Redis + Sentinel as a whole are a an [sic] eventually consistent system where the merge function is last failover wins, and the data from old masters are discarded to replicate the data of the current master, so there is always a window for losing acknowledged writes.

A second replication strategy, Redis Cluster, provides transparent sharding and majority availability. Like Redis Sentinel, it uses asynchronous replication, and can lose acknowledged writes during some types of network failures:

Usually there are small windows where acknowledged writes can be lost. Windows to lose acknowledged writes are larger when clients are in a minority partition.

Redis Enterprise, a commercial offering from Redis Labs, includes a third replication strategy called “Active-Active Geo-Distribution”, which is based on Conflict-free Replicated Data Types (CRDTs). Sets use observed-removed sets, counters use a novel resettable counter implementation, and maps merge updates on a key-wise basis. Some datatypes, like strings and the values of maps, are resolved using last-write-wins, which is subject to lost updates.

Redis Sentinel, Redis Cluster, and Active-Active Geo-Distribution all allow lost updates—at least for some workloads. To mitigate this risk, Redis includes a WAIT command, which ensures prior writes are “durable even if a node catches on fire and never comes back to the cluster”. Moreover, Redis Enterprise claims to offer “full ACID compliance with its support for MULTI, EXEC, WAIT, DISCARD, and WATCH commands.”

So, does Redis Enterprise lose updates? Or is it “full ACID”? The Redis-Raft documentation casts doubt on ACID claims,1 noting that WAIT “does not make the system strongly consistent overall”. In discussions with Jepsen, Redis Labs clarified that Redis Enterprise can offer ACID characteristics, but only 1.) without any form of replication, 2.) the write-ahead log must be set to fsync on every write, and 3.) there is no way to roll back when transactions fail. These factors are not clearly documented, but Redis Labs plans to document them in the future.

In short, users who want fault-tolerance and not lost updates need something stronger than existing Redis replication systems. Whereupon: Redis-Raft.

1.2 Redis-Raft

The fourth Redis replication mechanism, and the focus of the present work, is Redis-Raft, which uses the Raft consensus algorithm to replicate Redis’ state machine across a set of nodes.

Redis-Raft claims to make Redis “effectively a CP system”. Putting all operations through the Raft log should allow operations to be linearizable. Since operations on different keys go through the same Raft state machine, and since MULTI transactions are implemented as a single Raft operation, Redis-Raft should also offer strict serializability—both for individual operations and for transactions.

Redis-Raft started as a proof-of-concept in February 2018, and Redis Labs has been working towards a production release since mid-2019. During our collaboration, Redis-Raft was unavailable to the public, but Redis Labs plans to make the source available at Redisconf 20, and aims for a general-availability release as a part of Redis 7.0.

We designed a test suite for Redis-Raft using the Jepsen testing library. Since Redis-Raft relies on features in the unstable branch of Redis, we ran our tests against Redis f88f866 and 6.0.3, and development builds of Redis-Raft from 1b3fbf6 through e0123a9. All tests were run on five-node Debian 9 clusters, on both LXC and EC2. We introduced a number of faults during our testing process, including process pauses, crashes, network partitions, clock skew, and membership changes.

Prior Jepsen tests have relied on a broad variety of workloads, each designed to detect different anomalies, or to compensate for performance limitations in other workloads. Over the last year, Jepsen collaborated with UC Santa Cruz’ Peter Alvaro to design a new type of consistency checker, which operates in linear (rather than exponential) time, over a broad range of transactions and single-key operations, which verifies a wide range of safety properties up to strict serializability, and which provides understandable, localized counterexamples for safety property violated. We call this checker Elle.

We used Elle exclusively in this analysis, measuring Redis’ safety with respect to transactions over lists. Each transaction (or singleton operation) is comprised of read and append operations over a small, evolving set of keys. Reads return the current state of the list using LRANGE, and appends add a distinct element to the end of the list using RPUSH. Elle infers dependency relationships between these transactions, including write-write, write-read, and read-write data dependencies, as well as realtime orders, and looks for cycles in that dependency graph as evidence of strict-serializability violations.