The challenge of schema change over time

Why is change so hard? If we could update our software in lockstep everywhere, life would be easy. Database columns would be renamed simultaneously in clients and servers. Every API client would jump forward at once. Peers would agree to new protocols without dispute.

Alas, because we can’t actually change all our systems at once, our changes must often preserve both:

Backward compatibility, making new code compatible with existing data. Forward compatibility, making existing code compatible with new data.

Backward compatibility is straightforward. It’s the ability to open old documents in new versions of the program. Forward compatibility, the ability to open documents in formats invented in the future, is rarer. We can see forward compatibility in web browsers, which are written so the features added to HTML won’t break the ability to render new sites in old browsers.

As any web developer already knows, writing forward-compatible HTML is more art than science. The Mozilla Developer Network maintains this guide with advice on the subject.

The need to maintain backward and forward compatibility appears in a wide variety of distributed systems, both centralized and decentralized. Let’s look more closely at several of these systems and the solutions they employ.

Stripe API versioning

Public web APIs can face compatibility challenges when trying to maintain backward compatibility with older clients. An organization like Stripe must balance their desire to change their API against their customers’ reluctance to change something that works for them. The result is strong pressure to preserve backward compatibility over time, often across many versions.

Many API developers take an ad hoc approach to this problem. Developers rely on tribal knowledge to inform them which operations are safe—for example, they intuit that they can respond with additional data, trusting existing clients to ignore it, but not require additional data in requests, because existing clients won’t know to send it. Developers also often resort to shotgun parsing: scattering data checks and fallback values in various places throughout the system’s main logic. This can often lead not just to bugs, but also security vulnerabilities.

The term shotgun parsing was introduced by Bratus & Patterson, and describes the habit of scattering parser-like behaviour throughout an application’s code. In addition to complicating programs, inconsistencies in handling of data can result in security vulnerabilities, as described in the linked paper.

Stripe has developed an elegant approach to this problem. They have built a middleware system into their API server that intercepts incoming and outgoing communication, and translates it between the current version of the system and the client’s requested version. As a result, developers at Stripe don’t need to concern themselves with the idiosyncrasies of old API requests most of the time, because their middleware ensures requests will be translated into the current version. When they want to change the API’s format, they can add a new translation to the “stack” in their middleware, and incoming requests will be translated to the current version.

Stripe uses an encapsulated middleware to translate API requests and responses between the current version and older versions still used by clients.

Our work is inspired by the encapsulation provided by this system, but Stripe’s implementation has some limitations. Developers writing migration rules must implement translations by hand and write tests to ensure they are correct. And because Stripe’s system uses dates to order its migrations, it is limited to a single linear migration path.

Kafka and message formats

Large organizations often embrace an event streaming architecture, which allows them to scale and decouple data entering their system from the processes that consume it. Apache Kafka is a scalable, persistent queue system often used for this purpose. Data schemas are paramount in these environments, because malformed events sent through the system can crash downstream subscribers. Thus, Kafka’s streams can require both backward and forward compatibility. Old messages must be readable by new consumers; new messages must be readable by old consumers.

To help manage this complexity, the Confluent Platform for Kafka provides a Schema Registry tool. This tool defines rules that help developers maintain schema compatibility—for example, if a schema needs to be backward compatible, the Schema Registry only allows adding optional fields, so that new code will never rely on the presence of a newly added field.

Untitled

An excerpt from Confluent's list of rules for preserving compatibility in Avro. Developers can pick a desired compatibility level, which then limits the changes they can make to a schema.