July 15, 2009

For modern applications, the word “schema” has become synonymous with the tables, columns, constraints, indexes, and foreign keys in a relational database management system. A typical relational schema affects physical concerns (like record layout on disk) and logical concerns (like the cascading deletion of records in related tables).

Schemas have gotten a bad name because current RDBMS tools give them these rotten attributes:

  • Changes broadly lock data, requiring downtime or putting excessive, short-term load on the system.
  • The data integrity constraints don’t quite support application integrity constraints. For example, there’s no standard way in SQL to require a column to contain only valid URLs.
  • Coordinating schema changes with application code changes is difficult, largely because schema changes lack convenient coupling with code.

But, instead of solving these problems, there’s been a rebellion against schemas in next-generation object and document databases. In the case of CouchDB, for example, a document isn’t guaranteed to be anything more than valid JSON. That puts a lot of burden on the application, and it makes sharing a database between two applications unsafe and unreliable. It’s also caused some architects from the RDBMS side to falsely assume distributed document/object database architecture is inherently at odds with integrity enforcement.

What’s needed is a fresh perspective on schemas for the document database era. Mostly, we need to fix the errors and omissions of RDBMS schemas.

Current approaches

Let’s say an application is moving a table from schema A to schema B.

The current relational approach

In current relational systems, a DBA runs code on the database server that changes every row from schema A to schema B. Other administrators simultaneously deploy code to support the updated schema. The application is probably down for maintenance in the process.

The current document approach

In current document systems, code gets written to do the following:

  • Correctly read data in schema A or B.
  • Write data in schema B.
  • And if the developers ever want to remove support for reading A from the code, an updater that walks through schema A data and updates it to schema B.

The document-based application lacks the coordination and downtime issues of the relational application, but its code gets increasingly unwieldy. Where does the handling of older data schema happen? I’ve seen applications implement it anywhere from in data access objects/functions (ideally) all the way to the view/theme layer. The older schema support is often commented well but has no clear path to removal. No one wants to embark on a project to analyze all documents to determine the necessity of old schema support code, let alone risk application breakage by doing such cleanup.

The result is ever-increasing code complexity and little sense of distinct data versions.

What needs to happen

The document approach clearly has the technical edge, but the architecture and burdens are misplaced. So, we should augment today’s document database design with built-in schema management features modeled after how applications cope with schema changes.

Document databases need the following features added:

Basic data types

Document databases are wonderfully flexible for their model of making everything “just a document,” but supporting schemas in applications with more than one data type requires some awareness of type. An application could embed the type into each document, but then typing becomes inconsistent, and the schema update and validation functions below require lots of conditionals. It’s much simpler for the document database to actually understand type.

Schema versioning for each document (or any other non-global set)

One of the problems with current RDBMS schema systems is that they perform updates globally. They have to do this because all data in a table is assumed to be in the same structure dictated by the schema. Document databases go to the other extreme, supporting no schema. Supporting schema versioning at a non-global level would allow updates to happen lazily and without introspection and guesswork.

Similar to type, an application could embed the schema versions in the documents, but then versioning would not be understood by the database, impairing the efficiency and clarity of validation and update code.

Combined with type support, a document database implementing this feature would know the type and schema version for any document it stores, opening a wealth of possibilities described below.

Validation functions

For each type and schema version, this is a function that throws an exception if the document fails validation. “Validation” is however the application defines it, and the validation function runs in a Turing-complete language. In the degenerate case (a validation function that simply returns) allows the current, schema-less behavior. New documents written to the database are required to validate (before any updates) against a minimum schema version, if not the most recent.

Update functions

For each type and schema version except schema version zero, there is a function that accepts a document from schema version N-1 and returns a document in schema version N. The database then runs the returned document against the validator for schema version N. By running a series of updates, data can be upgraded from even schema version zero to the latest schema.

Referential overlays

Current relational systems implement cascading changes the same way they implement schema changes: update/delete all tuples as necessary at the same time. If type Q references type P with a CASCADE DELETE, then related instances of Q are immediately deleted when something of type P is deleted.

These sorts of cascading changes are very useful, and document databases can support them using what I call “referential overlays.” Instead of deleting Q items at the same time P items are deleted — which requires either an expensive traversal of all Q tuples or maintaining an index of Q-based references to P tuples — we can simply note the deletions of P and apply the changes at read or write time.

The referential overlay logs are easiest to implement in a centralized (though replicated) way. The centralization limits throughput, but the logs are lightweight and much faster than applying the actual cascading effects, as RDBMS tools do currently.

A Drupal-focused use of referential overlays would be the system of nodes and taxonomy terms. If a taxonomy term is deleted, we currently remove the term from all nodes having it. We do this immediately, even if 10,000 nodes have the term. With a referential overlay, we would instead note the deletion of the term and remove it as nodes are loaded. And, of course, we wouldn’t save nodes with references to now-deleted terms. The result is the appearance of a system-wide cascading change without the immediate overhead.

The gardener

The lazy-updating architecture above obviously results in a data store with lots of invalid references and outdated documents that have to be corrected at read-time. For many reasons, we don’t want these to continuously accumulate. Fortunately, there’s a happy middle-ground between the global locking updates of current RDBMS tools and accumulating cruft forever. I call it the gardener.

A real-world gardener prunes plants and pulls weeds. She works progressively across a large field, slowly moving through the rows to keep things tidy. Even though weeds aren’t getting picked in every row every day, things stay under control because the attention is frequent enough.

The “gardener” in the architecture here works the same way with the documents. Before starting a round through the whole document set, the gardener notes the current schema versions and position in the relational overlay log. The garden then moves through the documents, one by one, updating each one to the latest schema version and applying relational overlays (which may result in deletion). The gardener always updates to the latest schema version and overlay, even if either has changed since starting the gardening round.

Also like a real-world gardener, this approach horizontally scalable and can, of course, be scheduled to work at convenient times rather than interfering with competing work. Even across many servers, the gardeners can work in relative independence.

After a complete round through the system, the database can delete scheme validators and updaters for versions older than the one the gardener started the round with. (We can do this because we know the oldest schema version in use among all documents is the one the gardener started the round with.) Similarly, we can delete relational overlay records that pre-date the beginning of the gardener’s round.

Given the size of the data set, the speed of the gardener, and the number of changes being applied, we can establish an upper bound on the time from requesting a schema change or performing a cascading change and knowing the change is completely reflected in all documents. It can be comforting to know nothing in the database is so old that it might cause a problem when actually used.

Now, the application software can be unaware of the gardening and even the time from schema change to complete deployment because they’re handled transparently. The application software will never encounter obsolete references or documents in old schema formats because the document database will never deliver a document without updating, validating, and applying the relational overlays.

Other notes and thoughts

Coordinating code and schema

Schema updates are currently decoupled with code largely because running schema updates is an event:

  • Database writes will be blocked.
  • The servers will experience almost complete load saturation during the update.
  • For the above reasons, downtime might be scheduled.

Because the architecture proposed here is non-blocking, it’s possible to have the application trigger any schema update as new code starts running without lots of planning and downtime. For a web application, the schema update could run on the first page request with updated code.

Snapshotting

Because a schema update request is a lightweight event and records have (relatively) lazy updates, it’s possible to use snapshots with schema changes with low overhead. Let’s say a system is running with copy-on-write (COW) snapshots like LVM. Snapshotting load is directly proportional to the amount of data being written. COW-based snapshots therefore work much better with a lazy update strategy than an RDBMS that might rewrite gigabytes of rows during one schema change.

The same is true for implementing transactional DDL. There simply isn’t much data that needs to be retained to allow DDL rollbacks.

Failed updates

Above, I proposed treating the request time as the point of atomic schema changeover. It’s possible that some documents might fail post-update validation, or the updater itself might throw exceptions for some documents.

One useful feature of some RDBMS tools is transactional DDL (like in PostgreSQL). Any row failing an update reverts the entire update.

It’s possible to efficiently get a similar effect within the proposed architecture two different ways:

  • If the space is available, walk through all records, update and validate them, and store the updated versions in a way invisible to application. As records get created or updated by the still-live application, write documents with the current schema version and then run the update, creating and storing the updated record at the same time. If the update successfully runs on all records, including records being written during the update, atomically switch to using the updated records (or inform the DBA that the atomic switch is now possible and safe).
  • If there’s less space, do the same as above, but discard the updated records after they pass validation. Post-switch, records will be updated on read or by the gardener.

Either way, administrators can get confirmation of update success without the normal locks and downtime.

Where some RDBMS approaches maintain validity

There are some cases where this architecture may be insufficient, like when an application needs to look up references between documents rather than merely ensure documents do not list invalid ones. This is a case where an index has validity. But the architecture proposed here is more flexible than the RDBMS one of combining indexes, foreign keys, and cascading effect in one system. We could, for example, have an index with lazy updates or one maintained entirely outside the document database itself.

A Drupal-based case of this would be taxonomy listing pages, which display any node using a taxonomy term. Often, these listings need to be fast but not perfectly up-to-date. They could, say, lag 10 minutes behind the most updated data. It might also make sense to manage these indexes in a system like Hadoop or Solr.

Comments