Scaling databases for the enterprise is hard. You have to parallelize, avoid bottlenecks, and shard across multiple machines. You have to carefully consider tradeoffs between data integrity and constant uptime, between optimizing for reading and writing, between speed of development and speed at runtime. You have to integrate wildly disparate data sources, satisfy stakeholders with competing expectations, and find the structure hidden in unstructured data. Working with databases at the scale of global enterprise is about bringing order to chaos.
I recently had the opportunity to interview MarkLogic's Greg Meddles on this topic. Meddles has been bringing order to chaos for decades, scaling and integrating data systems in the financial sector, the intelligence community, and health care. He worked on IBM's Watson, and is currently the technical lead for healthcare.gov, which integrated data stores from dozens of federal and state government sources while scaling to millions of users. In our conversation, I asked him about some of the common challenges that come with scaling and integrating databases. Here are some of his insights.
Separating uniqueness from order
Meddles began with an example that may seem elementary, but it was a good place to start, as the problem is easy to understand and touches on some of the more complicated issues we discussed later:
"In the old days, most IDs came from an auto increment sequence in the database. This provided both a uniqueness and an ordering component. But once you get away from a system that lives on one or two servers, the coordination of uniqueness and ordering doesn't work anymore."
The solution, of course, is to separate uniqueness from ordering. Uniqueness can be provided in a distributed system a number of ways—for example, by including an IP address or by using a UUID generator. Ordering is then a matter of having timestamps of sufficient granularity.
This strategy entails pulling apart the functional requirements, and questioning the assumptions built into conventional one-server approaches—it’s a helpful way of approaching many of the problems of scale.
Bottlenecks and parallelization
According to Meddles, most scaling problems come down to bottlenecks of one sort or another. Too often, he says, people try to solve this by just throwing more computing power at the problem:
"You're out of memory on some particular Amazon instance, so you bump up to the next biggest in size. That is always the naive solution. Whatever you're doing, you'll usually end up doing more of it. Eventually, you'll end up throwing good money after bad."
But there's a better way, Meddles notes: "The way that you get around these problems is to do more things in parallel."
There are a number of different approaches to doing things in parallel, but common to most of them is simplicity and, in particular, avoidance of state. "Maintenance of state is one of the issues that prevents you from doing things in parallel," he says.
Our conversation turned to design tradeoffs that have to be made when scaling. He brought up the CAP theory, which states that of three desirable qualities—Consistency, Availability, and Partition failure tolerance—you can have only two. Since a large, distributed database is going to require partition failure tolerance, you are left to decide between consistency and availability.
"You don't have a choice about whether there's going to be an interruption. That's a given. But you have a choice: does your system stay up, and you run the risk of a split-brain? Or does your system guarantee that the data will always be consistent, but you run the risk of outage?"
So, the choice really comes down to what your system does. Content delivery systems, for example, can probably live with a small amount of "split-brain." But in a transactional system, such as banking, for example, a short outage might not be ideal, but bad data is absolutely unacceptable.
Integrating multiple data sources
One of the common problems in enterprise data systems not directly related to scaling is the integration of many different types of databases and systems. For example, you may have a legacy system that stores data in tab-delimited files, unstructured text files coming from handwritten notes, and one or more conventional database management systems—and data from all of these sources needs to be read by and integrated into a single system.
Often, there is a desire to simply move this data into a monolithic data store, so everything can be treated the same way. (This has certainly been my experience.) Meddles suggests, on the other hand, keeping data in its original format whenever possible. "You have to start with the data you actually have, not the data you wish you had."
Transforming data from its original format to the one true format of your system is not always a lossless process. You may need to access the original content, in its original format, for any number of reasons. To integrate it into your monolithic store, then, you may want to extract structured data and then include pointers to the original, providing in-application access to it when needed. At MarkLogic they refer to this pattern as an "envelope." This is especially useful for unstructured text or binary data.
As Meddles explains it: "If you can keep your data in the form that it wants to be in, that's generally the easiest way to go."
Scaling systems without limit
I asked Meddles if he could sum up his thoughts on scaling databases into a few words: "To do all of this on time and on budget is hard. It's really hard. But most of the problems are solvable by keeping things simple and doing a good job with your design. Systems can be scaled really big, really well. There are no limits to what you can do."
This post is a collaboration between O'Reilly and MarkLogic. See our statement of editorial independence.