When you launch your website, it will be the primary (probably the only) consumer of your database. All of the queries your database handles will be related to making your app go. If everything goes well, you’ll have lots of users and you’ll start collecting and generating lots of data about those users and your website’s operation in general.
Around this time, the business development team will start asking you questions. How many new users join your site each day? Each hour? Is there a geographic distribution to your user base? What features are heavily used and which go unused? How many repeat visitors did you have last week, and what was the revenue result of the costly marketing campaign?
The natural thing to do to answer these questions is to start building reports. You add a report that breaks down new visitors and repeat visitors by day and hour. You add a report that shows access by state—perhaps plotting hits on a U.S. and world map. You add a report that shows revenue events as they relate to different traffic sources—external links versus unreferred traffic versus links from your email campaign. You add more and more reports almost as quickly as your business users can request them. Your business users are delighted. They check the reports frequently to increase their pleasure at how well the website you wrote is functioning. That is, until suddenly, one day, performance plummets. All of the metrics in your reports take a nosedive. The business users, flustered, take to looking at the reports you’ve generated all day, hoping to make sense of what went wrong.
What went wrong is that your reports are killing your site. As your site’s popularity and success increase, so does the amount of data your reports need to process. What seemed like a reasonably fast query—maybe 10 seconds to give user statistics for the last month by hour—now takes 30 seconds or maybe even a couple of minutes. And since your company became so successful, you hired more people who are looking at those reports. And since each report now takes minutes to generate, your business users fire off a bunch of simultaneous reports and then go get a cup of coffee. All of this has the effect of bogging down your site, and locking out the very users to whom you are trying to serve web pages to.
I call this the Heisenberg Uncertainty Principal of Website Reporting. If you try to report out of the same database in which you are collecting your data, the simple act of loading the reports creates abnormal load on your database. That in turn makes your web pages slow, which causes your users to leave your site frustrated, which causes you to load more reports to figure out what’s going on, which frustrates yet more users, and so on. Whenever you look, you impact the system in a negative way.
The solution is simply to not run reporting queries on your production database. But that is easier said than done.
A common shortcut many people take around building a data warehouse is to create a slave copy of their database, and run heavy reporting queries there, out of the path of users. This is not a good idea. To understand why, it’s important to be familiar with the difference between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).
OLTP comprises the set of queries that store, update, and retrieve data. Examples would be creating blog entries, then displaying them to website visitors. OLTP queries are engineered to execute quickly, as there is generally a user waiting for the result of the query. Speed is realized through a highly normalized schema. Each row contains a small amount of data and pointers to related data in other tables. Virtually all websites are OLTP systems.
OLAP queries are geared toward garnering business intelligence out of large quantities of data. OLAP queries process millions upon millions of records generated by individual OLTP queries. A sample OLAP query might be one that answers the question, “How many customers who bought a sale item also bought a nonsale item, broken down by store location and week?” In addition to ad-hoc queries such as this one, nightly or quarterly generated reports are OLAP queries, too, and therefore real-time results are generally not a requirement of an OLAP system.
Because OLTP and OLAP queries are so different, it’s not surprising that a database design that is well-suited for OLTP may not be well-suited for OLAP, and vice versa. In fact, as the amount of data in a highly normalized system increases, coupled with increasing complexity of reporting queries, it’s often the case that reporting queries start to take seemingly infinite time, or infinite memory, or both.
In OLAP, the goal is not to have quick inserts, updates, and deletes, but rather to filter, group, and aggregate huge amounts of data based on certain criteria. For this task, highly normalized schemas result in lots of costly joins on massive amounts of data. Denormalizing the schema to avoid most, or even all, of the joins can make OLAP queries complete in a reasonable amount of time.
So a highly normalized database is good for normal site operations, but a denormalized database is good for reporting. Can these two be reconciled? Unfortunately, no, they should not be reconciled within the same database. Denormalizing data in an operational database can quickly lead to bugs (so-called insert, update, and delete anomalies). And staying normalized causes reporting queries to be unreasonably slow, and downright dangerous if they are executed in the same database as are your OLTP operational queries.
The proper place to run reporting queries is in a data warehouse. A data warehouse is, in rough terms, a place where all of your historical data resides, and in a format that is optimized for complex reporting queries. OLAP systems rely on highly denormalized data, usually in a star or snowflake normalization pattern, which increases the speed of processing huge amounts of data by eliminating joins on many tables. Here, inconsistencies are not a concern because your star schema data is generated from your DKNF data, which you bend over backward to keep accurate.
Getting data out of your production database and into a data warehouse is not an easy task, though, and it’s hard to convince anyone that you need to spend loads of time building a data warehouse before you have any meaningful data to report on. However, as soon as your website appears to be doing reasonably well with users, it’s time to invest some resources in building a data warehouse. And do it before you build lots of one-off reports that will surely cause the Heisenberg Uncertainty Principle of Website Reporting to set in.
Therefore, while chaining a slave off your database to run reporting queries (an extremely common practice in the MySQL world) seems like low hanging fruit, it’s really not the fruit you want. It’s really kind of like rotten fruit. The fresh, delicious fruit you want for reporting is a data warehouse.
Unfortunately, building a data warehouse is beyond the scope of this book. There are many books on the topic, though. A good introduction is The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling by Ralph Kimball (Wiley), but a topic that does fall squarely in the scope of this book is ensuring that your website scales for users. And that is the topic of the next several chapters on schema design.
Get Enterprise Rails now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.