Chapter 4. Performance Tuning

You are very blessed if performance tuning has never been a topic in a report you built. Usually it’s not a question of if, but when the performance of a business solution becomes a priority.

Generally, if you’ve taken the previous chapters seriously and transformed all data into a star schema data model, you’ve made an important step toward well-performing reports. The shape of the data model plays an important role when it comes to performant reports. But, of course, many more pieces help determine how quickly reports return data or react to filters. Because this book is about data modeling, I limit myself to discussing performance-tuning topics only as they relate data modeling.

My first computer had a Turbo button on the front of its case, next to the power button. I used it rarely in the first weeks, but sooner or later, I asked myself, why I should run everything at a lower speed? The same applies to the data model you build. Why build a model that doesn’t run as quickly as possible? You should always keep performance in mind when building your data models.

Unfortunately, there’s no Turbo button in Power BI to hit after powering on. But there are concepts you can apply.

If you’re about my age, you may have had a paper list of phone numbers for your family, friends, and neighbors in your youth. Mine had the most important people first, and I added more and more people to it later. When the list got to a decent length, scanning it every time I needed a number frustrated me. So I started a new list and split the names and numbers onto different pages: one page per letter in the alphabet in alphabetical order by first name.

This principle applies to databases as well. You can create simple tables, where new data is (chronologically) added at the end of the table (or in between rows after a row is deleted). Adding data to the table takes very little time because there’s no need to find the right place; you can just use the next empty space. But you pay a penalty when you read from the table because its full content has to be scanned for every query. Filters will only reduce the result set, not the process.

The alternative is to store all the rows in a table in a certain order. As long as your filter refers to the order key, finding the right rows can be faster: you ignore all the rows that don’t fulfill the search condition, return all matching rows, and stop as soon as non-matching rows appear in the table. In reality, this can be even faster; databases store metadata for the sake of speeding up queries. But writing data into such a table will be a bit slower: the right position for the new rows has to be found. New space may need to be made available at this position. Metadata must be maintained.

These examples should make a very important principle clear: you can exchange query speed for space on disk or in memory. Speeding up queries this way will likely slow down write operations. In analytics, reading data is most often done frequently and quickly, while refreshes (write operations) can be done only at scheduled points in time and are fewer in number. Optimizing for read operations is therefore a good idea, even when it slows down the write operations.

You can choose one of the following options for storing data in tables:

Storing only queries

You could opt to not physically store (duplicate) data, but keep the data in the original tables. Instead of the data, you store only the query that will return the data in the desired shape. The advantage is that no extra space is needed to store the (shaped) data and no steps to update the data have to be scheduled. The query result will always be fresh. Depending on the type of transformations and the way the source tables are stored, the query will need some time to run.

Storing query results

Instead of running the queries against the data source every single time you need the data, you could store the result in a table and schedule a refresh. This will occupy space on disk or in memory, but speed up the queries because the result of the transformations is already persisted. The challenge is to schedule the refresh often enough so that the reports do not show stale data.

Adding metadata

You can distinguish between metadata automatically added by the database system (like descriptive statistics) and metadata explicitly added (like indexes). A database index is like the index at the end of this book. Instead of scanning the whole book for the term “foreign key,” you can jump to the index, where important terms are ordered alphabetically. In the index, you can quickly discover whether the book covers this term and find page references where you can find more information about it. While the book itself is ordered by its chapters, the index is ordered by an additional key. For tables, it is not uncommon to have more than one index.

Adding data of different granularity to the data model

Querying a table by its sort order or over an additional index will be faster than querying by a sort order that doesn’t match the table’s, or without a covering index. But still, a query needs to collect the necessary information and calculate the aggregated values for the report (which typically does not show single transactions, but data grouped by dimensions). Of course, it would be faster if those aggregated values were already stored (persisted) in a table. This is what aggregation tables are about: they store the identical information as the base table, but on different levels of granularity. For the report, the table with the right level of aggregation will be used.

No matter which solution you want to implement, all of them employ a strategy to exchange disk space for query runtime, and therefore increase the duration of time to process the transformation and refresh the user-facing data model.

Key Takeaways

A good data model takes query performance into account. By following the principles of the earlier chapters, you’ve already created a a data model with good query performance. No matter which data model you design or which tools you use, you have a wide variety of possibilities to control the performance by applying a taste of three options:

  • Directly querying the data source will always return the freshest information, but query time might not be acceptable (due to complex transformation or a data source not designed for these ad hoc queries).

  • We can speed up queries by pre-computing all or parts of the data needed. Transformations can be persisted; statistics and indexes will help to find information faster and we can pre-aggregate data on different levels of granularity. This takes up extra space in the database and needs to be maintained regularly, so it does not contain stale data.

  • By cleverly trading off query time and space used for the persisted data, you can achieve a balanced system, which satisfies the needs for fast reports and available storage resources.

In Part II, we leave the world of concepts and dive into Power BI Desktop and its possibilities for creating the data model, which will make the lives of your report creators easier.

Get Data Modeling with Microsoft Power BI now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.