Chapter 1. SQL Server Setup and Configuration

Database servers never live in a vacuum. They belong to an ecosystem of one or more applications used by customers. Application databases are hosted on one or more instances of SQL Server, and these instances, in turn, run on physical or virtual hardware. The data is stored on disks that are usually shared with other customers and database systems. Finally, all components use a network for communication and storage.

The complexity and internal dependencies of database ecosystems make troubleshooting a very challenging task. From the customers’ standpoint, most problems present themselves as general performance issues: applications might feel slow and unresponsive, database queries might time out, or applications might not connect to the database. The root cause of the issues could be anywhere. Hardware could be malfunctioning or incorrectly configured; the database might have inefficient schema, indexing, or code; SQL Server could be overloaded; client applications could have bugs or design issues. This means you’ll need to take a holistic view of your entire system in order to identify and fix problems.

This book is about troubleshooting SQL Server issues. However, you should always start the troubleshooting process by analyzing your application’s ecosystem and SQL Server environment. This chapter will give you a set of guidelines on how to perform that validation and detect the most common inefficiencies in SQL Server configurations.

First, I’ll discuss the hardware and operating system setup. Next, I’ll talk about SQL Server and database configuration. I’ll also touch on the topics of SQL Server consolidation and the overhead that monitoring can introduce into the system.

Hardware and Operating System Considerations

In most cases, troubleshooting and performance tuning processes happen in production systems that host a lot of data and work under heavy loads. You have to deal with the issues and tune the live systems. Nevertheless, it is impossible to completely avoid a discussion about hardware provisioning, especially because after troubleshooting you may find that your servers cannot keep up with the load and need to be upgraded.

I am not going to recommend particular vendors, parts, or model numbers; computer hardware improves quickly and any such specific advice would be obsolete by the time this book is published. Instead, I’ll focus on common-sense considerations with long-term relevance.

CPU

The license cost of a commercial database engine is, by far, the most expensive part in the system. SQL Server is no exception: you could build a decent server for less than the retail price of four cores in SQL Server Enterprise Edition. You should buy the most powerful CPU your budget allows, especially if you are using a non-Enterprise Edition of SQL Server, which limits the number of cores you can utilize.

Pay attention to the CPU model. Each new generation of CPU will introduce performance improvements over the previous generations. You may get performance improvements of 10% to 15% just by choosing newer CPUs, even when both generations of CPU have the same clock speed.

In some cases, when licensing cost is not an issue, you may need to choose between slower CPUs with more cores and faster CPUs with fewer cores. In that case, the choice greatly depends on system workload. In general, Online Transactional Processing (OLTP) systems, and especially In-Memory OLTP, would benefit from the higher, single-core performance. A data warehouse and analytical workload, on the other hand, may run better with a higher degree of parallelism and more cores.

Memory

There is a joke in the SQL Server community that goes like this:

  • Q. How much memory does SQL Server usually need?
  • A. More.

This joke has merits. SQL Server benefits from a large amount of memory, which allows it to cache more data. This, in turn, will reduce the amount of disk input/output (I/O) activity and improve SQL Server’s performance. Therefore, adding more memory to the server may be the cheapest and fastest way to address some performance issues.

For example, suppose the system suffers from nonoptimized queries. You could reduce the impact of these queries by adding memory and eliminating the physical disk reads they introduce. This, obviously, does not solve the root cause of the problem. It is also dangerous, because as the data grows, it eventually may not fit into the cache. However, in some cases it may be acceptable as a temporary Band-Aid solution.

The Enterprise Edition of SQL Server does not limit the amount of memory it can utilize. Non-Enterprise editions have limitations. In terms of memory utilization, the Standard Edition of SQL Server 2014 and later can use up to 128 GB of RAM for the buffer pool, 32 GB of RAM per database for In-Memory OLTP data, and 32 GB of RAM for storing columnstore index segments. Web Edition memory usage is limited to half of what the Standard Edition provides. Factor those limits into your analysis when you are provisioning or upgrading non-Enterprise Edition instances of SQL Server. Don’t forget to allocate some additional memory to other SQL Server components, such as the plan cache and lock manager.

In the end, add as much memory as you can afford. It is cheap nowadays. There is no need to over-allocate memory if your databases are small, but think about future data growth.

Disk Subsystem

A healthy, fast disk subsystem is essential for good SQL Server performance. SQL Server is a very I/O-intensive application—it is constantly reading from and writing data to disk.

There are many options for architecting the disk subsystem for SQL Server installations. The key is to build it in a way that provides low latency for I/O requests. For critical tier-1 systems, I recommend not exceeding 3 to 5 milliseconds (ms) of latency for data file reads and writes, and 1 ms to 2 ms of latency for transaction log writes. Fortunately, those numbers are now easily achieved with flash-based storage.

There’s a catch, though: when you troubleshoot I/O performance in SQL Server, you need to analyze the latency metrics within SQL Server rather than on the storage level. It is common to see significantly higher numbers in SQL Server rather than in storage key performance indicators (KPIs), due to the queueing that may occur with I/O-intensive workloads. (Chapter 3 will discuss how to capture and analyze I/O performance data.)

If your storage subsystem provides multiple performance tiers, I recommend putting the tempdb database on the fastest drive, followed by transaction log and data files. The tempdb database is the shared resource on the server, and it is essential that it has good I/O throughput.

The writes to transaction log files are synchronous. It is critical to have low write latency for those files. The writes to the transaction log are also sequential; however, remember that placing multiple log and/or data files on the same drive will lead to random I/O across multiple databases.

As a best practice, I’d put data and log files on the different physical drives for maintainability and recoverability reasons. You need to look at the underlying storage configuration, though. In some cases, when disk arrays do not have enough spindles, splitting them across multiple LUNs may degrade disk array performance.

In my systems, I do not split clustered and nonclustered indexes across multiple filegroups by placing them on different drives. It rarely improves I/O performance unless you can completely separate storage paths across the filegroups. On the other hand, this configuration can significantly complicate disaster recovery.

Finally, remember that some SQL Server technologies benefit from good sequential I/O performance. For example, In-Memory OLTP does not use random I/O at all, and the performance of sequential reads usually becomes the limiting factor for database startup and recovery. Data warehouse scans would also benefit from sequential I/O performance when B-Tree and columnstore indexes are not heavily fragmented. The difference between sequential and random I/O performance is not very significant with flash-based storage; however, it may be a big factor with magnetic drives.

Network

SQL Server communicates with clients and other servers via the network. Obviously, it needs to provide enough bandwidth to support that communication. There are a couple of items I want to mention in this regard.

First, you need to analyze the entire network topology when you troubleshoot network-related performance. Remember that a network’s throughput will be limited to the speed of its slowest component. For example, you may have a 10 Gbps uplink from the server; however, if you have a 1 Gbps switch in the network path, that would become the limiting factor. This is especially critical for network-based storage: make sure the I/O path to disks is as efficient as possible.

Second, it is a common practice to build a separate network for the cluster heartbeat in AlwaysOn Failover Clusters and AlwaysOn Availability Groups. In some cases, you may also consider building a separate network for all Availability Group traffic. This is a good approach that improves cluster reliability in simple configurations, when all cluster nodes belong to the same subnet and may utilize Layer 2 routing. However, in complex multisubnet setups, multiple networks may lead to routing issues. Be careful with such setups and make sure to properly utilize networks in cross-node communication, especially in virtual environments, which I will discuss in Chapter 15.

Virtualization adds another layer of complexity here. Consider a situation where you have a virtualized SQL Server cluster with nodes running on different hosts. You would need to check that the hosts can separate and route the traffic in the cluster network separately from the client traffic. Serving all vLan traffic through the single physical network card would defeat the purpose of a heartbeat network.

Operating Systems and Applications

As a general rule, I suggest using the most recent version of an operating system that supports your version of SQL Server. Make sure both the OS and SQL Server are patched, and implement a process to do patching regularly.

If you are using an old version of SQL Server (prior to 2016), use the 64-bit variant. In most cases, the 64-bit version outperforms the 32-bit version and scales better with the hardware.

Since SQL Server 2017, it’s been possible to use Linux to host the database server. From a performance standpoint, Windows and Linux versions of SQL Server are very similar. The choice of operating system depends on the enterprise ecosystem and on what your team is more comfortable supporting. Keep in mind that Linux-based deployments may require a slightly different High Availability (HA) strategy compared to a Windows setup. For example, you may have to rely on Pacemaker instead of Windows Server Failover Cluster (WSFC) for automatic failovers.

Use a dedicated SQL Server host when possible. Remember that it’s easier and cheaper to scale application servers—don’t waste valuable resources on the database host!

On the same note, do not run nonessential processes on the server. I see database engineers running SQL Server Management Studio (SSMS) in remote desktop sessions all the time. It is always better to work remotely and not consume server resources.

Finally, if you are required to run antivirus software on the server, exclude any database folders from the scan.

Virtualization and Clouds

Modern IT infrastructure depends heavily on virtualization, which provides additional flexibility, simplifies management, and reduces hardware costs. As a result, more often than not you’ll have to work with virtualized SQL Server infrastructure.

There is nothing wrong with that. Properly implemented virtualization gives you many benefits, with acceptable performance overhead. It adds another layer of HA with VMware vSphere vMotion or Hyper-V Live Migration. It allows you to seamlessly upgrade the hardware and simplifies database management. Unless yours is the edge case where you need to squeeze the most from the hardware, I suggest virtualizing your SQL Server ecosystem.

Note

The overhead from virtualization increases on large servers with many CPUs. However, it still may be acceptable in many cases.

Virtualization, however, adds another layer of complexity during troubleshooting. You need to pay attention to the host’s health and load in addition to guest virtual machine (VM) metrics. To make matters worse, the performance impact of an overloaded host might not be clearly visible in standard performance metrics in a guest OS.

I will discuss several approaches to troubleshooting the virtualization layer in Chapter 15; however, you can start by working with infrastructure engineers to confirm that the host is not over-provisioned. Pay attention to the number of physical CPUs and allocated vCPUs on the host along with physical and allocated memory. Mission-critical SQL Server VMs should have resources reserved for them to avoid a performance impact.

Aside from the virtualization layer, troubleshooting virtualized SQL Server instances is the same as troubleshooting physical ones. The same applies to cloud installations when SQL Server is running within VMs. After all, the cloud is just a different datacenter managed by an external provider.

Configuring Your SQL Server

The default configuration for the SQL Server setup process is relatively decent and may be suited to light and even moderate workloads. There are several things you need to validate and tune, however.

SQL Server Version and Patching Level

SELECT @@VERSION is the first statement I run during SQL Server system health checks. There are two reasons for this. First, it gives me a glimpse of the system’s patching strategy so that I can potentially suggest some improvements. Second, it helps me identify possible known issues that may exist in the system.

The latter reason is very important. Many times, customers have asked me to troubleshoot problems that had already been resolved in service packs and cumulative updates. Always look at the release notes to see if any of the issues mentioned look familiar; your problem may have already been fixed.

You might consider upgrading to the newest version of SQL Server when possible. Each version introduces performance, functional, and scalability enhancements. This is especially true if you move to SQL Server 2016 or later from older versions. SQL Server 2016 was a milestone release that included many performance enhancements. In my experience, upgrading from SQL Server 2012 to 2016 or later can improve performance by 20% to 40% without any additional steps.

It is also worth noting that starting with SQL Server 2016 SP1, many former Enterprise Edition–only features became available in the lower-end editions of the product. Some of them, like data compression, allow SQL Server to cache more data in the buffer pool and improve system performance.

Obviously, you need to test the system prior to upgrading—there is always the chance for regressions. The risk is usually small with minor patches; however, it increases with the major upgrades. You can mitigate some risks with several database options, as you will see later in this chapter.

Instant File Initialization

Every time SQL Server grows data and transaction log files—either automatically or as part of the ALTER DATABASE command—it fills the newly allocated part of the file with zeros. This process blocks all sessions that are trying to write to the corresponding file and, in the case of the transaction log, stops generating any log records. It may also generate a spike in I/O write workload.

That behavior cannot be changed for transaction log files; SQL Server always zeros them out. However, you can disable it for data files by enabling instant file initialization (IFI). This speeds up data file growth and reduces the time required to create or restore databases.

You can enable IFI by giving the SA_MANAGE_VOLUME_NAME permission, also known as the Perform Volume Maintenance Task, to the SQL Server startup account. This can be done in the Local Security Policy management application (secpol.msc). You will need to restart SQL Server for the change to take effect.

In SQL Server 2016 and later, you can also grant this permission as part of the SQL Server setup process, as shown in Figure 1-1.

Figure 1-1. Enabling IFI during SQL Server setup

You can check whether IFI is enabled by examining the instant_file​_initializa⁠tion_enabled column in the sys.dm_server_services dynamic management view (DMV). This column is available in SQL Server 2012 SP4, SQL Server 2016 SP1, and later. In older versions, you can run the code shown in Listing 1-1.

Listing 1-1. Checking if instant file initialization is enabled in old SQL Server versions
DBCC TRACEON(3004,3605,-1);
GO
CREATE DATABASE Dummy;
GO
EXEC sp_readerrorlog 0,1,N'Dummy';
GO
DROP DATABASE Dummy;
GO
DBCC TRACEOFF(3004,3605,-1);
GO

If IFI is not enabled, the SQL Server log will indicate that SQL Server is zeroing out the .mdf data file in addition to zeroing out the log .ldf file, as shown in Figure 1-2. When IFI is enabled, it will only show zeroing out of the log .ldf file.

Figure 1-2. Checking if instant file initialization is enabled

There is a small security risk associated with this setting. When IFI is enabled, database administrators may see some data from previously deleted files in the OS by looking at newly allocated data pages in the database. This is acceptable in most systems; if so, enable it.

tempdb Configuration

The tempdb is the system database used to store temporary objects created by users and by SQL Server internally. This is a very active database and it often becomes a source of contention in the system. I will discuss how to troubleshoot tempdb-related issues in Chapter 9; in this chapter, I’ll focus on configuration.

As already mentioned, you need to place tempdb on the fastest drive in the system. Generally speaking, this drive does not need to be redundant nor persistent—the database is re-created at SQL Server startup, and local SSD disk or cloud ephemeral storage would work fine. Remember, however, that SQL Server will go down if tempdb is unavailable, so factor that into your design.

If you are using a non-Enterprise Edition of SQL Server and the server has more memory than SQL Server can consume, you can put tempdb on the RAM drive. Don’t do this with SQL Server Enterprise Edition, though—you’ll usually achieve better performance by using that memory for the buffer pool.

Note

Pre-allocate tempdb files to the maximum size of the RAM drive and create additional small data and log files on disk to avoid running out of space. SQL Server will not use small on-disk files until RAM drive files are full.

The tempdb database should always have multiple data files. Unfortunately, the default configuration created at SQL Server setup is not optimal, especially in the old versions of the product. I will discuss how to fine-tune the number of data files in tempdb in Chapter 9, but you can use the following as a rule of thumb in the initial configuration:

  • If the server has eight or fewer CPU cores, create the same number of data files.

  • If the server has more than eight CPU cores, use either eight data files or one-fourth the number of cores, whichever is greater, rounding up in batches of four files. For example, use 8 data files in the 24-core server and 12 data files in the 40-core server.

Finally, make sure all tempdb data files have the same initial size and auto-growth parameters specified in megabytes (MB) rather than in percentages. This will allow SQL Server to better balance usage of the data files, reducing possible contention in the system.

Trace Flags

SQL Server uses trace flags to enable or change the behavior of some product features. Although Microsoft has introduced more and more database and server configuration options in new versions of SQL Server, trace flags are still widely used. You will need to check any trace flags that are present in the system; you may also need to enable some of them.

You can get the list of enabled trace flags by running the DBCC TRACESTATUS command. You can enable them in SQL Server Configuration Manager and/or by using the -T SQL Server startup option.

Let’s look at some common trace flags:

T1118
This trace flag prevents usage of mixed extents in SQL Server. This will help improve tempdb throughput in SQL Server 2014 and earlier versions by reducing the number of changes and, therefore, contention in tempdb system catalogs. This trace flag is not required in SQL Server 2016 and later, where tempdb does not use mixed extents by default.
T1117
With  this  trace  flag,  SQL  Server  auto-grows  all  data  files  in  the  filegroup when one of the files is out of space. It provides more balanced I/O distribution across data files. You should enable this trace flag to improve tempdb throughput in old versions of SQL Server; however, check if any users’ databases have filegroups with multiple unevenly sized data files. As with T1118, this trace flag is not required in SQL Server 2016 and later, where tempdb auto-grows all data files by default.
T2371
By default, SQL Server automatically updates statistics only after 20% of the data in the index has been changed. This means that with large tables, statistics are rarely updated automatically. The T2371 trace flag changes this behavior, making the statistics update threshold dynamic—the larger the table is, the lower the percentage of changes required to trigger the update. Starting with SQL Server 2016, you can also control this behavior via database compatibility level. Nevertheless, I still recommend enabling this trace flag unless all databases on the server have a compatibility level of 130 or above.
T3226
With this trace flag, SQL Server does not write information about successful database backups to the error log. This may help reduce the size of the logs, making them more manageable.
T1222
This trace flag writes deadlock graphs to the SQL Server error log. This flag is benign; however, it makes SQL Server logs harder to read and parse. It is also redundant—you can get a deadlock graph from a System_Health Extended Event session when needed. I usually remove this trace flag when I see it.
T4199
This trace flag and the QUERY_OPTIMIZER_HOTFIXES database option (in SQL Server 2016 and later) control the behavior of Query Optimizer hotfixes. When this trace flag is enabled, the hotfixes introduced in service packs and cumulative updates will be used. This may help address some Query Optimizer bugs and improve query performance; however, it also increases the risk of plan regressions after patching. I usually do not enable this trace flag in production systems unless it is possible to perform thorough regression testing of the system before patching.
T7412
This trace flag enables lightweight execution profiling of the infrastructure in SQL Server 2016 and 2017. This allows you to collect execution plans and many execution metrics for the queries in the system, with little CPU overhead. I will discuss it in more detail in Chapter 5.

To summarize, in SQL Server 2014 and earlier, enable T1118, T2371, and potentially, T1117. In SQL Server 2016 and later, enable T2371 unless all databases have a compatibility level of 130 or above. After that, look at all other trace flags in the system and understand what they are doing. Some trace flags may be inadvertently installed by third-party tools and can negatively affect server performance.

Server Options

SQL Server provides many configuration settings. I’ll cover many of them in depth later in the book; however, there are a few settings worth mentioning here.

Optimize for Ad-hoc Workloads

The first configuration setting I’ll discuss is Optimize for Ad-hoc Workloads. This configuration option controls how SQL Server caches the execution plans of ad-hoc (nonparameterized) queries. When this setting is disabled (by default), SQL Server caches the full execution plans of those statements, which may significantly increase plan cache memory usage. When this setting is enabled, SQL Server starts by caching the small structure (just a few hundred bytes), called the plan stub, replacing it with the full execution plan if an ad-hoc query is executed the second time.

In the majority of cases, ad-hoc statements are not executed repeatedly, and it is beneficial to enable the Optimize for Ad-hoc Workloads setting in every system. It could significantly reduce plan cache memory usage at the cost of infrequent additional recompilations of ad-hoc queries. Obviously, this setting would not affect the caching behavior of parameterized queries and T-SQL database code.

Note

Starting with SQL Server 2019 and in the Azure SQL Database, you can control the behavior of the Optimize for Ad-hoc Workloads setting at the database level using the OPTIMIZE_FOR_AD_HOC_WORKLOADS database scoped configuration.

Max Server Memory

The second important setting is Max Server Memory, which controls how much memory SQL Server can consume. Database engineers love to debate how to properly configure this setting, and there are different approaches to calculate the proper value for it. Many engineers even suggest leaving the default value in place and allowing SQL Server to manage it automatically. In my opinion, it is best to fine-tune this setting, but it’s important to do so correctly (Chapter 7 will discuss the details). An incorrect setting will impact SQL Server performance more than if you leave the default value in place.

One particular issue I often encounter during system health checks is severe under-provisioning of this setting. Sometimes people forget to change it after hardware or VM upgrades; other times, it’s incorrectly calculated in nondedicated environments, where SQL Server is sharing the server with other applications. In both cases, you can get immediate improvements by increasing the Max Server Memory setting or even reconfiguring it to the default value until you perform a full analysis later.

Affinity mask

You need to check SQL Server affinity and, potentially, set an affinity mask if SQL Server is running on hardware with multiple non-uniform memory access (NUMA) nodes. In modern hardware, each physical CPU usually becomes a separate NUMA node. If you restrict SQL Server from using some of the physical cores, you need to balance SQL Server CPUs (or schedulers; see Chapter 2) evenly across NUMAs.

For example, if you are running SQL Server on a server with two 18-core Xeon processors and limiting SQL Server to 24 cores, you need to set the affinity mask to utilize 12 cores from each physical CPU. This will give you better performance than having SQL Server use 18 cores from the first processor and 6 cores from the second.

Listing 1-2 shows how to analyze the distribution of SQL Server schedulers (CPUs) between NUMA nodes. Look at the count of schedulers for each parent_node_id column in the output.

Listing 1-2. Checking the distribution of NUMA node schedulers (CPUs)
SELECT
  parent_node_id
  ,COUNT(*) as [Schedulers]
  ,SUM(current_tasks_count) as [Current]
  ,SUM(runnable_tasks_count) as [Runnable]
FROM sys.dm_os_schedulers
WHERE status = 'VISIBLE ONLINE'
GROUP BY parent_node_id;

Parallelism

It is important to check parallelism settings in the system. Default settings, like MAXDOP = 0 and Cost Threshold for Parallelism = 5, do not work well in modern systems. As with Max Server Memory, it is better to fine-tune the settings based on the system workload (Chapter 6 will discuss this in detail). However, my rule of thumb for generic settings is as follows:

  • Set MAXDOP to one-fourth the number of available CPUs in OLTP and one-half those in data warehouse systems. In very large OLTP servers, keep MAXDOP at 16 or below. Do not exceed the number of schedulers in the NUMA node.

  • Set Cost Threshold for Parallelism to 50.

Starting with SQL Server 2016 and in the Azure SQL Server Database, you can set MAXDOP on the database level using the command ALTER DATABASE SCOPED CONFIGURATION SET MAXDOP. This is useful when the instance hosts databases that handle different workloads.

Configuration settings

As with trace flags, analyze other changes in configuration settings that have been applied on the server. You can examine current configuration options using the sys.configurations view. Unfortunately, SQL Server does not provide a list of default configuration values to compare, so you need to hardcode it, as shown in Listing 1-3. I am including just a few configuration settings to save space, but you can download the full version of the script from this book’s companion materials.

Listing 1-3. Detecting changes in server configuration settings
DECLARE
    @defaults TABLE
    (
        name SYSNAME NOT NULL PRIMARY KEY, 
        def_value SQL_VARIANT NOT NULL
    )

INSERT INTO @defaults(name,def_value) 
VALUES('backup compression default',0); 
INSERT INTO @defaults(name,def_value) 
VALUES('cost threshold for parallelism',5); 
INSERT INTO @defaults(name,def_value) 
VALUES('max degree of parallelism',0);
INSERT INTO @defaults(name,def_value) 
VALUES('max server memory (MB)',2147483647);
INSERT INTO @defaults(name,def_value) 
VALUES('optimize for ad hoc workloads',0); 
/* Other settings are omitted in the book */

SELECT
    c.name, c.description, c.value_in_use, c.value
    ,d.def_value, c.is_dynamic, c.is_advanced
FROM
    sys.configurations c JOIN @defaults d ON
        c.name = d.name
WHERE
    c.value_in_use <> d.def_value OR
    c.value <> d.def_value
ORDER BY
    c.name;

Figure 1-3 shows sample output of the preceding code. The discrepancy between the value and value_in_use columns indicates pending configuration changes that require a restart to take effect. The is_dynamic column shows if the configuration option can be modified without a restart.

Figure 1-3. Nondefault server configuration options

Configuring Your Databases

As the next step, you’ll need to validate several database settings and configuration options. Let’s look at them.

Database Settings

SQL Server allows you to change multiple database settings, tuning its behavior to meet system workload and other requirements. I’ll cover many of them later in the book; however, there are a few settings I would like to discuss here.

The first one is Auto Shrink. When this option is enabled, SQL Server periodically shrinks the database and releases unused free space from the files to the OS. While this looks appealing and promises to reduce disk space utilization, it may introduce issues.

The database shrink process works on the physical level. It locates empty space in the beginning of the file and moves allocated extents from the end of the file to the empty space, without taking extent ownership into consideration. This introduces noticeable load and leads to serious index fragmentation. What’s more, in many cases it’s useless: the database files simply expand again as the data grows. It’s always better to manage file space manually and disable Auto Shrink.

Another database option, Auto Close, controls how SQL Server caches data from the database. When Auto Close is enabled, SQL Server removes data pages from the buffer pool and execution plans from the plan cache when the database does not have any active connections. This will lead to a performance impact with the new sessions when data needs to be cached and queries need to be compiled again.

With very few exceptions, you should disable Auto Close. One such exception may be an instance that hosts a large number of rarely accessed databases. Even then, I would consider keeping this option disabled and allowing SQL Server to retire cached data in the normal way.

Make sure the Page Verify option is set to CHECKSUM. This will detect consistency errors more efficiently and helps resolve database corruption cases.

Pay attention to the database recovery model. If the databases are using the SIMPLE recovery mode, it would be impossible to recover past the last FULL database backup in the event of a disaster. If you find the database in this mode, immediately discuss it with the stakeholders, making sure they understand the risk of data loss.

Database Compatibility Level controls SQL Server’s compatibility and behavior on the database level. For example, if you are running SQL Server 2019 and have a database with a compatibility level of 130 (SQL Server 2016), SQL Server will behave as if the database is running on SQL Server 2016. Keeping the databases on the lower compatibility levels simplifies SQL Server upgrades by reducing possible regressions; however, it also blocks you from getting some new features and enhancements.

As a general rule, run databases on the latest compatibility level that matches the SQL Server version. Be careful when you change it: as with any version change, this may lead to regressions. Test the system before the change and make sure you can roll back the change if needed, especially if the database has a compatibility level of 110 (SQL Server 2012) or below. Increasing the compatibility level to 120 (SQL Server 2014) or above will enable a new cardinality estimation model and may significantly change execution plans for the queries. Test the system thoroughly to understand the impact of the change.

You can force SQL Server to use legacy cardinality estimation models with the new database compatibility levels by setting the LEGACY_CARDINALITY_ESTIMATION database option to ON in SQL Server 2016 and later, or by enabling the server-level trace flag T9481 in SQL Server 2014. This approach will allow you to perform upgrade or compatibility level changes in phases, reducing the impact to the system. (Chapter 5 will cover cardinality estimation in more detail and discuss how to reduce risks during SQL Server upgrades and database compatibility level changes.)

Transaction Log Settings

SQL Server uses write-ahead logging, persisting information about all database changes in a transaction log. SQL Server works with transaction logs sequentially, in merry-go-round fashion. In most cases, you won’t need multiple log files in the system—they make database administration more complicated and rarely improve performance.

Internally, SQL Server splits transaction logs into chunks called Virtual Log Files (VLFs) and manages them as single units. For example, SQL Server cannot truncate and reuse a VLF if it contains just a single active log record. Pay attention to the number of VLFs in the database. Too few very large VLFs will make log management and truncation suboptimal. Too many small VLFs will degrade the performance of transaction log operations. Try not to exceed several hundred VLFs in production systems.

The number of VLFs SQL Server adds when it grows a log depends on the SQL Server version and the size of the growth. In most cases, it creates 8 VLFs when the growth size is between 64 MB and 1 GB, or 16 VLFs with growth that exceeds 1 GB. Do not use percent-based auto-growth configuration, because it generates lots of unevenly sized VLFs. Instead, change the log auto-growth setting to grow the file in chunks. I usually use chunks of 1,024 MB, which generates 128 MB VLFs, unless I need a very large transaction log.

You can count the VLFs in the database with the sys.dm_db_log_info DMV in SQL Server 2016 and later. In older versions of SQL Server, you can obtain that information by running DBCC LOGINFO. If the transaction log isn’t configured well, consider rebuilding it. You can do this by shrinking the log to the minimal size and growing it in chunks of 1,024 MB to 4,096 MB.

Do not auto-shrink transaction log files. They will grow again and affect performance when SQL Server zeroes out the file. It is better to pre-allocate the space and manage log file size manually. Do not restrict the maximum size and auto-growth, though—you want logs to grow automatically in case of emergencies. (Chapter 11 will provide more details on how to troubleshoot transaction log issues.)

Data Files and Filegroups

By default, SQL Server creates new databases using the single-file PRIMARY filegroup and one transaction log file. Unfortunately, this configuration is suboptimal from performance, database management, and HA standpoints.

SQL Server tracks space usage in the data files through system pages called allocation maps. In systems with highly volatile data, allocation maps can be a source of contention: SQL Server serializes access to them during their modifications (more about this in Chapter 10). Each data file has its own set of allocation map pages, and you can reduce contention by creating multiple files in the filegroup with the active modifiable data.

Ensure that data is evenly distributed across multiple data files in the same filegroup. SQL Server uses an algorithm called Proportional Fill, which writes most of the data to the file that has the most free space available. Evenly sized data files will help balance those writes, reducing allocation map contention. Make sure all data files in the filegroup have the same size and auto-growth parameters, specified in megabytes.

You may also want to enable the AUTOGROW_ALL_FILES filegroup option (available in SQL Server 2016 and later), which triggers auto-growth for all files in the filegroup simultaneously. You can use trace flag T1117 for this in prior versions of SQL Server, but remember that this flag is set on the server level and will affect all databases and filegroups in the system.

It is often impractical or impossible to change the layout of existing databases. However, you may need to create new filegroups and move data around during performance tuning. Here are a few suggestions for doing this efficiently:

  • Create multiple data files in filegroups with volatile data. I usually start with four files and increase the number if I see latching issues (see Chapter 10). Make sure all data files have the same size and auto-growth parameters specified in megabytes; enable the AUTOGROW_ALL_FILES option. For filegroups with read-only data, one data file is usually enough.

  • Do not spread clustered indexes, nonclustered indexes, or large object (LOB) data across multiple filegroups. This rarely helps with performance and may introduce issues in cases of database corruption.

  • Place related entities (e.g., Orders and OrderLineItems) in the same filegroup. This will simplify database management and disaster recovery.

  • Keep the PRIMARY filegroup empty if possible.

Figure 1-4 shows an example of a database layout for a hypothetical eCommerce system. The data is partitioned and spread across multiple filegroups with the goal of minimizing downtime and utilizing partial database availability in case of disaster.1 It will also allow you to improve the backup strategy by implementing partial database backups and excluding read-only data from full backups.

Figure 1-4. Database layout for an eCommerce system

Analyzing the SQL Server Error Log

The SQL Server Error Log is another place I usually check at the beginning of troubleshooting. I like to see any errors it has, which can point to some areas that require follow-up. For example, errors 823 and 824 can indicate issues with disk subsystem and/or database corruption.

You can read the content of the error log in SSMS. You can also get it programmatically using the xp_readerrorlog system stored procedure. The challenge here is the amount of data in the log: the noise from the information messages may hide useful data.

The code in Listing 1-4 helps you address that problem. It allows you to filter out unnecessary noise and focus on the error messages. You can control the behavior of the code with the following variables:

@StartDate and @EndDate
Define the time for analysis.
@NumErrorLogs
Specifies the number of log files to read if SQL Server rolls them over.
@ExcludeLogonErrors
Omits logon auditing messages.
@ShowSurroundingEvents and @ExcludeLogonSurroundingEvents
Allow you to retrieve the information messages around the error entries from the log. The time window for those messages is controlled by the @Surrounding​E⁠ventsBeforeSeconds and @SurroundingEventsAfterSeconds variables.

The script produces two outputs. The first one shows the entries from the error log that include the word error. When the @ShowSurroundingEvents parameter is enabled, it also provides log entries around those error lines. You can exclude some log entries that contain the word error from the output by inserting them into the @ErrorsToIgnore table.

Listing 1-4. Analyzing the SQL Server error log
IF OBJECT_ID('tempdb..#Logs',N'U') IS NOT NULL DROP TABLE #Logs;
IF OBJECT_ID('tempdb..#Errors',N'U') IS NOT NULL DROP TABLE #Errors;
GO

CREATE TABLE #Errors
(
  LogNum INT NULL,
  LogDate DATETIME NULL,
  ID INT NOT NULL identity(1,1),
  ProcessInfo VARCHAR(50) NULL,
  [Text] NVARCHAR(MAX) NULL,
  PRIMARY KEY(ID)
);

CREATE TABLE #Logs
(
  [LogDate] DATETIME NULL,
  ProcessInfo VARCHAR(50) NULL,
  [Text] NVARCHAR(MAX) NULL
);

DECLARE 
  @StartDate DATETIME = DATEADD(DAY,-7,GETDATE())
  ,@EndDate DATETIME = GETDATE()
  ,@NumErrorLogs INT = 1
  ,@ExcludeLogonErrors BIT = 1
  ,@ShowSurroundingEvents BIT = 1
  ,@ExcludeLogonSurroundingEvents BIT = 1
  ,@SurroundingEventsBeforeSecond INT = 5
  ,@SurroundingEventsAfterSecond INT = 5
  ,@LogNum INT = 0;

DECLARE
  @ErrorsToIgnore TABLE
  (
    ErrorText NVARCHAR(1024) NOT NULL
  );

INSERT INTO @ErrorsToIgnore(ErrorText)
VALUES
  (N'Registry startup parameters:%'),
  (N'Logging SQL Server messages in file%'),
  (N'CHECKDB for database%finished without errors%');

WHILE (@LogNum <= @NumErrorLogs) 
BEGIN 
  INSERT INTO #Errors(LogDate,ProcessInfo,Text)
    EXEC [master].[dbo].[xp_readerrorlog] 
      @LogNum, 1, N'error', NULL, @StartDate, @EndDate, N'desc';
  IF @@ROWCOUNT > 0
    UPDATE #Errors SET LogNum = @LogNum WHERE LogNum IS NULL;  
  SET @LogNum += 1;
END;

IF @ExcludeLogonErrors = 1
  DELETE FROM #Errors WHERE ProcessInfo = 'Logon';

DELETE FROM e
FROM #Errors e
WHERE EXISTS
(
  SELECT *
  FROM @ErrorsToIgnore i
  WHERE e.Text LIKE i.ErrorText
);

-- Errors only
SELECT * FROM #Errors ORDER BY LogDate DESC;

IF @@ROWCOUNT > 0 AND @ShowSurroundingEvents = 1
BEGIN
  DECLARE
    @LogDate DATETIME
    ,@ID INT = 0

  WHILE 1 = 1
  BEGIN
    SELECT TOP 1 @LogNum = LogNum, @LogDate = LogDate, @ID = ID 
    FROM #Errors 
    WHERE ID > @ID
    ORDER BY ID;

    IF @@ROWCOUNT = 0
      BREAK;

    SELECT 
      @StartDate = DATEADD(SECOND, -@SurroundingEventsBeforeSecond, @LogDate)
      ,@EndDate = DATEADD(SECONd, @SurroundingEventsAfterSecond, @LogDate);

    INSERT INTO #Logs(LogDate,ProcessInfo,Text)
      EXEC [master].[dbo].[xp_readerrorlog] 
        @LogNum, 1, NULL, NULL, @StartDate, @EndDate;
  END;

  IF @ExcludeLogonSurroundingEvents = 1
    DELETE FROM #Logs WHERE ProcessInfo = 'Logon';

  DELETE FROM e
  FROM #Logs e
  WHERE EXISTS
  (
    SELECT *
    FROM @ErrorsToIgnore i
    WHERE e.Text LIKE i.ErrorText
  );

  SELECT * FROM #Logs ORDER BY LogDate DESC;
END

I am not going to show the full list of possible errors here; it may be excessive and, in many cases, is system specific. But you need to analyze any suspicious data from the output and understand its possible impact on the system.

Finally, I suggest setting up alerts for high-severity errors in SQL Server Agent, if this has not already been done. You can read the Microsoft documentation on how to do that.

Consolidating Instances and Databases

You can’t talk about SQL Server troubleshooting without discussing consolidation of database and SQL Server instances. While consolidating often reduces hardware and licensing costs, it doesn’t come for free; you need to analyze its possible negative impact on current or future system performance.

There is no universal consolidation strategy that can be used with every project. You should analyze the amount of data, the load, the hardware configuration, and your business and security requirements when making this decision. However, as a general rule, avoid consolidating OLTP and data warehouse/reporting databases on the same server when they are working under a heavy load (or, if they are consolidated, consider splitting them). Data warehouse queries usually process large amounts of data, which leads to heavy I/O activity and flushes the content of the buffer pool. Taken together, this negatively affects the performance of other systems.

In addition, analyze your security requirements when consolidating databases. Some security features, such as Audit, affect the entire server and add performance overhead for all databases on the server. Transparent Data Encryption (TDE) is another example: even though TDE is a database-level feature, SQL Server encrypts tempdb when either of the databases on the server has TDE enabled. This leads to performance overhead for all other systems.

As a general rule, do not keep databases with different security requirements on the same instance of SQL Server. Look at the trends and spikes in metrics and separate databases from each other when needed. (I will provide code to help you analyze CPU, I/O, and memory usage on a per-database basis later in the book.)

I suggest utilizing virtualization and consolidating multiple VMs on one or a few hosts, instead of putting multiple independent and active databases on a single SQL Server instance. This will give you much better flexibility, manageability, and isolation between the systems, especially if multiple SQL Server instances are running on the same server. It is much easier to manage their resource consumption when you virtualize them.

Observer Effect

The production deployment of every serious SQL Server system requires implementing a monitoring strategy. This may include third-party monitoring tools, code built based on standard SQL Server technologies, or both.

A good monitoring strategy is essential for SQL Server production support. It helps you be more proactive and reduces incident detection and recovery times. Unfortunately, it does not come for free—every type of monitoring adds overhead to the system. In some cases, this overhead may be negligible and acceptable; in others it may significantly affect server performance.

During my career as an SQL Server consultant, I’ve seen many cases of inefficient monitoring. For example, one client was using a tool that provided information about index fragmentation by calling the sys.dm_db_index_physical_stats function, in DETAILED mode, every four hours for every index in the database. This introduced huge spikes in I/O and cleared the buffer pool, leading to a noticeable performance hit. Another client used a tool that constantly polled various DMVs, adding significant CPU load to the server.

Fortunately, in many cases, you will be able to see those queries and evaluate their impact during system troubleshooting. However, this is not always the case with other technologies; an example is with monitoring based on Extended Events (xEvents). While Extended Events is a great technology that allows you to troubleshoot complex problems in SQL Server, it is not the best choice as a profiling tool. Some events are heavy and may introduce large overhead in busy environments.

Let’s look at an example that creates an xEvents session that captures queries running in the system, as shown in Listing 1-5.

Listing 1-5. Creating an xEvents session to capture queries in the system
CREATE EVENT SESSION CaptureQueries ON SERVER
ADD EVENT sqlserver.rpc_completed
(
  SET collect_statement=(1)
  ACTION
  (
    sqlos.task_time
    ,sqlserver.client_app_name
    ,sqlserver.client_hostname
    ,sqlserver.database_name
    ,sqlserver.nt_username
    ,sqlserver.sql_text
  )
),
ADD EVENT sqlserver.sql_batch_completed
(
  ACTION
  (
    sqlos.task_time
    ,sqlserver.client_app_name
    ,sqlserver.client_hostname
    ,sqlserver.database_name
    ,sqlserver.nt_username
    ,sqlserver.sql_text
  )
),
ADD EVENT sqlserver.sql_statement_completed
ADD TARGET package0.event_file
(SET FILENAME=N'C:\PerfLogs\LongSql.xel',MAX_FILE_SIZE=(200))
WITH
(
  MAX_MEMORY =4096 KB
  ,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS
  ,MAX_DISPATCH_LATENCY=5 SECONDS
);

Next, deploy it to a server that operates under a heavy load with a large number of concurrent requests. Measure the throughput in the system, with and without the xEvents session running. Obviously, be careful—and don’t run it on the production server!

Figure 1-5 illustrates the CPU load and number of batch requests per second in both scenarios on one of my servers. As you can see, enabling the xEvents session decreased throughput by about 20%. To make matters worse, it would be very hard to detect the existence of that session on the server.

Figure 1-5. Server throughput with and without an active xEvents session

Obviously, the degree of impact would depend on the system’s workload. In either case, check for any unnecessary monitoring or data collection tools when you do the troubleshooting.

The bottom line: evaluate the monitoring strategy and estimate its overhead as part of your analysis, especially when the server hosts multiple databases. For example, xEvents works at the server level. While you can filter the events based on the database_id field, the filtering occurs after an event has been fired. This can affect all databases on the server.

Summary

System troubleshooting is a holistic process that requires you to analyze your entire ecosystem. You need to look at the hardware, OS, and virtualization layers, and at the SQL Server and database configurations, and adjust them as needed.

SQL Server provides many settings that you can use to fine-tune the installation to the system workload. There are also best practices that apply to most systems, including enabling the IFI and Optimize for Ad-Hoc Workloads settings, increasing the number of files in tempdb, turning on some trace flags, disabling Auto Shrink, and setting up correct auto-growth parameters for database files.

In the next chapter, I’ll talk about one of the most important components in SQL Server—SQLOS—and a troubleshooting technique called Wait Statistics.

Troubleshooting Checklist

  • Perform a high-level analysis of the hardware, network, and disk subsystem.

  • Discuss host configuration and load in virtualized environments with infrastructure engineers.

  • Check OS and SQL Server versions, editions, and patching levels.

  • Check if instant file initialization is enabled.

  • Analyze trace flags.

  • Enable Optimize for Ad-Hoc Workloads.

  • Check memory and parallelism settings on the server.

  • Look at tempdb settings (including number of files); check for trace flag T1118, and potentially T1117, in SQL Server versions prior to 2016.

  • Disable Auto Shrink for databases.

  • Validate data and transaction log file settings.

  • Check the number of VLFs in the transaction log files.

  • Check errors in the SQL Server log.

  • Check for unnecessary monitoring in the system.

1 For a deep dive into data partitioning and disaster recovery strategies, please see my book Pro SQL Server Internals, Second Edition (Apress, 2016).

Get SQL Server Advanced Troubleshooting and Performance Tuning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.