BUY THIS BOOK
Add to Cart

Print Book $44.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £31.95

What is this?

Looking to Reprint this content?


Windows 2000 Performance  Guide
Windows 2000 Performance Guide Help for Administrators and Application Developers

By Mark Friedman, Odysseas Pentakalos
Price: $44.95 USD
£31.95 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Perspectives on Performance Management
Our goal in writing this book is to provide a good introduction to the Windows 2000 operating system and its hardware environment, focusing on understanding how it works from the standpoint of a performance analyst responsible for planning, configuration, and tuning. Our target audience consists of experienced performance analysts, system administrators, and developers who are already familiar with the Windows 2000 operating system environment.
Taken as a whole, Windows 2000 performance involves not just the operating system, but also the performance characteristics of various types of computer hardware, application software, and communications networks. The key operating system performance issues include understanding CPU scheduling and multiprogramming issues and the role and impact of virtual memory. Key areas of hardware include processor performance and the performance characteristics of I/O devices and other peripherals. Besides the operating system (OS), it is also important to understand how the database management system (DBMS) and transaction-monitoring software interact with both the OS and the application software and affect the performance of applications. Network transmission speeds and protocols are key determinants of communication performance. This is a lot of ground to cover in a single book, and there are many areas where our treatment of the topic is cursory at best. We have compiled a rather lengthy bibliography referencing additional readings in areas that we can discuss only briefly in the main body of this book.
Before we start to describe the way Windows 2000 works in detail, let's summarize the evolution of Microsoft's premier server operating system. While we try to concentrate on current NT technology (i.e., Windows 2000), this book encompasses all current NT releases from Version 3.5 onward, including current versions of NT 4.0 and Windows 2000. Each succeeding version of Windows NT incorporates significant architectural changes and improvements that affect the way the operating system functions. In many cases, these changes impact the interpretation of specific performance data counters, making the job of writing this book quite challenging.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Windows 2000 Evolution
Before we start to describe the way Windows 2000 works in detail, let's summarize the evolution of Microsoft's premier server operating system. While we try to concentrate on current NT technology (i.e., Windows 2000), this book encompasses all current NT releases from Version 3.5 onward, including current versions of NT 4.0 and Windows 2000. Each succeeding version of Windows NT incorporates significant architectural changes and improvements that affect the way the operating system functions. In many cases, these changes impact the interpretation of specific performance data counters, making the job of writing this book quite challenging.
Of necessity, the bulk of the book focuses on Windows 2000, with a secondary interest in Windows NT 4.0, as these are the releases currently available and in wide distribution. We have also had some experience running beta and prerelease versions of Windows XP. From a performance monitoring perspective, the 32-bit version of XP appears quite similar to Windows 2000. The few remaining NT 3.51 machines we stumble across are usually stable production environments running Citrix's WinFrame multiuser software that people are not willing to upgrade.
A daunting challenge to the reader who is familiar with some of the authoritative published works on NT performance is understanding that what was once good solid advice has become obsolete due to changes in the OS between Version 3, Version 4, and Windows 2000. Keeping track of the changes in Windows NT since Version 3 and how these changes impact system performance and tuning requires a little perspective on the evolution of the NT operating system, which is, after all, a relatively new operating system. Our purpose in highlighting the major changes here is to guide the reader who may be familiar with an older version of the OS. The following table summarizes these changes.
New feature/change
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tools of the Trade
Understanding the tools of the performance analyst's trade is essential for approaching matters of Windows 2000 performance. Rather than focus merely on tools and how they work, we chose to emphasize the use of analytic techniques to diagnose and solve performance problems. We stress an empirical approach, making observations of real systems under stress. If you are a newcomer to the Microsoft operating system but understand how Unix, Linux, Solaris, OpenVMS, MVS, or another full-featured OS works, you will find much that is familiar here. Moreover, despite breathtaking changes in computer architecture, many of the methods and techniques used to solve earlier performance problems can still be applied successfully today. This is a comforting thought as we survey the complex computing environments, a jumbled mass of hardware and software alternatives, that we are called upon to assess, configure, and manage today. Measurement methodology, workload characterization, benchmarking, decomposition techniques, and analytic queuing models remain as effective today as they were for the pioneering individuals who first applied these methods to problems in computer performance analysis.
Computer performance evaluation is significant because it is very closely associated with the productivity of the human users of computerized systems. Efficiency is the principal rationale for computer-based automation. Whenever productivity is a central factor in decisions regarding the application of computer technology to automate complex and repetitive processes, performance considerations loom large. These concerns were present at the dawn of computing and remain very important today. Even as computing technology improves at an unprecedented rate, the fact that our expectations of this technology grow at an even faster rate assures that performance considerations will continue to be important in the future.
In computer performance, productivity is often tangled up with cost. Unfortunately, the relationship between performance and cost is usually neither simple nor straightforward. Generally, more expensive equipment has better performance, but there are many exceptions. Frequently, equipment will perform well with some workloads but not others. In most buying decisions, it is important to understand the performance characteristics of the hardware, the performance characteristics of the specific workload, and how they match up with each other.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance and Productivity
As mentioned at the outset, there is a very important correlation between computer performance and productivity, and this provides a strong economic underpinning to the practice of performance management. Today, it is well established that good performance is an important element of human-computer interface design. (See Ben Schneiderman's Designing the User Interface: Strategies for Effective Human-Computer Interaction.) We have to explore a little further what it means to have "good" performance, but systems that provide fast, consistent response time are generally more acceptable to the people who use them. Systems with severe performance problems are often rejected outright by their users and fail. This leads to costly delays, expensive rewrites, and loss of productivity.
As computer performance analysts, we are interested in finding out what it takes to turn "bad" performance into "good" performance. Generally, the analysis focuses on two flavors of computer measurements. The first type of measurement data describes what is going on with the hardware, operating systems software, and application software that is running. These measurements reflect both activityrates and the utilization of key hardware components: how busy the processor, the disks, and the network are. Windows 2000 provides a substantial amount of performance data in this area: quantitative information on how busy different hardware components are, what processes are running, how much memory they are using, etc.
The second type of measurement data measures the capacity of the computer to do productive work. As discussed earlier in this chapter, most common measures of productivity we have seen are throughput (usually measured in transactions per second) and response time. A measure of throughput describes the quantity of work performed, while response time measures how long it takes to complete the task. When users of your network complain about bad performance, you need a way to quantify what "bad" is. You need application response time measurements from the standpoint of the end user. Unfortunately, Windows 2000 is not very strong in this key area of measurement.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Management
Performance management remains one of the key system management disciplines performed by systems professionals of all stripes. For instance, as a LAN administrator, you may be responsible for setting up the hardware platform and operating system environment used to run the applications that your company needs to run its business. The performance aspects of this job include monitoring the hardware and software and making configuration changes and tuning adjustments to make things run better. To support capacity planning and budgeting, you may be called upon to provide measurement data to cost-justify hardware expenditures and as input to the capacity planning process that is designed to ensure that adequate resources continue to be available.
In a capacity planning role, you may be responsible for assuring that adequate resources are available over the long term. This may mean monitoring the current growth, factoring in new application projects, and trying to plan the hardware configuration that can supply adequate performance. This inevitably means that you will encounter your company's budget for new equipment purchases. You will be called upon to explain your recommendations in straightforward, nontechnical language. Perhaps your analysis will show that more hardware is needed to meet the performance objectives or else workers interacting with the system will not be fully productive. This is part of performance management, too.
Our approach to these and other performance management tasks is decidedly empirical. Remember, you cannot manage what you cannot measure. We recommend that you gather data from real systems in operation, analyze it, and make decisions based on that analysis. Those decisions may range from deciding how to configure a workstation running Windows 2000 Professional for optimal performance for a graphics-intensive application, to how to configure multiple Windows 2000 servers to support distributed database or messaging applications. For systems under development, you may need to discuss hardware and software infrastructure alternatives with the programmers responsible for implementing the application. If you get an opportunity to influence the design of an important application under development, try to link decisions about design trade-offs to actual measurements of prototype code running under different conditions of stress. Before setting parameters and making tuning adjustments, review and analyze performance statistics from the system. Afterwards, verify that the changes you made are working by looking at similar measurement data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Problems of Scale
As fast as hardware is improving, it still cannot always meet user expectations. Computer applications are growing more intelligent and easier to use. However, because of this, they also require more resources to run. Additionally, computer applications are being deployed to more and more people living in every part of the world. This inevitably leads to problems of scale: applications may run all right in development, but run in fits and starts when you start to deploy them on a widespread basis.
Problems of scalability are some of the most difficult that application developers face. There are three facets to scalability:
  • The problems computers must solve get characteristically more difficult to compute as the number of elements involved grows.
  • People often have unrealistically high expectations about what computer technology can actually do.
  • There are sometimes absolute physical constraints on performance that cannot be avoided. This applies to long-distance telecommunications as much as it does to signal propagation inside a chip or across a communications bus.
We discuss these three factors in the following sections.
There are "hard" problems for computers that have a way of getting increasingly more complicated as the problem space itself grows larger. A classic example is the traveling salesman problem, where the computer calculates the most efficient route through multiple cities. This can be accomplished by brute force with a small number of cities, for example, by generating all the possible routes. As the number of cities increases, however, the number of possible routes the computer must generate and examine increases exponentially. In the mathematics of computing, these problems are known as NP-complete. With many NP-complete problems, as the problem space grows more complicated, the computational effort using known methods explodes far beyond what even the fastest computers can do.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Tools
We should consider ourselves lucky that so much attention has been paid to the discipline of computer performance evaluation by so many talented people over the years. There is a small but very active Special Interest Group of the Association of Computing Machinery called SIGMETRICS (see http://www.sigmetrics.org) devoted to the study of performance evaluation. There are many good university programs where topics in performance evaluation are taught and Ph.D. candidates are trained. For practitioners, there is a large professional association called the Computer Measurement Group (http://www.cmg.org) that sponsors an annual conference in December. The regular SIGMETRICS and CMG publications contain a wealth of invaluable material for the practicing performance analyst.
It is also fortunate that one of the things computers are good for is counting. It is easy to augment both hardware and software to keep track of what they are doing, although keeping the measurement overhead from overwhelming the amount of useful work being done is a constant worry. Over time, most vendors have embraced the idea of building measurement facilities into their products, and at times are even able to position them to competitive advantage because their measurement facilities render them more manageable. A system that provides measurement feedback is always more manageable than one that is a black box.
Users of Windows 2000 are fortunate that Microsoft has gone to the trouble of building in extensive performance monitoring facilities. We explore the use of specific Windows 2000 performance monitoring tools in subsequent chapters. The following discussion introduces the types of events monitored in Windows 2000, how measurements are taken, and what tools exist that access this measurement data. The basic measurement methodology used in Windows 2000 is the subject of more intense scrutiny in Chapter 2.
Measurement facilities are the prerequisite for the development of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Measurement Methodology
Microsoft Windows 2000 comes equipped with extensive facilities for monitoring performance, and these have been available in the operating system since its inception. The Windows 2000 performance monitoring API is the name we coined to describe the built-in facilities for monitoring system activity in Windows 2000. In this chapter, we take a thorough look at this API, which is the main source of the performance statistics that system administrators and application developers utilize.
The performance monitoring statistics available in Windows 2000 are quite extensive. At a basic level, they report on processor, memory, disk, and network usage, for example. Windows 2000 also measures and reports on the utilization of these system resources by application processes and execution threads. The extensive set of performance metrics collected is designed to assist system administrators and application developers, both of whom need to understand the impact of their decisions on overall system performance.
For example, the Windows 2000 32-bit application programming interface (the Win32 API) provides a complex set of operating system services for application developers to use, including thread scheduling, process virtual memory management, and file processing. The performance statistics Windows 2000 provides are complementary—they help programmers use these operating system services efficiently. Some applications written for Windows 2000, as we shall see, attempt to adjust automatically to their runtime environment. For instance, they may allocate a specific number of execution threads per processor, allowing the application to scale efficiently on multiprocessor machines. Or they may allocate memory-resident buffers for caching frequently used disk objects based on the amount of available RAM. These simple heuristics designed to allow an application to adjust automatically to its runtime environment neglect one vital element in computer performance, namely, the specific characteristics of your workload. To allow applications to scale across different hardware configurations and different workloads requires additional flexibility.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Monitoring on Windows
The performance monitoring statistics available in Windows 2000 are quite extensive. At a basic level, they report on processor, memory, disk, and network usage, for example. Windows 2000 also measures and reports on the utilization of these system resources by application processes and execution threads. The extensive set of performance metrics collected is designed to assist system administrators and application developers, both of whom need to understand the impact of their decisions on overall system performance.
For example, the Windows 2000 32-bit application programming interface (the Win32 API) provides a complex set of operating system services for application developers to use, including thread scheduling, process virtual memory management, and file processing. The performance statistics Windows 2000 provides are complementary—they help programmers use these operating system services efficiently. Some applications written for Windows 2000, as we shall see, attempt to adjust automatically to their runtime environment. For instance, they may allocate a specific number of execution threads per processor, allowing the application to scale efficiently on multiprocessor machines. Or they may allocate memory-resident buffers for caching frequently used disk objects based on the amount of available RAM. These simple heuristics designed to allow an application to adjust automatically to its runtime environment neglect one vital element in computer performance, namely, the specific characteristics of your workload. To allow applications to scale across different hardware configurations and different workloads requires additional flexibility.
Many applications developed specifically for Windows 2000 provide external controls that allow system administrators to adjust the values of configuration and tuning parameters. Tuning knobs that determine how many threads to create or how much RAM to allocate impact application scalability and other aspects of its performance. We discuss a number of these tuning parameters later in this book. These parameters often focus on the application runtime services that the Win32 API specifies and the Windows 2000 operating system implements. Setting tuning parameters that influence the execution behavior of key applications requires a knowledge and understanding of the various performance statistics available.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Monitoring API
Windows 2000 provides two distinct sets of performance monitoring services. The first is an interface for managing the collection and reporting of performance data objects. (Although Windows 2000 calls various sets of related performance statistics "objects," you may feel more comfortable thinking about them as records or record types.) Associated with each object is a set of related counters, in effect, numerical data fields. The Windows 2000 operating system kernel is extensively instrumented, and a wide range of detailed performance data is available on hardware resource usage, operating system activity, and active processes and threads across this interface. Besides the kernel, the I/O manager and the various networking services like TCP/IP are also instrumented.
The performance monitoring interface also defines a callable facility that allows a performance monitoring application (like the System Monitor) to retrieve the performance statistics that are collected. The API defines a set of data structures that are used to pass the performance statistics from the data collectors to performance monitoring applications.
The Win32 performance monitoring interface is both open and extensible. That makes it possible for key applications and subsystems to add to the pool of available performance metrics. And, in fact, many of Microsoft's internally developed applications already exploit this extensible interface, having added their own performance statistics to those maintained by the operating system kernel. These applications include MS SQL Server, Exchange Server, Internet Information Server, and others. Many applications developed on or ported to Windows 2000 by other vendors also utilize this interface for performance data collection. As of this writing, we are aware of some 200 distinct performance objects that are available, with literally thousands of different performance data counters.
A good reason for using the standard facilities for Windows 2000 performance data collection is that the data provided is immediately accessible by the second set of services: the applications that Microsoft and other vendors provide for collecting and viewing performance data. While the Microsoft Windows 2000 Resource Kit mentions a dozen different performance tools that are available, these boil down to two principal tools. One is the Windows 2000 Task Manager, which beginning in NT Version 4.0 provides a compact, real-time view of system activity. The second and more complete and comprehensive tool is System Monitor (or Performance Monitor in NT 4.0), which has the capability to display all the available performance data in both graphical and tabular views, log performance data to a file, and generate alerts. Let's begin by familiarizing ourselves with these essential tools first.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Data Logging
While designed primarily for interactive use, Sysmon (and Perfmon) also supports basic data logging facilities. Data being collected can be captured to a log file, which in turn can be used as input to the program's online display facilities. It is also possible to run multiple copies of Sysmon at once. While one copy is collecting data in real time, you can start another copy and examine a log file from a background data collection session. The Sysmon chart utility functions exactly the same way in real time as it does when it is displaying data from a file, except that the data you are viewing does not scroll off the screen, and you have additional commands to control the time range of the data you display.
To create a new log file using the System Monitor, select the Performance Logsand Alerts folder on the left side of the screen, and then access the Counter Logs folder. From the Action menu, select New Log Settings. After you name the new log settings, the Counter Log Properties Dialog Box appears. The Add button brings up a familiar Add Counters dialog that allows you to select the object, object instances, and counters you want to view. You also specify the collection interval, which can be in seconds, minutes, hours, or days. The shortest collection interval that System Monitor allows is one second.
Figure 2-15 expands the Performance Logs and Alerts node on the Microsoft Management Console that is used to anchor performance monitoring in Windows 2000. Two counter logs are shown. One is the System Overview performance data log that Microsoft configures by default. This logging file tracks three metrics that report overall processor utilization, physical disk contention, and paging activity. The second log file entry describes a background performance monitoring session that we defined. To understand what data is being logged and how, right-click on MyLogSettingsand access the logging session properties.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Monitoring Overhead
Now that we understand how the Windows 2000 performance monitoring interface works, we can tackle the issue of overhead. "What does all this Windows 2000 performance monitoring cost?" is a frequent question. Naturally, you want to avoid any situation where performance monitoring is so costly in its use of computer resources that it drastically influences the performance of the applications you care about. Performance monitoring must be part of the solution, not part of the problem. As long as the overhead of performance monitoring remains low, we can accept it as part of the cost of doing business. With performance monitoring, when something goes wrong, we at least have a good chance of finding out what happened. However, as we have already observed, logging performance data to a text file using System Monitor can be very CPU-intensive. When a logging session that is writing a text format data file is active, it can consume so much CPU time that it affects the performance of other applications you are trying to run.
Understanding how much overhead is involved in Windows 2000 performance monitoring is not a simple proposition. It helps to break up the overhead considerations into three major areas of concern:
  • The overhead involved in measuring Windows 2000
  • The overhead involved in gathering performance monitor data
  • The overhead involved in analyzing and other post-processing of the measurement data
We discuss these three areas of concern in the following sections.
The overhead of instrumentation refers to the system resources used in collecting performance measurements as they occur. With a few exceptions, the overhead of collecting both base and extended counters is built into Windows 2000. This overhead occurs whether or not there is a performance monitoring application like Sysmon actually looking at the information being collected. For example, Windows 2000 provides instrumentation that keeps track of which processes are using the CPU and for how long. This instrumentation is integral to the operating system; it cannot be turned off. Consequently, the overhead involved in making these measurements cannot be avoided.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Performance Monitoring Starter Set
We recognize that with so many Windows 2000 performance objects and counters to choose from, it is easy to be overwhelmed. To help readers who want to get started in Windows 2000 performance monitoring without having to read all the in-depth chapters that follow, we now present a set of recommended objects and counters that you can use to begin collecting performance data at regular intervals for your important Windows 2000 machines, shown in Table 2-1. Of course, we do not want to discourage you from reading the rest of the book; this simple list of counters will certainly not provide enough information to help you understand what is happening inside a Windows 2000 machine experiencing performance problems. Subsequent chapters of this book explain what these and other counters mean in some detail and show how to use them to solve practical performance problems.
We also include a brief discussion of some of the performance counters. The recommended set of counters to monitor include System, Processor, Memory, Cache, Disk, and networking performance counters that are relevant to a wide variety of Windows 2000 server and workstation machines. We also recommend collecting process data for critical applications, along with application-specific objects and counters where appropriate. For example, for a Windows 2000 machine being used as a web server, the Internet Information System Global object provides invaluable performance and tuning information.
The Usage Notes column accompanying the entries in the Windows 2000 performance monitoring starter set describe some convenient rules of thumb for detecting specific performance bottlenecks based on the associated counter value. In other cases, the Usage Notes discuss some popular misconceptions associated with specific Windows 2000 performance counter fields that sound like they mean one thing, but actually measure something else entirely. We understand that these Usage Notes may disagree with the recommendations of other authorities and even the official Windows 2000 Explain Text. Rest assured that we do not contradict the official Microsoft sources without good reason. You will find those reasons documented in ample detail in the body of this book. Useful new counters available only in Windows 2000 are also noted.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Processor Performance
At the heart of any computer is a Central Processing Unit (CPU), or simply the processor for short. The processor is the hardware component responsible for computation: a machine that executes arithmetic and logical instructions presented to it in the form of computer programs. In this chapter we look at several important aspects of processor performance in a Windows 2000 system.
The processor hardware is capable of executing one set of instructions at a time, yet computers are loaded with programs that require execution concurrently. Generally, there are two types of programs that computers execute: systems programs and applications programs. An operating system like Windows 2000 is a systems program. It is a set of computer instructions like any program, but it is designed to interface directly with the computer hardware. The operating system, in particular, is responsible for controlling the hardware: the CPU, memory, network interface, and associated peripheral devices. The operating system is also responsible for determining what other programs actually get to run on the computer system. There is a very strong performance component to this aspect of operating systems software. The Windows 2000 designers are trying to make the experience of using their computing platform as pleasant as possible. This includes running applications fast and efficiently.
Like other operating systems, Windows 2000 includes a Scheduler component that controls what application programs the processor executes and in what order. We discuss the functions performed by the Windows 2000 Scheduler first, but limit our scope to the uniprocessor environment initially. Multiprocessors introduce a number of complications, which we address in Chapter 5.
Before diving into the subject of the Windows 2000 Scheduler, it is good idea to highlight two of the major design goals behind Windows 2000. These goals are portability and simplicity, and they are important to understand because they have influenced the decisions that were made about what features to put into the OS.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Windows 2000 Design Goals
Before diving into the subject of the Windows 2000 Scheduler, it is good idea to highlight two of the major design goals behind Windows 2000. These goals are portability and simplicity, and they are important to understand because they have influenced the decisions that were made about what features to put into the OS.
Portability means that the operating system should be able to run (and run well) on a very broad range of computing environments, from standalone workstations to very large, clustered systems. This is challenging to accomplish because the hardware and the workloads run on such a wide range of computing alternatives are, well, wide-ranging. From its inception, Windows NT, and now Windows 2000, was committed to accomplishing this goal without making too many major platform-specific optimizations. This is a carefully calculated and thoughtful stance, in our view.
Concessions to specific hardware features can definitely show a big payoff in the short term, which makes it very tempting for OS designers to make them. These same concessions are apt to be a drag on performance over the longer term, however. This is significant, since operating systems have a long shelf life. (MS-DOS is now over twenty years old, the original Unix kernel was developed twenty-five years ago, and there are elements of the IBM MVS mainframe operating system originally developed in 1960 still running today. Remember that some of the vaunted "New Technology" in Windows NT/2000 is already over ten years old.)
Building Windows 2000 for the long haul and avoiding hardware-specific optimizations is important from the standpoint of portability. The Windows 2000 designers relied on the fact that over the long term, the power of the hardware used to run Windows 2000 would catch up and ultimately overcome any limitations of the OS. No sense succumbing to the temptation of making short-term, hardware-specific concessions when these features are not likely to endure.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Thread Execution Scheduler
The component of Windows 2000 that interacts with the processor hardware is known as the Scheduler. Task scheduling is a fundamental part of any operating system and resides in the OS kernel, ntoskrnl.exe,for reasons that should become apparent during this discussion. The Windows 2000 thread scheduler supports foreground/background execution, multiprogramming, multiprocessing, and pre-emptive scheduling with priority queuing. It functions similarly to other major operating systems such as Compaq's OpenVMS, Unix V, and IBM's MVS. If you are familiar with task scheduling in one of these OSes, you should not have any trouble understanding how the Windows 2000 Scheduler works too. On the other hand, if you have prior exposure only to MS-DOS and Windows, you may find many new topics being introduced.
Windows 2000 is a multiprogramming operating system, which means that it manages and selects among multiple programs that can all be active in various stages of execution at the same time. The dispatchable unit in Windows 2000, representing the application or system code to be executed, is the thread. (Unix terminology is identical. NT Versions 4.0 and higher also support a lightweight variant of threads called fibers, but that is a detail we can ignore for now.) The Scheduler running inside the Windows 2000 operating system kernel keeps track of each thread in the system and points the processor hardware to threads that are ready to run.
The basic rationale for multiprogramming is that most computing tasks do not execute instructions continuously. After a program thread executes for some period of time, it usually needs to perform an input/output (I/O) operation like reading information from the disk, printing characters on a printer, or drawing data on the display. While the program is waiting for this I/O function to complete, it does not need to hang on to the processor. An operating system that supports multiprogramming saves the status of a program that is waiting, restores its status when it is ready to resume execution, and finds something else that can run in the interim.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Thread Scheduling Tuning
As we have seen, Windows 2000 provides an extensive set of facilities to set the dispatching priority of an application's executing threads. Applications like Internet Explorer, RAS, System Monitor, and SQL Server take full advantage of Windows 2000 thread scheduling priority internally. For applications that do not take full advantage of these facilities, the Windows 2000 thread Scheduler adjusts thread priorities dynamically to increase the system's responsiveness, maximize throughput, and prevent thread starvation. Applications like SQL Server and Exchange expose external tuning parameters that allow the administrator to configure the number of threads started and their priority, and system applications like the Windows 2000 Spooler and the file server component also have tuning parameters that control the number and priority of threads the process initiates. Windows 2000 itself also exposes some tuning options for adjusting the number of worker threads it creates and runs.
Clearly, the inside of a specific application is not the best vantage point from which to make tuning decisions that can impact system performance globally. In this area, Windows 2000 provides very little support to a system administrator who needs to adjust the relative priorities of different applications running on the same system, When sufficient processing capacity is available and the Ready Queue never backs up, an external mechanism to adjust dispatch priority is not necessary. But when these conditions do not hold, some form of tuning intervention is desirable. Keep in mind that Windows 2000 needs to operate over a wide range of processing environments where processing power varies from a single processor of various speeds to four and eight-way multiprocessor configurations. No set of default parameters will be adequate across such a broad range of computing power.
In this section we review the Windows 2000 Scheduler adjustments that can be made externally by a system administrator or performance analyst. This is a relatively brief section because the options available are limited. This has created an opportunity for third-party developers to extend the set of controls available for processor scheduling. We describe the use of one third-party tool that has risen to this challenge, and discuss a new operating system interface introduced in Windows 2000 that will make it easier to build administrator controls to regulate thread scheduling in the future.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Optimizing Application Performance
The best medicine for most performance problems is invariably prevention. Despite advances in software performance engineering, developing complex computer programs that are both functionally correct and efficient remains a difficult and time-consuming task. This chapter specifically looks at tuning Windows 2000 applications running on Intel hardware from the perspective of optimizing processor cycles and resource usage. Fine-tuning the execution path of code remains one of the fundamental disciplines of performance engineering.
To bring this topic into focus, we describe a case study where an application designed and developed specifically for the Microsoft Windows 2000 environment is subjected to a rigorous analysis of its performance using several commercially available CPU execution profiling tools. One of the development tools we used requires an understanding of the internal workings of Intel processors, and justifies a lengthy excursion into the area of Intel processor hardware performance.
The performance of Intel processor hardware is the focus of the second half of this chapter. We will look inside an Intel x86 microprocessor to dissect the complex mechanisms employed to execute computer instructions. You may encounter situations where a background and understanding of Intel processor hardware is relevant to solving a performance or capacity problem. This chapter also introduces a set of Intel processor hardware performance measurements that can be extremely useful in troubleshooting CPU performance problems. While the first part of this chapter should appeal to developers responsible for applications that need to run efficiently under Windows 2000, the second part should have wider appeal. The technical discussion of the Intel processor architecture lays the groundwork for our treatment of multiprocessor performance considerations in Chapter 5.
The application that is the target of this analysis is a C language program written to collect Windows 2000 performance data continuously on an interval basis. Since the application is designed primarily for use as a performance tool, it is very important that it run efficiently. A tool designed to diagnose performance problems should not itself be the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Background
The application that is the target of this analysis is a C language program written to collect Windows 2000 performance data continuously on an interval basis. Since the application is designed primarily for use as a performance tool, it is very important that it run efficiently. A tool designed to diagnose performance problems should not itself be the cause of performance problems. Moreover, the customers for this application, many of whom are experienced Windows 2000 performance analysts, are a very demanding group of users.
The application's structure and flow is straightforward. Following initialization, the program enters a continuous data collection loop. Inside this loop, Windows 2000 services are called to retrieve selected performance data across a well-documented Win32 interface. The program consists of a single executable module called dmperfss.exe and two adjunct helper dynamic load libraries (DLLs); it simply gathers performance statistics across this interface and logs the information collected to a local disk file. The optimization of the code within this inner loop was the focus of this study.
Some additional details about the dmperfss application's structure and logic are important to this discussion. To a large extent, the program's design is constrained by the Windows 2000 Win32 performance monitoring Application Programming Interface (API) discussed in Chapter 2 that is the source of the performance data being collected. The performance data in Windows 2000 is structured as a set of objects, each with an associated set of counters. (Individuals not accustomed to object-oriented programming terminology might feel more comfortable thinking about objects as either records orrows of a database table, and counters as fields orcolumns in a database table.) There are more than 200 different performance objects defined in Windows 2000: base objects, which are available on every system, and extended objects, which are available only if specific application packages like MS SQL Server or Lotus Notes are installed. Within each object, specific performance counters are defined. Approximately 20 different types of counters exist, but generally they fall into three basic categories: accumulators, instantaneous measures, and compound variables, as described in Chapter 2.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Application Tuning Case Study
In this section we compare and contrast the three code profilers that we used to examine the application's consumption of processor cycles. All the tests were made on a Windows NT 4.0 Server installation, but the results apply to the Windows 2000 environment with almost no modification. Overall, we found both compiler add-on packages to be extremely useful and well worth the modest investment. Visual Quantify provides a highly intuitive user interface. It extends the CPU profiling information available through the built-in facilities of MS Visual C++ by incorporating information on many system services, and it greatly increased our understanding of the code's interaction with various Windows 2000 system services. VTune supplemented this understanding with even more detailed information about our code's interaction with the Windows 2000 runtime environment. VTune's unique analysis of Intel hardware performance provides singular insight into this critical area of application performance.
The first stage of the analysis used the code execution profiler built into MS Visual C++. This is a natural place for any tuning project to begin given that no additional software packages need to be licensed and installed. Runtime profiling is enabled from the Link tab of the Project, Settings dialog box inside Developer Studio, which also turns off incremental linking. Profiling can be performed at either the function or line level. Function-level profiling counts the number of calls to the function and adds code that keeps track of the time spent in execution while in the function. Line profiling is much higher impact and requires that debug code be generated. In this exercise, we report only the results of function profiling.
Select Profile from the Tools menu to initiate a profiling run. Additional profiling options are available at runtime. The most important of these allow the user to collect statistics on only specific modules within a program or to restrict line profiling to certain lines of code. At the conclusion of the run, a text report is available in the Output window under the Profile tab. You can view the report there or import the text into a spreadsheet where the data can be sorted and manipulated. The first few lines of the function-level profile report are shown here. The output is automatically sorted in the most useful sequence, showing the functions in order by the amount of time spent in the function.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Intel Processor Hardware Performance
VTune, we discovered, is targeted specifically for programs that execute on Intel hardware, and provides a very detailed and informative analysis of program execution behavior on the Intel Pentium processor family. It turns out that this analysis is not as useful for programs executing on newer Intel hardware, such as the Pentium Pro, Pentium II, Pentium III, or Pentium IV. However, learning how to use this detailed information requires quite a bit of understanding about the way that Pentium (and Pentium Pro) processor chips work.
The Intel IA-32 architecture is based on the original third-generation 32-bit 386 processor family. Today, the Intel 32-bit architecture is associated with Pentium (P5) and Pentium Pro, Pentium II, Pentium III, and Pentium IV processors (these four correspond to the P6 generation of Intel microprocessors). For example, the Pentium IV is a sixth-generation microprocessor (P6) running the Intel x86 instruction set. Hardware designers refer to the Intel x86 as a CISC (Complex Instruction Set Computer), a style of hardware that is no longer in vogue. Today, hardware designers generally prefer processor architectures based on RISC (Reduced Instruction Set Computers). The complex Intel x86 instruction set is a legacy of design decisions made twenty years ago at the dawn of the microprocessor age, when RISC concepts were not widely recognized. The overriding design consideration in the evolution of the Intel x86 microprocessor family is maintaining upward compatibility of code developed for earlier-generation machines produced over the last twenty years.
Table 4-1 summarizes the evolution of the Intel x86 microprocessor family starting with the 8080, first introduced in 1974. As semiconductor fabrication technology advanced and more transistors were available to the designers, Intel's chip designers added more and more powerful features to the microprocessor. For example, the 80286 (usually referred to as the 286) was a 16-bit machine with a form of extended addressing using segment registers. The next-generation 386 chip maintained compatibility with the 286's rather peculiar virtual memory addressing scheme while implementing a much more straightforward 32-bit virtual memory scheme. In contrast to the 16-bit 64K segmented architecture used in the 286, the 386 virtual addressing mode is known as a "flat" memory model.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Multiprocessing
As discussed in the last chapter, one sure route to better performance is to buy denser microprocessor chips, which have more logic packed into less space and therefore run faster. Intel founder Gordon Moore's Law, which says that microprocessor density and speeds will double every 18-24 months or so, has not let us down over the last 20 years. If you wait long enough, perhaps your performance problems will just go away with the next generation of computer chips! Another proven technique is multiprocessing, building computers with two, four, or more microprocessors, all capable of executing the same workload in parallel. Instead of waiting another 18 months for processor speed to double again, you might be able to take advantage of multiprocessing technology to double or quadruple your performance today. If you have a workload that is out of capacity on a single-processor system, a multiprocessor configuration running Windows 2000 may be the only reasonable alternative that offers you any hope of relief from these capacity constraints today.
Multiprocessing technology lets you harness the power of multiple microprocessors running a single copy of the Windows 2000 operating system. Enterprise-class server models with multiple CPUs abound. When is a two- or four-way multiprocessor solution a good answer for your processing needs? How much better performance should you expect from a server with multiple engines? What sorts of workloads lend themselves to multiprocessing? These are difficult questions to answer, but we try to tackle them here. For even bigger problems, you might even want to consider more exotic solutions where multiple multiprocessors are clustered to work together in parallel to power your web site, for example. However, clustered Windows 2000 solutions are beyond the scope of this book.
In this chapter, we introduce the performance considerations relevant to multiprocessors running Windows 2000. Per our usual method, we attempt to construct a conceptual framework that will allow you to formulate answers to these performance and capacity planning questions
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multiprocessing Basics
Very powerful multiprocessor (MP) hardware configurations are widely available today from many PC server manufacturers. Beginning as far back as the 486 processors, Intel hardware has provided the basic capability required to support multiprocessing designs. However, it was not until the widespread availability of the P6, known commercially as the Intel Pentium Pro microprocessors, that Intel began making processors chips that were specifically built with multiprocessing servers in mind. The internal bus and related bus mastering chip sets that Intel built for the P6 were designed to string up to four Pentium Pro microprocessors together. As we write this chapter, hardware manufacturers are bringing out high-end microprocessors based on the Intel Pentium III, specifically using the flavor of microprocessor code-named Xeon that is designed for 4-, 8-, and 16-way multiprocessors. Anyone interested in the performance of these large-scale, enterprise-class Windows 2000 Servers should be aware of the basic issues in multiprocessor design, performance, and capacity planning.
The specific type of multiprocessing implemented using P6, Pentium II, III, and IV chips is generally known as shared-memory multiprocessing. (There is no overall consensus on how to classify various parallel processing schemes, but, fortunately, most authorities at least can agree on what constitutes a shared-memory multiprocessor!) In this type of configuration, the processors operate totally independently of each other, but they do share a single copy of the operating system and access to main memory (i.e., RAM). A typical dual processor shared-memory configuration is illustrated in Figure 5-1, with two P6 processors that contain dedicated Level 2 caches. (They each have separate built-in Level 1 caches, too.) A two-way configuration simply means having twice the hardware: two identical sets of processors, caches, and internal buses. Similarly, a four-way configuration means having four of everything. Having duplicate caches is designed to promote scalability, since the cache is so fundamental to the performance of pipelined processors.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Cache Coherence
The cache effects of running on a shared-memory multiprocessor are probably the most salient factors limiting the scalability of this type of computer architecture. The various forms of processor cache, including Translation Lookaside Buffers (TLBs), code and data caches, and branch prediction tables, all play a critical role in the performance of pipelined machines like the Pentium, Pentium Pro, Pentium II, and Pentium III. For the sake of performance, in a multiprocessor configuration each CPU retains its own private cache memory, as depicted in Figure 5-10. We have seen that multiple threads executing inside the Windows 2000 kernel or running device driver code concurrently can attempt to access the same memory locations. Propagating changes to the contents of memory locations cached locally to other engines with their own private copies of the same shared-memory locations is a major issue, known as the cache coherence problem in shared-memory multiprocessors. Cache coherence issues also have significant performance ramifications.
Maintaining cache coherence in a shared-memory multiprocessor is absolutely necessary for programs to execute correctly. While independent program execution threads operate independently of each other for the most part, sometimes they must interact. Whenever they read and write common or shared-memory data structures, threads must communicate and coordinate accesses to these memory locations. This coordination inevitably has performance consequences. We illustrate this side effect by drawing on the example discussed earlier, where two kernel threads are attempting to gain access to the Windows 2000 Scheduler Ready Queue simultaneously. A global data structure like the Ready Queue that may be accessed by multiple threads executing concurrently on different processors must be protected by a lock on a multiprocessor. Let's look at how a lock word value set by one thread on one processor is propagated to cache memory in another processor where another thread is attempting to gain access to the same critical section.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Pentium Pro Hardware Counters
The measurement facility in the Intel P6 or Pentium Pro processors (including Pentium II, III, and IV) was strengthened to help hardware designers cope with the demands of more complicated multiprocessor designs. By installing the CPUmon freeware utility available for download at http://www.sysinternals.com, system administrators and performance analysts can access these hardware measurements. While the use of these counters presumes a thorough understanding of Intel multiprocessing hardware, we hope that the previous discussion of multiprocessor design and performance has given you the confidence to start using them to diagnose specific performance problems associated with large-scale Windows 2000 multiprocessors. The P6 counters provide valuable insight into multiprocessor performance, including direct measurement of the processor instruction rate, Level 2 cache, TLB, branch prediction, and the all-important shared-memory bus.
CPUmon, a freeware Pentium counter utility, lets you enable and then disable the Pentium counters for a specific interval, as illustrated