The best medicine for most performance problems is invariably prevention. Despite advances in software performance engineering, developing complex computer programs that are both functionally correct and efficient remains a difficult and time-consuming task. This chapter specifically looks at tuning Windows 2000 applications running on Intel hardware from the perspective of optimizing processor cycles and resource usage. Fine-tuning the execution path of code remains one of the fundamental disciplines of performance engineering.
To bring this topic into focus, we describe a case study where an application designed and developed specifically for the Microsoft Windows 2000 environment is subjected to a rigorous analysis of its performance using several commercially available CPU execution profiling tools. One of the development tools we used requires an understanding of the internal workings of Intel processors, and justifies a lengthy excursion into the area of Intel processor hardware performance.
The performance of Intel processor hardware is the focus of the second half of this chapter. We will look inside an Intel x86 microprocessor to dissect the complex mechanisms employed to execute computer instructions. You may encounter situations where a background and understanding of Intel processor hardware is relevant to solving a performance or capacity problem. This chapter also introduces a set of Intel processor hardware performance measurements that can be extremely useful in troubleshooting CPU performance problems. While the first part of this chapter should appeal to developers responsible for applications that need to run efficiently under Windows 2000, the second part should have wider appeal. The technical discussion of the Intel processor architecture lays the groundwork for our treatment of multiprocessor performance considerations in Chapter 5.
The application that is the target of this analysis is a C language program written to collect Windows 2000 performance data continuously on an interval basis. Since the application is designed primarily for use as a performance tool, it is very important that it run efficiently. A tool designed to diagnose performance problems should not itself be the cause of performance problems. Moreover, the customers for this application, many of whom are experienced Windows 2000 performance analysts, are a very demanding group of users.
The application’s structure and flow is straightforward. Following initialization, the program enters a continuous data collection loop. Inside this loop, Windows 2000 services are called to retrieve selected performance data across a well-documented Win32 interface. The program consists of a single executable module called dmperfss.exe and two adjunct helper dynamic load libraries (DLLs); it simply gathers performance statistics across this interface and logs the information collected to a local disk file. The optimization of the code within this inner loop was the focus of this study.
Some additional details about the
application’s structure and logic are important to this
discussion. To a large extent, the program’s design is
constrained by the Windows 2000 Win32 performance monitoring
Application Programming Interface (API) discussed in Chapter 2 that is the source of the performance data
being collected. The performance data in Windows 2000 is structured
as a set of objects, each with an associated set
of counters. (Individuals not accustomed to
object-oriented programming terminology might feel more comfortable
thinking about objects as either records
orrows of a database table, and
counters as fields orcolumns in a database table.) There are more than 200
different performance objects defined in Windows 2000: base
objects, which are available on every system, and
extended objects, which are available only if
specific application packages like MS SQL Server or Lotus Notes are
installed. Within each object, specific performance counters are
defined. Approximately 20 different types of counters exist, but
generally they fall into three basic categories: accumulators,
instantaneous measures, and compound variables, as described in Chapter 2.
The dmperfss.exe application retrieves designated objects one at a time by making a call to the Win32 function RegQueryEx( ). Having collected a data sample, the program then makes the appropriate calculations for all the counter variables of interest, and writes the corresponding counter values to a comma-delimited data collection file. In some cases, there are multiple instances of an object that need to be reported. For example, a computer system with four processors reports four instancesof the processor object each collection interval, plus a synthesized _Total instance. A parent-child relationship is defined for some object instances; threads are associated with a parent process. This means that both the parent and child object instances are retrieved using a single call to the performance monitoring interface.
At the time of the study, there were no apparent performance problems
dmperfss data collection application. The
program is most often used to retrieve a carefully crafted subset of
the available data using collection intervals ranging from 1 to 15
minutes, with one-minute intervals being the recommended setting. At
those rates of data collection, the overhead of
dmperfss data collection was uniformly much less
than 1% additional processor utilization during the interval.
Nevertheless, a good overall assessment of the performance of an
application is almost always valuable in guiding future development.
Furthermore, there are good reasons to run data collection at much
more frequent intervals than current customer practice. Consequently, we wished to investigate
whether it was feasible to build a monitoring program that would
collect certain data at much more frequent intervals—perhaps as
frequently as once per second, or even more frequently.
The performance analysis of this application was initiated at a point in the development cycle where the code was reasonably mature and stable. Once a software program under development is functioning correctly, it is certainly appropriate to tackle performance optimization. But it should be stressed that performance considerations should be taken into account at every stage of application design, development, and deployment.
This case study focuses on analyzing the code execution path using commercially available profiling tools. CPU profiling tools evaluate a program while it is executing and report on which sections of code are executing as a proportion of overall execution time. This allows programmers to focus on routines that account for the most execution time delay. This information supplies an otherwise sorely missing empirical element to performance-oriented program design and development. Without reliable, quantitative information on code execution paths, programmers tend to rely on very subjective criteria to make decisions that affect application performance. These tools eliminate a lot of idle chatter around the coffee machine about why a program is running slowly and what can be done to fix it. Profilers provide hard evidence of where programs are spending their time during execution.
A code profiler also provides data to help evaluate alternative approaches to speeding up program execution. The data might tell you which compiler optimization options are worthwhile, which sections of code should be targeted for revision, and where inline assembler routines might prove most helpful. In planning for the next cycle of development, the results of a code execution profile improve the decision-making process.
The code in the program under analysis, as with most programs written for the Windows environment, makes extensive use of the Win32 application programming interface and C runtime services. One desirable outcome of code profiling is a greater understanding of the performance impact of various system services, where many Windows programs spend the majority of their time during execution. The time spent inside these calls to system services is like a black box, which means the programmer generally has little knowledge of the performance characteristics of these runtime services. Understanding their performance impact at least helps the programmer use these system services more efficiently when there are no viable alternatives to using them.
All three are popular commercial packages. In selecting these specific tools, no attempt was made to provide encyclopedic coverage of all the code profiling options available for the Windows 2000/Intel environment. Instead, we focused on using a few of the better-known and widely available tools for Windows 2000 program development to solve a real-world problem.
The Microsoft Visual C++ optimizing compiler was a natural choice because all the code development was performed using this tool. It is a widely used compiler for this environment, and its built-in code profiling tool is often the first choice for developers who might be reluctant to buy an additional software package. The Rational Visual Quantify program is one of the better-known profiler tools for C and C++ language development. Rational is a leading manufacturer of developer tools for Unix, Windows, and Windows 2000. The Visual Quantify program features integration with the Microsoft Visual Studio development environment and is usually reviewed in surveys of C++ development tools published in popular trade publications. Finally, we used the Intel VTune program because this has garnered wide acceptance within the development community. It is an optimization tool developed by Intel specifically for programs that run on Intel hardware. VTune is a standalone program and, we discovered, is more oriented toward assembly language development than typical application development using C or C++.
 Integrating performance considerations into application design and development is associated with software performance engineering, a name originally popularized by Dr. Connie Smith in her 1990 textbook Performance Engineering of Software Systems. For a more recent (and less academic) survey of the field, see Chris Loosely and Frank Douglas, High-Performance Client Server, published in 1998 by John Wiley and Sons. Unfortunately, while many professional developers endorse the goals of performance engineering, few put the principles enumerated by these authors into practice.
 For instance, accumulator values, which are running totals, and instantaneous values are both collected at the same rate. Would it be possible to collect instantaneous values more frequently and then summarize these sample observations? See also Chapter 3’s discussion of transient processes for at least one other good reason for running data collection at short intervals.