BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Java Performance Tuning
Java Performance Tuning By Jack Shirazi
September 2000
Pages: 436

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
The trouble with doing something right the first time is that nobody appreciates how difficult it was.
—Fortune
There is a general perception that Java programs are slow. Part of this perception is pure assumption: many people assume that if a program is not compiled, it must be slow. Part of this perception is based in reality: many early applets and applications were slow, because of nonoptimal coding, initially unoptimized Java Virtual Machines (VMs), and the overheads of the language.
In earlier versions of Java, you had to struggle hard and compromise a lot to make a Java application run quickly. More recently, there have been fewer reasons why an application should be slow. The VM technology and Java development tools have progressed to the point where a Java application (or applet, servlet, etc.) is not particularly handicapped. With good designs and by following good coding practices and avoiding bottlenecks, applications usually run fast enough. However, the truth is that the first (and even several subsequent) versions of a program written in any language are often slower than expected, and the reasons for this lack of performance are not always clear to the developer.
This book shows you why a particular Java application might be running slower than expected, and suggests ways to avoid or overcome these pitfalls and improve the performance of your application. In this book I've gathered several years of tuning experiences in one place. I hope you will find it useful in making your Java application, applet, servlet, and component run as fast as you need.
Throughout the book I use the generic words "application" and "program" to cover Java applications, applets, servlets, beans, libraries, and really any use of Java code. Where a technique can be applied only to some subset of these various types of Java programs, I say so. Otherwise, the technique applies across all types of Java programs.
This question is always asked as soon as the first tests are timed: "Where is the time going? I did not expect it to take this long." Well, the short answer is that it's slow because it has not been performance-tuned. In the same way the first version of the code is likely to have bugs that need fixing, it is also rarely as fast as it can be. Fortunately, performance tuning is usually easier than debugging. When debugging, you have to fix bugs throughout the code; in performance tuning, you can focus your effort on the few parts of the application that are the bottlenecks.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Is It Slow?
This question is always asked as soon as the first tests are timed: "Where is the time going? I did not expect it to take this long." Well, the short answer is that it's slow because it has not been performance-tuned. In the same way the first version of the code is likely to have bugs that need fixing, it is also rarely as fast as it can be. Fortunately, performance tuning is usually easier than debugging. When debugging, you have to fix bugs throughout the code; in performance tuning, you can focus your effort on the few parts of the application that are the bottlenecks.
The longer answer? Well, it's true that there are overheads in the Java runtime system, mainly due to its virtual machine layer that abstracts Java away from the underlying hardware. It's also true that there are overheads from Java's dynamic nature. These overhead s can cause a Java application to run slower than an equivalent application written in a lower-level language ( just as a C program is generally slower than the equivalent program written in assembler). Java's advantages—namely, its platform-independence, memory management, powerful exception checking, built-in multithreading, dynamic resource loading, and security checks—add costs in terms of an interpreter, garbage collector, thread monitors, repeated disk and network accessing, and extra runtime checks.
For example, hierarchical method invocation requires an extra computation for every method call, because the runtime system has to work out which of the possible methods in the hierarchy is the actual target of the call. Most modern CPUs are designed to be optimized for fixed call and branch targets and do not perform as well when a significant percentage of calls need to be computed on the fly. On the other hand, good object-oriented design actually encourages many small methods and significant polymorphism in the method hierarchy. Compiler inlining is another frequently used technique that can significantly improve compiled code. However, this technique cannot be applied when it is too difficult to determine method calls at compile time, as is the case for many Java methods.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Tuning Game
Performance tuning is similar to playing a strategy game (but happily, you are usually paid to do it!). Your target is to get a better score (lower time) than the last score after each attempt. You are playing with, not against, the computer, the programmer, the design and architecture, the compiler, and the flow of control. Your opponents are time, competing applications, budgetary restrictions, etc. (You can complete this list better than I can for your particular situation.)
I once attended a customer who wanted to know if there was a "go faster" switch somewhere that he could just turn on and make the whole application go faster. Of course, he was not really expecting one, but checked just in case he had missed a basic option somewhere.
There isn't such a switch, but very simple techniques sometimes provide the equivalent. Techniques include switching compilers, turning on optimizations, using a different runtime VM, finding two or three bottlenecks in the code or architecture that have simple fixes, and so on. I have seen all of these give huge improvements to applications, sometimes a 20-fold speedup. Order-of-magnitude speedups are typical for the first round of performance tuning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
System Limitations and What to Tune
Three resource s limit all applications:
  • CPU speed and availability
  • System memory
  • Disk (and network) input/output (I/O)
When tuning an application, the first step is to determine which of these is causing your application to run too slowly.
If your application is CPU-bound, you need to concentrate your efforts on the code, looking for bottlenecks, inefficient algorithms, too many short-lived objects (object creation and garbage collection are CPU-intensive operations), and other problems, which I will cover in this book.
If your application is hitting system-memory limits, it may be paging sections in and out of main memory. In this case, the problem may be caused by too many objects, or even just a few large objects, being erroneously held in memory; by too many large arrays being allocated (frequently used in buffered applications); or by the design of the application, which may need to be reexamined to reduce its running memory footprint.
On the other hand, external data access or writing to the disk can be slowing your application. In this case, you need to look at exactly what you are doing to the disks that is slowing the application: first identify the operations, then determine the problems, and finally eliminate or change these to improve the situation.
For example, one program I know of went through web server logs and did reverse lookups on the IP addresses. The first version of this program was very slow. A simple analysis of the activity being performed determined that the major time component of the reverse lookup operation was a network query. These network queries do not have to be done sequentially. Consequently, the second version of the program simply multithreaded the lookups to work in parallel, making multiple network queries simultaneously, and was much, much faster.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Tuning Strategy
Here's a strategy I have found works well when attacking performance problems:
  1. Identify the main bottlenecks (look for about the top five bottlenecks, but go higher or lower if you prefer).
  2. Choose the quickest and easiest one to fix, and address it (except for distributed applications where the top bottleneck is usually the one to attack: see the following paragraph).
  3. Repeat from Step 1.
This procedure will get your application tuned the quickest. The advantage of choosing the "quickest to fix" of the top few bottlenecks rather than the absolute topmost problem is that once a bottleneck has been eliminated, the characteristics of the application change, and the topmost bottleneck may not even need to be addressed any longer. However, in distributed applications I advise you target the topmost bottleneck. The characteristics of distributed applications are such that the main bottleneck is almost always the best to fix and, once fixed, the next main bottleneck is usually in a completely different component of the system.
Although this strategy is simple and actually quite obvious, I nevertheless find that I have to repeat it again and again: once programmers get the bit between their teeth, they just love to apply themselves to the interesting parts of the problems. After all, who wants to unroll loop after boring loop when there's a nice juicy caching technique you're eager to apply?
You should always treat the actual identification of the cause of the performance bottleneck as a science, not an art. The general procedure is straightforward:
  1. Measure the performance using profilers and benchmark suites, and by instrumenting code.
  2. Identify the locations of any bottlenecks.
  3. Think of a hypothesis for the cause of the bottleneck.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perceived Performance
It is important to understand that the user has a particular view of performance that allows you to cut some corners. The user of an application sees changes as part of the performance. A browser that gives a running countdown of the amount left to be downloaded from a server is seen to be faster than one that just sits there, apparently hung, until all the data is downloaded. People expect to see something happening, and a good rule of thumb is that if an application is unresponsive for more than three seconds, it is seen to be slow. Some Human Computer Interface authorities put the user-patience limit at just two seconds; an IBM study from the early '70s suggested people's attention began to wander after waiting for more than just one second. For performance improvements, it is also useful to know that users are not generally aware of response time improvements of less than 20%. This means that when tuning for user perception, you should not deliver any changes to the users until you have made improvements that add more than a 20% speedup.
A few long response times make a bigger impression on the memory than many shorter ones. According to Arnold Allen, the perceived value of the average response time is not the average, but the 90th percentile value: the value that is greater than 90% of all observed response times. With a typical exponential distribution, the 90th percentile value is 2.3 times the average value. Consequently, so long as you reduce the variation in response times so that the 90th percentile value is smaller than before, you can actually increase the average response time, and the user will still perceive the application as faster. For this reason, you may want to target variation in response times as a primary goal. Unfortunately, this is one of the more complex targets in performance tuning: it can be difficult to determine exactly why response times are varying.
If the interface provides feedback and allows the user to carry on other tasks or abort and start another function (preferably both), the user sees this as a responsive interface and doesn't consider the application as slow as he might otherwise. If you give users an expectancy of how long a particular task might take and why, they often accept that this is as long as it has to take and adjust their expectations. Modern web browsers provide an excellent example of this strategy in practice. People realize that the browser is limited by the bandwidth of their connection to the Internet, and that downloading cannot happen faster than a given speed. Good browsers always try to show the parts they have already received so that the user is not blocked, and they also allow the user to terminate downloading or go off to another page at any time, even while a page is partly downloaded. Generally, it is not the browser that is seen to be slow, but rather the Internet or the server site. In fact, browser creators have made a number of tradeoffs so that their browsers appear to run faster in a slow environment. I have measured browser display of identical pages under identical conditions and found browsers that are actually faster at full page display, but seem slower because they do not display partial pages, or download embedded links concurrently, etc. Modern web browsers provide a good example of how to manage user expectations and perceptions of performance.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Starting to Tune
Before diving into the actual tuning, there are a number of considerations that will make your tuning phase run more smoothly and result in clearly achieved objectives.
Any application must meet the needs and expectations of its users, and a large part of those needs and expectations is performance. Before you start tuning, it is crucial to identify the target response times for as much of the system as possible. At the outset, you should agree with your users (directly if you have access to them, or otherwise through representative user profiles, market information, etc.) what the performance of the application is expected to be.
The performance should be specified for as many aspects of the system as possible, including:
  • Multiuser response times depending on the number of users (if applicable)
  • Systemwide throughput (e.g., number of transactions per minute for the system as a whole, or response times on a saturated network, again if applicable)
  • The maximum number of users, data, files, file sizes, objects, etc., the application supports
  • Any acceptable and expected degradation in performance between minimal, average, and extreme values of supported resources
Agree on target values and acceptable variances with the customer or potential users of the application (or whoever is responsible for performance) before starting to tune. Otherwise, you will not know where to target your effort, how far you need to go, whether particular performance targets are achievable at all, and how much tuning effort those targets may require. But most importantly, without agreed targets, whatever you achieve tends to become the starting point.
The following scenario is not unusual: a manager sees horrendous performance, perhaps a function that was expected to be quick, but takes 100 seconds. His immediate response is, "Good grief, I expected this to take no more than 10 seconds." Then, after a quick round of tuning that identifies and removes a huge bottleneck, function time is down to 10 seconds. The manager's response is now, "Ah, that's more reasonable, but of course I actually meant to specify 3 seconds—I just never believed you could get down so far after seeing it take 100 seconds. Now you can start tuning." You do not want your initial achievement to go unrecognized (especially if money depends on it), and it is better to know at the outset what you need to reach. Agreeing on targets before tuning makes everything clear to everyone.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What to Measure
The main measurement is always wall-clock time. You should use this measurement to specify almost all benchmarks, as it's the real-time interval that is most appreciated by the user. (There are certain situations, however, in which system throughput might be considered more important than the wall-clock time; e.g., servers, enterprise transaction systems, and batch or background systems.)
The obvious way to measure wall-clock time is to get a timestamp using System.currentTimeMillis( ) and then subtract this from a later timestamp to determine the elapsed time. This works well for elapsed time measurements that are not short. Other types of measurements have to be system-specific and often application-specific. You can measure:
  • CPU time (the time allocated on the CPU for a particular procedure)
  • The number of runnable processes waiting for the CPU (this gives you an idea of CPU contention)
  • Paging of processes
  • Memory sizes
  • Disk throughput
  • Disk scanning times
  • Network traffic, throughput, and latency
  • Transaction rates
  • Other system values
However, Java doesn't provide mechanisms for measuring these values directly, and measuring them requires at least some system knowledge, and usually some application-specific knowledge (e.g., what is a transaction for your application?).
You need to be careful when running tests that have small differences in timings. The first test is usually slightly slower than any other tests. Try doubling the test run so that each test is run twice within the VM (e.g., rename main( ) to maintest( ), and call maintest( ) twice from a new main( )).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Don't Tune What You Don't Need to Tune
The most efficient tuning you can do is not to alter what works well. As they say, "If it ain't broke, don't fix it." This may seem obvious, but the temptation to tweak something just because you have thought of an improvement has a tendency to override this obvious statement.
The second most efficient tuning is to discard work that doesn't need doing. It is not at all uncommon for an application to be started with one set of specifications and to have some of the specifications change over time. Many times the initial specifications are much more generic than the final product. However, the earlier generic specifications often still have their stamps in the application. I frequently find routines, variables, objects, and subsystems that are still being maintained but are never used and never will be used, since some critical aspect of these resources is no longer supported. These redundant parts of the application can usually be chopped without any bad consequences, often resulting in a performance gain.
In general, you need to ask yourself exactly what the application is doing and why. Then question whether it needs to do it in that way, or even if it needs to do it at all. If you have third-party products and tools being used by the application, consider exactly what they are doing. Try to be aware of the main resources they use (from their documentation). For example, a zippy DLL (shared library) that is speeding up all your network transfers is using some resources to achieve that speedup. You should know that it is allocating larger and larger buffers before you start trying to hunt down the source of your mysteriously disappearing memory. Then you can realize that you need to use the more complicated interface to the DLL that restricts resource usage, rather than a simple and convenient interface. And you will have realized this before doing extensive (and useless) object profiling, because you would have been trying to determine why your application is being a memory hog.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Checklist
  • Specify the required performance.
    • Ensure performance objectives are clear.
    • Specify target response times for as much of the system as possible.
    • Specify all variations in benchmarks, including expected response ranges (e.g., 80% of responses for X must fall within 3 seconds).
    • Include benchmarks for the full range of scaling expected (e.g., low to high numbers of users, data, files, file sizes, objects, etc.).
    • Specify and use a benchmark suite based on real user behavior. This is particularly important for multiuser benchmarks.
    • Agree on all target times with users, customers, managers, etc., before tuning.
  • Make your benchmarks long enough: over five seconds is a good target.
    • Use elapsed time (wall-clock time) for the primary time measurements.
    • Ensure the benchmark harness does not interfere with the performance of the application.
    • Run benchmarks before starting tuning, and again after each tuning exercise.
    • Take care that you are not measuring artificial situations, such as full caches containing exactly the data needed for the test.
  • Break down distributed application measurements into components, transfer layers, and network transfer times.
  • Tune systematically: understand what affects the performance; define targets; tune; monitor and redefine targets when necessary.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Profiling Tools
If you only have a hammer, you tend to see every problem as a nail.
—Abraham Maslow
Before you can tune your application, you need tools that will help you find the bottlenecks in the code. I have used many different tools for performance tuning, and so far I have found the commercially available profilers to be the most useful. You can easily find several of these, together with reviews of them, by searching the Web using java+optimi and java+profile, or checking the various computer magazines. These tools are usually available free for an evaluation period, and you can quickly tell which you prefer using. If your budget covers it, it is worth getting several profilers: they often have complementary features and provide different details about the running code. I have included a list of profilers in Chapter 15.
All profilers have some weaknesses, especially when you want to customize them to focus on particular aspects of the application. Another general problem with profilers is that they frequently fail to work in nonstandard environments. Nonstandard environments should be rare, considering Java's emphasis on standardization, but most profiling tools work at the VM level, and the JVMPI ( Java Virtual Machine Profiler Interface) was only beginning to be standardized in JDK 1.2, so incompatibilities do occur. Even after the JVMPI standard is finalized, I expect there will be some nonstandard VMs you may have to use, possibly a specialized VM of some sort—there are already many of these.
When tuning, I normally use one of the commercial profiling tools, and on occasion where the tools do not meet my needs, I fall back on a variation of one of the custom tools and information extraction methods presented in this chapter. Where a particular VM offers extra APIs that tell you about some running characteristics of your application, these custom tools are essential to access those extra APIs. Using a professional profiler and the proprietary tools covered in this chapter, you will have enough information to figure out where problems lie and how to resolve them. When necessary, you can successfully tune without a professional profiler, since the Sun VM does contain a basic profiler, which I cover in this chapter. However, this option is not ideal for the most rapid tuning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Measurements and Timings
When looking at timings, be aware that different tools affect the performance of applications in different ways. Any profiler slows down the application it is profiling. The degree of slowdown can vary from a few percent to a few hundred percent. Using System.currentTimeMillis( ) in the code to get timestamps is the only reliable way to determine the time taken by each part of the application. In addition, System.currentTimeMillis( ) is quick and has no effect on application timing (as long as you are not measuring too many intervals or ridiculously short intervals; see the discussion in Section 1.7 in Chapter 1).
Another variation on timing the application arises from the underlying operating system . The operating system can allocate different priorities for different processes, and these priorities determine the importance the operating system applies to a particular process. This in turn affects the amount of CPU time allocated to a particular process compared to other processes. Furthermore, these priorities can change over the lifetime of the process. It is usual for server operating systems to gradually decrease the priority of a process over that process's lifetime. This means that the process will have shorter periods of the CPU allocated to it before it is put back in the runnable queue. An adaptive VM (like Sun's HotSpot) can give you the reverse situation, speeding up code shortly after it has started running (see Section 3.3).
Whether or not a process runs in the foreground can also be important. For example, on a machine with the workstation version of Windows (most varieties including NT, 95, 98, and 2000), foreground processes are given maximum priority. This ensures that the window currently being worked on is maximally responsive. However, if you start a test and then put it in the background so that you can do something else while it runs, the measured times can be very different from the results you would get if you left that test running in the foreground. This applies even if you do not actually do anything else while the test is running in the background. Similarly, on server machines, certain processes may be allocated maximum priority (for example, Windows NT and 2000 server version, as well as most Unix server configured machines, allocate maximum priority to network I/O processes).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Garbage Collection
The Java runtime system normally includes a garbage collector. Some of the commercial profilers provide statistics showing what the garbage collector is doing. You can also use the -verbosegc option with the VM. This option prints out time and space values for objects reclaimed and space recycled as the reclamations occur. The printout includes explicit synchronous calls to the garbage collector (using System.gc( )) as well as asynchronous executions of the garbage collector, as occurs in normal operation when free memory available to the VM gets low.
System.gc( ) does not necessarily force a synchronous garbage collection. Instead, the gc( ) call is really a hint to the runtime that now is a good time to run the garbage collector. The runtime decides whether to execute the garbage collection at that time and what type of garbage collection to run.
It is worth looking at some output from running with -verbosegc. The following code fragment creates lots of objects to force the garbage collector to work, and also includes some synchronous calls to the garbage collector:
package tuning.gc;
public class Test {
  public static void main(String[] args)
  {
    int SIZE = 4000;
    StringBuffer s;
    java.util.Vector v;

    //Create some objects so that the garbage collector 
    //has something to do
    for (int i = 0; i < SIZE; i++)
    {
      s = new StringBuffer(50);
      v = new java.util.Vector(30);
      s.append(i).append(i+1).append(i+2).append(i+3);
    }
    s = null;
    v = null;
    System.out.println("Starting explicit garbage collection");
    long time = System.currentTimeMillis( );
    System.gc( );
    System.out.println("Garbage collection took " + 
      (System.currentTimeMillis( )-time) + " millis");

    int[] arr = new int[SIZE*10];
    //null the variable so that the array can be garbage collected
    time = System.currentTimeMillis( );
    arr = null;
    System.out.println("Starting explicit garbage collection");
    System.gc( );
    System.out.println("Garbage collection took " + 
      (System.currentTimeMillis( )-time) + " millis");
  }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Method Calls
The main focus of most profiling tools is to provide a profile of method calls. This gives you a good idea of where the bottlenecks in your code are and is probably the most important way to pinpoint where to target your efforts. By showing which methods and lines take the most time, a good profiling tool can save you time and effort in locating bottlenecks.
Most method profilers work by sampling the call stack at regular intervals and recording the methods on the stack. This regular snapshot identifies the method currently being executed (the method at the top of the stack) and all the methods below, to the program's entry point. By accumulating the number of hits on each method, the resulting profile usually identifies where the program is spending most of its time. This profiling technique assumes that the sampled methods are representative, i.e., if 10% of stacks sampled show method foo( ) at the top of the stack, then the assumption is that method foo( ) takes 10% of the running time. However, this is a sampling technique , and so it is not foolproof: methods can be missed altogether or have their weighting misrecorded if some of their execution calls are missed. But usually only the shortest tests are skewed. Any reasonably long test (i.e., over seconds, rather than milliseconds) will normally give correct results.
This sampling technique can be difficult to get right. It is not enough to simply sample the stack. The profiler must also ensure that it has a coherent stack state, so the call must be synchronized across the stack activities, possibly by temporarily stopping the thread. The profiler also needs to make sure that multiple threads are treated consistently, and that the timing involved in its activities is accounted for without distorting the regular sample time. Also, too short a sample interval causes the program to become extremely slow, while too long an interval results in many method calls being missed and hence misrepresentative profile results being generated.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Object-Creation Profiling
Unfortunately, the object-creation statistics available from the Sun JDK provide only very rudimentary information. Most profile tool vendors provide much better object-creation statistics, determining object numbers and identifying where particular objects are created in the code. My recommendation is to use a better (probably commercial) tool than the JDK profiler.
The heap-analysis tool (search www.java.sun.com for "HAT "), which can analyze the default profiling mode with Java 2, provides a little more information from the profiler output, but if you are relying on this, profiling object creation will require a lot of effort. To use this tool, you must use the binary output option to the profiling option:
java -Xrunhprof:format=b <classname>
I have used an alternate trick when a reasonable profiler is unavailable, cannot be used, or does not provide precisely the detail I need. This technique is to alter the java.lang.Object class to catch most nonarray object-creation calls. This is not a supported feature, but it does seem to work on most systems, because all constructors chain up to the Object class's constructor, and any explicitly created nonarray object calls the constructor in Object as its first execution point after the VM allocates the object on the heap. Objects that are created implicitly with a call to clone( ) or by deserialization do not call the Object class's constructor, and so are missed when using this technique.
Under the terms of the license granted by Sun, it is not possible to include or list an altered Object class with this book. But I can show you the simple changes to make to the java.lang.Object class to track object creation.
The change requires adding a line in the Object constructor to pass this to some object-creation monitor you are using. java.lang.Object does not have an explicitly defined constructor (it uses the default empty constructor), so you need to add one to the source and recompile. For any class other than
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Monitoring Gross Memory Usage
The JDK provides two methods for monitoring the amount of memory used by the runtime system. The methods are freeMemory( ) and totalMemory( ) in the java.lang.Runtime class.
totalMemory( ) returns a long, which is the number of bytes currently allocated to the runtime system for this particular Java VM process. Within this memory allocation, the VM manages its objects and data. Some of this allocated memory is held in reserve for creating new objects. When the currently allocated memory gets filled and the garbage collector cannot allocate sufficiently more memory, the VM requests more memory to be allocated to it from the underlying system. If the underlying system cannot allocate any further memory, an OutOfMemoryError error is thrown. Total memory can go up and down; some Java runtimes can return sections of unused memory to the underlying system while still running.
freeMemory( ) returns a long, which is the number of bytes available to the VM to create objects from the section of memory it controls (i.e., memory already allocated to the runtime by the underlying system). The free memory increases when a garbage collection successfully reclaims space used by dead objects, and also increases when the Java runtime requests more memory from the underlying operating system. The free memory reduces each time an object is created, and also when the runtime returns memory to the underlying system.
It can be useful to monitor memory usage while an application runs: you can get a good feel for the hotspots of your application. You may be surprised to see steady decrements in the free memory available to your application when you were not expecting any change. This can occur when you continuously generate temporary objects from some routine; manipulating graphical elements frequently shows this behavior.
Monitoring memory with freeMemory( ) and totalMemory( ) is straightforward, and I include here a simple class that does this graphically. It creates three threads: one to periodically sample the memory, one to maintain a display of the memory usage graph, and one to run the program you are monitoring. Figure 2-1 shows a screen shot of the memory monitor after monitoring a run of the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Client/Server Communications
To tune client/server or distributed applications , you need to identify all communications that occur during execution. The most important factors to look for are the number of transfers of incoming and outgoing data, and the amounts of data transferred. These elements affect performance the most. Generally, if the amount of data per transfer is less than about one kilobyte, the number of transfers is the factor that limits performance. If the amount of data being transferred is more than about a third of the network's capacity, the amount of data is the factor limiting performance. Between these two endpoints, either the amount of data or the number of transfers can limit performance, although in general, the number of transfers is more likely to be the problem.
As an example, websurfing with a browser typically hits both problems at different times. A complex page with many parts presented from multiple sites can take longer to display completely than one simple page with 10 times more data. Many different sites are involved in displaying the complex page; each site needs to have its server name converted to an IP address, which can take many network transfers, and then each site needs to be connected to and downloaded from. The simple page needs only one name lookup and one connection, and this can make a huge difference. On the other hand, if the amount of data is large compared to the connection bandwidth (the speed of the Internet connection at the slowest link between your client and the server machine), the limiting factor is that bandwidth, and so the complex page may display more quickly than the simple page.
Several generic tools are available for monitoring communication traffic, all aimed at system and network administrators (and quite expensive). I know of no general-purpose profiling tool targeted at application-level communications monitoring; normally, developers put their own monitoring capabilities into the application or use the trace mode in their third-party communications package, if they use one. (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Checklist
  • Use system- and network-level monitoring utilities to assist when measuring performance.
  • Run tests on unloaded systems with the test running in the foreground.
    • Use System.currentTimeMillis( ) to get timestamps if you need to determine absolute times. Never use the timings obtained from a profiler as absolute times.
    • Account for all performance effects of any caches.
  • Get better profiling tools. The better your tools, the faster and more effective your tuning.
    • Pinpoint the bottlenecks in the application: with profilers, by instrumenting code (putting in explicit timing statements), and by analyzing the code.
    • Target the top five to ten methods, and choose the quickest to fix.
    • Speed up the bottleneck methods that can be fixed the quickest.
    • Improve the method directly when the method takes a significant percentage of time and is not called too often.
    • Reduce the number of times a method is called when the method takes a significant percentage of time and is also called frequently.
  • Use an object-creation profiler together with garbage-collection statistics to determine which objects are created in large amounts and which large objects are created.
    • See if the garbage collector executes more often than you expect.
    • Use the Runtime.totalMemory( ) and Runtime.freeMemory( )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Underlying JDK Improvements
Throughout the progressive versions of Java, improvements have been made at all levels of the runtime system: in the garbage collector, in the code, in the VM handling of objects and threads, and in compiler optimizations. It is always worthwhile to check your own application benchmarks against each version (and each vendor's version) of the Java system you try out. Any differences in performance need to be identified and explained; if you can determine that a compiler from one version (or vendor) together with the runtime from another version (or vendor) speeds up your application, you may have the option of choosing the best of both worlds. Standard Java benchmarks tend to be of limited use in deciding which VMs provide the best performance for your application. You are always better off creating your own application benchmark suite for deciding which VM and compiler best suit your application.
The following sections identify some points to consider as you investigate different VMs, compilers, and JDK classes. If you control the target Java runtime environment, i.e., with servlet and other server applications, more options are available to you, and we will look at these extra options too.
The effects of the garbage collector can be difficult to determine accurately. It is worth including some tests in your performance benchmark suite that are specifically arranged to identify these effects. You can do this only in a general way, since the garbage collector is not under your control. The basic way to see what the garbage collector is up to is to run with the -verbosegc option. This prints out time and space values for objects reclaimed and space recycled. The printout includes explicit synchronous calls to the garbage collector (using System.gc( ) ) as well as asynchronous executions of the garbage collector, as occurs in normal operation when free memory available to the VM gets low. You can try to force the VM to execute only synchronous garbage collections by using the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Garbage Collection
The effects of the garbage collector can be difficult to determine accurately. It is worth including some tests in your performance benchmark suite that are specifically arranged to identify these effects. You can do this only in a general way, since the garbage collector is not under your control. The basic way to see what the garbage collector is up to is to run with the -verbosegc option. This prints out time and space values for objects reclaimed and space recycled. The printout includes explicit synchronous calls to the garbage collector (using System.gc( ) ) as well as asynchronous executions of the garbage collector, as occurs in normal operation when free memory available to the VM gets low. You can try to force the VM to execute only synchronous garbage collections by using the -noasyncgc option to the Java executable (no longer available from JDK 1.2). This option does not actually stop the garbage-collector thread from executing: it still executes if the VM runs out of free memory (as opposed to just getting low on memory). Output from the garbage collector running with -verbosegc is detailed in Section 2.2.
The garbage collector usually works by freeing the memory that becomes available from objects that are no longer referenced or, if this does not free sufficient space, by expanding the available memory space by asking the operating system for more memory (up to a maximum specified to the VM with the -Xmx /-mx option). The garbage collector's space-reclamation algorithm tends to change with each version of the JDK.
Sophisticated generational garbage collectors, which smooth out the impact of the garbage collector, are now being used; HotSpot uses a state-of-the-art generational garbage collector. Analysis of object-oriented programs has shown that most objects are short-lived, fewer have medium lifespans, and very few objects are long-lived. Generational garbage collectors move objects through multiple spaces, each time copying live objects from one space to the next and reclaiming the space used by objects that are no longer alive. By concentrating on short-lived objects—the early spaces—and spending less time recycling space where older objects live, the garbage collector frees the maximum amount of space for the lowest impact.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Replacing JDK Classes
It is possible for you to replace JDK classes directly. Unfortunately, you can't distribute these altered classes with any application or applet unless you have complete control of the target environment. Although you often do have this control with in-house and enterprise-developed applications, most enterprises prefer not to deploy alterations to externally built classes. The alterations then would not be supported by the vendor (Sun in this case) and may violate the license, so contact the vendor if you need to do this. In addition, altering classes in this way can be a significant maintenance problem.
The upshot is that you can easily alter JDK-supplied classes for development purposes, which can be useful for various reasons including debugging and tuning. But if you need the functionality in your deployed application, you need to provide classes that are used instead of the JDK classes by redirecting method calls into your own classes.
Replacing JDK classes indirectly in this way is a valid tuning technique. Some JDK classes, such as StreamTokenizer (see Section 5.4), are inefficient and can be replaced quite easily since you normally use them in small, well-defined parts of a program. Other JDK classes, like Date , BigDecimal , and String are used all over the place, and it can take a large effort to replace references with your own versions of these classes. The best way to replace these classes is to start from the design stage, so that you can consistently use your own versions throughout the application.
In Version 1.3 of the JDK, many of the java.lang.Math methods were changed from native to call the corresponding methods in java.lang.StrictMath . StrictMath provides bitwise consistency across platforms; earlier versions of Math used the platform-specific native functions that were not identical across all platforms. Unfortunately, StrictMath calculations are somewhat slower than the corresponding native functions. My colleague Kirk Pepperdine, who first pointed out the performance problem to me, puts it this way: "I've now got a bitwise-correct but excruciatingly slow program." The potential workarounds to this performance issue are all ugly: using an earlier JDK version, replacing the JDK class with an earlier version, or writing your own class to manage faster alternative floating-point calculations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Faster VMs
VM runtimes and Java compilers vary enormously over time and across vendors. More and more optimizations are finding their way into both VMs and compilers. Many possible compiler optimizations are considered in later sections of this chapter. In this section I focus on VM optimizations.
Different VMs have different running characteristics. Some VMs are intended purely for development and are highly suboptimal in terms of performance. These VMs may have huge inefficiencies, even in such basic operations as casting between different numeric types. One development VM I used had this behavior; it provided the foundation of an excellent development environment (actually my preferred environment), but was all but useless for performance testing, as any data type manipulation other than with ints or booleans produced highly varying and misleading times.
It is important to run any tests involving timing or profiling in the same VM you plan to run the application. You should test your application in the current "standard" VMs if your target environment is not fully defined.
There is, of course, nothing much you can do about speeding up any one VM (short of upgrading the CPUs). But you should be aware of the different VMs available, whether or not you control the deployment environment of your application. If you control the target environment, you can choose your VM appropriately. If you do not control the environment on which your application runs, remember that performance is partly user expectation. If you tell your user that VM "A" gives such and such a performance for your application, but VM "B" gives this other much slower performance, then you at least inform your user community of the implications of their choice of VM. This could also possibly put pressure on vendors with slower VMs to improve them.
The basic bytecode interpreter VM executes by decoding and executing bytecodes. This is slow, and is pure overhead, adding nothing to the functionality of the application. A just-in-time ( JIT) compiler in a virtual machine eliminates much of this overhead by doing the bytecode fetch and decode just once. The first time the method is loaded, the decoded instructions are converted into machine code native for the CPU the system is running on. After that, future invocations of a particular method no longer incur the interpreter overhead. However, a JIT must be fast at compiling to avoid slowing the runtime, so extensive optimizations within the compile phase are unlikely. This means that the compiled code is often not as fast as it could be. A JIT also imposes a significantly larger memory footprint to the process.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Better Optimizing Compilers
Look out for Java code compilers that specifically target performance optimizations. These are increasingly available. (I suggest searching the Web for java+compile+optimi and checking in Java magazines. A list is also included in Chapter 15.) Of course, all compilers try to optimize code, but some are better than others. Some companies put a great deal of effort into making their compiler produce the tightest, fastest code, while others tend to be distracted by other aspects of the Java environment and put less effort into the compile phase.
There are also some experimental compilers around. For example, the JAVAR compiler (http://www.extreme.indiana.edu/hpjava/) is a prototype compiler that automatically parallelizes parts of a Java application to improve performance.
It is possible to write preprocessors to automatically achieve many of the optimizations you can get with optimizing compilers; indeed, you can think of an optimizing compiler as a preprocessor together with a basic compiler (though in many cases it is better described as a postprocessor and recompiler). However, writing such a preprocessor is a significant task. Even if you ignore the Java code parsing or bytecode parsing required, any one preprocessor optimization can take months to create and verify. To get close to the full set of optimizations listed in the following sections could take years of development. Fortunately, it is not necessary for you to make that effort, because optimizing compiler vendors are making the effort for you.
Optimizing compilers cannot change your code to use a better algorithm. If you are using an inefficient search routine, there may be hugely better search algorithms giving orders of magnitude speedups. But the optimizing compiler only tries to speed up the algorithm you are using (with a probable small incremental speedup). It is still important to profile applications to identify bottlenecks even if you intend to use an optimizing compiler.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sun's Compiler and Runtime Optimizations
As you can see from the previous sections, knowing how the compiler alters your code as it generates bytecodes is important for performance tuning. Some compiler optimizations can be canceled out if you write your code so that the compiler cannot apply its optimizations. In this section, I cover what you need to know to get the most out of the compilation stage if you are using the JDK compiler ( javac ).
There are several optimizations that occur at the compilation stage without your needing to specify any compilation options. These optimizations are not necessarily required because of specifications laid down in Java. Instead, they have become standard compiler optimizations. The JDK compiler always applies them, and consequently almost every other compiler applies them as well. You should always determine exactly what your specific compiler optimizes as standard, from the documentation provided or by decompiling example code.

Section 3.5.1.1: Literal constants are folded

This optimization is a concrete implementation of the ideas discussed in Section 3.4.2.5 earlier. In this implementation, multiple literal constants in an expression are "folded" by the compiler. For example, in the following statement:
int foo = 9*10;
the 9*10 is evaluated to 90 before compilation is completed. The result is as if the line read:
int foo = 90;
This optimization allows you to make your code more readable without having to worry about avoiding runtime overheads.

Section 3.5.1.2: String concatenation is sometimes folded

With the Java 2 compiler, string concatenations to literal constants are folded:
String foo = "hi Joe " + (9*10);
is compiled as if it read:
String foo = "hi Joe 90";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Compile to Native Machine Code
If you know the target environments of your application, you have the option of taking your Java application and compiling it to a machine-code executable. There is a variety of these compilers already available for various target platforms, and the list continues to grow. (Check the computer magazines or follow the compiler links on good Java web sites. See also the compilers listed in Chapter 15.) These compilers can often work directly from the bytecode (i.e., the .class files) without the source code, so any third-party classes and beans you use can normally be included.
If you follow this option, a standard technique to remain multiplatform is to start the application from a batch file that checks the platform and installs (or even starts) the application binary appropriate for that platform, falling back to the standard Java runtime if no binary is available. Of course, the batch file also needs to be multiplatform, but then you could build it in Java.
But prepare to be disappointed with the performance of a natively compiled executable compared to the latest JIT-enabled runtime VMs. The compiled executable still needs to handle garbage collection, threads, exceptions, etc., all within the confines of the executable. These runtime features of Java do not necessarily compile efficiently into an executable. The performance of the executable may well depend on how much effort the compiler vendor has made in making those Java features run efficiently in the context of a natively compiled executable. The latest adaptive VMs have been shown to run some applications faster than running the equivalent natively compiled executable.
Advocates of the "compile to native executable" approach feel that the compiler optimizations will improve with time so that this approach will ultimately deliver the fastest applications. Luckily, this is a win-win situation for the performance of Java applications: try out both approaches if appropriate to you, and choose the one that works best.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Native Method Calls
For that extra zing in your application (but probably not applet), try out calls to native code. Wave goodbye to 100% pure Java certification, and say hello to added complexity to your development environment and deployment procedure. (If you are already in this situation for reasons other than performance tuning, there is little overhead to taking this route in your project.)
A couple of examples I've seen where native method calls were used for performance reasons were intensive number-crunching for a scientific application and parsing large amounts of data in restricted time. In these and other cases, the runtime application environment at the time could not get to the required speed using Java. I should note that the latter parsing problem would now be able to run fast enough in pure Java, but the original application was built with quite an early version of Java. In addition, some number crunchers find that the latest Java runtimes and optimizing compilers give them sufficient performance in Java without resorting to any native calls.
The JNI interface itself has its own overhead, which means that if a pure Java implementation comes close to the native call performance, the JNI overhead will probably cancel any performance advantages from the native call. However, on occasion the underlying system can provide an optimized native call that is not available from Java and cannot be implemented to work as fast in pure Java. In this kind of situation, JNI is useful for tuning.
Another case in which JNI can be useful is reducing the numbers of objects created, though this should be less common: you should normally be able to do this directly in Java. I once encountered a situation where JNI was needed to avoid excessive objects. This was with an application that originally required the use of a native DLL service. The vendor of that DLL ported the service to Java, which the application developers would have preferred using, but unfortunately the vendor neglected to tune the ported code. This resulted in the situation where a native call to a particular set of services produced just a couple dozen objects, but the Java-ported code produced nearly 10,000 objects. Apart from this difference, the speeds of the two implementations were similar. However, the overhead in garbage collection caused a significant degradation in performance, which meant that the native call to the DLL was the preferred option.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Uncompressed ZIP/JAR Files
It is better to deliver your classes in a ZIP or JAR file than to deliver them one class at a time over the network or load them individually from separate files in the filesystem. This packaged delivery provides some of the benefits of clustering (see Section 14.1.2). The benefits gained from packaging class files come from reducing I/O overheads such as repeated file opening and closing, and possibly improving seek times. Within the ZIP or JAR file, the classes should not be compressed unless network download time is a factor for the application. The best way to deliver local classes for performance reasons is in an uncompressed ZIP or JAR file. Coincidentally, that's how they're delivered with the JDK.
It is possible to further improve the classloading times by packing the classes into the ZIP/JAR file in the order in which they are loaded by the application. You can determine the loading order by running the application with the -verbose option, but note that this ordering is fragile: slight changes in the application can easily alter the loading order of classes. A further extension to this idea is to include your own classloader that opens the ZIP/JAR file itself and reads in all files sequentially, loading them into memory immediately. Perhaps the final version of this performance improvement route is to dispense with the ZIP/JAR filesystem: it is quicker to load the files if they are concatenated together in one big file, with a header at the start of the file giving the offsets and names of the contained files. This is similar to the ZIP filesystem, but it is better if you read the header in one block, and read in and load the files directly rather than going through the java.util.zip classes.
One further optimization to this classloading tactic is to start the classloader running in a separate (low-priority) thread immediately after VM startup.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Checklist
Content preview·