Chapter 4. Working with the JIT Compiler

The just-in-time (JIT) compiler is the heart of the Java Virtual Machine. Nothing in the JVM affects performance more than the compiler, and choosing a compiler is one of the first decisions made when running a Java application—whether you are a Java developer or an end user. Fortunately, in most situations the compiler needs little tuning beyond some basics.

This chapter covers the compiler in depth. It starts with some information on how the compiler works and discusses the advantages and disadvantages to using a JIT compiler. Then it moves on to which kinds of compilers are present within which versions of Java: understanding this and choosing the correct compiler for a situation is the most important step you must take to make applications run fast. Finally, it covers some intermediate and advanced tunings of the compiler; these tunings can help get those last few percentage points in the performance of an application.

Just-in-Time Compilers: An Overview

Some introductory material first; feel free to skip ahead if you understand the basics of just-in-time compilation.

Computers—and more specifically CPUs—can execute only a relatively few, specific instructions, which are called assembly or binary code. All programs that the CPU executes must therefore be translated into these instructions.

Languages like C++ and Fortran are called compiled languages because their programs are delivered as binary (compiled) code: the program is written, and then a static compiler produces a binary. The assembly code in that binary is targeted to a particular CPU. Complementary CPUs can execute the same binary: for example, AMD and Intel CPUs share a basic, common set of assembly language instructions, and later versions of CPUs almost always can execute the same set of instructions as previous versions of that CPU. The reverse is not always true; new versions of CPUs often introduce instructions that will not run on older versions of CPUs.

Languages like PHP and Perl, on the other hand, are interpreted. The same program source code can be run on any CPU as long as the machine has the correct interpreter (that is, the program called php or perl). The interpreter translates each line of the program into binary code as that line is executed.

There are advantages and disadvantages to each of these systems. Programs written in interpreted languages are portable: you can take the same code, drop it on any machine with the appropriate interpreter, and it will run. However, it might run slowly. As a simple case, consider what happens in a loop: the interpreter will retranslate each line of code when it is executed in the loop. The compiled code doesn’t need to repeatedly make that translation.

There are a number of factors that a good compiler takes into account when it produces a binary. One simple example of this is the order of the binary statements: not all assembly language instructions take the same amount of time to execute. A statement that adds the values stored in two registers might execute in one cycle, but retrieving (from main memory) the values needed for the addition may take multiple cycles.

Hence, a good compiler will produce a binary that executes the statement to load the data, executes some other instructions, and then—when the data is available—executes the addition. An interpreter that is looking at only one line of code at a time doesn’t have enough information to produce that kind of code; it will request the data from memory, wait for it to become available, and then execute the addition. Bad compilers will do the same thing, by the way, and it is not necessarily the case that even the best compiler can prevent the occasional wait for an instruction to complete.

For these (and other) reasons, interpreted code will almost always be measurably slower than compiled code: compilers have enough information about the program to provide a number of optimizations to the binary code that an interpreter simply cannot perform.

Interpreted code does have the advantage of portability. A binary compiled for a SPARC CPU obviously cannot run on an Intel CPU. But a binary that uses the latest AVX instructions of Intel’s Sandy Bridge processors cannot run on older Intel processors either. Hence, it is common for commercial software to be compiled to a fairly old version of a processor and not take advantage of the newest instructions available to it. There are various tricks around this, including shipping a binary with multiple shared libraries where the shared libraries execute performance-sensitive code and come with versions for various flavors of a CPU.

Java attempts to find a middle ground here. Java applications are compiled—but instead of being compiled into a specific binary for a specific CPU, they are compiled into an idealized assembly language. This assembly language (known as Java bytecodes) is then run by the java binary (in the same way that an interpreted PHP script is run by the php binary). This gives Java the platform independence of an interpreted language. Because it is executing an idealized binary code, the java program is able to compile the code into the platform binary as the code executes. This compilation occurs as the program is executed: it happens “just in time.”

The manner in which the Java Virtual Machine compiles this code as it executes is the focus of this chapter.

Hot Spot Compilation

As discussed in Chapter 1, the Java implementation discussed in this book is Oracle’s HotSpot JVM. This name (HotSpot) comes from the approach it takes toward compiling the code. In a typical program, only a small subset of code is executed frequently, and the performance of an application depends primarily on how fast those sections of code are executed. These critical sections are known as the hot spots of the application; the more the section of code is executed, the hotter that section is said to be.

Hence, when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this. First, if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.

But if the code in question is a frequently called method, or a loop that runs many iterations, then compiling it is worthwhile: the cycles it takes to compile the code will be outweighed by the savings in multiple executions of the faster compiled code. That trade-off is one reason that the compiler executes the interpreted code first—the compiler can figure out which methods are called frequently enough to warrant their compilation.

The second reason is one of optimization: the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make a number of optimizations when it compiles the code.

A number of those optimizations (and ways to affect them) are discussed later in this chapter, but for a simple example, consider the case of the equals() method. This method exists in every Java object (since it is inherited from the Object class) and is often overridden. When the interpreter encounters the statement b = obj1.equals(obj2), it must look up the type (class) of obj1 in order to know which equals() method to execute. This dynamic lookup can be somewhat time-consuming.

Over time, say that the JVM notices that each time this statement is executed, obj1 is of type java.lang.String. Then the JVM can produce compiled code that directly calls the String.equals() method. Now the code is faster not only because it is compiled, but also because it can skip the lookup of which method to call.

It’s not quite as simple as that; it is quite possible the next time the code is executed that obj1 refers to something other than a String, so the JVM has to produce compiled code that deals with that possibility. Nonetheless, the overall compiled code here will be faster (at least as long as obj1 continues to refer to a String) because it skips the lookup of which method to execute. That kind of optimization can only be made after running the code for a while and observing what it does: this is the second reason why JIT compilers wait to compile sections of code.

Quick Summary

  1. Java is designed to take advantage of the platform independence of scripting languages and the native performance of compiled languages.
  2. A Java class file is compiled into an intermediate language (Java bytecodes) that is then further compiled into assembly language by the JVM.
  3. Compilation of the bytecodes into assembly language performs a number of optimizations that greatly improve performance.

Basic Tunings: Client or Server (or Both)

The JIT compiler comes in two flavors, and the choice of which to use is often the only compiler tuning that needs to be done when running an application. In fact, choosing your compiler is something that must be considered even before Java is installed, since different Java binaries contain different compilers. That will get sorted out in just a bit; first, let’s figure out which one should be used in which circumstances.

The two compilers are known as client and server. These names come from the command-line argument used to select the compiler (e.g., either -client or -server). JVM developers (and even some tools) often refer to the compilers by the names C1 (compiler 1, client compiler) and C2 (compiler 2, server compiler). The names imply that the choice between them should be influenced by the hardware on which the program is running, but that’s not really true: especially today, some 15 years after the terms were first utilized, and your “client” laptop has four to eight CPUs and 8 GB of memory (which is more processing power than a midrange server had when Java was first developed).

The primary difference between the two compilers is their aggressiveness in compiling code. The client compiler begins compiling sooner than the server compiler does. This means that during the beginning of code execution, the client compiler will be faster, because it will have compiled correspondingly more code than the server compiler.

The engineering trade-off here is the knowledge the server compiler gains while it waits: that knowledge allows the server compiler to make better optimizations in the compiled code. Ultimately, code produced by the server compiler will be faster than that produced by the client compiler. From a user’s perspective, the benefit to that trade-off is based on how long the program will run, and how important the startup time of the program is.

The obvious question here is why there needs to be a choice at all: couldn’t the JVM start with the client compiler, and then use the server compiler as code gets hotter? That technique is known as tiered compilation. With tiered compilation, code is first compiled by the client compiler; as it becomes hot, it is recompiled by the server compiler.

Experimental versions of tiered compilation are available in early releases of Java 7. It turns out that there are a number of technical difficulties here (notably in the different architectures of the two compilers), and as a result, tiered compilation didn’t perform well in those experimental versions. Starting in Java 7u4, those difficulties have largely been solved, and tiered compilation usually offers the best performance for an application.

In Java 7, tiered compilation has a few quirks, and so it is not the default setting. In particular, it is easy to exceed the JVM code cache size, which can prevent code from getting optimally compiled (though it is easy enough to address that, as is discussed in Intermediate Tunings for the Compiler). To use tiered compilation, specify the server compiler (either with -server or by ensuring it is the default for the particular Java installation being used), and ensure that the Java command line includes the flag -XX:+TieredCompilation (the default value of which is false). In Java 8, tiered compilation is enabled by default.

To understand the trade-offs here, let’s look at a few examples.

Optimizing Startup

The client compiler is most often used when fast startup is the primary objective. The difference this makes on various applications is shown in Table 4-1.

Table 4-1. Startup time of various applications
Application-client-server-XX:+TieredCompilation

HelloWorld

0.08

0.08

0.08

NetBeans

2.83

3.92

3.07

BigApp

51.5

54.0

52.0

In a simple HelloWorld application, neither compiler has an advantage because not enough code is run for either compiler to make any contribution. And for a task that lasts only 80 ms, we’d be hard-pressed to notice a difference if it did exist.

NetBeans is a fairly typical, moderately sized Java GUI application. On startup, it loads about 10,000 classes, performs initialization of several graphical objects, and so on. Here, the client compiler offers a significant advantage on startup: the server compiler starts 38.5% slower, and the 1-second difference will certainly be noticeable. Note that the tiered compiler isn’t quite as fast, though it is only about 8% slower, a fairly trivial difference.

This is the reason NetBeans—and many GUI programs like it, including the Java plug-in used by web browsers—uses the client compiler by default. Performance is often all about perception: if the initial startup seems faster, and everything else seems fine, users will tend to view the program that has started faster as being faster overall.

Finally, there is BigApp: a very large server program that loads more than 20,000 classes and performs extensive initialization. Because it is an application server, it will certainly need to use the server compiler. Even though a lot of processing is going on here, there is still a slightly noticeable benefit to the client compiler. What’s interesting about this example is one thing mentioned in Chapter 1: it’s not always the JVM that is the problem. In this case, there are so many JAR files that must be read from disk that it is the gating factor for performance (otherwise, the startup difference would have been even more in favor of the client compiler).

Quick Summary

  1. The client compiler is most useful when the startup of an application is the overriding performance concern.
  2. Tiered compilation can achieve startup times very close to those obtained from the client compiler.

Optimizing Batch Operations

For batch applications—those that run a fixed amount of work—the choice of compiler boils down to which gets the best optimization in the amount of time the application runs. Table 4-2 shows an example of that.

Table 4-2. Time to execute batch applications
Number of stocks-client-server-XX:+TieredCompilation

1

0.142 seconds

0.176 seconds

0.165 seconds

10

0.211 seconds

0.348 seconds

0.226 seconds

100

0.454 seconds

0.674 seconds

0.472 seconds

1,000

2.556 seconds

2.158 seconds

1.910 seconds

1,0000

23.78 seconds

14.03 seconds

13.56 seconds

Using the sample stock code discussed in Chapter 2, the application here requests 1 year’s history (plus the average and standard deviation of that history) for between 1 and 10,000 stocks.

For 1 to 100 stocks, the faster startup with the client compiler completes the job sooner, and if the goal is to process only 100 stocks, the client compiler is the best choice. After that, the performance advantage swings in favor of the server compiler (and particularly the server compiler with tiered compilation). Even for a limited number of calculations, tiered compilation is pretty close to the client compiler, making it a good candidate for all cases.

It is also interesting that tiered compilation is always slightly better than the standard server compiler. In theory, once the program has run enough to compile all the hot spots, the server compiler might be expected to achieve the best (or at least equal) performance. But in any application, there will almost always be some small section of code that is infrequently executed. It is better to compile that code—even if the compilation is not the best that might be achieved—than to execute that code in interpreted mode. And as is discussed later in this chapter (see Compilation Thresholds), the server compiler will likely never actually compile all the code in an application, even if it runs forever.

Quick Summary

  1. For jobs that run in a fixed amount of time, choose the compiler based on which one is the fastest at executing the actual job.
  2. Tiered compilation provides a reasonable default choice for batch jobs.

Optimizing Long-Running Applications

Finally, there is the difference that can be expected in the eventual performance of a long-running application when different compilers are used. Performance of long-running applications is typically measured by examining the throughput that an application delivers after it has been “warmed up”—meaning after it has run long enough that the important parts of the code have been compiled.

This example uses the basic stock calculator and puts it in a servlet; each call to the servlet will retrieve information for a random stock symbol for a period of 25 years. Using the fhb program discussed in Chapter 2, Table 4-3 shows how many operations per second the server produced after warm-up periods of 0, 60, and 300 seconds.

Table 4-3. Throughput of server applications
Warm-up period-client-server-XX:+TieredCompilation

0 seconds

15.87

23.72

24.23

60 seconds

16.00

23.73

24.26

300 seconds

16.85

24.42

24.43

The measurement period here is 60 seconds, so even in the case where there is no warm-up, the compilers had an opportunity to get enough information to compile the hot spots; hence the server compilers are always better in this example. (Also, a lot of code was compiled during the startup of the application server.) As before, tiered compilation can compile just a little bit more code and squeeze out just a little more performance than the server compiler alone.

Quick Summary

For long-running applications, always choose the server compiler, preferably in conjunction with tiered compilation.

Java and JIT Compiler Versions

Now that differences between the compilers have been examined, let’s look at how to get the desired compiler. When you download Java, you must choose a version; the choice ultimately revolves around the platform you are using. However, the choice also impacts the JIT compiler(s) available to applications. The discussion so far has been about client and server compilers, but there are three versions of the JIT compiler:

  • A 32-bit client version (-client)
  • A 32-bit server version (-server)
  • A 64-bit server version (-d64)

To a certain extent, you choose the compiler you want to use by supplying the given argument (-server, etc.). However, things are not quite so simple.

When downloading Java for a given operating system, there are only two options: a 32-bit or a 64-bit binary. So clearly, the 32-bit binary can be expected to have (up to) two compilers, while the 64-bit binary will have only a single compiler. (In fact, the 64-bit binary will have two compilers, since the client compiler is needed to support tiered compilation. But a 64-bit JVM cannot be run with only the client compiler.)

Once installed, though, things become a little more complicated. On most platforms, the 32-bit and 64-bit binaries install separately. You can have both binaries installed on your computer, but you must refer to them via separate paths. Hence, on the machine I use for Linux testing, I have binaries installed in /export/VMs/jdk1.7.0-32bit and /export/VMs/jdk1.7.0-64bit, and I choose between them by setting my PATH accordingly.

On Solaris, things are different: the 64-bit installation overlays the 32-bit installation. Hence all three compilers are available from the same path. This makes it much easier for the end user; among other things, it means that if Java is installed system-wide in /usr/bin, a user can always specify via the command line which of the three possible compilers she wants. That kind of installation remains the exception. Things can be further complicated since developers of HotSpot often use Solaris as their primary development system and hence discussions (and sometimes documentation) gets confused by which installation paradigm is in use.

One last complication: for the sake of compatibility, the argument specifying which compiler to use is not rigorously followed. If you have a 64-bit JVM and specify -client, the application will use the 64-bit server compiler anyway. If you have a 32-bit JVM and you specify -d64, you will get an error that the given instance does not support a 64-bit JVM.

To summarize: the selection of the compiler is controlled by which JVM bits are installed and by the compiler argument passed to the JVM. Table 4-4 shows the result when the given argument is specified for the given installation.

Table 4-4. Result of compiler argument for OS combinations
Install bits -client -server -d64

Linux 32-bit

32-bit client compiler

32-bit server compiler

Error

Linux 64-bit

64-bit server compiler

64-bit server compiler

64-bit server compiler

Mac OS X

64-bit server compiler

64-bit server compiler

64-bit server compiler

Solaris 32-bit

32-bit client compiler

32-bit server compiler

Error

Solaris 64-bit

32-bit client compiler

32-bit server compiler

64-bit server compiler

Windows 32-bit

32-bit client compiler

32-bit server compiler

Error

Windows 64-bit

64-bit server compiler

64-bit server compiler

64-bit server compiler

In Java 8, when the server compiler is the default in any of these cases, tiered compilation is also enabled by default.

What if no compiler argument is given at all? Then the JVM uses the default compiler for the machine on which the code is running: the default compiler is a runtime choice. This choice is made based on whether the JVM considers the machine to be a “client” machine or a “server” machine. That decision is based on a combination of the operating system and number of CPUs on the machine; Table 4-5 lists the various defaults.

Table 4-5. Default compiler based on OS and machine
OSDefault compiler

Windows, 32-bit, any number of CPUs

-client

Windows, 64-bit, any number of CPUs

-server

MacOS, any number of CPUs

-server

Linux/Solaris, 32-bit, 1 CPU

-client

Linux/Solaris, 32-bit, 2 or more CPUs

-server

Linux, 64-bit, any number of CPUs

-server

Solaris, 32-bit/64-bit overlay, 1 CPU

-client

Solaris, 32-bit/64-bit overlay, 2 or more CPUs

-server (32-bit mode)

These defaults are based on the notion that startup time is always the most important thing for 32-bit Windows machines, and Unix-based machines are generally more interested in long-running performance. As always, there are exceptions: certainly modern Windows-based machines can run powerful servers even in 32-bit mode, and in those cases the server compiler should be used. Similarly, many application servers use simple Java-based administrative commands to inspect or change their configuration; even on Unix-based machines, these are better run with the client compiler.

Quick Summary

  1. Different Java binaries support different compilers.
  2. The compilers supported by different binaries are inconsistent among operating systems and binary architectures.
  3. A program doesn’t necessarily use the compiler specified depending on the platform support for that compiler.

Intermediate Tunings for the Compiler

For the most part, tuning the compiler is really just a matter of selecting the proper JVM and compiler switch (-client, -server or -XX:+TieredCompilation) for the installation on the target machine. Tiered compilation is usually the best choice for long-running applications and is within a few milliseconds of the performance of the client compiler on short-lived applications.

There are a few cases in which additional tunings are required; those cases are explored in this section.

Tuning the Code Cache

When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. The code cache has a fixed size, and once it has filled up, the JVM is not able to compile any additional code.

It is easy to see the potential issue here if the code cache is too small. Some hot spots will get compiled, but others will not: the application will end up running a lot of (very slow) interpreted code.

This is more frequently an issue when using either the client compiler or tiered compilation. When the regular server compiler is used, it is somewhat unlikely that the number of classes eligible for compilation will fill the code cache; typically only a handful of classes will be compiled. But the number of classes eligible for compilation when using the client compiler (and hence also eligible for compilation when tiered compilation is enabled) is potentially much higher.

When the code cache fills up, the JVM will (usually) spit out a warning to that effect:

Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full.
         Compiler has been disabled.
Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the
         code cache size using -XX:ReservedCodeCacheSize=

It is sometimes easy to miss this message, and some versions of Java 7 do not print it correctly when tiered compilation is enabled. Another way to determine if the compiler has ceased to compile code is to follow the output of the compilation log discussed later in this section.

Table 4-6 lists the default value of the code cache for various platforms.

Table 4-6. Default code cache based on platform
JVM type Default code cache size

32-bit client, Java 8

32 MB

32-bit server with tiered compilation, Java 8

240 MB

64-bit server with tiered compilation, Java 8

240 MB

32-bit client, Java 7

32 MB

32-bit server, Java 7

32 MB

64-bit server, Java 7

48 MB

64-bit server with tiered compilation, Java 7

96 MB

In Java 7, the default size for tiered compilation is often insufficient, and it is often necessary to increase the code cache size. Large programs that use the client compiler may also need to increase the code cache size.

There really isn’t a good mechanism to figure out how much code cache a particular application needs. Hence, when you need to increase the code cache size, it is sort of a hit-and-miss operation; a typical option is to simply double or quadruple the default.

The maximum size of the code cache is set via the -XX:ReservedCodeCacheSize=N flag (where N is the default just mentioned for the particular compiler). The code cache is managed like most memory in the JVM: there is an initial size (specified by -XX:InitialCodeCacheSize=N). Allocation of the code cache size starts at the initial size and increases as the cache fills up. The initial size of the code cache varies based on the chip architecture and compiler in use (on Intel machines, the client compiler starts with a 160 KB cache and the server compiler starts with a 2,496 KB cache). Resizing the cache happens in the background and doesn’t really affect performance, so setting the ReservedCodeCacheSize size (i.e., setting the maximum code cache size) is all that is generally needed.

Is there a disadvantage to specifying a really large value for the maximum code cache size so that it never runs out of space? It depends on the resources available on the target machine. If a 1 GB code cache size is specified, then the JVM will reserve 1 GB of native memory space. That memory isn’t allocated until needed, but it is still reserved, which means that there must be sufficient virtual memory available on your machine to satisfy the reservation.

In addition, if the JVM is 32-bit, then the total process size of the process cannot exceed 4 GB. That includes the Java heap, space for all the code of the JVM itself (including its native libraries and thread stacks), any native memory the application allocates (either directly of via the NIO libraries), and of course the code cache.

Those are the reasons the code cache is not unbounded and sometimes requires tuning for large applications (or even medium-sized applications when tiered compilation is used). Particularly on 64-bit machines, though, setting the value too high is unlikely to have a practical effect on the application: the application won’t run out of process space memory, and the extra memory reservation will generally be accepted by the operating system.

The size of the code cache can be monitored using jconsole by selecting the Memory Pool Code Cache chart on the Memory panel.

Quick Summary

  1. The code cache is a resource with a defined maximum size that affects the total amount of compiled code the JVM can run.
  2. Tiered compilation can easily use up the entire code cache in its default configuration (particularly in Java 7); monitor the code cache and increase its size if necessary when using tiered compilation.

Compilation Thresholds

This chapter has been somewhat vague in defining just what triggers the compilation of code. The major factor involved here is how often the code is executed; once it is executed a certain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.

There are tunings that affect these thresholds, which are discussed in this section. However, this section is really designed to give you better insight into how the compiler works (and introduce some terms). There is really only one case where the compilation thresholds might need to be tuned; that is discussed at the end of this section.

Compilation is based on two counters in the JVM: the number of times the method has been called, and the number of times any loops in the method have branched back. Branching back can effectively be thought of as the number of times a loop has completed execution, either because it reached the end of the loop itself or because it executed a branching statement like continue.

When the JVM executes a Java method, it checks the sum of those two counters and decides whether or not the method is eligible for compilation. If it is, the method is queued for compilation (see Compilation Threads for more details about queuing). This kind of compilation has no official name but is often called standard compilation.

But what if the method has a really long loop—or one that never exits and provides all the logic of the program? In that case, the JVM needs to compile the loop without waiting for a method invocation. So every time the loop completes an execution, the branching counter is incremented and inspected. If the branching counter has exceeded its individual threshold, then the loop (and not the entire method) becomes eligible for compilation.

This kind of compilation is called on-stack replacement (OSR), because even if the loop is compiled, that isn’t sufficient: the JVM has to have the ability to start executing the compiled version of the loop while the loop is still running. When the code for the loop has finished compiling, the JVM replaces the code (on-stack), and the next iteration of the loop will execute the much-faster compiled version of the code.

Standard compilation is triggered by the value of the -XX:CompileThreshold=N flag. The default value of N for the client compiler is 1,500; for the server compiler it is 10,000. Changing the value of the CompileThreshold flag will cause the the compiler to choose to compile the code sooner (or later) than it normally would have. Note, however, that although there is one flag here, the threshold is calculated by adding the sum of the back-edge loop counter plus the method entry counter.

Changing the CompileThreshold flag has been a popular recommendation in performance circles for quite some time; in fact, you may have seen that Java benchmarks often use this flag (e.g., frequently after 8,000 iterations for the server compiler).

We’ve seen that there is a big difference between the ultimate performance of the client and server compilers, due largely to the information available to the compiler when it compiles a particular method. Lowering the compile threshold, particularly for the server compiler, runs the risk that the code may be compiled a little less optimally than possible—but testing on an application may show that there is in fact little difference between compiling after, say, 8,000 invocations instead of 10,000.

You can bet that vendors who submit benchmark results with that tuning have verified there is no performance difference between the two settings for that benchmark. They use the lower setting for two reasons:

  • It saves a little time in how long the application needs to warm up.
  • It can compile certain server methods that would otherwise never compile.

The first point here should be well understood, but why would the server never compile an important method? It isn’t just that the compilation threshold hasn’t been reached yet: it’s that the compilation threshold will never be reached. This is because the counter values increase as methods and loops are executed, but they also decrease over time.

Periodically (specifically, when the JVM reaches a safepoint), the value of each counter is reduced. Practically speaking, this means that the counters are a relative measure of the recent hotness of the method or loop. One side effect of this is that somewhat-frequently executed code may never be compiled, even for programs that run forever (these methods are sometimes called lukewarm [as opposed to hot]). This is one case where reducing the compilation threshold can be beneficial, and it is another reason why tiered compilation is usually slightly faster than the server compiler alone. The next section will show how to determine if a particular method is not compiled; if methods in the critical path of the profiles for your application show they are not compiled, compilation can sometimes be achieved by reducing the compiler thresholds.

Quick Summary

  1. Compilation occurs when the number of times a method or loop has been executed reaches a certain threshold.
  2. Changing the threshold values can cause the code to be compiled sooner than it otherwise would.
  3. “Lukewarm” methods will never reach the compilation threshold (particularly for the server compiler) since the counters decay over time.

Inspecting the Compilation Process

The last of the intermediate tunings aren’t tunings per se: that is, they will not improve the performance of an application. Rather, they are the JVM flags (and other tools) that give visibility into the working of the compiler. The most important of these is -XX:+PrintCompilation (which by default is false).

If PrintCompilation is enabled, every time a method (or loop) is compiled, the JVM prints out a line with information about what has just been compiled. The output has varied somewhat between Java releases; the output discussed here became standardized in Java 7.

Most lines of the compilation log have the following format:

timestamp compilation_id attributes (tiered_level) method_name size deopt

The timestamp here is the time after the compilation has finished (relative to 0, which is when the JVM started).

The compilation_id is an internal task ID. Usually this number will simply increase monotonically, but sometimes with the server compiler (or anytime the number of compilation threads has been increased), you may see an out-of-order compilation ID. This indicates that compilation threads are running faster or slower relative to each other, but don’t conclude that one particular compilation task was somehow inordinately slow: it is usually just a function of thread scheduling (though OSR compilation is slow and often appears out of order).

The attributes field is a series of five characters that indicates the state of the code being compiled. If a particular attribute applies to the given compilation, the character shown in the following list is printed; otherwise, a space is printed for that attribute. Hence, the five-character attribute string may appear as two or more items separated by spaces. The various attributes are:

  • %: The compilation is OSR.
  • s: The method is synchronized.
  • !: The method has an exception handler.
  • b: Compilation occurred in blocking mode.
  • n: Compilation occurred for a wrapper to a native method.

The first three of these should be self-explanatory. The blocking flag will never be printed by default in current versions of Java; it indicates that compilation did not occur in the background (see Compilation Threads for more details about that). Finally, the native attribute indicates that the JVM generated some compiled code to facilitate the call into a native method.

If the program is not running with tiered compilation, the next field (tiered_level) will be blank. Otherwise, it will be a number indicating which tier has completed compilation (see Tiered Compilation Levels).

Next comes the name of the method being compiled (or the method containing the loop being compiled for OSR), which is printed as ClassName::method.

Next is the size (in bytes) of the code being compiled. This is the size of the Java bytecodes, not the size of the compiled code (so, unfortunately, this can’t be used to predict how large to size the code cache).

Finally, in some cases there will be a message at the end of the compilation line that indicates that some sort of deoptimization has occurred; these are typically the phrases “made not entrant” or “made zombie.” See Deoptimization for more details.

The compilation log may also include a line that looks like this:

timestamp compile_id COMPILE SKIPPED: reason

This line (with the literal text COMPILE SKIPPED) indicates that something has gone wrong with the compilation of the given method. There are two cases where this is expected, depending on the reason specified:

Code cache filled
The size of the code cache needs to be increased using the ReservedCodeCache flag.
Concurrent classloading
The class was modified as it was being compiled. The JVM will compile it again later; you should expect to see the method recompiled later in the log.

In all cases (except the cache being filled), the compilation should be reattempted again. If it is not, then there is an error that prevents compilation of the code. This is often a bug in the compiler, but the usual remedy in all cases is to refactor the code into something simpler that the compiler can handle.

Here are a few lines of output from enabling PrintCompilation on the stock servlet web application:

  28015  850             net.sdo.StockPrice::getClosingPrice (5 bytes)
  28179  905  s          net.sdo.StockPriceHistoryImpl::process (248 bytes)
  28226   25 %           net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes)
  28244  935             net.sdo.MockStockPriceEntityManagerFactory$\
                             MockStockPriceEntityManager::find (507 bytes)
  29929  939             net.sdo.StockPriceHistoryImpl::<init> (156 bytes)
 106805 1568   !         net.sdo.StockServlet::processRequest (197 bytes)

This output includes only a few of the stock-related methods that have been compiled. A few interesting things to note: the first such method wasn’t compiled until 28 seconds after the application server was started, and 849 methods were compiled before it. In this case, all those other methods were methods of the application server (filtered out of this output). The application server took about 2 seconds to start; the remaining 26 seconds before anything else was compiled were essentially idle as the application server waited for requests.

The remaining lines are included to point out some interesting features. The process() method, as seen here and in the code listing, is synchronized. Inner classes are compiled just like any other class and appear in the output with the usual Java nomenclature: outer-classname$inner-classname. The processRequest() method shows up with the exception handler as expected.

Finally, recall the implementation of the StockPriceHistoryImpl constructor, which contains a large loop:

public StockPriceHistoryImpl(String s, Date startDate, Date endDate) {
    EntityManager em = emf.createEntityManager();
    Date curDate = new Date(startDate.getTime());
    symbol = s;
    while (!curDate.after(endDate)) {
         StockPrice sp = em.find(StockPrice.class, new StockPricePK(s, curDate));
         if (sp != null) {
            if (firstDate == null) {
                firstDate = (Date) curDate.clone();
            }
            prices.put((Date) curDate.clone(), sp);
            lastDate = (Date) curDate.clone();
        }
        curDate.setTime(curDate.getTime() + msPerDay);
    }
}

The loop is executed more often than the constructor itself, so the loop is subject to OSR compilation. Note that it took a while for that method to be compiled; its compilation ID is 25, but it doesn’t appear until other methods in the 900 range are being compiled. (It’s easy to read OSR lines like this example as 25% and wonder about the other 75%, but remember that the number is the compilation ID, and the % just signifies OSR compilation.) That is typical of OSR compilation; the stack replacement is harder to set up, but other compilation can continue in the meantime.

Quick Summary

  1. The best way to gain visibility into how code is being compiled is by enabling PrintCompilation.
  2. Output from enabling PrintCompilation can be used to make sure that compilation is proceeding as expected.

Advanced Compiler Tunings

This section fills in some remaining details on how compilation works, and in the process explores some additional tunings that can affect it. However, although these values can be changed, there is really little reason to do so; the tunings exist to a large degree to help JVM engineers diagnose the behavior of the JVM. If you’re quite curious as to how the compiler works, then this section will be interesting to you; otherwise, feel free to read ahead.

Compilation Threads

Compilation Thresholds mentioned that when a method (or loop) becomes eligible for compilation, it is queued for compilation. That queue is processed by one or more background threads. This means that compilation is an asynchronous process, which is a good thing; it allows the program to continue executing even while the code in question is being compiled. If a method is compiled using standard compilation, then the next method invocation will execute the compiled method; if a loop is compiled using OSR, then the next iteration of the loop will execute the compiled code.

These queues are not strictly first in, first out: methods whose invocation counters are higher have priority. So even when a program starts execution and has lots of code to compile, this priority ordering helps to ensure that the most important code will be compiled first. (This is another reason why the compilation ID in the PrintCompilation output can appear out of order.)

When the client compiler is in use, the JVM starts one compilation thread; the server compiler has two such threads. When tiered compilation is in effect, the JVM will by default start multiple client and server threads based on a somewhat complex equation involving double logs of the number of CPUs on the target platform. That works out to the values shown in Table 4-7.

Table 4-7. Default number of C1 and C2 compiler threads for tiered compilation
Number of CPUsNumber of C1 threadsNumber of C2 threads

1

1

1

2

1

1

4

1

2

8

1

2

16

2

6

32

3

7

64

4

8

128

4

10

The number of compiler threads (for all three compiler options) can be adjusted by setting the -XX:CICompilerCount=N flag (with a default value given in the previous table). That is the total number of threads the JVM will use to process the queue(s); for tiered compilation, one-third of them (but at least one) will be used to process the client compiler queue, and the remaining threads (and also at least one) will be used to process the server compiler queue.

When might you consider adjusting this value? If a program is run on a single-CPU system, then having only one compiler thread might be slightly beneficial: there is limited CPU available, and having fewer threads contending for that resource will help performance in many circumstances. However, that advantage is limited only to the initial warm-up period; after that, the number of eligible methods to be compiled won’t really cause contention for the CPU. When the stock batching application was run on a single-CPU machine and the number of compiler threads was limited to one, the initial calculations were about 10% faster (since they didn’t have to compete for CPU as often). The more iterations that were run, the smaller the overall effect of that initial benefit, until all hot methods were compiled and the benefit was eliminated.

When tiered compilation is used, the number of threads can easily overwhelm the system, particularly if multiple JVMs are run at once (each of which will start many compilation threads). Reducing the number of threads in that case can help overall throughput (though again with the possible cost that the warm-up period will last longer).

Similarly, if lots of extra CPU cycles are available, then theoretically the program will benefit—at least during its warm-up period—when the number of compiler threads is increased. In real life, that benefit is extremely hard to come by. Further, if all that excess CPU is available, you’re much better off trying something that takes advantage of the available CPU cycles during the entire execution of the application (rather than just compiling faster at the beginning).

One other setting that applies to the compilation threads is the value of the -XX:+BackgroundCompilation flag, which by default is true. That setting means that the queue is processed asynchronously as just described. But that flag can be set to false, in which case when a method is eligible for compilation, code that wants to execute it will wait until it is in fact compiled (rather than continuing to execute in the interpreter). Background compilation is also disabled when -Xbatch is specified.

Quick Summary

  1. Compilation occurs asynchronously for methods that are placed on the compilation queue.
  2. The queue is not strictly ordered; hot methods are compiled before other methods in the queue. This is another reason why compilation IDs can appear out of order in the compilation log.

Inlining

One of the most important optimizations the compiler makes is to inline methods. Code that follows good object-oriented design often contains a number of attributes that are accessed via getters (and perhaps setters):

public class Point {
    private int x, y;

    public void getX() { return x; }
    public void setX(int i)  { x = i; }
}

The overhead for invoking a method call like this is quite high, especially relative to the amount of code in the method. In fact, in the early days of Java, performance tips often argued against this sort of encapsulation precisely because of the performance impact of all those method calls. Fortunately, JVMs now routinely perform code inlining for these kinds of methods. Hence, you can write this code:

Point p = getPoint();
p.setX(p.getX() * 2);

and the compiled code will essentially execute this:

Point p = getPoint();
p.x = p.x * 2;

Inlining is enabled by default. It can be disabled using the -XX:-Inline flag, though it is such an important performance boost that you would never actually do that (for example, disabling inlining reduces the performance of the stock batching test by over 50%). Still, because inlining is so important, and perhaps because there are many other knobs to turn, recommendations are often made regarding tuning the inlining behavior of the JVM.

Unfortunately, there is no basic visibility into how the JVM inlines code. (If you compile the JVM from source, you can produce a debug version that includes the flag -XX:+PrintInlining. That flag provides all sorts of information about the inlining decisions that the compiler makes.) The best that can be done is to look at profiles of the code, and if there are simple methods near the top of the profiles that seem like they should be inlined, try some experiments with inlining flags.

The basic decision about whether to inline a method depends on how hot it is and its size. The JVM determines if a method is hot (i.e., called frequently) based on an internal calculation; it is not directly subject to any tunable parameters. If a method is eligible for inlining because it is called frequently, then it will be inlined only if its bytecode size is less than 325 bytes (or whatever is specified as the -XX:MaxFreqInlineSize=N flag). Otherwise, it is eligible for inlining only if it is small: less than 35 bytes (or whatever is specified as the -XX:MaxInlineSize=N flag).

Sometimes you will see recommendations that the value of the MaxInlineSize flag be increased so that more methods are inlined. One often overlooked aspect of this relationship is that setting the MaxInlineSize value higher than 35 means that a method might be inlined when it is first called. However if the method is called frequently—in which case its performance matters much more—then it would have been inlined eventually (assuming its size is less than 325 bytes). Otherwise, the net effect of tuning the MaxInlineSize flag is that it might reduce the warm-up time needed for a test, but it is unlikely that it will have a big impact on a long-running application.

Quick Summary

  1. Inlining is the most beneficial optimization the compiler can make, particularly for object-oriented code where attributes are well encapsulated.
  2. Tuning the inlining flags is rarely needed, and recommendations to do so often fail to account for the relationship between normal inlining and frequent inlining. Make sure to account for both cases when investigating the effects of inlining.

Escape Analysis

The server compiler performs some very aggressive optimizations if escape analysis is enabled (-XX:+DoEscapeAnalysis, which is true by default). For example, consider this class to work with factorials:

public class Factorial {
    private BigInteger factorial;
    private int n;
    public Factorial(int n) {
        this.n = n;
    }
    public synchronized BigInteger getFactorial() {
        if (factorial == null)
            factorial = ...;
        return factorial;
    }
}

To store the first 100 factorial values in an array, this code would be used:

ArrayList<BigInteger> list = new ArrayList<BigInteger>();
for (int i = 0; i < 100; i++) {
    Factorial factorial = new Factorial(i);
    list.add(factorial.getFactorial());
}

The factorial object is referenced only inside that loop; no other code can ever access that object. Hence, the JVM is free to perform a number of optimizations on that object:

  • It needn’t get a synchronization lock when calling the getFactorial() method.
  • It needn’t store the field n in memory; it can keep that value in a register. Similarly it can store the factorial object reference in a register.
  • In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.

This kind of optimization is quite sophisticated: it is simple enough in this example, but these optimizations are possible even with more complex code. Depending on the code usage, not all optimizations will necessarily apply. But escape analysis can determine which of those optimizations are possible and make the necessary changes in the compiled code.

Escape analysis is enabled by default. In rare cases, it will get things wrong, in which case disabling it will lead to faster and/or more stable code. If you find this to be the case, then simplifying the code in question is the best course of action: simpler code will compile better. (It is a bug, however, and should be reported.)

Quick Summary

  1. Escape analysis is the most sophisticated of the optimizations the compiler can perform. This is the kind of optimization that frequently causes microbenchmarks to go awry.
  2. Escape analysis can often introduce “bugs” into improperly synchronized code.

Deoptimization

The discussion of the output of the PrintCompilation flag mentioned two cases where the compiler deoptimized the code. Deoptimization means that the compiler had to “undo” some previous compilation; the effect is that the performance of the application will be reduced—at least until the compiler can recompile the code in question.

There are two cases of deoptimization: when code is “made not entrant,” and when code is “made zombie.”

Not Entrant Code

There are two things that cause code to be made not entrant. One is due to the way classes and interfaces work, and one is an implementation detail of tiered compilation.

Let’s look at the first case. Recall that the stock application has an interface StockPriceHistory. In the sample code, this interface has two implementations: a basic one (StockPriceHistoryImpl) and one that adds logging (StockPriceHistoryLogger) to each operation. In the servlet code, the implementation used is based on the log parameter of the URL:

StockPriceHistory sph;
String log = request.getParameter("log");
if (log != null && log.equals("true")) {
    sph = new StockPriceHistoryLogger(...);
}
else {
    sph = new StockPriceHistoryImpl(...);
}
// Then the JSP makes calls to:
sph.getHighPrice();
sph.getStdDev();
// and so on

If a bunch of calls are made to http://localhost:8080/StockServlet (that is, without the log parameter), the compiler will see that the actual type of the sph object is StockPriceHistoryImpl. It will then inline code and perform other optimizations based on that knowledge.

Later, say a call is made to http://localhost:8080/StockServlet?log=true. Now the assumption the compiler made regarding the type of the sph object is false; the previous optimizations are no longer valid. This generates a deoptimization trap, and the previous optimizations are discarded. If a lot of additional calls are made with logging enabled, the JVM will quickly end up compiling that code and making new optimizations.

The compilation log for that scenario will include lines such as the following:

 841113   25 %           net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes)
                                 made not entrant
 841113  937  s          net.sdo.StockPriceHistoryImpl::process (248 bytes)
                                 made not entrant
1322722   25 %           net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes)
                                 made zombie
1322722  937  s          net.sdo.StockPriceHistoryImpl::process (248 bytes)
                                 made zombie

Note that both the OSR-compiled constructor and the standard-compiled methods have been made not entrant, and some time much later, they are made zombie.

Deoptimization sounds like a bad thing, at least in terms of performance, but that isn’t necessarily the case. The first example in this chapter that used the stock servlet application measured only the performance of the URL that triggers the StockPriceHistoryImpl path. With a 300-second warm-up, recall that test achieved about 24.4 OPS with tiered compilation.

Suppose that immediately after that test, a test is run that triggers the StockPriceHistoryLogger path—that is the scenario I ran to produce the deoptimization examples just listed. The full output of PrintCompilation shows that all the methods of the StockPriceHistoryImpl class get deoptimized when the requests for the logging implementation are started. But after deoptimization, if the path that uses the StockPriceHistoryImpl implementation is rerun, that code will get recompiled (with slightly different assumptions), and we will still end up still seeing about 24.4 OPS (after another warm-up period).

That’s the best case, of course. What happens if the calls are intermingled such that the compiler can never really assume which path the code will take? Because of the extra logging, the path that includes the logging gets about 24.1 OPS through the servlet. If operations are mixed, we get about 24.3 OPS: just about what would be expected from an average. Similar results are observed in the batch program. So aside from a momentary point where the trap is processed, deoptimization has not affected the performance in any significant way.

The second thing that can cause code to be made not entrant is due to the way tiered compilation works. In tiered compilation, code is compiled by the client compiler, and then later compiled by the server compiler (and actually it’s a little more complicated than that, as discussed in the next section). When the code compiled by the server compiler is ready, the JVM must replace the code compiled by the client compiler. It does this by marking the old code as not entrant and using the same mechanism to substitute the newly compiled (and more efficient) code. Hence, when a program is run with tiered compilation, the compilation log will show a slew of methods that are made not entrant. Don’t panic: this “deoptimization” is in fact making the code that much faster.

The way to detect this is to pay attention to the tier level in the compilation log:

  40915   84 %     3       net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes)
  40923 3697       3       net.sdo.StockPriceHistoryImpl::<init> (156 bytes)
  41418   87 %     4       net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes)
  41434   84 %     3       net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes)
                                      made not entrant
  41458 3749       4       net.sdo.StockPriceHistoryImpl::<init> (156 bytes)
  41469 3697       3       net.sdo.StockPriceHistoryImpl::<init> (156 bytes)
                                      made not entrant
  42772 3697       3       net.sdo.StockPriceHistoryImpl::<init> (156 bytes)
                                      made zombie
  42861   84 %     3       net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes)
                                      made zombie

Here, the constructor is first OSR-compiled at level 3, and then fully compiled also at level 3. A second later, the OSR code becomes eligible for level 4 compilation, so it is compiled at level 4 and the level 3 OSR code is made not entrant. The same process then occurs for the standard compilation, and then finally the level 3 code becomes a zombie.

Deoptimizing Zombie Code

When the compilation log reports that it has made zombie code, it is saying that it has reclaimed some previous code that was made not entrant. In the last example, after a test was run with the StockPriceHistoryLogger implementation, the code for the StockPriceHistoryImpl class was made not entrant. But there were still objects of the StockPriceHistoryImpl class around. Eventually all those objects were reclaimed by GC. When that happened, the compiler noticed that the methods of that class were now eligible to be marked as zombie code.

For performance, this is a good thing. Recall that the compiled code is held in a fixed-size code cache; when zombie methods are identified, it means that the code in question can be removed from the code cache, making room for other classes to be compiled (or limiting the amount of memory the JVM will need to allocate later).

The possible downside here is that if the code for the class is made zombie and then later reloaded and heavily used again, the JVM will need to recompile and reoptimize the code. Still, that’s exactly what happened in the scenario described above where the test was run without logging, then with logging, and then without logging; performance in that case was not noticeably affected. In general, the small recompilations that occur when zombie code is recompiled will not have a measurable effect on most applications.

Quick Summary

  1. Deoptimization allows the compiler to back out previous versions of compiled code.
  2. Code is deoptimized when previous optimizations are no longer valid (e.g., because the type of the objects in question has changed).
  3. There is usually a small, momentary effect in performance when code is deoptimized, but the new code usually warms up quickly again.
  4. Under tiered compilation, code is deoptimized when it had previously been compiled by the client compiler and has now been optimized by the server compiler.

Tiered Compilation Levels

The compilation log for a program using tiered compilation prints the tier level at which each method is compiled. In the example from the last section, code was compiled up through level 4, even though to simplify the discussion so far, I’ve said there are only two compilers (plus the interpreter).

It turns out that there are five levels of execution, because the client compiler has three different levels. So the level of compilation runs from:

  • 0: Interpreted code
  • 1: Simple C1 compiled code
  • 2: Limited C1 compiled code
  • 3: Full C1 compiled code
  • 4: C2 compiled code

A typical compilation log shows that most methods are first compiled at level 3: full C1 compilation. (All methods start at level 0, of course.) If they run often enough, they will get compiled at level 4 (and the level 3 code will be made not entrant). This is the most frequent path: the client compiler waits to compile something until it has information about how the code is used that it can leverage to perform optimizations.

If the server compiler queue is full, methods will be pulled from the server queue and compiled at level 2, which is the level at which the C1 compiler uses the invocation and back-edge counters (but doesn’t require profile feedback). That gets the method compiled more quickly; the method will later be compiled at level 3 after the C1 compiler has gathered profile information, and finally compiled at level 4 when the server compiler queue is less busy.

On the other hand, if the client compiler is full, a method that is scheduled for compilation at level 3 may become eligible for level 4 compilation while still waiting to be compiled at level 3. In that case, it is quickly compiled to level 2 and then transitioned to level 4.

Trivial methods may start in either levels 2 or 3 but then go to level 1 because of their trivial nature. If the server compiler for some reason cannot compile the code, it will also go to level 1.

And of course when code is deoptimized, it goes to level 0.

There are flags that control some of this behavior, but expecting results when tuning at this level is quite optimistic. The best case for performance happens when methods are compiled as expected: tier 0 → tier 3 → tier 4. If methods frequently get compiled into tier 2 and extra CPU cycles are available, consider increasing the number of compiler threads; that will reduce the size of the server compiler queue. If no extra CPU cycles are available, then all you can do is attempt to reduce the size of the application.

Quick Summary

  1. Tiered compilation can operate at five distinct levels among the two compilers.
  2. Changing the path between levels is not recommended; this section just helps to explain the output of the compilation log.

Summary

This chapter has provided a lot of details about how just-in-time compilation works. From a tuning perspective, the simple choice here is to use the server compiler with tiered compilation for virtually everything; this will solve 90% of compiler-related performance issues. Just make sure that the code cache is sized large enough, and the compiler will provide pretty much all the performance that is possible.

This chapter also contains a lot of background about how the compiler works. One reason for this is so you can understand some of the general recommendations made in Chapter 1 regarding small methods and simple code, and the effects of the compiler on microbenchmarks that were described in Chapter 2. In particular:

  1. Don’t be afraid of small methods—and in particular getters and setters—because they are easily inlined. If you have a feeling that the method overhead can be expensive, you’re correct in theory (we showed that removing inlining has a huge impact on performance). But it’s not the case in practice, since the compiler fixes that problem.
  2. Code that needs to be compiled sits in a compilation queue. The more code in the queue, the longer the program will take to achieve optimal performance.
  3. Although you can (and should) size the code cache, it is still a finite resource.
  4. The simpler the code, the more optimizations that can be performed on it. Profile feedback and escape analysis can yield much faster code, but complex loop structures and large methods limit their effectiveness.

Finally, if you profile your code and find some surprising methods at the top of your profile—methods you expect shouldn’t be there—you can use the information here to look into what the compiler is doing and to make sure it can handle the way your code is written.

Get Java Performance: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.