Chapter 4. Working with the JIT Compiler
The just-in-time (JIT) compiler is the heart of the Java Virtual Machine. Nothing in the JVM affects performance more than the compiler, and choosing a compiler is one of the first decisions made when running a Java application—whether you are a Java developer or an end user. Fortunately, in most situations the compiler needs little tuning beyond some basics.
This chapter covers the compiler in depth. It starts with some information on how the compiler works and discusses the advantages and disadvantages to using a JIT compiler. Then it moves on to which kinds of compilers are present within which versions of Java: understanding this and choosing the correct compiler for a situation is the most important step you must take to make applications run fast. Finally, it covers some intermediate and advanced tunings of the compiler; these tunings can help get those last few percentage points in the performance of an application.
Just-in-Time Compilers: An Overview
Computers—and more specifically CPUs—can execute only a relatively few, specific instructions, which are called assembly or binary code. All programs that the CPU executes must therefore be translated into these instructions.
Languages like C++ and Fortran are called compiled languages because their programs are delivered as binary (compiled) code: the program is written, and then a static compiler produces a binary. The assembly code in that binary is targeted to a particular CPU. Complementary CPUs can execute the same binary: for example, AMD and Intel CPUs share a basic, common set of assembly language instructions, and later versions of CPUs almost always can execute the same set of instructions as previous versions of that CPU. The reverse is not always true; new versions of CPUs often introduce instructions that will not run on older versions of CPUs.
Languages like PHP and Perl, on the other hand, are interpreted. The same
program source code can be run on any CPU as long as the machine has
the correct interpreter (that is, the program called
interpreter translates each line of the program into binary code as
that line is executed.
There are advantages and disadvantages to each of these systems. Programs written in interpreted languages are portable: you can take the same code, drop it on any machine with the appropriate interpreter, and it will run. However, it might run slowly. As a simple case, consider what happens in a loop: the interpreter will retranslate each line of code when it is executed in the loop. The compiled code doesn’t need to repeatedly make that translation.
There are a number of factors that a good compiler takes into account when it produces a binary. One simple example of this is the order of the binary statements: not all assembly language instructions take the same amount of time to execute. A statement that adds the values stored in two registers might execute in one cycle, but retrieving (from main memory) the values needed for the addition may take multiple cycles.
Hence, a good compiler will produce a binary that executes the statement to load the data, executes some other instructions, and then—when the data is available—executes the addition. An interpreter that is looking at only one line of code at a time doesn’t have enough information to produce that kind of code; it will request the data from memory, wait for it to become available, and then execute the addition. Bad compilers will do the same thing, by the way, and it is not necessarily the case that even the best compiler can prevent the occasional wait for an instruction to complete.
For these (and other) reasons, interpreted code will almost always be measurably slower than compiled code: compilers have enough information about the program to provide a number of optimizations to the binary code that an interpreter simply cannot perform.
Interpreted code does have the advantage of portability. A binary compiled for a SPARC CPU obviously cannot run on an Intel CPU. But a binary that uses the latest AVX instructions of Intel’s Sandy Bridge processors cannot run on older Intel processors either. Hence, it is common for commercial software to be compiled to a fairly old version of a processor and not take advantage of the newest instructions available to it. There are various tricks around this, including shipping a binary with multiple shared libraries where the shared libraries execute performance-sensitive code and come with versions for various flavors of a CPU.
Java attempts to find a middle ground here. Java applications
are compiled—but instead of being compiled into a specific binary for
a specific CPU, they are compiled into an idealized assembly language.
This assembly language (known as Java bytecodes) is then run by the
binary (in the same way that an interpreted PHP script is run by the
binary). This gives Java the platform independence of an interpreted
language. Because it is executing an idealized binary code, the
java program is able to compile the code into the platform binary as the
code executes. This compilation occurs as the program is executed: it
happens “just in time.”
Hot Spot Compilation
As discussed in Chapter 1, the Java implementation discussed in this book is Oracle’s HotSpot JVM. This name (HotSpot) comes from the approach it takes toward compiling the code. In a typical program, only a small subset of code is executed frequently, and the performance of an application depends primarily on how fast those sections of code are executed. These critical sections are known as the hot spots of the application; the more the section of code is executed, the hotter that section is said to be.
Hence, when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this. First, if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.
But if the code in question is a frequently called method, or a loop that runs many iterations, then compiling it is worthwhile: the cycles it takes to compile the code will be outweighed by the savings in multiple executions of the faster compiled code. That trade-off is one reason that the compiler executes the interpreted code first—the compiler can figure out which methods are called frequently enough to warrant their compilation.
The second reason is one of optimization: the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make a number of optimizations when it compiles the code.
A number of those optimizations
(and ways to affect them) are discussed
later in this chapter, but for a simple example,
consider the case of the
method. This method exists in every
Java object (since it is inherited from the
class) and is often
overridden. When the interpreter encounters the statement
b = obj1.equals(obj2),
it must look up the type (class) of
in order to know
method to execute. This dynamic lookup can be somewhat
Over time, say that the JVM notices that each time this statement
is of type
Then the JVM can produce
compiled code that directly calls the
Now the code is
faster not only because it is compiled, but also because it
can skip the lookup of which method to call.
It’s not quite as simple as
that; it is quite possible the next time the code is executed that
refers to something other than a
so the JVM has to produce
compiled code that deals with that possibility. Nonetheless, the overall
compiled code here will be faster (at least as long as
to refer to a
because it skips the lookup of which
method to execute. That kind of optimization can only be made after running
the code for a while and observing what it does: this is the second reason why
JIT compilers wait to compile sections of code.
- Java is designed to take advantage of the platform independence of scripting languages and the native performance of compiled languages.
- A Java class file is compiled into an intermediate language (Java bytecodes) that is then further compiled into assembly language by the JVM.
- Compilation of the bytecodes into assembly language performs a number of optimizations that greatly improve performance.
Basic Tunings: Client or Server (or Both)
The JIT compiler comes in two flavors, and the choice of which to use is often the only compiler tuning that needs to be done when running an application. In fact, choosing your compiler is something that must be considered even before Java is installed, since different Java binaries contain different compilers. That will get sorted out in just a bit; first, let’s figure out which one should be used in which circumstances.
The two compilers are known as
These names come from
the command-line argument used to select the compiler (e.g., either
JVM developers (and even some tools) often
refer to the compilers
by the names
C1 (compiler 1, client compiler) and
C2 (compiler 2,
server compiler). The names imply that the
choice between them should be influenced by the hardware on which the program
is running, but that’s not really true: especially today, some 15 years after
the terms were first utilized, and your “client” laptop has four to eight CPUs
and 8 GB of memory (which is more processing power than a midrange server
had when Java was first developed).
The primary difference between the two compilers is their aggressiveness in compiling code. The client compiler begins compiling sooner than the server compiler does. This means that during the beginning of code execution, the client compiler will be faster, because it will have compiled correspondingly more code than the server compiler.
The engineering trade-off here is the knowledge the server compiler gains while it waits: that knowledge allows the server compiler to make better optimizations in the compiled code. Ultimately, code produced by the server compiler will be faster than that produced by the client compiler. From a user’s perspective, the benefit to that trade-off is based on how long the program will run, and how important the startup time of the program is.
The obvious question here is why there needs to be a choice at all: couldn’t the JVM start with the client compiler, and then use the server compiler as code gets hotter? That technique is known as tiered compilation. With tiered compilation, code is first compiled by the client compiler; as it becomes hot, it is recompiled by the server compiler.
Experimental versions of tiered compilation are available in early releases of Java 7. It turns out that there are a number of technical difficulties here (notably in the different architectures of the two compilers), and as a result, tiered compilation didn’t perform well in those experimental versions. Starting in Java 7u4, those difficulties have largely been solved, and tiered compilation usually offers the best performance for an application.
In Java 7, tiered compilation has a few quirks, and so it is not the
In particular, it is easy to exceed the JVM code cache size,
which can prevent code from getting optimally compiled (though it is
easy enough to address that, as is discussed in Intermediate Tunings for the Compiler). To use tiered compilation, specify the server
compiler (either with
or by ensuring it is the default for the particular Java
installation being used), and ensure that the Java command line
includes the flag
(the default value of which is
false). In Java 8, tiered compilation
is enabled by default.
To understand the trade-offs here, let’s look at a few examples.
The client compiler is most often used when fast startup is the primary objective. The difference this makes on various applications is shown in Table 4-1.
In a simple
application, neither compiler has an advantage because not enough
code is run for either compiler to make any contribution. And for a task
only 80 ms, we’d be hard-pressed to notice a difference if it did
NetBeans is a fairly typical, moderately sized Java GUI application. On startup, it loads about 10,000 classes, performs initialization of several graphical objects, and so on. Here, the client compiler offers a significant advantage on startup: the server compiler starts 38.5% slower, and the 1-second difference will certainly be noticeable. Note that the tiered compiler isn’t quite as fast, though it is only about 8% slower, a fairly trivial difference.
This is the reason NetBeans—and many GUI programs like it, including the Java plug-in used by web browsers—uses the client compiler by default. Performance is often all about perception: if the initial startup seems faster, and everything else seems fine, users will tend to view the program that has started faster as being faster overall.
Finally, there is
a very large server program that loads more than
20,000 classes and performs extensive initialization. Because it is an
application server, it will certainly need to use the server compiler. Even
though a lot of processing is going on here, there is still a slightly
noticeable benefit to the client compiler. What’s interesting about this
example is one thing mentioned in Chapter 1: it’s not always
the JVM that is the problem. In this case, there are so many JAR files that
must be read from disk that it is the gating factor for
performance (otherwise, the startup difference would have been even more in
favor of the client compiler).
- The client compiler is most useful when the startup of an application is the overriding performance concern.
- Tiered compilation can achieve startup times very close to those obtained from the client compiler.
Optimizing Batch Operations
For batch applications—those that run a fixed amount of work—the choice of compiler boils down to which gets the best optimization in the amount of time the application runs. Table 4-2 shows an example of that.
|Number of stocks|
Using the sample stock code discussed in Chapter 2, the application here requests 1 year’s history (plus the average and standard deviation of that history) for between 1 and 10,000 stocks.
For 1 to 100 stocks, the faster startup with the client compiler completes the job sooner, and if the goal is to process only 100 stocks, the client compiler is the best choice. After that, the performance advantage swings in favor of the server compiler (and particularly the server compiler with tiered compilation). Even for a limited number of calculations, tiered compilation is pretty close to the client compiler, making it a good candidate for all cases.
It is also interesting that tiered compilation is always slightly better than the standard server compiler. In theory, once the program has run enough to compile all the hot spots, the server compiler might be expected to achieve the best (or at least equal) performance. But in any application, there will almost always be some small section of code that is infrequently executed. It is better to compile that code—even if the compilation is not the best that might be achieved—than to execute that code in interpreted mode. And as is discussed later in this chapter (see Compilation Thresholds), the server compiler will likely never actually compile all the code in an application, even if it runs forever.
- For jobs that run in a fixed amount of time, choose the compiler based on which one is the fastest at executing the actual job.
- Tiered compilation provides a reasonable default choice for batch jobs.
Optimizing Long-Running Applications
Finally, there is the difference that can be expected in the eventual performance of a long-running application when different compilers are used. Performance of long-running applications is typically measured by examining the throughput that an application delivers after it has been “warmed up”—meaning after it has run long enough that the important parts of the code have been compiled.
This example uses the basic stock calculator and puts it in a
servlet; each call to the servlet will retrieve information for a random
stock symbol for
a period of 25 years. Using the
fhb program discussed in
Chapter 2, Table 4-3 shows how many
operations per second the server produced after warm-up periods of 0,
60, and 300 seconds.
The measurement period here is 60 seconds, so even in the case where there is no warm-up, the compilers had an opportunity to get enough information to compile the hot spots; hence the server compilers are always better in this example. (Also, a lot of code was compiled during the startup of the application server.) As before, tiered compilation can compile just a little bit more code and squeeze out just a little more performance than the server compiler alone.
For long-running applications, always choose the server compiler, preferably in conjunction with tiered compilation.
Java and JIT Compiler Versions
Now that differences between the compilers have been examined, let’s look at how to get the desired compiler. When you download Java, you must choose a version; the choice ultimately revolves around the platform you are using. However, the choice also impacts the JIT compiler(s) available to applications. The discussion so far has been about client and server compilers, but there are three versions of the JIT compiler:
A 32-bit client version (
A 32-bit server version (
A 64-bit server version (
To a certain extent, you choose the compiler you want to use by supplying
the given argument
etc.). However, things are not quite so simple.
When downloading Java for a given operating system, there are only two options: a 32-bit or a 64-bit binary. So clearly, the 32-bit binary can be expected to have (up to) two compilers, while the 64-bit binary will have only a single compiler. (In fact, the 64-bit binary will have two compilers, since the client compiler is needed to support tiered compilation. But a 64-bit JVM cannot be run with only the client compiler.)
Once installed, though, things become a little more complicated. On most
platforms, the 32-bit and 64-bit binaries install separately. You can have both
binaries installed on your computer, but you must refer to them via separate
paths. Hence, on the machine I use for Linux testing, I have binaries
installed in /export/VMs/jdk1.7.0-32bit and /export/VMs/jdk1.7.0-64bit,
and I choose between them by setting my
On Solaris, things are different: the 64-bit installation overlays the 32-bit installation. Hence all three compilers are available from the same path. This makes it much easier for the end user; among other things, it means that if Java is installed system-wide in /usr/bin, a user can always specify via the command line which of the three possible compilers she wants. That kind of installation remains the exception. Things can be further complicated since developers of HotSpot often use Solaris as their primary development system and hence discussions (and sometimes documentation) gets confused by which installation paradigm is in use.
One last complication: for the sake of compatibility, the argument specifying
which compiler to use is not rigorously followed. If you have a 64-bit JVM
the application will use the 64-bit server
compiler anyway. If you have a 32-bit JVM and you specify
will get an error that the given instance does not support a 64-bit
To summarize: the selection of the compiler is controlled by which JVM bits are installed and by the compiler argument passed to the JVM. Table 4-4 shows the result when the given argument is specified for the given installation.
32-bit client compiler
32-bit server compiler
64-bit server compiler
64-bit server compiler
64-bit server compiler
Mac OS X
64-bit server compiler
64-bit server compiler
64-bit server compiler
32-bit client compiler
32-bit server compiler
32-bit client compiler
32-bit server compiler
64-bit server compiler
32-bit client compiler
32-bit server compiler
64-bit server compiler
64-bit server compiler
64-bit server compiler
In Java 8, when the server compiler is the default in any of these cases, tiered compilation is also enabled by default.
What if no compiler argument is given at all? Then the JVM uses the default compiler for the machine on which the code is running: the default compiler is a runtime choice. This choice is made based on whether the JVM considers the machine to be a “client” machine or a “server” machine. That decision is based on a combination of the operating system and number of CPUs on the machine; Table 4-5 lists the various defaults.
Windows, 32-bit, any number of CPUs
Windows, 64-bit, any number of CPUs
MacOS, any number of CPUs
Linux/Solaris, 32-bit, 1 CPU
Linux/Solaris, 32-bit, 2 or more CPUs
Linux, 64-bit, any number of CPUs
Solaris, 32-bit/64-bit overlay, 1 CPU
Solaris, 32-bit/64-bit overlay, 2 or more CPUs
These defaults are based on the notion that startup time is always the most important thing for 32-bit Windows machines, and Unix-based machines are generally more interested in long-running performance. As always, there are exceptions: certainly modern Windows-based machines can run powerful servers even in 32-bit mode, and in those cases the server compiler should be used. Similarly, many application servers use simple Java-based administrative commands to inspect or change their configuration; even on Unix-based machines, these are better run with the client compiler.
- Different Java binaries support different compilers.
- The compilers supported by different binaries are inconsistent among operating systems and binary architectures.
- A program doesn’t necessarily use the compiler specified depending on the platform support for that compiler.
Intermediate Tunings for the Compiler
For the most part, tuning the compiler is really just a matter of selecting
the proper JVM and compiler switch (
-XX:+TieredCompilation) for the installation on the target machine.
Tiered compilation is usually the best choice for long-running
applications and is within a few milliseconds of the performance of the client
compiler on short-lived applications.
There are a few cases in which additional tunings are required; those cases are explored in this section.
Tuning the Code Cache
When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. The code cache has a fixed size, and once it has filled up, the JVM is not able to compile any additional code.
It is easy to see the potential issue here if the code cache is too small. Some hot spots will get compiled, but others will not: the application will end up running a lot of (very slow) interpreted code.
This is more frequently an issue when using either the client compiler or tiered compilation. When the regular server compiler is used, it is somewhat unlikely that the number of classes eligible for compilation will fill the code cache; typically only a handful of classes will be compiled. But the number of classes eligible for compilation when using the client compiler (and hence also eligible for compilation when tiered compilation is enabled) is potentially much higher.
When the code cache fills up, the JVM will (usually) spit out a warning to that effect:
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled. Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
It is sometimes easy to miss this message, and some versions of Java 7 do not print it correctly when tiered compilation is enabled. Another way to determine if the compiler has ceased to compile code is to follow the output of the compilation log discussed later in this section.
Table 4-6 lists the default value of the code cache for various platforms.
|JVM type||Default code cache size|
32-bit client, Java 8
32-bit server with tiered compilation, Java 8
64-bit server with tiered compilation, Java 8
32-bit client, Java 7
32-bit server, Java 7
64-bit server, Java 7
64-bit server with tiered compilation, Java 7
In Java 7, the default size for tiered compilation is often insufficient, and it is often necessary to increase the code cache size. Large programs that use the client compiler may also need to increase the code cache size.
There really isn’t a good mechanism to figure out how much code cache a particular application needs. Hence, when you need to increase the code cache size, it is sort of a hit-and-miss operation; a typical option is to simply double or quadruple the default.
The maximum size of the code cache is set via the
N is the default just mentioned for the particular compiler).
The code cache is managed like most memory in the
JVM: there is an initial size (specified by
Allocation of the code cache size starts at the initial size and increases
cache fills up. The initial size of the code cache varies based on the
chip architecture and compiler in use (on Intel machines, the
client compiler starts with
a 160 KB cache and the server compiler starts with a 2,496 KB cache).
cache happens in the background and doesn’t really affect performance, so
size (i.e., setting the maximum code
cache size) is all that is generally needed.
Is there a disadvantage to specifying a really large value for the maximum code cache size so that it never runs out of space? It depends on the resources available on the target machine. If a 1 GB code cache size is specified, then the JVM will reserve 1 GB of native memory space. That memory isn’t allocated until needed, but it is still reserved, which means that there must be sufficient virtual memory available on your machine to satisfy the reservation.
In addition, if the JVM is 32-bit, then the total process size of the process cannot exceed 4 GB. That includes the Java heap, space for all the code of the JVM itself (including its native libraries and thread stacks), any native memory the application allocates (either directly of via the NIO libraries), and of course the code cache.
Those are the reasons the code cache is not unbounded and sometimes requires tuning for large applications (or even medium-sized applications when tiered compilation is used). Particularly on 64-bit machines, though, setting the value too high is unlikely to have a practical effect on the application: the application won’t run out of process space memory, and the extra memory reservation will generally be accepted by the operating system.
The size of the code cache can be monitored using
jconsole by selecting
Memory Pool Code Cache chart
on the Memory panel.
- The code cache is a resource with a defined maximum size that affects the total amount of compiled code the JVM can run.
- Tiered compilation can easily use up the entire code cache in its default configuration (particularly in Java 7); monitor the code cache and increase its size if necessary when using tiered compilation.
This chapter has been somewhat vague in defining just what triggers the compilation of code. The major factor involved here is how often the code is executed; once it is executed a certain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.
There are tunings that affect these thresholds, which are discussed in this section. However, this section is really designed to give you better insight into how the compiler works (and introduce some terms). There is really only one case where the compilation thresholds might need to be tuned; that is discussed at the end of this section.
Compilation is based on two counters in the JVM: the number
of times the method has been called, and the number of times any loops in the
method have branched back. Branching back can effectively be thought
of as the number of times a loop has completed execution, either because
it reached the end of the loop itself or because it executed a branching
When the JVM executes a Java method, it checks the sum of those two counters and decides whether or not the method is eligible for compilation. If it is, the method is queued for compilation (see Compilation Threads for more details about queuing). This kind of compilation has no official name but is often called standard compilation.
But what if the method has a really long loop—or one that never exits and provides all the logic of the program? In that case, the JVM needs to compile the loop without waiting for a method invocation. So every time the loop completes an execution, the branching counter is incremented and inspected. If the branching counter has exceeded its individual threshold, then the loop (and not the entire method) becomes eligible for compilation.
This kind of compilation is called on-stack replacement (OSR), because even if the loop is compiled, that isn’t sufficient: the JVM has to have the ability to start executing the compiled version of the loop while the loop is still running. When the code for the loop has finished compiling, the JVM replaces the code (on-stack), and the next iteration of the loop will execute the much-faster compiled version of the code.
Standard compilation is triggered by the value of the
flag. The default value of
N for the client compiler is 1,500; for
the server compiler it is 10,000. Changing the value
CompileThreshold flag will cause the the compiler to choose to compile the
code sooner (or later) than it normally would have. Note, however, that
although there is one flag here, the threshold is calculated by adding the
sum of the back-edge loop counter plus the method entry counter.
flag has been a popular recommendation in
performance circles for quite some time; in fact, you may have seen that
Java benchmarks often use this flag (e.g., frequently after 8,000 iterations
for the server compiler).
We’ve seen that there is a big difference between the ultimate performance of the client and server compilers, due largely to the information available to the compiler when it compiles a particular method. Lowering the compile threshold, particularly for the server compiler, runs the risk that the code may be compiled a little less optimally than possible—but testing on an application may show that there is in fact little difference between compiling after, say, 8,000 invocations instead of 10,000.
You can bet that vendors who submit benchmark results with that tuning have verified there is no performance difference between the two settings for that benchmark. They use the lower setting for two reasons:
- It saves a little time in how long the application needs to warm up.
- It can compile certain server methods that would otherwise never compile.
The first point here should be well understood, but why would the server never compile an important method? It isn’t just that the compilation threshold hasn’t been reached yet: it’s that the compilation threshold will never be reached. This is because the counter values increase as methods and loops are executed, but they also decrease over time.
Periodically (specifically, when the JVM reaches a safepoint), the value of each counter is reduced. Practically speaking, this means that the counters are a relative measure of the recent hotness of the method or loop. One side effect of this is that somewhat-frequently executed code may never be compiled, even for programs that run forever (these methods are sometimes called lukewarm [as opposed to hot]). This is one case where reducing the compilation threshold can be beneficial, and it is another reason why tiered compilation is usually slightly faster than the server compiler alone. The next section will show how to determine if a particular method is not compiled; if methods in the critical path of the profiles for your application show they are not compiled, compilation can sometimes be achieved by reducing the compiler thresholds.
- Compilation occurs when the number of times a method or loop has been executed reaches a certain threshold.
- Changing the threshold values can cause the code to be compiled sooner than it otherwise would.
- “Lukewarm” methods will never reach the compilation threshold (particularly for the server compiler) since the counters decay over time.
Inspecting the Compilation Process
The last of the intermediate tunings aren’t tunings per se: that is, they
will not improve the performance of an application. Rather, they are the
JVM flags (and other tools) that give visibility into the working of
the compiler. The most important of these is
(which by default is
is enabled, every time a method (or loop)
is compiled, the JVM prints out a line with information about what
has just been compiled.
The output has varied somewhat between Java
releases; the output discussed here became standardized in Java 7.
Most lines of the compilation log have the following format:
timestamp compilation_id attributes (tiered_level) method_name size deopt
The timestamp here is the time after the compilation has finished (relative to 0, which is when the JVM started).
compilation_id is an internal task ID. Usually this
number will simply increase monotonically, but sometimes with the server
compiler (or anytime the number of compilation threads has been increased),
you may see an out-of-order compilation ID. This indicates that compilation
threads are running faster or slower relative to each other, but don’t
conclude that one particular compilation task was somehow
inordinately slow: it is usually just a function of thread scheduling
(though OSR compilation is slow and often appears out of order).
attributes field is a series of five characters that indicates
the state of the code being compiled. If a particular attribute applies to
the given compilation,
the character shown in the following list is printed; otherwise, a space is printed for
that attribute. Hence, the five-character attribute string may appear as
two or more items separated by spaces. The various attributes are:
%: The compilation is OSR.
s: The method is synchronized.
!: The method has an exception handler.
b: Compilation occurred in blocking mode.
n: Compilation occurred for a wrapper to a native method.
The first three of these should be self-explanatory. The blocking flag will never be printed by default in current versions of Java; it indicates that compilation did not occur in the background (see Compilation Threads for more details about that). Finally, the native attribute indicates that the JVM generated some compiled code to facilitate the call into a native method.
If the program is not running with tiered compilation, the next field (
blank. Otherwise, it will be a number indicating which tier has completed
compilation (see Tiered Compilation Levels).
Next comes the name of the method being compiled (or the method containing
the loop being compiled for OSR), which is printed as
Next is the
size (in bytes) of the code being compiled. This is
the size of the Java bytecodes, not the size of the compiled code (so,
unfortunately, this can’t be used to predict how large to size the code
Finally, in some cases there will be a message at the end of the compilation line that indicates that some sort of deoptimization has occurred; these are typically the phrases “made not entrant” or “made zombie.” See Deoptimization for more details.
The compilation log may also include a line that looks like this:
timestamp compile_id COMPILE SKIPPED: reason
This line (with the literal text
COMPILE SKIPPED) indicates that something
has gone wrong with the compilation of the given method. There are two
cases where this is expected, depending on the reason specified:
In all cases (except the cache being filled), the compilation should be reattempted again. If it is not, then there is an error that prevents compilation of the code. This is often a bug in the compiler, but the usual remedy in all cases is to refactor the code into something simpler that the compiler can handle.
Here are a few lines of output from enabling
PrintCompilation on the
stock servlet web application:
28015 850 net.sdo.StockPrice::getClosingPrice (5 bytes) 28179 905 s net.sdo.StockPriceHistoryImpl::process (248 bytes) 28226 25 % net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 28244 935 net.sdo.MockStockPriceEntityManagerFactory$\ MockStockPriceEntityManager::find (507 bytes) 29929 939 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 106805 1568 ! net.sdo.StockServlet::processRequest (197 bytes)
This output includes only a few of the stock-related methods that have been compiled. A few interesting things to note: the first such method wasn’t compiled until 28 seconds after the application server was started, and 849 methods were compiled before it. In this case, all those other methods were methods of the application server (filtered out of this output). The application server took about 2 seconds to start; the remaining 26 seconds before anything else was compiled were essentially idle as the application server waited for requests.
The remaining lines are included to point out some interesting features.
method, as seen here and in the code listing, is
synchronized. Inner classes are compiled just like any other class and appear
in the output with the usual Java nomenclature:
method shows up with
the exception handler as expected.
Finally, recall the implementation of the
constructor, which contains a large loop:
The loop is executed more often than the constructor itself, so the loop is subject to OSR compilation. Note that it took a while for that method to be compiled; its compilation ID is 25, but it doesn’t appear until other methods in the 900 range are being compiled. (It’s easy to read OSR lines like this example as 25% and wonder about the other 75%, but remember that the number is the compilation ID, and the % just signifies OSR compilation.) That is typical of OSR compilation; the stack replacement is harder to set up, but other compilation can continue in the meantime.
The best way to gain visibility into how code is being compiled is by enabling
Output from enabling
PrintCompilationcan be used to make sure that compilation is proceeding as expected.
Advanced Compiler Tunings
This section fills in some remaining details on how compilation works, and in the process explores some additional tunings that can affect it. However, although these values can be changed, there is really little reason to do so; the tunings exist to a large degree to help JVM engineers diagnose the behavior of the JVM. If you’re quite curious as to how the compiler works, then this section will be interesting to you; otherwise, feel free to read ahead.
Compilation Thresholds mentioned that when a method (or loop) becomes eligible for compilation, it is queued for compilation. That queue is processed by one or more background threads. This means that compilation is an asynchronous process, which is a good thing; it allows the program to continue executing even while the code in question is being compiled. If a method is compiled using standard compilation, then the next method invocation will execute the compiled method; if a loop is compiled using OSR, then the next iteration of the loop will execute the compiled code.
These queues are not strictly first in, first out: methods whose invocation
counters are higher have priority. So even when a program starts execution
and has lots of code to compile, this priority ordering helps to ensure
that the most important code will be compiled first. (This is another
reason why the compilation ID in the
PrintCompilation output can appear
out of order.)
When the client compiler is in use, the JVM starts one compilation thread; the server compiler has two such threads. When tiered compilation is in effect, the JVM will by default start multiple client and server threads based on a somewhat complex equation involving double logs of the number of CPUs on the target platform. That works out to the values shown in Table 4-7.
|Number of CPUs||Number of C1 threads||Number of C2 threads|
The number of compiler threads (for all three compiler options) can be
adjusted by setting the
flag (with a default value given in the previous table). That is
the total number of threads the JVM will use to process the queue(s);
for tiered compilation, one-third of them (but at least one) will be
used to process the client compiler queue, and the remaining threads (and also
at least one) will be used to process the server compiler queue.
When might you consider adjusting this value? If a program is run on a single-CPU system, then having only one compiler thread might be slightly beneficial: there is limited CPU available, and having fewer threads contending for that resource will help performance in many circumstances. However, that advantage is limited only to the initial warm-up period; after that, the number of eligible methods to be compiled won’t really cause contention for the CPU. When the stock batching application was run on a single-CPU machine and the number of compiler threads was limited to one, the initial calculations were about 10% faster (since they didn’t have to compete for CPU as often). The more iterations that were run, the smaller the overall effect of that initial benefit, until all hot methods were compiled and the benefit was eliminated.
When tiered compilation is used, the number of threads can easily overwhelm the system, particularly if multiple JVMs are run at once (each of which will start many compilation threads). Reducing the number of threads in that case can help overall throughput (though again with the possible cost that the warm-up period will last longer).
Similarly, if lots of extra CPU cycles are available, then theoretically the program will benefit—at least during its warm-up period—when the number of compiler threads is increased. In real life, that benefit is extremely hard to come by. Further, if all that excess CPU is available, you’re much better off trying something that takes advantage of the available CPU cycles during the entire execution of the application (rather than just compiling faster at the beginning).
One other setting that applies to the compilation threads is the value of
flag, which by default is
That setting means that the queue is processed
asynchronously as just described. But that flag can be set to
in which case when a method is eligible for compilation, code that wants
to execute it will wait until it is in fact compiled (rather than continuing
to execute in the interpreter). Background compilation is also disabled when
-Xbatch is specified.
- Compilation occurs asynchronously for methods that are placed on the compilation queue.
- The queue is not strictly ordered; hot methods are compiled before other methods in the queue. This is another reason why compilation IDs can appear out of order in the compilation log.
One of the most important optimizations the compiler makes is to inline methods. Code that follows good object-oriented design often contains a number of attributes that are accessed via getters (and perhaps setters):
The overhead for invoking a method call like this is quite high, especially relative to the amount of code in the method. In fact, in the early days of Java, performance tips often argued against this sort of encapsulation precisely because of the performance impact of all those method calls. Fortunately, JVMs now routinely perform code inlining for these kinds of methods. Hence, you can write this code:
and the compiled code will essentially execute this:
Inlining is enabled by default. It can be disabled using the
flag, though it is such an important performance boost that you would never
actually do that (for example, disabling inlining reduces the performance of
the stock batching test by over 50%). Still, because inlining is so important,
and perhaps because there are many other knobs to turn, recommendations
made regarding tuning the inlining behavior of the JVM.
Unfortunately, there is no basic visibility into how the JVM inlines
code. (If you compile the JVM from source, you can produce
a debug version that includes the flag
That flag provides all sorts of information about the inlining
decisions that the compiler makes.) The best that can be done is to
look at profiles of the code, and if there are simple methods near the top
of the profiles that seem like they should be inlined, try some experiments
with inlining flags.
The basic decision about whether to inline a method depends on how hot
it is and its size. The JVM determines if a method is hot (i.e., called
based on an internal calculation; it is not directly subject to
any tunable parameters. If a method is eligible for inlining because it
is called frequently, then it will be inlined only if its bytecode size is
less than 325 bytes (or whatever is specified as the
flag). Otherwise, it is eligible for inlining only
if it is small: less than 35 bytes (or whatever is specified as the
Sometimes you will see recommendations that the value of the
flag be increased so that more methods are inlined.
One often overlooked aspect of this relationship is that setting the
MaxInlineSize value higher than 35 means that a method might be inlined when
it is first called. However if the method is called frequently—in
which case its performance matters much more—then it would have been
inlined eventually (assuming its size is less than 325 bytes).
Otherwise, the net effect of
flag is that it
might reduce the warm-up time needed for a test, but it is unlikely that
it will have a big impact on a long-running application.
- Inlining is the most beneficial optimization the compiler can make, particularly for object-oriented code where attributes are well encapsulated.
- Tuning the inlining flags is rarely needed, and recommendations to do so often fail to account for the relationship between normal inlining and frequent inlining. Make sure to account for both cases when investigating the effects of inlining.
The server compiler performs some very aggressive optimizations if
escape analysis is enabled
true by default).
consider this class to work with factorials:
To store the first 100 factorial values in an array, this code would be used:
object is referenced only inside that loop; no other
code can ever access that object. Hence, the JVM is free to perform a number
of optimizations on that object:
It needn’t get a synchronization lock when calling the
It needn’t store the field
nin memory; it can keep that value in a register. Similarly it can store the
factorialobject reference in a register.
- In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.
This kind of optimization is quite sophisticated: it is simple enough in this example, but these optimizations are possible even with more complex code. Depending on the code usage, not all optimizations will necessarily apply. But escape analysis can determine which of those optimizations are possible and make the necessary changes in the compiled code.
Escape analysis is enabled by default. In rare cases, it will get things wrong, in which case disabling it will lead to faster and/or more stable code. If you find this to be the case, then simplifying the code in question is the best course of action: simpler code will compile better. (It is a bug, however, and should be reported.)
- Escape analysis is the most sophisticated of the optimizations the compiler can perform. This is the kind of optimization that frequently causes microbenchmarks to go awry.
- Escape analysis can often introduce “bugs” into improperly synchronized code.
The discussion of the output of the
two cases where the compiler deoptimized the code. Deoptimization means
that the compiler had to “undo” some previous compilation; the effect is that the performance of the application will be reduced—at
least until the compiler can recompile the code in question.
There are two cases of deoptimization: when code is “made not entrant,” and when code is “made zombie.”
Not Entrant Code
Let’s look at the first case.
Recall that the stock application has an interface
In the sample code, this interface has two implementations: a basic one
one that adds logging
StockPriceHistoryLogger) to each operation.
In the servlet code,
the implementation used is based on the
log parameter of the URL:
// Then the JSP makes calls to:
// and so on
If a bunch of calls are made to http://localhost:8080/StockServlet
(that is, without the
log parameter), the compiler will see that the
actual type of the
sph object is
It will then
inline code and perform other optimizations based on that knowledge.
Later, say a call is made to
http://localhost:8080/StockServlet?log=true. Now the assumption the
compiler made regarding the type of the
sph object is false;
the previous optimizations
are no longer valid. This generates a deoptimization trap, and the previous
optimizations are discarded. If a lot of additional calls are made
with logging enabled, the JVM will quickly end up compiling that code and
making new optimizations.
The compilation log for that scenario will include lines such as the following:
841113 25 % net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made not entrant 841113 937 s net.sdo.StockPriceHistoryImpl::process (248 bytes) made not entrant 1322722 25 % net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made zombie 1322722 937 s net.sdo.StockPriceHistoryImpl::process (248 bytes) made zombie
Note that both the OSR-compiled constructor and the standard-compiled methods have been made not entrant, and some time much later, they are made zombie.
Deoptimization sounds like a bad thing, at least in terms of performance,
but that isn’t necessarily the case. The first example in this chapter
that used the stock
servlet application measured only the performance of the URL that triggers the
path. With a 300-second warm-up, recall that test
achieved about 24.4 OPS with tiered compilation.
Suppose that immediately after that test, a test is run that triggers the
HistoryLogger path—that is the scenario I ran to produce the deoptimization examples
just listed. The full output of
shows that all the methods of the
get deoptimized when the requests for the logging implementation are started.
But after deoptimization, if the path that uses the
implementation is rerun, that code will get recompiled (with slightly different
we will still end up still seeing about 24.4 OPS (after
another warm-up period).
That’s the best case, of course. What happens if the calls are intermingled such that the compiler can never really assume which path the code will take? Because of the extra logging, the path that includes the logging gets about 24.1 OPS through the servlet. If operations are mixed, we get about 24.3 OPS: just about what would be expected from an average. Similar results are observed in the batch program. So aside from a momentary point where the trap is processed, deoptimization has not affected the performance in any significant way.
The second thing that can cause code to be made not entrant is due to the way tiered compilation works. In tiered compilation, code is compiled by the client compiler, and then later compiled by the server compiler (and actually it’s a little more complicated than that, as discussed in the next section). When the code compiled by the server compiler is ready, the JVM must replace the code compiled by the client compiler. It does this by marking the old code as not entrant and using the same mechanism to substitute the newly compiled (and more efficient) code. Hence, when a program is run with tiered compilation, the compilation log will show a slew of methods that are made not entrant. Don’t panic: this “deoptimization” is in fact making the code that much faster.
The way to detect this is to pay attention to the tier level in the compilation log:
40915 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 40923 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 41418 87 % 4 net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 41434 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made not entrant 41458 3749 4 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 41469 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) made not entrant 42772 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) made zombie 42861 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made zombie
Here, the constructor is first OSR-compiled at level 3, and then fully compiled also at level 3. A second later, the OSR code becomes eligible for level 4 compilation, so it is compiled at level 4 and the level 3 OSR code is made not entrant. The same process then occurs for the standard compilation, and then finally the level 3 code becomes a zombie.
Deoptimizing Zombie Code
When the compilation log reports that it has made zombie code, it
is saying that it has reclaimed some previous code that was made
In the last example, after a test was run with the
implementation, the code for the
class was made not entrant. But there were still
objects of the
class around. Eventually all those
objects were reclaimed by GC. When that happened, the compiler
noticed that the methods of that class were now eligible to be marked as
For performance, this is a good thing. Recall that the compiled code is held in a fixed-size code cache; when zombie methods are identified, it means that the code in question can be removed from the code cache, making room for other classes to be compiled (or limiting the amount of memory the JVM will need to allocate later).
The possible downside here is that if the code for the class is made zombie and then later reloaded and heavily used again, the JVM will need to recompile and reoptimize the code. Still, that’s exactly what happened in the scenario described above where the test was run without logging, then with logging, and then without logging; performance in that case was not noticeably affected. In general, the small recompilations that occur when zombie code is recompiled will not have a measurable effect on most applications.
- Deoptimization allows the compiler to back out previous versions of compiled code.
- Code is deoptimized when previous optimizations are no longer valid (e.g., because the type of the objects in question has changed).
- There is usually a small, momentary effect in performance when code is deoptimized, but the new code usually warms up quickly again.
- Under tiered compilation, code is deoptimized when it had previously been compiled by the client compiler and has now been optimized by the server compiler.
Tiered Compilation Levels
The compilation log for a program using tiered compilation prints the tier level at which each method is compiled. In the example from the last section, code was compiled up through level 4, even though to simplify the discussion so far, I’ve said there are only two compilers (plus the interpreter).
It turns out that there are five levels of execution, because the client compiler has three different levels. So the level of compilation runs from:
- 0: Interpreted code
- 1: Simple C1 compiled code
- 2: Limited C1 compiled code
- 3: Full C1 compiled code
- 4: C2 compiled code
A typical compilation log shows that most methods are first compiled at level 3: full C1 compilation. (All methods start at level 0, of course.) If they run often enough, they will get compiled at level 4 (and the level 3 code will be made not entrant). This is the most frequent path: the client compiler waits to compile something until it has information about how the code is used that it can leverage to perform optimizations.
If the server compiler queue is full, methods will be pulled from the server queue and compiled at level 2, which is the level at which the C1 compiler uses the invocation and back-edge counters (but doesn’t require profile feedback). That gets the method compiled more quickly; the method will later be compiled at level 3 after the C1 compiler has gathered profile information, and finally compiled at level 4 when the server compiler queue is less busy.
On the other hand, if the client compiler is full, a method that is scheduled for compilation at level 3 may become eligible for level 4 compilation while still waiting to be compiled at level 3. In that case, it is quickly compiled to level 2 and then transitioned to level 4.
Trivial methods may start in either levels 2 or 3 but then go to level 1 because of their trivial nature. If the server compiler for some reason cannot compile the code, it will also go to level 1.
And of course when code is deoptimized, it goes to level 0.
There are flags that control some of this behavior, but expecting results when tuning at this level is quite optimistic. The best case for performance happens when methods are compiled as expected: tier 0 → tier 3 → tier 4. If methods frequently get compiled into tier 2 and extra CPU cycles are available, consider increasing the number of compiler threads; that will reduce the size of the server compiler queue. If no extra CPU cycles are available, then all you can do is attempt to reduce the size of the application.
- Tiered compilation can operate at five distinct levels among the two compilers.
- Changing the path between levels is not recommended; this section just helps to explain the output of the compilation log.
This chapter has provided a lot of details about how just-in-time compilation works. From a tuning perspective, the simple choice here is to use the server compiler with tiered compilation for virtually everything; this will solve 90% of compiler-related performance issues. Just make sure that the code cache is sized large enough, and the compiler will provide pretty much all the performance that is possible.
This chapter also contains a lot of background about how the compiler works. One reason for this is so you can understand some of the general recommendations made in Chapter 1 regarding small methods and simple code, and the effects of the compiler on microbenchmarks that were described in Chapter 2. In particular:
- Don’t be afraid of small methods—and in particular getters and setters—because they are easily inlined. If you have a feeling that the method overhead can be expensive, you’re correct in theory (we showed that removing inlining has a huge impact on performance). But it’s not the case in practice, since the compiler fixes that problem.
- Code that needs to be compiled sits in a compilation queue. The more code in the queue, the longer the program will take to achieve optimal performance.
- Although you can (and should) size the code cache, it is still a finite resource.
- The simpler the code, the more optimizations that can be performed on it. Profile feedback and escape analysis can yield much faster code, but complex loop structures and large methods limit their effectiveness.
Finally, if you profile your code and find some surprising methods at the top of your profile—methods you expect shouldn’t be there—you can use the information here to look into what the compiler is doing and to make sure it can handle the way your code is written.