Chapter 4. Working with the JIT Compiler
The just-in-time (JIT) compiler is the heart of the Java Virtual Machine; nothing controls the performance of your application more than the JIT compiler.
This chapter covers the compiler in depth. It starts with information on how the compiler works and discusses the advantages and disadvantages of using a JIT compiler. Until JDK 8 came along, you had to choose between two Java compilers. Today, those two compilers still exist but work in concert with each other, though in rare cases choosing one is necessary. Finally, we’ll look at some intermediate and advanced tunings of the compiler. If an application is running slowly without any obvious reason, those sections can help you determine whether the compiler is at fault.
Just-in-Time Compilers: An Overview
We’ll start with some introductory material; feel free to skip ahead if you understand the basics of just-in-time compilation.
Computers—and more specifically CPUs—can execute only a relatively few, specific instructions, which are called machine code. All programs that the CPU executes must therefore be translated into these instructions.
Languages like C++ and Fortran are called compiled languages because their programs are delivered as binary (compiled) code: the program is written, and then a static compiler produces a binary. The assembly code in that binary is targeted to a particular CPU. Complementary CPUs can execute the same binary: for example, AMD and Intel CPUs share a basic, common set of assembly language instructions, and later versions of CPUs almost always can execute the same set of instructions as previous versions of that CPU. The reverse is not always true; new versions of CPUs often introduce instructions that will not run on older versions of CPUs.
Languages like PHP and Perl, on the other hand, are interpreted. The same
program source code can be run on any CPU as long as the machine has
the correct interpreter (that is, the program called php
or perl
). The
interpreter translates each line of the program into binary code as
that line is executed.
Each system has advantages and disadvantages. Programs written in interpreted languages are portable: you can take the same code and drop it on any machine with the appropriate interpreter, and it will run. However, it might run slowly. As a simple case, consider what happens in a loop: the interpreter will retranslate each line of code when it is executed in the loop. The compiled code doesn’t need to repeatedly make that translation.
A good compiler takes several factors into account when it produces a binary. One simple example is the order of the binary statements: not all assembly language instructions take the same amount of time to execute. A statement that adds the values stored in two registers might execute in one cycle, but retrieving (from main memory) the values needed for the addition may take multiple cycles.
Hence, a good compiler will produce a binary that executes the statement to load the data, executes other instructions, and then—when the data is available—executes the addition. An interpreter that is looking at only one line of code at a time doesn’t have enough information to produce that kind of code; it will request the data from memory, wait for it to become available, and then execute the addition. Bad compilers will do the same thing, by the way, and it is not necessarily the case that even the best compiler can prevent the occasional wait for an instruction to complete.
For these (and other) reasons, interpreted code will almost always be measurably slower than compiled code: compilers have enough information about the program to provide optimizations to the binary code that an interpreter simply cannot perform.
Interpreted code does have the advantage of portability. A binary compiled for an ARM CPU obviously cannot run on an Intel CPU. But a binary that uses the latest AVX instructions of Intel’s Sandy Bridge processors cannot run on older Intel processors either. Hence, commercial software is commonly compiled to a fairly old version of a processor and does not take advantage of the newest instructions available to it. Various tricks around this exist, including shipping a binary with multiple shared libraries that execute performance-sensitive code and come with versions for various flavors of a CPU.
Java attempts to find a middle ground here. Java applications
are compiled—but instead of being compiled into a specific binary for
a specific CPU, they are compiled into an intermediate low-level language.
This language (known as Java bytecode) is then run by the java
binary (in the same way that an interpreted PHP script is run by the php
binary). This gives Java the platform independence of an interpreted
language. Because it is executing an idealized binary code, the
java
program is able to compile the code into the platform binary as the
code executes. This compilation occurs as the program is executed: it
happens “just in time.”
This compilation is still subject to platform dependencies. JDK 8, for example, cannot generate code for the latest instruction set of Intel’s Skylake processors, though JDK 11 can. I’ll have more to say about that in “Advanced Compiler Flags”.
The manner in which the Java Virtual Machine compiles this code as it executes is the focus of this chapter.
HotSpot Compilation
As discussed in Chapter 1, the Java implementation discussed in this book is Oracle’s HotSpot JVM. This name (HotSpot) comes from the approach it takes toward compiling the code. In a typical program, only a small subset of code is executed frequently, and the performance of an application depends primarily on how fast those sections of code are executed. These critical sections are known as the hot spots of the application; the more the section of code is executed, the hotter that section is said to be.
Hence, when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this. First, if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.
But if the code in question is a frequently called method or a loop that runs many iterations, then compiling it is worthwhile: the cycles it takes to compile the code will be outweighed by the savings in multiple executions of the faster compiled code. That trade-off is one reason that the compiler executes the interpreted code first—the compiler can figure out which methods are called frequently enough to warrant their compilation.
The second reason is one of optimization: the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make numerous optimizations when it compiles the code.
Those optimizations
(and ways to affect them) are discussed
later in this chapter, but for a simple example,
consider the
equals()
method. This method exists in every
Java object (because it is inherited from the
Object
class) and is often
overridden. When the interpreter encounters the statement
b = obj1.equals(obj2)
,
it must look up the type (class) of
obj1
in order to know
which
equals()
method to execute. This dynamic lookup can be somewhat
time-consuming.
Over time, say the JVM notices that each time this statement
is executed,
obj1
is of type
java.lang.String
.
Then the JVM can produce
compiled code that directly calls the
String.equals()
method.
Now the code is
faster not only because it is compiled but also because it
can skip the lookup of which method to call.
It’s not quite as simple as
that; it is possible the next time the code is executed that
obj1
refers to something other than a
String
.
The JVM will create
compiled code that deals with that possibility, which will involve
deoptimizing and then reoptimizing the code in question (you’ll see an example
in “Deoptimization”). Nonetheless, the overall
compiled code here will be faster (at least as long as
obj1
continues
to refer to a
String
)
because it skips the lookup of which
method to execute. That kind of optimization can be made only after running
the code for a while and observing what it does: this is the second reason
JIT compilers wait to compile sections of code.
Quick Summary
-
Java is designed to take advantage of the platform independence of scripting languages and the native performance of compiled languages.
-
A Java class file is compiled into an intermediate language (Java bytecodes) that is then further compiled into assembly language by the JVM.
-
Compilation of the bytecodes into assembly language performs optimizations that greatly improve performance.
Tiered Compilation
Once upon a time, the JIT compiler came in two flavors, and you had to install
different versions of the JDK depending on which compiler you wanted to use.
These compilers are known as the
client
and
server
compilers. In 1996, this
was an important distinction; in 2020, not so much. Today, all shipping JVMs
include both compilers (though in common usage, they are usually referred to
as server
JVMs).
Despite being called server JVMs, the distinction between client and server compilers persists; both compilers are available to and used by the JVM. So knowing this difference is important in understanding how the compiler works.
Historically, JVM developers (and even some tools) sometimes
referred to the compilers
by the names C1
(compiler 1, client compiler) and C2
(compiler 2,
server compiler). Those names are more apt now, since any distinction between
a client and server computer is long gone, so we’ll adopt those names
throughout.
The primary difference between the two compilers is their aggressiveness in compiling code. The C1 compiler begins compiling sooner than the C2 compiler does. This means that during the beginning of code execution, the C1 compiler will be faster, because it will have compiled correspondingly more code than the C2 compiler.
The engineering trade-off here is the knowledge the C2 compiler gains while it waits: that knowledge allows the C2 compiler to make better optimizations in the compiled code. Ultimately, code produced by the C2 compiler will be faster than that produced by the C1 compiler. From a user’s perspective, the benefit to that trade-off is based on how long the program will run and how important the startup time of the program is.
When these compilers were separate, the obvious question was
why there needed to be a choice at all: couldn’t
the JVM start with the C1 compiler and then use the C2 compiler as
code gets hotter? That technique is known as tiered compilation, and it
is the technique all JVMs now use.
It can be explicitly disabled with the
-XX:-TieredCompilation
flag (the default value of which is true
); in “Advanced Compiler Flags”, we’ll
discuss the ramifications of doing that.
Common Compiler Flags
Two commonly used flags affect the JIT compiler; we’ll look at them in this section.
Tuning the Code Cache
When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. The code cache has a fixed size, and once it has filled up, the JVM is not able to compile any additional code.
It is easy to see the potential issue here if the code cache is too small. Some hot methods will get compiled, but others will not: the application will end up running a lot of (very slow) interpreted code.
When the code cache fills up, the JVM spits out this warning:
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled. Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
It is sometimes easy to miss this message; another way to determine if the compiler has ceased to compile code is to follow the output of the compilation log discussed later in this section.
There really isn’t a good mechanism to figure out how much code cache a particular application needs. Hence, when you need to increase the code cache size, it is sort of a hit-and-miss operation; a typical option is to simply double or quadruple the default.
The maximum size of the code cache is set via the
-XX:ReservedCodeCacheSize=
N
flag (where N
is the default just mentioned for the particular compiler). The code cache is managed like most memory in the
JVM: there is an initial size (specified by -XX:InitialCodeCacheSize=
N
).
Allocation of the code cache size starts at the initial size and increases as the cache fills up. The initial size of the code cache is 2,496 KB, and the default maximum size is 240 MB. Resizing the cache happens in the background and doesn’t really affect performance, so setting the ReservedCodeCacheSize
size (i.e., setting the maximum code cache size) is all that is generally needed.
Is there a disadvantage to specifying a really large value for the maximum code cache size so that it never runs out of space? It depends on the resources available on the target machine. If a 1 GB code cache size is specified, the JVM will reserve 1 GB of native memory. That memory isn’t allocated until needed, but it is still reserved, which means that sufficient virtual memory must be available on your machine to satisfy the reservation.
In addition, if you still have an old Windows machine with a 32-bit JVM, the total process size cannot exceed 4 GB. That includes the Java heap, space for all the code of the JVM itself (including its native libraries and thread stacks), any native memory the application allocates (either directly or via the New I/O [NIO] libraries), and of course the code cache.
Those are the reasons the code cache is not unbounded and sometimes requires tuning for large applications. On 64-bit machines with sufficient memory, setting the value too high is unlikely to have a practical effect on the application: the application won’t run out of process space memory, and the extra memory reservation will generally be accepted by the operating system.
In Java 11, the code cache is segmented into three parts:
-
Nonmethod code
-
Profiled code
-
Nonprofiled code
By default, the code cache is sized the same way (up to 240 MB), and you
can still adjust the total size of the code cache by using the
ReservedCodeCacheSize
flag. In that case, the nonmethod code segment is allocated space according
to the number of compiler threads (see “Compilation Threads”); on a machine
with four CPUs, it will be about 5.5 MB. The other two segments then equally
divide the remaining total code cache—for example, about 117.2 MB each on
the machine with four CPUs (yielding 240 MB total).
You’ll rarely need to tune these segments individually, but if so, the flags are as follows:
-
-XX:NonNMethodCodeHeapSize=N
: for the nonmethod code -
-XX:ProfiledCodeHapSize=N
for the profiled code -
-XX:NonProfiledCodeHapSize=N
for the nonprofiled code
The size of the code cache (and the JDK 11 segments) can be monitored in real
time by using jconsole
and selecting
the Memory Pool Code Cache chart on the Memory panel. You can also enable
Java’s Native Memory Tracking feature as described in Chapter 8.
Inspecting the Compilation Process
The second flag isn’t a tuning per se: it
will not improve the performance of an application. Rather,
the -XX:+PrintCompilation
flag (which by default is false
) gives us visibility into the workings of
the compiler (though we’ll also look at tools that provide similar
information).
If
PrintCompilation
is enabled, every time a method (or loop)
is compiled, the JVM prints out a line with information about what
has just been compiled.
Most lines of the compilation log have the following format:
timestamp compilation_id attributes (tiered_level) method_name size deopt
The timestamp here is the time after the compilation has finished (relative to 0, which is when the JVM started).
The compilation_id
is an internal task ID. Usually, this
number will simply increase monotonically, but sometimes
you may see an out-of-order compilation ID. This happens most frequently when
there are multiple compilation threads and indicates that compilation
threads are running faster or slower relative to each other. Don’t
conclude, though, that one particular compilation task was somehow
inordinately slow: it is usually just a function of thread scheduling.
The attributes
field is a series of five characters that indicates
the state of the code being compiled. If a particular attribute applies to
the given compilation,
the character shown in the following list is printed; otherwise, a space is printed for
that attribute. Hence, the five-character attribute string may appear as
two or more items separated by spaces. The various attributes are as follows:
%
-
The compilation is OSR.
s
-
The method is synchronized.
!
-
The method has an exception handler.
b
-
Compilation occurred in blocking mode.
n
-
Compilation occurred for a wrapper to a native method.
The first of these attributes refers to on-stack replacement (OSR). JIT compilation is an asynchronous process: when the JVM decides that a certain method should be compiled, that method is placed in a queue. Rather than wait for the compilation, the JVM then continues interpreting the method, and the next time the method is called, the JVM will execute the compiled version of the method (assuming the compilation has finished, of course).
But consider a long-running loop. The JVM will notice that the loop itself should be compiled and will queue that code for compilation. But that isn’t sufficient: the JVM has to have the ability to start executing the compiled version of the loop while the loop is still running—it would be inefficient to wait until the loop and enclosing method exit (which may not even happen). Hence, when the code for the loop has finished compiling, the JVM replaces the code (on stack), and the next iteration of the loop will execute the much faster compiled version of the code. This is OSR.
The next two attributes should be self-explanatory. The blocking flag will never be printed by default in current versions of Java; it indicates that compilation did not occur in the background (see “Compilation Threads” for more details). Finally, the native attribute indicates that the JVM generated compiled code to facilitate the call into a native method.
If tiered compilation has been disabled, the next
field (tiered_level
)
will be
blank. Otherwise, it will be a number indicating which tier has completed
compilation.
Next comes the name of the method being compiled (or the method containing
the loop being compiled for OSR), which is printed as ClassName::method
.
Next is the size
(in bytes) of the code being compiled. This is
the size of the Java bytecodes, not the size of the compiled code (so,
unfortunately, this can’t be used to predict how large to size the code
cache).
Finally, in some cases a message at the end of the compilation
line will indicate that some sort of deoptimization has occurred; these
are typically the phrases made not entrant
or made zombie
. See
“Deoptimization” for more details.
The compilation log may also include a line that looks like this:
timestamp compile_id COMPILE SKIPPED: reason
This line (with the literal text COMPILE SKIPPED
) indicates that something
has gone wrong with the compilation of the given method. In two
cases this is expected, depending on the reason specified:
- Code cache filled
-
The size of the code cache needs to be increased using the
ReservedCodeCache
flag. - Concurrent classloading
-
The class was modified as it was being compiled. The JVM will compile it again later; you should expect to see the method recompiled later in the log.
In all cases (except the cache being filled), the compilation should be reattempted. If it is not, an error prevents compilation of the code. This is often a bug in the compiler, but the usual remedy in all cases is to refactor the code into something simpler that the compiler can handle.
Here are a few lines of output from enabling PrintCompilation
on the
stock REST application:
28015 850 4 net.sdo.StockPrice::getClosingPrice (5 bytes) 28179 905 s 3 net.sdo.StockPriceHistoryImpl::process (248 bytes) 28226 25 % 3 net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 28244 935 3 net.sdo.MockStockPriceEntityManagerFactory$\ MockStockPriceEntityManager::find (507 bytes) 29929 939 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 106805 1568 ! 4 net.sdo.StockServlet::processRequest (197 bytes)
This output includes only a few of the stock-related methods (and not necessarily all of the lines related to a particular method). A few interesting things to note: the first such method wasn’t compiled until 28 seconds after the server was started, and 849 methods were compiled before it. In this case, all those other methods were methods of the server or JDK (filtered out of this output). The server took about 2 seconds to start; the remaining 26 seconds before anything else was compiled were essentially idle as the application server waited for requests.
The remaining lines are included to point out interesting features. The process()
method is synchronized, so the attributes include an s
. Inner classes are compiled just like any other class and appear in the output with the usual Java nomenclature: outer-classname$inner-classname
.
The processRequest()
method shows up with the exception handler as expected.
Finally, recall the implementation of the StockPriceHistoryImpl
constructor, which contains a large loop:
public
StockPriceHistoryImpl
(
String
s
,
Date
startDate
,
Date
endDate
)
{
EntityManager
em
=
emf
.
createEntityManager
();
Date
curDate
=
new
Date
(
startDate
.
getTime
());
symbol
=
s
;
while
(!
curDate
.
after
(
endDate
))
{
StockPrice
sp
=
em
.
find
(
StockPrice
.
class
,
new
StockPricePK
(
s
,
curDate
));
if
(
sp
!=
null
)
{
if
(
firstDate
==
null
)
{
firstDate
=
(
Date
)
curDate
.
clone
();
}
prices
.
put
((
Date
)
curDate
.
clone
(),
sp
);
lastDate
=
(
Date
)
curDate
.
clone
();
}
curDate
.
setTime
(
curDate
.
getTime
()
+
msPerDay
);
}
}
The loop is executed more often than the constructor itself, so the loop is subject to OSR compilation. Note that it took a while for that method to be compiled; its compilation ID is 25, but it doesn’t appear until other methods in the 900 range are being compiled. (It’s easy to read OSR lines like this example as 25% and wonder about the other 75%, but remember that the number is the compilation ID, and the % just signifies OSR compilation.) That is typical of OSR compilation; the stack replacement is harder to set up, but other compilation can continue in the meantime.
Tiered Compilation Levels
The compilation log for a program using tiered compilation prints the tier level at which each method is compiled. In the sample output, code was compiled either at level 3 or 4, even though we’ve discussed only two compilers (plus the interpreter) so far. It turns out that there are five levels of compilation, because the C1 compiler has three levels. So the levels of compilation are as follows:
- 0
-
Interpreted code
- 1
-
Simple C1 compiled code
- 2
-
Limited C1 compiled code
- 3
-
Full C1 compiled code
- 4
-
C2 compiled code
A typical compilation log shows that most methods are first compiled at level 3: full C1 compilation. (All methods start at level 0, of course, but that doesn’t appear in the log.) If a method runs often enough, it will get compiled at level 4 (and the level 3 code will be made not entrant). This is the most frequent path: the C1 compiler waits to compile something until it has information about how the code is used that it can leverage to perform optimizations.
If the C2 compiler queue is full, methods will be pulled from the C2 queue and compiled at level 2, which is the level at which the C1 compiler uses the invocation and back-edge counters (but doesn’t require profile feedback). That gets the method compiled more quickly; the method will later be compiled at level 3 after the C1 compiler has gathered profile information, and finally compiled at level 4 when the C2 compiler queue is less busy.
On the other hand, if the C1 compiler queue is full, a method that is scheduled for compilation at level 3 may become eligible for level 4 compilation while still waiting to be compiled at level 3. In that case, it is quickly compiled to level 2 and then transitioned to level 4.
Trivial methods may start in either level 2 or 3 but then go to level 1 because of their trivial nature. If the C2 compiler for some reason cannot compile the code, it will also go to level 1. And, of course, when code is deoptimized, it goes to level 0.
Flags control some of this behavior, but expecting results when tuning at this level is optimistic. The best case for performance happens when methods are compiled as expected: tier 0 → tier 3 → tier 4. If methods frequently get compiled into tier 2 and extra CPU cycles are available, consider increasing the number of compiler threads; that will reduce the size of the C2 compiler queue. If no extra CPU cycles are available, all you can do is attempt to reduce the size of the application.
Deoptimization
The discussion of the output of the
PrintCompilation
flag mentioned
two cases of the compiler deoptimizing the code. Deoptimization means
that the compiler has to “undo” a previous compilation. The effect is that the performance of the application will be reduced—at
least until the compiler can recompile the code in question.
Deoptimization occurs in two cases: when code is made not entrant
and when code is made zombie
.
Not entrant code
Two things cause code to be made not entrant. One is due to the way classes and interfaces work, and one is an implementation detail of tiered compilation.
Let’s look at the first case. Recall that the stock application has an interface StockPriceHistory
.
In the sample code, this interface has two implementations: a basic one (StockPriceHistoryImpl
) and one that adds logging
(StockPriceHistoryLogger
) to each operation. In the REST code, the implementation used is based on the log
parameter of the URL:
StockPriceHistory
sph
;
String
log
=
request
.
getParameter
(
"log"
);
if
(
log
!=
null
&&
log
.
equals
(
"true"
))
{
sph
=
new
StockPriceHistoryLogger
(...);
}
else
{
sph
=
new
StockPriceHistoryImpl
(...);
}
// Then the JSP makes calls to:
sph
.
getHighPrice
();
sph
.
getStdDev
();
// and so on
If a bunch of calls are made to http://localhost:8080/StockServlet
(that is, without the log
parameter), the compiler will see that the
actual type of the sph
object is
StockPriceHistoryImpl
.
It will then
inline code and perform other optimizations based on that knowledge.
Later, say a call is made to
http://localhost:8080/StockServlet?log=true. Now the assumption the
compiler made regarding the type of the sph
object is incorrect;
the previous optimizations
are no longer valid. This generates a deoptimization trap, and the previous
optimizations are discarded. If a lot of additional calls are made
with logging enabled, the JVM will quickly end up compiling that code and
making new optimizations.
The compilation log for that scenario will include lines such as the following:
841113 25 % net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made not entrant 841113 937 s net.sdo.StockPriceHistoryImpl::process (248 bytes) made not entrant 1322722 25 % net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made zombie 1322722 937 s net.sdo.StockPriceHistoryImpl::process (248 bytes) made zombie
Note that both the OSR-compiled constructor and the standard-compiled methods have been made not entrant, and some time much later, they are made zombie.
Deoptimization sounds like a bad thing, at least in terms of performance, but that isn’t necessarily the case. Table 4-1 shows the operations per second that the REST server achieves under deoptimization scenarios.
Scenario | OPS |
---|---|
Standard implementation |
24.4 |
Standard implementation after deopt |
24.4 |
Logging implementation |
24.1 |
Mixed impl |
24.3 |
The standard implementation will give us 24.4 OPS.
Suppose that immediately after that test, a test is run that triggers the
StockPriceHistoryLogger
path—that is the scenario that ran to produce
the deoptimization examples
just listed. The full output of
PrintCompilation
shows that all the methods of the
StockPriceHistoryImpl
class
get deoptimized when the requests for the logging implementation are started.
But after deoptimization, if the path that uses the
StockPriceHistoryImpl
implementation is rerun, that code will get recompiled (with slightly different
assumptions), and
we will still end up still seeing about 24.4 OPS (after
another warm-up period).
That’s the best case, of course. What happens if the calls are intermingled such that the compiler can never really assume which path the code will take? Because of the extra logging, the path that includes the logging gets about 24.1 OPS through the server. If operations are mixed, we get about 24.3 OPS: just about what would be expected from an average. So aside from a momentary point where the trap is processed, deoptimization has not affected the performance in any significant way.
The second thing that can cause code to be made not entrant is the way tiered compilation works. When code is compiled by the C2 compiler, the JVM must replace the code already compiled by the C1 compiler. It does this by marking the old code as not entrant and using the same deoptimization mechanism to substitute the newly compiled (and more efficient) code. Hence, when a program is run with tiered compilation, the compilation log will show a slew of methods that are made not entrant. Don’t panic: this “deoptimization” is, in fact, making the code that much faster.
The way to detect this is to pay attention to the tier level in the compilation log:
40915 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 40923 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 41418 87 % 4 net.sdo.StockPriceHistoryImpl::<init> @ 48 (156 bytes) 41434 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made not entrant 41458 3749 4 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) 41469 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) made not entrant 42772 3697 3 net.sdo.StockPriceHistoryImpl::<init> (156 bytes) made zombie 42861 84 % 3 net.sdo.StockPriceHistoryImpl::<init> @ -2 (156 bytes) made zombie
Here, the constructor is first OSR-compiled at level 3 and then fully compiled also at level 3. A second later, the OSR code becomes eligible for level 4 compilation, so it is compiled at level 4 and the level 3 OSR code is made not entrant. The same process then occurs for the standard compilation, and finally the level 3 code becomes a zombie.
Deoptimizing zombie code
When the compilation log reports that it has made zombie code, it
is saying that it has reclaimed previous code that was made
not entrant.
In the preceding example, after a test was run with the
StockPriceHistoryLogger
implementation, the code for the
StockPriceHistoryImpl
class was made not entrant. But objects of the
StockPriceHistoryImpl
class remained. Eventually all those
objects were reclaimed by GC. When that happened, the compiler
noticed that the methods of that class were now eligible to be marked as
zombie code.
For performance, this is a good thing. Recall that the compiled code is held in a fixed-size code cache; when zombie methods are identified, the code in question can be removed from the code cache, making room for other classes to be compiled (or limiting the amount of memory the JVM will need to allocate later).
The possible downside is that if the code for the class is made zombie and then later reloaded and heavily used again, the JVM will need to recompile and reoptimize the code. Still, that’s exactly what happened in the previous scenario, where the test was run without logging, then with logging, and then without logging; performance in that case was not noticeably affected. In general, the small recompilations that occur when zombie code is recompiled will not have a measurable effect on most applications.
Quick Summary
-
The best way to gain visibility into how code is being compiled is by enabling
PrintCompilation
. -
Output from enabling
PrintCompilation
can be used to make sure that compilation is proceeding as expected. -
Tiered compilation can operate at five distinct levels among the two compilers.
-
Deoptimization is the process by which the JVM replaces previously compiled code. This usually happens in the context of C2 code replacing C1 code, but it can happen because of changes in the execution profile of an application.
Advanced Compiler Flags
This section covers a few other flags that affect the compiler. Mostly, this gives you a chance to understand even better how the compiler works; these flags should not generally be used. On the other hand, another reason they are included here is that they were once common enough to be in wide usage, so if you’ve encountered them and wonder what they do, this section should answer those questions.
Compilation Thresholds
This chapter has been somewhat vague in defining just what triggers the compilation of code. The major factor is how often the code is executed; once it is executed a certain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.
Tunings affect these thresholds. However, this section is really designed to give you better insight into how the compiler works (and introduce some terms); in current JVMs, tuning the threshold never really makes sense.
Compilation is based on two counters in the JVM: the number of times the
method has been called, and the number of times any loops in the method
have branched back. Branching back can effectively be thought of as the
number of times a loop has completed execution, either because it reached
the end of the loop itself or because it executed a branching statement
like continue
.
When the JVM executes a Java method, it checks the sum of those two counters and decides whether the method is eligible for compilation. If it is, the method is queued for compilation (see “Compilation Threads” for more details about queuing). This kind of compilation has no official name but is often called standard compilation.
Similarly, every time a loop completes an execution, the branching counter is incremented and inspected. If the branching counter has exceeded its individual threshold, the loop (and not the entire method) becomes eligible for compilation.
Tunings affect these thresholds.
When tiered compilation is disabled,
standard compilation is triggered by the value of the
-XX:CompileThreshold=
N
flag. The default value of N
is 10,000. Changing the value
of the CompileThreshold
flag will cause the compiler to choose to compile the
code sooner (or later) than it normally would have. Note, however, that
although there is one flag here, the threshold is calculated by adding the
sum of the back-edge loop counter plus the method entry counter.
You can often find recommendations to change the
CompileThreshold
flag, and several publications of
Java benchmarks use this flag (e.g., frequently after 8,000 iterations).
Some applications still ship with that flag set by default.
But remember that I said this flag works when tiered compilation is disabled—which means that when tiered compilation is enabled (as it normally is), this flag does nothing at all. Use of this flag is really just a holdover from JDK 7 and earlier days.
This flag used to be recommended for two reasons: first, lowering it would improve startup time for an application using the C2 compiler, since code would get compiled more quickly (and usually with the same effectiveness). Second, it could cause some methods to get compiled that otherwise never would have been compiled.
That last point is an interesting quirk: if a program runs forever, wouldn’t we expect all of its code to get compiled eventually? That’s not how it works, because the counters the compilers use increase as methods and loops are executed, but they also decrease over time. Periodically (specifically, when the JVM reaches a safepoint), the value of each counter is reduced.
Practically speaking, this means that the counters are a relative measure of the recent hotness of the method or loop. One side effect is that somewhat frequently executed code may never be compiled by the C2 compiler, even for programs that run forever. These methods are sometimes called lukewarm (as opposed to hot). Before tiered compilation, this was one case where reducing the compilation threshold was beneficial.
Today, however, even the lukewarm methods will be compiled, though perhaps
they could be ever-so-slightly improved if we could get them compiled by the C2
compiler rather than the C1 compiler. Little practical benefit
exists, but if you’re really interested, try changing the flags
-XX:Tier3InvocationThreshold=N
(default 200) to get C1 to compile a method more quickly, and
-XX:Tier4InvocationThreshold=N
(default 5000) to get C2 to compile a method more quickly. Similar
flags are available for the back-edge threshold.
Compilation Threads
“Compilation Thresholds” mentioned that when a method (or loop) becomes eligible for compilation, it is queued for compilation. That queue is processed by one or more background threads.
These queues are not strictly first in, first out; methods whose invocation
counters are higher have priority. So even when a program starts execution
and has lots of code to compile, this priority ordering helps ensure
that the most important code will be compiled first. (This is another
reason the compilation ID in the PrintCompilation
output can appear
out of order.)
The C1 and C2 compilers have different queues, each of which is processed by (potentially multiple) different threads. The number of threads is based on a complex formula of logarithms, but Table 4-2 lists the details.
CPUs | C1 threads | C2 threads |
---|---|---|
1 |
1 |
1 |
2 |
1 |
1 |
4 |
1 |
2 |
8 |
1 |
2 |
16 |
2 |
6 |
32 |
3 |
7 |
64 |
4 |
8 |
128 |
4 |
10 |
The number of compiler threads can be
adjusted by setting the
-XX:CICompilerCount=
N
flag. That is
the total number of threads the JVM will use to process the queue(s);
for tiered compilation, one-third (but at least one) will be
used to process the C1 compiler queue, and the remaining threads (but also
at least one) will be used to process the C2 compiler queue. The default value of that flag is the sum of the two columns in
the preceding table.
If tiered compilation is disabled, only the given number of C2 compiler threads are started.
When might you consider adjusting this value? Because the default value is based on the number of CPUs, this is one case where running with an older version of JDK 8 inside a Docker container can cause the automatic tuning to go awry. In such a circumstance, you will need to manually set this flag to the desired value (using the targets in Table 4-2 as a guideline based on the number of CPUs assigned to the Docker container).
Similarly, if a program is run on a single-CPU virtual machine, having only one compiler thread might be slightly beneficial: limited CPU is available, and having fewer threads contending for that resource will help performance in many circumstances. However, that advantage is limited only to the initial warm-up period; after that, the number of eligible methods to be compiled won’t really cause contention for the CPU. When the stock batching application was run on a single-CPU machine and the number of compiler threads was limited to one, the initial calculations were about 10% faster (since they didn’t have to compete for CPU as often). The more iterations that were run, the smaller the overall effect of that initial benefit, until all hot methods were compiled and the benefit was eliminated.
On the other hand, the number of threads can easily overwhelm the system, particularly if multiple JVMs are run at once (each of which will start many compilation threads). Reducing the number of threads in that case can help overall throughput (though again with the possible cost that the warm-up period will last longer).
Similarly, if lots of extra CPU cycles are available, then theoretically the program will benefit—at least during its warm-up period—when the number of compiler threads is increased. In real life, that benefit is extremely hard to come by. Further, if all that excess CPU is available, you’re much better off trying something that takes advantage of the available CPU cycles during the entire execution of the application (rather than just compiling faster at the beginning).
One other setting that applies to the compilation threads is the value of
the
-XX:+BackgroundCompilation
flag, which by default is true
.
That setting means that the queue is processed
asynchronously as just described. But that flag can be set to false
,
in which case when a method is eligible for compilation, code that wants
to execute it will wait until it is in fact compiled (rather than continuing
to execute in the interpreter). Background compilation is also disabled when
-Xbatch
is specified.
Inlining
One of the most important optimizations the compiler makes is to inline methods. Code that follows good object-oriented design often contains attributes that are accessed via getters (and perhaps setters):
public
class
Point
{
private
int
x
,
y
;
public
void
getX
()
{
return
x
;
}
public
void
setX
(
int
i
)
{
x
=
i
;
}
}
The overhead for invoking a method call like this is quite high, especially relative to the amount of code in the method. In fact, in the early days of Java, performance tips often argued against this sort of encapsulation precisely because of the performance impact of all those method calls. Fortunately, JVMs now routinely perform code inlining for these kinds of methods. Hence, you can write this code:
Point
p
=
getPoint
();
p
.
setX
(
p
.
getX
()
*
2
);
The compiled code will essentially execute this:
Point
p
=
getPoint
();
p
.
x
=
p
.
x
*
2
;
Inlining is enabled by default. It can be disabled using the
-XX:-Inline
flag, though it is such an important performance boost that you would never
actually do that (for example, disabling inlining reduces the performance of
the stock batching test by over 50%). Still, because inlining is so important,
and perhaps because we have many other knobs to turn, recommendations
are often
made regarding tuning the inlining behavior of the JVM.
Unfortunately, there is no basic visibility into how the JVM inlines
code. If you compile the JVM from source, you can produce
a debug version that includes the flag
-XX:+PrintInlining
.
That flag provides all sorts of information about the inlining
decisions that the compiler makes.) The best that can be done is to
look at profiles of the code, and if any simple methods near the top
of the profiles seem like they should be inlined, experiment
with inlining flags.
The basic decision about whether to inline a method depends on how hot
it is and its size. The JVM determines if a method is hot (i.e., called
frequently)
based on an internal calculation; it is not directly subject to
any tunable parameters. If a method is eligible for inlining because it
is called frequently, it will be inlined only if its bytecode size is
less than 325 bytes (or whatever is specified as the
-XX:MaxFreqInlineSize=
N
flag). Otherwise, it is eligible for inlining only
if it is smaller than 35 bytes (or whatever is specified as the
-XX:MaxInlineSize=
N
flag).
Sometimes you will see recommendations that the value of the
MaxInlineSize
flag be increased so that more methods are inlined.
One often overlooked aspect of this relationship is that setting the
MaxInlineSize
value higher than 35 means that a method might be inlined when
it is first called. However, if the method is called frequently—in
which case its performance matters much more—then it would have been
inlined eventually (assuming its size is less than 325 bytes).
Otherwise, the net effect of
tuning the
MaxInlineSize
flag is that it
might reduce the warm-up time needed for a test, but it is unlikely that
it will have a big impact on a long-running application.
Quick Summary
-
Inlining is the most beneficial optimization the compiler can make, particularly for object-oriented code where attributes are well encapsulated.
-
Tuning the inlining flags is rarely needed, and recommendations to do so often fail to account for the relationship between normal inlining and frequent inlining. Make sure to account for both cases when investigating the effects of inlining.
Escape Analysis
The C2 compiler performs aggressive optimizations if
escape analysis is enabled
(-XX:+DoEscapeAnalysis
,
which is true
by default).
For example,
consider this class to work with factorials:
public
class
Factorial
{
private
BigInteger
factorial
;
private
int
n
;
public
Factorial
(
int
n
)
{
this
.
n
=
n
;
}
public
synchronized
BigInteger
getFactorial
()
{
if
(
factorial
==
null
)
factorial
=
...;
return
factorial
;
}
}
To store the first 100 factorial values in an array, this code would be used:
ArrayList
<
BigInteger
>
list
=
new
ArrayList
<
BigInteger
>();
for
(
int
i
=
0
;
i
<
100
;
i
++)
{
Factorial
factorial
=
new
Factorial
(
i
);
list
.
add
(
factorial
.
getFactorial
());
}
The
factorial
object is referenced only inside that loop; no other
code can ever access that object. Hence, the JVM is free to perform optimizations on that object:
-
It needn’t get a synchronization lock when calling the
getFactorial()
method. -
It needn’t store the field
n
in memory; it can keep that value in a register. Similarly, it can store thefactorial
object reference in a register. -
In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.
This kind of optimization is sophisticated: it is simple enough in this example, but these optimizations are possible even with more-complex code. Depending on the code usage, not all optimizations will necessarily apply. But escape analysis can determine which of those optimizations are possible and make the necessary changes in the compiled code.
Escape analysis is enabled by default. In rare cases, it will get things wrong. That is usually unlikely, and in current JVMs, it is rare indeed. Still, because there were once some high-profile bugs, you’ll sometimes see recommendations for disabling escape analysis. Those are likely not appropriate any longer, though as with all aggressive compiler optimizations, it’s not out of the question that disabling this feature could lead to more stable code. If you find this to be the case, simplifying the code in question is the best course of action: simpler code will compile better. (It is a bug, however, and should be reported.)
Quick Summary
-
Escape analysis is the most sophisticated of the optimizations the compiler can perform. This is the kind of optimization that frequently causes microbenchmarks to go awry.
CPU-Specific Code
I mentioned earlier that one advantage of the JIT compiler is that it could emit code for different processors depending on where it was running. This presumes that the JVM is built with the knowledge of the newer processor, of course.
That is exactly what the compiler does for Intel chips. In 2011, Intel introduced Advanced Vector Extensions (AVX2) for the Sandy Bridge (and later) chips. JVM support for those instructions soon followed. Then in 2016 Intel extended this to include AVX-512 instructions; those are present on Knights Landing and subsequent chips. Those instructions are not supported in JDK 8 but are supported in JDK 11.
Normally, this feature isn’t something you worry about; the JVM will detect the CPU that it is running on and select the appropriate instruction set. But as with all new features, sometimes things go awry.
Support for AVX-512 instructions was first introduced in JDK 9, though it was not enabled by default. In a couple of false starts, it was enabled by default and then disabled by default. In JDK 11, those instructions were enabled by default. However, beginning in JDK 11.0.6, those instructions are again disabled by default. Hence, even in JDK 11, this is still a work in progress. (This, by the way, is not unique to Java; many programs have struggled to get the support of the AVX-512 instructions exactly right.)
So it is that on some newer Intel hardware, running some programs, you may find that an earlier instruction set works much better. The kinds of applications that benefit from the new instruction set typically involve more scientific calculations than Java programs often do.
These instruction sets are selected with the
-XX:UseAVX=
N
argument, where N
is as follows:
- 0
-
Use no AVX instructions.
- 1
-
Use Intel AVX level 1 instructions (for Sandy Bridge and later processors).
- 2
-
Use Intel AVX level 2 instructions (for Haswell and later processors).
- 3
-
Use Intel AVX-512 instructions (for Knights Landing and later processors).
The default value for this flag will depend on the processor running the JVM; the JVM will detect the CPU and pick the highest supported value it can. Java 8 has no support for a level of 3, so 2 is the value you’ll see used on most processors. In Java 11 on newer Intel processors, the default is to use 3 in versions up to 11.0.5, and 2 in later versions.
This is one of the reasons I mentioned in Chapter 1 that it is a good
idea to use the latest versions of Java 8 or Java 11, since important fixes
like this are in those latest versions. If you must use an earlier version of
Java 11 on the latest Intel processors, try setting the
-XX:UseAVX=2
flag, which in many cases will give you a performance boost.
Speaking of code maturity: for completeness, I’ll mention that the -XX:UseSSE=N
flag supports Intel Streaming SIMD Extensions (SSE) one to four. These extensions are for the Pentium line of processors. Tuning this flag in
2010 made some sense as all the permutations of its use were being worked out.
Today, we can generally rely on the robustness of that flag.
Tiered Compilation Trade-offs
I’ve mentioned a few times that the JVM works differently when tiered compilation is disabled. Given the performance advantages it provides, is there ever a reason to turn it off?
One such reason might be when running in a memory-constrained environment. Sure, your 64-bit machine probably has a ton of memory, but you may be running in a Docker container with a small memory limit or in a cloud virtual machine that just doesn’t have quite enough memory. Or you may be running dozens of JVMs on your large machine. In those cases, you may want to reduce the memory footprint of your application.
Chapter 8 provides general recommendations about this, but in this section we’ll look at the effect of tiered compilation on the code cache.
Table 4-3 shows the result of starting NetBeans on my system, which has a couple dozen projects that will be opened at startup.
Compiler mode | Classes compiled | Committed code cache | Startup time |
---|---|---|---|
+TieredCompilation |
22,733 |
46.5 MB |
50.1 seconds |
-TieredCompilation |
5,609 |
10.7 MB |
68.5 seconds |
The C1 compiler compiled about four times as many classes and predictably required about four times as much memory for the code cache. In absolute terms, saving 34 MB in this example is unlikely to make a huge difference. Saving 300 MB in a program that compiles 200,000 classes might be a different choice on some platforms.
What do we lose by disabling tiered compilation? As the table shows, we do spend more time to start the application and load all project classes. But what about a long-running program, where you’d expect all the hot spots to get compiled?
In that case, given a sufficiently long warm-up period, execution should be about the same when tiered compilation is disabled. Table 4-4 shows the performance of our stock REST server after warm-up periods of 0, 60, and 300 seconds.
Warm-up period | -XX:-TieredCompilation |
-XX:+TieredCompilation |
---|---|---|
0 seconds |
23.72 |
24.23 |
60 seconds |
23.73 |
24.26 |
300 seconds |
24.42 |
24.43 |
The measurement period is 60 seconds, so even when there is no warm-up, the compilers had an opportunity to get enough information to compile the hot spots; hence, there is little difference even when there is no warm-up period. (Also, a lot of code was compiled during the startup of the server.) Note that in the end, tiered compilation is still able to eke out a small advantage (albeit one that is unlikely to be noticeable). We discussed the reason for that when discussing compilation thresholds: there will always be a small number of methods that are compiled by the C1 compiler when tiered compilation is used that won’t be compiled by the C2 compiler.
The GraalVM
The GraalVM is a new virtual machine. It provides a means to run Java code, of course, but also code from many other languages. This universal virtual machine can also run JavaScript, Python, Ruby, R, and traditional JVM bytecodes from Java and other languages that compile to JVM bytecodes (e.g., Scala, Kotlin, etc.). Graal comes in two editions: a full open source Community Edition (CE) and a commercial Enterprise Edition (EE). Each edition has binaries that support either Java 8 or Java 11.
The GraalVM has two important contributions to JVM performance. First, an add-on technology allows the GraalVM to produce fully native binaries; we’ll examine that in the next section.
Second, the GraalVM can run in a mode as a regular JVM, but it contains a new implementation of the C2 compiler. This compiler is written in Java (as opposed to the traditional C2 compiler, which is written in C++).
The traditional JVM contains a version of the GraalVM JIT, depending on when the JVM was built. These JIT releases come from the CE version of GraalVM, which are slower than the EE version; they are also typically out-of-date compared to versions of GraalVM that you can download directly.
Within the JVM, using the GraalVM compiler is considered experimental,
so to enable it, you need to supply these flags:
-XX:+UnlockExperimentalVMOptions
,
-XX:+EnableJVMCI
,
and
-XX:+UseJVMCICompiler
.
The default for all those flags is false
.
Table 4-5 shows the performance of the standard Java 11 compiler, the Graal compiler from EE version 19.2.1, and the GraalVM embedded in Java 11 and 13.
JVM/compiler | OPS |
---|---|
JDK 11/Standard C2 |
20.558 |
JDK 11/Graal JIT |
14.733 |
Graal 1.0.0b16 |
16.3 |
Graal 19.2.1 |
26.7 |
JDK 13/Standard C2 |
21.9 |
JDK 13/Graal JIT |
26.4 |
This is once again the performance of our REST server (though on slightly different hardware than before, so the baseline OPS is only 20.5 OPS instead of 24.4).
It’s interesting to note the progression here: JDK 11 was built with a pretty early version of the Graal compiler, so the performance of that compiler lags the C2 compiler. The Graal compiler improved through its early access builds, though even its latest early access (1.0) build wasn’t as fast as the standard VM. Graal versions in late 2019 (released as production version 19.2.1), though, got substantially faster. The early access release of JDK 13 has one of those later builds and achieves close to the same performance with the Graal compiler, even while its C2 compiler is only modestly improved since JDK 11.
Precompilation
We began this chapter by discussing the philosophy behind a just-in-time compiler. Although it has its advantages, code is still subject to a warm-up period before it executes. What if in our environment a traditional compiled model would work better: an embedded system without the extra memory the JIT requires, or a program that completes before having a chance to warm up?
In this section, we’ll look at two experimental features that address that scenario. Ahead-of-time compilation is an experimental feature of the standard JDK 11, and the ability to produce a fully native binary is a feature of the Graal VM.
Ahead-of-Time Compilation
Ahead-of-time (AOT) compilation was first available in JDK 9 for Linux only, but in JDK 11 it is available on all platforms. From a performance standpoint, it is still a work in progress, but this section will give you a sneak peek at it.1
AOT compilation allows you to compile some (or all) of your application in advance of running it. This compiled code becomes a shared library that the JVM uses when starting the application. In theory, this means the JIT needn’t be involved, at least in the startup of your application: your code should initially run at least as well as the C1 compiled code without having to wait for that code to be compiled.
In practice, it’s a little different: the startup time of the application is greatly affected by the size of the shared library (and hence the time to load that shared library into the JVM). That means a simple application like a “Hello, world” application won’t run any faster when you use AOT compilation (in fact, it may run slower depending on the choices made to precompile the shared library). AOT compilation is targeted toward something like a REST server that has a relatively long startup time. That way, the time to load the shared library is offset by the long startup time, and AOT produces a benefit. But remember as well that AOT compilation is an experimental feature, and smaller programs may see benefits from it as the technology evolves.
To use AOT compilation, you use the jaotc
tool to produce a shared library
containing the compiled classes that you select. Then that shared library
is loaded into the JVM via a runtime argument.
The jaotc
tool has several options, but the way that you’ll
produce the best library is something like this:
$ jaotc --compile-commands=/tmp/methods.txt \ --output JavaBaseFilteredMethods.so \ --compile-for-tiered \ --module java.base
This command will use a set of compile commands to produce a compiled version of the java.base module in the given output file. You have the option of AOT compiling a module, as we’ve done here, or a set of classes.
The time to load the shared library depends on its size, which is a factor of the number of methods in the library. You can load multiple shared libraries that pre-compile different parts of code as well, which may be easier to manage but has the same performance, so we’ll concentrate on a single library.
While you might be tempted to precompile everything, you’ll obtain better performance if you judiciously precompile only subsets of the code. That’s why this recommendation is to compile only the java.base module.
The compile commands (in the /tmp/methods.txt file in this example) also serve to limit the data that is compiled into the shared library. That file contains lines that look like this:
compileOnly java.net.URI.getHost()Ljava/lang/String;
This line tells jaotc
that when it compiles the java.net.URI
class, it
should include only the getHost()
method. We can have other lines
referencing other methods from that class to include their compilation as well;
in the end, only the methods listed in the file will be included in the
shared library.
To create the list of compile commands, we need a list of every method that the application actually uses. To do that, we run the application like this:
$ java -XX:+UnlockDiagnosticVMOptions -XX:+LogTouchedMethods \ -XX:+PrintTouchedMethodsAtExit <other arguments>
When the program exits, it will print lines of each method the program used in a format like this:
java/net/URI.getHost:()Ljava/lang/String;
To produce the methods.txt file, save those lines, prepend each with
the compileOnly
directive, and remove
the colon immediately preceding the method arguments.
The classes that are precompiled by jaotc
will use a form of the C1
compiler, so in a long-running program, they will not be optimally compiled.
So the final option that we’ll need is --compile-for-tiered
. That
option arranges the shared library so that its methods are still eligible
to be compiled by the C2 compiler.
If you are using AOT compilation for a short-lived program, it’s fine to leave out this argument, but remember that the target set of applications is a server. If we don’t allow the precompiled methods to become eligible for C2 compilation, the warm performance of the server will be slower than what is ultimately possible.
Perhaps unsurprisingly, if you run your application with a library that
has tiered compilation enabled and use the
-XX:+PrintCompilation
flag, you see the same code replacement technique we observed before:
the AOT compilation will appear as another tier in the output, and
you’ll see the AOT methods get made not entrant and replaced as the JIT
compiles them.
Once the library has been created, you use it with your application like this:
$ java -XX:AOTLibrary=/path/to/JavaBaseFilteredMethods.so <other args>
If you want to make sure that the library is being used, include the
-XX:+PrintAOT
flag in your JVM arguments; that flag is false
by default. Like the
-XX:+PrintCompilation
flag, the
-XX:+PrintAOT
flag will produce output whenever a precompiled method is used by the JVM.
A typical line looks like this:
373 105 aot[ 1] java.util.HashSet.<init>(I)V
The first column here is the milliseconds since the program started, so it
took 373 milliseconds until the constructor of the HashSet
class was loaded
from the shared library and began
execution. The second column is an ID assigned to the method, and the third
column tells us which library the method was loaded from. The index (1 in this
example) is also printed by this flag:
18 1 loaded /path/to/JavaBaseFilteredMethods.so aot library
JavaBaseFilteredMethods.so is the first (and only) library loaded in
this example, so
its index is 1 (the second column) and subsequent references to aot
with that
index refer to this library.
GraalVM Native Compilation
AOT compilation was beneficial for relatively large programs but didn’t help (and could hinder) small, quick-running programs. That is because it’s still an experimental feature and because its architecture has the JVM load the shared library.
The GraalVM, on the other hand, can produce full native executables that run without the JVM. These executables are ideal for short-lived programs. If you ran the examples, you may have noticed references in some things (like ignored errors) to GraalVM classes: AOT compilation uses GraalVM as its foundation. This is an Early Adopter feature of the GraalVM; it can be used in production with the appropriate license but is not subject to warranty.
The GraalVM produces binaries that start up quite fast, particularly when comparing them to the running programs in the JVM. However, in this mode the GraalVM does not optimize code as aggressively as the C2 compiler, so given a sufficiently long-running application, the traditional JVM will win out in the end. Unlike AOT compilation, the GraalVM native binary does not compile classes using C2 during execution.
Similarly, the memory footprint of a native program produced from the GraalVM starts out significantly smaller than a traditional JVM. However, by the time a program runs and expands the heap, this memory advantage fades.
Limitations also exist on which Java features can be used in a program compiled into native code. These limitations include the following:
-
Dynamic class loading (e.g., by calling
Class.forName()
). -
Finalizers.
-
The Java Security Manager.
-
JMX and JVMTI (including JVMTI profiling).
-
Use of reflection often requires special coding or configuration.
-
Use of dynamic proxies often requires special configuration.
-
Use of JNI requires special coding or configuration.
We can see all of this in action by using a demo program from the GraalVM project that recursively counts the files in a directory. With a few files to count, the native program produced by the GraalVM is quite small and fast, but as more work is done and the JIT kicks in, the traditional JVM compiler generates better code optimizations and is faster, as we see in Table 4-6.
Number of files | Java 11.0.5 | Native application |
---|---|---|
7 |
217 ms (36K) |
4 ms (3K) |
271 |
279 ms (37K) |
20 ms (6K) |
169,000 |
2.3 s (171K) |
2.1 s (249K) |
1.3 million |
19.2 s (212K) |
25.4 s (269K) |
The times here are the time to count the files; the total footprint of the run (measured at completion) is given in parentheses.
Of course, the GraalVM itself is rapidly evolving, and the optimizations within its native code can be expected to improve over time as well.
Summary
This chapter contains a lot of background about how the compiler works. This is so you can understand some of the general recommendations made in Chapter 1 regarding small methods and simple code, and the effects of the compiler on microbenchmarks that were described in Chapter 2. In particular:
-
Don’t be afraid of small methods—and, in particular, getters and setters—because they are easily inlined. If you have a feeling that the method overhead can be expensive, you’re correct in theory (we showed that removing inlining significantly degrades performance). But it’s not the case in practice, since the compiler fixes that problem.
-
Code that needs to be compiled sits in a compilation queue. The more code in the queue, the longer the program will take to achieve optimal performance.
-
Although you can (and should) size the code cache, it is still a finite resource.
-
The simpler the code, the more optimizations that can be performed on it. Profile feedback and escape analysis can yield much faster code, but complex loop structures and large methods limit their effectiveness.
Finally, if you profile your code and find some surprising methods at the top of your profile—methods you expect shouldn’t be there—you can use the information here to look into what the compiler is doing and to make sure it can handle the way your code is written.
1 One benefit of AOC compilation is faster startup, but application class data sharing gives—at least for now—a better benefit in terms of startup performance and is a fully supported feature; see “Class Data Sharing” for more details.
Get Java Performance, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.