Chapter 4. Performance Testing Patterns and Antipatterns

Performance testing is undertaken for a variety of reasons. In this chapter we will introduce the different types of test that a team may wish to execute, and discuss best practices for each type.

In the second half of the chapter, we will outline some of the more common antipatterns that can plague a performance test or team, and explain refactored solutions to help prevent them from becoming a problem for teams.

Types of Performance Test

Performance tests are frequently conducted for the wrong reasons, or conducted badly. The reasons for this vary widely, but are often rooted in a failure to understand the nature of performance analysis and a belief that “something is better than nothing.” As we will see repeatedly, this belief is often a dangerous half-truth at best.

One of the more common mistakes is to speak generally of “performance testing” without engaging with the specifics. In fact, there are many different types of large-scale performance tests that can be conducted on a system.

Note

Good performance tests are quantitative. They ask questions that produce a numeric answer that can be handled as an experimental output and subjected to statistical analysis.

The types of performance tests we will discuss in this book usually have mostly independent (but somewhat overlapping) goals, so you should take care when thinking about the domain of any given single test. A good rule of thumb when planning a performance test is simply to write down (and confirm to management/the customer) the quantitative questions that the test is intended to answer, and why they are important for the application under test.

Some of the most common test types, and an example question for each, are as follows:

Latency test

What is the end-to-end transaction time?

Throughput test

How many concurrent transactions can the current system capacity deal with?

Load test

Can the system handle a specific load?

Stress test

What is the breaking point of the system?

Endurance test

What performance anomalies are discovered when the system is run for an extended period?

Capacity planning test

Does the system scale as expected when additional resources are added?

Degradation

What happens when the system is partially failed?

Let’s look in more detail at each of these test types in turn.

Latency Test

This is one of the most common types of performance test, usually because it can be closely related to a system observable that is of direct interest to management: how long are our customers waiting for a transaction (or a page load)? This is a double-edged sword, because the quantitative question that a latency test seeks to answer seems so obvious that it can obscure the necessity of identifying quantitative questions for other types of performance tests.

Note

The goal of a latency tuning exercise is usually to directly improve the user experience, or to meet a service-level agreement.

However, even in the simplest of cases, a latency test has some subtleties that must be treated carefully. One of the most noticeable is that (as we will discuss fully in “Statistics for JVM Performance”) a simple mean (average) is not very useful as a measure of how well an application is reacting to requests.

Throughput Test

Throughput is probably the second most common quantity to be performance-tested. It can even be thought of as equivalent to latency, in some senses.

For example, when we are conducting a latency test, it is important to state (and control) the concurrent transactions ongoing when producing a distribution of latency results.

Note

The observed latency of a system should be stated at known and controlled throughput levels.

Equally, we usually conduct a throughput test while monitoring latency. We determine the “maximum throughput” by noticing when the latency distribution suddenly changes—effectively a “breaking point” (also called an inflection point) of the system. The point of a stress test, as we will see, is to locate such points and the load levels at which they occur.

A throughput test, on the other hand, is about measuring the observed maximum throughput before the system starts to degrade.

Load Test

A load test differs from a throughput test (or a stress test) in that it is usually framed as a binary test: “Can the system handle this projected load or not?” Load tests are sometimes conducted in advance of expected business events—for example, the onboarding of a new customer or market that is expected to drive greatly increased traffic to the application. Other examples of possible events that could warrant performing this type of test include advertising campaigns, social media events, and “viral content.”

Stress Test

One way to think about a stress test is as a way to determine how much spare headroom the system has. The test typically proceeds by placing the system into a steady state of transactions—that is, a specified throughput level (often the current peak). The test then ramps up the concurrent transactions slowly, until the system observables start to degrade.

The value just before the observables started to degrade determines the maximum throughput achieved in a throughput test.

Endurance Test

Some problems manifest only over much longer periods of time (often measured in days). These include slow memory leaks, cache pollution, and memory fragmentation (especially for applications that use the Concurrent Mark and Sweep garbage collector, which may eventually suffer concurrent mode failure; see “CMS” for more details).

To detect these types of issues, an endurance test (also known as a soak test) is the usual approach. These are run at average (or high) utilization, but within observed loads for the system. During the test, resource levels are closely monitored to spot any breakdowns or exhaustions of resources.

This type of test is very common in fast-response (or low-latency) systems, as it is very common that those systems will not be able to tolerate the length of a stop-the-world event caused by a full GC cycle (see Chapter 6 and subsequent chapters for more on stop-the-world events and related GC concepts).

Capacity Planning Test

Capacity planning tests bear many similarities to stress tests, but they are a distinct type of test. The role of a stress test is to find out what the current system will cope with, whereas a capacity planning test is more forward-looking and seeks to find out what load an upgraded system could handle.

For this reason, capacity planning tests are often carried out as part of a scheduled planning exercise, rather than in response to a specific event or threat.

Degradation Test

A degradation test is also known as a partial failure test. A general discussion of resilience and fail-over testing is outside the scope of this book, but suffice it to say that in the most highly regulated and scrutinized environments (including banks and financial institutions), failover and recovery testing is taken extremely seriously and is usually planned in meticulous depth.

For our purposes, the only type of resilience test we consider is the degradation test. The basic approach to this test is to see how the system behaves when a component or entire subsystem suddenly loses capacity while the system is running at simulated loads equivalent to usual production volumes. Examples could be application server clusters that suddenly lose members, databases that suddenly lose RAID disks, or network bandwidth that suddenly drops.

Key observables during a degradation test include the transaction latency distribution and throughput.

One particularly interesting subtype of partial failure test is known as the Chaos Monkey. This is named after a project at Netflix that was undertaken to verify the robustness of its infrastructure.

The idea behind Chaos Monkey is that in a truly resilient architecture, the failure of a single component should not be able to cause a cascading failure or have a meaningful impact on the overall system.

Chaos Monkey attempts to demonstrate this by randomly killing off live processes that are actually in use in the production environment.

In order to successfully implement Chaos Monkey–type systems, an organization must have the highest levels of system hygiene, service design, and operational excellence. Nevertheless, it is an area of interest and aspiration for an increasing number of companies and teams.

Best Practices Primer

When deciding where to focus your effort in a performance tuning exercise, there are three golden rules that can provide useful guidance:

  • Identify what you care about and figure out how to measure it.

  • Optimize what matters, not what is easy to optimize.

  • Play the big points first.

The second point has a converse, which is to remind yourself not to fall into the trap of attaching too much significance to whatever quantity you can easily measure. Not every observable is significant to a business, but it is sometimes tempting to report on an easy measure, rather than the right measure.

Top-Down Performance

One of the aspects of Java performance that many engineers miss at first encounter is that large-scale benchmarking of Java applications is usually easier than trying to get accurate numbers for small sections of code. We will discuss this in detail in Chapter 5.

Note

The approach of starting with the performance behavior of an entire application is usually called top-down performance.

To make the most of the top-down approach, a testing team needs a test environment, a clear understanding of what it needs to measure and optimize, and an understanding of how the performance exercise will fit into the overall software development lifecycle.

Creating a Test Environment

Setting up a test environment is one of the first tasks most performance testing teams will need to undertake. Wherever possible, this should be an exact duplicate of the production environment, in all aspects. This includes not only application servers (which servers should have the same number of CPUs, same version of the OS and Java runtime, etc.), but web servers, databases, load balancers, network firewalls, and so on. Any services (e.g., third-party network services that are not easy to replicate, or do not have sufficient QA capacity to handle a production-equivalent load) will need to be mocked for a representative performance testing environment.

Sometimes teams try to reuse or time-share an existing QA environment for performance testing. This can be possible for smaller environments or for one-off testing, but the management overhead and scheduling and logistical problems that it can cause should not be underestimated.

Note

Performance testing environments that are significantly different from the production environments that they attempt to represent often fail to achieve results that have any usefulness or predictive power in the live environment.

For traditional (i.e., non-cloud-based) environments, a production-like performance testing environment is relatively straightforward to achieve: the team simply buys as many physical machines as are in use in the production environment and then configures them in exactly the same way as production is configured.

Management is sometimes resistant to the additional infrastructure cost that this represents. This is almost always a false economy, but sadly many organizations fail to account correctly for the cost of outages. This can lead to a belief that the savings from not having an accurate performance testing environment are meaningful, as it fails to properly account for the risks introduced by having a QA environment that does not mirror production.

Recent developments, notably the advent of cloud technologies, have changed this rather traditional picture. On-demand and autoscaling infrastructure means that an increasing number of modern architectures no longer fit the model of “buy servers, draw network diagram, deploy software on hardware.” The devops approach of treating server infrastructure as “livestock, not pets” means that much more dynamic approaches to infrastructure management are becoming widespread.

This makes the construction of a performance testing environment that looks like production potentially more challenging. However, it raises the possibility of setting up a testing environment that can be turned off when not in use. This can bring significant cost savings to the project, but it requires a proper process for starting up and shutting down the environment as scheduled.

Identifying Performance Requirements

Let’s recall the simple system model that we met in “A Simple System Model”. This clearly shows that the overall performance of a system is not solely determined by your application code. The container, operating system, and hardware all have a role to play.

Therefore, the metrics that we will use to evaluate performance should not be thought about solely in terms of the code. Instead, we must consider systems as a whole and the observable quantities that are important to customers and management. These are usually referred to as performance nonfunctional requirements (NFRs), and are the key indicators that we want to optimize.

Some goals are obvious:

  • Reduce 95% percentile transaction time by 100 ms.

  • Improve system so that 5x throughput on existing hardware is possible.

  • Improve average response time by 30%.

Others may be less apparent:

  • Reduce resource cost to serve the average customer by 50%.

  • Ensure system is still within 25% of response targets, even when application clusters are degraded by 50%.

  • Reduce customer “drop-off” rate by 25% per 25 ms of latency.

An open discussion with the stakeholders as to exactly what should be measured and what goals are to be achieved is essential. Ideally, this discussion should form part of the first kick-off meeting for the performance exercise.

Java-Specific Issues

Much of the science of performance analysis is applicable to any modern software system. However, the nature of the JVM is such that there are certain additional complications that the performance engineer should be aware of and consider carefully. These largely stem from the dynamic self-management capabilities of the JVM, such as the dynamic tuning of memory areas.

One particularly important Java-specific insight is related to JIT compilation. Modern JVMs analyze which methods are being run to identify candidates for JIT compilation to optimized machine code. This means that if a method is not being JIT-compiled, then one of two things is true about the method:

  • It is not being run frequently enough to warrant being compiled.

  • The method is too large or complex to be analyzed for compilation.

The second condition is much rarer than the first. However, one early performance exercise for JVM-based applications is to switch on simple logging of which methods are being compiled and ensure that the important methods for the application’s key code paths are being compiled.

In Chapter 9 we will discuss JIT compilation in detail, and show some simple techniques for ensuring that the important methods of applications are targeted for JIT compilation by the JVM.

Performance Testing as Part of the SDLC

Some companies and teams prefer to think of performance testing as an occasional, one-off activity. However, more sophisticated teams tend to make ongoing performance tests, and in particular performance regression testing, an integral part of their software development lifecycle (SDLC).

This requires collaboration between developers and infrastructure teams to control which versions of code are present in the performance testing environment at any given time. It is also virtually impossible to implement without a dedicated testing environment.

Having discussed some of the most common best practices for performance, let’s now turn our attention to the pitfalls and antipatterns that teams can fall prey to.

Introducing Performance Antipatterns

An antipattern is an undesired behavior of a software project or team that is observed across a large number of projects.1 The frequency of occurrence leads to the conclusion (or suspicion) that some underlying factor is responsible for creating the unwanted behavior. Some antipatterns may at first sight seem to be justified, with their non-ideal aspects not immediately obvious. Others are the result of negative project practices slowly accreting over time.

In some cases the behavior may be driven by social or team constraints, or by common misapplied management techniques, or by simple human (and developer) nature. By classifying and categorizing these unwanted features, we develop a “pattern language” for discussing them, and hopefully eliminating them from our projects.

Performance tuning should always be treated as a very objective process, with precise goals set early in the planning phase. This is easier said than done: when a team is under pressure or not operating under reasonable circumstances, this can simply fall by the wayside.

Many readers will have seen the situation where a new client is going live or a new feature is being launched, and an unexpected outage occurs—in user acceptance testing (UAT) if you are lucky, but often in production. The team is then left scrambling to find and fix what has caused the bottleneck. This usually means performance testing has not been carried out, or the team “ninja” made an assumption and has now disappeared (ninjas are good at this).

A team that works in this way will likely fall victim to antipatterns more often than a team that follows good performance testing practices and has open and reasoned conversations. As with many development issues, it is often the human elements, such as communication problems, rather than any technical aspect that leads to an application having problems.

One interesting possibility for classification was provided in a blog post by Carey Flichel called “Why Developers Keep Making Bad Technology Choices”. The post specifically calls out five main reasons that cause developers to make bad choices. Let’s look at each in turn.

Boredom

Most developers have experienced boredom in a role, and for some this doesn’t have to last very long before they are seeking a new challenge or role either in the company or elsewhere. However, other opportunities may not be present in the organization, and moving somewhere else may not be possible.

It is likely many readers have come across a developer who is simply riding it out, perhaps even actively seeking an easier life. However, bored developers can harm a project in a number of ways. For example, they might introduce code complexity that is not required, such as writing a sorting algorithm directly in code when a simple Collections.sort() would be sufficient. They might also express their boredom by looking to build components with technologies that are unknown or perhaps don’t fit the use case just as an opportunity to use them—which leads us to the next section.

Résumé Padding

Occasionally the overuse of technology is not tied to boredom, but rather represents the developer exploiting an opportunity to boost their experience with a particular technology on their résumé (or CV). In this scenario, the developer is making an active attempt to increase their potential salary and marketability as they’re about to re-enter the job market. It’s unlikely that many people would get away with this inside a well-functioning team, but it can still be the root of a choice that takes a project down an unnecessary path.

The consequences of an unnecessary technology being added due to a developer’s boredom or résumé padding can be far-reaching and very long-lived, lasting for many years after the original developer has left for greener pastures.

Peer Pressure

Technical decisions are often at their worst when concerns are not voiced or discussed at the time choices are being made. This can manifest in a few ways; for example, perhaps a junior developer doesn’t want to make a mistake in front of more senior members of their team (“imposter syndrome”), or perhaps a developer fears coming across as uninformed on a particular topic. Another particularly toxic type of peer pressure is for competitive teams, wanting to be seen as having high development velocity, to rush key decisions without fully exploring all of the consequences.

Lack of Understanding

Developers may look to introduce new tools to help solve a problem because they are not aware of the full capability of their current tools. It is often tempting to turn to a new and exciting technology component because it is great at performing one specific task. However, introducing more technical complexity must be taken on balance with what the current tools can actually do.

For example, Hibernate is sometimes seen as the answer to simplifying translation between domain objects and databases. If there is only limited understanding of Hibernate on the team, developers can make assumptions about its suitability based on having seen it used in another project.

This lack of understanding can cause overcomplicated usage of Hibernate and unrecoverable production outages. By contrast, rewriting the entire data layer using simple JDBC calls allows the developer to stay on familiar territory. One of the authors taught a Hibernate course that contained a delegate in exactly this position; he was trying to learn enough Hibernate to see if the application could be recovered, but ended up having to rip out Hibernate over the course of a weekend—definitely not an enviable position.

Misunderstood/Nonexistent Problem

Developers may often use a technology to solve a particular issue where the problem space itself has not been adequately investigated. Without having measured performance values, it is almost impossible to understand the success of a particular solution. Often collating these performance metrics enables a better understanding of the problem.

To avoid antipatterns it is important to ensure that communication about technical issues is open to all participants in the team, and actively encouraged. Where things are unclear, gathering factual evidence and working on prototypes can help to steer team decisions. A technology may look attractive; however, if the prototype does not measure up then the team can make a more informed decision.

Performance Antipatterns Catalogue

In this section we will present a short catalogue of performance antipatterns. The list is by no means exhaustive, and there are doubtless many more still to be discovered.

Distracted by Shiny

Description

The newest or coolest tech is often the first tuning target, as it can be more exciting to explore how newer technology works than to dig around in legacy code. It may also be that the code accompanying the newer technology is better written and easier to maintain. Both of these facts push developers toward looking at the newer components of the application.

Example comment

“It’s teething trouble—we need to get to the bottom of it.”

Reality

  • This is often just a shot in the dark rather than an effort at targeted tuning or measuring of the application.

  • The developer may not fully understand the new technology yet, and will tinker around rather than examine the documentation—often in reality causing other problems.

  • In the case of new technologies, examples online are often for small or sample datasets and don’t discuss good practice about scaling to an enterprise size.

Discussion

This antipattern is common in newly formed or less experienced teams. Eager to prove themselves, or to avoid becoming tied to what they see as legacy systems, they are often advocates for newer, “hotter” technologies—which may, coincidentally, be exactly the sort of technologies that would confer a salary uptick in any new role.

Therefore, the logical subconscious conclusion is that any performance issue should be approached by first taking a look at the new tech. After all, it’s not properly understood, so a fresh pair of eyes would be helpful, right?

Resolutions

  • Measure to determine the real location of the bottleneck.

  • Ensure adequate logging around the new component.

  • Look at best practices as well as simplified demos.

  • Ensure the team understands the new technology and establish a level of best practice across the team.

Distracted by Simple

Description

The team targets the simplest parts of the system first, rather than profiling the application overall and objectively looking for pain points in it. There may be parts of the system deemed “specialist” that only the original wizard who wrote them can edit.

Example comments

“Let’s get into this by starting with the parts we understand.”

“John wrote that part of the system, and he’s on holiday. Let’s wait until he’s back to look at the performance.”

Reality

  • The original developer understands how to tune (only?) that part of the system.

  • There has been no knowledge sharing or pair programming on the various system components, creating single experts.

Discussion

The dual of Distracted by Shiny, this antipattern is often seen in a more established team, which may be more used to a maintenance or keep-the-lights-on role. If the application has recently been merged or paired with newer technology, the team may feel intimidated or not want to engage with the new systems.

Under these circumstances, developers may feel more comfortable profiling only those parts of the system that are familiar, hoping that they will be able to achieve the desired goals without going outside of their comfort zone.

Of particular note is that both of these first two antipatterns are driven by a reaction to the unknown. In Distracted by Shiny, this manifests as a desire by the developer (or team) to learn more and gain advantage—essentially an offensive play. By contrast, Distracted by Simple is a defensive reaction, playing to the familiar rather than engaging with a potentially threatening new technology.

Resolutions

  • Measure to determine the real location of the bottleneck.

  • Ask for help from domain experts if the problem is in an unfamiliar component.

  • Ensure that developers understand all components of the system.

Performance Tuning Wizard

Description

Management has bought into the Hollywood image of a “lone genius” hacker and hired someone who fits the stereotype, to move around the company and fix all performance issues, by using their perceived superior performance tuning skills.

Note

There are genuine performance tuning experts and companies out there, but most would agree that you have to measure and investigate any problem. It’s unlikely the same solution will apply to all uses of a particular technology in all situations.

Example comment

“I’m sure I know just where the problem is…”

Reality

  • The only thing a perceived wizard or superhero is likely to do is challenge the dress code.

Discussion

This antipattern can alienate developers in the team who perceive themselves to not be good enough to address performance issues. It’s concerning, as in many cases a small amount of profiler-guided optimization can lead to good performance increases (see Chapter 13).

That is not to say that there aren’t specialists that can help with specific technologies, but the thought that there is a lone genius who will understand all performance issues from the beginning is absurd. Many technologists that are performance experts are specialists at measuring and problem solving based on those measurements.

Superhero types in teams can be very counterproductive if they are not willing to share knowledge or the approaches that they took to resolving a particular issue.

Resolutions

  • Measure to determine the real location of the bottleneck.

  • Ensure that any experts hired onto a team are willing to share and act as part of the team.

Tuning by Folklore

Description

While desperate to try to find a solution to a performance problem in production, a team member finds a “magic” configuration parameter on a website. Without testing the parameter the team applies it to production, because it must improve things exactly as it has for the person on the internet…

Example comment

“I found these great tips on Stack Overflow. This changes everything.”

Reality

  • The developer does not understand the context or basis of the performance tip, and the true impact is unknown.

  • It may have worked for that specific system, but that doesn’t mean the change will even have a benefit in another. In reality, it could make things worse.

Discussion

A performance tip is a workaround for a known problem—essentially a solution looking for a problem. Performance tips have a shelf life and usually age badly; someone will come up with a solution that will render the tip useless (at best) in a later release of the software or platform.

One source of performance advice that is usually particularly bad is admin manuals. They contain general advice devoid of context. Lawyers often insist on this vague advice and “recommended configurations” as an additional line of defense if the vendor is sued.

Java performance happens in a specific context, with a large number of contributing factors. If we strip away this context, then what is left is almost impossible to reason about, due to the complexity of the execution environment.

Note

The Java platform is also constantly evolving, which means a parameter that provided a performance workaround in one version of Java may not work in another.

For example, the switches used to control garbage collection algorithms frequently change between releases. What works in an older VM (version 7 or 6) may not be applied in the current version (Java 8). There are even switches that are valid and useful in version 7 that will cause the VM not to start up in the forthcoming version 9.

Configuration can be a one- or two-character change, but have significant impact in a production environment if not carefully managed.

Resolutions

  • Only apply well-tested and well-understood techniques that directly affect the most important aspects of the system.

  • Look for and try out parameters in UAT, but as with any change it is important to prove and profile the benefit.

  • Review and discuss configuration with other developers and operations staff or devops.

The Blame Donkey

Description

Certain components are always identified as the issue, even if they had nothing to do with the problem.

For example, one of the authors saw a massive outage in UAT the day before go-live. A certain path through the code caused a table lock on one of the central database tables. An error occurred in the code and the lock was retained, rendering the rest of the application unusable until a full restart was performed. Hibernate was used as the data access layer and immediately blamed for the issue. However, in this case, the culprit wasn’t Hibernate but an empty catch block for the timeout exception that did not clean up the database connection. It took a full day for developers to stop blaming Hibernate and to actually look at their code to find the real bug.

Example comment

“It’s always JMS/Hibernate/A_N_OTHER_LIB.”

Reality

  • Insufficient analysis has been done to reach this conclusion.

  • The usual suspect is the only suspect in the investigation.

  • The team is unwilling to look wider to establish a true cause.

Discussion

This antipattern is often displayed by management or the business, as in many cases they do not have a full understanding of the technical stack and have acknowledged cognitive biases, so they are proceeding by pattern matching. However, technologists are far from immune to it.

Technologists often fall victim to this antipattern when they have little understanding about the code base or libraries outside of the ones usually blamed. It is often easier to name a part of the application that is commonly the problem, rather than perform a new investigation. It can be the sign of a tired team, with many production issues at hand.

Hibernate is the perfect example of this; in many situations, Hibernate grows to the point where it is not set up or used correctly. The team then has a tendency to bash the technology, as they have seen it fail or not perform in the past. However, the problem could just as easily be the underlying query, use of an inappropriate index, the physical connection to the database, the object mapping layer, or something else. Profiling to isolate the exact cause is essential.

Resolutions

  • Resist the pressure to rush to conclusions.

  • Perform analysis as normal.

  • Communicate the results of the analysis to all stakeholders (to encourage a more accurate picture of the causes of problems).

Missing the Bigger Picture

Description

The team becomes obsessed with trying out changes or profiling smaller parts of the application without fully appreciating the full impact of the changes. Engineers start tweaking JVM switches in an effort to gain better performance, perhaps based on an example or a different application in the same company.

The team may also look to profile smaller parts of the application using micro­benchmarking (which is notoriously difficult to get right, as we will explore in Chapter 5).

Example comments

“If I just change these settings, we’ll get better performance.”

“If we can just speed up method dispatch time…”

Reality

  • The team does not fully understand the impact of changes.

  • The team has not profiled the application fully under the new JVM settings.

  • The overall system impact from a microbenchmark has not been determined.

Discussion

The JVM has literally hundreds of switches. This gives a very highly configurable runtime, but also gives rise to a great temptation to make use of all of this configurability. This is usually a mistake—the defaults and self-management capabilities are usually sufficient. Some of the switches also combine with each other in unexpected ways, which makes blind changes even more dangerous. Applications even in the same company are likely to operate and profile in a completely different way, so it’s important to spend time trying out settings that are recommended.

Performance tuning is a statistical activity, which relies on a highly specific context for execution. This implies that larger systems are usually easier to benchmark than smaller ones—because with larger systems, the law of large numbers works in the engineer’s favor, helping to correct for effects in the platform that distort individual events.

By contrast, the more we try to focus on a single aspect of the system, the harder we have to work to unweave the separate subsystems (e.g., threading, GC, scheduling, JIT compilation) of the complex environment that makes up the platform (at least in the Java/C# case). This is extremely hard to do, and handling the statistics is sensitive and is not often a skillset that software engineers have acquired along the way. This makes it very easy to produce numbers that do not accurately represent the behavior of the system aspect that the engineer believed they were benchmarking.

This has an unfortunate tendency to combine with the human bias to see patterns even when none exist. Together, these effects lead us to the spectacle of a performance engineer who has been deeply seduced by bad statistics or a poor control—an engineer arguing passionately for a performance benchmark or effect that their peers are simply not able to replicate.

There are a few other points to be aware of here. First, it’s difficult to evaluate the effectiveness of optimizations without a UAT environment that fully emulates production. Second, there’s no point in having an optimization that helps your application only in high-stress situations and kills performance in the general case—but obtaining sets of data that are typical of general application usage but also provide a meaningful test under load can be difficult.

Resolutions

Before making any change to switches live:

  1. Measure in production.

  2. Change one switch at a time in UAT.

  3. Ensure that your UAT environment has the same stress points as production.

  4. Ensure that test data is available that represents normal load in the production system.

  5. Test the change in UAT.

  6. Retest in UAT.

  7. Have someone recheck your reasoning.

  8. Pair with them to discuss your conclusions.

UAT Is My Desktop

Description

UAT environments often differ significantly from production, although not always in a way that’s expected or fully understood. Many developers will have worked in situations where a low-powered desktop is used to write code for high-powered production servers. However, it’s also becoming more common that a developer’s machine is massively more powerful than the small servers deployed in production. Low-powered micro-environments are usually not a problem, as they can often be virtualized for a developer to have one of each. This is not true of high-powered production machines, which will often have significantly more cores, RAM, and efficient I/O than a developer’s machine.

Example Comment

“A full-size UAT environment would be too expensive.”

Reality

  • Outages caused by differences in environments are almost always more expensive than a few more boxes.

Discussion

The UAT Is My Desktop antipattern stems from a different kind of cognitive bias than we have previously seen. This bias insists that doing some sort of UAT must be better than doing none at all. Unfortunately, this hopefulness fundamentally misunderstands the complex nature of enterprise environments. For any kind of meaningful extrapolation to be possible, the UAT environment must be production-like.

In modern adaptive environments, the runtime subsystems will make best use of the available resources. If these differ radically from those in the target deployment, they will make different decisions under the differing circumstances—rendering our hopeful extrapolation useless at best.

Resolutions

  • Track the cost of outages and opportunity cost related to lost customers.

  • Buy a UAT environment that is identical to production.

  • In most cases, the cost of the first far outweighs the second, and sometimes the right case needs to be made to managers.

Production-Like Data Is Hard

Description

Also known as the DataLite antipattern, this antipattern relates to a few common pitfalls that people encounter while trying to represent production-like data. Consider a trade processing plant at a large bank that processes futures and options trades that have been booked but need to be settled. Such a system would typically handle millions of messages a day. Now consider the following UAT strategies and their potential issues:

  1. To make things easy to test, the mechanism is to capture a small selection of these messages during the course of the day. The messages are then all run through the UAT system.

    This approach fails to capture burst-like behavior that the system could see. It may also not capture the warmup caused by more futures trading on a particular market before another market opens that trades options.

  2. To make the scenario easier to test, the trades and options are updated to use only simple values for assertion.

    This does not give us the “realness” of production data. Considering that we are using an external library or system for options pricing, it would be impossible for us to determine with our UAT dataset that this production dependency has not now caused a performance issue, as the range of calculations we are performing is a simplified subset of production data.

  3. To make things easier, all values are pushed through the system at once.
    This is often done in UAT, but misses key warmup and optimizations that may happen when the data is fed at a different rate.

Most of the time in UAT the test dataset is simplified to make things easier. However, it rarely makes results useful.

Example comments

“It’s too hard to keep production and UAT in sync.”

“It’s too hard to manipulate data to match what the system expects.”

“Production data is protected by security considerations. Developers should not have access to it.”

Reality

Data in UAT must be production-like for accurate results. If data is not available for security reasons, then it should be scrambled (aka masked or obfuscated) so it can still be used for a meaningful test. Another option is to partition UAT so developers still don’t see the data, but can see the output of the performance tests to be able to identify problems.

Discussion

This antipattern also falls into the trap of “something must be better than nothing.” The idea is that testing against even out-of-date and unrepresentative data is better than not testing.

As before, this is an extremely dangerous line of reasoning. While testing against something (even if it is nothing like production data) at scale can reveal flaws and omissions in the system testing, it provides a false sense of security.

When the system goes live, and the usage patterns fail to conform to the expected norms that have been anchored by UAT data, the development and ops teams may well find that they have become complacent due to the warm glow that UAT has provided, and are unprepared for the sheer terror that can quickly follow an at-scale production release.

Resolutions

  • Consult data domain experts and invest in a process to migrate production data back into UAT, scrambling or obfuscating data if necessary.

  • Overprepare for releases for which you expect high volumes of customers or transactions.

Cognitive Biases and Performance Testing

Humans can be bad at forming accurate opinions quickly, even when faced with a problem where they can draw upon past experiences and similar situations.

Note

A cognitive bias is a psychological effect that causes the human brain to draw incorrect conclusions. It is especially problematic because the person exhibiting the bias is usually unaware of it and may believe they are being rational.

Many of the antipatterns that have been explored in this chapter are caused, in whole or in part, by one or more cognitive biases that are in turn based on an unconscious assumptions.

For example, with the Blame Donkey antipattern, if a component has caused several recent outages the team may be biased to expect that same component to be the cause of any new performance problem. Any data that’s analyzed may be more likely to be considered credible if it confirms the idea that the Blame Donkey is responsible. The antipattern combines aspects of the biases known as confirmation bias and recency bias (a tendency to assume that whatever has been happening recently will keep happening).

Note

A single component in Java can behave differently from application to application depending on how it is optimized at runtime. In order to remove any pre-existing bias, it is important to look at the application as a whole.

Biases can be complementary or dual to each other. For example, some developers may be biased to assume that the problem is not software-related at all, and the cause must be the infrastructure the software is running on; this is common in the Works for Me antipattern, characterized by statements like “This worked fine in UAT, so there must be a problem with the production kit.” The converse is to assume that every problem must be caused by software, because that’s the part of the system the developer knows about and can directly affect.

Reductionist Thinking

This cognitive bias is based on an analytical approach that insists that if you break a system into small enough pieces, you can understand it by understanding its constituent parts. Understanding each part means reducing the chance of incorrect assumptions being made.

The problem with this view is that in complex systems it just isn’t true. Nontrivial software (or physical) systems almost always display emergent behavior, where the whole is greater than a simple summation of its parts would indicate.

Confirmation Bias

Confirmation bias can lead to significant problems when it comes to performance testing or attempting to look at an application subjectively. A confirmation bias is introduced, usually not intentionally, when a poor test set is selected or results from the test are not analyzed in a statistically sound way. Confirmation bias is quite hard to counter, because there are often strong motivational or emotional factors at play (such as someone in the team trying to prove a point).

Consider an antipattern such as Distracted by Shiny, where a team member is looking to bring in the latest and greatest NoSQL database. They run some tests against data that isn’t like production data, because representing the full schema is too complicated for evaluation purposes. They quickly prove that on a test set the NoSQL database produces superior access times on their local machine. The developer has already told everyone this would be the case, and on seeing the results they proceed with a full implementation. There are several antipatterns at work here, all leading to new unproved assumptions in the new library stack.

Fog of War (Action Bias)

This bias usually manifests itself during outages or situations where the system is not performing as expected. The most common causes include:

  • Changes to infrastructure that the system runs on, perhaps without notification or realizing there would be an impact

  • Changes to libraries that the system is dependent on

  • A strange bug or race condition the manifests itself on the busiest day of the year

In a well-maintained application with sufficient logging and monitoring, these should generate clear error messages that will lead the support team to the cause of the problem.

However, too many applications have not tested failure scenarios and lack appropriate logging. Under these circumstances even experienced engineers can fall into the trap of needing to feel that they’re doing something to resolve the outage and mistaking motion for velocity—the “fog of war” descends.

At this time, many of the human elements discussed in this chapter can come into play if participants are not systematic about their approach to the problem. For example, an antipattern such as the Blame Donkey may shortcut a full investigation and lead the production team down a particular path of investigation—often missing the bigger picture. Similarly, the team may be tempted to break the system down into its constituent parts and look through the code at a low level without first establishing in which subsystem the problem truly resides.

In the past it may always have paid to use a systematic approach to dealing with outage scenarios, leaving anything that did not require a patch to a postmortem. However, this is the realm of human emotion, and it can be very difficult to take the tension out of the situation, especially during an outage.

Risk Bias

Humans are naturally risk averse and resistant to change. Mostly this is because people have seen examples of how change can go wrong. This leads them to attempt to avoid that risk. This can be incredibly frustrating when taking small, calculated risks could move the product forward. We can reduce risk bias significantly by having a robust set of unit tests and production regression tests. If either of these is not trusted, change becomes extremely difficult and the risk factor is not controlled.

This bias often manifests in a failure to learn from application problems (even service outages) and implement appropriate mitigation.

Ellsberg’s Paradox

As an example of how bad humans are at understanding probability, consider Ellsberg’s Paradox. Named for the famous US investigative journalist and whistleblower Daniel Ellsberg, the paradox deals with the human desire for “known unknowns” over “unknown unknowns.”2

The usual formulation of Ellsberg’s Paradox is as a simple probability thought experiment. Consider a barrel, containing 90 colored balls—30 are known to be blue, and the rest are either red or green. The exact distribution of red and green balls is not known, but the barrel, the balls, and therefore the odds are fixed throughout.

The first step of the paradox is expressed as a choice of wagers. The player can choose to take either of two bets:

  1. The player will win $100 if a ball drawn at random is blue.

  2. The player will win $100 if a ball drawn at random is red.

Most people choose A), as it represents known odds: the likelihood of winning is exactly 1/3. However, (assuming that when a ball is removed it is placed back in the same barrel and then rerandomized), when the player is presented with a second bet something surprising happens. In this case the options are:

  1. The player will win $100 if a ball drawn at random is blue or green.

  2. The player will win $100 if a ball drawn at random is red or green.

In this situation, bet D corresponds to known odds (2/3 chance of winning), so virtually everyone picks this option.

The paradox is that the set of choices A and D is irrational. Choosing A implicitly expresses an opinion about the distribution of red and green balls—effectively that “there are more green balls than red balls.” Therefore, if A is chosen, then the logical strategy is to pair it with C, as this would provide better odds than the safe choice of D.

Summary

When you are evaluating performance results, it is essential to handle the data in an appropriate manner and avoid falling into unscientific and subjective thinking. In this chapter, we have met some of the types of test, testing best practices, and antipatterns that are native to performance analysis.

In the next chapter, we’re going to move on to looking at low-level performance measurements, the pitfalls of microbenchmarks, and some statistical techniques for handling raw results obtained from JVM measurements.

1 The term was popularized by the book AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis, by William J. Brown, Raphael C. Malvo, Hays W. McCormick III, and Thomas J. Malbray (New York: Wiley, 1998).

2 To reuse the phrase made famous by Donald Rumsfeld.

Get Optimizing Java now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.