Chapter 1. Introduction to Aerospike

Aerospike is a distributed NoSQL database with exceptional speed on both reads and writes and a strong uptime percentage. This sounds reasonable and normal for a database, but it doesn’t put into perspective the capabilities of this software.

In the data management industry, we have certain limitations in our minds as to what a database can do. Aerospike doesn’t live by those restrictions. Submillisecond transaction speeds are normal, even on petabyte-scale data volumes, without resorting to huge clusters. This is not what any sensible, experienced data analyst or database administrator expects. To truly understand and get what you want most out of Aerospike, you’ll have to set aside a certain amount of what you “know” and come at this with an open mind.

NoSQL databases have traditionally achieved high performance and scale by relaxing consistency guarantees. However, this trade-off is a poor choice for use cases that require a high level of data correctness. A majority of applications contain some data requiring correctness and consistency, such as financial data, as well as data that can tolerate limited consistency violations, such as data from clickstreams.

Aerospike maintains strong consistency along with high-performance characteristics that have been proven in hundreds of mission-critical production deployments with several years of continuous uptime in the face of typical hardware and network failures. The type of strong consistency that Aerospike implements can satisfy the stringent requirements needed for applications that are dependent on a system of record. Aerospike achieves its high performance on less hardware by making optimal use of flash storage, utilizing vertical scaling to effectively use the hardware on each server instance, and achieving excellent horizontal scaling using distributed clustering algorithms.

For the rest of this book, we’ll assume you have a good working knowledge of data, databases, and how to work with them. The book will focus on what makes Aerospike unique, and what you need to know about how it differs from other databases so that you can get the most out of it.

What Makes Aerospike Different

Strong consistency can be fast, even at scale.

While there are many important differences between Aerospike and other databases, the combination of strong consistency and high speed regardless of data scale is the essence of what makes Aerospike unique. And thanks to its ability to take advantage of flash storage devices, it does this on relatively small clusters.

First, consider Aerospike’s speed with consistency. A widespread view maintains that there will always be a major performance gap between running a system with relaxed consistency and one that supports strong consistency.

You’ve likely heard of Brewer’s CAP theorem (consistency, availability, and partition tolerance). It has been one of the founding principles of NoSQL systems, yet Aerospike has spent over a decade pushing the envelope in this area with remarkable results.

Some Aerospike use cases require strong consistency without losing the system’s high performance capabilities. Aerospike’s unique strong consistency algorithm can achieve this, maintaining high performance and low latency while maximizing system availability.

The requirements Aerospike meets in order to accomplish this are:

Not losing writes under any circumstances, including during split-brain scenarios and other situations where multiple nodes are missing from the cluster
Avoiding stale reads
Both adding and removing capacity (nodes) through simple operational procedures that do not result in consistency or availability breaches
Allowing developers and system architects to choose strong consistency or higher availability on a per-dataset basis
Maintaining both availability and consistency during a rolling software upgrade process provided the number of nodes down at any time is fewer than the replication factor of the data

Availability and Consistency (AP and CP Modes)

Partition tolerance must always be prioritized in any cluster distributed system. Aerospike enables you to configure which is the most important to your particular use case, availability or consistency, AP (availability, partition tolerance) mode or CP (consistency, partition tolerance) mode.

Many distributed databases sacrifice consistency, dropping to a state known as eventual consistency, where data across a cluster will eventually reach a state where the data is the same on all nodes of the cluster, but there will be a period of time when this is not the case.

Transactional databases, however, generally require the opposite of eventual consistency—in other words, strong consistency. In these databases, any node can be queried and produce an identical result.

Regardless of which you choose to prioritize, consistency or availability, Aerospike works to maximize the other as well.

Availability first (AP Mode)

Even when configured to prioritize availability, in AP mode, violating consistency is rare in a properly running Aerospike system. In Chapter 5, you’ll learn more about the two conditions that might cause consistency to be affected in AP mode.

Consistency first (CP Mode)

While AP mode provides a surprisingly high level of data consistency, it isn’t perfect. When data correctness is essential, choose CP mode, prioritizing consistency over availability. Most systems that provide strong consistency require a minimum of three copies.¹ So, if a cluster splits, one of the two subparts of the cluster can allow writes if it has a majority of (two out of three) copies of a data partition.

Aerospike optimizes cluster size by allowing storage of only two copies. Using an adaptive scheme that adds more write copies on the fly in situations where they are necessary, Aerospike provides the theoretically correct result of a three-copy distributed system while only paying the infrastructure and processing costs for keeping two copies of the data.

Chapter 5 discusses the few conditions when availability is impaired in CP mode, as well as other important aspects of the basic architecture of Aerospike. Chapter 9 dives even deeper into how this is accomplished.

Flash Optimization

Another key difference between Aerospike and other databases, which allows it to operate at in-memory database speeds while using around one-tenth the number of cluster nodes, is its ability to take advantage of the characteristics of flash storage in its hybrid storage approach.

Besides throughput and latency characteristics, the ability of a database is characterized by the amount of data that can be stored and processed. Leveraging flash requires a different storage mechanism from the standard spinning disks that most databases use. Aerospike software is designed to take advantage of those differences.

Flash storage (SSDs) can be a game changer for real-time data. While in-memory configurations, which use computer memory for all data storage and computations, can be somewhat faster, Aerospike has demonstrated that SSDs can increase the per-node capacity between 10 and 100 times, often with little perceptible loss of performance.

An Aerospike system that uses flash storage can manage dozens of terabytes of data on a single machine with submillisecond record access (read and write) times. This design is possible because the read latency characteristic of input and output (I/O) in SSDs is predictably fast, regardless of whether it is random or sequential. At the time of writing, a typical commodity server can be affordably set up with 30 TB of high-performance flash and 1 TB of memory—providing 30 times the capacity of a 1 TB in-memory database on the same server.

Aerospike supports multiple kinds of storage architectures: hybrid flash (aka hybrid memory), in-memory, and all-flash. To leverage flash storage, Aerospike implements a hybrid model where data resides in flash storage, and indexes reside entirely in memory. Figure 1-1 illustrates the hybrid and in-memory architectures side by side.

One interesting thing you’ll notice in Figure 1-1 is that two namespaces in Aerospike can be configured differently, one to use only memory, and the other to use a hybrid of memory and flash. They can coexist in the same node without issue. Both namespaces have their primary key index in memory. The main difference is that the in-memory namespace also stores its data in memory. The hybrid namespace uses a slightly more involved process to save its data to flash storage. Chapter 5 will describe this process in more depth.

Since storage I/O is the slowest point in any application that uses a storage device, storage I/O is not required to traverse the index, even when data is stored on flash devices. This makes read performance in Aerospike high and predictable. This design is possible because the read latency characteristic of I/O in NAND flash (the type of flash drive most commonly used in servers) has little penalty for random access. Database access first traverses the index and acquires metadata that indicates where the required data is located. If the data element is located in a local cache, it can be returned without I/O access at all. If it is in flash storage, a single I/O will be executed to bring the entire element into local memory.

This ability to do random read I/O comes at the cost of a limited number of write cycles on SSDs. In order to avoid creating uneven wear on a single part of the SSD, and to colocate all the data of a record in one location, Aerospike does not perform in-place updates. Instead, it employs a copy-on-write mechanism using large block writes. This wears the SSD down evenly, improving device durability, and guarantees that all data for a particular record is in one location, thus preventing fragmentation of data in storage.

Gathering writes together into large blocks enables very high write throughput by combining hundreds of transactional writes into a single large block write and ensures that the internal coalescing algorithms inside a flash controller do less work. The copy-on-write mechanism is also beneficial for data correctness and recovery in case of hardware failures.

Aerospike is typically configured to bypass the operating system’s filesystem and instead uses attached flash devices directly as a block device using a custom data layout. This avoids the problem of depending on filesystems that are not optimized for flash. Unoptimized filesystems would generate extra I/O and wear as well as complexity in matching an operating system’s data commit policy to the filesystem’s.

These techniques are instrumental in providing Aerospike the unique ability to support extremely high application write rates in production while maintaining in-memory-class read response times, as well as high data correctness.

What Makes Aerospike Optimal for Submillisecond Workloads

Beyond its unique high consistency and availability, and its small cluster sizes for demanding workloads at scale, the architecture of Aerospike is focused on read latencies lower than a millisecond on any volume of data, even with extreme throughput levels. You might expect this level of performance to require massive clusters of hardware or cloud instances, but thanks to its flash utilization and other optimizations, Aerospike frequently accomplishes this goal with a tenth (or fewer) the number of nodes compared to similar purely in-memory databases. Several aspects of the Aerospike architecture make this possible, including a shared-nothing architecture, the ability to maintain predictable high performance levels with high availability, and making optimal use of each node in a cluster.

Avoiding Hotspots with Shared-Nothing Architecture

All servers in an Aerospike cluster are peers, with no master, leader, or other differentiated nodes. For horizontal scaling, Aerospike uses a shared-nothing architecture in which all nodes forming the database cluster are homogeneous—identical in terms of CPU, memory, storage, and networking capacity. A key component of this scheme is a uniform data partitioning algorithm that eliminates all skew in terms of data distribution across database cluster nodes, thereby ensuring there are no hotspots.

Access to flash storage is done in a massively parallel manner, creating a transparent data layout similar to a high-performance “RAID-0” striping system. The keys are mapped into various devices within nodes in a uniform manner. This ensures that an even amount of data is stored on every node and every flash device, thus making sure that all hardware is used equally and the load on all servers and storage devices is balanced.

These strategies and several more you’ll learn about later prevent “hot spots” and require no configuration changes even as the workload changes. For example, in Figure 1-2, you can see 80-way parallelism illustrated in a cluster of 16 nodes with 5 flash devices on each node.

Keeping Performance Predictable with Resilient Stability

There are multiple things that make Aerospike’s performance predictable and consistent. You just learned that storage device I/O is not required to access the index in hybrid-memory or in-memory configurations. For accessing data, the previous section mentioned that the read latency for random access to SSDs is predictably low, and skipping the operating system layer makes every piece of data exactly one I/O hop to retrieve, regardless of whether it is random or sequential. Beyond that, several architectural features keep the cluster as stable and self-healing as possible so performance remains relatively constant over time.

An automatic partitioning of the key space and an automatic data rebalancing mechanism ensures that the transaction volume is distributed evenly across all nodes and is robust in the event of any node failures happening during the rebalancing process itself. The system is designed to be continuously available, so data rebalancing doesn’t impact cluster behavior.

By not requiring operator intervention, clusters will self-heal even at demanding times. You can configure and provision your hardware capacity and set up replication/synchronization policies so that the database recovers from failures without affecting users, and you can sleep at night.

The mapping from partition to node (partition map) is exchanged and cached with the clients. Sharing of the partition map with the client is critical in making client-server interactions extremely efficient. This is why, in Aerospike, access to data from the client requires a single network request to the server node containing the data item and a maximum of one storage I/O operation each time. In steady state, the scale-out ability of the Aerospike cluster is purely a function of the number of clients and server nodes. This guarantees the linear scalability of the system as long as other parts of the system—like network interconnect—can absorb the load.

To avoid hotspots that could affect performance, Aerospike colocates indexes and data to avoid any cross-node traffic when running read operations or queries. Writes may require communication between multiple nodes based on the replication factor. Colocation of index and data and a robust data distribution hash function give you nice, even data distribution across nodes. This ensures that:

Application workload is uniformly distributed across the cluster.
Performance of database operations is predictable.
Scaling the cluster up and down is easy.
Live cluster reconfiguration and subsequent data rebalancing is simple, nondisruptive, and efficient.

Reducing Cluster Size with Optimal Use of Each Node

For a system to operate at extremely high throughput with low latency, it is necessary not just to scale out across nodes but also to scale up “(vertically)” on each node. This section discusses system-level details that help Aerospike scale up to millions of transactions per second at submillisecond latencies. The techniques used by Aerospike apply to any data storage system.

The ability to scale vertically on each node effectively means the following:

Higher throughput levels on fewer nodes.
Better failure characteristics, since probability of a node failure typically increases as the number of nodes in a cluster increases.
Easier operational footprint. Managing a 10-node cluster versus a 200-node cluster is a huge win for operators.
Lower total cost of ownership. This is especially true once you factor in SSD-based scaling.

The basic philosophy here is to enable the system to take full advantage of the hardware by leveraging it in the best way possible. To accomplish this, Aerospike is written in the C language and has a number of CPU and network optimizations that allow it to handle millions of single-record transactions per second with sub-millisecond latencies on a single node.

The scale-up architecture can typically reduce the cost of ownership of a system using the hybrid memory configuration by up to 80% compared to the same system using a pure in-memory configuration. The surprising part is that no compromise is needed in application-level performance to get the savings in infrastructure and operational costs. See Chapter 5 for more information about the architecture that makes this possible.

Why Milliseconds Matter

For many workloads, Aerospike would have a ridiculously overpowered level of performance, like using a Formula 1 race car for your daily commute. However, for some use cases this level of speed isn’t just nice to have, it’s essential. Whether the scale of data is fairly small or extremely large, the key factor for success is often performance. For a variety of industries, particularly the more modern industries built on the speed expectations of digital natives, milliseconds matter as operations have to be executed within strict time-bound service-level agreements (SLAs).

Some examples:

The ad-tech industry has to handle millions of ad auctions per second, each within a bidding loop of 100 milliseconds. Too slow and no bid wins: the audience has clicked to a different window and the ad is never displayed.
As a video game is played, in-game personalized optimizations can mean the difference between success for the game and the gaming company and a game that flops.
For real-time events (ticket purchases for popular events, fantasy sports where millions of players are finalizing their teams in the final minutes before a match starts, etc.) the ability to handle enormous load spikes, up to 100× more than normal load, can make the difference between success and failure of a billion-dollar business.
For financial companies with digital payment options, detecting fraud within the transaction execution time itself (i.e., within 200 milliseconds) can mean the difference between normal customer service and big problems like financial loss, regulatory exposure, etc. With the right level of speed, fraud detection becomes fraud prevention, but only if the system can handle both the volume and the speed needed.

There are many industries where people do not yet see the need for submillisecond response times simply because they find it hard to conceive of such rapid access to their data. Faster data access can result in business changes that have considerable potential to increase top-line revenue.

Consider a trading company that runs compliance algorithms in a nightly batch on the mainframe, potentially missing noncompliant trades during the day. Faster data access could mean those compliance algorithms could be run every few minutes, eliminating this risk.

Or consider that a digital identity management company could accurately verify the identity of a user three times faster with a higher level of confidence by using more data in real time to make more accurate decisions.

Chapter 10 discusses some in-depth examples of use cases that are ideal for Aerospike’s strengths.

Summary

In this book, you’ll learn the basics of how to go from being an Aerospike beginner to having the confidence to use it daily. You’ll learn a bit about how to model data in Aerospike, how to monitor it for issues, and how to optimize the underlying hardware. You’ll get some basic information about Aerospike architecture, more depth on what makes it unique, and guidance on different methods to work with it to achieve optimal performance. In the end, you should have a solid foundation that will let you move to the next level and put Aerospike to work for you. First, let’s take a look at getting it installed and building your first application.

¹ Leslie Lamport, Lower Bounds for Asynchronous Consensus. Microsoft Research, Microsoft Corporation, MSR-TR-2004-72 (2006).

Get Aerospike: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Aerospike: Up and Running by V. Srinivasan, Tim Faulkes, Albert Autin, Paige Roberts