Chapter 4. NewSQL and Operational Analytics

The NewSQL Revolution

Timing is everything. Considering the progress the technology and open source community had made up until this point in time, the foundation was laid for the next evolutionary step in SQL processing. As we’ve seen, we started with single node online transaction processor (OLTP) database engines, then moved into hybrid OLTP/online analytical processing (OLAP) engines, then onto dedicated NoSQL systems, which could be used in conjunction with one another to solve problems across many domains.

To provide the base for each set of additional innovations, across cloud computing and distributed systems, new design patterns and architectures were envisioned. These patterns helped to open up an entire new set of capabilities only dreamed of prior to today. These capabilities allow for massive horizontally scaling, reliable and durable distributed systems, as well as frameworks for processing streams of data as we saw with both Kafka and Spark.

These steps paved the way for NewSQL and operational analytics. NewSQL combines the best of traditional relational systems with the inherent scalability formerly found only in NoSQL; it has been endorsed by industry analysts, such as Gartner and Forrester, as we describe in a moment. We look further into these topics now.

What Exactly Is NewSQL?

NewSQL databases represent the marriage of the ACID–based transactional consistency and reliability of OLTP databases to the high availability, horizontal scalability, and fault tolerance found in NoSQL databases. NewSQL databases have been redesigned from the ground up to support cloud-native architecture while balancing the trade-offs between consistency, availability, and fault tolerance (network partition-tolerance) that are required to run highly elastic, distributed systems within the cloud.

These new databases establish a holistic ecosystem that supports business needs across both the OLTP (RDBMS) requirements as well as the Business Intelligence (BI)–style OLAP workflows, all within a single hybrid system. (Gartner has coined the term HTAP, which stands for Hybrid Transactional Analytical Processing, for this new marriage.) It is through this merger of cross-functional capabilities that the stage has been set for the revolutionary operational database.

Gartner has described HTAP as creating new opportunities for dramatic innovation by business. Forrester calls this style of processing translytical, combining transactional and analytical processing in a single database, and it also sees bright prospects ahead.

A Brief Note on NewSQL and Database Theory

The CAP theorem governs the choices that can be made in database design. It basically states that you can have only two of the three desired characteristics of a database at any given time: consistency, availability, and partition tolerance. For NewSQL databases, partition tolerance is taken as a given because scalability requires it.

NewSQL databases can be balanced to meet operational needs of many different styles of workloads, which are split between consistency and partition tolerance (CP) or availability and partition tolerance (AP), while acting in some cases as a CP/AP load balancer to achieve standard SLAs within a highly elastic environment.

The Operational Database and Operational Analytics

Given the separation of concerns (SoC) between OLTP and OLAP queries, these hybrid databases must provide flexible distributed query processing that can continue to power mission-critical BI search queries while at the same time ingesting a near-continuous torrent of events and metrics data as well as transactions from across operational, edge, and mobile systems. They must provide an accurate view of not only up-to-date data, but also be capable of analyzing the equivalent of the entire data warehouse worth of data on demand, in order to achieve the very fast query response times required to answer the ever-growing demands of the business.

Key Features of the Operational Database

The operational database needs to effectively scale to handle enormous, unbounded datasets that support simultaneous ingestion of continuous data while at the same time supporting uninterrupted complex queries from hundreds to thousands of concurrent users (human users sending SQL queries, BI programs, and apps, including newer apps powered by machine learning and AI). The defining features of these databases are as follows.

Zero Downtime

In the modern age of cloud computing and SaaS, there is a zero-tolerance policy to service interruptions and any kind of downtime. Planned or unplanned maintenance can cost your company thousands, even millions of dollars, as seen with Amazon’s PrimeDay outage in 2018. Your operational database must be highly available, fault-tolerant, and self-healing.

Shared-Nothing Scale-Out

Operational databases achieve high availability, fault tolerance, and consistency by incorporating a shared-nothing, scale-out architecture. This is achieved when data exists only where it needs to, and resides only in more than one place in order to maintain fault-tolerance. This is achieved within an active primary shard/passive multireplica shards-style data network. The use of gossip-style protocols allows each node within the distributed database to act as either a query processing node, a search relay node, or a coordination node for data ingestion and primary and recovery replication. The analogy of gossip here takes its roots from evolutionary psychology, in which gossip helped improve cooperation among people by directly affecting their social status, thereby keeping people honest. By allowing distributed components to “gossip,” there are no secrets in the system and therefore no single points of failure.

The ability of each node within the stack to act on a query removes the need for a single master node and enables low end-to-end latency by running queries at the source of the data across a locally distributed subcluster. This architecture also enables strong durability for transactional data as explicit primary data nodes receive an update, write the change, and broadcast the change before a success state is returned.

Balance Consistency and Availability

Operational databases favor consistency over availability while at the same time achieving high availability and fault tolerance through the shared-nothing, scale-out architecture. NewSQL databases like the one delivered by MemSQL enable tuning between consistency and availability to achieve the correct balance between CP and AP to handle the various needs of different business use cases.

High-Performance Data Ingestion

Loading data into the operational database should be high performance while adhering to strict data availability SLAs and also achieving very low (subsecond) end-to-end refresh latencies—that is, with respect to the refresh of underlying collections of data backing the database’s logical tables. Ingestion of data should also be simplified by enabling continuous loading of data through the use of existing external and database-specific data pipelines.

Streaming data sources like Apache Kafka, or the continuous ingestion of file streams from Amazon S3, Microsoft Azure Blob, or HDFS, which are common sources of data stored within a business’ data warehouse or data lake, should be standard capabilities that ship with an operational database. Change Data Capture (CDC) also comes into play here, as operational databases many times are expected to live alongside systems of record. Ingestion of new data and refresh of the views of data within the database should also not affect or interfere with system-wide query performance.

Fast Query Response Time

Operational databases need to elastically scale up to handle a massive number of concurrent queries made by hundreds and even thousands of connected users, with the same speed and reliability of traditional OLTP databases, while also supporting the BI/OLAP join, rollup, drilldown, and aggregate workloads. Query response time is not measured in minutes, and in many cases not even seconds, but in milliseconds. Powering critical decision making requires subsecond response times on your business’s most up-to-date data in order to drive insights and clearly highlight emerging trends.

Some databases, like MemSQL, use a mixture of highly optimized columnar data stored on disk, which achieves a high level of compression, while also achieving extremely fast aggregate queries as well as in-memory (RAM) rowstores. This supports fast point queries, aggregations, and full row/columnar indices across the data stored in their hybrid database. This achieves the best of both OLTP and OLAP while still supporting full ANSI SQL as well as all analytical SQL language features expected to support the myriad BI use cases.

Security to Stake Your Company On

Having a fast, reliable database also means having very strict security policies. Data access-level security policies protect the data within a shared database environment across all connected users. Grant-style permissions, like those that became popular for securing OLTP databases, provide the basic table-view-level access rules within the database. It has also become popular to include column-level encryption/decryption policies controlled by user-defined functions (UDFs) or one-way hashes to ensure personal information or restricted data doesn’t leak out to just any user. Enabling data owners to view critical information while also adhering to extremely strict security policies, as seen with European Union’s GDPR and Health Insurance Portability and Accountability Act (HIPAA) compliance, is a must. Additionally, end-to-end transport-level encryption and encryption at rest are a must have for enterprise security.

Use Cases

Now that we understand what NewSQL is and how the operational database enables unified transactional and hybrid analytics workflows, we next explore how the integration of these new databases can be transformative when mixed into large enterprise architectures. We will see how complexity can be reduced while opening the door to more consistent, predictable workflows.

Given that Kafka is the leading open source solution for event streaming, and increasingly the platform of choice for data-driven interoperability, it is also the most supported streaming connector across SaaS offerings and open source projects. Apache Spark works natively with Kafka and provides an additional connectivity layer to most database technologies available on the market through the use of plug-ins, whereas Kafka’s stream processing paradigm enables exactly-once processing across Spark applications in many use cases, and is a native component of MemSQL’s Pipelines ingestion architecture as well as Apache Druid and Apache Pinot.

Now let’s dive deeper into two use cases across two separate industries to see how pivoting toward operational analytics has led to revolutionary changes across not just data architecture, but across businesses as a whole.

Ecommerce

Fanatics, a global leader in the licensed sports merchandise game, with more than $2 billion in annual revenue, had a complex data architecture, with different data sources for different internal users. It wanted to streamline its analytics and data architecture in order to provide a more holistic environment that could be used by data scientists, data engineers, executives, and nontechnical stakeholders alike. And the ecommerce platform had to seamlessly adapt and react to both planned and unplanned events—as in the case of championship game results, player trades, and injuries.

Moving to an Event-Driven Architecture

Fanatics had recently made the decision to go through a larger, more substantially complex architectural transformation that saw it move to a purely event-driven streaming data architecture, from an older monolithic and deeply siloed system of separate applications. This transformative move happened over a three-year period and introduced the “FanFlow” Kafka-based event bus—a flexible, event-based data delivery conduit that provides streams of data to power core data systems (see Figure 4-1).

Before: The previous Fanflow analytics architecture used different tools for different audiences
Figure 4-1. Before: The previous Fanflow analytics architecture used different tools for different audiences

These events encapsulate the state changes within an order life cycle, such as additions to cart, checkout, fulfillment, and delivery, as well as user behavior interactions on the site, type-ahead suggestions and point-of-sale events.

This reinvestment was a monumental step in the right direction, but Fanatics still had some larger issues to resolve on its journey to a truly unified analytics platform. These problems existed in the form of three separate systems, one which was a Lucene-based search index that saved raw events, a second which was an Apache Flink consumer with a Redis State machine that provided streaming analytics and metrics to their executives via custom dashboards, and, lastly, an Amazon S3 consumer that fed event streams into Fanatics’ Spark clusters to model user behavior and support ad hoc queries from its Hive data stores, which powered an Apache Zeppelin notebook environment.

This split between tools and functionality for internal stakeholders manifested problems due to not sharing a single source-of-truth data store, and views of data for the executives would not be synchronized with the Lucene-based stores, or the data view from the Hive/Zeppelin dashboards.

An additional problem also occurred as the company’s Lucene-based indexer couldn’t keep up with huge peak traffic spikes, and the trickle-down effect created issues that led to trouble managing data-level SLAs across its system.

Shifting to a Unified Operational Architecture

An operational database replaced the Lucene-based indexers, and Spark and Flink jobs were converted to SQL-based processing jobs (Figure 4-2), which allowed for a more consistent, predictable development life cycle and more stable SLAs. Now, internal stakeholders can all operate off of the same underlying up-to-date, near-real-time data, while not sacrificing long development cycles maintaining and managing multiple clusters running on many different technology stacks. This was a game changer and a huge win for Fanatics.

After: Fanflow feeds into robust analytics capabilities powered by MemSQL
Figure 4-2. After: Fanflow feeds into robust analytics capabilities powered by MemSQL

Telecommunications

The telephone dates back to the late 1800s, and yet, over almost a century and a half later, the industry is continuing to make leaps and bounds with respect to the overarching technology. Telephony now enables global connectivity across all continents in the form of voice, video, and data. However, for many telecommunications companies, the industry-wide move to cloud-native stacks can seem out of reach, given that huge budgets were previously poured into custom hardware, special-purpose microprocessors, and proprietary software as a means to edge out the competition and gain footholds across global markets.

The silver lining here comes in the form of the hybrid cloud. Companies can continue to use their datacenters and still connect directly to the internet’s backbone, while utilizing available interconnections that are available to bridge their systems with those of Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).

Before moving to the hybrid architecture

Under the previous approach, datacenters had to be built up to handle the next 6 to 12 months of anticipated growth, with massive upfront investments across servers, routers, and switches, not to mention redundancy planning and the electrical power to run and cool the infrastructure. If a team wanted to build up a new service, the lead time alone would be cause for concern. All of this would tend to restrict creativity and free thinking, because the lead time from day one of work to production would make most new ideas cost and time prohibitive.

Additionally, providing a holistic view of operations across all hardware, in the field and in the datacenter, ran the risk of data growth expanding beyond the capabilities of the hardware that was available. Even though the size of the platform-level sensor networks can vary from company to company, the torrent of data required to monitor service-level operations, network operations, system-level operations, and connectivity operations is on the order of many billion events worldwide every single day—and that is only the general, platform-level operational analytics.

Synchronizing state and maintaining consistent, predictable global data SLAs required infrastructure that was designed specifically for this kind of elastic scale and could flex to handle the ebbs and flows of common traffic patterns as well as the kinds of traffic that is more anomalous—like those of a natural disaster, or celebrity video going viral. As a compounding effect, slow ETL jobs and variable latency networks around the globe would restrict general data availability SLAs, and this caused many blind spots within the network.

Shifting to a hybrid architecture

Given that sensor networks lend themselves nicely to the event-stream paradigm, the first step is to define and agree upon the schemas that will define the data structures for the myriad sensor events. The next step simply being to emit the events into regionally distributed, cloud-hosted event buses like Amazon’s hosted Kafka solution.

This process tends to happen over many months, and is managed much more effectively by initially piloting one end-to-end solution. This allows for learning about such important system requirements as overall record size, the number of records produced per day, as well as how to efficiently partition topics, and what primary key to use for even distribution. After a pilot, the findings can be wrapped up into best practice guides for other teams to move faster when following suit.

Benefits of the hybrid architecture

By migrating to a near-real-time, streaming event-based architecture (Figure 4-3) using a NewSQL operational database, the company was able to keep all of the proprietary software, chips, and hardware, along with all of the datacenter points of presence, while also taking a tactical approach to harnessing the power of the cloud, improving operations at a global level.

Hybrid analytics and insights architecture
Figure 4-3. Hybrid analytics and insights architecture

This reduced friction by enabling executives, managers, support staff, and engineers to easily view and query data in a unified platform. Outages were resolved faster because data was generally available to help pinpoint which systems were under pressure or experiencing network unavailability (which typically occurs due to network partitioning). Root cause analysis for any number of problems was reduced from weeks of manual collection and analysis of data across countless disparate systems to sometimes less than an hour.

By operating off of the same data pipelines, teams could easily tap into streams of data to produce new specialized streams to power near-real-time monitors of critical business operations, to feed into real-time dashboards, and to provide further engagement and self-service capabilities for internal and external stakeholders alike.

Get The Rise of Operational Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.