Chapter 1. The Database Keeps Evolving, but Can It Keep Up?

There was a time when database choices were fairly limited. Enterprises could work with flat-file databases, or, eventually, more adaptable relational databases. These databases worked well for basic transactions and internal corporate operations, but users had to rely on IT teams to craft and generate reports about the state of their business. That all changed about a decade ago, with an explosion of new types of databases, built on the web and cloud, that put more power in the hands of end users—and gave database managers more powerful tools to serve fast-changing business needs.

Pushing the Boundaries of the Database Frontier

As databases made the transition to the cloud, they evolved, from serving as simply data storage frameworks to becoming essential instruments for delivering greater customer service, as well as understanding of the business and its environs. Databases evolved with the platforms that arose within the computing world—from mainframes to midrange-class computers to personal computers, from proprietary to open source systems, and ultimately, to the cloud. But the evolution isn’t stopping there.

There are many frontiers still open for the advancement of database solutions, with many issues that still need to be addressed: data silos that keep information from effectively reaching the users and applications that need it, data quality and integrity issues, a lag in adapting to new business realities, scalability issues in an era when data and associated applications are exploding, security issues, privacy requirements, and lack of talent to maintain data environments.

Businesses can’t afford to sit still, and neither can their databases. Database technology needs to be constantly refreshed. Today’s data teams are under immense pressure—tasked with delivering “data-driven” capabilities to their enterprises. At the same time, they have had a difficult time linking the components of the modern data architecture to the needs of the business. They need to do more with less—smaller budget, fewer resources, and ever-tighter deadlines. Today’s solutions may be faster and more flexible, but more is needed to align these solutions to the needs of the business.

Next, we’ll take a look at a brief history on how that evolution is unfolding.

Relational Databases Open Data to the Outside World

Back in the days when the mainframe ruled, database designers recognized that users needed to declaratively decide what information they needed. The traditional database systems at that point (such as information management systems [IMSs] or the indexed sequential access method [ISAM]) required programming skills, but relational database systems made it easier for users to construct queries and execute them efficiently, without knowledge of the intricacies of the underlying servers and systems. The relational database introduced a separation between physical and logical implementations, enabling users to easily understand the nature of the data.

Rather than store data in the flat files known to larger, less flexible systems, relational databases maintain data in rows and columns, accessible via Structured Query Language (SQL), which provides a way to create, modify, and query the data sets stored within. Through the joining of tables, end users could view relationships between different data sets, such as sales numbers and regions. SQL introduced a simpler tool for data queries that allowed a far wider audience of users to query the database. The result was an unlocking of data for business intelligence.

Relational databases quickly evolved into powerful relational database management systems (RDBMSs), offered as all-in-one management and development platforms by major vendors. RDBMSs have also evolved to include new ways to handle data, including object database capability and in-memory processing. Lately, they have been hosted within the cloud, either by their vendors or through platform-as-a-service (PaaS) providers. Cloud-based RDBMSs take advantage of the cloud’s ease of use and scalability to provide expanded capabilities.

Relational databases, supported by SQL tools for query access, introduced a simpler way to discover and leverage data, and opened data to a wide audience of users. This represented a key step to unlocking data to better understand trends among customers and markets. Relational databases have the following characteristics:

Advantages

Cross-platform; enables complex queries, joins, or transactions across multiple data sets

Challenges

Difficult to deploy; difficult to scale; requires extensive IT resources; SQL-based queries difficult to program; vendor lock-in; structured data only

NoSQL Databases Emerge to Offer Lightweight, and Even More Sociable, Alternatives

Relational databases helped in discovering and understanding trends within the business but were expensive in terms of multiuser or per-processor licensing, as well as difficult to set up and maintain. SQL itself required a robust understanding of its structure and commands. Seeking to avoid the complexity of building SQL-based queries for relational databases, along with their restrictions, a new breed of databases emerged: not only SQL (NoSQL) databases. The first generation of NoSQL databases focused on key-value stores (Berkeley DB and similar), text searching (Elasticsearch), and later document stores such as CouchDB and MongoDB.

NoSQL databases were designed for applications using unstructured or semistructured data, such as text or images, which were not supported by RDBMSs of the time. Supporting unstructured data within a relational database required a JSON-based nested document structured in a very complex way to data model in an RDBMS. NoSQL databases were marketed by their creators as lighter, quicker, and easier to stand up than heavy RDBMSs, intended to enable data managers and developers to put applications into production at a faster rate. In addition, these databases were built for the web and cloud architectures.

NoSQL databases occur in many flavors and are suited to a particular task at hand—as document-oriented, key-value, graph, column-family, and multimodel databases. Each flavor has its own advantages, from greater flexibility in supporting data models to greater visibility of these models. For example, MongoDB gained traction because it provided developers with a simpler abstraction for object data. Other NoSQL databases, such as Cassandra or DynamoDB, emerged due to inabilities to scale across multiple nodes.

For example, knowledge graphs are designed to capture rich relationships and contextual information. A knowledge graph can map a social network, e.g., mapping relationships and attributes that can be used to derive insights, make inferences, and perform complex queries on the data.

NoSQL databases brought data and insights closer to business decision makers and analysts, helping them to contextualize and frame their available data assets. NoSQL databases have the following characteristics:

Advantages

Easy to deploy on a rapid basis; avoids vendor lock-in; few underlying IT resources required; supports unstructured data; provides visual understanding of data relationships

Challenges

Integrating multiple databases for multiple requirements; limited data consistency; may not support complex queries, joins, or transactions seen with SQL

Cloud Databases Remove the Limits

Then came the cloud. Cloud databases, as their name suggests, are a managed service provided by a hosted vendor or company via the cloud or offered as a cloud-based service by database vendors. Cloud databases have automation built in by providers, as well as tighter integration with the services they provide. No hardware purchases are required to build and maintain cloud databases.

The most compelling advantage of cloud databases is their scalability on demand without the up-front costs of resident servers. They enable end users to automatically spin up new data functions and storage on demand. They also serve as backup and failover services, thereby ensuring high availability.

Security is another area where cloud databases may be more robust than their on-premises counterparts. That’s because security is an essential part of cloud providers’ culture, and these providers are better equipped with trained staff and the latest tools than their client companies.

Cloud databases removed many of the underlying systems and network challenges that impeded the full use of databases, enabling business users and analysts to focus on drawing insights from data, rather than underlying infrastructure. Cloud databases have the following characteristics:

Advantages

Easy to deploy on a rapid basis; highly scalable on demand; little up-front investment; no underlying IT resources required; supports unstructured data; technology is automatically refreshed or updated

Challenges

Cloud vendor lock-in; control over features and formats; cloud vendor business viability; data security; long-term costs

Distributed SQL Databases Containing HTAP Capabilities Spread the Power and Bring It All Together

It’s important to note that all of the previously mentioned database types are still in use, with many organizations using all forms for their various requirements. Every business case is unique, and there is no single “right” approach to leveraging data assets and applications for maximum performance.

Ultimately, the key to building a successful data environment in today’s digital age is bringing together the advantages of the previous generations of databases mentioned above with the requirements for speed and adaptability by data-driven enterprises. The next evolution of data environments, built on high-performance data architectures, brings together all these advantages, but without the baggage that each successive generation of databases brought. This new generation of databases is built on the advantages that distributed SQL databases provide, combined with the real-time capabilities of hybrid transactional/analytical processing (HTAP) databases.

Distributed SQL databases have been available for a number of years, enabling the storage and processing of data more locally to users or applications requesting data. Managing data closer to where it is needed—across multiple sites or nodes—decreases latency and provides a more modular approach to build for scalability. In addition, in the event of failure of a node, other nodes can pick up the slack, ensuring greater availability.

Distributed SQL databases containing HTAP capabilities enable real-time processing and analysis of data as it is being generated. Combined with the flexibility of distributed SQL databases, they are able to process both online transactional and analytical processing workloads within the same system, sharing the single source of truth without data link delay in between. This allows for the simplification of technology stacks and data silos, which help companies build actionable data insight right from the real-time update and then drive growth faster.

A high-performance data architecture is needed to take advantage of the benefits that HTAP databases have to offer. This architecture provides greatly enhanced capabilities for scalability, availability, and performance—combined with the flexibility of integrating with existing databases from any vendor. Working within this architecture, managers and professionals can make faster, more informed decisions based on current data. An architecture built on distributed SQL databases containing HTAP capabilities also enables more efficient and streamlined data processing, which can improve cost efficiency while expediting business operations and decision-making. Especially with OpenAI GPT innovation today, a simplified data architecture will play a much more strategic role than ever before.

The database world continues to rapidly evolve, promising fresh approaches to helping organizations leverage the data that is critical for serving customers, increasing employee productivity, and moving forward with advanced analytic applications. Distributed SQL databases containing HTAP capabilities offer a simplified and consolidated approach for building a real-time, data-driven enterprise.

Get High-Performance Data Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.