Chapter 1. Database Options in the Cloud

The rush to the cloud is often measured in pure business terms—and its growing popularity is widely recognized. For instance, a recent Gartner report (requiring the submission of business information to view) finds that databases are growing at a rate of 68% in the cloud, whereas there is little on-premises growth outside of price increases and what they call “forced upgrades.”

But this dramatic shift in computing and database access also shows up in the types of services offered and the evolution of professional jobs in computing. Managed databases of many sorts are now part of every major cloud vendor’s offerings. The use of these databases will remove many tasks performed by a database administrator (DBA) in traditional environments where an organization owns its own hardware, which we’ll call on-premises environments. Moving to the cloud will add new tasks, change some existing tasks, and provide a subtly different context for understanding many of the tasks.

Cloud vendors doggedly keep up with changes in the database space and vary their database offerings to meet the diverse needs of their clients. Relational databases (in both transactional and data warehouse form) appear along with nonrelational databases, such as key/value and document stores.

This report helps the DBA and related staff—such as data scientists, data architects, and application developers—choose among cloud offerings. It explains on a general level what responsibilities the DBA should expect to perform in the cloud environment. Finally, it offers guidelines for migration.

The report does not cover arguments for or against moving to the cloud, because there are other resources to support this decision and because the decision is tied in so tightly with the particular traits of your databases and how you use them. Furthermore, we don’t review or recommend particular cloud offerings, although we refer to the offerings of the currently dominant cloud vendors: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform.

This chapter includes the following:

  • A brief recap of the differences between on-premises computing and cloud computing, with an eye toward what is relevant to DBAs

  • Definition of a managed database

  • Types of managed databases offered by cloud providers

  • Overview of a DBA’s role in the cloud

Cloud offerings often mirror the software available for traditional, on-premises deployment. This means that you can move data from your own equipment to a comparable database offered by a cloud vendor. For instance, you can run MySQL or Oracle in the cloud instead of in your own datacenter, or you can use a database native to the cloud such as Amazon Aurora.

You should also be able to keep a foot in each territory (on-premises and cloud), moving data into the cloud at a pace that’s comfortable for your organization. You can also maintain two deployments forever, one on-premises and the other in the cloud. The cloud is often a good place to try something totally new, such as a big data project that you’ve never created before.

Although we expect most readers to be familiar with the cloud, its benefits, and various reasons to use or avoid it, the next section summarizes traits of the cloud that underlie the ideas in this report. After that, we can focus on databases.

High-Level Effects of Moving to the Cloud

As you move data to the cloud, you will encounter differences from on-premises deployments that affect your decisions. Following are the key changes of interest to a DBA:

  • You pay as you go, in hourly or monthly increments, to run each server (called an instance in the lingo of virtual computing). Thus, if you have a business intelligence (BI) application that you run every night, you can fire up an instance of a database for 8 hours during the night and avoid paying for the other 16 hours. With some services (such as AWS Reserved Instances) you can also run a database at a discount for a limited time.

  • Hardware and related infrastructure are handled by the vendor. This can take an enormous amount of work off the shoulders of system and database administrators. Instead of provisioning your own hardware, you choose the amount of CPU power, memory, and storage you want in your database instance. Vendors often have special offerings that can improve performance in ways related to strategies used on-premises: for instance, Microsoft Azure’s SQL Data Warehouse offers solid-state drive (SSD) storage, and Amazon DynamoDB has a special in-memory caching version.

  • The major cloud vendors (AWS, Azure, and Google Cloud) offer not only the isolation of your workloads in virtual machines (VMs), but the isolation of complete corporate networks in the cloud, a service called a virtual private cloud (VPC). Just as VMs allow each of your servers to use computing resources efficiently and securely, a VPC allows you to set up your corporate network in the cloud, potentially saving money while letting the cloud vendor manage a good deal of your network security. Database servers can participate in these VPCs. You can also set up a virtual private network (VPN) to communicate between the cloud’s VPC and your on-premises computers.

  • Cloud vendors offer multiple locations that are important for several purposes: to offer servers that are geographically close to the user bases of clients, provide redundancy in case of disaster, allow load balancing, and more. Some countries, particularly in the European Union, require that some types of data be stored within their legal jurisdictions, for privacy or other legal reasons. The concepts of regions, which may cover a large part of a continent, and availability zones (AZs) within those regions appear in AWS, Azure, and Google Cloud. Thus, you can distribute your workloads across several AZs in a region, replicating data and services, and set up load balancing across redundant copies of data to provide resilience. Some cloud vendors offer regions with special features to meet regulatory needs, particularly around security. All major cloud providers also offer content delivery network (CDN) services.

  • Cloud vendors offer sophisticated ways to scale services, provide fault tolerance, and perform load balancing. If your database instance fails and your configuration is set up for high availability, a new instance might be available. The client might need to restart sessions that were running on the old instance when it failed, but your administrative efforts go into configuring fault tolerance instead of getting up at 2:00 A.M. to restart the database. Similarly, if the load on a database becomes high, a new instance can start up to share the load, and then shut down when it is no longer needed. This is called elastic scaling.

  • Vendors normally offer command-line options, graphical interfaces, and APIs, each with strengths and weaknesses that are familiar to computer professionals. APIs permit automation and thus take work off your hands.

  • Some security that you need to worry about on-premises is handled by the vendor. Obviously, they control physical access to the servers. Depending on how you use their services, they can also protect you against low-level internet attacks such as port scans and denial-of-service (DoS) attacks. Later chapters lay out the aspects of security that are still the DBA’s responsibility.

  • Cloud vendors build in monitoring and performance tools that you can hook into your databases fairly easily, and that are always sprouting new features. Third-party monitoring tools might also be able to tap into and extend the built-in monitoring capabilities of the platform.

Many criteria for choosing a provider are general, not database specific: pricing, the ecosystem of third-party developers around each offering, stability, support for legal compliance regimes, added value in high-end tools for artificial intelligence, and more.

As you evaluate cloud offerings, think about future directions for your organization. For instance, will streaming and real-time data processing be a bigger part of your future? Which cloud offerings will support you a couple of years from now?

Self-Managed Versus Managed Databases

We now look specifically at databases. Your options in the cloud divide into two major categories:

Self-managed databases

In this cloud offering, the vendor provides just the hardware, the hypervisor to run your VM, and the API or other tools to manage deployment. You need to create a VM running an operating system and your application. Obviously, you can run anything you want in the VM, including a database engine of your choice. You need to perform most of the administrative tasks yourself, such as installing updates and configuring all of the networking options.

Managed databases

In this cloud offering, the vendor provides not only hardware but also the server software itself. Most vendors offer both traditional databases (such as Oracle and MySQL) and cloud native databases that are specific to that vendor.

Self-managed databases are much more like running your own database on-premises; managed databases change the way you work a lot more. Here are the key differences:

  • If you use a managed database, you don’t need to manually download, install, update, configure, or back up the database; the cloud provider does all of that. You can still modify parameters that instruct the database how to run (collation, cursors, connections, etc.) and control decisions such as whether to install new releases and how long to retain backups.

  • Sometimes the cloud vendor can provide a license for a proprietary database and include that cost with the cost of using the cloud service. In other cases, you are responsible for obtaining a license. If you already purchased a license from a proprietary database vendor, you can sometimes apply that to a managed database.

  • Because the vendor installs and runs the managed database, you are limited to the choices the vendor makes. You can’t ask for a database engine that isn’t supported by the vendor, or even a version that isn’t already supported. But the three big vendors (AWS, Azure, and Google) offer several proprietary and open-source databases as well as their own cloud native databases.

  • Self-managed and managed databases offer different parameters for configuration.

  • With managed databases, much of the security is handled by the vendor. But even a managed database leaves important security decisions up to you, such as to whom to give accounts and what limitations to place on tables or columns. Monitoring and auditing access attempts is usually still your job. You can also control security settings such as which IP addresses have access.

Managed databases therefore offer many advantages over on-premises deployments and self-managed databases, but you might have unique reasons to run your own database. The rest of this report covers managed databases because they offer a unique opportunity for a DBA to focus on activities that have more long-term impact on the business.

Cloud Native Databases

Managed databases can also be divided into two categories: traditional and cloud native. Traditional databases such as Oracle, SQL Server, MySQL, and PostgreSQL are frequently offered as managed databases by cloud providers. If you have built your organization around one of these databases on-premises, moving to the same database in the cloud simplifies migration. You are less likely to need to alter your applications and can use familiar tools to manage the databases. You can also mix offerings from different vendors or maintain your on-premises version of the database.

But the vendors have invested great effort into their own new offerings, sometimes called cloud native databases. Vendors provide evidence that the cloud native databases perform better, scale more easily, and are cheaper in the long run. Thus, testimonials from Autodesk and InfoScout suggest that AWS engineers have solved many of the scaling and efficiency problems of managing relational databases in the cloud with their own database, Amazon Aurora. Cloud native databases are also designed to scale enormously, a task that has historically been difficult for relational databases.

Types of Managed Databases

Most types of databases you find out in the field are also offered as managed databases. Additionally, cloud vendors have also developed their own cloud native databases, following common industry trends and offering performance benefits. The major types of supported databases include the following:

Relational databases

As mentioned earlier, some cloud databases are managed versions of popular databases in widespread use. For instance, Azure offers Microsoft’s traditional SQL Server. Amazon supports MariaDB, MySQL, Oracle, PostgreSQL, and SQL Server through its Amazon Relational Database Service (RDS). Such offerings help you to move databases more easily from on-premises installations.

In addition, vendors have created databases of their own. For instance, Azure provides Azure Cosmos DB, Google Cloud offers Cloud Spanner, and AWS offers Amazon Aurora.

Data warehouses

Although these are typically relational databases, they differ from the transaction-oriented databases internally and in their offerings. For instance, transaction-oriented databases usually store all the columns of a single row together so that you can quickly retrieve multiple columns about a customer or product. In contrast, data warehouses in the cloud tend to be columnar, meaning that they store data by column instead of by row. This greatly speeds up common warehouse queries like, “give me the ages of all customers who live in California.” The tools offered by cloud vendors with data warehouses focus on fast ingestion and extraction, facilitating their use in big data applications. Amazon Redshift, Google’s BigQuery, and Azure’s SQL Data Warehouse are examples of these offerings.

Nonrelational databases

This term is commonly used to cover a variety of different data stores that, unlike traditional relational databases, are built for special-application use cases. Cloud vendors offer a variety of these for different purposes:

  • Key/value databases offer quick storage and retrieval of values without support for more sophisticated operations.

  • Document databases store data as JSON documents with a flexible schema, allowing data to be stored and queried in the same format used in applications.

  • Graph databases store relationships among objects, making it easy to run algorithms such as finding the object with the most connections.

  • Search databases optimize the location of documents containing particular words.

  • Time series databases record events with timestamps and are optimized for time-related tasks such as graphing occurrences of events over time, useful for tracking events such as device readings or web postings.

  • Ledger databases are like enhanced, secure log files that record activities such as financial transactions, signing them, and making them immutable.

Other options, including in-memory caching databases, might also be available. Cloud vendors offer services for less structured storage (i.e., BLOBs or files): examples include Amazon Simple Storage Service (Amazon S3), Google Cloud Storage, and Azure Blob Storage. The services offer even cheaper options for archival storage, along with tools to move data in and out of storage easily.

This chapter, necessarily, has stated database features in general terms, with lots of hedging words such as “some” or “might.” This reluctance to be pinned down stems from the subtle differences that can be seen after you examine vendor offerings carefully. Some offerings have evolved further than others. Some databases offer easier administration or migration than others. You must take all of these factors into account when you choose a cloud offering.

When a Managed Database Might Not Be a Good Fit

A managed database service isn’t appropriate in a couple scenarios.

Your database size and IOPS are greater than the database limits.

Cloud native databases are designed to scale to meet extreme demands. But managed databases based on more traditional database engines often come with some upper limits on their size and the input/output operations per second (IOPS) they can support. Many organizations can fit within these limits. But if your database is truly enormous, and you have greater IOPS needs than what the provider can support, you might find your chosen managed database services to be infeasible. You should run a cloud native database or run your own instances in the cloud by renting the VMs and storage that you need.

You need full control over your databases.

If it’s important for you to have maximum control over your databases, you must manage your own databases in the cloud. If you need a specific database version that’s not supported by the cloud vendor, or you need to use database features or options that aren’t supported by the provider, you’re back on your own.

Role of the DBA in a Managed Database

As mentioned at the start of this report, moving to the cloud will change your job, sometimes in subtle ways. Chapter 2 covers this in detail, but we’ll just say here that after you relinquish hardware maintenance, system configuration, and other things a managed database takes care of, there is plenty of work left for you. Here are some tasks that the DBA often performs on a managed database:

  • Determining requirements, such as CPU power and memory.

  • Configuring database runs, including the automation of scaling, failure recovery, and taking snapshots of the database.

  • Choosing which regions and AZs to host the database instances.

  • Creating and manipulating databases and tables through the command line, console, or API offered by the vendor.

  • Authorizing user accounts, setting up groups, and controlling access to databases and parts of the databases. Some databases offer this access through identity and access management (IAM) and others use the traditional interfaces offered by the database in on-premises deployments.

  • Determining how to divide data into shards when you need to distribute it across multiple systems.

  • Setting database parameters that affect performance or resiliency.

  • Auditing access and monitoring database activity for the purposes of security, performance, and resilience.

  • Performing ingestion and data transfers through traditional extract, transform, and load (ETL) or more recent streaming data tools. These data transfers can involve accepting data from outside sources, providing data to big data tools such as Spark or Kafka, storing data in a data warehouse either in the cloud or on-premises, and any other relationship you need to establish with another tool to get data processing done.

  • Determining schemas or formats, together with data architects and application developers, and implementing them in relational or nonrelational data stores.

Most exciting, you’ll have more time to look at the business context for the database and think of ways to improve the value of the data you manage for your organization. We go into more detail about these topics in Chapter 2.

Get An Introduction to Cloud Databases now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.