Chapter 5. Heroku PostgreSQL

As Heroku grows, there is an increasing need to provide database services that match the ethos of the Heroku platform—simple provisioning and usage on demand at a low cost.

Thus, Heroku decided to develop the Heroku PostgreSQL service.

Why PostgreSQL?

Over the last few years, two open source database servers have become dominant on the Web. These servers are MySQL and PostgreSQL. In order to provide the best service available, Heroku decided to use PostgreSQL, and provide this service as the default choice for all database hosting.

In the early days of Heroku PostgreSQL, this was somewhat surprising, as MySQL was easily the most predominantly used database server at the time. However, over time, opinion has shifted more toward using PostgreSQL.

But why PostgreSQL?

The initial choice was made based on the fact that it was believed to be operationally more reliable than MySQL, which is very important when managing hundreds of thousands of databases on a daily basis. Over time, Heroku found PostgreSQL to get better and better as the development community around it added updated and new functionality.

Additionally, and a very key reason, PostgreSQL is an open system and always will be (unlike MySQL). This means that now and moving forward, as long as you are using PostgreSQL for your database server, you will not be subject to any vendor lock-in. Thus, as a user, you are able to take your data wherever you please.

There are a number of other technical reasons as to why PostgreSQL is the favored option, too: for instance, transactional data definition languages (DDLs), fast index creation with concurrent indexes, extensibility, and partial indexing and constraints.

Transactional DDL

If you’ve ever made a change to your database and have had something fail mid-way, either due to a constraint or some other means, you understand what pain can come of quickly untangling the resultant mess. Typically, changes to a schema are intended to be run holistically and if they fail, you want to fully roll back.

Some other databases, such as Oracle in recent versions and SQL Server, do support this. PostgreSQL, however, supports wrapping your DDL inside a transaction. This means if an error does occur, you can simply roll back and have the previous DDL statements rolled back with it, leaving your schema migrations as safe as your data, and your application in a consistent state.

Concurrent Indexes

When you create an index with most traditional databases, it holds a lock on the table while it creates the index. This means that the table is more or less unusable during that time. When you’re starting out, this isn’t a problem, but as your data grows and you add indexes to improve performance, it could mean downtime just to add an index (not ideal in a production environment).

Not surprisingly, PostgreSQL has a great means of adding an index without holding that lock. Simply invoking CREATE INDEX CONCURRENTLY instead of CREATE INDEX will create your index without holding the lock.

Of course, with many features, there are caveats. In the case of creating your index concurrently, it does take somewhere on the order of two to three times longer, and cannot be done within a transaction.

Extensibility

Do you need to go beyond standard PostgreSQL? There’s a good chance that someone else has, and that there’s already an extension for it.

Extensions take PostgreSQL further with things such as geospatial support, JSON data types, key/value stores, and connecting to external data sources (e.g., Oracle, MySQL, and Redis).

Partial Indexing

In a similar fashion to affecting only part of your data, you may care about an index on only a portion of your data, or you may care about placing a constraint only where a certain condition is true.

Take an example case of the white pages. Within the white pages, you only have one active address, but you’ve had multiple addresses over recent years.

You likely wouldn’t care about the past addresses being indexed, but would want everyone’s current address to be indexed. With partial indexes, this becomes simple and straightforward:

CREATE INDEX idx_address_current ON address (user_id) WHERE current IS True;

What Heroku Gives You

There are a number of reasons why you should use Heroku for hosting your PostgreSQL databases, some of which may not be immediately apparent. For instance, let’s take a look at typical requirements of a database administrator (DBA):

  • Set up and administer databases (perform backup, recovery, monitoring, and tuning as necessary).
  • Devise a scalable solution with the highest performance and greatest redundancy.
  • Deploy upgrades and patches.
  • Implement and test disaster recovery.

In essence, a qualified DBA needs to be able to set up, manage, secure, back up, scale, architect, and optimize data structures for day-to-day use.

With PostgreSQL, as with the rest of the Heroku platform, most of this is done for you. In fact, Heroku will do everything aside from architect and optimize your database. There is no need to worry about patching for security or stability, and there is no need to worry about backups; everything is taken care of for you. All that is left for you to do is to worry about your data and how to store it efficiently.

Because Heroku takes on a lot of the responsibilities of a DBA, you are freed up to consider the parts that are important to your particular application and your particular use case. Meanwhile, Heroku is doing its part to keep your database healthy, functioning, and running well, 24 hours a day.

Development versus Production

At this point, it is worth mentioning that there are two classes of service provided by Heroku: development and production.

The Heroku PostgreSQL development plans are designed to offer the database features required for development and testing, without the production-grade operations, monitoring, and support found in paid production plans. Advanced features such as fork, follow, and automatic database backups are not available on the development plans (although manual backups are available).

Production plans are suitable for production-scale applications. Additionally, the production databases offer a number of advantages over shared, including direct access (via PSQL or any native PostgreSQL library), stored procedures, and PostgreSQL 9.1 support.

Choosing the Right Plan

Heroku PostgreSQL plans vary primarily by the size of their in-memory data cache. The quoted cache size constitutes the total amount of RAM given to the PostgreSQL service. While a small amount of RAM is used for managing each connection and other tasks, PostgreSQL will take advantage of almost all this RAM for its caching needs.

PostgreSQL constantly manages the cache of your data: rows you’ve written, indexes you’ve made, and metadata PostgreSQL keeps. When the data needed for a query is entirely in that cache, performance is very fast. Queries made from cached data are often 100 to 1,000 times faster than from the full dataset. Well-engineered, high performance web applications will have 99% or more of their queries served from cache.

Conversely, having to fall back to disk is at least an order of magnitude slower. Additionally, columns with large data types (e.g., large text columns) are stored out-of-line via The Oversized-Attribute Storage Technique (TOAST), and accessing large amounts of this data can be slow.

Unfortunately, there is no great way of knowing exactly how much cache you need for a given use case. This is not a very satisfactory answer, but it is the truth. The recommended way is to move between plans, watch your application performance, and see what works best for your application’s access patterns.

Access patterns vary greatly from application to application. Many applications only access a small, recently changed portion of their overall data. PostgreSQL will automatically keep that portion in cache as time goes on, and as a result these applications can perform well on the lower cost plans.

Applications that frequently access all of their data don’t have that luxury; these can experience dramatic increases in performance by ensuring that their datasets fit completely in memory. However, you will eventually reach the point where you have more data than the largest plan allows, and you will have to take more advanced approaches such as splitting your data across several databases.

Heroku Postgres’s many plans are segmented in two broad tiers: starter and production. Each tier contains several individual plans. Although these two tiers share many features, there are several differences that will determine which plan is most appropriate for your use case.

Shared Features

The starter and production tier database plans all share the following features:

  • Fully managed database service with automatic health checks
  • Write-ahead log (WAL) off-premise storage every 60 seconds, ensuring minimal data loss
  • Data clips for easy and secure sharing of data and queries
  • SSL-protected PSQL/libpq ingress
  • PostgreSQL extensions

Together, these features combine to provide a safe and resilient home base for your data.

Starter Features

Note

The starter tier database plans are not intended for production-caliber applications or applications with high-uptime requirements. Therefore, if you are hosting an application that you consider to be critical, it is advised that you consider using the production tier database plans.

The starter tier, which includes the dev and basic plans, has the following limitations:

  • Enforced row limits of 10,000 rows for dev and 10,000,000 for basic plans.
  • Max of 20 connections.
  • The lack of an in-memory cache limits the performance capabilities since the data can’t be accessed on low-latency storage.
  • Fork and follow, used to create replica databases and master-slave setups, are not supported.
  • Expected uptime of 99.5% each month.

In addition to these feature and resource limitations, starter tier database plans do not automatically record daily data snapshots. You will need to use the PGBackups add-on to manually configure this level of data protection and retention.

Production Features

As the name implies, the production tier of Heroku PostgreSQL is intended for production applications and includes the following feature additions to the starter tier:

  • No row limitations
  • Increasing amounts of in-memory cache
  • Fork and follow support
  • Max of 500 connections
  • Expected uptime of 99.95% each month

Management of production tier database plans is also much more robust, including:

  • Automatic daily snapshots with one-month retention
  • Priority service restoration on disruptions

Getting Started

So how do we provision and use a Heroku PostgreSQL instance? Fortunately, this is a very simple task.

Tip

A prerequisite of following this walkthrough is to have a Heroku account, verified with a credit card. To sign up, if you haven’t already, visit http://postgres.heroku.com/signup.

To provision a database, follow these steps:

  1. Go to your database dashboard, the page that loads immediately after logging in.
  2. Click the + in the header.
  3. Select the Dev Plan (which is free) when you are presented with a list of the options available.
  4. Click Create Database. You should now see a message that the database is being created. A few seconds later, this page will change to show you the name of your new database (e.g., “protected-wave-27”).
  5. Click the database name.

Your PostgreSQL database is now provisioned and ready to use. From the current screen, you can see information such as the host, database name, and credentials needed to connect.

By clicking the cog at the upper-right corner of the Connection Settings panel, you are able to select one of the commonly used methods of connecting to your database, and get a pregenerated string that you can use in your application. For instance, if you were to select ActiveRecord (the Ruby on Rails ORM), you would see something similar to this:

adapter:        postgresql
encoding:       unicode
pool:   5
database:       db3bjv2gmckc5c
username:       fKpF7NNtm4kfZd
password:       7ARGxieXAnZEyyLMm9h
host:   ec2-23-21-85-231.compute-1.amazonaws.com
port:   5432

You will also be able to see other information, such as the database type and some basic statistics regarding size, version, and availability. From here, you are also able to see your current data clips and snapshots. More on that later.

Importing and Exporting Data

Once you have your database, how do you get your preexising data in? How would we get our data out of Heroku PostgreSQL for further querying offline? In order to make this process as efficient as possible, Heroku has made sure that there is no lock-in, and made sure that you can access your data in the normal PostgreSQL ways.

Importing Data

In many cases, you will now want to import some data into your database for use by your application. This is done via the use of a dump file from another PostgreSQL database:

$ pg_dump -Fc --no-acl --no-owner my_old_postgresql > data.dump

Alternatively, a dump can be created via the use of PGAdmin, the GUI tool for PostgreSQL administration. To do so, select a database from the Object Browser and click Tools → Backup. Set the filename to data.dump, use the COMPRESS format, and (under Dump Options #1) choose not to save Privilege or Tablespace.

The primary option for importing this data into your database is to use pg_restore.

Using pg_restore locally

The PostgreSQL pg_restore utility can restore your data.dump file to your database locally.

Tip

To install the pg_restore utility, you must install PostgreSQL locally. More information on doing this can be found here.

With a DATABASE_URL of postgres://<USER>:<PASS>@<HOST>:<PORT>/<DBNAME> the pg_restore command would be:

$ PGPASSWORD=<PASS> pg_restore --verbose --clean --no-acl --no-owner -h <HOST> \
   -U <USER> -d <DBNAME> -p <PORT> ~/data.dump

Be sure to replace the appropriate elements with those given in your connection settings.

For now, though, let’s import a sample 200 MB employee database that contains the following:

  • Employees
  • Departments
  • Salaries (many-to-one with employees)
  • Titles (many-to-one with employees)
  • Department employees (one-to-one with departments and employees)
  • Department managers (one-to-one with departments and employees)

First, download the dump file:

$ curl -o employees.dump https://heroku-data.s3.amazonaws.com/employees.dump

Then, use pg_restore to restore directly to your Heroku PostgreSQL database. Remember, the correct pg_restore syntax can be automatically generated from your database detail page:

$ PGPASSWORD=<PASS> pg_restore --verbose --clean --no-acl --no-owner -h <HOST \
   -U <USER> -d <DBNAME> -p <PORT> employees.dump

Once complete, you should be able to connect to your database, and query the database as if it were local.

Exporting Data

So, now that you’ve got some data in your database, how do you get it out again? You could query it and write it all down on some slips of paper, but this will take a fair while, so it’s probably best to use one of the many more efficient methods.

Snapshots

Let’s say you need to do a full database export. You can download a database snapshot, which is essentially identical to the dump file that you created when importing. To create a snapshot, log in to your Heroku PostgreSQL account and go to your database list. Select the database you’re interested in and click the + in the Snapshots section. A few seconds later, the snapshot will have been created and can be downloaded.

Once downloaded, restore the dump to your local database as normal.

CSV Exports

But what if you don’t want to download the whole database? Heroku will let you export the results of a single query as a comma-separated values (CSV) file via a SQL command (as built into PostgreSQL itself). In order to do this, connect to your PostgreSQL database (as described earlier in the chapter) and open up a new query window. Once done, issuing a query such as the following will create a new file locally containing the query data:

COPY (SELECT * FROM products) TO dump.csv DELIMITER '\t'

where:

SELECT * FROM products

is the query that you wish to download.

PGBackups

A very common requirement is backups. Although Heroku strives to ensure that there is no data loss in the event of some sort of digital catastrophe, it’s always a good idea to make sure you keep backups for your own peace of mind.

In order to use PGBackups, you must first provision the add-on. Note that you will have no access to data backups without this add-on:

$ heroku addons:add pgbackups
-----> Adding pgbackups to my_app done, v18 (free)

In order to create a backup manually (PGBackups also supports automated backups), you can simply request it from the command line:

$ heroku pgbackups:capture

HEROKU_POSTGRESQL_BLACK (DATABASE_URL)  ----backup--->  b251

Capturing... done
Storing... done

Should you require, you can then download your most recent backup:

$ heroku pgbackups:url
"http://s3.amazonaws.com/hkpgbackups/app1234567@heroku.com/…

If, however, you require access to anything other than the most recent backup, you can review your archive:

$ heroku pgbackups
ID   | Backup Time         | Size  | Database
-----+---------------------+-------+------------------------
a226 | 2013/02/22 20:02.19 | 5.3KB | DATABASE_URL
a227 | 2013/02/23 20:02.19 | 5.3KB | DATABASE_URL
b251 | 2013/02/24 16:08.02 | 5.3KB | HEROKU_POSTGRESQL_BLACK
b252 | 2013/02/24 16:08.53 | 5.3KB | HEROKU_POSTGRESQL_PINK 

Data Clips

One common need of a database administrator is to share data in her database with other people. Generally speaking, as these people are rarely technical, sharing dump files with them might not be the best approach. What’s more, it is generally a common need that this data is provided on a periodic basis. For instance, you may need to deliver an export of year-to-date sales by month to your manager each and every month.

Sharing information on the Internet is done by sharing URLs. URLs identify locations, books, videos, and even source code. Until now, there hasn’t been a convenient way to share data inside a database. That’s why Heroku introduced data clips. They are a fast and easy way to unlock the data in your database in the form of a secure URL.

Data clips allow the results of SQL queries on a Heroku PostgreSQL database to be easily shared. Simply create a query against a database within your databases list, and then share the resulting URL with coworkers, colleagues, or the world.

Data clips can be shared through email, Twitter, irc, or any other medium, because they are just URLs. The recipients of a data clip are able to view the data in their browser or download it in JSON, CSV, XML, or Microsoft Excel formats.

As data changes rapidly in databases, so can data clips. They can either be locked to a point in time or set to refresh with live data. When locked to a point in time, data clips are guaranteed to show an unchanging snapshot of data, even if they are viewed hours, days, or years after the clip was created. Alternatively, when data clips are set to “now,” they provide a live view into the database in its current form.

Followers

So, now that we’ve talked about some of the data-in and data-out tasks that we commonly have to do day to day, let’s talk more about the features that relate more to running Heroku PostgreSQL databases in an operational environment.

Although your database is hosted by Heroku and administered by some of the best DBAs in the business, there is still risk. Your database is running on one machine for instance, and should this go pop, then you’re looking at some downtime until the database is moved to another location. This is why Heroku developed followers.

Followers are essentially the same as replicated databases. This means that you have a single master database that receives read and write queries and a follower, which is another identical database that mirrors the master database from a data point of view.

There are a few reasons to do this. First, as upgrading (or downgrading) your database requires replacement of the database itself, creating a follower first in the new plan allows you to simply switch across and then decommission the old database with very little downtime.

Another reason is to allow you to create a backup database that you can have in case of a failure of the first. You could use followers to create a sharding system where write queries are directed at the master, and read queries are directed at a follower to alleviate load on a single database.

Note, however, that followers cannot be instantly created, and that they take time to populate. Therefore, once you’ve created a follower, you will need to interrogate its completeness via the Heroku command-line interface:

$ heroku pg:info --app sushi
=== HEROKU_POSTGRESQL_RED
Plan         Ronin
Status       available
Data Size    82.8 GB
Tables       13
PG Version   9.0.5
Created      2011-06-15 09:58 PDT
=== HEROKU_POSTGRESQL_WHITE
Plan         Ronin
Status       following
Data Size    82.8 GB
Tables       13
PG Version   9.0.6
Created      2011-11-15 09:54 PDT
Following    HEROKU_POSTGRESQL_RED (DATABASE_URL)
Behind By    125 commits

Fast Database Changeovers

By now, you should see that it is possible to swap out your database for a new, more powerful one with very little downtime.

This changeover uses followers to minimize the downtime in migrating between databases. At a high level, a follower is created in order to move your data from one database to another. Once it has received the majority of the data and is closely following your main database, you will prevent new data from being written (usually by enabling maintenance mode on your app). The follower will then fully catch up to the main database and be promoted to be the primary database for the application.

Note

One thing that makes this entire process much simpler is to temporarily set your application as read-only. This won’t suit some applications, but making sure that your data isn’t changing will ease the process significantly.

In order to carry out a database changeover, follow this simple process:

  1. Create a follower for your database.
  2. Wait for the follower to catch up using the technique we’ve explained.
  3. As it’s important that no transactions are changing data, enable maintenance mode on your application:

    $ heroku maintenance:on --app sushi

    And scale every dyno down (in this instance, the workers and fast_workers dynos):

    $ heroku ps:scale workers=0 fast_workers=0 --app sushi
  4. Confirm that your follower is zero commits behind your master and unfollow your main database:

    $ heroku pg:unfollow HEROKU_POSTGRESQL_WHITE --app sushi
  5. Promote your follower to the main database:

    $ heroku pg:promote HEROKU_POSTGRESQL_WHITE --app sushi
    -----> Promoting HEROKU_POSTGRESQL_WHITE to DATABASE_URL... done
  6. And return your application to normal:

    $ heroku ps:scale workers=X fast_workers=Y --app sushi
    $ heroku maintenance:off --app sushi
  7. At this point, your original main database is now unused and can be exported, destroyed, or simply left as is. Note, though, that you will still be charged.

Forking

One last feature that is worth mentioning is forking. There is a common need to debug issues that occur in a live production environment—typically issues whereby a user has somehow managed to get himself into a certain state. In these situations, the problems are usually data related, so the developer tasked to debug the issue needs to have access to this data. You could download this data and re-create it locally, but if your database is large this could be very inefficient.

Another scenario could be that you have a deployment coming up and you’d like to test some database changes against the production data before going live with the changes. Again, downloading a database that could be gigabytes in size would be a bad idea.

Therefore, Heroku has developed fork, which is essentially the practice of taking an existing Heroku PostgreSQL database and making a direct copy of it and the data associated with it at that particular time. This copy does not follow your main database or change in any other way, so it’s a good way of cloning a database for interrogation or testing against.

As these forked databases are just like any other database, you can connect a staging version of your application to it and run your code as normal, all without affecting your production application.

To fork a database:

$ heroku addons:add heroku-postgresql:ronin --fork HEROKU_POSTGRESQL_RED
-----> Adding heroku-postgresql:ronin to sushi... done, v72 ($200/mo)
       Attached as HEROKU_POSTGRESQL_ORANGE
-----> Database will become available after it completes forking
   Use 'heroku pg:wait' to track status

Preparing a database fork can take anywhere from several minutes to several hours, depending on the size of your dataset.

Other Features

Aside from those already described, the Heroku team is constantly adding new features to their PostgreSQL offering on a weekly basis. This partly comes from the benefit of managing one of the largest fleets of PostgreSQL databases on the Web. The team, by virtue of the sheer number of databases under its control, is able to gather a vast amount of usage data, and a comprehensive list of pain points that other users are suffering. By using this data and contributing back to the open source PostgreSQL project, the platform can constantly improve.

Let’s look at some of the key added features that PostgreSQL offers.

Extension Support

Databases are the well-known solution for storing data for an application. However, they sometimes lack functionality required by application developers, such as data encryption or cross-database reporting. As a result, developers are forced to write the needed functionality at their application layer. PostgreSQL 9.1, which already has an extensive collection of data types and functions, took the first step toward mitigating this by creating an extension system that allows the database’s functionality to be expanded.

Extensions allow related pieces of functionality, such as datatypes and functions, to be bundled together and installed in a database with a single command.

Heroku began supporting extensions in March 2012 with the release of hstore, the schemaless datatype for SQL. Users have taken advantage of hstore to increase their development agility by avoiding the need to predefine their schemas.

To install an extension, use the CREATE EXTENSION command in psql:

$ heroku pg:psql --app sushi
psql (9.1.4)
SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
=> CREATE EXTENSION citext;
CREATE EXTENSION

Tip

A list of available extensions can be found at the Heroku DevCenter.

Improved Visibility

Visibility into your data has long been a problem for many application developers. In the current version of PostgreSQL (9.2 at the time of writing), all queries are normalized and data about them is recorded. This allows you to gain insight such as:

  • How often is a query run?
  • How much time is spent running the query?
  • How much data is returned?

Each of these key pieces of data is critical when it comes to effectively optimizing your database’s performance.

The old way of poring through logs is no longer needed to gain this insight. Now your database contains what it needs to help you improve performance within a PostgreSQL database.

Ensuring such functionality is committed back to the PostgreSQL core is very important, as it prevents lock-in and creates a better ecosystem for the community as a whole.

Let’s take a look at how we can begin using some of this. First turn on statement tracking with CREATE EXTENSION pg_stat_statements;. Then run the next query and you’ll receive all of your top run queries:

SELECT
    count(*),
    query
FROM
  pg_stat_statements
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10;

JSON Support

Developers are always looking for more extensibility and power when working with and storing their data.

With PostgreSQL 9.2, there’s even more robust support for NoSQL within your SQL database in the form of JSON. By using the JSON datatype, your JSON is validated as proper JSON before it’s allowed to be committed.

Beyond the datatype itself, there are several new functions available. These are record_to_json, row_to_json, and array_to_json. Using these functions we can turn a row/record, or even an array of values, immediately into JSON to be used within an application or returned via an API:

$ heroku pg:psql
=> SELECT row_to_json(row('foo','bar', 1, 2));
     row_to_json

 {"f1":"foo","f2":"bar", "f3": 1, "f4": 2}
(1 row)

Range Type Support

The range datatype is another example of powerful data flexibility. It is a single column consisting of a to-and-from value. Your range can exist as a range of timestamps, can be alpha-numeric or numeric, and can even have constraints placed on it to enforce common range conditions.

For example, this schema ensures that in creating a class schedule we can’t have two classes at the same time:

CREATE TABLE schedule (class int, during tsrange);
ALTER TABLE schedule ADD EXCLUDE USING gist (during WITH &&);

Then attempting to add data we would receive an error:

INSERT INTO schedule VALUES (3, '[2012-09-24 13:00, 2012-09-24 13:50)');
INSERT INTO schedule VALUES
(1108, '[2012-09-24 13:30, 2012-09-24 14:00)');
ERROR:  conflicting key value violates exclusion
        constraint "schedule_during_excl"

Other data types appear all the time, so it’s well worth keeping an eye on Heroku’s DevCenter and blog for announcements of further types.

Get Heroku: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.