Chapter 8. Databases and Persistence

Introduction

Credit: Aaron Watters, CTO, ReportLab

There are three kinds of people in this world: those who can count and those who can’t.

However, there are only two kinds of computer programs: toy programs and programs that interact with persistent databases of some kind. That is to say, most real computer programs must retrieve stored information and record information for future use. These days, this is true of almost every computer game, which can typically save and restore the state of the game at any time. So when I refer to toy programs, I mean programs written as exercises, or for the fun of programming. Nearly all real programs (such as programs that people get paid to write) have some persistent database storage/retrieval component.

When I was a Fortran programmer in the 1980s I noticed that although almost every program had to retrieve and store information, they almost always did it using home-grown methods. Furthermore, since the storage and retrieval part of the program was the least interesting component from the programmer’s point of view, these parts of the program were frequently implemented very sloppily and were hideous sources of intractable bugs. This repeated observation convinced me that the study and implementation of database systems sat at the core of programming pragmatics, and that the state of the art as I saw it then required much improvement.

Later, in graduate school, I was delighted to find that there was an impressive and sophisticated body of work relating to the implementation of database systems. The literature of database systems covered issues of concurrency, fault tolerance, distribution, query optimization, database design, and transaction semantics, among others. In typical academic fashion, many of the concepts had been elaborated to the point of absurdity (such as the silly notion of conditional multivalued dependencies), but much of the work was directly related to the practical implementation of reliable and efficient storage and retrieval systems. The starting point for much of this work was E. F. Codd’s seminal work “A Relational Model of Data for Large Shared Data Banks” from Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387, http://www.acm.org/classics/nov95/toc.html.

I also found that among my fellow graduate students, and even among most of the faculty, the same body of knowledge was either disregarded or regarded with some scorn. Everyone recognized that knowledge of conventional relational technology could be lucrative, but they generally considered such knowledge to be on the same level as knowing how to write (or more importantly, maintain) COBOL programs. This was not helped by the fact that the emerging database interface standard, SQL (which is now very well established), looked like an extension of COBOL and bore little obvious relationship to any modern programming language.

Those who were interested in database systems were generally drawn to alternatives to the relational model, such as functional or object-oriented database system implementations. There was also a small group of people interested in logic databases. Logic databases were largely an interesting footnote to the study of logic programming and prolog-like programming languages, but the underlying concepts also resonated strongly with Codd’s original vision for relational databases. The general feeling was that relational-database technology, at least at the level of SQL, was a mistake and that something better would soon overtake it.

Now it is more than a decade later, and there is little sign that anything will soon overtake SQL-based relational technology for the majority of data-based applications. In fact, relational-database technology seems more pervasive than ever. The largest software vendors—IBM, Microsoft, and Oracle—all provide various relational-database implementations as crucial components of their core offerings. Other large software firms, such as SAP and PeopleSoft, essentially provide layers of software built on top of a relational-database core.

Generally, relational databases have been augmented rather than replaced. Enterprise software-engineering dogma frequently espouses three-tier systems, in which the bottom tier is a carefully designed relational database, the middle tier defines a view of the database as business objects, and the top tier consists of applications or transactions that manipulate the business objects, with effects that ultimately translate to changes in the underlying relational tables.

Microsoft’s Open Database Connectivity (ODBC) standard provides a common programming API for SQL-based relational databases that permits programs to interact with many different database engines with no, or few, changes. For example, a Python program can be implemented using Microsoft Access as a backend database for testing and debugging purposes. Once the program is stable, it can be put into production use, remotely accessing a backend DB2 database on an IBM mainframe residing on another continent, by changing (at most) one line of code.

This is not to say that relational databases are appropriate for all applications. In particular, a computer game or engineering design tool that must save and restore sessions should probably use a more direct method of persisting the logical objects of the program than the flat tabular representation encouraged in relational-database design. However, even in domains such as engineering or scientific information, I must caution that a hybrid approach that uses some relational methods is often advisable. For example, I have seen a complex relational-database schema for archiving genetic-sequencing information—in which the sequences show up as binary large objects (BLOBs)—but a tremendous amount of important ancillary information can fit nicely into relational tables. But as the reader has probably surmised, I fear, I speak as a relational bigot.

Within the Python world there are many ways of providing persistence and database functionality. My favorite is Gadfly, a simple and minimal SQL implementation that works primarily with in-memory databases. It is my favorite for no other reason than because it is mine, and it’s biggest advantage is that if it becomes unworkable for you, it is easy to switch over to another, industrial-strength SQL engine. Many Gadfly users have started an application with Gadfly (because it was easy to use) and switched later (because they needed more).

However, many people may prefer to start by using other SQL implementations such as mySQL, MS-Access, Oracle, Sybase, MS SQL server, or others that provide the advantages of an ODBC interface (which Gadfly does not do).

Python provides a standard interface for accessing relational databases: the Python DB Application Programming Interface (Py-DBAPI), originally designed by Greg Stein. Each underlying database API requires a wrapper implementation of the Py-DBAPI, and implementations are available for many underlying database interfaces, notably Oracle and ODBC.

When the relational approach is overkill, Python provides built-in facilities for storing and retrieving data. At the most basic level, the programmer can manipulate files directly, as covered in Chapter 4. A step up from files, the marshal module allows programs to serialize data structures constructed from simple Python types (not including, for example, classes or class instances). marshal has the advantage in that it can retrieve large data structures with blinding speed. The pickle and cPickle modules allow general storage of objects, including classes, class instances, and circular structures. cPickle is so named because it is implemented in C and is consequently quite fast, but it remains noticeably slower than marshal.

While marshal and pickle provide basic serialization and deserialization of structures, the application programmer will frequently desire more functionality, such as transaction support and concurrency control. In this case, if the relational model doesn’t fit the application, a direct object database implementation such as the Z-Object Database (ZODB) might be appropriate (http://www.amk.ca/zodb/).

I must conclude with a plea to those who are dismissive of relational-database technology. Please remember that it is successful for some good reasons, and it might be worth considering. To paraphrase Churchill:

text = """ Indeed, it has been said that democracy is the worst form of
    government, except for all those others that have been tried
    from time to time. """
import string
for a, b in [("democracy", "SQL"), ("government", "database")]:
    text = string.replace(text, a, b)
print text

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.