BUY THIS BOOK
Add to Cart

Print Book $34.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £24.95

What is this?

Looking to Reprint this content?


Programming the Perl DBI
Programming the Perl DBI Database programming with Perl By Alligator Descartes, Tim Bunce
February 2000
Pages: 364

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
The subject of databases is a large and complex one, spanning many different concepts of structure, form, and expected use. There are also a multitude of different ways to access and manipulate the data stored within these databases.
This book describes and explains an interface called the Perl Database Interface, or DBI, which provides a unified interface for accessing data stored within many of these diverse database systems. The DBI allows you to write Perl code that accesses data without needing to worry about database- or platform-specific issues or proprietary interfaces.
We also take a look at non-DBI ways of storing, retrieving, and manipulating data with Perl, as there are occasions when the use of a database might be considered overkill but some form of structured data storage is required.
To begin, we shall discuss some of the more common uses of database systems in business today and the place that Perl and DBI takes within these frameworks.
In today's computing climate, databases are everywhere. In previous years, they tended to be used almost exclusively in the realm of mainframe-processing environments. Nowadays, with pizza-box sized machines more powerful than room-sized machines of ten years ago, high-performance database processing is available to anyone.
In addition to cheaper and more powerful computer hardware, smaller database packages have become available, such as Microsoft Access and mSQL. These packages give all computer users the ability to use powerful database technology in their everyday lives.
The corporate workplace has also seen a dramatic decentralization in database resources, with radical downsizing operations in some companies leading to their centralized mainframe database systems being replaced with a mixture of smaller databases distributed across workstations and PCs. The result is that developers and users are often responsible for the administration and maintenance of their own databases and datasets.
This trend towards mixing and matching database technology has some important downsides. Having replaced a centralized database with a cluster of workstations and multiple database types, companies are now faced with hiring skilled administration staff or training their existing administration staff for new skills. In addition, administrators now need to learn how to glue different databases together.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
From Mainframes to Workstations
In today's computing climate, databases are everywhere. In previous years, they tended to be used almost exclusively in the realm of mainframe-processing environments. Nowadays, with pizza-box sized machines more powerful than room-sized machines of ten years ago, high-performance database processing is available to anyone.
In addition to cheaper and more powerful computer hardware, smaller database packages have become available, such as Microsoft Access and mSQL. These packages give all computer users the ability to use powerful database technology in their everyday lives.
The corporate workplace has also seen a dramatic decentralization in database resources, with radical downsizing operations in some companies leading to their centralized mainframe database systems being replaced with a mixture of smaller databases distributed across workstations and PCs. The result is that developers and users are often responsible for the administration and maintenance of their own databases and datasets.
This trend towards mixing and matching database technology has some important downsides. Having replaced a centralized database with a cluster of workstations and multiple database types, companies are now faced with hiring skilled administration staff or training their existing administration staff for new skills. In addition, administrators now need to learn how to glue different databases together.
It is in this climate that a new order of software engineering has evolved, namely database-independent programming interfaces. If you thought administration staff had problems with downsizing database technology, developers may have been hit even harder.
A centralized mainframe environment implies that database software is written in a standard language, perhaps COBOL or C, and runs only on one machine. However, a distributed environment may support multiple databases on different operating systems and processors, with each development team choosing their preferred development environment (such as Visual Basic, PowerBuilder, Oracle Pro*C, Informix E/SQL, C++ code with ODBC—the list is almost endless). Therefore, the task of coordinating and porting software has rapidly gone from being relatively straightforward to extremely difficult.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl
Perl is a very high-level programming language originally developed in the 1980s by Larry Wall. Perl is now being developed by a group of individuals known as the Perl5-Porters under the watchful eye of Larry. One of Perl's many strengths is its ability to process arbitrary chunks of textual data, known as strings , in many powerful ways, including regular-expression string manipulation. This capability makes Perl an excellent choice for database programming, since the majority of information stored within databases is textual in nature. Perl takes the pain of manipulating strings out of programming, unlike C, which is not well-suited for that task. Perl scripts tend to be far smaller than equivalent C programs and are generally portable to other operating systems that run Perl with little or no modification.
Perl also now features the ability to dynamically load external modules , which are pieces of software that can be slotted into Perl to extend and enhance its functionality. There are literally hundreds of these modules available now, ranging from mathematical modules to three-dimensional graphics-rendering modules to modules that allow you to interact with networks and network software. The DBI is a set of modules for Perl that allows you to interact with databases.
In recent years, Perl has become a standard within many companies by just being immensely useful for many different applications, the "Swiss army knife of programming languages." It has been heavily used by system administrators who like its flexibility and usefulness for almost any job they can think of. When used in conjunction with DBI, Perl makes loading and dumping databases very straightforward, and its excellent data-manipulation capabilities allow developers to create and manipulate data easily.
Furthermore, Perl has been tacitly accepted as being the de facto language on the World Wide Web for writing CGI programs. What's this got to do with databases? Using Perl and DBI, you can quickly deploy powerful CGI scripts that generate dynamic web pages from the data contained within your databases. For example, online shopping catalogs can be stored within a database and presented to shoppers as a series of dynamically created web pages. The sample code for this book revolves around a database of archaeological sites that you can deploy on the Web.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DBI in the Real World
DBI is being used in many companies across the world today, including large-scale, mission-critical environments such as NASA and Motorola. Consider the following testimonials by avid DBI users from around the world:
We developed and support a large scale telephone call logging and analysis system for a major client of ours. The system collects ~1 GB of call data per day from over 1,200,000 monitored phone numbers. ~424 GB has been processed so far (over 6,200,000,000 calls). Data is processed and loaded into Oracle using DBI and DBD::Oracle. The database holds rolling data for around 20 million calls. The system generates over 44,000 PostScript very high quality reports per month (~five pages with eleven color graphs and five tables) generated by using Perl to manipulate FrameMaker templates. [Values correct as of July 1999, and rising steadily.]
The whole system runs on three dual processor Sun SPARC Ultra 2 machines—one for data acquisition and processing, one for Oracle and the third does most of the report production (which is also distributed across the other two machines). Almost the entire system is implemented in Perl.
There is only one non-Perl program and that's only because it existed already and isn't specific to this system. The other non-Perl code is a few small libraries linked into Perl using the XS interface.
A quote from a project summary by a senior manager: "Less than a year later the service went live. This was subsequently celebrated as one of the fastest projects of its size and complexity to go from conception to launch."
Designed, developed, implemented, installed, and supported by the Paul Ingram Group, who received a "Rising to the Challenge" award for their part in the project. Without Perl, the system could not have been developed fast enough to meet the demanding go-live date. And without Perl, the system could not be so easily maintained or so quickly extended to meet changing requirements.
Tim Bunce, Paul Ingram Group
In 1997 I built a system for NASA's Langley Research Center in Virginia that puts a searchable web front end on a database of about 100,000 NASA-owned equipment items. I used Apache, DBI, Informix, WDB, and mod_perl on a Sparc 20. Ran like a charm. They liked it so much they used it to give demos at meetings on reorganizing the wind tunnels! Thing was, every time they showed it to people, I ended up extending the system to add something new, like tracking equipment that was in for repairs, or displaying GIFs of technical equipment so when they lost the spec sheet, they could look it up online. When it works, success feeds on itself.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Historical Interlude andStanding Stones
Throughout this book, we intersperse examples on relevant topics under discussion. In order to ensure that the examples do not confuse you any more than you may already be confused, let's discuss in advance the data we'll be storing and manipulating in the examples.
Primarily within the UK, but also within other countries around the world, there are many sites of standing stones or megaliths. The stones are arranged into rings, rows, or single or paired stones. No one is exactly sure what the purpose or purposes of these monuments are, but there are certainly a plethora of theories ranging from the noncommittal ``ritual'' use to the more definitive alien landing-pad theory. The most famous and visited of these monuments is Stonehenge, located on Salisbury Plain in the south of England. However, Stonehenge is a unique and atypical megalithic monument.
Part of the lack of understanding about megaliths stems from the fact that these monuments can be up to 5,000 years old. There are simply no records available to us that describe the monuments' purposes or the ritual or rationale behind their erection. However, there are lots of web sites that explore various theories.
The example code shown within this book, and the sample web application we'll also be providing, uses a database containing information on these sites.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Basic Non-DBI Databases
There are several ways in which databases organize the data contained within them. The most common of these is the relational database methodology. Databases that use a relational model are called Relational Database Management Systems , or RDBMSs. The most popular database systems nowadays (such as Oracle, Informix, and Sybase) are all relational in design.
But what does "relational" actually mean? A relational database is a database that is perceived by the user as a collection of tables, where a table is an unordered collection of rows. (Loosely speaking, a relation is a just a mathematical term for such a table.) Each row has a fixed number of fields, and each field can store a predefined type of data value, such as an integer, date, or string.
Another type of methodology that is growing in popularity is the object-oriented methodology, or OODBMS. With an object-oriented model, everything within the database is treated as an object of a certain class that has rules defined within itself for manipulating the data it encapsulates. This methodology closely follows that of object-oriented programming languages such as Smalltalk, C++, and Java. However, the DBI does not support any real OODBMS, so for the moment this methodology will not be discussed further.
Finally, there are several simplistic database packages that exist on various operating systems. These simple database packages generally do not feature the more sophisticated functionality that ``real'' database engines provide. They are, to all intents, only slightly sophisticated file-handling routines, not actually database packages. However, in their defense, they can be extremely fast, and in certain situations the sophisticated functionality that a ``real'' database system provides is simply an unnecessary overhead.
In this chapter, we'll be exploring some non-DBI databases, ranging from the very simplest of ASCII data files through to disk-based hash files supporting duplicate keys. Along the way, we'll consider concurrent access and locking issues, and some applications for the rather useful
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Storage Managers and Layers
Modern databases, no matter which methodology they implement, are generally composed of multiple layers of software. Each layer implements a higher level of functionality using the interfaces and services defined by the lower-level layers.
For example, flat-file databases are composed of pools of data with very few layers of abstraction. Databases of this type allow you to manipulate the data stored within the database by directly altering the way in which the data is stored within the data files themselves. This feature gives you a lot of power and flexibility at the expense of being difficult to use, minimal in terms of functionality, and nerve-destroying since you have no safety nets. All manipulation of the data files uses the standard Perl file operations, which in turn use the underlying operating system APIs.
DBM file libraries, like Berkeley DB, are an example of a storage manager layer that sits on top of the raw data files and allows you to manipulate the data stored within the database through a clearly defined API. This storage manager translates your API calls into manipulations of the data files on your behalf, preventing you from directly altering the structure of the data in such a manner that it becomes corrupt or unreadable. Manipulating a database via this storage manager is far easier and safer than doing it yourself.
You could potentially implement a more powerful database system on top of DBM files. This new layer would use the DBM API to implement more powerful features and add another layer of abstraction between you and the actual physical data files containing the data.
There are many benefits to using higher-level storage managers. The levels of abstraction between your code and the underlying database allow the database vendors to transparently add optimizations, alter the structure of the database files, or port the database engine to other platforms without you having to alter a single line of code.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Query Languages and Data Functions
Database operations can be split into those manipulating the database itself (that is, the logical and physical structure of the files comprising the database) and those manipulating the data stored within these files. The former topic is generally database-specific and can be implemented in various ways, but the latter is typically carried out by using a query language.
All query languages, from the lowest level of using Perl's string and numerical handling functions to a high-level query language such as SQL, implement four main operations with which you can manipulate the data. These operations are:
Fetching
The most commonly used database operation is that of retrieving data stored within a database. This operation is known as fetching, and returns the appropriate data in a form understood by the API host language being used to query the database. For example, if you were to use Perl to query an Oracle database for data, the data would be requested by using the SQL query language, and the rows returned would be in the form of Perl strings and numerics. This operation is also known as selecting data, from the SQL SELECT keyword used to fetch data from a database.
Storing
The corollary operation to fetching data is storing data for later retrieval. The storage manager layers translate values from the programming language into values understood by the database. The storage managers then store that value within the data files. This operation is also known as inserting data.
Updating
Once data is stored within a database, it is not necessarily immutable. It can be changed if required. For example, in a database storing information on products that can be purchased, the pricing information for each product may change over time. The operation of changing a value of existing data within the database is known as updating. It is important to note that this operation doesn't add items to or remove items from the database; rather, it just changes existing items.
Deleting
The final core operation that you generally want to perform on data is to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Standing Stones and the Sample Database
Our small example databases throughout this chapter will contain information on megalithic sites within the UK. A more complex version of this database is used in the following chapters.
The main pieces of information that we wish to store about megaliths are the name of the site, the location of the site within the UK, a unique map reference for the site, the type of megalithic setting the site is (e.g., a stone circle or standing stone), and a description of what the site looks like.
For example, we might wish to store the following information about Stonehenge in our database:
Name:
Stonehenge
Location:
Wiltshire, England
Map Reference:
SU 123 400
Type:
Stone Circle and Henge
Description:
The most famous megalithic site in the world, comprised of an earthen bank, or henge, and several concentric rings of massive standing stones formed into trilithons.
With this simple database, we can retrieve all sorts of different pieces of information, such as, ``tell me of all the megalithic sites in Wiltshire,'' or ``tell me about all the standing stones in Orkney,'' and so on.
Now let's discuss the simplest form of database that you might wish to use: the flat-file database.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Flat-File Databases
The simplest type of database that we can create and manipulate is the old standby, the flat-file database. This database is essentially a file, or group of files, that contains data in a known and standard format that a program scans for the requested information. Modifications to the data are usually done by updating an in-memory copy of the data held in the file, or files, then writing the entire set of data back out to disk. Flat-file databases are typically ASCII text files containing one record of information per line. The line termination serves as the record delimiter.
In this section we'll be examining the two main types of flat-file database: files that separate fields with a delimiter character, and files that allocate a fixed length to each field. We'll discuss the pros and cons of each type of data file and give you some example code for manipulating them.
The most common format used for flat-file databases is probably the delimited file in which each field is separated by a delimiting character. And possibly the most common of these delimited formats is the comma-separated values (CSV) file, in which fields are separated from one another by commas. This format is understood by many common programs, such as Microsoft Access and spreadsheet programs. As such, it is an excellent base-level and portable format useful for sharing data between applications.
Other popular delimiting characters are the colon ( : ), the tab, and the pipe symbol ( | ). The Unix /etc/passwd file is a good example of a delimited file with each record being separated by a colon. Figure 2.1 shows a single record from an /etc/passwd file.
Figure 2.1: The /etc/passwd file record format
Since delimited files are a very low-level form of storage manager, any manipulations that we wish to perform on the data must be done using operating system functions and low-level query logic, such as basic string comparisons. The following program illustrates how we can open a data file containing colon-separated records of megalith data, search for a given site, and return the data if found:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Putting Complex Data into Flat Files
In our discussions of so-called "flat files" we've so far been storing, retrieving, and manipulating only that most basic of datatypes: the humble string. What can you do if you want to store more complex data, such as lists, hashes, or deeply nested data structures using references?
The answer is to convert whatever it is you want to store into a string. Technically that's known as marshalling or serializing the data. The Perl Module List has a section that lists several Perl modules that implement data marshalling.
We're going to take a look at two of the most popular modules, Data::Dumper and Storable, and see how we can use them to put some fizz into our flat files. These techniques are also applicable to storing complex Perl data structures in relational databases using the DBI, so pay attention.
The Data::Dumper module takes a list of Perl variables and writes their values out in the form of Perl code, which will recreate the original values, no matter how complex, when executed.
This module allows you to dump the state of a Perl program in a readable form quickly and easily. It also allows you to restore the program state by simply executing the dumped code using eval() or do().
The easiest way to describe what happens is to show you a quick example:
#!/usr/bin/perl -w
#
# ch02/marshal/datadumpertest: Creates some Perl variables and dumps them out.
#                              Then, we reset the values of the variables and
#                              eval the dumped ones ...

use Data::Dumper;

### Customise Data::Dumper's output style
### Refer to Data::Dumper documentation for full details
if ($ARGV[0] eq 'flat') {
    $Data::Dumper::Indent = 0;
    $Data::Dumper::Useqq  = 1;
}
$Data::Dumper::Purity = 1;

### Create some Perl variables
my $megalith  = 'Stonehenge';
my $districts = [ 'Wiltshire', 'Orkney', 'Dorset' ];
### Print them out
print "Initial Values: \$megalith  = " . $megalith . "\n" .
      "                \$districts = [ ". join(", ", @$districts) . " ]\n\n";

### Create a new Data::Dumper object from the database
my $dumper = Data::Dumper->new( [    $megalith, $districts  ],
                                [ qw( megalith  districts ) ] );

### Dump the Perl values out into a variable
my $dumpedValues = $dumper->Dump();

### Show what Data::Dumper has made of the variables!
print "Perl code produced by Data::Dumper:\n";
print $dumpedValues . "\n";

### Reset the variables to rubbish values
$megalith = 'Blah! Blah!';
$districts = [ 'Alderaan', 'Mordor', 'The Moon' ];

### Print out the rubbish values
print "Rubbish Values: \$megalith  = " . $megalith . "\n" .
      "                \$districts = [ ". join(", ", @$districts) . " ]\n\n";

### Eval the file to load up the Perl variables
eval $dumpedValues;
die if $@;

### Display the re-loaded values
print "Re-loaded Values: \$megalith  = " . $megalith . "\n" .
      "                  \$districts = [ ". join(", ", @$districts) . " ]\n\n";

exit;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concurrent Database Access and Locking
Before we start looking at DBM file storage management, we should discuss the issues that were flagged earlier regarding concurrent access to flat-file databases, as these problems affect all relatively low-level storage managers.
The basic problem is that concurrent access to files can result in undefined, and generally wrong, data being stored within the data files of a database. For example, if two users each decided to delete a row from the megalith database using the program shown in the previous section, then during the deletion phase, both users would be operating on the original copy of the database. However, whichever user's deletion finished first would be overwritten as the second user's deletion copied their version of the database over the first user's deletion. The first user's deletion would appear to have been magically restored. This problem is known as a race condition and can be very tricky to detect as the conditions that cause the problem are difficult to reproduce.
To avoid problems of multiple simultaneous changes, we need to somehow enforce exclusive access to the database for potentially destructive operations such as the insertion, updating, and deletion of records. If every program accessing a database were simply read-only, this problem would not appear, since no data would be changed. However, if any script were to alter data, the consistency of all other processes accessing the data for reading or writing could not be guaranteed.
One way in which we can solve this problem is to use the operating system's file-locking mechanism, accessed by the Perl flock() function. flock() implements a cooperative system of locking that must be used by all programs attempting to access a given file if it is to be effective. This includes read-only scripts, such as the query script listed previously, which can use flock() to test whether or not it is safe to attempt a read on the database.
The symbolic constants used in the following programs are located within the Fcntl package and can be imported into your scripts for use with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DBM Files and the BerkeleyDatabase Manager
DBM files are a storage management layer that allows programmers to store information in files as pairs of strings, a key, and a value. DBM files are binary files and the key and value strings can also hold binary data.
There are several forms of DBM files, each with its own strengths and weaknesses. Perl supports the ndbm , db , gdbm , sdbm , and odbm managers via the NDBM_File , DB_File , GDBM_File , SDBM_File , and ODBM_File extensions. There's also an AnyDBM_File module that will simply use the best available DBM. The documentation for the AnyDBM_File module includes a useful table comparing the different DBMs.
These extensions all associate a DBM file on disk with a Perl hash variable (or associative array ) in memory. The simple look like a hash programming interface lets programmers store data in operating system files without having to consider how it's done. It just works.
Programmers store and fetch values into and out of the hash, and the underlying DBM storage management layer will look after getting them on and off the disk.
In this section, we shall discuss the most popular and sophisticated of these storage managers, the Berkeley Database Manager, also known as the Berkeley DB. This software is accessed from Perl via the DB_File and Berkeley DB extensions. On Windows systems, it can be installed via the Perl package manager, ppm. On Unix systems, it is built by default when Perl is built only if the Berkeley DB library has already been installed on your system. That's generally the case on Linux, but on most other systems you may need to fetch and build the Berkeley DB library first.
In addition to the standard DBM file features, Berkeley DB and the DB_File module also provide support for several different storage and retrieval algorithms that can be used in subtly different situations. In newer versions of the software, concurrent access to databases and locking are also supported.
Prior to manipulating data within a Berkeley database, either a new database must be created or an existing database must be opened for reading. This can be done by using one of the following function calls:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The MLDBM Module
The MLDBM module is very useful for quickly writing complex Perl data structures to DBM files for persistent storage. The ML in MLDBM stands for multilevel and refers to its ability to store complex multilevel data structures. That's something that ordinary hashes, even hashes tied to DBM files, can't do.
The MLDBM module is an excellent example of a layered storage manager. It acts as a thin layer over another DBM module, but intercepts reads and writes to automatically serialize (or deserialize) the data using another module.
The module works by automatically serializing the Perl data structures that you wish to store into a single string, which is then stored within a DBM file. The data is recovered by deserializing the data from the stored string back into a valid Perl object. The actual interface for referencing the stored and retrieved data is identical to the API for DBM files. That makes it very easy to "drop in" use of MLDBM instead of your existing DBM module.
The following example shows how we could use DB_File for storage and Data::Dumper for displaying the restored data:
#!/usr/bin/perl -w
#
# ch02/mldbmtest: Demonstrates storing complex data structures in a DBM
#                 file using the MLDBM module.

use MLDBM qw( DB_File Data::Dumper );
use Fcntl;

### Remove the test file in case it exists already ...
unlink 'mldbmtest.dat';

tie my %database1, 'MLDBM', 'mldbmtest.dat', O_CREAT | O_RDWR, 0666
    or die "Can't initialize MLDBM file: $!\n";

### Create some megalith records in the database
%database1 = (
    'Avebury' => {
        name => 'Avebury',
        mapref => 'SU 103 700',
        location => 'Wiltshire'
    },
    'Ring of Brodgar' => {
        name => 'Ring of Brodgar',
        mapref => 'HY 294 133',
        location => 'Orkney'
    }
);

### Untie and retie to show data is stored in the file
untie %database1;

tie my %database2, 'MLDBM', 'mldbmtest.dat', O_RDWR, 0666
    or die "Can't initialize MLDBM file: $!\n";

### Dump out via Data::Dumper what's been stored ...
print Data::Dumper->Dump( [ \%database2 ] );

untie %database2;

exit;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
This has been a long chapter, both for you to read and for us to write. We've covered a lot of topics and, hopefully, given you some useful insights into database fundamentals and some new techniques for your mental toolbox.
We're almost ready to discuss the DBI itself, but before we do, we want to introduce you to the joys of SQL.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: SQL and Relational Databases
The Structured Query Language, or SQL, is a language designed for the purpose of manipulating data within databases.
In 1970, E. F. Codd, working for IBM, published a now classic paper, "A Relational Model of Data for Large Shared Data Banks," in which he laid down a set of abstract principles for data management that became known as the relational model. The entire field of relational database technology has its roots in that paper.
One of the many research projects sparked by that paper was the design and implementation of a language that could make interacting with relational databases simple. And it didn't make the programmer write horrendously complex sections of code to interact with the database.
This chapter serves to give the complete database neophyte a very limited overview of what SQL is and how you can do some simple tasks with it. Many of the more complex details of SQL's design and operation have been omitted or greatly simplified to allow the neophyte to learn enough to use the DBI in a simple, but effective, way. Section "Resources " in the Preface lists other books and web sites dedicated to SQL and relational database technologies.
The relational database model revolves around data storage units called tables, which have a number of attributes associated with them, called columns. For example, we might wish to store the name of the megalithic site, its location, what sort of site it is, and where it can be found on the map in our megaliths table. Each of these items of data would be a separate column.
In most large database systems, tables are created within containing structures known as schemas . A schema is a collection of logical data structures, or schema objects, such as tables and views. In some databases, a schema corresponds to a user created within the database. In others, it's a more general way of grouping related tables. For example, in our megalithic database, using Oracle, we have created a user called stones. Within the stones user's schema, the various tables that compose the megalithic database have been created.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Relational Database Methodology
The relational database model revolves around data storage units called tables, which have a number of attributes associated with them, called columns. For example, we might wish to store the name of the megalithic site, its location, what sort of site it is, and where it can be found on the map in our megaliths table. Each of these items of data would be a separate column.
In most large database systems, tables are created within containing structures known as schemas . A schema is a collection of logical data structures, or schema objects, such as tables and views. In some databases, a schema corresponds to a user created within the database. In others, it's a more general way of grouping related tables. For example, in our megalithic database, using Oracle, we have created a user called stones. Within the stones user's schema, the various tables that compose the megalithic database have been created.
Data is stored within a table in the form of rows . That is, the data for one site is stored within one row that contains the appropriate values for each column. This sort of data layout corresponds exactly to the row-column metaphor used by spreadsheets, ledgers, or even plain old tabulated lists you might scribble in a notepad.
An example of such a list containing megalithic data is:
Site            Location        Type                    Map Reference
----            --------        ----                    -------------
Callanish I     Western Isles   Stone Circle and Rows   NB 213 330
Stonehenge      Wiltshire       Stone Circle and Henge  SU 123 422
Avebury         Wiltshire       Stone Circle and Henge  SU 103 700
Sunhoney        Aberdeenshire   Recumbent Stone Circle  NJ 716 058
Lundin Links    Fife            Four Poster             NO 404 027
This system lends itself quite well to a generalized query such as ``Give me the names of all the megaliths'' or ``Give me the map locations of all the megaliths in Wiltshire.'' To perform these queries, we simply specify the columns we wish to see and the conditions each column in each row must meet to be returned as a valid result.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Datatypes and NULL Values
One of the most important aspects of the structures defined within a database, such as tables and views, is the datatype of each of the columns. Perl is a loosely typed language, whereas SQL is strongly typed. Thus, each field or value is of a given datatype that determines how values and fields are compared. For example, the mapref field within the megaliths table would not be much use if it could hold only dates!
Therefore, it is important to assign an appropriate datatype to each column. This avoids any potential confusion as to how the values stored within each column are to be interpreted, and also establishes how these values can be compared in query condition clauses.
There are several common datatypes. The most widely used of these can be grouped as follows:
Numeric datatypes
The grouping of numeric datatypes includes types such as integer and floating point (or real) numbers. These types, depending on your database, may include FLOAT , REAL , INTEGER , and NUMBER . Numeric datatypes are compared in the obvious way; that is, the actual values are tested.
Character datatypes
Character datatypes are used to store and manipulate textual data. Any characters whatsoever—digits or letters—can be stored within a character datatype.
However, if digits are stored within a character datatype, they will be treated as being a string of characters as opposed to a number. For example, they'll be sorted and ordered as strings and not numbers, so "10" will be less than "9".
Depending on your database system, there can be many different types of character datatypes such as CHAR , VARCHAR , VARCHAR2 , and so on. Most databases support at least the most basic of these, CHAR.
When being compared, character datatypes usually apply lexical ordering according to the character set being used by the database.
Date datatypes
Most database systems implement at least one datatype that contains date information, as opposed to a character datatype containing a string representation of a date. This allows you to perform arbitrary arithmetic on date values very easily. For example, you might wish to select rows where the date field corresponds to a Monday.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Querying Data
The first (and possibly most immediately useful) operation that SQL allows you to perform on data is to select and return rows of data from tables stored within the database. This activity forms the core of exactly what a database represents, a large repository of searchable information.
All SQL queries, no matter how simple or complex, use the SELECT keyword to specify the columns to fetch, the tables to fetch them from, and any conditions that must be met for the rows to be retrieved. SELECT falls into the group of commands known as Data Manipulation Language , or DML, commands.
The full syntax for SELECT can be intimidating to the new user, primarily because it sports a multitude of different ways in which the query can be customized. For example, you might wish to return only unique data rows, group certain rows together, or even specify how the returned rows should be sorted.
For the moment, we'll just look at the simplest cases.
In our example, we've tended to use capital letters for SQL commands and other reserved words, and lowercase letters for database object names (tables, columns, etc.). In most databases, the SQL commands are not case-sensitive, but the actual database object names may or may not be.
The simplest SQL query is to ask for certain columns in all rows of a table. The SELECT syntax for this form of query can be expressed as simply as:
SELECT column, column, ..., column
FROM table
or:
SELECT *
FROM table
which will query and fetch back all the columns within the specified table.
Therefore, to select some of the rows from some columns in the megaliths table, the following SQL statement can be used:
SELECT name, location, mapref
FROM megaliths
which would return the following information:
+---------------------------------------------------------------+
| name         | location                          | mapref     |
+---------------------------------------------------------------+
| Callanish I  | Callanish, Isle of Lewis          | NB 213 330 |
| Lundin Links | Lundin Links, Fife, Scotland      | NO 404 027 |
| Stonehenge   | Near Amesbury, Wiltshire, England | SU 123 400 |
| Avebury      | Avebury, Wiltshire, England       | SU 103 700 |
| Sunhoney     | Near Insch, Aberdeenshire         | NJ 716 058 |
+---------------------------------------------------------------+
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Modifying Data Within Tables
Read-only databases (that is, databases that only allow you to SELECT data from them) are very useful. Data warehouses are typically massive read-only databases populated with archived data mangled into a form suitable for reporting. However, in the cut-and-thrust world of transaction-processing databases, the ability to modify data within the database quickly and efficiently is of paramount importance.
There are several core operations that comprise the broader definition of data modification, namely:
  • Inserting new data into the database
  • Deleting data from the database
  • Updating, or modifying, existing data within the database
Each of these operations falls into the grouping of Data Manipulation Language commands, or DMLs, alongside SELECT.
We shall discuss each of these tasks in turn and apply the theory to our example database.
Before a database can be really of any use, data must be inserted into it by some means, either by manual data entry or with an automated batch loading program. The action of inserting data only applies to cases in which you wish to load a completely new record of information into the database. If the record already exists and merely requires modification of a column value, the update operation should be used instead.
Data inserts in the relational database model are done on a row-by-row basis: each record or item of information that you load into the database corresponds to a brand-new row within a given existing table. As each inserted record corresponds to a new row in one table, multitable inserts are not possible.
The SQL INSERT keyword provides a simple mechanism for inserting new rows of data into the database. For example, assuming the megaliths table is already present in the database and and contains the six columns shown earlier in Figure 3.1, a single row of data can be inserted into it using the following SQL statement:
INSERT INTO megaliths VALUES ( 0, 'Callanish I', 
                               '"Stonehenge of the North"',
                               'Western Isles',
                               'NB 213 330', 1 )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Creating and Destroying Tables
The previous section discusses the operations SQL can perform to manipulate data stored as rows within tables in the database. However, there is a separate set of statements that covers the manipulation of the tables (and other objects) within the database themselves. These statements are known as Data Definition Language commands, or DDLs.
The operations that can be performed on tables are fairly basic, since they are quite far-reaching in their consequences. The two simplest operations available are:
Creating a new table
This is done via the CREATE TABLE command, the syntax of which varies depending on the database platform being used. However, this statement generally specifies the name of the table to be created and the definition of all the columns of the table (both names and datatypes).
For example, the SQL we used to create the megaliths table within our database was:
CREATE TABLE megaliths (
    id              INTEGER NOT NULL,
    name            VARCHAR(64),
    location        VARCHAR(64),
    description     VARCHAR(256),
    site_type_id    INTEGER,
    mapref          VARCHAR(16)
)
CREATE TABLE will create a brand-new table with the given definition, which will be completely empty until you insert rows into it.
Deleting, or dropping, an existing table
This action is as drastic as data modification can get. The actual table structure within the database is completely removed, as are any rows of data currently stored within that table. This operation cannot usually be rolled back from. Once the fatal statement is typed, the specified table has gone forever (unless you have made a backup).
The syntax for dropping tables is fairly standard across databases and is extremely straightforward. To completely get rid of our megaliths table, we can issue the SQL statement of:
DROP TABLE megaliths
There are other ways in which table definitions can be manipulated, and also other database structures that can be created (such as views and indexes). But these are beyond the scope of this book. You should consult your database documentation for more information.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Programming with the DBI
In this chapter, we'll discuss in detail the actual programming interface defined by the DBI module. We'll start with the very architecture of DBI, continue with explaining how to use the handles that DBI provides to interact with databases, then cover simple tasks such as connecting and disconnecting from databases. Finally, we'll discuss the important topic of error handling and describe some of the DBI's utility methods and functions. Future chapters will discuss how to manipulate data within your databases, as well as other advanced functionality.
The DBI architecture is split into two main groups of software: the DBI itself, and the drivers. The DBI defines the actual DBI programming interface, routes method calls to the appropriate drivers, and provides various support services to them. Specific drivers are implemented for each different type of database and actually perform the operations on the databases. Figure 4.1 illustrates this architecture.
Figure 4.1: The DBI architecture
Therefore, if you are authoring software using the DBI programming interface, the method you use is defined within the DBI module. From there, the DBI module works out which driver should handle the execution of the method and passes the method to the appropriate driver for actual execution. This is more obvious when you recognize that the DBI module does not perform any database work itself, nor does it even know about any types of databases whatsoever. Figure 4.2 shows the flow of data from a Perl script through to the database.
Figure 4.2: Data flow through DBI
Under this architecture, it is relatively straightforward to implement a driver for any type of database. All that is required is to implement the methods defined in the DBI specification, as supported by the DBI module, in a way that is meaningful for that database. The data returned from this module is passed back into the DBI module, and from there it is returned to the Perl program. All the information that passes between the DBI and its drivers is standard Perl datatypes, thereby preserving the isolation of the DBI module from any knowledge of databases.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
DBI Architecture
The DBI architecture is split into two main groups of software: the DBI itself, and the drivers. The DBI defines the actual DBI programming interface, routes method calls to the appropriate drivers, and provides various support services to them. Specific drivers are implemented for each different type of database and actually perform the operations on the databases. Figure 4.1 illustrates this architecture.
Figure 4.1: The DBI architecture
Therefore, if you are authoring software using the DBI programming interface, the method you use is defined within the DBI module. From there, the DBI module works out which driver should handle the execution of the method and passes the method to the appropriate driver for actual execution. This is more obvious when you recognize that the DBI module does not perform any database work itself, nor does it even know about any types of databases whatsoever. Figure 4.2 shows the flow of data from a Perl script through to the database.
Figure 4.2: Data flow through DBI
Under this architecture, it is relatively straightforward to implement a driver for any type of database. All that is required is to implement the methods defined in the DBI specification, as supported by the DBI module, in a way that is meaningful for that database. The data returned from this module is passed back into the DBI module, and from there it is returned to the Perl program. All the information that passes between the DBI and its drivers is standard Perl datatypes, thereby preserving the isolation of the DBI module from any knowledge of databases.
The separation of the drivers from the DBI itself makes the DBI a powerful programming interface that can be extended to support almost any database available today. Drivers currently exist for many popular databases including Oracle, Informix, mSQL, MySQL, Ingres, Sybase, DB2, Empress, SearchServer, and PostgreSQL. There are even drivers for XBase and CSV files.
These drivers can be used interchangeably with little modification to your programs. Couple this database-level portability with the portability of Perl scripts across multiple operating systems, and you truly have a rapid application development tool worthy of notice.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Handles
The DBI defines three main types of objects that you may use to interact with databases. These objects are known as handles. There are handles for drivers, which the DBI uses to create handles for database connections, which, in turn, can be used to create handles for individual database commands, known as statements. Figure 4.3 illustrates the overall structure of the way in which handles are related, and their meanings are described in the following sections.
Figure 4.3: DBI handles
Driver handles represent loaded drivers and are created when the driver is loaded and initialized by the DBI. There is exactly one driver handle per loaded driver. Initially, the driver handle is the only contact the DBI has with the driver, and at this stage, no contact has been made with any database through that driver.
The only two significant methods available through the driver handle are data_sources() , to enumerate what can be connected to, and connect() , to actually make a connection. These methods are more commonly invoked as DBI class methods, however, which we will discuss in more detail later in this chapter.
Since a driver handle completely encapsulates a driver, there's no reason why multiple drivers can't be simultaneously loaded. This is part of what makes the DBI such a powerful interface.
For example, if a programmer is tasked with the job of transferring data from an Oracle database to an Informix database, it is possible to write a single DBI program that connects simultaneously to both databases and simply passes the data backwards and forwards as needed. In this case, two driver handles would be created, one for Oracle and one for Informix. No problems arise from this situation, since each driver handle is a completely separate Perl object.
Within the DBI specification, a driver handle is usually referred to as $drh .
Driver handles should not normally be referenced within your programs. The actual instantiation of driver handles happens ``under the hood'' of DBI, typically when
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Data Source Names
When connecting to a database via the DBI, you need to tell the DBI where to find the database to connect to. For example, the database driver might require a database name, or a physical machine name upon which the database resides. This information is termed a data source name, and of all the aspects of DBI, this is possibly the most difficult to standardize due to the sheer number and diversity of connection syntaxes.
The DBI requires the data source name to start with the characters dbi:, much like a URL begins with http:, and then the name of the driver, followed by another colon—for example, dbi:Oracle:. Any text that follows is passed to the driver's own connect() method to interpret as it sees fit. Most drivers expect either a simple database name or, more often, a set of one or more name/value pairs separated with semicolons. Some common examples are listed later in this section.
For example, mSQL requires the hostname, database name, and potentially, the TCP/IP port number for connecting to the database server. However, Oracle may require only a single word that is an alias to a more complicated connection identifier that is stored in separate Oracle configuration files.
DBI offers two useful methods for querying which data sources are available to you for each driver you have installed on your system.
Firstly, you can get a list of all the available drivers installed on your machine by using the DBI->available_drivers() method. This returns a list with each element containing the data source prefix of an installed driver, such as dbi:Informix:.
Secondly, you can invoke the DBI->data_sources() method against one or more of the drivers returned by the DBI->available_drivers() method to enumerate which data sources are known to the driver. Calling the data_sources() method will actually load the specified driver and validate that it is completely and correctly installed. Because DBI dies if it can't load and initialize a driver, this method should be called inside an
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Connection and Disconnection
The main activity in database programming usually involves the execution of SQL statements within a database. However, to accomplish this task, a connection to a database must be established first. Furthermore, after all the work has been done, it is good manners to disconnect from the database to free up both your local machine resources and, more importantly, valuable database resources.
In the case of simple databases, such as flat-file or Berkeley DB files, ``connecting'' is usually as simple as opening the files for reading or using the tie mechanism. However, in larger database systems, connecting may be considerably more complicated.
A relatively simple RDBMS is mSQL, which has a simple method of connection: to connect, a program connects to a TCP/IP port on the computer running the database. This establishes a live connection within the database. However, more complex systems, such as Oracle, have a lot more internal security and housekeeping work that must be performed at connection time. They also have more data that needs to be specified by the program, such as the username and password that you wish to connect with.
By looking at a broad spectrum of database systems, the information required to connect can be boiled down to:
  1. The data source name, a string containing information specifying the driver to use, what database you wish to connect to, and possibly its whereabouts. This argument takes the format discussed in the previous section and is highly database-specific.
  2. The username that you wish to connect to the database as. To elaborate on the concept of usernames a little further, some databases partition the database into separate areas, called schemas, in which different users may create tables and manipulate data. Users cannot affect tables and data created by other users. This setup is similar to accounts on a multiuser computer system, in that users may create their own files, which can be manipulated by them, but not necessarily by other users. In fact, users may decide to disallow all access to their files, or tables, from all other users, or allow access to a select group or all users.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Error Handling
The handling of errors within programs, or the lack thereof, is one of the more common causes of questions concerning programming with DBI. Someone will ask "Why doesn't my program work?" and the answer generally runs along the lines of "Why aren't you performing error checking?" Sure enough, nine out of ten times when error checking is added, the exact error message appears and the cause for error is obvious.
Early versions of the DBI required programmers to perform their own error checking, in a traditional way similar to the examples listed earlier for connecting to a database. Each method that returned some sort of status indicator as to its success or failure should have been followed by an error condition checking statement. This is an excellent, slightly C-esque way of programming, but it quickly gets to be tiresome, and the temptation to skip the error checking grows.
The DBI now has a far more straightforward error-handling capability in the style of exception s. That is, when DBI internally detects that an error has occurred after a DBI method call, it can automatically either warn() or