Chapter 14. Data Persistence

My programs can share their data, either with other programs or with future invocations of themselves. To make that possible, I store the data outside of the program’s memory and then read it from that source to recreate it. I can put that data in a file or a database, send it over a network connection, or anything else I want to do with it.

I can even share data between different programs. For anything except for simple applications, I’d probably want to use a robust database server and the DBI module. I won’t cover proper database servers such as MySQL, PostgreSQL, or Oracle. Perl works with those through DBI, and there’s already a great book for that in Further Reading,” at the end of the chapter. This chapter is about lightweight techniques I can use when I don’t need a full server backend.

Flat Files

Conceptually and practically, the easiest way to save and reuse data is to write it as text to a file. I don’t need much to do it, and I can inspect the file, change the data if I like, and send it to other people without worrying about low-level details like byte ordering or internal data sizes. When I want to reuse the data, I read it back into the program. Even if I don’t use a real file, I can still use these techniques to send the data down a socket or in an email.

pack

The pack built-in takes data and turns it into a single string by using a template string to decide how to put the data together. It’s similar to sprintf, although like its name suggests, the output string uses space as efficiently as possible:

#!/usr/bin/perl
# pack.pl

my $packed = pack( 'NCA*',  31415926, 32, 'Perl' );

print 'Packed string has length [' . length( $packed ) . "]\n";
print "Packed string is [$packed]\n";

The string that pack creates in this case is shorter than just stringing together the characters that make up the data, and certainly not as easy to read:

Packed string has length [9]
Packed string is [☐Ã¶Ë† Perl]

The format string NCA* has one letter for each of the rest of the arguments and tells pack how to interpret it. The N treats its argument as a network-order unsigned long. The C treats its argument as an unsigned char, and the A treats its argument as an ASCII character. After the A I use a * as a repeat count to apply it to all the characters in its argument. Without the *, it would only pack the first character in Perl.

Once I have my packed string, I can write it to a file, send it over a socket, or anything else I can do with strings. When I want to get back my data, I use unpack with the same template string:

my( $long, $char, $ascii ) = unpack( "NCA*", $packed );

print <<"HERE";
Long: $long
Char: $char
ASCII: $ascii
HERE

As long as I’ve done everything correctly, I get back the data I had when I started:

Long: 31415926
Char: 32
ASCII: Perl

I can pack several data together to form a record for a flat file database. Suppose my record comprises the ISBN, title, and author for a book. I can use three different A formats, giving each a length specifier. For each length, pack will either truncate the argument if it is too long or pad it with spaces if it’s shorter:

my( $isbn, $title, $author ) = ( 
        '0596527241', 'Mastering Perl', 'brian d foy' 
        );

my $record = pack( "A10 A20 A20", $isbn, $title, $author );

print "Record: [$record]\n";

The record is exactly 50 characters long, no matter which data I give it:

Record: [0596527241Mastering Perl      brian d foy         ]

When I store this in a file along with several other records, I always know that the next 50 bytes is another record. The seek built-in puts me in the right position, and I can read an exact number of bytes with sysread:

open my($fh), "books.dat" or die ...;

seek $fh, 50 * $ARGV[0];         # move to right record

sysread $fh, my( $record ), 50;  # read next record.

There are many other formats I can use in the template string, including every sort of number format and storage. If I wanted to inspect a string to see exactly what’s in it, I can unpack it with the H format to turn it into a hex string. I don’t have to unpack the string in $packed with the same template I used to create it:

my $hex = unpack( "H*", $packed );
print "Hex is [$hex]\n";

I can now see the hex values for the individual bytes in the string:

Hex is [01df5e76205065726c]

The unpack built-in is also handy for reading binary files. Here’s a bit of code to read the Portable Network Graphics (PNG) data from Gisle Aas’s Image::Info distribution. In the while loop, he reads a chunk of eight bytes, which he unpacks as a long and a four-character ASCII string. The number is the length of the next block of data and the string is the block type. Further on in the subroutine he uses even more unpacks:

package Image::Info::PNG;

sub process_file {
        my $signature = my_read($fh, 8);
        die "Bad PNG signature"
        unless $signature eq "\x89PNG\x0d\x0a\x1a\x0a";

        $info->push_info(0, "file_media_type" => "image/png");
        $info->push_info(0, "file_ext" => "png");

        my @chunks;

        while (1) {
                my($len, $type) = unpack("Na4", my_read($fh, 8));

                ...
                }

         ...
         }

Data::Dumper

With almost no effort I can serialize Perl data structures as (mostly) human-readable text. The Data::Dumper module, which comes with Perl, turns its arguments into a textual representation that I can later turn back into the original data. I give its Dumper function a list of references to stringify:

#!/usr/bin/perl
#  data-dumper.pl

use Data::Dumper qw(Dumper);

my %hash = qw(
        Fred    Flintstone
        Barney  Rubble
        );

my @array = qw(Fred Barney Betty Wilma);

print Dumper( \%hash, \@array );

The program outputs text that represents the data structures as Perl code:

$VAR1 = {
                  'Barney' => 'Rubble',
                  'Fred' => 'Flintstone'
                };
$VAR2 = [
                  'Fred',
                  'Barney',
                  'Betty',
                  'Wilma'
                ];

I have to remember to pass it references to hashes or arrays; otherwise, Perl passes Dumper a flattened list of the elements and Dumper won’t be able to preserve the data structures. If I don’t like the variable names, I can specify my own. I give Data::Dumper->new an anonymous array of the references to dump and a second anonymous array of the names to use for them:

#!/usr/bin/perl
# data-dumper-named.pl

use Data::Dumper qw(Dumper);

my %hash = qw(
        Fred    Flintstone
        Barney  Rubble
        );

my @array = qw(Fred Barney Betty Wilma);

my $dd = Data::Dumper->new(
        [ \%hash, \@array ],
        [ qw(hash array) ]
        );

print $dd->Dump;

I can then call the Dump method on the object to get the stringified version. Now my references have the name I gave them:

$hash = {
         'Barney' => 'Rubble',
         'Fred' => 'Flintstone'
        };
$array = [
          'Fred',
          'Barney',
          'Betty',
          'Wilma'
         ];

The stringified version isn’t the same as what I had in the program, though. I had a hash and an array before but now I have references to them. If I prefix my names with an asterisk in my call to Data::Dumper->new, Data::Dumper stringifies the data:

my $dd = Data::Dumper->new(
        [ \%hash, \@array ],
        [ qw(*hash *array) ]
        );

The stringified version no longer has references:

%hash = (
         'Barney' => 'Rubble',
         'Fred' => 'Flintstone'
        );
@array = (
          'Fred',
          'Barney',
          'Betty',
          'Wilma'
         );

I can then read these stringified data back into the program or even send them to another program. It’s already Perl code, so I can use the string form of eval to run it. I’ve saved the previous output in data-dumped.txt, and now I want to load it into my program. By using eval in its string form, I execute its argument in the same lexical scope. In my program I define %hash and @array as lexical variables but don’t assign anything to them. Those variables get their values through the eval and strict has no reason to complain:

#!/usr/bin/perl
# data-dumper-reload.pl
use strict;

my $data = do {
        if( open my $fh, '<', 'data-dumped.txt' ) { local $/; <$fh> }
        else { undef }
        };

my %hash;
my @array;

eval $data;

print "Fred's last name is $hash{Fred}\n";

Since I dumped the variables to a file, I can also use do. We covered this partially in Intermediate Perl, although in the context of loading subroutines from other files. We advised against it then because either require or use work better for that. In this case, we’re reloading data and the do built-in has some advantages over eval. For this task, do takes a filename and it can search through the directories in @INC to find that file. When it finds it, it updates %INC with the path to the file. This is almost the same as require, but do will reparse the file every time whereas require or use only do that the first time. They both set %INC so they know when they’ve already seen the file and don’t need to do it again. Unlike require or use, do doesn’t mind returning a false value, either. If do can’t find the file, it returns undef and sets $! with the error message. If it finds the file but can’t read or parse it, it returns undef and sets $@. I modify my previous program to use do:

#!/usr/bin/perl
# data-dumper-reload-do.pl
use strict;

use Data::Dumper;

my $file = "data-dumped.txt";
print "Before do, \$INC{$file} is [$INC{$file}]\n";

{
no strict 'vars';

do $file;
print "After do, \$INC{$file} is [$INC{$file}]\n";

print "Fred's last name is $hash{Fred}\n";
}

When I use do, I lose out on one important feature of eval. Since eval executes the code in the current context, it can see the lexical variables that are in scope. Since do can’t do that it’s not strict safe and it can’t populate lexical variables.

I find the dumping method especially handy when I want to pass around data in email. One program, such as a CGI program, collects the data for me to process later. I could stringify the data into some format and write code to parse that later, but it’s much easier to use Data::Dumper, which can also handle objects. I use my Business::ISBN module to parse a book number, then use Data::Dumper to stringify the object, so I can use the object in another program. I save the dump in isbn-dumped.txt:

#!/usr/bin/perl
# data-dumper-object.pl

use Business::ISBN;
use Data::Dumper;

my $isbn = Business::ISBN->new( '0596102062' );

my $dd = Data::Dumper->new( [ $isbn ], [ qw(isbn) ] );


open my( $fh ), ">", 'isbn-dumped.txt'
        or die "Could not save ISBN: $!";

print $fh $dd->Dump();

When I read the object back into a program, it’s like it’s been there all along since Data::Dumper outputs the data inside a call to bless:

$isbn = bless( {
                'country' => 'English',
                'country_code' => '0',
                'publisher_code' => 596,
                'valid' => 1,
                'checksum' => '2',
                'positions' => [
                                9,
                                4,
                                1
                               ],
                'isbn' => '0596102062',
                'article_code' => '10206'
               }, 'Business::ISBN' );

I don’t need to do anything special to make it an object but I still need to load the appropriate module to be able to call methods on the object. Just because I can bless something into a package doesn’t mean that package exists or has anything in it:

#!/usr/bin/perl
# data-dumper-object-reload.pl

use Business::ISBN;

my $data = do {
        if( open my $fh, '<', 'isbn-dumped.txt' ) { local $/; <$fh> }
        else { undef }
        };

my $isbn;

eval $data;

print "The ISBN is ", $isbn->as_string, "\n";

Similar Modules

The Data::Dumper module might not be enough for me all the time and there are several other modules on CPAN that do the same job a bit differently. The concept is the same: turn data into text files and later turn the text file back into data. I can try to dump an anonymous subroutine:

use Data::Dumper;

my $closure = do {
        my $n = 10;

        sub { return $n++ }
        };

print Dumper( $closure );

I don’t get back anything useful, though. Data::Dumper knows it’s a subroutine, but it can’t say what it does:

$VAR1 = sub { "DUMMY" };

The Data::Dump::Streamer module can handle these situations to a limited extent although it has a problem with scoping. Since it must serialize the variables to which the code refs refer, those variables come back to life in the same scope as the code reference:

use Data::Dump::Streamer;

my $closure = do {
                my $n = 10;
                
                sub { return $n++ }
                };


print Dump( $closure );

With Data::Dumper::Streamer I get the lexicals variables and the code for my anonymous subroutine:

my ($n);
$n = 10;
$CODE1 = sub {
                   return $n++;
                 };

Since Data::Dump::Streamer serializes all of the code references in the same scope, all of the variables to which they refer show up in the same scope. There are some ways around that, but they may not always work. Use caution.

If I don’t like the variables Data::Dumper has to create, I might want to use Data::Dump, which simply creates the data:

#!/usr/bin/perl
use Business::ISBN;
use Data::Dump qw(dump);

my $isbn = Business::ISBN->new( '0596102062' );

print dump( $isbn );

The output is almost just like that from Data::Dumper, although it is missing the $VARn stuff:

bless({
  article_code => 10_206,
  checksum => 2,
  country => "English",
  country_code => 0,
  isbn => "0596102062",
  positions => [9, 4, 1],
  publisher_code => 596,
  valid => 1,
}, "Business::ISBN")

When I eval this, I won’t create any variables. I have to store the result of the eval to use the variable. The only way to get back my object is to assign the result of eval to $isbn:

#!/usr/bin/perl
# data-dump-reload.pl

use Business::ISBN;

my $data = do {
        if( open my $fh, '<', 'data-dump.txt' ) { local $/; <$fh> }
        else { undef }
        };

my $isbn = eval $data;

print "The ISBN is ", $isbn->as_string, "\n";

There are several other modules on CPAN that can dump data, so if I don’t like any of these formats I have many other options.

YAML

YAML (YAML Ain’t Markup Language) is the same idea as Data::Dumper, although more concise and easier to read. YAML is becoming more popular in the Perl community and is already used in some module distribution maintenance. The Meta.yml file produced by various module distribution creation tools is YAML. Somewhat accidentally, the JavaScript Object Notation (JSON) is a valid YAML format. I write to a file that I give the extension .yml:

#!/usr/bin/perl
# yaml-dump.pl

use Business::ISBN;
use YAML qw(Dump);

my %hash = qw(
        Fred    Flintstone
        Barney  Rubble
        );

my @array = qw(Fred Barney Betty Wilma);

my $isbn = Business::ISBN->new( '0596102062' );

open my($fh), ">", 'dump.yml' or die "Could not write to file: $!\n";
print $fh Dump( \%hash, \@array, $isbn );

The output for the data structures is very compact although still readable once I understand its format. To get the data back, I don’t have to go through the shenanigans I experienced with Data::Dumper:

---
Barney: Rubble
Fred: Flintstone
---
- Fred
- Barney
- Betty
- Wilma
--- !perl/Business::ISBN
article_code: 10206
checksum: 2
country: English
country_code: 0
isbn: 0596102062
positions:
  - 9
  - 4
  - 1
publisher_code: 596
valid: 1

The YAML module provides a Load function to do it for me, although the basic concept is the same. I read the data from the file and pass the text to Load:

#!/usr/bin/perl
# yaml-load.pl

use Business::ISBN;
use YAML;

my $data = do {
        if( open my $fh, '<', 'dump.yml' ) { local $/; <$fh> }
        else { undef }
        };

my( $hash, $array, $isbn ) = Load( $data );

print "The ISBN is ", $isbn->as_string, "\n";

YAML’s only disadvantage is that it isn’t part of the standard Perl distribution yet and it relies on several noncore modules as well. As YAML becomes more popular this will probably improve. Some people have already come up with simpler implementations of YAML, including Adam Kennedy’s YAML::Tiny and Audrey Tang’s YAML::Syck.

Storable

The Storable module, which comes with Perl 5.7 and later, is one step up from the human-readable data dumps from the last section. The output it produces might be human-decipherable, but in general it’s not for human eyes. The module is mostly written in C, and part of this exposes the architecture on which I built perl, and the byte order of the data will depend on the underlying architecture. On a big-endian machine, my G4 Powerbook for instance, I’ll get different output than on my little-endian MacBook. I’ll get around that in a moment.

The store function serializes the data and puts it in a file. Storable treats problems as exceptions (meaning it tries to die rather than recover), so I wrap the call to its functions in eval and look at the eval error variable $@ to see if something serious went wrong. More minor errors, such as output errors, don’t die and return undef, so I check that too and find the error in $! if it was related to something with the system (i.e., couldn’t open the output):

#!/usr/bin/perl
# storable-store.pl

use Business::ISBN;
use Storable qw(store);

my $isbn = Business::ISBN->new( '0596102062' );

my $result = eval { store( $isbn, 'isbn-stored.dat' ) };

if( $@ )
        { warn "Serious error from Storable: $@" }
elsif( not defined $result )
        { warn "I/O error from Storable: $!" }

When I want to reload the data I use retrieve. As with store, I wrap my call in eval to catch any errors. I also add another check in my if structure to ensure I got back what I expected, in this case a Business::ISBN object:

#!/usr/bin/perl
# storable-retreive.pl

use Business::ISBN;
use Storable qw(retrieve);

my $isbn = eval { retrieve( 'isbn-stored.dat' ) };

if( $@ )
        { warn "Serious error from Storable: $@" }
elsif( not defined $isbn )
        { warn "I/O error from Storable: $!" }
elsif( not eval { $isbn->isa( 'Business::ISBN' ) } )
        { warn "Didn't get back Business::ISBN object\n" }

print "I loaded the ISBN ", $isbn->as_string, "\n";

To get around this machine-dependent format, Storable can use network order, which is architecture-independent and is converted to the local order as appropriate. For that, Storable provides the same function names with a prepended “n.” Thus, to store the data in network order, I use nstore. The retrieve function figures it out on its own so there is no nretrieve function. In this example, I also use Storable’s functions to write directly to filehandles instead of a filename. Those functions have fd in their name:

my $result = eval { nstore( $isbn, 'isbn-stored.dat' ) };

open my $fh, ">", $file or die "Could not open $file: $!";
my $result = eval{ nstore_fd $isbn, $fh };

my $result = eval{ nstore_fd $isbn, \*STDOUT  };
my $result = eval{ nstore_fd $isbn, \*SOCKET  };

$isbn = eval { fd_retrieve(\*SOCKET) };

Now that you’ve seen filehandle references as arguments to Storable’s functions, I need to mention that it’s the data from those filehandles that Storable affects, not the filehandles themselves. I can’t use these functions to capture the state of a filehandle or socket that I can magically use later. That just doesn’t work, no matter how many people ask about it on mailing lists.

Freezing Data

The Storable module, which comes with Perl, can also freeze data into a scalar. I don’t have to store it in a file or send it to a filehandle; I can keep it in memory, although serialized. I might store that in a database or do something else with it. To turn it back into a data structure, I use thaw:

#!/usr/bin/perl
# storable-thaw.pl

use Business::ISBN;
use Data::Dumper;
use Storable qw(nfreeze thaw);

my $isbn = Business::ISBN->new( '0596102062' );

my $frozen = eval { nfreeze( $isbn ) };

if( $@ ) { warn "Serious error from Storable: $@" }

my $other_isbn = thaw( $frozen );

print "The ISBN is ", $other_isbn->as_string, "\n";

This has an interesting use. Once I serialize the data it’s completely disconnected from the variables in which I was storing it. All of the data are copied and represented in the serialization. When I thaw it, the data come back into a completely new data structure that knows nothing about the previous data structure.

Before I show that, I’ll show a shallow copy, in which I copy the top level of the data structure, but the lower levels are the same references. This is a common error in copying data. I think they are distinct copies only later to discover that a change to the copy also changes the original.

I’ll start with an anonymous array that comprises two other anonymous arrays. I want to look at the second value in the second anonymous array, which starts as Y. I look at that value in the original and the copy before and after I make a change in the copy. I make the shallow copy by dereferencing $AoA and using its elements in a new anonymous array. Again, this is the naive approach, but I’ve seen it quite a bit and probably even did it myself a couple or fifty times:

#!/usr/bin/perl
# shallow-copy.pl

my $AoA = [
        [ qw( a b ) ],
        [ qw( X Y ) ],
        ];

# make the shallow copy
my $shallow_copy = [ @$AoA ];

# Check the state of the world before changes
show_arrays( $AoA, $shallow_copy );

# Now, change the shallow_copy
$shallow_copy->[1][1] = "Foo";

# Check the state of the world after changes
show_arrays( $AoA, $shallow_copy );

print "\nOriginal: $AoA->[1]\nCopy: $shallow_copy->[1]\n";

sub show_arrays {
        foreach my $ref ( @_ ) {
                print "Element [1,1] is $AoA->[1][1]\n";
                }
        }

When I run the program, I see from the output that the change to $shallow_copy also changes $AoA. When I print the stringified version of the reference for the corresponding elements in each array, I see that they are actually references to the same data:

Element [1,1] is Y
Element [1,1] is Y
Element [1,1] is Foo
Element [1,1] is Foo

Original: ARRAY(0x18006c4)
Copy: ARRAY(0x18006c4)

To get around the shallow copy problem I can make a deep copy by freezing and immediately thawing, and I don’t have to do any work to figure out the data structure. Once the data are frozen, they no longer have any connection to the source. I use nfreeze to get the data in network order just in case I want to send it to another machine:

use Storable qw(nfreeze thaw);

my $deep_copy = thaw( nfreeze( $isbn ) );

This is so useful that Storable provides the dclone function to do it in one step:

use Storable qw(dclone);

my $deep_copy = dclone $isbn;

Storable is much more interesting and useful than I’ve shown for this section. It can also handle file locking and has hooks to integrate it with classes so I can use its features for my objects. See the Storable documentation for more details.

The Clone::Any module by Matthew Simon Cavalletto provides the same functionality through a facade to several different modules that can make deep copies. With Clone::Any’s unifying interface, I don’t have to worry about which module I actually use or is installed on a remote system (as long as one of them is):

use Clone::Any qw(clone);

my $deep_copy = clone( $isbn );

DBM Files

The next step after Storable are tiny, lightweight databases. These don’t require a database server but still handle most of the work to make the data available in my program. There are several facilities for this, but I’m only going to cover a couple of them. The concept is the same even if the interfaces and fine details are different.

dbmopen

Since at least Perl 3, I’ve been able to connect to DBM files, which are hashes stored on disk. In the early days of Perl, when the language and practice was much more Unix-centric, DBM access was important since many system databases used that format. The DBM was a simple hash where I could specify a key and a value. I use dbmopen to connect a hash to the disk file, then use it like a normal hash. dbmclose ensures that all of my changes make it to the disk:

#!/usr/bin/perl
# dbmopen.pl

dbmopen %HASH, "dbm-open", 0644;

$HASH{'0596102062'} = 'Intermediate Perl';

while( my( $key, $value ) = each %HASH ) {
        print "$key: $value\n";
        }

dbmclose %HASH;

In modern Perl the situation is much more complicated. The DBM format branched off into several competing formats, each of which had their own strengths and peculiarities. Some could only store values shorter than a certain length, or only store a certain number of keys, and so on.

Depending on the compilation options of the local perl binary, I might be using any of these implementations. That means that although I can safely use dbmopen on the same machine, I might have trouble sharing it between machines since the next machine might have used a different DBM library.

None of this really matters because CPAN has something much better.

DBM::Deep

Much more popular today is DBM::Deep, which I use anywhere that I would have previously used one of the other DBM formats. With this module, I can create arbitrarily deep, multilevel hashes or arrays. The module is pure Perl so I don’t have to worry about different library implementations, underlying details, and so on. As long as I have Perl, I have everything I need. It works without worry on a Mac, Windows, or Unix, any of which can share DBM::Deep files with any of the others. And best of all, it’s pure Perl.

Joe Huckaby created DBM::Deep with both an object-oriented interface and a tie interface (see Chapter 17). The documentation recommends the object interface, so I’ll stick to that here. With a single argument, the constructor uses it as a filename, creating the file if it does not already exist:

use DBM::Deep;

my $isbns = DBM::Deep->new( "isbns.db" );
if( $isbns->error ) {
        warn "Could not create database: " . $isbns->error . "\n";
        }

$isbns->{'0596102062'} = 'Intermediate Perl';

Once I have the DBM::Deep object, I can treat it just like a hash reference and use all of the hash operators.

Additionally, I can call methods on the object to do the same thing. I can even set additional features, such as file locking and flushing when I create the object:

#!/usr/bin/perl

use DBM::Deep;

my $isbns = DBM::Deep->new(
        file      => "isbn.db"
        locking   => 1,
        autoflush => 1,
        );
if( $isbns->error ) {
        warn "Could not create database: " . $isbns->error . "\n";
        }

$isbns->put( '0596102062', 'Intermediate Perl' );

my $value = $isbns->get( '0596102062' );

The module also handles objects based on arrays, which have their own set of methods. It has hooks into its inner mechanisms so I can define how it does its work.

By the time you read this book, DBM::Deep should already have transaction support thanks to the work of Rob Kinyon, its current maintainer. I can create my object and then use the begin_work method to start a transaction. Once I do that, nothing happens to the data until I call commit, which writes all of my changes to the data. If something goes wrong, I just call rollback to get to where I was when I started:

my $db = DBM::Deep->new( 'file.db' );

eval {
        $db->begin_work;

        ...

        die "Something didn't work" if $error;

        $db->commit;
        }

if( $@ )
        {
        $db->rollback;
        }

Summary

By stringifying Perl data I have a lightweight way to pass data between invocations of a program and even between different programs. Slightly more complicated are binary formats, although Perl comes with the modules to handle that too. No matter which one I choose, I have some options before I decide that I have to move up to a full database server.

Mastering Perl by brian d foy