Chapter 16. Working with Bits

Perl is a high-level language, so I don’t have to play with bits and bytes to get my job done. The trade-off, however, is that I have to let Perl manage how it stores everything. What if I want to control that? And what about the rest of the world that packs a lot of information into single bytes, such as Unix file permissions? Or what if my array of tens of thousands of numbers takes up too much memory? Falling back to working with the bits can help that.

Binary Numbers

Almost all of us deal with binary computers, even to the point that it seems redundant to say “binary.” When it gets down to the lowest levels, these deal with on or off, or what we’ve come to call 1 or 0. String enough of those 1s and 0s together, and I have the instructions to tell a computer to do something or the physical representation of data on a disk. And, although most of us don’t have to deal with computers at this level, some of this thinking has reached into high-level programming because we have to deal with lower levels at some point.

Consider, for instance, the arguments that I use with mkdir, chmod, or dbmopen to set the file mode (also known as permissions, but actually more than just that). Although I write the mode as a single base-8 number, its meaning depends on its particular bit pattern:

mkdir $dir, 0755;
chmod 0644, @files;
dbmopen %HASH, $db_file, 0644;

I also get the file mode as one of the return values from stat:

my $file_mode = ( stat( $file ) )[2];

On Unix and Unix-like systems, that file mode packs in a lot of information, including the file permissions for the owner, group, and others, as well as bits for setuid, setgid, and some other stuff. Once I have it, I need to pick it apart. Perl has all of the necessary operators to do this, but I’ll get to those later.

Writing in Binary

In some cases I might want to use a sequence of bits to represent a series of values. By giving every bit (or group of bits) meaning, I can use a single scalar to store multiple values while only incurring the scalar variable memory overhead once. Since computers can be very fast at bit operations, my operations on strings of bits won’t be that slow, although the rest of the programming around this technique may slow things down. In Chapter 17, I’ll use a bit string to store a DNA strand. While the memory requirements of my program drop dramatically, I don’t see impressive speeds. Oh well; I’m always trading one benefit for another.

Since Perl 5.6, I can specify values directly in binary using the 0b notation. We partially covered this in Chapter 2 of Learning Perl:

my $value = 0b1;    # same as 1, 0x01,  01
my $value = 0b10;   #         2, 0x02,  02
my $value = 0b1000; #         8, 0x08, 010

I can use embedded underscores to make long binary values easier for me to read; Perl simply ignores them. A byte is a sequence of eight bits, and a nybble^[54]is half of a byte:

my $value = 0b1010_0101              # by nybbles;
my $value = 0b11110000_00001111      # by bytes
my $value = 0b1111_0000___0000_1111  # by bytes and nybbles

Currently (and without hope for future inclusion), Perl does not have a bin built-in that acts like oct or hex to interpret a number in a particular base. I could write my own, and initially I did before Randal reminded me that the oct built-in handles binary, octal, and hexidecimal conversions:

my $number = oct( "0b110" ); # 6

Of course, once I assign the value to the variable, Perl just thinks of it as a number, which doesn’t have an inherent representation, although Perl shows me values in the decimal representation unless I specifically tell it to do something else. I can output values in binary format with the %b format specifier to printf or sprintf. In this example, I preface the value with the literal sequence 0b just to remind myself that I formatted the value in binary. All the ones and zeros give me a hint, but other bases use those digits, too:

#!/usr/bin/perl

my $value = 0b0011;

printf "The value is 0b%b\n", $value;

In that example I had to prefix the value with 0b myself. I can use a different sprintf sequence to get it for free. By using a hash symbol, #, after the % that starts the placeholder, Perl prefixes the number with a string to indicate the base:^[55]

my $number_string = printf '%#b', 12;   # prints "0b1100"

I can get more fancy by specifying a width for the format. To always get 32 places in my binary representation, I put that width before the format specifier. I also add a leading 0 so that Perl fills the empty columns with zeros. The literal 0b adds two characters to the value, so the total column width is 34:

printf "The value is 0b%034b\n", $value;

printf "The value is %#034b\n", $value;

I use this sort of code quite a bit since I often need to convert between bases, so I have some Perl one-liners to help me. I alias the following one-liners to commands I can use in the bash shell (your shell might do it differently). The d2h converts from decimal to hexadecimal, the o2b converts from octal to binary, and so on. These tiny scripts might come in handy as you go through this chapter:

# for bash. your shell is probably different.

alias d2h="perl -e 'printf qq|%X\n|, int( shift )'"
alias d2o="perl -e 'printf qq|%o\n|, int( shift )'"
alias d2b="perl -e 'printf qq|%b\n|, int( shift )'"

alias h2d="perl -e 'printf qq|%d\n|, hex( shift )'"
alias h2o="perl -e 'printf qq|%o\n|, hex( shift )'"
alias h2b="perl -e 'printf qq|%b\n|, hex( shift )'"

alias o2h="perl -e 'printf qq|%X\n|, oct( shift )'"
alias o2d="perl -e 'printf qq|%d\n|, oct( shift )'"
alias o2b="perl -e 'printf qq|%b\n|, oct( shift )'"

Bit Operators

Perl’s binary operators do the same things they do in C and for the most part act like they do in the particular version of the C library with which your Perl was compiled. Whenever I need to work in binary and look something up I usually reach for my C book,^[56]but mostly because that’s where I first learned it.

Unary NOT, ~

The unary NOT operator (sometimes called the complement operator), ~, returns the bitwise negation, or 1’s complement, of the value, based on integer size of the architecture.^[57]This means it doesn’t care what the sign of the numeric value is; it just flips all the bits:

my $value      = 0b1111_1111;
my $complement = ~ $value;
printf "Complement of\n\t%b\nis\n\t%b\n", $value, $complement;

I see that even though I gave it an 8-bit value, it comes back as a 32-bit value (because my MacBook has 32-bit integers):

Complement of
        11111111
is
        11111111111111111111111100000000

That’s not very nice output. I’d like the values to line up properly. To do that, I need the integer size. That’s easy enough to get from the Perl configuration, though (see Chapter 11). The integer size is in bytes, so I multiply by eight the value I get from Perl’s configuration:

#!/usr/bin/perl
# complement.pl

use Config;
my $int_size   = $Config{intsize} * 8;

print "Int size is $int_size\n";

my $value      = 0b1111_1111;
my $complement = ~ $value;
printf "Complement of\n\t%${int_size}b\nis\n\t%${int_size}b\n",
        $value, $complement;

Now my values line up properly, although I’d like it even more if I could see the leading zeros. You can figure that one out on your own:

Int size is 32
Complement of
                                11111111
is
        11111111111111111111111100000000

I also have to be careful how I use the result I get from a unary NOT. Depending on how I use it, I can get back different values. In the next example I put the bitwise NOT value in $negated. When I print $negated with printf, I see that I flipped all the bits, and that the negative value is one greater in magnitude than the positive one. That’s two’s complement thinking, and I won’t go into that here. However, when I print the number with a plain ol’ print, Perl treats it as an unsigned value, so that bit flipping doesn’t do anything to the sign for the numbers that started positive, and it makes negative numbers positive:

#!/usr/bin/perl
# unary-not.pl

foreach my $value ( 255, 128, 5, 65534  )
        {
        my $negated = ~ $value;

        printf "  value is %#034b  %d\n",   $value,   $value;

        printf "~ value is %#034b  %d\n", $negated, $negated;

        print  "  value is ", $negated, "\n\n";
        }

This gives me output that can be confusing to those who don’t know what’s happening (which means that I shouldn’t use this liberally if I want the next programmer to be able to figure out what’s going on):

  value is 0b00000000000000000000000011111111  255
~ value is 0b11111111111111111111111100000000  -256
  value is 4294967040

  value is 0b00000000000000000000000010000000  128
~ value is 0b11111111111111111111111101111111  -129
  value is 4294967167

  value is 0b00000000000000000000000000000101  5
~ value is 0b11111111111111111111111111111010  -6
  value is 4294967290

  value is 0b00000000000000001111111111111110  65534
~ value is 0b11111111111111110000000000000001  -65535
  value is 4294901761

The ~ operator also lives near the top of the precedence chart, so it’s going to do its work before most other operators have a chance to do theirs. Be careful with that.

Bitwise AND, &

What if I don’t want all of those bits in my previous examples? I’m stuck with Perl’s integer size, but I can use a bit mask to get rid of the excess, and that brings me to the next operator, bitwise AND, &.

The bitwise AND operator returns the bits set in both first and second arguments. If either value has a 0 in that position, the result has a zero in that position, too. Or, the result has a 1 in the same position only where both arguments have a 1. Usually the second argument is called a mask since its 0s hide those positions in the first argument:

  1010       value
& 1101       mask
------
  1000

This lets me select the parts of a bit field that interest me. In the previous section, I used the ~ to take the complement of an 8-bit value but got back a 32-bit value. If I wanted only the last eight bits, I could use & with a value that has the bits set in only the lowest byte:

my $eight_bits_only = $complement & 0b1111_1111;

I can do this with the hexadecimal representation to make it easier to read. The value 0xFF represents a byte with all bits set, so I can use that as the mask to hide everything but the lowest byte:

my $eight_bits_only = $complement & 0xFF;

This is also useful to select just the bits I need from a number. For instance, the Unix file mode that I get back from stat contains the owner, group, and other permissions encoded into two bytes. Each of the permissions gets a nybble, and the high nybble has various other information. To get the permissions, I just have to know (and use) the right bit masks. In this case, I specify them in octal, which corresponds to the representation I use for chmod and mkdir (either in Perl or on the command line):

my $mode = ( stat($file) )[2];

my $is_group_readable   = $mode & 040;
my $is_group_writable   = $mode & 020;
my $is_group_executable = $mode & 010;

I don’t like all of those magic number bit masks, though, so I can make them into constants (again, see Chapter 11):

use constant GROUP_READABLE   => 040;
use constant GROUP_WRITABLE   => 020;
use constant GROUP_EXECUTABLE => 010;

my $mode = ( stat($file) )[2];

my $is_group_readable   = $mode & GROUP_READABLE;
my $is_group_writable   = $mode & GROUP_WRITABLE;
my $is_group_executable = $mode & GROUP_EXECUTABLE;

I don’t even have to do that much work, though, because these already have well-known constants in the POSIX module. The fcntl_h export tag gives me the POSIX constants for file permission masks. Can you tell which one does what just by looking at them?

#!/usr/bin/perl
# posix-mode-constants.pl

use POSIX qw(:fcntl_h);

# S_IRGRP S_IROTH S_IRUSR
# S_IWGRP S_IWOTH S_IWUSR
# S_IXGRP S_IXOTH S_IXUSR
# S_IRWXG S_IRWXO S_IRWXU
# S_ISGID S_ISUID

my $mode = ( stat( $ARGV[0] ) )[2];

print "Group readable\n"   if $mode & S_IRGRP;
print "Group writable\n"   if $mode & S_IWGRP;
print "Group executable\n" if $mode & S_IXGRP;

Binary OR, |

The bitwise OR operator, |, returns the bits set in either (or both) operand. If a position in either argument has the bit set, the result has that bit set.

  1010
| 1110
------
  1110

I often use these to combine values, and you may have already been using them with operators such as sysopen and flock. Those built-ins take an argument that constrains (or allows) their action, and I build up those values by OR-ing values. Each of the values specifies a setting. The result is the combination of all of the settings.

The third argument to sysopen is its mode. If I knew the bit values for the mode settings, I could use them directly, but they might vary from system to system. I use the values from Fcntl instead. I used this in Chapter 3 to limit what my file open can do:

#!/usr/bin/perl -T

use Fcntl (:DEFAULT);

my( $file ) = $ARGV[0] =~ m/([A-Z0-9_.-]+)/gi;

sysopen( my( $fh ), $file, O_APPEND | O_CREAT )
        or die "Could not open file: $!\n";

For file locking, I OR the settings I want to get the right effect. The Fcntl module supplies the values as constants. In this example, I open a file in read/write mode and immediately try to get a lock on the file. I pass the combination on exclusive lock, LOCK_EX, and nonblocking lock, LOCK_NB, so if I can’t get the lock right away it dies. By OR-ing those constants, I form the right bit pattern to send to flock:

use Fcntl qw(:flock);

open my($fh), '<+', $file       or die "Connot open: $!";
flock( $fh, LOCK_EX | LOCK_NB ) or die "Cannot lock: $!";

...;

close $fh; # don't unlock, just close!

Without the LOCK_NB, my program would sit at the flock line waiting to get the lock. Although I simply exited the program in this example, I might want to sleep for a bit and try again, or do something else until I can get the lock.

Exclusive OR, ^

The bitwise XOR operator, ^, returns the bits set in either, but not both, operands. That’s the part that makes it exclusive. If a position in either argument has the bit set, the result has the bit set, but only if the same position in the other argument doesn’t have the bit set. That is, that bit can only be set in one of the arguments for it to be set in the result:

  1010
^ 1110
------
  0100

The bitwise operators also work on strings, although I’m not sure why anyone would ever want to do that outside of an Obfuscated Perl Contest. I’ll show one interesting example, good for quiz shows and contests, but leave the rest up to you. It’s all in perlop.

So, knowing that, what’s the difference between “perl” and “Perl”?

$ perl -e 'printf "[%s]\n", ("perl" ^ "Perl")'
[ ]

Okay, that’s a bit hard to see so I’ll use ord to translate that into its ASCII value:

$ perl -e 'printf "[%d]\n", ord("perl" ^ "Perl")'
[32]

It’s the space character! The ^ masks all of the positions where the bits are set in both strings, and only the first character is different. It turns out that they differ in exactly one bit.

I want to see the bit patterns that led to this. The ord built-in returns the numeric value that I format with %b:

$ perl -e 'printf "[%#10b]\n", ord("perl" ^ "Perl")'
[0b00100000]

How do I get that value? First, I get the values for the upper- and lowercase versions of the letter P:

$ perl -e 'printf "[%#10b]\n", ord( shift )' P
[0b01010000]
$ perl -e 'printf "[%#10b]\n", ord( shift )' p
[0b01110000]

When I XOR those, I get the bits that are set in only one of the characters. The lowercase characters in ASCII have the same bit values except for the bit 5, putting all the lowercase characters 32 positions above the uppercase ones:

  0101_0000
^ 0111_0000
----------
  0010_0000

This leads to the perlfaq1 answer that there is only one bit of difference between “perl” and “Perl”.^[58]

Left << and right >> shift operators

The bit-shift operators move the entire bit field either to the left, using <<, or to the right, using >>, and fill in the vacancies with zeros. The arrows point in the direction I’m shifting, and the most significant bit (the one that represents the greatest value) is on the left:

my $high_bit_set = 1 << 8;     # 0b1000_0000

my $second_byte  = 0xFF << 8;  # 0x00_00_FF_00

The bit-shift operators do not wrap values to the other end, although I could write my own subroutine to do that. I’ll just have to remember the parts I’m about to push off the end and add them to the other side. The length of my values depends on my particular architecture, just as I discussed earlier.

I use the bit-shift operator with the return value from system, which is two bytes (or whatever the libc version of wait returns). The low byte has signal and core information, but it’s the high byte that I actually want if I need to see the exit value of the external command. I simply shift everything to the right eight positions. I don’t need to mask the value since the low byte disappears during the shift:

my $rc = system( 'echo', 'Just another perl hacker, ' );
my $exit_status = $rc >> 8;

I don’t need to save the return value from system because Perl puts it in the special variable $?:

system( 'echo', 'Just another perl hacker, ' );
my $exit_status = $? >> 8;

I can also inspect $? to see what went wrong in case of an error. I have to know the proper masks:

my $signal_id   = $? & 0b01111111; # or 0177, 127, 0x7F
my $dumped_core = $? & 0b10000000; # or 0200, 128, 0x80

Bit Vectors

Bit vectors can save memory by using a single scalar to hold many values. I can use a long string of bits to store the values instead of using an array of scalars. Even the empty scalar takes up some memory; I have to pay for that scalar overhead with every scalar I create. Using Devel::Size, I can look at the size of a scalar:

#!/usr/bin/perl
# devel-size.pl

use Devel::Size qw(size);

my $scalar;

print "Size of scalar is " .
  size( $scalar ) . " bytes\n";

On my MacBook running Perl 5.8.8, this scalar takes up 12 bytes, and it doesn’t even have a value yet!

Size of scalar is 12 bytes.

I could use Devel::Peek to see some of this:

#!/usr/bin/perl
# devel-peek.pl

use Devel::Peek;

my $scalar;

print Dump( $scalar );

The output shows me that Perl has already set up some infrastructure to handle the scalar value:

SV = NULL(0x0) at 0x1807058
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY)

Even with nothing in it, the scalar variable has a reference count and the scalar flags. Now, imagine an array of several hundred or thousand scalar values, each with their own scalar overhead. That’s a lot of memory before I even think about the values.

I don’t need to use Perl’s arrays to store my data. If I have enough data and another way to store it and then access it, I can save a lot of memory by avoiding the Perl variable overhead.

The easiest thing I can do is use a long string where each character (or other number of characters) represents an element. I’ll pretend that I’m working with DNA (the biological sort, although you should probably use BioPerl for this sort of thing), and I’ll use the letters T, A, C, and G to represent the base pairs that make up the DNA strand (I do this in Chapter 17 where I talk about tied variables). Instead of storing the sequence as an array of scalars each holding one character (or even objects representing that base), I store them as sequential characters in a single string where I only get the scalar overhead once:

my $strand = 'TGACTTTAGCATGACAGATACAGGTACA';

I can then access the string with substr(), which I give a starting position and a length:

my $codon = substr( $strand, 3, 3 );

I can even change values since I can use substr() as an lvalue:

substr( $strand, 2, 3 ) = 'GAC';

Of course, I can hide these operations behind functions, or I can even make an object out of the string and call methods on it to get or change the parts I want.

One step up the sophistication ladder is pack() (see Chapter 14), which does much of the same thing but with much more flexibility. I can shove several different types into a string and pull them out again. I’ll skip the example and refer you to the Tie::Array::PackedC module, which stores a series of integers (or doubles) as a packed string instead of their numerical and possibly string values in separate scalar variables.

A bit vector does the same thing as the single string or the packed string. In one scalar value, it stores several values. Just like in my DNA example, or the stuff that pack() does, it’s up to me how I partition that bit vector and then represent the values.

The vec Function

The built-in vec() function treats a string as a bit vector. It divides the string into elements according to the bit size I specify, although that number has to be a power of two. It works in the same sense that substr() works on a string by pulling out part of it, although vec only works with one “element” at a time.

I can use any string that I like. In this example I use 8 for the bit size, which corresponds to (single-byte) characters:

#!/usr/bin/perl
# vec-string.pl

my $extract = vec "Just another Perl hacker,", 3, 8;

printf "I extracted %s, which is the character '%s'\n",
        $extract,
        chr($extract);

From the output, I see that $extract is the number, and I need to use chr to turn it back into its character representation:

I extracted 116, which is the character 't'

I can also start from scratch to build up the string. The vec function is an lvalue so I can assign to it. As with other things in Perl, the first element has the index 0. Since vec is dealing with bit fields, to replace the lowercase p in the string with its uppercase version, I need to use ord to get the numeric version I’ll assign to vec:

my $bit_field = "Just another perl hacker,";
vec( $bit_field, 13, 8 ) = ord('P');

print "$bit_field\n"; # "Just another Perl hacker,"

I showed earlier that there is only one bit of difference between “perl” and “Perl.” I don’t need to change the entire character; I could just assign to the right bit:^[59]

my $bit_field = "Just another perl hacker,";
vec( $bit_field, 109, 1 ) = 0;

print "$bit_field\n"; # "Just another Perl hacker,"

When using vec on a string, Perl treats it as a byte string, tossing away any other encoding that the string may have had. That is, vec can operate on any string, but it turns it into a byte string. That’s a good reason not use vec to play with strings that I want to use as strings:

#!/usr/bin/perl
# vec-drops-encoding.pl

use Devel::Peek;

# set the UTF-8 flag by including unicode sequence
my $string = "Has a unicode smiley --> \x{263a}\n";
Dump( $string );

# keeps the UTF-8 flag on access
print STDERR "-" x 50, "\n";
my $first_char = vec( $string, 0, 8 );
Dump( $string );

# loses the UTF-8 flag on assignment
print STDERR "-" x 50, "\n";
vec( $string, 0, 8 ) = ord('W');
Dump( $string );

The progression of the Devel::Peek output shows that I can create a string with the UTF8 flag. As raw bytes, I get the three bytes \342\230\272 but Perl knows that is a Unicode code point because of the encoding:

SV = PV(0x1801460) at 0x1800fb8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x401b10 "Has a unicode smiley --> \342\230\272\n"\0 
        [UTF8 "Has a unicode smiley --> \x{263a}\n"]
  CUR = 29
  LEN = 32

I can use vec to extract part of the string without affecting the UTF8 flag. Simply accessing the string through vec does set some magic on the variable, but it’s still UTF8:

--------------------------------------------------
SV = PVMG(0x180aca0) at 0x1800fb8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x401b10 "Has a unicode smiley --> \342\230\272\n"\0 
        [UTF8 "Has a unicode smiley --> \x{263a}\n"]
  CUR = 29
  LEN = 32
  MAGIC = 0x4059d0
        MG_VIRTUAL = &PL_vtbl_utf8
        MG_TYPE = PERL_MAGIC_utf8(w)
        MG_LEN = 27

Finally, once I change the string through vec, Perl treats it as a simple series of bytes. When I change the initial H to a W, Perl forgets all about the encoding. It’s up to me to provide the context and meaning of the bits once I use it as a bit vector. If I want to keep the string value, I should do something else:

--------------------------------------------------
SV = PVMG(0x180aca0) at 0x1800fb8
  REFCNT = 2
  FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x401b10 "Was a unicode smiley --> \342\230\272\n"\0
  CUR = 29
  LEN = 32
  MAGIC = 0x4059d0
        MG_VIRTUAL = &PL_vtbl_utf8
        MG_TYPE = PERL_MAGIC_utf8(w)
        MG_LEN = -1

Bit String Storage

The actual storage gets a bit tricky, so making a change and then inspecting the scalar I use to store everything, it may seem like the wrong thing is happening. Perl actually stores the bit vector as a string, so on inspection, I most likely see a lot of nonsense:

#!/usr/bin/perl
# vec-wacky-order.pl

{
my @chars = qw( a b c d 1 2 3 );

my $string = '';

for( my $i = 0; $i < @chars; $i++ )
        {
        vec( $string, $i, 8 ) = ord( $chars[$i] );
        }

print "\@chars string is ---> [$string]\n";
}

#------

{
my @nums = qw( 9 2 3 12 15 );

my $string = '';

for( my $i = 0; $i < @nums; $i++ )
        {
        vec( $string, $i, 4 ) = 0 + $nums[$i];
        }

print "\@nums string is ---> [$string]\n";

my $bit_string =  unpack( 'B*', $string );

$bit_string =~ s/(....)(?=.)/${1}_/g;

print "\$bit_string is ---> [ $bit_string ]\n";
}

With eight bits per element, Perl uses one byte for each element. That’s pretty easy to understand, and nothing tricky happens. The first element in the bit vector is the first byte, the second is the second byte, and so on. The first part of my program creates a string I can recognize, and I see the characters in the order I added them:

@chars string is ---> [abcd123]

The second part of the program is different. I set the bit size to 4 and add several numbers to it. As a string it doesn’t look anything like its elements, but when I look at the bit pattern I can make out my four-bit numbers, although not in the order I added them, and with an apparent extra one:

@nums string is ---> [)√]
$bit_string is ---> [ 0010_1001_1100_0011_0000_1111 ]
                        2    9   12   3     0   15

If I use one, two, or four bits per element, Perl still treats the string as bytes, but then orders the bits in a little-endian fashion. I’ll use the alphabet to illustrate the sequence again, this time for two bytes each. The proper order of the elements is A, B, C, D, but vec starts with the lower part of each byte, which is to the right, and fills the byte up towards the left before moving to the next byte:

4 bits:       B       A

2 bits:   D   C   B   A

1 bit:  H G F E D C B A

I wrote a little program to illustrate the ordering of the elements. For each of the bit lengths, I get the index of the last element (counting from zero) as well as the bit pattern of all the bits on for that bit length by using the oct function (although I have to remember to tack on the “0b” to the front). When I run this program, I’ll see a line that shows the bit field and a line right under it to show the actual storage:

#!/usr/bin/perl
# vec-4bits.pl

foreach my $bit_length ( qw( 4 2 1 ) )
        {
        print "Bit length is $bit_length\n";
        my $last    = (16 / $bit_length) - 1;
        my $on_bits = oct( "0b" . "1" x $bit_length );

        foreach my $index ( 0 .. $last )
                {
                my $string = "\000\000";

                vec( $string, $index, $bit_length ) = $on_bits;

                printf "%2d: ", $index;
                print show_string( $string ), "\n    ", show_ord( $string ), "\n";
                }
        print "\n";
        }

sub show_string
        {
        unpack( "b*", $_[0] );
        }

sub show_ord
        {
        my $result = '';

        foreach my $byte ( split //, $_[0] )
                {
                $result .= sprintf "%08b", ord($byte);
                }

        $result;
        }

If I really need to see the bit vector in ones and zeros, I can use unpack with the b format. This orders the bits in the way I would naturally expect, instead of the tricky order I showed when using a bit length less than 8 with vec:

$bit_string = unpack( "b*" , $bit_vector);

I really don’t need to worry about this, though, as long as I use vec to both access and store the values and use the same number of bits each time.

Storing DNA

In my earlier DNA example, I had four things to store ( T, A, C, G ). Instead of using a whole character (eight bits) to store each one of those as I did previously, I can use just two bits. In this example, I turn a 12-character string into a bit vector that is only 3 bytes long:

my %bit_codes = (
        T => 0b00,
        A => 0b11,
        C => 0b10,
        G => 0b01,
        );

# add the reverse mapping too
@bit_codes{values %bit_codes} = keys %bit_codes;

use constant WIDTH => 2;

my $bits = '';
my @bases = split //, 'CCGGAGAGATTA';

foreach my $i ( 0 .. $#bases ) {
        vec( $bits, $i, WIDTH ) = $bit_codes{ $bases[$i] };
        }

print "Length of string is " . length( $bits ) . "\n";

That’s my bit vector of 12 elements, and now I want to pull out the third element. I give vec() three arguments: the bit vector, the number of the element, and the width in bits of each element. I use the value that vec() returns to look up the base symbol in the hash, which maps both ways:

my $base = vec $bits, 2, WIDTH;
printf "The third element is %s\n", $bit_codes{ $base };

I could get more fancy by using four bits per element and using each bit to represent a base. That might seem like a waste of the other three bits, which should be turned off if I know the base already, but sometimes I don’t know the base. I might, for instance, only know that it’s not A, so it might be any of the others. Bioinformaticists have other letters to represent these cases (in this case, B, meaning “not A”), but I don’t need that right now.

Keeping Track of Things

In “Generating Sudoku” in The Perl Review, Eric Maki uses bit vectors to represent possible solution states to a Sudoku puzzle. He represents each puzzle row with nine bits, one for each square, and turns on a bit when that square has a value. A row might look like:

0 0 0 1 0 1 1 0 0

For each of the 9 rows in the puzzle, he adds another 9 bits, ending up with a bit string 81 bits long for all of the squares. His solution is a bit more complicated than that, but I’m just interested in the bit operations right now.

It’s very easy for him to check a candidate solution. Once any square has a value, he can eliminate all of the other solutions that also have a value in that square. He doesn’t have to do a lot of work to do that, though, because he just uses bit operations.

He knows which solutions to eliminate since a bitwise AND of the candidate row and the current solution have at least one bit in common. The pivot row is the one from the current solution that he compares to the same row in other candidate solutions. In this example, the rows have a bit in common. The result is a true value, and as before, I don’t need to do any shifting because I only need to know that the result is true, so the actual value is unimportant to me. Let me get to that in a minute:

  0 0 1 0 0 0 1 0 0 # candidate row
& 0 0 0 1 0 1 1 0 0 # pivot row
--------------------
  0 0 0 0 0 0 1 0 0 # bit set, eliminate row

In another case, the candidate row has no bits in common with the same row from the current solution, so an AND gives back all zeros:

  0 1 0 0 1 0 0 0 1 # still a candidate row
& 0 0 0 1 0 1 1 0 0 # pivot row
--------------------
  0 0 0 0 0 0 0 0 0 # false, still okay

I have to be careful here! Since vec() uses strings, and all strings except “0” are true (including “00” and so on), I can’t immediately decide based on the string value if it’s all zeros.

Eric uses bit operations for more than just puzzle solving, though. He also keeps track of all the rows he’s no longer considering. In all, there are 93 placement possibilities, and he stores that as a bit vector. Each bit is a candidate row, although if he sets a bit, that row is no longer a candidate. The index of that bit maps into an array he keeps elsewhere. By turning off rows in his bit mask, he doesn’t have to remove elements from the middle of his data structure, saving him a lot of time Perl would otherwise spend dealing with data structure maintenance. In this case, he uses a bit vector to save on speed, but uses more memory.

Once he knows that he’s going to skip a row, he sets that bit in the $removed bit vector:

vec( $removed, $row, 1 ) = 1;

When he needs to know all of the candidate rows still left, that’s just the bitwise negation of the removed rows. Be careful here! You don’t want the binding operator by mistake:

$live_rows = ( ~ $removed );

Summary

Although Perl mostly insulates me from the physical details of computers, sometimes I still have to deal with them when the data comes to me packed into bytes. Or, if Perl’s data structures take up too much memory for my problem, I might want to pack my data into bit strings to escape the Perl memory penalty. Once I have the bits, I work with them in mostly the same way I would in other languages.

Mastering Perl by brian d foy