Text-Processing Strategies

Ruby makes basic I/O operations dead simple, but this doesn’t mean it’s a bad idea to pick up and apply some general approaches to text processing. Here we’ll talk about two techniques that most programmers doing file processing will want to know about, and you’ll see what they look like in Ruby.

Advanced Line Processing

The case study for this chapter showed the most common use of File.foreach(), but there is more to be said about this approach. This section will highlight a couple of tricks worth knowing about when doing line-by-line processing.

Using Enumerator

The following example shows code that extracts and sums the totals found in a file that has entries similar to these:

some
lines
of
text
total: 12

other
lines
of
text
total: 16

more
text
total: 3

The following code shows how to do this without loading the whole file into memory:

sum = 0
File.foreach("data.txt") { |line| sum += line[/total: (\d+)/,1].to_f }

Here, we are using File.foreach as a direct iterator, and building up our sum as we go. However, because foreach() returns an Enumerator, we can actually write this in a cleaner way without sacrificing efficiency:

enum = File.foreach("data.txt")
sum = enum.inject(0) { |s,r| s + r[/total: (\d+)/,1].to_f }

The primary difference between the two approaches is that when you use File.foreach directly with a block, you are simply iterating line by line over the file, whereas Enumerator gives you some more powerful ways of processing your data.

When we work with arrays, we don’t usually write code like this:

sum = 0
arr.each { |e| sum += e }

Instead, we typically let Ruby do more of the work for us:

sum = arr.inject(0) { |s,e| s + e }

For this reason, we should do the same thing with files. If we have an Enumerable method we want to use to transform or process a file, we should use the enumerator provided by File.foreach() rather than try to do our processing within the block. This will allow us to leverage the power behind Ruby’s Enumerable module rather than doing the heavy lifting ourselves.

Tracking line numbers

If you’re interested in certain line numbers, there is no need to maintain a manual counter. You simply need to create a file handle to work with, and then make use of the File#lineno method. To illustrate this, we can very easily implement the Unix command head:

def head(file_name,max_lines = 10)
  File.open(file_name) do |file|
    file.each do |line|
      puts line
      break if file.lineno == max_lines
    end
  end
end

For a more interesting use case, we can consider a file that is formatted in two line pairs, the first line a key, the second a value:

first name
gregory
last name
brown
email
gregory.t.brown@gmail.com

Using File#lineno, this is trivial to process:

keys   = []
values = []

File.open("foo.txt") do |file|
  file.each do |line|
    (file.lineno.odd? ? keys : values) << line.chomp
  end
end

Hash[*keys.zip(values).flatten]

The result of this code is a simple hash, as you might expect:

 { "first name" => "gregory",
   "last name"  => "brown",
   "email"      => "gregory.t.brown@gmail.com" }

Though there is probably more we can say about iterating over files line by line, this should get you well on your way. For now, there are other important I/O strategies to investigate, so we’ll keep moving.

Atomic Saves

Although many file processing scripts can happily read in one file as input and produce another as output, sometimes we want to be able to do transformations directly on a single file. This isn’t hard in practice, but it’s a little bit less obvious than you might think.

It is technically possible to rewrite parts of a file using the "r+" file mode, but in practice, this can be unwieldy in most cases. An alternative approach is to load the entire contents of a file into memory, manipulate the string, and then overwrite the original file. However, this approach is wasteful, and is not the best way to go in most cases.

As it turns out, there is a simple solution to this problem, and that is simply to work around it. Rather than trying to make direct changes to a file, or store a string in memory and then write it back out to the same file after manipulation, we can instead make use of a temporary file and do line-by-line processing as normal. When we finish the job, we can rename our temp file so as to replace the original. Using this approach, we can easily make a backup of the original file if necessary, and also roll back changes upon error.

Let’s take a quick look at an example that demonstrates this general strategy. We’ll build a script that strips comments from Ruby files, allowing us to take source code such as this:

# The best class ever
# Anywhere in the world
class Foo

  # A useless comment
  def a
     true
  end

  #Another Useless comment
  def b
    false
  end

end

and turn it into comment-free code such as this:

class Foo

  def a
     true
  end

  def b
    false
  end

end

With the help of Ruby’s tempfile and fileutils standard libraries, this task is trivial:

require "tempfile"
require "fileutils"
temp = Tempfile.new("working")
File.foreach(ARGV[0]) do |line|
  temp << line unless line =~ /^\s*#/
end

temp.close
FileUtils.mv(temp.path,ARGV[0])

We initialize a new Tempfile object and then iterate over the file specified on the command line. We append each line to the Tempfile, as long as it is not a comment line. This is the first part of our task:

temp = Tempfile.new("working")
File.foreach(ARGV[0]) do |line|
  temp << line unless line =~ /^\s*#/
end

temp.close

Once we’ve written our Tempfile and closed the file handle, we then use FileUtils to rename it and replace the original file we were working on:

FileUtils.mv(temp.path,ARGV[0])

In two steps, we’ve efficiently modified a file without loading it entirely into memory or dealing with the complexities of using the r+ file mode. In many cases, the simple approach shown here will be enough.

Of course, because you are modifying a file in place, a poorly coded script could risk destroying your input file. For this reason, you might want to make a backup of your file. This can be done trivially with FileUtils.cp, as shown in the following reworked version of our example:

require "tempfile"
require "fileutils"

temp = Tempfile.new("working")
File.foreach(ARGV[0]) do |line|
  temp << line unless line =~ /^\s*#/
end

temp.close
FileUtils.cp(ARGV[0],"#{ARGV[0]}.bak")
FileUtils.mv(temp.path,ARGV[0])

This code makes a backup of the original file only if the temp file is successfully populated, which prevents it from producing garbage during testing.

Sometimes it will make sense to do backups; other times, it won’t be essential. Of course, it’s better to be safe than sorry, so if you’re in doubt, just add the extra line of code for a bit more peace of mind.

The two strategies shown in this section will come up in practice again and again for those doing frequent text processing. They can even be used in combination when needed.

We’re about to close our discussion on this topic, but before we do that, it’s worth mentioning the following reminders:

  • When doing line-based file processing, File.foreach can be used as an Enumerator, unlocking the power of Enumerable. This provides an extremely handy way to search, traverse, and manipulate files without sacrificing efficiency.

  • If you need to keep track of which line of a file you are on while you are iterating over it, you can use File#lineno rather than incrementing your own counter.

  • When doing atomic saves, the tempfile standard library can be used to avoid unnecessary clutter.

  • Be sure to test any code that does atomic saves thoroughly, as there is real risk of destroying your original source files if backups are not made.

Get Ruby Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.