Ruby makes basic I/O operations dead simple, but this doesn’t mean it’s a bad idea to pick up and apply some general approaches to text processing. Here we’ll talk about two techniques that most programmers doing file processing will want to know about, and you’ll see what they look like in Ruby.
The case study for this chapter showed the most common use of
File.foreach()
, but there is more to
be said about this approach. This section will highlight a couple of
tricks worth knowing about when doing line-by-line processing.
The following example shows code that extracts and sums the totals found in a file that has entries similar to these:
some lines of text total: 12 other lines of text total: 16 more text total: 3
The following code shows how to do this without loading the whole file into memory:
sum = 0 File.foreach("data.txt") { |line| sum += line[/total: (\d+)/,1].to_f }
Here, we are using File.foreach
as a direct iterator, and
building up our sum as we go. However, because foreach()
returns an Enumerator
, we can actually write this in a
cleaner way without sacrificing efficiency:
enum = File.foreach("data.txt") sum = enum.inject(0) { |s,r| s + r[/total: (\d+)/,1].to_f }
The primary difference between the two approaches is that when
you use File.foreach
directly with a block, you
are simply iterating line by line over the file, whereas Enumerator
gives you some more powerful ways
of processing your data.
When we work with arrays, we don’t usually write code like this:
sum = 0 arr.each { |e| sum += e }
Instead, we typically let Ruby do more of the work for us:
sum = arr.inject(0) { |s,e| s + e }
For this reason, we should do the same thing with files. If we
have an Enumerable
method we want
to use to transform or process a file, we should use the enumerator
provided by File.foreach()
rather
than try to do our processing within the block. This will allow us to
leverage the power behind Ruby’s Enumerable
module rather than doing the
heavy lifting ourselves.
If you’re interested in certain line numbers, there is no need
to maintain a manual counter. You simply need to create a file handle
to work with, and then make use of the File#lineno
method. To illustrate this, we
can very easily implement the Unix command
head
:
def head(file_name,max_lines = 10) File.open(file_name) do |file| file.each do |line| puts line break if file.lineno == max_lines end end end
For a more interesting use case, we can consider a file that is formatted in two line pairs, the first line a key, the second a value:
first name gregory last name brown email gregory.t.brown@gmail.com
Using File#lineno
, this is
trivial to process:
keys = [] values = [] File.open("foo.txt") do |file| file.each do |line| (file.lineno.odd? ? keys : values) << line.chomp end end Hash[*keys.zip(values).flatten]
The result of this code is a simple hash, as you might expect:
{ "first name" => "gregory", "last name" => "brown", "email" => "gregory.t.brown@gmail.com" }
Though there is probably more we can say about iterating over files line by line, this should get you well on your way. For now, there are other important I/O strategies to investigate, so we’ll keep moving.
Although many file processing scripts can happily read in one file as input and produce another as output, sometimes we want to be able to do transformations directly on a single file. This isn’t hard in practice, but it’s a little bit less obvious than you might think.
It is technically possible to rewrite parts of a file using the
"r+"
file mode, but in practice, this
can be unwieldy in most cases. An alternative approach is to load the
entire contents of a file into memory, manipulate the string, and then
overwrite the original file. However, this approach is wasteful, and is
not the best way to go in most cases.
As it turns out, there is a simple solution to this problem, and that is simply to work around it. Rather than trying to make direct changes to a file, or store a string in memory and then write it back out to the same file after manipulation, we can instead make use of a temporary file and do line-by-line processing as normal. When we finish the job, we can rename our temp file so as to replace the original. Using this approach, we can easily make a backup of the original file if necessary, and also roll back changes upon error.
Let’s take a quick look at an example that demonstrates this general strategy. We’ll build a script that strips comments from Ruby files, allowing us to take source code such as this:
# The best class ever # Anywhere in the world class Foo # A useless comment def a true end #Another Useless comment def b false end end
and turn it into comment-free code such as this:
class Foo def a true end def b false end end
With the help of Ruby’s tempfile and fileutils standard libraries, this task is trivial:
require "tempfile" require "fileutils" temp = Tempfile.new("working") File.foreach(ARGV[0]) do |line| temp << line unless line =~ /^\s*#/ end temp.close FileUtils.mv(temp.path,ARGV[0])
We initialize a new Tempfile
object and then iterate over the file specified on the command line. We
append each line to the Tempfile
, as
long as it is not a comment line. This is the first part of our
task:
temp = Tempfile.new("working") File.foreach(ARGV[0]) do |line| temp << line unless line =~ /^\s*#/ end temp.close
Once we’ve written our Tempfile
and closed the file handle, we then use FileUtils
to rename it and replace the
original file we were working on:
FileUtils.mv(temp.path,ARGV[0])
In two steps, we’ve efficiently modified a file without loading it
entirely into memory or dealing with the complexities of using the
r+
file mode. In many cases, the simple approach
shown here will be enough.
Of course, because you are modifying a file in place, a poorly
coded script could risk destroying your input file. For this reason, you
might want to make a backup of your file. This can be done trivially
with FileUtils.cp
, as shown in the
following reworked version of our example:
require "tempfile" require "fileutils" temp = Tempfile.new("working") File.foreach(ARGV[0]) do |line| temp << line unless line =~ /^\s*#/ end temp.close FileUtils.cp(ARGV[0],"#{ARGV[0]}.bak") FileUtils.mv(temp.path,ARGV[0])
This code makes a backup of the original file only if the temp file is successfully populated, which prevents it from producing garbage during testing.
Sometimes it will make sense to do backups; other times, it won’t be essential. Of course, it’s better to be safe than sorry, so if you’re in doubt, just add the extra line of code for a bit more peace of mind.
The two strategies shown in this section will come up in practice again and again for those doing frequent text processing. They can even be used in combination when needed.
We’re about to close our discussion on this topic, but before we do that, it’s worth mentioning the following reminders:
When doing line-based file processing,
File.foreach
can be used as anEnumerator
, unlocking the power ofEnumerable
. This provides an extremely handy way to search, traverse, and manipulate files without sacrificing efficiency.If you need to keep track of which line of a file you are on while you are iterating over it, you can use
File#lineno
rather than incrementing your own counter.When doing atomic saves, the tempfile standard library can be used to avoid unnecessary clutter.
Be sure to test any code that does atomic saves thoroughly, as there is real risk of destroying your original source files if backups are not made.
Get Ruby Best Practices now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.