O'Reilly logo

Ferret by David Balmain

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Getting Started

Installing Ferret

First things first: let’s get Ferret installed. Thanks to RubyGems, this is pretty easy. If you haven’t used RubyGems before, there is a great introduction at the RubyGems web site (http://docs.rubygems.org/). If you are on Windows and you used the Ruby One-Click Installer to install Ruby, you’ll have everything you need. Other systems, such as Linux or Mac, need to have make and a C compiler such as gcc to build the extension. Other than that, Ferret comes dependency-free. You simply need to run the gem install script:

dave$ sudo gem install ferret

Once this process successfully completes, you will have Ferret installed on your system. The easiest way to check that everything is working correctly is to open an irb session, as shown in Example 1-1.

Example 1-1. irb session
dave$ irb -rubygems
>> require 'ferret'                           #=> true
>> index = Ferret::I.new                      #=> #<Ferret::Index:...>
>> index << "Time heals all wounds"           #=> #<Ferret::Index:...>
>> index << "A rolling stone gathers no moss" #=> #<Ferret::Index:...>
>> index << "A stitch in time saves nine"     #=> #<Ferret::Index:...>
>> index << "Look before you leap"            #=> #<Ferret::Index:...>
>> index << "Time and tide wait for no man"   #=> #<Ferret::Index:...>
>> index << "Time wounds all heels"           #=> #<Ferret::Index:...>
>> puts index.search("time")

All we’ve done here is load RubyGems and Ferret, create a new in-memory index, add a few strings to it, and run a search, printing out the results. If everything is working correctly, you will see the results of your search printed out in order of relevance. It doesn’t get much simpler than that. irb is a great way to play around with Ferret and try out new things. Next, we’ll show you how to index all the text files under a particular directory.

A Quick Example: Indexing the Filesystem

With the explosion of the Internet, a huge amount of information has become available to us. But it doesn’t matter how much information is available if we can’t find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.

More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drives—everything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesn’t cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.

By the end of this book, you’ll have built a search application that will make searching your hard drive as easy as searching the Web. In this section, we start with plain old text files. Let’s begin by writing a command-line indexing program that takes two arguments: the name of the directory we want to index, and the name of the directory in which the index will be stored. Take a look at Example 1-2.

Example 1-2. index.rb
  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <data dir> <index dir>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Directory '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $data_dir, $index_dir = ARGV
 16 begin
 17   FileUtils.mkdir_p($index_dir)
 18 rescue
 19   usage("Can't create index directory '#$index_dir'.")
 20 end
 21 
 22 index = Index.new(:path => $index_dir,          
 23                   :create => true)
 24 
 25 Dir["#$data_dir/**/*.txt"].each do |file_name|  
 26   index << {:file_name => file_name, :content => File.read(file_name)} 
 27 end
 28 index.optimize()                                
 29 index.close()

Most of this code is for command-line argument handling and can be safely skimmed over. The interesting part of the code begins on line 22. This is where we create the index. The :path parameter clearly specifies where you want to store the index. Setting the :create parameter to true tells Ferret to create a new index in the specified directory. Any index already residing in the specified directory will be overwritten, so be careful when setting :create to true. We saw earlier that we can add simple Strings to an index. This time we use a Hash, as we want each document to have two fields.

Once the index is created, we need to add documents to it. Line 25 simply scans a directory tree for all text files. Line 26 is where most of the action is happening. Since we can add simple Strings to an index, we use a Hash because we want each document to have two fields: a :file_name field and a :content field. Later, we’ll learn about the Document class, which lets us assign weightings (or boosts, as they are known in Ferret) to documents and fields.

The Index#optimize method is called on line 28. This method optimizes the index for searching, and it is a good idea to call it whenever you do a batch indexing.[1] On the following line, we close the index. Index#close will make sure that any data held in RAM is flushed to the index. It then commits the index and closes any locks that the Index object might be holding on the index.

Creating an index is now simply a matter of running the indexer from the command line:

dave$ ruby index.rb index_dir/ text_files/

Now that we have an index, we need to be able to search it. That is why we built it, after all. The search code is as simple as the indexing code; take a look at Example 1-3.

Example 1-3. search.rb
  0 #!/usr/bin/env ruby
  1 require 'rubygems'
  2 require 'ferret'
  3 require 'fileutils'
  4 include Ferret
  5 include Ferret::Index
  6 
  7 def usage(message = nil)
  8   puts message if message
  9   puts "ruby #{File.basename(__FILE__)} <index dir> <search phrase>"
 10   exit(1)
 11 end
 12 
 13 usage() if ARGV.size != 2
 14 usage("Index '#{ARGV[0]}' doesn't exist.") unless File.directory?(ARGV[0])
 15 $index_dir, $search_phrase = ARGV
 16 
 17 index = Index.new(:path => $index_dir) 
 18 
 19 results = []
 20 total_hits = index.search_each($search_phrase) do |doc_id, score| 
 21   results << "  #{score} - #{index[doc_id][:file_name]}" 
 22 end
 23 
 24 puts "#{total_hits} matched your query:\n" + results.join("\n")
 25 
 26 index.close()

On line 21 we simply write the results to a string. You can use the document ID to access the index; the document itself acts like a Hash object. If you would like to build an index of a large number of text files, check out Project Gutenberg (http://www.gutenberg.org/). Go ahead and try out the search script:

dave$ ruby search.rb index_dir/ "Moby Dick"

Summary

So far we’ve met the Index class. This class is just a convenient, easy-to-use interface to the rest of the Ferret API. It does most of the hard work for you, such as parsing queries and keeping track of the IndexReader and IndexWriter classes behind the scenes, or knowing when to commit the index so that your search is always on the latest version of the index. There is a lot more that you can do with the Index class, and in most cases it will be all you need. But if you really want to take full advantage of all Ferret has to offer, you’ll need to find out what is going on behind the scenes. In Chapter 2, we’ll learn more about how indexing actually works and how to configure the index for your application.



[1] When doing incremental indexing, as you might do in a Rails application, it is better not to call the optimize method. You’ll learn more about this in the Optimizing the Index” section in Chapter 3.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required