BUY THIS BOOK
Add to Cart

Print Book $24.99


Add to Cart

PDF $19.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £15.50

What is this?

Looking to Reprint or License this content?


Ferret
Ferret

By David Balmain
Book Price: $24.99 USD
£15.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Getting Started
First things first: let’s get Ferret installed. Thanks to RubyGems, this is pretty easy. If you haven’t used RubyGems before, there is a great introduction at the RubyGems web site (http://docs.rubygems.org/). If you are on Windows and you used the Ruby One-Click Installer to install Ruby, you’ll have everything you need. Other systems, such as Linux or Mac, need to have make and a C compiler such as gcc to build the extension. Other than that, Ferret comes free. You simply need to run the gem install script:
dave$ sudo gem install ferret
Once this process successfully completes, you will have Ferret installed on your system. The easiest way to check that everything is working correctly is to open an irb session, as shown in .
Example . irb session
 
                             
                         
              
    
        
               
      
              
  
All we’ve done here is load RubyGems and Ferret, create a new in-memory index, add a few strings to it, and run a search, printing out the results. If everything is working correctly, you will see the results of your search printed out in order of relevance. It doesn’t get much simpler than that. irb is a great way to play around with Ferret and try out new things. Next, we’ll show you how to index all the text files under a particular directory.
With the explosion of the Internet, a huge amount of information has become available to us. But it doesn’t matter how much information is available if we can’t find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.
More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drives—everything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesn’t cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Installing Ferret
First things first: let’s get Ferret installed. Thanks to RubyGems, this is pretty easy. If you haven’t used RubyGems before, there is a great introduction at the RubyGems web site (http://docs.rubygems.org/). If you are on Windows and you used the Ruby One-Click Installer to install Ruby, you’ll have everything you need. Other systems, such as Linux or Mac, need to have make and a C compiler such as gcc to build the extension. Other than that, Ferret comes free. You simply need to run the gem install script:
dave$ sudo gem install ferret
Once this process successfully completes, you will have Ferret installed on your system. The easiest way to check that everything is working correctly is to open an irb session, as shown in .
Example . irb session
 
                             
                         
              
    
        
               
      
              
  
All we’ve done here is load RubyGems and Ferret, create a new in-memory index, add a few strings to it, and run a search, printing out the results. If everything is working correctly, you will see the results of your search printed out in order of relevance. It doesn’t get much simpler than that. irb is a great way to play around with Ferret and try out new things. Next, we’ll show you how to index all the text files under a particular directory.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Quick Example: Indexing the Filesystem
With the explosion of the Internet, a huge amount of information has become available to us. But it doesn’t matter how much information is available if we can’t find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.
More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drives—everything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesn’t cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.
By the end of this book, you’ll have built a search application that will make searching your hard drive as easy as searching the Web. In this section, we start with plain old text files. Let’s begin by writing a command-line indexing program that takes two arguments: the name of the directory we want to index, and the name of the directory in which the index will be stored. Take a look at .
Example . index.rb
 
  
  
  
  
  
 
   
      
    
   
 
 
     
   
    
 
   
 
   
 
 
               
                     
 
     
           
 
                                 
                                    
Most of this code is for command-line argument handling and can be safely skimmed over. The interesting part of the code begins on line . This is where we create the index. The :path parameter clearly specifies where you want to store the index. Setting the :create parameter to true tells Ferret to create a new index in the specified directory. Any index already residing in the specified directory will be overwritten, so be careful when setting :create to true. We saw earlier that we can add simple
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
So far we’ve met the Index class. This class is just a convenient, easy-to-use interface to the rest of the Ferret API. It does most of the hard work for you, such as parsing queries and keeping track of the IndexReader and IndexWriter classes behind the scenes, or knowing when to commit the index so that your search is always on the latest version of the index. There is a lot more that you can do with the Index class, and in most cases it will be all you need. But if you really want to take full advantage of all Ferret has to offer, you’ll need to find out what is going on behind the scenes. In , we’ll learn more about how indexing actually works and how to configure the index for your application.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Indexing
When you are indexing with Ferret, you need to know about the following classes:
  • Ferret::Store::Directory
  • Ferret::Index::FieldInfo
  • Ferret::Index::FieldInfos
  • Ferret::Field
  • Ferret::Document
  • Ferret::Index::IndexWriter
  • Ferret::Index::IndexReader
We will discuss each of these classes in this chapter. We’ll begin by discussing index storage. This involves looking at the two implementations of the Directory class: RAMDirectory and FSDirectory. After that, we’ll look at the building blocks of the Ferret index—the Field and Document classes—and we’ll discuss document and field boosting. We’ll then look at setting up the index. This involves using the FieldInfo and FieldInfos class and is probably the most important topic in this chapter. Finally, we’ll discuss how the actual indexing process works, at which time you’ll learn when to use the IndexReader and IndexWriter classes. Readers should pay special attention to the ” section in , as it seems to be the biggest problem area for new users in Ferret.
Ferret indexes are stored in a Ferret::Store::Directory. Directory is an class that specifies how an index should be stored. Ferret comes with two implementations of the Directory class: Ferret::Store::FSDirectory for storing the index on the filesystem, and Ferret::Store::RAMDirectory for storing an index in memory. You’ve used both of these classes already: RAMDirectory in and FSDirectory in Examples and . The Index class hides all the details, but when you pass a :path parameter to Index.new, an FSDirectory is used; otherwise, a RAMDirectory is used.
Most of the time, you will persist your index to the filesystem, so you’ll be using FSDirectory. For the most part, RAMDirectory is used internally during indexing. There are times, however, when RAMDirectory will come in handy. It is a little faster than FSDirectory, so if you have a small index that will fit in your available memory and you need a really fast search, you might choose to use a RAMDirectory. You can even load an existing filesystem index into a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Index Storage
Ferret indexes are stored in a Ferret::Store::Directory. Directory is an class that specifies how an index should be stored. Ferret comes with two implementations of the Directory class: Ferret::Store::FSDirectory for storing the index on the filesystem, and Ferret::Store::RAMDirectory for storing an index in memory. You’ve used both of these classes already: RAMDirectory in and FSDirectory in Examples and . The Index class hides all the details, but when you pass a :path parameter to Index.new, an FSDirectory is used; otherwise, a RAMDirectory is used.
Most of the time, you will persist your index to the filesystem, so you’ll be using FSDirectory. For the most part, RAMDirectory is used internally during indexing. There are times, however, when RAMDirectory will come in handy. It is a little faster than FSDirectory, so if you have a small index that will fit in your available memory and you need a really fast search, you might choose to use a RAMDirectory. You can even load an existing filesystem index into a RAMDirectory:
   
RAMDirectory is most beneficial during testing. Each unit test can create a new index, which is automatically cleaned up at the end of the test.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Documents, Fields, and Boosts
The best way to think of an index is as a searchable array of documents. A Ferret document is a collection of fields representing a chunk of data that you want to make searchable. Whether that chunk of data is a database row, a Word document, or an MP3 file doesn’t matter. They are all just documents to Ferret. A Ferret document can be represented by the Ferret::Document class. This class extends Ruby’s Hash class, adding only a boost attribute. In fact, as you saw in , documents can also be Hashes, where the key is the name of the field and the value is the data stored in the field.
The term “document” can be quite confusing. We often need to talk about the idea of a document in an index that is implemented by the Document class. A document can represent a PDF or a text document, or it can represent like a movie or a product. Make note of the formatting we use to distinguish documents from the Document class.
Earlier we mentioned that Documents have a boost attribute, but we didn’t say what boost was for. The boost attribute gives a document a higher weighting in the results of a search. By using the boost attribute, you can make more important documents appear higher in the search results. The default boost value is 1.0, so if you have a document that you consider to be important, you might set its boost to 100.0. Another document that you consider less might have its boost set to 0.0001.
To illustrate this point, let’s say you have an online bookstore that sells a number of books on fishing. If one of your users comes along and submits a query with the term “fishing”, you might show a list of the top 10 books found. But how do you rate the top 10? By default, Ferret returns the books in which the term “fishing” appears most frequently, relative to the size of the document. However, just having a high occurrence of the term “fishing” doesn’t necessarily make the book a great book on fishing. It would be better if you could
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Setting Up the Index
We now know that the index is made up of documents, which, in turn, are made up of fields. What happens to the fields when they are added to the index? Are they all treated equally? By default, the answer is yes: each field is created with the same properties. But this is not always desirable. For example, Ferret stores all the text in all the fields unmodified by default. If you are indexing data from a database, this may not be necessary. Since you are already storing the data in the database, it is often pointless to store it in the Ferret index as well.
Each field in a Ferret index has its properties defined in a Ferret::Index::FieldInfo object. A FieldInfo is an immutable class with the following properties:
  • name
  • boost
  • stored?
  • compressed?
  • indexed?
  • tokenized?
  • omit_norms?
  • store_term_vectors?
  • store_positions?
  • store_offsets?
The FieldInfo#name property is a symbol used to match the FieldInfo object with a field in a document. FieldInfo#boost is the default boost that is given to each instance of the field when it is added to the index. This is where, for example, you would boost the :title field if you wanted it to have more weight in the search results than the :content field. The default value for #boost is 1.0.
The rest of these properties can be divided into three groups: store, index, and . These are the other parameters you can use to instantiate a new FieldInfo object. For example:
                   
                                
                                      
                              
                                 

:store

Fields in Ferret can be stored or unstored. You should store fields that you want to retrieve after a search. For example, you would probably store a file URL if you are indexing your filesystem, or the ID (primary key) of a database table row when indexing a database table. You can also use a Ferret index like a database, storing all of your data in it. If you want to add highlighting to your search results, you need to store the fields you want to highlight. If a field is stored, it can also be compressed. This is useful when storing large documents in the index and disk space becomes an issue. It makes no sense to have a field that is both unstored and compressed. shows the three options for the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Indexing Operations
Now that you have the index set up, you are ready to start the indexing process. Ferret allows two write operations on an index: add and delete. By combining these two , you can perform an update operation, which you’ll also learn about in this section, as well as retrieving documents from the index. In this section you’ll also get an informal introduction to the Ferret::Index::IndexReader and Ferret::Index::IndexWriter classes through the explanations and examples. You’ll learn more about these two classes in .
Adding documents to the index is done with the aptly named IndexWriter class. You’ve already seen how to add documents using the Index class; if you look under the covers, you’ll find that the Index object is just using an IndexWriter to write documents to the index:
    

   


Notice the commit method at the end. None of the changes you make to an index through an IndexWriter are guaranteed to be applied until you call either commit, optimize, or close. The commit method simply applies all changes to the index, leaving the IndexWriter open. close does the same thing, then closes the IndexWriter. optimize is like commit in that it commits all changes, leaving the IndexWriter open for more index modifications. However, it also optimizes the index for searching. This process can be quite resource-intensive and should be called sparingly, usually after running a batch indexing process or once a day as a cron process.
Once there are documents in the index, you need to be able to retrieve them again. To retrieve documents, use an IndexReader:
   
You can think of an index as an array of documents and, in fact, you can reference the index just as you would an array:
  


  


  
This is where things start to get a little confusing. You can use IndexWriter to delete documents as you would expect. But you can also use IndexReader to delete documents. Both delete methods work in slightly different ways. We’ll start with IndexReader#delete
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Indexing Non-String Datatypes
So far, we’ve only really talked about adding strings to the index. As far as Ferret is concerned, every field is a string. But sometimes we want to index other datatypes, such as numbers and dates. We’re going to take a moment to talk about best practices when indexing non-string datatypes, specifically storing special datatypes in their own field. We won’t mention how to handle numbers or dates within a larger string field (like in the string The 39 Steps). You’ll learn more about text-field analysis in .
Indexing number fields is relatively straightforward. You don’t even need to convert them to strings when you add them to the document. However, you do need to think about how you set up the field. Make sure it is untokenized, as some Analyzers will strip all numbers and you’ll end up with an empty field:
           
The one exception is when you want to run range queries on a number field. For example, you may want to submit a query for all products between $5.00 and $25.00 or for all products that weigh less than 500 grams. In Ferret, the RangeQuery sorts fields lexicographically, so while 200 comes before 500, 70 comes after 500. To fix this, pad the numbers to a fixed width by prepending zeros. So instead of adding 5, 70, and 200, you would add 0005, 0070, and 0200, and instead of adding 3.45 and 101.95, you would add 0003.45 and 0101.95. This is pretty easy using Ruby’s printf-like notation:
           
               
                 
In the ” section in , we’ll show you how to automate this in an Analyzer so that you no longer need to think about it when adding .
As with numbers fields, you can add Times to a document; the time is converted to a String using its to_s method when it is added to the index:
        
Again, be sure to set the field type to untokenized. When it comes to dates, though, they can be written in so many different formats that it’s worth taking some time to consider which format to use. Like numbers, dates are common in range queries, so you should try to pick a date format that is ordered when sorted lexicographically. Any of the following will work, so pick one and stick to it:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
You now know how to set up an index, including what kinds of properties are available for each field. We also covered how to set up fields for indexing different datatypes, such as numbers and dates, as well as how to setup a field for sorting. You’ve learned about document and field boosting, and how to perform the basic operations add, get, update, and delete on a Ferret index. That pretty much covers everything you need to know about indexing with Ferret.
takes indexing to the next level. You’ll learn more about how the indexing process works and how to tune your index to get the best possible performance from it. If performance is not a concern for you—and Ferret is usually fast enough out of the box—you can probably skip most of the next chapter. But everyone should carefully read the ” section.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Advanced Indexing
So far, we’ve taken a black-box approach to Ferret. This chapter explains what is really going on during indexing and, in the process, explains how to tune your index for maximum performance. We conclude by explaining how locking works. It is crucial that you understand this, particularly if you want to run Ferret in a multithreaded or multiprocess environment.
We are now going to show how a source document—such as an HTML document from the Web, a row from a database, or an image from your personal image collection—becomes a Ferret document stored in the index. Ferret is agnostic about the source document’s type. It doesn’t matter whether you are indexing an MP3 file, a text document, or your store’s product, Ferret treats it as a collection of string fields. So, the first step is to turn source documents into Documents. This is pretty easy with plain-text documents. With other text document types, such as PDF or HTML, you’ll need to write a parser/reader that extracts the searchable text from the documents. For an image file, you might have a parser that extracts EXIF tags. Database rows usually map pretty easily to Documents. See for a framework for doing exactly this.
Once you have a Document, you add it to an IndexWriter. This is where the magic begins. The Document’s fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however it sees fit (see for more information). Once the Document has passed through the analyzer, it is buffered until the IndexWriter is ready to write a new segment to the index. Exactly how many documents are buffered depends on the IndexWriter’s :max_buffered_docs and :max_buffer_memory parameters, which are discussed in the ” section later in this chapter.
When an IndexWriter hits its limit for the maximum number of buffered documents, or when it is committed, it writes the buffered documents to a segment in the Directory. In , the IndexWriter buffers 10 documents before writing them as a segment. As segments are written to the index, the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How the Indexing Process Works
We are now going to show how a source document—such as an HTML document from the Web, a row from a database, or an image from your personal image collection—becomes a Ferret document stored in the index. Ferret is agnostic about the source document’s type. It doesn’t matter whether you are indexing an MP3 file, a text document, or your store’s product, Ferret treats it as a collection of string fields. So, the first step is to turn source documents into Documents. This is pretty easy with plain-text documents. With other text document types, such as PDF or HTML, you’ll need to write a parser/reader that extracts the searchable text from the documents. For an image file, you might have a parser that extracts EXIF tags. Database rows usually map pretty easily to Documents. See for a framework for doing exactly this.
Once you have a Document, you add it to an IndexWriter. This is where the magic begins. The Document’s fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however it sees fit (see for more information). Once the Document has passed through the analyzer, it is buffered until the IndexWriter is ready to write a new segment to the index. Exactly how many documents are buffered depends on the IndexWriter’s :max_buffered_docs and :max_buffer_memory parameters, which are discussed in the ” section later in this chapter.
When an IndexWriter hits its limit for the maximum number of buffered documents, or when it is committed, it writes the buffered documents to a segment in the Directory. In , the IndexWriter buffers 10 documents before writing them as a segment. As segments are written to the index, the IndexWriter maintains a segment stack. Each time a segment is pushed onto the stack, the IndexWriter checks to see if there are :merge_factor segments on top of the stack that are all the same size. If there are, they are popped off the stack, merged, the merged segment is pushed back onto the stack, and the process is repeated. In , the merge factor is set to 3, so if there are three 10-document segments on top of the stack, they are merged into one segment. Likewise, three 30-document segments are merged one 90-document segment, and so on.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tuning Indexing Performance
Ferret’s indexing performance is lightning-fast out of the box, so you’re justified in wondering whether you need to know how to make Ferret even faster. In most cases, you won’t need Ferret to go any faster than it already does. But if you are indexing gigabytes rather than megabytes and the indexing process is taking hours rather than seconds, you need to know how to push Ferret to its limits.
People who have used Lucene or earlier versions of Ferret might try to improve indexing speed by indexing to a RAMDirectory and then flushing the RAMDirectory to disk. That trick is now pointless; Ferret automatically indexes as many documents as it can in memory before flushing them to the Directory. You can ensure that all the indexing is done in memory by setting the :max_buffered_docs and :max_buffer_memory to sufficiently large .
The indexing process is regulated by the parameters, shown with their defaults in .
Table : Index parameters
ParameterDefaultShort description
:max_buffer_memory16 MbThe maximum memory used by the IndexWriter before buffered documents are flushed to the index
:chunk_size1 MbThe size of the memory chunks allocated to the memory pool during indexing
:merge_factor10The minimum number of similar sized segments needed to trigger a merge
:max_buffered_docs10,000The maximum number of documents that will be buffered by the IndexWriter before they are flushed to the index
:max_merge_docsInfiniteThe maximum number of documents that will be merged into a
:max_field_length10,000The maximum number of terms from any field that will be added to the index
:use_compound_filetrueSpecifies whether or not to write the index in compound file format
:index_skip_interval128The skip interval between terms in the term dictionary
:doc_skip_interval16The skip interval between document IDs in the term dictionary
In this section, we will go through each of these parameters in turn and explain the effects they have on the indexing processes. In some cases, the parameters will also affect search speed, so that will also be discussed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Optimizing the Index
You now know that the indexing process can leave any number of segments in the index. Although indexing performance is unaffected by the number of segments in the index, search performance does depend on the number of segments. The fewer segments, the better the search performance. When you open an index for reading, each segment is opened with its own SegmentReader. Each of those SegmentReaders has its own term dictionary, term enumerators, document enumerators, etc., so they can chew up quite a few resources, not to mention the fact that each reader needs to be searched separately. Therefore, the fewer the segments the better when searching the index. This is even more important when running short-lived command-line programs. It takes much more time to read in an unoptimized index than to read in an optimized index.
IndexWriter has an optimize method that minimizes the number of segments in an index, making the index optimal for searching. The best time to optimize the index is at the end of a batch indexing session. If, however, you are incrementally indexing your data—as you might do when indexing a model in a Rails application—you need to be more careful deciding when to optimize the index. The optimizing process itself can be quite resource-intensive, and it prevents any other documents from being added to the index. Thus, it is certainly not a good idea to optimize the index after each document is added to the index.
It should be noted that for large indexes, some processes—like sorting—can take a very long time on unoptimized indexes. Sometimes it takes a lot longer to sort an unoptimized index than to both optimize and sort the index. So, if you are having performance problems reading the index (searching, sorting, filtering, etc.), the first thing you should try is optimizing the index.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Index Locking and Concurrency Issues
This section deals with one of the most confusing issues for users new to Ferret: index locking. Ferret was designed to be used in a multiprocess environment, so it comes with a built-in index locking mechanism. Basically, you can’t have two processes modifying the same index at the same time. And just because you are working in a single-process, single-threaded environment, it doesn’t mean you can forget about index locking. Let’s say, for example, that you want to delete a set of documents in the index by their document numbers, but you also have an IndexWriter open for adding documents to the index. You will need to close the IndexWriter before committing the deletions and then reopen the IndexWriter. Why, you ask? To answer this, you have to understand how index locking works.
The Ferret index currently uses two locks: a commit lock and a write lock. You must know which operations use which one of these locks and when. There are two classes that can obtain these locks: IndexWriter and IndexReader. The IndexWriter the write lock as soon as it is opened and keeps it until it is closed. (Hence, the importance of closing IndexWriters when you have finished with them.) As a result, you can only ever have one IndexWriter open on an index at any time, and you can’t perform any write operations with an IndexReader while there is an IndexWriter open on the index. IndexWriter will also obtain the commit lock when you optimize, commit, or close the index. Furthermore, IndexWriter will obtain the commit lock unpredictably during indexing. IndexWriter will commit the index whenever segments are merged, so it is difficult to predict exactly when it will obtain the commit lock, except to say that it may happen whenever a document is added to the index.
IndexReader, on the other hand, acquires the write lock only when a write operation is called. There are three of these: delete, undelete, and set_norm. The commit lock is acquired only when you call the IndexReader’s commit
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
In this chapter, we covered some advanced indexing topics. You learned a little about what happens behind the scenes during the indexing process. We covered performance tuning for both indexing and searching in some detail, and we explained the Ferret index locking mechanism, briefly touching on some concurrency issues. By this stage, you are probably chomping at the bit to find out how to search the indexes you’ve been . We will cover that in .
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Search
Everything you’ve learned so far about creating indexes is pretty useless if you don’t know how to use those indexes to find what you are looking for. After all, that’s what Ferret is for. This chapter covers everything you need to know about searching in Ferret. We’ll start with the basic search classes followed by the various types of query. We’ll then talk about the query parser and Ferret’s own query language—FQL. We’ll then cover some more advanced topics such as sorting, filtering, and highlighting.
Ferret’s search API is about as simple as its indexing API. In fact, if you are using the Index class, all you have to know is the search_each() method and a little bit of Ferret’s query language and you are set. However, if you take the time to learn the rest of the search API, you’ll discover a wealth of opportunities you didn’t even know existed.
The search API consists of the following classes:
  • IndexSearcher
  • Query
  • QueryParser
  • Filter
  • Sort
IndexSearcher, as the name would suggest, is used to search indexes. You can also use it to highlight and explain query results and read documents from the index (as you would with IndexReader). To create an IndexSearcher, you need to supply it with an IndexReader:
   
             
As usual, you can shortcut this by supplying it with a Directory or a filesystem path to the index:
    
Ferret contains more than 15 different types of query, each of which you’ll learn about later in this chapter. Basically, queries are built and combined to specify what exactly it is you are looking for. You can then pass them to the IndexSearcher so it will retrieve your result set. Queries are the fundamental building block of the search API.
With more than 15 different types of query (each with its own definitive API), it can get quite tedious to build them by hand. Succinct as Ruby code is, it is much easier to build queries using a simple query language, not to mention the fact that you wouldn’t want users to have to type Ruby code into your search box. For example, let’s say we wanted to search for all articles in a blog that have the words “ruby” and “ferret” in either the title field or the content field. You could use the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Overview of Searching Classes
Ferret’s search API is about as simple as its indexing API. In fact, if you are using the Index class, all you have to know is the search_each() method and a little bit of Ferret’s query language and you are set. However, if you take the time to learn the rest of the search API, you’ll discover a wealth of opportunities you didn’t even know existed.
The search API consists of the following classes:
  • IndexSearcher
  • Query
  • QueryParser
  • Filter
  • Sort
IndexSearcher, as the name would suggest, is used to search indexes. You can also use it to highlight and explain query results and read documents from the index (as you would with IndexReader). To create an IndexSearcher, you need to supply it with an IndexReader:
   
             
As usual, you can shortcut this by supplying it with a Directory or a filesystem path to the index:
    
Ferret contains more than 15 different types of query, each of which you’ll learn about later in this chapter. Basically, queries are built and combined to specify what exactly it is you are looking for. You can then pass them to the IndexSearcher so it will retrieve your result set. Queries are the fundamental building block of the search API.
With more than 15 different types of query (each with its own definitive API), it can get quite tedious to build them by hand. Succinct as Ruby code is, it is much easier to build queries using a simple query language, not to mention the fact that you wouldn’t want users to have to type Ruby code into your search box. For example, let’s say we wanted to search for all articles in a blog that have the words “ruby” and “ferret” in either the title field or the content field. You could use the QueryParser:
   
Or you could build the query yourself. The QueryParser is the magic behind the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building Queries
Even if you are using the QueryParser to build all your queries, you’ll gain a better understanding of how searching works in Ferret by building each of the queries by hand. We’ll also include the Ferret Query Language (FQL) syntax for each different type of query as we go. As you read, you’ll find some queries that you can’t build even using the QueryParser, so it will be useful to learn about them as well.
Before we get started, we should mention that each Query has a boost field. Because you will usually be combining queries with a BooleanQuery, it can be useful to give some of those queries a higher weighting than the other clauses in the BooleanQuery. All Query objects also implement hash and eql?, so they can be used in a HashTable to cache query results.
TermQuery is the most basic of all queries and is actually the building block for most of the other queries (even where you wouldn’t expect it, like in
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
QueryParser
You’ve now been introduced to all the different types of queries available in Ferret, and you’ve learned how to build different queries by hand. Some of it probably seems like a lot of work and it’s certainly not something you’d ask a user to do. Luckily, we can leave most of the work to the Ferret QueryParser. You’ve already seen many examples of the Ferret Query Language (FQL) in the previous section (”), and you’ll have noticed that most of the queries you can build in code can be described much more easily in FQL. In this section, we’ll talk about setting up the QueryParser, and then we’ll go into more detail about FQL.
The QueryParser has a number of parameters, as shown in .
Table : QueryParser parameters
ParameterDefaultShort description
:default_field:*The default field to be searched; it can also be an array.
:analyzerStandardAnalyzerAnalyzer used by the query parser to parse query terms.
:wild_card_downcasetrueSpecifies whether wildcard queries should be downcased or not, since they are not analyzed by the parser.
:fields[]Lets the query parser know what fields are available for searching, particularly when the :* is specified as the search field.
:validate_fieldsfalseSet to true if you want an exception to be raised if there is an attempt to search a nonexistent field.
:or_defaulttrueUse OR as the default Boolean operator.
:default_slop0Default slop to use in PhraseQueries.
:handle_parser_errorstrueQueryParser will quietly handle all parsing internally. If you’d like to handle them , set this parameter to false.
:clean_stringtrueQueryParser will quickly review the query string to make sure that quotes and brackets match up and special characters are escaped.
:max_clauses512The maximum number of clauses allowed in Boolean queries and the maximum number of terms allowed in multi, prefix, wildcard, or fuzzy queries.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Filtering Search Results
We’ve already mentioned Filters in our discussion of ConstantScoreQuery and FilteredQuery. Filters are used to apply extra constraints to a result set. For example, we want to restrict our search to documents that were created during the last month. We have two options: add a RangeQuery clause to our query, or apply a RangeFilter. The main advantage of using a Filter over a Query is that no score is taken into account, so a Filter can be a lot faster. To add to that, Filters cache their results so that subsequent uses of the Filter perform even better again. All caching is done against an instance of an IndexReader, so a new cache needs to be built each time a Filter is used against a different IndexReader.
Filters also make it easy to apply constraints to user input queries. Filters are best used when applying commonly used constraints to a user’s query, such as restricting a search of a blog to only today’s postings or only to marked for publication.
There are only two standard Filters that come with Ferret:
  • RangeFilter
  • QueryFilter
RangeFilter takes the same parameters as RangeQuery as described in the ” section earlier in this chapter. Basically, you need to supply a :field and an upper and/or lower limit for that field. For example, if you want to restrict a search to products that are priced at $50.00 or more and less than $100.00, we would build the filter like this:
        
Note again the way we padded the price values. RangeFilter works only on fields that are correctly lexically sorted, so you need to remember to pad all number fields to a fixed width if you want to filter that field with a RangeFilter.
QueryFilter makes use of a query to filter search results. The initial application of a QueryFilter will be just as slow as if you added the filter query as a :must clause to the actual query. However, after caching, subsequent use of the QueryFilter will be much faster.
A good example of where you might use a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sorting Search Results
By default, documents are sorted by relevance and then by document ID if scores are equal. But what if we want to sort the result set by the value in one of the fields (e.g., price)? One way to do this is to retrieve the entire result set and make use of Ruby’s Array#sort method. However, this would take too long for large result sets, not to mention use up a lot of unnecessary memory. Searcher provides a :sort parameter for easy sorting. The easiest way to specify a sort is to pass a sort string. A sort string is a comma-separated list of field names with an optional DESC modifier to reverse the sort for that field. The type of the field is automatically detected and the field sorted accordingly. So Float fields will be sorted by Float value, and Integer fields will be sorted by Integer value. SCORE and DOC_ID can be used in place of field names to sort by relevance and internal document ID, respectively. Here are some examples:
   
   
   
Although this will do the job most of the time, you can be a little more explicit in describing how a result set is sorted by using the Sort API. You will also need to use the Sort API to take full advantage of sort caching. There are two classes in the Sort API: Sort and SortField.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Highlighting Query Results
Query highlighting, like excerpting, is one of the newer features in Ferret, added in version 0.10. Highlighting takes a query and returns the data from a document field with all of the matches in the field highlighted. Excerpting, on the other hand, takes excerpts from the field, preferably with matching terms, and highlights the terms in those excerpts. Both Ferret::Search::Searcher and Ferret::Index::Index classes have a highlight method. In this section, we’ll look at Index#highlight because it allows us to pass string queries instead of having to build Query objects (see ). Otherwise, both methods are essentially the same. To use the highlight method, you must supply a query and the document ID of the document you wish to highlight. A number of other parameters can be used to describe exactly how you want to highlight the field.
Table : Index#highlight parameters
ParameterDescription
:fieldDefaults to @options[:default_field]. The highlighter only works on one field at a time, so you need to specify which field it is you want to highlight. If you want to highlight multiple fields, you'll need to call this method multiple times.
:excerpt_lengthDefaults to 150 bytes. This parameter specifies the length of excerpt to show. The algorithm for extracting excerpts attempts to fit as many matched terms into each excerpt as possible. If you’d simply like the complete field back with all matches highlighted, set this parameter to :all.
:num_excerptsSpecifies the number of excerpts you wish to retrieve. This defaults to 2, :excerpt_length is set to :all, in which case :num_excerpts is set to 1.
:pre_tagTo highlight matches, you need to specify short strings to place before and after . :pre_tag defaults to <b>, which is fine when printing HTML, but if you are results to the console, we recommend using something like
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
Content preview·Buy PDF of this chapter|