4.3 Searching Your Data Using Lucene.Net

Data is everywhere, whether it’s on the Internet, your local system, or networked hard drives. The challenge often isn’t in collecting and organizing your data but in finding it. Businesses collect data in a staggering array of formats, including Microsoft Outlook or Excel files, Access or SQL databases, PDFs, HTML files, plain old text files, and perhaps even custom application formats. That data often then gets scattered across a dizzying number of locations on different servers.

Chances are that your customers will need to deal with disparate data formats and with data stored in multiple locations. Furthermore, they will probably want to be able to exert some control over how searches are performed. Customers may want to be able to limit searches to certain keywords or to a particular set of data folders on a particular server, or to filter out information older than a particular date.

Google Desktop has made a splash by bringing this functionality to end users. Now you have the power to bring the same indexing and searching capabilities into your applications using Lucene.Net, a high-performance, scalable search engine library written in the C# language and utilizing the .NET Framework.

Lucene.Net at a Glance

Tool

Lucene.Net

Version covered

1.4.3, 1.9, 1.9.1, and 2.0

Home page

http://incubator.apache.org/lucene.net/

Power Tools page

http://www.windevpowertools.com/tools/144

Summary

.NET-based search engine API for indexing and searching contents

License type

Apache License, version 2.0

Online resources

API documentation, mailing list at ASF

Supported Frameworks

.NET 1.1, 2.0

Getting Started

Lucene.Net is an open source project currently under incubation at the Apache Software Foundation (ASF). The source code can be downloaded from the project’s home page as a .zip archive or checked out from the Subversion repository.

Lucene.Net requires a Microsoft C# compiler and version 1.1 or 2.0 of the .NET Framework. It works with either Microsoft Visual Studio 2003 or 2005. The source comes with a solution for Visual Studio 2003.

NUnit is required if you want to run the test code. It can be downloaded from its home page at http://www.nunit.org.

You’ll also need SharpZipLib (discussed later in this chapter) if you want to support compressed indexing in Lucene.Net versions 1.9 and 1.9.1. SharpZipLib can be downloaded from its home page at http://www.icsharpcode.net/OpenSource/SharpZipLib/.

Using Lucene.Net

Lucene.Net is not a standalone search engine application. It can’t be used as-is out of the box to index and search your data or the Web. Out of the box, Lucene.Net can’t extract or read your binary data (such as Microsoft Office or PDF files), make use of SQL data, or crawl the Web.

You must understand this about Lucene.Net so that you will be able to appreciate and understand its capabilities. All that Lucene.Net has to offer is a set of rich APIs that you must call to first create a Lucene.Net index and later search on that index. The task of extracting raw text data out of your binary data is your job. You have to write the code to read from formats such as Microsoft Office files, extract the raw text out of the files, and pass this raw text data to Lucene.Net, where it can finally be indexed and later searched.

After your raw text data has been indexed, you can use Lucene.Net’s API to search this data. Indexing and searching via Lucene.Net’s APIs is easy and yet very powerful.

Two groups of APIs make up Lucene.Net: the indexing APIs and the search APIs. You will spend most of your time writing code for the search APIs. However, before you can start searching, you must create indexes.

Creating an index

Indexing is the process of analyzing raw text data and converting it into a format that will allow Lucene.Net to search that data quickly. A Lucene.Net index is optimized for fast random access to all words stored in the index. When you create a Lucene.Net index, you have the option to create multiple fields and store different data in each field. For example, if you are indexing Microsoft Office (Word, Excel, Power Point, etc.) files, you can create a field for the filename, a field for the file date, and a field for the body of the document. In this way, at search time, you can narrow your query to only filenames, file dates, or the body of the document, or you can mix two or more fields with the same query and get a search hit.

Example 4-1 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution. This example application shows you how to create an index and populate it with data. It assumes that you have a folder holding several raw text files. If you don’t have such a folder, you’ll need to create one and populate it with some files. In addition, you will need an empty folder where the index will be stored. The example application will create a subfolder called index for this purpose.

Example 4-1. A Lucene.Net command-line sample application to index a filesystem

using System;
using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer;
using IndexWriter = Lucene.Net.Index.IndexWriter;
using Document = Lucene.Net.Documents.Document;
using Field = Lucene.Net.Documents.Field;
using DateTools = Lucene.Net.Documents.DateTools;

namespace Lucene.Net.Demo
{
  class IndexFiles
  {
    internal static readonly System.IO.FileInfo INDEX_DIR =
        new System.IO.FileInfo("index");

    [STAThread]
    public static void  Main(System.String[] args)
    {
      System.String usage = typeof(IndexFiles) + " <root_directory>";
      if (args.Length == 0)
      {
        System.Console.Error.WriteLine("Usage: " + usage);
        System.Environment.Exit(1);
      }

      // Check whether the "index" directory exists.
      // If not, create it; otherwise, exit program.
      bool tmpBool = System.IO.Directory.Exists(INDEX_DIR.FullName);
      if (tmpBool)
      {
        System.Console.Out.WriteLine("Cannot save index to '" +
            INDEX_DIR + "' directory, please delete it first");
        System.Environment.Exit(1);
      }

      System.IO.FileInfo docDir = new System.IO.FileInfo(args[0]);
      tmpBool = System.IO.Directory.Exists(docDir.FullName);
      if (!tmpBool)
      {
        System.Console.Out.WriteLine("Document directory '" +
            docDir.FullName + "' does not exist or is not readable, " +
            "please check the path");
        System.Environment.Exit(1);
      }

      System.DateTime start = System.DateTime.Now;
      try
      {
        IndexWriter writer =
            new IndexWriter(INDEX_DIR, new StandardAnalyzer( ), true);
        System.Console.Out.WriteLine("Indexing to directory '" +
                                     INDEX_DIR + "'...");
        IndexDocs(writer, docDir);
        System.Console.Out.WriteLine("Optimizing...");
        writer.Optimize( );
        writer.Close( );

        System.DateTime end = System.DateTime.Now;
        System.Console.Out.WriteLine(end.Ticks - start.Ticks +
                                     " total milliseconds");
      }
      catch (System.IO.IOException e)
      {
        System.Console.Out.WriteLine(" caught a " + e.GetType( ) +
                                     "\n with message: " + e.Message);
      }
    }

    public static void  IndexDocs(IndexWriter writer,
                                  System.IO.FileInfo file)
    {
      if (System.IO.Directory.Exists(file.FullName))
      {
        System.String[] files =
            System.IO.Directory.GetFileSystemEntries(file.FullName);
        if (files != null)
        {
          for (int i = 0; i < files.Length; i++)
          {
            IndexDocs(writer, new System.IO.FileInfo(files[i]));
          }
        }
      }
      else
      {
        System.Console.Out.WriteLine("adding " + file);
        writer.AddDocument(IndexDocument(file));
      }
    }

    public static Document IndexDocument(System.IO.FileInfo f)
    {
      // Make a new, empty document
      Document doc = new Document( );

      // Add the path of the file as a field named "path".
      // Use a field that is indexed (i.e., searchable), but don't
      // tokenize the field into words.
      doc.Add(new Field("path", f.FullName, Field.Store.YES,
                        Field.Index.UN_TOKENIZED));

      // Add the last modified date of the file to a field named
      // "modified". Use a field that is indexed (i.e., searchable),
      // but don't tokenize the field into words.
      doc.Add(new Field("modified",
                        DateTools.TimeToString(f.LastWriteTime.Ticks,
                        DateTools.Resolution.MINUTE),
                        Field.Store.YES, Field.Index.UN_TOKENIZED));

      // Add the contents of the file to a field named "contents".
      // Specify a Reader, so that the text of the file is tokenized
      // and indexed, but not stored. Note that FileReader expects
      // the file to be in the system's default encoding. If that's
      // not the case, searching for special characters will fail.
      doc.Add(new Field("contents",
                        new System.IO.StreamReader(f.FullName,
                        System.Text.Encoding.Default)));

      // Return the document
      return doc;
    }
  }
}

The key Lucene.Net references used in this example application are StandardAnalyzer, IndexWriter, Document, and Field. We’ll take a look at each of these next.

Understanding analyzers

An analyzer, combined with a streamer, plays an important role in Lucene.Net. During indexing, an analyzer and a streamer take a stream of raw text and break it into searchable terms. In addition, they remove any “noise” from the text (commas, periods, question marks, etc.), as well as common words (“this,” “that,” “then,” “is,” “a,” etc.). Removing noise and common words greatly speeds up searching.

If you want to index non-English data, you can write your own analyzer and streamer. However, chances are that someone has already written one that fits the bill and contributed it to Lucene.Net. Currently, the following streamers are supported: Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. These streamers can be found in the contrib folder of the distribution. If you want to write your own, you can use one of the available analyzers and streamers as a model.

Our example application uses the standard analyzer that comes with Lucene.Net.

Understanding the role of the IndexWriter

The following line:

IndexWriter writer =
    new IndexWriter(INDEX_DIR, new StandardAnalyzer( ), true);

creates or opens an index. This is done through the IndexWriter object. An IndexWriter is used whenever you want to add anything to or delete anything from an index. The first parameter is the path to the index. The second parameter is an analyzer (discussed in the previous section). If you wrote your own analyzer, you will specify it here. The last parameter tells the IndexWriter constructor to create a new index (true) or open an existing one (false).

Once an index has been created or opened, you’re ready to modify it. In our example, we are indexing a filesystem, which means we will read a folder and the subfolders it contains. As we iterate through the filesystem, any file we visit will be opened by the IndexDocs( ) method as text and indexed. IndexDocs( ) opens files and passes the file handles to addDocument( ). This method constructs what is known as a Lucene.Net Document.

Think of a Document as a virtual document that contains metadata: the title, author, publication date, and chapters. For each file you index, a separate Document is created, like so:

Document doc = new Document( );

Adding data to a document

Once you’ve created a Document, you’ll need to add data to it. This is done by creating one or more Fields for each piece of metadata in your file. For example, in the sample application, we created a Field called path that holds the path to the file we are indexing, a Field called modified that holds the date the file was last modified, and a Field called contents that holds the document’s raw text content. You can create more Fields as your application requires. When you create a Field, you can also specify what type of Field it is.

The three Fields in our sample application are added to a Document like so:

doc.Add(new Field("path", f.FullName, Field.Store.YES,
                  Field.Index.UN_TOKENIZED));

doc.Add(new Field("modified",
                  DateTools.TimeToString(f.LastWriteTime.Ticks,
                  DateTools.Resolution.MINUTE), Field.Store.YES,
                  Field.Index.UN_TOKENIZED));

doc.Add(new Field("contents", new System.IO.StreamReader(f.FullName,
                  System.Text.Encoding.Default)));

After you’ve populated a Document object with Field objects, you’re ready to add the Document to the index:

writer.AddDocument(IndexDocument(file));

Running the IndexFiles application

From the command line, run the IndexFiles application against the folder you have populated with raw text files. You can also simply point IndexFiles to the Lucene.Net source directory, and IndexFiles will index the Lucene.Net source files for you. To start IndexFiles, issue the following command from the bin directory: IndexFiles C:\Lucene.Net\. Once IndexFiles is done indexing your files, it creates a directory called index in the current directory and stores the index in it.

Searching an index

Searching in Lucene.Net is similar to indexing and offers great functionality. It’s expected that you will spend more time in Lucene.Net’s search APIs than in the indexing ones.

There are several ways you can search your index. You can use Lucene.Net to search one index, or you can search multiple indexes using MultiSearcher. Searching two or more indexes distributes your data across multiple indexes for faster searching, better tuning, and greater control.

For example, you can separate your data into date ranges, perhaps creating an index for each month. This will allow you to narrow your search to a particular month’s index or combine multiple months’ indexes. (Obviously, this kind of index creation doesn’t have to be date-related; it can be based on any useful criteria.)

In addition to the MultiSearcher, Lucene.Net also offers the RemoteSearchable capability. With RemoteSearchable, you can rely on Lucene.Net’s web server API to search one or more indexes residing on different servers.

Lucene.Net also gives you the power and flexibility of searching on one or more fields, individually weighting any of your fields, and applying Boolean query criteria such as AND, OR, NOT, NEAR, and DATE_RANGE. What’s more, you can update an index and search it at the same time. Once the index update is done, just close your searcher and reopen it, and your updated data will be available.

Our Lucene.Net example application will show you how to search the index that we created in Example 4-1, where we indexed the filesystem. Example 4-2 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution.

Example 4-2. A Lucene.Net command-line sample application to search an index

using System;
using Analyzer = Lucene.Net.Analysis.Analyzer;
using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer;
using Document = Lucene.Net.Documents.Document;
using QueryParser = Lucene.Net.QueryParsers.QueryParser;
using Hits = Lucene.Net.Search.Hits;
using IndexSearcher = Lucene.Net.Search.IndexSearcher;
using Query = Lucene.Net.Search.Query;
using Searcher = Lucene.Net.Search.Searcher;

namespace Lucene.Net.Demo
{
  class SearchFiles
  {
    [STAThread]
    public static void  Main(System.String[] args)
    {
      try
      {
        Searcher searcher = new IndexSearcher(@"index");
        Analyzer analyzer = new StandardAnalyzer( );


        // Create a new StreamReader using standard input as the stream
        System.IO.StreamReader streamReader =
            new System.IO.StreamReader(
                // Sets reader's input stream to the standard input stream
                new System.IO.StreamReader(
                    System.Console.OpenStandardInput( ),
                    System.Text.Encoding.Default).BaseStream,
                // Sets reader's encoding to whatever standard input is using
                new System.IO.StreamReader(
                    System.Console.OpenStandardInput( ),
                    System.Text.Encoding.Default).CurrentEncoding);
        while (true)
        {
          System.Console.Out.Write("Query: ");
          System.String line = streamReader.ReadLine( );

          if (line.Length <= 0)
            break;

          Query query = QueryParser.Parse(line, "contents", analyzer);
          System.Console.Out.WriteLine("Searching for: " +
                                       query.ToString("contents"));

          Hits hits = searcher.Search(query);
          System.Console.Out.WriteLine(hits.Length( ) +
                                       " total matching documents");

          int HITS_PER_PAGE = 10;
          for (int start = 0; start < hits.Length( ); start += HITS_PER_PAGE)
          {
            int end = System.Math.Min(hits.Length( ), start + HITS_PER_PAGE);
            for (int i = start; i < end; i++)
            {
              Document doc = hits.Doc(i);
              System.String path = doc.Get("path");
              if (path != null)
              {
                System.Console.Out.WriteLine(i + ". " + path);
              }
              else
              {
                System.String url = doc.Get("url");
                if (url != null)
                {
                  System.Console.Out.WriteLine(i + ". " + url);
                  System.Console.Out.WriteLine("   - " + doc.Get("title"));
                }
                else
                {
                  System.Console.Out.WriteLine(i + ". " +
                                     "No path nor URL for this document");
                }
              }
            }

            if (hits.Length( ) > end)
            {
              System.Console.Out.Write("more (y/n) ? ");
              line = streamReader.ReadLine( );
              if (line.Length <= 0 || line[0] == 'n')
                break;
            }
          }
        }
        searcher.Close( );
      }
      catch (System.Exception e)
      {
        System.Console.Out.WriteLine(" caught a " + e.GetType( ) +
                                     "\n with message: " + e.Message);
      }
    }
  }
}

In this example application, the key Lucene.Net references being used are Standard-Analyzer, Document, QueryParser, Hits, IndexSearcher, Query, and Searcher.

Understanding searchers

A Searcher is the front door to your index. Through it, search single or multiple indexes located locally on your hard drive or remotely on different machines. The following line:

Searcher searcher = new IndexSearcher(@"index");

creates a Searcher object by instantiating an IndexSearcher. The parameter passed to IndexSearcher is the name of a folder containing an index, expressed as either a full path or a relative path.

Using analyzers in searching

We used analyzers when we created the index. Why do we need them again during searching? During indexing, we used an analyzer to clean up our raw text. The same rules must be applied on the text a user types at the search prompt. Furthermore, the same type of analyzer must be used for searching as for indexing, or the search results will not be correct—or, even worse, no hits may be returned at all.

This line creates the matching analyzer:

Analyzer analyzer = new StandardAnalyzer( );

Revisiting documents

We also covered the Document class during indexing. At search time, we use a Document object to hold information about a hit resulting from a search query. The Document object contains the fields and the data in those fields.

In our example application, a reference to a Document object is retrieved like so:

Document doc = hits.Doc(i);

Parsing user input with QueryParser

A QueryParser works hand-in-hand with an analyzer. The job of the QueryParser is to take a user’s query, apply the same rules as the analyzer, and figure out what the user is searching for.

For example, if your search query is +cat +dog, the QueryParser will know that you are searching for both the words cat and dog and that they must be in the same field.

Tip

The + option marks a term as a required part of the query.

Lucene.Net supports several such power-search features. You can do a Boolean search using OR, AND, and NOT terms, and you can limit your search to a particular field.

In our example application, a QueryParser is created like so:

Query query = QueryParser.Parse(line, "contents", analyzer);

Here, we pass three parameters to the parser. The first is the string that the user typed (the search query). The second parameter is the name of the default field that we will search. You can specify multiple fields, or no field at all, leaving it up to the user to identify the field to search in. The final parameter is the analyzer.

Working with search hits

A Hits collection is what you get back as a result of running a search query. If your search query returns hits, you use the Hits object to iterate over a list of Document objects.

In our example application, a reference to a Hits object is returned like so:

Hits hits = searcher.Search(query);

Remember that we instantiated a Searcher object and pointed it at our index folder. Now we’re passing it a reference to the Query object discussed previously. This kind of abstraction is what makes Lucene.Net so flexible and powerful; working with an index is consistent, regardless of whether you’re using one or more indexes and whether they’re local or remote. Additionally, the search behavior is consistent, whether you have one query or a combination of queries.

Running the SearchFiles application

When you’re ready to run the application, move to the folder where the index was created during indexing. Once you are in that folder, run the SearchFiles application by just typing its name (using the fully qualified pathname if you haven’t copied it to the same directory as the indexes).

Getting Support

Since Lucene.Net is an open source project and is incubated into ASF, support for it is through its mailing list, noted at the project’s home page. Subscribe to the mailing list and post your questions there. Questions are answered in a timely fashion, and the community is looking to grow.

Get Windows Developer Power Tools now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.