12.6. Creating an Index of XML Documents
Problem
You need to quickly search a collection of XML documents, and, to do this, you need to create an index of terms keeping track of the context in which these terms appear.
Solution
Use Jakarta Lucene and Jakarta Digester and create an index of Lucene
Document
objects for the lowest level of
granularity you wish to search. For example, if you are attempting to
search for speeches in a Shakespeare play that contain specific
terms, create a Lucene Document
object for each
speech. For the purposes of this recipe, assume that you are
attempting to index Shakespeare plays stored in the following XML
format:
<?xml version="1.0"?> <PLAY> <TITLE>All's Well That Ends Well</TITLE> <ACT> <TITLE>ACT I</TITLE> <SCENE> <TITLE>SCENE I. Rousillon. The COUNT's palace.</TITLE> <SPEECH> <SPEAKER>COUNTESS</SPEAKER> <LINE>In delivering my son from me, I bury a second husband.</LINE> </SPEECH> <SPEECH> <SPEAKER>BERTRAM</SPEAKER> <LINE>And I in going, madam, weep o'er my father's death</LINE> <LINE>anew: but I must attend his majesty's command, to</LINE> <LINE>whom I am now in ward, evermore in subjection.</LINE> </SPEECH> </SCENE> </ACT> </PLAY>
The following class creates a Lucene index of Shakespeare speeches,
reading XML files for each play in the
./data/Shakespeare
directory, and calling the
PlayIndexer
to create Lucene
Document
objects for every speech. These
Document
objects are then written to a Lucene index using an
IndexWriter
:
import java.io.File; ...
Get Jakarta Commons Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.