Chapter 4. Query and Results Management

Chapter 3 described the technology as far as the index being created. The next two steps in the search journey are to find the documents that match a query and then to provide a list of the documents ranked in descending order of relevance. This is where the search technology becomes more visible. At the risk of an inappropriate generalization, all search applications work along the lines set out in Chapter 3. It is in the management of queries and results, and in the display of results (described in Chapter 12), that differences become more obvious.

Query Management

Query management presents some substantial processing challenges. In the case of indexing, there is an enormous amount of information that the indexing process can use to “understand” the nature of the information content. Users then type in a single word and expect the search engine to undertake a mind meld and work out what the query is really about. The query processing stage has to be able to undertake four processes very quickly:

Check for obvious spelling mistakes and offer suggestions for correct spellings
Use stemming and lemmatization to develop a range of potential query terms
Identify entities or phrases that may need to be clarified or expanded
Apply some semantic analysis to gain an insight into the likely nature of the query

A good example of query management in action can be seen on the public website of Microsoft. Typing the word SharePoint will cause a drop-down list of key variants to appear, such as SharePoint 2007 and SharePoint 2010, as well as shared view. Sticking with SharePoint will produce a list of results that are entry-level publications on SharePoint for people who have no previous knowledge of the application. For the query SharePoint 2010 Disaster Recovery, none of the entry-level results are anywhere to be seen, as the query processor has makes the assumption that anyone asking this question clearly knows about SharePoint technology.

For many years now, there has been a significant amount of interest in using natural language processing (NLP—not to be confused with neuro-linguistic programming!) for queries. The user types a sentence such as Find all the projects we have carried out in India with a gross margin of more than 30%. This is a well-formed instruction, but the information needs to have been indexed in order for the application to provide an answer. The margin information might be held in a finance system and there is no link to the project lists held in SharePoint 2010.

As with all aspects of search technology, it only matters that the particular approach to query management taken by a vendor works for the queries that your organization is going to generate. This is why it is important to undertake the user requirements analysis, come up with personas and use cases, and then work up some typical queries that can be used in the proof-of-concept tests.

Exploratory Search

Enterprise search performance assessment is often fixated around the need to provide a specific document/information in response to a query, usually because the information architecture of the intranet or document management system is so broken that the document has effectively vanished. Little, if indeed any, attention is paid to exploratory search, where the user has little prior understanding of what query terms might be appropriate and how rich the collection might be in relevant documents. When users are presented with perhaps anything more than 200 relevant documents, they may well be overwhelmed with choice and then not know how to refine the search.

An engineer may be looking for ideas to reduce the failure of a power screwdriver in low temperatures. The immediate problem would be to define what is meant by “low temperature.” If she defines this to be anything below 0° C, then this could be used in a range search. However, if her company has an engineering base in the United States, there could be a problem over temperature indices, as the test documents might be in Fahrenheit. This is a somewhat artificial example, but should be enough to show that just developing a starting point for a query is far from easy, even for people skilled in the subject area.

Spellchecking

The quality of the search application’s spellchecker makes a significant difference to user satisfaction, as it speeds up the search process by not wasting time looking for words that do not exist. In addition to spotting incorrect spelling, a good spellchecker offers suggestions, and this feature can be extended to auto-complete by presenting users with a list of words that match the query terms as they are being typed into the search box. Finding the balance between being helpful and getting in the way is not easy.

The suggestions will be made from a spelling dictionary that is generated from the index terms, but it should also be possible to add in special terms that are of value to the organization. This could include key members of staff with names that are difficult to spell correctly or office locations in places like Rawalpindi (the second a in the name is easily forgotten because it is not pronounced). The same goes for the p in raspberry.

Retrieval Models

There are four different approaches (often referred to as models) to manage the process of matching the query against the index and delivering a set of results.

The first of these models is Boolean retrieval. It is named after George Boole (1815–1864), who as an English mathematician with a special interest in algebraic logic, in which logical propositions could be expressed in algebraic terms. Boole’s work was taken up by Claude Shannon in the late 1930s as the basis for managing telephone circuits, and later, the circuits in digital computers. Boolean algebra is characterized by the use of the operators AND, NOT, and OR.

A query about the London 2012 Olympics could be represented as follows:

London AND Olympics AND 2012

If the user was interested in information about both the 2012 and 1948 Olympics, then the following query could be used:

London AND Olympics AND (2012 OR 1948)

The nested logic within the parentheses is familiar to anyone who has had to create formulae in an Excel spreadsheet.

This approach was taken by all the early search applications, but has the fundamental problem that the documents returned either meet or do not meet the query term. There is no room for fuzziness. Adding in more terms to try to be specific can result in relevant documents being excluded. It is not possible to rank the set of results as a list in descending order of relevance. This order is referred to as a ranked list.

To overcome this problem, Gerard Salton developed the vector space model in the late 1960s, although he did not publish the core papers on his work until the early 1970s. The mathematics of this model is very complex, but in principle it enables the computation of how similar a document is to the terms in the query. Many current search applications make use of the vector space model.

Search application vendors are usually unwilling to reveal exactly which model they are using in their products, and in any case, it is not just the retrieval model but how the results are ranked that is of importance to a customer. Each has strong proponents, but there is no one ideal model.

Parametric Search

Parametric search enables the user to build complex Boolean queries from a set of facets or characteristics applicable to the content being searched. This is the approach that is widely used for Advanced Search on websites and intranets. Usually the parameters are selected from drop-down lists despite the challenges these present in terms of accessibility. One of the benefits of a parametric search is that it facilitates the use of the OR operator. Users can then search for information on projects in Kuwait OR Dubai OR Oman. Users of web search applications often overlook the implicit AND that is the default option, so that this query becomes Kuwait AND Dubai AND Oman, which only finds documents where all three locations are mentioned.

Although parametric search can be of significant benefit to users, the major problem is that a user does not know the extent to which a parameter may be having a major impact on the number of documents retrieved. In the days of remote access online database services such as Lockheed Dialog and SDC Orbit, it was possible to get a return from the search application of the number of documents that met each parameter and also the number that met all the parameters. This is rarely available in enterprise search applications, so the user has to wait for the results page to be presented before realizing that the selection of parameters is not optimal.

Filters and Facets

The value of facets and filters can be very considerable but only when implemented with care. The two terms are often used interchangeably, but the differences between a filter and a facet are important. A filter reduces a set of results by whether or not they meet defined criteria, such as being published in a particular year or being related to a specific subsidiary or sales region. A filter will therefore remove some items from a result set so that, for example, only sales reports for the Nordic region are presented and reports for North America are excluded. Any further search refinement is then carried out on the reduced set of results.

Facets present a set of characteristics of the information in the repository, listing out elements such as year of publication, geographic region, size of project, and perhaps even the name of the author. Once the results of a query are listed out, the sets of facets, usually on the lefthand side of the results page, show the counts of the number of documents that contain the element. Computationally, this is quite a challenge. The two main approaches are often referred to as top down and bottom up. In the top down approach, the number of hits per document is calculated from the inverted index. The bottom up approach works through the documents in the results set and then accumulates the number of occurrences. There are also some combined approaches, and exactly how the facet hit values are derived tends to be one of the nondisclosable elements of a commercial search application.

One result of the trade-offs that have to be made between the computational demands of accurate facet hit counting and being able to deliver results as expeditiously as possible is that the hit counts may be approximated. This is most obvious when the sum of the individual counts does not match the headline count.

For example, the Year facet may show there are 4,503 hits, but the individual year counts are 1,417, 734, 344, and 239, and continue to decrease by year. The user might be forgiven for wondering what happened to the other 2,000 or so results.

For filters and facets to work, there must be a very high standard of content quality and of metadata tagging. The user needs to be able to trust that if he selects 2013 sales reports, he does not also get some 2012 sales reports that have been tagged by a modified date and not the date of publication. A false drop like this on a set will cause the user to question the integrity of the search application.

Ranking

The emphasis on ranking models in this chapter is because they are, to a very significant extent, the magic sauce that differentiates search applications. Although there is some scope to develop innovative technologies for stemming and tokenizing, these primarily affect performance and the ability to index complex documents. When it comes to ranking, the benefits could well be immediately obvious but perhaps only for certain categories of content where a particular algorithm works very well. The only way to find out which is the better ranking approach adopted by two different vendors is to assess them on defined, known collections of documents under controlled test conditions. That is not going to happen in the harsh procurement world of enterprise search, even if it should. Ad hoc tests will just confuse the situation.

There are a number of approaches to trying to give the user the documents she needs on the first page of the search results. Absolute query and relative query boosting are two examples of static ranking, and are based on business rules. For a number of queries, there could be one or more documents that are important to display either as the first or second result, or above the list of results. For example, any search for some HR-related terms such as maternity leave or paternity leave will always result in the user being presented with both the global HR policy document and the HR policy for the local unit. Both may be highly relevant because a manager located in India may want to check what the rules are in Sweden. This is sometimes referred to as a Best Bet, or absolute query boosting.

Under relative query boosting, for certain queries there could be one or more documents that a user should be made aware of, but which do not merit being placed at the beginning of a results list. Any search on corporate performance might have a rule that ensures that the latest quarterly report is always in the top 20 results, or possibly a PowerPoint presentation given to investors.

Ranking by decreasing relevance is just one possible sequence. In the world of enterprise search, date order can be important. A manager either wants to find out the most recent project reports listed in reverse chronological order (most recent first), or needs to find out the first time that a particular chemical was synthesized in the company’s research laboratories (oldest first).

Without a doubt the most challenging task in search management is optimizing the ranking of search results. It requires a combination of knowledge, including the following:

How to change the weights of each of the ranking parameters
The content being searched
The language (specialized terms) of the organization
The business processes in the organization that might result in a requirement to search
The expectations of the search user

This knowledge is very unlikely to be found in one person and is the main reason why a search support team is essential in achieving the highest possible levels of search satisfaction. Ideally, your team will have a combined knowledge of computational linguistics, information science, the mathematics of probability, and a sprinkling of computer science.

Summarization

The entries in the results list will usually include a title, some additional data about the document (i.e., metadata), and then a summary of the document. There are many different ways of creating these summaries, including taking highly relevant sentences from the document and reproducing the text document, or displaying the search term within the context of a few words taken from the sentence in which it appears. This latter approach would seem to be an effective approach, but in a long document (a feature of enterprise requirements), this may just be a few sentences from a 200-page project report and may not be representative of the entire report.

A considerable amount of research continues to be undertaken into creating document summaries rather than using the rules-based approach, and it is very likely that there will be substantial enhancements to result summarization over the next few years. Although the algorithms are now relatively well developed, the challenge is how to endure that the processing time for summarization does not have a significant impact on the speed with which the results are presented.

Often little attention is paid to the presentation of summaries in search results, and yet there are often significant differences between search applications on how the summarization is achieved and presented. It is important to assess these in the specification and selection process, using some test collections to assess the differences. Another factor to take into account is the ease with which the summarization process can be optimized after installation.

Document Thumbnails

Another approach is to display an HTML thumbnail of the document with the search terms highlighted, and with the facility to step through each occurrence of the term. Again, the more terms, the less successful this approach becomes, but it is especially useful for PowerPoint presentations when users are looking for a slide on which they remember there was an especially clever diagram that they would like to reuse. The extent to which thumbnails can be generated will depend on the file format of the document.

Another benefit of using document thumbnails is that it avoids the need to open up the document in another application just to view it for long enough to determine that the document is not relevant and close it down again. That puts quite a load on the hardware and network bandwidth. The quality of the thumbnails varies as far as being an accurate representation of the document.

There are some third-party suppliers of software to generate document thumbnails, including the Finnish company Documill.

Query Auto-Completion

A popular feature of search applications is the listing of computer-generated queries that take the initial search term as a starting point. If I search for ford on Google, the suggestions returned are Ford, Ford Focus, Ford UK, and Ford Mondeo, all easily recognizable as frequent queries from a UK domain. Changing the search to ford depth (because I am interested in finding the deepest ford in the UK) then generates Ford depth gauge (which is close) but also Deptford Goth (a rock group) and Eagle ford depth, which is a rock formation in Texas.

As with so much of search technology, there is a lot going on behind query auto-completion (QAC). As the query is typed, the words are matched against a database of the frequency of query terms. This works well when there are very large numbers of queries logged by web search applications, but in an enterprise environment that is not the case. Instead, the suggestions are derived from the document index, though often only a partial index because of the need for a very quick response to the query. It is not unusual to find that this index is not security trimmed. Typing in redundancy might display redundancy discussions London and then mysteriously show no results because the documents relating to the discussions are locked down to a few designated managers.

Despite the apparent popularity of QAC, until recently, little research has been carried out on how users interact with the suggested queries. More research will undoubtedly now be carried out (it only takes one paper to catalyze a research program!), and over the next few years there are likely to be some significant improvements to QAC, to an important extent driven by the benefits in mobile search.

Federated Search

In theory, the ideal strategy for search would be to have a query box that enabled a user to search all the information in the organization with a single query, no matter in which repository the information is stored. Several vendors are currently suggesting that information silos have to be broken down if the organization is to flourish.

There are two options:

Option A: One big index

In principle, it is possible to crawl and index any number of individual search applications, or business applications with a search component, and create a single index. That is not difficult. What is difficult is creating a ranking list of results that make any sort of sense to the user, including presenting them in a consistent way.

Given the number of applications and thus, the size of the index, the results lists are likely to be quite long. Delivering results with high precision is very difficult. Disaster recovery planning is very important in this option, as there are many points of failure. It is important to manage crawl schedules with care—one schedule will certainly not meet all requirements.

Option B: Query federation

The query is managed by one search application, which then sends out the query to other search applications. Results from all the applications are then either integrated, or more usually, presented in a number of different sections of the search results page. It is not sensible to produce a “ranked” list of results, as ranking cannot be calculated as an algorithm of the ranking in each of the individual applications. Technically, the presentation of results from query federation is known as the interleaving problem—how can the results be presented so that the user is able to appreciate how the items have been ranked against one another?

Both options require the use of connectors (see Chapter 3), which are challenging to write and maintain. A small change in configuration in one of the queried applications may end up disconnecting the connector. Commercial search vendors, such as BA Insight and Coveo, have libraries of connectors that they maintain, but they can also be obtained from systems integrators such as Search Technologies. When a connector between two search applications fails (though often they just fail to perform as expected), there is always an interesting discussion between the vendors concerned about which end of the connector has failed. Connectors will also manage security protocols, either through early or late binding. Almost inevitably, matching the security models in each application will introduce some latency into the delivery of results, and this needs to be carefully managed.

This approach works well when it is possible to query a search application (e.g., from a commercial publisher) but not crawl and index it.

There are quite a number of factors to take into account when considering a federated search approach:

In both options, users are also able to query individual search applications. A lawyer looking for information from a matter database may not also wish to see apparently relevant results from the corporate intranet.
It may not be possible to implement query suggestion and offer the same set of filters and facets to help manage long results lists, as these are driven by the metadata schemas used on the respective repositories and indexes.
The user interface may become very difficult to use. If the results are divided up into small screen areas for each repository, then perhaps only a few results can be seen without scrolling. Google research indicates that users can scan a list of Google results in around nine seconds. This is a good benchmark against which to judge the time users take to review federated results
Managing access permissions can be a significant concern, as the permissions may not map uniformly across all the repositories. This can slow down the presentation of a complete list of results.

Another challenge with federated search is making sense of the search logs, especially in the case of Option B. In the case of either option, adding a new (or even upgraded) search application to the list of searches or repositories can make such substantial changes to the ranking of results that users should be forgiven for thinking that search is broken. A final challenge is how to cope with cloud-based applications, such as searching both on-premise SharePoint and Google Apps in the cloud.

The devil is in the details, and there is no substitute for a prolonged period of both requirements gathering and proof-of-concept testing. An increasing number of commercial vendors offer some form of federated search, but do take the time to read the small print and at the end of each sentence write a short essay on “what the implications are for us.” You can, of course, build a federated application in open source software, but at present this option is only for seriously brave and experienced search teams.

Summary

The technology described in this chapter is the base technology for all search applications. Each vendor, or open source application, will have its own approach to exactly how each of these processes is delivered, but trying to differentiate between them on the basis of these core processes is not a good use of time. As with any software application, it is not how a process is carried out but whether the results are of value to the organization.

Enterprise Search, 2nd Edition by Martin White