Using FULLTEXT Searches

Problem

You want to search through a lot of text.

Solution

Use a FULLTEXT index.

Discussion

You can use pattern matches to look through any number of rows, but as the amount of text goes up, the match operation can become quite slow. It’s also common to look for the same text in several string columns, which with pattern matching tends to result in unwieldy queries:

SELECT * from tbl_name
WHERE col1 LIKE 'pat' OR col2 LIKE 'pat' OR col3 LIKE 'pat' ...

A useful alternative (available as of MySQL 3.23.23) is to use FULLTEXT searching, which is designed for looking through large amounts of text, and can search multiple columns simultaneously. To use this capability, add a FULLTEXT index to your table, then use the MATCH operator to look for strings in the indexed column or columns. FULLTEXT indexing can be used with MyISAM tables, for columns of type CHAR, VARCHAR, or TEXT.

FULLTEXT searching is best illustrated with a reasonably good-sized body of text. If you don’t have a sample dataset, several repositories of freely available electronic text are available on the Internet. For the examples here, the one I’ve chosen is the complete text of the King James Version of the Bible (KJV), which is relatively large and has the advantage of being nicely structured by book, chapter, and verse. Because of its size, this dataset is not included with the recipes distribution, but is available separately as the mcb-kjv distribution at the MySQL Cookbook web site.[25] (See Appendix A.) The distribution includes a file kjv.txt that contains the verse records. Some sample records look like this:

O   Genesis 1   1   1   In the beginning God created the heaven and the earth.
O   Exodus  2   20  13  Thou shalt not kill.
N   Luke    42  17  32  Remember Lot's wife.

Each record contains the following fields:

  • Book section. This is either O or N, signifying the Old or New Testament.

  • Book name and corresponding book number, from 1 to 66.

  • Chapter and verse numbers.

  • Text of the verse.

To import the records into MySQL, create a table named kjv that looks like this:

CREATE TABLE kjv
(
    bsect   ENUM('O','N') NOT NULL,         # book section (testament)
    bname   VARCHAR(20) NOT NULL,           # book name
    bnum    TINYINT UNSIGNED NOT NULL,      # book number
    cnum    TINYINT UNSIGNED NOT NULL,      # chapter number
    vnum    TINYINT UNSIGNED NOT NULL,      # verse number
    vtext   TEXT NOT NULL                   # text of verse
) TYPE = MyISAM;

Then load the kjv.txt file into the table using this statement:

mysql> LOAD DATA LOCAL INFILE 'kjv.txt' INTO TABLE kjv;

You’ll notice that the kjv table contains columns for both book names (Genesis, Exodus, ...) and book numbers (1, 2, ...). The names and numbers have a fixed correspondence, and one can be derived from the other—a redundancy that means the table is not in normal form. It’s possible to eliminate the redundancy by storing just the book numbers (which take less space than the names), and then producing the names when necessary in query results by joining the numbers to a small mapping table that associates each book number with the corresponding name. But I want to avoid using joins at this point. Thus, the table includes book names so that search results can be interpreted more easily, and numbers so that the results can be sorted easily into book order.

After populating the table, prepare it for use in FULLTEXT searching by adding a FULLTEXT index. This can be done using an ALTER TABLE statement:[26]

mysql> ALTER TABLE kjv ADD FULLTEXT (vtext);

To perform a search using the index, use MATCH( ) to name the indexed column and AGAINST( ) to specify what text to look for. For example, to answer the question “How often does the name Mizraim occur?” (you’ve often wondered about that, right?), search the vtext column using this query:

mysql> SELECT COUNT(*) from kjv WHERE MATCH(vtext) AGAINST('Mizraim');
+----------+
| COUNT(*) |
+----------+
|        4 |
+----------+

To find out what those verses are, select the columns you want to see (the example here uses \G so that the results better fit the page):

mysql> SELECT bname, cnum, vnum, vtext
    -> FROM kjv WHERE MATCH(vtext) AGAINST('Mizraim')\G
*************************** 1. row ***************************
bname: Genesis
 cnum: 10
 vnum: 6
vtext: And the sons of Ham; Cush, and Mizraim, and Phut, and Canaan.
*************************** 2. row ***************************
bname: Genesis
 cnum: 10
 vnum: 13
vtext: And Mizraim begat Ludim, and Anamim, and Lehabim, and Naphtuhim,
*************************** 3. row ***************************
bname: 1 Chronicles
 cnum: 1
 vnum: 8
vtext: The sons of Ham; Cush, and Mizraim, Put, and Canaan.
*************************** 4. row ***************************
bname: 1 Chronicles
 cnum: 1
 vnum: 11
vtext: And Mizraim begat Ludim, and Anamim, and Lehabim, and Naphtuhim,

The results come out in book, chapter, and verse number order in this particular instance, but that’s actually just coincidence. By default, FULLTEXT searches compute a relevance ranking and use it for sorting. To make sure a search result is sorted the way you want, add an explicit ORDER BY clause:

SELECT bname, cnum, vnum, vtext
FROM kjv WHERE MATCH(vtext) AGAINST('search string')
ORDER BY bnum, cnum, vnum;

You can include additional criteria to narrow the search further. The following queries perform progressively more specific searches to find out how often the name Abraham occurs in the entire KJV, the New Testament, the book of Hebrews, and Chapter 11 of Hebrews:

mysql> SELECT COUNT(*) from kjv WHERE MATCH(vtext) AGAINST('Abraham');
+----------+
| COUNT(*) |
+----------+
|      216 |
+----------+
mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham')
    -> AND bsect = 'N';
+----------+
| COUNT(*) |
+----------+
|       66 |
+----------+
mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham')
    -> AND bname = 'Hebrews';
+----------+
| COUNT(*) |
+----------+
|       10 |
+----------+
mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham')
    -> AND bname = 'Hebrews' AND cnum = 11;
+----------+
| COUNT(*) |
+----------+
|        2 |
+----------+

If you expect to use search criteria that include other non-FULLTEXT columns frequently, you can increase the performance of such queries by adding regular indexes to those columns. For example, to index the book, chapter, and verse number columns, do this:

mysql> ALTER TABLE kjv ADD INDEX (bnum), ADD INDEX (cnum), ADD INDEX (vnum);

Search strings in FULLTEXT queries can include more than just a single word, and you might suppose that adding additional words would make a search more specific. But in fact that widens it, because a FULLTEXT search returns records that contain any of the words. In effect, the query performs an OR search for any of the words. This is illustrated by the following queries, which identify successively larger numbers of verses as additional search words are added:

mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham');
+----------+
| COUNT(*) |
+----------+
|      216 |
+----------+
mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham Sarah');
+----------+
| COUNT(*) |
+----------+
|      230 |
+----------+
mysql> SELECT COUNT(*) from kjv
    -> WHERE MATCH(vtext) AGAINST('Abraham Sarah Ishmael Isaac');
+----------+
| COUNT(*) |
+----------+
|      317 |
+----------+

To perform a search where each word in the search string must be present, see Recipe 4.14.

If you want to use a FULLTEXT search that looks though multiple columns simultaneously, name them all when you construct the index:

ALTER TABLE tbl_name ADD FULLTEXT (col1, col2, col3);

To issue a search query that uses this index, name those same columns in the MATCH( ) list:

SELECT ... FROM tbl_name
WHERE MATCH(col1, col2, col3) AGAINST('search string');

See Also

FULLTEXT indexes provide a quick-and-easy way to set up a simple search engine. One way to use this capability is to provide a web-based interface to the indexed text. The MySQL Cookbook site includes a basic web-based KJV search page that demonstrates this.



[25] The mcb-kjv distribution is derived from the KJV text available at the Unbound Bible site at Biola University (http://unbound.biola.edu), but has been modified somewhat to make it easier to use for the recipes in this book. The mcb-kjv distribution includes notes that describe how it differs from the Biola distribution.

[26] It’s possible to include the index definition in the initial CREATE TABLE statement, but it’s generally faster to create a non-indexed table and then add the index with ALTER TABLE after populating the table than to load a large dataset into an indexed table.

Get MySQL Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.