Chapter 12. Tokenization
MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. By doing so, we can alter how content is represented in the indexes.
Tokenizing Social Security Numbers
Problem
You want to search across Social Security Numbers from different sources, which may have been recorded with or without dashes. In the United States, each citizen has a Social Security Number (SSN), which is used as a unique identifier when interacting with the federal government. These numbers take the form of NNN-NN-NNNN, where each N is a digit.
Solution
Applies to MarkLogic versions 7 and higher
We’ll solve this problem using custom tokenization.
To develop this recipe, I used documents that looked like these two:
<doc><name>Alpha</name><ssn>111-22-3333</ssn></doc>
<doc><name>Alpha</name><ssn>123456789</ssn></doc>
The first step is to create a field with paths that target the elements (or JSON properties) that hold the SSNs. A field may have more than one path, so add a path for each element that has an SSN.
xqueryversion"1.0-ml";importmodulenamespaceadmin="http://marklogic.com/xdmp/admin"at"/MarkLogic/admin.xqy";let$db-id:=xdmp:database("Documents")let$field-name:="SSN"let$paths:=("/doc/ssn")returnadmin:save-configuration(admin:database-set-field-value-searches(admin:database-add-field ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access