Skip to Content
MarkLogic Cookbook
book

MarkLogic Cookbook

by Dave Cassel
March 2018
Intermediate to advanced
34 pages
1h 33m
English
O'Reilly Media, Inc.
Content preview from MarkLogic Cookbook

Chapter 12. Tokenization

MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. By doing so, we can alter how content is represented in the indexes.

Tokenizing Social Security Numbers

Problem

You want to search across Social Security Numbers from different sources, which may have been recorded with or without dashes. In the United States, each citizen has a Social Security Number (SSN), which is used as a unique identifier when interacting with the federal government. These numbers take the form of NNN-NN-NNNN, where each N is a digit.

Solution

Applies to MarkLogic versions 7 and higher

We’ll solve this problem using custom tokenization.

To develop this recipe, I used documents that looked like these two:

<doc>
  <name>Alpha</name>
  <ssn>111-22-3333</ssn>
</doc>
<doc>
  <name>Alpha</name>
  <ssn>123456789</ssn>
</doc>

The first step is to create a field with paths that target the elements (or JSON properties) that hold the SSNs. A field may have more than one path, so add a path for each element that has an SSN.

xquery version "1.0-ml";

import module namespace admin =
  "http://marklogic.com/xdmp/admin"
  at "/MarkLogic/admin.xqy";

let $db-id := xdmp:database("Documents")
let $field-name := "SSN"
let $paths := (
  "/doc/ssn"
)
return
  admin:save-configuration(
    admin:database-set-field-value-searches(
      admin:database-add-field ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

XSLT Cookbook

XSLT Cookbook

Sal Mangano
XQuery

XQuery

Priscilla Walmsley

Publisher Resources

ISBN: 9781491994610