book

MarkLogic Cookbook

Name: MarkLogic Cookbook
Author: Dave Cassel
ISBN: 9781491994603

by Dave Cassel

March 2018

Intermediate to advanced

34 pages

1h 33m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Acknowledgments
I. Implementing XQuery: Practical Solutions to Real-World Problems
1. Peak Performance
Assert Query ModeProblemSolutionDiscussionFast Distinct ValuesProblemSolutionDiscussion
2. Fun with Maps
Check Whether Two Maps Are EqualProblemSolutionDiscussionFind the Intersection of a Sequence of MapsProblemSolutionDiscussionApply a Function to All Values in a MapProblemSolutionDiscussion
3. Document Security
List User Permissions on a DocumentProblemSolutionDiscussionGet Permissions with Role NamesProblemSolutionDiscussion
4. Working with Documents
Generate a Unique IDProblemSolutionDiscussionFind Binary DocumentsProblemSolutionDiscussionFind Recently Modified Binary DocumentsProblemSolutionDiscussion
5. The Task Server
Cancel Active Tasks on the Task ServerProblemSolutionDiscussionCancel Active and Queued Tasks on the Task ServerProblemSolutionDiscussion
6. Administration
Find Hostnames in a ClusterProblemSolutionDiscussionFind Current and Effective MarkLogic Versions During Rolling UpgradeProblemSolutionDiscussion
II. Documents, Triples, and Values: Powering Search

7. Document Searches
Search by Root ElementProblemSolutionDiscussionSee AlsoFind Documents That Are Missing an ElementProblemSolutionDiscussionSee Also
8. Scoring Search Results
Sort Results to Promote Recent DocumentsProblemSolutionDiscussionSee AlsoWeigh Matches Based on Document PartsProblemSolutionDiscussionSee Also
9. Understanding Your Data and How It Gets Used
Logging Search RequestsProblemSolutionDiscussionSee AlsoCount Documents in DirectoriesProblemSolutionDiscussionSee Also
10. Searching with the Optic API
Paging Over ResultsProblemSolutionDiscussionSee AlsoGroup BySolutionDiscussionSee AlsoExtract Content from Retrieved DocumentsProblemSolutionDiscussionSee AlsoSelect Documents Based on Criteria in Joined DocumentsProblemSolutionDiscussionSee Also
III. Transforming Data
11. Input Transformations
Changing Date FormatProblemSolutionDiscussionConverting Binaries to Base64 Strings and BackProblemSolutionDiscussionSee AlsoIngesting an Aggregate JSON File with Many Documents InsideProblemSolutionDiscussion
12. Tokenization
Tokenizing Social Security NumbersProblemSolutionDiscussion
13. Template-Driven Extraction
Searching on Derived DataProblemSolutionDiscussionSee AlsoUsing an IRI Namespace with TDEProblemDiscussionSee Also
14. Redaction
Redacting Credit Card Numbers, Replacing with DigitsProblemSolutionDiscussionSee AlsoRedacting ICD10 CodesProblemSolutionDiscussion

Content preview from MarkLogic Cookbook

Chapter 12. Tokenization

MarkLogic has a default set of rules it uses to tokenize content; that is, to break a stream of text into words, punctuation, and symbols. This default tokenization works well for normal text, but in some cases we might wish to change it. By doing so, we can alter how content is represented in the indexes.

Tokenizing Social Security Numbers

Problem

You want to search across Social Security Numbers from different sources, which may have been recorded with or without dashes. In the United States, each citizen has a Social Security Number (SSN), which is used as a unique identifier when interacting with the federal government. These numbers take the form of NNN-NN-NNNN, where each N is a digit.

Solution

Applies to MarkLogic versions 7 and higher

We’ll solve this problem using custom tokenization.

To develop this recipe, I used documents that looked like these two:

<doc>
  <name>Alpha</name>
  <ssn>111-22-3333</ssn>
</doc>

<doc>
  <name>Alpha</name>
  <ssn>123456789</ssn>
</doc>

The first step is to create a field with paths that target the elements (or JSON properties) that hold the SSNs. A field may have more than one path, so add a path for each element that has an SSN.

xquery version "1.0-ml";

import module namespace admin =
  "http://marklogic.com/xdmp/admin"
  at "/MarkLogic/admin.xqy";

let $db-id := xdmp:database("Documents")
let $field-name := "SSN"
let $paths := (
  "/doc/ssn"
)
return
  admin:save-configuration(
    admin:database-set-field-value-searches(
      admin:database-add-field ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491994610

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

MarkLogic Cookbook

by Dave Cassel

Chapter 12. Tokenization

Tokenizing Social Security Numbers

Problem

Solution

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

XSLT Cookbook

XQuery

Gaining Data Agility with Multi-Model Databases

Java XML and JSON: Document Processing for Java SE

Publisher Resources

Chapter 12. Tokenization

Tokenizing Social Security Numbers

Problem

Solution

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

XSLT Cookbook

XQuery

Gaining Data Agility with Multi-Model Databases

Java XML and JSON: Document Processing for Java SE

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.