Chapter 4. Working with Documents

Documents are the fundamental data structure for MarkLogic. This chapter addresses some common problems: generating unique identifiers for documents and finding binary documents (which can’t be directly searched by content).

Generate a Unique ID

Problem

Generate a unique identifier for each document.

Solution

Use the built-in sem:uuid-string() function:

  xdmp:document-insert(
  	"/content/" || sem:uuid-string() || ".xml",
  	$new-doc
  )

Discussion

This recipe generates identifiers that are unique. When generating unique IDs, many people start with the idea that they should be monotonically increasing numbers. While this is conceptually a reasonable thing to do, it has a hidden requirement: a single place where the next value is stored and updated. For instance, consider having a document that tracks the next available number. A process that wants to insert a new document must create a write-lock on the number-tracking document. Any other process wanting to insert must wait until it can get a write-lock on that same document. This single resource prevents MarkLogic from being able to work on noninterfering inserts in parallel. When faced with a requirement to generate monotonically increasing numbers for IDs, ask whether they really need to be that, or simply be unique. Most often, the real requirement is to be unique. The sem:uuid() and sem:uuid-string() functions provide a fast way to accomplish that.

Another thought to consider is where to use the unique identifier. In a database of students, each new student’s information is stored in a document. Suppose the developer decided to use let $uri := "/student/" || $student-name || ".xml" as the URI, where $student-name is the student’s first and last names, joined by a hyphen. To avoid collisions, the developer adds a <student-id> element, populated with sem:uuid-string(). When inserting a new student record, the application code will need to check whether a student already has the URI that would be built using the standard process. If so, then the code would need to modify the URI somehow; perhaps by adding a “-2” to the student’s name. Of course, to do that, the code must check whether that URI has already been taken, and so on. A much simpler approach is to use the unique ID in the URI, thus avoiding the concern: let $uri := "/student/" || sem:uuid-string() || ".xml".

Find Binary Documents

Problem

Find the URIs of binary documents.

Solution

Applies to MarkLogic versions 7 and higher

xquery version "1.0-ml";

declare namespace qry  = "http://marklogic.com/cts/query";

let $binary-term :=
  xdmp:plan(/binary())//qry:term-query/qry:key/text()
return cts:uris((), (), cts:term-query($binary-term))

Discussion

This recipe returns a sequence of URIs for all the binary documents in the target database.

The implementation relies on how the /binary() XPath is interpreted. xdmp:plan() tells us how MarkLogic sees a query. Part of the result is the final plan:

<qry:final-plan xmlns:qry="http://marklogic.com/cts/query">
  <qry:and-query>
    <qry:term-query weight="0">
      <qry:key>7908746777995149422</qry:key>
      <qry:annotation>document-format(binary)</qry:annotation>
    </qry:term-query>
  </qry:and-query>
</qry:final-plan>

Notice the term-query part—in addition to storing a document, MarkLogic stores metadata about a document, and that metadata is queryable, too. Sometimes the trick is just figuring out how to specify that query. In this case, we use information from xdmp:plan to get the job done.

You might ask, “Why not just use XPath, such as /binary()?” This would also work, but it works by retrieving the binaries themselves. You could take it a step further with /binary() ! fn:base-uri(.) to get just the URIs (which is what the recipe provides), but again, this requires loading up the actual documents and doing something with them. The beauty of the recipe is that it works on indexes.

There’s one sneaky bit with this recipe: cts:term-query isn’t a published function. That means you should be careful where you use it, but for this recipe, it gets the job done.

Find Recently Modified Binary Documents

Problem

Find binary documents that have been recently modified.

Solution

Applies to MarkLogic versions 7 and higher

xquery version "1.0-ml";

declare namespace qry  = "http://marklogic.com/cts/query";

let $binary-term :=
  xdmp:plan(/binary())//qry:term-query/qry:key/text()
let $query-start :=
  (fn:current-dateTime() - xs:dayTimeDuration("P1D"))
let $query-stop := fn:current-dateTime()
let $query := cts:and-query((
  cts:properties-fragment-query(
    cts:and-query((
      cts:element-range-query(
        xs:QName("prop:last-modified"), ">", $query-start),
      cts:element-range-query(
        xs:QName("prop:last-modified"), "<", $query-stop)
    ))
  ),
  cts:term-query($binary-term)
))
return (
  text{
    "Estimate:",
    xdmp:estimate(cts:search(fn:doc(), $query))
  },
  cts:uris((), ("limit=100"), $query)
)

Required Indexes

  • "maintain last modified" must be on

  • dateTime range index on prop:last-modified

Discussion

Recently modified binaries can be found if the "maintain last modified" option on the target database is active. You must also have a dateTime range index set up on prop:last-modified, so that the cts:element-range-query will work.

The xs:dayTimeDuration chosen for $query-start defines what “recent” means in this case. Notice that the last-modified date is stored in a property fragment, so this recipe uses a cts:properties-fragment-query to look for it.

The recipe returns an estimate of the number of recently modified binaries, as well as the URIs of the first 100. We can count on the estimate to be accurate, as the query is targeting indexes.

Get MarkLogic Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.