Extracting Information from Word Documents

XSLT can also be used to extract information from existing Word documents. This can be useful for tracking document metadata, aggregating document fragments, listing tracked changes—the sky is the limit. In this section, we’ll look at three examples: dumping the text of a document, extracting metadata from a document, and listing a document’s comments.

Dumping a Document’s Text Content

Sometimes, we are only interested in the textual content of a document and not its formatting. Because of the way that WordprocessingML is structured, dumping all the text content of a document is a very straightforward task. In fact, the empty XSLT stylesheet (shown in Example 3-4) gets us pretty close to what we want to do.

Example 3-4. The empty transformation, empty.xsl

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
</xsl:stylesheet>

All text content within a Word document is represented using text nodes in the WordprocessingML document. Since the empty stylesheet does not specify any explicit template rules, only the built-in template rules (defined in the XSLT recommendation) are applied. (See http://www.w3.org/TR/xslt#built-in-rule.) The built-in rule for elements is to keep processing (apply templates to children), and the built-in rule for text nodes is to copy them. The resulting behavior of the empty stylesheet is that all the text content of the source document is copied to the result tree without any element ...

Get Office 2003 XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Office 2003 XML by Simon St. Laurent, Mary McRae, Evan Lenz

Extracting Information from Word Documents

Dumping a Document’s Text Content

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly