Unravel the OpenOffice File Format

OpenOffice provides a suite of applications whose native file format consists of a set of XML files, compressed into a ZIP archive. This hack explores the basics of the OpenOffice file format.

OpenOffice (http://www.openoffice.org) is a suite of free, multiplatform, open source applications for the desktop, sponsored by Sun Microsystems (http://wwws.sun.com/software/star/openoffice/). The suite includes text-editor, spreadsheet, drawing, and presentation applications, each of which uses an XML-based file format. Table 4-2 lists the OpenOffice applications and their file extensions.

Each file is saved as a collection of XML documents and stored in a ZIP archive. (You can also save documents in other formats, such as text, Rich Text Format, or HTML. You can also export a document as PDF.) The specification of the OpenOffice XML file format is being maintained by an OASIS technical committee (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office).

Table 4-2. OpenOffice applications and file extensions

OpenOffice application

File extension

Calc spreadsheet application

*.sxc

Calc templates

*.stc

Draw graphics application

*.sxd

Draw templates

*.std

Impress presentation application

*.sxi

Impress templates

*.sti

Math application

*.sxm

Master files

*.sxg

Writer text editor application

*.svw

Writer templates

*.stw

In the OpenOffice subdirectory of the book’s file archive is a small file, foaf.sxw , a snippet taken from the FOAF hack [Hack #64] . It is shown in OpenOffice’s Writer application in Figure 4-5. You can use any ZIP tool to examine or extract the XML files from this ZIP file. I’ll use the unzip command-line tool that comes with Unix distributions such as Cygwin (http://www.cygwin.com).

foaf.sxw in OpenOffice’s Writer application

Figure 4-5. foaf.sxw in OpenOffice’s Writer application

While in the OpenOffice subdirectory, enter this command at a shell prompt:

unzip -l foaf.sxw

The -l option allows you to inspect the contents of the compressed file without extracting the files from it. This command produces:

Archive:  foaf.sxw
  Length     Date   Time    Name
 --------    ----   ----    ----
       30  04-04-04 04:51   mimetype
     4178  04-04-04 04:51   content.xml
     8062  04-04-04 04:51   styles.xml
     1174  04-04-04 04:51   meta.xml
     9180  04-04-04 04:51   settings.xml
      752  04-04-04 04:51   META-INF/manifest.xml
 --------                   -------
    23376                   6 files

Extract these files into the OpenOffice subdirectory with:

unzip foaf.sxw

You’ll see this:

Archive:  foaf.sxw
 extracting: mimetype
  inflating: content.xml
  inflating: styles.xml
 extracting: meta.xml
  inflating: settings.xml
  inflating: META-INF/manifest.xml

Briefly, here’s what each of these files contains:

mimetype

Contains the file’s media type; e.g., application/vnd.sun.xml.writer.

content.xml

Holds the text content of the file.

meta.xml

Holds any meta information for the document. You can edit the meta information associated with this document by selecting File Properties.

settings.xml

Contains information about the settings of the document.

styles.xml

Stores the styles applied to the document. You can apply styles to the document by selecting Format Stylist (or by pressing F11).

META-INF/manifest.xml

Contains a list of XML and other files that make up the default OpenOffice representation of the document.

Tip

When you do a File Save As, you can click the “Save with password” checkbox. If you do this, all the XML files except meta.xml are saved as encrypted files.

For illustration, we’ll look at one of the files stored in the OpenOffice saved-file archive. Example 4-12 shows the XML markup that’s inside content.xml . This document is nicely indented because in the Tools Options Load/Save dialog box under General settings, I’ve unchecked the Size optimization for XML format (no pretty printing) checkbox. It’s checked by default, meaning that normally the XML files are saved without indentation.

Example 4-12. content.xml from foaf.sxw

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE office:document-content PUBLIC 
"-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "office.dtd">

<office:document-content 
 xmlns:office="http://openoffice.org/2000/office"
 xmlns:style="http://openoffice.org/2000/style" 
 xmlns:text="http://openoffice.org/2000/text" 
 xmlns:table="http://openoffice.org/2000/table" 
 xmlns:draw="http://openoffice.org/2000/drawing" 
 xmlns:fo="http://www.w3.org/1999/XSL/Format" 
 xmlns:xlink="http://www.w3.org/1999/xlink"
 xmlns:number="http://openoffice.org/2000/datastyle"
 xmlns:svg="http://www.w3.org/2000/svg"
 xmlns:chart="http://openoffice.org/2000/chart"
 xmlns:dr3d="http://openoffice.org/2000/dr3d"
 xmlns:math="http://www.w3.org/1998/Math/MathML"
 xmlns:form="http://openoffice.org/2000/form"
 xmlns:script="http://openoffice.org/2000/script" 
 office:class="text" office:version="1.0">
 <office:script/>
 <office:font-decls>
  <style:font-decl style:name="Tahoma1" fo:font-family="Tahoma"/>
  <style:font-decl style:name="Lucida Sans Unicode" 
   fo:font-family="&apos;Lucida Sans Unicode&apos;" 
       style:font-pitch="variable"/>
  <style:font-decl style:name="MS Mincho" 
       fo:font-family="&apos;MS Mincho&apos;"
   style:font-pitch="variable"/>
  <style:font-decl style:name="Tahoma" fo:font-family="Tahoma" 
   style:font-pitch="variable"/>
  <style:font-decl style:name="Times New Roman" 
   fo:font-family="&apos;Times New Roman&apos;" 
       style:font-family-generic="roman"
   style:font-pitch="variable"/>
  <style:font-decl style:name="Arial" fo:font-family="Arial" 
   style:font-family-generic="swiss" style:font-pitch="variable"/>
 </office:font-decls>
 <office:automatic-styles>
  <style:style style:name="P1" style:family="paragraph" 
   style:parent-style-name="Text body">
   <style:properties fo:text-align="center" 
        style:justify-single-word="false"/>
  </style:style>
  <style:style style:name="fr1" style:family="graphics" 
   style:parent-style-name="Graphics">
   <style:properties style:vertical-pos="top" 
        style:vertical-rel="paragraph"
    style:horizontal-pos="center" style:horizontal-rel="paragraph"
    style:mirror="none" fo:clip="rect(0inch 0inch 0inch 0inch)" 
        draw:luminance="0%"
    draw:contrast="0%" draw:red="0%" draw:green="0%" draw:blue="0%" 
        draw:gamma="1"
    draw:color-inversion="false" draw:transparency="0%" 
    draw:color-mode="standard"/>
  </style:style>
 </office:automatic-styles>
 <office:body>
  <text:sequence-decls>
   <text:sequence-decl text:display-outline-level="0" 
          text:name="Illustration"/>
   <text:sequence-decl text:display-outline-level="0" 
          text:name="Table"/>
   <text:sequence-decl text:display-outline-level="0" 
          text:name="Text"/>
   <text:sequence-decl text:display-outline-level="0" 
          text:name="Drawing"/>
  </text:sequence-decls>
 <text:h text:style-name="Heading 1" text:level="1">Identify Yourself with FOAF,
 an Application of RDF</text:h><text:p text:style-name="Text body">
 FOAF provides a framework for creating and  publishing personal information
 in a machine-readable fashion. As you learn FOAF,  you will also
 get acquainted with RDF in a practical way as well.</text:p>
 <text:p text:style-name="Text body">The Friend of a Friend or FOAF project 
(http://www.foaf-project.org/) is a community-driven effort to define an RDF
 vocabulary for expressing metadata about people, and their interests,
 relationships and activities. Founded by Dan Brickley and Libby Miller, the FOAF
 project is an open community-lead initiative which is tackling head-on the wider
 Semantic Web goal of creating a machine processable web of data. Achieving this
 goal quickly requires a network-effect that will rapidly yield a mass of data.
 Network effects mean people. It seems a fairly safe bet that any early Semantic
 Web successes are going to be riding on the back of people-centric applications.
 Indeed, arguably everything interesting that we might want to describe on the
 Semantic Web was created by or involves people in some form or another. And FOAF
 is all about people.</text:p><text:p text:style-name="Text body">
  FOAF facilitates the creation of the Semantic Web equivalent of the 
 archetypal personal homepage: My name is Leigh, this is a picture of me, 
 I'm interested in XML, and here are some links to my friends. And
 just like the HTML version, FOAF documents can be linked together to form a web
 of data, with well-defined semantics.</text:p><text:p text:style-name=
 "Text body"> Being a W3C Resource Description Framework or RDF application 
 (http://www.w3.org/RDF/) means that FOAF can claim the usual benefits of being
  easily harvested and aggregated. And like all RDF vocabularies, it can be 
 easily combined with other vocabularies, allowing the capture of a very rich set
 of metadata. This hack introduces the basic terms of the FOAF vocabulary,
 illustrating them with a number of examples. The hack concludes with a brief
 review of the more interesting FOAF applications and considers some other uses 
 for the data. The FOAF graphic is shown in Figure A-1.</text:p>
 <text:p text:style-name="P1">Figure A-1: FOAFlets</text:p>
 <text:p text:style-name="Text body"/>
 <text:p text:style-name="Text body">
 <draw:image draw:style-name="fr1"
 draw:name="Graphic1" text:anchor-type="paragraph" svg:width="4.2201inch"
 svg:height="2.4299inch" draw:z-index="0"
 xlink:href="#Pictures/10000000000001A6000000F34FFA992C.jpg" 
 xlink:type="simple"xlink:show="embed" xlink:actuate="onLoad"/></text:p>
 </office:body>
</office:document-content>

The XML documents in OpenOffice use DTDs [Hack #68] that come with the installed package, though XML Schema and RELAX NG schemas will be available in future versions. For example, on Windows, these files are installed by default in C:\Program Files\OpenOffice.org1.1.1\share\dtd\officedocument\1_0. This document uses office.dtd (line 3). (These DTDs are not in the book’s file archive.) On line 4, the office:document-content element is the document element with the namespace http://openoffice.org/2000/office. Many other namespaces are declared, along with some familiar ones, such as for SVG [Hack #9] and XSL-FO [Hack #48] .

Various font declarations are stored in style:font-decl elements on lines 21 through 37. Attributes with the fo: prefix properties from XSL-FO. Lines 38 through 56 list styles that are used in the document. Lines 58 to 67 contain markup used for numeric sequencing in the document. A heading appears on line 68, followed by body text in lines 69 through 97. Lines 98 through 106 show how OpenOffice defines a reference to a graphic, including attributes from the SVG and XLink namespaces such as svg:width and xlink:href. The embedded graphic is stored in the Pictures subdirectory of foaf.sxw as the file 10000000000001A6000000F34FFA992C.jpg (line 104).

See Also

Get XML Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.