Chapter 3. Text Document Basics

At this point we are ready to look at the specifics of the content.xml file for word processing documents. We will build up from the most basic elements, characters and paragraphs, to sections and pages. This chapter also covers the topic of lists and outlines in OpenDocument word processing documents.

Characters and Paragraphs

All OpenDocument files are based on Unicode, and are encoded in the UTF-8 encoding scheme. You may see a discussion of this at the section called “Unicode Encoding Schemes”. This means that you may freely mix characters from a variety of languages in an OpenDocument file, as shown in Figure 3.1, “Document with Mixed Languages”. It also means that those characters will not be easily viewable in a normal ASCII text editor.

Figure 3.1. Document with Mixed Languages

images

Whitespace

In XML, whitespace in element content is typically not preserved unless specially designated. OpenDocument collapses consecutive whitespace characters, which are defined as space (0×0020), tab (0×0009), carriage return (0×000D), and line feed (0×000A) to a single space. How, then, does OpenDocument represent a document where whitespace is significant?

To handle extra spaces, OpenDocument uses the <text:s> element. This empty element has an optional attribute, text:c, which tells how many spaces occur in the document. If this attribute is absent, then the ...

Get OASIS OpenDocument Essentials: Using OASIS OpenDocument XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

OASIS OpenDocument Essentials: Using OASIS OpenDocument XML by J. David Eisenberg

Chapter 3. Text Document Basics

Characters and Paragraphs

Whitespace

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly