Chapter 1. A Brief Foray into Structured Content (a.k.a. XML)

Whenever we talk about eXtensible Markup Language (XML), we are talking about a type of structured content. In case you haven’t been exposed to these concepts, let’s take a brief look at them before we dive further into XML and InDesign.

The first XML concept is that of structure, sometimes called “hierarchy.” Structure is the organization of pieces of information into a grouping that makes sense to humans. For example, if you are going to describe a course within a college course catalog, at minimum you would give the course name and a brief description. To relate this course to the larger picture of getting a degree, you would provide information about the major that the course is part of, how many credit hours the course counts for, and the prerequisites, if there are any.

Looked at from the top down, a college offers programs of study consisting of courses in a sequence. Course credits have to add up to the required number for the degree program.

If you draw the relationships as boxes that contain information, you might see that a program of study contains a set of repeating information blocks consisting of blocks of course names and descriptions, as in Figure 1-1.

A diagram of a possible course catalog structure

Figure 1-1. A diagram of a possible course catalog structure

Each piece of information that we want to identify and work with is given an element name. The top-level element (root) at the left of Figure 1-1 is named <Programs of Study> and consists of many individual <ProgramOfStudy> elements. Repeating element blocks make up a <CourseSequence> element.

The names of elements can be very wordy to ensure that humans can read and understand what they mean, or they can be tersely named, like Prg, Crs, and TCrd, if mostly computer programs use them. XML element naming is dependent on the person or machine who has to work with the XML and how. Here are some general naming rules: element names can’t start with a number, can’t contain spaces, and can’t contain certain “illegal” characters such as ?, >, &, and /.

The second XML concept is semantics, which is applying names to things so that they are meaningful to you and others. So rather than Titlemain, Titlesub, and List, you would use names that relate to the type of information you are organizing: ProgramName, ProgramDescription, ProgramRequirements, CourseName, CourseDescription, Credits, and so on.

Hierarchy and semantics are combined in structured content and can be translated into an abstract model of XML elements, such as in Example 1-1.

Example 1-1. A tree diagram of possible course catalog structure

ProgramsOfStudy
↳     ProgramOfStudy
    ↳    ProgramName
    ↳    ProgramDescription
    ↳    CourseSequence
       ↳    CourseDescriptions
          ↳    CourseDescription_Major
             ↳    CourseDescription_Name
             ↳    CourseCreditsHrs
             ↳    CourseDescription_Text
             ↳    CourseDescription_Fotnote
          ↳    CourseDescription_Minor
             ↳    CourseDescription_Name
             ↳    CourseCreditsHrs
             ↳    CourseDescription_Text
             ↳    CourseDescription_Fotnote
    ↳    ProgramRequirements
    ↳    TotalProgramCredits
    ↳    CumulativeGradePointAverage

If a structure of meaningful components will be used by more than one person or organization, it can be formalized with a set of rules, such as:

Every program of study must consist of a sequence of more than one of each of required major courses, required minor courses, and elective courses. Additionally, the course credit hours must add up to the total credit hours required to complete the program of study, and the grades received must cumulatively add up to the minimum grade average for the student to graduate.

A set of rules for the structured content is called a schema or a Document Type Definition (DTD). The rules can be simple or complex, depending upon the number of elements and how they can be used (whether required or optional, how many times the element can occur, and within what contexts, etc.).

Rather than spend a lot of time exploring XML and DTDs at this point, I will consider them to be part of the problem-solving process for creating a content creation and publishing workflow. There are many resources for learning about XML and DTDs online.

The key points to keep in mind are what you call the pieces of content (the element names) and how they are organized (the structure). These points are factors in setting up your InDesign import and export processes. The names of your elements can be the same as, or different from, the names of paragraph styles that you use in InDesign.

XML element attributes provide additional information, typically to enable finer distinctions among content that is basically the same. For example, in a staff directory, an attribute might be used to indicate a department head, so that when the person’s name is shown, their name gets special typographical treatment in InDesign.

Unless you are using a DTD or schema developed by someone else, you can name elements and attributes in ways that are meaningful for your organization. That’s why XML is “extensible”—you are not limited to a defined set of elements as you would be with HTML for web pages.

If you are using a DTD or schema provided by another organization, you will have to learn how the elements and attributes in it create the kind of structure that you will work with in InDesign. I’ll examine elements and attributes and their naming more in subsequent chapters.

Get XML and InDesign now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.