Working with XML
The Extensible Markup Language (XML) provides a framework for designing domain-specific markup languages. It is sometimes used for representing annotated text and for lexical resources. Unlike HTML with its predefined tags, XML permits us to make up our own tags. Unlike a database, XML permits us to create data without first specifying its structure, and it permits us to have optional and repeatable elements. In this section, we briefly review some features of XML that are relevant for representing linguistic data, and show how to access data stored in XML files using Python programs.
Using XML for Linguistic Structures
Thanks to its flexibility and extensibility, XML is a natural choice for representing linguistic structures. Here’s an example of a simple lexical entry.
Example 11-3.
<entry> <headword>whale</headword> <pos>noun</pos> <gloss>any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head</gloss> </entry>
It consists of a series of XML tags enclosed in angle brackets.
Each opening tag, such as <gloss>
, is matched with a closing
tag, </gloss>
; together they
constitute an XML element. The
preceding example has been laid out nicely using whitespace, but it
could equally have been put on a single long line. Our approach to
processing XML will usually not be sensitive to whitespace. In order
for XML to be well formed, all opening tags must have corresponding closing tags, at the same level of nesting (i.e., the XML ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.