Chapter 12. XML and HTML
XML and HTML are the most popular markup languages (textual ways of describing structured data). HTML is used to describe textual documents, like you see on the Web. XML is used for just about everything else: data storage, messaging, configuration files, you name it. Just about every software buzzword forged over the past few years involves XML.
Java and C++ programmers tend to regard XML as a lightweight, agile technology, and are happy to use it all over the place. XML is a lightweight technology, but only compared to Java or C++. Ruby programmers see XML from the other end of the spectrum, and from there it looks pretty heavy. Simpler formats like YAML and JSON usually work just as well (see Recipes 14.1 or 14.2), and are easier to manipulate. But to shun XML altogether would be to cut Ruby off from the rest of the world, and nobody wants that. This chapter covers the most useful ways of parsing, manipulating, slicing, and dicing XML and HTML documents.
There are two standard APIs for manipulating XML: DOM and SAX. Both are overkill for most everyday uses, and neither is a good fit for Ruby’s code block–heavy style. Ruby’s solution is to offer a pair of APIs that capture the style of DOM and SAX while staying true to the Ruby programming philosophy.1 Both APIs are in the standard library’s REXML package, written by Sean Russell.
Like DOM, the
Document class parses an XML document into a nested tree of objects. You can navigate the tree with Ruby accessors ...