Processing Markup and Data Formats with Regular Expressions

This final chapter focuses on common tasks that come up when working with an assortment of common markup languages and data formats: HTML, XHTML, XML, CSV, and INI. Although we’ll assume at least basic familiarity with these technologies, a brief description of each is included next to make sure we’re on the same page before digging in. The descriptions concentrate on the basic syntax rules needed to correctly search through the data structures of each format. Other details will be introduced as we encounter relevant issues.

Although it’s not always apparent on the surface, some of these formats can be surprisingly complex to process and manipulate accurately, at least using regular expressions. When programming, it’s usually best to use dedicated parsers and APIs instead of regular expressions when performing many of the tasks in this chapter, especially if accuracy is paramount (e.g., if your processing might have security implications). However, we don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, scraping data from a limited set of HTML files, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but reading through this chapter will ensure that you don’t stumble into ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.