9.6. Decode XML Entities

Problem

You want to convert all character entities defined by the XML standard to their corresponding literal characters. The conversion should handle named character references (such as &, <, and ") as well as numeric character references (be they in decimal notation as Σ or Σ, or in hexadecimal notation as Σ, Σ, or Σ).

Solution

Regular expression

&(?:#([0-9]+)|#x([0-9a-fA-F]+)|([0-9a-zA-Z]+));
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This regular expression includes three capturing groups. Only one of the groups participate in any particular match and capture a value. Using three groups like this allows you to easily check which type of entity was matched.

Replace matches with their corresponding literal characters

Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.

When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, ...

Get Regular Expressions Cookbook, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.