9.6. Decode XML Entities
You want to convert all character entities defined by the
XML standard to their corresponding literal characters. The conversion
should handle named character references (such as
") as well as numeric character
references (be they in decimal notation as
Σ, or in hexadecimal notation as
|Regex options: None|
This regular expression includes three capturing groups. Only one of the groups participate in any particular match and capture a value. Using three groups like this allows you to easily check which type of entity was matched.
Replace matches with their corresponding literal characters
Use the regular expression just shown, together with the code in Recipe 3.16. The code examples listed there show how to perform a search-and-replace with replacement text generated in code.
When writing your replacement callback function, use backreferences to determine the appropriate replacement character. If group 1 captured a value, backreference 1 holds a numeric character reference in decimal notation, possibly with leading zeros. If group 2 captured a value, backreference 2 holds a numeric character reference in hexadecimal notation, possibly with leading zeros. If group 3 captured a value, backreference 3 holds an entity name. Use a lookup object, dictionary, ...