8.4. Match XML Names
Problem
You want to check whether a string is a legitimate XML name (a common syntactic construct). XML provides precise rules for the characters that can occur in a name, and reuses those rules for element, attribute, and entity names, processing instruction targets, and more. Names must be composed of a letter, underscore, or colon as the first character, followed by any combination of letters, digits, underscores, colons, hyphens, and periods. That’s actually an approximate description, but it’s pretty close. The exact list of permitted characters depends on the version of XML in use.
Alternatively, you might want to splice a pattern for matching valid names into other XML-handling regexes, when the extra precision warrants the added complexity.
Following are some examples of valid names:
thing_thing_2_:Российские-Вещьfantastic4:the.thing日本の物
Note that letters from non-Latin scripts are allowed, even including the ideographic characters in the last example. Likewise, any Unicode digit is allowed after the first character, not just the Arabic numerals 0–9.
For comparison, here are several examples of invalid names that should not be matched by the regex:
thing!thing with spaces.thing.with.a.dot.in.front-thingamajig2nd_thing
Solution
Like identifiers in many programming languages, there is a set of characters that can occur in an XML name, and a subset that can be used as the first character. Those character lists are dramatically different for XML 1.0 Fourth Edition ...