Using Unicode with Regular Expressions

In addition to normal searching where you match a literal pattern to a document (with potentially varying degrees of looseness in what constitutes a “match”), text is often searched for text that matches a regular expression. A regular expression describes a category of related strings. Any string in that category is said to “match” the regular expression. For example, a regular expression such as

p[a-z]*g

might be used to specify a category consisting of all words that start with p and end with g. “Pig,” “pug,” “plug,” and “piling” would all match this regular expression. If you searched a string for this regular expression, the search would stop on any of those words (or any other word that began with ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.