We've seen in some examples that regular expressions can quickly become quite complex. One way of battling complexity is to split expressions up into separate expressions. A good example of this is finding hyphens in ranges and replacing them with en dashes. As we will see, it's not easy to come up with a single expression to accomplish this.
The ranges we're after are ranges of numbers, both Arabic and Roman (e.g., 34–78, v–ix), ranges of numbers in parentheses (as in (23)–(27)), and certain letter ranges (a-d), which could be preceded by a number (as in 6b-d). The first type, ranges of Arabic numbers such as 23–56, can be handled with these expressions:
(?x) \b (\d+) - (\d+) \b
We need to specify word boundaries so that we catch only ranges with one hyphen. A search for
\d-\d matches all kinds of things that are unlikely to be page ranges, such as ISBN numbers, telephone numbers, grant numbers—well, anything with more than one hyphen is not a page range.
Hyphens in ranges of Arabic numbers in parentheses such as (12)-(14) can be captured with the following expressions:
(?x) \b ([\d()]+) - ([\d()]+) \b
In order to match a number with its enclosing parentheses, we need to define a character class of
) (note again that the parentheses needn't be escaped in a character class).
Hyphens in Roman page ranges such as ii-xv are replaced with an en dash ...