Specifying a Regular Expression for the Shortest Match
Problem
You’re trying to match a text pattern using regular expressions, but the regular expression is identifying the longest possible matches of a pattern. Instead, you would like to change it to find the shortest possible match.
Solution
This problem often arises in patterns that try to match text enclosed inside a pair of starting and ending delimiters (e.g., a quoted string). To illustrate, consider this example:
>>>str_pat=re.compile(r'\"(.*)\"')>>>text1='Computer says "no."'>>>str_pat.findall(text1)['no.']>>>text2='Computer says "no." Phone says "yes."'>>>str_pat.findall(text2)['no." Phone says "yes.']>>>
In this example, the pattern r'\"(.*)\"' is attempting to
match text enclosed inside double quotes. However, the * operator
in a regular expression is greedy, so matching is
based on finding the longest possible match. Thus, in the
second example involving text2, it incorrectly matches the
two quoted strings.
To fix this, add the ? modifier after the * operator in
the pattern, like this:
>>>str_pat=re.compile(r'\"(.*?)\"')>>>str_pat.findall(text2)['no.', 'yes.']>>>
This makes the matching nongreedy and instead produces the shortest match.
Discussion
This recipe addresses one of the more common problems encountered ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access