Useful Applications of Regular Expressions
The previous examples all involved searching for words
w that match some regular expression
regexp using re.search(regexp, w)
. Apart from checking whether a regular expression matches
a word, we can use regular expressions to extract material from words,
or to modify words in specific ways.
Extracting Word Pieces
The re.findall()
(“find all”)
method finds all (non-overlapping) matches of the given regular
expression. Let’s find all the vowels in a word, then count
them:
>>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word) ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] >>> len(re.findall(r'[aeiou]', word)) 16
Let’s look for all sequences of two or more vowels in some text, and determine their relative frequency:
>>> wsj = sorted(set(nltk.corpus.treebank.words())) >>> fd = nltk.FreqDist(vs for word in wsj ... for vs in re.findall(r'[aeiou]{2,}', word)) >>> fd.items() [('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ...]
Note
Your Turn: In the W3C Date
Time Format, dates are represented like this: 2009-12-31. Replace
the ?
in the following Python
code with a regular expression, in order to convert the string
'2009-12-31'
to a list of
integers [2009, 12, 31]
:
[int(n) for n in re.findall(?,
'2009-12-31')]
Doing More ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.