Regular Expressions for Detecting Word Patterns
Many linguistic processing tasks involve pattern matching. For
example, we can find words ending with ed using
endswith('ed')
. We saw a variety of
such “word tests” in Table 1-4. Regular
expressions give us a more powerful and flexible method for describing
the character patterns we are interested in.
Note
There are many other published introductions to regular
expressions, organized around the syntax of regular expressions and
applied to searching text files. Instead of doing this again, we focus
on the use of regular expressions at different stages of linguistic
processing. As usual, we’ll adopt a problem-based approach and present
new features only as they are needed to solve practical problems. In
our discussion we will mark regular expressions using chevrons like
this: «patt
».
To use regular expressions in Python, we need to import the
re
library using: import re
. We also need a list of words to
search; we’ll use the Words Corpus again (Lexical Resources). We will preprocess it to remove any
proper names.
>>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
Using Basic Metacharacters
Let’s find words ending with ed using the
regular expression «ed$
». We will
use the re.search(p, s)
function to
check whether the pattern p
can be
found somewhere inside the string s
. We need to specify the characters of interest, and use the dollar sign, which has a special behavior in the context of regular expressions ...
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.