Matching Words
Problem
You want to pick out words from a string.
Solution
Think long and hard about what you want a word to be and what separates one word from the next, then write a regular expression that embodies your decisions. For example:
/\S+/ # as many non-whitespace bytes as possible /[A-Za-z'-]+/ # as many letters, apostrophes, and hyphens
Discussion
Because words vary between applications, languages, and input
streams, Perl does not have built-in definitions of words. You must
make them from character classes and quantifiers yourself, as we did
previously. The second pattern is an attempt to recognize
"shepherd's" and
"sheep-shearing" each as single words.
Most approaches will have limitations because of the vagaries of
written human languages. For instance, although the second pattern
successfully identifies "spank'd" and
"counter-clockwise" as words, it will also pull
the "rd" out of "23rd
Psalm". If you want to be more precise when you
pull words out from a string, you can specify the stuff surrounding
the word. Normally, this should be a word-boundary, not whitespace:
/\b([A-Za-z]+)\b/ # usually best /\s([A-Za-z]+)\s/ # fails at ends or w/ punctuation
Although Perl provides \w, which matches a
character that is part of a valid Perl identifier, Perl identifiers
are rarely what you think of as words, since we really mean a string
of alphanumerics and underscores, but not colons or quotes. Because
it’s defined in terms of \w,
\b may surprise you if you expect to match an ...