You want to convert people’s names from the “FirstName LastName” format to “LastName, FirstName” for use in an alphabetical listing. You additionally want to account for other name parts, so that you can, say convert “FirstName MiddleNames Particles LastName Suffix” to “LastName, FirstName MiddleNames Particles Suffix.”
Unfortunately, it isn’t possible to reliably parse names using a regular expression. Regular expressions are rigid, whereas names are so flexible that even humans get them wrong. Determining the structure of a name or how it should be listed alphabetically often requires taking traditional and national conventions, or even personal preferences, into account. Nevertheless, if you’re willing to make certain assumptions about your data and can handle a moderate level of error, a regular expression can provide a quick solution.
The following regular expression has intentionally been kept simple, rather than trying to account for edge cases.
^(.+?)●([^\s,]+)(,?●(?:[JS]r\.?|III?|IV))?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3
Replacement text flavors: Python, Ruby |
function formatName(name) { return name.replace(/^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$/i, "$2, $1$3"); }
Recipe 3.15 has code listings that will help you add this regex search-and-replace to programs written in other languages. Recipe 3.4 shows how to set the “case insensitive” option used here.
First, let’s take a look at this regular expression piece by piece. Higher-level comments are provided afterward to help explain which parts of a name are being matched by various segments of the regex. Since the regex is written here in free-spacing mode, the literal space characters have been escaped with backslashes:
^ # Assert position at the beginning of the string. ( # Capture the enclosed match to backreference 1: .+? # Match one or more characters, as few times as possible. ) # End the capturing group. \ # Match a literal space character. ( # Capture the enclosed match to backreference 2: [^\s,]+ # Match one or more non-whitespace/comma characters. ) # End the capturing group. ( # Capture the enclosed match to backreference 3: ,?\ # Match ", " or " ". (?: # Group but don't capture: [JS]r\.? # Match "Jr", "Jr.", "Sr", or "Sr.". | # Or: III? # Match "II" or "III". | # Or: IV # Match "IV". ) # End the noncapturing group. )? # Make the group optional. $ # Assert position at the end of the string.
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This regular expression makes the following assumptions about the subject data:
It contains at least one first name and one last name (other name parts are optional).
The first name is listed before the last name (not the norm with some national conventions).
If the name contains a suffix, it is one of the values “Jr”, “Jr.”, “Sr”, “Sr.”, “II”, “III”, or “IV”, with an optional preceding comma.
A few more issues to consider:
The regular expression cannot identify compound surnames that don’t use hyphens. For example,
Sacha Baron Cohen
would be replaced withCohen, Sacha Baron
, rather than the correct listing,Baron Cohen, Sacha
.It does not keep particles in front of the family name, although this is sometimes called for by convention or personal preference (for example, the correct alphabetical listing of “Charles de Gaulle” is “de Gaulle, Charles” according to the Chicago Manual of Style, 16th Edition, which contradicts Merriam-Webster’s Biographical Dictionary on this particular name).
Because of the ‹
^
› and ‹$
› anchors that bind the match to the beginning and end of the string, no replacement can be made if the entire subject text does not fit the pattern. Hence, if no suitable match is found (for example, if the subject text contains only one name), the name is left unaltered.
As for how the regular expression works, it uses three capturing
groups to split up the name. The pieces are then reassembled in the
desired order via backreferences in the replacement string. Capturing
group 1 uses the maximally flexible ‹.+?
› pattern to grab the first name along with any
number of middle names and surname particles, such as the German “von”
or the French, Portuguese, and Spanish “de.” These name parts are
handled together because they are listed sequentially in the output.
Lumping the first and middle names together also helps avoid errors,
because the regular expression cannot distinguish between a compound
first name, such as “Mary Lou” or “Norma Jeane,” and a first name plus
middle name. Even humans cannot accurately make the distinction just by
visual examination.
Capturing group 2 matches the last name using ‹[^\s,]+
›. Like the dot used in
capturing group 1, the flexibility of this character class allows it to
match accented characters and any other non-Latin characters. Capturing
group 3 matches an optional suffix, such as “Jr.” or “III,” from a
predefined list of possible values. The suffix is handled separately
from the last name because it should continue to appear at the end of
the reformatted name.
Let’s go back for a minute to capturing group 1. Why was the dot
within group 1 followed by the lazy ‹+?
›
quantifier, whereas the character class in group 2 was followed by the
greedy ‹+
›
quantifier? If group 1 (which handles a variable number of elements and
therefore needs to go as far as it can into the name) used a greedy
quantifier, capturing group 3 (which attempts to match a suffix)
wouldn’t have a shot at participating in the match. The dot from group 1
would match until the end of the string, and since capturing group 3 is
optional, the regex engine would only backtrack enough to find a match
for group 2 before declaring success. Capturing group 2 can use a greedy
quantifier because its more restrictive character class only allows it
to match one name.
Table 4-2 shows some examples of how names are formatted using this regular expression and replacement string.
An added segment in the following regular expression allows you to output surname particles from a predefined list in front of the last name. Specifically, this regular expression accounts for the values “de”, “du”, “la”, “le”, “St”, “St.”, “Ste”, “Ste.”, “van”, and “von”. Any number of these values are allowed in sequence (for example, “de la”):
^(.+?)●((?:(?:d[eu]|l[ae]|Ste?\.?|v[ao]n)●)*[^\s,]+)↵ (,?●(?:[JS]r\.?|III?|IV))?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3
Replacement text flavors: Python, Ruby |
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.