You want to convert people’s names from the “FirstName LastName” format to “LastName, FirstName”, for use in an alphabetical listing. You additionally want to account for other name parts, so that you can, e.g., convert “FirstName MiddleNames Particles LastName Suffix” to “LastName, FirstName MiddleNames Particles Suffix”.
Unfortunately, it isn’t possible to reliably parse names using a regular expression. Regular expressions are rigid, whereas names are so flexible that even humans get them wrong. Determining the structure of a name or how it should be listed alphabetically often requires taking traditional and national conventions, or even personal preferences, into account. Nevertheless, if you’re willing to make certain assumptions about your data and can handle a moderate level of error, a regular expression can provide a quick solution.
The following regular expression has intentionally been kept simple, rather than trying to account for edge cases.
^(.+?)●([^\s,]+)(,?●(?:[JS]r\.?|III?|IV))?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3
Replacement text flavors: Python, Ruby |
function formatName (name) { return name.replace(/^(.+?) ([^\s,]+)(,? (?:[JS]r\.?|III?|IV))?$/i, "$2, $1$3"); }
See Recipe 3.15 for help implementing this regular expression with other programming languages.
First, let’s take a look at this regular expression piece by piece. Higher-level comments are provided afterward to help explain which parts of a name are being matched by various segments of the regex. Since the regex is written here in free-spacing mode, the literal space characters have been escaped with backslashes:
^ # Assert position at the beginning of the string. ( # Capture the enclosed match to backreference 1... .+? # Match one or more characters, as few times as possible. ) # End the capturing group. \ # Match a literal space character. ( # Capture the enclosed match to backreference 2... [^\s,]+ # Match one or more characters that are not whitespace # or commas. ) # End the capturing group. ( # Capture the enclosed match to backreference 3... ,?\ # Match ", " or " ". (?: # Group but don't capture... [JS]r\.? # Match "Jr", "Jr.", "Sr", or "Sr.". | # or... III? # Match "II" or "III". | # or... IV # Match "IV". ) # End the noncapturing group. )? # Repeat the group between zero and one time. $ # Assert position at the end of the string.
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
This regular expression makes the following assumptions about the subject data:
It contains at least one first name and one last name (other name parts are optional).
The first name is listed before the last name.
If the name contains a suffix, it is one of the values “Jr”, “Jr.”, “Sr”, “Sr.”, “II”, “III”, or “IV”, with an optional preceding comma.
A few more issues to consider:
The regular expression cannot identify compound surnames that don’t use hyphens. For example,
Sacha Baron Cohen
would be replaced withCohen, Sacha Baron
, rather than the correct listing,Baron Cohen, Sacha
.It does not keep particles in front of the family name, although this is occasionally called for by convention or personal preference (for example, the correct alphabetical listing of “Charles de Gaulle” is “de Gaulle, Charles” according to the Chicago Manual of Style, 15th Edition, which contradicts Merriam-Webster’s Biographical Dictionary on this particular name).
Because of the ‹
^
› and ‹$
› anchors that bind the match to the beginning and end of the string, no replacement can be made if the entire subject text does not fit the pattern. Hence, if no suitable match is found (for example, if the subject text contains only one name), the name is left unaltered.
As for how the regular expression works, it uses three capturing
groups to split up the name. The pieces are then reassembled in the
desired order via backreferences in the replacement string. Capturing
group 1 uses the maximally flexible ‹.+?
› pattern to grab the first name along with
any number of middle names and surname particles, such as the German
“von” or the French, Portuguese, and Spanish “de”. These name parts
are handled together because they are listed sequentially in the
output. Lumping the first and middle names together also helps avoid
errors, because the regular expression cannot distinguish between a
compound first name, such as “Mary Lou” or “Norma Jeane,” and a first
name plus middle name. Even humans cannot accurately make the
distinction just by visual examination.
Capturing group 2 matches the last name using ‹[^\s,]+
›. Like the dot used in
capturing group 1, the flexibility of this character class allows it
to match accented characters and any other non-Latin characters.
Capturing group 3 matches an optional suffix, such as “Jr.” or “III,”
from a predefined list of possible values. The suffix is handled
separately from the last name because it should continue to appear at
the end of the reformatted name.
Let’s go back for a minute to capturing group 1. Why was the dot
within group 1 followed by the lazy ‹+?
› quantifier, whereas the character class in
group 2 was followed by the greedy ‹+
› quantifier? If group 1 (which handles a
variable number of elements and therefore needs to go as far as it can
into the name) used a greedy quantifier, capturing group 3 (which
attempts to match a suffix) wouldn’t have a shot at participating in
the match. The dot from group 1 would match until the end of the
string, and since capturing group 3 is optional, the regex engine
would only backtrack enough to find a match for group 2 before
declaring success. Capturing group 2 can use a greedy quantifier
because its more restrictive character class only allows it to match
one name.
Table 4-2 shows some examples of how names are formatted using this regular expression and replacement string.
An added segment in the following regular expression allows you to output surname particles from a predefined list in front of the last name. Specifically, this regular expression accounts for the values “De”, “Du”, “La”, “Le”, “St”, “St.”, “Ste”, “Ste.”, “Van”, and “Von”. Any number of these values are allowed in sequence (for example, “de la”):
^(.+?)●((?:(?:D[eu]|L[ae]|Ste?\.?|V[ao]n)●)*[^\s,]+)↵ (,?●(?:[JS]r\.?|III?|IV))?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
$2,●$1$3
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
\2,●\1\3
Replacement text flavors: Python, Ruby |
Get Regular Expressions Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.