9.12. Extract CSV Fields from a Specific Column

Problem

You want to extract every field (record item) from the third column of a CSV file.

Solution

The regular expressions from Recipe 9.11 can be reused here to iterate over each field in a CSV subject string. With a bit of extra code, you can count the number of fields from left to right in each row, or record, and extract the fields at the position you’re interested in.

The following regular expression (shown with and without the free-spacing option) matches a single CSV field and its preceding delimiter in two separate capturing groups. Since line breaks can appear within double-quoted fields, it would not be accurate to simply search from the beginning of each line in your CSV string. By matching and stepping past fields one by one, you can easily determine which line breaks appear outside of double-quoted fields and therefore start a new record.

Tip

The regular expressions in this recipe are designed to work correctly with valid CSV files only, according to the format rules discussed in Comma-Separated Values (CSV).

(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
( , | \r?\n | ^ )   # Capture the leading field delimiter to backref 1
(                   # Capture a single field to backref 2:
  [^",\r\n]+        #   Unquoted field
|                   #  Or:
  " (?:[^"]|"")* "  #   Quoted field (may contain escaped double quotes)
)?                  # The group is optional because fields may be empty
Regex options: Free-spacing
Regex flavors: ...

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.