8.10. Change the Delimiter Used in CSV Files

Problem

You want to change all field-delimiting commas in a CSV file to tabs. Commas that occur within double-quoted values should be left alone.

Solution

The following regular expression matches an individual CSV field along with its preceding delimiter, if any. The preceding delimiter is usually a comma, but can also be an empty string (i.e., nothing) when matching the first field of the first record, or a line break when matching the first field of any subsequent record. Every time a match is found, the field itself, including the double quotes that may surround it, is captured to backreference 2, and its preceding delimiter is captured to backreference 1.

Tip

The regular expressions in this recipe are designed to work correctly only with valid CSV files, according to the format rules under the term “Comma-Separated Values (CSV)” that was discussed earlier in this book.

(,|\r?\n|^)([^",\r\n]+|"(?:[^"]|"")*")?
Regex options: None (“^ and $ match at line breaks” must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here is the same regular expression again in free-spacing mode:

( , | \r?\n | ^ )   # Capturing group 1 matches field delimiters
                    #   or the beginning of the string
(                   # Capturing group 2 matches a single field:
  [^",\r\n]+        #   a non-quoted field
|                   #  or...
  " (?:[^"]|"")* "  #   a quoted field (may contain escaped double-quotes)
)?                  # The group is optional because fields may be empty
Regex options: Free-spacing (“^ and $ ...

Get Regular Expressions Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.