Parsing Comma-Separated Data
Problem
You have a data file containing comma-separated values that you need to read in, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format.
Solution
Use the procedure in Mastering Regular Expressions.
sub parse_csv {
my $text = shift; # record containing comma-separated values
my @new = ();
push(@new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(@new, undef) if substr($text, -1,1) eq ',';
return @new; # list of values that were comma-separated
}Or use the standard Text::ParseWords module.
use Text::ParseWords;
sub parse_csv {
return quotewords(",",0, $_[0]);
}Discussion
Comma-separated input is a deceptive and complex format. It sounds
simple, but involves a fairly complex escaping system because the
fields themselves can contain commas. This makes the pattern matching
solution complex and rules out a simple split
/,/.
Fortunately, Text::ParseWords hides the complexity from you. Pass its
quotewords
function two arguments and the CSV string. The first argument is the
separator (a comma, in this case) and the second is a true or false
value controlling whether the strings are returned with quotes around
them.
If you want to represent quotation marks inside a field delimited by quotation marks, escape them with ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access