cut output is incorrect if delimiter character is within double quotes

65 Views Asked by At

cut -f2 -d"," - doesn't get the second column properly since first column is encapsulated within double quotes.

"Foo News, Videos (Android)","Foo News, Videos - Android","Foo News, Videos (Android)"
Bar News (iOS),"Bar News, Movie - iOS",Bar News & Movie (iOS)
3

There are 3 best solutions below

0
glenn jackman On

In bash 5.2, it's possible to do: this version of bash ships with a "delimiter-separated values" loadable module that respects double-quoted fields.

enable dsv
while IFS= read -r line; do
  dsv -a fields "$line"
  declare -p fields
  printf '%s\n' "${fields[1]}"
done < file.csv
declare -a fields=([0]="Foo News, Videos (Android)" [1]="Foo News, Videos - Android" [2]="Foo News, Videos (Android)")
Foo News, Videos - Android
declare -a fields=([0]="Bar News (iOS)" [1]="Bar News, Movie - iOS" [2]="Bar News & Movie (iOS)")
Bar News, Movie - iOS

I have this in my ~/.bashrc to make this easy:

# for loadable builtins
bash_root=${BASH%/bin/bash}
[[ -d "$bash_root/lib/bash" ]] && BASH_LOADABLES_PATH="$bash_root/lib/bash"
unset bash_root
0
Diego Torres Milano On

Or use python

#! /usr/bin/env python3

import csv
import sys

for row in csv.reader(sys.stdin):
    print(row[1])
0
ktc On

You can use some regex and awk to somewhat fit your need.

Let's say your data stored in test.csv.

cat test.csv | sed -r 's/("\s*,|,\s*"|"\s*,\s*")/\\,/g' | tr -d '"' | awk '{split($0, array, "\\\\,"); print array[2]}'
#             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#             1                                          2            3
  1. sed -r 's/("\s*,|,\s*"|"\s*,\s*")/\\,/g' : replace all "outer" comma and double quote into \, based on regex. example cases: ", ," ",". (You can choose other delimiter rather than \,.)
  2. tr -d '"' : remove the double quote at beginning and the end.
  3. awk '{split($0, array, "\\\\,"); print array[2]}' : split the delimiter via awk. (ref: How to split a delimited string into an array in awk?)

# result
Foo News, Videos - Android
Bar News, Movie - iOS

To parameterize column, just in case, you need to adjust single quote to escape variable in awk:

show_column () { 
  column=$1
  cat test.csv | sed -r 's/("\s*,|,\s*"|"\s*,\s*")/\\,/g' | tr -d '"' | awk '{split($0, array, "\\\\,"); print array['${column}']}' 
}

$ show_column 1
Foo News, Videos (Android)
Bar News (iOS)

$ show_column 2
Foo News, Videos - Android
Bar News, Movie - iOS

$ show_column 3
Foo News, Videos (Android)
Bar News & Movie (iOS)

Note

  1. If there are some cases like foo,bar,bar (without any double quotes) in your csv, you need to remove the sed part, just use awk to work. Because regex above are based on double quotes.
  2. If there are pure comma in column, e.g. "foo", ",", ",", "bar", should change regex to. The key strategy is to replace "outer" comma.
  3. About putting variables inside of single quote, the escape pattern is something like:
# before
 'foobar'
#|^^^^^^|


# after
 'foo'${my_var}'bar'
#|^^^|         |^^^|