BASH: Count identical lines

6k Views Asked by At

I have a file that contains:

VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoiceMailConfig60CharsTest
VoicemailDefaultTypeTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest

How do I replace the duplicate lines with counts:

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

I placing the pair into an associative array. I tried using 'read' inside a 'while' statement, but the array gets lost. Here's my attempt:

unset line
tests=$(cat file.log)
echo "$tests" | 
    while read l; do 
        if [ "$l" == "${line}" ]; then
            let cnt++;
        else
            echo "${line} (${cnt})"
            line=${l}
            cnt=1
        fi
        export run_suites
    done
6

There are 6 best solutions below

1
MattT On BEST ANSWER

Assuming the formatting of the output doesn't exactly have to match

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

you can just use

sort <input_file> | uniq -c

If you need the output to exactly match what you posted, you can use

awk '{duplicates[$1]++} END{for (ind in duplicates) {print ind,"("duplicates[ind]")"}}' <input_file>

Edit: Posted just after anubhava's answer... but leaving (unless people suggest I delete) because of the addition of the sort command.

1
anubhava On

You can use this simple awk script to get counts:

awk '{freq[$1]++} END{for (i in freq) print i, "(" freq[i] ")"}' file

VoiceMailConfig60CharsTest (1)
VoicemailSettingsFromMessageModeScreenTest (2)
VoiceMailIconSelectableTest (5)
VoicemailButtonTest (5)
VoicemailDefaultTypeTest (1)
VoicemailSettingsTest (7)

If you want to maintain the order of appearance in input then use:

awk '!freq[$1]++{order[++k]=$1} END{
    for (i=1; i<=k; i++) print order[i], "(" freq[order[i]] ")"}' file

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)
0
chepner On

If you don't care about that exact output format, just use sort and uniq:

$ sort file.log | uniq -c
5 VoicemailButtonTest
1 VoiceMailConfig60CharsTest
1 VoicemailDefaultTypeTest
5 VoiceMailIconSelectableTest
2 VoicemailSettingsFromMessageModeScreenTest
7 VoicemailSettingsTest

sort, of course, is unnecessary if the file is already sorted as in your question. If it isn't sorted, uniq -c will still work, but it only considers a line to be a duplicate if it is identical to the immediately preceding line:

$ printf 'a\nb\na' | uniq -c
1 a
1 b
1 a
0
Ed Morton On
$ awk '$1 != prev{if (NR>1) print prev, "("cnt")"; prev=$1; cnt=0} {cnt++} END{print prev, "("cnt")"}' file
VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

The above retains your input order and stores almost nothing in memory, it doesn't care if your input is sorted or not, it just relies on all duplicate keys occurring contiguously in your input file like you showed in your example.

0
karakfa On

without awk keeping the order of the keys based on first appearance and doesn't require sorted or grouped input.

cat -n file    |     # add line numbers for order
sort -k2       |     # sort based on keys, ignoring line no
uniq -f1 -c    |     # count keys, ignoring line no
sort -k2,2n    |     # sort by line no to recover initial order
sed -r 's/(\S+)\s+(\S+)\s+(\S+)/\3 (\1)/'     # format output
0
ctac_ On

With bash array

unset tab
declare -A tab
while read line;do
  let tab["$line"]=${tab["$line"]}+1
done < infile
for i in ${!tab[*]} ;do
  echo "$i  (${tab[$i]})"
done | sort