BASH: Count identical lines

Question

BASH: Count identical lines

6k Views Asked by Sean Walton At 31 October 2017 at 18:33

I have a file that contains:

VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoiceMailConfig60CharsTest
VoicemailDefaultTypeTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest

How do I replace the duplicate lines with counts:

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

I placing the pair into an associative array. I tried using 'read' inside a 'while' statement, but the array gets lost. Here's my attempt:

unset line
tests=$(cat file.log)
echo "$tests" | 
    while read l; do 
        if [ "$l" == "${line}" ]; then
            let cnt++;
        else
            echo "${line} (${cnt})"
            line=${l}
            cnt=1
        fi
        export run_suites
    done

Original Q&A

There are 6 best solutions below

anubhava On 31 October 2017 at 18:37

You can use this simple awk script to get counts:

awk '{freq[$1]++} END{for (i in freq) print i, "(" freq[i] ")"}' file

VoiceMailConfig60CharsTest (1)
VoicemailSettingsFromMessageModeScreenTest (2)
VoiceMailIconSelectableTest (5)
VoicemailButtonTest (5)
VoicemailDefaultTypeTest (1)
VoicemailSettingsTest (7)

If you want to maintain the order of appearance in input then use:

awk '!freq[$1]++{order[++k]=$1} END{
    for (i=1; i<=k; i++) print order[i], "(" freq[order[i]] ")"}' file

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

chepner On 31 October 2017 at 18:44

If you don't care about that exact output format, just use sort and uniq:

$ sort file.log | uniq -c
5 VoicemailButtonTest
1 VoiceMailConfig60CharsTest
1 VoicemailDefaultTypeTest
5 VoiceMailIconSelectableTest
2 VoicemailSettingsFromMessageModeScreenTest
7 VoicemailSettingsTest

sort, of course, is unnecessary if the file is already sorted as in your question. If it isn't sorted, uniq -c will still work, but it only considers a line to be a duplicate if it is identical to the immediately preceding line:

$ printf 'a\nb\na' | uniq -c
1 a
1 b
1 a

Ed Morton On 31 October 2017 at 18:58

$ awk '$1 != prev{if (NR>1) print prev, "("cnt")"; prev=$1; cnt=0} {cnt++} END{print prev, "("cnt")"}' file
VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

The above retains your input order and stores almost nothing in memory, it doesn't care if your input is sorted or not, it just relies on all duplicate keys occurring contiguously in your input file like you showed in your example.

karakfa On 31 October 2017 at 19:36

without awk keeping the order of the keys based on first appearance and doesn't require sorted or grouped input.

cat -n file    |     # add line numbers for order
sort -k2       |     # sort based on keys, ignoring line no
uniq -f1 -c    |     # count keys, ignoring line no
sort -k2,2n    |     # sort by line no to recover initial order
sed -r 's/(\S+)\s+(\S+)\s+(\S+)/\3 (\1)/'     # format output

ctac_ On 31 October 2017 at 23:45

With bash array

unset tab
declare -A tab
while read line;do
  let tab["$line"]=${tab["$line"]}+1
done < infile
for i in ${!tab[*]} ;do
  echo "$i  (${tab[$i]})"
done | sort

**MattT** · Accepted Answer · 2017-10-31T18:44:30.037000

Assuming the formatting of the output doesn't exactly have to match

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

you can just use

sort <input_file> | uniq -c

If you need the output to exactly match what you posted, you can use

awk '{duplicates[$1]++} END{for (ind in duplicates) {print ind,"("duplicates[ind]")"}}' <input_file>

Edit: Posted just after anubhava's answer... but leaving (unless people suggest I delete) because of the addition of the sort command.

BASH: Count identical lines

There are 6 best solutions below

Related Questions in BASH

Related Questions in AWK

Related Questions in SED

Related Questions in DUPLICATES

Related Questions in LINE-COUNT

Trending Questions

Popular # Hahtags

Popular Questions