uniq: printing all duplicated lines and repeat counts is NOT meaningless

142 Views Asked by midnite At 17 April 2023 at 20:44

Is there a way to show the duplicated counts with the actual duplicated lines repeated?

For example, input:

AAAA XXXX
AAAA YYYY
BBBB ZZZZ

Expected output:

2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ

Using the Linux program uniq, it refuses to show the duplicated line 2 AAAA YYYY.

Linux command used:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count --check-chars 4
      2 AAAA XXXX
      1 BBBB ZZZZ

The -D option in uniq means print all duplicate lines. But it says it is meaningless.

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count -D --check-chars 4
uniq: printing all duplicated lines and repeat counts is meaningless
Try 'uniq --help' for more information.

In my actual use case, XXXX YYYY ZZZZ are the file paths, and AAAA BBBB are the md5 hashes of the file contents. If XXXX and YYYY hashes are identical, I need to check file XXXX and YYYY. However I cannot get the file path of YYYY.

There are 3 best solutions below

Barmar

Barmar On 17 April 2023 at 20:47

You can use join to combine the uniq output with the original input.

$ join -1 1 -2 2 <( printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ') <(printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | uniq --count --check-chars 4) | cut -d' ' -f1-3
AAAA XXXX 2
AAAA YYYY 2
BBBB ZZZZ 1

Fravadona

Fravadona On 17 April 2023 at 21:34

With this little awk you might get something usable?

awk '
    { arr[$1] = arr[$1] FS $2 }
    END {
        for (md5 in arr) {
            n = split(arr[md5], paths)
            print md5, n
            for (i = 1; i <= n; i++)
                print "\t" paths[i]
        }
    }
'

BBBB 1
    ZZZZ
AAAA 2
    XXXX
    YYYY

jared_mamrot

jared_mamrot On 18 April 2023 at 06:05

Not sure if there's an easier method, but one potential option using awk:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' | awk '{a[$1]++; b[NR] = $1; c[NR] = $1 FS $2} END{for (i=1; i<=length(b); i++) {print a[b[i]], c[i]}}'
2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ

Proper formatting:

printf 'AAAA XXXX\nAAAA YYYY\nBBBB ZZZZ' |\
awk '{
    a[$1]++
    b[NR] = $1
    c[NR] = $1 FS $2
}

END {
    for (i = 1; i <= length(b); i++) {
        print a[b[i]], c[i]
    }
}'
2 AAAA XXXX
2 AAAA YYYY
1 BBBB ZZZZ