How to get the average number of words in files using the output of "wc -w"

100 Views Asked by At

I'm listing the number of words in a bunch of files and sorting it like this:

wc -w *.tex | sort -rn

which outputs a nice list of the files and word count for each file

   17423 total
    6481 panama-to-colombia.tex
    5516 the-salt-flats.tex
    5426 hiking-cordillera-huayhuash.tex

How can I also calculate and display the average number of words per file? i.e. a line at the bottom like:

5808 AVERAGE

Note: I'd like to find a solution that works for an arbitrary number of files in the list.

5

There are 5 best solutions below

1
Cyrus On BEST ANSWER

I suggest to append to your code:

| awk '{sum=sum+$1; print};END{print sum/2/(NR-1),"AVERAGE"}'

sum=sum+$1 adds the number in the first column ($1) to the variable sum in each row. print outputs the current row unchanged. The average is calculated after the last line read in. During the calculation, please note that the line with total is also included in the output of wc -w *.tex.


See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

0
Renaud Pacalet On
wc -w --total=never *.tex | datamash -W mean 1
5807.6666666667

If you prefer a rounded result, e.g., to 2 decimal places:

wc -w --total=never *.tex | datamash -WR2 mean 1
5807.67
1
dawg On

You can do entirely in awk:

awk 'FNR==1{files[FILENAME]=0}
{for(i=1;i<=NF;i++) files[FILENAME]++}
END{for ( f in files ) {
    total+=files[f]
    print files[f], f }
    print total, "total"
    print total / (length(files)), "average"
}
' *.tex

Prints:

5426 hiking-cordillera-huayhuash.tex
5516 the-salt-flats.tex
6481 panama-to-colombia.tex
17423 total
5807.67 average
0
pmf On

You could make it a function (or a script) which counts the words of the concatenation of its file args cat -- "$@", and then divides that by the number of its file args $#:

wc-wavg() { echo $(($(cat -- "$@" | wc -w) / $#)); }

wc-wavg *.tex
0
Ed Morton On

You could do it all in awk, e.g. given these input files which includes an empty file (fileempty) and a repeated file (file1) which are 2 of the possible rainy day cases likely to cause a potential solution to fail:

$ wc -w file1 fileempty file1
 2 file1
 0 fileempty
 2 file1
 4 total

and using GNU awk for ARGIND:

$ awk '
    {
        numWords[ARGIND] += NF
        tot += NF
    }
    END {
        fmt=" %" length(tot) "s %s\n"
        for ( i=1; i<=ARGIND; i++ ) {
            printf fmt, numWords[i]+0, ARGV[i]
        }
        printf fmt, tot+0, "total"
        printf printf fmt, tot / (ARGIND ? ARGIND : 1), "AVERAGE"
    }
' file1 fileempty file1
 2 file1
 0 fileempty
 2 file1
 4 total
 1.33333 AVERAGE

and just to show how that behaves for the other rainy day cases that come to mind:

  1. Just 1 input file:
$ awk '
    {numWords[ARGIND] += NF; tot += NF} END{fmt=" %" length(tot) "s %s\n"; for (i=1; i<=ARGIND; i++) { printf fmt, numWords[i]+0, ARGV[i] }; printf fmt, tot+0, "total"; printf fmt, tot / (ARGIND ? ARGIND : 1), "AVERAGE" }
' file1
 2 file1
 2 total
 2 AVERAGE
  1. An empty file as the only input:
$ awk '
    {numWords[ARGIND] += NF; tot += NF} END{fmt=" %" length(tot) "s %s\n"; for (i=1; i<=ARGIND; i++) { printf fmt, numWords[i]+0, ARGV[i] }; printf fmt, tot+0, "total"; printf fmt, tot / (ARGIND ? ARGIND : 1), "AVERAGE" }
' fileempty
 0 fileempty
 0 total
 0 AVERAGE
  1. No input file, just input from stdin (this may or may not be the desired output, idk):
$ awk '
    {numWords[ARGIND] += NF; tot += NF} END{fmt=" %" length(tot) "s %s\n"; for (i=1; i<=ARGIND; i++) { printf fmt, numWords[i]+0, ARGV[i] }; printf fmt, tot+0, "total"; printf fmt, tot / (ARGIND ? ARGIND : 1), "AVERAGE" }
' <<!
> foo
> bar
> !
 2 total
 2 AVERAGE