apply sed only to the part of the file after last match in loop - shell / bash

91 Views Asked by At

I have a couple of large files (~1Gb) of such structure:

fooA iug9wa
fooA lauie
fooA nwgoieb
fooB wilgb
fooB rqgebepu
fooB ifbqeiu
...
fooN ibfiygb
fooN yvsiy
fooN aeviu

I would like to replace in shell each fooX (which contains letters, numbers "." and "_"), (I have all listed in foo.list) to sequential numbers 1 to N.

I've used:

nfoos=$(wc -l < foo.list)

for i in $(seq 1 $nfoos)
do
    currentfoo=$(sed "${i}q;d" foo.list)
    sed -i "s/"${currentfoo}"/$i/g" file1
    sed -i "s/"${currentfoo}"/$i/g" file2
    sed -i "s/"${currentfoo}"/$i/g" filen
done

However, with large files it's been taking forever. Since each consecutive fooX always appears in the files than foo(X-1) I though to make sed only search the part of fileX after the last match of fooX, so that with each foo there is less space to search. I've been trying to use labels and some multiline approaches, but the syntax keeps beating me here.

Does anyone know how to make it work? (Doesn't necessarily have to use sed, but would be great if it worked in basic shell in Bash.)

Appreciate any help. And if you do, please explain each function/option/variable used so that I can figure out where I had been messing up.

2

There are 2 best solutions below

2
Walter A On BEST ANSWER

You can use awk.
The first part of the next awk command will fill the array a, the second part replaces the first word.

awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list file1

When this is what you like, you can loop over your files

for f in file1 file2 filen; do
  awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list "${f}" > "${f}.tmp" &&
  mv "${f}.tmp" "${f}"
done

The && makes sure the new file will only replace the original file when awk was OK.

4
Verpous On

Two optimizations:

  1. Use awk to generate a sed script which does all the replacements in a single run.

  2. Run sed -i with N file arguments instead of running sed N times with 1 file argument each.

awk '{ print "s/" $0 "/" NR "/g;" }' foo.list > temp_script
sed -i -f temp_script $(cat foo.list)

Now you run sed only once instead of N^2 times.