Optimize prefix checks of lines while iterating through file

72 Views Asked by At

I'm writing a script in the form

while read LINE 
do
    [[ $LINE =~ ^headertag1 ]] && function1 && continue
    [[ $LINE =~ ^headertag2 ]] && function2 && continue
    ...
done < filename

As the number of tags increase, I will be doing too many checks per line. I can trying sorting the common tags higher up, but I don't think it solves the fundamental issue. I'm not a software engineer. Are there programming concepts/methods that can improve this situation?

3

There are 3 best solutions below

1
stephanmg On BEST ANSWER

Not sure here, but if you look for tidying up your code and feel bored by adding these if guards repetitively, then maybe this idea will help:

#!/bin/bash

tags[tag1]="some regex1"
tags[tag2]="some regex2"
tags[tag3]="some regex3"

function action() {
  echo "perl -pe '${tags[$tag]} other-file.txt'"
}

while read LINE; do
  for tag in "${!tags[@]}"; do
    [[ $LINE =~ ^$tag ]] && action "${tags[$tag]}"
  done
done < filename

Not sure if the OP is asking something like this.

2
Mika Feiler On

Yes, for two you can first find the longest common prefix of both (here people were wondering how to do that in Bash Longest common prefix of two strings in bash), then first check whether the lines start with it and then after stripping it from both the tag and the line check whether the lines start with the rest of it.

For more than two, you need to make a trie — also known as a prefix tree https://en.wikipedia.org/wiki/Trie .

That Wikipedia article says

For the space-optimized presentation of prefix tree, see compact prefix tree.

And having longest common prefixes, that's what you're gonna have.

Since Bash doesn't have multidimensional associative arrays, you will have to either consider https://en.wikipedia.org/wiki/Trie#Implementation_strategies or embed some other scripting language, like Perl or Python — or GNU Awk (gawk), which, unlike to standard Awk, introduces multidimensional associative arrays.

Using the optimization of Bash's associative arrays implementation

As suggested in comment, we may consider taking just the tag with a simpler regex and using it as a key for associative array which are somewhat optimized in Bash (we can investigate how well-suited for our needs in the sources:

if we know what it is delimited by — like, if we know that it is always immediately followed by a : or something while not containing it, and using a simpler regex like:

[[ $LINE =~ ^(.*): ]] && "${DICTIONARY_OF_FUNCTIONS["${BASH_REMATCH[1]}"]}"

or Using the optimization of Bash's functions store

if all your tags are like, /[a-z][a-z0-9]+/ or otherwise accepted by Bash as function names, and delimited as in the method with Bash's associative arrays, then you can use the above method for interpolating function names, like,

function the_function_for_tag_headertag1() {
    echo "hey it's the first one"
}
[[ $LINE =! ^(.*): ]] && {
    func_name="the_function_for_tag_${BASH_REMATCH[1]}"
    type "${func_name}" && "${func_name}"
}
3
dash-o On

The test that you perform on each tag

    [[ $LINE =~ ^headertag1 ]] && function1 && continue

Is extremely cheap (in memory regexp. Most likely, it will take a fraction of the IO time associate with reading the LINE (from file, or other process). Unless you are performing the test large number of times, this implementation is reasonable.

Note about style: If all pattern are prefix match (or other simple constructs), consider using bash case statement

case "$LINE" in
   header1*) function1 ;;
   header2*) function2 ;;
   ...
esac

This will make the code more elegant, but not change the performance - both the RE and the wildcard are simple.