Eliminating overlapping entries based on start / end values in bash

96 Views Asked by At

I have a tab-delimited file of entries that have a start and end position

# Name \t Start \t End
Name1 \t 1 \t 3
Name2 \t 7 \t 9
Name3 \t 5 \t 8
Name4 \t 5 \t 6

I want to delete lines that overlap with earlier lines. In this example, the desired output would be

Name1 \t 1 \t 3
Name2 \t 7 \t 9
Name4 \t 5 \t 6

What I have so far:

#!/bin/bash
while IFS=$'\n' read line; do
     # Assign variable names
     name=$(echo $line | cut -f 1)
     start=$(echo $line | cut -f 2)
     end=$(echo $line | cut -f 3)
     # I envision an if statement structured so that:
     # if [ $end < $PreviousStart ] || [ $start > $PreviousEnd ] ; then echo $line >> output.txt
done < file.txt

This is where I get stuck because I would need to check each line of output.txt (all the previous lines from my original file) and only print $line if the if statement is satisfied for all current lines of output.txt. I was thinking awk may have a solution for this that is less circuitous...

Any help is greatly appreciated

3

There are 3 best solutions below

0
markp-fuso On

Assumptions:

  • as we read a new line we need to test for overlaps against all previous non-overlapping lines
  • if a new line does not overlap with any of the previous non-overlapping lines then ...
  • a) we save the new line as a new member of the group of non-overlapping lines and
  • b) print the new line to stdout

One awk idea:

awk '
BEGIN { FS=OFS="\t" }
      { for (i=1; i<=cnt; i++)                            # loop through array of previous lines
            if ( ( $2 >= start[i] && $2 <= end[i] ) ||    # does current "start" overlap with a previous line?
                 ( $3 >= start[i] && $3 <= end[i] )    )  # does current "end" overlap with a previous line?
                 next                                     # if there is an overlap then skip this line and process the next line of input 

        start[++cnt] = $2                                 # we have a new non-overlapping line so save the start and end points
        end[cnt] = $3
        print                                             # print current line to stdout
      }
' file.txt

This generates:

Name1   1       3
Name2   7       9
Name4   5       6
3
blhsing On

You can keep previous non-overlapping start and end positions as indices and values in an array in awk so you can easily iterate through them for each record to test if the current start and end positions overlap with any of them, and skip the current record if they do:

awk '-F\t' '{for(s in a)if($2<=a[s]&&s<=$3)next;a[$2]=$3}1' file.txt

Demo: https://awk.js.org/?snippet=REkxdr

0
Schmaehgrunza On
#!/usr/bin/bash
declare name
declare -a startArr endArr
declare -i start end overlapping

while IFS=$'\t' read -r name start end; do
   if ((${#name}==0)); then continue; fi
   overlapping=0
   for ((i=0; i<${#startArr[*]}; i++))
      {
      if (( start >= ${startArr[i]} && start <= ${endArr[i]} || end >= ${startArr[i]} && end <= ${endArr[i]} || start < ${startArr[i]} && end > ${endArr[i]} )); then 
         overlapping=1;
         break;
      fi
      }
   if ((overlapping)); then continue; fi
   echo "$name"$'\t'"$start"$'\t'"$end"
   startArr+=($start); endArr+=($end)
 done < file1

file1:

Name1   1   3
Name2   7   9
Name3   5   8
Name4   5   6
Name5   4   10   # edited this line , other case of overlapping