Eliminating overlapping entries based on start / end values in bash

96 Views Asked by Dody At 16 February 2024 at 00:37

I have a tab-delimited file of entries that have a start and end position

# Name \t Start \t End
Name1 \t 1 \t 3
Name2 \t 7 \t 9
Name3 \t 5 \t 8
Name4 \t 5 \t 6

I want to delete lines that overlap with earlier lines. In this example, the desired output would be

Name1 \t 1 \t 3
Name2 \t 7 \t 9
Name4 \t 5 \t 6

What I have so far:

#!/bin/bash
while IFS=$'\n' read line; do
     # Assign variable names
     name=$(echo $line | cut -f 1)
     start=$(echo $line | cut -f 2)
     end=$(echo $line | cut -f 3)
     # I envision an if statement structured so that:
     # if [ $end < $PreviousStart ] || [ $start > $PreviousEnd ] ; then echo $line >> output.txt
done < file.txt

This is where I get stuck because I would need to check each line of output.txt (all the previous lines from my original file) and only print $line if the if statement is satisfied for all current lines of output.txt. I was thinking awk may have a solution for this that is less circuitous...

Any help is greatly appreciated

Original Q&A

There are 3 best solutions below

markp-fuso On 16 February 2024 at 01:09

Assumptions:

as we read a new line we need to test for overlaps against all previous non-overlapping lines
if a new line does not overlap with any of the previous non-overlapping lines then ...
a) we save the new line as a new member of the group of non-overlapping lines and
b) print the new line to stdout

One awk idea:

awk '
BEGIN { FS=OFS="\t" }
      { for (i=1; i<=cnt; i++)                            # loop through array of previous lines
            if ( ( $2 >= start[i] && $2 <= end[i] ) ||    # does current "start" overlap with a previous line?
                 ( $3 >= start[i] && $3 <= end[i] )    )  # does current "end" overlap with a previous line?
                 next                                     # if there is an overlap then skip this line and process the next line of input 

        start[++cnt] = $2                                 # we have a new non-overlapping line so save the start and end points
        end[cnt] = $3
        print                                             # print current line to stdout
      }
' file.txt

This generates:

Name1   1       3
Name2   7       9
Name4   5       6

blhsing On 16 February 2024 at 01:21

You can keep previous non-overlapping start and end positions as indices and values in an array in awk so you can easily iterate through them for each record to test if the current start and end positions overlap with any of them, and skip the current record if they do:

awk '-F\t' '{for(s in a)if($2<=a[s]&&s<=$3)next;a[$2]=$3}1' file.txt

Demo: https://awk.js.org/?snippet=REkxdr

Schmaehgrunza On 16 February 2024 at 21:01

#!/usr/bin/bash
declare name
declare -a startArr endArr
declare -i start end overlapping

while IFS=$'\t' read -r name start end; do
   if ((${#name}==0)); then continue; fi
   overlapping=0
   for ((i=0; i<${#startArr[*]}; i++))
      {
      if (( start >= ${startArr[i]} && start <= ${endArr[i]} || end >= ${startArr[i]} && end <= ${endArr[i]} || start < ${startArr[i]} && end > ${endArr[i]} )); then 
         overlapping=1;
         break;
      fi
      }
   if ((overlapping)); then continue; fi
   echo "$name"$'\t'"$start"$'\t'"$end"
   startArr+=($start); endArr+=($end)
 done < file1

file1:

Name1   1   3
Name2   7   9
Name3   5   8
Name4   5   6
Name5   4   10   # edited this line , other case of overlapping

Eliminating overlapping entries based on start / end values in bash

There are 3 best solutions below

Related Questions in BASH

Related Questions in LOOPS

Related Questions in IF-STATEMENT

Related Questions in AWK

Related Questions in OVERLAP

Trending Questions

Popular # Hahtags

Popular Questions