Given a large (> 100MB) data frame of events with location and timestamps, how can I remove events synchronously occurring in all locations (i.e. putative noise) in R, MATLAB or Python (with reasonable performance)?
A minimal specification of the problem in R would be:
pixel <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3)
start <- c(1, 3, 6, 8, 1, 3, 5, 7, 8, 1, 4, 7)
end <- c(2, 4, 7, 9, 2, 4, 6, 8, 9, 3, 5, 9)
events <- data.frame(cbind(pixel, start, end))
# there was an event between 1 and 2s detected everywhere;
# this event would therefore be removed in the desired output:
#
# pixel start end
# 1 3 4
# 1 6 7
# 1 8 9
# 2 3 4
# 2 5 6
# 2 7 8
# 2 8 9
# 3 4 5
# 3 7 9
I had tried to solve the problem with loops, but the solution is slow. (Experts sometimes recommend to "vectorize" calculations, but I found no way to get rid of the loops.)
Additionally, I had found a related post for the problem in Python on Pandas Data Frame - Remove Overlapping Intervals.
It seems to me that this type of problem should be a common one and is probably already solved by a package, but I couldn't find it.
I think your expected output is incomplete and has more rows than it should. Namely, all three
pixels have an event between1, 2and8, 9, so we should be removing two rows from eachpixel.Here's a
data.tablesolution. Note that since we want the comparison to be right-side-open (i.e.,1, 3does not overlap with3, 4), I'll momentarily decreaseendby an iota, set the keys (required forfoverlaps), check for overlaps, then return the iota I subtract.The
overlapscolumn now represents a count for how many total uniquepixelvalues are found in the set of overlapping time ranges, including "self". When this number is the same as the number of total uniquepixelvalues, then we have rows that overlap with all other groups.Follow-up proof, by-row:
ALL3rows should be removed (by my interpretation of your logic).