I am working with a large time series data set aggregated every minute of 2020. The dataset is getting values from sensors that are monitoring equipment in a thermal generation plant. The sensors measure values such as temperature, pressure, current, etc. and update the dataset with every reading.
I am looking to detect errors in the dataset caused by the sensors. One of the error types from the sensors occurs when the input from the sensors is stuck on a certain value. For example, one of the temperature sensors reported a value of 71.46 for 20 minutes straight when we know it should be fluctuating. I am trying to locate these errors in my current dataset, and hopefully train a model to check for recurring values in future datasets.
Ideally I'd want to be able to find the time windows in the dataset where you see a value recur 5 or more times in a row.
The data is in the form of a pandas time dataframe and the kernel is python 3.6. Let me know if you have any suggestions.
I think a simple way to find out whether 5 consecutive values are the same could be to calculate a rolling average using a 5 step window size for all your values and then check the difference between values in adjacent rows? Not sure if this is too simplistic? But if the value of the rolling average is the same at row
xas in rowx+1than you're repeating the same value? Of course if it happens to be that the new value coming into the window is exactly the same as the first value that is exiting the window, then this wouldn't highlight it.This can be done as such:
And so now the problem becomes finding the rows/cells in
diff_tablewhere any value in a row is0, which is easy:To find the start and end times of when this happens is a bit trickier, but if your timestamps are in the index in your table, you could create a
Serieswith the same index and have values of1and0depending on whether any of the columns indiff_tablefor a given row were0(i.e. had a repeat of 5 values). By again subtracting adjacent values in this series, you can then identify whether it is the start of an interval (e.g.1(1-0) or end of an interval-1(0 -(-1)), depending on what values you have chosen).That can help finding the start of the time the moving average started to be constant. So if you then subtract 5 minutes from that start time, you would get your real interval start, when the sensor started repeating.
There might be better ways, but this is the one I would give a shot if it was my problem.