Is there a python function for locating multiple identical values in a row in a time series data set?

91 Views Asked by At

I am working with a large time series data set aggregated every minute of 2020. The dataset is getting values from sensors that are monitoring equipment in a thermal generation plant. The sensors measure values such as temperature, pressure, current, etc. and update the dataset with every reading.

I am looking to detect errors in the dataset caused by the sensors. One of the error types from the sensors occurs when the input from the sensors is stuck on a certain value. For example, one of the temperature sensors reported a value of 71.46 for 20 minutes straight when we know it should be fluctuating. I am trying to locate these errors in my current dataset, and hopefully train a model to check for recurring values in future datasets.

Ideally I'd want to be able to find the time windows in the dataset where you see a value recur 5 or more times in a row.

The data is in the form of a pandas time dataframe and the kernel is python 3.6. Let me know if you have any suggestions.

1

There are 1 best solutions below

0
robbo On

I think a simple way to find out whether 5 consecutive values are the same could be to calculate a rolling average using a 5 step window size for all your values and then check the difference between values in adjacent rows? Not sure if this is too simplistic? But if the value of the rolling average is the same at row x as in row x+1 than you're repeating the same value? Of course if it happens to be that the new value coming into the window is exactly the same as the first value that is exiting the window, then this wouldn't highlight it.

This can be done as such:

roller = df.rolling(5).mean()
diff_table = roller - roller.shift(1)

And so now the problem becomes finding the rows/cells in diff_table where any value in a row is 0, which is easy:

has_repeat = np.isclose(diff_table, 0).any(axis=1)  

To find the start and end times of when this happens is a bit trickier, but if your timestamps are in the index in your table, you could create a Series with the same index and have values of 1 and 0 depending on whether any of the columns in diff_table for a given row were 0 (i.e. had a repeat of 5 values). By again subtracting adjacent values in this series, you can then identify whether it is the start of an interval (e.g. 1 (1-0) or end of an interval -1 (0 -(-1)), depending on what values you have chosen).

That can help finding the start of the time the moving average started to be constant. So if you then subtract 5 minutes from that start time, you would get your real interval start, when the sensor started repeating.

There might be better ways, but this is the one I would give a shot if it was my problem.