Combining two dataframes with different column name in time-series

32 Views Asked by At

I have two data frames, one named sensor and one named train. The sensor data frame contains the data of a time series, whose index is the ts_sensor column. In the sensor data frame the names of the columns identify the name of the sensor, and each cell is filled with the value assumed by the sensor at the ts_sensor moment. The train data frame contains an id_sensor column filled with the name of the sensor (as in the sensor data frame) and the boot_threshold column in which each cell is filled with a threshold value for each sensor.

Here is the info related to the mentioned data frames.

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id_train             264 non-null    object 
 1   asset_id             264 non-null    object 
 2   train_type           264 non-null    object 
 3   equipment_type       264 non-null    object 
 4   id_sensor            264 non-null    object 
 5   unit                 262 non-null    object 
 6   is_operating         262 non-null    float64
 7   will_be_available    262 non-null    float64
 8   pi_acquisition_date  0 non-null      float64
 9   pi_description       264 non-null    object 
 10  sensor_description   264 non-null    object 
 11  tag_id               262 non-null    float64
 12  priority             262 non-null    float64
 13  for_ml               264 non-null    int64  
 14  onoff_parameter      264 non-null    int64  
 15  boot_threshold       264 non-null    int64  
 16  efficiency_index     264 non-null    int64  
dtypes: float64(5), int64(4), object(8)
memory usage: 35.2+ KB

sensor.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1729930 entries, 0 to 1729929
Columns: 135 entries, level_0 to Australia.Blacktip.ICSS.KPI.EE.GasTurbineEfficiency_A
dtypes: float64(132), int64(2), object(1)
memory usage: 1.7+ GB

sensor.columns
Index(['level_0', 'index', 'sensor_ts',
       'Australia.Blacktip.ICSS.360140413-100.PV',
       'Australia.Blacktip.ICSS.360140415-100.PV',
       'Australia.Blacktip.ICSS.360140417-100.PV',
       'Australia.Blacktip.ICSS.360140421-100.PV',
       'Australia.Blacktip.ICSS.360140423-100.PV',
       'Australia.Blacktip.ICSS.360140425-100.PV',
       'Australia.Blacktip.ICSS.360140433-100.PV',
       ...
       'Australia.Blacktip.ICSS.3601PDI144.DACA.PV',
       'Australia.Blacktip.ICSS.3601PIT111.PV',
       'Australia.Blacktip.ICSS.3601PIT141.PV',
       'Australia.Blacktip.ICSS.3601TIT042.PV',
       'Australia.Blacktip.ICSS.3601TIT052.PV',
       'Australia.Blacktip.ICSS.3601TIT111.PV',
       'Australia.Blacktip.ICSS.3601TIT122.PV',
       'Australia.Blacktip.ICSS.3601TIT141.PV',
       'Australia.Blacktip.ICSS.3601TIT152.PV',
       'Australia.Blacktip.ICSS.KPI.EE.GasTurbineEfficiency_A'],
      dtype='object', length=135)

I want to obtain a new data frame in which the time series is respected and only the sensors whose ts_sensor value is currently higher than the corresponding boot_threshold value are present.

Thank you all

1

There are 1 best solutions below

0
DataSciRookie On

Are you looking for that ?

import pandas as pd
import numpy as np   
# Simulated sensor data (a small part of the actual data for demonstration)
sensor_data = {
    'sensor_ts': pd.date_range(start='2023-01-01', periods=4, freq='D'),
    'Australia.Blacktip.ICSS.360140413-100.PV': [100, 200, 300, 400],
    'Australia.Blacktip.ICSS.360140415-100.PV': [100, 220, 330, 440],
}

sensor = pd.DataFrame(sensor_data)

# Simulated train data 
train_data = {
    'id_sensor': ['Australia.Blacktip.ICSS.360140413-100.PV', 'Australia.Blacktip.ICSS.360140415-100.PV'],
    'boot_threshold': [250, 300],
}

train = pd.DataFrame(train_data)

# Reshape sensor DataFrame from wide to long format
sensor_long = pd.melt(sensor, id_vars=['sensor_ts'], var_name='id_sensor', value_name='value')

# Merge the reshaped sensor data with the train data
merged_data = pd.merge(sensor_long, train, on='id_sensor')

# Filter to include only rows where sensor value is greater than the boot_threshold
filtered_data = merged_data[merged_data['value'] > merged_data['boot_threshold']]

# Displaying the filtered data for verification
filtered_data

Output :

sensor_ts   id_sensor   value                      boot_threshold
2   2023-01-03  Australia.Blacktip.ICSS.360140413-100.PV    300 250
3   2023-01-04  Australia.Blacktip.ICSS.360140413-100.PV    400 250
6   2023-01-03  Australia.Blacktip.ICSS.360140415-100.PV    330 300
7   2023-01-04  Australia.Blacktip.ICSS.360140415-100.PV    440 300