I am working on human sensor data and want to remove outliers if any from the dataset. The data is collected 50times per second. My question is do I need to some pre-processing before using the Isolation-Forest model as I don't any error as such but want to use it in the right way. I have never worked with time series data before so any suggestions would be great.

The first 20 rows of the data is shown below: Human activity sensor data

after reading the file I am extracting the sensor columns and using it on the model straight away.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from mpl_toolkits.mplot3d import Axes3D

data = pd.read_csv('walking.csv')

# Specify the columns that contain sensor readings (exclude 'act' and 'id' columns)
sensor_columns = ['rotationRate.x', 'rotationRate.y', 'rotationRate.z',
                   'userAcceleration.x', 'userAcceleration.y', 'userAcceleration.z'] 

# Combine all sensor columns into a feature vector
X = data[sensor_columns]

# Define a range of contamination values to test
contamination_values = np.arange(0.01, 0.11, 0.01)  # Adjust the range as needed

# Create an empty list to store cross-validation scores
cv_scores = []

# Initialize time-based cross-validation
tscv = TimeSeriesSplit(n_splits=5)  # You can adjust the number of splits as needed

# Loop through each contamination value and evaluate the model with time-based cross-validation
for contamination in contamination_values:
    scores = []
    for train_index, test_index in tscv.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        
        clf = IsolationForest(contamination=contamination, random_state=42)
        clf.fit(X_train)
        outlier_predictions = clf.predict(X_test)
        
        scores.append(np.mean(outlier_predictions == -1))  # Calculate the proportion of outliers
        
    cv_scores.append(np.mean(scores))

# Find the best contamination value with the highest cross-validation score
best_contamination = contamination_values[np.argmax(cv_scores)]
print("Best Contamination Value:", best_contamination)

# Train the final model with the best contamination value on the entire dataset
clf = IsolationForest(contamination=best_contamination, random_state=42)
clf.fit(X)
outlier_predictions = clf.predict(X)

# Create a new column to mark outliers in your original DataFrame
data['is_outlier'] = outlier_predictions

# Print the number of outliers detected
print("Number of Outliers Detected:", np.sum(outlier_predictions == -1))

I get the no. of outliers along with a warning which says X does not have valid feature names, but Isolation-Forest was fitted with feature names

0

There are 0 best solutions below