How to exclude abnormal data points and smooth the data before linear fitting

Question

How to exclude abnormal data points and smooth the data before linear fitting

68 Views Asked by FreeAir At 23 March 2024 at 19:10

I want to calculate the modulus by linear fitting of the strain-stress curve. However, since the pressure data obtained scatters a lot, sometimes the fitting results are not good. I think two things can improve the situation. First, some extremely large or small values need to be ignored, Second, maybe the data of pressure can be processed in some way before the fitting. For example, maybe a smooth or running average could help.

The following is the Python code I used for linear fitting.

import sys
import numpy as np
from scipy.optimize import curve_fit
from pathlib import Path
import os


def f(x, a, b):
   return a + b * x

def get_modulus(BoxValue, PressureValue, strain_cutoff = 0.012):
   strain_all = np.array(BoxValue / BoxValue[0] - 1)
   stress_all = np.array(- PressureValue / 10000)
   strain = strain_all[strain_all <= strain_cutoff]
   stress = stress_all[strain_all <= strain_cutoff]
   popt, pcov = curve_fit(f, strain, stress)
   # modulus = popt[1]
   # modulus_std_dev = np.sqrt(np.diag(pcov))[1]
   return popt[1], np.sqrt(np.diag(pcov))[1]

BoxValue = np.loadtxt("box-xx_0water_150peg.dat"))[:,1]
PressureValue = np.loadtxt(f"pres-xx_0water_150peg"))[:,1]
modulus, modulus_std = get_modulus(BoxValue, PressureValue)

Two sets of data can be downloaded here.

The first one: box data, and pressure data

The second one: box data, and pressure data

The pressure data of the second one has an abnormal point on the 180th line, which I think should be excluded from the fitting.

Could you please tell me what is the best practice to do this and provide the code as well?

Any suggestions or comments are welcome.

Original Q&A

There are 1 best solutions below

**jlandercy** · Answer 1 · 2024-03-28T10:23:18.820000

TL; DR

Your dataset at too far away from linearity to be handled uniquely by Robust Regression. You definitely needs dataset pre-processing before being able to regress modulus.

Robust regression

First we process your data:

def prepare(suffix):
    # Load:
    x = pd.read_csv("box" + suffix, sep="\t", header=None, names=["id", "x"])
    y = pd.read_csv("pres" + suffix, sep="\t", header=None, names=["id", "y"])
    # Merge:
    data = x.merge(y, on=["id"])
    # Post process:
    data["strain"] = data["x"] / data["x"][0] - 1.
    data["stress"] = - data["y"] / 10_000.
    return data

Then we perform Robust Linear Regression on it:

def analyse(data):
    # Regress:
    X, y = data["strain"].values.reshape(-1, 1), data["stress"].values
    regressor = TheilSenRegressor()
    regressor.fit(X, y)
    # Predict:
    yhat = regressor.predict(X)
    score = r2_score(y, yhat)
    # Render:
    fig, axe = plt.subplots()
    axe.scatter(X, y)
    axe.plot(X, yhat, color="orange")
    axe.set_title("Regression")
    axe.set_xlabel("Strain")
    axe.set_ylabel("Stress")
    axe.grid()
    return {
        "slope": regressor.coef_,
        "intercept": regressor.intercept_,
        "score": score,
        "axe": axe
    }

First dataset returns (probably not linear at all):

df1 = prepare("-xx_0water_150peg.dat")
sol1 = analyse(df1)
#{'slope': array([2.22561879]),
# 'intercept': 0.025513888992521497,
# 'score': 0.7902902090920214,
# 'axe': <AxesSubplot:title={'center':'Regression'}, xlabel='Strain', ylabel='Stress'>}

Second dataset returns (multiple setup):

df2 = prepare("-yy_200water_150peg.dat")
sol2 = analyse(df2)
#{'slope': array([29.5666213]),
# 'intercept': 0.0008528267952350494,
# 'score': 0.007683701094953643,
# 'axe': <AxesSubplot:title={'center':'Regression'}, xlabel='Strain', ylabel='Stress'>}

Where the strong outlier does not affect the regression:

But if we zoom, we detect at least two different behaviours:

Conclusions

Robust regression will help you to filter out few strong outliers without having the need to handle them manually
Your dataset are non linear in some extent:
- First dataset exhibits a negative curvature
- Second dataset contains at least two behaviours that need to be split before analysis or if you wish to automatize it add clustering before fitting
You need to address if intercept must be fitted, generally strain-stress curves pass by the origin (no strain, no stress).

How to exclude abnormal data points and smooth the data before linear fitting

There are 1 best solutions below

TL; DR

Robust regression

Conclusions

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in LINEAR-REGRESSION

Related Questions in CURVE-FITTING

Related Questions in SCIPY-OPTIMIZE

Trending Questions

Popular # Hahtags

Popular Questions