How to exclude abnormal data points and smooth the data before linear fitting

68 Views Asked by At

I want to calculate the modulus by linear fitting of the strain-stress curve. However, since the pressure data obtained scatters a lot, sometimes the fitting results are not good. I think two things can improve the situation. First, some extremely large or small values need to be ignored, Second, maybe the data of pressure can be processed in some way before the fitting. For example, maybe a smooth or running average could help.

The following is the Python code I used for linear fitting.

import sys
import numpy as np
from scipy.optimize import curve_fit
from pathlib import Path
import os


def f(x, a, b):
   return a + b * x

def get_modulus(BoxValue, PressureValue, strain_cutoff = 0.012):
   strain_all = np.array(BoxValue / BoxValue[0] - 1)
   stress_all = np.array(- PressureValue / 10000)
   strain = strain_all[strain_all <= strain_cutoff]
   stress = stress_all[strain_all <= strain_cutoff]
   popt, pcov = curve_fit(f, strain, stress)
   # modulus = popt[1]
   # modulus_std_dev = np.sqrt(np.diag(pcov))[1]
   return popt[1], np.sqrt(np.diag(pcov))[1]

BoxValue = np.loadtxt("box-xx_0water_150peg.dat"))[:,1]
PressureValue = np.loadtxt(f"pres-xx_0water_150peg"))[:,1]
modulus, modulus_std = get_modulus(BoxValue, PressureValue)

Two sets of data can be downloaded here.

The first one: box data, and pressure data

The second one: box data, and pressure data

The pressure data of the second one has an abnormal point on the 180th line, which I think should be excluded from the fitting.

Could you please tell me what is the best practice to do this and provide the code as well?

Any suggestions or comments are welcome.

1

There are 1 best solutions below

5
jlandercy On

TL; DR

Your dataset at too far away from linearity to be handled uniquely by Robust Regression. You definitely needs dataset pre-processing before being able to regress modulus.

Robust regression

First we process your data:

def prepare(suffix):
    # Load:
    x = pd.read_csv("box" + suffix, sep="\t", header=None, names=["id", "x"])
    y = pd.read_csv("pres" + suffix, sep="\t", header=None, names=["id", "y"])
    # Merge:
    data = x.merge(y, on=["id"])
    # Post process:
    data["strain"] = data["x"] / data["x"][0] - 1.
    data["stress"] = - data["y"] / 10_000.
    return data

Then we perform Robust Linear Regression on it:

def analyse(data):
    # Regress:
    X, y = data["strain"].values.reshape(-1, 1), data["stress"].values
    regressor = TheilSenRegressor()
    regressor.fit(X, y)
    # Predict:
    yhat = regressor.predict(X)
    score = r2_score(y, yhat)
    # Render:
    fig, axe = plt.subplots()
    axe.scatter(X, y)
    axe.plot(X, yhat, color="orange")
    axe.set_title("Regression")
    axe.set_xlabel("Strain")
    axe.set_ylabel("Stress")
    axe.grid()
    return {
        "slope": regressor.coef_,
        "intercept": regressor.intercept_,
        "score": score,
        "axe": axe
    }

First dataset returns (probably not linear at all):

df1 = prepare("-xx_0water_150peg.dat")
sol1 = analyse(df1)
#{'slope': array([2.22561879]),
# 'intercept': 0.025513888992521497,
# 'score': 0.7902902090920214,
# 'axe': <AxesSubplot:title={'center':'Regression'}, xlabel='Strain', ylabel='Stress'>}

enter image description here

Second dataset returns (multiple setup):

df2 = prepare("-yy_200water_150peg.dat")
sol2 = analyse(df2)
#{'slope': array([29.5666213]),
# 'intercept': 0.0008528267952350494,
# 'score': 0.007683701094953643,
# 'axe': <AxesSubplot:title={'center':'Regression'}, xlabel='Strain', ylabel='Stress'>}

Where the strong outlier does not affect the regression:

enter image description here

But if we zoom, we detect at least two different behaviours:

enter image description here

Conclusions

  • Robust regression will help you to filter out few strong outliers without having the need to handle them manually
  • Your dataset are non linear in some extent:
    • First dataset exhibits a negative curvature
    • Second dataset contains at least two behaviours that need to be split before analysis or if you wish to automatize it add clustering before fitting
  • You need to address if intercept must be fitted, generally strain-stress curves pass by the origin (no strain, no stress).