Gnuplot - How to ignore outliers for the fit?

120 Views Asked by At

I had started working with Gnuplot and tried out a few things. Now, I was wondering how to automatically remove outliers from the fit. An example is shown in the figure with a data point at 4,50 from the second data set.Outlier in "data set 2" distorts the fit And the data set:

I've found a similar question here, but I couldn't make it work for my example. There might be a lot of different approaches and I'm not that experienced with Gnuplot or similar software. So, I would be glad about suggestions, what would be a possible approach to describe outliers.

I'm using the gnuplottex package in LaTeX (texlive) on Windows 10. The gnuplot code:

\begin{gnuplot}[terminal=tikz, terminaloptions={color size 7cm,5cm}]
reset session

$Data <<EOD
#data
x   y1  y2  y3  y4
1   1   6   4   2   
2   4   10  1   1   
3   9   15  0   0.5 
4   16  50  1   2   
5   25  31  4   5   
6   36  42  9   12  
7   49  55  30  23
EOD

datafile = 'data.dat'
set print 'parameters.dat'

#_____________Set the label for data points________________________
set key top left                            # set position of legend
set key Left                                # set raggedleft
set key samplen 2 spacing 1.2 font ",8" # set fontsize and spacing
set key noautotitle 

###1__________Define function and number of columns_________________________
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]

do for [col=colMin:colMax] {
    a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
    fit f(x,a,b,c) datafile u 1:col via a,b,c
    A[col] = a;  B[col] = b;  C[col] = c
    
    print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}

plot for [col=colMin:colMax] datafile u 1:col ls col, \
     for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col, \
     for [col=colMin:colMax] keyentry w lp ls col \ 
     title sprintf("$y%d$",col-1)
\end{gnuplot}
2

There are 2 best solutions below

5
theozh On BEST ANSWER

As mentioned in the comments you have to somehow define what you consider as outlier. There are certainly several ways how to do that. I'm not claiming that this is the best way, just consider it as a starting point.

Some Comments:

  • you do a fit with all datapoints
  • define an absolute distance OutlierDist what you consider as outlier
  • plot the data into a table $NOOUTLIERS and if the absolute distance to the fitted curve is >=OutlierDist then write NaN into the second column and the original value into the 3rd column.
  • now, fit a second time (without the outliers)
  • plot the data, the fitted curves (2nd fit) and if desired the outliers

This can certainly be optimized.

Data: "SO77774328.dat

x   y1  y2  y3  y4
1    1    6   4    2
2    4   10   1    1
3    9   15   0    0.5
4   16   50   1    2
5   25   31   4    5
6   36   42   9   12
7   49   55  30   23

Script:

### remove outliers for fitting
reset session

FILE     = "SO77774328.dat"
PARAMS_1 = "SO77774328_1.par"
PARAMS_2 = "SO77774328_2.par"

f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]

set print PARAMS_1
do for [col=colMin:colMax] {
    a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
    fit f(x,a,b,c) FILE u 1:col via a,b,c
    A[col] = a;  B[col] = b;  C[col] = c
    print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
unset print

# write data to table with outliers --> NaN
OutlierDist = 10   # outlier distance
dev(colX,colY) = abs(column(colY)-f(column(colX),A[colY],B[colY],C[colY])-1) >= OutlierDist ? NaN :  column(colY)
set table $NOOUTLIERS
    do for [colY=colMin:colMax] {
        plot FILE u 1:(v0=dev(1,colY)):(v0!=v0?column(colY):NaN) lc var
    }
unset table

# fit again
set print PARAMS_2
do for [col=colMin:colMax] {
    i = col-colMin   # datablock index
    a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
    fit f(x,a,b,c) $NOOUTLIERS index i u 1:2 via a,b,c
    A[col] = a;  B[col] = b;  C[col] = c
    print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
unset print

set key noautotitle left top

plot for [col=colMin:colMax] FILE u 1:col ls col-1, \
     for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col-1, \
     for [col=colMin:colMax] keyentry w lp ls col-1 title sprintf("y%d",col-1), \
     $NOOUTLIERS u 1:(valid(2) ? NaN : column(3)) w p pt 6 ps 2 lc "red" ti "Outlier"
### end of script

Result:

enter image description here

1
Karl On

You simply define ordinate values that are too far from the function plot as "NaN", using the ternary operator (a?b:c) :

maxdev=1.3 # for example
fit f(x) dataf using 1:(abs(f($1)-$2)>maxdev ? NaN : $2) via <parameters>

Absolute vertical distance is of course just the simplest example. A sensible outlier criterion for a given problem will eventually require a lot more thinking to avoid compromising the validity esp. of your fit errors.

Of course you need to have a rather good set of parameters already, or after two or three iterations, all points will be far from the current function. So definitely start with

 fit f(x) dataf using 1:2 via <parameters>

and perhaps then continue with a large value for "maxdev", and reduce it stepwise. Disclaimer: always give a good explanation why you exclude datapoints, and mark them as excluded outliers. ;)

 plot dataf us 1:(abs(f($1)-$2)>maxdev ? NaN : $2) title "data,\
      dataf us 1:(abs(f($1)-$2)>maxdev ? $2 : NaN) title "outliers", \
      f(x) title "fit"