I had started working with Gnuplot and tried out a few things. Now, I was wondering how to automatically remove outliers from the fit. An example is shown in the figure with a data point at 4,50 from the second data set.
And the data set:
I've found a similar question here, but I couldn't make it work for my example. There might be a lot of different approaches and I'm not that experienced with Gnuplot or similar software. So, I would be glad about suggestions, what would be a possible approach to describe outliers.
I'm using the gnuplottex package in LaTeX (texlive) on Windows 10. The gnuplot code:
\begin{gnuplot}[terminal=tikz, terminaloptions={color size 7cm,5cm}]
reset session
$Data <<EOD
#data
x y1 y2 y3 y4
1 1 6 4 2
2 4 10 1 1
3 9 15 0 0.5
4 16 50 1 2
5 25 31 4 5
6 36 42 9 12
7 49 55 30 23
EOD
datafile = 'data.dat'
set print 'parameters.dat'
#_____________Set the label for data points________________________
set key top left # set position of legend
set key Left # set raggedleft
set key samplen 2 spacing 1.2 font ",8" # set fontsize and spacing
set key noautotitle
###1__________Define function and number of columns_________________________
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]
do for [col=colMin:colMax] {
a=1; b=1; c=4 # some initial values, sometimes 0 or NaN is not a good start
fit f(x,a,b,c) datafile u 1:col via a,b,c
A[col] = a; B[col] = b; C[col] = c
print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
plot for [col=colMin:colMax] datafile u 1:col ls col, \
for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col, \
for [col=colMin:colMax] keyentry w lp ls col \
title sprintf("$y%d$",col-1)
\end{gnuplot}
As mentioned in the comments you have to somehow define what you consider as outlier. There are certainly several ways how to do that. I'm not claiming that this is the best way, just consider it as a starting point.
Some Comments:
OutlierDistwhat you consider as outlier$NOOUTLIERSand if the absolute distance to the fitted curve is>=OutlierDistthen writeNaNinto the second column and the original value into the 3rd column.This can certainly be optimized.
Data:
"SO77774328.datScript:
Result: