ELKI's LOF implementation for heavily duplicated data

561 Views Asked by Ira At 03 September 2015 at 05:02

Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?

Original Q&A

There are 1 best solutions below

Erich Schubert On 14 September 2015 at 08:22 BEST ANSWER

It is not so much a matter of ELKI, but of the algorithms.

Most outlier detection algorithms use the k nearest neighbors. If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.

But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:

add jitter to the data set
remove duplicates (but never consider highly dupilcated values outliers!)
increase the neighborhood size

It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.

This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.

Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation. 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.

In other words, this is not a bug in ELKI. The implementation does what is published.

ELKI's LOF implementation for heavily duplicated data

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in NAN

Related Questions in DUPLICATE-DATA

Related Questions in OUTLIERS

Related Questions in ELKI

Trending Questions

Popular # Hahtags

Popular Questions