Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?
ELKI's LOF implementation for heavily duplicated data
561 Views Asked by Ira At
1
There are 1 best solutions below
Related Questions in PROBABILITY
- How to evaluate the probability of a range in R?
- shortest path algorithm with expected cost
- How to modify probabilities in lotery? - Python
- Coin throw probability on a chessboard square
- Group dataframe and sample n rows with equal probability between groups
- Coupon collector’s test (for testing sequences)
- hmmlearn MultinomialHMM emissionprob_ size
- Calculating Conditional Distributions using Copula in R
- What is the CRC32 Collision probability of All possible ASCII strings of variable length ranging from 1 to 7
- Why do the samplers not behave the same when sampled on there own?
- Visualize a vector of booleans as an image grid of symbols with different colors in python
- Content-Based Filtering for Tagged Posts
- Python - Writing code for probability of choosing 7 pairs in a dominoes game?
- Inverse probability of treatment weighting (IPTW) and crr() or FGR() in R
- Reverse engineering values for mlogit logsum function applied to a choice model
Related Questions in NAN
- There's a NaN error in one of the tables on my PHP Laravel Script, What could be the problem?
- Having issues with autocorrelation of a lagged time series in python
- Handling NaN entries in a dataframe created from CSV
- How to replace NAs in R which get mode from a group
- Why doesn't parseFloat() fix NaN (Javascript)?
- Filtering pandas dataframe NaN values
- LSTM with Tanh Activation Function Producing NaN During Tuning
- Why does my Javascript method return NaN when two number types are calculated?
- "NaNs produced" warning when trying to test the homogeneity (Levene's Test) with drc package
- Force crash/exception on NaN assignment in C/C++
- Using SciPy ndimage.zoom on an array with nan values
- MissingDataError: exog contains inf or nans after dropna()
- AlpineJS passing number coming up as NaN
- lcmm() in R Error: Numerical problem by computing fn value of function is : NaN
- using loc to remove nan which appears in 2 or more columns
Related Questions in DUPLICATE-DATA
- Does JSON syntax allow duplicate values?
- Batch type duplicates lines
- How do I find duplicate values in Oracle without using HAVING/GROUP BY
- PHP Jquery POST - sends duplicate data
- User defined function for tagging duplicates
- space optimize a large array with many duplicates
- Duplicate oracle database from remote server on local machine
- Using Javascript to pull cell data from one table to another
- Finding Duplicate Entry from multiple table in mysql
- How to find rows in a file with same values for specific columns using unix commands?
- Highlight rows where the same values in columns A:B but different values in column C
- Looping through json data using angular and filling in missing values
- how make the opentsdb store duplicate data
- Preventing Duplicate records in MySQL
- Removes duplicate values from an array with lower value
Related Questions in OUTLIERS
- Need help realigning python fill_between with data points
- Outlier removing based on spectral signal in Google Earth Engine (GEE)
- remove outliers from geom_split_violin
- Colouring extreme outliers in a boxplot using ggplot2
- Finding outlier points of a curve
- How to present a boxplot that does not show outliers separately
- Creating HTTP code 500 alert using Datadog monitoring multiple systems in the same alert
- boxes upper face pose estimation from pointcloud
- Problems creating a transformer for a pipeline
- Problem With Outliers and the results of the Boxplots
- Function for identifying outliers
- Outliers calculation in sql teradata
- Removing time series outliers in R with tsclean() - adjusting the sensibility to outliers
- Why are all the bars in my boxplot lumped together?
- Remove the Largest Outlier in the array-Python
Related Questions in ELKI
- Getting row indices back from the DBIDs neighbours in ELKI CorePredicate DBCAN
- WeightedCorePredicate Implementation for ELKI - An example
- Elki GDBSCAN Java/Scala - how to modify the CorePredicate
- Visualization results of dbscan using ELKI
- DBSCAN: How to Cluster Large Dataset with One Huge Cluster
- ELKI: How to Specify Feature Columns of CSV for K-Means
- ELKI: LOF score as infinite
- how to install ELKI on windows?
- Should DBSCAN and its index have the same distance function
- sample_weight option in the ELKI implementation of DBSCAN
- Create Dendrogram with Elki
- KMeans usage in ELKI, comprehensive example
- How can I cluster data using a distance matrix with the ELKI library?
- ELKI KNNDistancesSampler
- Can ELKI cluster non-normalized negative points?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
It is not so much a matter of ELKI, but of the algorithms.
Most outlier detection algorithms use the k nearest neighbors. If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.
But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:
It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.
This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.
Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation. 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.
In other words, this is not a bug in ELKI. The implementation does what is published.