elki | ELKI Data Mining Toolkit | Predictive Analytics library
kandi X-RAY | elki Summary
kandi X-RAY | elki Summary
ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers many data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions in particular of new methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Runs the clustering algorithm
- Updates the estimate and weights for the outliers
- Get the distance
- Build a single - element ensemble
- Performs Hessenberg reduction
- Sets the real vector
- Apply the backigenvectors of a matrix
- Backsubstitute a complex vector
- Runs the P3C algorithm on the given database
- Run the detection algorithm
- Process a single relation
- Runs a SVM SVM
- Split the set of objects
- Run the gedy algorithm
- Run the clustering algorithm
- Load multiple objects
- Choose the initial medoids
- Performs the SLINK algorithm on the given database
- Splits the given entries into two points
- Executes the SLINK algorithm
- Run the B - ABOD algorithm on the given data set
- Preprocess the graph
- Run the benchmark
- Preprocess the data
- Loads the data from the primary source connection
- Performs a SUBCLU algorithm on the given database
elki Key Features
elki Examples and Code Snippets
brew cask install java
git clone https://github.com/elki-project/elki.git
cd elki
./gradlew shadowJar
./gradlew build
open elki-bundle-0.7.2-SNAPSHOT.jar
Community Discussions
Trending Discussions on elki
QUESTION
I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?
...ANSWER
Answered 2020-Mar-17 at 12:53Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024. This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.
QUESTION
I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.
Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?
If not, what is the simplest way to achieve this by changin/extending ELKI in Java?
...ANSWER
Answered 2020-Mar-10 at 08:16This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.
But there is a filter to select columns: NumberVectorFeatureSelectionFilter
. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install elki
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page