elki | ELKI Data Mining Toolkit | Predictive Analytics library

by elki-project Java Version: 0.8.0 License: AGPL-3.0

X-Ray Key Features Code Snippets(1)Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | elki Summary

elki is a Java library typically used in Analytics, Predictive Analytics applications. elki has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has high support. You can download it from GitHub, Maven.

ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers many data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions in particular of new methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.

Support

Quality

Security

License

Reuse

Support

elki has a highly active ecosystem.

It has 730 star(s) with 309 fork(s). There are 57 watchers for this library.

It had no major release in the last 12 months.

There are 3 open issues and 56 have been closed. On average issues are closed in 64 days. There are no pull requests.

It has a negative sentiment in the developer community.

The latest version of elki is 0.8.0

Quality

elki has 0 bugs and 0 code smells.

Security

elki has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

elki code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

elki is licensed under the AGPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

elki releases are not available. You will need to build from source code and install.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

elki saves you 192267 person hours of effort in developing the same functionality from scratch.

It has 209208 lines of code, 16586 functions and 2989 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed elki and discovered the below as its top functions. This is intended to give you an instant insight into elki implemented functionality, and help decide if they suit your requirements.

Runs the clustering algorithm
Updates the estimate and weights for the outliers
Get the distance
Build a single - element ensemble
Performs Hessenberg reduction
Sets the real vector
Apply the backigenvectors of a matrix
Backsubstitute a complex vector
Runs the P3C algorithm on the given database
Run the detection algorithm
Process a single relation
Runs a SVM SVM
Split the set of objects
Run the gedy algorithm
Run the clustering algorithm
Load multiple objects
Choose the initial medoids
Performs the SLINK algorithm on the given database
Splits the given entries into two points
Executes the SLINK algorithm
Run the B - ABOD algorithm on the given data set
Preprocess the graph
Run the benchmark
Preprocess the data
Loads the data from the primary source connection
Performs a SUBCLU algorithm on the given database

Get all kandi verified functions for this library.

elki Key Features

No Key Features are available at this moment for elki.

elki Examples and Code Snippets

Could not run elki.sh in mac OSX

Lines of Code : 12

License : Strong Copyleft (CC BY-SA 4.0)

Copy

brew cask install java

git clone https://github.com/elki-project/elki.git

cd elki

./gradlew shadowJar

./gradlew build

open elki-bundle-0.7.2-SNAPSHOT.jar

Community Discussions

Trending Discussions on elki

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

ELKI: How to Specify Feature Columns of CSV for K-Means

QUESTION

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

Asked 2020-Mar-17 at 12:53

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.

Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?

Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).

Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?

...

ANSWER

Answered 2020-Mar-17 at 12:53

Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024. This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.

But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.

Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

Source https://stackoverflow.com/questions/60713059

QUESTION

ELKI: How to Specify Feature Columns of CSV for K-Means

Asked 2020-Mar-10 at 08:16

I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.

Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?

If not, what is the simplest way to achieve this by changin/extending ELKI in Java?

...

ANSWER

Answered 2020-Mar-10 at 08:16

This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.

But there is a filter to select columns: NumberVectorFeatureSelectionFilter. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):

Source https://stackoverflow.com/questions/60603032

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install elki

You can download ELKI including source code on the Releases page. ELKI uses the AGPLv3 License, a well-known open source license. There is a list of Publications that accompany the ELKI releases. When using ELKI in your scientific work, you should cite the publication corresponding to the ELKI release you are using, to give credit. This also helps to improve the repeatability of your experiments. We would also appreciate if you contributed your algorithm to ELKI to allow others to reproduce your results and compare with your algorithm (which in turn will likely get you citations). We try to document every publication used for implementing ELKI: the page RelatedPublications is generated from the source code annotations.

Support

Beginners may want to start at the HowTo documents, Examples and Tutorials to help with difficult configuration scenarios and beginning with ELKI development. This website serves as community development hub and task tracker for both bug reports, Tutorials, FAQ, general issues and development tasks. The most important documentation pages are: Tutorial, JavaDoc, FAQ, InputFormat, DataTypes, DistanceFunctions, DataSets, Development, Parameterization, Visualization, Benchmarking, and the list of Algorithms and RelatedPublications.

Find more information at: