elki | ELKI Data Mining Toolkit | Predictive Analytics library

 by   elki-project Java Version: 0.8.0 License: AGPL-3.0

kandi X-RAY | elki Summary

kandi X-RAY | elki Summary

elki is a Java library typically used in Analytics, Predictive Analytics applications. elki has no bugs, it has no vulnerabilities, it has build file available, it has a Strong Copyleft License and it has high support. You can download it from GitHub, Maven.

ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers many data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions in particular of new methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              elki has a highly active ecosystem.
              It has 730 star(s) with 309 fork(s). There are 57 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 3 open issues and 56 have been closed. On average issues are closed in 64 days. There are no pull requests.
              OutlinedDot
              It has a negative sentiment in the developer community.
              The latest version of elki is 0.8.0

            kandi-Quality Quality

              elki has 0 bugs and 0 code smells.

            kandi-Security Security

              elki has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              elki code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              elki is licensed under the AGPL-3.0 License. This license is Strong Copyleft.
              Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

            kandi-Reuse Reuse

              elki releases are not available. You will need to build from source code and install.
              Deployable package is available in Maven.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.
              elki saves you 192267 person hours of effort in developing the same functionality from scratch.
              It has 209208 lines of code, 16586 functions and 2989 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed elki and discovered the below as its top functions. This is intended to give you an instant insight into elki implemented functionality, and help decide if they suit your requirements.
            • Runs the clustering algorithm
            • Updates the estimate and weights for the outliers
            • Get the distance
            • Build a single - element ensemble
            • Performs Hessenberg reduction
            • Sets the real vector
            • Apply the backigenvectors of a matrix
            • Backsubstitute a complex vector
            • Runs the P3C algorithm on the given database
            • Run the detection algorithm
            • Process a single relation
            • Runs a SVM SVM
            • Split the set of objects
            • Run the gedy algorithm
            • Run the clustering algorithm
            • Load multiple objects
            • Choose the initial medoids
            • Performs the SLINK algorithm on the given database
            • Splits the given entries into two points
            • Executes the SLINK algorithm
            • Run the B - ABOD algorithm on the given data set
            • Preprocess the graph
            • Run the benchmark
            • Preprocess the data
            • Loads the data from the primary source connection
            • Performs a SUBCLU algorithm on the given database
            Get all kandi verified functions for this library.

            elki Key Features

            No Key Features are available at this moment for elki.

            elki Examples and Code Snippets

            Could not run elki.sh in mac OSX
            Lines of Code : 12dot img1License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            brew cask install java
            
            git clone https://github.com/elki-project/elki.git
            
            cd elki
            
            ./gradlew shadowJar
            
            ./gradlew build
            
            open elki-bundle-0.7.2-SNAPSHOT.jar
            

            Community Discussions

            QUESTION

            DBSCAN: How to Cluster Large Dataset with One Huge Cluster
            Asked 2020-Mar-17 at 12:53

            I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.

            Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?

            Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).

            Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?

            ...

            ANSWER

            Answered 2020-Mar-17 at 12:53
            1. Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.

            2. DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.

            3. If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.

            4. Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024. This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.

            But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.

            Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

            Source https://stackoverflow.com/questions/60713059

            QUESTION

            ELKI: How to Specify Feature Columns of CSV for K-Means
            Asked 2020-Mar-10 at 08:16

            I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.

            Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?

            If not, what is the simplest way to achieve this by changin/extending ELKI in Java?

            ...

            ANSWER

            Answered 2020-Mar-10 at 08:16

            This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.

            But there is a filter to select columns: NumberVectorFeatureSelectionFilter. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):

            Source https://stackoverflow.com/questions/60603032

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install elki

            You can download ELKI including source code on the Releases page. ELKI uses the AGPLv3 License, a well-known open source license. There is a list of Publications that accompany the ELKI releases. When using ELKI in your scientific work, you should cite the publication corresponding to the ELKI release you are using, to give credit. This also helps to improve the repeatability of your experiments. We would also appreciate if you contributed your algorithm to ELKI to allow others to reproduce your results and compare with your algorithm (which in turn will likely get you citations). We try to document every publication used for implementing ELKI: the page RelatedPublications is generated from the source code annotations.

            Support

            Beginners may want to start at the HowTo documents, Examples and Tutorials to help with difficult configuration scenarios and beginning with ELKI development. This website serves as community development hub and task tracker for both bug reports, Tutorials, FAQ, general issues and development tasks. The most important documentation pages are: Tutorial, JavaDoc, FAQ, InputFormat, DataTypes, DistanceFunctions, DataSets, Development, Parameterization, Visualization, Benchmarking, and the list of Algorithms and RelatedPublications.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
            Maven
            Gradle
            CLONE
          • HTTPS

            https://github.com/elki-project/elki.git

          • CLI

            gh repo clone elki-project/elki

          • sshUrl

            git@github.com:elki-project/elki.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link