mean-shift | Mean-shift clustering in Python in 100 lines | Machine Learning library
kandi X-RAY | mean-shift Summary
kandi X-RAY | mean-shift Summary
mean-shift clustering algorithm in python using numpy only. running the code: python mean-shift.py. input: [[-0.85 -1.04 ] [ 1.18 -1.12 ] [ 1.237 1.242] [ 1.401 -1.81 ] [ 0.999 -0.518] [-1.013 -1.112] [-1.259 -0.561] [-0.878 0.884] [ 0.902 -1.478] [-0.911 0.548] [-0.888 0.892] [-0.204 -1.064] [-1.543 -0.721] [ 1.056 0.654] [ 1.374 1.273] [-1.571 -1.703] [ 1.169 -1.286] [ 1.271 0.583] [-0.147 -1.563] [ 1.1 0.999] [-1.139 -0.685] [-1.255 1.042] [ 0.672 -1.033] [ 1.361 -0.67 ] [-1.888 -1.148] [-0.76 1.065] [-0.159 -0.968] [ 1.009 -0.796] [-0.622 1.247] [ 0.536 -0.617] [ 1.399 -0.602] [-1.407 1.43 ] [-0.188 0.48 ] [-0.9 0.921] [-0.701 0.851] [-1.361 1.358] [-1.063 0.896] [-1.083 1.684] [-1.213 -0.529] [ 0.759 -0.613] [ 1.371 0.4 ] [ 0.864 1.234] [ 0.932 -1.609] [ 1.146 0.124] [-1.107 1.681] [ 1.497 0.674] [ 1.227 0.79 ] [ 0.761 -1.544] [-1.011 0.883] [ 0.964 1.078] [-0.634 1.341] [-1.513 -0.867] [-1.131 2.097] [-1.065 0.929] [-0.948 0.879] [-1.152 -1.273] [-1.179 0.645] [ 0.863 1.481] [-1.649 -1.443] [-0.891 1.023] [ 1.124 -1.047] [-0.927 -0.732] [ 1.49 -0.901] [ 1.019 -1.053] [-1.675 -0.241] [ 0.992 -0.607] [-1.015 0.361] [-0.016 1.137] [ 1.098 1.026] [ 0.782 0.456] [-0.732 -1.254] [-0.626 0.743] [-0.513 0.378] [-1.042
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Performs clustering
- Cluster points
- Helper function to shift a point
- Return the distance between two vectors
- Return a list of n colors
mean-shift Key Features
mean-shift Examples and Code Snippets
Community Discussions
Trending Discussions on mean-shift
QUESTION
We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.
To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.
I've been currently researching for a faster solution.
What I have already tested:
Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.
Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.
Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.
Facebook FAISS - no support for sparse representation.
There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?
Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?
Is there any magical library that I haven't heard of?
Or just simply scale vertically?
...ANSWER
Answered 2019-Oct-11 at 22:06Choose the algorithm wisely. There are clever algorithms, and there are stupid algorithms for kmeans. Lloyd's is stupid, but the only one you will find in GPUs so far. It wastes a lot of resources with unnecessary computations. Because GPU and "big data" people do not care about resource efficiency... Good algorithms include Elkan's, Hamerly's, Ying-Yang, Exponion, Annulus, etc. - these are much faster than Lloyd's.
Sklearn is one of the better tools here, because it at least includes Elkan's algorithm. But if I am not mistaken, it may be making a dense copy of your data repeatedly. Maybe in chunks so you don't notice it. When I compared k-means from sklearn with my own spherical k-means in Python, my implementation was many times faster. I can only explain this with me using sparse optimizations while the sklearn version performed dense operations. But maybe this has been improved since.
Implementation quality is important. There was an interesting paper about benchmarking k-means. Let me Google it:
Kriegel, H. P., Schubert, E., & Zimek, A. (2017). The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowledge and Information Systems, 52(2), 341-378.
They show how supposedly the same algorithm can have orders f magnitude runtime differences, depending on implementation differences. Spark does not fare very well there... It has too high overheads, too slow algorithms.
You don't need all the data.
K-means works with averages. The quality of the mean very slowly improves as you add more data. So there is little use in using all the data you have. Just use a large enough sample, and the results should be of almost the same quality. You can exploit this also for seeding. Run on a smaller set first, then add more data for refinement.
Because your data is sparse, there is a high chance that k-means is not the right tools anyway. Have you tested the quality of your results? How do you ensure attributes to be appropriately scaled? How much is the result determined simply by where the vectors are 0, and not by the actual non-zero values? Do results actually improve with rerunning k-means so often? What if you d not rerun k-means ever again? What if you just run it on a sample as discussed in 3)? What if you just pick k random centers and do 0 iterations of k-means? What is your best Silhouette? Chances are that you cannot measure the difference and are just wasting time and resources for nothing! So what do you do to ensure reliability of your results?
QUESTION
I am trying to segment a colour image using the Mean-Shift algorithm using scikit-learn. There is something I would like to know about the MeanShift fit_predict() function. In the documentation for the MeanShift algorithm, it states that fit_predict() performs clustering on X and returns cluster labels.
What exactly are the cluster labels? Are they the labels for all the clusters the algorithm found, or is there a label for each data sample returned? Any insights are appreciated.
...ANSWER
Answered 2019-Jul-12 at 08:44There is a label returned for each training example. It is a combination of the fit() and the predict() function.
QUESTION
I am trying to segment a colour image using Mean-Shift clustering using sklearn. I have read the image into a numpy array, however I want to extract each colour channel (R,G,B) so that I can use each as a variable for classification.
I have found the following code online, which extracts the RGB colour channels of an image which is represented as a numpy array.
...ANSWER
Answered 2019-Jul-12 at 01:16A normal picture you will have 3 layer, Red Green and Blue.
When you read an picture by a tool (example is open-cv), it will return for you a numpy array with shape (width_image x length_image x channels).
The arrange of channels depend on what you used, if you read by open-cv it will be Blue is first then Green then Red, else if you use matplotlib.pyplot.imread it will be Red-Green-Blue.
That code wrote like that because they read picture by open-cv
QUESTION
I am trying to segment a colour image using the Mean-Shift algorithm using sklearn. I have the following code:
...ANSWER
Answered 2019-Jul-11 at 02:06It's because the image you're loading does not have RGB values (if you look at the dimensions, the last one is 4.
You need to first convert it to RGB like this:
QUESTION
I'm using Shift-means clustering (https://scikit-learn.org/stable/modules/clustering.html#mean-shift) in which the labels of clusters are obtained from this source: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
However,it's not clear how the labels of clusters (0,1,...) are generated. Appearly, it seems that label 0 is the cluster with more elements. It this a general rule?
How the others algorithms works? it's in a "random" sense? or the algorithms behind detecte the greater clusters for the 0 cluster?
Thanks!
PS: it's easy order the labels according this rule, my question is more theoretical.
...ANSWER
Answered 2019-Jun-09 at 07:30In many cases, the cluster order depends on the initialization. If you provide the initial values, then this order will be preserved.
If you do not provide such initial values, the order will usually be based on the data order. The first item is likely to belong to the first cluster, for example (withholding noise in some algorithms, such as DBSCAN).
Now quantity (cluster size) has an interesting effect: assuming that your data is randomly ordered (and not, for example, ordered by some synthetic data generation process) then the first element is more likely to belong to the "largest" cluster, so this cluster is most likely to come first even with "random" order.
Now in sklearn's mean-shift (which in my opinion contains an error in the final assignment rule) the authors decided to sort by "intensity" apparently, but I don't remember any such rule in the original papers. https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/cluster/mean_shift_.py#L222
QUESTION
This is for a class and I would really appreciate your help! I made some changes based on a comment I received, but now I get another error.. I need to modify an existing function that implements the mean-shift algorithm, but instead of initializing all the points as the first set of centroids, the function creates a grid of centroids with the grid based on the radius. I also need to delete the centroids that don't contain any data points. My issue is that I don't understand how to fix the error I get!
...ANSWER
Answered 2019-Mar-23 at 20:52When you are running that loop: for i in centroids
the i that is iterated through centroids isn't a number, it is a vector which is why an error is pops up. For example, the first i value might be equal to [0 1 2 0 1 2 0 1 2]. So to take an index of that doesn't make sense. What your code is saying to do is to take centroid = centroid[n1 n2 nk]. To fix it, you really need to change how your initialize centroid function works. Meshgrid also won't create an N dimensional grid, so your meshgrid might work for 2 dimensions but not N. I hope that helps.
QUESTION
I am looking to find peak regions in 2D data (if you will, grayscale images or 2D landscapes, created through a Hough transform). By peak region I mean a locally maximal peak, yet NOT a single point but a part of the surrounding contributing region that goes with it. I know, this is a vague definition, but maybe the word mountain or the images below will give you an intuition of what I mean.
The peaks marked in red (1-4) are what I want, the ones in pink (5-6) examples for the "grey zone", where it would be okay if those smaller peaks are not found but also okay if they are.
Images contain between 1-20 peaked regions, different in height. The 2D data for above surf plot is shown below with a possible result (orange corresponds to Peak 1, green corresponds to Peak 2 a/b, ...). Single images for tests can be found in the description links:
Image left: input image - - - - middle: (okaish) result - - - - right: result overlayed over image.
The result above was produced using simple thresholding (MATLAB code):
...ANSWER
Answered 2017-May-09 at 19:26In such peak finding problems, I mostly use morphological operations. Since the Hough transform results are mostly noisy, I prefer blurring it first, then apply tophat and extended maxima transform. Then for each local maximum, find the region around it with adaptive thresholding. Here is a sample code:
QUESTION
I am using 2 sample scripts to check if python 3.6 can make use of the OpenCL functionalities of Opencv on windows. I have tried to run a couple of examples related to CAMSHIFT provided in samples and checked to see if I have OpenCL.
I would love to know why python shows that I do not have opencl while C++ with VS shows that I have opencl enabled devices
System Info:
Opencv 3.2.0 built from source using Opencl on and added python and numpy libraries
Python 3.6 for windows 64bit
Visual Studio 2015 community edition
Numpy using
...ANSWER
Answered 2017-Mar-09 at 22:35Try this:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install mean-shift
You can use mean-shift like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page