K-means- | 使用pandas 、numpy 、K-means算法、matplotlib分析航空公司客户价值 | Machine Learning library
kandi X-RAY | K-means- Summary
kandi X-RAY | K-means- Summary
使用pandas 、numpy 、K-means算法、matplotlib分析航空公司客户价值
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- calculate decimal value
K-means- Key Features
K-means- Examples and Code Snippets
Community Discussions
Trending Discussions on K-means-
QUESTION
Following this example of K means clustering I want to recreate the same - only I'm very keen for the final image to contain just the quantized colours (+ white background). As it is, the colour bars get smooshed together to create a pixel line of blended colours.
Whilst they look very similar, the image (top half) is what I've got from CV2 it contains 38 colours total. The lower image only has 10 colours and is what I'm after.
Let's look at a bit of that with 6 times magnification:
I've tried :
...ANSWER
Answered 2021-May-18 at 16:27I recommend you to show the image using cv2.imshow
, instead of using matplotlib
.
cv2.imshow
shows the image "pixel to pixel" by default, while matplotlib.pyplot
matches the image dimensions to the size of the axes.
QUESTION
I've been trying to run RAPIDS on Google Colab pro, and have successfully installed the cuml and cudf packages, however I am unable to run even the example scripts.
TLDR;Anytime I try to run the fit function for cuml on Google Colab I get the following error. I get this when using the demo examples both for installation and then for cuml. This happens for a range of cuml examples (I first hit this trying to run UMAP).
...ANSWER
Answered 2021-May-06 at 17:13Colab retains cupy==7.4.0
despite conda installing cupy==8.6.0
during the RAPIDS install. It is a custom install. I just had success pip installing cupy-cuda110==8.6.0
BEFORE installing RAPIDS, with
!pip install cupy-cuda110==8.6.0
:
I'll be updating the script soon so that you won't have to do it manually, but want to test a few more things out. Thanks again for letting us know!
EDIT: script updated.
QUESTION
I want to cluster pdf documents based on their structure, not only the text content.
The main problem with the text only approach is, that it will loose the information if a document has a pdf form structure or was it just a plain doc or does it contain pictures?
For our further processing these information are most important. My main goal is now to be able to classify a document regarding mainly its structure not only the text content.
The documents to classify are stored in a SQL database as byte[] (varbinary), so my idea is now to use the this raw data for classification, without prior text conversion.
Because if I look at the hex output of these data, I can see repeating structures which seems to be similar to the different doc classes I want to separate. You can see some similar byte patterns as first impression in my attached screenshot.
So my idea is now to train a K-Means model with e.g. a hex output string. In the next step I would try to find the best number of clusters with the elbow method, which should be around 350 - 500.
The size of the pdf data varies between 20 kByte and 5 MB, mostly around 150 kBytes. To train the model I have +30.k documents.
When I research that, the results are sparse. I only find this article, which make me unsure about the best way to solve my task. https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided
My questions are:
- Is K-Means the best algorithm for my goal?
- What method do you would recommend?
- How to normalize or transform the data for the best results?
ANSWER
Answered 2021-Feb-27 at 20:10Like Ian in the comments said, to use raw data seems a bad idea.
With further research I found the best solution to first read the structure of the PDF file e.g. with an approach like this:
https://github.com/Uzi-Granot/PdfFileAnaylyzer
I normalized and clustered the data with this information, which gives me good results.
QUESTION
I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.
From sklearn KMeans documentation:
...ANSWER
Answered 2020-Nov-20 at 20:14I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):
QUESTION
I'm trying to reduce the input data size by first performing a K-means clustering in R then sample 50-100 samples per representative cluster for downstream classification and feature selection.
The original dataset was split 80/20, and then 80% went into K means training. I know the input data has 2 columns of labels and 110 columns of numeric variables. From the label column, I know there are 7 different drug treatments. In parallel, I tested the elbow method to find the optimal K for the cluster number, it is around 8. So I picked 10, to have more data clusters to sample for downstream.
Now I have finished running the model <- Kmeans(), the output list got me a little confused of what to do. Since I have to scale only the numeric variables to put into the kmeans function, the output cluster membership don't have that treatment labels anymore. This I can overcome by appending the cluster membership to the original training data table.
Then for the 10 centroids, how do I find out what the labels are? I can't just do
...ANSWER
Answered 2020-Nov-01 at 22:31First we need a reproducible example of your data:
QUESTION
I am using a k-modes model (mymodel
) which is created by a data frame mydf1
. I am looking to assign the nearest cluster of mymodel
for each row of a new data frame mydf2
.
Similar to this question - just with k-modes instead of k-means. The predict
function of the flexclust
package only works with numeric data, not categorial.
A short example:
...ANSWER
Answered 2020-Sep-29 at 09:08We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.
QUESTION
I am new to machine learning and i am using
...ANSWER
Answered 2020-Aug-24 at 17:35Iris dataset contains 4 features describing the three different types of flowers (i.e. 3 classes). Therefore, each point in the dataset is located in a 4-dimensional space and the same applies to the centroids, so to describe their position you need the 4 coordinates.
In examples, it's easier to use 2-dimensional data (sometimes 3-dimensional) as it is easier to plot it out and display for teaching purposes, but the centroids will have as many coordinates as your data has dimensions (i.e. features), so with the Iris dataset, you would expect the 4 coordinates.
QUESTION
This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The error that I get is:
...ANSWER
Answered 2020-May-20 at 23:54IMHO, train_test_split
gives you a tuple, and when you do copy()
, that copy()
is a tuple
's operation, not pandas'. This triggers pandas' infamous copy warning.
So you only create a shallow copy of the tuple, not the elements. In other words
QUESTION
I'm trying to learn sklearn. As I understand from step 5 of the following example, the predicted clusters can be mislabelled and it would be up to me to relabel them properly. This is also done in an example on sci-kit. Labels must be re-assigned so that the results of the clustering and the ground truth match by color.
How would I know if the labels of the predicted clusters match the initial data labels and how to readjust the indices of the labels to properly match the two sets?
...ANSWER
Answered 2020-Mar-30 at 07:00With clustering, there's no meaningful order or comparison between clusters, we're just finding groups of observations that have something in common. There's no reason to refer to one cluster as 'the blue cluster' vs 'the red cluster' (unless you have some extra knowledge about the domain). For that reason, sklearn will arbitrarily assign numbers to each cluster.
QUESTION
I am testing this code.
...ANSWER
Answered 2020-Jan-03 at 01:33The problem may be with the format of your data. Most models will expect a data frame
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install K-means-
You can use K-means- like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page