How to perform K-medoids clustering using scikit-learn Python

share link

by vsasikalabe dot icon Updated: May 8, 2023

technology logo
technology logo

Solution Kit Solution Kit  

The scikit-learn is also known as sklearn. Both refer to the same package. You can install the package by using pip install scikit-learn. Scikit-learn is a data analysis library in the Python system. K Meloid clustering depends on the partition. It can solve K- means problems and also produce empty clusters. It is sensitive to noise. It selects the central member of the cluster. It is quite complex. The Median is the middle value of the dataset. It means that 50% of the data point values are smaller or equal, and the remaining 50% are higher or equal. 


Different Types of Median: 

  • Mean 
  • Mode 
  • Median 
  • Range 
  • Midrange 


By default, the KMedoids() method of scikit-learn. We use it for the Partition Around Medoids (PAM) algorithm. We can use it to find the medoids. We can use the cluster medoids distance in earlier approaches. But the result tends to the bad solutions. The algorithm is very simple and fast when compared to the partition algorithms. We can define the Medoids as the classical partitioning technique of clustering. It divides the data set objects into clusters. So, we must specify k before the execution of the medoids algorithm. The alternative for k-means clustering is k-medoids. The algorithm has low sensitive noise compared to k-means. It uses the medoids as a cluster instead of means as a cluster. 


The sum of the squared error (SSE) calculates the cluster quality. The argmin command always returns to the lowest index. The many potential medoids have cluster members with equal distances. We can perform clustering by many different approaches. It always depends on the e distances between the non-medoid objects and the medoid. The most used distance in clustering methods is the Manhattan or Minkowski distance. We can manipulate the pairwise distances and store them in memory for the fit duration. We can use Principal Component Analysis to reduce the dimensions. 


Using the distance formula, we can assign all the data points to the closest centroid. The formula can be Euclidian distance and Manhattan distance. We should represent each cluster by a selected object within the cluster. We can call the selected objects, which are the centrally located points. The silhouette method is the approach for determining the optimal number of clusters. 


The code np.random.seed(0) lets you provide the input starting point in a random number generator. Batch Size is important in Machine Learning. It is one of the hyperparameters. It displays the number of sample points before updating the internal data. The arithmetic mean of all the points is the cluster center. Each point is very close to its cluster center. k means++ removes the drawback of K means which is dependent on the initialization of the centroid. subplots_adjust() is a function that changes the spacing between the plots. We can adjust the subplot's position using parameters. The parameters like left, right, top, bottom, wspace, and hspace. The K-means is a clustering method used to group similar data points. This algorithm takes a set of data points as input. It divides them into a specified number of clusters, each represented by a centroid. 


This is an example of how to perform K-medoids clustering using scikit-learn Python: 

Fig : Preview of the output that you will get on running this code from your IDE.

Code

In this solution we used matplotlib and numpy libraries of Python.

# original | sorted
# [ 1, -1] | [-1, -1]
# [-1, -1] | [ 1, -1]
# [ 1,  1] | [ 1,  1]

# original | sorted
# [-1, -1] | [-1, -1]
# [-1,  1] | [-1, -1]
# [ 1, -1] | [ 1,  1]
# [ 1,  1] | [ 1,  1]

print(__doc__)

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets.samples_generator import make_blobs

# #############################################################################
# Generate sample data
np.random.seed(0)

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

# #############################################################################
# Compute clustering with Means

k_means = KMeans(init='k-means++', n_clusters=4, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0

# #############################################################################
# Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', n_clusters=4, batch_size=batch_size,
                      n_init=10, max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0

# #############################################################################
# Plot result

fig = plt.figure(figsize=(8, 3))
fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)
colors = ['#4EACC5', '#FF9C34', '#4E9A06', '#123456']

# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)
order = pairwise_distances_argmin(k_means_cluster_centers,
                                  mbk_means_cluster_centers)

# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8,  'train time: %.2fs\ninertia: %f' % (
    t_batch, k_means.inertia_))

# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == order[k]
    cluster_center = mbk_means_cluster_centers[order[k]]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('MiniBatchKMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' %
         (t_mini_batch, mbk.inertia_))

# Initialise the different array to all False
different = (mbk_means_labels == 4)
ax = fig.add_subplot(1, 3, 3)

for k in range(n_clusters):
    different += ((k_means_labels == k) != (mbk_means_labels == order[k]))

identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w',
        markerfacecolor='#bbbbbb', marker='.')
ax.plot(X[different, 0], X[different, 1], 'w',
        markerfacecolor='m', marker='.')
ax.set_title('Difference')
ax.set_xticks(())
ax.set_yticks(())

plt.show()

# order cluster centers by their x and y coordinates, weighted by 1 and 0.1 respectively
k_order = np.argsort(k_means.cluster_centers_[:, 0] + k_means.cluster_centers_[:, 1]*0.1)
mbk_order = np.argsort(mbk.cluster_centers_[:, 0] + mbk.cluster_centers_[:, 1]*0.1)
k_means_cluster_centers = k_means.cluster_centers_[k_order]
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

mbk_order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
k_means_cluster_centers = k_means.cluster_centers_
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

Instructions

Follow the steps carefully to get the output easily.

  1. Download and Install the PyCharm Community Edition on your computer.
  2. Open the terminal and install the required libraries with the following commands.
  3. Install matplotlib - pip install matplotlib .
  4. Install Numpy - pip install Numpy
  5. Install sklearn - pip install scikit-learn
  6. Create a new Python file on your IDE.
  7. Copy the snippet using the 'copy' button and paste it into your Python file.
  8. Remove .samples_generator from line no 21.
  9. Run the current file to generate the output.


I hope you found this useful.


I found this code snippet by searching for ' scikit-learn: Comparison of the K-Means and MiniBatchKMeans clustering algorithms' in Kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. PyCharm Community Edition 2022.3.1
  2. The solution is created in Python 3.11.1 Version
  3. matplotlib 3.7.1. Version
  4. Numpy 1.24.2 Version
  5. scikit-learn 1.2.2 Version


Using this solution, we can able to perform K-medoids clustering using scikit-learn Python with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to perform K-medoids clustering using scikit-learn Python.

Dependent Libraries

matplotlibby matplotlib

Python doticonstar image 17559 doticonVersion:v3.7.1doticon
no licences License: No License (null)

matplotlib: plotting with Python

Support
    Quality
      Security
        License
          Reuse

            matplotlibby matplotlib

            Python doticon star image 17559 doticonVersion:v3.7.1doticonno licences License: No License

            matplotlib: plotting with Python
            Support
              Quality
                Security
                  License
                    Reuse

                      numpyby numpy

                      Python doticonstar image 23755 doticonVersion:v1.25.0rc1doticon
                      License: Permissive (BSD-3-Clause)

                      The fundamental package for scientific computing with Python.

                      Support
                        Quality
                          Security
                            License
                              Reuse

                                numpyby numpy

                                Python doticon star image 23755 doticonVersion:v1.25.0rc1doticon License: Permissive (BSD-3-Clause)

                                The fundamental package for scientific computing with Python.
                                Support
                                  Quality
                                    Security
                                      License
                                        Reuse

                                          scikit-learnby scikit-learn

                                          Python doticonstar image 54584 doticonVersion:1.2.2doticon
                                          License: Permissive (BSD-3-Clause)

                                          scikit-learn: machine learning in Python

                                          Support
                                            Quality
                                              Security
                                                License
                                                  Reuse

                                                    scikit-learnby scikit-learn

                                                    Python doticon star image 54584 doticonVersion:1.2.2doticon License: Permissive (BSD-3-Clause)

                                                    scikit-learn: machine learning in Python
                                                    Support
                                                      Quality
                                                        Security
                                                          License
                                                            Reuse

                                                              If you do not have matplotlib, scikit-learn, and Numpy libraries that are required to run this code, you can install them by clicking on the above link.

                                                              You can search for any dependent library on kandi like matplotlib, scikit-learn, and Numpy.

                                                              FAQ: 

                                                              1. What is Partition Around Medoids? How does it differ from other partitioning techniques? 

                                                              The PAM is the Partition Around Medoids algorithm used to find the medoids. Compared to other partitioning methods, this algorithm is fast and simple. 


                                                              2. How does the PAM algorithm handle a non-medoid data point? 

                                                              The PAM algorithm depends on the k-typical objects or medoids. It depends on the observations of the dataset. We have to construct the clusters by assigning each observation (after finding a set of k medoids). We should exchange each selected medoid and non-medoid data point. And then, we can calculate the goal function. This function depends on the sum of the dissimilarities of all objects. The swap improves the quality of the clustering. We must do it by exchanging selected objects (medoids) and non-selected objects. 


                                                              3. Is k-means clustering or Partitional Clustering more effective for k-medoids Python? 

                                                              The K-means or Partitional clustering is more effective for k-medoids. It uses medoids as clusters instead of means of the cluster objects. Because the algorithm has less sensitive noise and outliers when compared to k-means. 


                                                              4. Which clustering algorithms are well suited to use with the PAM algorithm? 

                                                              K-means clustering algorithms are well suited to use with the PAM algorithm. It follows only a few steps. First, select the k centroids, where k equals the number of clusters. 


                                                              5. Can you provide examples of successful applications of the algorithm to real-world problems? 

                                                              1. Identifying Fake News 
                                                              2. Spam filter 
                                                              3. Marketing and Sales 
                                                              4. Classifying network traffic 
                                                              5. Identifying criminal activities 
                                                              6. Document analysis 


                                                              6. Are there any open-source libraries available that implement k medoids Python? 

                                                              • Install matplotlib - pip install matplotlib. 
                                                              • Install Numpy - pip install Numpy 
                                                              • Install sklearn - pip install scikit-learn 


                                                              7. What makes the PAM algorithm a fast-clustering method compared to other methods? 

                                                              Compared to other methods, the PAM algorithm is a simple and fast clustering method. It rectifies the drawbacks of k-means. Because PAM handles noise better, it will also minimize the sum of dissimilarities. It makes the result more explainable. 

                                                              Support

                                                              1. For any support on kandi solution kits, please use the chat
                                                              2. For further learning resources, visit the Open Weaver Community learning page

                                                              See similar Kits and Libraries