How to perform K-medoids clustering using scikit-learn Python

by vsasikalabe Updated: May 8, 2023

Solution Kit

The scikit-learn is also known as sklearn. Both refer to the same package. You can install the package by using pip install scikit-learn. Scikit-learn is a data analysis library in the Python system. K Meloid clustering depends on the partition. It can solve K- means problems and also produce empty clusters. It is sensitive to noise. It selects the central member of the cluster. It is quite complex. The Median is the middle value of the dataset. It means that 50% of the data point values are smaller or equal, and the remaining 50% are higher or equal.

Different Types of Median:

Mean
Mode
Median
Range
Midrange

By default, the KMedoids() method of scikit-learn. We use it for the Partition Around Medoids (PAM) algorithm. We can use it to find the medoids. We can use the cluster medoids distance in earlier approaches. But the result tends to the bad solutions. The algorithm is very simple and fast when compared to the partition algorithms. We can define the Medoids as the classical partitioning technique of clustering. It divides the data set objects into clusters. So, we must specify k before the execution of the medoids algorithm. The alternative for k-means clustering is k-medoids. The algorithm has low sensitive noise compared to k-means. It uses the medoids as a cluster instead of means as a cluster.

The sum of the squared error (SSE) calculates the cluster quality. The argmin command always returns to the lowest index. The many potential medoids have cluster members with equal distances. We can perform clustering by many different approaches. It always depends on the e distances between the non-medoid objects and the medoid. The most used distance in clustering methods is the Manhattan or Minkowski distance. We can manipulate the pairwise distances and store them in memory for the fit duration. We can use Principal Component Analysis to reduce the dimensions.

Using the distance formula, we can assign all the data points to the closest centroid. The formula can be Euclidian distance and Manhattan distance. We should represent each cluster by a selected object within the cluster. We can call the selected objects, which are the centrally located points. The silhouette method is the approach for determining the optimal number of clusters.

The code np.random.seed(0) lets you provide the input starting point in a random number generator. Batch Size is important in Machine Learning. It is one of the hyperparameters. It displays the number of sample points before updating the internal data. The arithmetic mean of all the points is the cluster center. Each point is very close to its cluster center. k means++ removes the drawback of K means which is dependent on the initialization of the centroid. subplots_adjust() is a function that changes the spacing between the plots. We can adjust the subplot's position using parameters. The parameters like left, right, top, bottom, wspace, and hspace. The K-means is a clustering method used to group similar data points. This algorithm takes a set of data points as input. It divides them into a specified number of clusters, each represented by a centroid.

This is an example of how to perform K-medoids clustering using scikit-learn Python:

Fig : Preview of the output that you will get on running this code from your IDE.

Code

In this solution we used matplotlib and numpy libraries of Python.

scikit-learn: Comparison of the K-Means and MiniBatchKMeans clustering algorithms

PythonLines of Code : 123License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

# original | sorted
# [ 1, -1] | [-1, -1]
# [-1, -1] | [ 1, -1]
# [ 1,  1] | [ 1,  1]

# original | sorted
# [-1, -1] | [-1, -1]
# [-1,  1] | [-1, -1]
# [ 1, -1] | [ 1,  1]
# [ 1,  1] | [ 1,  1]

print(__doc__)

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets.samples_generator import make_blobs

# #############################################################################
# Generate sample data
np.random.seed(0)

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

# #############################################################################
# Compute clustering with Means

k_means = KMeans(init='k-means++', n_clusters=4, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0

# #############################################################################
# Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', n_clusters=4, batch_size=batch_size,
                      n_init=10, max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0

# #############################################################################
# Plot result

fig = plt.figure(figsize=(8, 3))
fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)
colors = ['#4EACC5', '#FF9C34', '#4E9A06', '#123456']

# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)
order = pairwise_distances_argmin(k_means_cluster_centers,
                                  mbk_means_cluster_centers)

# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8,  'train time: %.2fs\ninertia: %f' % (
    t_batch, k_means.inertia_))

# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == order[k]
    cluster_center = mbk_means_cluster_centers[order[k]]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('MiniBatchKMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' %
         (t_mini_batch, mbk.inertia_))

# Initialise the different array to all False
different = (mbk_means_labels == 4)
ax = fig.add_subplot(1, 3, 3)

for k in range(n_clusters):
    different += ((k_means_labels == k) != (mbk_means_labels == order[k]))

identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w',
        markerfacecolor='#bbbbbb', marker='.')
ax.plot(X[different, 0], X[different, 1], 'w',
        markerfacecolor='m', marker='.')
ax.set_title('Difference')
ax.set_xticks(())
ax.set_yticks(())

plt.show()

# order cluster centers by their x and y coordinates, weighted by 1 and 0.1 respectively
k_order = np.argsort(k_means.cluster_centers_[:, 0] + k_means.cluster_centers_[:, 1]*0.1)
mbk_order = np.argsort(mbk.cluster_centers_[:, 0] + mbk.cluster_centers_[:, 1]*0.1)
k_means_cluster_centers = k_means.cluster_centers_[k_order]
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

mbk_order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
k_means_cluster_centers = k_means.cluster_centers_
mbk_means_cluster_centers = mbk.cluster_centers_[mbk_order]

Instructions

Follow the steps carefully to get the output easily.

Download and Install the PyCharm Community Edition on your computer.
Open the terminal and install the required libraries with the following commands.
Install matplotlib - pip install matplotlib .
Install Numpy - pip install Numpy
Install sklearn - pip install scikit-learn
Create a new Python file on your IDE.
Copy the snippet using the 'copy' button and paste it into your Python file.
Remove .samples_generator from line no 21.
Run the current file to generate the output.

I hope you found this useful.

I found this code snippet by searching for ' scikit-learn: Comparison of the K-Means and MiniBatchKMeans clustering algorithms' in Kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

PyCharm Community Edition 2022.3.1
The solution is created in Python 3.11.1 Version
matplotlib 3.7.1. Version
Numpy 1.24.2 Version
scikit-learn 1.2.2 Version

Using this solution, we can able to perform K-medoids clustering using scikit-learn Python with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to perform K-medoids clustering using scikit-learn Python.

Dependent Libraries

matplotlibby matplotlib

Python

17559

Version:v3.7.1

License: No License (null)

matplotlib: plotting with Python

Support

Quality

Security

License

Reuse

matplotlibby matplotlib

Python 17559 Version:v3.7.1 License: No License

matplotlib: plotting with Python

Support

Quality

Security

License

Reuse

numpyby numpy

Python

23755

Version:v1.25.0rc1

License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

numpyby numpy

Python 23755 Version:v1.25.0rc1 License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

scikit-learnby scikit-learn

Python

54584

Version:1.2.2

License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

scikit-learnby scikit-learn

Python 54584 Version:1.2.2 License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

If you do not have matplotlib, scikit-learn, and Numpy libraries that are required to run this code, you can install them by clicking on the above link.

You can search for any dependent library on kandi like matplotlib, scikit-learn, and Numpy.

FAQ:

1. What is Partition Around Medoids? How does it differ from other partitioning techniques?

The PAM is the Partition Around Medoids algorithm used to find the medoids. Compared to other partitioning methods, this algorithm is fast and simple.

2. How does the PAM algorithm handle a non-medoid data point?

The PAM algorithm depends on the k-typical objects or medoids. It depends on the observations of the dataset. We have to construct the clusters by assigning each observation (after finding a set of k medoids). We should exchange each selected medoid and non-medoid data point. And then, we can calculate the goal function. This function depends on the sum of the dissimilarities of all objects. The swap improves the quality of the clustering. We must do it by exchanging selected objects (medoids) and non-selected objects.

3. Is k-means clustering or Partitional Clustering more effective for k-medoids Python?

The K-means or Partitional clustering is more effective for k-medoids. It uses medoids as clusters instead of means of the cluster objects. Because the algorithm has less sensitive noise and outliers when compared to k-means.

4. Which clustering algorithms are well suited to use with the PAM algorithm?

K-means clustering algorithms are well suited to use with the PAM algorithm. It follows only a few steps. First, select the k centroids, where k equals the number of clusters.

5. Can you provide examples of successful applications of the algorithm to real-world problems?

Identifying Fake News
Spam filter
Marketing and Sales
Classifying network traffic
Identifying criminal activities
Document analysis

6. Are there any open-source libraries available that implement k medoids Python?

Install matplotlib - pip install matplotlib.
Install Numpy - pip install Numpy
Install sklearn - pip install scikit-learn

7. What makes the PAM algorithm a fast-clustering method compared to other methods?

Compared to other methods, the PAM algorithm is a simple and fast clustering method. It rectifies the drawbacks of k-means. Because PAM handles noise better, it will also minimize the sum of dissimilarities. It makes the result more explainable.

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

How to perform K-medoids clustering using scikit-learn Python

Code

Instructions

Environment Tested

Dependent Libraries

FAQ:

Support

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow