Clusterings in machine learning — K-Means and K-Medoids examples

Nastasia Saby
Zenika
Published in
4 min readMay 2, 2019

--

Photo by Charisse Kenion on Unsplash

Machine learning is about making computers doing tasks without being explicitly programmed. We want computers to learn and use this knowledge to some situations.

There are different ways to do machine learning. With supervised machine learning, a computer learns the past to predict the future. With unsupervised learning, the idea is to find patterns inside data.

Clustering

We can do unsupervised machine learning with clustering. Clustering is a way to partition data. We group items together because of their similarity.

For instance, suppose we have diamonds with their prices. We want to partition this to get an idea of the cheap, medium and expensive diamonds.
We want three clusters: cheap, medium and expensive.

In reality, we make tests to determine the number of clusters which is the best one. Suppose, we have found that this number is 3.

Clustering is a way to group items which are more similar in a same group. Very often, clustering is used to understand better data (users segmentation, repeating patterns in financial transactions, etc.).

In our example, we want to group our diamonds to understand better them. Then, we could explore each group to find if there are cheap prices that seem very cheap. The idea is to discover some anomalies.

To do that, we first need to cluster them. Let’s go !

To cluster this, we can use an algorithm as follows:

9 diamonds
  1. Take 3 random diamonds as the centers
3 random diamonds as the centers

2. Associate the other diamonds to the closest center => we have our first clusters

Our first clusters

3. Determine new centers from the first clusters: one center for each cluster

New centers determined from our first clusters

4. Associate the diamonds to the new centers

New cluster with our new centers

5. Determine again new centers from this second clustering
6. etc.
7. Do it until “convergence” (the clusters do not move anymore or we have reached a defined iterations number).

What interests me is the way to determine new diamonds as the centers of their cluster. Each time, we have associated diamonds to the centers, we determine new ones.

Centers of data

Photo by Anthony Tori on Unplash

The question is to define the “center” of some data.

We can define this on different ways.

The famous ones are mean or median.

A mean will take the sum of all the diamonds and divide this sum by the numbers of diamonds.

Suppose, we have 5 diamonds with prices 2, 100, 102, 110, 115.

The mean is : 85.8.

It is less than most of our prices. The weird value 2 has a huge weight in the computation. This value is an outlier. It is an extreme value that is not representative of the majority of our data.

We see how a mean can lead us to get bad information about our data.

Another way to define the center of some data is to use a median. We sort the value and take the centered one.

With our 5 diamonds, our values are already sorted. The median is the third one 102.

Centers of clusters

We find this same logic with clustering in machine learning.

K-Means is a kind of clustering algorithm, maybe the most famous. The center of a cluster for K-Means is the mean. Consequently, it is sensitive to outliers.

With our 5 diamonds (2, 100, 102, 110, 115), K-Means considers the center as 85.8.

K-Medoids is another kind of clustering algorithm. It uses another way to compute the centers.

For K-Medoids, we take each diamond and compute its distance with the other diamonds. Then we select the diamond with the minimum distance.

Distance between diamond with price 2 and the other diamonds = 419
Distance between diamond with price 100 and the other diamonds = 125
Distance between diamond with price 102 and the other diamonds = 123
Distance between diamond with price 110 and the other diamonds = 131
Distance between diamond with price 115 and the other diamonds = 146

This time, we chose 102 as the center. We call it a medoid. It is a better option in our case.

A medoid as a median is not sensitive to outliers. But a medoid is not a median.

Conclusion

K-Medoids is more robust because less sensitive to outliers.

K-Means is more efficient. It takes more time to define distances between each diamond than to compute a mean.

There are other ways to do clustering. I only focused on these two ways. I think the difference between them is interesting.

--

--