3D Visualization of K-means Clustering (2024)

Çağrı Aydoğdu

Published in

Analytics Vidhya

4 min read

Nov 7, 2018

In the previous post, I explained how to choose the optimal K value for K-Means Clustering. Since the main purpose of the post was not to introduce the implementation of K-means, I had used built-in functions of sklearn library and thought that the reader already knew what this algorithm was.

But today, I want to focus on the implementation part and make a tutorial about K-means. We will write custom functions in order to calculate the distances between the data points and clusters, and to update the cluster centers after each iteration. And at the end, we will visualize our clusters by a 3D graph and have a really cool picture. So, if you are ready, let’s get started!

In this tutorial, we will be using Iris dataset as we did in the previous one. We already know that Iris dataset contains 3 different types of flowers and 4 features for each flower. But we will be using only 3 features for this tutorial since we can’t visualize a 4 dimensional space. Therefore, it is a smart idea to choose 3 random cluster centers to start with.

import numpy as np
from sklearn import datasets
import random
iris=datasets.load_iris()
x=iris.data
target=iris.target
min_of_features=np.zeros((1,3))
max_of_features=np.zeros((1,3))
for i in range(3):
 min_of_features[0,i]=min(x[:,i])
 max_of_features[0,i]=max(x[:,i])cluster_centers=np.zeros((3,3))
for i in range(3):
 for j in range(3):
 cluster_centers[i,j]=round(random.uniform(min_of_features[0,j],max_of_features[0,j]),3)

“Cluster_centers” is a 3x3 matrix including 3 centers and 3 dimensions for each. I found the minimum and maximum values of each feature and assigned a random value between these values(In order to be close to the data points).

As we have our cluster centers assigned, we can start calculating the distance between data points and each cluster center. We have 3 cluster centers, thus, we will have 3 distance values for each data point. For clustering, we have to choose the closest center and assign our relevant data point to that center. Let’s see the code for this part:

#a is the row vector including 4 features(only 3 will be used)
#b is a 3x3 matrix containing cluster centersdef distance_find(a,b):
 total_1=np.square(a[0]-b[0,0])+np.square(a[1]-b[0,1])+np.square(a[2]-b[0,2]) total_2=np.square(a[0]-b[1,0])+np.square(a[1]-b[1,1])+np.square(a[2]-b[1,2])
 total_3=np.square(a[0]-b[2,0])+np.square(a[1]-b[2,1])+np.square(a[2]-b[2,2])
 vec=np.array([total_1,total_2,total_3])
 if min(vec)==total_1:
 return 0
 elif min(vec)==total_2:
 return 1
 elif min(vec)==total_3:
 return 2

The above code finds 3 distance values for each data points and compares them at the end. If,for instance, “total_1” is the minimum of 3 distance values, we infer that the first cluster is the closest and return a zero.

We completed distance measurement part but we are not done yet. Since we started with random cluster centers, we don’t expect our clusters to be correct. We need to continue and update the cluster centers until the system is stable and no change occurs in the clusters.

Updating the cluster centers implies to take the average of each cluster and assign that value as the new center. Averaging helps us narrow down the range of the clusters and eliminate the outliars or the noise. So that, we will force our clusters to be as close as possible to each other,thus, have better groups.

def mean_finder():
 cluster_new=np.zeros((3,3))
 for i in range(3):
 number_of_elements=sum(cluster_labels==i)
 for j in range(3):
 total=0
 for z in range(len(cluster_labels)):
 if cluster_labels[z]==i:
 total=total+x[z,j]
 else:
 total=total
 cluster_new[i,j]=round(total/(number_of_elements[0]+0.001),4)
 return cluster_new

Above, we wrote the function to update the cluster centers. It averages the data points in each cluster based on 4 features and returns a new 3x4 matrix. These centers are supposed to be better than the previous ones as we are decreasing the level of randomness. I would like to note that i added 0.001 to the denominator of the right-hand side expression in the below code in order to avoid from a NaN value if the number_of_elements is zero. (“cluster_new[i,j]=round(total/(number_of_elements[0]+0.001),4)”)

Since our functions are ready, we can start running our code and observing the results:

cluster_labels=np.zeros((len(x),1))for iteration in range(15):
 for i in range(len(x)):
 row=x[i,:]
 cluster_labels[i]=distance_find(row,cluster_centers)
 cluster_centers=mean_finder()

That’s all! I set the iteration number to 15 and I expect to see a good result since we have a small dataset(150 rows and 4 columns). Now we are ready to see our 3D graphs. Let’s see how the graph looked like before clustering in order to compare the performance of our model:

fig=plt.figure()
ax=Axes3D(fig)
ax.scatter(x[:50,0],x[:50,1],x[:50,2],color='red')
ax.scatter(x[50:100,0],x[50:100,1],x[50:100,2],color='green')
ax.scatter(x[100:150,0],x[100:150,1],x[100:150,2],color='blue')
plt.show()

3D Visualization of K-means Clustering (3)

And let’s see how it looks after clustering:

cluster_labels2=np.array(cluster_labels)
cluster_labels2=np.zeros(len(x))
cluster_labels2[:]=cluster_labels[:,0]
fig=plt.figure()
ax=Axes3D(fig)ax.scatter(x[cluster_labels2==0,0],x[cluster_labels2==0,1],x[cluster_labels2==0,2],color='red')ax.scatter(cluster_centers[0,0],cluster_centers[0,1],cluster_centers[0,2],color='red',marker='o',s=120)ax.scatter(x[cluster_labels2==2,0],x[cluster_labels2==2,1],x[cluster_labels2==2,2],color='green')ax.scatter(cluster_centers[2,0],cluster_centers[2,1],cluster_centers[2,2],color='green',marker='o',s=120)ax.scatter(x[cluster_labels2==1,0],x[cluster_labels2==1,1],x[cluster_labels2==1,2],color='blue')ax.scatter(cluster_centers[1,0],cluster_centers[1,1],cluster_centers[1,2],color='blue',marker='o',s=120)plt.show()

3D Visualization of K-means Clustering (4)

As you can see, two graphs are quite similar to each other. But there are some differences to be emphasized. For instance, the green point indicated by an arrow was originally blue. However, it was clustered with the green points. The reason is quite straightforward. Since the point is closer to the green cluster center than to the blue one, it was grouped as a green point. By the way, bigger balls in the middle of each cluster represent the cluster centers. So we can see that the cluster center is such a point that it is able to carry average features of the whole cluster.

I hope you enjoyed the tutorial and improved your understanding of K-means Clustering. See you on the next post!