Machine Learning

K-means Clustering

ClusterIris

K-means is a partitioning method. It requires a  prespecified number of cluster (k). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the k clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering  works by clustering N items into K clusters called C_k, where k=1,...K.  An item can only belong to one cluster, so the clusters are disjoint. Each item n=1,...N has a vector of features  x_nand \mu_k  is the mean vector of the items in cluster  C_k. The N  items are assigned to clusters by minimizing the sum of squares

\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2

Step 1. Assign items randomly to the K clusters C_k .

Step 2. Compute \mu_k for each cluster C_k .Note that \mu_k  is a vector the length of the number of features, and contains the mean for each feature in x_n .

Step 3. Every item is reassigned to a cluster C_k with the mean vector \mu_k that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers. 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s