Machine Learning

K-means Clustering


K-means is a partitioning method. It requires a  prespecified number of cluster (k). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the k clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering  works by clustering N items into K clusters called C_k, where k=1,...K.  An item can only belong to one cluster, so the clusters are disjoint. Each item n=1,...N has a vector of features  x_nand \mu_k  is the mean vector of the items in cluster  C_k. The N  items are assigned to clusters by minimizing the sum of squares

\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2

Step 1. Assign items randomly to the K clusters C_k .

Step 2. Compute \mu_k for each cluster C_k .Note that \mu_k  is a vector the length of the number of features, and contains the mean for each feature in x_n .

Step 3. Every item is reassigned to a cluster C_k with the mean vector \mu_k that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers.