Machine Learning

# K-means Clustering

K-means is a partitioning method. It requires a  prespecified number of cluster ($k$). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the $k$ clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering  works by clustering $N$ items into $K$ clusters called $C_k$, where $k=1,...K$.  An item can only belong to one cluster, so the clusters are disjoint. Each item $n=1,...N$ has a vector of features  $x_n$and $\mu_k$  is the mean vector of the items in cluster  $C_k$. The $N$  items are assigned to clusters by minimizing the sum of squares

$\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2$

Step 1. Assign items randomly to the K clusters $C_k$ .

Step 2. Compute $\mu_k$ for each cluster $C_k$ .Note that $\mu_k$  is a vector the length of the number of features, and contains the mean for each feature in $x_n$ .

Step 3. Every item is reassigned to a cluster $C_k$ with the mean vector $\mu_k$ that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers.