Machine Learning

Normalizing and Standardizing

There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.

\displaystyle x_{new}=\frac{x-\bar{x}}{sd}

Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.

\displaystyle x_{new}=\frac{x_i-min(x)}{max(x)-min(x)}

Scikit-learn in python has some great preprocessing modules for standardizing and normalization.

Standardize to 0 mean with unit variance

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

Normalize to (0,1) range.

sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)

One thought on “Normalizing and Standardizing

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s