Machine Learning

# Normalizing and Standardizing

There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.

$\displaystyle x_{new}=\frac{x-\bar{x}}{sd}$

Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.

$\displaystyle x_{new}=\frac{x_i-min(x)}{max(x)-min(x)}$

Scikit-learn in python has some great preprocessing modules for standardizing and normalization.

Standardize to 0 mean with unit variance

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)


Normalize to (0,1) range.

sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)