Machine Learning

Exploratory Analysis

Exploratory analysis involves data cleaning, feature engineering, and data summarizing and trend analysis.

It is common to be faced with a dataset with high dimensionality. This means that there are many possible features (or variables) that can be used in a machine learning algorithm or statistical model. Typically only a small set of these features will actually be informative to a model, and many features will be highly correlated. A large set of features can lead to very long processing time and overfitting. Dumping a large amount of features into a model is a great way to get a garbage in garbage out model. The features need to be reduced and/or transformed before used in developing a model. Exploratory analysis explores the data for meaningful trends, creates meaningful features, reduces dimensionality, and cleans the data so that it is prepared for analysis.

An exploratory analysis can include some of these steps:

  • Remove meaningless features from the dataset. If the feature does not have a statistically significant relationship with the response variable y, it is probable y not useful to include it in the model.
  • Calculate new features from the dataset.
  • Normalize or Standardize data
  • Impute missing data
  • Apply transformations such as a log transformation, or first order difference.
  • Apply dimensionality reduction methods such as PCA to decorrelate/summarizes the features
  • Plot the data and explore trends
  • Test hypothesizes with the data
Machine Learning

Normalizing and Standardizing

There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.

\displaystyle x_{new}=\frac{x-\bar{x}}{sd}

Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.

\displaystyle x_{new}=\frac{x_i-min(x)}{max(x)-min(x)}

Scikit-learn in python has some great preprocessing modules for standardizing and normalization.

Standardize to 0 mean with unit variance

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

Normalize to (0,1) range.

sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)
Machine Learning

Recommender Systems: Content-Based Filtering

Content Based Filtering (CBR) is a type of recommender system that uses  description of the item and a profile of the user’s interests to make recommendations to users. CBR uses descriptions and attributes of items to match to user’s selected interests/preferences. CBR uses ML algorithms such as logistic regression or decision trees to make these matches, and results are converted to a ranked recommendation. Content based filtering systems rely on the item in question, and not on other user preferences. The benefit of this method is it does not require and historic data/training data/data from other users.

Machine learning is used to build user profiles if they do not already exist.

ContentFlowThe user profile is typically based on historic user data to build a functional relationship representing how a user would rate a certain item.

Decision tree classification of items based on features

In some cases this taste profile is already known, for example, a survey filled out when signing up for the website could build a user profile.



  • Does not depend on data from other users
  • Can recommend unpopular or new items
  • Easy to explain why items were recommended because have feature


  • Is not predictive (Cannot predict new types of items to users)
Machine Learning

Recommender Systems: Collaborative Filtering

A recommender system is an algorithm that helps users discover new content. It is popular in online marketing. Recommender systems are known to drastically increase an online vendor’s revenue.

Collaborative filtering is a type of recommender system. It is an algorithm that makes automatic predictions (recommendations) to a user based on the preferences of a set of similar users. It requires a large amount of data from users. Collaborative filtering does not necessarily require machine learning, and uses deterministic calculations.

ColabFlow Cosine similarity or Pearson correlation are used to measure similarity between two users, and create the neighborhood of N users.

Let x and y be vectors of user’s ratings.

cosine similarity


\displaystyle sim(x,y)=\cos(\theta)=\frac{x \cdot  y}{|x||y|}=\frac{\Sigma(xy)}{\sqrt{\Sigma x^2} \sqrt{\Sigma y^2}}

\cos(x,y)=1 \theta=0 x and y are equivalent
\cos(x,y)=0 \theta=\frac{/pi}{2} X and y are dissimilar

The cosine similarity is equivalent to the pearson correlation coefficient if the x and y vectors are normalized by the mean

Users tend to rate on different scales. Rating vectors can be normalized by dividing by mean or using difference from mean.

If there are many users, calculations can be computationally intensive. It is common to cluster users into groups and perform recommendations for each group of users.

Item to item collaborative filtering is a similar method that finds a neighborhood of similar items instead of users. Estimates of ratings are based on ratings for similar items for each user.


  • Does not require feature data for the items. No feature selection needed.
  • Can be predictive.


  • Does not work well with small datasets.
  • Cannot recommend new/unrated items.
  • Defaults to recommending popular items. Popularity bias.
Machine Learning

Principal Component Analysis

PCA is used to reduce the dimensions of a large data set such as a set of feature data x. The high dimensional data is summarized  using orthogonal transformations into uncorrelated principal components.

The dimension reduction is done by only selecting/using the eigenvectors (principle components) with large eigenvalues (the vectors that explain the most variance). The first component explains the most variance in the data, so an elbow plot is used to determine the significant number of principal components for an analysis. The set of eigenvectors form an uncorrelated orthogonal basis for the covariance matrix \Sigma 

Covariance Matrix

Covariance matrix  \Sigma , is a matrix of all possible covariances between a set of variables  X_n . The  (i, j) entry of  \Sigma is \Sigma_{(i, j)}=cov(X_i, X_j) .


Every eigenvector has a corresponding eigenvalue. The eigenvalue represents the amount of variance in the direction of the eigenvector. The eigenvectors for  \Sigma are found by solving

det|\Sigma-\lambda I|=0 for the eigenvalues \lambda .


The principal components are the eigenvectors of the covariance matrix \Sigma .  Each eigenvalue represents a direction. For n-dimensional data there are n eigenvectors.  Eigenvectors X of \Sigma are found by solving

det|\Sigma-\lambda I|X=0

for X for each specific eigenvalue \lambda , which will result in a set of n eigenvectors and n eigenvalues for an  nxn matrix .


The eigenvectors form an orthogonal matrix W used as a transformation matrix on the features to create a set of new features from the data.  y=W^T \times x



Machine Learning

K-means Clustering


K-means is a partitioning method. It requires a  prespecified number of cluster (k). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the k clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering  works by clustering N items into K clusters called C_k, where k=1,...K.  An item can only belong to one cluster, so the clusters are disjoint. Each item n=1,...N has a vector of features  x_nand \mu_k  is the mean vector of the items in cluster  C_k. The N  items are assigned to clusters by minimizing the sum of squares

\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2

Step 1. Assign items randomly to the K clusters C_k .

Step 2. Compute \mu_k for each cluster C_k .Note that \mu_k  is a vector the length of the number of features, and contains the mean for each feature in x_n .

Step 3. Every item is reassigned to a cluster C_k with the mean vector \mu_k that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers.