Machine Learning

Exploratory Analysis

Exploratory analysis involves data cleaning, feature engineering, and data summarizing and trend analysis.

It is common to be faced with a dataset with high dimensionality. This means that there are many possible features (or variables) that can be used in a machine learning algorithm or statistical model. Typically only a small set of these features will actually be informative to a model, and many features will be highly correlated. A large set of features can lead to very long processing time and overfitting. Dumping a large amount of features into a model is a great way to get a garbage in garbage out model. The features need to be reduced and/or transformed before used in developing a model. Exploratory analysis explores the data for meaningful trends, creates meaningful features, reduces dimensionality, and cleans the data so that it is prepared for analysis.

An exploratory analysis can include some of these steps:

  • Remove meaningless features from the dataset. If the feature does not have a statistically significant relationship with the response variable y, it is probable y not useful to include it in the model.
  • Calculate new features from the dataset.
  • Normalize or Standardize data
  • Impute missing data
  • Apply transformations such as a log transformation, or first order difference.
  • Apply dimensionality reduction methods such as PCA to decorrelate/summarizes the features
  • Plot the data and explore trends
  • Test hypothesizes with the data
Machine Learning

Normalizing and Standardizing

There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.

\displaystyle x_{new}=\frac{x-\bar{x}}{sd}

Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.

\displaystyle x_{new}=\frac{x_i-min(x)}{max(x)-min(x)}

Scikit-learn in python has some great preprocessing modules for standardizing and normalization.

Standardize to 0 mean with unit variance

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
 

Normalize to (0,1) range.

sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)
Tutorials

Easy Animation in R

I recently animated some figures using ggplot and ImageMagick. ImageMagick is an application independent of R. A set og ggplots are created in R, and ImageMagick is used in the command line to merge the set of plots into a gif.

To get started, download https://www.imagemagick.org/script/binary-releases.php following instructions for your operating system. I work on Windows, so this tutorial will use that.

mapcrimes

library(ggmap)
#subset data to just one day for an example
crime <- crime[crime$time<'2010-01-05 0:00:00', ]
#create list of times used to save unique plots
times <- unique(crime$time)
#get map from ggmap
houston <- get_map(location = 'Houston', zoom = 10,
maptype = "roadmap",
color = "bw")

#loop through all times, and save ggplots as jpeg
for(i in 1:length(times)){
    t <- times[i]
    m <- ggmap(houston) + geom_point(data = crime[crime$time == t,],
                                     mapping = aes(x = lon, y = lat, color=offense),
                                     size=5) +
    ggtitle(paste("Houston crime\n"), t) +
    theme(legend.position="bottom") +
    scale_colour_discrete(drop = FALSE)
    title <- paste("map",i,".jpg",sep="")
    jpeg(title)
    print(m)
    dev.off()
}

Once all the jpeg plots are saved to your directory, open an command prompt and navigate to the folder where those plots are saved. Then run the following command in the command prompt. This will use all the individual jpeg files to create one animated gif.

magick *.jpg mapcrimes.gif

Here is another little example.

tscrimes

library(dplyr)
library(ggmap)

crime$date <- as.Date(crime$time)
crime <- crime[crime$date<'2010-03-01', ]
crime <- crime %>%
         group_by(date) %>%
         summarize(total=n())
times <- unique(crime$date)

for(i in 1:length(times)){
    t <- times[i]
    title <- paste('tsplot', i, ".png", sep="")
    png(title)
    plot(crime[crime$date<=t,], col="blue", main="Crimes in Houston Jan-Mar 2010",
    xlim=c(min(times), max(times)),
    ylim=c(200, 450))
    lines(crime[crime$date<=t,], lwd=2, col="blue")
    dev.off()
}

And create the gif in the command line using
magick *.png tscrimes.gif

P.S. the R package animation will produce the same graphs without using command line.

Machine Learning

Recommender Systems: Content-Based Filtering

Content Based Filtering (CBR) is a type of recommender system that uses  description of the item and a profile of the user’s interests to make recommendations to users. CBR uses descriptions and attributes of items to match to user’s selected interests/preferences. CBR uses ML algorithms such as logistic regression or decision trees to make these matches, and results are converted to a ranked recommendation. Content based filtering systems rely on the item in question, and not on other user preferences. The benefit of this method is it does not require and historic data/training data/data from other users.

Machine learning is used to build user profiles if they do not already exist.

ContentFlowThe user profile is typically based on historic user data to build a functional relationship representing how a user would rate a certain item.

Decision tree classification of items based on features

In some cases this taste profile is already known, for example, a survey filled out when signing up for the website could build a user profile.

ContentDiag

Pros

  • Does not depend on data from other users
  • Can recommend unpopular or new items
  • Easy to explain why items were recommended because have feature

Cons

  • Is not predictive (Cannot predict new types of items to users)
Machine Learning

Recommender Systems: Collaborative Filtering

A recommender system is an algorithm that helps users discover new content. It is popular in online marketing. Recommender systems are known to drastically increase an online vendor’s revenue.

Collaborative filtering is a type of recommender system. It is an algorithm that makes automatic predictions (recommendations) to a user based on the preferences of a set of similar users. It requires a large amount of data from users. Collaborative filtering does not necessarily require machine learning, and uses deterministic calculations.

ColabFlow Cosine similarity or Pearson correlation are used to measure similarity between two users, and create the neighborhood of N users.

Let x and y be vectors of user’s ratings.

cosine similarity

CosineSimilarity

\displaystyle sim(x,y)=\cos(\theta)=\frac{x \cdot  y}{|x||y|}=\frac{\Sigma(xy)}{\sqrt{\Sigma x^2} \sqrt{\Sigma y^2}}

\cos(x,y)=1 \theta=0 x and y are equivalent
\cos(x,y)=0 \theta=\frac{/pi}{2} X and y are dissimilar

The cosine similarity is equivalent to the pearson correlation coefficient if the x and y vectors are normalized by the mean

Users tend to rate on different scales. Rating vectors can be normalized by dividing by mean or using difference from mean.

If there are many users, calculations can be computationally intensive. It is common to cluster users into groups and perform recommendations for each group of users.

Item to item collaborative filtering is a similar method that finds a neighborhood of similar items instead of users. Estimates of ratings are based on ratings for similar items for each user.

Pros

  • Does not require feature data for the items. No feature selection needed.
  • Can be predictive.

Cons

  • Does not work well with small datasets.
  • Cannot recommend new/unrated items.
  • Defaults to recommending popular items. Popularity bias.
Tutorials

A Simple Tutorial of the ggmap package in R

The R package ggmap allows ggplot2 capabilities with google maps. I have recently used it for a few projects, and wanted to save a few quick sample plots so I can easily return to creating similar plots. Here are a few examples of ggmap plots.

The first example uses the ‘crime’ dataset that comes in the ggmap package. This dataset shows crimes in the Houston area.

map211

require(ggmap)
require(ggplot2)

houston &lt;- get_map(location = 'Houston', zoom = 10, maptype = "roadmap")

ggmap(houston) +
geom_point(data = crime[crime$offense == "murder",],
mapping = aes(x = lon, y = lat),
color="hotpink", shape=3, size=3) +
ggtitle("ggmap example using\nthe crime R dataset") +
theme(plot.title = element_text(hjust = 0.5))

This second plot is an example using the satellite plot type. This plot shows eathquakes near Fiji. The data is from the ‘quakes’ dataset in the R package ‘datasets’.

Map1

require(datasets)
require(ggmap)
require(ggplot2)

fiji <- get_map(location = c(mean(quakes$long), mean(quakes$lat)), zoom = 5,
maptype = "satellite")

ggmap(fiji) +
geom_point(data = quakes, mapping = aes(x = long, y = lat, color = depth, size=mag)) +
scale_colour_gradient(low = "hotpink",high = "yellow") +
ggtitle("ggmap example using\nthe Fiji earthquakes R dataset ") +
theme(plot.title = element_text(hjust = 0.5))

 

Machine Learning

Principal Component Analysis

PCA is used to reduce the dimensions of a large data set such as a set of feature data x. The high dimensional data is summarized  using orthogonal transformations into uncorrelated principal components.

The dimension reduction is done by only selecting/using the eigenvectors (principle components) with large eigenvalues (the vectors that explain the most variance). The first component explains the most variance in the data, so an elbow plot is used to determine the significant number of principal components for an analysis. The set of eigenvectors form an uncorrelated orthogonal basis for the covariance matrix \Sigma 

Covariance Matrix

Covariance matrix  \Sigma , is a matrix of all possible covariances between a set of variables  X_n . The  (i, j) entry of  \Sigma is \Sigma_{(i, j)}=cov(X_i, X_j) .

Eigenvalues

Every eigenvector has a corresponding eigenvalue. The eigenvalue represents the amount of variance in the direction of the eigenvector. The eigenvectors for  \Sigma are found by solving

det|\Sigma-\lambda I|=0 for the eigenvalues \lambda .

Eigenvectors

The principal components are the eigenvectors of the covariance matrix \Sigma .  Each eigenvalue represents a direction. For n-dimensional data there are n eigenvectors.  Eigenvectors X of \Sigma are found by solving

det|\Sigma-\lambda I|X=0

for X for each specific eigenvalue \lambda , which will result in a set of n eigenvectors and n eigenvalues for an  nxn matrix .

Transformation

The eigenvectors form an orthogonal matrix W used as a transformation matrix on the features to create a set of new features from the data.  y=W^T \times x

PCA

 

Machine Learning

K-means Clustering

ClusterIris

K-means is a partitioning method. It requires a  prespecified number of cluster (k). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the k clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering  works by clustering N items into K clusters called C_k, where k=1,...K.  An item can only belong to one cluster, so the clusters are disjoint. Each item n=1,...N has a vector of features  x_nand \mu_k  is the mean vector of the items in cluster  C_k. The N  items are assigned to clusters by minimizing the sum of squares

\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2

Step 1. Assign items randomly to the K clusters C_k .

Step 2. Compute \mu_k for each cluster C_k .Note that \mu_k  is a vector the length of the number of features, and contains the mean for each feature in x_n .

Step 3. Every item is reassigned to a cluster C_k with the mean vector \mu_k that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers.