Exploratory Analysis

April 7, 2017April 6, 2017 ClaireLeave a comment

Exploratory analysis involves data cleaning, feature engineering, and data summarizing and trend analysis.

It is common to be faced with a dataset with high dimensionality. This means that there are many possible features (or variables) that can be used in a machine learning algorithm or statistical model. Typically only a small set of these features will actually be informative to a model, and many features will be highly correlated. A large set of features can lead to very long processing time and overfitting. Dumping a large amount of features into a model is a great way to get a garbage in garbage out model. The features need to be reduced and/or transformed before used in developing a model. Exploratory analysis explores the data for meaningful trends, creates meaningful features, reduces dimensionality, and cleans the data so that it is prepared for analysis.

An exploratory analysis can include some of these steps:

Remove meaningless features from the dataset. If the feature does not have a statistically significant relationship with the response variable y, it is probable y not useful to include it in the model.
Calculate new features from the dataset.
Normalize or Standardize data
Impute missing data
Apply transformations such as a log transformation, or first order difference.
Apply dimensionality reduction methods such as PCA to decorrelate/summarizes the features
Plot the data and explore trends
Test hypothesizes with the data

Machine Learning

Normalizing and Standardizing

April 6, 2017April 6, 2017 Claire1 Comment

There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.

$\displaystyle x_{new}=\frac{x-\bar{x}}{sd}$

Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.

$\displaystyle x_{new}=\frac{x_i-min(x)}{max(x)-min(x)}$

Scikit-learn in python has some great preprocessing modules for standardizing and normalization.

Standardize to 0 mean with unit variance

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

Normalize to (0,1) range.

sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)

Tutorials

Easy Animation in R

March 16, 2017April 6, 2017 Claire2 Comments

I recently animated some figures using ggplot and ImageMagick. ImageMagick is an application independent of R. A set og ggplots are created in R, and ImageMagick is used in the command line to merge the set of plots into a gif.

To get started, download https://www.imagemagick.org/script/binary-releases.php following instructions for your operating system. I work on Windows, so this tutorial will use that.

mapcrimes

library(ggmap)
#subset data to just one day for an example
crime <- crime[crime$time<'2010-01-05 0:00:00', ]
#create list of times used to save unique plots
times <- unique(crime$time)
#get map from ggmap
houston <- get_map(location = 'Houston', zoom = 10,
maptype = "roadmap",
color = "bw")

#loop through all times, and save ggplots as jpeg
for(i in 1:length(times)){
    t <- times[i]
    m <- ggmap(houston) + geom_point(data = crime[crime$time == t,],
                                     mapping = aes(x = lon, y = lat, color=offense),
                                     size=5) +
    ggtitle(paste("Houston crime\n"), t) +
    theme(legend.position="bottom") +
    scale_colour_discrete(drop = FALSE)
    title <- paste("map",i,".jpg",sep="")
    jpeg(title)
    print(m)
    dev.off()
}

Once all the jpeg plots are saved to your directory, open an command prompt and navigate to the folder where those plots are saved. Then run the following command in the command prompt. This will use all the individual jpeg files to create one animated gif.

magick *.jpg mapcrimes.gif

Here is another little example.

tscrimes

library(dplyr)
library(ggmap)

crime$date <- as.Date(crime$time)
crime <- crime[crime$date<'2010-03-01', ]
crime <- crime %>%
         group_by(date) %>%
         summarize(total=n())
times <- unique(crime$date)

for(i in 1:length(times)){
    t <- times[i]
    title <- paste('tsplot', i, ".png", sep="")
    png(title)
    plot(crime[crime$date<=t,], col="blue", main="Crimes in Houston Jan-Mar 2010",
    xlim=c(min(times), max(times)),
    ylim=c(200, 450))
    lines(crime[crime$date<=t,], lwd=2, col="blue")
    dev.off()
}

And create the gif in the command line using
magick *.png tscrimes.gif

P.S. the R package animation will produce the same graphs without using command line.

Machine Learning

Recommender Systems: Content-Based Filtering

March 12, 2017March 10, 2017 ClaireLeave a comment

Content Based Filtering (CBR) is a type of recommender system that uses description of the item and a profile of the user’s interests to make recommendations to users. CBR uses descriptions and attributes of items to match to user’s selected interests/preferences. CBR uses ML algorithms such as logistic regression or decision trees to make these matches, and results are converted to a ranked recommendation. Content based filtering systems rely on the item in question, and not on other user preferences. The benefit of this method is it does not require and historic data/training data/data from other users.

Machine learning is used to build user profiles if they do not already exist.

ContentFlow The user profile is typically based on historic user data to build a functional relationship representing how a user would rate a certain item.

Decision tree classification of items based on features

In some cases this taste profile is already known, for example, a survey filled out when signing up for the website could build a user profile.

ContentDiag

Pros

Does not depend on data from other users
Can recommend unpopular or new items
Easy to explain why items were recommended because have feature

Cons

Is not predictive (Cannot predict new types of items to users)

Machine Learning

Recommender Systems: Collaborative Filtering

March 11, 2017April 6, 2017 ClaireLeave a comment

A recommender system is an algorithm that helps users discover new content. It is popular in online marketing. Recommender systems are known to drastically increase an online vendor’s revenue.

Collaborative filtering is a type of recommender system. It is an algorithm that makes automatic predictions (recommendations) to a user based on the preferences of a set of similar users. It requires a large amount of data from users. Collaborative filtering does not necessarily require machine learning, and uses deterministic calculations.

ColabFlow Cosine similarity or Pearson correlation are used to measure similarity between two users, and create the neighborhood of N users.

Let x and y be vectors of user’s ratings.

cosine similarity

CosineSimilarity

$\displaystyle sim(x,y)=\cos(\theta)=\frac{x \cdot y}{|x||y|}=\frac{\Sigma(xy)}{\sqrt{\Sigma x^2} \sqrt{\Sigma y^2}}$

$\cos(x,y)=1$	$\theta=0$	x and y are equivalent
$\cos(x,y)=0$	$\theta=\frac{/pi}{2}$	X and y are dissimilar

The cosine similarity is equivalent to the pearson correlation coefficient if the x and y vectors are normalized by the mean

Users tend to rate on different scales. Rating vectors can be normalized by dividing by mean or using difference from mean.

If there are many users, calculations can be computationally intensive. It is common to cluster users into groups and perform recommendations for each group of users.

Item to item collaborative filtering is a similar method that finds a neighborhood of similar items instead of users. Estimates of ratings are based on ratings for similar items for each user.

Pros

Does not require feature data for the items. No feature selection needed.
Can be predictive.

Cons

Does not work well with small datasets.
Cannot recommend new/unrated items.
Defaults to recommending popular items. Popularity bias.

Tutorials

A Simple Tutorial of the ggmap package in R

March 10, 2017April 6, 2017 ClaireLeave a comment

The R package ggmap allows ggplot2 capabilities with google maps. I have recently used it for a few projects, and wanted to save a few quick sample plots so I can easily return to creating similar plots. Here are a few examples of ggmap plots.

The first example uses the ‘crime’ dataset that comes in the ggmap package. This dataset shows crimes in the Houston area.

map211

require(ggmap)
require(ggplot2)

houston &lt;- get_map(location = 'Houston', zoom = 10, maptype = "roadmap")

ggmap(houston) +
geom_point(data = crime[crime$offense == "murder",],
mapping = aes(x = lon, y = lat),
color="hotpink", shape=3, size=3) +
ggtitle("ggmap example using\nthe crime R dataset") +
theme(plot.title = element_text(hjust = 0.5))

This second plot is an example using the satellite plot type. This plot shows eathquakes near Fiji. The data is from the ‘quakes’ dataset in the R package ‘datasets’.

Map1

require(datasets)
require(ggmap)
require(ggplot2)

fiji <- get_map(location = c(mean(quakes$long), mean(quakes$lat)), zoom = 5,
maptype = "satellite")

ggmap(fiji) +
geom_point(data = quakes, mapping = aes(x = long, y = lat, color = depth, size=mag)) +
scale_colour_gradient(low = "hotpink",high = "yellow") +
ggtitle("ggmap example using\nthe Fiji earthquakes R dataset ") +
theme(plot.title = element_text(hjust = 0.5))

Machine Learning

Principal Component Analysis

March 9, 2017April 6, 2017 ClaireLeave a comment

PCA is used to reduce the dimensions of a large data set such as a set of feature data $x$ . The high dimensional data is summarized using orthogonal transformations into uncorrelated principal components.

The dimension reduction is done by only selecting/using the eigenvectors (principle components) with large eigenvalues (the vectors that explain the most variance). The first component explains the most variance in the data, so an elbow plot is used to determine the significant number of principal components for an analysis. The set of eigenvectors form an uncorrelated orthogonal basis for the covariance matrix $\Sigma$

Covariance Matrix

Covariance matrix $\Sigma$ , is a matrix of all possible covariances between a set of variables $X_n$ . The $(i, j)$ entry of $\Sigma$ is $\Sigma_{(i, j)}=cov(X_i, X_j)$ .

Eigenvalues

Every eigenvector has a corresponding eigenvalue. The eigenvalue represents the amount of variance in the direction of the eigenvector. The eigenvectors for $\Sigma$ are found by solving

$det|\Sigma-\lambda I|=0$ for the eigenvalues $\lambda$ .

Eigenvectors

The principal components are the eigenvectors of the covariance matrix $\Sigma$ . Each eigenvalue represents a direction. For n-dimensional data there are n eigenvectors. Eigenvectors $X$ of $\Sigma$ are found by solving

$det|\Sigma-\lambda I|X=0$

for $X$ for each specific eigenvalue $\lambda$ , which will result in a set of n eigenvectors and n eigenvalues for an $nxn$ matrix .

Transformation

The eigenvectors form an orthogonal matrix $W$ used as a transformation matrix on the features to create a set of new features from the data. $y=W^T \times x$

PCA

Machine Learning

K-means Clustering

March 9, 2017April 6, 2017 ClaireLeave a comment

ClusterIris

K-means is a partitioning method. It requires a prespecified number of cluster ( $k$ ). The algorithm divides the data into that number of clusters, and then iterates until it converges to the ideal content in the $k$ clusters. This algorithm clusters by minimizing within-clusters sum of squares.

Formally, K-means clustering works by clustering $N$ items into $K$ clusters called $C_k$ , where $k=1,...K$ . An item can only belong to one cluster, so the clusters are disjoint. Each item $n=1,...N$ has a vector of features $x_n$ , and $\mu_k$ is the mean vector of the items in cluster $C_k$ . The $N$ items are assigned to clusters by minimizing the sum of squares

$\displaystyle \sum_{k=1}^K \sum_{n=C_j} | x_n- \mu_j|^2$

Step 1. Assign items randomly to the K clusters $C_k$ .

Step 2. Compute $\mu_k$ for each cluster $C_k$ .Note that $\mu_k$ is a vector the length of the number of features, and contains the mean for each feature in $x_n$ .

Step 3. Every item is reassigned to a cluster $C_k$ with the mean vector $\mu_k$ that is closest to that item’s value.

If there is not equal variance in the data, or one variable is on a much larger scale than others, that variable will influence K-means decisions the most. Thus, it is common to normalize all variables/features so they contribute equally.

Is is easy to see how this algorithm becomes K-mediods if the median is used instead of the mean. K-mediods is robust to outliers.