Exploratory analysis involves data cleaning, feature engineering, and data summarizing and trend analysis.
It is common to be faced with a dataset with high dimensionality. This means that there are many possible features (or variables) that can be used in a machine learning algorithm or statistical model. Typically only a small set of these features will actually be informative to a model, and many features will be highly correlated. A large set of features can lead to very long processing time and overfitting. Dumping a large amount of features into a model is a great way to get a garbage in garbage out model. The features need to be reduced and/or transformed before used in developing a model. Exploratory analysis explores the data for meaningful trends, creates meaningful features, reduces dimensionality, and cleans the data so that it is prepared for analysis.
An exploratory analysis can include some of these steps:
- Remove meaningless features from the dataset. If the feature does not have a statistically significant relationship with the response variable y, it is probable y not useful to include it in the model.
- Calculate new features from the dataset.
- Normalize or Standardize data
- Impute missing data
- Apply transformations such as a log transformation, or first order difference.
- Apply dimensionality reduction methods such as PCA to decorrelate/summarizes the features
- Plot the data and explore trends
- Test hypothesizes with the data
There is a difference between normalizing and standardizing data. Both are common steps in preprocessing feature variables for a machine learning model.
Standardizing centers the data by subtracting the mean, and scales by the standard deviation. This centers the data around 0 as the mean, with unit variance. This is typically what is used to preprocess feature vectors in machine learning.
Normalizing scales the data between 0 and 1. Note that normalizing will lose outliers.
Scikit-learn in python has some great preprocessing modules for standardizing and normalization.
Standardize to 0 mean with unit variance
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
Normalize to (0,1) range.
sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)
Content Based Filtering (CBR) is a type of recommender system that uses description of the item and a profile of the user’s interests to make recommendations to users. CBR uses descriptions and attributes of items to match to user’s selected interests/preferences. CBR uses ML algorithms such as logistic regression or decision trees to make these matches, and results are converted to a ranked recommendation. Content based filtering systems rely on the item in question, and not on other user preferences. The benefit of this method is it does not require and historic data/training data/data from other users.
Machine learning is used to build user profiles if they do not already exist.
The user profile is typically based on historic user data to build a functional relationship representing how a user would rate a certain item.
Decision tree classification of items based on features
In some cases this taste profile is already known, for example, a survey filled out when signing up for the website could build a user profile.
- Does not depend on data from other users
- Can recommend unpopular or new items
- Easy to explain why items were recommended because have feature
- Is not predictive (Cannot predict new types of items to users)
A recommender system is an algorithm that helps users discover new content. It is popular in online marketing. Recommender systems are known to drastically increase an online vendor’s revenue.
Collaborative filtering is a type of recommender system. It is an algorithm that makes automatic predictions (recommendations) to a user based on the preferences of a set of similar users. It requires a large amount of data from users. Collaborative filtering does not necessarily require machine learning, and uses deterministic calculations.
Cosine similarity or Pearson correlation are used to measure similarity between two users, and create the neighborhood of N users.
Let x and y be vectors of user’s ratings.
||x and y are equivalent
||X and y are dissimilar
The cosine similarity is equivalent to the pearson correlation coefficient if the x and y vectors are normalized by the mean
Users tend to rate on different scales. Rating vectors can be normalized by dividing by mean or using difference from mean.
If there are many users, calculations can be computationally intensive. It is common to cluster users into groups and perform recommendations for each group of users.
Item to item collaborative filtering is a similar method that finds a neighborhood of similar items instead of users. Estimates of ratings are based on ratings for similar items for each user.
- Does not require feature data for the items. No feature selection needed.
- Can be predictive.
- Does not work well with small datasets.
- Cannot recommend new/unrated items.
- Defaults to recommending popular items. Popularity bias.
PCA is used to reduce the dimensions of a large data set such as a set of feature data . The high dimensional data is summarized using orthogonal transformations into uncorrelated principal components.
The dimension reduction is done by only selecting/using the eigenvectors (principle components) with large eigenvalues (the vectors that explain the most variance). The first component explains the most variance in the data, so an elbow plot is used to determine the significant number of principal components for an analysis. The set of eigenvectors form an uncorrelated orthogonal basis for the covariance matrix
Covariance matrix , is a matrix of all possible covariances between a set of variables . The entry of is .
Every eigenvector has a corresponding eigenvalue. The eigenvalue represents the amount of variance in the direction of the eigenvector. The eigenvectors for are found by solving
for the eigenvalues .
The principal components are the eigenvectors of the covariance matrix . Each eigenvalue represents a direction. For n-dimensional data there are n eigenvectors. Eigenvectors of are found by solving
for for each specific eigenvalue , which will result in a set of n eigenvectors and n eigenvalues for an matrix .
The eigenvectors form an orthogonal matrix used as a transformation matrix on the features to create a set of new features from the data.