This is a study note on the book An Introduction to Statistical Learning with Applications in R, with my own experimental R-code for each topic. Cheers!

Navigation

00. Introduction
01. Linear Model
02. Tree-Based Method
03. Unsupervised Learning
04. PCA - A Deeper Dive
05. A Side Note on Hypothesis Test and Confidence Interval


Introduction

Suppose we observe a quantitative response and different predicting variables . We assume that there is a underlying relationship between and :

We want to estimate mainly for two purpose:

  • prediction: in the case where is not easily obtained, we want to estimate with , and use to predict Y with . Here can be a black box, such as highly non-linear approaches which offers accuracy over interpretability.
  • inference: in the case where we are more interested in how is affected by the change in each . We need to know the exact form of . For example, linear model is often used which offer interpretable inference but sometimes inaccurate.

Feature

There are important and subtle differences between a feature and a variable.

  • variable: raw data
  • feature: data that is transformed, derived from raw data. A feature can be more predictive and have a direct relationship with the target variable, but it isn’t immediately represented by the raw variable.

The need for feature engineering arises from limitations of modeling algorithms:

  1. the curse of dimensionality (leading to statistical insiginificance)
  2. the need to represent the signal in a meaningful and interpretable way
  3. the need to capture complex signals in the data accurately
  4. computational feasibility when the number of features gets large.

Feature Transformation

The Occam’s Razor principle states that a simpler solutions are more likely to be corret than complex ones.

Consider the following example of modeling a exponentially distributed response. After applying a log transformation, we can view the relation from a different viewpoint provided by the new feature space, in which a simpler model may achieve more predictive power than a complex model in the original input space.

1
2
3
4
5
6
7
8
9
x1 <- runif(100, 1,10)
x2 <- exp(x1)
df <- data.frame(x1 = x1, x2 = x2)
df$logx2 <- log(df$x2)
ss
options(repr.plot.width=6, repr.plot.height=3)
p1 <- ggplot(data = df, aes(x = x1, y = x2)) + geom_point(size=0.3)
p2 <- ggplot(data = df, aes(x = x1, y = logx2)) + geom_point(size=0.3)
grid.arrange(p1, p2, nrow=1)

Screen Shot 2019-05-16 at 11.05.16 AM.png


Consider another classification problem, in which we want to identify the boundary between the two classes. A complex model would draw a circle as the divider. A simpler approach would be to create a new feature with distances of each point from the origin. The divider becomes a much simpler straight line.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
x1 <- runif(1000,-1,1)
x2 <- runif(1000,-1,1)
class <- ifelse(sqrt(x1^2 +x2^2) < 0.5, "A", "B")

df <- data.frame(x1 = x1, x2 = x2, class = class)
p1 <- ggplot(data = df, aes(x = x1, y = x2, color = class)) +
geom_point(size=1)

df$dist_from_0 <- sqrt(df$x1^2 + df$x2^2)
p2 <- ggplot(data = df, aes(x = 0, y = dist_from_0, color = class)) +
geom_point(position = "jitter", size=1) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
annotate("segment", x = -0.5, xend = 0.5, y = 0.5, yend = 0.5)

options(repr.plot.width=8, repr.plot.height=3)
grid.arrange(p1, p2, nrow=1)

Screen Shot 2019-05-16 at 11.41.39 AM.png

Feature Selection

For predictive variables, there are a total of models. We use feature selection to choose a smaller subset of the variable to model.

  • Forward Selection We begin with the null model with no variables and an intercept. We then fit simple linear regression to choose our first variable with the lowest . Same way to choose the next variable to be added until some stoppoing rule.
  • Backward Selection We begin with the full model and remove the variable with the largest coefficient -value. Re-fit and remove the next.
  • Mixed Selection We begin with the null model and the forward selection technique. Whenever the p-value for a variable exceeds a threshold we remove it.

Regression Problems

In regression problems the variables are quantitative, we use mean squared error, or , to measure the quality of estimator :

A fundamental property of statistical learning that holds regardless of the particular dataset and statistical method is that as model flexibility increases, we can observe a monotone decrease in training MSE and an U-shape in test .

Bias vs Variance

The bias-variance trade off decompose the expected test MSE of a single data point :

where:

  • variance refers to the variance of the estimator among different datasets. A highly flexible lead to a high variance, as even small variance in data induce change in the ‘s form.
  • bias refers to the error due to estimator inflexibility. For example, using linear to estimate non-linear relationship leads to high bias.

Classification Problems

In regression problems the variables are qualitative, we use error rate to measure the quality of estimator :

The Bayes Classifier

The Bayes classifier predict the classification based on the combination of the prior probability and its likelihood given predictor values. With categories , and predictor values , is assigned to category which has the maximum posterior probability:

Where:

The naive Bayes classifier assumes independence between the predictor ‘s, and the formula becomes:

When is continuous, the Guassian naive Bayes classifier assumes that

In R, we use the naiveBayes function from the e1071 package to predict Survival from the Titanic dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
library(e1071)
library(caret)
set.seed(9999)
df=as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq),]
df$Freq <- NULL

# create a 75/25 train/test split
partition <- createDataPartition(df$Survived, list=FALSE, p=0.75)
train <- df[partition, ]
test <- df[-partition, ]

# fit naive bayes classifier
bayes <- naiveBayes(Survived ~ ., data=train)

# 10-fold validation
bayes.cv <- train(Survived ~ ., data=train,
method = "nb",
trControl = trainControl(method = "cv",
number = 10))

Viewing the model results, the prior probabilities are shown in the “A-priori probabilities” section.

1
2
3
4
5
6
7
8
9
10
11
> bayes

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
No Yes
0.6767554 0.3232446

The likelihood are shown in the “Conditional probabilities” section

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Conditional probabilities:
Class
Y 1st 2nd 3rd Crew
No 0.09481216 0.10554562 0.34615385 0.45348837
Yes 0.28838951 0.15730337 0.24344569 0.31086142

Sex
Y Male Female
No 0.91323792 0.08676208
Yes 0.53183521 0.46816479

Age
Y Child Adult
No 0.03130590 0.96869410
Yes 0.06928839 0.93071161


The confusionMatrix function from the caret package returns a test Accuracy of 0.7978, which corresponds to an Bayes error rate of 0.2023.

Theoretically, the Bayes classifier produces the lowest error rate if we know the true conditional probability , which is not the case with real data. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

1
2
3
4
5
6
7
8
9
> confusionMatrix(predict(bayes, test), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 343 82
Yes 29 95

Accuracy : 0.7978

With 10-fold cross validation, the test error rate is 0.2095.

1
2
3
4
5
6
7
8
9
> confusionMatrix(predict(bayes.cv, test), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 332 75
Yes 40 102

Accuracy : 0.7905

K-Nearest Neighbors

The KNN classifier estimate the conditional probability based on the set of K predictors in the training data that are the most similar to , representing by .

Note that the is one minus the -local error rate of a estimate , and therefore with KNN we are picking the that minimizes the -local error rate given .

When , the estimator produces a training error rate of , but the test error rate might be quite high due to overfitting. The method is therefore very flexible with low bias and high variance. As , the estimator becomes more linear.

In R, we use the knn function (with K=1) in the class library to predict Survival from the Titanic dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(e1071)
library(class)
set.seed(9999)
df <- as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq),]
df$Freq <- NULL

# The "knn" function in the "class" library only works with numeric data
df$iClass <- as.integer(df$Class)
df$iSex <- as.integer(df$Sex)
df$iAge <- as.integer(df$Age)
df$iSurvived <- as.integer(df$Survived)

# Create random 75/25 train/test split
train_ind <- sample(seq_len(nrow(df)), size = floor(0.75 * nrow(df)))
df_train <- df[train_ind, c(5:7)]
df_test <- df[-train_ind, c(5:7)]
df_train_cl <- df[train_ind, 4]
df_test_cl <- df[-train_ind, 4]

# Fit KNN
knn <- knn(train = df_train,
test = df_test,
cl = df_train_cl,
k=1)

# 10-fold CV
knn.cv <- tune.knn(x = df[, c(5:7)],
y = df[, 4],
k = 1:20,
tunecontrol=tune.control(sampling = "cross"),
cross=10)

The confusion matrix shows an error rate of 0.1942.

1
2
3
4
5
6
7
8
9
> confusionMatrix(knn, df_test_cl)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 364 101
Yes 6 80

Accuracy : 0.8058


Now run KNN with the tune wrapper to perform 10-fold cross validation. The result recommends KNN with K=1, which turns out to be the same as what we originally tested.

1
2
3
4
5
6
7
8
9
10
11
> knn.cv

Parameter tuning of ‘knn.wrapper’:

- sampling method: 10-fold cross validation

- best parameters:
k
1

- best performance: 0.2157576

Summarizing the test error rate for naiveBayes and KNN.

1
2
naiveBayes (cv10):          0.2095
KNN (cv10, K=1): 0.1942


Linear Model

Simple Linear Regression

A simple linear regression assumes that:

Coefficient Estimate

Given data points , let . We define the residual sum of squares, or , as:

Minimizing as an objective function, we can solve for :

Coefficient Estimate - Gaussian Residual

Furthermore, if we assume the individual error terms are i.i.d Gaussian, i.e.:

We now have a conditional pdf of given :

The maximum log-likelihood estimator for the paramter estimate , , and can be computed as follow:

Setting the partial-derivative of the estimator with respect to each of the parameter to zero, we can obtain the maximum likelihood parameters:

MLE, or maximum likelihood estimation, is a frequentist approach for estimating model parameters based on the assumed underlying model form and error distribution, by maximizing the probability of seeing what we saw in the data. MLE is a near-optimal method of estimation, and is optimal in many cases.

If we assume a Gaussian error distribution and a linear model, then the conclusion above states that maximizing the MLE objective function is the SAME as minimizing the RSS objective function.

More on frequentist vs Bayesian in this SOA work paper. Also see this CMU lecture note and for more detail regarding the derivation.

Model Fit

Recall that we assume there is a underlying relationship between and :

The residual standard error, or , estimates the standard deviation of . Note that RSE is an absolute measure of the lack of fit of the model and depends on units of .

The measures the proportion of variance explained by the regression. is the total sum of squares which measures the total variance in

In a simple regression setting,

Residual Plot

Here is a good article on how to interpret your residual plot.

Summarizing the approaches for different residual issues:

  1. Y-axis Unbalanced: transform target
  2. X-axis Unbalanced: transform predictor
  3. Heteroscedasticity: transform target/predictor
  4. Non-Linearity: transform predictor; create non-linear model
  5. Outlier: transform target/predictor; remove/assess the outlier;

Multiple Linear Regression

The multiple linear regression takes the form:

F-statistic

We use the F-statistic to test the null hypothesis that there are no relationships between the predictors and target variable.

We calculate the F-statistic as follow:

We expect if is true.

Potential Problems

Non-Linearity Residual plots are useful to detect whether the underlying relationship is non-linear. One of the solutions to this problem is to fit transformations of the predictors such as , , and .

Collinearity This is where two or more predictors are closely correlated with each other, or “collinear”. Collinearity reduces the accuracy of the coefficient estimates by increasing its standard deviation and p-value. One way to detect collinearity is from the correlation matrix or the “pairs” plot in R.

There are two solutions: dropping one of the collinear predictor, or combine the collinear predictors into a single predictor.

Multi-collinearity This is where collinearity exist between three or more predictors when none of the pairwise correlations are high. The best way to assess multi-collinearity if through variance inflation factor, or VIF.

A VIF of indicates no collinearity. A VIF of indicates high collinearity.

Outliers Use residual plot to detect and potentially remove outliers.

Linear Model Selection and Regularization

Linear model can be improved by using alternative fitting procedures, which produce better prediction accuracy and model interpretability.

  • Subset Selection Select a subset of the original predictors and then fit the model.

  • Shrinkage/Regularization Fit the model by shrinking coefficient estimates towards zero, therefore reducing variances of the coefficient estimates.

  • Dimension Reduction Project the predictors onto a M-dimensional subspace, then use the M projections as predictors to fit a linear model with least square.

Subset Selection

The beset subset selection fits a separate least square regression for each combination from the predictors, creating models to compare.

The forward stepwise selection and backward stepwise selection fits a total of models. Specifically, at each step a predictor is added/removed to the model only if it gives the greatest additional improvement (lowest RSS or highest adjusted ) among all the predictors.

After the selection process, we need to determine the optimal model that gives the lowest potential test error, either through:

  • Cross-validation, or
  • Adjusted train error

    • Example 1: , where is the number of predictors in the subset, and is the variance of error estimated using the full models containing all predictors. Essentially, a penalty term is added to the train RSS to adjust for the fact that the training error tends to underestimate the test error.

    • Example 2: . The AIC, or Akaike information criterion, uses the maximum likelihood function to assess the relative quality of statistical models give a set of data. In linear models with Gaussian error terms, the maximum likelihood function is equivalent to , and therefore and are proportional to each other.

    • Example 3:

    • Example 4: . While always decreases in the stepwise selection as the number of predictors increases, may or may not decrease.

Shrinkage

Ridge Regression

Recall that in least square regression, the coefficient are estimated by minimizing the objective function RSS.

In ridge regression, the objective function include an additional shrinkage penalty, where is a tuning parameter. Note that we do not want to shrink the intercept , which is simply a measure of the mean of the responses:

  • As increases, the variance decreases and the bias increases. The model fit usually is improved initially as the variance decreases, but worsen at some point when bias starts to increases rapidly. Cross-validation is often used to select the optimal .
  • As , the coefficient approaches .

The ridge regression works when the linear model has low bias and high variance, e.g. when the underlying relationship is close to linear. The ridge regression trades off a small increase in bias for large decrease in variance.

Additionally, it is important to standardize all features when applying regularization. Imagining a feature in dollar and in thousand dollar: the model with the dollar feature will have much higher coefficient compared to the thousand dollar one, leading to larger regularization effect for the dollar feature.

Lasso Regression

Although the ridge regression shrinks the coefficients, it does not eliminiate excess predictors. Model interpretation might be an issue for the ridge regression where the number of predictors are large. The lasso regression overcomes this issue and force some coefficients to be exactly .

However, there are limitations of feature selections using regularization techniques such as lasso, such as model interpretability. In addition, the feature we selected are optimized in linear models, and may not necessarily translate to other model forms.

Note the difference between (ridge) and (lasso) penalty:

  • when the coefficients (absolute value) are greater than 1 (when the parameters are large), the penalty is greater than the , and ridge provides more shrinkage.
  • when the coefficients (absolute value) are smaller than 1 (when the parameters are small), the penalty is greater than the , and lasso provides more shrinkage.

In R, we use the glmnet package to compute ridge and lasso regressions to predict mpg from the mtcars built-in data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
library(glmnet)
library(caret)
set.seed(9999)

# data
df <- as.data.frame(mtcars)
x <- model.matrix(mpg~., df)[, -1]
y <- df$mpg

# 75/25 train/test split
partition <- createDataPartition(df$mpg, list = FALSE, p = .75)
df_train <- df[partition, ]
df_test <- df[-partition, ]
x_train <- x[partition, ]
x_test <- x[-partition, ]
y_train <- y[partition]
y_test <- y[-partition]

# fit regression
m1 <- lm(mpg ~ ., df_train)
m1.pred <- predict(m1, df_test)
m1.mse <- round(mean((y_test - m1.pred)^2), 2)

# fit ridge regression
m2 <- cv.glmnet(x_train, y_train, alpha=0, nfolds=6)
m2.bestlambda <- m2$lambda.min
m2.pred <- predict(m2, s=m2.bestlambda, x_test)
m2.mse <- round(mean((y_test - m2.pred)^2), 2)

# fit lasso regression
m3 <- cv.glmnet(x_train, y_train, alpha=1, nfolds=6)
m3.bestlambda <- m3$lambda.min
m3.pred <- predict(m3, s=m3.bestlambda, x_test)
m3.mse <- round(mean((y_test - m3.pred)^2), 2)


# get coefficients
m2.best <- glmnet(x_train, y_train, alpha=0, lambda=m2.bestlambda)
m3.best <- glmnet(x_train, y_train, alpha=1, lambda=m3.bestlambda)
comp <- cbind(coef(m1), coef(m2.best), coef(m3.best))
colnames(comp) <- c("original", "ridge", "lasso")

The test MSE are as follow. Note that both ridge and lasso regression perform better than the original regression.

1
2
3
4
5
6
> m1.mse
[1] 16.64
> m2.mse
[1] 3.66
> m3.mse
[1] 6.61

We also made a comparison of the coefficients, based on the normal regression and the regularized regression with cv-optimal .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> comp
11 x 3 sparse Matrix of class "dgCMatrix"
original ridge lasso
(Intercept) -19.52071389 19.420775180 14.002192375
cyl 1.69431225 -0.275408324 .
disp 0.01185998 -0.004705579 .
hp -0.01449594 -0.011129305 -0.002545135
drat 3.08676192 1.272808791 1.334526777
wt -3.19280650 -1.137507747 -2.254408320
qsec 1.02473436 0.117671597 0.331182874
vs 0.97127211 0.677494227 .
am 2.63740010 1.633418877 1.703501439
gear 3.36943552 0.794847062 1.571031414
carb -1.45443855 -0.588427657 -1.162118484

Resampling Method

Resampling methods involve repeatedly drawing samples from a training set to re-fit the model.

Cross-Validation

We often use the test error rate to determine and compare how well a statistical learning model perform. However, in the absence of a large designated test set, the test error rate can be difficult to estimate. The train error rate is often quite different from the test error rate. Therefore, cross-validation can be used to estimate the test error rates by creating validation sets off the train data.

A k-fold cross validation involves randomly dividing the observations into -groups. The process is repeated -times where each group is treated as a validation set while fitting the remaining groups. The -fold CV test MSE is the average of all the MSE from each validation set:

The leave-one-out cross validation, or LOOCV, is a special case of -fold CV with . Since LOOCV requires fitting the model times, with being equal to number of train data points. The -fold CV with are more feasible computationally as it only needs to fit the model times.

Bootstrap

Bootstrap provides a measure of accruacy of either a parameter estimate or a given statistical learning method. It can be used to estimate variance of a parameter by repeatedly re-sampling the same data set with replacement (i.e. duplicate data entries allowed) while re-calculating the parameter based on each re-sample.

Hyper-Parameter Tuning

We specify numerious constants called hyperparameter during our modeling process, e.g. , , etc. To set these constant such that our model can predict accruately while avoiding over-complexity and overfitting, we tune our hyperparameter with cross validation. For more details see the auto claim notebook.

Generalized Linear Model

Generalized linear models were developed by Nelder and Wedderburn in a paper published in 1972 to provide a flexible framework which introduces a link function that transform the linear combinations of predictors.

An Ordinary Linear Model has many limitations:

  1. As ordinary linear model produces a numeric response, it requires the assumptions of orderings to predict qualitative responses.
  2. Negative values may be predicted when not allowed.
  3. When the variance of the target variable depends on the mean, the homoscedasticity assumption is violated, and therefore the least square estimator is no longer the MLE estimator and various statistical test would not hold.
  4. Sensitive to outliers.
  5. Does not perform well with non-linear relationships.


The Generalized Linear Models relaxes the assumptions of OLM. First, GLM relaxes the normal residual assumption of OLM, and allow the target variable to follow any distribution within the exponential distribution family:

With regard to this distribution, there exists a canonical link function associated with it that simplifies the mathematics of solving GLM analytically.

  • Normal => Identity:
  • Exponential/Gamma => Negative Inverse:
  • Inverse Gaussian => Inverse Square:
  • Poisson => Log:
  • Bernoulli/Binomial/Multinomial => Logit:

We can either choose the canonical link function or pick another one (which may not lead to a converged GLM solution, however). With this link function, GLM assumes that the expectation of the target is the inverse linked linear combination of predictors:

With all above assumptions satisfy, the coefficient of a GLM model can then be solved:

Logistic Regression

The logistic regression model is popular for classification problems. With two response classes, we can calculate the probability of assigning the response in each class and predict the response by choosing the class with the higher probability. or more than two response classes, multiple-class logistic regression is available but the discrimentant analysis is more popular.

We define as our new response variable and the link function: logit function, short for logistic function, as such:

Since we assume a linear relationship between our predictor and the linked reponse , we have:

Therefore,

Now we have a nice property of , which is exactly what we wanted to model probability responses. The quantity is called the odds.

In R, we use the glm function (with family=binomial) in the predict Survival from the Titanic dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library(e1071)
library(caret)
set.seed(9999)
df <- as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq), ]
df <- subset(df, select = -c(Freq))

# Binarize class, sex and age
df.new <- predict(dummyVars("~Class+Sex+Age",
df,
sep="_",
fullRank=TRUE), df)
df.new <- as.data.frame(df.new)
df.new["Survived"] <- df$Survived
df <- df.new

# Create random 80/20 train/test split
train_ind <- sample(seq_len(nrow(df)), size = floor(0.80 * nrow(df)))
df_train <- df[train_ind, ]
df_test <- df[-train_ind, ]

# Fit Logistic Regression
model <- glm(Survived~.,
df_train,
family=binomial)
summary(model)
contrasts(df$Survived)

# Predict In-Sample
prob_train <- predict(model, df_train, type="response")
pred_train <- rep("No", nrow(df_train))
pred_train[prob_train > .5] <- "Yes"

# Predict Out-Of-Sample
prob_test <- predict(model, df_test, type="response")
pred_test <- rep("No", nrow(df_test))
pred_test[prob_test > .5] <- "Yes"

From summary(model), note that most coeefficients are significant.

1
2
3
4
5
6
7
8
9
> summary(model)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3094 0.3132 0.988 0.323
Class_2nd -1.1163 0.2187 -5.105 3.31e-07 `
Class_3rd -1.7041 0.1918 -8.883 < 2e-16 `
Class_Crew -0.8848 0.1773 -4.991 6.01e-07 `
Sex_Female 2.3691 0.1555 15.235 < 2e-16 `
Age_Adult -0.6289 0.2767 -2.273 0.023 *

Note that probability of 1 correspond to “Yes” in the Survived variable.

1
2
3
4
> contrasts(df$Survived)
Yes
No 0
Yes 1

From in-sample confusion matrix.

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(pred_train), df_train$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 1087 296
Yes 102 275

Accuracy : 0.7739

From out-of-sample confusion matrix.

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(pred_test), df_test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 277 66
Yes 24 74

Accuracy : 0.7959

Comparing the test error rate between naiveBayes, KNN and logistic regression.

1
2
3
naiveBayes (cv10):          0.2095
KNN (cv10, K=1): 0.1942
logistic regression: 0.2041

Poisson Regression

The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.

We can fit Posisson regression if we observe that the frequencies of response variable exhibits a Poisson shape. We will create a new response vector and assumes that has an underlying linear relationship in .

That is, we assume that for each , . The log link function ensures that is strictly positive.

Note that we had made a strong assumption that for each , the mean and variance of are the same, as dictated by the Poisson distribution. However, if the data shows larger variance than expected, or overdispersion, we can then use the quasi-Poisson regression, which is essenstially the negative binomial distribution with looser assumptions than Poisson.

In the diamonds dataset from the ggplot2 package, we plotted the histogram of the price data from 50,000 observations.

1
2
3
library(ggplot2)
df <- as.data.frame(diamonds)
ggplot(df, aes(x=price)) + geom_histogram(binwidth=1)

The price data shows resemblence to a Poisson distribution.

Rplot1.png

We fitted four different models: linear, ridge, lasso, and Poisson regressions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
library(e1071)
library(ggplot2)
library(caret)
library(glmnet)
set.seed(9999)
df <- as.data.frame(diamonds)

# caret::dummyVars does not work well with ordered factor. change to unordered.
df["cut_1"] <- factor(df$cut, order=FALSE)
df["color_1"] <- factor(df$color, order=FALSE)
df["clarity_1"] <- factor(df$clarity, order=FALSE)

# Binarize Category Variable
dummy <- dummyVars(~ cut_1 + color_1 + clarity_1, df, sep="_", fullRank=TRUE)
df.new <- as.data.frame(predict(dummy, newdata = df))
df.new["carat"] <- df$carat
df.new["price"] <- df$price
df <- df.new
x <- model.matrix(price~., df)[, -1]
y <- df$price

# Create random 80/20 train/test split
partition <- sample(seq_len(nrow(df)), size = floor(0.80 * nrow(df)))
df_train <- df[partition, ]
df_test <- df[-partition, ]
x_train <- x[partition, ]
x_test <- x[-partition, ]
y_train <- y[partition]
y_test <- y[-partition]

# Fit Linear Regression
m1 <- lm(price~., df_train)
m1.pred <- predict(m1, df_test)
m1.mse <- round(mean((y_test - m1.pred)^2), 2)

# Fit Ridge Regression
m2 <- cv.glmnet(x_train, y_train, alpha=0, nfolds=10)
m2.bestlambda <- m2$lambda.min
m2.pred <- predict(m2, s=m2.bestlambda, x_test)
m2.mse <- round(mean((y_test - m2.pred)^2), 2)

# Fit Lasso Regression
m3 <- cv.glmnet(x_train, y_train, alpha=1, nfolds=10)
m3.bestlambda <- m3$lambda.min
m3.pred <- predict(m3, s=m3.bestlambda, x_test)
m3.mse <- round(mean((y_test - m3.pred)^2), 2)

# Fit Poisson Regression
m4 <- glm(price~., df_train, family=poisson(link="log"))
m4.pred <- predict(m4, df_test)
m4.mse <- round(mean((y_test - exp(m4.pred))^2), 2)

# Fit quasiPoisson Regression
m5 <- glm(price~., df_train, family=quasipoisson(link="log"))
m5.pred <- predict(m5, df_test)
m5.mse <- round(mean((y_test - exp(m5.pred))^2), 2)

# Compare Coefficient
m2.best <- glmnet(x_train, y_train, alpha=0, lambda=m2.bestlambda)
m3.best <- glmnet(x_train, y_train, alpha=1, lambda=m3.bestlambda)
comp <- cbind(coef(m1), coef(m2.best), coef(m3.best), coef(m4), coef(m5))
colnames(comp) <- c("original", "ridge", "lasso", "Poisson", "quasiPoi")

Showing the results. We can see that lasso regression improved upon ridge. However, the Poisson regression show very high MSE, and not improved by using quasi-Poisson to deal with overdispersion.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> m1.mse
[1] 1296082

> m2.mse
[1] 1662769

> m3.mse
[1] 1296773

> m4.mse
[1] 5237564

> m5.mse
[1] 5237564

Comparing coefficients:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> comp
19 x 5 sparse Matrix of class "dgCMatrix"
original ridge lasso Poisson quasiPoi
(Intercept) -7369.2859 -2751.693983 -7083.4777 5.17185599 5.17185599
cut_1_Good 686.6877 112.163397 657.1100 0.17515789 0.17515789
`cut_1_Very Good` 864.3160 319.828814 838.9901 0.20528993 0.20528993
cut_1_Premium 893.6342 367.364011 866.6581 0.18009057 0.18009057
cut_1_Ideal 1025.5579 421.231162 999.8031 0.20406528 0.20406528
color_1_E -225.3985 -23.486004 -205.5637 -0.05660526 -0.05660526
color_1_F -294.1807 2.717515 -274.2126 -0.03079075 -0.03079075
color_1_G -522.9599 -112.556133 -501.1661 -0.10597258 -0.10597258
color_1_H -997.6266 -490.885083 -975.9837 -0.25411004 -0.25411004
color_1_I -1452.1455 -749.323400 -1426.9279 -0.42200863 -0.42200863
color_1_J -2343.2864 -1450.247841 -2315.1429 -0.63819087 -0.63819087
clarity_1_SI2 2612.5463 -472.362109 2345.9392 1.14176901 1.14176901
clarity_1_SI1 3555.6013 149.192295 3287.0687 1.38540593 1.38540593
clarity_1_VS2 4210.3193 671.522828 3940.7103 1.50843614 1.50843614
clarity_1_VS1 4507.7144 861.334888 4235.9431 1.58850692 1.58850692
clarity_1_VVS2 4965.2210 1192.175019 4691.8855 1.66443836 1.66443836
clarity_1_VVS1 5072.6388 1162.725817 4796.5710 1.61060572 1.61060572
clarity_1_IF 5436.6567 1476.377453 5158.0224 1.72336616 1.72336616
carat 8899.8818 7672.223943 8883.8976 1.65798106 1.65798106

We are curious as to why the Poisson regression perform much worse than a simple linear regression, when the reponse variable clearly shows Poisson patterns.

1
2
3
4
5
6
7
8
9
10
11
p1 <- ggplot() +
geom_point(data=df_test, aes(x=as.numeric(row.names(df_test)), y=price),
shape=".", color="black") +
geom_point(aes(x=as.numeric(row.names(df_test)), y=m1.pred),
shape=".", color="red")

p2 <- ggplot() +
geom_point(data=df_test, aes(x=as.numeric(row.names(df_test)), y=price),
shape=".", color="black") +
geom_point(aes(x=as.numeric(row.names(df_test)), y=exp(m4.pred)),
shape=".", color="red")

We first look at the linear regression fit, where the black dots are the original data and the red dots are the fitted data:

Rplotp1.png

We then look at the Poisson regression fit. Unfortunately the Poisson regression create ultra-high predictions for some values, which skew the MSE matrix. This is we forget to log the numerical variable (carat) when we use log link function, which results in a exponential shape for the prediction.

Rplotp2.png

We change the code as follow:

1
2
3
4
5
6
7
8
9
# Fit Poisson Regression
m4 <- glm(price~-carat+log(carat), df_train, family=poisson(link="log"))
m4.pred <- predict(m4, df_test)
m4.mse <- round(mean((y_test - exp(m4.pred))^2), 2)

# Fit quasiPoisson Regression
m5 <- glm(price~-carat+log(carat)., df_train, family=quasipoisson(link="log"))
m5.pred <- predict(m5, df_test)
m5.mse <- round(mean((y_test - exp(m5.pred))^2), 2)

The mse are now better:

1
2
3
4
> m4.mse
[1] 2544181
> m5.mse
[1] 2544181

Plotting the prediction. We can see that other than one prediction outlier, the overall predictions are better than when we did not log the carat.

Rplot.new.png

Goodness of Fit

Deviance is a measure of the goodness of fit of a generalized linear model, similar to a in the simple linear model. . The default value is called null deviance and is the deviance calculated when the response variable is predicted using its sample mean.

Adding additional feature to the model would generally decrease the deviance and decrease the degree of freedoms.


Tree-Based Method

The tree-based method involve segmenting the predictor space into several regions to make a prediction for a given observation. There are several advantages to the tree-based methods:

  • Easy to explain
  • Intuitive to human reasoning
  • Graphic and interpretable
  • No need to dummy variables for qualitative data

On the other hand, disadvantages:

  • Lower predictive accuracy
  • Sensitive to change in data

Several techniques can make significant improvement to compensate the disadvantages, namely bagging, random forest and boosting.

Regression Tree

In a linear regression model, the underlying relationship is assumed to be:

Whereas in a regression tree model, the underlying is assumed to be:

Where each represent a partition of the feature space. The goal is to solve for the partition set which minimize the objective function:

To find efficiently, we introduce recursive binary splitting, which is a top-down and greedy approach. It is greedy because it is short-sighted in that it always chooses the current best split, instead of the optimal split overall.

Due to the greedy nature, it is preferred that we first grow a complex tree and then prune it back, so that all potential large reductions in RSS are captured.

Cost Complexity Pruning

The cost complexity pruning approach aim to minimize the objective function, give each value of :

Where is the number of terminal nodes of subtree where is the original un-prune tree. The tuning parameter controls the complexity of the subtree , penalizing any increase in nodes. The goal is to prune the tree with various and then use cross-validation to select the best .

This is similar to the lasso equation, which also introduce a tuning parameter to control the complexity of a linear model.

In R, we use the rpart library, which stands for recursive partitioning and regression trees, to fit a regression tree to predict mpg in our mtcars dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(caret)
library(rpart)
library(rpart.plot)
set.seed(9999)
df <- as.data.frame(mtcars)

# create a 75/25 train/test split
partition <- createDataPartition(df$mpg, list=FALSE, p=0.75)
train <- df[partition, ]
test <- df[-partition, ]

# fit regression tree
t1 <- rpart(mpg ~ ., train)
t1.predict <- predict(t1, test)
t1.mse <- round(mean((test$mpg - t1.predict)^2), 2)

The test MSE, as compared to the previous linear models.

1
2
3
4
5
6
7
8
9
10
11
# regression tree
> t1.mse
[1] 11.59

# linear regression
> m1.mse
[1] 16.64
> m2.mse
[1] 3.66
> m3.mse
[1] 6.61

Plotting the tree with rpart.plot(t1) command.
Rplot.png

Classification Tree

Classification tree is very similar to regression trees expect we need an alternative method to RSS when deciding a split. There are three common approaches.

  • Classification error rate:

Where represents the porportion of train observations from the -th parent node that are from the -th child node. However, the this approach is not sufficiently sensitive to node impurity for tree-growing, compared to the next two.

  • Gini index:

Note that the Gini index decreases as all get closer to or . Therefore it is a measure of the node purity.

  • Entropy:

The Entropy is also a measure of the node purity and similar to the Gini index numerically.

For a two-class decision tree, the impurity measures calculated from different methods for a given are simulated below with python.

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0, 1, 0.001)
y1 = x*np.log(x)/np.log(2)*-1 + (1-x)*np.log(1-x)/np.log(2)*-1
y2 = x*(1-x) + (1-x)*(1-(1-x))
y3 = 1-np.maximum(x, 1-x)

fig = plt.figure(figsize=(6, 6))
plt.plot(x, y1)
plt.plot(x, y2)
plt.plot(x, y3)
plt.legend(['entropy', 'gini', 'class error'], loc='upper right')

Unknown.png

We can see that all three method are similar and consistent with each other. Entropy and the Gini are more sensitive to changes in the node probabilities, therefore preferrable when growing the trees. The classification error is more often used during complexity pruning.

Information Gain

When the measure of node purity is calculated, we want to maximize the information gain after each split. We use and to denote the parent and child node, with entropy as the measure. is the number of observations under the parent node:

In R, we use the rpart library to create a classification tree to predict Survived in the Titanic data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(caret)
library(rpart)
library(rpart.plot)
set.seed(9999)
df <- as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq), ]
df <- subset(df, select = -c(Freq))

# create a 75/25 train/test split
partition <- createDataPartition(df$Survived, list=FALSE, p=0.75)
train <- df[partition, ]
test <- df[-partition, ]

# fit classification tree
t2 <- rpart(Survived ~ ., train)
t2.prob <- predict(t2, test)
t2.pred <- rep("No", nrow(t2.prob))
t2.pred[t2.prob[, 2] > .5] <- "Yes"

# pruning
t2.prune <- prune(t2, cp = 0.05)
rpart.plot(t2.prune)

The confusion matrix shows:

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(t2.pred), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 368 105
Yes 4 72

Accuracy : 0.8015

Plotting the tree with rpart.plot(t2).
Rplot2.png

Plotting the complexity parameter against the cross-validation relative error with plotcp(t2). Although the relative error is at the lowest at nodes, comparable level of relative error was achieved at nodes. Because decision tree is prone to overfitting, here we manually prune the tree back to 2 nodes.

Rplot3.png

Plotting the tree with rpart.plot(t2.prune).

Rplot4.png

The confusion matrix after pruning. We lose a small bit of out-of-sample accuracy due to manual pruning.

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(t2.prune.pred), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 343 83
Yes 29 94

Accuracy : 0.796

Comparing the test error rate with naiveBayes, KNN and logistic regression.

1
2
3
4
naiveBayes (cv10):           0.2095
KNN (cv10, K=1): 0.1942
classification tree: 0.1985
classification tree (prune): 0.2040

Confusion Matrix

The confusion matrix is a convenient summary of the model prediction. In our previous example with the un-pruned tree:

1
2
3
4
5
6
7
8
9
10
11
12
13
> confusionMatrix(as.factor(t2.pred), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 368 105
Yes 4 72

# mapping the number to definition
Reference
Prediction No Yes
No TN FN
Yes FP TP

There are four types of prediction:

  • True Positive (TP): 72
  • True Negative (TN): 368
  • False Positive (FP), or Type I Error: 4
  • False Negative (FN), or Type II Error: 105

Several metrics can be computed:

  • Accuracy: (TP + TN) / N = (72 + 368) / 549 = 0.8015
  • Error Rate: (FP + FN) / N = 1 - Accuracy = 0.1985
  • Precision: TP / (Predicted Positive) = 72 / (72 + 4) = 0.9474
  • Sensitivity: TP / (Actually Positive) = 72 / (72 + 105) = 0.4068

Receiver Operator Characteristic Curve

The ROC curve can be used to evaluate the performacne of our model. The ROC curve plots the TPR (true positive rate) against FPR (false positive rate) over a range of cutoff values:

  • TPR = TP / (Actually Positive) = 0.4068
  • FPR = FP / (Actually Negative) = 0.0107

In python, we can create a ROC curve from our previous prediction TPR and FPR:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0, 1, 0.0001)
plt.plot(x, x)
plt.xlabel('FPR')
plt.ylabel('TPR')

x2=0.0107
y2=0.4680
plt.scatter(x2, y2, s=20, color='red')
plt.plot([0, x2], [0, y2], 'r-', color='red')
plt.plot([x2, 1], [y2, 1], 'r-', color='red')
plt.title('ROC Plot')

Unknown-1.png

The baseline line refers to a model/cutoff where all observations in the testing set are predicted to be positive (point x=1, y=1), or negative (point x=0, y=0). The area under the red lines and above the x-axis is an estimate of the model fit and is called AUC, or area under the ROC curve.

An AUC of 1 means that the model has a TPR of 1 and FPR of 0.

Ensemble Methods

Bagging

The decision tree method in general suffer from high model variance, compared to linear regression which shows low variance. Bootstrap aggregation, or bagging is a general-purpose procedure for reducing variance of a statistical model without affecting the bias. It is particular useful in the decision tree model context.

  • For regression trees, construct regression trees using bootstrapped (repeatedly sampled) training sets. These trees grow deep and are not pruned, therefore having high variance. At the end, average the trees to reduce the variance.

  • For classification trees, construct classification trees. When predicting a test observation, take the majority classification resulted from the trees.

Note that the bagging results are more accruate but less visual. We can obtain the variable importances by computing the total RSS/Gini decreases by splits over each predictors, hence providing better interpretations of the results.

Random Forest

Random forest improves upon bagged trees by de-correlating the trees. At each split, only predictors are considered, isolating effects on single feature with large influences.

In R, use the randomForest package:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library(caret)
library(randomForest)
set.seed(9999)
df <- as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq), ]
df <- subset(df, select = -c(Freq))

# create a 75/25 train/test split
partition <- createDataPartition(df$Survived, list=FALSE, p=0.75)
train <- df[partition, ]
test <- df[-partition, ]


# fit random forest
rf <- randomForest(formula = Survived ~ .,
data = train,
ntree = 100,
importance = TRUE)

In-sample confusion matrix:

1
2
3
4
5
6
7
8
9
10
11
12
13
> rf

Call:
randomForest(formula = Survived ~ ., data = train, ntree = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 1

OOB estimate of error rate: 23.43%
Confusion matrix:
No Yes class.error
No 1063 55 0.04919499
Yes 332 202 0.62172285

Out-of-sample confusino matrix:

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(predict(rf, test)), test$Survived)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 368 108
Yes 4 69

Accuracy : 0.796

Comparing the test error rate with naiveBayes, KNN and logistic regression.

1
2
3
4
5
naiveBayes (cv10):           0.2095
KNN (cv10, K=1): 0.1942
classification tree: 0.1985
classification tree (prune): 0.2040
random forest: 0.2040

Boosting

Boosting is also a general-purpose procedure for improving accuracy of the statistical model. In boosting, we repeatedly fit new trees to the residuals from the previous tree, and add the new trees to the main tree such that a loss function is minimized (subject to shrinkage parameter , typical 0.01, to control overfitting). CV is often used to determine the total number of trees to be fitted.

A gradient boosting machine is an algorithm that calculates the gradient of the loss function and update the paramters such that the model moves in the direction of the negative gradient, thus closer to a minimum point of the loss function.

  • XGBoost is an open-source software library that provides a gradient boosting framework for R. XGBoost initially started as a research project by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
llibrary(caret)
library(xgboost)
library(pROC)
set.seed(9999)
df <- as.data.frame(Titanic)
df <- df[rep.int(seq_len(nrow(df)), df$Freq), ]
df <- subset(df, select = -c(Freq))

# turn survival into 0 and 1
Survived_Ind <- rep(0, nrow(df))
Survived_Ind[df$Survived == "Yes"] <- 1
df$Survived_Ind <- Survived_Ind


# create a 75/25 train/test split
partition <- createDataPartition(df$Survived_Ind, list=FALSE, p=0.75)
train <- df[partition, ]
test <- df[-partition, ]
test_2 <- test
train <- subset(train, select = -c(Survived))
test <- subset(test, select = -c(Survived))


# create model frame for xgboost input
train.mf <- model.frame(as.formula("Survived_Ind ~."), data = train)
test.mf <- model.frame(as.formula("Survived_Ind ~."), data = test)

# create a model matrix only contains numerical values.
train.mm <- model.matrix(attr(train.mf, "terms"), data = train)
test.mm <- model.matrix(attr(test.mf, "terms"), data = test)

# [optional] create A XGB dense matrix contains an R matrix and metadata
train.dm <- xgb.DMatrix(train.mm, label = train$Survived, missing = -1)
test.dm <- xgb.DMatrix(test.mm, label = test$Survived, missing = -1)


# create xgboost parameter list
params <- list("booster" = "gbtree", # "gblinear" for glm
"objective" = "binary:logistic", # the output here is a probability
"eval_metric" = "auc",
"eta" = 0.1, # lambda
"subsample" = 0.6, # proportion of observations
"colsample_bytree" = 0.6, # proportion of features
"max_depth" = 5) # depth of the decision tree

# train xgboost model
model.cv <- xgb.cv(params = params,
data = train.dm,
nrounds = 1000, # the number of trees / iterations
prediction = FALSE, # storage of prediction under each tree
print_every_n = 25,
early_stopping_rounds = 50,
maximize = TRUE, # AUC metric -> maximize
nfold = 6) # cv

# fit final model
model <- xgb.train(params = params,
data = train.dm,
nrounds = model.cv$best_iteration,
prediction = FALSE)

# format prediction
xgb.prob <- predict(model, test.dm)
xgb.pred <- rep("No", sum(lengths(xgb.prob)))
xgb.pred[xgb.prob > 0.5] <- "Yes"

Feature importance:

1
2
3
4
> xgb.importance(feature_names = dimnames(train.dm)[[2]], model = model)
Feature Gain Cover Frequency
1: Class3rd 0.8347889 0.5957944 0.5
2: Class2nd 0.1652111 0.4042056 0.5

AUC result:

1
2
> auc(test$Survived_Ind, xgb.prob)
Area under the curve: 0.6033

Confusion matrix:

1
2
3
4
5
6
7
8
9
> confusionMatrix(as.factor(xgb.pred), as.factor(test_2$Survived))
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 370 180
Yes 0 0

Accuracy : 0.6727

Unfortunately xgboost predict everything to be “No”

Comparing the test error rate with naiveBayes, KNN and logistic regression.

1
2
3
4
5
6
naiveBayes (cv10):           0.2095
KNN (cv10, K=1): 0.1942
classification tree: 0.1985
classification tree (prune): 0.2040
random forest: 0.2040
xgboost: 0.3273

Ensemble Model Interpretation

Feature Importance ranks the contribution of each feature.
Partial Dependence Plots visualizes the model’s average dependence on a specific feature (or a pair of features).


Unsupervised Learning

In supervised learning, we are provided a set of observations , each containing features, and a response variable . We are interested at predicting using the observations and features.

In unsupervised learning, we are interested at exploring hidden relationships within the data themself without involving any response variables. It is “unsupervised” in the sense that the learning outcome is subjective, unlike supervised learning in which specific metrics such as error rates are used to evaluate learning outcomes.

PCA

See the PCA section below.

K-Mean Clustering

Clustering seek to partition data into homogeneous subgroups. The K-Mean clustering partitions data into distinct and non-overlapping clusters , by minimizing the objective function of total in-cluster variation , which is the sum of all pair-wise squared Euclidean distances between the observations in the cluster, divided by the number of observations in the cluster.

Algorithm

The algorithm divides data into initial cluster, and reassign observations to the cluster with the closest cluster centroid (mean of all previous observations in the cluster).

The K-Mean clustering algorithm finds a local optimum, and therefore depend on the initial cluster. So it is important to repeat the process with different initial points, and then select the best result based on minimum total in-cluster variation.

  • The nstart parameter in the kmean function in R specifies the number of random starting centers to try and the one ended with the optimal objective function value will be selected.

It is also important to check for outliers, as the algorithm would let the outlier become its own cluster and stops improving.

Standardization

It is a must to standardize the variables before performing k-mean cluster analysis, as the objective function (Euclidean distance, etc.) is calculated from the actual value of the variable.

Curse of Dimensionality

The curse of dimensionality describe the problems when performing clustering on three or more dimensional space, where:

  • visualization becomes harder
  • as the number of dimensions increases, the Euclidean distance between data points are the same on average.

The solution is to reduce the dimensionality before using clustering technique.

The Elbow Method

Each cluster replaces its data with its center. In other words, with a clustering model we try to predict which cluster a data point belongs to.

A good model would explain more variance in the data with its cluster assignments. The elbow method looks at the statistics defined as:

As soon as the additional F statistics drops/stops increasing when adding a new cluster, we use that number of clusters.

Hierarchical Clustering

The hierarchical clustering provide flexibilities in terms of the number of clusters . It results in a tree-based representation of the data called dendrogram, which is built either bottom-up/agglomerative or top-down/divisive.

Hierarchical clustering assumes that there exists a hierarchical structure. In most feature generation cases, we prefer k-means clustering instead.

Agglomerative

An agglomerative hierarchical cluster starts off by assigning each data point in its own cluster. Each step in the clustering process two similar clusters with minimum distance among all are merged, where the distance is calculated between the elements within the cluster that are closest (single-linkage) or furthest (complete-linkage)


PCA - A Deeper Dive

PCA finds low dimensional representation of a dataset that contains as much as possible of the variation. As each of the observations lives on a -dimensional space, and not all dimensions are equally interesting.

Linear Algebra Review

Let be a matrix. With , the determinant of can be calculated as follow.

Properties of determinant:

A real number is an eigenvalue of if there exists a non-zero vector (eigenvector) in such that:

The determinant of matrix is called the characteristic polynomial of . The equation is called the characteristic equation of , where the eigenvalues are the real roots of the equation. It can be shown that:

Matrix is invertible if there exists a matrix such that . A square matrix is invertible if and only if its determinant is non-zero. A non-square matrix do not have an inverse.

Matrix is called diagonalizable if and only if it has linearly independent eigenvectors. Let denote the eigen vectors of A and denote the diagonal vector. Then:

If matrix is symmetric, then:

  • all eigenvalues of are real numbers
  • all eigenvectors of from distinct eigenvalues are orthogonal

Matrix is positive semi-definite if and only if any of the following:

  • for any matrix ,
  • all eigenvalues of are non-negative
  • all the upper left submatrices have non-negative determinants.

Matrix is positive definite if and only if any of the following:

  • for any matrix ,
  • all eigenvalues of are positive
  • all the upper left submatrices have positive determinants.

All covariance, correlation matrices must be symmetric and positive semi-definite. If there is no perfect linear dependence between random variables, then it must be positive definite.

Let be an invertible matrix, the LU decomposition breaks down as the product of a lower triangle matrix and upper triangle matrix . Some applications are:

  • solve :
  • solve :

Let be a symmetric positive definite matrix, the Cholesky decomponsition expand on the LU decomposition and breaks down , where is a unique upper triangular matrix with positive diagonal entries. Cholesky decomposition can be used to generate correltaed random variables in Monte Carlo simulation

Matrix Interpretation

Consider a matrix:

To find the first principal component , we define it as the normalized linear combination of that has the largest variance, where its loading are normalized:

Or equivalently, for each score:

In matrix form:

Finally, the first principal component loading vector solves the optimization problem that maximize the sample variance of the scores . An objective function can be formulated as follow and solved via an eigen decomposition:

To find the second principal component loading , use the same objective function with replacement and include an additional constraint that is orthogonal to .

Geometric Interpretation

The loading matrix defines a linear transformation that projects the data from the feature space into a subspace , in which the data has the most variance. The result of the projection is the factor matrix , also known as the principal components.

In other words, the principal components vectors forms a low-dimensional linear subspace that are the closest (shortest average squared Euclidean distance) to the observations.

Eigen Decomposition

Given data matrix , the objective of PCA is to find a lower dimension representation factor matrix , from which a matrix can be constructed where distance between the covariance matrices and are minimized.

The covariance matrix of is a symmetric positive semi-definite matrix, therefore we have the following decomposition where ‘s’ are eigenvectors of and ‘s are the eigenvalues. Note that can be a zero vector if the columns of are linearly dependent.

If we ignore the constant , and define the loading matrix and factor matrix where . Then:

Now comes the PCA idea: Let’s rank the ‘s in descending order, pick such that:

Now we observe that the matrix is also a positive semi-definite matrix. Following similar decomposition, we obtain a matrix and matrix , where:

Here we have it, a dimension-reduced factor matrix , where its projection back to space, , has similar covariance as the original dataset .

Practical Considerations

PCA excels at identifying latent variables from the measurable variables. PCA can only be applied to numeric data, while categorical variables need to be binarized beforehand.

  • Centering: yes.

  • Scaling:

    • if the range and scale of the variables are different, correlation matrix is typically used to perform PCA, i.e. each variables are scaled to have standard deviation of
    • otherwise if the variables are in the same units of measure, using the covariance matrix (not standardizing) the variables could reveal interesting properties of the data
  • Uniqueness: each loading vector is unique up to a sign flip, as the it can take on opposite direction in the same subspace. Same applies to the score vector , as

  • Propotional of Variance Explained: we can compute the total variance in a data set in the first formula below. The variance explained by the -th principal component is: . Therefore, the second formula can be computed for the :


A Side Note on Hypothesis Test and Confidence Interval

A formal statistical data analysis includes:

  • hypothesis testings: seeing whether a structure is present
  • confidence interval: setting error limits on estimates
  • prediction interval: setting error limits on future outcomes

We will discuss a typical setting as follow:

  • Let be the unknown parameter, and is the parameter estimator based on observations
  • Let be the estimator of , equal to
  • In rare circumstances where is independent of the unknown , such as in regression settings where the error terms are i.i.d. normally distributed, we can show that it follows a t-distribution with degree of freedom.
  • However, asymptotically as , the Central Limit Theorem applies which release us from the assumption restriction above. Under a variety of regular conditions:
  • The confidence interval can then be computed as:
  • The hypothesis test of would:

    • accept that if
    • reject that if
  • The p-value is such that the hypothesis test is indifference between accept or reject.

  • In practice, t-distribution is typically used for regression data as it is more conservative (). For non-regression data, normal distribution is often used.

Some background on LLN and CLT:

  • Given i.i.d. random variable . Let

  • The Law of Large Numbers states that if , then a.s.

  • The Central Limit Theorem states that if , then

  • Therefore, if we have a consistent estimator for , meaning as , Slutsky's Theorem concludes that
  • We can then use the normal distribution to set confidence intervals and perform hypothesis tests.




Reference:

  • An Introduction to Statistical Learning with Applications in R, James, Witten, Hastie and Tibshirani
  • FINM 33601 Lecture Note, Y. Balasanov, University of Chicago