Despite its importance, the first section is not overly complicated so I will try to give a more holistic summary instead of linearly moving through the chapter.
The major concept here is cross-validation which broadly encompasses the idea of comparing performance on the training and testing data sets. Most statistical learning techniques minimize the training error rate but this is not a sufficient metric to declare the model successful or select one technique over another. Our ultimate goal for predictive modeling is to have a low test error rate. Cross validation methods attempt to help us quantify the test error rate as well as compare the test error between multiple models.
At the heart of this topic is a tension between wanting to use as much data as possible to training our learning method and also wanting to test it on fresh data. If we apportion all of the data for training, we will not have a good estimate of the testing error rate. Conversely, if we sacrifice a large portion of our data to test the model, the model fit may not be as good due to a smaller sample size. This will overestimate the testing error rate. While the way I described this is a bit of a false dichotomy, it exemplifies the competing interests of a tight model fit and the ability to test it.
There are three methods described in the book about how to go about validating models and/or selecting among them. From my reading, the authors clearly prefer the third method, but they present the others for intuition and comparison purposes. Here I will outline them:
There are a couple of other notes given to us here. One is that it is generally quite difficult to estimate the true test MSE/error rate. While these methods can be shown to be effective in simulations, real data may be more heterogeneous and wild than your simulation could account for. Another important note is that sometimes we’re less interested in estimating the actual test MSE but rather we want to compare the estimated test MSE between several models in order to select a ‘best’ one. An example of this would be selecting \(K\) in the KNN method.
Anther point made here, that I think is not obvious, is that the LOOCV is not strictly better than k-fold method. It would seem that doing all those models would only provide less variance but this is only partly true. The LOOCV has the lowest bias of these three methods since there is absolutely no luck involved in the sets partitioned for testing however it actually has a potential for higher variance than the k-fold classification. The reason for this is that the models that we create with LOOCV are almost identical and thus the errors they produce are highly correlated. Averaging these will produce a higher variance than that number of uncorrelated observations. This is (at least one of the ?) reason why \(k\) is typically chosen as a relatively small integer such as five or ten. If there isn’t a good example in the lab or exercises, I’ll simulate this later.
The final thing to note here is that although these examples were given with continuous responses, it’s easy to translate this to a classification problem. Instead we compute the k-fold error rate as:
\[CV_{(k)}=\frac{1}{k}\sum_{i=1}^k\frac{1}{n_i}\text{Err}_i\] Where \(n_i\) is the number of observations in the \(i^{th}\) fold and \(\text{Err}_i=\sum_{i=1}^{n_k}I(y_i\neq\hat{y_i})\) is the number of observations that were misclassified. I intuitioned this formula as the example given in the book is for the LOOCV but I believe it should be the same idea. For each fold, we just find the proportion of observations that are misclassified and then average these proportions to estimate the testing error rate. There are some very nice examples given at the end where they show selecting the maximum power of terms for a logistic regression model as well as values of \(K\) for KNN. In these cases, choosing the appropriate parameters based off the minimum k-fold testing error rate provides good guidance as to the best fitting model.
The second portion of this chapter is dedicated to the bootstrap or bootstrapping methods. These are ways of estimating uncertainty or variability in summary quantities or learning models that are derived from sampling the training data with replacement repeatedly. I like these methods a lot and would really like to learn more about them. One of my former stats teachers told us that if we’re mediocre at math, we can usually simulate our way out of a lot of problems. This is the primary use case for bootstraps; sometimes it’s too much effort/impossible to find a closed form for estimates such as standard error so we just make a simulation of sorts with our data and estimate it that way.
Much of this chapter is an example so I’ll go through it. Suppose we have an amount of money and are going to invest \(\alpha\) proportion of it into a fund with \(X\) return and \(1-\alpha\) into a fund with return \(Y\). Our goal is to minimize the variability of the joint returns:
\[\sigma^2=Var(\alpha X+(1-\alpha)Y)\] \[\sigma^2=\alpha^2Var(X)+(1-\alpha)^2Var(Y) + 2\alpha(1-\alpha)Cov(X,Y)\] Now we assume that \(Var(X), Var(Y), \&\text{ } Cov(X,Y)\) are fixed so we can take a derivative with respect to \(\alpha\): \[\frac{d\sigma^2}{d\alpha}=2\alpha Var(X)-2(1-\alpha)Var(Y) + (2-4\alpha)Cov(X,Y)\] Setting to zero and solving: \[2\alpha Var(X)-2Var(Y) +2\alpha Var(Y) + 2Cov(X,Y)-4\alpha Cov(X,Y)=0\] \[\alpha=\frac{Var(Y)-Cov(X,Y)}{Var(X)+Var(Y)-2Cov(X,Y)}\] So if we are able to estimate the variability of each fund and their covariance, we can figure out which \(\alpha\) provides the most stable return:
\[\hat{\alpha}=\frac{\hat{\sigma}_Y^2-\hat{\sigma}_{X,Y}}{\hat{\sigma}_X^2+\hat{\sigma}_Y^2-2\hat{\sigma}_{X,Y}}\] Here is where we would apply the bootstrap; estimate each of the quantities on the RHS by repeatedly sampling the training data with replacement and computing sample variances and covariance. The more specific language that they use to describe the bootstrap is as follows. For some large number, \(B\), we draw that many samples, \(Z^{*1},Z^{*2},...,Z^{*B}\), where each \(Z^{*i}\) is a sample of size \(N\) drawn with replacement. After drawing these samples, we can compute whatever measurement we want such as the standard error of \(\alpha\) in the previous example:
\[S.E._B(\hat{\alpha})=\sqrt{\frac{1}{B-1}\sum_{r=1}^B(\hat{\alpha}^{*r}-\frac{1}{B}\sum_{s=1}^B\hat{\alpha}^{*s})^2}\] So this is just computing the sample standard deviation of \(\alpha\).
So for this, we’re going to go through the three cross validation methods and then do an example of the bootstrap with a linear model as the example. The first example will simply be predicting the mpg
of the cars in the auto
data set via their horsepower
.
First, the validation set approach:
set.seed(1)
# Select half the rows of the auto set for training
auto_train = Auto %>%
slice(
sample.int(392, 196)
)
auto_test = Auto %>%
anti_join(auto_train)
# Here's our training model:
auto_vset_mod = lm(
formula = mpg ~ horsepower,
data = auto_train
)
# Compute the testing set MSE
auto_vset_test_mse = predict(
auto_vset_mod,
auto_test
)
auto_vset_test_mse = mean((auto_vset_test_mse - auto_test$mpg)^2)
auto_vset_test_mse
## [1] 23.26601
So the estimated MSE is about 23 which is somewhat large considering that the middle half of the entire Auto mpg range is [17,29]. The value shown in the book is about 26 but I was not able to replicated this, even by copying their code exactly in a new session so…. not sure what’s going on there.
We also can compare this to the model with a square term:
# Train again with a square term:
auto_vset_mod_sq = lm(
formula = mpg ~ poly(horsepower, degree = 2),
data = auto_train
)
# Compute the testing set MSE
auto_vset_sq_test_mse = predict(
auto_vset_mod_sq,
auto_test
)
auto_vset_sq_test_mse = mean((auto_vset_sq_test_mse - auto_test$mpg)^2)
auto_vset_sq_test_mse
## [1] 18.71646
So with the square term we get a significantly lower test error rate. We can test with a different training set. Here we just set the seed to 2 instead of 1:
## [1] 25.72651
This shows us the variability in these estimates just due to the exact validation set we choose.
Let’s move on to the LOOCV method. Here we’ll fit the same model and use the boot
library to compute some of the CV statistics:
auto_full_mod = glm(
formula = mpg ~ horsepower,
data = Auto
)
auto_full_cv = cv.glm(
data = Auto,
glmfit = auto_full_mod
)
names(auto_full_cv)
## [1] "call" "K" "delta" "seed"
auto_full_cv$delta
## [1] 24.23151 24.23114
The values in delta
are the estimate of the MSE and the ‘adjusted’ version that attempts to compensate for bias when \(K<N\). I guess we’ll get to exactly what that means soon. In any case, this function is nice because it can do the LOOCV quickly. I’d be surprised if it doesn’t use the closed form version instead of fitting all the models.
We can test the various degrees of horsepower
and see how they compare in the testing MSE using LOOCV:
auto_poly = function(deg){
cv.glm(
data = Auto,
glmfit = glm(
formula = mpg ~ poly(horsepower, degree = deg),
data = Auto
)
)$delta[1]
}
vapply(1:5, auto_poly, numeric(1))
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
Here we see a large decline in test MSE and then little change after that.
Lastly, we can try the k-fold method. First, observe that there is some randomness in the statistics we get:
set.seed(17)
kfold_mses = rep(NA, length = 5)
for(i in 1:5){
kfold_mses[i] = cv.glm(
data = Auto,
glmfit = auto_full_mod,
K = 10
)$delta[1]
}
kfold_mses
## [1] 24.27207 24.08261 24.29863 24.25307 24.22995
For the second part of the lab, we’ll look at implementing the example previously given where we’re apportioning an investment between two funds. There is a data set for this:
dim(Portfolio)
## [1] 100 2
head(Portfolio)
## X Y
## 1 -0.8952509 -0.2349235
## 2 -1.5624543 -0.8851760
## 3 -0.4170899 0.2718880
## 4 1.0443557 -0.7341975
## 5 -0.3155684 0.8419834
## 6 -1.7371238 -2.0371910
We will write a function that computes the proporiton we should allocate to fund X first:
compute_alpha = function(df, indices){
sub_df = df %>%
slice(indices)
with(
data = sub_df,
(var(Y) - cov(X,Y))/(var(X) + var(Y) -2*cov(X, Y))
)
}
compute_alpha(Portfolio, 1:100)
## [1] 0.5758321
Then, we can use the bootstrap library to compute one thousand re-samplings of this statistic:
boot(
data = Portfolio,
statistic = compute_alpha,
R = 1000
)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = compute_alpha, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 0.006376356 0.08992828
Here we get the point estimate and standard error for this statistic. I rather like this library and I think next time I’m approached with one of these problems I’ll consider looking deeper into it. The few times I’ve done this before, I’ve just written the simulation myself in full. Kinda curious what they mean by bias in this context. It’s not entirely clear what’s going on with the arguments of this function at the moment. It looks like it’s calculating statistic(data, indices)
each time. Here’s another example: we’ll compute the coefficients for the linear model by bootstrapping and then compare with the given standard error:
set.seed(1)
get_coef = function(df, indices){
lm(
mpg ~ horsepower,
data = df,
subset = indices
) %>%
coef()
}
boot(
data = Auto,
statistic = get_coef,
R = 1000
)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = get_coef, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0553942585 0.843931305
## t2* -0.1578447 -0.0006285291 0.007367396
Then if we compare it to the output of the linear model in full, it’s similar but a bit off:
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
The linear model is computing these standard errors based off the assumptions that the trend is indeed linear and that the coefficients have a normal sampling distribution. Because the relationship is somewhat non-linear, the bootstrap estimates a higher (and more accurate) variance than the plain linear model.
They ask to derive the formulation of \(\alpha\) in the bootstrapping example and I did this in the text.
What is the probability that a particular bootstrap sample where the sample size is N? Well the probability that the observation is not the first observation is \(1-\frac{1}{N}\). Then the probability that the specific observation is none of the observations in the bootstrap sample is \(1-(1-\frac{1}{N})^N\). Here’s a graph of what that looks like as we increase the sample size.
We can do a little simulation:
one_in = function(x){
1 %in% sample(1:100, replace = TRUE)
}
vapply(
1:10000,
one_in,
logical(1)
) %>%
mean()
## [1] 0.6442
I wonder if there is some intuitive way we could have known this. We could go through figuring out what the limit is but I was kind of hoping to knwo why there is abut a 62% chance of this happening.
I made nice notes comparing k-fold classification with the other two already
Suppose we use a statistical learning method to make a prediction about \(Y\) for some value of \(X\). If we want to estimate the the sample standard deviation of these, we could use a bootstrap. A plausible method could be to sample the data with replacement, fit the statistical learning method, and then make a prediction at \(X\). After repeating this many times, we could compute the sample standard deviation from these predictions and use it as an estimate for the general standard deviation about the prediction.
We’re going to return to the default
data set where we’ll use the validation set approach to estimate the testing error of modeling whether someone defaults as a function of income
and balance
. We’ll also consider possible improvement by including student
into the model.
# This will be easier if we just make the outcome literally binary:
Default = Default %>%
mutate(
default_bin = as.numeric(default == 'Yes')
)
default_mod = function(seed){
set.seed(seed)
def_train = sample_frac(Default, 0.5)
def_test = anti_join(Default, def_train)
mod1 = glm(
default_bin ~ income + balance,
data = def_train,
family = 'binomial'
)
mod1_err = predict(
mod1,
newdata = def_test,
type = 'response'
) %>%
round()
mod1_err = mean(mod1_err != def_test$default_bin)
mod2 = glm(
default_bin ~ income + balance + student,
data = def_train,
family = 'binomial'
)
mod2_err = predict(
mod2,
newdata = def_test,
type = 'response'
) %>%
round()
mod2_err = mean(mod2_err != def_test$default_bin)
tibble(
null_error = mean(def_train$default_bin),
mod1_err = mod1_err,
mod2_err = mod2_err
)
}
(1:3) %>%
lapply(default_mod) %>%
bind_rows()
## # A tibble: 3 x 3
## null_error mod1_err mod2_err
## <dbl> <dbl> <dbl>
## 1 0.0352 0.0254 0.026
## 2 0.034 0.0238 0.0246
## 3 0.0356 0.0264 0.0272
The overall performance is not great; we only reduced the error rate by about 30% from the null model. The student
variable does not appear to add anything to the model. I believe we previously saw that this was simply associated with balance and not effective when combined.
We will now compare some estimates of standard error for the logistic regression coefficients with those produced from bootstrap samples. Begin by just looking at the coefficient estimates and their standard errors:
glm(
default_bin ~ income + balance,
data = Default,
family = 'binomial'
) %>%
broom::tidy(exponentiate = TRUE)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00000973 0.435 -26.5 2.96e-155
## 2 income 1.00 0.00000499 4.17 2.99e- 5
## 3 balance 1.01 0.000227 24.8 3.64e-136
Now we write a function for boot
in order to estimate these:
default_coef = function(data, index){
Default %>%
slice(index) %>%
glm(
formula = default_bin ~ income + balance,
family = 'binomial'
) %>%
coef() %>%
exp()
}
boot(
data = Default,
statistic = default_coef,
R = 1000
)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = default_coef, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 9.728329e-06 5.230880e-07 4.565895e-06
## t2* 1.000021e+00 2.071371e-07 4.860660e-06
## t3* 1.005663e+00 1.989182e-05 2.323510e-04
Extremely similar. Much power, very bootstrap
Here we’re going to use the Weekly
data set to model Direction
using Lag1
and Lag2
as predictors. We’ll manually compute the LOOCV testing error rate.
make_loocv = function(i){
mod = glm(
formula = Direction ~ Lag1 + Lag2,
data = slice(Weekly, -1*i),
family = 'binomial'
)
Weekly %>%
slice(i) %>%
mutate(
pred_prob = predict.glm(
mod,
newdata = slice(Weekly, i),
type = 'response'
),
prediction = ifelse(
pred_prob > 0.5,
'Up',
'Down'
),
prediction = factor(
prediction,
levels = c('Down', 'Up')
)
)
}
weekly_loocv = lapply(
1:nrow(Weekly),
make_loocv
) %>%
bind_rows()
# Testing error rate estimate:
mean(weekly_loocv$Direction != weekly_loocv$prediction)
## [1] 0.4499541
So here we get an error rate that is quite similar to the null model which is unsurprising given how week of a relationship there is between the predictors and response.
We’re going manually compute the LOOCV again manually and this time we’ll try up to a fourth power of the predictor. We’ll then compare the models as well as the results using different seeds.
# Here's the simulated data we've been given:
set.seed(1)
sim_dat = tibble(
x = rnorm(100),
y = x-2*x^2+rnorm(100)
)
# So just from the formulation, we should expect the quadratic model to perform the best.
# First a functin to make the polynomial regression models:
pwr_mod = function(dat, deg){
glm(
formula = y ~ poly(x, degree = deg),
data = dat
)
}
# Make the model without each observation and predict:
make_loocv2 = function(i){
mods = lapply(
X = 1:4,
FUN = pwr_mod,
dat = slice(sim_dat, -1*i)
)
predictions = lapply(
X = mods,
FUN = predict.glm,
FUN.VALUE = numeric(1),
newdata = slice(sim_dat, i)
)
names(predictions) = paste0('deg_', 1:4)
as_tibble(predictions)
}
# now run it:
sim_dat = (1:nrow(sim_dat)) %>%
lapply(make_loocv2) %>%
bind_rows() %>%
bind_cols(sim_dat) %>%
mutate(
across(
matches('deg_'),
function(x){x - sim_dat$y}
)
)
# What are the mse:
sim_dat %>%
summarize(
across(
matches('deg'),
function(x){mean((x - mean(x))^2)}
)
)
## # A tibble: 1 x 4
## deg_1 deg_2 deg_3 deg_4
## <dbl> <dbl> <dbl> <dbl>
## 1 7.29 0.937 0.957 0.954
As predicted, the quadratic model has the lowest estimated test MSE. We’re also asked about different seeds but we know this should not affect the results of LOOCV as there is no randomness in the process.
The final question uses the Boston
data set. We’ll go through a series of related questions:
medv
(median value of owner occupied homes in 1000$)Boston = MASS::Boston
medv_mean = mean(Boston$medv)
medv_mean
## [1] 22.53281
sd(Boston$medv)/sqrt(nrow(Boston))
## [1] 0.4088611
medv_boot = boot(
data = Boston,
statistic = function(data, index){
data %>%
slice(index) %>%
pluck('medv') %>%
mean()
},
R = 1000
)
medv_boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = function(data, index) {
## data %>% slice(index) %>% pluck("medv") %>% mean()
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.008186957 0.4100113
Quite close as expected.
medv
and compare the results to a t-test.medv_boot_se = sd(medv_boot$t)
paste0(
'Bootstrapped CI = (',
round(medv_mean - 1.96*medv_boot_se, 3),
', ',
round(medv_mean + 1.96*medv_boot_se, 3),
')'
)
## [1] "Bootstrapped CI = (21.729, 23.336)"
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
# First just get estimates of these from the raw data:
Boston %>%
summarize(
median = median(medv),
q10 = quantile(medv, 0.1)
)
## median q10
## 1 21.2 12.75
# Now bootstrap:
boot(
data = Boston,
statistic = function(data, index){
medvs = data %>%
slice(index) %>%
pluck('medv')
medvs = c(
median(medvs),
quantile(medvs, 0.1)
)
names(medvs) = c('median', 'q10')
medvs
},
R = 1000
)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = function(data, index) {
## medvs = data %>% slice(index) %>% pluck("medv")
## medvs = c(median(medvs), quantile(medvs, 0.1))
## names(medvs) = c("median", "q10")
## medvs
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.20 -0.0392 0.3775790
## t2* 12.75 0.0137 0.4826699
We again get good point estimates and now have access to standard errors of these difficult to wrangle statistics. I was hoping that applying names to the output vector would put them in the print
statement but oh well.
Alright, so I thought this chapter was quite nice. I didn’t know these cross validation methods at all and it was nice to work with them. I didn’t get the feeling that this was particuarly difficult however I’m sure that there are some more advanced methods and scenarios that would make this more challenging.