Chapter 5: Resampling Methods

Definitions
Notes
- 5.1: Cross-Validation
- 5.2: The Bootstrap
Lab
Exercises
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9

Definitions

5.1

Resampling methods: Statistical techniques derived from drawing random samples with replacement from the training data.
Model assessment: Evaluating the performance of a model.
Model selection: Process of comparing various statistical learning methods and selecting one (or maybe several).
Validation set approach: Setting aside a portion of the data as a test set
Leave on out cross validation (LOOCV): Fitting the statistical method repeatedly each with a different observation excluded as a test set. Test summary statistics are averaged across the 1 observation test sets.
K-fold cross validation: Method of cross validation in which we divide the data into k equally sized sets. Repeatedly fit the statistical learning method into

5.2

Bootstrap: A method of quantifying variation or uncertainty of a statistic or model by repeatedly sampling the training data with replacement.

Notes

5.1: Cross-Validation

Despite its importance, the first section is not overly complicated so I will try to give a more holistic summary instead of linearly moving through the chapter.

The major concept here is cross-validation which broadly encompasses the idea of comparing performance on the training and testing data sets. Most statistical learning techniques minimize the training error rate but this is not a sufficient metric to declare the model successful or select one technique over another. Our ultimate goal for predictive modeling is to have a low test error rate. Cross validation methods attempt to help us quantify the test error rate as well as compare the test error between multiple models.

At the heart of this topic is a tension between wanting to use as much data as possible to training our learning method and also wanting to test it on fresh data. If we apportion all of the data for training, we will not have a good estimate of the testing error rate. Conversely, if we sacrifice a large portion of our data to test the model, the model fit may not be as good due to a smaller sample size. This will overestimate the testing error rate. While the way I described this is a bit of a false dichotomy, it exemplifies the competing interests of a tight model fit and the ability to test it.

There are three methods described in the book about how to go about validating models and/or selecting among them. From my reading, the authors clearly prefer the third method, but they present the others for intuition and comparison purposes. Here I will outline them:

Validation set approach: In this method, we randomly select a subset of our data to be left out from the training process. The authors give an example of leaving out half although I have more frequently seen a ratio of 4:1, train to test, used in other materials. Once the training is done, we compute the MSE, test error rate, or any other metrics of fit on the data that was left out. In this way, we allow the model to be tested on data that it has not seen before which simulates the idea of applying this in the wild. While this makes intuitive sense, there are two major drawbacks:

The exact partition of data into training and testing sets can have a dramatic effect on the statistics of model fit. If we happen to select a training set that does not have a good representation of the patterns in the broader population, we can end up having a very poor estimate of the training error rate.
Because we set aside a chunk of the data, the training set will necessarily be smaller and thus the fit may be poorer. As a result, the testing error rate may be overestimated due to a less tuned model.

Leave one out cross validation (LOOCV): Alternatively, we could leave a single observation out of the model and use this as a test set. Doing this once would be a highly volatile method but if we do it with every observation in the data set, we can average these errors to estimate the testing error rate. The error for a continuous response could be estimated like this: \[CV_{(n)}=\frac{1}{n}\sum_{i=1}^n\text{MSE}_i\] Where $\text{MSE}_i$ is $(y_i-\hat{y_i})^2$. Here we’re computing N models and computing one unbiased estimate of the testing MSE from each. Then averaging these gives a more robust estimate of the testing MSE. By doing this, we avoid the two problems mentioned with the validation set approach. First, we will be using almost all the data in each model fit so we will not inflate the test error rate by using an arbitrarily small data set to train our learning method. Secondly, possible bias from selecting a particular subset of the data will not matter because LOOCV does the same thing for every observation. The major drawback of this method is that it’s very computationally expensive. Until recently, this would not be possible with many large data sets and/or complex models but now we can try it. Even so, I suspect that you would not want to use this method as any kind of preliminary step. Perhaps once you’ve found a model that you’re overall pleased with, you could set this to run overnight in order to get a more accurate estimate of the training MSE/error rate. One other thing to mention here is that there is a nice closed form for linear regression (even with polynomial terms) that uses the converse of the leverage as a weight.
k-fold cross validation: This last method is a bit between the other two and definitely seems to be preferred by the authors. Instead of just picking on subset/observation to leave out, we partition the data into $k$ subsets. Then we apply the validation set approach $k$ times where each time we’re using $\frac{1}{k}^{th}$ of the data as a validation set. We then compute the estimate of the test error way in a similar fashion as the LOOCV method: \[CV_{(k)}=\frac{1}{k}\sum_{i=1}^k\text{MSE}_i\] By doing this, we are able to partially solve the problems of each of the previous methods. We are not nearly as reliant on the exact subset we pick as the first method and we will use a larger portion of the data to fit the models. We’re told that typically people choose $k=5$ or $k=10$ but in general, $k << N$, so the number of models that will need to be fit is quite a bit smaller than for LOOCV meaning this can be done more quickly.

There are a couple of other notes given to us here. One is that it is generally quite difficult to estimate the true test MSE/error rate. While these methods can be shown to be effective in simulations, real data may be more heterogeneous and wild than your simulation could account for. Another important note is that sometimes we’re less interested in estimating the actual test MSE but rather we want to compare the estimated test MSE between several models in order to select a ‘best’ one. An example of this would be selecting $K$ in the KNN method.
Anther point made here, that I think is not obvious, is that the LOOCV is not strictly better than k-fold method. It would seem that doing all those models would only provide less variance but this is only partly true. The LOOCV has the lowest bias of these three methods since there is absolutely no luck involved in the sets partitioned for testing however it actually has a potential for higher variance than the k-fold classification. The reason for this is that the models that we create with LOOCV are almost identical and thus the errors they produce are highly correlated. Averaging these will produce a higher variance than that number of uncorrelated observations. This is (at least one of the ?) reason why $k$ is typically chosen as a relatively small integer such as five or ten. If there isn’t a good example in the lab or exercises, I’ll simulate this later.
The final thing to note here is that although these examples were given with continuous responses, it’s easy to translate this to a classification problem. Instead we compute the k-fold error rate as:

\[CV_{(k)}=\frac{1}{k}\sum_{i=1}^k\frac{1}{n_i}\text{Err}_i\] Where $n_i$ is the number of observations in the $i^{th}$ fold and $\text{Err}_i=\sum_{i=1}^{n_k}I(y_i\neq\hat{y_i})$ is the number of observations that were misclassified. I intuitioned this formula as the example given in the book is for the LOOCV but I believe it should be the same idea. For each fold, we just find the proportion of observations that are misclassified and then average these proportions to estimate the testing error rate. There are some very nice examples given at the end where they show selecting the maximum power of terms for a logistic regression model as well as values of $K$ for KNN. In these cases, choosing the appropriate parameters based off the minimum k-fold testing error rate provides good guidance as to the best fitting model.

5.2: The Bootstrap

The second portion of this chapter is dedicated to the bootstrap or bootstrapping methods. These are ways of estimating uncertainty or variability in summary quantities or learning models that are derived from sampling the training data with replacement repeatedly. I like these methods a lot and would really like to learn more about them. One of my former stats teachers told us that if we’re mediocre at math, we can usually simulate our way out of a lot of problems. This is the primary use case for bootstraps; sometimes it’s too much effort/impossible to find a closed form for estimates such as standard error so we just make a simulation of sorts with our data and estimate it that way.
Much of this chapter is an example so I’ll go through it. Suppose we have an amount of money and are going to invest $\alpha$ proportion of it into a fund with $X$ return and $1-\alpha$ into a fund with return $Y$. Our goal is to minimize the variability of the joint returns:

\[\sigma^2=Var(\alpha X+(1-\alpha)Y)\] \[\sigma^2=\alpha^2Var(X)+(1-\alpha)^2Var(Y) + 2\alpha(1-\alpha)Cov(X,Y)\] Now we assume that $Var(X), Var(Y), \&\text{ } Cov(X,Y)$ are fixed so we can take a derivative with respect to $\alpha$: \[\frac{d\sigma^2}{d\alpha}=2\alpha Var(X)-2(1-\alpha)Var(Y) + (2-4\alpha)Cov(X,Y)\] Setting to zero and solving: \[2\alpha Var(X)-2Var(Y) +2\alpha Var(Y) + 2Cov(X,Y)-4\alpha Cov(X,Y)=0\] \[\alpha=\frac{Var(Y)-Cov(X,Y)}{Var(X)+Var(Y)-2Cov(X,Y)}\] So if we are able to estimate the variability of each fund and their covariance, we can figure out which $\alpha$ provides the most stable return:

\[\hat{\alpha}=\frac{\hat{\sigma}_Y^2-\hat{\sigma}_{X,Y}}{\hat{\sigma}_X^2+\hat{\sigma}_Y^2-2\hat{\sigma}_{X,Y}}\] Here is where we would apply the bootstrap; estimate each of the quantities on the RHS by repeatedly sampling the training data with replacement and computing sample variances and covariance. The more specific language that they use to describe the bootstrap is as follows. For some large number, $B$, we draw that many samples, $Z^{*1},Z^{*2},...,Z^{*B}$, where each $Z^{*i}$ is a sample of size $N$ drawn with replacement. After drawing these samples, we can compute whatever measurement we want such as the standard error of $\alpha$ in the previous example:

\[S.E._B(\hat{\alpha})=\sqrt{\frac{1}{B-1}\sum_{r=1}^B(\hat{\alpha}^{*r}-\frac{1}{B}\sum_{s=1}^B\hat{\alpha}^{*s})^2}\] So this is just computing the sample standard deviation of $\alpha$.

Lab

So for this, we’re going to go through the three cross validation methods and then do an example of the bootstrap with a linear model as the example. The first example will simply be predicting the mpg of the cars in the auto data set via their horsepower.

First, the validation set approach:

set.seed(1)
# Select half the rows of the auto set for training
auto_train = Auto %>%
  slice(
    sample.int(392, 196)
  )
auto_test = Auto %>%
  anti_join(auto_train)

# Here's our training model:
auto_vset_mod = lm(
  formula = mpg ~ horsepower,
  data = auto_train
)

# Compute the testing set MSE
auto_vset_test_mse = predict(
  auto_vset_mod,
  auto_test
)
auto_vset_test_mse = mean((auto_vset_test_mse - auto_test$mpg)^2)
auto_vset_test_mse

## [1] 23.26601

So the estimated MSE is about 23 which is somewhat large considering that the middle half of the entire Auto mpg range is [17,29]. The value shown in the book is about 26 but I was not able to replicated this, even by copying their code exactly in a new session so…. not sure what’s going on there.

We also can compare this to the model with a square term:

# Train again with a square term:
auto_vset_mod_sq = lm(
  formula = mpg ~ poly(horsepower, degree = 2),
  data = auto_train
)

# Compute the testing set MSE
auto_vset_sq_test_mse = predict(
  auto_vset_mod_sq,
  auto_test
)
auto_vset_sq_test_mse = mean((auto_vset_sq_test_mse - auto_test$mpg)^2)
auto_vset_sq_test_mse

## [1] 18.71646

So with the square term we get a significantly lower test error rate. We can test with a different training set. Here we just set the seed to 2 instead of 1:

## [1] 25.72651

This shows us the variability in these estimates just due to the exact validation set we choose.
Let’s move on to the LOOCV method. Here we’ll fit the same model and use the boot library to compute some of the CV statistics:

auto_full_mod = glm(
  formula = mpg ~ horsepower,
  data = Auto
)
auto_full_cv = cv.glm(
  data = Auto,
  glmfit = auto_full_mod
)
names(auto_full_cv)

## [1] "call"  "K"     "delta" "seed"

auto_full_cv$delta

## [1] 24.23151 24.23114

The values in delta are the estimate of the MSE and the ‘adjusted’ version that attempts to compensate for bias when $K<N$. I guess we’ll get to exactly what that means soon. In any case, this function is nice because it can do the LOOCV quickly. I’d be surprised if it doesn’t use the closed form version instead of fitting all the models.
We can test the various degrees of horsepower and see how they compare in the testing MSE using LOOCV:

auto_poly = function(deg){
  cv.glm(
    data = Auto,
    glmfit = glm(
      formula = mpg ~ poly(horsepower, degree = deg),
      data = Auto
    )
  )$delta[1]
}

vapply(1:5, auto_poly, numeric(1))

## [1] 24.23151 19.24821 19.33498 19.42443 19.03321

Here we see a large decline in test MSE and then little change after that.

Lastly, we can try the k-fold method. First, observe that there is some randomness in the statistics we get:

set.seed(17)
kfold_mses = rep(NA, length = 5)

for(i in 1:5){
  kfold_mses[i] = cv.glm(
    data = Auto,
    glmfit = auto_full_mod,
    K = 10
  )$delta[1]
}
kfold_mses

## [1] 24.27207 24.08261 24.29863 24.25307 24.22995

For the second part of the lab, we’ll look at implementing the example previously given where we’re apportioning an investment between two funds. There is a data set for this:

dim(Portfolio)

## [1] 100   2

head(Portfolio)

##            X          Y
## 1 -0.8952509 -0.2349235
## 2 -1.5624543 -0.8851760
## 3 -0.4170899  0.2718880
## 4  1.0443557 -0.7341975
## 5 -0.3155684  0.8419834
## 6 -1.7371238 -2.0371910

We will write a function that computes the proporiton we should allocate to fund X first:

compute_alpha = function(df, indices){
  sub_df = df %>% 
    slice(indices)
  with(
    data = sub_df,
    (var(Y) - cov(X,Y))/(var(X) + var(Y) -2*cov(X, Y))
  )
  
}
compute_alpha(Portfolio, 1:100)

## [1] 0.5758321

Then, we can use the bootstrap library to compute one thousand re-samplings of this statistic:

boot(
  data = Portfolio,
  statistic = compute_alpha,
  R = 1000
)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = compute_alpha, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original      bias    std. error
## t1* 0.5758321 0.006376356  0.08992828

Here we get the point estimate and standard error for this statistic. I rather like this library and I think next time I’m approached with one of these problems I’ll consider looking deeper into it. The few times I’ve done this before, I’ve just written the simulation myself in full. Kinda curious what they mean by bias in this context. It’s not entirely clear what’s going on with the arguments of this function at the moment. It looks like it’s calculating statistic(data, indices) each time. Here’s another example: we’ll compute the coefficients for the linear model by bootstrapping and then compare with the given standard error:

set.seed(1)
get_coef = function(df, indices){
  lm(
    mpg ~ horsepower,
    data = df,
    subset = indices
  ) %>% 
    coef()
}

boot(
  data = Auto,
  statistic = get_coef,
  R = 1000
)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = get_coef, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0553942585 0.843931305
## t2* -0.1578447 -0.0006285291 0.007367396

Then if we compare it to the output of the linear model in full, it’s similar but a bit off:

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

The linear model is computing these standard errors based off the assumptions that the trend is indeed linear and that the coefficients have a normal sampling distribution. Because the relationship is somewhat non-linear, the bootstrap estimates a higher (and more accurate) variance than the plain linear model.

Exercises

1

They ask to derive the formulation of $\alpha$ in the bootstrapping example and I did this in the text.

2

What is the probability that a particular bootstrap sample where the sample size is N? Well the probability that the observation is not the first observation is $1-\frac{1}{N}$. Then the probability that the specific observation is none of the observations in the bootstrap sample is $1-(1-\frac{1}{N})^N$. Here’s a graph of what that looks like as we increase the sample size.

We can do a little simulation:

one_in = function(x){
  1 %in% sample(1:100, replace = TRUE)
}

vapply(
  1:10000, 
  one_in,
  logical(1)
) %>% 
  mean()

## [1] 0.6442

I wonder if there is some intuitive way we could have known this. We could go through figuring out what the limit is but I was kind of hoping to knwo why there is abut a 62% chance of this happening.

3

I made nice notes comparing k-fold classification with the other two already

4

Suppose we use a statistical learning method to make a prediction about $Y$ for some value of $X$. If we want to estimate the the sample standard deviation of these, we could use a bootstrap. A plausible method could be to sample the data with replacement, fit the statistical learning method, and then make a prediction at $X$. After repeating this many times, we could compute the sample standard deviation from these predictions and use it as an estimate for the general standard deviation about the prediction.

5

We’re going to return to the default data set where we’ll use the validation set approach to estimate the testing error of modeling whether someone defaults as a function of income and balance. We’ll also consider possible improvement by including student into the model.

# This will be easier if we just make the outcome literally binary:
Default = Default %>% 
  mutate(
    default_bin = as.numeric(default == 'Yes')
  )

default_mod = function(seed){
  
  set.seed(seed)
  def_train = sample_frac(Default, 0.5)
  def_test = anti_join(Default, def_train)
  
  mod1 = glm(
    default_bin ~ income + balance,
    data = def_train,
    family = 'binomial'
  )
  mod1_err = predict(
    mod1,
    newdata = def_test,
    type = 'response'
  ) %>% 
    round()
  mod1_err = mean(mod1_err != def_test$default_bin)
  
  mod2 = glm(
    default_bin ~ income + balance + student,
    data = def_train,
    family = 'binomial'
  )
  mod2_err = predict(
    mod2,
    newdata = def_test,
    type = 'response'
  ) %>% 
    round()
  mod2_err = mean(mod2_err != def_test$default_bin)
  
  tibble(
    null_error = mean(def_train$default_bin),
    mod1_err = mod1_err,
    mod2_err = mod2_err
  )
}


(1:3) %>% 
  lapply(default_mod) %>% 
  bind_rows()

## # A tibble: 3 x 3
##   null_error mod1_err mod2_err
##        <dbl>    <dbl>    <dbl>
## 1     0.0352   0.0254   0.026 
## 2     0.034    0.0238   0.0246
## 3     0.0356   0.0264   0.0272

The overall performance is not great; we only reduced the error rate by about 30% from the null model. The student variable does not appear to add anything to the model. I believe we previously saw that this was simply associated with balance and not effective when combined.

6

We will now compare some estimates of standard error for the logistic regression coefficients with those produced from bootstrap samples. Begin by just looking at the coefficient estimates and their standard errors:

glm(
  default_bin ~ income + balance,
  data = Default,
  family = 'binomial'
) %>% 
  broom::tidy(exponentiate = TRUE)

## # A tibble: 3 x 5
##   term          estimate  std.error statistic   p.value
##   <chr>            <dbl>      <dbl>     <dbl>     <dbl>
## 1 (Intercept) 0.00000973 0.435         -26.5  2.96e-155
## 2 income      1.00       0.00000499      4.17 2.99e-  5
## 3 balance     1.01       0.000227       24.8  3.64e-136

Now we write a function for boot in order to estimate these:

default_coef = function(data, index){
  
  Default %>% 
    slice(index) %>% 
    glm(
      formula = default_bin ~ income + balance,
      family = 'binomial'
    ) %>% 
    coef() %>% 
    exp()
}

boot(
  data = Default,
  statistic = default_coef,
  R = 1000
)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = default_coef, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 9.728329e-06 5.230880e-07 4.565895e-06
## t2* 1.000021e+00 2.071371e-07 4.860660e-06
## t3* 1.005663e+00 1.989182e-05 2.323510e-04

Extremely similar. Much power, very bootstrap

7

Here we’re going to use the Weekly data set to model Direction using Lag1 and Lag2 as predictors. We’ll manually compute the LOOCV testing error rate.

make_loocv = function(i){
  
  mod = glm(
    formula = Direction ~ Lag1 + Lag2,
    data = slice(Weekly, -1*i),
    family = 'binomial'
  )
  
  Weekly %>% 
    slice(i) %>% 
    mutate(
      pred_prob = predict.glm(
        mod, 
        newdata = slice(Weekly, i), 
        type = 'response'
      ),
      prediction = ifelse(
        pred_prob > 0.5,
        'Up',
        'Down'
      ),
      prediction = factor(
        prediction,
        levels = c('Down', 'Up')
      )
    )
  
}

weekly_loocv = lapply(
  1:nrow(Weekly),
  make_loocv
) %>% 
  bind_rows()

# Testing error rate estimate:
mean(weekly_loocv$Direction != weekly_loocv$prediction)

## [1] 0.4499541

So here we get an error rate that is quite similar to the null model which is unsurprising given how week of a relationship there is between the predictors and response.

8

We’re going manually compute the LOOCV again manually and this time we’ll try up to a fourth power of the predictor. We’ll then compare the models as well as the results using different seeds.

# Here's the simulated data we've been given:
set.seed(1)
sim_dat = tibble(
  x = rnorm(100),
  y = x-2*x^2+rnorm(100)
)
# So just from the formulation, we should expect the quadratic model to perform the best.

# First a functin to make the polynomial regression models:
pwr_mod = function(dat, deg){
  
  glm(
    formula = y ~ poly(x, degree = deg),
    data = dat
  )
  
}

# Make the model without each observation and predict:
make_loocv2 = function(i){
  
  mods = lapply(
    X = 1:4,
    FUN = pwr_mod,
    dat = slice(sim_dat, -1*i)
  )
  
  predictions = lapply(
    X = mods,
    FUN = predict.glm,
    FUN.VALUE = numeric(1),
    newdata = slice(sim_dat, i)
  )
  
  names(predictions) = paste0('deg_', 1:4)
  as_tibble(predictions)
  
}

# now run it:
sim_dat = (1:nrow(sim_dat)) %>% 
  lapply(make_loocv2) %>% 
  bind_rows() %>% 
  bind_cols(sim_dat) %>% 
  mutate(
    across(
      matches('deg_'),
      function(x){x - sim_dat$y}
    )
  )

# What are the mse:
sim_dat %>% 
  summarize(
    across(
      matches('deg'),
      function(x){mean((x - mean(x))^2)}
    )
  )

## # A tibble: 1 x 4
##   deg_1 deg_2 deg_3 deg_4
##   <dbl> <dbl> <dbl> <dbl>
## 1  7.29 0.937 0.957 0.954

As predicted, the quadratic model has the lowest estimated test MSE. We’re also asked about different seeds but we know this should not affect the results of LOOCV as there is no randomness in the process.

9

The final question uses the Boston data set. We’ll go through a series of related questions:

Provide an estimate of the population meand of medv (median value of owner occupied homes in 1000$)

Boston = MASS::Boston
medv_mean = mean(Boston$medv)
medv_mean

## [1] 22.53281

Estimate the standard error:

sd(Boston$medv)/sqrt(nrow(Boston))

## [1] 0.4088611

Estimate the standard error using a bootstrap:

medv_boot = boot(
  data = Boston,
  statistic = function(data, index){
    data %>% 
      slice(index) %>% 
      pluck('medv') %>% 
      mean()
  },
  R = 1000
)

medv_boot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = function(data, index) {
##     data %>% slice(index) %>% pluck("medv") %>% mean()
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.008186957   0.4100113

Quite close as expected.

Use the bootstrap to compute a 95% CI for medv and compare the results to a t-test.

medv_boot_se = sd(medv_boot$t) 
paste0(
  'Bootstrapped CI = (', 
  round(medv_mean - 1.96*medv_boot_se, 3), 
  ', ', 
  round(medv_mean + 1.96*medv_boot_se, 3),
  ')'
)

## [1] "Bootstrapped CI = (21.729, 23.336)"

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Repeat this process with the median and tenth percentile. Order statistics don’t have (or at least I’m not aware of) nice formulae for computing their standard errors so the bootstrap is very useful in this case:

# First just get estimates of these from the raw data:
Boston %>% 
  summarize(
    median = median(medv),
    q10 = quantile(medv, 0.1)
  )

##   median   q10
## 1   21.2 12.75

# Now bootstrap:
boot(
  data = Boston,
  statistic = function(data, index){
    medvs = data %>% 
      slice(index) %>% 
      pluck('medv')
    medvs = c(
      median(medvs),
      quantile(medvs, 0.1)
    )
    names(medvs) = c('median', 'q10')
    medvs
  },
  R = 1000
)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = function(data, index) {
##     medvs = data %>% slice(index) %>% pluck("medv")
##     medvs = c(median(medvs), quantile(medvs, 0.1))
##     names(medvs) = c("median", "q10")
##     medvs
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    21.20 -0.0392   0.3775790
## t2*    12.75  0.0137   0.4826699

We again get good point estimates and now have access to standard errors of these difficult to wrangle statistics. I was hoping that applying names to the output vector would put them in the print statement but oh well.

Alright, so I thought this chapter was quite nice. I didn’t know these cross validation methods at all and it was nice to work with them. I didn’t get the feeling that this was particuarly difficult however I’m sure that there are some more advanced methods and scenarios that would make this more challenging.