Chapter 6: Linear Model Selection & Regularization

Notes

This chapter focuses on a number of tweaks we can make to linear models. The most important concerns addressed here are:

Reducing the variance of models produced in situations where the number of predictors is close to the number of observations.
Improving interpretability by reducing the total number of predictors.

The methods described are:

Subset selection: Choosing only a subset of predictors to use.
Shrinkage methods: Reducing variance of model fitting by diminishing the effect of some or all of the predictors.
Dimension reduction:: Reframing the problem in terms of combinations of variables thus reducing the number of used predictors.

6.1: Subset Selection

It’s often the case that we have many more variables than necessary to create an effective statistical learning model. This can happen because some predictors are unrelated to the outcomes, because a complicated model is more difficult to interpret, or because the number of predictors strains or prohibits the model fit. In order to avoid these problems, we will use subset selection methods.

As a theoretical starting point, we can consider the best subset selection method. In some cases, this is possible but I get the feeling that this serves mostly as a foil for the other methods discussed. Essentially, it involves us fitting every (\(2^p\)) model possible and then using some kind of cross validation method to compare them. The model with the best fit under that method would be chosen ad the ‘best subset’ of predictors. Unfortunately, this takes a lot of time and will be computationally expensive so it is not ideal for most situations.

A variation on this is given where we reduce the time spent cross validating by these steps:
1. Begin with a null model \(\mathcal{M}_0\) using no predictors. 2. For \(0<i\leq p\), create all the models with \(i\) predictors. Select the model with the least RSS as \(\mathcal{M}_i\) (or another measure such as deviance) 3. Cross validate each \(\mathcal{M}_i\) and select the model with the lowest test MSE, \(C_p\), AIC, BIC, adjusted \(R^2\), or any other metric.
Personally, I don’t find this algorithm to be that compelling because the only thing that it reduces over testing every possible model is the cross validation step. In practice, the model fitting will probably be the longest time investment and this still involves fitting every possible model.

The next set of methods all fall under the heading of stepwise selection. These methods involve some kind of iterative re-fitting of models in which we start with some initial model and then repeatedly add and/or remove predictors to the model according to some criteria related to the potential change in model fit. Forward and backward selection work like this:

Begin with a null model \(\mathcal{M}_0\). In the case of forward, this is the intercept only. For backward, we start with the full model.
Add or remove (depending on direction) predictors one by one and compute a summary statistic for each such as RSS. Select the model the best of these models as \(\mathcal{M}_1\).
For each subsequent \(1<i\leq p\), build upon the the previous model. Again consider all of the \(p-i+1\) remaining predictors and choose one to add/remove from the model and set the best of these as \(\mathcal{M}_i\).
Cross validate \(\mathcal{M}_0, \mathcal{M}_1, ..., \mathcal{M}_p\) or use another metric to choose the best subset of predictors.
These are pretty reasonable ways to select a model in my opionin as they do not have to fit nearly as many models (\(\frac{p(p+1)}{}\)).
Some other things to note about stepwise selection methods:

Forward and backward selection do not necessarily yield the same results althoug the subsets are often similar or identical.
There is no guarantee that the the best model is actually considered in these methods. The are what you might call ‘greedy’ algorithms in which they try to maximize at each step. The true best model could involve just two predictors, neither of which would be selected in the first round of forward selection.
When \(n<p\), backwards selection cannot be performedand forward selection will terminate at subsets of size \(n\).
There are hybrid approaches in which we set criteria to enter as well as leave. These methods sometimes behave like ‘after each additional variable is added, the other predictors are then re-evaluated for removal’ and then there is some kind of stopping procedure to make sure this doesn’t go in loops.

The next portion of the text describes some ways of choosing the ‘best’ subset. One method, which we’ve already covered, is just to use the cross validation methods described in chapter 5. Alternatively, there are some methods of adjusting the training error rate in order to select a method that should have a small testing error rate. The listed methods are Mallow’s \(C_p\), Akaike information criterion (AIC), Bayesian information criteria (BIC), and the adjusted \(R^2\).

Mallow’s \(C_p\) is just a transformation of the residual sum of squares: \[C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2)\]

Where \(d\) is the number of predictors used and \(\hat{\sigma}^2\) is the estimate of the variance of \(\epsilon\) from the regression estimate. In the linear regression case I believe this should be the MSE. We’re told that if the regression model is unbiased (truly meets assumptions of linearity ect.), then \(C_p\) is an unbiased estimate of the testing MSE. Therefore, we would like to select the model with the smallest value of \(C_p\). From the formula, we can also see that \(C_p\) puts a penalty on the number of predictors so unlike \(R^2\), simpler models can be selected via this method. An alternative formulation is given as:

\[C'_p=\frac{RSS}{\hat{\sigma}^2}+2d-n\] Which will have the same minimum value for all the models in consideration.

The next measure that we could consider here is the Akaike information criteria which is given as:

\[AIC=2d-2ln(L)\approx\frac{1}{n\hat{\sigma}^2}(RSS-2d\hat{\sigma}^2)\] Where \(L\) is the likelihood of the model and \(-2ln(L)\) is the deviance. So having more predictors increases the AIC as does a smaller \(L\) (and thus larger deviance). I’m reading from some other sources that this quantity estimates the ‘information lost’ via the estimate of \(f\). It’s not clear what this means exactly this means but it’s not something we want and thus we choose to minimize the AIC. There are also corrected versions of this because apparently the desired properties are only asymtotically achieved. Another point to note is that this doesn’t tell us anything about the absolute fit fo the model, only the relative efficacy between models.

It’s directly proportional to \(C_p\) so from that perspective, it probably is not worthwhile computing them both… as far as I can tell.

Another measure in this line is the Bayesian information criteria which is quite similar to the AIC but it puts a different weight on the number of predictors:

\[BIC=k\cdot ln(n)-2ln(L)\approx \frac{1}{n\hat{\sigma}^2}(RSS-log(n)d\hat{\sigma}^2)\] In general, we will have data sets with more than 7 observations so \(log(n)>2\). This indicates that the BIC more heavily penalizes models with many predictors than the AIC does.

The last measure listed here is the adjusted \(R^2\) which I have heard lots of complaints about but have not looked into the reasoning for these complaints yet. It’s given as: \[\text{Adj. }R^2=1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\] Here the goal is to maximize the adjusted \(R^2\) which really just means minimizing \(RSS/(n-d-1)\) since the TSS is fixed for a given data set. Unlike \(R^2\), this can potentially go down with more predictors if the addition of a variable does not provide a sufficient reduction in RSS. The authors note that while there is a decent intuition behind the adjusted R\(^2\), there is no real theoretical derivation of this as a useful metric.

The last portion of this section describes using cross validation methods to select a model. The methods are the same as described in chapter 5. One other thing they note is people sometimes use the one standard deviation rule. This is a method of selecting among models with similar testing MSE/error rate. It basically suggests that if there are multiple models within one standard deviation of the MSE of the best performing model, choose the one with the least predictors. This allows you to pick a more parsimonious model from among comparable options.

6.2: Shrinkage Methods

The next section describes two techniques that were previously unknown to me: ridge and lasso regression. These techniques aim to solve the following problem; although linear models are less flexible than many of the other techniques described in this book, they still can be too flexible for a particular problem. In particular, if you have many predictors the model fit can pick up on irrelevant noise from some of them that leads to increased errors in the testing set. Another issue that can cause high variance in linear models is highly correlated predictors.

As a result, ridge and lasso models introduce some additional bias in the form of a constraint on the predictors in order to reduce the variance estimates compared to ordinary least squares regression. These constraints are called shrinkage methods because they reduce the influence of one or more of the predictors in order to produce coefficient estimates that will be more stable over different data sets. The reason we would consider this bias is that if the assumptions of the linear model are fit, we’re introducing another model specification that moves us away from a smaller training error rate. Another source I looked at said that these methods desensitize linear models so as to reduce overfitting.

The formulation for these methods is similar. Instead of simply minimizing the sum of square errors, as OLS does, we minimize this quantity: \[\sum_{i=1}^n\left(y_i-\beta_0+\sum_{j=1}^p\beta_jx_{i,j}\right)^2+\lambda f(\boldsymbol{\beta})\] The first term is just the square errors (RSS), \(\lambda\geq0\) is called a tuning parameter, and \(f(\boldsymbol{\beta})\geq0\) is a function of the coefficients (but not the intercept).

So what is this doing? Unlike the OLS, we now have two considerations to take into account for selecting the coefficient estimates. First, we still consider the size of the errors in that we want a small RSS. The second term tells us that we’re also trying to have a small \(f(\boldsymbol{\beta})\). I imagine that there are other constraints you can use but in the case of the ridge and lasso, \(f\) is increasing with respect to each parameter.

For the ridge, \(f(\boldsymbol{\beta})\) is what is called the ‘ell 2’ (\(\ell_2\)) norm: \[f(\boldsymbol{\beta})=\|\boldsymbol{\beta}\|_2=\sqrt{\sum_{i=1}^n\beta_i^2}\] And for the lasso it’s the \(\ell_1\) norm: \[f(\boldsymbol{\beta})=\|\boldsymbol{\beta}\|_1=\sum_{i=1}^n|\beta_i|\] I imagine that \(\ell_3\) would be the cube root of the sum of cubes? In any case, we can see that these functions increase as the parameter estimates grow. As a result, smaller parameter estimates will be considered when fitting the ridge & lasso models. This is where the shrinkage from; larger coefficient estimates incur a higher penalty and thus must be associated with a looser model fit (RSS).

When we set \(\lambda=0\) we get the OLS estimates. When \(\lambda>>0\), the coefficients will be drive towards zero and we’ll get something close to then null model. As the tuning parameter is increased, the bias will increase (nonlinearly) and the variance will decrease (also non-linearly). The graphs in an example given suggest that there may be some ‘sweet spot’ where you add a little bit of bias, get a sizable reduction in variance, and then stop increasing \(\lambda\). Therefore, choosing the tuning parameter is an issue that is discussed a little later.

When I first saw this I thought it seemed extremely silly as it initially looks like the penalty will not be applied equally across the various predictors if they have different scales. This is true however the authors note that people typically scale the parameters (divide by their sample standard deviation) before using these techniques.

They give another formulation of this which is that we could consider this problem: \[\text{minimize}\sum_{i=1}^n\left(y_i-\beta_0+\sum_{j=1}^p\beta_jx_{i,j}\right)^2 \text{subject to }f(\boldsymbol{\beta})\leq s\] They say that once you choose a \(\lambda\), then there is some \(s\) that this problem produces the same coefficient estimates as the lasso/ridge regression. I didn’t find this transition to be very clear but I found this youtube video that took a different tact.

Instead of starting the quantity involving \(\lambda\), they start by saying we’re going to minimize the RSS under the constraint that \(\ell_2<c^2\) (for ridge regression). This is essentially saying that the betas must be confined within a circle/sphere/multidimensional circle. Then the video says that in order to minimize RSS under this constraint, we use a Lagrange multiplier. I had basically blocked this out from my memory but essentially if we have a function \(f(x)\) that we want to minimize and a constraint function \(g(x)\), then we can look for the saddle point of \[\mathcal{L}(x, \lambda)=f(x)-\lambda g(x)\] And this will produce the minimum value for \(f(x)\).

For our case, we can assume that the OLS estimate will be on the outside of our constraint (otherwise we would just use the OLS estimate) and thus the optimal solution will be on the edge of the circle and not inside. Thus we’ll make the constraint an equality \(g(\boldsymbol{\beta}) = \|\boldsymbol{\beta}\|_2-c^2=0\).

Now our Lagrange multiplier thingie is like this:

\[\mathcal{L}(\boldsymbol{\beta}, \lambda)=\sum_{i=1}^n\left(y_i-\beta_0+\sum_{j=1}^p\beta_jx_{i,j}\right)^2-\lambda(\|\boldsymbol{\beta}||_2-c^2)\] And then if we fix \(\lambda\), then we can drop the \(-c^2\) since this will just be a constant in our minimization task. This will give us the quantity that the authors describe.

Okay, I’m glad that I went through that as I feel a bit better about the connections of these two formulations of the ridge/lasso. When we choose a particular lambda, this sets the budget for our entire set of non-intercept coefficients. We are setting a maximum for them collectively which is the shrinkage. The exact shape of the constraint applied is what differentiates the ridge and lasso. We’ll see that this does have consequences in how they behave and perform in different settings.

Another aspect of this that I found interesting is that this can be similar to the subset selection we previously discussed. We could write the the subset selection process for at most \(s\) predictors to be:

\[\text{minimize}\sum_{i=1}^n\left(y_i-\beta_0+\sum_{j=1}^p\beta_jx_{i,j}\right)^2 \text{subject to }\sum_{j=1}^pI(\beta_j\neq0)\leq s\] In the case of the ridge, the coefficients will typically not (never?) reach zero but for the lasso, this is quite likely if we keep cranking up \(\lambda\). I found this to be a very interesting and not intuitive so I spent (probably too long) trying to get some intuition about why lasso can be used to perform variable but the ridge cannot.

If we think about the case of two predictors and set the constraint to \(c^2=1\), then the possible betas for the lasso live in a square (yellow) and those for the ridge lie in a circle (green). If the OLS estimate is \(\beta_1 =\beta_2=2\), we could plot some contour plots out from this (note they probably would be elipses not cirlces.)

The intersection points between the contour lines and the constraints are the estimates that the lasso and ridge will select.

The intuition given by the authors on why the lasso will zero out particular coefficients but the ridge will not is based on this visualization. I don’t have a particularly good geometric intuition but here is what I see:

If we think about the OLS estimates as being uniformly distributed on the plane, when we extend circular contour lines out from them, there aren’t any points on the circle that are particularly likely to be the first intersection point. Conversely, the corners of the lasso’s square will have a larger proportion of the possible contours hit them than any particular point on the sides of the square. I thought about trying to prove this but I thought maybe an expansion on this visual would be more illuminating. Here I have just drawn the top half of the square and a few different contour rings:

So I think if the least squares estimate is in that upper triangle, then it will intersect the corner and drop one of the two variables, otherwise, it will intersect the side and keep a combination of the two. This picture simplifies it since the actual contour curves will rarely be circular but I think a similar logic will apply to various elipses ect.

The next thing noted is that there are some differences in situations where these techniques perform. If it’s the case that there are some variables that truly don’t relate to the outcome (frequently in my experience) the lasso of regression method will perform better. If this is not the case, then the lasso will be more biased than the ridge regression because you will set some coefficients to zero when they actually have a weak relationship.

Another thing they note here through a somewhat contrived example is that the shrinkage that the lasso applies is by a fixed quantity (\(\beta-\delta\)) whereas the ridge method applies a scaling shrinkage (\(\beta\cdot\delta\)). As a result, the ridge will apply the strongest shrinkage to the largest parameters and the least to the weakest so situations where you have a lot of weaker predictors, this will be better than the lasso. Another thing to think about with the difference in the shrinkages applied is that a scaling a non-zero change will never go to zero but sliding it will which is in line with the lasso performing variable selection while the ridge method does not.

The authors also give some other comments about how this can be interpreted from a Bayesian perspective but I didn’t understand this to well so maybe we’ll get back to it a bit late-ah.

Last bit to mention is the method to select a tuning parameter. The authors mainly suggest that we should just use cross validation to test a number of different values for the tuning parameter. I don’t have any experience doing this so I don’t really know what reasonable values would be to start with. In the absence of any other guidance or experience, I suspect you would just try a random array of values and if the minimum value is on one of the ends of your array of \(\lambda s\), then we should continue to try more values off in that directions until we get an arc of testing MSE estimates.

6.3: Dimension Reduction Methods

The next section concerns methods to deal with high dimensionality. There are a variety of applications for this but they chiefly involve us trying to simplify problems that have too many predictors to either actually fit models or interpret them. I used PCA a bit some years ago to create some visualizations and am eager to see if it’s easier to understand this go around.

One method of dimension reduction is to form a set of linear combinations, \(Z_1, Z_2,...,Z_M\), of our initial predictors \(X_1, X_2,..., X_p\) where \(0<M<p\). Here each \(Z_m\) can be written as: \[Z_m = \sum_{j=1}^p\phi_{j,m}X_j\] Where each \(\phi_{j,m}\) is a constant, many of which will be zero. After creating these transformations of our initial predictors, we can use them to form a regression model:

\[y_i=\theta_0+\sum_{m=1}^M\theta_mZ_m+\epsilon_i\text{, i = 1, 2,...,n}\]

Chapter 6: Linear Model Selection & Regularization

Notes from ISL

Sven Halvorson

Definitions

Notes

6.1: Subset Selection

6.2: Shrinkage Methods

6.3: Dimension Reduction Methods

6.4: Considerations in High Dimensions

Lab

Exercises

1