Chapter 4: Classification

Definitions
Notes
Lab
Exercises
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11

Definitions

4.2

Classification problem/classifier: When the response variable is categorical, a classifier is our estimate of \(f\).

4.3

Logistic model Classification strategy where we model the probability of \(Y=1\) with the logistic function.
Logit: Logarithm of an odds
Maximum likelihood estimation: A method of selecting parameters wherin we model the parameters as a function of the training data and then pick the parameter estimates based on maximizing the likelihood function.

4.4

Prior: Assumed probabilities made before modeling. Sometimes come from the data themselves, other data sources, or other studies.
Posterior probability: Modification made to the prior using the data.
Linear discriminant analysis (LDA): A method in which the density functions of the predictors are estimated for each class of outcome. Then, the probability of each outcome class is estimated from these densities and prior estimates of the outcome level frequencies. The class with the highest probablity is selected for classification of the observation.
Discriminant function: The function in LDA with which we use as our classifier. We assign an observation x to class k if \(\delta_k(x)>\delta_l(x)\) for all \(k\neq l\).
Multivariate Gaussian distribution: A distribution wherein observations have p attributes. Each atribute comes from a normal distribution but these may have different means and variances.
Null classifier: A very simple, even silly, classifier that serves as a comparison for other models. An example would be just classifying every observation as the class that is most common in the training data.
Variable standardization: Transforming variables so that they have a mean of 0 and a standard deviation of 1.
Curse of dimensionality: When the number of predictors becomes large, non-parametrice methods often tend to respond too heavily to factors that are not strongly related to the outcome.

Notes

4.2: Why not linear regression?

With lots of classification problems, the reason why linear regression is not appropriate is that the coding schemes implicitly put distances between the different classes when there may be none at all. The example of mild, medium, and severe are ordered but it’s not clear that the distance between mild and medium is the same as medium to severe. The case of binary variables is closer to appropriate but linear regression will fail here in that it will sometimes predict values larger than 1 or less than zero.

The book uses this shorthand \(P(Y=1|X)=p(x)\) which… I don’t like that much but we’ll stick with it.

4.3: Logistic regression

In order to deal with the issue of potentially predicting values outside of (0,1), we can instead use a transformation (link) of the probability of X. The most commonly used one is the logistic function:

\[p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\] Here’s a graph of the logistic function with different values of \(\beta_0=0\) and \(\beta_1=1\):

From this, we can see that changes in the probability with respect to the logistic function respond to the coefficients. \(\beta_0\) controls where the transition of probability is centered. The sign of \(\beta_1\) controls whether low values of \(X\) are associated with high or low probabilities of \(Y=1\). The magnitude of \(\beta1\) controls how quickly the transition happens.

For interpretation, we generally like to reformulate the logistic function like this:

\[\frac{p(X)}{1-p(X)} = e^{\beta_0+\beta_1}\] By doing this, we can interpret the various betas as either odds, odds ratios, or changes in odds ratio for a given 1 unit change in a continuous predictor. A further transformation of this gets us to:

\[log(\frac{p(X)}{1-p(X)}) = \beta_0+\beta_1\] Which the left hand side is called the logit and logistic regression models this as a linear combination of predictors.

The method of estimating the coefficients is done through maximum likelihood which is one of the few statistical concepts that I found to have interesting philosophical parallels. Essentially what we do is select the set of estimates that produces the highest probability of observing the given sample. To me this seems like ‘whatever you see, that should not be viewed as an outlier without additional information.’ Selecting the coefficients comes from the likelihood function which for a binary outcome and two coefficients would be:

\[l(\beta_0,\beta_1|X) = \sum_{i:y_i=1}p(x_i)\sum_{j:y_j=0}(1-p(x_j))\] In this case, we’re thinking of the data as fixed and we write a function that has the parameters as the only variables. Whatever the global maximum is for \(l\), this is which parameters we will choose. In the context of logistic regression it would look like this:

\[l(\beta_0,\beta_1|X) = \sum_{i:y_i=1}\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}\sum_{j:y_j=0}(1-\frac{e^{\beta_0+\beta_1x_j}}{1+e^{\beta_0+\beta_1x_j}})\] Then we would probably end up taking a log and turning this more into a polynomial in order to solve it.

The authors then go into some discussion about how predictions are made, extensions to more than one predictor, and confounders that I will not repeat here. They also note that it is possible to extend logistic regression to more than two classes but that it is generally not very effective compared to other methods.

4.4 Linear discriminant analysis

An alternative to logistic regression is linear discriminant analysis. This method is new to me so I’m excited to get into more new material. In a broader sense, the method works somewhat in reverse of logistic regression. The reasons given that this method could be preferable to logistic regression are:

In the case of highly separated levels of \(Y\), logistic regression can have unstable parameter estimates. I take this to mean that the standard error of the parameter estimates will be larger in the case that \(X\) is highly predictive.
If n is small and the predictors are approximately normal, then LDA is more more stable.
When we have more than one class in the response, LDA is preferable.

Logistic regression works by modeling the transformed response in terms of a linear combination of the predictors. LDA, by contrast, we model the distribution of each predictor in each level of the response . We then use Bayes theorem to reverse direction and model the response in terms of these distributions. To me, this method feels like:

Figure out how the distributions of \(X\) are within each response class.
Determine which distribution the observation most likely came from and select the appropriate class based off this.

Let’s get into the details. First we start with a response, \(Y\), that can take one of \(K\geq2\) classes. For each class, \(K=k\), we assume that \(\pi_k\) is the unconditional probability \(P(Y=k)\). Frequently we would just use the observed frequency within the training data but if we had something like census information that could better inform our priors, that could be used. Then for each class, we define the density function as:

\[f_k(x)=P(X=x|Y=k)\] So this is the portion where we model the distribution of the predictors for each class. They mention that this is sort of symbolic in this case because the density function for a continuous variable (which is what is described for LDA here) would require a small region around \(x\) and not equality. Then we use Bayes theorem to convert this back into a probability of \(Y\) given \(X\):

\[p_k(x)=P(Y=k|X=x)=\frac{P(Y=k \cap X=x)}{P(X=x)}=\frac{P(Y=k)P(X=x|Y=k)}{\sum_{l=1}^KP(X=x|Y=k)}=\frac{\pi_k\cdot f_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}\] So then once we have all our estimates of the priors (\(\pi_k\)) and the posteriors (\(f_k(x)\)), we can then compute the probabilities of each training observation being in each outcome class and select the most likely one. The prior probabilities are relatively easy and from my understanding, most Bayesian methods are relatively insensitive to poor choices of prior. The issue is more that computing the posteriors is hard without some assumptions. One method of doing this (I assume there you could use alternative distributions/assumptions depending on how the training data actually looks) is to just assume that you have a normal distribution for the predictor.

In this case, we can use the density function (pdf) for the Gaussian distribution for \(f_k(x)\):

\[f_k(x)=\frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)\] So this wildboy basically tells us what the probability is of having a predictor value within a narrow interval around \(x\). The authors suggest we assume the variance is the same for each response level for the moment. We can substitute the normal pdf in the formula for \(p_k(x)\):

\[p_k(x)=\frac{\pi_k\cdot \frac{1}{\sqrt{2\pi}\sigma}\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma}\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_l)^2)}\] Then, for ease of computation and simplification, they suggest taking a log of this. Because logarithms are increasing, maximizing this will produce the same solution. Cancel the constant term first:

\[\delta_k(x)=log(\pi_k\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_k)^2))-log(\sum_{l=1}^K\pi_l\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_l)^2))\]

Hmm spent a bit of time trying to unfurl this but don’t want to spend forever on this. Maybe later. In any case, we end up with the discriminant function for the probabilty that \(Y=k\) given \(X=x\):

\[\delta_k(x)=x\cdot\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\] If we were in a situation with two levels for \(Y\) and equal priors, then in order for us to classify the observation as \(Y=1\) then we would need: \[\delta_1(x)-\delta_2(x)=x\cdot\frac{\mu_1}{\sigma^2}-\frac{\mu_1^2}{2\sigma^2}-[x\cdot\frac{\mu_2}{\sigma^2}-\frac{\mu_2^2}{2\sigma^2}]>0\] \[\delta_1(x)-\delta_2(x)=x\cdot\frac{\mu_1-\mu_2}{\sigma^2}-\frac{\mu_1^2-\mu_2^2}{2\sigma^2}>0\] \[x>\frac{\mu_1^2-\mu_2^2}{2}\] And this is a very sensible result. If we had equal samples of the two levels and the two distributions had equal variances, we would just need an observation closer to the mean the posterior for \(Y=1\) to assign it to that class. In all, I like the intuition about this technique quite a bit. It feels like you’re taking into account the idea of ‘prevalence’ from epidemiology in terms of weighting your guesses a bit more based off sample proportions (or other priors) but then most of the decision comes from simply how far apart the distributions are and how variable they are.

In this simple case, the most common paramater estimates needed to compute \(\delta_k(x)\) are:

\[\hat{\mu_k}=\frac{1}{n_k}\sum_{i:y_i=k}x_i\]

\[\hat{\sigma}^2=\frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\mu_k)^2\] Here \(\hat{\mu_k}\) is just the average of the predictor values when \(Y=k\) and the authors say that \(\hat{\sigma}^2\) is the ‘weighted average of the sample variances.’ This doesn’t seem quite right to me as they’re not being weighted by their relative frequencies. Rather, the variance is being computed across the entire sample but the mean used to compute it depends on the class of \(Y\). That seems reasonable if you assume homogeneity of variance but I would not call it a weighted average variances without using something like \(\pi_k\) to weight.

One thing to note here is that if the data is substantially not normal, the central limit theorem will not save us here. This is because we are not trying to estimate a mean, we’re using the literal PDF of the normal distribution. If we have predictors with substantially non-normal distributions, the values from \(f_k(x)\) will be very unreliable and thus the LDA will be weak. I suspect that if we’re not using some other version of LDA, we will probably work to transform variables to look normal-ish when needed.

Typically one predictor is not going to be sufficient so in order to extend this, we can move to a multivariate Gaussian distribution for the predictors where \(X = (x_1, X_2, ..., X_p)\). Hear, each \(X_i\) a normal distribution but we allow different parameters as well as correlations within them. There are some nice 3D plots in the book that show these like hills. If they are independent, then the cross sections along each axis will look normal. Otherwise, the cross sections along some other vector will look normal.

For a multivariate Gaussian distribution with p (components?), we will write \(X\sim N(\mu,\Sigma)\) where

\[E(X)=\mu=\begin{bmatrix}\mu_1\\\mu_2\\...\\\mu_p\end{bmatrix}\] And \[Cov(X)=\Sigma=\begin{bmatrix}\sigma_1^2&\sigma_{1,2}&...&\sigma_p^2\\\ &\sigma_2^2&...&\sigma_{2,p}\\ & & ...\\ & & &\sigma_p^2\\\end{bmatrix}\] Then the multivariate Gaussian density is :

\[f(x)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))\] It’s important to note that in this formulation, we’re assuming that the covariance structure is the same for all classes but this is later relaxed in QDA.
This can be simplified to a linear combination of the observation, the estimated mean and covariance, and the prior:

\[\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+log(\pi_k)\] As with the case where \(p=1\), we would compute this quantity for each \(k\) and classify \(x_i\) to the class with the largest value of \(\delta_k(x_i)\). The Bayes decision boundary between classes \(Y=k\) and \(Y=l\) will be the values of \(X\) such that \(\delta_k(X)=\delta_l(X)\) for \(k\neq l\). In the case of \(k=3\) levels of the outcome, we will have 3 Bayes decision boundaries although these do not need to be unique. I can imagine a situation in which we sampled the same class twice and labeled it as two different classes. In that case, each of these would have the same Bayes decision boundary with the third class.

The authors then go into a somewhat lengthy discussion of confusion matrices and the various quantities that are often computed from them. Since I already yammered on about this previously, I’ll omit this from the notes.

One thing that they mentioned here that I only touched on in the linked post is how different kinds of errors correspond to different subject matter problems. They give the example of the default data set in which a customer defaulting comes with an especially high price for the company. As a result, they may chose to select a decision rule other than that which produces the lowest overall error rate. Say that we have two classes and class \(k=1\) is especially detrimental and we really do not want to miss observations that could be in that class (say defaulting on the loan). Then we might choose some other rule like classifying \(X\) to \(Y=1\) if:

\[P_{k=1}(x)<\alpha<0.5\] In which case, even if our LDA only gives say a 25% chance of \(Y=1\) (and 75% chance of \(Y=0\)), we may choose to assign the observation to \(Y=1\) because the conseuqence of a false negative is so large.

The last addition to this method that we relax the assumption that the the covariance matrix is the same for each class. This is called quadratic discriminant analysis. By doing this, we allow for more complicated relationships between the predictors as they do not need to be simple location shifts across the classes. Instead, an entirely unique multivariate Gaussian distribution can be chosen for each level of the outcome.

The major benefit that this has is that we have a substantially more flexible method of modeling the data. The major drawback is that there are many more parameters to estimate. For LDA, we just need to estimate the mean for each class and the covariance matrix (\(kp + p(p+1)/2\)); we estimate the mean of each predictor within each class (\(kp\)) and then the upper triangle of one covariance matrix \(p(p+1)/2\). By contrast, QDA will need to estimate that covariance matrix for each class so the total number of parameters needed is \(k(p + p(p+1)/2)\). When the data set is small, the number of predictors is large, and/or the number of classes is large, this can become a problem. The estimates will be come unstable (high variance) and either LDA or logistic regression will become preferrable.

For QDA, we can write the discriminant function like this:

\[\delta_k(x)=-\frac{1}{2}x^T\Sigma_k^{-1}x+x^T\Sigma_k^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k-\frac{1}{2}log(|\Sigma_k|)+log(\pi_k)\] As with LDA, we would assign each observation t the class where \(\delta_k(x)\) is largest.

4.5 Comparison of Classification Methods

In the last section, we make some comparisons between methods. One important note that relates LDA and logistic regression is that they have similar formulations. It’s possible to write the log odds produced from an LDA as a linear combination in \(X\). If we have two classes of outcome and one predictor then

\[\frac{p_1(x)}{p_2(x)}=\frac{\frac{\pi_1\cdot f_1(x)}{\sum_{l=1}^K\pi_lf_l(x)}}{\frac{\pi_0\cdot f_0(x)}{\sum_{l=1}^K\pi_lf_l(x)}}=\frac{\pi_1\cdot f_1(x)}{\pi_0\cdot f_0(x)}\] And then the log odds would be

\[log(\frac{p_1(x)}{p_2(x)})=log(\pi_1\cdot f_1(x))-log(\pi_0\cdot f_0(x))=log(f_1(x))/log(f_0(x))+log(\pi_1/\pi_2)\] Substituting in the normal pdf:

\[log(\frac{p_1(x)}{p_2(x)})=log(e^{-\frac{1}{\sqrt{2\pi}\sigma}[(x-\mu_1)^2-(x-\mu_0)^2]})+log(\pi_1/\pi_2)\] Assuming we’ve estimated the variances, the first part of the first logarithm is just a constant (\(c_0\)). Then we can unroll the other part:

\[log(\frac{p_1(x)}{p_2(x)})=-2\mu_1x+2\mu_0x+\mu_1^2-\mu_0^2 +c_0 = c_1x+c_0'\] So basically what this shows is that both LDA and logistic regression model the log odds as a linear combination of \(X\). This means that in some sense they have a similar degree of flexibility. The methods of estimating the coefficients is done via maximum likelihood for logistic regression and via the joint normal distribution for LDA. To me, I kind of take this as evidence that logistic regression seems preferable when \(k=2\) because it is more flexible in the types of predictors it can use. The scenarios detailed in this section don’t give a good example of when LDA is clearly superior to logistic regression although my impression is that this is likely to occur when \(k>2\).

Finally, the authors make some comparisons between the logistic regression, LDA, QDA, and KNN (2 formulations). They note that QDA is more flexible than either logistic regression and LDA but less flexible than KNN. The scenarios detailed at the end give these general points:

When the assumptions of LDA or logistic regression are met, these methods outperform QDA and KNN. This is because the added flexibility ends up responding to noise more than true trends.
If the underlying distributions of predictors are not normal (the example given is a t distribution), this does not affect logistic regression much but will worsen the performance of LDA. In this case, both still outperform KNN and QDA assuming that the covariance is equal across classes.
If the underlying distributions of the predictors still look normal but there heterogeneous covariance structures between classes or nonlinear relationships with the outcomes (say quadratic or interactions), then the QDA will perform better than any other method. In this case, QDA will not be biased like LDA or logistic regression and will have lower variance than KNN. In the case of non-linear relationships between predictors and the response, it may be possible to remedy this using transformations of variables in a logistic regression model.
In the case of distinctly not-normal predictors, highly complex relationships between predictors, and/or strongly non-linear/quadratic relationships between predictors and outcomes, the parametric methods will be too biased in comparison to KNN. The authors also note that there are a number of issues in selecting the K in KNN and in their examples, K=1, does quite badly under every scenario. Apparently there are methods of choosing this value wisely later in the book.

Lab

The lab for this section comes from the data set Smarket which contains data about the S&P 500. We have several lagged versions of the index. It looks like we have 252 days for most of the years. This looks like a weird data set to model with these methods as we have completely dependent variables.

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

We have the volume of shares traded on that day, the percentage return on the day (Today), and whether the market is up or down. Can look at the correlations:

##              Year         Lag1         Lag2         Lag3         Lag4         Lag5      Volume        Today
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718  0.029787995  0.53900647  0.030095229
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911 -0.005674606  0.04090991 -0.026155045
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533 -0.003557949 -0.04338321 -0.010250033
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036 -0.018808338 -0.04182369 -0.002447647
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000 -0.027083641 -0.04841425 -0.006899527
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641  1.000000000 -0.02200231 -0.034860083
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246 -0.022002315  1.00000000  0.014591823
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527 -0.034860083  0.01459182  1.000000000

The dayu to day correlations are quite small but it’s not surprising that the correlations are a bit stronger between adjacent days and than for a few days apart. Quite low relationship between Today and Volume which makes me think people trade a lot of stocks in losing and winning environments.

We can predict whether the market will go up with a logistc model:

smark_logistic = glm(
  formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data = Smarket,
  family = 'binomial'
)

summary(smark_logistic)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

We can get some predictions in the form of probabilities (or log odds with a different value of response):

predict(smark_logistic, type = 'response')[1:10]

##         1         2         3         4         5         6         7         8         9        10 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 0.5176135 0.4888378

#  If you want to verify that the interpretation of the categorical variable
# was as you expected:
contrasts(Smarket$Direction)

##      Up
## Down  0
## Up    1

If we want to make a confusion matrix, we can turn all the predicted values into up if P(Y == ‘up’) > 0.5:

smark_predictions = ifelse(
  predict(smark_logistic, type = 'response') > 0.5,
  'Up',
  'Down'
)

table(smark_predictions, Smarket$Direction)

##                  
## smark_predictions Down  Up
##              Down  145 141
##              Up    457 507

Then an example calculation here would be that the sensitivity is \(\frac{507}{507+141}=0.782\). Seems like we have a lot of false positives though. The overall error rate is 52.2% which is not much better than the approximately 50/50 split between up and down in the training set.

To further investigate this model, they suggest that we train the model on the data pre-2005 and test on 2005’s data:

smark_train = Smarket %>% 
  filter(Year < 2005)

smark_test = Smarket %>% 
  filter(Year == 2005)

smark_logistic2 = glm(
  formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data = smark_train,
  family = 'binomial'
)

smark_test %>% 
  mutate(
    pred = predict(smark_logistic2, smark_test, type = 'response'),
    pred_val = ifelse(
     pred > 0.5,
     'Up',
     'Down'
    )
  ) %>% 
  count(
    Direction,
    pred_val
  )

##   Direction pred_val  n
## 1      Down     Down 77
## 2      Down       Up 34
## 3        Up     Down 97
## 4        Up       Up 44

The training error rate for this is \(\frac{34+97}{252} = 0.51\) which is a bit worse than flipping a coin. The authors give this minor tweak: remove the unhelpful lag variables farther from the current date. Doing this will make the predictions based off more relevant variables only:

smark_logistic3 = glm(
  formula = Direction ~ Lag1 + Lag2 ,
  data = smark_train,
  family = 'binomial'
)

smark_test %>% 
  mutate(
    pred = predict(smark_logistic3, smark_test, type = 'response'),
    pred_val = ifelse(
     pred > 0.5,
     'Up',
     'Down'
    )
  ) %>% 
  count(
    Direction,
    pred_val
  )

##   Direction pred_val   n
## 1      Down     Down  35
## 2      Down       Up  76
## 3        Up     Down  35
## 4        Up       Up 106

Now we’re only making 111 mistakes instead of 131.

Next we’ll give LDA a shot. Excited to do something new with the code:

library('MASS')

smark_lda = lda(
  formula = Direction ~ Lag1 + Lag2,
  data = smark_train
)

smark_lda

## Call:
## lda(Direction ~ Lag1 + Lag2, data = smark_train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

We can verify that some of these estimates are what we expected:

# Priors:
smark_train %>% 
  count(Direction) %>% 
  mutate(m = n/sum(n))

##   Direction   n        m
## 1      Down 491 0.491984
## 2        Up 507 0.508016

smark_train %>% 
  group_by(Direction) %>% 
  summarize(
    Lag1_mean = mean(Lag1),
    Lag2_mean = mean(Lag2)
  )

## # A tibble: 2 x 3
##   Direction Lag1_mean Lag2_mean
##   <fct>         <dbl>     <dbl>
## 1 Down         0.0428    0.0339
## 2 Up          -0.0395   -0.0313

The predictions from this kind of model come in a few pieces. We can get the class predictions for each observations (class that maximizes \(\delta\))

smark_lda_pred = predict(smark_lda, newdata = smark_test)
smark_lda_pred$class[1:5]

## [1] Up Up Up Up Up
## Levels: Down Up

The posterior estimates for each class*observation:

smark_lda_pred$posterior %>% 
  as_tibble() %>% 
  head()

## # A tibble: 6 x 2
##    Down    Up
##   <dbl> <dbl>
## 1 0.490 0.510
## 2 0.479 0.521
## 3 0.467 0.533
## 4 0.474 0.526
## 5 0.493 0.507
## 6 0.494 0.506

And the linear discriminats:

smark_lda_pred$x[1:5]

## [1]  0.08293096  0.59114102  1.16723063  0.83335022 -0.03792892

So from here we can check the accuracy of the LDA:

table(smark_lda_pred$class, smark_test$Direction)

##       
##        Down  Up
##   Down   35  35
##   Up     76 106

This is basically the same as what we got for the logistic regression. If we wanted to apply different classification thresholds, we would just apply them to the posterior estimates.

Here’s the analogous code using QDA:

smark_qda = qda(
  formula = Direction ~ Lag1 + Lag2,
  data = smark_train
)

The features of this are similar but now we get a covariance matrix for each level:

smark_qda$scaling

## , , Down
## 
##            1           2
## Lag1 -0.8147 -0.02102907
## Lag2  0.0000 -0.80724672
## 
## , , Up
## 
##               1           2
## Lag1 -0.8119072 -0.01505382
## Lag2  0.0000000 -0.81929901

We can also create the confusion matrix and compute the accuracy:

smark_qda_pred = predict(
  smark_qda,
  newdata = smark_test
)
table(smark_qda_pred$class, smark_test$Direction)

##       
##        Down  Up
##   Down   30  20
##   Up     81 121

mean(smark_qda_pred$class == smark_test$Direction)

## [1] 0.5992063

This does a bit better than the other two methods.

Finally, we can try this with KNN:

library('class')

set.seed(1)
smarket_knn = knn(
  train = dplyr::select(smark_train, Lag1, Lag2),
  test = dplyr::select(smark_test, Lag1, Lag2),
  cl = smark_train$Direction,
  k = 1
)

table(smarket_knn, smark_test$Direction)

##            
## smarket_knn Down Up
##        Down   43 58
##        Up     68 83

mean(smarket_knn == smark_test$Direction)

## [1] 0.5

I imagine you have to set the seed to reproduce your work because of some method it uses to deal with ties.

Using \(K=1\) gives us very poor accuracy. We can in crease K a bit though t see if that helps:

smarket_knn3 = knn(
  train = dplyr::select(smark_train, Lag1, Lag2),
  test = dplyr::select(smark_test, Lag1, Lag2),
  cl = smark_train$Direction,
  k = 3
)

table(smarket_knn3, smark_test$Direction)

##             
## smarket_knn3 Down Up
##         Down   48 54
##         Up     63 87

mean(smarket_knn3 == smark_test$Direction)

## [1] 0.5357143

The authors conclude this section by saying that QDA provides the best results here. Assuming that we’re okay with this method of validating our models, I agree.

The last part of this lab has to do with using the Caravan data set. This has a ton of sociodemographic and product ownership variables that are all binary. We als have the record of whether they purchased insurance fr their caravan:

table(Caravan$Purchase)

## 
##   No  Yes 
## 5474  348

Because KNN relies on distance (of some metric) it can be the case that the scale of certain variables causes them to have an outsized influence on the results. In order to change this, we can standardize continuous variables to avoid this. This function feels kinda old as it returns a matrix regardless of whether you feed it a vector or not. Anyways, you can do the whole data frame at once like this:

Caravan[,1:85] = scale(Caravan[,1:85])

cara_test = Caravan %>% 
  slice(1:1000)  

cara_train = Caravan %>% 
  slice(-(1:1000))

Make some KNN predictions:

set.seed(1)
cara_knn1 = knn(
  train = dplyr::select(cara_train, -Purchase),
  test = dplyr::select(cara_test, -Purchase),
  cl = cara_train$Purchase,
  k = 1
)

mean(cara_knn1 == cara_test$Purchase)

## [1] 0.882

mean(cara_test$Purchase == 'Yes')

## [1] 0.059

So with \(K=1\) we’re worse than the null classifier by a bit. The authrs present a scenario in which we are trying to sell insurance and would like to target customers. If that’s the case, correctly identifying customers who will not purchase is not important and instead we want to maximize the positive predictive value. In this case we have:

table(cara_knn1, cara_test$Purchase)

##          
## cara_knn1  No Yes
##       No  873  50
##       Yes  68   9

From this we can see that the positive predictive value here is \(\frac{9}{77}=11.7\%\) which is quite a bit better than the 6% we would achieve by guessing at random. Increasing \(K\) improves this further:

cara_knn3 = knn(
  train = dplyr::select(cara_train, -Purchase),
  test = dplyr::select(cara_test, -Purchase),
  cl = cara_train$Purchase,
  k = 3
)

cara_knn5 = knn(
  train = dplyr::select(cara_train, -Purchase),
  test = dplyr::select(cara_test, -Purchase),
  cl = cara_train$Purchase,
  k = 5
)

table(cara_knn3, cara_test$Purchase)

##          
## cara_knn3  No Yes
##       No  920  54
##       Yes  21   5

table(cara_knn5, cara_test$Purchase)

##          
## cara_knn5  No Yes
##       No  930  55
##       Yes  11   4

Compare with the logistic model. Our PPV depends on the threshold for selecting yes:

cara_logistic = glm(
  Purchase ~ .,
  data = cara_train,
  family = 'binomial'
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cara_logistic_pred = predict(
  cara_logistic,
  cara_test,
  type = 'response'
)

table(cara_logistic_pred > 0.5, cara_test$Purchase)

##        
##          No Yes
##   FALSE 934  59
##   TRUE    7   0

table(cara_logistic_pred > 0.25, cara_test$Purchase)

##        
##          No Yes
##   FALSE 919  48
##   TRUE   22  11

So setting the threshold at 25% gives us a slightly worse PPV than the KNN with K=3

Exercises

1

We’re asked to show that the logistic function and logit representation are equivalent. Begin with:

\[P(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\] Then if we convert this to an odds:

\[\frac{P(X)}{1-P(X)} = \frac{\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}}{1-\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}}\] Then combine the pieces of the denominator:

\[\frac{P(X)}{1-P(X)} = \frac{\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}}{\frac{1}{1+e^{\beta_0+\beta_1X}}}=e^{\beta_0+\beta_1X} \]

2

Next question is just to argue that maximizing the \(p_k(x)\) is the same as maximizing the discriminat function. I didn’t go through all the algebra of this in my notes (yet) but I don’t think we even need that. Assuming we believe that the discriminant is indeed the log of \(p_k(x)\), it is clear that this is true because logarithm is an increasing function (\(A>B\Rightarrow log(A)>log(B)\)).

3

Assume that we are performing a QDA with p=1 predictors and class specific variances. Demonstrate that the Bayes decision boundary is not linear but instead is quadratic.
In this situation, we have a different variance for each class but no covariance to deal with. The probability that \(X\) belongs to class \(k\) is given by:

\[p_k(x)=\frac{\pi_k\cdot \frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_l)^2)}\] If we have two classes, \(k\neq j\), then the Bayes decision boundary is:

\[p_k(x)=p_j(x)\]

\[\frac{\pi_k\cdot \frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_l)^2)}=\frac{\pi_j\cdot \frac{1}{\sqrt{2\pi}\sigma_j}\cdot\text{exp}(-\frac{1}{2\sigma_j^2}(x-\mu_j)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma_k}\cdot\text{exp}(-\frac{1}{2\sigma_k^2}(x-\mu_l)^2)}\] Dropping the denominators and taking log gives: \[log(\pi_k)- log(\sqrt{2\pi}\sigma_k)-\frac{1}{2\sigma_k^2}(x-\mu_k)^2=log(\pi_j)- log(\sqrt{2\pi}\sigma_j)-\frac{1}{2\sigma_j^2}(x-\mu_j)^2\] And then from here we can see that solving this will be some quadratic equation since unlike the case where \(\sigma^2_k=\sigma^2_j\), the coefficients of \(x^2\) differ and not cancel out.

4

This question has several parts and asks us to look at the curse of dimensionality. We’ll start by looking at a few scenarios:

Assume we have only one feature, \(X\), and it comes from a uniform distribution U(0,1). We use the observations within 10 percentage points of an observation to classify test observation \(x_i\). We’re asked what fraction of the test observations we’ll use on average to make a prediction. Well it’s quite close to 10% however the ends of the distribution use less than 10% so the expected proportion of observations used is:

\[E(\text{prop used})=\frac{1}{100}\Big[2\int_{0}^{5}(5+x)dx+10*90\Big]= \frac{1}{100}\Big[\left. 10x+x^2\right|_{0}^5)dx+10*90\Big] = 9.75\%\] Guess I could have just done that with geometry but I wanted to write the fancy integral sign.

Now extend this to two predictors, \(X_1\) and \(X_2\), both uniformally distributed on [0,1]. The text doesn’t say they’re indepenedent but let’s assume they are. What proportion of observations will we use now?
In the previous one I had scaled this to 100 but I think it’s more intuitive to just think about cube with volume 1. It’s been about 12 years since I took multivariable calculus and it wasn’t a great subject for me then so hopefully I don’t botch these calculations too badly. In any case, our intuition should be that for one dimension, we use a bit less than 10% of the data, for two we use about 10%\(^2\), ect.

The volume of the sample space used is

\[V(X,Y)=\begin{cases}0.01, & X, Y \in [0.05,0.95] \\0.1(0.1-X), & Y \in [0.05,0.95] \text{ and }X \notin [0.05,0.95]\\0.1(0.1-Y), & Y \notin [0.05,0.95] \text{ and }X \in [0.05,0.95]\\(0.1-X)(0.1-Y), & Y \notin [0.05,0.95] \text{ and }X \notin [0.05,0.95]\\ \end{cases}\]

So… I think to compute this we can do three cases?

The first case is just the center of the square where we have 0.1x0.1 proportion of sample used This total is \(V_1=0.9\cdot 0.9\cdot (0.1)^2 = 0.0081\)
The second scenario is where we are in the middle 90% of one variable and the extreme 5% of the other. These are like the four edges, not corners, of the square. There are 4 of these so

\[V_2 = 4\int_{0.05}^{0.95}\int_{0}^{0.05}0.1(0.05+x)dxdy=0.4\int_{0.05}^{0.95}\left. 0.05x+\frac{x^2}{2}\right|_{0}^{0.05}=0.4\cdot 0.9\cdot 0.00\cdot0.00375=0.00135\]

The third part is the four corners:

\[V_3 = 4\int_{0}^{0.05}\int_{0}^{0.05}(y+0.05)(x+0.05)dxdy=4\int_{0}^{0.05}\int_{0}^{0.05}x(y+0.05)+0.05y+0.0025\cdot dxdy\] \[V_3 =4\int_{0}^{0.05}\left. \frac{x^2}{2}(y+0.05)+0.05yx+0.0025x\right|_{0}^{0.05} dy=4\int_{0}^{0.05} 0.00125(y+0.05)+0.025y+0.000125\cdot dy\] \[V_3 =4\int_{0}^{0.05} 0.02625y+0.0019\cdot dy\approx0.00005\]

So then \[V_{total}=.0081+0.00135+0.00005=0.0095\] We did a little rounding but it’s approximately \(\frac{1}{10}\) of the answer from part a so that seems right.

Now if we have 100 predictors, we expect to use about \(100\cdot0.1^{100}\) percent of the data.
The above example imagined using a fixed ‘distance’ of within 0.05 of the observation. As the number of predictors grows, the proportion of observations within that that ‘distance’ shrinks. As a result, when we fix our \(K\), the odds of an element of \(\mathcal{N}_0\) being within that threshold start to drop i.e. the distance between the test observation and each of its nearest neighbors grows. If the training set is not particularly large, then this problem will only be exacerbated.
Now imagine that we are going to make a \(p\) dimensional hypercube about a test observation. If we want that cube to contain 10% of the total observations, what does the length of the cube have to be? Assuming we stick with the same independent uniform distributions, then the total ‘volume’ would still be 1. Then for the cube to be 10% of this volume we need \(0.1 = l^p\Rightarrow l = \sqrt[p]{0.1}\). Say \(p=5\), then \(l=0.63\) which means that with respect to each variable, we will be associating more dissimilar values in order to make the set \(\mathcal{N}_0\).

I feel like this was a very well done question. I think that I understood the problems of dimensionality only in that some variables would have an equal gravitational pull but simply not be associated with the outcome. This shows that even if all of the inputs are good predictors in their own right, having a lot of them with a small training set will cause us to associate distant values when using KNN.

5

These questions are about differences between LDA and QDA.

If the Bayes decision is linear, we would probably still expect QDA to perform better on the training set but worse on the testing set. This is because more flexible models just generally have lower training error but QDA probably pick up on non-existent patterns and thus underperform in the testing set.
In the case of a non-linear boundary, we should expect QDA to perform better in both the training and test sets.
As we increase the sample size, we should expect QDA to improve its accuracy rate relative to LDA. This is for two reasons. One is that we have to estimate many more parameters for QDA and this penalty will represent a smaller fraction of the total sample as N increases. Additionally, frivolous patterns that QDA picks up on in a small data set are less likely to appear in a larger training set. If we have more data to train on, QDA will be able to pick up on larger non-linear patterns that are not determined by only a few points.
Given that the Bayes decision boundary is linear, is it true that QDA will still have a lower test error rate because a flexible model can still model a linear relationship? I do not think this is true. If we have a large sample, I believe that the models will perform similarly but the LDA will still outperform because the additional flexibility will only pick up on noise.

6

Suppose we are modeling the chance a student gets an A in a course as a function of hours studied \(X_1\) and their GPA \(X_2\). Estimated regression model is:

\[\hat{Y}=-6+0.05X_1+X_2\]

The log odds that a student that studied for 40 hours with a 3.5 gpa gets an A is \[\hat{Y}=-6+0.05\cdot 40+3.5=-0.5\] And the associated probability is 0.377
In order to have a 50% chance of getting an A, this student would have to study for: \[0=-6+0.05\cdot X1+3.5\Rightarrow X_1=120\]

7

Now we have a situation where we’re going to use LDA to predict the probability of a company turning a profit based on their percent profit from the previous year. We assume that the variances are equal (\(\sigma^2=36\)), that 80% of companies made profits, and that the mean profit percents for companies who did and did not make profits were 10% and 0% respectively.

Let’s call \(Y=0\) the class for non-profitable companies and \(Y=1\) for profitable ones. Then using the Gaussian density function we have:

\[p_1(x)=\frac{\pi_1\cdot \frac{1}{\sqrt{2\pi}\sigma}\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_1)^2)}{\sum_{l=1}^2\pi_l\frac{1}{\sqrt{2\pi}\sigma}\cdot\text{exp}(-\frac{1}{2\sigma^2}(x-\mu_l)^2)}=\frac{0.8\cdot \frac{1}{\sqrt{2\pi}\cdot6}\cdot\text{exp}(-\frac{1}{2\cdot36}(4-10)^2)}{0.8\cdot \frac{1}{\sqrt{2\pi}\cdot6}\cdot\text{exp}(-\frac{1}{2\cdot36}(4-10)^2)+0.2\cdot \frac{1}{\sqrt{2\pi}\cdot6}\cdot\text{exp}(-\frac{1}{2\cdot36}(4-0)^2)}\] This is a lot easier to do with an r command:

p1 = 0.8*dnorm(x = 4, mean = 10, sd = 6)/(0.8*dnorm(x = 4, mean = 10, sd = 6) + 0.2*dnorm(x = 4, mean = 0, sd = 6))
round(p1, 3)

## [1] 0.752

8

We’re given a scenario where we divide the data into equal sized training and testing sets. A logistic model gives 20% error rate with the training data and 30% with the testing. A KNN model with K=1 gives an average error rate of 18% across the two samples. Which should we use? I had to look at another solution for this since my initial impression was way off. The key is that the KNN with K=1 always has a zero percent training error rate. As a result, the test error rate for that KNN would be 36% and thus worse than the logistic method. If K>1, it is not possible to actually tell because one possibility is that the error rate of the KNN train is 4% in the training data and 32% in the testing set in which case we would want the logistic model. On the other hand, if the training error rate was 16% in the training and 20% in the testing, we would want the KNN model.

9

Last question is very easy. Just converting odds to probabilities. An odds of 0.37 corresponds to a probability of \(\frac{0.37}{1+0.37} = 27\%\). A 16% probability corresponds to an odds of \(\frac{0.16}{1+0.16} = 0.19\)

10

Here we’re going to start out with the Weekly data set. This has 20 years of returns.

Investigate the data set for patterns

Weekly = ISLR::Weekly

head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

# basically the same thing as the Smarket data set
cor(dplyr::select(Weekly,-Direction))

##               Year         Lag1        Lag2        Lag3         Lag4         Lag5      Volume        Today
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923 -0.030519101  0.84194162 -0.032459894
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876 -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535 -0.072499482 -0.08551314  0.059166717
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865  0.060657175 -0.06928771 -0.071243639
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000 -0.075675027 -0.06107462 -0.007825873
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027  1.000000000 -0.05851741  0.011012698
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617 -0.058517414  1.00000000 -0.033077783
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873  0.011012698 -0.03307778  1.000000000

Very similar to the other stock data set. Small relationships between days and a large relationship between year & volume.

Here’s a logistic model using everything except Year & Today:

weekly_logistic = glm(
  formula = Direction ~ . - Year -Today,
  data = Weekly,
  family = 'binomial'
)

summary(weekly_logistic)

## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = "binomial", 
##     data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

1-weekly_logistic$deviance/weekly_logistic$null.deviance

## [1] 0.00658015

Very few relationships between predictors and outcomes. We see that two days before is a positive predictor for having a positive outcome for the day.

The confusion matrix looks like this:

weekly_logistic_pred = predict(
  weekly_logistic,
  type = 'response'
)

table(
  weekly_logistic_pred > 0.5,
  Weekly$Direction
)

##        
##         Down  Up
##   FALSE   54  48
##   TRUE   430 557

The logistic errors are heavily concentrated in false positives. The sensitivity is 92% but the specificity is only 11%.

Set the data prior to 2009 as the training set and the rest of the data as the test set. Use only lag2 as a predictor and report the confusion matrix. Repeat this for LDA, QDA, and KNN with K=1. I’ll skip the code but here are the results:

## # A tibble: 4 x 4
##   model    accuracy    fp    fn
##   <chr>       <dbl> <dbl> <dbl>
## 1 logistic    0.625    34     5
## 2 lda         0.625    34     5
## 3 qda         0.587    43     0
## 4 knn         0.510    22    29

So from this we can see that the logistic regression and LDA are quite similar which indicates a linear decision boundary. The other two method seem to provide poorer fit which is not unexpected with a relatively small data set.

The last question just suggests that we try some different combination of variables, including transformations and interactions, as well as different values of K. I think in the interest of time, I’ll just do some square terms, log transforms, and one interaction. Then I’ll combine any of these that seem successful for a ‘final’ model.

The lag variables seem relatively normal so from the perspective of the LDA and QDA models, a log transform might do better. Since Lag2 was the strongest predictor, we’ll try a square term of this. It looks like the adjacent days have the strongest relationships among the lags so lets try lag1:lag2 as well as lag2:log(Volume). We’ll also try K = 1, 3, 5 for each of these specifications.

# Apply these transformations:
Weekly = Weekly %>% 
  mutate(
    lag2_sq = Lag2*Lag2,
    volume_log = log(Volume),
    lag2_volume_log = Lag2*volume_log,
    lag2_lag1 = Lag2*Lag1
  )
weekly_train = Weekly %>% 
  filter(Year < 2009)
weekly_test = Weekly %>% 
  filter(Year >= 2009)

# List of predictor sets to try:
model_performances = list(
  c('Lag2'),
  c('Lag2', 'volume_log'),
  c('Lag2', 'lag2_sq'),
  c('Lag2', 'Lag1', 'lag2_lag1'),
  c('Lag2', 'volume_log', 'lag2_volume_log'),
  c('Lag2', 'lag2_sq', 'Lag1', 'lag2_lag1', 'volume_log', 'lag2_volume_log')
)

I’ll hide the code for the next part but basically I wrote a function that would try these different sets of predictors on the different models. The best performers are here:

## # A tibble: 36 x 5
##    model    accuracy    fp    fn preds                              
##    <chr>       <dbl> <dbl> <dbl> <chr>                              
##  1 logistic    0.625    34     5 Lag2                               
##  2 lda         0.625    34     5 Lag2                               
##  3 logistic    0.625    35     4 Lag2 + lag2_sq                     
##  4 qda         0.625    36     3 Lag2 + lag2_sq                     
##  5 knn5        0.615    16    24 Lag2 + Lag1 + lag2_lag1            
##  6 lda         0.615    36     4 Lag2 + lag2_sq                     
##  7 logistic    0.596    27    15 Lag2 + volume_log + lag2_volume_log
##  8 knn3        0.587    20    23 Lag2 + volume_log + lag2_volume_log
##  9 lda         0.587    28    15 Lag2 + volume_log + lag2_volume_log
## 10 qda         0.587    43     0 Lag2                               
## # ... with 26 more rows

The null classifier in this case would be predicting it goes up every day and this model has an accuracy of 58.6%. With this in mind, we’re not looking so good on any of these. It looks like we could not really improve the performance with alternative predictors and/or transformations. That makes some sense to me given what the initial logistic regression output was like with all of the possible predictors. In general, that is not a way to tell because it’s certainly possible that transformations perform better but when we see overall low correlations with the raw data, we shouldn’t be that optimistic about.

11

The next question is somewhat similar in that we are going to try these same classification strategies on the Auto data set with the response being an above median mpg. We’ll first choose a subset of predictors visually and then try the different models. We’ll then compute some summary statistics on performance.

# first divide into training and test:
auto = Auto %>% 
  dplyr::select(-name) %>%
  mutate(
    origin = factor(origin),
    mpg01 = factor(
      as.numeric(mpg > median(mpg)),
      levels = 0:1
    )
  )

set.seed(0)
auto_train = auto %>% 
  sample_frac(0.8)
auto_test = auto %>% 
  anti_join(auto_train)

# So for the continuous predictors, I think we can just scatter them.
# we will eventually make this binary
auto_train %>% 
  # Real tired of specifying which select. Next chapter ima just load tidyverse after MASS
  dplyr::select(-origin) %>% 
  mutate(
    across(
      -matches('mpg'),
      function(x){(x-mean(x))/sd(x)}
    )
  ) %>% 
  pivot_longer(
    cols = -matches('mpg'),
    names_to = 'var',
    values_to = 'val'
  ) %>% 
  ggplot(
    aes(
      x = val, 
      y = mpg,
      color = mpg01
    )
  ) + 
  geom_point(size = 2) +
  theme_minimal() +
  facet_wrap(.~var)

chisq.test(x = auto_train$mpg01, y = auto_train$origin)

## 
##  Pearson's Chi-squared test
## 
## data:  auto_train$mpg01 and auto_train$origin
## X-squared = 85.153, df = 2, p-value < 2.2e-16

So from this investigation, it looks like every variable is possibly useful but acceleration and year seem a bit less clear. Displacement, horsepower, and weight all seem nonlinear so maybe we will try some square terms for those. How about the normality assumption fro LDA/QDA?

Looking at this also makes me not want to use acceleration in the models since again, it seems to have low relationship with mpg. Year an cylindar look pretty non-normal to me so lets exclude those from the LDA/QDA. Perhaps we’ll try with and without to see if it actually does worse.

auto = Auto %>% 
  dplyr::select(-name) %>%
  mutate(
    origin = factor(origin),
    mpg01 = factor(
      as.numeric(mpg > median(mpg)),
      levels = 0:1
    ),
    displacement_sq = displacement^2,
    horsepower_sq = horsepower^2,
    weight_sq = weight^2
  )

set.seed(0)
auto_train = auto %>% 
  sample_frac(0.8)
auto_test = auto %>% 
  anti_join(auto_train)

base_preds = c(
  'origin',
  'cylinders',
  'displacement',
  'horsepower',
  'weight',
  'year'
)

full_preds = c(
  base_preds,
  'displacement_sq',
  'horsepower_sq',
  'weight_sq',
  'year'
)

lda_preds = full_preds %>% 
  setdiff(c('origin', 'cylinders'))

pred_names = c('base', 'add square terms', 'remove factors')

Now I’ll use a similar function as last time to do this:

## # A tibble: 18 x 5
##    model    accuracy    fp    fn preds           
##    <chr>       <dbl> <dbl> <dbl> <chr>           
##  1 logistic    0.923     4    35 base            
##  2 lda         0.91      7    37 base            
##  3 qda         0.91      6    36 base            
##  4 knn1        0.872     6    33 base            
##  5 knn3        0.897     4    33 base            
##  6 knn5        0.897     4    33 base            
##  7 logistic    0.936     3    35 add square terms
##  8 lda         0.936     5    37 add square terms
##  9 qda         0.923     6    37 add square terms
## 10 knn1        0.885     6    34 add square terms
## 11 knn3        0.923     4    35 add square terms
## 12 knn5        0.897     4    33 add square terms
## 13 logistic    0.923     4    35 remove factors  
## 14 lda         0.949     4    37 remove factors  
## 15 qda         0.91      7    37 remove factors  
## 16 knn1        0.885     6    34 remove factors  
## 17 knn3        0.923     4    35 remove factors  
## 18 knn5        0.897     4    33 remove factors

In general, these models do much better than in the previous example. The null model would achieve 50% accuracy and here we’re in the high 80s and low 90s. The model that does the best is the LDA without the non-normal seeming features which is in agreement with my previous thought that we really will care about violations of this assumption in practice. The logistic models generally do pretty well and the KNN with K=3 is doing well too. Using KNN with K=1 just seems like a bad idea in general. I think the lack of flexibility will cause this to be highly volatile and I would be surprised if this is a common choice for K.

The remaining exercises are a tutorial on function writing which I do not need and another similar example on employing this which I think I’ll skip since taking these long notes is getting me behind my study group.