6  Regularization and Penalized Models

6.1 Overview of Unit

6.1.1 Learning Objectives

  • Subsetting approaches: Forward, Backward, Best Subset (covered in reading only)
  • Cost and Loss functions
    • What are they and how are they used
    • What are the specific formulas for linear model, logistic regression, and variants of glmnet (ridge, LASSO, full elasticnet)
  • What is regularization
    • What are its benefits?
    • What are its costs?
  • How does lambda affect bias-variance trade-off in glmnet
  • What does alpha do?
  • Feature engineering approaches for dimensionality reduction: PCA (covered in reading only)
  • Other algorithms that do feature selection/dimensionality reduction: PCR and PLS (covered in reading only)
  • Contrasts of PCA, PCR, PLS, and glmnet/LASSO for dimensionality reduction (covered in reading only)

6.1.2 Readings

Post questions to the readings channel in Slack

6.1.3 Lecture Videos

Post questions to the video-lectures channel in Slack


6.1.4 Application Assignment and Quiz

Submit the application assignment here and complete the unit quiz by 8 pm on Wednesday, February 28th


6.2 Introduction to Penalized/Regularized Statistical Algorithms

6.2.1 Overview

Complex (e.g., flexible) models increase the chance of overfitting to the training set. This leads to:

  • Poor prediction
  • Burdensome prediction models for implementation (need to measure lots of predictors)
  • Low power to test hypothesis about predictor effects

Complex models are difficult to interpret


Regularization is technique that:

  • Reduces overfitting
  • Allows for p >> n (!!!)
  • May yield more interpretable models (LASSO, Elastic Net)
  • May reduce implementation burden (LASSO, Elastic Net)

Regularization does this by applying a penalty to the parametric model coefficients (parameter estimates)

  • This constrains/shrinks these coefficients to yield a simpler/less overfit model
  • Some types of penalties shrink the coefficients to zero (feature selection)

We will consider three approaches to regularization

  • L2 (Ridge)
  • L1 (LASSO)
  • Elastic net

These approaches are available for both regression and classification problems and for a variety of parametric statistical algorithms


6.2.2 Cost functions

To understand regularization, we need to first explicitly consider loss/cost functions for the parametric statistical models we have been using.

  • A loss function quantifies the error between a single predicted and observed outcome within some statistical model.

  • A cost function is simply the aggregate of the loss across all observations in the training sample.

Optimization procedures (least squares, maximum likelihood, gradient descent) seek to determine a set of parameter estimates that minimize some specific cost function for the training sample.


The cost function for the linear model is the mean squared error (squared loss):

  • \(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)

  • No constraints or penalties are placed on the parameter estimates (\(\beta_k\))

  • They can take on any values with the only goal to minimize the MSE in the training sample


The cost function for logistic regression is log loss:

  • \(\frac{1}{n}\sum_{i = 1}^{n} -Y_ilog(\hat{Y_i}) - (1-Y_i)log(1-\hat{Y_i})\)

  • where \(Y_i\) is coded 0,1 and \(\hat{Y_i}\) is the predicted probability that Y = 1

  • Again, no constraints or penalties are placed on the parameter estimates (\(\beta_k\))

  • They can take on any values with the only goal to minimize the sum of the log loss in the training sample

6.3

6.4 Intuitions about Penalized Cost Functions and Regularization

This is an example from a series of wonderfully clear lectures in a machine learning course by Andrew Ng in Coursera.


Lets imagine a training set:

  • House sale price predicted by house size
  • True DGP is quadratic. Diminishing increase in sale price as size increases
  • N = 5 in training set




If we fit a linear model with size as the only feature…

  • \(\hat{sale\_price_i} = \beta_0 + \beta_1 * size\)

  • In this training set, we might get the model below (in blue)

  • This is a biased model (predicts too high for low and high house sizes; predicts too low for moderate size houses)

  • If we took this model to new data from the same quadratic DGP, it would clearly not predict very well


Lets consider the other extreme

  • If we fit a 4th order polynomial model using size…
  • \(\hat{sale\_price_i} = \beta_0 + \beta_1 * size + \beta_2 * size^2 + \beta_3 * size^3 + \beta_4 * size^4\)
  • In this training set, we would get the model below (in blue)
  • This is model is overfit to this training set. It would not predict well in new data from the same quadratic DGP
  • Also, the model would have high variance (if we estimated the parameters in another N = 5 training set, they would be very different)


This problem with overfitting and variance isn’t limited to polynomial regression.

  • We would have the same problem (perfect fit in training with poor fit in new val data) if we predicted housing prices with many features when the training N = 5. e.g.,

  • \(\hat{sale\_price_i} = \beta_0 + \beta_1 * size + \beta_2 * year\_built + \beta_3 * num\_garages + \beta_4 * quality\)


Obviously, the correct model to fit is a second order polynomial model with size

  • \(\hat{sale\_price_i} = \beta_0 + \beta_1 * size + \beta_2 * size^2\)
  • But we couldn’t know this with real data because we wouldn’t know the underlying DGP
  • When we don’t know the underlying DGP, we need to be able to consider potentially complex models with many features in some way that diminishes the potential problem with overfitting/model variance


What if we still fit a fourth order polynomial but changed the cost function to penalize the absolute value of \(\beta_3\) and \(\beta_4\) parameter estimates?

Typical cost based on MSE/squared loss:

  • \(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)

Our new cost function:

  • \([\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [1000 * \beta_3 + 1000 * \beta_4]\)

\([\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [1000 * \beta_3 + 1000 * \beta_4]\)

  • The only way to make the value of this new cost function small is to make \(\beta_3\) and \(\beta_4\) small

  • If we made the penalty applied to \(\beta_3\) and \(\beta_4\) large (e.g., 1000 as above), we will end up with the parameter estimates for these two features at approximately 0.

  • With a sufficient penalty applied, their parameter estimates will only change from zero to the degree that these changes accounted for a large enough drop in MSE to offset this penalty in the overall aggregate cost function.


\([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + 1000 * \beta_3 + 1000 * \beta_4\)

  • With this penalty in place, our final model might shift from the blue model to the pink model below. The pink model is mostly quadratic but with a few extra “wiggles” if \(\beta_3\) and \(\beta_4\) are not exactly 0.


Of course, we don’t typically know in advance which parameter estimates to penalize.

  • Instead, we apply some penalty to all the parameter estimates (except \(\beta_0\))
  • This shrinks the parameter estimates for all the features to some degree
  • However, features that do reduce MSE meaningfully will be “worth” including with non-zero parameter estimates
  • You can also control the shrinkage by controlling the size of the penalty

In general, regularization produces models that:

  • Are simpler (e.g. smoother, smaller coefficients/parameter estimates)
  • Are less prone to overfitting
  • Allow for models with p >> n
  • Are sometimes more interpretable (LASSO, Elastic Net)

These benefits are provided by the introduction of some bias into the parameter estimates

This allows for a bias-variance trade-off where some bias is introduced for a big reduction in variance of model fit


We will now consider three regularization approaches that introduce different types of penalties to shrink the parameter estimates

  • L2 (Ridge)
  • L1 (LASSO)
  • Elastic net

These approaches are available for both regression and classification problems and for a variety of parametric statistical algorithms

A fourth common regularized classification model (also sometimes used for regression) is the support vector machine (not covered in class but commonly used as well and easy to understand with this foundation)

Each of these approaches uses a different specific penalty, which has implications for how the model performs in different settings


6.5 Ridge Regression

The cost function for Ridge Regression is:

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

It has two components:

  • Inside the left brackets is the SSE from linear regression
  • Inside the right brackets is the Ridge penalty.


This penalty:

  • Includes the sum of the squared parameter estimates (excluding \(\beta_0\)). Squaring removes the sign of these parameter estimates.
  • This sum is multiplied by \(\lambda\), a hyperparameter in Ridge regression. Lambda allows us to tune the size of the penalty.
  • This is an application of the L2 norm (matrix algebra) to the vector of parameter estimates

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

Question: What will happen to a Ridge regression model’s parameter estimates and its performance (i.e., its bias & variance) as lambda increases/decreases?
Show Answer
As lambda increases, the model becomes less flexible b/c its parameter estimates 
become constrained/shrunk.  This will increase bias but decrease variance for model 
performance.

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

Question: What is the special case of Ridge regression when lambda = 0?
Show Answer
The OLS regression is a special case where lambda = 0 (i.e., no penalty is applied).  
This is the most flexible. It is unbiased but with higher variance than for 
non-zero values of lambda

Lets compare Ridge regression to OLS (ordinary least squares with squared loss cost function) linear regression

  • Ridge parameter estimates are biased but have lower variance (smaller SE) than OLS

  • Ridge may predict better in new data

    • This depends on the value of \(\lambda\) selected and its impact on bias-variance trade-off in Ridge regression vs. OLS
    • There does exist a value of \(\lambda\) for which Ridge predicts better than OLS in new data
  • Ridge regression (but not OLS) allows for p > (or even >>) than n

  • Ridge regression (but not OLS) accommodates highly correlated (or even perfectly multi-collinear) features

  • OLS (but not Ridge regression) is scale invariant

    • You should scale (mean and standard deviation correct) features for use with Ridge regression

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

Question: Why does the scale of the features matter for Ridge regression?
Show Answer
Features with bigger SDs will have smaller parameter estimates.  Therefore they 
will be less affected by the penalty.



Unless the features are on the same scale to start, you should standardize them for all applications (regression and classification) of Ridge (and also LASSO and elastic net). You can handle this during feature engineering in the recipe.


6.6 LASSO Regression

LASSO is an acronym for Least Absolute Shrinkage and Selection Operator

The cost function for LASSO Regression is:

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} |\beta_j|\:])\)

It has two components:

  • Inside the left brackets is the SSE from linear regression
  • Inside the right brackets is the LASSO penalty.


This penalty:

  • Includes the sum of the absolute value of the parameter estimates (excluding \(\beta_0\)). The absolute value removes the sign of these parameter estimates.
  • This sum is multiplied by \(\lambda\), a hyperparameter in LASSO regression. Lambda allows us to tune the size of the penalty.
  • This is an application of the L1 norm to the vector of parameter estimates

6.6.1 LASSO vs. Ridge Comparison

With respect to the parameter estimates:

  • LASSO yields sparse solution (some parameter estimates set to exactly zero)

  • Ridge tends to retain all features (parameter estimates don’t get set to exactly zero)

  • LASSO selects one feature among correlated group and sets others to zero

  • Ridge shrinks all parameter estimates for correlated features

Ridge tends to outperform LASSO wrt prediction in new data. There are cases where LASSO can predict better (most features have zero effect and only a few are non-zero) but even in those cases, Ridge is competitive.


6.6.2 Advantages of LASSO

  • Does feature selection (sets parameter estimates to exactly 0)

    • Yields a sparse solution
    • Sparse model is more interpretable?
    • Sparse model is easier to implement? (fewer features included so don’t need to measure as many predictors)
  • More robust to outliers (similar to LAD vs. OLS)

  • Tends to do better when there are a small number of robust features and the others are close to zero or zero


6.6.3 Advantages of Ridge

  • Computationally superior (closed form solution vs. iterative; Only one solution to minimize the cost function)
  • More robust to measurement error in features (remember no measurement error is an assumption for unbiased estimates in OLS regression)
  • Tends to do better when there are many features with large (and comparable) effects (i.e., most features are related to the outcome)

6.7 Elastic Net Regression

The Elastic Net blends the L1 and L2 penalties to obtain the benefits of each of those approaches.

We will use the implementation of the Elastic Net in glmnet in R.


You can also read additional introductory documentation for this package


In the Gaussian regression context, the Elastic Net cost function is:

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda (\alpha\sum_{j = 1}^{p} |\beta_j| + (1-\alpha)\sum_{j = 1}^{p} \beta_j^{2})\:])\)

This model has two hyper-parameters

  • \(\lambda\) controls the degree of regularization as before
  • \(\alpha\) is a “mixing” parameter that blends the degree of L1 and L2 contributions to the aggregate penalty. (Proportion of LASSO penalty)
    • \(\alpha\) = 1 results in the LASSO model
    • \(\alpha\) = 0 results in the Ridge model
    • Intermediate values for \(\alpha\) blend these penalties together proportionally to include more or less LASSO penalty

As before (e.g., KNN), best values of \(\lambda\) (and \(\alpha\)) can be selected using resampling using tune_grid()

The grid needs to have crossed values of both penalty (\(lambda\)) and mixture (\(alpha\)) for glmnet

  • Can use expand_grid()
  • Only penalty is needed in grid if fitting a Ridge or LASSO model.

6.8 Empirical Example 1: Many “good” but correlated predictors

For the first example, we will simulate data with:

  • Many correlated predictors
  • All related to outcome
  • Get a small training sample
  • Get a big test sample (for more precise estimate of model performance)

First we set the predictors for our simulation

n_cases_trn <- 100
n_cases_test <- 1000
n_x <- 20
covs_x <- 50
vars_x <- 100
b_x <- rep(1, n_x) # one unit change in y for 1 unit change in x
y_error <- 100

Then we draw samples from population

set.seed(12345)
mu <- rep(0, n_x)  # means for all variables = 0
sigma <- matrix(covs_x, nrow = n_x, ncol = n_x)
diag(sigma) <- vars_x  
sigma
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
 [1,]  100   50   50   50   50   50   50   50   50    50    50    50    50
 [2,]   50  100   50   50   50   50   50   50   50    50    50    50    50
 [3,]   50   50  100   50   50   50   50   50   50    50    50    50    50
 [4,]   50   50   50  100   50   50   50   50   50    50    50    50    50
 [5,]   50   50   50   50  100   50   50   50   50    50    50    50    50
 [6,]   50   50   50   50   50  100   50   50   50    50    50    50    50
 [7,]   50   50   50   50   50   50  100   50   50    50    50    50    50
 [8,]   50   50   50   50   50   50   50  100   50    50    50    50    50
 [9,]   50   50   50   50   50   50   50   50  100    50    50    50    50
[10,]   50   50   50   50   50   50   50   50   50   100    50    50    50
[11,]   50   50   50   50   50   50   50   50   50    50   100    50    50
[12,]   50   50   50   50   50   50   50   50   50    50    50   100    50
[13,]   50   50   50   50   50   50   50   50   50    50    50    50   100
[14,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[15,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[16,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[17,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[18,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[19,]   50   50   50   50   50   50   50   50   50    50    50    50    50
[20,]   50   50   50   50   50   50   50   50   50    50    50    50    50
      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]    50    50    50    50    50    50    50
 [2,]    50    50    50    50    50    50    50
 [3,]    50    50    50    50    50    50    50
 [4,]    50    50    50    50    50    50    50
 [5,]    50    50    50    50    50    50    50
 [6,]    50    50    50    50    50    50    50
 [7,]    50    50    50    50    50    50    50
 [8,]    50    50    50    50    50    50    50
 [9,]    50    50    50    50    50    50    50
[10,]    50    50    50    50    50    50    50
[11,]    50    50    50    50    50    50    50
[12,]    50    50    50    50    50    50    50
[13,]    50    50    50    50    50    50    50
[14,]   100    50    50    50    50    50    50
[15,]    50   100    50    50    50    50    50
[16,]    50    50   100    50    50    50    50
[17,]    50    50    50   100    50    50    50
[18,]    50    50    50    50   100    50    50
[19,]    50    50    50    50    50   100    50
[20,]    50    50    50    50    50    50   100
x <- MASS::mvrnorm(n = n_cases_trn, mu, sigma) |> 
  magrittr::set_colnames(str_c("x_", 1:n_x)) |>
  as_tibble()
data_trn_1 <- x |> 
  mutate(y = rowSums(t(t(x)*b_x)) + rnorm(n_cases_trn, 0, y_error)) |>  
  glimpse()
Rows: 100
Columns: 21
$ x_1  <dbl> 0.9961431, 14.6304657, 4.7113834, 5.4731915, 2.2835857, -16.15454…
$ x_2  <dbl> 8.6957591, 0.8347032, -0.5353713, -12.5511722, -1.0787145, -16.42…
$ x_3  <dbl> 10.6129264, 20.4883323, -8.6503901, -1.5413989, -17.7641062, -14.…
$ x_4  <dbl> 11.80601585, 11.29743568, 4.46220219, 0.79397084, -3.78878523, -2…
$ x_5  <dbl> 18.835971, 6.095917, -3.272026, -1.839770, -1.610545, -6.040994, …
$ x_6  <dbl> 2.223292893, 11.500104110, -0.457979676, -8.496314755, 7.87014122…
$ x_7  <dbl> 10.5129546, 6.4060743, 12.0210560, -13.7151243, 11.1422588, -14.2…
$ x_8  <dbl> 8.8124570, 9.6462680, 2.4037228, -0.2015028, 6.3737654, -3.064850…
$ x_9  <dbl> -3.67513802, 3.26579725, -7.25050948, -12.31704216, -1.69613573, …
$ x_10 <dbl> -2.38262537, -6.46882639, 1.83840881, 2.57893960, 6.40948837, -27…
$ x_11 <dbl> 14.9730316, 15.6370590, -1.1323317, 6.5593694, 5.1039078, -7.5830…
$ x_12 <dbl> -4.47755314, 0.04826209, 2.59168486, -4.26699855, 16.49056393, -1…
$ x_13 <dbl> 8.5928287, -4.0292190, -14.4559758, 6.0387060, 2.3659570, -13.319…
$ x_14 <dbl> -10.54664618, 3.85439473, -7.14415536, 5.64697664, 2.86171243, -1…
$ x_15 <dbl> -5.9948683, 0.6097567, 4.8671687, 0.7012719, 14.8754506, 3.478677…
$ x_16 <dbl> 7.9805572, 12.1442098, -2.2095452, -9.1514837, 10.0124739, -2.676…
$ x_17 <dbl> -2.4674864, -13.6498178, 1.0131614, 3.7409088, 5.9397242, -10.243…
$ x_18 <dbl> 9.9571661, 5.7690926, 5.3204675, -13.9289436, 14.6130884, -19.710…
$ x_19 <dbl> -3.37220750, 1.23419418, -6.67125147, -4.95630437, 8.94440082, -5…
$ x_20 <dbl> 3.7686080, 3.4971897, -3.2892748, -14.2852637, -1.5467975, -28.72…
$ y    <dbl> 24.067076, 210.433707, -73.482134, 44.144652, 228.535603, -259.78…
x <- MASS::mvrnorm(n = n_cases_test, mu, sigma) |> 
  magrittr::set_colnames(str_c("x_", 1:n_x)) |>
  as_tibble() 
data_test_1 <- x |> 
  mutate(y = rowSums(t(t(x)*b_x)) + rnorm(n_cases_test, 0, y_error))

Set up a tibble to track model performance in train and test sets

  • We are using test to repeatedly to get rigorous held-out performance separate from model selection process.
  • Just for our understanding
  • We would not choose a model configuration based on test set error
error_ex1 <- tibble(model = character(), 
                    rmse_trn = numeric(), 
                    rmse_test = numeric()) |> 
  glimpse()
Rows: 0
Columns: 3
$ model     <chr> 
$ rmse_trn  <dbl> 
$ rmse_test <dbl> 

6.8.1 Fit a standard (OLS) linear regression

Fit the linear model

  • No feature engineering needed. Can use raw predictors as features
  • No resampling needed b/c there are no hyperparameters
fit_lm_1 <- 
  linear_reg() |> 
  set_engine("lm") |> 
  fit(y ~ ., data = data_trn_1)

fit_lm_1 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)   -5.22      12.0    -0.436   0.664 
 2 x_1            1.22       1.40    0.869   0.388 
 3 x_2           -1.42       1.62   -0.874   0.385 
 4 x_3           -0.679      1.59   -0.427   0.671 
 5 x_4            0.392      1.49    0.263   0.793 
 6 x_5            4.20       1.81    2.33    0.0226
 7 x_6            0.804      1.58    0.510   0.612 
 8 x_7            0.953      1.50    0.634   0.528 
 9 x_8            2.30       1.53    1.50    0.137 
10 x_9            2.37       1.53    1.55    0.126 
11 x_10           0.758      1.67    0.454   0.651 
12 x_11           0.530      1.87    0.283   0.778 
13 x_12           1.44       1.78    0.813   0.419 
14 x_13           2.40       1.65    1.45    0.150 
15 x_14           3.26       1.89    1.72    0.0889
16 x_15           0.110      1.53    0.0719  0.943 
17 x_16          -1.28       1.68   -0.761   0.449 
18 x_17          -0.144      1.55   -0.0926  0.926 
19 x_18           2.74       1.66    1.65    0.102 
20 x_19           1.36       1.61    0.843   0.402 
21 x_20          -1.49       1.67   -0.893   0.375 

Irreducible error was set by y_error (100)

  • Overfit to train
  • Much worse in test
rmse_vec(truth = data_trn_1$y, 
         estimate = predict(fit_lm_1, data_trn_1)$.pred)
[1] 91.86214
rmse_vec(truth = data_test_1$y, 
         estimate = predict(fit_lm_1, data_test_1)$.pred)
[1] 112.3208

6.8.2 Fit LASSO

LASSO, Ridge, and glmnet all need features on same scale to apply penalty consistently

  • Use step_normalize(). This sets mean = 0, sd = 1 (NOTE: Bad name as it does NOT change shape of distribution!)
  • Can use same recipe for LASSO, Ridge, and glmnet
  • Can use same train and test feature matrices as well
rec_1 <- recipe(y ~ ., data = data_trn_1) |> 
  step_normalize(all_predictors())

rec_prep_1 <- rec_1 |> 
  prep(data_trn_1)

feat_trn_1 <- rec_prep_1 |> 
  bake(NULL)

feat_test_1 <- rec_prep_1 |> 
  bake(data_test_1)

Set up splits for resampling for tuning hyperparameters

  • Use bootstrap for more precise estimation (even if more biased). Good for selection
  • Can use same bootstrap splits for LASSO, Ridge, and glmnet
set.seed(20140102)
splits_boot_1 <- data_trn_1 |> 
   bootstraps(times = 100, strata = "y")  

Now onto the LASSO….

We need to tune \(\lambda\) (tidymodels calls this penalty)

  • \(\alpha\) = 1 (tidymodels calls this mixture)
  • Set up grid with exponential values for penalty
  • glmnet uses warm starts so can fit lots of values for \(\lambda\) quickly
  • Could also use cv.glmnet() directly in glmnet package to find good values. See get_lamdas() in fun_modeling.R
grid_lasso <- expand_grid(penalty = exp(seq(-4, 4, length.out = 500)))
fits_lasso_1 <- xfun::cache_rds(
  expr = {
  linear_reg(penalty = tune(), 
               mixture = 1) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_1, 
              resamples = splits_boot_1, 
              grid = grid_lasso, 
              metrics = metric_set(rmse))

   },
   rerun = rerun_setting,
   dir = "cache/006/",
   file = "fits_lasso_1")

Evaluate model performance in validation sets (OOB)

Make sure that you have hit a clear minimum (bottom of U or at least an asymptote)

plot_hyperparameters(fits_lasso_1, hp1 = "penalty", metric = "rmse")


Fit best configuration (i.e., best lambda) to full train set

  • Use select_best()
  • Don’t forget to indicate which column (\(penalty\))
fit_lasso_1 <-
  linear_reg(penalty = select_best(fits_lasso_1)$penalty, 
             mixture = 1) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_1)

We can now use tidy() to look at the LASSO parameter estimates

  • tidy() uses Matrix package, which has conflicts with tidyr. Load the package without those conflicting functions
library(Matrix, exclude = c("expand", "pack", "unpack"))

Now call tidy()

  • Notice that LASSO sets some \(\beta\) to 0 even though none are 0 in DGP
  • LASSO is not great at reproducing the DGP!
fit_lasso_1 |> 
  tidy() |> 
  print(n = 21)
Loaded glmnet 4.1-8
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)    29.9     11.3
 2 x_1             7.26    11.3
 3 x_2             0       11.3
 4 x_3             0       11.3
 5 x_4             0       11.3
 6 x_5            34.2     11.3
 7 x_6             8.00    11.3
 8 x_7             1.56    11.3
 9 x_8            17.4     11.3
10 x_9            20.9     11.3
11 x_10            7.95    11.3
12 x_11            3.49    11.3
13 x_12            7.01    11.3
14 x_13           20.8     11.3
15 x_14           25.8     11.3
16 x_15            4.85    11.3
17 x_16            0       11.3
18 x_17            0       11.3
19 x_18           23.5     11.3
20 x_19            9.56    11.3
21 x_20            0       11.3

Irreducible error was set by y_error (100)

  • Somewhat overfit to train
  • Somewhat better in test
(error_ex1 <- error_ex1 |> 
  bind_rows(tibble(model = "LASSO model",                       
                   rmse_trn = rmse_vec(truth = feat_trn_1$y, 
                                       estimate = predict(fit_lasso_1,
                                                          feat_trn_1)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_1$y, 
                                        estimate = predict(fit_lasso_1,
                                                           feat_test_1)$.pred))))
# A tibble: 2 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     91.9      112.
2 LASSO model      94.7      107.

6.8.3 Fit Ridge

Fit Ridge algorithm

  • Tune \(\lambda\) (penalty)
  • May need to experiment to get right range of values for lambda
  • \(\alpha\) = 0 (mixture)
  • Evaluate model configurations in OOB validation sets <<<<<<< HEAD
grid_ridge <- expand_grid(penalty = exp(seq(-1, 7, length.out = 500)))
grid_ridge <- expand_grid(penalty = exp(seq(-1, 7, length.out = 500)))
fits_ridge_1 <- xfun::cache_rds(
  expr = {
    linear_reg(penalty = tune(), 
               mixture = 0) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_1, 
              resamples = splits_boot_1, 
              grid = grid_ridge, 
              metrics = metric_set(rmse))

  },
  rerun = rerun_setting,
  dir = "cache/006/",
  file = "fits_ridge_1")

Review hyperparameter plot

plot_hyperparameters(fits_ridge_1, hp1 = "penalty", metric = "rmse")


Fit best model configuration (i.e., best lambda) in full train set

  • Notice that no \(\beta\) are exactly 0
  • Why are parameter estimates not near 1 for LASSO and Ridge?
fit_ridge_1 <-
  linear_reg(penalty = select_best(fits_ridge_1)$penalty, 
             mixture = 0) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_1)

fit_ridge_1 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)    29.9     281.
 2 x_1            10.0     281.
 3 x_2             5.53    281.
 4 x_3             5.84    281.
 5 x_4             6.38    281.
 6 x_5            14.4     281.
 7 x_6             9.78    281.
 8 x_7             5.80    281.
 9 x_8            12.2     281.
10 x_9            13.2     281.
11 x_10           10.5     281.
12 x_11           10.3     281.
13 x_12            8.43    281.
14 x_13           12.8     281.
15 x_14           12.3     281.
16 x_15           10.2     281.
17 x_16            3.93    281.
18 x_17            7.67    281.
19 x_18           12.3     281.
20 x_19           11.2     281.
21 x_20            5.91    281.

Irreducible error was set by y_error (100)

  • Much less overfit to train
  • Still not bad in test
(error_ex1 <- error_ex1 |> 
  bind_rows(tibble(model = "Ridge model",   
                   rmse_trn = rmse_vec(truth = feat_trn_1$y, 
                                       estimate = predict(fit_ridge_1,
                                                          feat_trn_1)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_1$y, 
                                        estimate = predict(fit_ridge_1,
                                                           feat_test_1)$.pred))))
# A tibble: 3 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     91.9      112.
2 LASSO model      94.7      107.
3 Ridge model      98.6      103.

6.8.4 Fit glmnet

Now we need to tune both

  • \(\lambda\) (penalty)
  • \(\alpha\) (mixture)
  • Typical to only evaluate a small number of \(alpha\)
  • Warm starts across \(\lambda\)
grid_glmnet <- expand_grid(penalty = exp(seq(-1, 7, length.out = 500)),
                           mixture = seq(0, 1, length.out = 6))
fits_glmnet_1 <- xfun::cache_rds(
  expr = {
    linear_reg(penalty = tune(), 
               mixture = tune()) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_1, 
              resamples = splits_boot_1, 
              grid = grid_glmnet, 
              metrics = metric_set(rmse))
  
  },
  rerun = rerun_setting,
  dir = "cache/006/",
  file = "fits_glmnet_1")

plot_hyperparameters(fits_glmnet_1, hp1 = "penalty", hp2 = "mixture", metric = "rmse")


Fit best configuration in full train set

  • Can use select_best() for both hyperparameters, separately
  • Ridge was best (but cool that glmnet could determine that empirically!)
select_best(fits_glmnet_1)
# A tibble: 1 × 3
  penalty mixture .config                
    <dbl>   <dbl> <chr>                  
1    281.       0 Preprocessor1_Model0415
fit_glmnet_1 <-
  linear_reg(penalty = select_best(fits_glmnet_1)$penalty, 
             mixture = select_best(fits_glmnet_1)$mixture) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_1)

fit_glmnet_1 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)    29.9     281.
 2 x_1            10.0     281.
 3 x_2             5.53    281.
 4 x_3             5.84    281.
 5 x_4             6.38    281.
 6 x_5            14.4     281.
 7 x_6             9.78    281.
 8 x_7             5.80    281.
 9 x_8            12.2     281.
10 x_9            13.2     281.
11 x_10           10.5     281.
12 x_11           10.3     281.
13 x_12            8.43    281.
14 x_13           12.8     281.
15 x_14           12.3     281.
16 x_15           10.2     281.
17 x_16            3.93    281.
18 x_17            7.67    281.
19 x_18           12.3     281.
20 x_19           11.2     281.
21 x_20            5.91    281.

A final comparison of training and test error for the four statistical algorithms

(error_ex1 <- error_ex1 |> 
  bind_rows(tibble(model = "glmnet model",   
                   rmse_trn = rmse_vec(truth = feat_trn_1$y, 
                                       estimate = predict(fit_glmnet_1,
                                                          feat_trn_1)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_1$y, 
                                        estimate = predict(fit_glmnet_1,
                                                           feat_test_1)$.pred))))
# A tibble: 4 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     91.9      112.
2 LASSO model      94.7      107.
3 Ridge model      98.6      103.
4 glmnet model     98.6      103.

6.9 Empirical Example 2: Good and Zero Predictors

For the second example, we will simulate data with:

  • Two sets of correlated predictors
  • First set related to outcome
  • Second set unrelated to outcome
  • Get a small training sample
  • Get a big test sample (for more precise estimates of performance of our model configurations)

Set up simulation parameters

n_cases_trn <- 100
n_cases_test <- 1000
n_x <- 20
covs_x <- 50 
vars_x <- 100
b_x <- rep(c(1,0), each = n_x / 2)
y_error <- 100

mu <- rep(0, n_x)  

sigma <- matrix(0, nrow = n_x, ncol = n_x)
for (i in 1:(n_x/2)){
  for(j in 1:(n_x/2)){
    sigma[i, j] <- covs_x
  }
} 
for (i in (n_x/2 + 1):n_x){
  for(j in (n_x/2 + 1):n_x){
    sigma[i, j] <- covs_x
  }
} 

diag(sigma) <- vars_x  

Simulate predictors and Y

set.seed(2468)

x <- MASS::mvrnorm(n = n_cases_trn, mu, sigma) |> 
  magrittr::set_colnames(str_c("x_", 1:n_x)) |>
  as_tibble()
data_trn_2 <- x |> 
  mutate(y = rowSums(t(t(x)*b_x)) + rnorm(n_cases_trn, 0, y_error)) |>  
  glimpse()
Rows: 100
Columns: 21
$ x_1  <dbl> -1.4627996, -1.8160400, 2.9068102, -6.5569418, -8.0480270, 9.2138…
$ x_2  <dbl> 17.7272165, 0.6310498, 9.7749301, -9.7193778, -4.3778964, 6.72976…
$ x_3  <dbl> -0.5051625, -17.7689516, 6.3105699, -10.5572604, -3.9617496, 3.90…
$ x_4  <dbl> 3.281062, 8.955012, 3.352868, -11.268461, -4.286181, 13.815268, -…
$ x_5  <dbl> 8.35180401, 6.81665987, 0.06645588, -9.66213146, -8.85459170, 16.…
$ x_6  <dbl> -5.9821809, 18.5962999, -0.8264266, 1.4451062, -9.3772084, 14.342…
$ x_7  <dbl> -12.40624948, 4.28231426, 1.53437679, 5.47908047, -4.77878760, 15…
$ x_8  <dbl> 3.479352, -7.562426, 16.716015, -4.841876, 1.927100, 27.236838, 1…
$ x_9  <dbl> 1.102297, 1.638666, 10.573017, 5.936702, 2.084463, 11.768179, 5.7…
$ x_10 <dbl> -6.6174049, 9.8993606, 5.6307484, 5.7061525, -12.3675238, 0.64541…
$ x_11 <dbl> -0.797748528, -18.721814405, -3.992559581, -21.480685628, -4.0253…
$ x_12 <dbl> 6.7928298, -7.3322948, -5.8933992, -13.0153958, 2.4618048, -12.34…
$ x_13 <dbl> -5.68712456, -1.01785876, 7.86559179, -18.79422479, -3.31514615, …
$ x_14 <dbl> -9.99575229, -16.35133901, -18.02068048, -26.14924286, 1.46020815…
$ x_15 <dbl> -4.2672643, -4.4350928, -13.6207539, -15.1147495, -5.5577594, -13…
$ x_16 <dbl> 8.5520238, -2.5639516, 5.5768023, -19.1732900, -4.1939131, 0.7558…
$ x_17 <dbl> 4.93829278, -14.17190979, -10.23808120, -8.90013599, -9.33297489,…
$ x_18 <dbl> -1.987745, -9.259699, -13.225873, -28.090077, -13.168306, -1.3565…
$ x_19 <dbl> 1.5085562, -13.9721276, -5.6622726, -10.9398210, -22.5233973, -8.…
$ x_20 <dbl> 1.2541564, -19.0101170, -1.6780834, -16.9207862, -7.2618287, 1.13…
$ y    <dbl> -94.94590, 69.75956, 88.46711, 15.72162, 145.49463, 101.30513, 10…
x <- MASS::mvrnorm(n = n_cases_test, mu, sigma) |> 
  magrittr::set_colnames(str_c("x_", 1:n_x)) |>
  as_tibble()
data_test_2 <- x |> 
  mutate(y = rowSums(t(t(x)*b_x)) + rnorm(n_cases_test, 0, y_error)) |>  
  glimpse()
Rows: 1,000
Columns: 21
$ x_1  <dbl> -11.082333, 10.623468, -7.962704, -4.360526, 4.170808, -8.056227,…
$ x_2  <dbl> 4.4892149, 6.5181854, 7.4099931, 5.1254868, 12.7502461, -9.275426…
$ x_3  <dbl> -3.94113759, 2.65646872, -0.10913588, 13.39114437, 12.99616116, 3…
$ x_4  <dbl> -17.3732360, 2.6196690, -6.8545899, 3.6877167, 9.5982646, 7.09623…
$ x_5  <dbl> 2.1853683, 25.4130435, 0.2749666, 3.9995128, 16.5471387, 9.649271…
$ x_6  <dbl> -13.8387545, -7.7514735, -1.0824170, -10.3723020, 13.3474819, -4.…
$ x_7  <dbl> -8.1371882, 14.0102608, 2.6593081, 5.2537524, 0.2673897, -5.48200…
$ x_8  <dbl> -22.8820028, 0.8089559, -2.8108121, 10.4672192, 10.9376332, 1.937…
$ x_9  <dbl> -20.8906202, 13.3578359, 19.4734781, 0.7248788, 7.3258207, 14.565…
$ x_10 <dbl> -18.2918851, 3.0034874, 5.3284673, -1.6321880, 5.1493291, 4.60337…
$ x_11 <dbl> 18.672349, 11.091231, 14.868589, 9.865558, 4.726164, 14.227183, 1…
$ x_12 <dbl> 5.5770933, 0.5463577, -9.0256983, 3.8787265, 9.8113720, -3.925492…
$ x_13 <dbl> -6.996611892, 19.551699340, 15.284866989, -2.964783803, 3.1004588…
$ x_14 <dbl> 21.031040, 13.320671, 18.770094, 10.911134, -0.382450, 18.983760,…
$ x_15 <dbl> -1.06370853, 14.64546199, -2.46018710, 3.09542088, -0.52366758, 8…
$ x_16 <dbl> -0.8450554, 17.3738247, 4.3254573, 12.7759481, 13.7785211, 13.078…
$ x_17 <dbl> 4.6867098, -1.5470071, 2.1471326, -9.5172721, 11.7413577, 6.54228…
$ x_18 <dbl> 6.3226155, 9.0624381, 0.5223122, 6.1315901, 5.0103132, 7.2802472,…
$ x_19 <dbl> 7.6803428, -2.6084423, 4.2716484, 22.0685295, 0.4731131, 8.255899…
$ x_20 <dbl> -5.4343168, -4.9070165, 13.5187641, 14.4286786, 8.4636535, 8.9575…
$ y    <dbl> -83.529025, 83.571326, 86.073513, -95.654257, 79.326001, 114.5172…

Set up a tibble to track model performance in train and test

error_ex2 <- tibble(model = character(), rmse_trn = numeric(), rmse_test = numeric()) |> 
  glimpse()
Rows: 0
Columns: 3
$ model     <chr> 
$ rmse_trn  <dbl> 
$ rmse_test <dbl> 

6.9.1 Fit a standard (OLS) linear regression

Fit and evaluate the linear model

fit_lm_2 <- 
  linear_reg() |> 
  set_engine("lm") |> 
  fit(y ~ ., data = data_trn_2)

fit_lm_2 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept) -14.0        10.8    -1.30    0.198 
 2 x_1          -0.372       1.45   -0.256   0.798 
 3 x_2           3.54        1.39    2.55    0.0126
 4 x_3           1.54        1.53    1.01    0.317 
 5 x_4           2.90        1.42    2.05    0.0441
 6 x_5           2.98        1.50    1.98    0.0509
 7 x_6          -1.28        1.34   -0.956   0.342 
 8 x_7           0.821       1.48    0.553   0.582 
 9 x_8           1.38        1.26    1.10    0.276 
10 x_9           0.280       1.25    0.225   0.823 
11 x_10          0.0247      1.33    0.0186  0.985 
12 x_11         -2.09        1.56   -1.34    0.185 
13 x_12          1.71        1.47    1.16    0.248 
14 x_13          1.33        1.42    0.936   0.352 
15 x_14         -0.927       1.24   -0.747   0.457 
16 x_15          2.99        1.53    1.96    0.0536
17 x_16         -0.602       1.40   -0.431   0.668 
18 x_17          1.78        1.28    1.39    0.168 
19 x_18         -1.73        1.73   -1.00    0.321 
20 x_19         -1.28        1.37   -0.938   0.351 
21 x_20         -2.83        1.24   -2.28    0.0253

Irreducible error was set by y_error (100)

  • Very overfit to train
  • Very poor performance in test
(error_ex2 <- error_ex2 |> 
  bind_rows(tibble(model = "linear model",                       
                   rmse_trn = rmse_vec(truth = data_trn_2$y, 
                                       estimate = predict(fit_lm_2,
                                                          data_trn_2)$.pred),
                   rmse_test = rmse_vec(truth = data_test_2$y, 
                                        estimate = predict(fit_lm_2,
                                                           data_test_2)$.pred))))
# A tibble: 1 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     83.4      117.

6.9.2 Fit LASSO

For all glmnet algorithms, set up:

  • Recipe
  • Feature matrices
  • Bootstraps for model configuration selection (tuning)
rec_2 <- recipe(y ~ ., data = data_trn_2) |> 
  step_normalize(all_predictors())

rec_prep_2 <- rec_2 |> 
  prep(data_trn_2)

feat_trn_2 <- rec_prep_2 |> 
  bake(NULL)

feat_test_2 <- rec_prep_2 |> 
  bake(data_test_2)

set.seed(20140102)
splits_boot_2 <- data_trn_2 |> 
   bootstraps(times = 100, strata = "y") 

Tune \(\lambda\) for LASSO

  • We can use sample penalty grids from earlier example because sample size and number of features hasnt changed so likely still good
fits_lasso_2 <- xfun::cache_rds(
  expr = {
    linear_reg(penalty = tune(), 
               mixture = 1) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_2, 
              resamples = splits_boot_2, 
              grid = grid_lasso, 
              metrics = metric_set(rmse))

  },
  rerun = rerun_setting,
  dir = "cache/006/",
  file = "fits_lasso_2")

Plot hyperparameters

plot_hyperparameters(fits_lasso_2, hp1 = "penalty", metric = "rmse")


Fit best LASSO to full training set

  • Notice the many \(\beta\) = 0
  • It did set some of the “good” features to 0 as well
fit_lasso_2 <-
  linear_reg(penalty = select_best(fits_lasso_2)$penalty, 
             mixture = 1) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_2)

fit_lasso_2 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)    5.52     12.7
 2 x_1            1.02     12.7
 3 x_2           21.4      12.7
 4 x_3            8.60     12.7
 5 x_4           23.3      12.7
 6 x_5           19.9      12.7
 7 x_6            0        12.7
 8 x_7            0        12.7
 9 x_8           10.8      12.7
10 x_9            0.589    12.7
11 x_10           0        12.7
12 x_11          -0.632    12.7
13 x_12           0        12.7
14 x_13           0        12.7
15 x_14           0        12.7
16 x_15           0        12.7
17 x_16           0        12.7
18 x_17           0        12.7
19 x_18           0        12.7
20 x_19           0        12.7
21 x_20          -6.93     12.7

Irreducible error was set by y_error (100)

  • Somewhat overfit to train
  • Good in val
(error_ex2 <- error_ex2 |> 
  bind_rows(tibble(model = "LASSO model",                       
                   rmse_trn = rmse_vec(truth = feat_trn_2$y, 
                                       estimate = predict(fit_lasso_2,
                                                          feat_trn_2)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_2$y, 
                                        estimate = predict(fit_lasso_2,
                                                           feat_test_2)$.pred))))
# A tibble: 2 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     83.4      117.
2 LASSO model      92.9      103.

6.9.3 Fit Ridge

Tune \(\lambda\) for Ridge

  • can use ridge penalty grid from example 1
fits_ridge_2 <- xfun::cache_rds(
  expr = {
    linear_reg(penalty = tune(), 
               mixture = 0) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_2, 
              resamples = splits_boot_2, 
              grid = grid_ridge, 
              metrics = metric_set(rmse))

  },
  rerun = rerun_setting,
  dir = "cache/006/",
  file = "fits_ridge_2")

Plot hyperparameters

plot_hyperparameters(fits_ridge_2, hp1 = "penalty", metric = "rmse")


plot_hyperparameters(fits_ridge_2, hp1 = "penalty", metric = "rmse")


Fit best Ridge to full training set

  • Notice no \(\beta\) = 0
fit_ridge_2 <-
  linear_reg(penalty = select_best(fits_ridge_2)$penalty, 
             mixture = 0) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_2)

fit_ridge_2 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)     5.52    132.
 2 x_1             7.19    132.
 3 x_2            14.7     132.
 4 x_3             9.23    132.
 5 x_4            14.6     132.
 6 x_5            12.7     132.
 7 x_6             4.12    132.
 8 x_7             6.27    132.
 9 x_8             9.75    132.
10 x_9             5.83    132.
11 x_10            5.18    132.
12 x_11           -6.65    132.
13 x_12            4.21    132.
14 x_13            3.84    132.
15 x_14           -2.49    132.
16 x_15            5.80    132.
17 x_16           -1.39    132.
18 x_17            4.19    132.
19 x_18           -5.07    132.
20 x_19           -5.88    132.
21 x_20           -8.57    132.

Irreducible error was set by y_error (100)

  • Somewhat overfit to train
  • Still slightly better than LASSO in test
(error_ex2 <- error_ex2 |> 
  bind_rows(tibble(model = "Ridge model",                       
                   rmse_trn = rmse_vec(truth = feat_trn_2$y, 
                                       estimate = predict(fit_ridge_2,
                                                          feat_trn_2)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_2$y, 
                                        estimate = predict(fit_ridge_2,
                                                           feat_test_2)$.pred))))
# A tibble: 3 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     83.4      117.
2 LASSO model      92.9      103.
3 Ridge model      90.6      102.

6.9.4 Fit Complete glmnet

Tune \(\lambda\) and \(\alpha\) for glmnet

fits_glmnet_2 <- xfun::cache_rds(
  expr = {
    linear_reg(penalty = tune(), 
               mixture = tune()) |> 
    set_engine("glmnet") |> 
    tune_grid(preprocessor = rec_2, 
              resamples = splits_boot_2, 
              grid = grid_glmnet, 
              metrics = metric_set(rmse))

  },
  rerun = rerun_setting,
  dir = "cache/006/",
  file = "fits_glmnet_2")

Plot hyperparameters

plot_hyperparameters(fits_glmnet_2, hp1 = "penalty", hp2 = "mixture", metric = "rmse")


Fit Best glmnet in full train set

  • Still Ridge (but won’t always be)
select_best(fits_glmnet_2)
# A tibble: 1 × 3
  penalty mixture .config                
    <dbl>   <dbl> <chr>                  
1    132.       0 Preprocessor1_Model0368
fit_glmnet_2 <-
  linear_reg(penalty = select_best(fits_glmnet_2)$penalty, 
             mixture = select_best(fits_glmnet_2)$mixture) |>
  set_engine("glmnet") |> 
  fit(y ~ ., data = feat_trn_2)

fit_glmnet_2 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)     5.52    132.
 2 x_1             7.19    132.
 3 x_2            14.7     132.
 4 x_3             9.23    132.
 5 x_4            14.6     132.
 6 x_5            12.7     132.
 7 x_6             4.12    132.
 8 x_7             6.27    132.
 9 x_8             9.75    132.
10 x_9             5.83    132.
11 x_10            5.18    132.
12 x_11           -6.65    132.
13 x_12            4.21    132.
14 x_13            3.84    132.
15 x_14           -2.49    132.
16 x_15            5.80    132.
17 x_16           -1.39    132.
18 x_17            4.19    132.
19 x_18           -5.07    132.
20 x_19           -5.88    132.
21 x_20           -8.57    132.

Irreducible error was set by y_error (100)

  • Somewhat overfit to train
  • Still not bad in validate
(error_ex2 <- error_ex2 |> 
  bind_rows(tibble(model = "glmnet model",   
                   rmse_trn = rmse_vec(truth = feat_trn_2$y, 
                                       estimate = predict(fit_glmnet_2,
                                                          feat_trn_2)$.pred),
                   rmse_test = rmse_vec(truth = feat_test_2$y, 
                                        estimate = predict(fit_glmnet_2,
                                                           feat_test_2)$.pred))))
# A tibble: 4 × 3
  model        rmse_trn rmse_test
  <chr>           <dbl>     <dbl>
1 linear model     83.4      117.
2 LASSO model      92.9      103.
3 Ridge model      90.6      102.
4 glmnet model     90.6      102.

6.10 LASSO for Feature (e.g., Covariate) Selection?

Lets consider a typical explanatory setting in Psychology

  • A focal dichotomous IV (your experimental manipulation)
  • A number of covariates (some good, some bad)
  • A quantitative outcome (y)
  • Covariates are uncorrelated with IV b/c IV is manipulated

Let’s pretend the previous 20 xs were your covariates


What are your options to test iv prior to this course?

  • You want to use covariates to increase power

  • BUT you don’t know which covariates to use

    • You might use all of them

    • Or you might use none of them (a clear lost opportunity)

    • Or you might hack it by using those increase your focal IV effect (very bad!)


NOW, We might use the feature selection characteristics for LASSO to select which covariates are included

There are two possibilities that occur to me


  1. Use LASSO to build best DGP for a covariates only model

  • Could be more conservative (fewer covariates) by using within 1 SE of best performance but less flexible (i.e., will set more parameter estimates to 0)
  • Follow up with a linear model (using \(lm\)), regressing y on \(iv\) and covariates from LASSO that are non-zero
fit_lasso_2 |> 
  tidy() |> 
  print(n = 21)
# A tibble: 21 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)    5.52     12.7
 2 x_1            1.02     12.7
 3 x_2           21.4      12.7
 4 x_3            8.60     12.7
 5 x_4           23.3      12.7
 6 x_5           19.9      12.7
 7 x_6            0        12.7
 8 x_7            0        12.7
 9 x_8           10.8      12.7
10 x_9            0.589    12.7
11 x_10           0        12.7
12 x_11          -0.632    12.7
13 x_12           0        12.7
14 x_13           0        12.7
15 x_14           0        12.7
16 x_15           0        12.7
17 x_16           0        12.7
18 x_17           0        12.7
19 x_18           0        12.7
20 x_19           0        12.7
21 x_20          -6.93     12.7
  • You clearly improved your best guess on covariates to include
  • You will regress y on iv and the 11 covariates with non-zero effects

  1. Use LASSO to build best DGP including iv and covariates but don’t penalize iv

  • Look at penalty.factor = rep(1, nvars) argument in glmnet()
  • Can fit LASSO with unbiased? estimate of iv
  • Need to bootstrap for SE for iv (next unit)
  • Only appropriate if IV is manipulated

Should really conduct simulation study of both of these options (vs. all and no covariates).

  • I want to
  • Want to do the study with me?

6.11 Ridge, LASSO, and Elastic net models for other Y distributions

These penalties can be added to the cost functions of other generalized linear models to yield regularized/penalized versions of those models as well. For example

L1 penalized (LASSO) logistic regression (w/ labels coded 0,1):

  • \(\frac{1}{n}([\:\sum_{i = 1}^{n} -Y_ilog(\hat{Y_i}) - (1-Y_i)log(1-\hat{Y_i})\:]\:+\:[\:\lambda\sum_{j = 1}^{p} |\beta_j|\:]\)

For L2 penalized (Ridge) logistic regression (w/ labels coded 0,1)

  • \(\frac{1}{n}([\:\sum_{i = 1}^{n} -Y_ilog(\hat{Y_i}) - (1-Y_i)log(1-\hat{Y_i})\:]\:+\:[\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:]\)

glmnet implements:

  • family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian")
  • Full range of \(\alpha\) to mix two types of penalties

6.12 Discussion

6.12.1 Announcements

  • Midterm - application and conceptual parts
  • Can we have a midterm review session?

6.12.2 feature matrix vs. raw predictors

When to use which and why could I use raw predictors in simulated data example from lectures


6.12.3 Cost functions

What is Cost function?

Linear model

  • \(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)

Ridge (L2)

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

LASSO (L1)

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} |\beta_j|\:])\)

Elastic Net

  • \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda (\alpha\sum_{j = 1}^{p} |\beta_j| + (1-\alpha)\sum_{j = 1}^{p} \beta_j^{2})\:])\)

6.12.4 LASSO, Ridge, and elastic net questions

  • The pros and cons of Lasso and Ridge vs. Elastic Net

  • I’m still a little confused as to why you would ever use Ridge or LASSO separately when you can just selectively use one or the other through elastic net. Wouldn’t it make sense to just always use elastic net and then change the penalty accordingly for when you wanted to use a Ridge or LASSO regression approach?

    • Pure LASSO
    • model variance
    • two self-report measure scales vs. items
  • Some explanation why Lasso is more robust to outliers and ridge is more robust to measurement error would be appreciated.

  • I’m not very clear why LASSO could limit some parameters to be zero but ridge regression cannot. Can we go through this a bit?

  • How do you know what numbers to start at for tuning lamba (like in the code below)? I think John mentioned he has a function to find these, but I’m wondering if there are any rules of thumb.

  • How do we know how “strong” of a penalty we need to be applying to our cost function? Does the reduction in variance increase as we increase the strength of the penalty?

  • Is a decreased number of features always good or bad, or does it depend on the model/recipe

  • Can you talk more about how to interpret the scale of the parameter estimates? In the lecture you said the following and I’m not quite sure what that means:

  • I might be totally wrong but I wonder if we have to care about the multi-collinearity or high dimension on classification as well. Or this is only limited to regression and so we are solving with regularising only regression model?


6.12.5 Stepwise questions

  • Could you go over Forward, Backward, and Best Subset subsetting? I think I understand the algorithm they use, but I do not understand the penalty function they use for picking the “best” model. In the book, it looks like it uses R-squared to pick the best model, but wouldn’t the full model always have the greatest R-squared?

  • Is there a reason why we do not discuss variable selection using subset methods?

  • Is there specific cases when you would pick backwards or forwards selection, or is it up to the researcher?

  • Training vs. val/val error in stepwise approaches.

  • Use of AIC, BIC, Cp, and adjusted R2 vs. cross-validation


6.12.6 Explanatory goals

  • Can LASSO be used for variable selection when engaging in cross sectional data analysis to identify which variables in a large set of Xs are important for a particular outcome?

  • In practice, what elements should be considered before selecting IVs and covariates?


6.12.7 PCA

  • https://setosa.io/ev/principal-component-analysis/
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer Texts in Statistics. New York: Springer-Verlag.