Unit 06 Lab Agenda

Homework specific

I was wondering how to remove the x in both ms_sub_class and pid? I am still working on the homework assignment and will probably figure this out as I complete it. But so far I have not been able to remove it after turning it into a factor.

str_replace_all(text, "x", "")

Parallel Processing & Cache

Is there any way that I can set my R to run things faster in coding?

Yes! We can use parallel processing and/or caching to speed up computation time. You can find explanations in lab 5 too (I’m copying and pasting the examples Zihan put up there!).

Parallel processing

Parallel processing allows us to speed up computation by executing multiple tasks simultaneously with multiple CPU cores.

We can use parallel processing when:

The fitting process for each of these configurations is independent for the others
The order that the configurations are fit doesn’t matter either
When these two criteria are met, the processes can be run in parallel with an often big time savings

Find more information on here😊!!

Set up parallel processing

cl <- parallel::makePSOCKcluster(parallel::detectCores(logical = FALSE))
doParallel::registerDoParallel(cl)

Using cache

Can we review how to use the cache in our code?

I’m just curious about the rerun = rerun_setting and how this makes the model rerun or of it is skipped. Is it just if data is changed or adjusting parameter it should be rerun?

When we’re doing caching, we store the results of the processing steps as objects. And then, we could read them in directly once we run through the process. It could save us a lot of time when working with large scale data and repetitive procedures.

If you want to set up the rerun option globally, assign it to a variable.

Set up cache

library(xfun, include.only = "cache_rds")
rerun_setting <- FALSE

Example code

results <- cache_rds(
  expr = {
    Sys.sleep(10)
    results <- c(1, 2, 3)
  }, 
  dir = "cache/",
  file = "cache_demo_xfun",
  rerun = rerun_setting)

results

[1] 1 2 3

When you set rerun_setting = TRUE, you will overwrite anything stored in your cache file and run your codes from the start. When it is set to FALSE, you will read in whatever is stored in your cache file and keeps running what’s remaining. It is important to note that you should overwrite your cache file (i.e., rerun_setting = TRUE) every time you change something in your codes.

Generating Data

it would be great to cover some technique for generating own fake data like MASS function.

set.seed(123)

n <- 1000

# Generate 3 correlated numeric predictors
mu <- c(5, 100, 0.1)  # define mean for each predictor
sigma <- matrix(.7, nrow = 3, ncol = 3) # base correlation
diag(sigma) <- 1

(sigma)

     [,1] [,2] [,3]
[1,]  1.0  0.7  0.7
[2,]  0.7  1.0  0.7
[3,]  0.7  0.7  1.0

data <- MASS::mvrnorm(n, mu, sigma) |> 
  magrittr::set_colnames(paste0("x", 1:3)) |>
  as_tibble()

# Generate 5 independent numeric variables
data <- data |> bind_cols(
  independent_numeric <- tibble(
  runif(n, min = 0, max = 1),      # Uniform distribution (0–1)
  rnorm(n, mean = 200, sd = 50),  # Normal distribution (high mean)
  rpois(n, lambda = 5),           # Poisson-distributed data
  rbeta(n, shape1 = 2, shape2 = 5) * 100, # Beta distribution scaled
  sample(1000:5000, n, replace = TRUE)  # Discrete uniform
))

# Generate categorical predictors
data <- data |> bind_cols(tibble(
  sample(c("A", "B", "C", "D"), n, replace = TRUE,     # Nominal
         prob = c(0.6, 0.2, 0.02, 0.18)), # make the data sparse
  sample(c("Low", "Medium", "High"), n, replace = TRUE),  # Ordinal
  sample(letters[1:5], n, replace = TRUE),  # Random categorical levels
  sample(0:1, n, replace = TRUE, prob = c(0.7, 0.3)),  # Binary
  sample(c("Yes", "No"), n, replace = TRUE)  # Binary categorical
))

data <- data |> magrittr::set_colnames(paste0("x", 1:13)) |> 
  glimpse()

Rows: 1,000
Columns: 13
$ x1  <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2  <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3  <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4  <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5  <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6  <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7  <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8  <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9  <chr> "A", "A", "A", "A", "B", "B", "A", "A", "D", "A", "D", "D", "A", "…
$ x10 <chr> "Medium", "Medium", "Low", "Medium", "High", "High", "Low", "High"…
$ x11 <chr> "b", "b", "e", "c", "c", "e", "a", "b", "e", "e", "b", "e", "b", "…
$ x12 <int> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "…

data |> 
  select(paste0("x", 1:8)) |> 
  cor(use = "pairwise.complete.obs") |>
  corrplot::corrplot.mixed()

# simulate outcome data that follow certain data generation process (linear model)
data <- data |> 
  mutate(y = 5 + 2 * x1 + 4 * x3 + .3 * x5 + 10 * x6 + 3 * x12 + 
           rnorm(n, mean = 0, sd = 5))

Introduce random missingness to the data

set.seed(123)

data$x2[sample(1:1000, size = 5)] <- NA
data$x3[sample(1:1000, size = 15)] <- NA
data$x8[sample(1:1000, size = 80)] <- NA
data$x12[sample(1:1000, size = 2)] <- NA

skimr::skim_without_charts(data)

Data summary
Name	data
Number of rows	1000
Number of columns	14
_______________________
Column type frequency:
character	4
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
x9	1	1	1	4
x10	1	3	6	3
x11	1	1	1	5
x13	1	2	3	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
x1	0	1.00	5.03	1.03	1.53	4.32	5.03	5.66	8.40
x2	5	1.00	100.01	0.98	97.35	99.35	100.04	100.65	103.03
x3	15	0.98	0.10	0.96	-2.81	-0.53	0.10	0.75	2.92
x4	0	1.00	0.50	0.28	0.00	0.26	0.49	0.73	1.00
x5	0	1.00	198.94	49.94	55.62	166.16	201.16	232.01	372.30
x6	0	1.00	4.97	2.23	0.00	3.00	5.00	6.00	15.00
x7	0	1.00	28.82	15.96	1.09	16.45	26.47	39.57	80.43
x8	80	0.92	3029.55	1153.66	1000.00	2056.50	3091.00	3969.25	4998.00
x12	2	1.00	0.30	0.46	0.00	0.00	0.00	1.00	1.00
y	0	1.00	125.52	28.51	25.61	106.03	124.47	144.47	242.34

Data Cleaning

Cleaning the data. The application assignment due this Friday required a lot of data cleaning with the variable names and the data values. I was wondering if we could have a refresher about what functions to use to clean the data (i.e. remove quotation marks, unnecessary leading x’s, etc.)

More on cleaning data. Good practice? Concise methods? The data this week is dreadful to look at.

One question I’m curious about comes from ames dataset for this week. In this dataset, there are ordinal variables such as “electrical” which has very low frequency count at certain level: only one value for “ms” level. When spliting the data into training and test sets, this one value was assigned to test set, resulting in the missing of a level in this column, which leads to the creation of new level called “NA” in this column within the training set. I wonder if this could become a problem for the model performance and if you have any suggestions to resovle such situation.

Clean variable name

data <- data |> janitor::clean_names()

Class variable

data <- data |> 
  mutate(
    across(paste0("x", 9:13), as.factor)
  ) |> 
  glimpse()

Rows: 1,000
Columns: 14
$ x1  <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2  <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3  <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4  <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5  <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6  <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7  <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8  <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9  <fct> A, A, A, A, B, B, A, A, D, A, D, D, A, A, A, A, A, B, D, D, A, B, …
$ x10 <fct> Medium, Medium, Low, Medium, High, High, Low, High, Low, High, Hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, No, Yes, No, Yes, Yes…
$ y   <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…

Clean variable response

data <- data |> mutate(across(where(is.factor), tidy_responses)) |> 
  glimpse()

Rows: 1,000
Columns: 14
$ x1  <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2  <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3  <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4  <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5  <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6  <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7  <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8  <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9  <fct> a, a, a, a, b, b, a, a, d, a, d, d, a, a, a, a, a, b, d, d, a, b, …
$ x10 <fct> medium, medium, low, medium, high, high, low, high, low, high, hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> yes, no, no, yes, yes, yes, no, no, yes, no, no, yes, no, yes, yes…
$ y   <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…

Relevel factors if deem necessary

I’m releveling x10 here because I want to transform it into numeric values later

data <- data |> mutate(x10 = fct_relevel(x10, c("low", "medium", "high"))) |> 
  glimpse()

Rows: 1,000
Columns: 14
$ x1  <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2  <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3  <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4  <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5  <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6  <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7  <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8  <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9  <fct> a, a, a, a, b, b, a, a, d, a, d, d, a, a, a, a, a, b, d, d, a, b, …
$ x10 <fct> medium, medium, low, medium, high, high, low, high, low, high, hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> yes, no, no, yes, yes, yes, no, no, yes, no, no, yes, no, yes, yes…
$ y   <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…

Train/validation/test split

Can you do grouped kfold CV with multiple groups?

group_vfold_cv() only allows one grouping variable at a time. However, we can do a workaround to group kfold CV with multiple groups.

For example, say if we want to group observations based on x16 and x20

group_cv_folds <- data |> 
  mutate(combined_group = paste0(x11, "_", x13)) |> 
  group_vfold_cv(group = combined_group, v = 5)

print(group_cv_folds)

# Group 5-fold cross-validation 
# A tibble: 5 × 2
  splits            id       
  <list>            <chr>    
1 <split [812/188]> Resample1
2 <split [779/221]> Resample2
3 <split [800/200]> Resample3
4 <split [819/181]> Resample4
5 <split [790/210]> Resample5

Feature Engineering

Comparison of standardization methods

Regarding standardizing variables before regularization, I wonder if the same applies to categorical variables. Should we also standardize categorical variables? If so, wouldn’t that make interpretability difficult?

If we dummy code category variables, we don’t need to standardize them. However, if we use other encoding methods (e.g., ordinal encoding), we should standardize them.

What is the difference between step_normalize and step_normalize() and step_scale(). I feel like we have used both and they serve a similar purpose. Is there any difference or any circumstance where one is better than the other?

What’s the difference between step_scale and step_range?

step_normalize() creates a specification of a recipe step that will normalize numeric data to have a standard deviation of one and a mean of zero. step_scale() creates a specification of a recipe step that will normalize numeric data to have a standard deviation of one. step_range() creates a specification of a recipe step that will normalize numeric data to be within a pre-defined range of values.

step_normalize() and step_scale() are similar in that they both standardize the data, but step_normalize() also centers the data by subtracting the mean. step_range() is different because it scales the data to a specific range, rather than standardizing it.

Formula	Description	When to use
\(x' = \frac{x - \mu}{\sigma}\)	centers data by mean; scale data by SD	when applying to models sensitive to magnitude (e.g., linear regression, PCA, ridge/lasso regression)
\(x' = \frac{x}{\sigma}\)	scale data by SD (without centering)	when want to adjust scale without shifting center
\(x' = \frac{x - min(x)}{max(x) - min(x)}\)	scale data to a specific range	when working with models that require a specific range (e.g., neural networks, knn, svm)

💻 Sample Codes

# step_yeojohnson()
data_yeojohnson <- recipe(y ~ ., data = data) |> 
  step_YeoJohnson(x7) |> 
  prep(data) |> 
  bake(NULL)

# step_normalize()
data_normalize <- recipe(y ~ ., data = data) |> 
  step_normalize(x7) |> 
  prep(data) |> 
  bake(NULL)

# step_scale()
data_scale <- recipe(y ~ ., data = data) |> 
  step_scale(x7) |> 
  prep(data) |> 
  bake(NULL) 

# step_range()
data_range <- recipe(y ~ ., data = data) |> 
  step_range(x7) |> 
  prep(data) |> 
  bake(NULL)

Visualization

When applying transformation and scaling in recipe function, is there a method to measure or visualize how well are they doing? Or should this be conducted when checking the univariate stats?

We can visualize the distribution of the data before and after applying the transformations to see how they affect the data.

cowplot::plot_grid(
  data |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "Original Data"),
  data_yeojohnson |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_yeojohnson()"),
  data_normalize |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_normalize()"),
  data_scale |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_scale()"),
  data_range |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_range()")
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

bind_rows(
  data |> 
    summarize(x7_mean = mean(x7, na.rm = TRUE), 
              x7_sd = sd(x7, na.rm = TRUE)),
  data_yeojohnson |>
    summarize(x7_mean = mean(x7, na.rm = TRUE), 
              x7_sd = sd(x7, na.rm = TRUE)),
  data_normalize |>
    summarize(x7_mean = mean(x7, na.rm = TRUE), 
              x7_sd = sd(x7, na.rm = TRUE)),
  data_scale |>
    summarize(x7_mean = mean(x7, na.rm = TRUE), 
              x7_sd = sd(x7, na.rm = TRUE)),
  data_range |>
    summarize(x7_mean = mean(x7, na.rm = TRUE), 
              x7_sd = sd(x7, na.rm = TRUE))
) |> 
  mutate(method = c("Original Data", "step_YeoJohnson", "step_normalize()", 
                    "step_scale()", "step_range()")) |> 
  select(method, everything())

# A tibble: 5 × 3
  method            x7_mean  x7_sd
  <chr>               <dbl>  <dbl>
1 Original Data    2.88e+ 1 16.0  
2 step_YeoJohnson  8.64e+ 0  3.11 
3 step_normalize() 1.07e-15  1    
4 step_scale()     1.81e+ 0  1    
5 step_range()     3.49e- 1  0.201

Machine Learning Models

Maybe having a demo of going over the steps of PCR and PLS regression.

When we want to more specifics about any model, we can go to the tidymodels parsnip documentation and search for the model we are interested in. For example, we can find information on PLS regression here.

There is no PCR model in tidymodels, but we can use PCA when building recipes to reduce the dimensionality of the data and then use a linear regression model to predict the outcome.

Hyperparameter tuning

Selecting best range

Selecting values for lambda and alpha (i.e. tuning for the hyperspace grid). How can one try to choose the best range of values?

I would hope to have more explanation on tuning the penalty hyperparameter

Is there a way of getting a good alpha or lambda value that does not include guessing?

In terms of alpha (which range from 0 to 1), we typically do a grid search of 11 values from 0 to 1. For lambda, we can do a grid search of a wide range, and use plotting to determine whether we have determined an optimal range.

John also has an example function called get_lambdas that generates a grid of best guess values for lambda.

recipe <- recipe(y ~ ., data = data) |> 
  step_rm(x11, x13) |> 
  step_mutate(x10 = as.numeric(x10)) |>
  step_impute_mean(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |> 
  step_pca(x1:x3, num_comp = tune(), options = list(center = TRUE, scale. = TRUE),) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_other(x9) |>
  step_dummy(all_nominal_predictors()) |> 
  step_nzv(all_predictors())

grid <- expand_grid(
  num_comp = 1:2,
  penalty = exp(seq(-30, 3, length.out = 500)),
  mixture = seq(0, 1, length = 10)
)

fit <- linear_reg(penalty = tune(), mixture = tune()) |> 
  set_engine("glmnet") |> 
  tune_grid(
    preprocessor = recipe,
    resamples = group_cv_folds,
    grid = grid,
    metrics = metric_set(rmse)
  )

plot_hyperparameters(fit, hp1 = "penalty", hp2 = "mixture", metric = "rmse")

More on the expand_grid stuff and looking at multiple dimensions there.

When using linear_reg() with set_engine(“glmnet” and tune_grid(), how does tune_grid() choose the best model?

grid |> head(500) |> print_kbl()

num_comp	penalty	mixture
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00
1	0	0.00
1	0	0.11
1	0	0.22
1	0	0.33
1	0	0.44
1	0	0.56
1	0	0.67
1	0	0.78
1	0	0.89
1	0	1.00

show_best(fit, metric = "rmse", n = 10) |> print_kbl()

penalty	mixture	num_comp	.metric	.estimator	mean	n	std_err	.config
0.09	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4919
0.10	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4920
0.09	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4918
0.08	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4917
0.11	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4921
0.08	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4916
0.07	1.00	2	rmse	standard	5.07	5	0.07	Preprocessor2_Model4915
0.12	1.00	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4922
0.07	1.00	2	rmse	standard	5.07	5	0.07	Preprocessor2_Model4914
0.09	0.89	2	rmse	standard	5.07	5	0.06	Preprocessor2_Model4419

Why would we need to tune both the mixture hyperparameter and the penalty hyperparameter in elastic net regression but only penalty hyperparameter in LASSO model?

In LASSO regression, the penalty hyperparameter controls the amount of regularization applied to the model. In elastic net regression, both the penalty and mixture hyperparameters control the amount of regularization applied to the model. The mixture hyperparameter controls the balance between LASSO and ridge regression, while the penalty hyperparameter controls the overall amount of regularization. A mixture of 1 in elastic net regression is equivalent to LASSO regression, while a mixture of 0 is equivalent to ridge regression.

Regularization

The web book mentions using penalty.factor to prevent the IV from being penalized, but are there any best practices for tuning this? Could we allow some shrinkage of the IV’s effect while still maintaining an unbiased estimate?

How can we use LASSO for feature selection in a dataset where the independent variable (IV) is dichotomous (e.g., an experimental manipulation), and we want to select the best covariates to predict a quantitative outcome (y)? I particularly want to focus on bootstrapping to find the best lambda value and following up with a linear model.

We do not tune the penalty factor, but we can set it to 0 for the IVs (manupulated variables) we do not want to penalize. This will prevent the IV from being penalized, but it will still be included in the model. This is useful when we want to include an IV in the model but do not want it to be penalized.

is it correct that we only need to consider cost function but not loss function and glmnet packaget only has arguments for cost function?

Loss function is the function that measures how well our predictions match the actual values (e.g., MSE, rmse). Cost function is the function that we want to minimize in order to find the best model parameters (e.g., loss function and the regularization term in glmnet). In glmnet, we only need to specify the cost function because it is already minimizing the loss function as part of the optimization process.

Interaction terms in High-Dimension regularization

We can add interaction terms to the model by using the step_interact() function in the recipe. Note that it might lead to highly dimensional feature sets if we include lots of interactions in our model. This can lead to problems of overfitting, so we should be cautious when selecting which interactions to include. We can use regularization to select the most important interaction terms and prevent overfitting. However, in LASSO regression, it’d make interpretation harder if only interaction term is retained but not the main effects. We can use penalty.factor to prevent the main effects from being penalized.

The decision to include interaction terms in the model should be based on domain knowledge and theory. If we believe that the interaction terms are important for predicting the outcome, we should include them in the model. If we are unsure, we can use regularization to select the most important interaction terms.