<- parallel::makePSOCKcluster(parallel::detectCores(logical = FALSE))
cl ::registerDoParallel(cl) doParallel
Unit 06 Lab Agenda
Homework specific
I was wondering how to remove the x in both ms_sub_class and pid? I am still working on the homework assignment and will probably figure this out as I complete it. But so far I have not been able to remove it after turning it into a factor.
str_replace_all(text, "x", "")
Parallel Processing & Cache
Is there any way that I can set my R to run things faster in coding?
Yes! We can use parallel processing and/or caching to speed up computation time. You can find explanations in lab 5 too (I’m copying and pasting the examples Zihan put up there!).
Parallel processing
Parallel processing allows us to speed up computation by executing multiple tasks simultaneously with multiple CPU cores.
We can use parallel processing when:
The fitting process for each of these configurations is independent for the others
The order that the configurations are fit doesn’t matter either
When these two criteria are met, the processes can be run in parallel with an often big time savings
Find more information on here😊!!
Set up parallel processing
Using cache
Can we review how to use the cache in our code?
I’m just curious about the rerun = rerun_setting and how this makes the model rerun or of it is skipped. Is it just if data is changed or adjusting parameter it should be rerun?
When we’re doing caching, we store the results of the processing steps as objects. And then, we could read them in directly once we run through the process. It could save us a lot of time when working with large scale data and repetitive procedures.
If you want to set up the rerun option globally, assign it to a variable.
Set up cache
library(xfun, include.only = "cache_rds")
<- FALSE rerun_setting
Example code
<- cache_rds(
results expr = {
Sys.sleep(10)
<- c(1, 2, 3)
results
}, dir = "cache/",
file = "cache_demo_xfun",
rerun = rerun_setting)
results
[1] 1 2 3
When you set rerun_setting = TRUE
, you will overwrite anything stored in your cache file and run your codes from the start. When it is set to FALSE
, you will read in whatever is stored in your cache file and keeps running what’s remaining. It is important to note that you should overwrite your cache file (i.e., rerun_setting = TRUE
) every time you change something in your codes.
Generating Data
it would be great to cover some technique for generating own fake data like MASS function.
set.seed(123)
<- 1000
n
# Generate 3 correlated numeric predictors
<- c(5, 100, 0.1) # define mean for each predictor
mu <- matrix(.7, nrow = 3, ncol = 3) # base correlation
sigma diag(sigma) <- 1
(sigma)
[,1] [,2] [,3]
[1,] 1.0 0.7 0.7
[2,] 0.7 1.0 0.7
[3,] 0.7 0.7 1.0
<- MASS::mvrnorm(n, mu, sigma) |>
data ::set_colnames(paste0("x", 1:3)) |>
magrittras_tibble()
# Generate 5 independent numeric variables
<- data |> bind_cols(
data <- tibble(
independent_numeric runif(n, min = 0, max = 1), # Uniform distribution (0–1)
rnorm(n, mean = 200, sd = 50), # Normal distribution (high mean)
rpois(n, lambda = 5), # Poisson-distributed data
rbeta(n, shape1 = 2, shape2 = 5) * 100, # Beta distribution scaled
sample(1000:5000, n, replace = TRUE) # Discrete uniform
))
# Generate categorical predictors
<- data |> bind_cols(tibble(
data sample(c("A", "B", "C", "D"), n, replace = TRUE, # Nominal
prob = c(0.6, 0.2, 0.02, 0.18)), # make the data sparse
sample(c("Low", "Medium", "High"), n, replace = TRUE), # Ordinal
sample(letters[1:5], n, replace = TRUE), # Random categorical levels
sample(0:1, n, replace = TRUE, prob = c(0.7, 0.3)), # Binary
sample(c("Yes", "No"), n, replace = TRUE) # Binary categorical
))
<- data |> magrittr::set_colnames(paste0("x", 1:13)) |>
data glimpse()
Rows: 1,000
Columns: 13
$ x1 <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2 <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3 <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4 <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5 <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6 <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7 <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8 <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9 <chr> "A", "A", "A", "A", "B", "B", "A", "A", "D", "A", "D", "D", "A", "…
$ x10 <chr> "Medium", "Medium", "Low", "Medium", "High", "High", "Low", "High"…
$ x11 <chr> "b", "b", "e", "c", "c", "e", "a", "b", "e", "e", "b", "e", "b", "…
$ x12 <int> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "…
|>
data select(paste0("x", 1:8)) |>
cor(use = "pairwise.complete.obs") |>
::corrplot.mixed() corrplot
# simulate outcome data that follow certain data generation process (linear model)
<- data |>
data mutate(y = 5 + 2 * x1 + 4 * x3 + .3 * x5 + 10 * x6 + 3 * x12 +
rnorm(n, mean = 0, sd = 5))
Introduce random missingness to the data
set.seed(123)
$x2[sample(1:1000, size = 5)] <- NA
data$x3[sample(1:1000, size = 15)] <- NA
data$x8[sample(1:1000, size = 80)] <- NA
data$x12[sample(1:1000, size = 2)] <- NA data
::skim_without_charts(data) skimr
Name | data |
Number of rows | 1000 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
x9 | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
x10 | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
x11 | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
x13 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
x1 | 0 | 1.00 | 5.03 | 1.03 | 1.53 | 4.32 | 5.03 | 5.66 | 8.40 |
x2 | 5 | 1.00 | 100.01 | 0.98 | 97.35 | 99.35 | 100.04 | 100.65 | 103.03 |
x3 | 15 | 0.98 | 0.10 | 0.96 | -2.81 | -0.53 | 0.10 | 0.75 | 2.92 |
x4 | 0 | 1.00 | 0.50 | 0.28 | 0.00 | 0.26 | 0.49 | 0.73 | 1.00 |
x5 | 0 | 1.00 | 198.94 | 49.94 | 55.62 | 166.16 | 201.16 | 232.01 | 372.30 |
x6 | 0 | 1.00 | 4.97 | 2.23 | 0.00 | 3.00 | 5.00 | 6.00 | 15.00 |
x7 | 0 | 1.00 | 28.82 | 15.96 | 1.09 | 16.45 | 26.47 | 39.57 | 80.43 |
x8 | 80 | 0.92 | 3029.55 | 1153.66 | 1000.00 | 2056.50 | 3091.00 | 3969.25 | 4998.00 |
x12 | 2 | 1.00 | 0.30 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
y | 0 | 1.00 | 125.52 | 28.51 | 25.61 | 106.03 | 124.47 | 144.47 | 242.34 |
Data Cleaning
Cleaning the data. The application assignment due this Friday required a lot of data cleaning with the variable names and the data values. I was wondering if we could have a refresher about what functions to use to clean the data (i.e. remove quotation marks, unnecessary leading x’s, etc.)
More on cleaning data. Good practice? Concise methods? The data this week is dreadful to look at.
One question I’m curious about comes from ames dataset for this week. In this dataset, there are ordinal variables such as “electrical” which has very low frequency count at certain level: only one value for “ms” level. When spliting the data into training and test sets, this one value was assigned to test set, resulting in the missing of a level in this column, which leads to the creation of new level called “NA” in this column within the training set. I wonder if this could become a problem for the model performance and if you have any suggestions to resovle such situation.
Clean variable name
<- data |> janitor::clean_names() data
Class variable
<- data |>
data mutate(
across(paste0("x", 9:13), as.factor)
|>
) glimpse()
Rows: 1,000
Columns: 14
$ x1 <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2 <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3 <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4 <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5 <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6 <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7 <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8 <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9 <fct> A, A, A, A, B, B, A, A, D, A, D, D, A, A, A, A, A, B, D, D, A, B, …
$ x10 <fct> Medium, Medium, Low, Medium, High, High, Low, High, Low, High, Hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, No, Yes, No, Yes, Yes…
$ y <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…
Clean variable response
<- data |> mutate(across(where(is.factor), tidy_responses)) |>
data glimpse()
Rows: 1,000
Columns: 14
$ x1 <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2 <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3 <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4 <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5 <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6 <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7 <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8 <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9 <fct> a, a, a, a, b, b, a, a, d, a, d, d, a, a, a, a, a, b, d, d, a, b, …
$ x10 <fct> medium, medium, low, medium, high, high, low, high, low, high, hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> yes, no, no, yes, yes, yes, no, no, yes, no, no, yes, no, yes, yes…
$ y <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…
Relevel factors if deem necessary
I’m releveling x10 here because I want to transform it into numeric values later
<- data |> mutate(x10 = fct_relevel(x10, c("low", "medium", "high"))) |>
data glimpse()
Rows: 1,000
Columns: 14
$ x1 <dbl> 4.053361, 4.329041, 6.386110, 5.003954, 3.975538, 6.999359, 5.5239…
$ x2 <dbl> 99.91951, 99.93490, 101.60793, 99.62042, 100.61825, 101.53961, 101…
$ x3 <dbl> -0.37678056, 0.21842966, 1.28841502, 0.66482472, 0.85313137, 1.163…
$ x4 <dbl> 0.44026102, 0.39739215, 0.37154765, 0.52880857, 0.07378541, 0.7168…
$ x5 <dbl> 273.91672, 129.66066, 105.80139, 186.13169, 221.52139, 193.56067, …
$ x6 <int> 7, 8, 3, 5, 8, 1, 3, 3, 7, 2, 8, 4, 8, 5, 5, 6, 6, 5, 1, 2, 6, 6, …
$ x7 <dbl> 19.610964, 56.562639, 10.725169, 50.625442, 35.825848, 41.604957, …
$ x8 <int> 2187, 1785, 2729, 3845, 1920, 3983, 4871, 4785, 4552, 1204, 1074, …
$ x9 <fct> a, a, a, a, b, b, a, a, d, a, d, d, a, a, a, a, a, b, d, d, a, b, …
$ x10 <fct> medium, medium, low, medium, high, high, low, high, low, high, hig…
$ x11 <fct> b, b, e, c, c, e, a, b, e, e, b, e, b, b, d, b, a, c, d, c, a, e, …
$ x12 <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
$ x13 <fct> yes, no, no, yes, yes, yes, no, no, yes, no, no, yes, no, yes, yes…
$ y <dbl> 155.70534, 130.49806, 78.58449, 123.07257, 166.10833, 93.14634, 12…
Train/validation/test split
Can you do grouped kfold CV with multiple groups?
group_vfold_cv()
only allows one grouping variable at a time. However, we can do a workaround to group kfold CV with multiple groups.
For example, say if we want to group observations based on x16 and x20
<- data |>
group_cv_folds mutate(combined_group = paste0(x11, "_", x13)) |>
group_vfold_cv(group = combined_group, v = 5)
print(group_cv_folds)
# Group 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [812/188]> Resample1
2 <split [779/221]> Resample2
3 <split [800/200]> Resample3
4 <split [819/181]> Resample4
5 <split [790/210]> Resample5
Feature Engineering
Comparison of standardization methods
Regarding standardizing variables before regularization, I wonder if the same applies to categorical variables. Should we also standardize categorical variables? If so, wouldn’t that make interpretability difficult?
If we dummy code category variables, we don’t need to standardize them. However, if we use other encoding methods (e.g., ordinal encoding), we should standardize them.
What is the difference between step_normalize and step_normalize() and step_scale(). I feel like we have used both and they serve a similar purpose. Is there any difference or any circumstance where one is better than the other?
What’s the difference between step_scale and step_range?
step_normalize()
creates a specification of a recipe step that will normalize numeric data to have a standard deviation of one and a mean of zero. step_scale()
creates a specification of a recipe step that will normalize numeric data to have a standard deviation of one. step_range()
creates a specification of a recipe step that will normalize numeric data to be within a pre-defined range of values.
step_normalize()
and step_scale()
are similar in that they both standardize the data, but step_normalize()
also centers the data by subtracting the mean. step_range()
is different because it scales the data to a specific range, rather than standardizing it.
Formula | Description | When to use |
---|---|---|
\(x' = \frac{x - \mu}{\sigma}\) | centers data by mean; scale data by SD | when applying to models sensitive to magnitude (e.g., linear regression, PCA, ridge/lasso regression) |
\(x' = \frac{x}{\sigma}\) | scale data by SD (without centering) | when want to adjust scale without shifting center |
\(x' = \frac{x - min(x)}{max(x) - min(x)}\) | scale data to a specific range | when working with models that require a specific range (e.g., neural networks, knn, svm) |
💻 Sample Codes
# step_yeojohnson()
<- recipe(y ~ ., data = data) |>
data_yeojohnson step_YeoJohnson(x7) |>
prep(data) |>
bake(NULL)
# step_normalize()
<- recipe(y ~ ., data = data) |>
data_normalize step_normalize(x7) |>
prep(data) |>
bake(NULL)
# step_scale()
<- recipe(y ~ ., data = data) |>
data_scale step_scale(x7) |>
prep(data) |>
bake(NULL)
# step_range()
<- recipe(y ~ ., data = data) |>
data_range step_range(x7) |>
prep(data) |>
bake(NULL)
Visualization
When applying transformation and scaling in recipe function, is there a method to measure or visualize how well are they doing? Or should this be conducted when checking the univariate stats?
We can visualize the distribution of the data before and after applying the transformations to see how they affect the data.
::plot_grid(
cowplot|> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "Original Data"),
data |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_yeojohnson()"),
data_yeojohnson |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_normalize()"),
data_normalize |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_scale()"),
data_scale |> ggplot(aes(x = x7)) + geom_histogram() + labs(title = "step_range()")
data_range )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
bind_rows(
|>
data summarize(x7_mean = mean(x7, na.rm = TRUE),
x7_sd = sd(x7, na.rm = TRUE)),
|>
data_yeojohnson summarize(x7_mean = mean(x7, na.rm = TRUE),
x7_sd = sd(x7, na.rm = TRUE)),
|>
data_normalize summarize(x7_mean = mean(x7, na.rm = TRUE),
x7_sd = sd(x7, na.rm = TRUE)),
|>
data_scale summarize(x7_mean = mean(x7, na.rm = TRUE),
x7_sd = sd(x7, na.rm = TRUE)),
|>
data_range summarize(x7_mean = mean(x7, na.rm = TRUE),
x7_sd = sd(x7, na.rm = TRUE))
|>
) mutate(method = c("Original Data", "step_YeoJohnson", "step_normalize()",
"step_scale()", "step_range()")) |>
select(method, everything())
# A tibble: 5 × 3
method x7_mean x7_sd
<chr> <dbl> <dbl>
1 Original Data 2.88e+ 1 16.0
2 step_YeoJohnson 8.64e+ 0 3.11
3 step_normalize() 1.07e-15 1
4 step_scale() 1.81e+ 0 1
5 step_range() 3.49e- 1 0.201
Machine Learning Models
Maybe having a demo of going over the steps of PCR and PLS regression.
When we want to more specifics about any model, we can go to the tidymodels parsnip documentation and search for the model we are interested in. For example, we can find information on PLS regression here.
There is no PCR model in tidymodels, but we can use PCA when building recipes to reduce the dimensionality of the data and then use a linear regression model to predict the outcome.
Hyperparameter tuning
Selecting best range
Selecting values for lambda and alpha (i.e. tuning for the hyperspace grid). How can one try to choose the best range of values?
I would hope to have more explanation on tuning the penalty hyperparameter
Is there a way of getting a good alpha or lambda value that does not include guessing?
In terms of alpha (which range from 0 to 1), we typically do a grid search of 11 values from 0 to 1. For lambda, we can do a grid search of a wide range, and use plotting to determine whether we have determined an optimal range.
John also has an example function called get_lambdas
that generates a grid of best guess values for lambda.
<- recipe(y ~ ., data = data) |>
recipe step_rm(x11, x13) |>
step_mutate(x10 = as.numeric(x10)) |>
step_impute_mean(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_pca(x1:x3, num_comp = tune(), options = list(center = TRUE, scale. = TRUE),) |>
step_normalize(all_numeric_predictors()) |>
step_other(x9) |>
step_dummy(all_nominal_predictors()) |>
step_nzv(all_predictors())
<- expand_grid(
grid num_comp = 1:2,
penalty = exp(seq(-30, 3, length.out = 500)),
mixture = seq(0, 1, length = 10)
)
<- linear_reg(penalty = tune(), mixture = tune()) |>
fit set_engine("glmnet") |>
tune_grid(
preprocessor = recipe,
resamples = group_cv_folds,
grid = grid,
metrics = metric_set(rmse)
)
plot_hyperparameters(fit, hp1 = "penalty", hp2 = "mixture", metric = "rmse")
More on the expand_grid stuff and looking at multiple dimensions there.
When using linear_reg() with set_engine(“glmnet” and tune_grid(), how does tune_grid() choose the best model?
|> head(500) |> print_kbl() grid
num_comp | penalty | mixture |
---|---|---|
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
1 | 0 | 0.00 |
1 | 0 | 0.11 |
1 | 0 | 0.22 |
1 | 0 | 0.33 |
1 | 0 | 0.44 |
1 | 0 | 0.56 |
1 | 0 | 0.67 |
1 | 0 | 0.78 |
1 | 0 | 0.89 |
1 | 0 | 1.00 |
show_best(fit, metric = "rmse", n = 10) |> print_kbl()
penalty | mixture | num_comp | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
0.09 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4919 |
0.10 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4920 |
0.09 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4918 |
0.08 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4917 |
0.11 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4921 |
0.08 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4916 |
0.07 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.07 | Preprocessor2_Model4915 |
0.12 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4922 |
0.07 | 1.00 | 2 | rmse | standard | 5.07 | 5 | 0.07 | Preprocessor2_Model4914 |
0.09 | 0.89 | 2 | rmse | standard | 5.07 | 5 | 0.06 | Preprocessor2_Model4419 |
Why would we need to tune both the mixture hyperparameter and the penalty hyperparameter in elastic net regression but only penalty hyperparameter in LASSO model?
In LASSO regression, the penalty hyperparameter controls the amount of regularization applied to the model. In elastic net regression, both the penalty and mixture hyperparameters control the amount of regularization applied to the model. The mixture hyperparameter controls the balance between LASSO and ridge regression, while the penalty hyperparameter controls the overall amount of regularization. A mixture of 1 in elastic net regression is equivalent to LASSO regression, while a mixture of 0 is equivalent to ridge regression.
Regularization
The web book mentions using penalty.factor to prevent the IV from being penalized, but are there any best practices for tuning this? Could we allow some shrinkage of the IV’s effect while still maintaining an unbiased estimate?
How can we use LASSO for feature selection in a dataset where the independent variable (IV) is dichotomous (e.g., an experimental manipulation), and we want to select the best covariates to predict a quantitative outcome (y)? I particularly want to focus on bootstrapping to find the best lambda value and following up with a linear model.
We do not tune the penalty factor, but we can set it to 0 for the IVs (manupulated variables) we do not want to penalize. This will prevent the IV from being penalized, but it will still be included in the model. This is useful when we want to include an IV in the model but do not want it to be penalized.
is it correct that we only need to consider cost function but not loss function and glmnet packaget only has arguments for cost function?
Loss function is the function that measures how well our predictions match the actual values (e.g., MSE, rmse). Cost function is the function that we want to minimize in order to find the best model parameters (e.g., loss function and the regularization term in glmnet). In glmnet, we only need to specify the cost function because it is already minimizing the loss function as part of the optimization process.
Interaction terms in High-Dimension regularization
We can add interaction terms to the model by using the step_interact()
function in the recipe. Note that it might lead to highly dimensional feature sets if we include lots of interactions in our model. This can lead to problems of overfitting, so we should be cautious when selecting which interactions to include. We can use regularization to select the most important interaction terms and prevent overfitting. However, in LASSO regression, it’d make interpretation harder if only interaction term is retained but not the main effects. We can use penalty.factor
to prevent the main effects from being penalized.
The decision to include interaction terms in the model should be based on domain knowledge and theory. If we believe that the interaction terms are important for predicting the outcome, we should include them in the model. If we are unsure, we can use regularization to select the most important interaction terms.