Code
<- parallel::makePSOCKcluster(parallel::detectCores(logical = FALSE))
cl ::registerDoParallel(cl) doParallel
Before we dive into resampling, we need to introduce two coding techniques that can save us a lot of time when implementing resampling methods
When using resampling, we often end up fitting many, many model configurations
Critically
To do this in R, we need to set up a parallel processing backend
TLDR - copy the following code chunk into your scripts after you load your other libraries (e.g., tidyverse and tidymodels)
<- parallel::makePSOCKcluster(parallel::detectCores(logical = FALSE))
cl ::registerDoParallel(cl) doParallel
Even with parallel processing, resampling procedures can STILL take a lot of time, particularly on notebook computers that don’t have a lot of cores available
In these instances, you may also want to consider caching the result
But…
In other notes, we describe three options to cache calculations that are available in R.
xfun::cache_rds()
?xfun::cache_rds
) if you plan to use itStart by loading only that function for the xfun
package. You can add this line of code after your other libraries (e.g., tidyverse, tidymodels)
library(xfun, include.only = "cache_rds")
To use the function
expr
) to the function inside a set of curly brackets {}
dir =
) and filename (file =
) for the rds file that will save the cached calculations.
/
at the end of the path is needed.rerun = FALSE
as a third argument.
rerun_setting
below)hash =
. See more details at previous linkcache_rds(
expr = {
},dir = "cache/",
file = "filename",
rerun = rerun_setting
)
We will demonstrate the use of this function throughout the book. BUT you do not need to use it if you find it confusing.
We will use resampling for two goals:
For both of these goals we are using new data to estimate performance of model configuration(s)
There are two kinds of problems that can emerge from using a sub-optimal resampling approach
Essentially, this is the bias and variance problem again, but now not with respect to the model’s actual performance but instead with our estimate of how the model will perform
This is a very important distinction to keep in mind or you will be confused as we discuss bias and variance into the future. We have:
Let’s get a dataset for this unit. We will use the heart disease dataset from the UCI Machine Learning Repository. We will focus on the Cleveland data subset, whose variable are defined in this data dictionary
These data are less well prepared
?
rename()
to add tidy variable namesRows: 303 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code categorical variables as factors with meaningful text labels (and no spaces)
<- data_all |>
data_all mutate(disease = factor(disease, levels = 0:4,
labels = c("no", "yes", "yes", "yes", "yes")),
sex = factor(sex, levels = c(0, 1), labels = c("female", "male")),
fbs = factor(fbs, levels = c(0, 1), labels = c("no", "yes")),
exer_ang = factor(exer_ang, levels = c(0, 1), labels = c("no", "yes")),
exer_st_slope = factor(exer_st_slope, levels = 1:3,
labels = c("upslope", "flat", "downslope")),
cp = factor(cp, levels = 1:4,
labels = c("typ_ang", "atyp_ang", "non_anginal", "non_anginal")),
rest_ecg = factor(rest_ecg, levels = 0:2,
labels = c("normal", "abnormal1", "abnormal2")),
rest_ecg = fct_collapse(rest_ecg,
abnormal = c("abnormal1", "abnormal2")),
thal = factor(thal, levels = c(3, 6, 7),
labels = c("normal", "fixeddefect", "reversabledefect"))) |>
glimpse()
Rows: 303
Columns: 14
$ age <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44…
$ sex <fct> male, male, male, male, female, male, female, female, …
$ cp <fct> typ_ang, non_anginal, non_anginal, non_anginal, atyp_a…
$ rest_bp <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140,…
$ chol <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192,…
$ fbs <fct> yes, no, no, no, no, no, no, no, no, yes, no, no, yes,…
$ rest_ecg <fct> abnormal, abnormal, abnormal, normal, abnormal, normal…
$ max_hr <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148,…
$ exer_ang <fct> no, yes, yes, no, no, no, no, yes, no, yes, no, no, ye…
$ exer_st_depress <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4,…
$ exer_st_slope <fct> downslope, flat, flat, downslope, upslope, upslope, do…
$ ca <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ thal <fct> fixeddefect, normal, reversabledefect, normal, normal,…
$ disease <fct> no, yes, yes, no, no, no, yes, no, yes, yes, no, no, y…
We won’t do EDA in this unit but lets at least do a quick skim to inform ourselves
thal
, which is categorical|> skim_all() data_all
Name | data_all |
Number of rows | 303 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
factor | 8 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | n_unique | top_counts |
---|---|---|---|---|
sex | 0 | 1.00 | 2 | mal: 206, fem: 97 |
cp | 0 | 1.00 | 3 | non: 230, aty: 50, typ: 23 |
fbs | 0 | 1.00 | 2 | no: 258, yes: 45 |
rest_ecg | 0 | 1.00 | 2 | abn: 152, nor: 151 |
exer_ang | 0 | 1.00 | 2 | no: 204, yes: 99 |
exer_st_slope | 0 | 1.00 | 3 | ups: 142, fla: 140, dow: 21 |
thal | 2 | 0.99 | 3 | nor: 166, rev: 117, fix: 18 |
disease | 0 | 1.00 | 2 | no: 164, yes: 139 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | skew | kurtosis |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1.00 | 54.44 | 9.04 | 29 | 48.0 | 56.0 | 61.0 | 77.0 | -0.21 | -0.55 |
rest_bp | 0 | 1.00 | 131.69 | 17.60 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | 0.70 | 0.82 |
chol | 0 | 1.00 | 246.69 | 51.78 | 126 | 211.0 | 241.0 | 275.0 | 564.0 | 1.12 | 4.35 |
max_hr | 0 | 1.00 | 149.61 | 22.88 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | -0.53 | -0.09 |
exer_st_depress | 0 | 1.00 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | 1.26 | 1.50 |
ca | 4 | 0.99 | 0.67 | 0.94 | 0 | 0.0 | 0.0 | 1.0 | 3.0 | 1.18 | 0.21 |
We will be fitting a logistic regression with all of the predictors for the first half of this unit
Lets set up a recipe for feature engineering with this statistical algorithm
<- recipe(disease ~ ., data = data_all) |>
rec_lr step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors())
The order of steps in a recipe matter
While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems according to tidy models folks:
To date, you have essentially learned how to do the single validation set approach (although we haven’t called it that)
With this approach, we would take our full n = 303 and:
If our goal was to evaluate the expected performance of a single model configuration in new data
If our goal was to select the best model configuration among many candidate configurations
We call this the single validation set approach but that single held-out set can be either a validation or test set depending on our goals
If you need to BOTH select a best model configuration AND evaluate that best model configuration, you would need both a validation and a test set.
We have been doing the single validation set approach all along but we will provide one more example now (with a 50/50 split) to transition the code we are using to a more general workflow that will accommodate our more complicated resampling approaches
In the first half of this unit, we will focus on assessing the performance of a single model configuration
We will call the held-out set a test set and use it to evaluate the expected future performance of this single configuration
Previously:
Then:
set.seed(19690127)
<- data_all |>
splits initial_split(prop = 0.5, strata = "disease")
<- analysis(splits)
data_trn |> nrow() data_trn
[1] 151
<- assessment(splits)
data_test |> nrow() data_test
[1] 152
<- rec_lr |>
rec_prep prep(data_trn)
<- rec_prep |>
feat_trn bake(NULL)
<- rec_prep |>
feat_test bake(data_test)
<-
fit_lr logistic_reg() |>
set_engine("glm") |>
fit(disease ~ ., data = feat_trn)
accuracy_vec(feat_test$disease, predict(fit_lr, feat_test, type = "class")$.pred_class)
[1] 0.8355263
Now lets do this in a new and more efficient workflow
validation_split()
rather than initial_split()
set.seed(19690127)
<- data_all |>
splits_validate validation_split(prop = 0.5, strata = "disease")
Warning: `validation_split()` was deprecated in rsample 1.2.0.
ℹ Please use `initial_validation_split()` instead.
Now we can fit our model configuration in our training set(s) and calculate performance metric(s) in the held-out sets using fit_resamples()
<-
fits_lr logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_lr, resamples = splits_validate,
metrics = metric_set(accuracy))
The object (we will call it fits_
) that is returned in NOT a model using our model figuration (what we got using fit()
, which we called fit_
)
Instead, it contains the performance metrics for the configuration, estimated by
We pull these performance estimates out of the fits object using collect_metrics()
|>
fits_lr collect_metrics(summarize = FALSE)
# A tibble: 1 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 validation accuracy binary 0.836 Preprocessor1_Model1
The model configuration was fit in the training set. The training set had N = 151 participants.
This model was trained with N = 151 but we have 303 participants. If we trained the same model configuration with all of our data, that model would be expected to performance better than the N-151 model.
Increasing the sample size used to fit our model configuration will decrease model variance but not change model bias. This will produce overall lower error.
This might not be true if the additional data were not similar quality to our training data but we know our validation/test set is similar because we did a random resample.
Because then we would not have had any new data left to get an estimate of its performance in new data!
If you plan to actually use your model in the real world for prediction, you should always re-fit the best configuration using all available data!
Our estimate will likely be biased. It will underestimate the true expected peformance of our final model. We can think of it as a lower bound on that expected performance. The amount of bias will be a function of the difference between the sample size of the training set and the size the of full dataset. If we want less biased estimates, we want to allocate as much data as possible to the training set when estimating the performance of our final model configuration (but this will come with other costs!)
Using a training set with 80% of the sample will yield a less biased (under) estimate of the final (using all data) model performance than a training set with 50% of the sample.
However, using a test set of 20% of the data will produce a more variable (less precise) estimate of performance than the 50% test set.
This is another bias-variance trade off but now instead of talking about model performance, we are seeing that we have to trade off bias and variance in our estimate of the model performance too!
This recognition of a bias-variance trade-off in our performance estimates is what motivates the more complicated resampling approaches we will now consider.
In our example, we plan to use this model for future predictions, so now lets fit it a final time using the full dataset
Make a feature matrix for the full dataset
We are now using the full data set as our new training set so we prep and bake with the full dataset
<- rec_lr |>
rec_prep prep(data_all)
<- rec_prep |>
feat_all bake(NULL)
And then fit your model configuration
<-
fit_lr logistic_reg() |>
set_engine("glm") |>
fit(disease ~ ., data = feat_all)
|> tidy() fit_lr
# A tibble: 17 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5.06 2.79 -1.81 0.0704
2 age -0.0185 0.0235 -0.787 0.431
3 rest_bp 0.0240 0.0109 2.20 0.0279
4 chol 0.00414 0.00376 1.10 0.271
5 max_hr -0.0204 0.0104 -1.96 0.0497
6 exer_st_depress 0.275 0.212 1.30 0.195
7 ca 1.32 0.263 5.00 0.000000566
8 sex_male 1.43 0.489 2.92 0.00345
9 cp_atyp_ang 1.05 0.752 1.40 0.162
10 cp_non_anginal 1.24 0.602 2.06 0.0396
11 fbs_yes -0.812 0.522 -1.55 0.120
12 rest_ecg_abnormal 0.469 0.364 1.29 0.197
13 exer_ang_yes 1.24 0.401 3.10 0.00195
14 exer_st_slope_flat 1.08 0.442 2.43 0.0150
15 exer_st_slope_downslope 0.468 0.819 0.572 0.568
16 thal_fixeddefect 0.262 0.759 0.345 0.730
17 thal_reversabledefect 1.41 0.397 3.54 0.000396
If we need to predict disease in the future, this is the model we would use (with these parameter estimates)
Our estimate of its future accuracy is based on our previous assessment using the held-in training set to fit the model configuration and the held-out test set to estimate its performance
This estimate should be considered a lower bound on its expected performance
Let’s turn to a new resampling technique and start with some questions to motivate it
Put all but one case into the training set (i.e., leave only one case out in the test set). In our example, you would fit a model with n = 302 this model will have essentially equivalent overfitting as n = 303 so it will not yield much bias when we use it to estimate the performance of the n = 303 model.
You will estimate performance with only n = 1 in the test set. This means there will be high variance in your performance estimate.
Repeat this split between training and test n times so that there are n different sets of n = 1 test sets. Then average the performance across all n of these test sets to get a more stable estimate of performance. Averaging is a good way to reduce the variance of any estimate.
This is leave one out cross-validation!
Comparisons across LOOCV and single validation set approaches
The performance estimate from LOOCV has less bias than the single validation set method (because the models that are used to estimate performance were fit with close to the full n of the final model that will be fit to all the data)
LOOCV uses all observations in the held-out set at some point. This may yield less variance than single 20% or 50% validation set?
but…
LOOCV eventually uses all the data in the held-out set across the ‘n’ held-out sets.
K-fold cross validation (next method) improves the variance of the average performance metric by averaging across more independent (less overlapping) training sets
loo_cv()
for vfold_cv()
in the next exampleK-fold cross validation
Common values of K are 5 and 10
Note that K is sometimes referred to as V in some fields/literatures (Don’t blame me!)
Visualization of K-fold
Let’s demonstrate the code for K-fold Cross-validation
disease
<- data_all |>
splits_kfold vfold_cv(v = 10, repeats = 1, strata = "disease")
splits_kfold
# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [272/31]> Fold01
2 <split [272/31]> Fold02
3 <split [272/31]> Fold03
4 <split [272/31]> Fold04
5 <split [273/30]> Fold05
6 <split [273/30]> Fold06
7 <split [273/30]> Fold07
8 <split [273/30]> Fold08
9 <split [273/30]> Fold09
10 <split [274/29]> Fold10
fit_resamples()
as beforesplits
and rec
metric_set()
<-
fits_lr_kfold logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_lr,
resamples = splits_kfold,
metrics = metric_set(accuracy))
Then, we review performance estimates in held out folds using collect_metrics()
summarize = FALSE
<- collect_metrics(fits_lr_kfold, summarize = FALSE)
metrics_kfold
|> print_kbl() metrics_kfold
id | .metric | .estimator | .estimate | .config |
---|---|---|---|---|
Fold01 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Fold02 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Fold03 | accuracy | binary | 0.71 | Preprocessor1_Model1 |
Fold04 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Fold05 | accuracy | binary | 0.70 | Preprocessor1_Model1 |
Fold06 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Fold07 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Fold08 | accuracy | binary | 0.73 | Preprocessor1_Model1 |
Fold09 | accuracy | binary | 0.93 | Preprocessor1_Model1 |
Fold10 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
|> plot_hist(".estimate") metrics_kfold
summarize = TRUE
collect_metrics(fits_lr_kfold, summarize = TRUE)
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.809 10 0.0242 Preprocessor1_Model1
As a last step, we still fit the final model as before using the full dataset
feat_all
) from earlier<-
fit_lr logistic_reg() |>
set_engine("glm") |>
fit(disease ~ ., data = feat_all)
|> tidy() fit_lr
# A tibble: 17 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5.06 2.79 -1.81 0.0704
2 age -0.0185 0.0235 -0.787 0.431
3 rest_bp 0.0240 0.0109 2.20 0.0279
4 chol 0.00414 0.00376 1.10 0.271
5 max_hr -0.0204 0.0104 -1.96 0.0497
6 exer_st_depress 0.275 0.212 1.30 0.195
7 ca 1.32 0.263 5.00 0.000000566
8 sex_male 1.43 0.489 2.92 0.00345
9 cp_atyp_ang 1.05 0.752 1.40 0.162
10 cp_non_anginal 1.24 0.602 2.06 0.0396
11 fbs_yes -0.812 0.522 -1.55 0.120
12 rest_ecg_abnormal 0.469 0.364 1.29 0.197
13 exer_ang_yes 1.24 0.401 3.10 0.00195
14 exer_st_slope_flat 1.08 0.442 2.43 0.0150
15 exer_st_slope_downslope 0.468 0.819 0.572 0.568
16 thal_fixeddefect 0.262 0.759 0.345 0.730
17 thal_reversabledefect 1.41 0.397 3.54 0.000396
If we need to predict disease in the future, this is the fitted model we would use (with these parameter estimates)
Our estimate of its future accuracy is 0.8090026 with a standard error of 0.0241847
Comparisons between K-fold vs. LOOCV and Single Validation set
For Bias:
For Variance:
K-fold has less variance than LOOCV
K-fold has less variance than single validation set b/c it uses all data as test at some point (vs. a subset of held-out test data)
K-fold is less computationally expensive than LOOCV (though more expensive than single validation set)
K-fold is generally preferred over both of these other approaches
K-fold is less computationally intensive BUT still can be costly. Particularly when you are getting performance estimates for multiple model configurations (more on that when we learn how to tune hyperparameters)
So you may want to start caching the fits_
object so you don’t have to recalculate it.
Here is a demonstration
<- TRUE rerun_setting
cache_rds()
function{}
cached/
folder with filename fits_lr_kfold_HASH
rec_lr
splits_kfold
rerun = TRUE
temporarily or you can change rerun_setting <- TRUE
at the top of your script to do fresh calculations for all of your cached code chunksrerun_settings <- TRUE
when you are done with your development to make sure everything is accurate (review output carefully for any changes!)<- cache_rds(
fits_lr_kfold expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_lr,
resamples = splits_kfold,
metrics = metric_set(accuracy))
}, dir = "cache/005/",
file = "fits_lr_kfold",
rerun = rerun_setting)
You can repeat the K-fold procedure multiple times with new splits for a different mix of K folds each time
Two benefits:
But it is computationally expensive (depending on number of repeats)
An example of Repeated K-fold Cross-validation
set.seed(19690127)
<- data_all |>
splits_kfold10x vfold_cv(v = 10, repeats = 10, strata = "disease")
splits_kfold10x
# 10-fold cross-validation repeated 10 times using stratification
# A tibble: 100 × 3
splits id id2
<list> <chr> <chr>
1 <split [272/31]> Repeat01 Fold01
2 <split [272/31]> Repeat01 Fold02
3 <split [272/31]> Repeat01 Fold03
4 <split [272/31]> Repeat01 Fold04
5 <split [273/30]> Repeat01 Fold05
6 <split [273/30]> Repeat01 Fold06
7 <split [273/30]> Repeat01 Fold07
8 <split [273/30]> Repeat01 Fold08
9 <split [273/30]> Repeat01 Fold09
10 <split [274/29]> Repeat01 Fold10
# ℹ 90 more rows
<- cache_rds(
fits_lr_kfold10x expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_lr,
resamples = splits_kfold10x,
metrics = metric_set(accuracy))
}, dir = "cache/005/",
file = "fits_lr_kfold10x",
rerun = rerun_setting)
<- collect_metrics(fits_lr_kfold10x, summarize = FALSE)
metrics_kfold10x
|> print_kbl() metrics_kfold10x
id | id2 | .metric | .estimator | .estimate | .config |
---|---|---|---|---|---|
Repeat01 | Fold01 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat01 | Fold02 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat01 | Fold03 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat01 | Fold04 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat01 | Fold05 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat01 | Fold06 | accuracy | binary | 0.73 | Preprocessor1_Model1 |
Repeat01 | Fold07 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat01 | Fold08 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat01 | Fold09 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat01 | Fold10 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Repeat02 | Fold01 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat02 | Fold02 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat02 | Fold03 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat02 | Fold04 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat02 | Fold05 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat02 | Fold06 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat02 | Fold07 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat02 | Fold08 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat02 | Fold09 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat02 | Fold10 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Repeat03 | Fold01 | accuracy | binary | 0.94 | Preprocessor1_Model1 |
Repeat03 | Fold02 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat03 | Fold03 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat03 | Fold04 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat03 | Fold05 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat03 | Fold06 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat03 | Fold07 | accuracy | binary | 0.73 | Preprocessor1_Model1 |
Repeat03 | Fold08 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat03 | Fold09 | accuracy | binary | 0.70 | Preprocessor1_Model1 |
Repeat03 | Fold10 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat04 | Fold01 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat04 | Fold02 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat04 | Fold03 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat04 | Fold04 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat04 | Fold05 | accuracy | binary | 0.67 | Preprocessor1_Model1 |
Repeat04 | Fold06 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat04 | Fold07 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat04 | Fold08 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat04 | Fold09 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat04 | Fold10 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Repeat05 | Fold01 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat05 | Fold02 | accuracy | binary | 0.74 | Preprocessor1_Model1 |
Repeat05 | Fold03 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat05 | Fold04 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat05 | Fold05 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat05 | Fold06 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat05 | Fold07 | accuracy | binary | 0.93 | Preprocessor1_Model1 |
Repeat05 | Fold08 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat05 | Fold09 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat05 | Fold10 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Repeat06 | Fold01 | accuracy | binary | 0.94 | Preprocessor1_Model1 |
Repeat06 | Fold02 | accuracy | binary | 0.74 | Preprocessor1_Model1 |
Repeat06 | Fold03 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat06 | Fold04 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat06 | Fold05 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat06 | Fold06 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat06 | Fold07 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat06 | Fold08 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat06 | Fold09 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat06 | Fold10 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat07 | Fold01 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat07 | Fold02 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat07 | Fold03 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat07 | Fold04 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat07 | Fold05 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat07 | Fold06 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat07 | Fold07 | accuracy | binary | 0.73 | Preprocessor1_Model1 |
Repeat07 | Fold08 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Repeat07 | Fold09 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat07 | Fold10 | accuracy | binary | 0.69 | Preprocessor1_Model1 |
Repeat08 | Fold01 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat08 | Fold02 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat08 | Fold03 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat08 | Fold04 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat08 | Fold05 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat08 | Fold06 | accuracy | binary | 0.93 | Preprocessor1_Model1 |
Repeat08 | Fold07 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat08 | Fold08 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat08 | Fold09 | accuracy | binary | 0.67 | Preprocessor1_Model1 |
Repeat08 | Fold10 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat09 | Fold01 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat09 | Fold02 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat09 | Fold03 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat09 | Fold04 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat09 | Fold05 | accuracy | binary | 0.93 | Preprocessor1_Model1 |
Repeat09 | Fold06 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Repeat09 | Fold07 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat09 | Fold08 | accuracy | binary | 0.73 | Preprocessor1_Model1 |
Repeat09 | Fold09 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Repeat09 | Fold10 | accuracy | binary | 0.72 | Preprocessor1_Model1 |
Repeat10 | Fold01 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat10 | Fold02 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat10 | Fold03 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Repeat10 | Fold04 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Repeat10 | Fold05 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat10 | Fold06 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat10 | Fold07 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Repeat10 | Fold08 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat10 | Fold09 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Repeat10 | Fold10 | accuracy | binary | 0.76 | Preprocessor1_Model1 |
|> plot_hist(".estimate", bins = 10) metrics_kfold10x
Average performance estimated (and its SE) across the 100 held-out folds
collect_metrics(fits_lr_kfold10x, summarize = TRUE)
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.824 100 0.00604 Preprocessor1_Model1
Comparisons between repeated K-fold and K-fold
Repeated K-fold:
Repeated K-fold is preferred over K-fold to the degree possible based on computational limitations (parallel, N, p, statistical algorithm, # of model configurations)
We have to be particularly careful with resampling methods when we have repeated observations for the same participant (or unit of analysis more generally)
group_vfold_cv()
and then proceeding as before with all other analyses/code
group
argument to the name of the variable that codes for subid or unit of analysis that is repeated.A bootstrap sample is a random sample taken with replacement (i.e., same observations can be sampled multiple times within one bootstrap sample)
If you bootstrap a new sample of size n from a dataset with sample size n, approximately 63.2% of the original observations end up in the bootstrap sample
The remaining 36.8% of the observations are often called the “out of bag” (OOB) samples
Bootstrap Resampling
An example of Bootstrap resampling
disease
set.seed(19690127)
<- data_all |>
splits_boot bootstraps(times = 100, strata = "disease")
splits_boot
# Bootstrap sampling using stratification
# A tibble: 100 × 2
splits id
<list> <chr>
1 <split [303/115]> Bootstrap001
2 <split [303/123]> Bootstrap002
3 <split [303/105]> Bootstrap003
4 <split [303/115]> Bootstrap004
5 <split [303/114]> Bootstrap005
6 <split [303/115]> Bootstrap006
7 <split [303/113]> Bootstrap007
8 <split [303/95]> Bootstrap008
9 <split [303/101]> Bootstrap009
10 <split [303/115]> Bootstrap010
# ℹ 90 more rows
<- cache_rds(
fits_lr_boot expr = {
logistic_reg() |>
set_engine("glm") |>
fit_resamples(preprocessor = rec_lr,
resamples = splits_boot,
metrics = metric_set(accuracy))
},dir = "cache/005/",
file = "fits_lr_boot",
rerun = rerun_setting)
<- collect_metrics(fits_lr_boot, summarize = FALSE)
metrics_boot
|> print_kbl() metrics_boot
id | .metric | .estimator | .estimate | .config |
---|---|---|---|---|
Bootstrap001 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap002 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap003 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap004 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap005 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap006 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap007 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap008 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap009 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap010 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap011 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Bootstrap012 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap013 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap014 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap015 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap016 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap017 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap018 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap019 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Bootstrap020 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap021 | accuracy | binary | 0.74 | Preprocessor1_Model1 |
Bootstrap022 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap023 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap024 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap025 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap026 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap027 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap028 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Bootstrap029 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap030 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap031 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap032 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap033 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap034 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap035 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap036 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap037 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Bootstrap038 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap039 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap040 | accuracy | binary | 0.88 | Preprocessor1_Model1 |
Bootstrap041 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap042 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap043 | accuracy | binary | 0.90 | Preprocessor1_Model1 |
Bootstrap044 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap045 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap046 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Bootstrap047 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Bootstrap048 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap049 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap050 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap051 | accuracy | binary | 0.75 | Preprocessor1_Model1 |
Bootstrap052 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Bootstrap053 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap054 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap055 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap056 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap057 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap058 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap059 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap060 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap061 | accuracy | binary | 0.72 | Preprocessor1_Model1 |
Bootstrap062 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap063 | accuracy | binary | 0.86 | Preprocessor1_Model1 |
Bootstrap064 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap065 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap066 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap067 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap068 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap069 | accuracy | binary | 0.87 | Preprocessor1_Model1 |
Bootstrap070 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Bootstrap071 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap072 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap073 | accuracy | binary | 0.76 | Preprocessor1_Model1 |
Bootstrap074 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Bootstrap075 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap076 | accuracy | binary | 0.89 | Preprocessor1_Model1 |
Bootstrap077 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap078 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap079 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap080 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap081 | accuracy | binary | 0.88 | Preprocessor1_Model1 |
Bootstrap082 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap083 | accuracy | binary | 0.77 | Preprocessor1_Model1 |
Bootstrap084 | accuracy | binary | 0.79 | Preprocessor1_Model1 |
Bootstrap085 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap086 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap087 | accuracy | binary | 0.76 | Preprocessor1_Model1 |
Bootstrap088 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap089 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap090 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap091 | accuracy | binary | 0.85 | Preprocessor1_Model1 |
Bootstrap092 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
Bootstrap093 | accuracy | binary | 0.83 | Preprocessor1_Model1 |
Bootstrap094 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap095 | accuracy | binary | 0.82 | Preprocessor1_Model1 |
Bootstrap096 | accuracy | binary | 0.81 | Preprocessor1_Model1 |
Bootstrap097 | accuracy | binary | 0.84 | Preprocessor1_Model1 |
Bootstrap098 | accuracy | binary | 0.78 | Preprocessor1_Model1 |
Bootstrap099 | accuracy | binary | 0.75 | Preprocessor1_Model1 |
Bootstrap100 | accuracy | binary | 0.80 | Preprocessor1_Model1 |
|> plot_hist(".estimate", bins = 10) metrics_boot
collect_metrics(fits_lr_boot, summarize = TRUE)
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.817 100 0.00338 Preprocessor1_Model1
Relevant comparisons, strengths/weaknesses for bootstrap for resampling
Can also represent the variance of our held-out error (like repeated K-fold)
Used primarily for selecting among model configurations when you don’t care about bias and just want a precise selection metric
Useful in explanation scenarios where you just need the “best” model
“Inner loop” of nested cross validation (more on this later)
In all of the previous examples, we have used various resampling methods only to evaluate the performance of a single model configuration in new data. In these instances, we were treating the held-out sets as test sets.
Resampling is also used to get held out performance estimates to select best model configurations.
Best means the model configuration that performs the best in new data and therefore is closest to the true DGP for the data
For example, we might want to select among model configurations in an explanatory scenario to have a principled approach to determine the model configuration that best matches the true DGP (and would be best to test your hypotheses). e.g.,
We can simply get performance estimates for each configuration using one of the previously described resampling methods
One additional common scenario where you will do model selection across many model configurations is when “tuning” (i.e., selecting) the best values for hyperparameters for a statistical algorithm (e.g., k in KNN).
tidymodels
makes this easy and it follows a very similar workflow as earlier with a few changes
tune
package functions decide in some cases)tune_grid()
rather fit_resamples()
to fit and evaluate the models configurations that differ with respect to their hyperparametersLets use bootstrap resampling to select the best K for KNN applied to our heart disease dataset
We can use the same splits are established as before (splits_boot
)
We need a slightly different recipe for KNN vs. logistic regression
<- recipe(disease ~ ., data = data_all) |>
rec_knn step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_range(all_numeric()) |>
step_dummy(all_nominal_predictors())
The fitting process is what is different
<- expand.grid(neighbors = seq(1, 150, by = 3))
hyper_grid hyper_grid
neighbors
1 1
2 4
3 7
4 10
5 13
6 16
7 19
8 22
9 25
10 28
11 31
12 34
13 37
14 40
15 43
16 46
17 49
18 52
19 55
20 58
21 61
22 64
23 67
24 70
25 73
26 76
27 79
28 82
29 85
30 88
31 91
32 94
33 97
34 100
35 103
36 106
37 109
38 112
39 115
40 118
41 121
42 124
43 127
44 130
45 133
46 136
47 139
48 142
49 145
50 148
fit_resamples()
multiple times to estimate performance of those different configurationstune_grid()
grid =
<- cache_rds(
fits_knn_boot expr = {
nearest_neighbor(neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification") |>
tune_grid(preprocessor = rec_knn,
resamples = splits_boot,
grid = hyper_grid,
metrics = metric_set(accuracy))
},dir = "cache/005/",
file = "fits_knn_boot",
rerun = rerun_setting)
Reviewing performance of model configurations is similar to before but now with multiple configurations
summarize = TRUE
summarize = FALSE
)collect_metrics(fits_knn_boot, summarize = TRUE)
# A tibble: 50 × 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy binary 0.749 100 0.00362 Preprocessor1_Model01
2 4 accuracy binary 0.749 100 0.00362 Preprocessor1_Model02
3 7 accuracy binary 0.771 100 0.00367 Preprocessor1_Model03
4 10 accuracy binary 0.782 100 0.00353 Preprocessor1_Model04
5 13 accuracy binary 0.787 100 0.00345 Preprocessor1_Model05
6 16 accuracy binary 0.794 100 0.00329 Preprocessor1_Model06
7 19 accuracy binary 0.798 100 0.00314 Preprocessor1_Model07
8 22 accuracy binary 0.800 100 0.00313 Preprocessor1_Model08
9 25 accuracy binary 0.802 100 0.00311 Preprocessor1_Model09
10 28 accuracy binary 0.803 100 0.00306 Preprocessor1_Model10
# ℹ 40 more rows
collect_metrics(fits_knn_boot, summarize = TRUE) |>
ggplot(aes(x = neighbors, y = mean)) +
geom_line()
K (neighbors) is affecting the bias-variance trade-off. As K increases, model bias increases but model variance decreases. In most instances, model variance decreases faster than model bias increases. Therefore performance should increase and then peak at a good point along the bias-variance trade-off. Beyond this optimal value, performance should decrease again. You want to select a hyperparameter value that is associated with peak (or near peak) performance.
The simplest way to select among model configurations (e.g., hyperparameters) is to choose the model configuration with the best performance
show_best(fits_knn_boot, n = 10)
Warning in show_best(fits_knn_boot, n = 10): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 10 × 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 145 accuracy binary 0.819 100 0.00321 Preprocessor1_Model49
2 148 accuracy binary 0.819 100 0.00323 Preprocessor1_Model50
3 142 accuracy binary 0.819 100 0.00324 Preprocessor1_Model48
4 124 accuracy binary 0.818 100 0.00310 Preprocessor1_Model42
5 121 accuracy binary 0.818 100 0.00309 Preprocessor1_Model41
6 136 accuracy binary 0.818 100 0.00323 Preprocessor1_Model46
7 139 accuracy binary 0.818 100 0.00323 Preprocessor1_Model47
8 133 accuracy binary 0.818 100 0.00319 Preprocessor1_Model45
9 130 accuracy binary 0.818 100 0.00316 Preprocessor1_Model44
10 127 accuracy binary 0.818 100 0.00314 Preprocessor1_Model43
select_best(fits_knn_boot)
Warning in select_best(fits_knn_boot): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 2
neighbors .config
<dbl> <chr>
1 145 Preprocessor1_Model49
The next most common is to choose the simplest (least flexible) model that has performance within one SE of the best performing configuration.
select_by_one_std_err()
desc(neighbors)
)select_by_one_std_err(fits_knn_boot,
desc(neighbors))
Warning in select_by_one_std_err(fits_knn_boot, desc(neighbors)): No value of
`metric` was given; "accuracy" will be used.
# A tibble: 1 × 2
neighbors .config
<dbl> <chr>
1 148 Preprocessor1_Model50
<- rec_knn |>
rec_prep prep(data_all)
<- rec_prep |>
feat_all bake(NULL)
We can use the select_*()
from above to use this best hyperparameter in our specification of the algorithm
Note that we now fit using all the data and switch to fit()
rather than tune_grid()
<-
fit_knn_best nearest_neighbor(neighbors = select_best(fits_knn_boot)$neighbors) |>
set_engine("kknn") |>
set_mode("classification") |>
fit(disease ~ ., data = feat_all)
Warning in select_best(fits_knn_boot): No value of `metric` was given; "accuracy" will be used.
No value of `metric` was given; "accuracy" will be used.
Resampling methods can be used to get model performance estimates to select the best model configuration and/or evaluate that best model
So far we have done EITHER selection OR evaluation but not both together
The concepts to both select the best configuration and evaluation it are similar but it requires different (slightly more complicated) resampling than what we have done so far
If you use your held-out resamples to select the best model among a number of model configurations then the same held-out resamples cannot also be used to evaluate the performance of that same best model
If it is, the performance metric will have optimization bias. To the degree that there is any noise (i.e., variance) in the measurement of performance, selecting the best model configuration will capitalize on this noise.
You need to use one set of held out resamples (validation sets) to select the best model. Then you need a DIFFERENT set of held out resamples (test sets) to evaluate that best model.
There are two strategies for this:
initial_split()
).Other Observations about Common Practices:
First we divide our data into training and test using inital_split()
Next, we use bootstrap resampling with the training set to split training into many held-in and held-out sets. We use these held-out (OOB) sets as validation sets to select the best model configuration based on mean/median performance across those sets.
After we select the best model configuration using bootstrap resampling of the training set
Of course, in the end, if you plan to use the model, you will refit this final model configuration to the FULL dataset but the performance estimate for that model will come from test set on th previous step (there are no more data to estimate new performance)
initial_split()
for first train/test splitset.seed(123456)
<- data_all |>
splits_test initial_split(prop = 2/3, strata = "disease")
<- splits_test |>
data_trn analysis()
<- splits_test |>
data_test assessment()
<- data_trn |>
splits_boot_trn bootstraps(times = 100, strata = "disease")
hyper_grid
)<- cache_rds(
fits_knn_boot_trn expr = {
nearest_neighbor(neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification") |>
tune_grid(preprocessor = rec_knn,
resamples = splits_boot_trn,
grid = hyper_grid,
metrics = metric_set(accuracy))
},dir = "cache/005/",
file = "fits_knn_boot_trn",
rerun = rerun_setting)
K = 97 is the best model configuration determined by bootstrap resampling
BUT this is NOT the correct estimate of its performance in new data
We compared 50 model configurations (values of k). This performance estimate may have some optimization bias (though 50 model configurations is really not THAT many)
show_best(fits_knn_boot_trn, n = 10)
Warning in show_best(fits_knn_boot_trn, n = 10): No value of `metric` was
given; "accuracy" will be used.
# A tibble: 10 × 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 91 accuracy binary 0.815 100 0.00309 Preprocessor1_Model31
2 97 accuracy binary 0.815 100 0.00305 Preprocessor1_Model33
3 100 accuracy binary 0.815 100 0.00302 Preprocessor1_Model34
4 124 accuracy binary 0.815 100 0.00299 Preprocessor1_Model42
5 121 accuracy binary 0.815 100 0.00299 Preprocessor1_Model41
6 88 accuracy binary 0.815 100 0.00306 Preprocessor1_Model30
7 79 accuracy binary 0.815 100 0.00300 Preprocessor1_Model27
8 94 accuracy binary 0.815 100 0.00302 Preprocessor1_Model32
9 85 accuracy binary 0.814 100 0.00299 Preprocessor1_Model29
10 73 accuracy binary 0.814 100 0.00309 Preprocessor1_Model25
show_best(fits_knn_boot_trn, n = 10)$mean
Warning in show_best(fits_knn_boot_trn, n = 10): No value of `metric` was
given; "accuracy" will be used.
[1] 0.8149497 0.8148387 0.8147184 0.8147127 0.8146859 0.8146851 0.8145969
[8] 0.8145515 0.8144602 0.8144509
select_best(fits_knn_boot_trn)
Warning in select_best(fits_knn_boot_trn): No value of `metric` was given;
"accuracy" will be used.
# A tibble: 1 × 2
neighbors .config
<dbl> <chr>
1 91 Preprocessor1_Model31
<- rec_knn |>
rec_prep prep(data_trn)
<- rec_prep |>
feat_trn bake(NULL)
<- rec_prep |>
feat_test bake(data_test)
<-
fit_knn_best nearest_neighbor(neighbors = select_best(fits_knn_boot_trn)$neighbors) |>
set_engine("kknn") |>
set_mode("classification") |>
fit(disease ~ ., data = feat_trn)
Warning in select_best(fits_knn_boot_trn): No value of `metric` was given; "accuracy" will be used.
No value of `metric` was given; "accuracy" will be used.
accuracy_vec(feat_test$disease, predict(fit_knn_best, feat_test)$.pred_class)
[1] 0.8333333
Our best estimate of how accurate a model with k = 91 would be in new data is 0.8333333.
And now the final, mind-blowing extension!!!!!
The bootstrap resampling + test set approach to simultaneously select and evaluate models is commonly used
However, it suffers from the same problems as the single train, valdition, test set approach when it comes to evaluating the performance of the final best model
Nested resampling offers an improvement with respect to these two issues
Nested resampling involves two loops
Nested resampling is VERY CONFUSING at first (like the first year you use it!)
Nested resampling isn’t fully supported by tidymodels as of yet. You have to do some coding to iterate over the outer loop
Application of nested resampling is outside the scope of this course but you should understand it conceptually. For further reading on the implementation of this method, see an example provided by the tidymodels folks.
A ‘simple’ example using bootstrap for inner loop and 10-fold CV for outer loop
Nested resampling evaluates a fitting and selection process not a specific model configuration!
You therefore need to select a final model configuration using same resampling with full data
You then need to fit that new model configuration to the full data
That was the last two steps on the previous page
The inner loop is used for selecting models. Bootstrap yields low variance performance estimates (but they are biased). We want low variance to select best model configuration. K-fold is a good method for less biased performance estimates. We want less bias in our final evaluation of our best model. You can do repeated K-fold in the outer loop to both reduce its variance and give you a sense of the performance sampling distribution. BUT VERY COMPUTATIONALLY INTENSIVE
Final words on resampling:
Iterative methods (K-fold, bootstrap) are superior to single validation set approach wrt bias-variance trade-off in performance measurement
K-Fold resampling should be used if you looking for a performance estimate of a single model configuration
Bootstrap resampling should be used if you are looking only to choose among model configurations but don’t need an independent assessment of that final model
Bootstrap resampling + Test set or Nested Resampling should be used when you plan to both select among model configurations AND evaluate the best model
In scenarios where you will not have one test set but will eventually use all the data as test ast some point (i.e., k-fold for evaluation of a single model configuration or Nested CV)