5 Resampling Methods for Model Selection and Evaluation

5.1 Overview of Unit

5.1.1 Learning Objectives

Bias vs. variance wrt model performance estimates
- How is this different from bias vs. variable of model itself
Methods for computationally intense calculations
- Parallel processing
- Cache
Types of resampling
- Validation set approach
- Leave One Out CV
- K-Fold and Repeated K-Fold
- Grouped K-Fold
- Bootstrap resampling
Use of resampling for tuning hyperparameters
Combining these resampling approaches with a Test set
- Used for simultaneous model selection and evaluation
- Single independent test set
- Advanced topic: Nested resampling

5.1.2 Readings

Kuhn and Johnson (2018) Chapter 4, pp 61 - 80
Supplemental: James et al. (2023) Chapter 5, pp 197 - 208 186

Post questions to the readings channel in Slack

5.1.3 Lecture Videos

Lecture 1: Overview & Parallel Processing ~ 16 mins
Lecture 2: Introduction to Resampling ~ 11 mins
Lecture 3: Single Validation/Test Set Approach ~ 26 mins
Lecture 4: Leave One Out Cross Validation ~ 9 mins
Lecture 5: K-fold Cross Validation Approaches ~ 21 mins
Lecture 6: Repeated and Grouped K-fold Approaches ~ 11 mins
Lecture 7: Bootstrap Resampling ~ 11 mins
Lecture 8: Using Resampling to Select Best Model Configurations ~ 17 mins
Lecture 9: Resampling for Both Model Selection and Evaluation ~ 11 mins
Lecture 10: Nested Resampling ~ 14 mins

Post questions to the video-lectures channel in Slack

5.1.4 Application Assignment and Quiz

Post questions to application-assignments Slack channel

Submit the application assignment here and complete the unit quiz by 8 pm on Wednesday, February 21st

5.2 Some Technical Details for Costly Computations

Before we dive into resampling, we need to introduce two coding techniques that can save us a lot of time when implementing resampling methods

Parallel processing
Caching time-consuming computations

5.2.1 Parallel Processing

When using resampling, we often end up fitting many, many model configurations

This can be the same model configuration in many different training sets
Or many different model configurations in many different training sets (even more computationally demanding)

Critically

The fitting process for each of these configurations is independent for the others
The order that the configurations are fit doesn’t matter either
When these two criteria are met, the processes can be run in parallel with an often big time savings

To do this in R, we need to set up a parallel processing backend

Lots of options and details depending on the code you intend to run in parallel to do it really well
We can discuss some of these issues/details and other solutions (i.e., High Throughput Computing at CHTC)
Some options are OS specific
Provide more details elsewhere

TLDR - copy the following code chunk into your scripts after you load your other libraries (e.g., tidyverse and tidymodels)

cl <- parallel::makePSOCKcluster(parallel::detectCores(logical = FALSE))
doParallel::registerDoParallel(cl)

5.2.2 Using Cache

Even with parallel processing, resampling procedures can STILL take a lot of time, particularly on notebook computers that don’t have a lot of cores available

In these instances, you may also want to consider caching the result

When you cache some set of calculations, you are essentially saving the results of the calculations
If you need to run the script again, you simply load the saved calculations again from disk, rather than re-calculating them (its much quicker to just read them from a file)

But…

You need to redo the calculations if you change anything in your script that could affect them
This is called “invalidating the cache”
You need to be very careful to reuse the cache when you can but also to invalidate it when the calculations have changed

In other notes, we describe three options to cache calculations that are available in R.

You should read more about those options if you plan to use one
Our preferred solution is to use xfun::cache_rds()
Read the help for this function (?xfun::cache_rds) if you plan to use it
Cache is complicated and can lead to errors.
But cache can also save you a lot of time during development!

Start by loading only that function for the xfun package. You can add this line of code after your other libraries (e.g., tidyverse, tidymodels)

library(xfun, include.only = "cache_rds")

To use the function

You will pass the code for the calculations you want to cache as the first argument (expr) to the function inside a set of curly brackets {}
You need to list the path (dir =) and filename (file =) for the rds file that will save the cached calculations.
- The / at the end of the path is needed.
- You should use a meaningful (and distinct) filename.
Provide rerun = FALSE as a third argument.
- You can set this to true temporarily if you need to invalidate the cache to redo the calculations
- We like to set it up as an environment variable (see rerun_setting below)
- Keep it as FALSE during development
- Set it to TRUE at the end of our development so that we make sure we didn’t make any cache invalidation errors
You may also provide a list of globals to hash =. See more details at previous link

cache_rds(
  expr = {
 },
 dir = "cache/",
 file = "filename", 
 rerun = rerun_setting 
)

We will demonstrate the use of this function throughout the book. BUT you do not need to use it if you find it confusing.

5.3 Introduction to Resampling

We will use resampling for two goals:

To select among model configurations based on relative performance estimates of these configurations in new data
To evaluate the performance of our best/final model configuration in new data

For both of these goals we are using new data to estimate performance of model configuration(s)

There are two kinds of problems that can emerge from using a sub-optimal resampling approach

We can get a biased estimate of model performance (i.e., we can systematically under or over-estimate its performance)
We can get an imprecise estimate of model performance (i.e., high variance in our model performance metric if it was repeatedly calculated in different samples of held-out data)

Essentially, this is the bias and variance problem again, but now not with respect to the model’s actual performance but instead with our estimate of how the model will perform

This is a very important distinction to keep in mind or you will be confused as we discuss bias and variance into the future. We have:

bias and variance of model performance (i.e., the predictions the model makes)
bias and variance of our estimate of how well the model will perform in new data
different factors affect each

Let’s get a dataset for this unit. We will use the heart disease dataset from the UCI Machine Learning Repository. We will focus on the Cleveland data subset, whose variable are defined in this data dictionary

These data are less well prepared

No variable/column names exist
NA is coded with ?
Use rename() to add tidy variable names

data_all <- read_csv(here::here(path_data, "cleveland.csv"), 
1                     col_names = FALSE,
2                     na = "?") |>
  rename(age = X1,
         sex = X2,
         cp = X3,
         rest_bp = X4,
         chol = X5,
         fbs = X6,
         rest_ecg = X7,
         max_hr = X8,
         exer_ang = X9,
         exer_st_depress = X10,
         exer_st_slope = X11,
         ca = X12,
         thal = X13,
         disease = X14)

1: Indicating that column names are NOT on the first row. First row begins with data
2: Specifying a non-standard value for NA

Rows: 303 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code categorical variables as factors with meaningful text labels (and no spaces)

data_all <- data_all |> 
  mutate(disease = factor(disease, levels = 0:4, 
                          labels = c("no", "yes", "yes", "yes", "yes")),
         sex = factor(sex,  levels = c(0, 1), labels = c("female", "male")),
         fbs = factor(fbs, levels = c(0, 1), labels = c("no", "yes")),
         exer_ang = factor(exer_ang, levels = c(0, 1), labels = c("no", "yes")),
         exer_st_slope = factor(exer_st_slope, levels = 1:3, 
                                labels = c("upslope", "flat", "downslope")),
         cp = factor(cp, levels = 1:4, 
                     labels = c("typ_ang", "atyp_ang", "non_anginal", "non_anginal")),
         rest_ecg = factor(rest_ecg, levels = 0:2, 
                           labels = c("normal", "abnormal1", "abnormal2")),
         rest_ecg = fct_collapse(rest_ecg, 
                                abnormal = c("abnormal1", "abnormal2")),
         thal = factor(thal, levels = c(3, 6, 7), 
                       labels = c("normal", "fixeddefect", "reversabledefect"))) |> 
  glimpse()

Rows: 303
Columns: 14
$ age             <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44…
$ sex             <fct> male, male, male, male, female, male, female, female, …
$ cp              <fct> typ_ang, non_anginal, non_anginal, non_anginal, atyp_a…
$ rest_bp         <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140,…
$ chol            <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192,…
$ fbs             <fct> yes, no, no, no, no, no, no, no, no, yes, no, no, yes,…
$ rest_ecg        <fct> abnormal, abnormal, abnormal, normal, abnormal, normal…
$ max_hr          <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148,…
$ exer_ang        <fct> no, yes, yes, no, no, no, no, yes, no, yes, no, no, ye…
$ exer_st_depress <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4,…
$ exer_st_slope   <fct> downslope, flat, flat, downslope, upslope, upslope, do…
$ ca              <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ thal            <fct> fixeddefect, normal, reversabledefect, normal, normal,…
$ disease         <fct> no, yes, yes, no, no, no, yes, no, yes, yes, no, no, y…

We won’t do EDA in this unit but lets at least do a quick skim to inform ourselves

303 cases
a dichotomous outcome, disease (yes or no for heart disease)
7 other categorical predictors
6 numeric predictors
2 missing values for thal, which is categorical
4 missing values for ca, which is numeric

data_all |> skim_all()

Data summary
Name	data_all
Number of rows	303
Number of columns	14
_______________________
Column type frequency:
factor	8
numeric	6
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	n_unique	top_counts
sex	0	1.00	2	mal: 206, fem: 97
cp	0	1.00	3	non: 230, aty: 50, typ: 23
fbs	0	1.00	2	no: 258, yes: 45
rest_ecg	0	1.00	2	abn: 152, nor: 151
exer_ang	0	1.00	2	no: 204, yes: 99
exer_st_slope	0	1.00	3	ups: 142, fla: 140, dow: 21
thal	2	0.99	3	nor: 166, rev: 117, fix: 18
disease	0	1.00	2	no: 164, yes: 139

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	skew	kurtosis
age	0	1.00	54.44	9.04	29	48.0	56.0	61.0	77.0	-0.21	-0.55
rest_bp	0	1.00	131.69	17.60	94	120.0	130.0	140.0	200.0	0.70	0.82
chol	0	1.00	246.69	51.78	126	211.0	241.0	275.0	564.0	1.12	4.35
max_hr	0	1.00	149.61	22.88	71	133.5	153.0	166.0	202.0	-0.53	-0.09
exer_st_depress	0	1.00	1.04	1.16	0	0.0	0.8	1.6	6.2	1.26	1.50
ca	4	0.99	0.67	0.94	0	0.0	0.0	1.0	3.0	1.18	0.21

We will be fitting a logistic regression with all of the predictors for the first half of this unit

Lets set up a recipe for feature engineering with this statistical algorithm

Impute missing data for all numeric predictors using median imputation
Impute missing data for all nominal predictors using the modal value
Dummy code all nominal predictors

rec_lr <- recipe(disease ~ ., data = data_all) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_dummy(all_nominal_predictors())

The order of steps in a recipe matter

While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems according to tidy models folks:

[Convert character to factor] (we do this outside our recipe as part of cleaning)
Impute
Individual transformations for skewness and other issues
Discretize (if needed and if you have no other choice)
Create dummy variables
Create interactions
Normalization steps (center, scale, range, etc)
Multivariate transformation (e.g. PCA, spatial sign, etc)

5.4 The single validation (test) set approach

To date, you have essentially learned how to do the single validation set approach (although we haven’t called it that)

With this approach, we would take our full n = 303 and:

Split into one training set and one held-out set
Fit a model in our training set
Use this trained model to predict scores in held-out set
Calculate a performance metric (e.g., accuracy, rmse) based on predicted and observed scores in the held-out set

If our goal was to evaluate the expected performance of a single model configuration in new data

We called this held-out set a test set
We would report this performance metric from the held-out test set as our estimate of the performance of our model in new data

If our goal was to select the best model configuration among many candidate configurations

We called this held-out set a validation set
We would use this performance metric from the held-out validation set to select the best model configuration

We call this the single validation set approach but that single held-out set can be either a validation or test set depending on our goals

If you need to BOTH select a best model configuration AND evaluate that best model configuration, you would need both a validation and a test set.

We have been doing the single validation set approach all along but we will provide one more example now (with a 50/50 split) to transition the code we are using to a more general workflow that will accommodate our more complicated resampling approaches

In the first half of this unit, we will focus on assessing the performance of a single model configuration

Logistic regression algorithm
No hyperparameters
Features based on all available predictors

We will call the held-out set a test set and use it to evaluate the expected future performance of this single configuration

Previously:

We would fit the model configuration in training and then made predictions for observations in the held-out test set in separate steps
We did this in separate steps so you could better understand the process
I will show you that first again as a baseline

Then:

We will now do these tasks in one step using \(validation\_split()\)
I will show you this combined approach second
This latter approach will be an example for how we code this for our more complicated resampling approaches

Let’s do a 50/50 split, stratified on our outcome, disease

set.seed(19690127)

splits <- data_all |> 
  initial_split(prop = 0.5, strata = "disease")

data_trn <- analysis(splits)
data_trn |>  nrow()

[1] 151

data_test <- assessment(splits)
data_test |> nrow()

[1] 152

Make features for train and test (skim them on your own time!)

rec_prep <- rec_lr |> 
  prep(data_trn)

feat_trn <- rec_prep |> 
  bake(NULL)

feat_test <- rec_prep |> 
  bake(data_test)

Fit model in train

fit_lr <-
  logistic_reg() |> 
  set_engine("glm") |> 
  fit(disease ~ ., data = feat_trn)

Evaluate model in test

accuracy_vec(feat_test$disease, predict(fit_lr, feat_test, type = "class")$.pred_class)

[1] 0.8355263

Now lets do this in a new and more efficient workflow

We still start by settting up a splits object
Note use of splits_validate() rather than initial_split()
We will use a variety of functions at this step depending on how we decide to handle resampling

set.seed(19690127)
splits_validate <- data_all |> 
  validation_split(prop = 0.5, strata = "disease")

Warning: `validation_split()` was deprecated in rsample 1.2.0.
ℹ Please use `initial_validation_split()` instead.

Now we can fit our model configuration in our training set(s) and calculate performance metric(s) in the held-out sets using fit_resamples()

You can and should read more about this function
Takes algorithm (broad category, engine, and mode if needed), recipe, and splits as inputs
Specify the (set of) metrics we want to use to estimate for our model configuration
Don’t need to explicitly create feature matrices for held-in and held-out sets.
- But also don’t see these feature matrices
- May still want to create and skim them as a check?

fits_lr <-
  logistic_reg() |> 
  set_engine("glm") |> 
  fit_resamples(preprocessor = rec_lr, resamples = splits_validate, 
                 metrics = metric_set(accuracy))

The object (we will call it fits_) that is returned in NOT a model using our model figuration (what we got using fit(), which we called fit_)

Instead, it contains the performance metrics for the configuration, estimated by

Fitting the model configuration in the held in set(s) and then
Predicting into the held-out sets

We pull these performance estimates out of the fits object using collect_metrics()

There is one performance estimate
It is for our model configurations performance in test set
It is an estimate of how well our model configuration will work with new data
It matches what we got previously when doing this manually

fits_lr |> 
  collect_metrics(summarize = FALSE)

# A tibble: 1 × 5
  id         .metric  .estimator .estimate .config             
  <chr>      <chr>    <chr>          <dbl> <chr>               
1 validation accuracy binary         0.836 Preprocessor1_Model1

Question: How many participants were used to fit the model that we used to estimate the performance of our model configuration?

Show Answer

The model configuration was fit in the training set. The training set had N = 151 participants.

Question: If we planned to implement this model (i.e., really use it in practice to predict heart disease in new patients) is this the best model we can develop or can we improve it?

Show Answer

This model was trained with N = 151 but we have 303 participants.  If we trained
the same model configuration with all of our data, that model would be expected
to performance better than the N-151 model.

Question: Why will the N = 303 model always be better (or at worst equivalent) to the model trained with N = 151.

Show Answer

Increasing the sample size used to fit our model configuration will decrease
model variance but not change model bias. This will produce overall lower 
error.

This might not be true if the additional data were not similar quality to our
training data but we know our validation/test set is similar because we did a
random resample.

Question: So why did we not just fit this model configuration using N = 303 to start?

Show Answer

Because then we would not have had any new data left to get an estimate of its performance in new data!

If you plan to actually use your model in the real world for prediction, you should always re-fit the best configuration using all available data!

Question: But what does this mean about our estimate of the performance of this final model (fit with all available data) data when we get that estimate using a model that as fit with a smaller sample size (the sample size in our training set)?

Show Answer

Our estimate will likely be biased.  It will underestimate the true expected 
peformance of our final model.  We can think of it as a lower bound on that expected
performance.  The amount of bias will be a function of the difference between the 
sample size of the training set and the size the of full dataset.   If we want less 
biased estimates, we want to allocate as much data as possible to the training set 
when estimating the performance of our final model configuration (but this will come 
with other costs!)

Question: Contrast the costs/benefits of a 50/50 vs. 80/20 split for train and test

Show Answer

Using a training set with 80% of the sample will yield a less biased (under)
estimate of the final (using all data) model performance than a training set with
50% of the sample.  

However, using a test set of 20% of the data will produce a 
more variable (less precise) estimate of performance than the 50% test set.  

This  is another bias-variance trade off but now instead of talking about model
performance, we are seeing that we have to trade off bias and variance in our
estimate of the model performance too!

This recognition of a bias-variance trade-off in our performance estimates is what motivates the more complicated resampling approaches we will now consider.

In our example, we plan to use this model for future predictions, so now lets fit it a final time using the full dataset

We do this manually
NOTE: tidymodels has routines to do all of this including fitting final models (read more about “workflows”)
We do not use them in the course because they hide steps that are important for conceptual understanding
We do not use them in our lab because we break apart all of these steps to train models using high throughput computing

Make a feature matrix for the full dataset

We are now using the full data set as our new training set so we prep and bake with the full dataset

rec_prep <- rec_lr |> 
  prep(data_all)

feat_all <- rec_prep |> 
  bake(data_all)

And then fit your model configuration

fit_lr <-
  logistic_reg() |> 
  set_engine("glm") |> 
  fit(disease ~ ., data = feat_all)

fit_lr |> tidy()

# A tibble: 17 × 5
   term                    estimate std.error statistic     p.value
   <chr>                      <dbl>     <dbl>     <dbl>       <dbl>
 1 (Intercept)             -5.06      2.79       -1.81  0.0704     
 2 age                     -0.0185    0.0235     -0.787 0.431      
 3 rest_bp                  0.0240    0.0109      2.20  0.0279     
 4 chol                     0.00414   0.00376     1.10  0.271      
 5 max_hr                  -0.0204    0.0104     -1.96  0.0497     
 6 exer_st_depress          0.275     0.212       1.30  0.195      
 7 ca                       1.32      0.263       5.00  0.000000566
 8 sex_male                 1.43      0.489       2.92  0.00345    
 9 cp_atyp_ang              1.05      0.752       1.40  0.162      
10 cp_non_anginal           1.24      0.602       2.06  0.0396     
11 fbs_yes                 -0.812     0.522      -1.55  0.120      
12 rest_ecg_abnormal        0.469     0.364       1.29  0.197      
13 exer_ang_yes             1.24      0.401       3.10  0.00195    
14 exer_st_slope_flat       1.08      0.442       2.43  0.0150     
15 exer_st_slope_downslope  0.468     0.819       0.572 0.568      
16 thal_fixeddefect         0.262     0.759       0.345 0.730      
17 thal_reversabledefect    1.41      0.397       3.54  0.000396

If we need to predict disease in the future, this is the model we would use (with these parameter estimates)

Our estimate of its future accuracy is based on our previous assessment using the held-in training set to fit the model configuration and the held-out test set to estimate its performance

This estimate should be considered a lower bound on its expected performance

5.5 Leave One Out Cross Validation

Let’s turn to a new resampling technique and start with some questions to motivate it

Question: How could you use this single validation set approach to get the least biased estimate of model performance with your n = 303 dataset that would still allow you to estimate its performance in a held out test set?

show answer

Put all but one case into the training set (i.e., leave only one case out in the
test set).  In our example, you would fit a model with n = 302  this model will 
have essentially equivalent overfitting as n = 303 so it will not yield much bias 
when we use it to estimate the performance of the n = 303 model.

Question: What will be the biggest problem with this approach?

Show Answer

You will estimate performance with only n = 1 in the test set.  This means there 
will be high variance in your performance estimate.

Question: How might you reduce this problem?

Show Answer

Repeat this split between training and test n times so that there are n different
sets of n = 1 test sets.  Then average the performance across all n of these test 
sets to get a more stable estimate of performance. Averaging is a good way to reduce 
the variance of any estimate.  

This is leave one out cross-validation!

Comparisons across LOOCV and single validation set approaches

The performance estimate from LOOCV has less bias than the single validation set method (because the models that are used to estimate performance were fit with close to the full n of the final model that will be fit to all the data)
LOOCV uses all observations in the held-out set at some point. This may yield less variance than single 20% or 50% validation set?

but…

LOOCV can be computationally expensive (need to fit and evaluate the same model configuration n times).
This is a real problem when you are also working with a high number of model configurations (i.e., number fits = n * number of model configurations).

LOOCV eventually uses all the data in the held-out set across the ‘n’ held-out sets.

Averaging also helps reduce variance in the performance metric.
However, averaging reduces variance to a greater degree when the performance measures being averaged are less related/more independent.
The n fitted models are very similar in LOOCV b/c they are each fit on almost the same data (each with n-1 observations)

K-fold cross validation (next method) improves the variance of the average performance metric by averaging across more independent (less overlapping) training sets

For this reason, it is superior and (always?) preferred over LOOCV
We are not demonstrating LOOCV b/c we strongly prefer other methods (k-fold)
Still important to understand it conceptually and its strengths/weaknesses
If you wanted to use this resampling approach, simply substitute loo_cv() for vfold_cv() in the next example

5.6 K-fold Cross Validation

K-fold cross validation

Divide the observations into K equal size independent “folds” (each observation appears in only one fold)
Hold out 1 of these folds (1/Kth of the dataset) to use as a held-out set
Fit a model in the remaining K-1 folds
Repeat until each of the folds has been held out once
Performance estimate is the average performance across the K held-out folds

Common values of K are 5 and 10

Note that K is sometimes referred to as V in some fields/literatures (Don’t blame me!)

Visualization of K-fold

Let’s demonstrate the code for K-fold Cross-validation

First, we split into 10 folds (default)
- repeats = 1 (default; more on this in a bit)
- stratify on disease

splits_kfold <- data_all |> 
  vfold_cv(v = 10, repeats = 1, strata = "disease")

splits_kfold

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [272/31]> Fold01
 2 <split [272/31]> Fold02
 3 <split [272/31]> Fold03
 4 <split [272/31]> Fold04
 5 <split [273/30]> Fold05
 6 <split [273/30]> Fold06
 7 <split [273/30]> Fold07
 8 <split [273/30]> Fold08
 9 <split [273/30]> Fold09
10 <split [274/29]> Fold10

Fit model configuration in first 9 folds, evaluate in 10th fold. Repeat 9 more times for each additional held-out fold
- Use fit_resamples() as before
- Still no need to continue to remake features for held-in and held-out sets. Just provide splits and rec
- Set performance metrics with metric_set()

fits_lr_kfold <- 
  logistic_reg() |> 
  set_engine("glm") |> 
  fit_resamples(preprocessor = rec_lr, 
                resamples = splits_kfold, 
                metrics = metric_set(accuracy))

Then, we review performance estimates in held out folds using collect_metrics()
- Can see performance in all folds using summarize = FALSE
- The performance estimates are not all the same. This is what we mean when we talk about the variance in our performance estimates. You couldn’t see it before with only one held-out set but now that we have 10, it becomes more concrete. Ideally this variance is as low as possible.

metrics_kfold <- collect_metrics(fits_lr_kfold, summarize = FALSE)

metrics_kfold |> print_kbl()

id	.metric	.estimator	.estimate	.config
Fold01	accuracy	binary	0.77	Preprocessor1_Model1
Fold02	accuracy	binary	0.84	Preprocessor1_Model1
Fold03	accuracy	binary	0.71	Preprocessor1_Model1
Fold04	accuracy	binary	0.84	Preprocessor1_Model1
Fold05	accuracy	binary	0.70	Preprocessor1_Model1
Fold06	accuracy	binary	0.83	Preprocessor1_Model1
Fold07	accuracy	binary	0.87	Preprocessor1_Model1
Fold08	accuracy	binary	0.73	Preprocessor1_Model1
Fold09	accuracy	binary	0.93	Preprocessor1_Model1
Fold10	accuracy	binary	0.86	Preprocessor1_Model1

Could plot this as a histogram to visualize this variance (i.e., the sampling distribution of performance estimates)
Would be better if we had more folds (see repeats in a bit)

metrics_kfold |> plot_hist(".estimate")

Can see the average performance over folds along with its standard error using summarize = TRUE
This average will have lower variance (which is estimated by the standard error. Still not zero!)

collect_metrics(fits_lr_kfold, summarize = TRUE)

# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.809    10  0.0242 Preprocessor1_Model1

As a last step, we still fit the final model as before using the full dataset

We already have the feature matrix for the full dataset (feat_all) from earlier
Otherwise, remake it

fit_lr <-
  logistic_reg() |> 
  set_engine("glm") |> 
  fit(disease ~ ., data = feat_all)

fit_lr |> tidy()

# A tibble: 17 × 5
   term                    estimate std.error statistic     p.value
   <chr>                      <dbl>     <dbl>     <dbl>       <dbl>
 1 (Intercept)             -5.06      2.79       -1.81  0.0704     
 2 age                     -0.0185    0.0235     -0.787 0.431      
 3 rest_bp                  0.0240    0.0109      2.20  0.0279     
 4 chol                     0.00414   0.00376     1.10  0.271      
 5 max_hr                  -0.0204    0.0104     -1.96  0.0497     
 6 exer_st_depress          0.275     0.212       1.30  0.195      
 7 ca                       1.32      0.263       5.00  0.000000566
 8 sex_male                 1.43      0.489       2.92  0.00345    
 9 cp_atyp_ang              1.05      0.752       1.40  0.162      
10 cp_non_anginal           1.24      0.602       2.06  0.0396     
11 fbs_yes                 -0.812     0.522      -1.55  0.120      
12 rest_ecg_abnormal        0.469     0.364       1.29  0.197      
13 exer_ang_yes             1.24      0.401       3.10  0.00195    
14 exer_st_slope_flat       1.08      0.442       2.43  0.0150     
15 exer_st_slope_downslope  0.468     0.819       0.572 0.568      
16 thal_fixeddefect         0.262     0.759       0.345 0.730      
17 thal_reversabledefect    1.41      0.397       3.54  0.000396

If we need to predict disease in the future, this is the fitted model we would use (with these parameter estimates)

Our estimate of its future accuracy is 0.8090026 with a standard error of 0.0241847

Comparisons between K-fold vs. LOOCV and Single Validation set

For Bias:

K-fold typically has less bias than the single validation set method
- E.g. 10-fold fits models with 9/10th of the data vs. 50% or 80%, etc
K Fold has somewhat more bias than LOOCV because LOOCV uses n - 1 observations for fitting models

For Variance:

K-fold has less variance than LOOCV
- Like LOOCV, it uses all observations in test at some point
- The averaged models are more independent b/c models are fitted on less overlapping training sets
K-fold has less variance than single validation set b/c it uses all data as test at some point (vs. a subset of held-out test data)
K-fold is less computationally expensive than LOOCV (though more expensive than single validation set)
K-fold is generally preferred over both of these other approaches

K-fold is less computationally intensive BUT still can be costly. Particularly when you are getting performance estimates for multiple model configurations (more on that when we learn how to tune hyperparameters)

So you may want to start caching the fits_ object so you don’t have to recalculate it.

Here is a demonstration

First we set up an environment variable that we can flip between true/false to invalidate all our cached calculations
Put this near the top of your code with other environment settings so its easy to find

rerun_setting <- TRUE

Now use the cache_rds() function
Put the resampling code inside of {}
Cached file will be saved in cached/ folder with filename fits_lr_kfold_HASH
If you change any code in the function, it will invalidate the cache and re-run the code
However, it will not invalidate the cache if you change objects or data that affect this function but are outside it, e.g.,
- Fix errors in data
- Update rec_lr
- Update splits_kfold
- Be VERY careful!
You can manually invalidate just the cache for just this code chunk by setting rerun = TRUE temporarily or you can change rerun_setting <- TRUE at the top of your script to do fresh calculations for all of your cached code chunks
You can set rerun_settings <- TRUE when you are done with your development to make sure everything is accurate (review output carefully for any changes!)

fits_lr_kfold <- cache_rds(
  expr = {
    logistic_reg() |> 
    set_engine("glm") |> 
    fit_resamples(preprocessor = rec_lr, 
                  resamples = splits_kfold, 
                  metrics = metric_set(accuracy))
  }, 
  dir = "cache/005/",
  file = "fits_lr_kfold",
  rerun = rerun_setting)

5.7 Repeated K-fold Cross Validation

You can repeat the K-fold procedure multiple times with new splits for a different mix of K folds each time

Two benefits:

More stable performance estimate (because averaged over more folds: repeats * K)
Many more estimates of performance to characterize (SE; plot) of your performance estimate

But it is computationally expensive (depending on number of repeats)

An example of Repeated K-fold Cross-validation

Splits with repeats = 10 (will do 10 different splits of 10-fold)

set.seed(19690127)
splits_kfold10x <- data_all |> 
  vfold_cv(v = 10, repeats = 10, strata = "disease")

splits_kfold10x

#  10-fold cross-validation repeated 10 times using stratification 
# A tibble: 100 × 3
   splits           id       id2   
   <list>           <chr>    <chr> 
 1 <split [272/31]> Repeat01 Fold01
 2 <split [272/31]> Repeat01 Fold02
 3 <split [272/31]> Repeat01 Fold03
 4 <split [272/31]> Repeat01 Fold04
 5 <split [273/30]> Repeat01 Fold05
 6 <split [273/30]> Repeat01 Fold06
 7 <split [273/30]> Repeat01 Fold07
 8 <split [273/30]> Repeat01 Fold08
 9 <split [273/30]> Repeat01 Fold09
10 <split [274/29]> Repeat01 Fold10
# ℹ 90 more rows

Everything else is the same!

fits_lr_kfold10x <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_lr, 
                    resamples = splits_kfold10x, 
                    metrics = metric_set(accuracy))
  }, 
  dir = "cache/005/",
  file = "fits_lr_kfold10x",
  rerun = rerun_setting)

Individual estimates across 100 held-out folds

metrics_kfold10x <- collect_metrics(fits_lr_kfold10x, summarize = FALSE)

metrics_kfold10x |> print_kbl()

id	id2	.metric	.estimator	.estimate	.config
Repeat01	Fold01	accuracy	binary	0.77	Preprocessor1_Model1
Repeat01	Fold02	accuracy	binary	0.90	Preprocessor1_Model1
Repeat01	Fold03	accuracy	binary	0.81	Preprocessor1_Model1
Repeat01	Fold04	accuracy	binary	0.81	Preprocessor1_Model1
Repeat01	Fold05	accuracy	binary	0.90	Preprocessor1_Model1
Repeat01	Fold06	accuracy	binary	0.73	Preprocessor1_Model1
Repeat01	Fold07	accuracy	binary	0.80	Preprocessor1_Model1
Repeat01	Fold08	accuracy	binary	0.83	Preprocessor1_Model1
Repeat01	Fold09	accuracy	binary	0.80	Preprocessor1_Model1
Repeat01	Fold10	accuracy	binary	0.86	Preprocessor1_Model1
Repeat02	Fold01	accuracy	binary	0.77	Preprocessor1_Model1
Repeat02	Fold02	accuracy	binary	0.87	Preprocessor1_Model1
Repeat02	Fold03	accuracy	binary	0.87	Preprocessor1_Model1
Repeat02	Fold04	accuracy	binary	0.87	Preprocessor1_Model1
Repeat02	Fold05	accuracy	binary	0.77	Preprocessor1_Model1
Repeat02	Fold06	accuracy	binary	0.80	Preprocessor1_Model1
Repeat02	Fold07	accuracy	binary	0.90	Preprocessor1_Model1
Repeat02	Fold08	accuracy	binary	0.87	Preprocessor1_Model1
Repeat02	Fold09	accuracy	binary	0.80	Preprocessor1_Model1
Repeat02	Fold10	accuracy	binary	0.79	Preprocessor1_Model1
Repeat03	Fold01	accuracy	binary	0.94	Preprocessor1_Model1
Repeat03	Fold02	accuracy	binary	0.87	Preprocessor1_Model1
Repeat03	Fold03	accuracy	binary	0.87	Preprocessor1_Model1
Repeat03	Fold04	accuracy	binary	0.81	Preprocessor1_Model1
Repeat03	Fold05	accuracy	binary	0.83	Preprocessor1_Model1
Repeat03	Fold06	accuracy	binary	0.83	Preprocessor1_Model1
Repeat03	Fold07	accuracy	binary	0.73	Preprocessor1_Model1
Repeat03	Fold08	accuracy	binary	0.83	Preprocessor1_Model1
Repeat03	Fold09	accuracy	binary	0.70	Preprocessor1_Model1
Repeat03	Fold10	accuracy	binary	0.83	Preprocessor1_Model1
Repeat04	Fold01	accuracy	binary	0.81	Preprocessor1_Model1
Repeat04	Fold02	accuracy	binary	0.77	Preprocessor1_Model1
Repeat04	Fold03	accuracy	binary	0.87	Preprocessor1_Model1
Repeat04	Fold04	accuracy	binary	0.90	Preprocessor1_Model1
Repeat04	Fold05	accuracy	binary	0.67	Preprocessor1_Model1
Repeat04	Fold06	accuracy	binary	0.87	Preprocessor1_Model1
Repeat04	Fold07	accuracy	binary	0.90	Preprocessor1_Model1
Repeat04	Fold08	accuracy	binary	0.77	Preprocessor1_Model1
Repeat04	Fold09	accuracy	binary	0.83	Preprocessor1_Model1
Repeat04	Fold10	accuracy	binary	0.86	Preprocessor1_Model1
Repeat05	Fold01	accuracy	binary	0.84	Preprocessor1_Model1
Repeat05	Fold02	accuracy	binary	0.74	Preprocessor1_Model1
Repeat05	Fold03	accuracy	binary	0.81	Preprocessor1_Model1
Repeat05	Fold04	accuracy	binary	0.77	Preprocessor1_Model1
Repeat05	Fold05	accuracy	binary	0.87	Preprocessor1_Model1
Repeat05	Fold06	accuracy	binary	0.77	Preprocessor1_Model1
Repeat05	Fold07	accuracy	binary	0.93	Preprocessor1_Model1
Repeat05	Fold08	accuracy	binary	0.87	Preprocessor1_Model1
Repeat05	Fold09	accuracy	binary	0.80	Preprocessor1_Model1
Repeat05	Fold10	accuracy	binary	0.79	Preprocessor1_Model1
Repeat06	Fold01	accuracy	binary	0.94	Preprocessor1_Model1
Repeat06	Fold02	accuracy	binary	0.74	Preprocessor1_Model1
Repeat06	Fold03	accuracy	binary	0.81	Preprocessor1_Model1
Repeat06	Fold04	accuracy	binary	0.77	Preprocessor1_Model1
Repeat06	Fold05	accuracy	binary	0.77	Preprocessor1_Model1
Repeat06	Fold06	accuracy	binary	0.83	Preprocessor1_Model1
Repeat06	Fold07	accuracy	binary	0.83	Preprocessor1_Model1
Repeat06	Fold08	accuracy	binary	0.87	Preprocessor1_Model1
Repeat06	Fold09	accuracy	binary	0.77	Preprocessor1_Model1
Repeat06	Fold10	accuracy	binary	0.90	Preprocessor1_Model1
Repeat07	Fold01	accuracy	binary	0.84	Preprocessor1_Model1
Repeat07	Fold02	accuracy	binary	0.81	Preprocessor1_Model1
Repeat07	Fold03	accuracy	binary	0.87	Preprocessor1_Model1
Repeat07	Fold04	accuracy	binary	0.81	Preprocessor1_Model1
Repeat07	Fold05	accuracy	binary	0.90	Preprocessor1_Model1
Repeat07	Fold06	accuracy	binary	0.83	Preprocessor1_Model1
Repeat07	Fold07	accuracy	binary	0.73	Preprocessor1_Model1
Repeat07	Fold08	accuracy	binary	0.83	Preprocessor1_Model1
Repeat07	Fold09	accuracy	binary	0.87	Preprocessor1_Model1
Repeat07	Fold10	accuracy	binary	0.69	Preprocessor1_Model1
Repeat08	Fold01	accuracy	binary	0.77	Preprocessor1_Model1
Repeat08	Fold02	accuracy	binary	0.84	Preprocessor1_Model1
Repeat08	Fold03	accuracy	binary	0.84	Preprocessor1_Model1
Repeat08	Fold04	accuracy	binary	0.84	Preprocessor1_Model1
Repeat08	Fold05	accuracy	binary	0.77	Preprocessor1_Model1
Repeat08	Fold06	accuracy	binary	0.93	Preprocessor1_Model1
Repeat08	Fold07	accuracy	binary	0.90	Preprocessor1_Model1
Repeat08	Fold08	accuracy	binary	0.80	Preprocessor1_Model1
Repeat08	Fold09	accuracy	binary	0.67	Preprocessor1_Model1
Repeat08	Fold10	accuracy	binary	0.90	Preprocessor1_Model1
Repeat09	Fold01	accuracy	binary	0.84	Preprocessor1_Model1
Repeat09	Fold02	accuracy	binary	0.81	Preprocessor1_Model1
Repeat09	Fold03	accuracy	binary	0.84	Preprocessor1_Model1
Repeat09	Fold04	accuracy	binary	0.84	Preprocessor1_Model1
Repeat09	Fold05	accuracy	binary	0.93	Preprocessor1_Model1
Repeat09	Fold06	accuracy	binary	0.77	Preprocessor1_Model1
Repeat09	Fold07	accuracy	binary	0.90	Preprocessor1_Model1
Repeat09	Fold08	accuracy	binary	0.73	Preprocessor1_Model1
Repeat09	Fold09	accuracy	binary	0.87	Preprocessor1_Model1
Repeat09	Fold10	accuracy	binary	0.72	Preprocessor1_Model1
Repeat10	Fold01	accuracy	binary	0.81	Preprocessor1_Model1
Repeat10	Fold02	accuracy	binary	0.81	Preprocessor1_Model1
Repeat10	Fold03	accuracy	binary	0.81	Preprocessor1_Model1
Repeat10	Fold04	accuracy	binary	0.84	Preprocessor1_Model1
Repeat10	Fold05	accuracy	binary	0.90	Preprocessor1_Model1
Repeat10	Fold06	accuracy	binary	0.80	Preprocessor1_Model1
Repeat10	Fold07	accuracy	binary	0.80	Preprocessor1_Model1
Repeat10	Fold08	accuracy	binary	0.90	Preprocessor1_Model1
Repeat10	Fold09	accuracy	binary	0.90	Preprocessor1_Model1
Repeat10	Fold10	accuracy	binary	0.76	Preprocessor1_Model1

Histogram of those 100 individual estimates

metrics_kfold10x |> plot_hist(".estimate", bins = 10)

Average performance estimated (and its SE) across the 100 held-out folds

collect_metrics(fits_lr_kfold10x, summarize = TRUE)

# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.824   100 0.00604 Preprocessor1_Model1

You should also refit a final model in the full data at the end as before
We wont demonstrate that here

Comparisons between repeated K-fold and K-fold

Repeated K-fold:

Has same bias as K-fold (still fitting models with K-1 folds)
Has all the benefits of single K-fold
Has even more stable estimate of performance (mean over more folds/repeats)
Provides more info about distribution for the performance estimate
But is more computationally expensive

Repeated K-fold is preferred over K-fold to the degree possible based on computational limitations (parallel, N, p, statistical algorithm, # of model configurations)

5.8 Grouped K-fold

We have to be particularly careful with resampling methods when we have repeated observations for the same participant (or unit of analysis more generally)

We can often predict an individual’s own data better using some of their own data.
If our model will not ever encounter that individual again, this will optimistically bias our estimate of our models performance with new/future observations.
We can remove that bias by making sure that all observations from an individual are grouped together so that they always either held-in or held-out but never split across both.
Easy to do a grouped K-fold by making splits using group_vfold_cv() and then proceeding as before with all other analyses/code
- set the group argument to the name of the variable that codes for subid or unit of analysis that is repeated.

5.9 Bootstrap Resampling

A bootstrap sample is a random sample taken with replacement (i.e., same observations can be sampled multiple times within one bootstrap sample)

If you bootstrap a new sample of size n from a dataset with sample size n, approximately 63.2% of the original observations end up in the bootstrap sample

The remaining 36.8% of the observations are often called the “out of bag” (OOB) samples

Bootstrap Resampling

Creates B bootstrap samples of size n = n from the original dataset
For any specific bootstrap (b)
- Model(s) are fit to the bootstrap sample
- Model performance is evaluated in the associated out of bag (held-out) samples
This is repeated B times such that you have B assessments of model performance

An example of Bootstrap resampling

Again, all that changes is how you form the splits/resamples
You will use \(bootstraps()\) to form the splits
Here are 100 bootstraps stratified on disease

set.seed(19690127)
splits_boot <- data_all |> 
  bootstraps(times = 100, strata = "disease") 

splits_boot

# Bootstrap sampling using stratification 
# A tibble: 100 × 2
   splits            id          
   <list>            <chr>       
 1 <split [303/115]> Bootstrap001
 2 <split [303/123]> Bootstrap002
 3 <split [303/105]> Bootstrap003
 4 <split [303/115]> Bootstrap004
 5 <split [303/114]> Bootstrap005
 6 <split [303/115]> Bootstrap006
 7 <split [303/113]> Bootstrap007
 8 <split [303/95]>  Bootstrap008
 9 <split [303/101]> Bootstrap009
10 <split [303/115]> Bootstrap010
# ℹ 90 more rows

Everything else is the same!

fits_lr_boot <- cache_rds(
  expr = {
    logistic_reg() |> 
      set_engine("glm") |> 
      fit_resamples(preprocessor = rec_lr, 
                    resamples = splits_boot, 
                    metrics = metric_set(accuracy))

  },
  dir = "cache/005/",
  file = "fits_lr_boot", 
  rerun = rerun_setting)

100 individual performance estimates from the 100 OOB sets

metrics_boot <- collect_metrics(fits_lr_boot, summarize = FALSE)

metrics_boot |> print_kbl()

id	.metric	.estimator	.estimate	.config
Bootstrap001	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap002	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap003	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap004	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap005	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap006	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap007	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap008	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap009	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap010	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap011	accuracy	binary	0.87	Preprocessor1_Model1
Bootstrap012	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap013	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap014	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap015	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap016	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap017	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap018	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap019	accuracy	binary	0.86	Preprocessor1_Model1
Bootstrap020	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap021	accuracy	binary	0.74	Preprocessor1_Model1
Bootstrap022	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap023	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap024	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap025	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap026	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap027	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap028	accuracy	binary	0.77	Preprocessor1_Model1
Bootstrap029	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap030	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap031	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap032	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap033	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap034	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap035	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap036	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap037	accuracy	binary	0.86	Preprocessor1_Model1
Bootstrap038	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap039	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap040	accuracy	binary	0.88	Preprocessor1_Model1
Bootstrap041	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap042	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap043	accuracy	binary	0.90	Preprocessor1_Model1
Bootstrap044	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap045	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap046	accuracy	binary	0.87	Preprocessor1_Model1
Bootstrap047	accuracy	binary	0.87	Preprocessor1_Model1
Bootstrap048	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap049	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap050	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap051	accuracy	binary	0.75	Preprocessor1_Model1
Bootstrap052	accuracy	binary	0.86	Preprocessor1_Model1
Bootstrap053	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap054	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap055	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap056	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap057	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap058	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap059	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap060	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap061	accuracy	binary	0.72	Preprocessor1_Model1
Bootstrap062	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap063	accuracy	binary	0.86	Preprocessor1_Model1
Bootstrap064	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap065	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap066	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap067	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap068	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap069	accuracy	binary	0.87	Preprocessor1_Model1
Bootstrap070	accuracy	binary	0.77	Preprocessor1_Model1
Bootstrap071	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap072	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap073	accuracy	binary	0.76	Preprocessor1_Model1
Bootstrap074	accuracy	binary	0.77	Preprocessor1_Model1
Bootstrap075	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap076	accuracy	binary	0.89	Preprocessor1_Model1
Bootstrap077	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap078	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap079	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap080	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap081	accuracy	binary	0.88	Preprocessor1_Model1
Bootstrap082	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap083	accuracy	binary	0.77	Preprocessor1_Model1
Bootstrap084	accuracy	binary	0.79	Preprocessor1_Model1
Bootstrap085	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap086	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap087	accuracy	binary	0.76	Preprocessor1_Model1
Bootstrap088	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap089	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap090	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap091	accuracy	binary	0.85	Preprocessor1_Model1
Bootstrap092	accuracy	binary	0.80	Preprocessor1_Model1
Bootstrap093	accuracy	binary	0.83	Preprocessor1_Model1
Bootstrap094	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap095	accuracy	binary	0.82	Preprocessor1_Model1
Bootstrap096	accuracy	binary	0.81	Preprocessor1_Model1
Bootstrap097	accuracy	binary	0.84	Preprocessor1_Model1
Bootstrap098	accuracy	binary	0.78	Preprocessor1_Model1
Bootstrap099	accuracy	binary	0.75	Preprocessor1_Model1
Bootstrap100	accuracy	binary	0.80	Preprocessor1_Model1

Histogram of those 100 performance estimates

metrics_boot |> plot_hist(".estimate", bins = 10)

Average performance over those 100 estimates

collect_metrics(fits_lr_boot, summarize = TRUE)

# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.817   100 0.00338 Preprocessor1_Model1

You should also refit a final model in the full data at the end as before to get final single fitted model for later use

Relevant comparisons, strengths/weaknesses for bootstrap for resampling

The bootstrap resampling performance estimate will have higher bias than K-fold using typical K values (bias equivalent to about K = 2)
Although training sets have full n, they only include about 63% unique observations. These models under perform training sets with 80 - 90% unique observations
With smaller training set sizes, this bias is considered too high by some (Kuhn)

The bootstrap resampling performance estimate will have less variance than K-fold
Compare SE of accuracy for 100 resamples using k-fold with repeats: 0.0060406 vs. bootstrap: 0.0033803
With 1000 bootstraps (and test sets with ~ 37% of n) can get a very precise estimate of held-out error

Can also represent the variance of our held-out error (like repeated K-fold)
Used primarily for selecting among model configurations when you don’t care about bias and just want a precise selection metric
Useful in explanation scenarios where you just need the “best” model
“Inner loop” of nested cross validation (more on this later)

5.10 Using Resampling to Select Best Model Configurations

In all of the previous examples, we have used various resampling methods only to evaluate the performance of a single model configuration in new data. In these instances, we were treating the held-out sets as test sets.

Resampling is also used to get held out performance estimates to select best model configurations.

Best means the model configuration that performs the best in new data and therefore is closest to the true DGP for the data

For example, we might want to select among model configurations in an explanatory scenario to have a principled approach to determine the model configuration that best matches the true DGP (and would be best to test your hypotheses). e.g.,

Selecting covariates to include
Deciding on X transformations
Outlier identification approach
Statistical algorithm

We can simply get performance estimates for each configuration using one of the previously described resampling methods

We would call the held-out data (the single set, the folds, the OOB samples) a validation set
We select the model configuration with the best mean (or median?) across our resampled validation sets on the relevant performance metric.

One additional common scenario where you will do model selection across many model configurations is when “tuning” (i.e., selecting) the best values for hyperparameters for a statistical algorithm (e.g., k in KNN).

tidymodels makes this easy and it follows a very similar workflow as earlier with a few changes

We will need to indicate which hyperparameters we plan to tune in the statistical algorithm
We need (or can) select values to consider for that hyperparameter (or we can let the tune package functions decide in some cases)
We will now use tune_grid() rather fit_resamples() to fit and evaluate the models configurations that differ with respect to their hyperparameters
This IS computationally costly. Now fitting and estimating performance for multiple configurations across multipe held-in/held-out sets

Lets use bootstrap resampling to select the best K for KNN applied to our heart disease dataset

We can use the same splits are established as before (splits_boot)

We need a slightly different recipe for KNN vs. logistic regression

We have to scale (lets range correct) the features
No need to do this to the dummy features. They are already range corrected

rec_knn <- recipe(disease ~ ., data = data_all) |> 
  step_impute_median(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>   
  step_range(all_numeric()) |> 
  step_dummy(all_nominal_predictors())

The fitting process is what is different

We set up a tibble with values of the hyperparameters to consider
We indicate which hyperparameters need to be tuned

hyper_grid <- expand.grid(neighbors = seq(1, 150, by = 3))
hyper_grid

   neighbors
1          1
2          4
3          7
4         10
5         13
6         16
7         19
8         22
9         25
10        28
11        31
12        34
13        37
14        40
15        43
16        46
17        49
18        52
19        55
20        58
21        61
22        64
23        67
24        70
25        73
26        76
27        79
28        82
29        85
30        88
31        91
32        94
33        97
34       100
35       103
36       106
37       109
38       112
39       115
40       118
41       121
42       124
43       127
44       130
45       133
46       136
47       139
48       142
49       145
50       148

When model configurations differ by features or statistical algorithms, we have to use fit_resamples() multiple times to estimate performance of those different configurations
Held out performance estimates for model configurations that differ by hyperparameter values are obtained in one step using tune_grid()
We need to set grid =

fits_knn_boot <- cache_rds(
  expr = {
    nearest_neighbor(neighbors = tune()) |> 
      set_engine("kknn") |> 
      set_mode("classification") |>
      tune_grid(preprocessor = rec_knn, 
                resamples = splits_boot, 
                grid = hyper_grid,
                metrics = metric_set(accuracy))

  },
  dir = "cache/005/",
  file = "fits_knn_boot",
  rerun = rerun_setting)

Reviewing performance of model configurations is similar to before but now with multiple configurations

We can see the average performance over folds along with its standard error using summarize = TRUE
We could see the performance of each configuration in EACH fold too, but there are lots of them (use summarize = FALSE)

collect_metrics(fits_knn_boot, summarize = TRUE)

# A tibble: 50 × 7
   neighbors .metric  .estimator  mean     n std_err .config              
       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1         1 accuracy binary     0.749   100 0.00362 Preprocessor1_Model01
 2         4 accuracy binary     0.749   100 0.00362 Preprocessor1_Model02
 3         7 accuracy binary     0.771   100 0.00367 Preprocessor1_Model03
 4        10 accuracy binary     0.782   100 0.00353 Preprocessor1_Model04
 5        13 accuracy binary     0.787   100 0.00345 Preprocessor1_Model05
 6        16 accuracy binary     0.794   100 0.00329 Preprocessor1_Model06
 7        19 accuracy binary     0.798   100 0.00314 Preprocessor1_Model07
 8        22 accuracy binary     0.800   100 0.00313 Preprocessor1_Model08
 9        25 accuracy binary     0.802   100 0.00311 Preprocessor1_Model09
10        28 accuracy binary     0.803   100 0.00306 Preprocessor1_Model10
# ℹ 40 more rows

We can plot average performance by the values of the hyperparameter

collect_metrics(fits_knn_boot, summarize = TRUE) |> 
  ggplot(aes(x = neighbors, y = mean)) +
    geom_line()

K (neighbors) is affecting the bias-variance trade-off. As K increases, model bias increases but model variance decreases. In most instances, model variance decreases faster than model bias increases. Therefore performance should increase and then peak at a good point along the bias-variance trade-off. Beyond this optimal value, performance should decrease again. You want to select a hyperparameter value that is associated with peak (or near peak) performance.

The simplest way to select among model configurations (e.g., hyperparameters) is to choose the model configuration with the best performance

show_best(fits_knn_boot, n = 10)

# A tibble: 10 × 7
   neighbors .metric  .estimator  mean     n std_err .config              
       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1       145 accuracy binary     0.819   100 0.00321 Preprocessor1_Model49
 2       148 accuracy binary     0.819   100 0.00323 Preprocessor1_Model50
 3       142 accuracy binary     0.819   100 0.00324 Preprocessor1_Model48
 4       124 accuracy binary     0.818   100 0.00310 Preprocessor1_Model42
 5       121 accuracy binary     0.818   100 0.00309 Preprocessor1_Model41
 6       136 accuracy binary     0.818   100 0.00323 Preprocessor1_Model46
 7       139 accuracy binary     0.818   100 0.00323 Preprocessor1_Model47
 8       133 accuracy binary     0.818   100 0.00319 Preprocessor1_Model45
 9       130 accuracy binary     0.818   100 0.00316 Preprocessor1_Model44
10       127 accuracy binary     0.818   100 0.00314 Preprocessor1_Model43

select_best(fits_knn_boot)

# A tibble: 1 × 2
  neighbors .config              
      <dbl> <chr>                
1       145 Preprocessor1_Model49

The next most common is to choose the simplest (least flexible) model that has performance within one SE of the best performing configuration.

Use select_by_one_std_err()
Sort performance results from least to most flexible (e.g., desc(neighbors))

select_by_one_std_err(fits_knn_boot, 
                      desc(neighbors))

# A tibble: 1 × 9
  neighbors .metric  .estimator  mean     n std_err .config               .best
      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                 <dbl>
1       148 accuracy binary     0.819   100 0.00323 Preprocessor1_Model50 0.819
  .bound
   <dbl>
1  0.816

We should also refit a final model with the “best” hyperparameter using the full data as before

rec_prep <- rec_knn |>   
  prep(data_all)

feat_all <- rec_prep |> 
  bake(data_all)

We can use the select_*() from above to use this best hyperparameter in our specification of the algorithm
Note that we now fit using all the data and switch to fit() rather than tune_grid()

fit_knn_best <-
  nearest_neighbor(neighbors = select_best(fits_knn_boot)$neighbors) |> 
  set_engine("kknn") |> 
  set_mode("classification") |>
  fit(disease ~ ., data = feat_all)

However, we can’t use the previous bootstrap resampling to evaluate this final/best model because we already used it to select this the best configuration
We need new, held-out data to evaluate it (a new test set!)

5.11 Resampling for Both Model Selection and Evaluation

Resampling methods can be used to get model performance estimates to select the best model configuration and/or evaluate that best model

So far we have done EITHER selection OR evaluation but not both together
The concepts to both select the best configuration and evaluation it are similar but it requires different (slightly more complicated) resampling than what we have done so far

If you use your held-out resamples to select the best model among a number of model configurations then the same held-out resamples cannot also be used to evaluate the performance of that same best model
If it is, the performance metric will have optimization bias. To the degree that there is any noise (i.e., variance) in the measurement of performance, selecting the best model configuration will capitalize on this noise.
You need to use one set of held out resamples (validation sets) to select the best model. Then you need a DIFFERENT set of held out resamples (test sets) to evaluate that best model.

There are two strategies for this:

Strategy 1:
- First, hold out a test set for final/best model evaluation (using initial_split()).
- Then use one of the above resampling methods (single validation set approach, k-fold, or bootstrap) to select the best model configuration.
- Bootstrap is likely best option b/c it is typically more precise (though biased)
Strategy 2: Nested resampling. More on this in a moment

Other Observations about Common Practices:

Simple resampling methods with the full sample (and no held-out test set) to both select AND evaluate are still common
Failure by some (even Kuhn) to appreciate the degree of optimization bias
Particular problem in Psychology because of small n (high variance in our performance metric)?
Can be fine if you just want to find the best model configuration but don’t need to evaluate its performance rigorously (i.e., you don’t really care that much about validity of your performance estimate of your final model and are fine with it being a bit optimistic)

5.11.1 Bootstrap with Test Set

First we divide our data into training and test using inital_split()
Next, we use bootstrap resampling with the training set to split training into many held-in and held-out sets. We use these held-out (OOB) sets as validation sets to select the best model configuration based on mean/median performance across those sets.
After we select the best model configuration using bootstrap resampling of the training set
- We refit that model configuration in the FULL training set
- And we use that model to predict into the test set to evaluate it
Of course, in the end, if you plan to use the model, you will refit this final model configuration to the FULL dataset but the performance estimate for that model will come from test set on th previous step (there are no more data to estimate new performance)

Use initial_split() for first train/test split

set.seed(123456)
splits_test <- data_all |> 
  initial_split(prop = 2/3, strata = "disease")

data_trn <- splits_test |> 
  analysis()

data_test <- splits_test |> 
  assessment()

Use training set for model selection via resampling (in this case bootstrap)
No need for another seed

splits_boot_trn <- data_trn |> 
  bootstraps(times = 100, strata = "disease")

Use the same grid of hyperparameters we set up earlier (hyper_grid)

Fit model configurations that vary by K in all 100 bootstrap samples
Make predictions and calculate accuracy for these fitted models in 100 OOB (validation) sets

fits_knn_boot_trn <- cache_rds(
  expr = {
    nearest_neighbor(neighbors = tune()) |> 
      set_engine("kknn") |> 
      set_mode("classification") |>
      tune_grid(preprocessor = rec_knn, 
                resamples = splits_boot_trn, 
                grid = hyper_grid, 
                metrics = metric_set(accuracy))
  },
  dir = "cache/005/",
  file = "fits_knn_boot_trn",
  rerun = rerun_setting)

Select the best model configuration (best k)
- K = 97 is the best model configuration determined by bootstrap resampling
- BUT this is NOT the correct estimate of its performance in new data
- We compared 50 model configurations (values of k). This performance estimate may have some optimization bias (though 50 model configurations is really not THAT many)

show_best(fits_knn_boot_trn, n = 10)

# A tibble: 10 × 7
   neighbors .metric  .estimator  mean     n std_err .config              
       <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1        91 accuracy binary     0.815   100 0.00309 Preprocessor1_Model31
 2        97 accuracy binary     0.815   100 0.00305 Preprocessor1_Model33
 3       100 accuracy binary     0.815   100 0.00302 Preprocessor1_Model34
 4       124 accuracy binary     0.815   100 0.00299 Preprocessor1_Model42
 5       121 accuracy binary     0.815   100 0.00299 Preprocessor1_Model41
 6        88 accuracy binary     0.815   100 0.00306 Preprocessor1_Model30
 7        79 accuracy binary     0.815   100 0.00300 Preprocessor1_Model27
 8        94 accuracy binary     0.815   100 0.00302 Preprocessor1_Model32
 9        85 accuracy binary     0.814   100 0.00299 Preprocessor1_Model29
10        73 accuracy binary     0.814   100 0.00309 Preprocessor1_Model25

show exact means

show_best(fits_knn_boot_trn, n = 10)$mean

 [1] 0.8149497 0.8148387 0.8147184 0.8147127 0.8146859 0.8146851 0.8145969
 [8] 0.8145515 0.8144602 0.8144509

select best

select_best(fits_knn_boot_trn)

# A tibble: 1 × 2
  neighbors .config              
      <dbl> <chr>                
1        91 Preprocessor1_Model31

Fit the k = 97 model configuration in the full training set

rec_prep <- rec_knn |> 
  prep(data_trn)

feat_trn <- rec_prep |> 
  bake(NULL)

feat_test <- rec_prep |> 
  bake(data_test)

fit_knn_best <-
  nearest_neighbor(neighbors = select_best(fits_knn_boot_trn)$neighbors) |> 
  set_engine("kknn") |> 
  set_mode("classification") |>
  fit(disease ~ ., data = feat_trn)

Use that fitted model to predict into test set

accuracy_vec(feat_test$disease, predict(fit_knn_best, feat_test)$.pred_class)

[1] 0.8333333

Our best estimate of how accurate a model with k = 91 would be in new data is 0.8333333.

However, it was only fit with n = 203 (the training set).
If we truly want the best model, we should now train once again with all of our data.
This model would likely perform even better b/c it has > n.
However, we have no more new data to evaluate it!
There is no perfect performance estimate!

5.12 Nested Resampling

And now the final, mind-blowing extension!!!!!

The bootstrap resampling + test set approach to simultaneously select and evaluate models is commonly used

However, it suffers from the same problems as the single train, valdition, test set approach when it comes to evaluating the performance of the final best model

It only uses a single small held out test set. In this case, 1/3 of the total sample size.
- This will yield a high variance/imprecise estimate of model performance
It also yields a biased estimate of model performance
- The model we evaluated in test was fit to only the training data which was only 2/3 of total sample size
- Yet our true final model is trained with the full dataset
- We are likely underestimating its true performance

Nested resampling offers an improvement with respect to these two issues

Nested resampling involves two loops

The inner loop is used for model selection
The outer loop is used for model evaluation

Nested resampling is VERY CONFUSING at first (like the first year you use it!)

Nested resampling isn’t fully supported by tidymodels as of yet. You have to do some coding to iterate over the outer loop

Application of nested resampling is outside the scope of this course but you should understand it conceptually. For further reading on the implementation of this method, see an example provided by the tidymodels folks.

A ‘simple’ example using bootstrap for inner loop and 10-fold CV for outer loop

Divide the full sample into 10 folds
Iterate through those 10 folds as follows (this is the outer loop)
- Hold out fold 1
- Use folds 2-10 to do the inner loop bootstrap
  - Bootstrap these folds B times
  - Fit models in B bootstrap samples
  - Calculate selection performance metrics in B out-of-bag samples
  - Average the B bootstrapped selection performance metrics for each model configuration
  - Select the best model configuration using this average bootstrapped selection performance metric
- Use best model configuration from folds 2-10 to make predictions for the held-out fold 1 to get the first (of ten) evaluation performance metrics
- Repeat for held out fold 2 - 10
Average the 10 evaluation performance metrics. This is the expected performance of a best model configuration (selected by B bootstraps) in new data. [You still don’t know what configuration you should use because you have 10 ‘best’ model configurations]
Do B bootstraps with the full sample to select the best model configuration
Fit this best model configuration to the full data

Nested resampling evaluates a fitting and selection process not a specific model configuration!

You therefore need to select a final model configuration using same resampling with full data
You then need to fit that new model configuration to the full data
That was the last two steps on the previous page

Question: Why bootstrap on inner loop and k-fold on outer loop?

Show Answer

The inner loop is used for selecting models.  Bootstrap yields low variance performance
estimates (but they are biased).  We want low variance to select best model
configuration.  K-fold is a good method for less biased performance estimates.  
We want less bias in our final evaluation of our best model.  You can do repeated
K-fold in the outer loop to both reduce its variance and give you a sense of the
performance sampling distribution. BUT VERY COMPUTATIONALLY INTENSIVE

5.13 Data Exploration with Advanced Resampling Methods

Final words on resampling:

Iterative methods (K-fold, bootstrap) are superior to single validation set approach wrt bias-variance trade-off in performance measurement
K-Fold resampling should be used if you looking for a performance estimate of a single model configuration
Bootstrap resampling should be used if you are looking only to choose among model configurations but don’t need an independent assessment of that final model
Bootstrap resampling + Test set or Nested Resampling should be used when you plan to both select among model configurations AND evaluate the best model

In scenarios where you will not have one test set but will eventually use all the data as test ast some point (i.e., k-fold for evaluation of a single model configuration or Nested CV)

Think carefully about how you do EDA
Resampling reduces overfitting in the model selection process
Can use eyeball sample (10-20% of full data) with little impact on final performance measure

5.14 Discussion

5.14.1 Announcements

Starting Unit 6 (Regularization/Penalized models) at end of class today; All set!
Unit 7 is Mid-term Exam unit
- No readings, lecture or quiz
- Application assignment as exam (due at normal application assignment day/time; Weds, March 6th at 8pm)
- Conceptual exam during discussion meeting (Thursday March 7th at 11 am)
Room assignments; 121 next week but staying in 338; Thanks for the feedback!
EDA with new datasets (and the last application assignment)

Feedback

concepts in color?, code visually distinct!!
tables for contrasts vs. text. - will try for this year but can’t promise (still working on terms)
reading papers using the methods - have one selected will consider more once we know enough!
other ways to submit questions linked to lecture slides??
direct links to slides format?
captions? Help me!

5.14.2 Bias and Variance

General definitions - need to think about repeated estimation of something (\(\hat{f}\), \(\hat{DGP}\); \(\hat{accuracy}\), \(\hat{rmse}\))
Examples for our models (\(\hat{f}\), \(\hat{DGP}\))
Examples for our performance estimates (\(\hat{accuracy}\), \(\hat{rmse}\))

5.14.3 Describe process to use resampling to get a performance estimate to evaluate of a single model configuration

Which method?
Use held-in/held-out terminology in addition to train/test
Discuss the bias and variance of the performance estimate of this configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data

5.14.4 Describe process to use resampling to get performance estimates to select a best model configuration among many (explanatory setting?)

When done (what are you selecting among)?
Which method?
Use held-in/held-out terminology in addition to train/val
Discuss the bias and variance of the performance estimate of this configuration
Discuss the bias and variance of the performance estimate used to select the best configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data
What are the implications if you use that same performance estimate to evaluate that best model configuration (i.e., estimate its performance in new data)?

5.14.5 Describe process to use resampling (other than nested) to get performance estimates to both select best configuration and evaluate it in new data

Which method?
Describe how to do it using held-in/held-out terminology in addition to train/val/test
Discuss the bias and variance of the performance estimate used to select the best configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data
Discuss the bias and variance of the performance estimate used to evaluate that best configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data

5.14.6 Describe process to use Nested CV to get performance estimates to both select best configuration and evaluate it in new data

When used?
Describe how to do it using held-in/held-out terminology in addition to train/val/test
Discuss the bias and variance of the performance estimate used to select the best configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data
Discuss the bias and variance of the performance estimate used to evaluate that best configuration. Consider implications of:
- The method
- size of the held-in data
- size of held out data

5.14.7 Methods to EITHER select best model configuration (among many) OR evaluate a single best model configuration

What are pros, cons, and when to use?

Single validation (or test) set
LOOCV
K-fold
Repeated k-fold
Bootstrap
ALSO: grouped k-fold

5.14.8 Methods to BOTH select best model configuration and evaluate that best configuration

What is optimization bias (its just a form of bias but given a fancy name due to its source)?
Combine above method with held out test set
Nested CV (see tutorial)

5.14.9 Discuss data leakage with resampling

Why is data leakage a problem?
How does noise factor into the problem?
Why does resampling substantially reduce data leakage?
How to use eyeball sample with single held out test set

5.14.10 A final interesting question…..

Biased estimate of model performance has troubled me in the last assignment. I did repeated k-fold cross validation with a KNN model where k = 5. The effective sample size was n ~ 800 during model selection. Then I refit the model with the full training data where n ~ 1000 and submitted my predictions for the test set. After some testing with one of the TAs, I figured that my submitted model with a larger N performed worse than the model evaluated with resamples, if I used the same K determined via resampling (number of nearest neighbors). My hunch was that the optimal K (number of nearest neighbors) was different in a model trained with n ~ 800 than a model trained with n ~ 1000. Is this a possible explanation? If so, would LOOCV better serve the purpose of selecting a K for the final KNN model (so that the training set for selecting K is very much the same as the training set used for the final model)?

On average, what do you expect about the performance estimate of your final model configuration in test vs. the performance estimates used to select the best model configuration?
What if you used bootstrap to select best configuration?
…but, what about choosing k from train and then using that k in a model fit to train and val?
- best k,
- best k within +- 1 SE (lower limit or higher limit)

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer Texts in Statistics. New York: Springer-Verlag.

Kuhn, Max, and Kjell Johnson. 2018. Applied Predictive Modeling. 1st ed. 2013, Corr. 2nd printing 2018 edition. New York: Springer.